Latency Benchmarks
The Curate-Me gateway sits between your application and the LLM provider. Every request passes through the governance chain (rate limiting, cost estimation, PII scanning, model allowlists, HITL gate) before being proxied upstream.
The key question: how much latency does this add?
Methodology
We measure two paths for identical requests:
- Gateway — Your app calls
api.curate-me.ai, which runs the governance chain and proxies to the provider. - Direct — Your app calls the provider API directly (e.g.,
api.openai.com).
Each test sends 20 identical requests (minimal prompt, max_tokens: 10, temperature: 0)
with 3 warm-up requests to establish connections. We report P50, P95, and P99 for both
TTFB (time to first byte) and total request time.
The overhead is the difference: gateway - direct. This isolates the cost of the
governance chain from the inherent provider latency.
What’s in the overhead
The governance chain runs 6 checks sequentially, short-circuiting on first denial:
- Rate limit check (Redis lookup)
- Cost estimation (model pricing table)
- PII scan (regex, no network)
- Security scanner (regex, no network)
- Model allowlist (in-memory)
- HITL gate (in-memory flag check)
Plus: API key authentication, org context resolution, request/response header injection, and cost recording (async, non-blocking).
Results
Last updated: March 2026. Run
./scripts/benchmark-gateway-latency.shto reproduce.
OpenAI (gpt-4o-mini)
| Metric | P50 | P95 | P99 |
|---|---|---|---|
| Gateway TTFB | ~320ms | ~480ms | ~550ms |
| Direct TTFB | ~280ms | ~420ms | ~500ms |
| Overhead | ~40ms | ~60ms | ~50ms |
OpenRouter (openai/gpt-4o-mini)
| Metric | P50 | P95 | P99 |
|---|---|---|---|
| Gateway TTFB | ~380ms | ~550ms | ~620ms |
| Direct TTFB | ~340ms | ~490ms | ~570ms |
| Overhead | ~40ms | ~60ms | ~50ms |
Summary
| Provider | Typical overhead (P50 TTFB) | Governance checks |
|---|---|---|
| OpenAI | ~40ms | 6 checks |
| OpenRouter | ~40ms | 6 checks |
| Anthropic | ~35ms | 6 checks |
The governance chain adds 30-60ms of overhead at P50, which is typically less than
15% of the total request time for most LLM calls. For requests with longer generation
times (higher max_tokens), the overhead percentage drops further since the proxy streams
the response through with minimal buffering.
Running the Benchmark
Bash (curl)
# Default: 20 requests against production gateway
./scripts/benchmark-gateway-latency.sh
# Custom request count
./scripts/benchmark-gateway-latency.sh --requests 50
# Test against local gateway
./scripts/benchmark-gateway-latency.sh --local
# JSON output (for CI/automation)
./scripts/benchmark-gateway-latency.sh --json
# Different provider
./scripts/benchmark-gateway-latency.sh --provider anthropic
# Gateway-only (no direct API key needed)
./scripts/benchmark-gateway-latency.sh --skip-directPython (httpx)
# Default: 20 requests, sequential
python scripts/benchmark_gateway.py
# Custom options
python scripts/benchmark_gateway.py --requests 50 --provider anthropic
# JSON output
python scripts/benchmark_gateway.py --json
# Markdown tables (for docs)
python scripts/benchmark_gateway.py --markdown
# Local gateway
python scripts/benchmark_gateway.py --local --skip-directEnvironment Variables
| Variable | Required | Description |
|---|---|---|
CM_API_KEY | Yes | Your Curate-Me gateway API key |
OPENAI_API_KEY | No | Direct OpenAI key (for comparison) |
ANTHROPIC_API_KEY | No | Direct Anthropic key (for comparison) |
OPENROUTER_API_KEY | No | Direct OpenRouter key (for comparison) |
If no direct API key is set, the benchmark runs in gateway-only mode automatically.
Why the Overhead Is Low
The governance chain is designed for minimal latency:
- No network calls — PII scanning and security scanning use compiled regex patterns, not external ML services.
- In-memory lookups — Model allowlists and HITL flags are cached in memory.
- Redis for rate limits — Single Redis
INCR+EXPIREper request (sub-millisecond). - Async cost recording — Cost writes to Redis and MongoDB happen after the response starts streaming, not before.
- Connection pooling — The proxy reuses
httpxconnections to upstream providers. - Streaming passthrough — SSE responses are streamed through with no buffering.
Interpreting Results
What matters most: TTFB overhead, not total time. The total time is dominated by the LLM’s generation speed, which the gateway cannot control. The TTFB overhead measures the pure governance chain cost.
Expected variation: LLM provider latency varies significantly by time of day, model load, and region. Run benchmarks multiple times and compare the overhead (delta), not absolute numbers.
Production vs local: Running against localhost:8002 eliminates network latency to
the gateway, isolating the pure governance chain overhead (~10-20ms). Production numbers
include the extra network hop to the gateway server.