Latency Benchmarks

The Curate-Me gateway sits between your application and the LLM provider. Every request passes through the governance chain (rate limiting, cost estimation, PII scanning, model allowlists, HITL gate) before being proxied upstream.

The key question: how much latency does this add?

Methodology

We measure two paths for identical requests:

Gateway — Your app calls api.curate-me.ai, which runs the governance chain and proxies to the provider.
Direct — Your app calls the provider API directly (e.g., api.openai.com).

Each test sends 20 identical requests (minimal prompt, max_tokens: 10, temperature: 0) with 3 warm-up requests to establish connections. We report P50, P95, and P99 for both TTFB (time to first byte) and total request time.

The overhead is the difference: gateway - direct. This isolates the cost of the governance chain from the inherent provider latency.

What’s in the overhead

The governance chain runs 6 checks sequentially, short-circuiting on first denial:

Rate limit check (Redis lookup)
Cost estimation (model pricing table)
PII scan (regex, no network)
Security scanner (regex, no network)
Model allowlist (in-memory)
HITL gate (in-memory flag check)

Plus: API key authentication, org context resolution, request/response header injection, and cost recording (async, non-blocking).

Results

Last updated: March 2026. Run ./scripts/benchmark-gateway-latency.sh to reproduce.

OpenAI (gpt-4o-mini)

Metric	P50	P95	P99
Gateway TTFB	~320ms	~480ms	~550ms
Direct TTFB	~280ms	~420ms	~500ms
Overhead	~40ms	~60ms	~50ms

OpenRouter (openai/gpt-4o-mini)

Metric	P50	P95	P99
Gateway TTFB	~380ms	~550ms	~620ms
Direct TTFB	~340ms	~490ms	~570ms
Overhead	~40ms	~60ms	~50ms

Summary

Provider	Typical overhead (P50 TTFB)	Governance checks
OpenAI	~40ms	6 checks
OpenRouter	~40ms	6 checks
Anthropic	~35ms	6 checks

The governance chain adds 30-60ms of overhead at P50, which is typically less than 15% of the total request time for most LLM calls. For requests with longer generation times (higher max_tokens), the overhead percentage drops further since the proxy streams the response through with minimal buffering.

Running the Benchmark

Bash (curl)


# Default: 20 requests against production gateway
./scripts/benchmark-gateway-latency.sh
 
# Custom request count
./scripts/benchmark-gateway-latency.sh --requests 50
 
# Test against local gateway
./scripts/benchmark-gateway-latency.sh --local
 
# JSON output (for CI/automation)
./scripts/benchmark-gateway-latency.sh --json
 
# Different provider
./scripts/benchmark-gateway-latency.sh --provider anthropic
 
# Gateway-only (no direct API key needed)
./scripts/benchmark-gateway-latency.sh --skip-direct

Python (httpx)


# Default: 20 requests, sequential
python scripts/benchmark_gateway.py
 
# Custom options
python scripts/benchmark_gateway.py --requests 50 --provider anthropic
 
# JSON output
python scripts/benchmark_gateway.py --json
 
# Markdown tables (for docs)
python scripts/benchmark_gateway.py --markdown
 
# Local gateway
python scripts/benchmark_gateway.py --local --skip-direct

Environment Variables

Variable	Required	Description
`CM_API_KEY`	Yes	Your Curate-Me gateway API key
`OPENAI_API_KEY`	No	Direct OpenAI key (for comparison)
`ANTHROPIC_API_KEY`	No	Direct Anthropic key (for comparison)
`OPENROUTER_API_KEY`	No	Direct OpenRouter key (for comparison)

If no direct API key is set, the benchmark runs in gateway-only mode automatically.

Why the Overhead Is Low

The governance chain is designed for minimal latency:

No network calls — PII scanning and security scanning use compiled regex patterns, not external ML services.
In-memory lookups — Model allowlists and HITL flags are cached in memory.
Redis for rate limits — Single Redis INCR + EXPIRE per request (sub-millisecond).
Async cost recording — Cost writes to Redis and MongoDB happen after the response starts streaming, not before.
Connection pooling — The proxy reuses httpx connections to upstream providers.
Streaming passthrough — SSE responses are streamed through with no buffering.

Interpreting Results

What matters most: TTFB overhead, not total time. The total time is dominated by the LLM’s generation speed, which the gateway cannot control. The TTFB overhead measures the pure governance chain cost.

Expected variation: LLM provider latency varies significantly by time of day, model load, and region. Run benchmarks multiple times and compare the overhead (delta), not absolute numbers.

Production vs local: Running against localhost:8002 eliminates network latency to the gateway, isolating the pure governance chain overhead (~10-20ms). Production numbers include the extra network hop to the gateway server.