Skip to Content
GatewayLatency Benchmarks

Latency Benchmarks

The Curate-Me gateway sits between your application and the LLM provider. Every request passes through the governance chain (rate limiting, cost estimation, PII scanning, model allowlists, HITL gate) before being proxied upstream.

The key question: how much latency does this add?

Methodology

We measure two paths for identical requests:

  • Gateway — Your app calls api.curate-me.ai, which runs the governance chain and proxies to the provider.
  • Direct — Your app calls the provider API directly (e.g., api.openai.com).

Each test sends 20 identical requests (minimal prompt, max_tokens: 10, temperature: 0) with 3 warm-up requests to establish connections. We report P50, P95, and P99 for both TTFB (time to first byte) and total request time.

The overhead is the difference: gateway - direct. This isolates the cost of the governance chain from the inherent provider latency.

What’s in the overhead

The governance chain runs 6 checks sequentially, short-circuiting on first denial:

  1. Rate limit check (Redis lookup)
  2. Cost estimation (model pricing table)
  3. PII scan (regex, no network)
  4. Security scanner (regex, no network)
  5. Model allowlist (in-memory)
  6. HITL gate (in-memory flag check)

Plus: API key authentication, org context resolution, request/response header injection, and cost recording (async, non-blocking).

Results

Last updated: March 2026. Run ./scripts/benchmark-gateway-latency.sh to reproduce.

OpenAI (gpt-4o-mini)

MetricP50P95P99
Gateway TTFB~320ms~480ms~550ms
Direct TTFB~280ms~420ms~500ms
Overhead~40ms~60ms~50ms

OpenRouter (openai/gpt-4o-mini)

MetricP50P95P99
Gateway TTFB~380ms~550ms~620ms
Direct TTFB~340ms~490ms~570ms
Overhead~40ms~60ms~50ms

Summary

ProviderTypical overhead (P50 TTFB)Governance checks
OpenAI~40ms6 checks
OpenRouter~40ms6 checks
Anthropic~35ms6 checks

The governance chain adds 30-60ms of overhead at P50, which is typically less than 15% of the total request time for most LLM calls. For requests with longer generation times (higher max_tokens), the overhead percentage drops further since the proxy streams the response through with minimal buffering.

Running the Benchmark

Bash (curl)

# Default: 20 requests against production gateway ./scripts/benchmark-gateway-latency.sh # Custom request count ./scripts/benchmark-gateway-latency.sh --requests 50 # Test against local gateway ./scripts/benchmark-gateway-latency.sh --local # JSON output (for CI/automation) ./scripts/benchmark-gateway-latency.sh --json # Different provider ./scripts/benchmark-gateway-latency.sh --provider anthropic # Gateway-only (no direct API key needed) ./scripts/benchmark-gateway-latency.sh --skip-direct

Python (httpx)

# Default: 20 requests, sequential python scripts/benchmark_gateway.py # Custom options python scripts/benchmark_gateway.py --requests 50 --provider anthropic # JSON output python scripts/benchmark_gateway.py --json # Markdown tables (for docs) python scripts/benchmark_gateway.py --markdown # Local gateway python scripts/benchmark_gateway.py --local --skip-direct

Environment Variables

VariableRequiredDescription
CM_API_KEYYesYour Curate-Me gateway API key
OPENAI_API_KEYNoDirect OpenAI key (for comparison)
ANTHROPIC_API_KEYNoDirect Anthropic key (for comparison)
OPENROUTER_API_KEYNoDirect OpenRouter key (for comparison)

If no direct API key is set, the benchmark runs in gateway-only mode automatically.

Why the Overhead Is Low

The governance chain is designed for minimal latency:

  • No network calls — PII scanning and security scanning use compiled regex patterns, not external ML services.
  • In-memory lookups — Model allowlists and HITL flags are cached in memory.
  • Redis for rate limits — Single Redis INCR + EXPIRE per request (sub-millisecond).
  • Async cost recording — Cost writes to Redis and MongoDB happen after the response starts streaming, not before.
  • Connection pooling — The proxy reuses httpx connections to upstream providers.
  • Streaming passthrough — SSE responses are streamed through with no buffering.

Interpreting Results

What matters most: TTFB overhead, not total time. The total time is dominated by the LLM’s generation speed, which the gateway cannot control. The TTFB overhead measures the pure governance chain cost.

Expected variation: LLM provider latency varies significantly by time of day, model load, and region. Run benchmarks multiple times and compare the overhead (delta), not absolute numbers.

Production vs local: Running against localhost:8002 eliminates network latency to the gateway, isolating the pure governance chain overhead (~10-20ms). Production numbers include the extra network hop to the gateway server.