Skip to Content
GatewayResponse Cache

Response Cache

The gateway includes a response cache that stores LLM responses and serves them for identical subsequent requests. This eliminates duplicate provider calls and reduces both latency and cost.

How It Works

When a request arrives, the gateway generates a cache key by hashing the response-influencing fields of the request body:

  • model, messages, system, temperature, max_tokens, top_p, top_k, stop, tools, tool_choice

Fields that don’t affect the response (like stream, user, metadata, seed) are excluded from the hash. Messages are normalized recursively for deterministic hashing.

Request → Normalize → SHA-256 → Redis Lookup ├── HIT → Return cached response └── MISS → Proxy to provider → Cache response → Return

Enabling the Cache

The cache is controlled by the FF_GATEWAY_RESPONSE_CACHE feature flag (disabled by default).

Once enabled globally, each organization can configure their own cache policy:

# Get current cache policy curl https://api.curate-me.ai/gateway/admin/cache/policy \ -H "X-CM-API-Key: cm_sk_xxx" # Enable cache for your org with a 30-minute TTL curl -X PATCH https://api.curate-me.ai/gateway/admin/cache/policy \ -H "X-CM-API-Key: cm_sk_xxx" \ -H "Content-Type: application/json" \ -d '{ "enabled": true, "default_ttl": 1800, "exclude_models": ["o1", "o3"] }'

Cache Policy Options

FieldTypeDefaultDescription
enabledbooleanfalseEnable/disable cache for this org
default_ttlinteger3600Time-to-live in seconds (60 — 86,400)
exclude_modelsstring[][]Models that should never be cached

What Gets Cached

Request TypeCached?
Non-streaming chat completionsYes
Non-streaming messagesYes
Streaming requests (stream: true)No
Tool-use completionsYes (tools are part of the cache key)

Cached Response Format

When a response is served from cache, it includes the original response body, token usage, model name, and provider. The cache is transparent to clients — cached responses are indistinguishable from fresh ones.

Cache Statistics

Monitor cache performance through the admin API:

curl https://api.curate-me.ai/gateway/admin/cache/stats \ -H "X-CM-API-Key: cm_sk_xxx"

Returns:

{ "hits": 1247, "misses": 3891, "sets": 3891, "evictions": 102, "hit_rate": 24.3, "total_entries": 3789 }

Cache Invalidation

# Flush all cached responses for your org curl -X DELETE https://api.curate-me.ai/gateway/admin/cache/{org_id} \ -H "X-CM-API-Key: cm_sk_xxx" # Flush the entire cache (admin only) curl -X DELETE https://api.curate-me.ai/gateway/admin/cache \ -H "X-CM-API-Key: cm_sk_xxx"

Cache entries also expire automatically based on their TTL.

Cost Savings

The cache is most effective for:

  • Repeated queries — chatbots with common questions, FAQ agents
  • Development and testing — same prompts during iteration
  • Multi-user systems — shared prompts across users in the same org

A 25% hit rate on a $200/month LLM spend saves ~$50/month. Higher hit rates (common in testing and shared-prompt scenarios) save proportionally more.

Backend Implementation

FilePurpose
src/gateway/response_cache.pyCache engine, key generation, TTL management
src/gateway/gateway_cache.pyAdmin API routes (stats, policy, invalidation)