Response Cache
The gateway includes a response cache that stores LLM responses and serves them for identical subsequent requests. This eliminates duplicate provider calls and reduces both latency and cost.
How It Works
When a request arrives, the gateway generates a cache key by hashing the response-influencing fields of the request body:
model,messages,system,temperature,max_tokens,top_p,top_k,stop,tools,tool_choice
Fields that don’t affect the response (like stream, user, metadata, seed) are excluded from the hash. Messages are normalized recursively for deterministic hashing.
Request → Normalize → SHA-256 → Redis Lookup
├── HIT → Return cached response
└── MISS → Proxy to provider → Cache response → ReturnEnabling the Cache
The cache is controlled by the FF_GATEWAY_RESPONSE_CACHE feature flag (disabled by default).
Once enabled globally, each organization can configure their own cache policy:
# Get current cache policy
curl https://api.curate-me.ai/gateway/admin/cache/policy \
-H "X-CM-API-Key: cm_sk_xxx"
# Enable cache for your org with a 30-minute TTL
curl -X PATCH https://api.curate-me.ai/gateway/admin/cache/policy \
-H "X-CM-API-Key: cm_sk_xxx" \
-H "Content-Type: application/json" \
-d '{
"enabled": true,
"default_ttl": 1800,
"exclude_models": ["o1", "o3"]
}'Cache Policy Options
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable/disable cache for this org |
default_ttl | integer | 3600 | Time-to-live in seconds (60 — 86,400) |
exclude_models | string[] | [] | Models that should never be cached |
What Gets Cached
| Request Type | Cached? |
|---|---|
| Non-streaming chat completions | Yes |
| Non-streaming messages | Yes |
Streaming requests (stream: true) | No |
| Tool-use completions | Yes (tools are part of the cache key) |
Cached Response Format
When a response is served from cache, it includes the original response body, token usage, model name, and provider. The cache is transparent to clients — cached responses are indistinguishable from fresh ones.
Cache Statistics
Monitor cache performance through the admin API:
curl https://api.curate-me.ai/gateway/admin/cache/stats \
-H "X-CM-API-Key: cm_sk_xxx"Returns:
{
"hits": 1247,
"misses": 3891,
"sets": 3891,
"evictions": 102,
"hit_rate": 24.3,
"total_entries": 3789
}Cache Invalidation
# Flush all cached responses for your org
curl -X DELETE https://api.curate-me.ai/gateway/admin/cache/{org_id} \
-H "X-CM-API-Key: cm_sk_xxx"
# Flush the entire cache (admin only)
curl -X DELETE https://api.curate-me.ai/gateway/admin/cache \
-H "X-CM-API-Key: cm_sk_xxx"Cache entries also expire automatically based on their TTL.
Cost Savings
The cache is most effective for:
- Repeated queries — chatbots with common questions, FAQ agents
- Development and testing — same prompts during iteration
- Multi-user systems — shared prompts across users in the same org
A 25% hit rate on a $200/month LLM spend saves ~$50/month. Higher hit rates (common in testing and shared-prompt scenarios) save proportionally more.
Backend Implementation
| File | Purpose |
|---|---|
src/gateway/response_cache.py | Cache engine, key generation, TTL management |
src/gateway/gateway_cache.py | Admin API routes (stats, policy, invalidation) |