Response Cache

The gateway includes a response cache that stores LLM responses and serves them for identical subsequent requests. This eliminates duplicate provider calls and reduces both latency and cost.

How It Works

When a request arrives, the gateway generates a cache key by hashing the response-influencing fields of the request body:

model, messages, system, temperature, max_tokens, top_p, top_k, stop, tools, tool_choice

Fields that don’t affect the response (like stream, user, metadata, seed) are excluded from the hash. Messages are normalized recursively for deterministic hashing.


Request → Normalize → SHA-256 → Redis Lookup
                                   ├── HIT  → Return cached response
                                   └── MISS → Proxy to provider → Cache response → Return

Enabling the Cache

The cache is controlled by the FF_GATEWAY_RESPONSE_CACHE feature flag (disabled by default).

Once enabled globally, each organization can configure their own cache policy:


# Get current cache policy
curl https://api.curate-me.ai/gateway/admin/cache/policy \
  -H "X-CM-API-Key: cm_sk_xxx"
 
# Enable cache for your org with a 30-minute TTL
curl -X PATCH https://api.curate-me.ai/gateway/admin/cache/policy \
  -H "X-CM-API-Key: cm_sk_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "enabled": true,
    "default_ttl": 1800,
    "exclude_models": ["o1", "o3"]
  }'

Cache Policy Options

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable/disable cache for this org
`default_ttl`	integer	`3600`	Time-to-live in seconds (60 — 86,400)
`exclude_models`	string[]	`[]`	Models that should never be cached

What Gets Cached

Request Type	Cached?
Non-streaming chat completions	Yes
Non-streaming messages	Yes
Streaming requests (`stream: true`)	No
Tool-use completions	Yes (tools are part of the cache key)

Cached Response Format

When a response is served from cache, it includes the original response body, token usage, model name, and provider. The cache is transparent to clients — cached responses are indistinguishable from fresh ones.

Cache Statistics

Monitor cache performance through the admin API:


curl https://api.curate-me.ai/gateway/admin/cache/stats \
  -H "X-CM-API-Key: cm_sk_xxx"

Returns:


{
  "hits": 1247,
  "misses": 3891,
  "sets": 3891,
  "evictions": 102,
  "hit_rate": 24.3,
  "total_entries": 3789
}

Cache Invalidation


# Flush all cached responses for your org
curl -X DELETE https://api.curate-me.ai/gateway/admin/cache/{org_id} \
  -H "X-CM-API-Key: cm_sk_xxx"
 
# Flush the entire cache (admin only)
curl -X DELETE https://api.curate-me.ai/gateway/admin/cache \
  -H "X-CM-API-Key: cm_sk_xxx"

Cache entries also expire automatically based on their TTL.

Cost Savings

The cache is most effective for:

Repeated queries — chatbots with common questions, FAQ agents
Development and testing — same prompts during iteration
Multi-user systems — shared prompts across users in the same org

A 25% hit rate on a $200/month LLM spend saves ~$50/month. Higher hit rates (common in testing and shared-prompt scenarios) save proportionally more.

Backend Implementation

File	Purpose
`src/gateway/response_cache.py`	Cache engine, key generation, TTL management
`src/gateway/gateway_cache.py`	Admin API routes (stats, policy, invalidation)