Runbook: Redis Incident Response
Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-05-04 Validation method: Manual review Severity trigger: SEV1-SEV2 Customer impact: Rate limiting fails open, cost tracking stops, session cache miss Required access: SSH to VPS Related services: curateme-redis, curateme-redis-broker
This runbook covers diagnosing and resolving Redis issues on the Curate-Me platform. Redis is the backbone of the gateway governance chain — it handles rate limiting counters, real-time cost accumulation, hierarchical budget tracking, session state, and governance decision caching. Follow the steps in order.
Symptoms
- Rate limiting not enforced — requests exceeding configured limits pass through
- Cost tracking falls behind — dashboard daily spend shows stale or zero values
- Governance chain latency spikes —
X-CM-Governance-Time-Msheader jumps above 200ms - Gateway logs show
ConnectionErrororTimeoutErrorreferencing Redis - Budget alerts not firing despite spend exceeding thresholds
- HITL gate decisions not cached, causing duplicate approval requests
./scripts/analytics healthreportsredis_connected: falseorredis_latency_ms> 50
Step 1: Check Redis container health
ssh $DEPLOY_USER@$PLATFORM_VPS_IP
# Container status (include stopped with -a)
docker ps -a --filter name=curateme-redis --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
# Resource usage
docker stats curateme-redis --no-stream
# Verify Redis responds
docker exec curateme-redis redis-cli ping
# Expected: PONG
# Server info
docker exec curateme-redis redis-cli info server | grep -E "redis_version|uptime_in_seconds|tcp_port"
# Recent errors in container logs
docker logs curateme-redis --tail 200 --timestamps 2>&1 | grep -iE "error|warning|oom|denied|overcommit"Step 2: Memory diagnostics
docker exec curateme-redis redis-cli info memory | grep -E "used_memory_human|used_memory_peak_human|maxmemory_human|maxmemory_policy|mem_fragmentation_ratio"| Metric | Healthy | Investigate if |
|---|---|---|
used_memory_human | < 80% of maxmemory | > 90% of maxmemory |
maxmemory_policy | allkeys-lru or volatile-lru | noeviction (writes fail when full) |
mem_fragmentation_ratio | 1.0 — 1.5 | > 2.0 (fragmentation) or < 1.0 (swapping) |
Check keyspace distribution to find what consumes memory:
docker exec curateme-redis redis-cli info keyspace
# Count keys by gateway namespace
for prefix in "gateway:ratelimit" "gateway:cost" "gateway:budget" "gateway:dlq" "gateway:hitl" "session"; do
count=$(docker exec curateme-redis redis-cli --scan --pattern "${prefix}:*" | wc -l)
echo "${prefix}:* => $count keys"
done
# Total key count
docker exec curateme-redis redis-cli dbsizeStep 3: Identify root cause
Cause A: Memory exhaustion
Redis has hit maxmemory and is either evicting keys or rejecting writes.
Diagnosis:
docker exec curateme-redis redis-cli info memory | grep -E "used_memory_human|maxmemory_human"
docker exec curateme-redis redis-cli info stats | grep -E "evicted_keys|rejected_connections"
# If evicted_keys is climbing, data is being lost. If maxmemory_policy is noeviction, writes fail.Fix (immediate): Flush stale keys that are safe to remove (rate-limit windows auto-recreate):
# Remove rate-limit and session keys that lack a TTL (orphaned)
for prefix in "gateway:ratelimit" "session"; do
docker exec curateme-redis redis-cli --scan --pattern "${prefix}:*" | while read key; do
ttl=$(docker exec curateme-redis redis-cli ttl "$key")
[ "$ttl" -eq -1 ] && docker exec curateme-redis redis-cli del "$key"
done
doneFix (immediate): Set eviction policy if currently noeviction:
docker exec curateme-redis redis-cli config set maxmemory-policy allkeys-lruFix (long-term): Increase maxmemory and persist:
docker exec curateme-redis redis-cli config set maxmemory 536870912 # 512MB
docker exec curateme-redis redis-cli config rewriteCause B: Connection limit reached
All client connections consumed; new connections rejected.
Diagnosis:
docker exec curateme-redis redis-cli info clients | grep -E "connected_clients|maxclients|rejected_connections|blocked_clients"| Metric | Healthy | Investigate if |
|---|---|---|
connected_clients | < 100 | near maxclients |
rejected_connections | 0 | > 0 |
blocked_clients | 0 | > 10 |
Fix (immediate): Kill idle connections:
# List clients idle > 300s
docker exec curateme-redis redis-cli client list | grep -oP 'id=\d+ .* idle=\d+' | \
awk -F'idle=' '{if ($2 > 300) print $0}'
# Set server-side idle timeout
docker exec curateme-redis redis-cli config set timeout 300
docker exec curateme-redis redis-cli config set maxclients 500
docker exec curateme-redis redis-cli config rewriteCause C: Persistence failures (AOF/RDB)
Background save or AOF rewrite is failing, which can degrade performance during fork.
Diagnosis:
docker exec curateme-redis redis-cli info persistence | grep -E "rdb_last_bgsave_status|aof_last_bgrewrite_status|aof_last_write_status"
docker logs curateme-redis --tail 100 2>&1 | grep -iE "rdb|aof|bgsave|fork|can't save"Fix: If background saves fail due to memory (fork requires ~2x memory):
docker exec curateme-redis redis-cli config set stop-writes-on-bgsave-error no
# If AOF is corrupt, trigger manual rewrite:
docker exec curateme-redis redis-cli bgrewriteaofIf persistence is not critical (rate-limit and cost counters rebuild on their own), disable it with config set save "" and config rewrite.
Cause D: High latency / slow commands
Diagnosis:
# Check slow log (commands exceeding 10ms default threshold)
docker exec curateme-redis redis-cli slowlog get 20
# Ops/sec baseline
docker exec curateme-redis redis-cli info stats | grep instantaneous_ops_per_sec
# Real-time latency test (10 samples, 1s interval)
docker exec curateme-redis redis-cli --latency -i 1 -c 10
# Check for dangerous commands (KEYS, FLUSHALL)
docker exec curateme-redis redis-cli slowlog get 50 | grep -iE "keys|flushall|flushdb"Fix: Ensure no scripts or monitoring tools use KEYS * — always use SCAN. If slowlog shows specific expensive commands, trace them back to the calling service and optimize.
Step 4: Impact assessment
When Redis is unavailable, these gateway features degrade:
| Feature | Behavior when Redis is down |
|---|---|
| Rate limiting | Fails open — all requests pass, no enforcement |
| Cost accumulation | Lags — INCRBYFLOAT unavailable, fallback path may lose increments |
| Budget enforcement | Disabled — daily/per-request budgets not checked, spend can exceed limits |
| Hierarchical budgets | Disabled — org/team rollups stop updating |
| HITL gate caching | Disabled — duplicate approval prompts may appear |
| Dead-letter queue | Unavailable — failed usage records cannot be queued |
| Session state | Lost — active sessions lose context on reconnect |
Check the dead-letter queue and cost drift:
docker exec curateme-redis redis-cli llen gateway:dlq:usage_records
./scripts/analytics costs today
# Compare redis_daily_total vs mongodb_daily_total -- drift > 10% means records were lostStep 5: Verify resolution
# Basic connectivity
docker exec curateme-redis redis-cli ping
# Memory headroom
docker exec curateme-redis redis-cli info memory | grep -E "used_memory_human|maxmemory_human"
# Connection headroom
docker exec curateme-redis redis-cli info clients | grep -E "connected_clients|rejected_connections"
# Slow log not growing
docker exec curateme-redis redis-cli slowlog len
# Verify rate limiting works (RateLimit-Remaining header should decrement)
curl -sI https://api.curate-me.ai/v1/openai/chat/completions \
-H "X-CM-API-Key: $API_KEY" -H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"ping"}],"max_tokens":5}' \
| grep -i ratelimit
# Cost accumulation recording
./scripts/analytics costs today
# Replay any dead-letter records accumulated during outage
docker exec curateme-redis redis-cli llen gateway:dlq:usage_records
# If > 0:
./scripts/analytics costs replay-dlq
# Full platform health
./scripts/analytics health
./scripts/errors recentEscalation
If the issue persists after working through all steps:
- Collect diagnostics:
docker exec curateme-redis redis-cli info > /tmp/redis-info.txtdocker exec curateme-redis redis-cli slowlog get 100 > /tmp/redis-slowlog.txtdocker exec curateme-redis redis-cli client list > /tmp/redis-clients.txtdocker logs curateme-redis --tail 1000 > /tmp/redis-incident.log 2>&1./scripts/errors by-source gateway > /tmp/gateway-errors.log
- Check cost drift and reconcile:
./scripts/analytics costs reconcile - Record the timeline: when symptoms started, traffic volume, recent deploys
- Check for recent changes:
./scripts/errors recent --since 24h - Contact the platform team with collected diagnostics, cost drift report, and timeline
Related Runbooks
- MongoDB Incident — MongoDB and Redis are both infrastructure dependencies with similar triage patterns
- Gateway High Latency — Redis slowness directly causes gateway latency spikes
- Cost Accumulation Lag — Redis is the backing store for real-time cost counters
- Rate Limit Exceeded — rate limit counters are stored in Redis