Skip to Content
RunbooksRunbook: Redis Incident Response

Runbook: Redis Incident Response

Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-05-04 Validation method: Manual review Severity trigger: SEV1-SEV2 Customer impact: Rate limiting fails open, cost tracking stops, session cache miss Required access: SSH to VPS Related services: curateme-redis, curateme-redis-broker

This runbook covers diagnosing and resolving Redis issues on the Curate-Me platform. Redis is the backbone of the gateway governance chain — it handles rate limiting counters, real-time cost accumulation, hierarchical budget tracking, session state, and governance decision caching. Follow the steps in order.


Symptoms

  • Rate limiting not enforced — requests exceeding configured limits pass through
  • Cost tracking falls behind — dashboard daily spend shows stale or zero values
  • Governance chain latency spikes — X-CM-Governance-Time-Ms header jumps above 200ms
  • Gateway logs show ConnectionError or TimeoutError referencing Redis
  • Budget alerts not firing despite spend exceeding thresholds
  • HITL gate decisions not cached, causing duplicate approval requests
  • ./scripts/analytics health reports redis_connected: false or redis_latency_ms > 50

Step 1: Check Redis container health

ssh $DEPLOY_USER@$PLATFORM_VPS_IP # Container status (include stopped with -a) docker ps -a --filter name=curateme-redis --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" # Resource usage docker stats curateme-redis --no-stream # Verify Redis responds docker exec curateme-redis redis-cli ping # Expected: PONG # Server info docker exec curateme-redis redis-cli info server | grep -E "redis_version|uptime_in_seconds|tcp_port" # Recent errors in container logs docker logs curateme-redis --tail 200 --timestamps 2>&1 | grep -iE "error|warning|oom|denied|overcommit"

Step 2: Memory diagnostics

docker exec curateme-redis redis-cli info memory | grep -E "used_memory_human|used_memory_peak_human|maxmemory_human|maxmemory_policy|mem_fragmentation_ratio"
MetricHealthyInvestigate if
used_memory_human< 80% of maxmemory> 90% of maxmemory
maxmemory_policyallkeys-lru or volatile-lrunoeviction (writes fail when full)
mem_fragmentation_ratio1.0 — 1.5> 2.0 (fragmentation) or < 1.0 (swapping)

Check keyspace distribution to find what consumes memory:

docker exec curateme-redis redis-cli info keyspace # Count keys by gateway namespace for prefix in "gateway:ratelimit" "gateway:cost" "gateway:budget" "gateway:dlq" "gateway:hitl" "session"; do count=$(docker exec curateme-redis redis-cli --scan --pattern "${prefix}:*" | wc -l) echo "${prefix}:* => $count keys" done # Total key count docker exec curateme-redis redis-cli dbsize

Step 3: Identify root cause

Cause A: Memory exhaustion

Redis has hit maxmemory and is either evicting keys or rejecting writes.

Diagnosis:

docker exec curateme-redis redis-cli info memory | grep -E "used_memory_human|maxmemory_human" docker exec curateme-redis redis-cli info stats | grep -E "evicted_keys|rejected_connections" # If evicted_keys is climbing, data is being lost. If maxmemory_policy is noeviction, writes fail.

Fix (immediate): Flush stale keys that are safe to remove (rate-limit windows auto-recreate):

# Remove rate-limit and session keys that lack a TTL (orphaned) for prefix in "gateway:ratelimit" "session"; do docker exec curateme-redis redis-cli --scan --pattern "${prefix}:*" | while read key; do ttl=$(docker exec curateme-redis redis-cli ttl "$key") [ "$ttl" -eq -1 ] && docker exec curateme-redis redis-cli del "$key" done done

Fix (immediate): Set eviction policy if currently noeviction:

docker exec curateme-redis redis-cli config set maxmemory-policy allkeys-lru

Fix (long-term): Increase maxmemory and persist:

docker exec curateme-redis redis-cli config set maxmemory 536870912 # 512MB docker exec curateme-redis redis-cli config rewrite

Cause B: Connection limit reached

All client connections consumed; new connections rejected.

Diagnosis:

docker exec curateme-redis redis-cli info clients | grep -E "connected_clients|maxclients|rejected_connections|blocked_clients"
MetricHealthyInvestigate if
connected_clients< 100near maxclients
rejected_connections0> 0
blocked_clients0> 10

Fix (immediate): Kill idle connections:

# List clients idle > 300s docker exec curateme-redis redis-cli client list | grep -oP 'id=\d+ .* idle=\d+' | \ awk -F'idle=' '{if ($2 > 300) print $0}' # Set server-side idle timeout docker exec curateme-redis redis-cli config set timeout 300 docker exec curateme-redis redis-cli config set maxclients 500 docker exec curateme-redis redis-cli config rewrite

Cause C: Persistence failures (AOF/RDB)

Background save or AOF rewrite is failing, which can degrade performance during fork.

Diagnosis:

docker exec curateme-redis redis-cli info persistence | grep -E "rdb_last_bgsave_status|aof_last_bgrewrite_status|aof_last_write_status" docker logs curateme-redis --tail 100 2>&1 | grep -iE "rdb|aof|bgsave|fork|can't save"

Fix: If background saves fail due to memory (fork requires ~2x memory):

docker exec curateme-redis redis-cli config set stop-writes-on-bgsave-error no # If AOF is corrupt, trigger manual rewrite: docker exec curateme-redis redis-cli bgrewriteaof

If persistence is not critical (rate-limit and cost counters rebuild on their own), disable it with config set save "" and config rewrite.

Cause D: High latency / slow commands

Diagnosis:

# Check slow log (commands exceeding 10ms default threshold) docker exec curateme-redis redis-cli slowlog get 20 # Ops/sec baseline docker exec curateme-redis redis-cli info stats | grep instantaneous_ops_per_sec # Real-time latency test (10 samples, 1s interval) docker exec curateme-redis redis-cli --latency -i 1 -c 10 # Check for dangerous commands (KEYS, FLUSHALL) docker exec curateme-redis redis-cli slowlog get 50 | grep -iE "keys|flushall|flushdb"

Fix: Ensure no scripts or monitoring tools use KEYS * — always use SCAN. If slowlog shows specific expensive commands, trace them back to the calling service and optimize.


Step 4: Impact assessment

When Redis is unavailable, these gateway features degrade:

FeatureBehavior when Redis is down
Rate limitingFails open — all requests pass, no enforcement
Cost accumulationLags — INCRBYFLOAT unavailable, fallback path may lose increments
Budget enforcementDisabled — daily/per-request budgets not checked, spend can exceed limits
Hierarchical budgetsDisabled — org/team rollups stop updating
HITL gate cachingDisabled — duplicate approval prompts may appear
Dead-letter queueUnavailable — failed usage records cannot be queued
Session stateLost — active sessions lose context on reconnect

Check the dead-letter queue and cost drift:

docker exec curateme-redis redis-cli llen gateway:dlq:usage_records ./scripts/analytics costs today # Compare redis_daily_total vs mongodb_daily_total -- drift > 10% means records were lost

Step 5: Verify resolution

# Basic connectivity docker exec curateme-redis redis-cli ping # Memory headroom docker exec curateme-redis redis-cli info memory | grep -E "used_memory_human|maxmemory_human" # Connection headroom docker exec curateme-redis redis-cli info clients | grep -E "connected_clients|rejected_connections" # Slow log not growing docker exec curateme-redis redis-cli slowlog len # Verify rate limiting works (RateLimit-Remaining header should decrement) curl -sI https://api.curate-me.ai/v1/openai/chat/completions \ -H "X-CM-API-Key: $API_KEY" -H "Content-Type: application/json" \ -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"ping"}],"max_tokens":5}' \ | grep -i ratelimit # Cost accumulation recording ./scripts/analytics costs today # Replay any dead-letter records accumulated during outage docker exec curateme-redis redis-cli llen gateway:dlq:usage_records # If > 0: ./scripts/analytics costs replay-dlq # Full platform health ./scripts/analytics health ./scripts/errors recent

Escalation

If the issue persists after working through all steps:

  1. Collect diagnostics:
    • docker exec curateme-redis redis-cli info > /tmp/redis-info.txt
    • docker exec curateme-redis redis-cli slowlog get 100 > /tmp/redis-slowlog.txt
    • docker exec curateme-redis redis-cli client list > /tmp/redis-clients.txt
    • docker logs curateme-redis --tail 1000 > /tmp/redis-incident.log 2>&1
    • ./scripts/errors by-source gateway > /tmp/gateway-errors.log
  2. Check cost drift and reconcile: ./scripts/analytics costs reconcile
  3. Record the timeline: when symptoms started, traffic volume, recent deploys
  4. Check for recent changes: ./scripts/errors recent --since 24h
  5. Contact the platform team with collected diagnostics, cost drift report, and timeline