Runbook: Cost Accumulation Falls Behind
Owner: Platform Team Backup owner: On-call engineer Last validated: Not yet validated Validation method: Manual drill Severity trigger: SEV2 Customer impact: Budget enforcement fails — orgs can overspend before governance catches up Required access: SSH (VPS), MongoDB, Redis Related services: curateme-backend-gateway
Real-time cost tracking in the gateway uses a dual-write pattern: Redis for fast atomic increments (INCRBYFLOAT) and MongoDB for durable usage records. When these diverge, budget enforcement breaks — orgs can overspend before governance catches up. Follow these steps in order.
Symptoms
- Dashboard shows lower daily spend than expected
- Budget warnings trigger late or not at all
- Metered billing reports don’t match gateway logs
X-CM-Daily-Costresponse header shows stale values- Hierarchical budget alerts (50%/75%/90%) fire late or not at all
Step 1: Assess the drift magnitude
Determine how far behind cost tracking has fallen and which layer is affected.
./scripts/analytics costs todayExpected output:
{
"redis_daily_total": 42.15,
"mongodb_daily_total": 38.90,
"drift_pct": 7.7,
"last_redis_write": "2026-05-04T14:32:01Z",
"last_mongo_write": "2026-05-04T14:31:58Z"
}What to look for:
| Condition | Meaning | Severity |
|---|---|---|
drift_pct < 2% | Normal — async writes have minor lag | Low |
drift_pct 2-10% | Redis and MongoDB diverging | Medium |
drift_pct > 10% | Significant cost tracking failure | High |
redis_daily_total = 0 | Redis connection lost entirely | Critical |
last_redis_write > 5 min old | Redis writes stopped | Critical |
Step 2: Check Redis health
Cost accumulation depends on Redis INCRBYFLOAT for atomic counters and HSET for hierarchical budget tracking.
./scripts/analytics health
# Look for: redis_connected: true, redis_latency_ms < 10If redis_connected is false or latency is high, check directly:
# On VPS
ssh $DEPLOY_USER@$PLATFORM_VPS_IP
redis-cli -u $REDIS_URL ping
# Expected: PONG
redis-cli -u $REDIS_URL info memory | grep used_memory_human
# Check if near maxmemory
redis-cli -u $REDIS_URL info stats | grep instantaneous_ops_per_sec
# Normal: 50-500 ops/sec. Over 5000 = possible hot keyStep 3: Identify the root cause
Cause A: Redis connection lost
The gateway falls back to a non-atomic read-then-write path when Redis is unreachable. Under concurrency, this loses increments.
Diagnosis: redis_connected: false in health check, or backend logs show non_atomic_fallback warnings:
./scripts/errors by-source gateway | grep "non_atomic_fallback"Fix (immediate): Restore Redis connectivity:
# Check container status
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker ps | grep redis"
# Restart if needed
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker restart curateme-redis"Fix (reconcile lost data): After Redis is back, recalculate from MongoDB usage records:
./scripts/analytics costs reconcileCause B: Dead-letter queue backup
When MongoDB writes fail (connection timeout, disk full), usage records go to a Redis-backed dead-letter queue. If DLQ grows, MongoDB totals fall behind.
Diagnosis:
redis-cli LLEN gateway:dlq:usage_records
# Healthy: 0. Problem: > 100Fix: Replay the DLQ once MongoDB is healthy:
# Verify MongoDB is accepting writes first
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker exec curateme-mongo mongosh --eval 'db.runCommand({ping:1})'"
# Replay failed records
./scripts/analytics costs replay-dlqCause C: High request volume backpressure
Cost recording is async but shares the event loop. Under extreme load (>1000 req/s), the async write queue can back up.
Diagnosis: Check gateway request rate and queue depth:
curl https://api.curate-me.ai/gateway/admin/metrics \
-H "Authorization: Bearer $ADMIN_TOKEN" | jq '.cost_recorder'Look for queue_depth > 100 or dropped_records > 0.
Fix (immediate): Reduce load if possible (rate limit a specific org, enable request queueing).
Fix (long-term): Increase cost recorder worker pool size in gateway configuration:
# In gateway environment
COST_RECORDER_WORKERS=4 # Default is 2
COST_RECORDER_BATCH_SIZE=50 # Default is 20Cause D: Hierarchical budget desync
The hierarchical budget system (Org → Team → Key) uses separate Redis counters that can drift from the flat daily counter.
Diagnosis: Compare flat vs hierarchical totals:
# Check hierarchical budget nodes for an org
curl https://api.curate-me.ai/gateway/admin/budgets/$ORG_ID \
-H "Authorization: Bearer $ADMIN_TOKEN"If the org-level budget total differs from the flat redis_daily_total, the hierarchy has drifted.
Fix: Force hierarchy recalculation:
curl -X POST https://api.curate-me.ai/gateway/admin/budgets/$ORG_ID/reconcile \
-H "Authorization: Bearer $ADMIN_TOKEN"Step 4: Verify resolution
After applying fixes, confirm cost tracking is back in sync:
# Check drift has decreased
./scripts/analytics costs today
# drift_pct should be < 2%
# Check DLQ is drained
redis-cli LLEN gateway:dlq:usage_records
# Should be 0
# Send a test request and verify cost is recorded
curl -s https://api.curate-me.ai/v1/openai/chat/completions \
-H "X-CM-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4o-mini", "messages": [{"role":"user","content":"ping"}], "max_tokens": 5}' \
-D - -o /dev/null 2>&1 | grep "X-CM-Daily-Cost"
# Value should increment
# Verify Redis and MongoDB agree
./scripts/analytics healthPrevention
| Measure | How |
|---|---|
| Redis high-availability | Configure Redis with AOF persistence and monitor memory usage |
| DLQ monitoring | Alert when gateway:dlq:usage_records length exceeds 50 |
| Cost drift alerting | Alert when drift_pct exceeds 5% for more than 10 minutes |
| Budget reconciliation | Run ./scripts/analytics costs reconcile daily via cron |
Related Runbooks
- Budget Exceeded — when budgets are exceeded due to lag
- Redis Incident — when Redis itself is the problem
- Gateway High Latency — cost recording can contribute to latency
Rollback
Revert the changes described in the Procedure section. If a configuration change was made, restore the previous value from the MongoDB audit log or Redis backup.
Verification
After applying the fix, verify:
- The symptoms listed above are no longer present
- No new errors in gateway logs:
docker logs curateme-backend-gateway --tail=50 - Health check passes:
curl -s http://localhost:8002/health | jq .status
Escalation
- If cost drift exceeds 10% of daily spend, escalate immediately
- Collect:
./scripts/errors by-source gateway,./scripts/analytics costs today, DLQ length - If drift persists after Redis restart and reconciliation, check for code-level bugs in
cost_recorder.py - Contact platform team with drift percentage, affected time window, and org IDs impacted