Skip to Content
RunbooksRunbook: Cost Accumulation Falls Behind

Runbook: Cost Accumulation Falls Behind

Owner: Platform Team Backup owner: On-call engineer Last validated: Not yet validated Validation method: Manual drill Severity trigger: SEV2 Customer impact: Budget enforcement fails — orgs can overspend before governance catches up Required access: SSH (VPS), MongoDB, Redis Related services: curateme-backend-gateway


Real-time cost tracking in the gateway uses a dual-write pattern: Redis for fast atomic increments (INCRBYFLOAT) and MongoDB for durable usage records. When these diverge, budget enforcement breaks — orgs can overspend before governance catches up. Follow these steps in order.


Symptoms

  • Dashboard shows lower daily spend than expected
  • Budget warnings trigger late or not at all
  • Metered billing reports don’t match gateway logs
  • X-CM-Daily-Cost response header shows stale values
  • Hierarchical budget alerts (50%/75%/90%) fire late or not at all

Step 1: Assess the drift magnitude

Determine how far behind cost tracking has fallen and which layer is affected.

./scripts/analytics costs today

Expected output:

{ "redis_daily_total": 42.15, "mongodb_daily_total": 38.90, "drift_pct": 7.7, "last_redis_write": "2026-05-04T14:32:01Z", "last_mongo_write": "2026-05-04T14:31:58Z" }

What to look for:

ConditionMeaningSeverity
drift_pct < 2%Normal — async writes have minor lagLow
drift_pct 2-10%Redis and MongoDB divergingMedium
drift_pct > 10%Significant cost tracking failureHigh
redis_daily_total = 0Redis connection lost entirelyCritical
last_redis_write > 5 min oldRedis writes stoppedCritical

Step 2: Check Redis health

Cost accumulation depends on Redis INCRBYFLOAT for atomic counters and HSET for hierarchical budget tracking.

./scripts/analytics health # Look for: redis_connected: true, redis_latency_ms < 10

If redis_connected is false or latency is high, check directly:

# On VPS ssh $DEPLOY_USER@$PLATFORM_VPS_IP redis-cli -u $REDIS_URL ping # Expected: PONG redis-cli -u $REDIS_URL info memory | grep used_memory_human # Check if near maxmemory redis-cli -u $REDIS_URL info stats | grep instantaneous_ops_per_sec # Normal: 50-500 ops/sec. Over 5000 = possible hot key

Step 3: Identify the root cause

Cause A: Redis connection lost

The gateway falls back to a non-atomic read-then-write path when Redis is unreachable. Under concurrency, this loses increments.

Diagnosis: redis_connected: false in health check, or backend logs show non_atomic_fallback warnings:

./scripts/errors by-source gateway | grep "non_atomic_fallback"

Fix (immediate): Restore Redis connectivity:

# Check container status ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker ps | grep redis" # Restart if needed ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker restart curateme-redis"

Fix (reconcile lost data): After Redis is back, recalculate from MongoDB usage records:

./scripts/analytics costs reconcile

Cause B: Dead-letter queue backup

When MongoDB writes fail (connection timeout, disk full), usage records go to a Redis-backed dead-letter queue. If DLQ grows, MongoDB totals fall behind.

Diagnosis:

redis-cli LLEN gateway:dlq:usage_records # Healthy: 0. Problem: > 100

Fix: Replay the DLQ once MongoDB is healthy:

# Verify MongoDB is accepting writes first ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker exec curateme-mongo mongosh --eval 'db.runCommand({ping:1})'" # Replay failed records ./scripts/analytics costs replay-dlq

Cause C: High request volume backpressure

Cost recording is async but shares the event loop. Under extreme load (>1000 req/s), the async write queue can back up.

Diagnosis: Check gateway request rate and queue depth:

curl https://api.curate-me.ai/gateway/admin/metrics \ -H "Authorization: Bearer $ADMIN_TOKEN" | jq '.cost_recorder'

Look for queue_depth > 100 or dropped_records > 0.

Fix (immediate): Reduce load if possible (rate limit a specific org, enable request queueing).

Fix (long-term): Increase cost recorder worker pool size in gateway configuration:

# In gateway environment COST_RECORDER_WORKERS=4 # Default is 2 COST_RECORDER_BATCH_SIZE=50 # Default is 20

Cause D: Hierarchical budget desync

The hierarchical budget system (Org → Team → Key) uses separate Redis counters that can drift from the flat daily counter.

Diagnosis: Compare flat vs hierarchical totals:

# Check hierarchical budget nodes for an org curl https://api.curate-me.ai/gateway/admin/budgets/$ORG_ID \ -H "Authorization: Bearer $ADMIN_TOKEN"

If the org-level budget total differs from the flat redis_daily_total, the hierarchy has drifted.

Fix: Force hierarchy recalculation:

curl -X POST https://api.curate-me.ai/gateway/admin/budgets/$ORG_ID/reconcile \ -H "Authorization: Bearer $ADMIN_TOKEN"

Step 4: Verify resolution

After applying fixes, confirm cost tracking is back in sync:

# Check drift has decreased ./scripts/analytics costs today # drift_pct should be < 2% # Check DLQ is drained redis-cli LLEN gateway:dlq:usage_records # Should be 0 # Send a test request and verify cost is recorded curl -s https://api.curate-me.ai/v1/openai/chat/completions \ -H "X-CM-API-Key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{"model": "gpt-4o-mini", "messages": [{"role":"user","content":"ping"}], "max_tokens": 5}' \ -D - -o /dev/null 2>&1 | grep "X-CM-Daily-Cost" # Value should increment # Verify Redis and MongoDB agree ./scripts/analytics health

Prevention

MeasureHow
Redis high-availabilityConfigure Redis with AOF persistence and monitor memory usage
DLQ monitoringAlert when gateway:dlq:usage_records length exceeds 50
Cost drift alertingAlert when drift_pct exceeds 5% for more than 10 minutes
Budget reconciliationRun ./scripts/analytics costs reconcile daily via cron


Rollback

Revert the changes described in the Procedure section. If a configuration change was made, restore the previous value from the MongoDB audit log or Redis backup.

Verification

After applying the fix, verify:

  • The symptoms listed above are no longer present
  • No new errors in gateway logs: docker logs curateme-backend-gateway --tail=50
  • Health check passes: curl -s http://localhost:8002/health | jq .status

Escalation

  1. If cost drift exceeds 10% of daily spend, escalate immediately
  2. Collect: ./scripts/errors by-source gateway, ./scripts/analytics costs today, DLQ length
  3. If drift persists after Redis restart and reconciliation, check for code-level bugs in cost_recorder.py
  4. Contact platform team with drift percentage, affected time window, and org IDs impacted