Skip to Content
RunbooksRunbook: MongoDB Incident Response

Runbook: MongoDB Incident Response

Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-05-04 Validation method: Manual review Severity trigger: SEV1 Customer impact: API key auth fails, audit logging stops, org config unavailable Required access: SSH to VPS, MongoDB admin credentials Related services: curateme-mongo

This runbook covers diagnosing and resolving MongoDB issues on the Curate-Me platform. MongoDB stores audit logs, org configurations, API keys, A2A inbox, fleet deployments, autopilot results, and usage records. Follow the steps in order.


Symptoms

  • Backend API returning 500 errors with ServerSelectionTimeoutError or AutoReconnect in logs
  • Dashboard pages loading with empty data or “Service unavailable” banners
  • Gateway governance chain works (Redis-backed) but audit trail is missing
  • A2A task inbox not accepting new tasks or returning stale results
  • Fleet deployment status stuck in deploying state
  • ./scripts/errors recent shows spikes in MongoDB-related errors
  • ./scripts/analytics health reports mongodb_connected: false

Step 1: Check MongoDB container health

ssh $DEPLOY_USER@$PLATFORM_VPS_IP # Container status (include stopped containers with -a) docker ps -a --filter name=curateme-mongo --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" # Resource usage docker stats curateme-mongo --no-stream # Recent errors in container logs docker logs curateme-mongo --tail 500 --timestamps 2>&1 | grep -iE "error|fatal|oom|killed" # Verify MongoDB accepts connections docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval "db.runCommand({ping: 1})" # Expected: { ok: 1 }

Step 2: Connection pool diagnostics

docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval " const s = db.serverStatus(); print('Connections: ' + s.connections.current + ' / ' + s.connections.available); print('Active ops: ' + s.globalLock.activeClients.total); print('Queued R/W: ' + s.globalLock.currentQueue.readers + '/' + s.globalLock.currentQueue.writers); "
MetricHealthyInvestigate if
current connections< 100> 200 (pool exhaustion)
available connections> 500< 50 (running out)
Queued readers/writers0> 10 (lock contention)

List active operations to find what is holding connections:

docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval " db.currentOp({active: true}).inprog.forEach(function(op) { print(op.client + ' | ' + op.op + ' | ' + op.ns + ' | ' + op.secs_running + 's'); }); "

Step 3: Identify root cause

Cause A: Container crash / OOM kill

Diagnosis:

docker inspect curateme-mongo --format='OOMKilled={{.State.OOMKilled}} ExitCode={{.State.ExitCode}}' # Exit code 137 = SIGKILL/OOM, 1 = internal error dmesg | grep -i "oom\|mongo" | tail -20

Fix (immediate): Restart the container and verify:

docker restart curateme-mongo && sleep 5 docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval "db.runCommand({ping: 1})"

Fix (long-term): Set memory limits in docker-compose.production.yml and cap WiredTiger cache:

# Under the mongo service: command: ["mongod", "--wiredTigerCacheSizeGB", "0.5"] deploy: resources: limits: memory: 2G

Cause B: Connection pool exhaustion

Diagnosis: current connections near or exceeding available in Step 2 output.

Fix (immediate): Kill idle operations:

docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval " db.currentOp({active: false, secs_running: {'\$gt': 300}}).inprog.forEach(function(op) { db.killOp(op.opid); print('Killed: ' + op.opid + ' from ' + op.client); }); "

Fix (long-term): Ensure bounded connection pools in MONGO_URI:

MONGO_URI=mongodb://admin:password@host:27017/?maxPoolSize=50&minPoolSize=5&maxIdleTimeMS=30000

Cause C: Slow queries

Diagnosis: Find long-running operations and enable the profiler:

docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval " db.currentOp({active: true, secs_running: {'\$gt': 5}}).inprog.forEach(function(op) { print(op.secs_running + 's | ' + op.ns + ' | ' + JSON.stringify(op.command)); }); db.setProfilingLevel(1, {slowms: 100}); " # After a few minutes, check profiled slow queries: docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval " db.system.profile.find({millis: {'\$gt': 100}}).sort({ts: -1}).limit(10).forEach(function(d) { print(d.ns + ' | ' + d.millis + 'ms | ' + JSON.stringify(d.command)); }); "

Fix: Kill the slow query, then add missing indexes:

docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' curateme_b2b --eval " db.a2a_inbox.createIndex({agent_id: 1, status: 1}); db.a2a_audit_log.createIndex({timestamp: -1, org_id: 1}); db.managed_runners.createIndex({fleet_id: 1, status: 1}); db.autopilot_results.createIndex({org_id: 1, created_at: -1}); db.usage_records.createIndex({org_id: 1, timestamp: -1}); print('Indexes created'); "

Cause D: Disk space exhaustion

Diagnosis:

df -h /var/lib/docker docker exec curateme-mongo du -sh /data/db/ docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval " db.adminCommand({listDatabases: 1}).databases.forEach(function(d) { print(d.name + ': ' + (d.sizeOnDisk / 1024 / 1024).toFixed(2) + ' MB'); }); "

Fix (immediate): Rotate old data and compact:

docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' curateme_b2b --eval " const cutoff = new Date(Date.now() - 90 * 24 * 60 * 60 * 1000); const r = db.a2a_audit_log.deleteMany({timestamp: {'\$lt': cutoff}}); print('Deleted ' + r.deletedCount + ' old audit entries'); db.runCommand({compact: 'a2a_audit_log'}); db.runCommand({compact: 'usage_records'}); "

Fix (long-term): Set TTL indexes for automatic expiration:

docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' curateme_b2b --eval " db.a2a_audit_log.createIndex({timestamp: 1}, {expireAfterSeconds: 7776000}); // 90 days db.system.profile.createIndex({ts: 1}, {expireAfterSeconds: 604800}); // 7 days "

Step 4: Backup and restore

If data corruption is suspected, restore from the most recent backup.

# Check available backups ls -lah /home/curateme/backups/mongodb/ # Create emergency backup before restoring docker exec curateme-mongo mongodump \ -u admin -p '$MONGO_ADMIN_PASSWORD' \ --archive=/tmp/emergency-backup-$(date +%Y%m%d-%H%M).gz --gzip docker cp curateme-mongo:/tmp/emergency-backup-*.gz /home/curateme/backups/mongodb/ # Stop app services, restore, restart docker stop curateme-backend-b2b curateme-backend-gateway docker cp /home/curateme/backups/mongodb/curateme-mongo-backup-YYYY-MM-DD.gz curateme-mongo:/tmp/restore.gz docker exec curateme-mongo mongorestore \ -u admin -p '$MONGO_ADMIN_PASSWORD' \ --archive=/tmp/restore.gz --gzip --drop docker start curateme-backend-b2b curateme-backend-gateway

Step 5: Verify resolution

# MongoDB health docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval " print('Ping: ' + JSON.stringify(db.runCommand({ping: 1}))); const s = db.serverStatus(); print('Connections: ' + s.connections.current + '/' + s.connections.available); print('Uptime: ' + s.uptime + 's'); " # Platform health curl -s https://api.curate-me.ai/api/v1/health | python3 -m json.tool curl -s -o /dev/null -w "%{http_code}" https://api.curate-me.ai/api/v1/admin/gateway/stats \ -H "Authorization: Bearer $ADMIN_TOKEN" # Operational checks ./scripts/analytics health ./scripts/errors recent

Escalation

If the issue persists after working through all steps:

  1. Collect diagnostics:
    • docker logs curateme-mongo --tail 1000 > /tmp/mongo-incident.log 2>&1
    • docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval "db.serverStatus()" > /tmp/mongo-status.json
    • ./scripts/errors by-source backend > /tmp/backend-errors.log
  2. Record the timeline: when symptoms started, recent deploys, traffic changes
  3. Check for query regressions: ./scripts/errors recent --since 24h
  4. Contact the platform team with collected diagnostics and timeline

  • Redis Incident — Redis and MongoDB are both infrastructure dependencies with similar triage patterns
  • Webhook Disaster Recovery — webhook delivery depends on MongoDB for event storage
  • Incident Response — MongoDB outages are typically SEV1 incidents requiring the full incident playbook