Runbook: MongoDB Incident Response
Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-05-04 Validation method: Manual review Severity trigger: SEV1 Customer impact: API key auth fails, audit logging stops, org config unavailable Required access: SSH to VPS, MongoDB admin credentials Related services: curateme-mongo
This runbook covers diagnosing and resolving MongoDB issues on the Curate-Me platform. MongoDB stores audit logs, org configurations, API keys, A2A inbox, fleet deployments, autopilot results, and usage records. Follow the steps in order.
Symptoms
- Backend API returning 500 errors with
ServerSelectionTimeoutErrororAutoReconnectin logs - Dashboard pages loading with empty data or “Service unavailable” banners
- Gateway governance chain works (Redis-backed) but audit trail is missing
- A2A task inbox not accepting new tasks or returning stale results
- Fleet deployment status stuck in
deployingstate ./scripts/errors recentshows spikes in MongoDB-related errors./scripts/analytics healthreportsmongodb_connected: false
Step 1: Check MongoDB container health
ssh $DEPLOY_USER@$PLATFORM_VPS_IP
# Container status (include stopped containers with -a)
docker ps -a --filter name=curateme-mongo --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
# Resource usage
docker stats curateme-mongo --no-stream
# Recent errors in container logs
docker logs curateme-mongo --tail 500 --timestamps 2>&1 | grep -iE "error|fatal|oom|killed"
# Verify MongoDB accepts connections
docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval "db.runCommand({ping: 1})"
# Expected: { ok: 1 }Step 2: Connection pool diagnostics
docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval "
const s = db.serverStatus();
print('Connections: ' + s.connections.current + ' / ' + s.connections.available);
print('Active ops: ' + s.globalLock.activeClients.total);
print('Queued R/W: ' + s.globalLock.currentQueue.readers + '/' + s.globalLock.currentQueue.writers);
"| Metric | Healthy | Investigate if |
|---|---|---|
current connections | < 100 | > 200 (pool exhaustion) |
available connections | > 500 | < 50 (running out) |
| Queued readers/writers | 0 | > 10 (lock contention) |
List active operations to find what is holding connections:
docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval "
db.currentOp({active: true}).inprog.forEach(function(op) {
print(op.client + ' | ' + op.op + ' | ' + op.ns + ' | ' + op.secs_running + 's');
});
"Step 3: Identify root cause
Cause A: Container crash / OOM kill
Diagnosis:
docker inspect curateme-mongo --format='OOMKilled={{.State.OOMKilled}} ExitCode={{.State.ExitCode}}'
# Exit code 137 = SIGKILL/OOM, 1 = internal error
dmesg | grep -i "oom\|mongo" | tail -20Fix (immediate): Restart the container and verify:
docker restart curateme-mongo && sleep 5
docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval "db.runCommand({ping: 1})"Fix (long-term): Set memory limits in docker-compose.production.yml and cap WiredTiger cache:
# Under the mongo service:
command: ["mongod", "--wiredTigerCacheSizeGB", "0.5"]
deploy:
resources:
limits:
memory: 2GCause B: Connection pool exhaustion
Diagnosis: current connections near or exceeding available in Step 2 output.
Fix (immediate): Kill idle operations:
docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval "
db.currentOp({active: false, secs_running: {'\$gt': 300}}).inprog.forEach(function(op) {
db.killOp(op.opid);
print('Killed: ' + op.opid + ' from ' + op.client);
});
"Fix (long-term): Ensure bounded connection pools in MONGO_URI:
MONGO_URI=mongodb://admin:password@host:27017/?maxPoolSize=50&minPoolSize=5&maxIdleTimeMS=30000Cause C: Slow queries
Diagnosis: Find long-running operations and enable the profiler:
docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval "
db.currentOp({active: true, secs_running: {'\$gt': 5}}).inprog.forEach(function(op) {
print(op.secs_running + 's | ' + op.ns + ' | ' + JSON.stringify(op.command));
});
db.setProfilingLevel(1, {slowms: 100});
"
# After a few minutes, check profiled slow queries:
docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval "
db.system.profile.find({millis: {'\$gt': 100}}).sort({ts: -1}).limit(10).forEach(function(d) {
print(d.ns + ' | ' + d.millis + 'ms | ' + JSON.stringify(d.command));
});
"Fix: Kill the slow query, then add missing indexes:
docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' curateme_b2b --eval "
db.a2a_inbox.createIndex({agent_id: 1, status: 1});
db.a2a_audit_log.createIndex({timestamp: -1, org_id: 1});
db.managed_runners.createIndex({fleet_id: 1, status: 1});
db.autopilot_results.createIndex({org_id: 1, created_at: -1});
db.usage_records.createIndex({org_id: 1, timestamp: -1});
print('Indexes created');
"Cause D: Disk space exhaustion
Diagnosis:
df -h /var/lib/docker
docker exec curateme-mongo du -sh /data/db/
docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval "
db.adminCommand({listDatabases: 1}).databases.forEach(function(d) {
print(d.name + ': ' + (d.sizeOnDisk / 1024 / 1024).toFixed(2) + ' MB');
});
"Fix (immediate): Rotate old data and compact:
docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' curateme_b2b --eval "
const cutoff = new Date(Date.now() - 90 * 24 * 60 * 60 * 1000);
const r = db.a2a_audit_log.deleteMany({timestamp: {'\$lt': cutoff}});
print('Deleted ' + r.deletedCount + ' old audit entries');
db.runCommand({compact: 'a2a_audit_log'});
db.runCommand({compact: 'usage_records'});
"Fix (long-term): Set TTL indexes for automatic expiration:
docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' curateme_b2b --eval "
db.a2a_audit_log.createIndex({timestamp: 1}, {expireAfterSeconds: 7776000}); // 90 days
db.system.profile.createIndex({ts: 1}, {expireAfterSeconds: 604800}); // 7 days
"Step 4: Backup and restore
If data corruption is suspected, restore from the most recent backup.
# Check available backups
ls -lah /home/curateme/backups/mongodb/
# Create emergency backup before restoring
docker exec curateme-mongo mongodump \
-u admin -p '$MONGO_ADMIN_PASSWORD' \
--archive=/tmp/emergency-backup-$(date +%Y%m%d-%H%M).gz --gzip
docker cp curateme-mongo:/tmp/emergency-backup-*.gz /home/curateme/backups/mongodb/
# Stop app services, restore, restart
docker stop curateme-backend-b2b curateme-backend-gateway
docker cp /home/curateme/backups/mongodb/curateme-mongo-backup-YYYY-MM-DD.gz curateme-mongo:/tmp/restore.gz
docker exec curateme-mongo mongorestore \
-u admin -p '$MONGO_ADMIN_PASSWORD' \
--archive=/tmp/restore.gz --gzip --drop
docker start curateme-backend-b2b curateme-backend-gatewayStep 5: Verify resolution
# MongoDB health
docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval "
print('Ping: ' + JSON.stringify(db.runCommand({ping: 1})));
const s = db.serverStatus();
print('Connections: ' + s.connections.current + '/' + s.connections.available);
print('Uptime: ' + s.uptime + 's');
"
# Platform health
curl -s https://api.curate-me.ai/api/v1/health | python3 -m json.tool
curl -s -o /dev/null -w "%{http_code}" https://api.curate-me.ai/api/v1/admin/gateway/stats \
-H "Authorization: Bearer $ADMIN_TOKEN"
# Operational checks
./scripts/analytics health
./scripts/errors recentEscalation
If the issue persists after working through all steps:
- Collect diagnostics:
docker logs curateme-mongo --tail 1000 > /tmp/mongo-incident.log 2>&1docker exec curateme-mongo mongosh -u admin -p '$MONGO_ADMIN_PASSWORD' --eval "db.serverStatus()" > /tmp/mongo-status.json./scripts/errors by-source backend > /tmp/backend-errors.log
- Record the timeline: when symptoms started, recent deploys, traffic changes
- Check for query regressions:
./scripts/errors recent --since 24h - Contact the platform team with collected diagnostics and timeline
Related Runbooks
- Redis Incident — Redis and MongoDB are both infrastructure dependencies with similar triage patterns
- Webhook Disaster Recovery — webhook delivery depends on MongoDB for event storage
- Incident Response — MongoDB outages are typically SEV1 incidents requiring the full incident playbook