Runbook: Incident Response
Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-05-04 Validation method: Manual review Severity trigger: SEV1-SEV4 Customer impact: Varies by severity — see classification matrix Required access: SSH to VPS, Slack #incidents channel Related services: All production services
This playbook defines how the Curate-Me platform team declares, triages, escalates, and resolves production incidents. Print this page or bookmark it — you will need it under pressure.
Severity Classification Matrix
Every incident gets a severity level. Assign one immediately on declaration — you can adjust later as you learn more.
| Severity | Criteria | Response Time | Example |
|---|---|---|---|
| SEV1 — Critical | Complete service outage. Gateway not proxying requests. All customers affected. Data loss possible. | Acknowledge in 5 min. All hands on deck. | Gateway down, MongoDB unresponsive, VPS unreachable |
| SEV2 — Major | Partial outage or severe degradation. A core feature is broken for many customers. | Acknowledge in 15 min. Primary on-call engaged. | Dashboard login broken, rate limiting not enforcing, cost tracking stopped recording |
| SEV3 — Minor | Degraded experience for a subset of users. Workaround exists. | Acknowledge in 1 hour. Fix within business hours. | One provider failover stuck, PII scanner false positives for one org, runner provision slow |
| SEV4 — Low | Cosmetic issue, minor bug, or single-tenant edge case. No business impact. | Acknowledge in 4 hours. Scheduled fix. | Dashboard UI glitch, stale cache for one org, non-critical log noise |
When in doubt, round up. A SEV3 that turns out to be a SEV4 is fine. A SEV1 that was treated as SEV3 is not.
Incident Declaration
When to declare
Declare an incident when any of these are true:
- Gateway health check fails (
/v1/healthreturns non-200) - More than 3 customer-visible errors in a 5-minute window
- Any SEV1 or SEV2 condition from the matrix above
- A service dependency (Redis, MongoDB) is unreachable
- You are unsure whether something is an incident — declare it anyway
How to declare
- Post in the
#incidentsSlack channel with the format below - Assign a severity level
- Assign an Incident Commander (IC) — usually the person who declared it
- Create a shared thread for all updates
Slack declaration format:
INCIDENT DECLARED -- SEV[1-4]
What: [one-line description]
Impact: [who is affected, what is broken]
IC: [your name]
Status: InvestigatingWho gets notified
| Severity | Notification |
|---|---|
| SEV1 | Slack #incidents + phone call to Platform Lead + all engineers pinged |
| SEV2 | Slack #incidents + DM to on-call engineer |
| SEV3 | Slack #incidents thread |
| SEV4 | Slack #incidents thread (async) |
Initial Triage (First 5 Minutes)
Run these commands in order. The goal is to understand scope before you start fixing things.
1. Get the error landscape
# View recent production errors -- start here
./scripts/errors recent
# Get a summary of error counts by source
./scripts/errors summary2. Check service health
# Overall health check
./scripts/analytics health
# Quick container status on VPS
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'"3. Check each critical service
# Gateway (the core product)
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker logs curateme-backend-gateway --tail 50 --since 5m"
# Dashboard / Backend API
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker logs curateme-backend-b2b --tail 50 --since 5m"
# Redis
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker exec curateme-redis redis-cli ping"
# MongoDB
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker exec curateme-mongo mongosh --quiet --eval 'db.runCommand({ping:1})'"4. Check resource utilization
# Disk, memory, CPU on VPS
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "df -h / && echo '---' && free -h && echo '---' && uptime"
# Docker resource usage
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker stats --no-stream --format 'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}'"5. Check today’s metrics for anomalies
./scripts/analytics snapshot today
./scripts/analytics costs
./scripts/analytics performancePost your findings in the Slack incident thread before proceeding to fixes.
Escalation Clocks
Once an incident is declared, these timers start. If the current responder has not resolved the issue within the window, escalate.
| Time Elapsed | Action | Who |
|---|---|---|
| 0 min | Incident declared, IC assigned | Declaring engineer (L1) |
| 15 min | If no root cause identified, pull in second engineer | On-call engineer (L2) |
| 30 min | If SEV1/SEV2 not mitigated, escalate to Platform Lead | Platform Lead (L3) |
| 1 hour | If SEV1 still active, notify executive stakeholders | Executive (L4) |
| 2 hours | If SEV1 still active, consider external provider support tickets | Platform Lead + Executive |
Escalation is not failure. Escalate early. The IC retains ownership of the incident even after escalation.
Communication Templates
Copy-paste these into Slack. Fill in the bracketed fields.
Initial notification
INCIDENT UPDATE -- SEV[X] -- [HH:MM UTC]
Status: Investigating
Impact: [describe customer impact -- who, what, since when]
Current theory: [what we think is wrong, or "still triaging"]
Next step: [what we are doing right now]
ETA to next update: [time]Status update (post every 15 min for SEV1, 30 min for SEV2)
INCIDENT UPDATE -- SEV[X] -- [HH:MM UTC]
Status: [Investigating / Identified / Mitigating / Monitoring]
Root cause: [if known, or "still investigating"]
Actions taken: [what has been done since last update]
Next step: [what happens next]
ETA to resolution: [estimate or "unknown"]Resolution announcement
INCIDENT RESOLVED -- SEV[X] -- [HH:MM UTC]
Duration: [start time] to [end time] ([X] minutes)
Impact: [summary of what was affected]
Root cause: [one-line root cause]
Resolution: [what fixed it]
Follow-up: Postmortem scheduled for [date/time]Service Dependency Matrix
Use this to understand blast radius. If a service in the left column goes down, everything in the “Affected By Outage” column is impacted.
| Service | Depends On | Port | Affected By Outage |
|---|---|---|---|
| Gateway | Redis, MongoDB | 8002 | All customer API calls, cost tracking, governance enforcement |
| Dashboard | Backend API | 3001 | Admin UI, org management, analytics views |
| Backend API | MongoDB, Redis | 8001 | Dashboard, all admin operations, fleet management |
| Redis | — (standalone) | 6379 | Rate limiting falls open, cost accumulation stops, cache miss storm |
| MongoDB | — (standalone) | 27017 | All persistence, audit logs, API key validation, org config |
| Runner Containers | Gateway, MongoDB | Varies | Individual agent tasks (isolated per runner) |
| Caddy | — (reverse proxy) | 80/443 | All HTTPS traffic to all services |
Key insight: Redis and MongoDB are single points of failure. If either is down, assume SEV1 until proven otherwise.
Failure mode quick reference
| Dependency Down | Gateway Behavior | Dashboard Behavior |
|---|---|---|
| Redis | Rate limiting disabled (fails open), cost tracking stops | Session cache miss, slower page loads |
| MongoDB | API key auth fails, no audit logging, org config unavailable | Cannot load orgs, policies, or analytics |
| Redis + MongoDB | Gateway effectively down — cannot auth or govern | Dashboard non-functional |
| Backend API | Gateway unaffected (independent) | Dashboard returns 502 on all API calls |
| Caddy | No HTTPS termination — all public traffic fails | Unreachable |
Common Incident Patterns
Quick lookup table. Match the symptom to find the likely cause and the runbook with detailed steps.
| Symptom | Likely Cause | First Command | Runbook |
|---|---|---|---|
| Gateway returning 502 to all requests | Gateway container crashed or Caddy misconfigured | ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker ps | grep gateway" | This page (SEV1 triage) |
| Gateway latency spike (>500ms governance time) | Redis slow or PII scanner on large payload | ./scripts/analytics performance | Gateway High Latency |
| All requests getting rate-limited | Rate limit misconfigured or Redis returning stale counts | ./scripts/errors by-source gateway | Rate Limit Exceeded |
| Cost tracking shows $0 for the day | Redis cost accumulator stopped flushing to MongoDB | ./scripts/analytics costs | Cost Accumulation Lag |
| Budget exceeded alerts firing incorrectly | Cost spike from expensive model or runaway automation | ./scripts/analytics costs | Budget Exceeded |
| Dashboard login page blank or erroring | Next.js build issue or backend API unreachable | ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker logs curateme-dashboard --tail 30" | This page (Step 3) |
| PII scanner blocking legitimate requests | Regex false positive on customer data | ./scripts/errors by-source pii | PII Blocked |
| Runner container stuck in “running” but not responding | OpenClaw gateway crash or OOM inside container | Check runner heartbeat via API | Runner Stuck |
| Runner fails to provision | Docker image pull failure or disk full | ssh $DEPLOY_USER@$PLATFORM_VPS_IP "df -h /" | Runner Provision Failure |
| Multiple governance steps denying in sequence | Cascading policy misconfiguration | ./scripts/errors by-source governance | Governance Cascading Denials |
| Provider failover cycling between providers | All configured providers degraded | Check provider status pages | Provider Failover Loop |
Post-Incident Review
Schedule a postmortem within 48 hours of any SEV1 or SEV2 incident. SEV3 incidents get a lightweight written review. SEV4 does not require a postmortem.
Postmortem template
Use this structure for every postmortem document.
# Postmortem: [Incident Title]
**Date:** [YYYY-MM-DD]
**Severity:** SEV[X]
**Duration:** [start] to [end] ([X] minutes)
**Incident Commander:** [name]
**Author:** [name]
## Summary
[2-3 sentences: what happened, who was affected, how it was resolved]
## Timeline (all times UTC)
| Time | Event |
|------|-------|
| HH:MM | [first sign of issue] |
| HH:MM | [incident declared] |
| HH:MM | [root cause identified] |
| HH:MM | [mitigation applied] |
| HH:MM | [incident resolved] |
## Impact
- **Customer impact:** [number of orgs affected, request failure rate, duration]
- **Revenue impact:** [if any -- gateway downtime means customer calls fail]
- **Data impact:** [any data loss or corruption]
## Root Cause
[Detailed technical explanation. Be specific -- not "Redis was slow" but
"Redis maxmemory was set to 256MB and eviction policy was noeviction,
causing write failures when cost accumulator keys exceeded the limit."]
## What Went Well
- [things that worked during the response]
## What Went Poorly
- [things that slowed down detection or resolution]
## Action Items
| Action | Owner | Priority | Due Date |
|--------|-------|----------|----------|
| [specific fix or improvement] | [name] | P1/P2/P3 | [date] |
## Lessons Learned
- [key takeaways for the team]Postmortem rules
- Blameless. Focus on systems and processes, not individuals.
- Concrete action items. Every postmortem must produce at least one action item with an owner and due date.
- Share widely. Post the completed postmortem in
#incidentsand link it from the incident thread.
Escalation Contacts
| Role | Responsibility | When to Contact |
|---|---|---|
| On-Call Engineer | First responder. Triage, initial mitigation. | Any declared incident |
| Platform Lead | Architecture decisions, cross-service coordination. | SEV1 after 15 min, SEV2 after 30 min, any incident requiring infra changes |
| Executive | Customer communication, business impact decisions. | SEV1 after 1 hour, any data breach or security incident |
On-call expectations
- Acknowledge alerts within 5 minutes during on-call hours
- Have VPS SSH access and production credentials ready
- Keep the incident Slack thread updated every 15 minutes during active SEV1/SEV2
Incident Closure Criteria
An incident is not resolved until every item on this checklist is confirmed.
Resolution checklist
- Service health confirmed — all containers running, health endpoints returning 200
./scripts/analytics health ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker ps --format '{{.Names}}: {{.Status}}'" - Error rate returned to baseline — no new errors of the same type
./scripts/errors recent ./scripts/errors summary - Metrics look normal — latency, request volume, cost tracking all within expected ranges
./scripts/analytics performance ./scripts/analytics snapshot today - Customer-facing impact verified resolved — test a real request through the gateway
curl -s -o /dev/null -w "%{http_code}" https://api.curate-me.ai/v1/health curl -s -o /dev/null -w "%{http_code}" https://dashboard.curate-me.ai - Incident thread updated — resolution announcement posted in Slack
- Monitoring confirmed — watch for 15 minutes after fix to ensure no recurrence
- Postmortem scheduled — for SEV1/SEV2, calendar invite sent within 24 hours
Related Runbooks
- Deployment Procedure — rollback steps and emergency hotfix deploys during incidents
- Redis Incident — Redis failure is a common trigger for SEV1/SEV2 incidents
- MongoDB Incident — MongoDB failure affects persistence, auth, and audit logging
- Gateway High Latency — latency spikes are a frequent incident symptom