Runbook: Incident Response

Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-05-04 Validation method: Manual review Severity trigger: SEV1-SEV4 Customer impact: Varies by severity — see classification matrix Required access: SSH to VPS, Slack #incidents channel Related services: All production services

This playbook defines how the Curate-Me platform team declares, triages, escalates, and resolves production incidents. Print this page or bookmark it — you will need it under pressure.

Severity Classification Matrix

Every incident gets a severity level. Assign one immediately on declaration — you can adjust later as you learn more.

Severity	Criteria	Response Time	Example
SEV1 — Critical	Complete service outage. Gateway not proxying requests. All customers affected. Data loss possible.	Acknowledge in 5 min. All hands on deck.	Gateway down, MongoDB unresponsive, VPS unreachable
SEV2 — Major	Partial outage or severe degradation. A core feature is broken for many customers.	Acknowledge in 15 min. Primary on-call engaged.	Dashboard login broken, rate limiting not enforcing, cost tracking stopped recording
SEV3 — Minor	Degraded experience for a subset of users. Workaround exists.	Acknowledge in 1 hour. Fix within business hours.	One provider failover stuck, PII scanner false positives for one org, runner provision slow
SEV4 — Low	Cosmetic issue, minor bug, or single-tenant edge case. No business impact.	Acknowledge in 4 hours. Scheduled fix.	Dashboard UI glitch, stale cache for one org, non-critical log noise

When in doubt, round up. A SEV3 that turns out to be a SEV4 is fine. A SEV1 that was treated as SEV3 is not.

Incident Declaration

When to declare

Declare an incident when any of these are true:

Gateway health check fails (/v1/health returns non-200)
More than 3 customer-visible errors in a 5-minute window
Any SEV1 or SEV2 condition from the matrix above
A service dependency (Redis, MongoDB) is unreachable
You are unsure whether something is an incident — declare it anyway

How to declare

Post in the #incidents Slack channel with the format below
Assign a severity level
Assign an Incident Commander (IC) — usually the person who declared it
Create a shared thread for all updates

Slack declaration format:


INCIDENT DECLARED -- SEV[1-4]
What: [one-line description]
Impact: [who is affected, what is broken]
IC: [your name]
Status: Investigating

Who gets notified

Severity	Notification
SEV1	Slack `#incidents` + phone call to Platform Lead + all engineers pinged
SEV2	Slack `#incidents` + DM to on-call engineer
SEV3	Slack `#incidents` thread
SEV4	Slack `#incidents` thread (async)

Initial Triage (First 5 Minutes)

Run these commands in order. The goal is to understand scope before you start fixing things.

1. Get the error landscape


# View recent production errors -- start here
./scripts/errors recent
 
# Get a summary of error counts by source
./scripts/errors summary

2. Check service health


# Overall health check
./scripts/analytics health
 
# Quick container status on VPS
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'"

3. Check each critical service


# Gateway (the core product)
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker logs curateme-backend-gateway --tail 50 --since 5m"
 
# Dashboard / Backend API
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker logs curateme-backend-b2b --tail 50 --since 5m"
 
# Redis
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker exec curateme-redis redis-cli ping"
 
# MongoDB
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker exec curateme-mongo mongosh --quiet --eval 'db.runCommand({ping:1})'"

4. Check resource utilization


# Disk, memory, CPU on VPS
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "df -h / && echo '---' && free -h && echo '---' && uptime"
 
# Docker resource usage
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker stats --no-stream --format 'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}'"

5. Check today’s metrics for anomalies


./scripts/analytics snapshot today
./scripts/analytics costs
./scripts/analytics performance

Post your findings in the Slack incident thread before proceeding to fixes.

Escalation Clocks

Once an incident is declared, these timers start. If the current responder has not resolved the issue within the window, escalate.

Time Elapsed	Action	Who
0 min	Incident declared, IC assigned	Declaring engineer (L1)
15 min	If no root cause identified, pull in second engineer	On-call engineer (L2)
30 min	If SEV1/SEV2 not mitigated, escalate to Platform Lead	Platform Lead (L3)
1 hour	If SEV1 still active, notify executive stakeholders	Executive (L4)
2 hours	If SEV1 still active, consider external provider support tickets	Platform Lead + Executive

Escalation is not failure. Escalate early. The IC retains ownership of the incident even after escalation.

Communication Templates

Copy-paste these into Slack. Fill in the bracketed fields.

Initial notification


INCIDENT UPDATE -- SEV[X] -- [HH:MM UTC]
Status: Investigating
Impact: [describe customer impact -- who, what, since when]
Current theory: [what we think is wrong, or "still triaging"]
Next step: [what we are doing right now]
ETA to next update: [time]

Status update (post every 15 min for SEV1, 30 min for SEV2)


INCIDENT UPDATE -- SEV[X] -- [HH:MM UTC]
Status: [Investigating / Identified / Mitigating / Monitoring]
Root cause: [if known, or "still investigating"]
Actions taken: [what has been done since last update]
Next step: [what happens next]
ETA to resolution: [estimate or "unknown"]

Resolution announcement


INCIDENT RESOLVED -- SEV[X] -- [HH:MM UTC]
Duration: [start time] to [end time] ([X] minutes)
Impact: [summary of what was affected]
Root cause: [one-line root cause]
Resolution: [what fixed it]
Follow-up: Postmortem scheduled for [date/time]

Service Dependency Matrix

Use this to understand blast radius. If a service in the left column goes down, everything in the “Affected By Outage” column is impacted.

Service	Depends On	Port	Affected By Outage
Gateway	Redis, MongoDB	8002	All customer API calls, cost tracking, governance enforcement
Dashboard	Backend API	3001	Admin UI, org management, analytics views
Backend API	MongoDB, Redis	8001	Dashboard, all admin operations, fleet management
Redis	— (standalone)	6379	Rate limiting falls open, cost accumulation stops, cache miss storm
MongoDB	— (standalone)	27017	All persistence, audit logs, API key validation, org config
Runner Containers	Gateway, MongoDB	Varies	Individual agent tasks (isolated per runner)
Caddy	— (reverse proxy)	80/443	All HTTPS traffic to all services

Key insight: Redis and MongoDB are single points of failure. If either is down, assume SEV1 until proven otherwise.

Failure mode quick reference

Dependency Down	Gateway Behavior	Dashboard Behavior
Redis	Rate limiting disabled (fails open), cost tracking stops	Session cache miss, slower page loads
MongoDB	API key auth fails, no audit logging, org config unavailable	Cannot load orgs, policies, or analytics
Redis + MongoDB	Gateway effectively down — cannot auth or govern	Dashboard non-functional
Backend API	Gateway unaffected (independent)	Dashboard returns 502 on all API calls
Caddy	No HTTPS termination — all public traffic fails	Unreachable

Common Incident Patterns

Quick lookup table. Match the symptom to find the likely cause and the runbook with detailed steps.

Symptom	Likely Cause	First Command	Runbook
Gateway returning 502 to all requests	Gateway container crashed or Caddy misconfigured	`ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker ps \| grep gateway"`	This page (SEV1 triage)
Gateway latency spike (>500ms governance time)	Redis slow or PII scanner on large payload	`./scripts/analytics performance`	Gateway High Latency
All requests getting rate-limited	Rate limit misconfigured or Redis returning stale counts	`./scripts/errors by-source gateway`	Rate Limit Exceeded
Cost tracking shows $0 for the day	Redis cost accumulator stopped flushing to MongoDB	`./scripts/analytics costs`	Cost Accumulation Lag
Budget exceeded alerts firing incorrectly	Cost spike from expensive model or runaway automation	`./scripts/analytics costs`	Budget Exceeded
Dashboard login page blank or erroring	Next.js build issue or backend API unreachable	`ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker logs curateme-dashboard --tail 30"`	This page (Step 3)
PII scanner blocking legitimate requests	Regex false positive on customer data	`./scripts/errors by-source pii`	PII Blocked
Runner container stuck in “running” but not responding	OpenClaw gateway crash or OOM inside container	Check runner heartbeat via API	Runner Stuck
Runner fails to provision	Docker image pull failure or disk full	`ssh $DEPLOY_USER@$PLATFORM_VPS_IP "df -h /"`	Runner Provision Failure
Multiple governance steps denying in sequence	Cascading policy misconfiguration	`./scripts/errors by-source governance`	Governance Cascading Denials
Provider failover cycling between providers	All configured providers degraded	Check provider status pages	Provider Failover Loop

Post-Incident Review

Schedule a postmortem within 48 hours of any SEV1 or SEV2 incident. SEV3 incidents get a lightweight written review. SEV4 does not require a postmortem.

Postmortem template

Use this structure for every postmortem document.


# Postmortem: [Incident Title]
 
**Date:** [YYYY-MM-DD]
**Severity:** SEV[X]
**Duration:** [start] to [end] ([X] minutes)
**Incident Commander:** [name]
**Author:** [name]
 
## Summary
[2-3 sentences: what happened, who was affected, how it was resolved]
 
## Timeline (all times UTC)
| Time | Event |
|------|-------|
| HH:MM | [first sign of issue] |
| HH:MM | [incident declared] |
| HH:MM | [root cause identified] |
| HH:MM | [mitigation applied] |
| HH:MM | [incident resolved] |
 
## Impact
- **Customer impact:** [number of orgs affected, request failure rate, duration]
- **Revenue impact:** [if any -- gateway downtime means customer calls fail]
- **Data impact:** [any data loss or corruption]
 
## Root Cause
[Detailed technical explanation. Be specific -- not "Redis was slow" but
"Redis maxmemory was set to 256MB and eviction policy was noeviction,
causing write failures when cost accumulator keys exceeded the limit."]
 
## What Went Well
- [things that worked during the response]
 
## What Went Poorly
- [things that slowed down detection or resolution]
 
## Action Items
| Action | Owner | Priority | Due Date |
|--------|-------|----------|----------|
| [specific fix or improvement] | [name] | P1/P2/P3 | [date] |
 
## Lessons Learned
- [key takeaways for the team]

Postmortem rules

Blameless. Focus on systems and processes, not individuals.
Concrete action items. Every postmortem must produce at least one action item with an owner and due date.
Share widely. Post the completed postmortem in #incidents and link it from the incident thread.

Escalation Contacts

Role	Responsibility	When to Contact
On-Call Engineer	First responder. Triage, initial mitigation.	Any declared incident
Platform Lead	Architecture decisions, cross-service coordination.	SEV1 after 15 min, SEV2 after 30 min, any incident requiring infra changes
Executive	Customer communication, business impact decisions.	SEV1 after 1 hour, any data breach or security incident

On-call expectations

Acknowledge alerts within 5 minutes during on-call hours
Have VPS SSH access and production credentials ready
Keep the incident Slack thread updated every 15 minutes during active SEV1/SEV2

Incident Closure Criteria

An incident is not resolved until every item on this checklist is confirmed.

Resolution checklist

Service health confirmed — all containers running, health endpoints returning 200


./scripts/analytics health
ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker ps --format '{{.Names}}: {{.Status}}'"

Error rate returned to baseline — no new errors of the same type
```
./scripts/errors recent
./scripts/errors summary
```
Metrics look normal — latency, request volume, cost tracking all within expected ranges
```
./scripts/analytics performance
./scripts/analytics snapshot today
```

Customer-facing impact verified resolved — test a real request through the gateway


curl -s -o /dev/null -w "%{http_code}" https://api.curate-me.ai/v1/health
curl -s -o /dev/null -w "%{http_code}" https://dashboard.curate-me.ai

Incident thread updated — resolution announcement posted in Slack
Monitoring confirmed — watch for 15 minutes after fix to ensure no recurrence
Postmortem scheduled — for SEV1/SEV2, calendar invite sent within 24 hours

Deployment Procedure — rollback steps and emergency hotfix deploys during incidents
Redis Incident — Redis failure is a common trigger for SEV1/SEV2 incidents
MongoDB Incident — MongoDB failure affects persistence, auth, and audit logging
Gateway High Latency — latency spikes are a frequent incident symptom