Skip to Content
RunbooksRunbook: Incident Response

Runbook: Incident Response

Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-05-04 Validation method: Manual review Severity trigger: SEV1-SEV4 Customer impact: Varies by severity — see classification matrix Required access: SSH to VPS, Slack #incidents channel Related services: All production services

This playbook defines how the Curate-Me platform team declares, triages, escalates, and resolves production incidents. Print this page or bookmark it — you will need it under pressure.


Severity Classification Matrix

Every incident gets a severity level. Assign one immediately on declaration — you can adjust later as you learn more.

SeverityCriteriaResponse TimeExample
SEV1 — CriticalComplete service outage. Gateway not proxying requests. All customers affected. Data loss possible.Acknowledge in 5 min. All hands on deck.Gateway down, MongoDB unresponsive, VPS unreachable
SEV2 — MajorPartial outage or severe degradation. A core feature is broken for many customers.Acknowledge in 15 min. Primary on-call engaged.Dashboard login broken, rate limiting not enforcing, cost tracking stopped recording
SEV3 — MinorDegraded experience for a subset of users. Workaround exists.Acknowledge in 1 hour. Fix within business hours.One provider failover stuck, PII scanner false positives for one org, runner provision slow
SEV4 — LowCosmetic issue, minor bug, or single-tenant edge case. No business impact.Acknowledge in 4 hours. Scheduled fix.Dashboard UI glitch, stale cache for one org, non-critical log noise

When in doubt, round up. A SEV3 that turns out to be a SEV4 is fine. A SEV1 that was treated as SEV3 is not.


Incident Declaration

When to declare

Declare an incident when any of these are true:

  • Gateway health check fails (/v1/health returns non-200)
  • More than 3 customer-visible errors in a 5-minute window
  • Any SEV1 or SEV2 condition from the matrix above
  • A service dependency (Redis, MongoDB) is unreachable
  • You are unsure whether something is an incident — declare it anyway

How to declare

  1. Post in the #incidents Slack channel with the format below
  2. Assign a severity level
  3. Assign an Incident Commander (IC) — usually the person who declared it
  4. Create a shared thread for all updates

Slack declaration format:

INCIDENT DECLARED -- SEV[1-4] What: [one-line description] Impact: [who is affected, what is broken] IC: [your name] Status: Investigating

Who gets notified

SeverityNotification
SEV1Slack #incidents + phone call to Platform Lead + all engineers pinged
SEV2Slack #incidents + DM to on-call engineer
SEV3Slack #incidents thread
SEV4Slack #incidents thread (async)

Initial Triage (First 5 Minutes)

Run these commands in order. The goal is to understand scope before you start fixing things.

1. Get the error landscape

# View recent production errors -- start here ./scripts/errors recent # Get a summary of error counts by source ./scripts/errors summary

2. Check service health

# Overall health check ./scripts/analytics health # Quick container status on VPS ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'"

3. Check each critical service

# Gateway (the core product) ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker logs curateme-backend-gateway --tail 50 --since 5m" # Dashboard / Backend API ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker logs curateme-backend-b2b --tail 50 --since 5m" # Redis ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker exec curateme-redis redis-cli ping" # MongoDB ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker exec curateme-mongo mongosh --quiet --eval 'db.runCommand({ping:1})'"

4. Check resource utilization

# Disk, memory, CPU on VPS ssh $DEPLOY_USER@$PLATFORM_VPS_IP "df -h / && echo '---' && free -h && echo '---' && uptime" # Docker resource usage ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker stats --no-stream --format 'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}'"

5. Check today’s metrics for anomalies

./scripts/analytics snapshot today ./scripts/analytics costs ./scripts/analytics performance

Post your findings in the Slack incident thread before proceeding to fixes.


Escalation Clocks

Once an incident is declared, these timers start. If the current responder has not resolved the issue within the window, escalate.

Time ElapsedActionWho
0 minIncident declared, IC assignedDeclaring engineer (L1)
15 minIf no root cause identified, pull in second engineerOn-call engineer (L2)
30 minIf SEV1/SEV2 not mitigated, escalate to Platform LeadPlatform Lead (L3)
1 hourIf SEV1 still active, notify executive stakeholdersExecutive (L4)
2 hoursIf SEV1 still active, consider external provider support ticketsPlatform Lead + Executive

Escalation is not failure. Escalate early. The IC retains ownership of the incident even after escalation.


Communication Templates

Copy-paste these into Slack. Fill in the bracketed fields.

Initial notification

INCIDENT UPDATE -- SEV[X] -- [HH:MM UTC] Status: Investigating Impact: [describe customer impact -- who, what, since when] Current theory: [what we think is wrong, or "still triaging"] Next step: [what we are doing right now] ETA to next update: [time]

Status update (post every 15 min for SEV1, 30 min for SEV2)

INCIDENT UPDATE -- SEV[X] -- [HH:MM UTC] Status: [Investigating / Identified / Mitigating / Monitoring] Root cause: [if known, or "still investigating"] Actions taken: [what has been done since last update] Next step: [what happens next] ETA to resolution: [estimate or "unknown"]

Resolution announcement

INCIDENT RESOLVED -- SEV[X] -- [HH:MM UTC] Duration: [start time] to [end time] ([X] minutes) Impact: [summary of what was affected] Root cause: [one-line root cause] Resolution: [what fixed it] Follow-up: Postmortem scheduled for [date/time]

Service Dependency Matrix

Use this to understand blast radius. If a service in the left column goes down, everything in the “Affected By Outage” column is impacted.

ServiceDepends OnPortAffected By Outage
GatewayRedis, MongoDB8002All customer API calls, cost tracking, governance enforcement
DashboardBackend API3001Admin UI, org management, analytics views
Backend APIMongoDB, Redis8001Dashboard, all admin operations, fleet management
Redis— (standalone)6379Rate limiting falls open, cost accumulation stops, cache miss storm
MongoDB— (standalone)27017All persistence, audit logs, API key validation, org config
Runner ContainersGateway, MongoDBVariesIndividual agent tasks (isolated per runner)
Caddy— (reverse proxy)80/443All HTTPS traffic to all services

Key insight: Redis and MongoDB are single points of failure. If either is down, assume SEV1 until proven otherwise.

Failure mode quick reference

Dependency DownGateway BehaviorDashboard Behavior
RedisRate limiting disabled (fails open), cost tracking stopsSession cache miss, slower page loads
MongoDBAPI key auth fails, no audit logging, org config unavailableCannot load orgs, policies, or analytics
Redis + MongoDBGateway effectively down — cannot auth or governDashboard non-functional
Backend APIGateway unaffected (independent)Dashboard returns 502 on all API calls
CaddyNo HTTPS termination — all public traffic failsUnreachable

Common Incident Patterns

Quick lookup table. Match the symptom to find the likely cause and the runbook with detailed steps.

SymptomLikely CauseFirst CommandRunbook
Gateway returning 502 to all requestsGateway container crashed or Caddy misconfiguredssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker ps | grep gateway"This page (SEV1 triage)
Gateway latency spike (>500ms governance time)Redis slow or PII scanner on large payload./scripts/analytics performanceGateway High Latency
All requests getting rate-limitedRate limit misconfigured or Redis returning stale counts./scripts/errors by-source gatewayRate Limit Exceeded
Cost tracking shows $0 for the dayRedis cost accumulator stopped flushing to MongoDB./scripts/analytics costsCost Accumulation Lag
Budget exceeded alerts firing incorrectlyCost spike from expensive model or runaway automation./scripts/analytics costsBudget Exceeded
Dashboard login page blank or erroringNext.js build issue or backend API unreachablessh $DEPLOY_USER@$PLATFORM_VPS_IP "docker logs curateme-dashboard --tail 30"This page (Step 3)
PII scanner blocking legitimate requestsRegex false positive on customer data./scripts/errors by-source piiPII Blocked
Runner container stuck in “running” but not respondingOpenClaw gateway crash or OOM inside containerCheck runner heartbeat via APIRunner Stuck
Runner fails to provisionDocker image pull failure or disk fullssh $DEPLOY_USER@$PLATFORM_VPS_IP "df -h /"Runner Provision Failure
Multiple governance steps denying in sequenceCascading policy misconfiguration./scripts/errors by-source governanceGovernance Cascading Denials
Provider failover cycling between providersAll configured providers degradedCheck provider status pagesProvider Failover Loop

Post-Incident Review

Schedule a postmortem within 48 hours of any SEV1 or SEV2 incident. SEV3 incidents get a lightweight written review. SEV4 does not require a postmortem.

Postmortem template

Use this structure for every postmortem document.

# Postmortem: [Incident Title] **Date:** [YYYY-MM-DD] **Severity:** SEV[X] **Duration:** [start] to [end] ([X] minutes) **Incident Commander:** [name] **Author:** [name] ## Summary [2-3 sentences: what happened, who was affected, how it was resolved] ## Timeline (all times UTC) | Time | Event | |------|-------| | HH:MM | [first sign of issue] | | HH:MM | [incident declared] | | HH:MM | [root cause identified] | | HH:MM | [mitigation applied] | | HH:MM | [incident resolved] | ## Impact - **Customer impact:** [number of orgs affected, request failure rate, duration] - **Revenue impact:** [if any -- gateway downtime means customer calls fail] - **Data impact:** [any data loss or corruption] ## Root Cause [Detailed technical explanation. Be specific -- not "Redis was slow" but "Redis maxmemory was set to 256MB and eviction policy was noeviction, causing write failures when cost accumulator keys exceeded the limit."] ## What Went Well - [things that worked during the response] ## What Went Poorly - [things that slowed down detection or resolution] ## Action Items | Action | Owner | Priority | Due Date | |--------|-------|----------|----------| | [specific fix or improvement] | [name] | P1/P2/P3 | [date] | ## Lessons Learned - [key takeaways for the team]

Postmortem rules

  • Blameless. Focus on systems and processes, not individuals.
  • Concrete action items. Every postmortem must produce at least one action item with an owner and due date.
  • Share widely. Post the completed postmortem in #incidents and link it from the incident thread.

Escalation Contacts

RoleResponsibilityWhen to Contact
On-Call EngineerFirst responder. Triage, initial mitigation.Any declared incident
Platform LeadArchitecture decisions, cross-service coordination.SEV1 after 15 min, SEV2 after 30 min, any incident requiring infra changes
ExecutiveCustomer communication, business impact decisions.SEV1 after 1 hour, any data breach or security incident

On-call expectations

  • Acknowledge alerts within 5 minutes during on-call hours
  • Have VPS SSH access and production credentials ready
  • Keep the incident Slack thread updated every 15 minutes during active SEV1/SEV2

Incident Closure Criteria

An incident is not resolved until every item on this checklist is confirmed.

Resolution checklist

  • Service health confirmed — all containers running, health endpoints returning 200
    ./scripts/analytics health ssh $DEPLOY_USER@$PLATFORM_VPS_IP "docker ps --format '{{.Names}}: {{.Status}}'"
  • Error rate returned to baseline — no new errors of the same type
    ./scripts/errors recent ./scripts/errors summary
  • Metrics look normal — latency, request volume, cost tracking all within expected ranges
    ./scripts/analytics performance ./scripts/analytics snapshot today
  • Customer-facing impact verified resolved — test a real request through the gateway
    curl -s -o /dev/null -w "%{http_code}" https://api.curate-me.ai/v1/health curl -s -o /dev/null -w "%{http_code}" https://dashboard.curate-me.ai
  • Incident thread updated — resolution announcement posted in Slack
  • Monitoring confirmed — watch for 15 minutes after fix to ensure no recurrence
  • Postmortem scheduled — for SEV1/SEV2, calendar invite sent within 24 hours