Public Incident Update (Status Page)
This runbook walks an on-call engineer through publishing a public
incident on status.curate-me.ai and the
in-dashboard degraded-status banner. Incidents are stored in MongoDB and
served by the public /api/v1/platform/health/status endpoint — no
deploy is required.
This procedure is item 16 of
SELF_SERVE_DEVELOP_DD502239_10_10_REMAINING_PLAN.
When to publish an incident
Publish if any of the following are true:
- The system status API reports
degradedoroutageand the underlying cause is known. - A core dependency (Database, Gateway, Stripe, Email, Runner Control Plane) is failing and customer requests are visibly affected.
- A customer support ticket about a service issue has been opened in the last 30 minutes for an issue we have not yet acknowledged.
Do not publish for: a brief transient blip (< 60s) that has already self-recovered, internal-only failures, or a degraded state that has no customer impact (e.g. background analytics queue lag).
Severity matrix
| Severity | Examples | Status banner | Pager |
|---|---|---|---|
minor | One LLM provider degraded; failover working | Yes | No |
major | Gateway slow > 5s p99; Stripe webhooks lagging | Yes | Yes |
critical | Gateway down; Database down; customer-visible auth failure | Yes | Yes |
Status workflow
Status transitions: investigating → identified → monitoring →
resolved. Always publish a fresh comment on the incident at each
transition.
Publishing — the supported paths
Path A: dashboard admin (no SSH, fastest)
- Navigate to Dashboard → Settings → Public Status (admin only).
- Click New Incident. Fill in:
- Title — short, customer-language (e.g. “Elevated gateway latency for OpenAI requests”)
- Severity —
minor/major/critical - Message — what is impacted, what we are doing
- Save. The incident becomes visible at
https://status.curate-me.aiand within ~30 seconds the dashboard degraded-status banner appears for affected users. - Update the incident as state changes; click Resolve when fixed.
Path B: Mongo direct (no dashboard access)
If the dashboard is itself part of the outage, write directly:
ssh ${DEPLOY_USER}@${PLATFORM_VPS_IP}
docker exec -it curateme-mongo mongosh curatemedb.incidents.insertOne({
title: "Elevated gateway latency for OpenAI requests",
message: "We are investigating. Failover to alternate providers is active.",
severity: "major", // minor | major | critical
status: "investigating", // investigating | identified | monitoring | resolved
created_at: new Date().toISOString(),
resolved_at: null,
});To update:
db.incidents.updateOne(
{ _id: ObjectId("<paste-id>") },
{ $set: { status: "monitoring", message: "Latency is recovering. Continuing to monitor." } }
);To resolve:
db.incidents.updateOne(
{ _id: ObjectId("<paste-id>") },
{ $set: { status: "resolved", resolved_at: new Date().toISOString() } }
);The status API caches for 30 seconds, so the public page and dashboard banner update within that window.
Path C: HTTP API (admin token required)
curl -X POST https://api.curate-me.ai/api/v1/admin/incidents \
-H "Authorization: Bearer ${ADMIN_JWT}" \
-H "Content-Type: application/json" \
-d '{
"title": "Elevated gateway latency for OpenAI requests",
"message": "We are investigating. Failover is active.",
"severity": "major",
"status": "investigating"
}'Verifying customer-visible state
# Public status (no auth)
curl -s https://api.curate-me.ai/api/v1/platform/health/status | jq '.incidents, .status'
# Public status page renders the same data
open https://status.curate-me.aiThe dashboard StatusBanner polls the same endpoint every 60 seconds
and renders the most severe active incident in a yellow (degraded) or
red (outage) bar above the existing dashboard chrome.
Customer comms templates
Investigating (severity major):
We are investigating reports of elevated latency on requests proxied through the gateway. Customer requests may take longer than normal to return; no requests are being dropped. Updates every 15 minutes.
Monitoring:
Mitigation deployed. Latency has returned to normal. We will keep monitoring for the next 30 minutes before resolving.
Resolved:
The latency issue is resolved as of
<HH:MM UTC>. Root cause was<one-sentence>. Full post-incident review will be shared in<channel>within 5 business days.
Where the status page is linked from
| Surface | Link target |
|---|---|
| Marketing footer | https://status.curate-me.ai |
| Marketing /status page (full ui) | reads /api/v1/platform/health/status |
| Dashboard help menu | https://status.curate-me.ai |
| Dashboard degraded banner | reads /api/v1/platform/health/status |
API error docs (/troubleshooting) | https://status.curate-me.ai |
SLA doc (/sla) | https://status.curate-me.ai |
| Billing card-declined error | inline link to status page |
If you add a new customer-facing error surface, add a status-page link to it.
Related runbooks
- Incident Response Playbook — the internal triage / paging procedure (this runbook is the customer comms half of it).
- Webhook Disaster Recovery — Stripe / email provider webhook outages.
- Launch-Candidate Smoke Flake Handling — when an incident is caused by a failed launch-candidate smoke.