Skip to Content
RunbooksPublic Incident Update (Status Page)

Public Incident Update (Status Page)

This runbook walks an on-call engineer through publishing a public incident on status.curate-me.ai  and the in-dashboard degraded-status banner. Incidents are stored in MongoDB and served by the public /api/v1/platform/health/status endpoint — no deploy is required.

This procedure is item 16 of SELF_SERVE_DEVELOP_DD502239_10_10_REMAINING_PLAN.

When to publish an incident

Publish if any of the following are true:

  • The system status API reports degraded or outage and the underlying cause is known.
  • A core dependency (Database, Gateway, Stripe, Email, Runner Control Plane) is failing and customer requests are visibly affected.
  • A customer support ticket about a service issue has been opened in the last 30 minutes for an issue we have not yet acknowledged.

Do not publish for: a brief transient blip (< 60s) that has already self-recovered, internal-only failures, or a degraded state that has no customer impact (e.g. background analytics queue lag).

Severity matrix

SeverityExamplesStatus bannerPager
minorOne LLM provider degraded; failover workingYesNo
majorGateway slow > 5s p99; Stripe webhooks laggingYesYes
criticalGateway down; Database down; customer-visible auth failureYesYes

Status workflow

Status transitions: investigatingidentifiedmonitoringresolved. Always publish a fresh comment on the incident at each transition.

Publishing — the supported paths

Path A: dashboard admin (no SSH, fastest)

  1. Navigate to Dashboard → Settings → Public Status (admin only).
  2. Click New Incident. Fill in:
    • Title — short, customer-language (e.g. “Elevated gateway latency for OpenAI requests”)
    • Severityminor / major / critical
    • Message — what is impacted, what we are doing
  3. Save. The incident becomes visible at https://status.curate-me.ai and within ~30 seconds the dashboard degraded-status banner appears for affected users.
  4. Update the incident as state changes; click Resolve when fixed.

Path B: Mongo direct (no dashboard access)

If the dashboard is itself part of the outage, write directly:

ssh ${DEPLOY_USER}@${PLATFORM_VPS_IP} docker exec -it curateme-mongo mongosh curateme
db.incidents.insertOne({ title: "Elevated gateway latency for OpenAI requests", message: "We are investigating. Failover to alternate providers is active.", severity: "major", // minor | major | critical status: "investigating", // investigating | identified | monitoring | resolved created_at: new Date().toISOString(), resolved_at: null, });

To update:

db.incidents.updateOne( { _id: ObjectId("<paste-id>") }, { $set: { status: "monitoring", message: "Latency is recovering. Continuing to monitor." } } );

To resolve:

db.incidents.updateOne( { _id: ObjectId("<paste-id>") }, { $set: { status: "resolved", resolved_at: new Date().toISOString() } } );

The status API caches for 30 seconds, so the public page and dashboard banner update within that window.

Path C: HTTP API (admin token required)

curl -X POST https://api.curate-me.ai/api/v1/admin/incidents \ -H "Authorization: Bearer ${ADMIN_JWT}" \ -H "Content-Type: application/json" \ -d '{ "title": "Elevated gateway latency for OpenAI requests", "message": "We are investigating. Failover is active.", "severity": "major", "status": "investigating" }'

Verifying customer-visible state

# Public status (no auth) curl -s https://api.curate-me.ai/api/v1/platform/health/status | jq '.incidents, .status' # Public status page renders the same data open https://status.curate-me.ai

The dashboard StatusBanner polls the same endpoint every 60 seconds and renders the most severe active incident in a yellow (degraded) or red (outage) bar above the existing dashboard chrome.

Customer comms templates

Investigating (severity major):

We are investigating reports of elevated latency on requests proxied through the gateway. Customer requests may take longer than normal to return; no requests are being dropped. Updates every 15 minutes.

Monitoring:

Mitigation deployed. Latency has returned to normal. We will keep monitoring for the next 30 minutes before resolving.

Resolved:

The latency issue is resolved as of <HH:MM UTC>. Root cause was <one-sentence>. Full post-incident review will be shared in <channel> within 5 business days.

Where the status page is linked from

SurfaceLink target
Marketing footerhttps://status.curate-me.ai
Marketing /status page (full ui)reads /api/v1/platform/health/status
Dashboard help menuhttps://status.curate-me.ai
Dashboard degraded bannerreads /api/v1/platform/health/status
API error docs (/troubleshooting)https://status.curate-me.ai
SLA doc (/sla)https://status.curate-me.ai
Billing card-declined errorinline link to status page

If you add a new customer-facing error surface, add a status-page link to it.