Public Incident Update (Status Page)

This runbook walks an on-call engineer through publishing a public incident on status.curate-me.ai and the in-dashboard degraded-status banner. Incidents are stored in MongoDB and served by the public /api/v1/platform/health/status endpoint — no deploy is required.

This procedure is item 16 of SELF_SERVE_DEVELOP_DD502239_10_10_REMAINING_PLAN.

When to publish an incident

Publish if any of the following are true:

The system status API reports degraded or outage and the underlying cause is known.
A core dependency (Database, Gateway, Stripe, Email, Runner Control Plane) is failing and customer requests are visibly affected.
A customer support ticket about a service issue has been opened in the last 30 minutes for an issue we have not yet acknowledged.

Do not publish for: a brief transient blip (< 60s) that has already self-recovered, internal-only failures, or a degraded state that has no customer impact (e.g. background analytics queue lag).

Severity matrix

Severity	Examples	Status banner	Pager
`minor`	One LLM provider degraded; failover working	Yes	No
`major`	Gateway slow > 5s p99; Stripe webhooks lagging	Yes	Yes
`critical`	Gateway down; Database down; customer-visible auth failure	Yes	Yes

Status workflow

Status transitions: investigating → identified → monitoring → resolved. Always publish a fresh comment on the incident at each transition.

Publishing — the supported paths

Path A: dashboard admin (no SSH, fastest)

Navigate to Dashboard → Settings → Public Status (admin only).
Click New Incident. Fill in:
- Title — short, customer-language (e.g. “Elevated gateway latency for OpenAI requests”)
- Severity — minor / major / critical
- Message — what is impacted, what we are doing
Save. The incident becomes visible at https://status.curate-me.ai and within ~30 seconds the dashboard degraded-status banner appears for affected users.
Update the incident as state changes; click Resolve when fixed.

Path B: Mongo direct (no dashboard access)

If the dashboard is itself part of the outage, write directly:


ssh ${DEPLOY_USER}@${PLATFORM_VPS_IP}
docker exec -it curateme-mongo mongosh -u admin -p "$MONGO_ADMIN_PASSWORD" \
  --authenticationDatabase admin curate_dashboard


db.incidents.insertOne({
  title: "Elevated gateway latency for OpenAI requests",
  message: "We are investigating. Failover to alternate providers is active.",
  severity: "major",            // minor | major | critical
  status: "investigating",       // investigating | identified | monitoring | resolved
  created_at: new Date().toISOString(),
  resolved_at: null,
});

To update:


db.incidents.updateOne(
  { _id: ObjectId("<paste-id>") },
  { $set: { status: "monitoring", message: "Latency is recovering. Continuing to monitor." } }
);

To resolve:


db.incidents.updateOne(
  { _id: ObjectId("<paste-id>") },
  { $set: { status: "resolved", resolved_at: new Date().toISOString() } }
);

The status API caches for 30 seconds, so the public page and dashboard banner update within that window.

Path C: HTTP API (admin token required)


curl -X POST https://api.curate-me.ai/api/v1/admin/incidents \
  -H "Authorization: Bearer ${ADMIN_JWT}" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Elevated gateway latency for OpenAI requests",
    "message": "We are investigating. Failover is active.",
    "severity": "major",
    "status": "investigating"
  }'

Verifying customer-visible state


# Public status (no auth)
curl -s https://api.curate-me.ai/api/v1/platform/health/status | jq '.incidents, .status'
 
# Public status page renders the same data
open https://status.curate-me.ai

The dashboard StatusBanner polls the same endpoint every 60 seconds and renders the most severe active incident in a yellow (degraded) or red (outage) bar above the existing dashboard chrome.

Customer comms templates

Investigating (severity major):

We are investigating reports of elevated latency on requests proxied through the gateway. Customer requests may take longer than normal to return; no requests are being dropped. Updates every 15 minutes.

Monitoring:

Mitigation deployed. Latency has returned to normal. We will keep monitoring for the next 30 minutes before resolving.

Resolved:

The latency issue is resolved as of <HH:MM UTC>. Root cause was <one-sentence>. Full post-incident review will be shared in <channel> within 5 business days.

Where the status page is linked from

Surface	Link target
Marketing footer	`https://status.curate-me.ai`
Marketing /status page (full ui)	reads `/api/v1/platform/health/status`
Dashboard help menu	`https://status.curate-me.ai`
Dashboard degraded banner	reads `/api/v1/platform/health/status`
API error docs (`/troubleshooting`)	`https://status.curate-me.ai`
SLA doc (`/sla`)	`https://status.curate-me.ai`
Billing card-declined error	inline link to status page

If you add a new customer-facing error surface, add a status-page link to it.

Incident Response Playbook — the internal triage / paging procedure (this runbook is the customer comms half of it).
Webhook Disaster Recovery — Stripe / email provider webhook outages.
Launch-Candidate Smoke Flake Handling — when an incident is caused by a failed launch-candidate smoke.