Runbook: Self-Serve Launch — Error UX Checklist
Owner: Gateway Team Backup owner: Dashboard lead Last validated: Not yet validated Validation method:
scripts/self-serve-flag-cutover-smoke.sh(OPEN + ROLLBACK phases) Severity trigger: SEV2 (launch blocker — self-serve users cannot self-unblock) Customer impact: Self-serve users hit a generic error with no actionable fix and abandon Required access: Gateway + B2B API (staging or prod), dashboard Related services: curateme-backend-gateway, curateme-backend-b2b
Self-serve users have no CSM to call. When the gateway denies a request — a missing provider key, a budget cap, a rate limit, a model that is not on the allowlist — the error itself has to carry the user to the fix. This runbook documents how a gateway error becomes a dashboard “Fix now” button, and the pre-launch checks that keep that path live.
Error flow: gateway → dashboard → action
1. Gateway builds a structured error
services/backend/src/gateway/error_response.py is the single source of error
shape. gateway_error_response() emits:
{
"error": {
"code": "provider_key_missing",
"message": "No openai API key is configured for your organization. ...",
"request_id": "req_abc123",
"docs_url": "https://docs.curate-me.ai/quickstart#provider-keys",
"remediation_info": {
"provider": "openai",
"action": "Add your openai API key",
"dashboard_url": "/settings/provider-secrets",
"docs_url": "/quickstart#provider-keys",
"header_name": "X-Provider-Key"
}
}
}Key fields:
code— stable machine code (provider_key_missing,rate_limit_exceeded,budget_exceeded,model_not_allowed, …).remediation_info— the self-serve unblock payload (top-level, not buried underdetails).actionis short imperative copy;dashboard_urlis the relative in-app fix surface;docs_urlis a relative docs path.docs_url(top level) — absolute docs link from the_DOCS_URLSmap, used as a “Learn more” fallback when noremediation_info.docs_urlis present.
The convenience constructors that pass remediation_info today:
provider_key_missing_error, rate_limit_error, budget_exceeded_error,
model_not_allowed_error.
Every error response also carries the X-CM-Request-Id header (and
X-CM-Trace-Id when a trace is in scope) via the request spine — support uses
this to look the request up.
2. Dashboard normalizes the error
apps/dashboard/lib/api-errors.ts (normalizeResponseError) converts any
non-2xx Response into a NormalizedError and enriches the spine from
X-CM-* response headers when the body omits them.
apps/dashboard/lib/errors/normalize.ts (normalizeGatewayError) extracts
error.remediation_info and error.docs_url onto the NormalizedError:
NormalizedError.remediationInfo— the backend hint object.NormalizedError.docsUrl— the top-level docs link.
3. Dashboard renders message + action
apps/dashboard/lib/error-remediations.ts resolves the best CTA via
resolveRemediationCTA(code, remediationInfo, docsUrl):
- Prefer the backend
remediation_info.dashboard_url. - Fall back to the static
ERROR_REMEDIATIONSmap (coversprovider_key_missing,rate_limit_exceeded,budget_exceeded/daily_budget_exceeded,model_not_allowed,invalid_api_key). - Fall back to a docs “Learn more” link.
apps/dashboard/components/ErrorCard.tsx (used by the route-level
error.tsx boundaries) derives the remediation from the thrown error and
renders apps/dashboard/components/RemediationBanner.tsx below the message —
a teal info card with a “Fix now” button and a “Learn more” docs link. The
banner uses the established --cm-* design tokens; it does not introduce
--mc-* vars or hardcoded dark backgrounds.
Pre-launch checklist
Run before flipping SELF_SERVE_PUBLIC_SIGNUP / SELF_SERVE_PAID_CHECKOUT on.
Backend
- Each remediable error in
error_response.pypassesremediation_infoas a top-level argument (not underdetails):provider_key_missing_error,rate_limit_error,budget_exceeded_error,model_not_allowed_error. -
check_stripe_live_key()inservices/backend/src/utils/security_checks.pyraises in production whenSTRIPE_SECRET_KEYstarts withsk_test_. It is called fromcheck_production_security()(gateway lifespan) and eagerly at gateway module import insrc/main_gateway.py. - No Stripe test key in the production environment / vault.
Dashboard
-
NormalizedErrorcarriesremediationInfo+docsUrl(lib/errors/normalize.ts). -
normalizeGatewayError()extractserror.remediation_infoanderror.docs_url. -
lib/error-remediations.tscovers at leastprovider_key_missing,rate_limit_exceeded,daily_budget_exceeded,model_not_allowed. -
ErrorCardrendersRemediationBannerwhen a remediation is derivable. -
RemediationBanneruses--cm-*tokens only (no--mc-*, no hardcoded dark backgrounds).
Smoke test
scripts/self-serve-flag-cutover-smoke.sh --env stagingExpected:
- CLOSED phase — all four launch flags start OFF.
- OPEN phase — flags enable in cutover order; the live gateway returns a
structured
error.code, a populatedremediation_info, and anX-CM-Request-Idheader (seeassert_gateway_error_shape). - ROLLBACK phase — flags disable within 10 minutes; with
SELF_SERVE_PUBLIC_SIGNUPOFF, an unauthenticated onboard POST does not create a live account — it returns thewaitlistedenvelope (or a 4xx) with a human-readable message (seeassert_onboard_gate_closed).
Note: the onboard gate intentionally returns 200
status: "waitlisted", not 403, to avoid revealing API shape to bots (seeservices/backend/src/api/routes/platform/onboard.py). The smoke assertion accepts either, and fails only if a liveorg_idis issued.
Troubleshooting
Dashboard shows a generic error with no “Fix now” button
- Confirm the gateway response body contains
error.remediation_info(curl the failing call, inspect JSON). If absent, the gateway constructor is not passingremediation_info— checkerror_response.py. - Confirm
normalizeGatewayError()extracts it ontoremediationInfo. - Confirm the rendering surface (
ErrorCard→RemediationBanner) is on the path that caught the error.
”Fix now” links to the wrong page
resolveRemediationCTA prefers the backend dashboard_url. If it is wrong,
fix the value in the gateway constructor’s remediation_info. The static map in
error-remediations.ts is only the fallback.
X-CM-Request-Id missing from the error response
Ensure the request spine is established before the route runs and that
gateway_error_response() is the response path (it stamps the header from the
spine contextvar). See the spine setup in error_response.py.
Gateway refuses to start with a Stripe error
check_stripe_live_key() is doing its job: STRIPE_SECRET_KEY is an
sk_test_* key in production. Set a live key (sk_live_*) before enabling
paid checkout.
Rollback
This runbook documents checks, not a mutation. To roll back a launch, run the ROLLBACK phase of the cutover smoke:
scripts/self-serve-flag-cutover-smoke.sh --env staging --only-phase rollback