Skip to Content
RunbooksRunbook: Self-Serve Launch — Error UX Checklist

Runbook: Self-Serve Launch — Error UX Checklist

Owner: Gateway Team Backup owner: Dashboard lead Last validated: Not yet validated Validation method: scripts/self-serve-flag-cutover-smoke.sh (OPEN + ROLLBACK phases) Severity trigger: SEV2 (launch blocker — self-serve users cannot self-unblock) Customer impact: Self-serve users hit a generic error with no actionable fix and abandon Required access: Gateway + B2B API (staging or prod), dashboard Related services: curateme-backend-gateway, curateme-backend-b2b


Self-serve users have no CSM to call. When the gateway denies a request — a missing provider key, a budget cap, a rate limit, a model that is not on the allowlist — the error itself has to carry the user to the fix. This runbook documents how a gateway error becomes a dashboard “Fix now” button, and the pre-launch checks that keep that path live.


Error flow: gateway → dashboard → action

1. Gateway builds a structured error

services/backend/src/gateway/error_response.py is the single source of error shape. gateway_error_response() emits:

{ "error": { "code": "provider_key_missing", "message": "No openai API key is configured for your organization. ...", "request_id": "req_abc123", "docs_url": "https://docs.curate-me.ai/quickstart#provider-keys", "remediation_info": { "provider": "openai", "action": "Add your openai API key", "dashboard_url": "/settings/provider-secrets", "docs_url": "/quickstart#provider-keys", "header_name": "X-Provider-Key" } } }

Key fields:

  • code — stable machine code (provider_key_missing, rate_limit_exceeded, budget_exceeded, model_not_allowed, …).
  • remediation_info — the self-serve unblock payload (top-level, not buried under details). action is short imperative copy; dashboard_url is the relative in-app fix surface; docs_url is a relative docs path.
  • docs_url (top level) — absolute docs link from the _DOCS_URLS map, used as a “Learn more” fallback when no remediation_info.docs_url is present.

The convenience constructors that pass remediation_info today: provider_key_missing_error, rate_limit_error, budget_exceeded_error, model_not_allowed_error.

Every error response also carries the X-CM-Request-Id header (and X-CM-Trace-Id when a trace is in scope) via the request spine — support uses this to look the request up.

2. Dashboard normalizes the error

apps/dashboard/lib/api-errors.ts (normalizeResponseError) converts any non-2xx Response into a NormalizedError and enriches the spine from X-CM-* response headers when the body omits them.

apps/dashboard/lib/errors/normalize.ts (normalizeGatewayError) extracts error.remediation_info and error.docs_url onto the NormalizedError:

  • NormalizedError.remediationInfo — the backend hint object.
  • NormalizedError.docsUrl — the top-level docs link.

3. Dashboard renders message + action

apps/dashboard/lib/error-remediations.ts resolves the best CTA via resolveRemediationCTA(code, remediationInfo, docsUrl):

  1. Prefer the backend remediation_info.dashboard_url.
  2. Fall back to the static ERROR_REMEDIATIONS map (covers provider_key_missing, rate_limit_exceeded, budget_exceeded / daily_budget_exceeded, model_not_allowed, invalid_api_key).
  3. Fall back to a docs “Learn more” link.

apps/dashboard/components/ErrorCard.tsx (used by the route-level error.tsx boundaries) derives the remediation from the thrown error and renders apps/dashboard/components/RemediationBanner.tsx below the message — a teal info card with a “Fix now” button and a “Learn more” docs link. The banner uses the established --cm-* design tokens; it does not introduce --mc-* vars or hardcoded dark backgrounds.


Pre-launch checklist

Run before flipping SELF_SERVE_PUBLIC_SIGNUP / SELF_SERVE_PAID_CHECKOUT on.

Backend

  • Each remediable error in error_response.py passes remediation_info as a top-level argument (not under details): provider_key_missing_error, rate_limit_error, budget_exceeded_error, model_not_allowed_error.
  • check_stripe_live_key() in services/backend/src/utils/security_checks.py raises in production when STRIPE_SECRET_KEY starts with sk_test_. It is called from check_production_security() (gateway lifespan) and eagerly at gateway module import in src/main_gateway.py.
  • No Stripe test key in the production environment / vault.

Dashboard

  • NormalizedError carries remediationInfo + docsUrl (lib/errors/normalize.ts).
  • normalizeGatewayError() extracts error.remediation_info and error.docs_url.
  • lib/error-remediations.ts covers at least provider_key_missing, rate_limit_exceeded, daily_budget_exceeded, model_not_allowed.
  • ErrorCard renders RemediationBanner when a remediation is derivable.
  • RemediationBanner uses --cm-* tokens only (no --mc-*, no hardcoded dark backgrounds).

Smoke test

scripts/self-serve-flag-cutover-smoke.sh --env staging

Expected:

  • CLOSED phase — all four launch flags start OFF.
  • OPEN phase — flags enable in cutover order; the live gateway returns a structured error.code, a populated remediation_info, and an X-CM-Request-Id header (see assert_gateway_error_shape).
  • ROLLBACK phase — flags disable within 10 minutes; with SELF_SERVE_PUBLIC_SIGNUP OFF, an unauthenticated onboard POST does not create a live account — it returns the waitlisted envelope (or a 4xx) with a human-readable message (see assert_onboard_gate_closed).

Note: the onboard gate intentionally returns 200 status: "waitlisted", not 403, to avoid revealing API shape to bots (see services/backend/src/api/routes/platform/onboard.py). The smoke assertion accepts either, and fails only if a live org_id is issued.


Troubleshooting

Dashboard shows a generic error with no “Fix now” button

  1. Confirm the gateway response body contains error.remediation_info (curl the failing call, inspect JSON). If absent, the gateway constructor is not passing remediation_info — check error_response.py.
  2. Confirm normalizeGatewayError() extracts it onto remediationInfo.
  3. Confirm the rendering surface (ErrorCardRemediationBanner) is on the path that caught the error.

resolveRemediationCTA prefers the backend dashboard_url. If it is wrong, fix the value in the gateway constructor’s remediation_info. The static map in error-remediations.ts is only the fallback.

X-CM-Request-Id missing from the error response

Ensure the request spine is established before the route runs and that gateway_error_response() is the response path (it stamps the header from the spine contextvar). See the spine setup in error_response.py.

Gateway refuses to start with a Stripe error

check_stripe_live_key() is doing its job: STRIPE_SECRET_KEY is an sk_test_* key in production. Set a live key (sk_live_*) before enabling paid checkout.


Rollback

This runbook documents checks, not a mutation. To roll back a launch, run the ROLLBACK phase of the cutover smoke:

scripts/self-serve-flag-cutover-smoke.sh --env staging --only-phase rollback