Skip to Content
RunbooksLaunch-Candidate Smoke Flake Handling

Launch-Candidate Smoke Flake Handling

When a pull request carries the launch-candidate label (or the launch-go-no-go alias, or its branch starts with launch-candidate-), the Self-Serve Staging Smoke  becomes a blocking check. This runbook covers the three permitted responses when that smoke fails.

This gate is item 15 of SELF_SERVE_DEVELOP_DD502239_10_10_REMAINING_PLAN.

Required artifacts on every launch-candidate run

The workflow uploads an artifact named self-serve-staging-smoke-<run-id>-<attempt> containing the JSON proof packet from the smoke script:

FileSource
01-token-create.jsonBYOVM token create response
02-doctor.jsoncm-runner doctor output
03a-hcloud-create.jsonHetzner VM provisioning response
03b-agent-discovered.jsonAgent register-with-org discovery
03-agent-online.jsonAgent online confirmation
04-launch.jsonTemplate launch response (run_id)
05-run-final.jsonTerminal run state
cleanup-response.jsonResource cleanup status

A go/no-go meeting must see these artifacts within the last 24 hours. If the artifact is older than 24h, rerun the workflow with workflow_dispatch before the meeting.

In addition the launch room expects the following local-script proof packets (run alongside the CI smoke, attached to the go/no-go doc):

  • scripts/print-launch-urls.sh --launch-proof — URL contradiction audit, must exit 0.
  • scripts/self-serve-launch-check.sh — local dashboard + backend + routes matrix, must exit 0.
  • scripts/golden-path-smoke.sh — signup → checkout → first API call, must exit 0.
  • scripts/launch-command-center.sh today — daily evidence packet for item 20 (no DB access required).

Response 1 — Fix the failure

This is the default. Most launch-candidate smoke failures are real regressions. Diagnose via the artifact files in this order:

  1. 01-token-create.json — non-zero status_code or missing token field means BYOVM registration is broken (item 14 wiring).
  2. 03a-hcloud-create.json — non-zero error means provisioning is broken; check Hetzner billing / quota.
  3. 03-agent-online.json — missing agent_id means the agent binary failed to call back; check cloud-init logs by re-running the workflow with keep_vm=true and SSHing in.
  4. 04-launch.json — non-2xx means the certified template launch API is broken (gateway, dispatcher, or template registry).
  5. 05-run-final.jsonstatus: failed is the most common late failure mode; the embedded error field is the agent’s terminal message.

Push the fix, re-run the workflow, confirm green, then merge.

Response 2 — Retry once for a known transient class

Transient failures we have observed on staging:

  • hcloud_provision_timeout — Hetzner image pull > 90s. Re-run.
  • agent_register_window_exceeded — VM provisioned but agent missed the registration window (cloud-init log shows systemd unit started late). Re-run.
  • gateway_503 on the first API call — usually a backend cold-start on a recently redeployed staging. Wait 60s and re-run.

Single retry is permitted. A second consecutive failure must escalate to Response 1 or Response 3. Do not retry indefinitely.

Response 3 — Waiver (last resort)

If the failure is in a non-launch surface (e.g. the smoke trips an unrelated infrastructure flake unrelated to self-serve), apply the launch-waiver label and add a comment in this exact format:

LAUNCH WAIVER (item 15) - failure_class: <hcloud|agent|gateway|template|other> - owner: <github-handle> - risk: <one-paragraph statement of what is being waived> - expected_followup: <link to issue tracking the real fix> - approver: <go/no-go owner github-handle>

The waiver label is the only sanctioned bypass; the workflow is still listed as failed in CI but the launch owner has accepted the risk.

Verifying the gate is wired

# Show the workflow YAML — the if-gate must reference launch-candidate. gh workflow view "Self-Serve Staging Smoke (SS-010)" --yaml | grep -A2 launch-candidate # Confirm the launch-waiver label exists on the repo. gh label list --search launch-waiver

If launch-waiver is missing, create it:

gh label create launch-waiver \ --color B71C1C \ --description "Accepted launch-candidate smoke waiver (with risk note)"

When to remove the launch-candidate label

After merge to develop and successful deploy to staging, the launch owner should immediately remove the launch-candidate label from any other open PRs so the workflow doesn’t keep burning Hetzner VMs.

Related runbooks: Incident Response · Webhook Disaster Recovery.