Launch-Candidate Smoke Flake Handling
When a pull request carries the launch-candidate label (or the
launch-go-no-go alias, or its branch starts with launch-candidate-),
the Self-Serve Staging Smoke
becomes a blocking check. This runbook covers the three permitted
responses when that smoke fails.
This gate is item 15 of
SELF_SERVE_DEVELOP_DD502239_10_10_REMAINING_PLAN.
Required artifacts on every launch-candidate run
The workflow uploads an artifact named
self-serve-staging-smoke-<run-id>-<attempt> containing the JSON proof
packet from the smoke script:
| File | Source |
|---|---|
01-token-create.json | BYOVM token create response |
02-doctor.json | cm-runner doctor output |
03a-hcloud-create.json | Hetzner VM provisioning response |
03b-agent-discovered.json | Agent register-with-org discovery |
03-agent-online.json | Agent online confirmation |
04-launch.json | Template launch response (run_id) |
05-run-final.json | Terminal run state |
cleanup-response.json | Resource cleanup status |
A go/no-go meeting must see these artifacts within the last 24
hours. If the artifact is older than 24h, rerun the workflow with
workflow_dispatch before the meeting.
In addition the launch room expects the following local-script proof packets (run alongside the CI smoke, attached to the go/no-go doc):
scripts/print-launch-urls.sh --launch-proof— URL contradiction audit, must exit 0.scripts/self-serve-launch-check.sh— local dashboard + backend + routes matrix, must exit 0.scripts/golden-path-smoke.sh— signup → checkout → first API call, must exit 0.scripts/launch-command-center.sh today— daily evidence packet for item 20 (no DB access required).
Response 1 — Fix the failure
This is the default. Most launch-candidate smoke failures are real regressions. Diagnose via the artifact files in this order:
01-token-create.json— non-zerostatus_codeor missingtokenfield means BYOVM registration is broken (item 14 wiring).03a-hcloud-create.json— non-zeroerrormeans provisioning is broken; check Hetzner billing / quota.03-agent-online.json— missingagent_idmeans the agent binary failed to call back; check cloud-init logs by re-running the workflow withkeep_vm=trueand SSHing in.04-launch.json— non-2xx means the certified template launch API is broken (gateway, dispatcher, or template registry).05-run-final.json—status: failedis the most common late failure mode; the embeddederrorfield is the agent’s terminal message.
Push the fix, re-run the workflow, confirm green, then merge.
Response 2 — Retry once for a known transient class
Transient failures we have observed on staging:
hcloud_provision_timeout— Hetzner image pull > 90s. Re-run.agent_register_window_exceeded— VM provisioned but agent missed the registration window (cloud-init log showssystemdunit started late). Re-run.gateway_503on the first API call — usually a backend cold-start on a recently redeployed staging. Wait 60s and re-run.
Single retry is permitted. A second consecutive failure must escalate to Response 1 or Response 3. Do not retry indefinitely.
Response 3 — Waiver (last resort)
If the failure is in a non-launch surface (e.g. the smoke trips an
unrelated infrastructure flake unrelated to self-serve), apply the
launch-waiver label and add a comment in this exact format:
LAUNCH WAIVER (item 15)
- failure_class: <hcloud|agent|gateway|template|other>
- owner: <github-handle>
- risk: <one-paragraph statement of what is being waived>
- expected_followup: <link to issue tracking the real fix>
- approver: <go/no-go owner github-handle>The waiver label is the only sanctioned bypass; the workflow is still listed as failed in CI but the launch owner has accepted the risk.
Verifying the gate is wired
# Show the workflow YAML — the if-gate must reference launch-candidate.
gh workflow view "Self-Serve Staging Smoke (SS-010)" --yaml | grep -A2 launch-candidate
# Confirm the launch-waiver label exists on the repo.
gh label list --search launch-waiverIf launch-waiver is missing, create it:
gh label create launch-waiver \
--color B71C1C \
--description "Accepted launch-candidate smoke waiver (with risk note)"When to remove the launch-candidate label
After merge to develop and successful deploy to staging, the launch
owner should immediately remove the launch-candidate label from any
other open PRs so the workflow doesn’t keep burning Hetzner VMs.
Related runbooks: Incident Response · Webhook Disaster Recovery.