Launch-Candidate Smoke Flake Handling

When a pull request carries the launch-candidate label (or the launch-go-no-go alias, or its branch starts with launch-candidate-), the Self-Serve Staging Smoke becomes a blocking check. This runbook covers the three permitted responses when that smoke fails.

This gate is item 15 of SELF_SERVE_DEVELOP_DD502239_10_10_REMAINING_PLAN.

Required artifacts on every launch-candidate run

The workflow uploads an artifact named self-serve-staging-smoke-<run-id>-<attempt> containing the JSON proof packet from the smoke script:

File	Source
`01-token-create.json`	BYOVM token create response
`02-doctor.json`	`cm-runner doctor` output
`03a-hcloud-create.json`	Hetzner VM provisioning response
`03b-agent-discovered.json`	Agent register-with-org discovery
`03-agent-online.json`	Agent online confirmation
`04-launch.json`	Template launch response (`run_id`)
`05-run-final.json`	Terminal run state
`cleanup-response.json`	Resource cleanup status

A go/no-go meeting must see these artifacts within the last 24 hours. If the artifact is older than 24h, rerun the workflow with workflow_dispatch before the meeting.

In addition the launch room expects the following local-script proof packets (run alongside the CI smoke, attached to the go/no-go doc):

scripts/print-launch-urls.sh --launch-proof — URL contradiction audit, must exit 0.
scripts/self-serve-launch-check.sh — local dashboard + backend + routes matrix, must exit 0.
scripts/golden-path-smoke.sh — signup → checkout → first API call, must exit 0.
scripts/launch-command-center.sh today — daily evidence packet for item 20 (no DB access required).

Response 1 — Fix the failure

This is the default. Most launch-candidate smoke failures are real regressions. Diagnose via the artifact files in this order:

01-token-create.json — non-zero status_code or missing token field means BYOVM registration is broken (item 14 wiring).
03a-hcloud-create.json — non-zero error means provisioning is broken; check Hetzner billing / quota.
03-agent-online.json — missing agent_id means the agent binary failed to call back; check cloud-init logs by re-running the workflow with keep_vm=true and SSHing in.
04-launch.json — non-2xx means the certified template launch API is broken (gateway, dispatcher, or template registry).
05-run-final.json — status: failed is the most common late failure mode; the embedded error field is the agent’s terminal message.

Push the fix, re-run the workflow, confirm green, then merge.

Response 2 — Retry once for a known transient class

Transient failures we have observed on staging:

hcloud_provision_timeout — Hetzner image pull > 90s. Re-run.
agent_register_window_exceeded — VM provisioned but agent missed the registration window (cloud-init log shows systemd unit started late). Re-run.
gateway_503 on the first API call — usually a backend cold-start on a recently redeployed staging. Wait 60s and re-run.

Single retry is permitted. A second consecutive failure must escalate to Response 1 or Response 3. Do not retry indefinitely.

Response 3 — Waiver (last resort)

If the failure is in a non-launch surface (e.g. the smoke trips an unrelated infrastructure flake unrelated to self-serve), apply the launch-waiver label and add a comment in this exact format:


LAUNCH WAIVER (item 15)
- failure_class: <hcloud|agent|gateway|template|other>
- owner: <github-handle>
- risk: <one-paragraph statement of what is being waived>
- expected_followup: <link to issue tracking the real fix>
- approver: <go/no-go owner github-handle>

The waiver label is the only sanctioned bypass; the workflow is still listed as failed in CI but the launch owner has accepted the risk.

Verifying the gate is wired


# Show the workflow YAML — the if-gate must reference launch-candidate.
gh workflow view "Self-Serve Staging Smoke (SS-010)" --yaml | grep -A2 launch-candidate
 
# Confirm the launch-waiver label exists on the repo.
gh label list --search launch-waiver

If launch-waiver is missing, create it:


gh label create launch-waiver \
  --color B71C1C \
  --description "Accepted launch-candidate smoke waiver (with risk note)"

When to remove the launch-candidate label

After merge to develop and successful deploy to staging, the launch owner should immediately remove the launch-candidate label from any other open PRs so the workflow doesn’t keep burning Hetzner VMs.

Related runbooks: Incident Response · Webhook Disaster Recovery.