Managed runners are in private beta. Contact us for access.
Runbook: Runner Operations
Owner: Platform Team Backup owner: On-call engineer Last validated: May 19, 2026 Validation method: Manual drill + production canary Severity trigger: SEV3 Customer impact: Runner provisioning or management unavailable Required access: SSH (VPS), Docker, MongoDB Related services: curateme-backend-gateway, runner containers Related runbooks: Runner Startup SLO , Support: Runner Incidents
Infrastructure operations for managed runner hosts — bootstrapping, registration, rollback, and capacity tuning.
cm-runner agent (public path). Customer BYOVM machines now install the
agent from the public registry: ghcr.io/curate-me-ai/cm-runner:<vYYYY.M.D>.
The legacy localhost:5000/curate-me/cm-runner and the deleted
services/runner-agent/ source tree are no longer used — the canonical agent
source is packages/cm-runner/ and images are published from
deploy/runner-images/cm-runner/Dockerfile.
For the customer-facing install flow see the
Connect Your Machine quickstart; the
dashboard generates per-org install commands via InstallGuide.tsx.
1. Host Bootstrap — Bring a New Hetzner VPS Online
Prerequisites
hcloudCLI installed and configured- Access to the CurateMe Hetzner project
- Gateway API key with RUNNER_CREATE permission
Steps
# 1. Create server
hcloud server create \
--name cm-runner-$(date +%s) \
--type cx22 \
--image ubuntu-22.04 \
--location nbg1 \
--network $HETZNER_NETWORK_ID \
--firewall $HETZNER_FIREWALL_ID \
--user-data-from-file deploy/vps/cloud-init/cm-runner-user-data.yaml
# 2. Wait for server to boot and cloud-init to finish
hcloud server ssh cm-runner-XXXX "cloud-init status --wait"
# 3. Verify Docker is running
hcloud server ssh cm-runner-XXXX "docker info --format '{{.ServerVersion}}'"
# 4. Verify cm-runner agent is running
hcloud server ssh cm-runner-XXXX "systemctl status cm-runner"
# 5. Check agent heartbeat in gateway
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
https://api.curate-me.ai/gateway/admin/runners/byovm/agents | jq '.agents[] | select(.status=="online")'Verification
- Agent shows
status: onlinein Your Machines (BYOVM) agent list - Heartbeat timestamp is within the last 60 seconds
systemctl status cm-runnershows active (running)
You can also verify gateway-side health with:
./scripts/analytics health2. Agent Registration
Generate Registration Token
curl -X POST \
-H "Authorization: Bearer $GATEWAY_TOKEN" \
-H "Content-Type: application/json" \
https://api.curate-me.ai/gateway/admin/runners/byovm/register-token \
-d '{"org_id": "your-org-id", "ttl_seconds": 3600}'Register Agent
# On the runner host:
curl -X POST \
-H "Content-Type: application/json" \
https://api.curate-me.ai/gateway/admin/runners/byovm/register \
-d '{
"registration_token": "TOKEN_FROM_ABOVE",
"agent_id": "agent-$(hostname)",
"hostname": "$(hostname)",
"capabilities": ["docker", "desktop", "openclaw"]
}'Verify
# Check agent appears and is online
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
https://api.curate-me.ai/gateway/admin/runners/byovm/agents
# Check heartbeat
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
https://api.curate-me.ai/gateway/admin/runners/byovm/agents/$AGENT_ID/heartbeatIf registration fails, check gateway error logs for the source of the issue:
./scripts/errors by-source gatewayRun Diagnostics
The cm-runner agent heartbeat reports capabilities (Docker version, disk
space, available memory, supported image pulls) on every cycle. To trigger an
on-demand diagnostic run from the host:
cm-runner agent --diagnostics-onlyThe agent reports the result to the control plane and the dashboard displays it next to the machine card. For deeper triage of a specific failure mode, see the dedicated pages under Troubleshooting:
- Invalid Registration Token
- Machine Offline
- Docker Socket Denied
- Image Pull Failed
- OpenClaw Boot Failed
- Missing Credentials
- Slow First Launch / Startup
3. Incident Rollback — Disable Hetzner Provider
Use this when the Hetzner provider is causing issues and you need to fall back to E2B.
Immediate Mitigation
# 1. Disable Hetzner Your Machines (BYOVM) feature flag
# Set in .env or runtime config:
FF_RUNNER_BYOVM=false
# 2. All new runner requests will fall back to E2B automatically.
# Existing Hetzner sessions continue until they expire.
# 3. To force-drain existing Hetzner sessions:
curl -X POST \
-H "Authorization: Bearer $GATEWAY_TOKEN" \
https://api.curate-me.ai/gateway/admin/runners/drain \
-d '{"provider_type": "hetzner_vps"}'
# 4. Emergency kill switch (stops ALL runner operations):
curl -X POST \
-H "Authorization: Bearer $GATEWAY_TOKEN" \
https://api.curate-me.ai/gateway/admin/runners/security/kill-switch/activate \
-d '{"reason": "Hetzner provider incident", "actor": "ops-team"}'Recovery
# 1. Fix the underlying issue
# 2. Re-enable feature flag: FF_RUNNER_BYOVM=true
# 3. Verify with canary:
curl -X POST \
-H "Authorization: Bearer $GATEWAY_TOKEN" \
https://api.curate-me.ai/gateway/admin/runners/canary \
-d '{"provider": "hetzner_vps"}'
# 4. Deactivate kill switch if it was activated:
curl -X POST \
-H "Authorization: Bearer $GATEWAY_TOKEN" \
https://api.curate-me.ai/gateway/admin/runners/security/kill-switch/deactivateAfter recovery, confirm no lingering errors:
./scripts/errors by-source gateway
./scripts/analytics health4. OpenClaw CLI Failure Rollback
When OpenClaw CLI is failing or producing incorrect results.
Immediate (simulate mode)
# Switch to simulate mode — CLI calls become no-ops
export OPENCLAW_AUDIT_MODE=simulate
export OPENCLAW_CLI_REQUIRED=false
# Restart backend
systemctl restart cm-backendPermanent Fix
# 1. Check current version
openclaw --version
# 2. Reinstall/upgrade to known-good version
# (Current pin is 2026.5.22; 2026.4.2 remains an approved rollback target
# during the canary window — substitute it here only if rolling back.)
npm install -g @openclaw/cli@2026.5.22
# 3. Verify
openclaw --version
openclaw selftest
# 4. Restore enforcement
export OPENCLAW_AUDIT_MODE=enforce
export OPENCLAW_CLI_REQUIRED=true
systemctl restart cm-backend5. Template / Skill Pack Rollback
When a skill pack causes runner failures.
Quarantine the Pack
# 1. Quarantine the problematic skill pack
curl -X DELETE \
-H "Authorization: Bearer $GATEWAY_TOKEN" \
https://api.curate-me.ai/gateway/admin/runners/skill-packs/$PACK_ID
# 2. Revert template to previous skill_pack_version
curl -X PATCH \
-H "Authorization: Bearer $GATEWAY_TOKEN" \
-H "Content-Type: application/json" \
https://api.curate-me.ai/gateway/admin/runners/templates/$TEMPLATE_ID \
-d '{"skill_pack_id": "previous-known-good-pack-id"}'
# 3. Lock image to last-known-good
curl -X POST \
-H "Authorization: Bearer $GATEWAY_TOKEN" \
https://api.curate-me.ai/gateway/admin/runners/templates/$TEMPLATE_ID/lock-image \
-d '{"image_ref": "curate-me/runner-openclaw-locked:v1.2.3@sha256:abc..."}'6. Slack Bridge Incident Rollback
When the Slack conversation bridge is causing issues.
# 1. Disable Slack channel adapter
export OPENCLAW_CHANNEL_SLACK_ENABLED=false
systemctl restart cm-backend
# 2. Kill affected sessions if needed
curl -X POST \
-H "Authorization: Bearer $GATEWAY_TOKEN" \
https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/terminate
# 3. Check conversation mappings for the affected org
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
"https://api.curate-me.ai/gateway/admin/runners/conversations?org_id=$ORG_ID"
# 4. Re-enable after fix
export OPENCLAW_CHANNEL_SLACK_ENABLED=true
systemctl restart cm-backend7. Capacity Tuning
Key Configuration Parameters
| Variable | Default | Description |
|---|---|---|
HETZNER_BURST_ENABLED | false | Enable burst provisioning |
HETZNER_BURST_MAX_HOSTS | 5 | Maximum burst hosts |
HETZNER_BURST_IDLE_TIMEOUT_SECONDS | 900 | Idle timeout before scale-down |
HETZNER_SERVER_TYPE | cx22 | Hetzner server type (2 vCPU, 4GB) |
HETZNER_WARM_POOL_SIZE | 0 | Pre-provisioned warm hosts |
Monitoring Commands
# Current runner count by provider
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
"https://api.curate-me.ai/gateway/admin/runners/inventory?limit=0" | jq '.total'
# Active sessions
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
"https://api.curate-me.ai/gateway/admin/runners/quotas" | jq '.daily_runner_minutes_used'
# Provider health
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
"https://api.curate-me.ai/gateway/admin/runners/health" | jq '.'
# Cost SLO metrics
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
"https://api.curate-me.ai/gateway/admin/runners/cost-slo/report"You can also use the operational tooling for a quick health snapshot:
./scripts/analytics health
./scripts/errors by-source gatewayScaling Up
# Increase burst capacity
export HETZNER_BURST_MAX_HOSTS=10
export HETZNER_BURST_ENABLED=true
# Add warm pool for faster cold starts
export HETZNER_WARM_POOL_SIZE=2
# Apply and restart
systemctl restart cm-backendScaling Down
# Reduce burst capacity
export HETZNER_BURST_MAX_HOSTS=3
export HETZNER_BURST_IDLE_TIMEOUT_SECONDS=300 # 5 min idle timeout
# Drain idle hosts
curl -X POST \
-H "Authorization: Bearer $GATEWAY_TOKEN" \
https://api.curate-me.ai/gateway/admin/runners/drain \
-d '{"provider_type": "hetzner_vps", "idle_only": true}'Rollback
For runner host changes, revert with git checkout HEAD~1 -- deploy/runner-images/ && docker compose up -d. For capacity changes, restore the previous values in the runner control plane configuration. Specific rollback procedures for templates and Slack bridge are in Sections 5 and 6 above.
Verification
After applying changes, verify:
- Runner health endpoint responds:
curl -s http://localhost:8002/health | jq .status - Existing runners still report heartbeats in the dashboard
- New runner provisioning works: launch a test runner from the dashboard
Related Runbooks
- Runner Provision Failure — when runner containers fail to provision
- Runner Stuck — when a runner is running but unresponsive
- Deployment Procedure — full production deployment workflow