Skip to Content
RunbooksRunbook: Runner Operations

Managed runners are in private beta. Contact us for access.

Runbook: Runner Operations

Owner: Platform Team Backup owner: On-call engineer Last validated: May 19, 2026 Validation method: Manual drill + production canary Severity trigger: SEV3 Customer impact: Runner provisioning or management unavailable Required access: SSH (VPS), Docker, MongoDB Related services: curateme-backend-gateway, runner containers Related runbooks: Runner Startup SLO , Support: Runner Incidents 


Infrastructure operations for managed runner hosts — bootstrapping, registration, rollback, and capacity tuning.

cm-runner agent (public path). Customer BYOVM machines now install the agent from the public registry: ghcr.io/curate-me-ai/cm-runner:<vYYYY.M.D>. The legacy localhost:5000/curate-me/cm-runner and the deleted services/runner-agent/ source tree are no longer used — the canonical agent source is packages/cm-runner/ and images are published from deploy/runner-images/cm-runner/Dockerfile.

For the customer-facing install flow see the Connect Your Machine quickstart; the dashboard generates per-org install commands via InstallGuide.tsx.


1. Host Bootstrap — Bring a New Hetzner VPS Online

Prerequisites

  • hcloud CLI installed and configured
  • Access to the CurateMe Hetzner project
  • Gateway API key with RUNNER_CREATE permission

Steps

# 1. Create server hcloud server create \ --name cm-runner-$(date +%s) \ --type cx22 \ --image ubuntu-22.04 \ --location nbg1 \ --network $HETZNER_NETWORK_ID \ --firewall $HETZNER_FIREWALL_ID \ --user-data-from-file deploy/vps/cloud-init/cm-runner-user-data.yaml # 2. Wait for server to boot and cloud-init to finish hcloud server ssh cm-runner-XXXX "cloud-init status --wait" # 3. Verify Docker is running hcloud server ssh cm-runner-XXXX "docker info --format '{{.ServerVersion}}'" # 4. Verify cm-runner agent is running hcloud server ssh cm-runner-XXXX "systemctl status cm-runner" # 5. Check agent heartbeat in gateway curl -H "Authorization: Bearer $GATEWAY_TOKEN" \ https://api.curate-me.ai/gateway/admin/runners/byovm/agents | jq '.agents[] | select(.status=="online")'

Verification

  • Agent shows status: online in Your Machines (BYOVM) agent list
  • Heartbeat timestamp is within the last 60 seconds
  • systemctl status cm-runner shows active (running)

You can also verify gateway-side health with:

./scripts/analytics health

2. Agent Registration

Generate Registration Token

curl -X POST \ -H "Authorization: Bearer $GATEWAY_TOKEN" \ -H "Content-Type: application/json" \ https://api.curate-me.ai/gateway/admin/runners/byovm/register-token \ -d '{"org_id": "your-org-id", "ttl_seconds": 3600}'

Register Agent

# On the runner host: curl -X POST \ -H "Content-Type: application/json" \ https://api.curate-me.ai/gateway/admin/runners/byovm/register \ -d '{ "registration_token": "TOKEN_FROM_ABOVE", "agent_id": "agent-$(hostname)", "hostname": "$(hostname)", "capabilities": ["docker", "desktop", "openclaw"] }'

Verify

# Check agent appears and is online curl -H "Authorization: Bearer $GATEWAY_TOKEN" \ https://api.curate-me.ai/gateway/admin/runners/byovm/agents # Check heartbeat curl -H "Authorization: Bearer $GATEWAY_TOKEN" \ https://api.curate-me.ai/gateway/admin/runners/byovm/agents/$AGENT_ID/heartbeat

If registration fails, check gateway error logs for the source of the issue:

./scripts/errors by-source gateway

Run Diagnostics

The cm-runner agent heartbeat reports capabilities (Docker version, disk space, available memory, supported image pulls) on every cycle. To trigger an on-demand diagnostic run from the host:

cm-runner agent --diagnostics-only

The agent reports the result to the control plane and the dashboard displays it next to the machine card. For deeper triage of a specific failure mode, see the dedicated pages under Troubleshooting:


3. Incident Rollback — Disable Hetzner Provider

Use this when the Hetzner provider is causing issues and you need to fall back to E2B.

Immediate Mitigation

# 1. Disable Hetzner Your Machines (BYOVM) feature flag # Set in .env or runtime config: FF_RUNNER_BYOVM=false # 2. All new runner requests will fall back to E2B automatically. # Existing Hetzner sessions continue until they expire. # 3. To force-drain existing Hetzner sessions: curl -X POST \ -H "Authorization: Bearer $GATEWAY_TOKEN" \ https://api.curate-me.ai/gateway/admin/runners/drain \ -d '{"provider_type": "hetzner_vps"}' # 4. Emergency kill switch (stops ALL runner operations): curl -X POST \ -H "Authorization: Bearer $GATEWAY_TOKEN" \ https://api.curate-me.ai/gateway/admin/runners/security/kill-switch/activate \ -d '{"reason": "Hetzner provider incident", "actor": "ops-team"}'

Recovery

# 1. Fix the underlying issue # 2. Re-enable feature flag: FF_RUNNER_BYOVM=true # 3. Verify with canary: curl -X POST \ -H "Authorization: Bearer $GATEWAY_TOKEN" \ https://api.curate-me.ai/gateway/admin/runners/canary \ -d '{"provider": "hetzner_vps"}' # 4. Deactivate kill switch if it was activated: curl -X POST \ -H "Authorization: Bearer $GATEWAY_TOKEN" \ https://api.curate-me.ai/gateway/admin/runners/security/kill-switch/deactivate

After recovery, confirm no lingering errors:

./scripts/errors by-source gateway ./scripts/analytics health

4. OpenClaw CLI Failure Rollback

When OpenClaw CLI is failing or producing incorrect results.

Immediate (simulate mode)

# Switch to simulate mode — CLI calls become no-ops export OPENCLAW_AUDIT_MODE=simulate export OPENCLAW_CLI_REQUIRED=false # Restart backend systemctl restart cm-backend

Permanent Fix

# 1. Check current version openclaw --version # 2. Reinstall/upgrade to known-good version # (Current pin is 2026.5.22; 2026.4.2 remains an approved rollback target # during the canary window — substitute it here only if rolling back.) npm install -g @openclaw/cli@2026.5.22 # 3. Verify openclaw --version openclaw selftest # 4. Restore enforcement export OPENCLAW_AUDIT_MODE=enforce export OPENCLAW_CLI_REQUIRED=true systemctl restart cm-backend

5. Template / Skill Pack Rollback

When a skill pack causes runner failures.

Quarantine the Pack

# 1. Quarantine the problematic skill pack curl -X DELETE \ -H "Authorization: Bearer $GATEWAY_TOKEN" \ https://api.curate-me.ai/gateway/admin/runners/skill-packs/$PACK_ID # 2. Revert template to previous skill_pack_version curl -X PATCH \ -H "Authorization: Bearer $GATEWAY_TOKEN" \ -H "Content-Type: application/json" \ https://api.curate-me.ai/gateway/admin/runners/templates/$TEMPLATE_ID \ -d '{"skill_pack_id": "previous-known-good-pack-id"}' # 3. Lock image to last-known-good curl -X POST \ -H "Authorization: Bearer $GATEWAY_TOKEN" \ https://api.curate-me.ai/gateway/admin/runners/templates/$TEMPLATE_ID/lock-image \ -d '{"image_ref": "curate-me/runner-openclaw-locked:v1.2.3@sha256:abc..."}'

6. Slack Bridge Incident Rollback

When the Slack conversation bridge is causing issues.

# 1. Disable Slack channel adapter export OPENCLAW_CHANNEL_SLACK_ENABLED=false systemctl restart cm-backend # 2. Kill affected sessions if needed curl -X POST \ -H "Authorization: Bearer $GATEWAY_TOKEN" \ https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/terminate # 3. Check conversation mappings for the affected org curl -H "Authorization: Bearer $GATEWAY_TOKEN" \ "https://api.curate-me.ai/gateway/admin/runners/conversations?org_id=$ORG_ID" # 4. Re-enable after fix export OPENCLAW_CHANNEL_SLACK_ENABLED=true systemctl restart cm-backend

7. Capacity Tuning

Key Configuration Parameters

VariableDefaultDescription
HETZNER_BURST_ENABLEDfalseEnable burst provisioning
HETZNER_BURST_MAX_HOSTS5Maximum burst hosts
HETZNER_BURST_IDLE_TIMEOUT_SECONDS900Idle timeout before scale-down
HETZNER_SERVER_TYPEcx22Hetzner server type (2 vCPU, 4GB)
HETZNER_WARM_POOL_SIZE0Pre-provisioned warm hosts

Monitoring Commands

# Current runner count by provider curl -H "Authorization: Bearer $GATEWAY_TOKEN" \ "https://api.curate-me.ai/gateway/admin/runners/inventory?limit=0" | jq '.total' # Active sessions curl -H "Authorization: Bearer $GATEWAY_TOKEN" \ "https://api.curate-me.ai/gateway/admin/runners/quotas" | jq '.daily_runner_minutes_used' # Provider health curl -H "Authorization: Bearer $GATEWAY_TOKEN" \ "https://api.curate-me.ai/gateway/admin/runners/health" | jq '.' # Cost SLO metrics curl -H "Authorization: Bearer $GATEWAY_TOKEN" \ "https://api.curate-me.ai/gateway/admin/runners/cost-slo/report"

You can also use the operational tooling for a quick health snapshot:

./scripts/analytics health ./scripts/errors by-source gateway

Scaling Up

# Increase burst capacity export HETZNER_BURST_MAX_HOSTS=10 export HETZNER_BURST_ENABLED=true # Add warm pool for faster cold starts export HETZNER_WARM_POOL_SIZE=2 # Apply and restart systemctl restart cm-backend

Scaling Down

# Reduce burst capacity export HETZNER_BURST_MAX_HOSTS=3 export HETZNER_BURST_IDLE_TIMEOUT_SECONDS=300 # 5 min idle timeout # Drain idle hosts curl -X POST \ -H "Authorization: Bearer $GATEWAY_TOKEN" \ https://api.curate-me.ai/gateway/admin/runners/drain \ -d '{"provider_type": "hetzner_vps", "idle_only": true}'

Rollback

For runner host changes, revert with git checkout HEAD~1 -- deploy/runner-images/ && docker compose up -d. For capacity changes, restore the previous values in the runner control plane configuration. Specific rollback procedures for templates and Slack bridge are in Sections 5 and 6 above.

Verification

After applying changes, verify:

  • Runner health endpoint responds: curl -s http://localhost:8002/health | jq .status
  • Existing runners still report heartbeats in the dashboard
  • New runner provisioning works: launch a test runner from the dashboard