Managed runners are in private beta. Contact us for access.

Runbook: Runner Operations

Owner: Platform Team Backup owner: On-call engineer Last validated: May 19, 2026 Validation method: Manual drill + production canary Severity trigger: SEV3 Customer impact: Runner provisioning or management unavailable Required access: SSH (VPS), Docker, MongoDB Related services: curateme-backend-gateway, runner containers Related runbooks: Runner Startup SLO , Support: Runner Incidents

Infrastructure operations for managed runner hosts — bootstrapping, registration, rollback, and capacity tuning.

cm-runner agent (public path). Customer BYOVM machines now install the agent from the public registry: ghcr.io/curate-me-ai/cm-runner:<vYYYY.M.D>. The legacy localhost:5000/curate-me/cm-runner and the deleted services/runner-agent/ source tree are no longer used — the canonical agent source is packages/cm-runner/ and images are published from deploy/runner-images/cm-runner/Dockerfile.

For the customer-facing install flow see the Connect Your Machine quickstart; the dashboard generates per-org install commands via InstallGuide.tsx.

1. Host Bootstrap — Bring a New Hetzner VPS Online

Prerequisites

hcloud CLI installed and configured
Access to the CurateMe Hetzner project
Gateway API key with RUNNER_CREATE permission

Steps


# 1. Create server
hcloud server create \
  --name cm-runner-$(date +%s) \
  --type cx22 \
  --image ubuntu-22.04 \
  --location nbg1 \
  --network $HETZNER_NETWORK_ID \
  --firewall $HETZNER_FIREWALL_ID \
  --user-data-from-file deploy/vps/cloud-init/cm-runner-user-data.yaml
 
# 2. Wait for server to boot and cloud-init to finish
hcloud server ssh cm-runner-XXXX "cloud-init status --wait"
 
# 3. Verify Docker is running
hcloud server ssh cm-runner-XXXX "docker info --format '{{.ServerVersion}}'"
 
# 4. Verify cm-runner agent is running
hcloud server ssh cm-runner-XXXX "systemctl status cm-runner"
 
# 5. Check agent heartbeat in gateway
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
  https://api.curate-me.ai/gateway/admin/runners/byovm/agents | jq '.agents[] | select(.status=="online")'

Verification

Agent shows status: online in Your Machines (BYOVM) agent list
Heartbeat timestamp is within the last 60 seconds
systemctl status cm-runner shows active (running)

You can also verify gateway-side health with:


./scripts/analytics health

2. Agent Registration

Generate Registration Token


curl -X POST \
  -H "Authorization: Bearer $GATEWAY_TOKEN" \
  -H "Content-Type: application/json" \
  https://api.curate-me.ai/gateway/admin/runners/byovm/register-token \
  -d '{"org_id": "your-org-id", "ttl_seconds": 3600}'

Register Agent


# On the runner host:
curl -X POST \
  -H "Content-Type: application/json" \
  https://api.curate-me.ai/gateway/admin/runners/byovm/register \
  -d '{
    "registration_token": "TOKEN_FROM_ABOVE",
    "agent_id": "agent-$(hostname)",
    "hostname": "$(hostname)",
    "capabilities": ["docker", "desktop", "openclaw"]
  }'

Verify


# Check agent appears and is online
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
  https://api.curate-me.ai/gateway/admin/runners/byovm/agents
 
# Check heartbeat
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
  https://api.curate-me.ai/gateway/admin/runners/byovm/agents/$AGENT_ID/heartbeat

If registration fails, check gateway error logs for the source of the issue:


./scripts/errors by-source gateway

Run Diagnostics

The cm-runner agent heartbeat reports capabilities (Docker version, disk space, available memory, supported image pulls) on every cycle. To trigger an on-demand diagnostic run from the host:


cm-runner agent --diagnostics-only

The agent reports the result to the control plane and the dashboard displays it next to the machine card. For deeper triage of a specific failure mode, see the dedicated pages under Troubleshooting:

3. Incident Rollback — Disable Hetzner Provider

Use this when the Hetzner BYOVM provider is causing issues. Note: Hetzner BYOVM is the production runner transport — there is no live E2B fallback. The e2b_provider.py code path exists but is not wired as a production provider, so disabling BYOVM halts new runner provisioning rather than failing over. Plan for a provisioning outage, and use the emergency kill switch below if you need to stop all runner activity.

Immediate Mitigation


# 1. Disable Hetzner Your Machines (BYOVM) feature flag
# Set in .env or runtime config:
FF_RUNNER_BYOVM=false
 
# 2. New runner provisioning is halted (there is no automatic E2B fallback in production).
#    Existing Hetzner sessions continue until they expire.
 
# 3. To force-drain existing Hetzner sessions:
curl -X POST \
  -H "Authorization: Bearer $GATEWAY_TOKEN" \
  https://api.curate-me.ai/gateway/admin/runners/drain \
  -d '{"provider_type": "hetzner_vps"}'
 
# 4. Emergency kill switch (stops ALL runner operations):
curl -X POST \
  -H "Authorization: Bearer $GATEWAY_TOKEN" \
  https://api.curate-me.ai/gateway/admin/runners/security/kill-switch/activate \
  -d '{"reason": "Hetzner provider incident", "actor": "ops-team"}'

Recovery


# 1. Fix the underlying issue
# 2. Re-enable feature flag: FF_RUNNER_BYOVM=true
# 3. Verify with canary:
curl -X POST \
  -H "Authorization: Bearer $GATEWAY_TOKEN" \
  https://api.curate-me.ai/gateway/admin/runners/canary \
  -d '{"provider": "hetzner_vps"}'
 
# 4. Deactivate kill switch if it was activated:
curl -X POST \
  -H "Authorization: Bearer $GATEWAY_TOKEN" \
  https://api.curate-me.ai/gateway/admin/runners/security/kill-switch/deactivate

After recovery, confirm no lingering errors:


./scripts/errors by-source gateway
./scripts/analytics health

4. OpenClaw CLI Failure Rollback

When OpenClaw CLI is failing or producing incorrect results.

Immediate (simulate mode)


# Switch to simulate mode — CLI calls become no-ops
export OPENCLAW_AUDIT_MODE=simulate
export OPENCLAW_CLI_REQUIRED=false
 
# Restart backend
systemctl restart cm-backend

Permanent Fix


# 1. Check current version
openclaw --version
 
# 2. Reinstall/upgrade to known-good version
# (Current pin is 2026.5.22; 2026.4.2 remains an approved rollback target
#  during the canary window — substitute it here only if rolling back.)
npm install -g @openclaw/cli@2026.5.22
 
# 3. Verify
openclaw --version
openclaw selftest
 
# 4. Restore enforcement
export OPENCLAW_AUDIT_MODE=enforce
export OPENCLAW_CLI_REQUIRED=true
systemctl restart cm-backend

5. Template / Skill Pack Rollback

When a skill pack causes runner failures.

Quarantine the Pack


# 1. Quarantine the problematic skill pack
curl -X DELETE \
  -H "Authorization: Bearer $GATEWAY_TOKEN" \
  https://api.curate-me.ai/gateway/admin/runners/skill-packs/$PACK_ID
 
# 2. Revert template to previous skill_pack_version
curl -X PATCH \
  -H "Authorization: Bearer $GATEWAY_TOKEN" \
  -H "Content-Type: application/json" \
  https://api.curate-me.ai/gateway/admin/runners/templates/$TEMPLATE_ID \
  -d '{"skill_pack_id": "previous-known-good-pack-id"}'
 
# 3. Lock image to last-known-good
curl -X POST \
  -H "Authorization: Bearer $GATEWAY_TOKEN" \
  https://api.curate-me.ai/gateway/admin/runners/templates/$TEMPLATE_ID/lock-image \
  -d '{"image_ref": "curate-me/runner-openclaw-locked:v1.2.3@sha256:abc..."}'

6. Slack Bridge Incident Rollback

When the Slack conversation bridge is causing issues.


# 1. Disable Slack channel adapter
export OPENCLAW_CHANNEL_SLACK_ENABLED=false
systemctl restart cm-backend
 
# 2. Kill affected sessions if needed
curl -X POST \
  -H "Authorization: Bearer $GATEWAY_TOKEN" \
  https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/terminate
 
# 3. Check conversation mappings for the affected org
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
  "https://api.curate-me.ai/gateway/admin/runners/conversations?org_id=$ORG_ID"
 
# 4. Re-enable after fix
export OPENCLAW_CHANNEL_SLACK_ENABLED=true
systemctl restart cm-backend

7. Capacity Tuning

Key Configuration Parameters

Variable	Default	Description
`HETZNER_BURST_ENABLED`	`false`	Enable burst provisioning
`HETZNER_BURST_MAX_HOSTS`	`5`	Maximum burst hosts
`HETZNER_BURST_IDLE_TIMEOUT_SECONDS`	`900`	Idle timeout before scale-down
`HETZNER_SERVER_TYPE`	`cx22`	Hetzner server type (2 vCPU, 4GB)
`HETZNER_WARM_POOL_SIZE`	`0`	Pre-provisioned warm hosts

Monitoring Commands


# Current runner count by provider
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
  "https://api.curate-me.ai/gateway/admin/runners/inventory?limit=0" | jq '.total'
 
# Active sessions
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
  "https://api.curate-me.ai/gateway/admin/runners/quotas" | jq '.daily_runner_minutes_used'
 
# Provider health
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
  "https://api.curate-me.ai/gateway/admin/runners/health" | jq '.'
 
# Cost SLO metrics
curl -H "Authorization: Bearer $GATEWAY_TOKEN" \
  "https://api.curate-me.ai/gateway/admin/runners/cost-slo/report"

You can also use the operational tooling for a quick health snapshot:


./scripts/analytics health
./scripts/errors by-source gateway

Scaling Up


# Increase burst capacity
export HETZNER_BURST_MAX_HOSTS=10
export HETZNER_BURST_ENABLED=true
 
# Add warm pool for faster cold starts
export HETZNER_WARM_POOL_SIZE=2
 
# Apply and restart
systemctl restart cm-backend

Scaling Down


# Reduce burst capacity
export HETZNER_BURST_MAX_HOSTS=3
export HETZNER_BURST_IDLE_TIMEOUT_SECONDS=300  # 5 min idle timeout
 
# Drain idle hosts
curl -X POST \
  -H "Authorization: Bearer $GATEWAY_TOKEN" \
  https://api.curate-me.ai/gateway/admin/runners/drain \
  -d '{"provider_type": "hetzner_vps", "idle_only": true}'

Rollback

For runner host changes, revert with git checkout HEAD~1 -- deploy/runner-images/ && docker compose up -d. For capacity changes, restore the previous values in the runner control plane configuration. Specific rollback procedures for templates and Slack bridge are in Sections 5 and 6 above.

Verification

After applying changes, verify:

Runner health endpoint responds: curl -s http://localhost:8002/health | jq .status
Existing runners still report heartbeats in the dashboard
New runner provisioning works: launch a test runner from the dashboard

Runner Provision Failure — when runner containers fail to provision
Runner Stuck — when a runner is running but unresponsive
Deployment Procedure — full production deployment workflow

Runbook: Runner Operations

1. Host Bootstrap — Bring a New Hetzner VPS Online

Prerequisites

Steps

Verification

2. Agent Registration

Generate Registration Token

Register Agent

Verify

Run Diagnostics

3. Incident Rollback — Disable Hetzner Provider

Immediate Mitigation

Recovery

4. OpenClaw CLI Failure Rollback

Immediate (simulate mode)

Permanent Fix

5. Template / Skill Pack Rollback

Quarantine the Pack

6. Slack Bridge Incident Rollback

7. Capacity Tuning

Key Configuration Parameters

Monitoring Commands

Scaling Up

Scaling Down

Rollback

Verification

Related Runbooks