Managed runners are in private beta. Contact us for access.

Runbook: Runner Provision Fails

Owner: Platform Team Backup owner: On-call engineer Last validated: Not yet validated Validation method: Manual drill Severity trigger: SEV2 Customer impact: New runner launches fail; existing runners unaffected Required access: SSH (VPS), Docker, MongoDB, container registry Related services: curateme-backend-gateway, Docker daemon, container registry

Managed runner containers are provisioned via the runner control plane, which pulls container images (OpenClaw-based), configures networking and env vars, and starts the container with VNC + gateway integration. This runbook covers failures at each stage of provisioning.

Symptoms

Runner status stuck in provisioning state for more than 2 minutes
Dashboard shows “Failed to start runner” error
Container never appears in docker ps
Container starts but immediately exits (visible in docker ps -a with Exited status)
VNC stream not available after container starts
Runner heartbeat never registers with gateway

Step 1: Determine the failure stage

Runner provisioning has 4 stages. Identify which one failed.


# Check runner status in the control plane
curl https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Expected response:


{
  "runner_id": "rnr_abc123",
  "status": "provisioning",
  "provision_stage": "image_pull",
  "error": null,
  "created_at": "2026-05-04T14:30:00Z"
}

What to look for:

`provision_stage`	Meaning	Jump to
`image_pull`	Container image download failed	Cause A
`container_create`	Docker couldn’t create the container	Cause B
`container_start`	Container created but failed to start	Cause C
`health_check`	Container running but not healthy	Cause D
`null` or missing	Control plane never started provisioning	Check gateway logs

Step 2: Check infrastructure prerequisites


# On the runner VPS
ssh $DEPLOY_USER@$RUNNERS_VPS_IP
 
# Docker daemon health
docker info 2>&1 | head -5
# Should show: Server Version, Containers, Images
 
# Resource availability
free -m | grep Mem
# Check available memory — each runner needs ~500-800MB
 
df -h /var/lib/docker
# Check disk space — images are 1-3GB each
 
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
# Check current container resource usage

Quick reference — resource requirements per runner:

Resource	Minimum	Recommended
Memory	512MB	768MB
Disk	3GB (image + workspace)	5GB
CPU	0.5 cores	1 core
Ports	5900 (VNC), 6080 (websockify), 18789 (gateway)	Per-container dynamic assignment

Step 3: Identify the root cause

Cause A: Image pull failure

The container image (curate-me/openclaw-base, curate-me/desktop-base, etc.) must be available either from the private registry or GHCR.

Diagnosis:


# Check if the image exists locally
docker images | grep curate-me
 
# Try pulling manually
docker pull ghcr.io/curate-me-ai/openclaw-base:latest 2>&1
# Look for: authentication errors, network errors, "not found"

Common pull failures:

Error	Cause	Fix
`unauthorized`	Registry auth expired	`docker login ghcr.io` with PAT
`not found`	Image tag doesn’t exist	Check available tags, use `:latest`
`connection refused`	Network issue to registry	Check DNS, firewall rules
`no space left on device`	Disk full	`docker system prune -f`

Fix (auth):


# Re-authenticate with GitHub Container Registry
echo $GITHUB_PAT | docker login ghcr.io -u curate-me-ai --password-stdin

Fix (private registry on VPS 1):


# If using the private registry at 10.0.1.1:5000
docker pull 10.0.1.1:5000/curate-me/openclaw-base:latest

Cause B: Container creation failure

Docker can’t create the container — usually resource limits, port conflicts, or invalid configuration.

Diagnosis:


./scripts/errors by-source gateway | grep "runner_provision"
# Look for: "port already allocated", "OOM", "invalid env"

Fix (port conflict):


# Find what's using the ports
ss -tlnp | grep -E "5900|6080|18789"
 
# Kill the conflicting container
docker rm -f <conflicting_container>

Fix (resource limits): Adjust Docker memory limits or clean up stopped containers:


# Remove stopped containers
docker container prune -f
 
# Remove unused images
docker image prune -a -f

Cause C: Container starts but immediately exits

The container is created and starts, but the entrypoint script fails.

Diagnosis:


# Find the exited container
docker ps -a --filter "status=exited" | grep openclaw
 
# Check exit code and logs
docker inspect <container_id> --format '{{.State.ExitCode}}'
docker logs <container_id> 2>&1 | tail -50

Common exit codes:

Exit Code	Meaning	Fix
1	General error in entrypoint	Check logs for missing env vars or config
137	OOM killed	Increase container memory limit
126	Permission denied on entrypoint	Check image is built correctly
127	Entrypoint not found	Check image tag is correct
139	Segfault	Check host kernel compatibility

Fix (missing env vars):

The runner containers require specific env vars. Check if they were injected:


docker inspect <container_id> --format '{{range .Config.Env}}{{println .}}{{end}}' | grep -E "OPENAI|CM_|OPENCLAW"

Required env vars:

Var	Purpose
`OPENAI_API_KEY`	LLM routing through gateway
`OPENAI_BASE_URL`	Points to gateway (`https://api.curate-me.ai/v1/openai`)
`CM_GATEWAY_API_KEY`	Gateway auth
`CM_API_KEY`	Platform auth
`OPENCLAW_SKIP_CRON`	Resource optimization
`OPENCLAW_SKIP_BROWSER_CONTROL_SERVER`	Resource optimization

Fix (OOM): Increase memory limit or reduce OpenClaw resource usage:


# Ensure skip flags are set to reduce memory footprint
# See: OpenClaw Gateway Resource Optimization docs
# Without skip flags: ~5.7GB. With skip flags: ~500MB

Cause D: Container running but health check fails

The container is running but the runner control plane can’t verify it’s healthy — no heartbeat, VNC not responding, or gateway port not listening.

Diagnosis:


# Check if the container is running
docker ps | grep <container_id>
 
# Check if VNC is listening
docker exec <container_id> ss -tlnp | grep 5900
 
# Check if OpenClaw gateway is up
docker exec <container_id> curl -s http://localhost:18789/health
 
# Check if the agent can reach the gateway
docker exec <container_id> curl -s https://api.curate-me.ai/gateway/admin/runners/heartbeat \
  -H "X-CM-Agent-Token: $AGENT_TOKEN"

Fix (VNC not starting): The Xvfb + x11vnc stack may have failed:


docker exec <container_id> ps aux | grep -E "Xvfb|x11vnc|fluxbox"
# All three should be running
 
# Restart VNC stack inside container
docker exec <container_id> supervisorctl restart xvfb x11vnc fluxbox

Fix (OpenClaw gateway not starting): Node.js process may have crashed:


docker exec <container_id> ps aux | grep node
# Should show OpenClaw gateway process
 
# Check OpenClaw logs
docker exec <container_id> cat /tmp/openclaw-gateway.log | tail -30

Step 4: Clear stuck provisioning runners

If runners are stuck in provisioning and the underlying issue is fixed, clean them up:


# Mark stuck runners as failed so they can be retried
curl -X POST https://api.curate-me.ai/gateway/admin/runners/cleanup \
  -H "Authorization: Bearer $ADMIN_TOKEN"

To retry provisioning a specific runner:


curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/retry \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Step 5: Verify resolution


# Provision a test runner
curl -X POST https://api.curate-me.ai/gateway/admin/runners \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"template": "default", "org_id": "'$ORG_ID'"}'
 
# Wait 30 seconds, then check status
curl https://api.curate-me.ai/gateway/admin/runners/$NEW_RUNNER_ID \
  -H "Authorization: Bearer $ADMIN_TOKEN"
# status should be "running"
 
# Verify VNC stream is accessible from dashboard
# Open: https://dashboard.curate-me.ai/runners/$NEW_RUNNER_ID
 
# Check runner appears in fleet
curl https://api.curate-me.ai/gateway/admin/runners \
  -H "Authorization: Bearer $ADMIN_TOKEN" | jq '.runners[] | select(.status=="running")'

Runner Stuck — when a runner is running but unresponsive
Gateway High Latency — runners route through gateway
Redis Incident — runner session budgets use Redis

Rollback

Revert the changes described in the Procedure section. If a configuration change was made, restore the previous value from the MongoDB audit log or Redis backup.

Verification

After applying the fix, verify:

The symptoms listed above are no longer present
No new errors in gateway logs: docker logs curateme-backend-gateway --tail=50
Health check passes: curl -s http://localhost:8002/health | jq .status

Escalation

If provisioning consistently fails across all runner types, check for host-level Docker issues: dmesg | tail -50 for OOM kills or kernel errors
Collect: ./scripts/errors by-source gateway | grep runner, docker ps -a, free -m, df -h
If the Docker daemon itself is failing, check systemd status: systemctl status docker
Contact platform team with: runner ID, provision stage, container logs, and host resource snapshot

Runbook: Runner Provision Fails

Symptoms

Step 1: Determine the failure stage

Step 2: Check infrastructure prerequisites

Step 3: Identify the root cause

Cause A: Image pull failure

Cause B: Container creation failure

Cause C: Container starts but immediately exits

Cause D: Container running but health check fails

Step 4: Clear stuck provisioning runners

Step 5: Verify resolution

Related Runbooks

Rollback

Verification

Escalation