Skip to Content
RunbooksRunbook: Runner Provision Fails

Managed runners are in private beta. Contact us for access.

Runbook: Runner Provision Fails

Owner: Platform Team Backup owner: On-call engineer Last validated: Not yet validated Validation method: Manual drill Severity trigger: SEV2 Customer impact: New runner launches fail; existing runners unaffected Required access: SSH (VPS), Docker, MongoDB, container registry Related services: curateme-backend-gateway, Docker daemon, container registry


Managed runner containers are provisioned via the runner control plane, which pulls container images (OpenClaw-based), configures networking and env vars, and starts the container with VNC + gateway integration. This runbook covers failures at each stage of provisioning.


Symptoms

  • Runner status stuck in provisioning state for more than 2 minutes
  • Dashboard shows “Failed to start runner” error
  • Container never appears in docker ps
  • Container starts but immediately exits (visible in docker ps -a with Exited status)
  • VNC stream not available after container starts
  • Runner heartbeat never registers with gateway

Step 1: Determine the failure stage

Runner provisioning has 4 stages. Identify which one failed.

# Check runner status in the control plane curl https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID \ -H "Authorization: Bearer $ADMIN_TOKEN"

Expected response:

{ "runner_id": "rnr_abc123", "status": "provisioning", "provision_stage": "image_pull", "error": null, "created_at": "2026-05-04T14:30:00Z" }

What to look for:

provision_stageMeaningJump to
image_pullContainer image download failedCause A
container_createDocker couldn’t create the containerCause B
container_startContainer created but failed to startCause C
health_checkContainer running but not healthyCause D
null or missingControl plane never started provisioningCheck gateway logs

Step 2: Check infrastructure prerequisites

# On the runner VPS ssh $DEPLOY_USER@$RUNNERS_VPS_IP # Docker daemon health docker info 2>&1 | head -5 # Should show: Server Version, Containers, Images # Resource availability free -m | grep Mem # Check available memory — each runner needs ~500-800MB df -h /var/lib/docker # Check disk space — images are 1-3GB each docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}" # Check current container resource usage

Quick reference — resource requirements per runner:

ResourceMinimumRecommended
Memory512MB768MB
Disk3GB (image + workspace)5GB
CPU0.5 cores1 core
Ports5900 (VNC), 6080 (websockify), 18789 (gateway)Per-container dynamic assignment

Step 3: Identify the root cause

Cause A: Image pull failure

The container image (curate-me/openclaw-base, curate-me/desktop-base, etc.) must be available either from the private registry or GHCR.

Diagnosis:

# Check if the image exists locally docker images | grep curate-me # Try pulling manually docker pull ghcr.io/curate-me-ai/openclaw-base:latest 2>&1 # Look for: authentication errors, network errors, "not found"

Common pull failures:

ErrorCauseFix
unauthorizedRegistry auth expireddocker login ghcr.io with PAT
not foundImage tag doesn’t existCheck available tags, use :latest
connection refusedNetwork issue to registryCheck DNS, firewall rules
no space left on deviceDisk fulldocker system prune -f

Fix (auth):

# Re-authenticate with GitHub Container Registry echo $GITHUB_PAT | docker login ghcr.io -u curate-me-ai --password-stdin

Fix (private registry on VPS 1):

# If using the private registry at 10.0.1.1:5000 docker pull 10.0.1.1:5000/curate-me/openclaw-base:latest

Cause B: Container creation failure

Docker can’t create the container — usually resource limits, port conflicts, or invalid configuration.

Diagnosis:

./scripts/errors by-source gateway | grep "runner_provision" # Look for: "port already allocated", "OOM", "invalid env"

Fix (port conflict):

# Find what's using the ports ss -tlnp | grep -E "5900|6080|18789" # Kill the conflicting container docker rm -f <conflicting_container>

Fix (resource limits): Adjust Docker memory limits or clean up stopped containers:

# Remove stopped containers docker container prune -f # Remove unused images docker image prune -a -f

Cause C: Container starts but immediately exits

The container is created and starts, but the entrypoint script fails.

Diagnosis:

# Find the exited container docker ps -a --filter "status=exited" | grep openclaw # Check exit code and logs docker inspect <container_id> --format '{{.State.ExitCode}}' docker logs <container_id> 2>&1 | tail -50

Common exit codes:

Exit CodeMeaningFix
1General error in entrypointCheck logs for missing env vars or config
137OOM killedIncrease container memory limit
126Permission denied on entrypointCheck image is built correctly
127Entrypoint not foundCheck image tag is correct
139SegfaultCheck host kernel compatibility

Fix (missing env vars):

The runner containers require specific env vars. Check if they were injected:

docker inspect <container_id> --format '{{range .Config.Env}}{{println .}}{{end}}' | grep -E "OPENAI|CM_|OPENCLAW"

Required env vars:

VarPurpose
OPENAI_API_KEYLLM routing through gateway
OPENAI_BASE_URLPoints to gateway (https://api.curate-me.ai/v1/openai)
CM_GATEWAY_API_KEYGateway auth
CM_API_KEYPlatform auth
OPENCLAW_SKIP_CRONResource optimization
OPENCLAW_SKIP_BROWSER_CONTROL_SERVERResource optimization

Fix (OOM): Increase memory limit or reduce OpenClaw resource usage:

# Ensure skip flags are set to reduce memory footprint # See: OpenClaw Gateway Resource Optimization docs # Without skip flags: ~5.7GB. With skip flags: ~500MB

Cause D: Container running but health check fails

The container is running but the runner control plane can’t verify it’s healthy — no heartbeat, VNC not responding, or gateway port not listening.

Diagnosis:

# Check if the container is running docker ps | grep <container_id> # Check if VNC is listening docker exec <container_id> ss -tlnp | grep 5900 # Check if OpenClaw gateway is up docker exec <container_id> curl -s http://localhost:18789/health # Check if the agent can reach the gateway docker exec <container_id> curl -s https://api.curate-me.ai/gateway/admin/runners/heartbeat \ -H "X-CM-Agent-Token: $AGENT_TOKEN"

Fix (VNC not starting): The Xvfb + x11vnc stack may have failed:

docker exec <container_id> ps aux | grep -E "Xvfb|x11vnc|fluxbox" # All three should be running # Restart VNC stack inside container docker exec <container_id> supervisorctl restart xvfb x11vnc fluxbox

Fix (OpenClaw gateway not starting): Node.js process may have crashed:

docker exec <container_id> ps aux | grep node # Should show OpenClaw gateway process # Check OpenClaw logs docker exec <container_id> cat /tmp/openclaw-gateway.log | tail -30

Step 4: Clear stuck provisioning runners

If runners are stuck in provisioning and the underlying issue is fixed, clean them up:

# Mark stuck runners as failed so they can be retried curl -X POST https://api.curate-me.ai/gateway/admin/runners/cleanup \ -H "Authorization: Bearer $ADMIN_TOKEN"

To retry provisioning a specific runner:

curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/retry \ -H "Authorization: Bearer $ADMIN_TOKEN"

Step 5: Verify resolution

# Provision a test runner curl -X POST https://api.curate-me.ai/gateway/admin/runners \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"template": "default", "org_id": "'$ORG_ID'"}' # Wait 30 seconds, then check status curl https://api.curate-me.ai/gateway/admin/runners/$NEW_RUNNER_ID \ -H "Authorization: Bearer $ADMIN_TOKEN" # status should be "running" # Verify VNC stream is accessible from dashboard # Open: https://dashboard.curate-me.ai/runners/$NEW_RUNNER_ID # Check runner appears in fleet curl https://api.curate-me.ai/gateway/admin/runners \ -H "Authorization: Bearer $ADMIN_TOKEN" | jq '.runners[] | select(.status=="running")'


Rollback

Revert the changes described in the Procedure section. If a configuration change was made, restore the previous value from the MongoDB audit log or Redis backup.

Verification

After applying the fix, verify:

  • The symptoms listed above are no longer present
  • No new errors in gateway logs: docker logs curateme-backend-gateway --tail=50
  • Health check passes: curl -s http://localhost:8002/health | jq .status

Escalation

  1. If provisioning consistently fails across all runner types, check for host-level Docker issues: dmesg | tail -50 for OOM kills or kernel errors
  2. Collect: ./scripts/errors by-source gateway | grep runner, docker ps -a, free -m, df -h
  3. If the Docker daemon itself is failing, check systemd status: systemctl status docker
  4. Contact platform team with: runner ID, provision stage, container logs, and host resource snapshot