Managed runners are in private beta. Contact us for access.
Runbook: Runner Provision Fails
Owner: Platform Team Backup owner: On-call engineer Last validated: Not yet validated Validation method: Manual drill Severity trigger: SEV2 Customer impact: New runner launches fail; existing runners unaffected Required access: SSH (VPS), Docker, MongoDB, container registry Related services: curateme-backend-gateway, Docker daemon, container registry
Managed runner containers are provisioned via the runner control plane, which pulls container images (OpenClaw-based), configures networking and env vars, and starts the container with VNC + gateway integration. This runbook covers failures at each stage of provisioning.
Symptoms
- Runner status stuck in
provisioningstate for more than 2 minutes - Dashboard shows “Failed to start runner” error
- Container never appears in
docker ps - Container starts but immediately exits (visible in
docker ps -awithExitedstatus) - VNC stream not available after container starts
- Runner heartbeat never registers with gateway
Step 1: Determine the failure stage
Runner provisioning has 4 stages. Identify which one failed.
# Check runner status in the control plane
curl https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID \
-H "Authorization: Bearer $ADMIN_TOKEN"Expected response:
{
"runner_id": "rnr_abc123",
"status": "provisioning",
"provision_stage": "image_pull",
"error": null,
"created_at": "2026-05-04T14:30:00Z"
}What to look for:
provision_stage | Meaning | Jump to |
|---|---|---|
image_pull | Container image download failed | Cause A |
container_create | Docker couldn’t create the container | Cause B |
container_start | Container created but failed to start | Cause C |
health_check | Container running but not healthy | Cause D |
null or missing | Control plane never started provisioning | Check gateway logs |
Step 2: Check infrastructure prerequisites
# On the runner VPS
ssh $DEPLOY_USER@$RUNNERS_VPS_IP
# Docker daemon health
docker info 2>&1 | head -5
# Should show: Server Version, Containers, Images
# Resource availability
free -m | grep Mem
# Check available memory — each runner needs ~500-800MB
df -h /var/lib/docker
# Check disk space — images are 1-3GB each
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
# Check current container resource usageQuick reference — resource requirements per runner:
| Resource | Minimum | Recommended |
|---|---|---|
| Memory | 512MB | 768MB |
| Disk | 3GB (image + workspace) | 5GB |
| CPU | 0.5 cores | 1 core |
| Ports | 5900 (VNC), 6080 (websockify), 18789 (gateway) | Per-container dynamic assignment |
Step 3: Identify the root cause
Cause A: Image pull failure
The container image (curate-me/openclaw-base, curate-me/desktop-base, etc.) must be available either from the private registry or GHCR.
Diagnosis:
# Check if the image exists locally
docker images | grep curate-me
# Try pulling manually
docker pull ghcr.io/curate-me-ai/openclaw-base:latest 2>&1
# Look for: authentication errors, network errors, "not found"Common pull failures:
| Error | Cause | Fix |
|---|---|---|
unauthorized | Registry auth expired | docker login ghcr.io with PAT |
not found | Image tag doesn’t exist | Check available tags, use :latest |
connection refused | Network issue to registry | Check DNS, firewall rules |
no space left on device | Disk full | docker system prune -f |
Fix (auth):
# Re-authenticate with GitHub Container Registry
echo $GITHUB_PAT | docker login ghcr.io -u curate-me-ai --password-stdinFix (private registry on VPS 1):
# If using the private registry at 10.0.1.1:5000
docker pull 10.0.1.1:5000/curate-me/openclaw-base:latestCause B: Container creation failure
Docker can’t create the container — usually resource limits, port conflicts, or invalid configuration.
Diagnosis:
./scripts/errors by-source gateway | grep "runner_provision"
# Look for: "port already allocated", "OOM", "invalid env"Fix (port conflict):
# Find what's using the ports
ss -tlnp | grep -E "5900|6080|18789"
# Kill the conflicting container
docker rm -f <conflicting_container>Fix (resource limits): Adjust Docker memory limits or clean up stopped containers:
# Remove stopped containers
docker container prune -f
# Remove unused images
docker image prune -a -fCause C: Container starts but immediately exits
The container is created and starts, but the entrypoint script fails.
Diagnosis:
# Find the exited container
docker ps -a --filter "status=exited" | grep openclaw
# Check exit code and logs
docker inspect <container_id> --format '{{.State.ExitCode}}'
docker logs <container_id> 2>&1 | tail -50Common exit codes:
| Exit Code | Meaning | Fix |
|---|---|---|
| 1 | General error in entrypoint | Check logs for missing env vars or config |
| 137 | OOM killed | Increase container memory limit |
| 126 | Permission denied on entrypoint | Check image is built correctly |
| 127 | Entrypoint not found | Check image tag is correct |
| 139 | Segfault | Check host kernel compatibility |
Fix (missing env vars):
The runner containers require specific env vars. Check if they were injected:
docker inspect <container_id> --format '{{range .Config.Env}}{{println .}}{{end}}' | grep -E "OPENAI|CM_|OPENCLAW"Required env vars:
| Var | Purpose |
|---|---|
OPENAI_API_KEY | LLM routing through gateway |
OPENAI_BASE_URL | Points to gateway (https://api.curate-me.ai/v1/openai) |
CM_GATEWAY_API_KEY | Gateway auth |
CM_API_KEY | Platform auth |
OPENCLAW_SKIP_CRON | Resource optimization |
OPENCLAW_SKIP_BROWSER_CONTROL_SERVER | Resource optimization |
Fix (OOM): Increase memory limit or reduce OpenClaw resource usage:
# Ensure skip flags are set to reduce memory footprint
# See: OpenClaw Gateway Resource Optimization docs
# Without skip flags: ~5.7GB. With skip flags: ~500MBCause D: Container running but health check fails
The container is running but the runner control plane can’t verify it’s healthy — no heartbeat, VNC not responding, or gateway port not listening.
Diagnosis:
# Check if the container is running
docker ps | grep <container_id>
# Check if VNC is listening
docker exec <container_id> ss -tlnp | grep 5900
# Check if OpenClaw gateway is up
docker exec <container_id> curl -s http://localhost:18789/health
# Check if the agent can reach the gateway
docker exec <container_id> curl -s https://api.curate-me.ai/gateway/admin/runners/heartbeat \
-H "X-CM-Agent-Token: $AGENT_TOKEN"Fix (VNC not starting): The Xvfb + x11vnc stack may have failed:
docker exec <container_id> ps aux | grep -E "Xvfb|x11vnc|fluxbox"
# All three should be running
# Restart VNC stack inside container
docker exec <container_id> supervisorctl restart xvfb x11vnc fluxboxFix (OpenClaw gateway not starting): Node.js process may have crashed:
docker exec <container_id> ps aux | grep node
# Should show OpenClaw gateway process
# Check OpenClaw logs
docker exec <container_id> cat /tmp/openclaw-gateway.log | tail -30Step 4: Clear stuck provisioning runners
If runners are stuck in provisioning and the underlying issue is fixed, clean them up:
# Mark stuck runners as failed so they can be retried
curl -X POST https://api.curate-me.ai/gateway/admin/runners/cleanup \
-H "Authorization: Bearer $ADMIN_TOKEN"To retry provisioning a specific runner:
curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/retry \
-H "Authorization: Bearer $ADMIN_TOKEN"Step 5: Verify resolution
# Provision a test runner
curl -X POST https://api.curate-me.ai/gateway/admin/runners \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"template": "default", "org_id": "'$ORG_ID'"}'
# Wait 30 seconds, then check status
curl https://api.curate-me.ai/gateway/admin/runners/$NEW_RUNNER_ID \
-H "Authorization: Bearer $ADMIN_TOKEN"
# status should be "running"
# Verify VNC stream is accessible from dashboard
# Open: https://dashboard.curate-me.ai/runners/$NEW_RUNNER_ID
# Check runner appears in fleet
curl https://api.curate-me.ai/gateway/admin/runners \
-H "Authorization: Bearer $ADMIN_TOKEN" | jq '.runners[] | select(.status=="running")'Related Runbooks
- Runner Stuck — when a runner is running but unresponsive
- Gateway High Latency — runners route through gateway
- Redis Incident — runner session budgets use Redis
Rollback
Revert the changes described in the Procedure section. If a configuration change was made, restore the previous value from the MongoDB audit log or Redis backup.
Verification
After applying the fix, verify:
- The symptoms listed above are no longer present
- No new errors in gateway logs:
docker logs curateme-backend-gateway --tail=50 - Health check passes:
curl -s http://localhost:8002/health | jq .status
Escalation
- If provisioning consistently fails across all runner types, check for host-level Docker issues:
dmesg | tail -50for OOM kills or kernel errors - Collect:
./scripts/errors by-source gateway | grep runner,docker ps -a,free -m,df -h - If the Docker daemon itself is failing, check systemd status:
systemctl status docker - Contact platform team with: runner ID, provision stage, container logs, and host resource snapshot