Managed runners are in private beta. Contact us for access.

Runbook: Runner Container Stuck / Not Responding

Owner: Platform Team Backup owner: On-call engineer Last validated: Not yet validated Validation method: Manual drill Severity trigger: SEV3 Customer impact: Affected runner unresponsive; tasks stuck in pending state Required access: SSH (VPS), Docker, MongoDB Related services: curateme-backend-gateway, runner containers

This runbook covers diagnosing and recovering a managed OpenClaw runner container that has stopped responding while still showing a “running” status.

Symptoms

Runner status shows running in the dashboard but the agent is not processing tasks
No heartbeat received for more than 2 minutes (heartbeats fire every 60 seconds by default)
A2A tasks sent to the runner remain in pending state indefinitely
Desktop streaming (VNC) shows a frozen or blank screen
Jobs dispatched via BYOVM protocol are not being acknowledged

Step 1: Check runner status and heartbeat

Pull the runner’s current state from the control plane:


curl https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "X-Org-ID: $ORG_ID"

Expected response:


{
  "runner_id": "runner_abc123",
  "status": "running",
  "last_heartbeat": "2026-03-17T14:22:00Z",
  "uptime_seconds": 3847,
  "container_image": "ghcr.io/curate-me-ai/openclaw-base:latest",
  "resource_usage": {
    "memory_mb": 580,
    "cpu_percent": 12.4
  }
}

What to look for:

Field	Healthy	Investigate if
`last_heartbeat`	Within last 120 seconds	More than 120 seconds ago
`memory_mb`	< 1500 MB	> 2000 MB (memory exhaustion)
`cpu_percent`	< 80%	Sustained 100% (process stuck in loop)

Step 2: Attempt to connect via desktop streaming

If the runner has desktop streaming enabled, connect to the VNC session to visually inspect the container state:


# Get the desktop streaming URL
curl https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/desktop/connect \
  -H "Authorization: Bearer $ADMIN_TOKEN"

This returns a Guacamole WebSocket URL. Open it in a browser to see the container desktop. Look for:

Frozen screen — OpenClaw gateway process may have crashed
Error dialog — Check the error message content
Terminal output — Look for stack traces or out-of-memory messages
Blank screen — Xvfb or x11vnc may have crashed

Step 3: Check container logs

Pull the most recent logs from the runner container:


curl https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/logs?lines=100 \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Common patterns in logs:

Log Pattern	Meaning	Action
`FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory`	Node.js heap exhausted	Reduce context size, increase `--max-old-space-size`
`Error: connect ECONNREFUSED 127.0.0.1:18789`	OpenClaw gateway process crashed	Restart the runner
`WebSocket connection to 'ws://127.0.0.1:18789' failed`	Gateway process not accepting connections	Restart the runner
`SIGKILL` or `OOMKilled`	Container killed by kernel OOM	Increase memory limit or reduce workload
No recent log lines at all	Process completely frozen	Force stop and restart

Step 4: Identify the root cause

Cause A: OpenClaw gateway crash

The OpenClaw gateway process inside the container (ws://127.0.0.1:18789) has crashed and did not auto-restart.

Diagnosis: Container logs show ECONNREFUSED on port 18789 or the gateway process is absent from the process list.

Fix: Force stop and restart the runner:


# Force stop
curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/stop \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"force": true}'
 
# Wait for status to become "stopped"
sleep 5
 
# Restart
curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/start \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Cause B: Memory exhaustion

OpenClaw agents accumulate context over time. Long-running sessions with large conversation histories can exhaust the Node.js heap or the container memory limit.

Diagnosis: memory_mb is above 2000 MB, or container logs show JavaScript heap out of memory.

Fix (immediate): Restart the runner with a fresh session:


# Stop the runner
curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/stop \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"force": true}'
 
# Restart with increased memory
curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/start \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"env_overrides": {"NODE_OPTIONS": "--max-old-space-size=1024"}}'

Fix (long-term): Configure the runner with resource limits and idle timeout:


curl -X PATCH https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/config \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "memory_limit_mb": 2048,
    "idle_timeout_seconds": 1800,
    "session_max_turns": 100
  }'

Cause C: Network partition

The runner container has lost network connectivity to the gateway or to upstream LLM providers.

Diagnosis: Heartbeats stopped abruptly (not a gradual degradation), and the container is still consuming CPU/memory normally. Desktop streaming may also be unreachable.

Fix: Check the network configuration:


# For Docker-based runners, check the container network
docker inspect $CONTAINER_ID --format='{{.NetworkSettings.Networks}}'
 
# Test connectivity from inside the container
docker exec $CONTAINER_ID curl -s https://api.curate-me.ai/health

If the container has lost its network attachment, restart it:


curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/stop \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"force": true}'
 
curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/start \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Step 5: Verify recovery

After restarting the runner, confirm it is healthy:


# Check status -- should show "running" with a recent heartbeat
curl https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "X-Org-ID: $ORG_ID"
 
# Send a test A2A task to verify the agent is processing
curl -X POST https://api.curate-me.ai/gateway/a2a/tasks/send \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "target_agent_id": "'$RUNNER_ID'",
    "message": "health check ping",
    "priority": "high"
  }'

Prevention

Set idle timeouts

Runners that are not actively processing tasks should be automatically stopped to prevent resource waste and reduce the chance of memory-related stalls:


curl -X PATCH https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/config \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"idle_timeout_seconds": 1800}'

Configure resource limits

Set memory and CPU limits to prevent a single runner from consuming all host resources:


curl -X PATCH https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/config \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "memory_limit_mb": 2048,
    "cpu_limit": 1.5
  }'

Use OpenClaw skip flags

Reduce baseline memory usage by disabling unnecessary OpenClaw subsystems:


OPENCLAW_SKIP_CRON=1
OPENCLAW_SKIP_BROWSER_CONTROL_SERVER=1
OPENCLAW_SKIP_CANVAS_HOST=1
OPENCLAW_SKIP_GMAIL_WATCHER=1
OPENCLAW_SKIP_PROVIDERS=1
OPENCLAW_NO_RESPAWN=1
OPENCLAW_DISABLE_BONJOUR=1

These flags reduce per-container memory from ~5.7 GB to ~500 MB. See the runner configuration docs for the full list.

Escalation

If the runner cannot be recovered after a force stop and restart:

Collect the runner ID and org ID

Pull error context and container logs:


./scripts/errors by-source gateway | grep "runner"
docker logs $CONTAINER_ID > runner-debug.log 2>&1

Check runner resource consumption: ./scripts/analytics performance
Check the runner audit trail in the dashboard: Runners > [Runner] > Audit Log
Contact the platform team with the runner ID, logs, and the last known working timestamp

Rollback

Revert the changes described in the Procedure section. If a configuration change was made, restore the previous value from the MongoDB audit log or Redis backup.

Verification

After applying the fix, verify:

The symptoms listed above are no longer present
No new errors in gateway logs: docker logs curateme-backend-gateway --tail=50
Health check passes: curl -s http://localhost:8002/health | jq .status

Runner Provision Failure — when runners fail to start rather than getting stuck after starting
Runner Operations — general runner lifecycle and operational procedures
Redis Incident — runner session budgets and state are tracked in Redis