Runbook: Runner Container Stuck / Not Responding
This runbook covers diagnosing and recovering a managed OpenClaw runner container that has stopped responding while still showing a “running” status.
Symptoms
- Runner status shows
runningin the dashboard but the agent is not processing tasks - No heartbeat received for more than 2 minutes (heartbeats fire every 60 seconds by default)
- A2A tasks sent to the runner remain in
pendingstate indefinitely - Desktop streaming (VNC) shows a frozen or blank screen
- Jobs dispatched via BYOVM protocol are not being acknowledged
Step 1: Check runner status and heartbeat
Pull the runner’s current state from the control plane:
curl https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "X-Org-ID: $ORG_ID"Expected response:
{
"runner_id": "runner_abc123",
"status": "running",
"last_heartbeat": "2026-03-17T14:22:00Z",
"uptime_seconds": 3847,
"container_image": "ghcr.io/curate-me-ai/openclaw-base:latest",
"resource_usage": {
"memory_mb": 580,
"cpu_percent": 12.4
}
}What to look for:
| Field | Healthy | Investigate if |
|---|---|---|
last_heartbeat | Within last 120 seconds | More than 120 seconds ago |
memory_mb | < 1500 MB | > 2000 MB (memory exhaustion) |
cpu_percent | < 80% | Sustained 100% (process stuck in loop) |
Step 2: Attempt to connect via desktop streaming
If the runner has desktop streaming enabled, connect to the VNC session to visually inspect the container state:
# Get the desktop streaming URL
curl https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/desktop/connect \
-H "Authorization: Bearer $ADMIN_TOKEN"This returns a Guacamole WebSocket URL. Open it in a browser to see the container desktop. Look for:
- Frozen screen — OpenClaw gateway process may have crashed
- Error dialog — Check the error message content
- Terminal output — Look for stack traces or out-of-memory messages
- Blank screen — Xvfb or x11vnc may have crashed
Step 3: Check container logs
Pull the most recent logs from the runner container:
curl https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/logs?lines=100 \
-H "Authorization: Bearer $ADMIN_TOKEN"Common patterns in logs:
| Log Pattern | Meaning | Action |
|---|---|---|
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory | Node.js heap exhausted | Reduce context size, increase --max-old-space-size |
Error: connect ECONNREFUSED 127.0.0.1:18789 | OpenClaw gateway process crashed | Restart the runner |
WebSocket connection to 'ws://127.0.0.1:18789' failed | Gateway process not accepting connections | Restart the runner |
SIGKILL or OOMKilled | Container killed by kernel OOM | Increase memory limit or reduce workload |
| No recent log lines at all | Process completely frozen | Force stop and restart |
Step 4: Identify the root cause
Cause A: OpenClaw gateway crash
The OpenClaw gateway process inside the container (ws://127.0.0.1:18789) has crashed and did not auto-restart.
Diagnosis: Container logs show ECONNREFUSED on port 18789 or the gateway process is absent from the process list.
Fix: Force stop and restart the runner:
# Force stop
curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/stop \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"force": true}'
# Wait for status to become "stopped"
sleep 5
# Restart
curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/start \
-H "Authorization: Bearer $ADMIN_TOKEN"Cause B: Memory exhaustion
OpenClaw agents accumulate context over time. Long-running sessions with large conversation histories can exhaust the Node.js heap or the container memory limit.
Diagnosis: memory_mb is above 2000 MB, or container logs show JavaScript heap out of memory.
Fix (immediate): Restart the runner with a fresh session:
# Stop the runner
curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/stop \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"force": true}'
# Restart with increased memory
curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/start \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"env_overrides": {"NODE_OPTIONS": "--max-old-space-size=1024"}}'Fix (long-term): Configure the runner with resource limits and idle timeout:
curl -X PATCH https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/config \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"memory_limit_mb": 2048,
"idle_timeout_seconds": 1800,
"session_max_turns": 100
}'Cause C: Network partition
The runner container has lost network connectivity to the gateway or to upstream LLM providers.
Diagnosis: Heartbeats stopped abruptly (not a gradual degradation), and the container is still consuming CPU/memory normally. Desktop streaming may also be unreachable.
Fix: Check the network configuration:
# For Docker-based runners, check the container network
docker inspect $CONTAINER_ID --format='{{.NetworkSettings.Networks}}'
# Test connectivity from inside the container
docker exec $CONTAINER_ID curl -s https://api.curate-me.ai/healthIf the container has lost its network attachment, restart it:
curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/stop \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"force": true}'
curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/start \
-H "Authorization: Bearer $ADMIN_TOKEN"Step 5: Verify recovery
After restarting the runner, confirm it is healthy:
# Check status -- should show "running" with a recent heartbeat
curl https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "X-Org-ID: $ORG_ID"
# Send a test A2A task to verify the agent is processing
curl -X POST https://api.curate-me.ai/gateway/a2a/tasks/send \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"target_agent_id": "'$RUNNER_ID'",
"message": "health check ping",
"priority": "high"
}'Prevention
Set idle timeouts
Runners that are not actively processing tasks should be automatically stopped to prevent resource waste and reduce the chance of memory-related stalls:
curl -X PATCH https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/config \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"idle_timeout_seconds": 1800}'Configure resource limits
Set memory and CPU limits to prevent a single runner from consuming all host resources:
curl -X PATCH https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/config \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"memory_limit_mb": 2048,
"cpu_limit": 1.5
}'Use OpenClaw skip flags
Reduce baseline memory usage by disabling unnecessary OpenClaw subsystems:
OPENCLAW_SKIP_CRON=1
OPENCLAW_SKIP_BROWSER_CONTROL_SERVER=1
OPENCLAW_SKIP_CANVAS_HOST=1
OPENCLAW_SKIP_GMAIL_WATCHER=1
OPENCLAW_SKIP_PROVIDERS=1
OPENCLAW_NO_RESPAWN=1
OPENCLAW_DISABLE_BONJOUR=1These flags reduce per-container memory from ~5.7 GB to ~500 MB. See the runner configuration docs for the full list.
Escalation
If the runner cannot be recovered after a force stop and restart:
- Collect the runner ID and org ID
- Pull the full container logs:
docker logs $CONTAINER_ID > runner-debug.log 2>&1 - Check the runner audit trail in the dashboard: Runners > [Runner] > Audit Log
- Contact the platform team with the runner ID, logs, and the last known working timestamp