Skip to Content
RunbooksRunbook: Runner Container Stuck / Not Responding

Runbook: Runner Container Stuck / Not Responding

This runbook covers diagnosing and recovering a managed OpenClaw runner container that has stopped responding while still showing a “running” status.


Symptoms

  • Runner status shows running in the dashboard but the agent is not processing tasks
  • No heartbeat received for more than 2 minutes (heartbeats fire every 60 seconds by default)
  • A2A tasks sent to the runner remain in pending state indefinitely
  • Desktop streaming (VNC) shows a frozen or blank screen
  • Jobs dispatched via BYOVM protocol are not being acknowledged

Step 1: Check runner status and heartbeat

Pull the runner’s current state from the control plane:

curl https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "X-Org-ID: $ORG_ID"

Expected response:

{ "runner_id": "runner_abc123", "status": "running", "last_heartbeat": "2026-03-17T14:22:00Z", "uptime_seconds": 3847, "container_image": "ghcr.io/curate-me-ai/openclaw-base:latest", "resource_usage": { "memory_mb": 580, "cpu_percent": 12.4 } }

What to look for:

FieldHealthyInvestigate if
last_heartbeatWithin last 120 secondsMore than 120 seconds ago
memory_mb< 1500 MB> 2000 MB (memory exhaustion)
cpu_percent< 80%Sustained 100% (process stuck in loop)

Step 2: Attempt to connect via desktop streaming

If the runner has desktop streaming enabled, connect to the VNC session to visually inspect the container state:

# Get the desktop streaming URL curl https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/desktop/connect \ -H "Authorization: Bearer $ADMIN_TOKEN"

This returns a Guacamole WebSocket URL. Open it in a browser to see the container desktop. Look for:

  • Frozen screen — OpenClaw gateway process may have crashed
  • Error dialog — Check the error message content
  • Terminal output — Look for stack traces or out-of-memory messages
  • Blank screen — Xvfb or x11vnc may have crashed

Step 3: Check container logs

Pull the most recent logs from the runner container:

curl https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/logs?lines=100 \ -H "Authorization: Bearer $ADMIN_TOKEN"

Common patterns in logs:

Log PatternMeaningAction
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memoryNode.js heap exhaustedReduce context size, increase --max-old-space-size
Error: connect ECONNREFUSED 127.0.0.1:18789OpenClaw gateway process crashedRestart the runner
WebSocket connection to 'ws://127.0.0.1:18789' failedGateway process not accepting connectionsRestart the runner
SIGKILL or OOMKilledContainer killed by kernel OOMIncrease memory limit or reduce workload
No recent log lines at allProcess completely frozenForce stop and restart

Step 4: Identify the root cause

Cause A: OpenClaw gateway crash

The OpenClaw gateway process inside the container (ws://127.0.0.1:18789) has crashed and did not auto-restart.

Diagnosis: Container logs show ECONNREFUSED on port 18789 or the gateway process is absent from the process list.

Fix: Force stop and restart the runner:

# Force stop curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/stop \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"force": true}' # Wait for status to become "stopped" sleep 5 # Restart curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/start \ -H "Authorization: Bearer $ADMIN_TOKEN"

Cause B: Memory exhaustion

OpenClaw agents accumulate context over time. Long-running sessions with large conversation histories can exhaust the Node.js heap or the container memory limit.

Diagnosis: memory_mb is above 2000 MB, or container logs show JavaScript heap out of memory.

Fix (immediate): Restart the runner with a fresh session:

# Stop the runner curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/stop \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"force": true}' # Restart with increased memory curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/start \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"env_overrides": {"NODE_OPTIONS": "--max-old-space-size=1024"}}'

Fix (long-term): Configure the runner with resource limits and idle timeout:

curl -X PATCH https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/config \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "memory_limit_mb": 2048, "idle_timeout_seconds": 1800, "session_max_turns": 100 }'

Cause C: Network partition

The runner container has lost network connectivity to the gateway or to upstream LLM providers.

Diagnosis: Heartbeats stopped abruptly (not a gradual degradation), and the container is still consuming CPU/memory normally. Desktop streaming may also be unreachable.

Fix: Check the network configuration:

# For Docker-based runners, check the container network docker inspect $CONTAINER_ID --format='{{.NetworkSettings.Networks}}' # Test connectivity from inside the container docker exec $CONTAINER_ID curl -s https://api.curate-me.ai/health

If the container has lost its network attachment, restart it:

curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/stop \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"force": true}' curl -X POST https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/start \ -H "Authorization: Bearer $ADMIN_TOKEN"

Step 5: Verify recovery

After restarting the runner, confirm it is healthy:

# Check status -- should show "running" with a recent heartbeat curl https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "X-Org-ID: $ORG_ID" # Send a test A2A task to verify the agent is processing curl -X POST https://api.curate-me.ai/gateway/a2a/tasks/send \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "target_agent_id": "'$RUNNER_ID'", "message": "health check ping", "priority": "high" }'

Prevention

Set idle timeouts

Runners that are not actively processing tasks should be automatically stopped to prevent resource waste and reduce the chance of memory-related stalls:

curl -X PATCH https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/config \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"idle_timeout_seconds": 1800}'

Configure resource limits

Set memory and CPU limits to prevent a single runner from consuming all host resources:

curl -X PATCH https://api.curate-me.ai/gateway/admin/runners/$RUNNER_ID/config \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "memory_limit_mb": 2048, "cpu_limit": 1.5 }'

Use OpenClaw skip flags

Reduce baseline memory usage by disabling unnecessary OpenClaw subsystems:

OPENCLAW_SKIP_CRON=1 OPENCLAW_SKIP_BROWSER_CONTROL_SERVER=1 OPENCLAW_SKIP_CANVAS_HOST=1 OPENCLAW_SKIP_GMAIL_WATCHER=1 OPENCLAW_SKIP_PROVIDERS=1 OPENCLAW_NO_RESPAWN=1 OPENCLAW_DISABLE_BONJOUR=1

These flags reduce per-container memory from ~5.7 GB to ~500 MB. See the runner configuration docs for the full list.


Escalation

If the runner cannot be recovered after a force stop and restart:

  1. Collect the runner ID and org ID
  2. Pull the full container logs: docker logs $CONTAINER_ID > runner-debug.log 2>&1
  3. Check the runner audit trail in the dashboard: Runners > [Runner] > Audit Log
  4. Contact the platform team with the runner ID, logs, and the last known working timestamp