OpenClaw Boot Failed
Symptom
The runner container starts but the session never transitions from
PROVISIONING to READY within the per-tier timeout. Dashboard launch
panel shows openclaw_boot_failed or openclaw_ready_timeout. The cm-runner
agent log contains:
WARN openclaw_not_ready_yet attempts=N max=10
ERROR openclaw_boot_failed reason=startup_timeout | gateway_unreachable |
out_of_memory | provider_key_missingLikely causes
| Reason | What happened | Fix |
|---|---|---|
startup_timeout | OpenClaw process is still doing first-launch setup when the supervisor gave up | Bump container memory; check disk I/O speed. |
gateway_unreachable | The container can’t reach the configured CM_GATEWAY_PROXY_URL for LLM calls | Verify the value is set and reachable from inside the container. |
out_of_memory | Kernel OOM-killed the OpenClaw process | Host needs more RAM or fewer concurrent sessions. |
provider_key_missing | The session needs an LLM provider key the agent didn’t inject | See Missing Credentials. |
| Skip flags missing | OpenClaw is spawning ~19 worker processes and exhausting RAM | Confirm the executor is setting OPENCLAW_SKIP_* env vars (already the default in cm-runner). |
Fix
Step 1 — Read OpenClaw’s own log
The cm-runner executor streams the container’s stdout/stderr to the agent log, prefixed with the session ID:
docker logs cm-runner --tail 500 | grep "session_$SESSION_ID"Look for the last non-info line before the timeout — it almost always identifies the failure (missing key, port collision, malformed config).
Step 2 — Confirm gateway reachability from inside the container
docker exec session_$SESSION_ID curl -s -o /dev/null -w "%{http_code}\n" \
"$CM_GATEWAY_PROXY_URL/health"A 200 confirms the LLM proxy path is open. Anything else means the agent’s
CM_GATEWAY_PROXY_URL either points to the wrong host or the container’s
network can’t reach it.
Step 3 — Check memory headroom
docker stats --no-stream session_$SESSION_IDThe OpenClaw skip-flag profile (default in current cm-runner) keeps each session at ~450 MB steady-state. If you see > 2 GB sustained, either:
- The session is running a heavy workload — expected, scale the host.
- A non-default OpenClaw configuration is spawning the full worker fan-out —
unset any
OPENCLAW_SKIP_*=0overrides.
Step 4 — Rule out a bad image
If the same template fails on every machine, the published image is the suspect. Pin the template back to a known-good tag:
# Current pinned tag (default for new templates).
curl -X PATCH \
-H "X-CM-API-Key: cm_sk_your_key_here" \
-H "Content-Type: application/json" \
https://api.curate-me.ai/gateway/admin/runners/templates/$TEMPLATE_ID \
-d '{"image_ref": "ghcr.io/curate-me-ai/openclaw-base:v2026.5.22"}'
# Rollback to the previous pin (2026.4.2 stays available for 14 days
# post-canary as the approved rollback target).
curl -X PATCH \
-H "X-CM-API-Key: cm_sk_your_key_here" \
-H "Content-Type: application/json" \
https://api.curate-me.ai/gateway/admin/runners/templates/$TEMPLATE_ID \
-d '{"image_ref": "ghcr.io/curate-me-ai/openclaw-base:v2026.4.2"}'(See Runner Operations runbook § 5 for the full template / skill-pack rollback procedure.)
Where to find logs
# Agent + OpenClaw merged
docker logs cm-runner --tail 1000 | grep -E "openclaw|session_"
# Container itself
docker logs session_$SESSION_ID --tail 200Server-side: runner_state_transition_failed with the rejected target state.
Related
- Missing Credentials
- Runner Startup SLO
- Runner Operations runbook: Section 4 — OpenClaw CLI failure rollback