Runbook: Rotate Claude Code OAuth Token (Autopilot Max-Sub)
Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-05-22 Severity trigger: Depends on prod state — see “Production setup” below. SEV2 if no billable fallback is configured (autopilot fails on auth). SEV3 if a billable
ANTHROPIC_FALLBACK_KEYis loaded (silent cost regression). Required access: A laptop already signed into the Curate-Me Claude Max account;claudeCLI; SSH tocurateme@178.105.8.25. Related runbook: OAuth Token Rotation — covers the broader rotation system, alert decoder, andANTHROPIC_FALLBACK_KEYbreak-glass. Related issue: #2395
Production setup (as of 2026-05-22)
CLAUDE_CODE_OAUTH_TOKEN and ANTHROPIC_API_KEY in .env.production hold the same value — a long-lived OAuth token from claude setup-token. There is no separate billable pay-as-you-go fallback (sk-ant-api03-*) loaded in prod. So:
- If the long-lived OAT is intact, all autopilot LLM calls run at $0 via Max sub. The
llm_auth_refresh_token_revokedevents in logs are the auto-refresh probe failing harmlessly — the long-lived token doesn’t need refreshing, and the auto-refresh task expects an 8h-lifetime token (fromclaude auth login) that we don’t use. - If the long-lived OAT is revoked (typically because someone ran
claude setup-tokenelsewhere, or the Max account was signed in on another machine), autopilot LLM calls will start failing with 401s. This is the “actually broken” state this runbook recovers.
To check which state you’re in, run the “Confirm the symptom” check below alongside a recent autopilot run. Failed runs + 401s = the OAT is genuinely dead. Just auto-refresh noise + autopilot runs completing at $0 = harmless; no rotation needed.
What this runbook is for
The Claude Code refresh token (or the long-lived OAT itself) has been revoked, so the VPS-side calls to Anthropic start failing with 401. Only a human session signed into the Curate-Me Claude Max account can mint a replacement.
The main OAuth runbook covers the full architecture. This one is “do these four commands now.”
Trigger
Run this runbook when autopilot LLM calls are actually failing — not just when refresh-probe noise appears in logs. Specifically:
-
Autopilot template runs are failing with HTTP 401 at the Anthropic API. Check
gateway_request_failed/anthropic_401_unauthorizedevents, or look for failed autopilot tasks witherror: "401"inautopilot_results. -
Autopilot template runs that previously cost $0 (
max_subscription_climode) now show non-zero cost in the cost recorder — that means a billableANTHROPIC_FALLBACK_KEYhas been configured and is now serving calls. Confirmed viagateway_using_fallback_keylog lines or the Cost dashboard showing autopilot-tagged spend. -
The
reauth_requiredTeams alert from the rotation system fires. (See the main OAuth runbook for the full alert decoder.) -
A
claude setup-tokenwas just run somewhere else (invalidating the OAT in prod) — proactive rotation is required to restore service.
What is NOT a trigger
ssh curateme@178.105.8.25 'docker logs --since 1h curateme-backend-b2b 2>&1 | grep -c llm_auth_refresh_token_revoked'A non-zero count of llm_auth_refresh_token_revoked events on its own is not a trigger. The auto-refresh task expects an 8h-lifetime OAuth token from claude auth login; we use the long-lived setup-token instead, so the probe fails every cycle but the actual token works fine. Count this as expected log noise unless paired with the symptoms above.
To confirm calls are actually working despite the noise:
# Should return HTTP 200:
curl -sS -o /dev/null -w "%{http_code}\n" -X POST https://api.curate-me.ai/v1/messages \
-H "Authorization: Bearer $(ssh curateme@178.105.8.25 'grep ^CM_FLEET_GATEWAY_KEY /home/curateme/platform/.env.production | cut -d= -f2')" \
-H "anthropic-version: 2023-06-01" -H "Content-Type: application/json" \
-d '{"model":"claude-haiku-4-5","max_tokens":5,"messages":[{"role":"user","content":"hi"}]}'Why this isn’t automated
The refresh-token-revoked state can only be recovered by an interactive claude browser handshake on a machine signed into the Claude Max account. There is no API to obtain a new refresh token from a server. Everything downstream of obtaining the new auth.json is automated by scripts/rotate-claude-code-oat.sh.
Prerequisites
Before starting, confirm:
- You are at a laptop currently signed into the Curate-Me Claude Max account (typically Boris’s MacBook).
claudeCLI is installed andclaudeexits without prompting login.ssh curateme@178.105.8.25works without a password (key-based auth is already configured per the platform conventions).- You are NOT inside a Claude Code session — the
claude setup-tokenand reauth flows need a real terminal where they can open a browser tab.
TL;DR — use the helper script
./scripts/rotate-claude-code-oat.shIt runs through every step in this runbook. The runbook below is what to do if you want to perform each step manually, or if the script fails halfway and you need to recover.
Step 1 — Confirm the symptom
ssh curateme@178.105.8.25 \
'docker logs --since 1h curateme-backend-b2b 2>&1 | grep -c llm_auth_refresh_token_revoked'If this returns a number greater than zero, the refresh token is revoked and you should proceed. If it returns 0, the symptom is something else — check the main OAuth runbook instead.
Step 2 — Obtain a fresh token locally
From a non-Claude-Code terminal on the laptop signed into Claude Max:
# Trigger an interactive sign-in flow. This will open a browser tab.
claude
# Once the browser flow completes and you're back at the prompt, confirm:
ls -l ~/.config/claude-code/auth.jsonThe file’s mtime should be within the last few minutes. If not, the sign-in didn’t actually update auth.json (Claude Code caches credentials in the macOS keychain — the disk file lags). Force a write:
# Read the credential out of the macOS keychain and overwrite auth.json:
security find-generic-password -s "Claude Code-credentials" -w > ~/.config/claude-code/auth.json
chmod 600 ~/.config/claude-code/auth.jsonNote: Creating a new token via
claude setup-tokenwill invalidate the previous one everywhere — that’s the entire reason for the current outage if someone signed in on another machine. Coordinate with whoever else uses the Max account before re-running this if you have shared access.
Step 3 — Copy the auth.json to the VPS
scp ~/.config/claude-code/auth.json curateme@178.105.8.25:~/.config/claude-code/auth.json
# Make sure permissions are tight on the VPS copy:
ssh curateme@178.105.8.25 'chmod 600 ~/.config/claude-code/auth.json'The file lives at ~/.config/claude-code/auth.json on the VPS (i.e. /home/curateme/.config/claude-code/auth.json). All container mounts reference ${HOME}/.config/claude-code/auth.json in docker-compose.production.yml and bind it read-only into the containers at /home/runner/.config/claude-code/auth.json or /home/appuser/.config/claude-code/auth.json.
Step 4 — Recreate all five containers that mount auth.json
auth.json is bind-mounted into the containers, so the file content updates immediately. Recreate all five services that mount it:
ssh curateme@178.105.8.25 'cd /opt/curate-me/platform && \
docker compose -f docker-compose.production.yml up -d --force-recreate \
runner-agent runner-agent-blog backend-b2b backend-gateway celery-worker'The five services that mount auth.json:
runner-agent(curateme-runner-agent) — autopilot runner for the platform orgrunner-agent-blog(curateme-runner-agent-blog) — autopilot runner for the its-boris-blog orgbackend-b2b(curateme-backend-b2b) — autopilot orchestrator + decomposer/reviewer LLM clientbackend-gateway(curateme-backend-gateway) — gateway proxy that prefers Max-sub for Anthropiccelery-worker(curateme-celery-worker) — background workers that may invoke autopilot LLM calls
Why all five, not just the two runners
llm_auth.py’s in-process cache only holds the short-lived access token (~1h TTL) — not the refresh token, which always comes off disk. So the backends would self-heal eventually: either the cached access token expires, or the next 401 triggers invalidate_token_cache() and disk is re-read.
But “eventually” in a SEV3 is the wrong posture. Concretely, between rotation and self-heal:
- Any autopilot Claude Code subprocess already spawned has the OLD OAT baked into its env (
worker.pyinjectsCLAUDE_CODE_OAUTH_TOKENatdocker runtime). It will keep hitting Anthropic with the dead token until it exits — typically minutes, possibly longer for long-running runs — generating noisyllm_auth_refresh_token_revokedandgateway_using_fallback_keyalerts. - The backends serve LLM calls with the cached OLD access token until they 401. Each of those 401s is a customer-visible failure or a fallback-key spend event.
Recreating all five clears that window: backends restart with empty caches, runners stop accepting new work from in-flight stale-token subprocesses, and the next call from any service reads the fresh auth.json directly. The cost is ~30s of unavailability per service vs. minutes of intermittent failures — easy trade.
Step 5 — Verify
Wait ~30 seconds for the containers to come back, then confirm the symptom is gone:
# Should print 0 — no fresh revoked-token errors:
ssh curateme@178.105.8.25 \
'docker logs --since 2m curateme-backend-b2b 2>&1 | grep -c llm_auth_refresh_token_revoked'
# Optional: spot-check the gateway's auth source. Should NOT log gateway_using_fallback_key:
ssh curateme@178.105.8.25 \
'docker logs --since 2m curateme-backend-gateway 2>&1 | grep -E "max_subscription|fallback_key" | tail -5'For a definitive end-to-end test, kick off a small autopilot run from the dashboard (e.g. a news_digest task) and check the cost recorder. It should report $0 with source=max_subscription rather than the paid path.
Step 6 — Close the loop
- Update GitHub #2395 with the rotation timestamp.
- If this was triggered by an alert from the rotation system, confirm the alert clears within 15 minutes (the dead-man checker re-runs at that cadence).
Notes on token lifetimes
- We use the long-lived OAT from
claude setup-token— NOT the 8hclaude auth loginflow. Memory fileinfra_claude_token_rotation.md(2026-05-06) is the canonical source for this. - Prod stores the SAME long-lived token value in both
CLAUDE_CODE_OAUTH_TOKENandANTHROPIC_API_KEY. The latter is for legacy code paths that haven’t been ported offos.environ["ANTHROPIC_API_KEY"]. - The long-lived OAT is revoked whenever the user signs into Claude Code on another machine, runs
claude logout, or runsclaude setup-tokenagain anywhere (which regenerates and invalidates the prior token). Coordinate with anyone else who uses the Max account before re-running. - The access token derived from the OAT has a ~1 hour TTL. The
claude_oauth_refreshCelery task tries to refresh it every 15 minutes, but the refresh flow expects an 8h-lifetime token shape; with the long-lived setup-token it always logsllm_auth_refresh_token_revoked. This log line is expected noise, not a trigger. - The in-process
_token_cacheinllm_auth.pyonly holds the short-lived access token; the OAT always comes off disk on cache miss or 401. So a staleauth.jsonself-heals at the access-token TTL (~1h) without recreating containers — but recreating is faster than waiting, and necessary to flush in-flight subprocesses whose env was baked at spawn time. - If you suspect the OAT has been quietly rotated rather than revoked, run Test 1 from the main OAuth runbook before doing a full rotation — a forced refresh from disk doesn’t require any container recreates.
Related runbooks
- OAuth Token Rotation — full architecture, alert decoder, fallback key, test playbook
- Incident Response — wider incident playbook if this rotation is part of a broader outage
- Deployment Procedure — recreating containers via deploy script as an alternative to direct
docker composeinvocation