Runbook: Rotate Claude Code OAuth Token (Autopilot Max-Sub)

Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-05-22 Severity trigger: Depends on prod state — see “Production setup” below. SEV2 if no billable fallback is configured (autopilot fails on auth). SEV3 if a billable ANTHROPIC_FALLBACK_KEY is loaded (silent cost regression). Required access: A laptop already signed into the Curate-Me Claude Max account; claude CLI; SSH to curateme@178.105.8.25. Related runbook: OAuth Token Rotation — covers the broader rotation system, alert decoder, and ANTHROPIC_FALLBACK_KEY break-glass. Related issue: #2395

Production setup (as of 2026-05-22)

CLAUDE_CODE_OAUTH_TOKEN and ANTHROPIC_API_KEY in .env.production hold the same value — a long-lived OAuth token from claude setup-token. There is no separate billable pay-as-you-go fallback (sk-ant-api03-*) loaded in prod. So:

If the long-lived OAT is intact, all autopilot LLM calls run at $0 via Max sub. The llm_auth_refresh_token_revoked events in logs are the auto-refresh probe failing harmlessly — the long-lived token doesn’t need refreshing, and the auto-refresh task expects an 8h-lifetime token (from claude auth login) that we don’t use.
If the long-lived OAT is revoked (typically because someone ran claude setup-token elsewhere, or the Max account was signed in on another machine), autopilot LLM calls will start failing with 401s. This is the “actually broken” state this runbook recovers.

To check which state you’re in, run the “Confirm the symptom” check below alongside a recent autopilot run. Failed runs + 401s = the OAT is genuinely dead. Just auto-refresh noise + autopilot runs completing at $0 = harmless; no rotation needed.

What this runbook is for

The Claude Code refresh token (or the long-lived OAT itself) has been revoked, so the VPS-side calls to Anthropic start failing with 401. Only a human session signed into the Curate-Me Claude Max account can mint a replacement.

The main OAuth runbook covers the full architecture. This one is “do these four commands now.”

Trigger

Run this runbook when autopilot LLM calls are actually failing — not just when refresh-probe noise appears in logs. Specifically:

Autopilot template runs are failing with HTTP 401 at the Anthropic API. Check gateway_request_failed / anthropic_401_unauthorized events, or look for failed autopilot tasks with error: "401" in autopilot_results.
Autopilot template runs that previously cost $0 (max_subscription_cli mode) now show non-zero cost in the cost recorder — that means a billable ANTHROPIC_FALLBACK_KEY has been configured and is now serving calls. Confirmed via gateway_using_fallback_key log lines or the Cost dashboard showing autopilot-tagged spend.
The reauth_required Teams alert from the rotation system fires. (See the main OAuth runbook for the full alert decoder.)
A claude setup-token was just run somewhere else (invalidating the OAT in prod) — proactive rotation is required to restore service.

What is NOT a trigger


ssh curateme@178.105.8.25 'docker logs --since 1h curateme-backend-b2b 2>&1 | grep -c llm_auth_refresh_token_revoked'

A non-zero count of llm_auth_refresh_token_revoked events on its own is not a trigger. The auto-refresh task expects an 8h-lifetime OAuth token from claude auth login; we use the long-lived setup-token instead, so the probe fails every cycle but the actual token works fine. Count this as expected log noise unless paired with the symptoms above.

To confirm calls are actually working despite the noise:


# Should return HTTP 200:
curl -sS -o /dev/null -w "%{http_code}\n" -X POST https://api.curate-me.ai/v1/messages \
  -H "Authorization: Bearer $(ssh curateme@178.105.8.25 'grep ^CM_FLEET_GATEWAY_KEY /home/curateme/platform/.env.production | cut -d= -f2')" \
  -H "anthropic-version: 2023-06-01" -H "Content-Type: application/json" \
  -d '{"model":"claude-haiku-4-5","max_tokens":5,"messages":[{"role":"user","content":"hi"}]}'

Why this isn’t automated

The refresh-token-revoked state can only be recovered by an interactive claude browser handshake on a machine signed into the Claude Max account. There is no API to obtain a new refresh token from a server. Everything downstream of obtaining the new auth.json is automated by scripts/rotate-claude-code-oat.sh.

Prerequisites

Before starting, confirm:

You are at a laptop currently signed into the Curate-Me Claude Max account (typically Boris’s MacBook).
claude CLI is installed and claude exits without prompting login.
ssh curateme@178.105.8.25 works without a password (key-based auth is already configured per the platform conventions).
You are NOT inside a Claude Code session — the claude setup-token and reauth flows need a real terminal where they can open a browser tab.

TL;DR — use the helper script


./scripts/rotate-claude-code-oat.sh

It runs through every step in this runbook. The runbook below is what to do if you want to perform each step manually, or if the script fails halfway and you need to recover.

Step 1 — Confirm the symptom


ssh curateme@178.105.8.25 \
  'docker logs --since 1h curateme-backend-b2b 2>&1 | grep -c llm_auth_refresh_token_revoked'

If this returns a number greater than zero, the refresh token is revoked and you should proceed. If it returns 0, the symptom is something else — check the main OAuth runbook instead.

Step 2 — Obtain a fresh token locally

From a non-Claude-Code terminal on the laptop signed into Claude Max:


# Trigger an interactive sign-in flow. This will open a browser tab.
claude
 
# Once the browser flow completes and you're back at the prompt, confirm:
ls -l ~/.config/claude-code/auth.json

The file’s mtime should be within the last few minutes. If not, the sign-in didn’t actually update auth.json (Claude Code caches credentials in the macOS keychain — the disk file lags). Force a write:


# Read the credential out of the macOS keychain and overwrite auth.json:
security find-generic-password -s "Claude Code-credentials" -w > ~/.config/claude-code/auth.json
chmod 600 ~/.config/claude-code/auth.json

Note: Creating a new token via claude setup-token will invalidate the previous one everywhere — that’s the entire reason for the current outage if someone signed in on another machine. Coordinate with whoever else uses the Max account before re-running this if you have shared access.

Step 3 — Copy the auth.json to the VPS


scp ~/.config/claude-code/auth.json curateme@178.105.8.25:~/.config/claude-code/auth.json
 
# Make sure permissions are tight on the VPS copy:
ssh curateme@178.105.8.25 'chmod 600 ~/.config/claude-code/auth.json'

The file lives at ~/.config/claude-code/auth.json on the VPS (i.e. /home/curateme/.config/claude-code/auth.json). All container mounts reference ${HOME}/.config/claude-code/auth.json in docker-compose.production.yml and bind it read-only into the containers at /home/runner/.config/claude-code/auth.json or /home/appuser/.config/claude-code/auth.json.

Step 4 — Recreate all five containers that mount auth.json

auth.json is bind-mounted into the containers, so the file content updates immediately. Recreate all five services that mount it:


ssh curateme@178.105.8.25 'cd /home/curateme/platform && \
  docker compose -f docker-compose.production.yml up -d --force-recreate \
    runner-agent runner-agent-blog backend-b2b backend-gateway celery-worker'

The five services that mount auth.json:

runner-agent (curateme-runner-agent) — autopilot runner for the platform org
runner-agent-blog (curateme-runner-agent-blog) — autopilot runner for the its-boris-blog org
backend-b2b (curateme-backend-b2b) — autopilot orchestrator + decomposer/reviewer LLM client
backend-gateway (curateme-backend-gateway) — gateway proxy that prefers Max-sub for Anthropic
celery-worker (curateme-celery-worker) — background workers that may invoke autopilot LLM calls

Why all five, not just the two runners

llm_auth.py’s in-process cache only holds the short-lived access token (~1h TTL) — not the refresh token, which always comes off disk. So the backends would self-heal eventually: either the cached access token expires, or the next 401 triggers invalidate_token_cache() and disk is re-read.

But “eventually” in a SEV3 is the wrong posture. Concretely, between rotation and self-heal:

Any autopilot Claude Code subprocess already spawned has the OLD OAT baked into its env (worker.py injects CLAUDE_CODE_OAUTH_TOKEN at docker run time). It will keep hitting Anthropic with the dead token until it exits — typically minutes, possibly longer for long-running runs — generating noisy llm_auth_refresh_token_revoked and gateway_using_fallback_key alerts.
The backends serve LLM calls with the cached OLD access token until they 401. Each of those 401s is a customer-visible failure or a fallback-key spend event.

Recreating all five clears that window: backends restart with empty caches, runners stop accepting new work from in-flight stale-token subprocesses, and the next call from any service reads the fresh auth.json directly. The cost is ~30s of unavailability per service vs. minutes of intermittent failures — easy trade.

Step 5 — Verify

Wait ~30 seconds for the containers to come back, then confirm the symptom is gone:


# Should print 0 — no fresh revoked-token errors:
ssh curateme@178.105.8.25 \
  'docker logs --since 2m curateme-backend-b2b 2>&1 | grep -c llm_auth_refresh_token_revoked'
 
# Optional: spot-check the gateway's auth source. Should NOT log gateway_using_fallback_key:
ssh curateme@178.105.8.25 \
  'docker logs --since 2m curateme-backend-gateway 2>&1 | grep -E "max_subscription|fallback_key" | tail -5'

For a definitive end-to-end test, kick off a small autopilot run from the dashboard (e.g. a news_digest task) and check the cost recorder. It should report $0 with source=max_subscription rather than the paid path.

Step 6 — Close the loop

Update GitHub #2395 with the rotation timestamp.
If this was triggered by an alert from the rotation system, confirm the alert clears within 15 minutes (the dead-man checker re-runs at that cadence).

Notes on token lifetimes

We use the long-lived OAT from claude setup-token — NOT the 8h claude auth login flow. Memory file infra_claude_token_rotation.md (2026-05-06) is the canonical source for this.
Prod stores the SAME long-lived token value in both CLAUDE_CODE_OAUTH_TOKEN and ANTHROPIC_API_KEY. The latter is for legacy code paths that haven’t been ported off os.environ["ANTHROPIC_API_KEY"].
The long-lived OAT is revoked whenever the user signs into Claude Code on another machine, runs claude logout, or runs claude setup-token again anywhere (which regenerates and invalidates the prior token). Coordinate with anyone else who uses the Max account before re-running.
The access token derived from the OAT has a ~1 hour TTL. The claude_oauth_refresh Celery task tries to refresh it every 15 minutes, but the refresh flow expects an 8h-lifetime token shape; with the long-lived setup-token it always logs llm_auth_refresh_token_revoked. This log line is expected noise, not a trigger.
The in-process _token_cache in llm_auth.py only holds the short-lived access token; the OAT always comes off disk on cache miss or 401. So a stale auth.json self-heals at the access-token TTL (~1h) without recreating containers — but recreating is faster than waiting, and necessary to flush in-flight subprocesses whose env was baked at spawn time.
If you suspect the OAT has been quietly rotated rather than revoked, run Test 1 from the main OAuth runbook before doing a full rotation — a forced refresh from disk doesn’t require any container recreates.

OAuth Token Rotation — full architecture, alert decoder, fallback key, test playbook
Incident Response — wider incident playbook if this rotation is part of a broader outage
Deployment Procedure — recreating containers via deploy script as an alternative to direct docker compose invocation