Skip to Content
RunbooksRunbook: Claude Code OAuth Token Rotation

Runbook: Claude Code OAuth Token Rotation

Operational reference for the token rotation system that keeps the Curate-Me platform authenticated against Anthropic’s Claude Code endpoints. Written after issue #1775 (the 36-hour silent outage on 2026-04-17).


Architecture

Three independent mechanisms cooperate to keep an accessToken in place on the VPS and a long-lived API key available as a fallback:

MechanismWhereCadenceRole
claude_oauth_refresh Celery taskVPS backend15 minAuthoritative refresher. Uses the stored refreshToken to mint a fresh accessToken via https://platform.claude.com/v1/oauth/token. Writes auth.json.
push-claude-auth.sh launchd jobBoris’s Mac30 minPassive heartbeat. Pushes only if the operator just ran claude setup-token (keychain prefix differs from VPS AND keychain written in the last 5 min). Otherwise just pings for the dead-man.
token_push_dead_man Celery taskVPS backend15 minSilence detector. Fires mac_silent Teams alert if no push event arrived in the last 60 min.
ANTHROPIC_FALLBACK_KEY env varVPS backendN/ADegraded-mode fallback. Long-lived sk-ant-api* key used when OAuth chain is fully exhausted. Pay-per-token, not Max subscription pricing. Emits gateway_using_fallback_key log line when active.

The VPS is the authoritative refresher. The Mac does not run proactive refresh — it only pushes when the user re-auths locally. This eliminates the two-writer race hazard that would poison each refresher’s copy of the rotating refreshToken.


Endpoints and client_id (single source of truth)

Defined in services/backend/src/config/anthropic_oauth.py — do not hardcode these elsewhere:

ConstantValue
ANTHROPIC_OAUTH_TOKEN_URLhttps://platform.claude.com/v1/oauth/token
ANTHROPIC_OAUTH_CLIENT_ID9d1c250a-e61b-44d9-88ed-5944d1962f5e
ANTHROPIC_OAUTH_PROBE_URLhttps://api.anthropic.com/api/oauth/claude_cli/roles
ANTHROPIC_OAUTH_ACCESS_PREFIXsk-ant-oat01-
ANTHROPIC_OAUTH_REFRESH_PREFIXsk-ant-ort01-
ANTHROPIC_API_KEY_PREFIXsk-ant-api

The Mac-side scripts/push-claude-auth.sh mirrors the probe URL as a bash variable. If you change the constants, update the shell script too.


Alert decoder

Teams token_push alerts carry a status field that maps to one of five variants. Each has a specific operator action:

🔴 reauth_required — refresh token revoked

Anthropic has invalidated the stored refreshToken (usually because the operator signed into Claude Code on another machine). Auto-rotation cannot recover; an interactive browser handshake is required.

Action:

# On Boris's Mac: claude setup-token # Then push immediately (bypasses the 30-min launchd cadence): bash scripts/push-claude-auth.sh

🟡 mac_silent — launchd job stopped

The Mac has not pushed any heartbeat for 60+ minutes. Either the launchd job crashed, the Mac is offline/asleep, or the network is down.

Action:

# On Boris's Mac: launchctl list | grep claude-auth-push tail -20 ~/logs/claude-auth-push.log # Reinstall the job if missing: bash scripts/install-auth-push-cron.sh

🟡 stale — no successful push in >2h

A failed streak has been going for >2 hours. The VPS’s auth.json may be from a stale token. Related to but distinct from mac_silent: this one fires when pushes are arriving but failing, not when they’re absent.

Action: Check VPS-side refresh first (it’s the authoritative refresher):

ssh curateme@178.105.8.25 docker logs curateme-backend-celery-beat 2>&1 | grep claude_oauth_refresh | tail -10

If the VPS task is failing, investigate that. If the VPS task is succeeding but the alert persists, the Mac-side flow has a problem.

🔴 failed — repeated push failures

Three or more consecutive pushes have failed. Usually means the VPS is unreachable from the Mac (network, SSH, or the gateway URL is down).

Action:

# On the Mac: tail -20 ~/logs/claude-auth-push.log ssh -o ConnectTimeout=5 curateme@178.105.8.25 true && echo "ssh ok" || echo "ssh broken" curl -sf https://api.curate-me.ai/api/v1/admin/system/token-push/last-seen && echo ok || echo endpoint down

🟢 recovered — back to ok

A failed | stale | mac_silent streak has ended. Informational only.


ANTHROPIC_FALLBACK_KEY — the break-glass fallback

If both the VPS refresh and the Mac push fail simultaneously (Mac offline + refreshToken revoked), the OAuth chain is fully exhausted. ANTHROPIC_FALLBACK_KEY kicks in to keep the gateway serving requests in degraded mode.

Procurement: create a long-lived API key at https://console.anthropic.com/settings/keys and set it in services/backend/.env.production as ANTHROPIC_FALLBACK_KEY=sk-ant-api-....

When active:

  • Log line gateway_using_fallback_key appears on every request
  • Cost shifts from $0/request (Max subscription) to pay-per-token
  • Token prefix in logs changes from sk-ant-oat01- to sk-ant-api

Recovery: run claude setup-token, push, verify OAuth chain is working via curl https://api.curate-me.ai/api/v1/admin/system/token-push/status. The fallback is automatically dropped on the next call once OAuth is back.


Testing the rotation system end-to-end

Without waiting for natural expiry. Run from the VPS:

Test 1 — Force OAuth refresh

docker exec -it curateme-backend-gateway python -c " from src.tasks.claude_oauth_refresh import refresh_claude_oauth_token print(refresh_claude_oauth_token.apply().get()) " # Expect: {"ok": True, "refreshed": True|False, "token_present": True, "token_valid": True, "error": None}

Test 2 — Simulate revoked refresh_token

# Corrupt auth.json (backup first): cp ~/.config/claude-code/auth.json /tmp/auth.json.bak python3 -c " import json with open('/home/curateme/.config/claude-code/auth.json') as f: d = json.load(f) d['claudeAiOauth']['refreshToken'] = 'sk-ant-ort01-bogus' with open('/home/curateme/.config/claude-code/auth.json', 'w') as f: json.dump(d, f) " # Trigger the task: docker exec -it curateme-backend-gateway python -c " from src.tasks.claude_oauth_refresh import refresh_claude_oauth_token print(refresh_claude_oauth_token.apply().get()) " # Expect: error='refresh_token_revoked' + Teams reauth_required card # Restore: cp /tmp/auth.json.bak ~/.config/claude-code/auth.json

Test 3 — Simulate Mac silence

# On the Mac: launchctl unload ~/Library/LaunchAgents/com.curateme.claude-auth-push.plist # Wait 60 min. Teams should show mac_silent. # Reload: launchctl load ~/Library/LaunchAgents/com.curateme.claude-auth-push.plist

Test 4 — Simulate fallback key activation

# On the VPS, temporarily break auth.json: mv ~/.config/claude-code/auth.json /tmp/auth.json.bak # With ANTHROPIC_FALLBACK_KEY set, gateway calls should still succeed. curl -s https://api.curate-me.ai/v1/anthropic/v1/messages \ -H "X-CM-API-Key: $TEST_KEY" \ -d '{"model":"claude-haiku-4-5-20251001","max_tokens":1,"messages":[{"role":"user","content":"ping"}]}' # Check logs: docker logs curateme-backend-gateway 2>&1 | grep gateway_using_fallback_key | tail -5 # Restore: mv /tmp/auth.json.bak ~/.config/claude-code/auth.json

FileRole
services/backend/src/services/llm_auth.pyrefresh_token_from_keychain(), get_max_sub_token(), get_max_sub_token_with_source(), refresh_token_revoked()
services/backend/src/tasks/claude_oauth_refresh.pyAuthoritative 15-min refresh task
services/backend/src/tasks/max_sub_health_check.py5-min capability probe (token + round-trip)
services/backend/src/tasks/token_push_dead_man.py15-min dead-man for Mac silence
services/backend/src/api/routes/admin_system_ops.py/token-push, /token-push/status, /token-push/last-seen endpoints + alert dedup
services/backend/src/config/anthropic_oauth.pyCentralized endpoint + client_id constants
scripts/push-claude-auth.shMac-side passive-unless-reauth launchd job

Historical reference

  • Issue #1775 — the 36-hour silent outage on 2026-04-17 that prompted this system. Full root-cause analysis in the issue body.
  • ~/.claude/scheduled-tasks/check-claude-token/SKILL.md — original source of the correct endpoint/client_id/probe combinations.