Runbook: Claude Code OAuth Token Rotation
Operational reference for the token rotation system that keeps the Curate-Me platform authenticated against Anthropic’s Claude Code endpoints. Written after issue #1775 (the 36-hour silent outage on 2026-04-17).
Architecture
Three independent mechanisms cooperate to keep an accessToken in place on
the VPS and a long-lived API key available as a fallback:
| Mechanism | Where | Cadence | Role |
|---|---|---|---|
claude_oauth_refresh Celery task | VPS backend | 15 min | Authoritative refresher. Uses the stored refreshToken to mint a fresh accessToken via https://platform.claude.com/v1/oauth/token. Writes auth.json. |
push-claude-auth.sh launchd job | Boris’s Mac | 30 min | Passive heartbeat. Pushes only if the operator just ran claude setup-token (keychain prefix differs from VPS AND keychain written in the last 5 min). Otherwise just pings for the dead-man. |
token_push_dead_man Celery task | VPS backend | 15 min | Silence detector. Fires mac_silent Teams alert if no push event arrived in the last 60 min. |
ANTHROPIC_FALLBACK_KEY env var | VPS backend | N/A | Degraded-mode fallback. Long-lived sk-ant-api* key used when OAuth chain is fully exhausted. Pay-per-token, not Max subscription pricing. Emits gateway_using_fallback_key log line when active. |
The VPS is the authoritative refresher. The Mac does not run proactive
refresh — it only pushes when the user re-auths locally. This eliminates
the two-writer race hazard that would poison each refresher’s copy of the
rotating refreshToken.
Endpoints and client_id (single source of truth)
Defined in services/backend/src/config/anthropic_oauth.py — do not
hardcode these elsewhere:
| Constant | Value |
|---|---|
ANTHROPIC_OAUTH_TOKEN_URL | https://platform.claude.com/v1/oauth/token |
ANTHROPIC_OAUTH_CLIENT_ID | 9d1c250a-e61b-44d9-88ed-5944d1962f5e |
ANTHROPIC_OAUTH_PROBE_URL | https://api.anthropic.com/api/oauth/claude_cli/roles |
ANTHROPIC_OAUTH_ACCESS_PREFIX | sk-ant-oat01- |
ANTHROPIC_OAUTH_REFRESH_PREFIX | sk-ant-ort01- |
ANTHROPIC_API_KEY_PREFIX | sk-ant-api |
The Mac-side scripts/push-claude-auth.sh mirrors the probe URL as a bash
variable. If you change the constants, update the shell script too.
Alert decoder
Teams token_push alerts carry a status field that maps to one of five
variants. Each has a specific operator action:
🔴 reauth_required — refresh token revoked
Anthropic has invalidated the stored refreshToken (usually because the
operator signed into Claude Code on another machine). Auto-rotation cannot
recover; an interactive browser handshake is required.
Action:
# On Boris's Mac:
claude setup-token
# Then push immediately (bypasses the 30-min launchd cadence):
bash scripts/push-claude-auth.sh🟡 mac_silent — launchd job stopped
The Mac has not pushed any heartbeat for 60+ minutes. Either the launchd job crashed, the Mac is offline/asleep, or the network is down.
Action:
# On Boris's Mac:
launchctl list | grep claude-auth-push
tail -20 ~/logs/claude-auth-push.log
# Reinstall the job if missing:
bash scripts/install-auth-push-cron.sh🟡 stale — no successful push in >2h
A failed streak has been going for >2 hours. The VPS’s auth.json may be
from a stale token. Related to but distinct from mac_silent: this one
fires when pushes are arriving but failing, not when they’re absent.
Action: Check VPS-side refresh first (it’s the authoritative refresher):
ssh curateme@178.105.8.25
docker logs curateme-backend-celery-beat 2>&1 | grep claude_oauth_refresh | tail -10If the VPS task is failing, investigate that. If the VPS task is succeeding but the alert persists, the Mac-side flow has a problem.
🔴 failed — repeated push failures
Three or more consecutive pushes have failed. Usually means the VPS is unreachable from the Mac (network, SSH, or the gateway URL is down).
Action:
# On the Mac:
tail -20 ~/logs/claude-auth-push.log
ssh -o ConnectTimeout=5 curateme@178.105.8.25 true && echo "ssh ok" || echo "ssh broken"
curl -sf https://api.curate-me.ai/api/v1/admin/system/token-push/last-seen && echo ok || echo endpoint down🟢 recovered — back to ok
A failed | stale | mac_silent streak has ended. Informational only.
ANTHROPIC_FALLBACK_KEY — the break-glass fallback
If both the VPS refresh and the Mac push fail simultaneously (Mac offline +
refreshToken revoked), the OAuth chain is fully exhausted. ANTHROPIC_FALLBACK_KEY
kicks in to keep the gateway serving requests in degraded mode.
Procurement: create a long-lived API key at
https://console.anthropic.com/settings/keys and set it in
services/backend/.env.production as ANTHROPIC_FALLBACK_KEY=sk-ant-api-....
When active:
- Log line
gateway_using_fallback_keyappears on every request - Cost shifts from $0/request (Max subscription) to pay-per-token
- Token prefix in logs changes from
sk-ant-oat01-tosk-ant-api
Recovery: run claude setup-token, push, verify OAuth chain is working
via curl https://api.curate-me.ai/api/v1/admin/system/token-push/status.
The fallback is automatically dropped on the next call once OAuth is back.
Testing the rotation system end-to-end
Without waiting for natural expiry. Run from the VPS:
Test 1 — Force OAuth refresh
docker exec -it curateme-backend-gateway python -c "
from src.tasks.claude_oauth_refresh import refresh_claude_oauth_token
print(refresh_claude_oauth_token.apply().get())
"
# Expect: {"ok": True, "refreshed": True|False, "token_present": True, "token_valid": True, "error": None}Test 2 — Simulate revoked refresh_token
# Corrupt auth.json (backup first):
cp ~/.config/claude-code/auth.json /tmp/auth.json.bak
python3 -c "
import json
with open('/home/curateme/.config/claude-code/auth.json') as f: d = json.load(f)
d['claudeAiOauth']['refreshToken'] = 'sk-ant-ort01-bogus'
with open('/home/curateme/.config/claude-code/auth.json', 'w') as f: json.dump(d, f)
"
# Trigger the task:
docker exec -it curateme-backend-gateway python -c "
from src.tasks.claude_oauth_refresh import refresh_claude_oauth_token
print(refresh_claude_oauth_token.apply().get())
"
# Expect: error='refresh_token_revoked' + Teams reauth_required card
# Restore:
cp /tmp/auth.json.bak ~/.config/claude-code/auth.jsonTest 3 — Simulate Mac silence
# On the Mac:
launchctl unload ~/Library/LaunchAgents/com.curateme.claude-auth-push.plist
# Wait 60 min. Teams should show mac_silent.
# Reload:
launchctl load ~/Library/LaunchAgents/com.curateme.claude-auth-push.plistTest 4 — Simulate fallback key activation
# On the VPS, temporarily break auth.json:
mv ~/.config/claude-code/auth.json /tmp/auth.json.bak
# With ANTHROPIC_FALLBACK_KEY set, gateway calls should still succeed.
curl -s https://api.curate-me.ai/v1/anthropic/v1/messages \
-H "X-CM-API-Key: $TEST_KEY" \
-d '{"model":"claude-haiku-4-5-20251001","max_tokens":1,"messages":[{"role":"user","content":"ping"}]}'
# Check logs:
docker logs curateme-backend-gateway 2>&1 | grep gateway_using_fallback_key | tail -5
# Restore:
mv /tmp/auth.json.bak ~/.config/claude-code/auth.jsonRelated files
| File | Role |
|---|---|
| services/backend/src/services/llm_auth.py | refresh_token_from_keychain(), get_max_sub_token(), get_max_sub_token_with_source(), refresh_token_revoked() |
| services/backend/src/tasks/claude_oauth_refresh.py | Authoritative 15-min refresh task |
| services/backend/src/tasks/max_sub_health_check.py | 5-min capability probe (token + round-trip) |
| services/backend/src/tasks/token_push_dead_man.py | 15-min dead-man for Mac silence |
| services/backend/src/api/routes/admin_system_ops.py | /token-push, /token-push/status, /token-push/last-seen endpoints + alert dedup |
| services/backend/src/config/anthropic_oauth.py | Centralized endpoint + client_id constants |
| scripts/push-claude-auth.sh | Mac-side passive-unless-reauth launchd job |
Historical reference
- Issue #1775 — the 36-hour silent outage on 2026-04-17 that prompted this system. Full root-cause analysis in the issue body.
~/.claude/scheduled-tasks/check-claude-token/SKILL.md— original source of the correct endpoint/client_id/probe combinations.