Runbook: Interactive Claude Session Operations
Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-06-04 Severity trigger: SEV3 for a dangling session leaking spend or a stuck container; SEV2 if the cost sidecar fails open (sessions running without a budget hard-stop). Required access: Dashboard admin or platform-staff role; SSH to the runner VPS (
curateme@178.105.1.95); Mongo + Redis access on the platform VPS. Related design:docs/strategy/INTERACTIVE_CLAUDE_CODE_RUNNER_TEAMS.md(§7 control primitive, §10 GitHub, §11 governance/auth). Related runbook: Rotate Claude Code OAT — OAT rotation mechanics this runbook builds on.
What this covers
The Interactive Claude Code Session Broker is a live, steerable Claude Code session inside a per-org runner container that a human can watch and drive from either the dashboard (a Claude tab on the Live Session Console) or a Microsoft Teams thread. It is a gated preview.
This runbook is the operator’s guide to running it: turning the feature on per org, configuring the per-org GitHub App installation, the $0 OAT-direct auth path and its compensating controls, OAT rotation for interactive containers, finding and terminating a dangling session, and the streaming/takeover behavior operators will be asked about.
The session itself is owned by the runner control plane
(src/services/runner_control_plane/interactive_session.py). The autopilot
interactive_claude_session template is a launcher shim only — it calls
InteractiveSessionService.create_session() and returns the session_id +
stream_url. It owns no container.
1. Enabling the preview (INTERACTIVE_CLAUDE_SESSION)
The feature is gated by FeatureFlag.INTERACTIVE_CLAUDE_SESSION (default
OFF). Both the launcher shim and the gateway routes check it, so a session
cannot be started — and the Claude tab is hidden — until you grant it.
Per-org grant (preferred)
Flip the per-org override in the B2B Mongo org_feature_flag_overrides
collection. This is the same mechanism the webwright preview uses, and it takes
effect on the next request (no redeploy):
ssh curateme@178.105.8.25 \
'docker exec curateme-mongo mongosh curate_me --quiet --eval '\''
db.org_feature_flag_overrides.updateOne(
{ org_id: "org_DESIGN_PARTNER", flag_name: "interactive_claude_session" },
{ $set: { enabled: true, updated_at: new Date() } },
{ upsert: true }
)'\'''To revoke, set enabled: false (do not delete the doc — an explicit false
override is auditable and beats falling back to the global default).
Global / environment toggle (staging, demos)
For a whole environment, set the env var (read before the per-org override only when no override exists):
FF_INTERACTIVE_CLAUDE_SESSION=trueRestart backend-gateway (and backend-b2b if you want the command center to
list it) so the new env is picked up. Prefer per-org grants in production —
the global toggle exposes the preview to every tenant.
Verify the grant
# Should return enabled:true for the org you granted.
ssh curateme@178.105.8.25 \
'docker exec curateme-mongo mongosh curate_me --quiet --eval '\''
db.org_feature_flag_overrides.findOne(
{ org_id: "org_DESIGN_PARTNER", flag_name: "interactive_claude_session" })'\'''In the dashboard, the Claude tab on /runners/[runnerId]/live appears for
granted orgs and is absent otherwise.
Where a session can run (transports)
A session needs a place to host the persistent claude subprocess. Two
transports are supported (#2638):
local_docker(co-located) — the gatewaydocker exec -is straight into the container (it owns the docker socket). The default for development / single-VPS setups.byovm/hetzner_vps(remote) — the container lives on the separate runners-VPS, so the gateway drives it through the co-located cm-runner agent over the WS bridge: the agent spawns theclaudesubprocess inside the container and streams its stdio back. Prerequisites: (1) the runner’s agent is connected to the gateway WS bridge (GET /gateway/admin/runners/ws/...); (2) the runner host hasCLAUDE_CODE_OAUTH_TOKENin the agent’s env — the gateway never ships the shared Max-sub token over the bridge, so the agent injects its own (same $0 mechanism byovm autopilot uses); (3) the cm-runner build is new enough to understandinteractive.start(older agents simply don’t ack → the start fails fast with an actionable error).firecrackerrunners are not supported (their exec runs on a host vsock stub).
If a remote start fails with “agent … is not connected” or “did not
acknowledge interactive.start”, check the agent’s WS connection
(GET /gateway/admin/runners/bridge/status) and that the runner is running a
current cm-runner.
2. Configuring the per-org GitHub App installation
A session clones the target repo using a short-lived, repo-scoped GitHub App
installation token — never a long-lived PAT in the container, never a token
written to the session doc (only the github_installation_id reference is
persisted). See design §10 for the full credential flow.
Resolution order at create_session
- GitHub App installation token (preferred) — resolved from
github_app_settingskeyed byorg_id, minting a 10-minute RS256 JWT from the platformGITHUB_APP_ID+GITHUB_APP_PRIVATE_KEYand exchanging it for a ~1h repository-scoped token at/app/installations/{id}/access_tokens. - PAT fallback — the org’s
CredentialVaultentry namedgithub_token. - Dev-only
GITHUB_TOKENenv (never use in production).
Configure the installation per org
The org must have installed the Curate-Me GitHub App on the target repo, and
the installation must be recorded in github_app_settings. This is populated
by the PR-review path (GitHubPRReviewer.set_installation); if an org uses
interactive sessions without ever wiring PR review, set it explicitly:
ssh curateme@178.105.8.25 \
'docker exec curateme-mongo mongosh curate_me --quiet --eval '\''
db.github_app_settings.updateOne(
{ org_id: "org_DESIGN_PARTNER" },
{ $set: { installation_id: "INSTALLATION_ID", updated_at: new Date() } },
{ upsert: true }
)'\'''The caller can also pass github_installation_id directly in the launch inputs
to override the stored value for a single session.
Caveat (documented in design §10): the vault PAT fallback is keyed by
(org_id, agent_id), and interactive-sessionrunner_ids are newly minted — a token stored under an old/otheragent_idwill not be found. The provisioner looks under the stableagent_ids (global/default) and scans the org’s vault for any credential namedgithub_tokenregardless ofagent_id. If clone fails with “credential not found”, verify agithub_tokencredential exists in the org vault, or prefer the App installation path above.
Scopes
For clone-only sessions the App needs contents:read + metadata:read. A
session that pushes needs contents:write / pull_requests:write — the
current App is scoped for PR review, not push (design §18 open question). If a
session needs to push and the token lacks the scope, it will fail at push time,
not clone time.
3. Auth path — interactive Claude Code runs $0 on the OAT
There is an unavoidable constraint (design §11, verified in worker.py and the
prod E2E): the Max-subscription OAT runs Claude Code LLM calls at $0 but
must hit https://api.anthropic.com directly. Routing the OAT — or any key —
through our gateway breaks the Claude Code CLI’s OAuth handshake (Not logged in). The CLI cannot run through the gateway proxy at all, so there is no
governed gateway-key path for interactive sessions to fall back to.
Every interactive session uses OAT-direct (single flag, #2650)
A session injects the long-lived Max-sub OAT as CLAUDE_CODE_OAUTH_TOKEN with
ANTHROPIC_BASE_URL=https://api.anthropic.com (auth_path: oat_direct on the
doc) — the same $0 path autopilot uses. The single
FeatureFlag.INTERACTIVE_CLAUDE_SESSION flag gates the whole feature; there is
no separate OAT-bypass flag (removed in #2650). Turning the feature on for an
org is all that’s needed — see §1.
The gateway-key
auth_pathvalue is retained only as forward-compat machinery.create_sessionrefuses to boot a gateway-key session (an explicitoat_bypass: false) with an actionable error, because it would start aclaudesubprocess that dies seconds later (subprocess_dead). If governed gateway-proxy support for the Claude Code CLI ever lands, relaxing_assert_auth_path_viableis the single change that re-enables it.
Compensating controls (verify these are working)
Because individual calls go direct to Anthropic and skip the 15-stage governance chain, three session-level controls stand in for per-call governance:
- Token-count cost sidecar — the broker parses
input_tokens/output_tokensfrom each stream-json usage event and keeps a process-local blended-rate estimate (~$3/$15 per Mtok;_accumulate_costininteractive_session.py). This estimate is not authoritative billing — it is not written tocost_recorder’s Redis accumulators and does not create a Mongo usage record. Its job is to drive the per-session budget cap: after each completed turn (resultevent),_on_turn_completepersists the running estimate onto theinteractive_sessionsdoc (cost_usd_accumulated) and terminates the session when it reachesper_session_budget_usd(default $5). On the OAT-direct path there is no per-call gateway record, so this session-level estimate + budget cap is the only cost control — which is why a sidecar that fails open is a SEV2. - Rolling output PII scan every N turns (default 5) using
security_scanner.pypatterns. - MCP-level HITL stays fully active —
curate-me-mcpauthenticates back to the gateway; those MCP calls still traverse governance (curate_request_approvalHITL + cost + rate limit all fire).
If the sidecar fails open (sessions running with no budget hard-stop), treat it as SEV2: revoke
interactive_claude_sessionfor the affected org immediately (setenabled:false), then terminate any live session (§5) before investigating.
4. OAT rotation for interactive containers
Interactive session containers read the Max-sub token from a read-only
bind-mount of the host auth.json, at the runner-image path
(/home/runner/.config/claude-code/auth.json) — not the backend-service
path. So a rotated host auth.json is visible to the container’s filesystem
immediately, but a Claude subprocess that already started has the token’s
derived access token cached in-process until its next 401 or expiry.
When you rotate the OAT (see the Rotate Claude Code OAT
runbook), the helper script also refreshes auth.json inside running
interactive session containers — it overwrites the read-only-mounted file in
place and is idempotent (safe to re-run; a no-op when no interactive session
containers are running).
To rotate and cover interactive sessions in one shot:
./scripts/rotate-claude-code-oat.shTo confirm the script saw the interactive containers, watch for the
Refreshing auth.json in N interactive session container(s) step in its
output. If a long session needs the new token immediately (it is mid-turn
against a dead OAT), terminate and relaunch it (§5) — the relaunch reads the
fresh token at boot.
5. Finding and terminating a dangling session
A “dangling” session is one whose interactive_sessions doc is not terminal
(control_state not in failed/terminated) but whose container/subprocess is
gone, stuck, or leaking spend. The 60s reconciler (docker inspect liveness)
should catch most of these and flip them to failed, but operators sometimes
need to act manually.
Find live sessions
# All non-terminal sessions, newest first.
ssh curateme@178.105.8.25 \
'docker exec curateme-mongo mongosh curate_me --quiet --eval '\''
db.interactive_sessions.find(
{ control_state: { $nin: ["failed", "terminated"] } },
{ _id:1, org_id:1, runner_id:1, container_name:1, control_state:1,
auth_path:1, cost_usd_accumulated:1, created_at:1 }
).sort({ created_at: -1 }).limit(50).toArray()'\'''# The matching containers on the runner VPS.
ssh curateme@178.105.1.95 'docker ps --filter "name=claude-session-" \
--format "table {{.Names}}\t{{.Status}}\t{{.CreatedAt}}"'Terminate cleanly (preferred)
Use the API so the platform sends proc.terminate() to the subprocess, marks
the doc terminated, and releases the control token. As platform staff
(cross-org reader) or the owning org’s admin:
curl -sS -X POST \
"https://api.curate-me.ai/gateway/admin/runners/${RUNNER_ID}/sessions/${SESSION_ID}/claude/terminate" \
-H "X-CM-API-Key: ${ADMIN_KEY}"Force-terminate the container (last resort)
If the API path is unavailable (broker crashed), kill the container directly, then flip the doc so the UI and budget loop stop tracking it:
ssh curateme@178.105.1.95 "docker rm -f ${CONTAINER_NAME}"
ssh curateme@178.105.8.25 \
"docker exec curateme-mongo mongosh curate_me --quiet --eval '
db.interactive_sessions.updateOne(
{ _id: \"${SESSION_ID}\", org_id: \"${ORG_ID}\" },
{ \$set: { control_state: \"terminated\", terminated_at: new Date() } }
)'"Always include
org_idin the Mongo filter — everyinteractive_sessionsquery is compound-filtered by{ _id, org_id }(the PRIMARY tenant key, audit #2576). A force-update withoutorg_idis a tenant-isolation anti-pattern; don’t copy it into tooling.
Clean up stuck control tokens (rare)
If a session shows human_driving but nobody is connected, the Redis control
token may be orphaned. It carries a TTL and reverts to autonomous on expiry;
to clear it now:
ssh curateme@178.105.8.25 \
"docker exec curateme-redis redis-cli DEL interactive:control_token:${SESSION_ID}"6. Streaming & takeover behavior (operator FAQ)
- Transport. Output streams over SSE (
GET /claude/stream), fed from the Redis pub/sub channelinteractive:events:{sid}with a replay buffer (interactive:replay:{sid}, last ~500 events). Reconnect +Last-Event-IDreplay are handled by the dashboard’s existing SSE hook — a browser refresh mid-stream resumes without data loss. - Heartbeats. The broker emits a
heartbeatevent when stdout is idle (~10s) so proxies don’t close the connection and the UI shows “alive”. - Control token (takeover). Exactly one surface (dashboard / Teams / API)
holds the control token at a time, claimed via
SET EX NXin Redis. While a human is driving (human_driving), turns from any other surface are rejected with 409. “Take Control” claims the token; if it’s held, the button becomes “Request Control” (notifies the holder) and the send box goes read-only with a banner naming the holder. - States.
starting → autonomous ⇄ human_driving, withpaused(subprocess kept alive, context preserved, no turns accepted),awaiting_approval(a tool-use/MCP permission prompt is pending — all turns blocked until approve/deny from either surface), and terminalfailed/terminated. - Approvals (HITL). A
permission_promptputs the session inawaiting_approvaland surfaces an Approve/Deny modal (dashboard) or Adaptive Card (Teams). First decision wins (NX). MCP-level HITL stays active on the OAT-direct path (§3) — those MCP calls still traverse the gateway. - Teams parity. A reply in the bound Teams thread is delivered to the
session’s stdin;
!takeover/!pause/!resume/!releaseand!approve/!deny(or tap-to-approve) work from Teams. Prereq:teams_integrations.agent_channelmust be set per org or inbound Teams routing silently no-ops.
7. Close the loop
- Record any manual termination or flag change in the incident channel with the
org_id/session_id. - If you revoked
interactive_claude_sessionfor a cost incident, file a follow-up to investigate the cost sidecar before re-granting. - If a runner left orphaned
claude-session-*containers, confirm the 60s reconciler flipped their docs tofailed; if not, that’s a reconciler bug — open an issue.
Related runbooks
- Rotate Claude Code OAT — OAT rotation mechanics (interactive containers included).
- Runner Operations — broader runner container lifecycle and triage.
- Runner Stuck / Not Responding — when the host container, not the session, is the problem.
- Slack Integration Setup — control-surface bridge precedent.