Runbook: Interactive Claude Session Operations

Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-06-04 Severity trigger: SEV3 for a dangling session leaking spend or a stuck container; SEV2 if the cost sidecar fails open (sessions running without a budget hard-stop). Required access: Dashboard admin or platform-staff role; SSH to the runner VPS (curateme@178.105.1.95); Mongo + Redis access on the platform VPS. Related design: docs/strategy/INTERACTIVE_CLAUDE_CODE_RUNNER_TEAMS.md (§7 control primitive, §10 GitHub, §11 governance/auth). Related runbook: Rotate Claude Code OAT — OAT rotation mechanics this runbook builds on.

What this covers

The Interactive Claude Code Session Broker is a live, steerable Claude Code session inside a per-org runner container that a human can watch and drive from either the dashboard (a Claude tab on the Live Session Console) or a Microsoft Teams thread. It is a gated preview.

This runbook is the operator’s guide to running it: turning the feature on per org, configuring the per-org GitHub App installation, the $0 OAT-direct auth path and its compensating controls, OAT rotation for interactive containers, finding and terminating a dangling session, and the streaming/takeover behavior operators will be asked about.

The session itself is owned by the runner control plane (src/services/runner_control_plane/interactive_session.py). The autopilot interactive_claude_session template is a launcher shim only — it calls InteractiveSessionService.create_session() and returns the session_id + stream_url. It owns no container.

1. Enabling the preview (`INTERACTIVE_CLAUDE_SESSION`)

The feature is gated by FeatureFlag.INTERACTIVE_CLAUDE_SESSION (default OFF). Both the launcher shim and the gateway routes check it, so a session cannot be started — and the Claude tab is hidden — until you grant it.

Per-org grant (preferred)

Flip the per-org override in the B2B Mongo org_feature_flag_overrides collection. This is the same mechanism the webwright preview uses, and it takes effect on the next request (no redeploy):


ssh curateme@178.105.8.25 \
  'docker exec curateme-mongo mongosh curate_me --quiet --eval '\''
    db.org_feature_flag_overrides.updateOne(
      { org_id: "org_DESIGN_PARTNER", flag_name: "interactive_claude_session" },
      { $set: { enabled: true, updated_at: new Date() } },
      { upsert: true }
    )'\'''

To revoke, set enabled: false (do not delete the doc — an explicit false override is auditable and beats falling back to the global default).

Global / environment toggle (staging, demos)

For a whole environment, set the env var (read before the per-org override only when no override exists):


FF_INTERACTIVE_CLAUDE_SESSION=true

Restart backend-gateway (and backend-b2b if you want the command center to list it) so the new env is picked up. Prefer per-org grants in production — the global toggle exposes the preview to every tenant.

Verify the grant


# Should return enabled:true for the org you granted.
ssh curateme@178.105.8.25 \
  'docker exec curateme-mongo mongosh curate_me --quiet --eval '\''
    db.org_feature_flag_overrides.findOne(
      { org_id: "org_DESIGN_PARTNER", flag_name: "interactive_claude_session" })'\'''

In the dashboard, the Claude tab on /runners/[runnerId]/live appears for granted orgs and is absent otherwise.

Where a session can run (transports)

A session needs a place to host the persistent claude subprocess. Two transports are supported (#2638):

local_docker (co-located) — the gateway docker exec -is straight into the container (it owns the docker socket). The default for development / single-VPS setups.
byovm / hetzner_vps (remote) — the container lives on the separate runners-VPS, so the gateway drives it through the co-located cm-runner agent over the WS bridge: the agent spawns the claude subprocess inside the container and streams its stdio back. Prerequisites: (1) the runner’s agent is connected to the gateway WS bridge (GET /gateway/admin/runners/ws/...); (2) the runner host has CLAUDE_CODE_OAUTH_TOKEN in the agent’s env — the gateway never ships the shared Max-sub token over the bridge, so the agent injects its own (same $0 mechanism byovm autopilot uses); (3) the cm-runner build is new enough to understand interactive.start (older agents simply don’t ack → the start fails fast with an actionable error). firecracker runners are not supported (their exec runs on a host vsock stub).

If a remote start fails with “agent … is not connected” or “did not acknowledge interactive.start”, check the agent’s WS connection (GET /gateway/admin/runners/bridge/status) and that the runner is running a current cm-runner.

2. Configuring the per-org GitHub App installation

A session clones the target repo using a short-lived, repo-scoped GitHub App installation token — never a long-lived PAT in the container, never a token written to the session doc (only the github_installation_id reference is persisted). See design §10 for the full credential flow.

Resolution order at `create_session`

GitHub App installation token (preferred) — resolved from github_app_settings keyed by org_id, minting a 10-minute RS256 JWT from the platform GITHUB_APP_ID + GITHUB_APP_PRIVATE_KEY and exchanging it for a ~1h repository-scoped token at /app/installations/{id}/access_tokens.
PAT fallback — the org’s CredentialVault entry named github_token.
Dev-only GITHUB_TOKEN env (never use in production).

Configure the installation per org

The org must have installed the Curate-Me GitHub App on the target repo, and the installation must be recorded in github_app_settings. This is populated by the PR-review path (GitHubPRReviewer.set_installation); if an org uses interactive sessions without ever wiring PR review, set it explicitly:


ssh curateme@178.105.8.25 \
  'docker exec curateme-mongo mongosh curate_me --quiet --eval '\''
    db.github_app_settings.updateOne(
      { org_id: "org_DESIGN_PARTNER" },
      { $set: { installation_id: "INSTALLATION_ID", updated_at: new Date() } },
      { upsert: true }
    )'\'''

The caller can also pass github_installation_id directly in the launch inputs to override the stored value for a single session.

Caveat (documented in design §10): the vault PAT fallback is keyed by (org_id, agent_id), and interactive-session runner_ids are newly minted — a token stored under an old/other agent_id will not be found. The provisioner looks under the stable agent_ids (global / default) and scans the org’s vault for any credential named github_token regardless of agent_id. If clone fails with “credential not found”, verify a github_token credential exists in the org vault, or prefer the App installation path above.

Scopes

For clone-only sessions the App needs contents:read + metadata:read. A session that pushes needs contents:write / pull_requests:write — the current App is scoped for PR review, not push (design §18 open question). If a session needs to push and the token lacks the scope, it will fail at push time, not clone time.

3. Auth path — interactive Claude Code runs $0 on the OAT

There is an unavoidable constraint (design §11, verified in worker.py and the prod E2E): the Max-subscription OAT runs Claude Code LLM calls at $0 but must hit https://api.anthropic.com directly. Routing the OAT — or any key — through our gateway breaks the Claude Code CLI’s OAuth handshake (Not logged in). The CLI cannot run through the gateway proxy at all, so there is no governed gateway-key path for interactive sessions to fall back to.

Every interactive session uses OAT-direct (single flag, #2650)

A session injects the long-lived Max-sub OAT as CLAUDE_CODE_OAUTH_TOKEN with ANTHROPIC_BASE_URL=https://api.anthropic.com (auth_path: oat_direct on the doc) — the same $0 path autopilot uses. The single FeatureFlag.INTERACTIVE_CLAUDE_SESSION flag gates the whole feature; there is no separate OAT-bypass flag (removed in #2650). Turning the feature on for an org is all that’s needed — see §1.

The gateway-key auth_path value is retained only as forward-compat machinery. create_session refuses to boot a gateway-key session (an explicit oat_bypass: false) with an actionable error, because it would start a claude subprocess that dies seconds later (subprocess_dead). If governed gateway-proxy support for the Claude Code CLI ever lands, relaxing _assert_auth_path_viable is the single change that re-enables it.

Compensating controls (verify these are working)

Because individual calls go direct to Anthropic and skip the 15-stage governance chain, three session-level controls stand in for per-call governance:

Token-count cost sidecar — the broker parses input_tokens / output_tokens from each stream-json usage event and keeps a process-local blended-rate estimate (~$3/$15 per Mtok; _accumulate_cost in interactive_session.py). This estimate is not authoritative billing — it is not written to cost_recorder’s Redis accumulators and does not create a Mongo usage record. Its job is to drive the per-session budget cap: after each completed turn (result event), _on_turn_complete persists the running estimate onto the interactive_sessions doc (cost_usd_accumulated) and terminates the session when it reaches per_session_budget_usd (default $5). On the OAT-direct path there is no per-call gateway record, so this session-level estimate + budget cap is the only cost control — which is why a sidecar that fails open is a SEV2.
Rolling output PII scan every N turns (default 5) using security_scanner.py patterns.
MCP-level HITL stays fully active — curate-me-mcp authenticates back to the gateway; those MCP calls still traverse governance (curate_request_approval HITL + cost + rate limit all fire).

If the sidecar fails open (sessions running with no budget hard-stop), treat it as SEV2: revoke interactive_claude_session for the affected org immediately (set enabled:false), then terminate any live session (§5) before investigating.

4. OAT rotation for interactive containers

Interactive session containers read the Max-sub token from a read-only bind-mount of the host auth.json, at the runner-image path (/home/runner/.config/claude-code/auth.json) — not the backend-service path. So a rotated host auth.json is visible to the container’s filesystem immediately, but a Claude subprocess that already started has the token’s derived access token cached in-process until its next 401 or expiry.

When you rotate the OAT (see the Rotate Claude Code OAT runbook), the helper script also refreshes auth.json inside running interactive session containers — it overwrites the read-only-mounted file in place and is idempotent (safe to re-run; a no-op when no interactive session containers are running).

To rotate and cover interactive sessions in one shot:


./scripts/rotate-claude-code-oat.sh

To confirm the script saw the interactive containers, watch for the Refreshing auth.json in N interactive session container(s) step in its output. If a long session needs the new token immediately (it is mid-turn against a dead OAT), terminate and relaunch it (§5) — the relaunch reads the fresh token at boot.

5. Finding and terminating a dangling session

A “dangling” session is one whose interactive_sessions doc is not terminal (control_state not in failed/terminated) but whose container/subprocess is gone, stuck, or leaking spend. The 60s reconciler (docker inspect liveness) should catch most of these and flip them to failed, but operators sometimes need to act manually.

Find live sessions


# All non-terminal sessions, newest first.
ssh curateme@178.105.8.25 \
  'docker exec curateme-mongo mongosh curate_me --quiet --eval '\''
    db.interactive_sessions.find(
      { control_state: { $nin: ["failed", "terminated"] } },
      { _id:1, org_id:1, runner_id:1, container_name:1, control_state:1,
        auth_path:1, cost_usd_accumulated:1, created_at:1 }
    ).sort({ created_at: -1 }).limit(50).toArray()'\'''


# The matching containers on the runner VPS.
ssh curateme@178.105.1.95 'docker ps --filter "name=claude-session-" \
  --format "table {{.Names}}\t{{.Status}}\t{{.CreatedAt}}"'

Terminate cleanly (preferred)

Use the API so the platform sends proc.terminate() to the subprocess, marks the doc terminated, and releases the control token. As platform staff (cross-org reader) or the owning org’s admin:


curl -sS -X POST \
  "https://api.curate-me.ai/gateway/admin/runners/${RUNNER_ID}/sessions/${SESSION_ID}/claude/terminate" \
  -H "X-CM-API-Key: ${ADMIN_KEY}"

Force-terminate the container (last resort)

If the API path is unavailable (broker crashed), kill the container directly, then flip the doc so the UI and budget loop stop tracking it:


ssh curateme@178.105.1.95 "docker rm -f ${CONTAINER_NAME}"
 
ssh curateme@178.105.8.25 \
  "docker exec curateme-mongo mongosh curate_me --quiet --eval '
    db.interactive_sessions.updateOne(
      { _id: \"${SESSION_ID}\", org_id: \"${ORG_ID}\" },
      { \$set: { control_state: \"terminated\", terminated_at: new Date() } }
    )'"

Always include org_id in the Mongo filter — every interactive_sessions query is compound-filtered by { _id, org_id } (the PRIMARY tenant key, audit #2576). A force-update without org_id is a tenant-isolation anti-pattern; don’t copy it into tooling.

Clean up stuck control tokens (rare)

If a session shows human_driving but nobody is connected, the Redis control token may be orphaned. It carries a TTL and reverts to autonomous on expiry; to clear it now:


ssh curateme@178.105.8.25 \
  "docker exec curateme-redis redis-cli DEL interactive:control_token:${SESSION_ID}"

6. Streaming & takeover behavior (operator FAQ)

Transport. Output streams over SSE (GET /claude/stream), fed from the Redis pub/sub channel interactive:events:{sid} with a replay buffer (interactive:replay:{sid}, last ~500 events). Reconnect + Last-Event-ID replay are handled by the dashboard’s existing SSE hook — a browser refresh mid-stream resumes without data loss.
Heartbeats. The broker emits a heartbeat event when stdout is idle (~10s) so proxies don’t close the connection and the UI shows “alive”.
Control token (takeover). Exactly one surface (dashboard / Teams / API) holds the control token at a time, claimed via SET EX NX in Redis. While a human is driving (human_driving), turns from any other surface are rejected with 409. “Take Control” claims the token; if it’s held, the button becomes “Request Control” (notifies the holder) and the send box goes read-only with a banner naming the holder.
States. starting → autonomous ⇄ human_driving, with paused (subprocess kept alive, context preserved, no turns accepted), awaiting_approval (a tool-use/MCP permission prompt is pending — all turns blocked until approve/deny from either surface), and terminal failed / terminated.
Approvals (HITL). A permission_prompt puts the session in awaiting_approval and surfaces an Approve/Deny modal (dashboard) or Adaptive Card (Teams). First decision wins (NX). MCP-level HITL stays active on the OAT-direct path (§3) — those MCP calls still traverse the gateway.
Teams parity. A reply in the bound Teams thread is delivered to the session’s stdin; !takeover / !pause / !resume / !release and !approve / !deny (or tap-to-approve) work from Teams. Prereq: teams_integrations.agent_channel must be set per org or inbound Teams routing silently no-ops.

7. Close the loop

Record any manual termination or flag change in the incident channel with the org_id / session_id.
If you revoked interactive_claude_session for a cost incident, file a follow-up to investigate the cost sidecar before re-granting.
If a runner left orphaned claude-session-* containers, confirm the 60s reconciler flipped their docs to failed; if not, that’s a reconciler bug — open an issue.

Rotate Claude Code OAT — OAT rotation mechanics (interactive containers included).
Runner Operations — broader runner container lifecycle and triage.
Runner Stuck / Not Responding — when the host container, not the session, is the problem.
Slack Integration Setup — control-surface bridge precedent.