Skip to Content
RunbooksRunbook: Rotate the Microsoft Graph Client Secret

Runbook: Rotate the Microsoft Graph Client Secret

Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-06-14 Validation method: Live rotation against the Curate-Me Agent Identity Entra app reg + token round-trip + live inbound-email replay on the curate-me.ai tenant Severity trigger: SEV2 if the secret is already expired (all Graph calls fail: agent-identity outbound email + consumer inbound ingest stop); SEV3 if rotating proactively before expiry Customer impact: While the env holds a dead secret, outbound agent email falls back down the send_email() chain (Graph → Resend → SES → SMTP → file) and inbound household email stops being ingested (no new approvals); no gateway/proxy outage Required access: Entra admin (Application Administrator on the Curate-Me Agent Identity app reg), SSH to the platform VPS, local access to ~/Documents/fm-dogfood/ Related services: curateme-backend-b2b, curateme-backend-gateway, Microsoft Graph API, M365 subscription-renewal beat task Time to complete: ~15 minutes


The Microsoft Graph client secret authenticates the platform’s app-only (client-credentials) calls to Microsoft Graph for the agent-identity feature: sending mail as family@curate-me.ai, the consumer plus-addressed inbound ingest lane, and the change-notification subscriptions that turn an inbound email into a pending approval. Entra client secrets expire, so they must be rotated on a schedule (or immediately if leaked).

The secret is the MICROSOFT_GRAPH_CLIENT_SECRET env var. It is consumed by GraphConfig.from_env() in services/backend/src/services/agent_identity/graph_client.py (which raises RuntimeError: Microsoft Graph not configured. Missing env vars: ... when it is blank) and lives in two places:

  • Prod: MICROSOFT_GRAPH_CLIENT_SECRET= in ~/platform/.env.production on the VPS
  • Local dogfood copy: ~/Documents/fm-dogfood/m365_graph_secret.txt (chmod 600)

Secret hygiene — read first. NEVER echo, cat, or paste the secret value into a terminal, a transcript, a chat, or this runbook. Azure shows the secret value exactly once at creation time. Capture it straight from the portal Copy button into the clipboard and write it to a file with restrictive permissions (commands below use pbpaste, never echo <value>). The first dogfood secret was previously leaked into a transcript and had to be rotated — do not repeat that.


Prerequisites

Before starting, confirm you have:

  • Entra access — Application Administrator (or Global Admin) on the Curate-Me Agent Identity app registration in the curate-me.ai tenant
  • SSH to the VPSssh curateme@178.105.8.25
  • The tenant ID and client ID — read them from the VPS without printing the secret:
    ssh curateme@178.105.8.25 "grep -E '^MICROSOFT_GRAPH_(TENANT|CLIENT)_ID=' ~/platform/.env.production"
    (MICROSOFT_GRAPH_CLIENT_SECRET is intentionally NOT grepped here — never print it.)
  • A known-good inbound test address — one of the household’s verified senders (the mobile_allowed_senders row that maps an email to a member) so Step 6 can replay a real inbound email
  • A maintenance window note — between deleting the old secret and recreating backend-b2b, Graph calls authenticate with the old secret still loaded in memory; the cutover is the container restart

Step 1: Create the new client secret in Entra

  1. Open the Entra admin center Applications → App registrations.
  2. Open the app registration named Curate-Me Agent Identity.
  3. Go to Certificates & secrets → Client secrets → + New client secret.
  4. Description: graph-secret-<YYYY-MM> (e.g. graph-secret-2026-06). Expiry: 180 days (or your standard rotation window — shorter is safer).
  5. Click Add. The new secret Value is shown exactly once.
  6. Click the Copy icon on the Value column (NOT the Secret ID). The value is now on your clipboard.

Verification: A new row appears under Client secrets with your description and a future Expires date. Do NOT delete the old secret yet — both can coexist, and the old one stays valid until you remove it in Step 7.


Step 2: Capture the new secret into the local dogfood file

Write the clipboard contents straight to the chmod 600 file — never via echo <value>:

umask 077 pbpaste > ~/Documents/fm-dogfood/m365_graph_secret.txt chmod 600 ~/Documents/fm-dogfood/m365_graph_secret.txt

Verification: Confirm length and permissions without revealing the value:

ls -l ~/Documents/fm-dogfood/m365_graph_secret.txt # expect -rw------- (600) wc -c < ~/Documents/fm-dogfood/m365_graph_secret.txt # non-zero byte count

Step 3: Update MICROSOFT_GRAPH_CLIENT_SECRET on the VPS

The secret lives in ~/platform/.env.production. Update it in place without printing the value. Copy the local file up first, then splice it into the env file with a python edit that reads from the uploaded file (no value on any command line):

# 1. Upload the captured secret to a temp file on the VPS (scp does not echo contents) scp ~/Documents/fm-dogfood/m365_graph_secret.txt curateme@178.105.8.25:/tmp/graph_secret.new # 2. On the VPS: back up the env file, replace only the MICROSOFT_GRAPH_CLIENT_SECRET line, shred the temp ssh curateme@178.105.8.25 'cd ~/platform && cp .env.production .env.production.bak.$(date +%s) && \ python3 - <<"PY" import pathlib secret = pathlib.Path("/tmp/graph_secret.new").read_text().strip() env = pathlib.Path(".env.production") lines = env.read_text().splitlines() key = "MICROSOFT_GRAPH_CLIENT_SECRET" out, found = [], False for ln in lines: if ln.startswith(key + "="): out.append(f"{key}={secret}"); found = True else: out.append(ln) if not found: out.append(f"{key}={secret}") env.write_text("\n".join(out) + "\n") print(f"{key} updated; found_existing={found}") PY shred -u /tmp/graph_secret.new 2>/dev/null || rm -f /tmp/graph_secret.new'

This prints only MICROSOFT_GRAPH_CLIENT_SECRET updated; found_existing=True — never the value.

Note: deploy-to-vps.sh does a git reset --hard origin/<branch> then restores .env.production from its .env.production.bak copy (see scripts/deploy-to-vps.sh “Restoring VPS-local config files” step). Editing .env.production directly is correct because it is VPS-local and never committed; the timestamped .bak.<epoch> you just made is your rollback copy.

Verification: Confirm the key is present and non-empty without printing the value:

ssh curateme@178.105.8.25 "cd ~/platform && grep -c '^MICROSOFT_GRAPH_CLIENT_SECRET=.\\+' .env.production" # expect: 1

Step 4: Recreate backend-b2b (and backend-gateway) to load the new secret

The env var is read at process start by GraphConfig.from_env(), so the containers must be recreated, not just signalled. Recreate the backend group via the deploy script’s backend path (it rebuilds and up -ds backend-b2b backend-gateway runner-agent celery-worker celery-beat):

./scripts/deploy-to-vps.sh --backend

If you want the env-only fast path (no code change to deploy), recreate just the two API containers in place on the VPS so they re-read .env.production:

ssh curateme@178.105.8.25 "cd ~/platform && docker compose -f docker-compose.production.yml --env-file .env.production up -d --force-recreate backend-b2b backend-gateway"

Verification: Both containers are healthy and have a fresh start time:

ssh curateme@178.105.8.25 "docker ps --filter name=curateme-backend-b2b --filter name=curateme-backend-gateway --format '{{.Names}}\t{{.Status}}'" # expect both 'Up ... (healthy)' with a recent uptime

Step 5: Verify a client-credentials token round-trip

Confirm the new secret actually mints a Graph token. Run the round-trip inside curateme-backend-b2b so it uses the exact env the app sees, and have the container read the secret from its own environment (never paste it):

ssh curateme@178.105.8.25 'docker exec -e PYTHONPATH=/app curateme-backend-b2b python - <<"PY" import os, httpx tenant = os.environ["MICROSOFT_GRAPH_TENANT_ID"] data = { "client_id": os.environ["MICROSOFT_GRAPH_CLIENT_ID"], "client_secret": os.environ["MICROSOFT_GRAPH_CLIENT_SECRET"], "grant_type": "client_credentials", "scope": "https://graph.microsoft.com/.default", } r = httpx.post(f"https://login.microsoftonline.com/{tenant}/oauth2/v2.0/token", data=data, timeout=30) print("status", r.status_code) body = r.json() # Print ONLY safe fields — never the access_token value print("ok" if "access_token" in body else "FAIL", "expires_in", body.get("expires_in"), "error", body.get("error")) PY'

Expected: status 200 and ok expires_in 3599 error None. A status 401 / error invalid_client (AADSTS7000215: Invalid client secret provided) means the env still holds the old/wrong value — re-check Step 3.

Verification: Output begins status 200 and contains ok. The script deliberately prints only expires_in/error, never the token. This mirrors the production auth path in graph_client.py::_get_access_token() (MSAL acquire_token_for_client against login.microsoftonline.com/<tenant>/oauth2/v2.0/token).


Step 6: Verify a live inbound email still ingests

A token round-trip proves auth; this proves the subscription + ingest pipeline still works end-to-end with the new secret.

  1. From a verified sender address, send a short email to the household assistant mailbox (family@curate-me.ai, or the consumer plus-addressed alias assistant+<token>@curate-me.ai per M365_CONSUMER_INGEST_ADDRESS).

  2. Watch the backend log for the inbound webhook + dispatch:

    ssh curateme@178.105.8.25 "docker logs curateme-backend-b2b --since 5m 2>&1 | grep -iE 'graph_subscription|inbound|graph_message|approval|capture'"
  3. Confirm a new pending approval was created for the org (it should appear in the Family Manager app’s inbox / pending-approval queue).

If the subscription expired during the dead-secret window, re-arm it (idempotent) — this is the same call scripts/fm_m365_wire_org.py makes via ProvisioningService.ensure_inbound_subscription, and the 12-hourly m365_subscription_renewal beat task also covers it (it requires MICROSOFT_GRAPH_* plus M365_WEBHOOK_CLIENT_STATE, and for the consumer lane M365_CONSUMER_INGEST_USER_ID).

The wiring script lives at the repo root scripts/fm_m365_wire_org.py and is NOT baked into the backend image (the image build context is services/backend/, so /app/scripts/ only contains services/backend/scripts/*). It also imports from src.*, so it must run inside the container with PYTHONPATH=/app. Copy the repo-root script into the running container, then exec it:

# 1. Copy the repo-root wiring script into the container ssh curateme@178.105.8.25 "docker cp ~/platform/scripts/fm_m365_wire_org.py curateme-backend-b2b:/tmp/fm_m365_wire_org.py" # 2. Exec it inside the container (PYTHONPATH=/app so `src.*` imports resolve) ssh curateme@178.105.8.25 'docker exec -e PYTHONPATH=/app curateme-backend-b2b \ python /tmp/fm_m365_wire_org.py --org org_xxx \ --mailbox family@curate-me.ai --m365-user-id <graph-user-guid>'

(Replace org_xxx and <graph-user-guid> with the household org id and the mailbox’s M365 user GUID; the script is idempotent and adopts the already-active identity without re-creating the Graph user.)

Verification: The log shows a graph_subscription/inbound dispatch line and a new pending approval exists for the household org. No graph_* lines log 401/invalid_client.


Step 7: Delete the old client secret in Entra

Only after Steps 5 and 6 pass, remove the superseded secret so a leaked old value cannot be used:

  1. Back in Entra → Curate-Me Agent Identity → Certificates & secrets → Client secrets.
  2. Find the previous secret (older Expires/created date — NOT the one you just made).
  3. Click the trash icon → Delete.

Verification: Only the new secret remains in the Client secrets list. Re-run the Step 5 token round-trip once more — it must still return status 200 / ok, confirming the live secret is the new one and nothing depended on the deleted secret.


Rollback / If it goes wrong

Symptom: token round-trip returns 401 invalid_client after the swap. The env holds a bad value (truncated paste, trailing newline, wrong app). The old secret is still valid until Step 7, so:

  1. Restore the previous env file from the timestamped backup made in Step 3:
    ssh curateme@178.105.8.25 "cd ~/platform && ls -t .env.production.bak.* | head -1" # find newest backup ssh curateme@178.105.8.25 "cd ~/platform && cp .env.production.bak.<epoch> .env.production"
  2. Recreate the containers again (Step 4 fast path) so they reload the known-good secret.
  3. Re-run Step 5. Once green, re-attempt the new-secret capture (Step 2) being careful with the clipboard, then re-do Steps 3–6.

Symptom: outbound agent email stopped (no Graph send) but inbound is fine. Confirm the secret env var is non-empty (Step 3 verification) and that backend-b2b was actually recreated (Step 4 verification — check the uptime is recent). With a dead secret, send_email() silently falls back to Resend/SES/SMTP/file, so customers still get mail but not from the M365 mailbox.

Symptom: inbound emails stopped creating approvals. The change-notification subscription likely lapsed (Graph mail subscriptions are short-lived; the renewal task needs a valid secret). Re-run the Step 6 fm_m365_wire_org.py ensure_inbound_subscription call after the new secret is live.

Never roll back by deleting the new Entra secret while the env still references it — that would leave both copies dead. Always restore the env first, verify, then clean up Entra.

After any rollback, shred the temp upload if it survived: ssh curateme@178.105.8.25 "shred -u /tmp/graph_secret.new 2>/dev/null || rm -f /tmp/graph_secret.new". If the secret value was ever exposed (printed, logged, pasted), treat it as leaked and immediately rotate again from Step 1.


  • Agent Identity Provisioning — first-time M365 user creation, aliases, and the Insufficient privileges / Invalid client secret troubleshooting table
  • OAuth Token Rotation — general token-rotation pattern
  • Webhook Disaster Recovery — recovering Graph change-notification subscriptions
  • Agent Identity Design Doc  — full architecture, §11 (Phase 0) referenced by graph_client.py
  • services/backend/src/services/agent_identity/graph_client.pyGraphConfig.from_env() (env var contract) and _get_access_token() (the live token round-trip)
  • scripts/fm_m365_wire_org.py — idempotent org wiring + ensure_inbound_subscription (repo-root script; not baked into the backend image — docker cp it into the container to run, see Step 6)
  • services/backend/src/tasks/m365_subscription_renewal.py — the 12-hourly subscription-renewal beat task that also depends on MICROSOFT_GRAPH_CLIENT_SECRET