Runbook: Rotate the Microsoft Graph Client Secret
Owner: Platform Team Backup owner: On-call engineer Last validated: 2026-06-14 Validation method: Live rotation against the
Curate-Me Agent IdentityEntra app reg + token round-trip + live inbound-email replay on the curate-me.ai tenant Severity trigger: SEV2 if the secret is already expired (all Graph calls fail: agent-identity outbound email + consumer inbound ingest stop); SEV3 if rotating proactively before expiry Customer impact: While the env holds a dead secret, outbound agent email falls back down thesend_email()chain (Graph → Resend → SES → SMTP → file) and inbound household email stops being ingested (no new approvals); no gateway/proxy outage Required access: Entra admin (Application Administrator on theCurate-Me Agent Identityapp reg), SSH to the platform VPS, local access to~/Documents/fm-dogfood/Related services: curateme-backend-b2b, curateme-backend-gateway, Microsoft Graph API, M365 subscription-renewal beat task Time to complete: ~15 minutes
The Microsoft Graph client secret authenticates the platform’s app-only (client-credentials) calls to Microsoft Graph for the agent-identity feature: sending mail as family@curate-me.ai, the consumer plus-addressed inbound ingest lane, and the change-notification subscriptions that turn an inbound email into a pending approval. Entra client secrets expire, so they must be rotated on a schedule (or immediately if leaked).
The secret is the MICROSOFT_GRAPH_CLIENT_SECRET env var. It is consumed by GraphConfig.from_env() in services/backend/src/services/agent_identity/graph_client.py (which raises RuntimeError: Microsoft Graph not configured. Missing env vars: ... when it is blank) and lives in two places:
- Prod:
MICROSOFT_GRAPH_CLIENT_SECRET=in~/platform/.env.productionon the VPS - Local dogfood copy:
~/Documents/fm-dogfood/m365_graph_secret.txt(chmod 600)
Secret hygiene — read first. NEVER echo,
cat, or paste the secret value into a terminal, a transcript, a chat, or this runbook. Azure shows the secret value exactly once at creation time. Capture it straight from the portal Copy button into the clipboard and write it to a file with restrictive permissions (commands below usepbpaste, neverecho <value>). The first dogfood secret was previously leaked into a transcript and had to be rotated — do not repeat that.
Prerequisites
Before starting, confirm you have:
- Entra access — Application Administrator (or Global Admin) on the
Curate-Me Agent Identityapp registration in the curate-me.ai tenant - SSH to the VPS —
ssh curateme@178.105.8.25 - The tenant ID and client ID — read them from the VPS without printing the secret:
(
ssh curateme@178.105.8.25 "grep -E '^MICROSOFT_GRAPH_(TENANT|CLIENT)_ID=' ~/platform/.env.production"MICROSOFT_GRAPH_CLIENT_SECRETis intentionally NOT grepped here — never print it.) - A known-good inbound test address — one of the household’s verified senders (the
mobile_allowed_sendersrow that maps an email to a member) so Step 6 can replay a real inbound email - A maintenance window note — between deleting the old secret and recreating
backend-b2b, Graph calls authenticate with the old secret still loaded in memory; the cutover is the container restart
Step 1: Create the new client secret in Entra
- Open the Entra admin center → Applications → App registrations.
- Open the app registration named
Curate-Me Agent Identity. - Go to Certificates & secrets → Client secrets → + New client secret.
- Description:
graph-secret-<YYYY-MM>(e.g.graph-secret-2026-06). Expiry: 180 days (or your standard rotation window — shorter is safer). - Click Add. The new secret Value is shown exactly once.
- Click the Copy icon on the Value column (NOT the Secret ID). The value is now on your clipboard.
Verification: A new row appears under Client secrets with your description and a future Expires date. Do NOT delete the old secret yet — both can coexist, and the old one stays valid until you remove it in Step 7.
Step 2: Capture the new secret into the local dogfood file
Write the clipboard contents straight to the chmod 600 file — never via echo <value>:
umask 077
pbpaste > ~/Documents/fm-dogfood/m365_graph_secret.txt
chmod 600 ~/Documents/fm-dogfood/m365_graph_secret.txtVerification: Confirm length and permissions without revealing the value:
ls -l ~/Documents/fm-dogfood/m365_graph_secret.txt # expect -rw------- (600)
wc -c < ~/Documents/fm-dogfood/m365_graph_secret.txt # non-zero byte countStep 3: Update MICROSOFT_GRAPH_CLIENT_SECRET on the VPS
The secret lives in ~/platform/.env.production. Update it in place without printing the value. Copy the local file up first, then splice it into the env file with a python edit that reads from the uploaded file (no value on any command line):
# 1. Upload the captured secret to a temp file on the VPS (scp does not echo contents)
scp ~/Documents/fm-dogfood/m365_graph_secret.txt curateme@178.105.8.25:/tmp/graph_secret.new
# 2. On the VPS: back up the env file, replace only the MICROSOFT_GRAPH_CLIENT_SECRET line, shred the temp
ssh curateme@178.105.8.25 'cd ~/platform && cp .env.production .env.production.bak.$(date +%s) && \
python3 - <<"PY"
import pathlib
secret = pathlib.Path("/tmp/graph_secret.new").read_text().strip()
env = pathlib.Path(".env.production")
lines = env.read_text().splitlines()
key = "MICROSOFT_GRAPH_CLIENT_SECRET"
out, found = [], False
for ln in lines:
if ln.startswith(key + "="):
out.append(f"{key}={secret}"); found = True
else:
out.append(ln)
if not found:
out.append(f"{key}={secret}")
env.write_text("\n".join(out) + "\n")
print(f"{key} updated; found_existing={found}")
PY
shred -u /tmp/graph_secret.new 2>/dev/null || rm -f /tmp/graph_secret.new'This prints only MICROSOFT_GRAPH_CLIENT_SECRET updated; found_existing=True — never the value.
Note:
deploy-to-vps.shdoes agit reset --hard origin/<branch>then restores.env.productionfrom its.env.production.bakcopy (seescripts/deploy-to-vps.sh“Restoring VPS-local config files” step). Editing.env.productiondirectly is correct because it is VPS-local and never committed; the timestamped.bak.<epoch>you just made is your rollback copy.
Verification: Confirm the key is present and non-empty without printing the value:
ssh curateme@178.105.8.25 "cd ~/platform && grep -c '^MICROSOFT_GRAPH_CLIENT_SECRET=.\\+' .env.production"
# expect: 1Step 4: Recreate backend-b2b (and backend-gateway) to load the new secret
The env var is read at process start by GraphConfig.from_env(), so the containers must be recreated, not just signalled. Recreate the backend group via the deploy script’s backend path (it rebuilds and up -ds backend-b2b backend-gateway runner-agent celery-worker celery-beat):
./scripts/deploy-to-vps.sh --backendIf you want the env-only fast path (no code change to deploy), recreate just the two API containers in place on the VPS so they re-read .env.production:
ssh curateme@178.105.8.25 "cd ~/platform && docker compose -f docker-compose.production.yml --env-file .env.production up -d --force-recreate backend-b2b backend-gateway"Verification: Both containers are healthy and have a fresh start time:
ssh curateme@178.105.8.25 "docker ps --filter name=curateme-backend-b2b --filter name=curateme-backend-gateway --format '{{.Names}}\t{{.Status}}'"
# expect both 'Up ... (healthy)' with a recent uptimeStep 5: Verify a client-credentials token round-trip
Confirm the new secret actually mints a Graph token. Run the round-trip inside curateme-backend-b2b so it uses the exact env the app sees, and have the container read the secret from its own environment (never paste it):
ssh curateme@178.105.8.25 'docker exec -e PYTHONPATH=/app curateme-backend-b2b python - <<"PY"
import os, httpx
tenant = os.environ["MICROSOFT_GRAPH_TENANT_ID"]
data = {
"client_id": os.environ["MICROSOFT_GRAPH_CLIENT_ID"],
"client_secret": os.environ["MICROSOFT_GRAPH_CLIENT_SECRET"],
"grant_type": "client_credentials",
"scope": "https://graph.microsoft.com/.default",
}
r = httpx.post(f"https://login.microsoftonline.com/{tenant}/oauth2/v2.0/token", data=data, timeout=30)
print("status", r.status_code)
body = r.json()
# Print ONLY safe fields — never the access_token value
print("ok" if "access_token" in body else "FAIL",
"expires_in", body.get("expires_in"),
"error", body.get("error"))
PY'Expected: status 200 and ok expires_in 3599 error None. A status 401 / error invalid_client (AADSTS7000215: Invalid client secret provided) means the env still holds the old/wrong value — re-check Step 3.
Verification: Output begins status 200 and contains ok. The script deliberately prints only expires_in/error, never the token. This mirrors the production auth path in graph_client.py::_get_access_token() (MSAL acquire_token_for_client against login.microsoftonline.com/<tenant>/oauth2/v2.0/token).
Step 6: Verify a live inbound email still ingests
A token round-trip proves auth; this proves the subscription + ingest pipeline still works end-to-end with the new secret.
-
From a verified sender address, send a short email to the household assistant mailbox (
family@curate-me.ai, or the consumer plus-addressed aliasassistant+<token>@curate-me.aiperM365_CONSUMER_INGEST_ADDRESS). -
Watch the backend log for the inbound webhook + dispatch:
ssh curateme@178.105.8.25 "docker logs curateme-backend-b2b --since 5m 2>&1 | grep -iE 'graph_subscription|inbound|graph_message|approval|capture'" -
Confirm a new pending approval was created for the org (it should appear in the Family Manager app’s inbox / pending-approval queue).
If the subscription expired during the dead-secret window, re-arm it (idempotent) — this is the same call scripts/fm_m365_wire_org.py makes via ProvisioningService.ensure_inbound_subscription, and the 12-hourly m365_subscription_renewal beat task also covers it (it requires MICROSOFT_GRAPH_* plus M365_WEBHOOK_CLIENT_STATE, and for the consumer lane M365_CONSUMER_INGEST_USER_ID).
The wiring script lives at the repo root scripts/fm_m365_wire_org.py and is NOT baked into the backend image (the image build context is services/backend/, so /app/scripts/ only contains services/backend/scripts/*). It also imports from src.*, so it must run inside the container with PYTHONPATH=/app. Copy the repo-root script into the running container, then exec it:
# 1. Copy the repo-root wiring script into the container
ssh curateme@178.105.8.25 "docker cp ~/platform/scripts/fm_m365_wire_org.py curateme-backend-b2b:/tmp/fm_m365_wire_org.py"
# 2. Exec it inside the container (PYTHONPATH=/app so `src.*` imports resolve)
ssh curateme@178.105.8.25 'docker exec -e PYTHONPATH=/app curateme-backend-b2b \
python /tmp/fm_m365_wire_org.py --org org_xxx \
--mailbox family@curate-me.ai --m365-user-id <graph-user-guid>'(Replace org_xxx and <graph-user-guid> with the household org id and the mailbox’s M365 user GUID; the script is idempotent and adopts the already-active identity without re-creating the Graph user.)
Verification: The log shows a graph_subscription/inbound dispatch line and a new pending approval exists for the household org. No graph_* lines log 401/invalid_client.
Step 7: Delete the old client secret in Entra
Only after Steps 5 and 6 pass, remove the superseded secret so a leaked old value cannot be used:
- Back in Entra →
Curate-Me Agent Identity→ Certificates & secrets → Client secrets. - Find the previous secret (older
Expires/created date — NOT the one you just made). - Click the trash icon → Delete.
Verification: Only the new secret remains in the Client secrets list. Re-run the Step 5 token round-trip once more — it must still return status 200 / ok, confirming the live secret is the new one and nothing depended on the deleted secret.
Rollback / If it goes wrong
Symptom: token round-trip returns 401 invalid_client after the swap. The env holds a bad value (truncated paste, trailing newline, wrong app). The old secret is still valid until Step 7, so:
- Restore the previous env file from the timestamped backup made in Step 3:
ssh curateme@178.105.8.25 "cd ~/platform && ls -t .env.production.bak.* | head -1" # find newest backup ssh curateme@178.105.8.25 "cd ~/platform && cp .env.production.bak.<epoch> .env.production" - Recreate the containers again (Step 4 fast path) so they reload the known-good secret.
- Re-run Step 5. Once green, re-attempt the new-secret capture (Step 2) being careful with the clipboard, then re-do Steps 3–6.
Symptom: outbound agent email stopped (no Graph send) but inbound is fine. Confirm the secret env var is non-empty (Step 3 verification) and that backend-b2b was actually recreated (Step 4 verification — check the uptime is recent). With a dead secret, send_email() silently falls back to Resend/SES/SMTP/file, so customers still get mail but not from the M365 mailbox.
Symptom: inbound emails stopped creating approvals. The change-notification subscription likely lapsed (Graph mail subscriptions are short-lived; the renewal task needs a valid secret). Re-run the Step 6 fm_m365_wire_org.py ensure_inbound_subscription call after the new secret is live.
Never roll back by deleting the new Entra secret while the env still references it — that would leave both copies dead. Always restore the env first, verify, then clean up Entra.
After any rollback, shred the temp upload if it survived: ssh curateme@178.105.8.25 "shred -u /tmp/graph_secret.new 2>/dev/null || rm -f /tmp/graph_secret.new". If the secret value was ever exposed (printed, logged, pasted), treat it as leaked and immediately rotate again from Step 1.
Related
- Agent Identity Provisioning — first-time M365 user creation, aliases, and the
Insufficient privileges/Invalid client secrettroubleshooting table - OAuth Token Rotation — general token-rotation pattern
- Webhook Disaster Recovery — recovering Graph change-notification subscriptions
- Agent Identity Design Doc — full architecture, §11 (Phase 0) referenced by
graph_client.py services/backend/src/services/agent_identity/graph_client.py—GraphConfig.from_env()(env var contract) and_get_access_token()(the live token round-trip)scripts/fm_m365_wire_org.py— idempotent org wiring +ensure_inbound_subscription(repo-root script; not baked into the backend image —docker cpit into the container to run, see Step 6)services/backend/src/tasks/m365_subscription_renewal.py— the 12-hourly subscription-renewal beat task that also depends onMICROSOFT_GRAPH_CLIENT_SECRET