Runbook: Family Manager Inbound Email Not Arriving

Owner: Family Manager Team Backup owner: Platform on-call engineer Last validated: 2026-06-14 Validation method: Live E2E on the curate-me.ai consumer-ingest mailbox (fm_inbound_e2e.py + real forwarded email → pending approval) Severity trigger: SEV3 (a household’s forwarded email never becomes a proposal; no outage, no data loss) Customer impact: Forwarded emails are silently dropped — the household sees no new approval card; capture-by-photo/voice and the direct app path are unaffected Required access: SSH (VPS 178.105.8.25), docker exec curateme-backend-b2b, MongoDB (b2b database), M365 admin center (Graph subscription recreate) Related services: curateme-backend-b2b, curateme-celery-worker (beat), Microsoft Graph API, MongoDB, Redis Time to complete: ~10–20 minutes

When a household forwards an email to their assistant address (assistant+<household-tag>@curate-me.ai on the shared M365 ingest lane, or <token>@in.curate-me.ai on the legacy Resend lane) it is supposed to become a pending approval card in the app. The happy path is: Microsoft Graph fires a change-notification webhook → the inbound dispatcher routes the message into the alias pipeline → the message lands as ingest_routed_to_alias_pipeline status=capture_created.

A forwarded email that never becomes a proposal has dropped out at one of six gates. Work them in this order — each is cheap to check and rules out the next. Every gate logs a content-free machine code, so the whole diagnosis is docker logs … | grep. No sender address, subject, or body is ever logged — you are matching on opaque ids and codes only.

All the greps below run against the B2B app container, which is where both the webhook route and the mobile inbound pipeline live:


ssh curateme@178.105.8.25
docker logs curateme-backend-b2b --since 30m 2>&1 | grep <event>

Prerequisites

Before starting, confirm you have:

SSH to the VPS — ssh curateme@178.105.8.25
The org id of the affected household (org_…) — needed for the per-org gates (Steps 4 and 6) and the Mongo lookups
Graph env configured — MICROSOFT_GRAPH_TENANT_ID, MICROSOFT_GRAPH_CLIENT_ID, MICROSOFT_GRAPH_CLIENT_SECRET set in ~/platform/.env.production (the Graph client secret value lives in ~/Documents/fm-dogfood/m365_graph_secret.txt on the operator machine — never paste the value into a ticket)
The ingest lane armed — M365_CONSUMER_INGEST_USER_ID and M365_CONSUMER_INGEST_ADDRESS set in ~/platform/.env.production. BOTH must be present or the shared-mailbox ingest branch never matches
M365_WEBHOOK_CLIENT_STATE set in ~/platform/.env.production — without it the webhook rejects every notification AND no subscription can be created
An approximate timestamp of the forward — narrow --since so the greps are readable

Step 1: Is the Graph subscription alive?

Graph change-notification subscriptions expire after at most ~70 hours (GraphClient.create_subscription requests 4200 minutes). When a subscription lapses, the webhook simply never fires again — and because nothing errors, this is the most common silent failure. The m365-subscription-renewal beat task (src.tasks.m365_subscription_renewal.renew_m365_subscriptions, scheduled crontab(minute=10, hour="*/12") in src/celery_app.py — every 12 hours) keeps it alive by PATCH-renewing each subscription, and recreating on a Graph 404.

First, confirm the beat task is actually running and what it did on its last pass:


docker logs curateme-backend-b2b --since 24h 2>&1 | grep m365_subscription_renewal_sweep

A healthy sweep logs m365_subscription_renewal_sweep with scanned/renewed/recreated/created/failed counts. If you see no sweep line in the last 12h, the celery beat worker is not running the task — check the worker container. If the sweep returns {"skipped": 1}, Graph env is not configured (see Prerequisites). Watch for m365_subscription_renew_failed / m365_ingest_subscription_renew_failed with a non-404 error — those indicate a Graph credential or permission problem, not an expiry.

For the shared consumer-ingest mailbox (the one that has no agent_identities row), the renewal lives in _ensure_ingest_subscription, keyed on M365_CONSUMER_INGEST_USER_ID. A healthy create/renew logs m365_ingest_subscription_ensured; a missing client state logs m365_ingest_subscription_no_client_state.

To force a renewal sweep immediately instead of waiting for the next beat tick:


docker exec -e PYTHONPATH=/app curateme-backend-b2b \
  python -c "from src.tasks.m365_subscription_renewal import renew_m365_subscriptions; print(renew_m365_subscriptions())"

If a per-org premium identity’s subscription is wedged and you want to recreate it directly, re-run the one-time wiring script — its ensure_inbound_subscription step is idempotent and recreates the Graph subscription. NOTE: fm_m365_wire_org.py lives at the repo root (~/platform/scripts/fm_m365_wire_org.py on the VPS) and is NOT inside the backend image (the container build context is services/backend/), so copy it into the container first, then exec it:


docker cp ~/platform/scripts/fm_m365_wire_org.py \
  curateme-backend-b2b:/app/scripts/fm_m365_wire_org.py
docker exec -e PYTHONPATH=/app curateme-backend-b2b \
  python /app/scripts/fm_m365_wire_org.py --org org_xxx \
  --mailbox family@curate-me.ai --m365-user-id <graph-user-guid>

Verification: The sweep output (or the forced run) shows renewed or created/recreated ≥ 1 with failed: 0, and a follow-up m365_subscription_ensured / m365_ingest_subscription_ensured line appears. Then ask the household to re-forward; proceed to Step 2 to confirm the webhook now fires.

Step 2: Did the webhook actually fire?

If the subscription is alive, Graph should POST the notification to https://api.curate-me.ai/api/v1/platform/agent-identity/m365-webhook (the default in provisioning_service._DEFAULT_WEBHOOK_NOTIFICATION_URL, overridable via M365_WEBHOOK_NOTIFICATION_URL). Each accepted notification logs:


docker logs curateme-backend-b2b --since 15m 2>&1 | grep webhook_notification_received

This line carries m365_user_id and message_id. If you see it, the webhook fired and dispatch began — skip to Step 3.

If you see nothing, check these failure modes:

webhook_client_state_mismatch — Graph’s clientState does not equal M365_WEBHOOK_CLIENT_STATE on the server. The subscription was created with a different secret than the one currently deployed. Re-run ensure_inbound_subscription (Step 1) so the subscription is recreated with the current M365_WEBHOOK_CLIENT_STATE. The route returns 200 on mismatch (Graph would otherwise retry), so this is silent without the grep.
webhook_no_user_id — Graph sent a resource path the dispatcher could not parse. _extract_user_id expects Users/<id>/Messages/<id> (live Graph capitalizes and drops mailFolders); a structural change here would surface as this warning.
webhook_no_message_id / webhook_empty_payload / webhook_invalid_json — malformed or empty Graph payloads; usually transient.
webhook_validation_handshake appears but no notifications follow — the subscription is being (re)created (the handshake echoes Graph’s validationToken) but real mail isn’t notifying yet; give it a minute and re-check.

Confirm the path is public-path-exempt. The webhook is unauthenticated (Graph calls it with the clientState shared secret, not a JWT), so it MUST be in the tenant-isolation allowlist. Verify the exact path is present:


docker exec curateme-backend-b2b \
  grep -n "agent-identity/m365-webhook" /app/src/middleware/tenant_isolation.py

You should see "/api/v1/platform/agent-identity/m365-webhook" in PUBLIC_PATHS. If a refactor dropped it, the middleware 401s Graph before the route runs — Graph would retry then give up, and you would see no webhook_notification_received at all. (This is a documented past prod gap: the webhook was not in PUBLIC_PATHS and crash-looped silently.)

Verification: webhook_notification_received appears for a freshly forwarded email, and agent-identity/m365-webhook is present in PUBLIC_PATHS.

Step 3: Was the recipient a valid plus-tag (recipient match)?

The webhook fired, so now the dispatcher must recognize the message as ingest mail. In process_inbound_message, the shared-ingest branch is checked FIRST (before any identity lookup): if m365_user_id == M365_CONSUMER_INGEST_USER_ID, the message routes to _route_ingest_mailbox_message. The pipeline result logs:


docker logs curateme-backend-b2b --since 15m 2>&1 | grep ingest_routed_to_alias_pipeline

On the happy path this carries status=capture_created (and after handling, the message is deleted from the shared mailbox → graph_message_deleted, so household mail never accumulates in the shared inbox). Other branches:

ingest_message_unroutable (has_sender / recipient_count) — _match_ingest_recipients found no assistant+<tag>@<domain> recipient in To/Cc. This means the email was BCC’d to the assistant (BCC carries no routable recipient header — dropped by design, content-free), OR the forward went to a tag that does not match M365_CONSUMER_INGEST_ADDRESS. Confirm M365_CONSUMER_INGEST_ADDRESS matches the real shared mailbox (e.g. assistant@curate-me.ai) and that the household forwarded TO (not BCC’d) assistant+<their-tag>@curate-me.ai. The household’s actual tag is the address field on their active mobile_inbound_aliases row:
```
docker exec curateme-backend-b2b mongosh "$MONGO_URI" --quiet --eval \
  'db.getSiblingDB("curate_dashboard").mobile_inbound_aliases.findOne({org_id:"org_xxx", status:"active"}, {address:1})'
```
ingest_routed_to_alias_pipeline status=rejected with a rejection_code — the tag resolved to an org but a later gate rejected it; follow the rejection_code into Steps 4–6 (unknown_alias, not_consumer_household, feature_disabled, unverified_sender, sender_auth_failed, rate_limited).
ingest_alias_routing_failed — an unexpected pipeline crash; the mailbox copy is intentionally KEPT (the delete is skipped) so the message can be replayed once the fault is fixed. Investigate the accompanying traceback.

If the household has a premium per-org mailbox instead of the shared lane, the routing log is inbound_routed_to_consumer_capture (with status / rejection_code), and a non-premium/downgraded household’s mail is dropped content-free as inbound_consumer_household_mail_dropped. A B2B org’s mail (wrong tenancy) logs mobile_inbound_not_consumer_household.

Verification: ingest_routed_to_alias_pipeline shows status=capture_created. If it shows status=rejected, note the rejection_code and continue to the matching step below.

Step 4: Is the sender a VERIFIED sender for the org?

The inbound pipeline only ingests mail from senders the household has explicitly verified (mobile_allowed_senders rows with status="verified", gated by inbound_alias.is_verified_sender). An unknown or unverified From address is rejected as unverified_sender — ONE metadata-only row (hashed sender + subject hash + ≤40-char redacted preview), never the body.


docker logs curateme-backend-b2b --since 15m 2>&1 | grep -E "mobile_inbound_rejected|unverified_sender"

A mobile_inbound_rejected line with rejection_code=unverified_sender means the forwarding parent’s address is not on the household’s verified-sender allowlist. Confirm what is verified for the org:


docker exec curateme-backend-b2b mongosh "$MONGO_URI" --quiet --eval \
  'db.getSiblingDB("curate_dashboard").mobile_allowed_senders.find({org_id:"org_xxx"}, {email:1, status:1, member_id:1}).toArray()'

Fix: Have the household add and verify the sender in the app (Settings → Forwarding senders), which mails a single-use signed link (48h TTL). On success the server logs mobile_inbound_sender_verified. A verified sender whose member row is inactive is rejected as member_inactive; over-limit verified traffic is rejected as rate_limited (per-sender cap 50/day, per-alias 100/day).

Verification: The sender shows status: "verified" in mobile_allowed_senders for the org. Re-forward; the rejection should no longer appear.

Step 5: Did SPF/DMARC fail (spoofed From)?

Even a verified-sender match is rejected if the message’s authentication verdict is an explicit failure — the From header is trivially spoofable over SMTP. Exchange Online stamps an Authentication-Results header on every delivered message; _graph_auth_verdicts parses spf=… / dmarc=… from it, and _sender_auth_failed rejects when either verdict is in {fail, permerror} (_SENDER_AUTH_FAIL_VERDICTS). Missing/unknown verdicts deliberately fail OPEN so a provider that omits the fields doesn’t brick every forward.


docker logs curateme-backend-b2b --since 15m 2>&1 | grep -E "mobile_inbound_rejected|sender_auth_failed"

A rejection_code=sender_auth_failed means the sender is on the allowlist but the message failed SPF or DMARC — i.e. the From address was forged, OR (benign) the parent’s mail provider has a real SPF/DMARC misconfiguration and Exchange stamped a fail. This is a correct security rejection: do NOT disable the check. Resolution is to have the legitimate parent forward from a domain that passes SPF/DMARC (most consumer providers — gmail.com, outlook.com — pass), or fix the sending domain’s DNS. There is no per-message override; the gate is invariant.

Verification: A re-forward from a properly-authenticated mailbox of the same verified sender no longer logs sender_auth_failed and proceeds to capture_created.

Step 6: Is the FM_INBOUND_EMAIL flag off for this org?

Family Manager is explicit-OFF in production — every FM_* flag (including fm_inbound_email) defaults False when ENVIRONMENT resolves to production (feature_flags.py). The pipeline’s real tenant gate is the org-aware check at step 3.2 of process_inbound: resolution is per-org override (org_feature_flag_overrides) → FF_FM_INBOUND_EMAIL env var → environment default. A non-allowlisted production household is rejected with ZERO side effects as feature_disabled:


docker logs curateme-backend-b2b --since 15m 2>&1 | grep -E "mobile_inbound_feature_disabled_for_org|feature_disabled"

A mobile_inbound_feature_disabled_for_org org_id=org_xxx line means the household is not on the beta allowlist. Enable it with the sanctioned per-household tool (writes org_feature_flag_overrides rows; refuses any org that is not a consumer_household):


docker exec -e PYTHONPATH=/app curateme-backend-b2b \
  python /app/scripts/fm_beta_enable.py org_xxx --flags fm_inbound_email

(Note: a premium per-org M365 mailbox household is gated separately by FM_PREMIUM_IDENTITY, not FM_INBOUND_EMAIL. If that household’s mail logs mobile_inbound_m365_not_enabled / feature_not_enabled, enable fm_premium_identity — which fm_beta_enable.py refuses unless --force, by design — or use fm_m365_wire_org.py from Step 1, which sets that override as step 1 of its wiring.)

Verification: Re-running fm_beta_enable.py org_xxx --json shows fm_inbound_email: true for the org. A re-forward logs ingest_routed_to_alias_pipeline status=capture_created and a new pending approval appears in the app.

If it goes wrong / Rollback

Nothing in this runbook changes mail content — every fix is a subscription recreate, a verified-sender add, or a flag override. None mutate or delete a household’s captures or approvals.
fm_beta_enable.py is reversible: disable with ... org_xxx --disable (or --flags fm_inbound_email --disable). Disabling only flips the override row; it never deletes data.
A forced renewal sweep is idempotent — re-running renew_m365_subscriptions or fm_m365_wire_org.py will not create duplicate subscriptions (ensure_inbound_subscription short-circuits on an existing subscription unless force=True, and the renewal task only forces on a Graph 404).
If a message crashed mid-pipeline (ingest_alias_routing_failed), the shared-mailbox copy was intentionally kept (delete skipped). Graph will not re-notify, so once the fault is fixed, ask the household to re-forward — the deterministic _id idempotency guard means a true duplicate can never double-capture.
Never disable the SPF/DMARC gate (Step 5) or the verified-sender gate (Step 4) to “make it work” — both are load-bearing anti-spoof / anti-abuse invariants. The correct fix is always to fix the sender, not the gate.
Do not put secrets in tickets. The Graph client secret lives in ~/Documents/fm-dogfood/m365_graph_secret.txt (operator machine) and ~/platform/.env.production (VPS); M365_WEBHOOK_CLIENT_STATE lives only in ~/platform/.env.production. Reference the env var name, never the value.

Agent Identity Provisioning — the per-org M365 user provisioning this inbound lane rides on
services/backend/src/services/agent_identity/inbound_dispatcher.py — webhook → ingest routing (_route_ingest_mailbox_message, _match_ingest_recipients)
services/backend/src/services/mobile/inbound_email.py — the 12-step inbound pipeline (verified-sender gate, SPF/DMARC, FM_INBOUND_EMAIL org gate)
services/backend/src/tasks/m365_subscription_renewal.py — the 12-hour subscription-renewal beat task
services/backend/scripts/fm_inbound_e2e.py — signed inbound webhook E2E driver (assert capture/approval creation)
services/backend/scripts/fm_beta_enable.py — per-household FM_* allowlist tool
scripts/fm_m365_wire_org.py — one-time premium-identity wiring (adoption + subscription create)