Skip to Content
RunbooksRunbook: Governance Policy Cascading Denials

Runbook: Governance Policy Cascading Denials

Owner: Platform Team Backup owner: On-call engineer Last validated: Not yet validated Validation method: Manual drill Severity trigger: SEV2 Customer impact: Legitimate requests denied; appears as total outage to affected org Required access: SSH (VPS), MongoDB, Redis Related services: curateme-backend-gateway


The governance chain runs 14 checks in sequence, short-circuiting on the first denial. When multiple checks are misconfigured, legitimate requests hit a wall of denials that appears as a total outage to the customer. This runbook helps identify which stage is denying and why.


Symptoms

  • Spike in 403/422/429 errors from the gateway
  • Users report “all requests are being blocked”
  • Dashboard shows high denial rate across all governance steps
  • X-CM-Governance-Denied-Step response header present on most requests
  • Multiple orgs affected simultaneously (suggests global policy, not per-org)

Step 1: Identify the denying stage

The gateway tracks denials by governance stage. This is the single most important diagnostic.

./scripts/analytics snapshot today # Look at: governance_denials_by_step

Expected output:

{ "governance_denials_by_step": { "rate_limit": 12, "cost_estimate": 3, "pii_scan": 847, "security_scan": 2, "model_allowlist": 0, "hitl_gate": 1 } }

What to look for:

PatternMeaningJump to
One stage has 90%+ of denialsSingle misconfigured checkFix that specific cause below
pii_scan dominantFalse positive PII patternsCause A
rate_limit dominantRPM too low or burst trafficCause B
cost_estimate or hierarchical_budget dominantBudget exhaustedCause C
model_allowlist dominantModel not in allowlistCause D
security_scan or content_safety dominantSecurity scanner false positivesCause E
Denials spread across many stagesDeeper configuration issueCheck org policy (Step 2)

Step 2: Review the affected org’s governance policy

curl https://api.curate-me.ai/gateway/admin/governance/policy \ -H "Authorization: Bearer $ADMIN_TOKEN" -H "X-Org-ID: $ORG_ID"

Cross-reference with recent denial logs:

./scripts/errors by-source gateway | grep "governance_denied"

Check if there was a recent policy change:

# Check audit log for policy modifications curl https://api.curate-me.ai/gateway/admin/audit?action=policy_update&org_id=$ORG_ID \ -H "Authorization: Bearer $ADMIN_TOKEN"

Step 3: Identify the root cause

Cause A: PII scanner false positives

The PII scanner uses 33 regex patterns + optional Presidio NER (opt-in) including API keys, credentials, financial data, and health data. New patterns or enabled HIPAA/PCI modes can cause false positives.

Diagnosis: pii_scan has high denial count. Check what patterns are triggering:

./scripts/errors by-source gateway | grep "pii_detected" | head -20 # Look for: pattern_name, matched_text (redacted), severity

Fix (immediate): Switch PII action to log-only mode for the affected org:

curl -X PATCH https://api.curate-me.ai/api/v1/admin/gateway/policy/$ORG_ID \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"pii_action": "ALLOW"}'

Fix (targeted): Add a per-request allowlist for specific patterns:

curl -X PATCH https://api.curate-me.ai/api/v1/admin/gateway/policy/$ORG_ID \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"pii_allowlist": ["email", "phone"]}'

Fix (long-term): If HIPAA PHI or PCI-DSS mode was recently enabled, verify the org actually needs it. These modes add strict patterns that catch health/financial terminology.

Cause B: Rate limit too low or burst traffic

Diagnosis: rate_limit denials spiking. Check current limits vs usage:

# Check the org's current rate limit and usage curl https://api.curate-me.ai/gateway/admin/rate-limit/$ORG_ID \ -H "Authorization: Bearer $ADMIN_TOKEN"

Tier defaults:

TierRPM Limit
Free10
Starter60
Growth300
Enterprise5,000

Fix (immediate): Temporarily raise the limit:

curl -X PATCH https://api.curate-me.ai/api/v1/admin/gateway/policy/$ORG_ID \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"rate_limit_rpm": 300}'

Fix (long-term): Upgrade the org’s tier or set a custom RPM override.

Cause C: Budget exhausted

The gateway enforces budgets at three levels: per-request max cost, daily budget (flat), and hierarchical budget (Org → Team → Key).

Diagnosis: cost_estimate or hierarchical_budget denials. Check spend:

./scripts/analytics costs today # Check daily_cost vs daily_budget

Fix (immediate): Raise the daily budget:

curl -X PATCH https://api.curate-me.ai/api/v1/admin/gateway/policy/$ORG_ID \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"daily_budget": 200, "max_cost_per_request": 5.00}'

Fix (if hierarchical): Check and adjust hierarchy nodes:

curl https://api.curate-me.ai/gateway/admin/budgets/$ORG_ID \ -H "Authorization: Bearer $ADMIN_TOKEN" # Look for exhausted nodes — team or key budgets may be hit even if org budget remains

Cause D: Model allowlist too restrictive

Diagnosis: model_allowlist denials. Check which models are being rejected:

./scripts/errors by-source gateway | grep "model_blocked"

Fix: Add the missing model to the org’s allowlist:

curl -X PATCH https://api.curate-me.ai/api/v1/admin/gateway/policy/$ORG_ID \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"model_allowlist": ["gpt-4o", "gpt-4o-mini", "claude-sonnet-4-6", "claude-haiku-4-5"]}'

Cause E: Security scanner false positives

The content safety (stage 4.5) and security scan (stage 4.6) use regex patterns for prompt injection, jailbreak, and data exfiltration detection. The AI classifier (stage 4.7) is supposed to reduce false positives, but may not catch all of them.

Diagnosis: content_safety or security_scan denials. Check what patterns matched:

./scripts/errors by-source gateway | grep "security_finding"

Fix (immediate): The AI classifier should be reducing these. Check if it’s enabled:

curl https://api.curate-me.ai/gateway/admin/governance/policy \ -H "Authorization: Bearer $ADMIN_TOKEN" -H "X-Org-ID: $ORG_ID" \ | jq '.ai_classifier_enabled'

If false, enable it:

curl -X PATCH https://api.curate-me.ai/api/v1/admin/gateway/policy/$ORG_ID \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"ai_classifier_enabled": true}'

Step 4: Check for cascading effects

When one governance stage denies excessively, it can mask other issues. After fixing the primary cause:

# Re-check denial distribution ./scripts/analytics snapshot today # Verify the fix reduced denials ./scripts/errors recent | grep "governance_denied" | wc -l # Should be declining

Step 5: Verify resolution

# Send a test request through the affected org curl -s https://api.curate-me.ai/v1/openai/chat/completions \ -H "X-CM-API-Key: $ORG_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model": "gpt-4o-mini", "messages": [{"role":"user","content":"Hello"}], "max_tokens": 5}' \ -w "\nHTTP Status: %{http_code}\n" # Should return 200 # Check governance timing on the response curl -v https://api.curate-me.ai/v1/openai/chat/completions \ -H "X-CM-API-Key: $ORG_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model": "gpt-4o-mini", "messages": [{"role":"user","content":"Hello"}], "max_tokens": 5}' \ 2>&1 | grep "X-CM-Governance" # X-CM-Governance-Time-Ms should be present, no denial headers

The Full Governance Chain Reference

For diagnosis, here are all 15 stages in execution order:

StageCheckShort-circuits on
0Plan enforcementExpired subscription, exceeded plan quota
0.5Body size limitRequest body exceeds tier limit (1MB-100MB)
1Rate limitRPM exceeded for org/key
1.5Plan entitlementLegacy daily request quota
1.7Reasoning token capReasoning/thinking tokens exceed tier limit
2Cost estimateEstimated cost exceeds per-request or daily budget
2.5Hierarchical budgetOrg/Team/Key budget exhausted
3Runner session budgetPer-session cost limit exceeded
4PII scanPII/secrets detected in request content
4.5Content safetyPrompt injection or jailbreak patterns
4.6Security scanAdvanced injection, exfiltration, encoded payloads
4.7AI classifierLLM-based false positive reduction (may override 4.5/4.6)
5Model allowlistRequested model not in org’s allowlist
6HITL gateHigh-cost request flagged for human approval


Rollback

Revert the changes described in the Procedure section. If a configuration change was made, restore the previous value from the MongoDB audit log or Redis backup.

Verification

After applying the fix, verify:

  • The symptoms listed above are no longer present
  • No new errors in gateway logs: docker logs curateme-backend-gateway --tail=50
  • Health check passes: curl -s http://localhost:8002/health | jq .status

Escalation

  1. If denials affect multiple orgs simultaneously, check for a global policy change or platform-wide misconfiguration
  2. Collect: ./scripts/errors by-source gateway, ./scripts/analytics snapshot today, affected org IDs
  3. Check the denial tracker for escalation/suspension status: ./scripts/errors by-source gateway | grep "denial_escalation"
  4. Contact platform team with: which governance stage, how many orgs, and the time window