Runbook: Governance Policy Cascading Denials
Owner: Platform Team Backup owner: On-call engineer Last validated: Not yet validated Validation method: Manual drill Severity trigger: SEV2 Customer impact: Legitimate requests denied; appears as total outage to affected org Required access: SSH (VPS), MongoDB, Redis Related services: curateme-backend-gateway
The governance chain runs 14 checks in sequence, short-circuiting on the first denial. When multiple checks are misconfigured, legitimate requests hit a wall of denials that appears as a total outage to the customer. This runbook helps identify which stage is denying and why.
Symptoms
- Spike in 403/422/429 errors from the gateway
- Users report “all requests are being blocked”
- Dashboard shows high denial rate across all governance steps
X-CM-Governance-Denied-Stepresponse header present on most requests- Multiple orgs affected simultaneously (suggests global policy, not per-org)
Step 1: Identify the denying stage
The gateway tracks denials by governance stage. This is the single most important diagnostic.
./scripts/analytics snapshot today
# Look at: governance_denials_by_stepExpected output:
{
"governance_denials_by_step": {
"rate_limit": 12,
"cost_estimate": 3,
"pii_scan": 847,
"security_scan": 2,
"model_allowlist": 0,
"hitl_gate": 1
}
}What to look for:
| Pattern | Meaning | Jump to |
|---|---|---|
| One stage has 90%+ of denials | Single misconfigured check | Fix that specific cause below |
pii_scan dominant | False positive PII patterns | Cause A |
rate_limit dominant | RPM too low or burst traffic | Cause B |
cost_estimate or hierarchical_budget dominant | Budget exhausted | Cause C |
model_allowlist dominant | Model not in allowlist | Cause D |
security_scan or content_safety dominant | Security scanner false positives | Cause E |
| Denials spread across many stages | Deeper configuration issue | Check org policy (Step 2) |
Step 2: Review the affected org’s governance policy
curl https://api.curate-me.ai/gateway/admin/governance/policy \
-H "Authorization: Bearer $ADMIN_TOKEN" -H "X-Org-ID: $ORG_ID"Cross-reference with recent denial logs:
./scripts/errors by-source gateway | grep "governance_denied"Check if there was a recent policy change:
# Check audit log for policy modifications
curl https://api.curate-me.ai/gateway/admin/audit?action=policy_update&org_id=$ORG_ID \
-H "Authorization: Bearer $ADMIN_TOKEN"Step 3: Identify the root cause
Cause A: PII scanner false positives
The PII scanner uses 33 regex patterns + optional Presidio NER (opt-in) including API keys, credentials, financial data, and health data. New patterns or enabled HIPAA/PCI modes can cause false positives.
Diagnosis: pii_scan has high denial count. Check what patterns are triggering:
./scripts/errors by-source gateway | grep "pii_detected" | head -20
# Look for: pattern_name, matched_text (redacted), severityFix (immediate): Switch PII action to log-only mode for the affected org:
curl -X PATCH https://api.curate-me.ai/api/v1/admin/gateway/policy/$ORG_ID \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"pii_action": "ALLOW"}'Fix (targeted): Add a per-request allowlist for specific patterns:
curl -X PATCH https://api.curate-me.ai/api/v1/admin/gateway/policy/$ORG_ID \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"pii_allowlist": ["email", "phone"]}'Fix (long-term): If HIPAA PHI or PCI-DSS mode was recently enabled, verify the org actually needs it. These modes add strict patterns that catch health/financial terminology.
Cause B: Rate limit too low or burst traffic
Diagnosis: rate_limit denials spiking. Check current limits vs usage:
# Check the org's current rate limit and usage
curl https://api.curate-me.ai/gateway/admin/rate-limit/$ORG_ID \
-H "Authorization: Bearer $ADMIN_TOKEN"Tier defaults:
| Tier | RPM Limit |
|---|---|
| Free | 10 |
| Starter | 60 |
| Growth | 300 |
| Enterprise | 5,000 |
Fix (immediate): Temporarily raise the limit:
curl -X PATCH https://api.curate-me.ai/api/v1/admin/gateway/policy/$ORG_ID \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"rate_limit_rpm": 300}'Fix (long-term): Upgrade the org’s tier or set a custom RPM override.
Cause C: Budget exhausted
The gateway enforces budgets at three levels: per-request max cost, daily budget (flat), and hierarchical budget (Org → Team → Key).
Diagnosis: cost_estimate or hierarchical_budget denials. Check spend:
./scripts/analytics costs today
# Check daily_cost vs daily_budgetFix (immediate): Raise the daily budget:
curl -X PATCH https://api.curate-me.ai/api/v1/admin/gateway/policy/$ORG_ID \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"daily_budget": 200, "max_cost_per_request": 5.00}'Fix (if hierarchical): Check and adjust hierarchy nodes:
curl https://api.curate-me.ai/gateway/admin/budgets/$ORG_ID \
-H "Authorization: Bearer $ADMIN_TOKEN"
# Look for exhausted nodes — team or key budgets may be hit even if org budget remainsCause D: Model allowlist too restrictive
Diagnosis: model_allowlist denials. Check which models are being rejected:
./scripts/errors by-source gateway | grep "model_blocked"Fix: Add the missing model to the org’s allowlist:
curl -X PATCH https://api.curate-me.ai/api/v1/admin/gateway/policy/$ORG_ID \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"model_allowlist": ["gpt-4o", "gpt-4o-mini", "claude-sonnet-4-6", "claude-haiku-4-5"]}'Cause E: Security scanner false positives
The content safety (stage 4.5) and security scan (stage 4.6) use regex patterns for prompt injection, jailbreak, and data exfiltration detection. The AI classifier (stage 4.7) is supposed to reduce false positives, but may not catch all of them.
Diagnosis: content_safety or security_scan denials. Check what patterns matched:
./scripts/errors by-source gateway | grep "security_finding"Fix (immediate): The AI classifier should be reducing these. Check if it’s enabled:
curl https://api.curate-me.ai/gateway/admin/governance/policy \
-H "Authorization: Bearer $ADMIN_TOKEN" -H "X-Org-ID: $ORG_ID" \
| jq '.ai_classifier_enabled'If false, enable it:
curl -X PATCH https://api.curate-me.ai/api/v1/admin/gateway/policy/$ORG_ID \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"ai_classifier_enabled": true}'Step 4: Check for cascading effects
When one governance stage denies excessively, it can mask other issues. After fixing the primary cause:
# Re-check denial distribution
./scripts/analytics snapshot today
# Verify the fix reduced denials
./scripts/errors recent | grep "governance_denied" | wc -l
# Should be decliningStep 5: Verify resolution
# Send a test request through the affected org
curl -s https://api.curate-me.ai/v1/openai/chat/completions \
-H "X-CM-API-Key: $ORG_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4o-mini", "messages": [{"role":"user","content":"Hello"}], "max_tokens": 5}' \
-w "\nHTTP Status: %{http_code}\n"
# Should return 200
# Check governance timing on the response
curl -v https://api.curate-me.ai/v1/openai/chat/completions \
-H "X-CM-API-Key: $ORG_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4o-mini", "messages": [{"role":"user","content":"Hello"}], "max_tokens": 5}' \
2>&1 | grep "X-CM-Governance"
# X-CM-Governance-Time-Ms should be present, no denial headersThe Full Governance Chain Reference
For diagnosis, here are all 15 stages in execution order:
| Stage | Check | Short-circuits on |
|---|---|---|
| 0 | Plan enforcement | Expired subscription, exceeded plan quota |
| 0.5 | Body size limit | Request body exceeds tier limit (1MB-100MB) |
| 1 | Rate limit | RPM exceeded for org/key |
| 1.5 | Plan entitlement | Legacy daily request quota |
| 1.7 | Reasoning token cap | Reasoning/thinking tokens exceed tier limit |
| 2 | Cost estimate | Estimated cost exceeds per-request or daily budget |
| 2.5 | Hierarchical budget | Org/Team/Key budget exhausted |
| 3 | Runner session budget | Per-session cost limit exceeded |
| 4 | PII scan | PII/secrets detected in request content |
| 4.5 | Content safety | Prompt injection or jailbreak patterns |
| 4.6 | Security scan | Advanced injection, exfiltration, encoded payloads |
| 4.7 | AI classifier | LLM-based false positive reduction (may override 4.5/4.6) |
| 5 | Model allowlist | Requested model not in org’s allowlist |
| 6 | HITL gate | High-cost request flagged for human approval |
Related Runbooks
- Budget Exceeded — when cost budgets specifically are the issue
- Rate Limit Hit — rate limiting deep-dive
- PII Blocked — PII scanner deep-dive
- Gateway High Latency — governance chain can cause latency
Rollback
Revert the changes described in the Procedure section. If a configuration change was made, restore the previous value from the MongoDB audit log or Redis backup.
Verification
After applying the fix, verify:
- The symptoms listed above are no longer present
- No new errors in gateway logs:
docker logs curateme-backend-gateway --tail=50 - Health check passes:
curl -s http://localhost:8002/health | jq .status
Escalation
- If denials affect multiple orgs simultaneously, check for a global policy change or platform-wide misconfiguration
- Collect:
./scripts/errors by-source gateway,./scripts/analytics snapshot today, affected org IDs - Check the denial tracker for escalation/suspension status:
./scripts/errors by-source gateway | grep "denial_escalation" - Contact platform team with: which governance stage, how many orgs, and the time window