Runbook: Multi-Provider Failover Loops

Owner: Platform Team Backup owner: On-call engineer Last validated: Not yet validated Validation method: Manual drill Severity trigger: SEV1 Customer impact: Sustained 502 errors and high latency across all orgs using affected providers Required access: SSH (VPS), Redis, Docker Related services: curateme-backend-gateway

The gateway uses circuit breakers and fallback routing to handle provider outages. When multiple providers fail simultaneously or circuit breaker thresholds are misconfigured, the gateway can get stuck in a loop: try provider A → fail → try B → fail → try A again. This creates sustained high latency and cascading 502 errors.

Symptoms

High latency on all requests (10s+ instead of 1-2s)
Logs show rapid provider switching (routing_fallback events)
Circuit breaker toggling between open and half_open states
502 errors intermittently returned to users
Provider health endpoint shows multiple providers in degraded state
X-CM-Provider-Retries response header consistently shows max value

Step 1: Assess provider health

Determine which providers are healthy, degraded, or down.


./scripts/analytics health
# Look at: provider_status for each provider

Get detailed circuit breaker states:


curl https://api.curate-me.ai/gateway/admin/providers/health \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Expected response:


{
  "providers": {
    "openai": {"status": "healthy", "circuit": "closed", "success_rate": 0.99, "p95_ms": 450},
    "anthropic": {"status": "degraded", "circuit": "half_open", "success_rate": 0.72, "p95_ms": 3200},
    "google": {"status": "down", "circuit": "open", "success_rate": 0.0, "p95_ms": null}
  }
}

What to look for:

State	Meaning	Action
All `closed`	No failover issue — latency is elsewhere	Check Gateway High Latency
One `open`, others `closed`	Normal failover, working correctly	Monitor, no action needed
Multiple `half_open`	Providers flapping — the failover loop	Continue this runbook
All `open`	Total provider outage	Jump to Step 3, Cause A

Also check the provider status pages directly:

Provider	Status Page
OpenAI	status.openai.com
Anthropic	status.anthropic.com
Google AI	status.cloud.google.com
DeepSeek	status.deepseek.com

Step 2: Check for retry storms

Failed requests retrying amplify load on remaining healthy providers, potentially pushing them into failure too.


./scripts/errors by-source gateway | grep "routing_fallback"

Check retry rate:


curl https://api.curate-me.ai/gateway/admin/metrics \
  -H "Authorization: Bearer $ADMIN_TOKEN" | jq '.retry_metrics'

What to look for:

Metric	Healthy	Problem
`retries_per_minute`	< 10	> 100 (retry storm)
`fallback_rate`	< 5%	> 30% (systematic failures)
`avg_retries_per_request`	< 1.2	> 2.5 (every request retrying multiple times)

Step 3: Identify the root cause

Cause A: Multiple providers down simultaneously

Major cloud outages occasionally affect multiple LLM providers (shared infrastructure, regional issues).

Diagnosis: Multiple providers show open circuit breaker state. Status pages confirm outages.

Fix (immediate): Force traffic to the single remaining healthy provider:


# Block traffic to known-bad providers
curl -X POST https://api.curate-me.ai/gateway/admin/providers/google/circuit-breaker \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"state": "open", "reason": "manual override - provider confirmed down"}'
 
curl -X POST https://api.curate-me.ai/gateway/admin/providers/anthropic/circuit-breaker \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"state": "open", "reason": "manual override - provider degraded"}'

If ALL providers are down: Enable response cache as degraded-mode fallback:


curl -X POST https://api.curate-me.ai/gateway/admin/settings \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"cache_mode": "stale_if_error", "cache_ttl_seconds": 3600}'

Cause B: Circuit breaker thresholds too sensitive

If the circuit breaker opens after too few failures, transient errors cause premature failover.

Diagnosis: Circuit breakers toggling rapidly between half_open and open while the provider is actually healthy (confirmed by direct curl to provider).

Fix: Increase circuit breaker thresholds:


curl -X PUT https://api.curate-me.ai/gateway/admin/settings \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "circuit_breaker_failure_threshold": 10,
    "circuit_breaker_recovery_timeout_seconds": 30,
    "circuit_breaker_half_open_max_calls": 5
  }'

Default values for reference:

Setting	Default	Description
`failure_threshold`	5	Consecutive failures before opening
`recovery_timeout_seconds`	60	Seconds before trying half-open
`half_open_max_calls`	3	Test calls before closing

Cause C: Retry storm amplification

Each failed request retries up to max_provider_retries times, and each retry can trigger its own failover chain.

Diagnosis: retries_per_minute > 100. The retry load is causing otherwise healthy providers to degrade.

Fix (immediate): Reduce retry count to stop the amplification:


curl -X PUT https://api.curate-me.ai/gateway/admin/settings \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"max_provider_retries": 1}'

Fix (long-term): Add exponential backoff between retries. Review whether the default retry count of 3 is appropriate for your traffic volume.

Cause D: DNS resolution failure

If the gateway can’t resolve provider hostnames, all providers appear down simultaneously.

Diagnosis: All providers failing with connection errors, not HTTP errors. Check DNS:


# On VPS
ssh $DEPLOY_USER@$PLATFORM_VPS_IP
 
# Test DNS resolution
dig api.openai.com +short
dig api.anthropic.com +short
 
# Check /etc/resolv.conf for DNS server
cat /etc/resolv.conf

Fix: If DNS is failing, add a fallback DNS server:


# Temporarily add Google DNS
echo "nameserver 8.8.8.8" | sudo tee -a /etc/resolv.conf

Step 4: Verify resolution

After applying fixes, confirm the failover loop has stopped:


# Check provider health — circuit breakers should stabilize
curl https://api.curate-me.ai/gateway/admin/providers/health \
  -H "Authorization: Bearer $ADMIN_TOKEN"
 
# Check retry rate has dropped
curl https://api.curate-me.ai/gateway/admin/metrics \
  -H "Authorization: Bearer $ADMIN_TOKEN" | jq '.retry_metrics'
 
# Send a test request
curl -s -w "\nHTTP: %{http_code}, Time: %{time_total}s\n" \
  https://api.curate-me.ai/v1/openai/chat/completions \
  -H "X-CM-API-Key: $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o-mini", "messages": [{"role":"user","content":"ping"}], "max_tokens": 5}'
# HTTP: 200, Time should be < 3s
 
# Monitor error rate for 5 minutes
./scripts/errors recent

Prevention

Measure	How
Provider health monitoring	Alert when any provider circuit breaker opens
Retry budget	Set a per-minute retry cap to prevent storms
Diverse provider routing	Ensure fallback chains include providers on different infrastructure
DNS redundancy	Configure multiple DNS servers in `/etc/resolv.conf`
Load shedding	Under extreme failover, shed low-priority traffic to protect capacity

Gateway High Latency — when latency is high but not from failover
Budget Exceeded — failover to more expensive providers can spike costs
Cost Accumulation Lag — rapid failover can overwhelm cost recording

Rollback

Revert the changes described in the Procedure section. If a configuration change was made, restore the previous value from the MongoDB audit log or Redis backup.

Verification

After applying the fix, verify:

The symptoms listed above are no longer present
No new errors in gateway logs: docker logs curateme-backend-gateway --tail=50
Health check passes: curl -s http://localhost:8002/health | jq .status

Escalation

If all providers are down for more than 15 minutes, declare SEV1 per the Incident Response playbook
Collect: provider health output, ./scripts/errors by-source gateway, retry metrics
If retry storms persist after reducing max_provider_retries to 1, check for application-level retries in customer code
Contact platform team with: which providers are affected, circuit breaker states, and total time in failover loop