Runbook: Multi-Provider Failover Loops
Owner: Platform Team Backup owner: On-call engineer Last validated: Not yet validated Validation method: Manual drill Severity trigger: SEV1 Customer impact: Sustained 502 errors and high latency across all orgs using affected providers Required access: SSH (VPS), Redis, Docker Related services: curateme-backend-gateway
The gateway uses circuit breakers and fallback routing to handle provider outages. When multiple providers fail simultaneously or circuit breaker thresholds are misconfigured, the gateway can get stuck in a loop: try provider A → fail → try B → fail → try A again. This creates sustained high latency and cascading 502 errors.
Symptoms
- High latency on all requests (10s+ instead of 1-2s)
- Logs show rapid provider switching (
routing_fallbackevents) - Circuit breaker toggling between
openandhalf_openstates - 502 errors intermittently returned to users
- Provider health endpoint shows multiple providers in degraded state
X-CM-Provider-Retriesresponse header consistently shows max value
Step 1: Assess provider health
Determine which providers are healthy, degraded, or down.
./scripts/analytics health
# Look at: provider_status for each providerGet detailed circuit breaker states:
curl https://api.curate-me.ai/gateway/admin/providers/health \
-H "Authorization: Bearer $ADMIN_TOKEN"Expected response:
{
"providers": {
"openai": {"status": "healthy", "circuit": "closed", "success_rate": 0.99, "p95_ms": 450},
"anthropic": {"status": "degraded", "circuit": "half_open", "success_rate": 0.72, "p95_ms": 3200},
"google": {"status": "down", "circuit": "open", "success_rate": 0.0, "p95_ms": null}
}
}What to look for:
| State | Meaning | Action |
|---|---|---|
All closed | No failover issue — latency is elsewhere | Check Gateway High Latency |
One open, others closed | Normal failover, working correctly | Monitor, no action needed |
Multiple half_open | Providers flapping — the failover loop | Continue this runbook |
All open | Total provider outage | Jump to Step 3, Cause A |
Also check the provider status pages directly:
| Provider | Status Page |
|---|---|
| OpenAI | status.openai.com |
| Anthropic | status.anthropic.com |
| Google AI | status.cloud.google.com |
| DeepSeek | status.deepseek.com |
Step 2: Check for retry storms
Failed requests retrying amplify load on remaining healthy providers, potentially pushing them into failure too.
./scripts/errors by-source gateway | grep "routing_fallback"Check retry rate:
curl https://api.curate-me.ai/gateway/admin/metrics \
-H "Authorization: Bearer $ADMIN_TOKEN" | jq '.retry_metrics'What to look for:
| Metric | Healthy | Problem |
|---|---|---|
retries_per_minute | < 10 | > 100 (retry storm) |
fallback_rate | < 5% | > 30% (systematic failures) |
avg_retries_per_request | < 1.2 | > 2.5 (every request retrying multiple times) |
Step 3: Identify the root cause
Cause A: Multiple providers down simultaneously
Major cloud outages occasionally affect multiple LLM providers (shared infrastructure, regional issues).
Diagnosis: Multiple providers show open circuit breaker state. Status pages confirm outages.
Fix (immediate): Force traffic to the single remaining healthy provider:
# Block traffic to known-bad providers
curl -X POST https://api.curate-me.ai/gateway/admin/providers/google/circuit-breaker \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"state": "open", "reason": "manual override - provider confirmed down"}'
curl -X POST https://api.curate-me.ai/gateway/admin/providers/anthropic/circuit-breaker \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"state": "open", "reason": "manual override - provider degraded"}'If ALL providers are down: Enable response cache as degraded-mode fallback:
curl -X POST https://api.curate-me.ai/gateway/admin/settings \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"cache_mode": "stale_if_error", "cache_ttl_seconds": 3600}'Cause B: Circuit breaker thresholds too sensitive
If the circuit breaker opens after too few failures, transient errors cause premature failover.
Diagnosis: Circuit breakers toggling rapidly between half_open and open while the provider is actually healthy (confirmed by direct curl to provider).
Fix: Increase circuit breaker thresholds:
curl -X PUT https://api.curate-me.ai/gateway/admin/settings \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"circuit_breaker_failure_threshold": 10,
"circuit_breaker_recovery_timeout_seconds": 30,
"circuit_breaker_half_open_max_calls": 5
}'Default values for reference:
| Setting | Default | Description |
|---|---|---|
failure_threshold | 5 | Consecutive failures before opening |
recovery_timeout_seconds | 60 | Seconds before trying half-open |
half_open_max_calls | 3 | Test calls before closing |
Cause C: Retry storm amplification
Each failed request retries up to max_provider_retries times, and each retry can trigger its own failover chain.
Diagnosis: retries_per_minute > 100. The retry load is causing otherwise healthy providers to degrade.
Fix (immediate): Reduce retry count to stop the amplification:
curl -X PUT https://api.curate-me.ai/gateway/admin/settings \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"max_provider_retries": 1}'Fix (long-term): Add exponential backoff between retries. Review whether the default retry count of 3 is appropriate for your traffic volume.
Cause D: DNS resolution failure
If the gateway can’t resolve provider hostnames, all providers appear down simultaneously.
Diagnosis: All providers failing with connection errors, not HTTP errors. Check DNS:
# On VPS
ssh $DEPLOY_USER@$PLATFORM_VPS_IP
# Test DNS resolution
dig api.openai.com +short
dig api.anthropic.com +short
# Check /etc/resolv.conf for DNS server
cat /etc/resolv.confFix: If DNS is failing, add a fallback DNS server:
# Temporarily add Google DNS
echo "nameserver 8.8.8.8" | sudo tee -a /etc/resolv.confStep 4: Verify resolution
After applying fixes, confirm the failover loop has stopped:
# Check provider health — circuit breakers should stabilize
curl https://api.curate-me.ai/gateway/admin/providers/health \
-H "Authorization: Bearer $ADMIN_TOKEN"
# Check retry rate has dropped
curl https://api.curate-me.ai/gateway/admin/metrics \
-H "Authorization: Bearer $ADMIN_TOKEN" | jq '.retry_metrics'
# Send a test request
curl -s -w "\nHTTP: %{http_code}, Time: %{time_total}s\n" \
https://api.curate-me.ai/v1/openai/chat/completions \
-H "X-CM-API-Key: $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4o-mini", "messages": [{"role":"user","content":"ping"}], "max_tokens": 5}'
# HTTP: 200, Time should be < 3s
# Monitor error rate for 5 minutes
./scripts/errors recentPrevention
| Measure | How |
|---|---|
| Provider health monitoring | Alert when any provider circuit breaker opens |
| Retry budget | Set a per-minute retry cap to prevent storms |
| Diverse provider routing | Ensure fallback chains include providers on different infrastructure |
| DNS redundancy | Configure multiple DNS servers in /etc/resolv.conf |
| Load shedding | Under extreme failover, shed low-priority traffic to protect capacity |
Related Runbooks
- Gateway High Latency — when latency is high but not from failover
- Budget Exceeded — failover to more expensive providers can spike costs
- Cost Accumulation Lag — rapid failover can overwhelm cost recording
Rollback
Revert the changes described in the Procedure section. If a configuration change was made, restore the previous value from the MongoDB audit log or Redis backup.
Verification
After applying the fix, verify:
- The symptoms listed above are no longer present
- No new errors in gateway logs:
docker logs curateme-backend-gateway --tail=50 - Health check passes:
curl -s http://localhost:8002/health | jq .status
Escalation
- If all providers are down for more than 15 minutes, declare SEV1 per the Incident Response playbook
- Collect: provider health output,
./scripts/errors by-source gateway, retry metrics - If retry storms persist after reducing
max_provider_retriesto 1, check for application-level retries in customer code - Contact platform team with: which providers are affected, circuit breaker states, and total time in failover loop