Skip to Content
RunbooksRunbook: Multi-Provider Failover Loops

Runbook: Multi-Provider Failover Loops

Owner: Platform Team Backup owner: On-call engineer Last validated: Not yet validated Validation method: Manual drill Severity trigger: SEV1 Customer impact: Sustained 502 errors and high latency across all orgs using affected providers Required access: SSH (VPS), Redis, Docker Related services: curateme-backend-gateway


The gateway uses circuit breakers and fallback routing to handle provider outages. When multiple providers fail simultaneously or circuit breaker thresholds are misconfigured, the gateway can get stuck in a loop: try provider A → fail → try B → fail → try A again. This creates sustained high latency and cascading 502 errors.


Symptoms

  • High latency on all requests (10s+ instead of 1-2s)
  • Logs show rapid provider switching (routing_fallback events)
  • Circuit breaker toggling between open and half_open states
  • 502 errors intermittently returned to users
  • Provider health endpoint shows multiple providers in degraded state
  • X-CM-Provider-Retries response header consistently shows max value

Step 1: Assess provider health

Determine which providers are healthy, degraded, or down.

./scripts/analytics health # Look at: provider_status for each provider

Get detailed circuit breaker states:

curl https://api.curate-me.ai/gateway/admin/providers/health \ -H "Authorization: Bearer $ADMIN_TOKEN"

Expected response:

{ "providers": { "openai": {"status": "healthy", "circuit": "closed", "success_rate": 0.99, "p95_ms": 450}, "anthropic": {"status": "degraded", "circuit": "half_open", "success_rate": 0.72, "p95_ms": 3200}, "google": {"status": "down", "circuit": "open", "success_rate": 0.0, "p95_ms": null} } }

What to look for:

StateMeaningAction
All closedNo failover issue — latency is elsewhereCheck Gateway High Latency
One open, others closedNormal failover, working correctlyMonitor, no action needed
Multiple half_openProviders flapping — the failover loopContinue this runbook
All openTotal provider outageJump to Step 3, Cause A

Also check the provider status pages directly:

ProviderStatus Page
OpenAIstatus.openai.com 
Anthropicstatus.anthropic.com 
Google AIstatus.cloud.google.com 
DeepSeekstatus.deepseek.com 

Step 2: Check for retry storms

Failed requests retrying amplify load on remaining healthy providers, potentially pushing them into failure too.

./scripts/errors by-source gateway | grep "routing_fallback"

Check retry rate:

curl https://api.curate-me.ai/gateway/admin/metrics \ -H "Authorization: Bearer $ADMIN_TOKEN" | jq '.retry_metrics'

What to look for:

MetricHealthyProblem
retries_per_minute< 10> 100 (retry storm)
fallback_rate< 5%> 30% (systematic failures)
avg_retries_per_request< 1.2> 2.5 (every request retrying multiple times)

Step 3: Identify the root cause

Cause A: Multiple providers down simultaneously

Major cloud outages occasionally affect multiple LLM providers (shared infrastructure, regional issues).

Diagnosis: Multiple providers show open circuit breaker state. Status pages confirm outages.

Fix (immediate): Force traffic to the single remaining healthy provider:

# Block traffic to known-bad providers curl -X POST https://api.curate-me.ai/gateway/admin/providers/google/circuit-breaker \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"state": "open", "reason": "manual override - provider confirmed down"}' curl -X POST https://api.curate-me.ai/gateway/admin/providers/anthropic/circuit-breaker \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"state": "open", "reason": "manual override - provider degraded"}'

If ALL providers are down: Enable response cache as degraded-mode fallback:

curl -X POST https://api.curate-me.ai/gateway/admin/settings \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"cache_mode": "stale_if_error", "cache_ttl_seconds": 3600}'

Cause B: Circuit breaker thresholds too sensitive

If the circuit breaker opens after too few failures, transient errors cause premature failover.

Diagnosis: Circuit breakers toggling rapidly between half_open and open while the provider is actually healthy (confirmed by direct curl to provider).

Fix: Increase circuit breaker thresholds:

curl -X PUT https://api.curate-me.ai/gateway/admin/settings \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "circuit_breaker_failure_threshold": 10, "circuit_breaker_recovery_timeout_seconds": 30, "circuit_breaker_half_open_max_calls": 5 }'

Default values for reference:

SettingDefaultDescription
failure_threshold5Consecutive failures before opening
recovery_timeout_seconds60Seconds before trying half-open
half_open_max_calls3Test calls before closing

Cause C: Retry storm amplification

Each failed request retries up to max_provider_retries times, and each retry can trigger its own failover chain.

Diagnosis: retries_per_minute > 100. The retry load is causing otherwise healthy providers to degrade.

Fix (immediate): Reduce retry count to stop the amplification:

curl -X PUT https://api.curate-me.ai/gateway/admin/settings \ -H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Content-Type: application/json" \ -d '{"max_provider_retries": 1}'

Fix (long-term): Add exponential backoff between retries. Review whether the default retry count of 3 is appropriate for your traffic volume.

Cause D: DNS resolution failure

If the gateway can’t resolve provider hostnames, all providers appear down simultaneously.

Diagnosis: All providers failing with connection errors, not HTTP errors. Check DNS:

# On VPS ssh $DEPLOY_USER@$PLATFORM_VPS_IP # Test DNS resolution dig api.openai.com +short dig api.anthropic.com +short # Check /etc/resolv.conf for DNS server cat /etc/resolv.conf

Fix: If DNS is failing, add a fallback DNS server:

# Temporarily add Google DNS echo "nameserver 8.8.8.8" | sudo tee -a /etc/resolv.conf

Step 4: Verify resolution

After applying fixes, confirm the failover loop has stopped:

# Check provider health — circuit breakers should stabilize curl https://api.curate-me.ai/gateway/admin/providers/health \ -H "Authorization: Bearer $ADMIN_TOKEN" # Check retry rate has dropped curl https://api.curate-me.ai/gateway/admin/metrics \ -H "Authorization: Bearer $ADMIN_TOKEN" | jq '.retry_metrics' # Send a test request curl -s -w "\nHTTP: %{http_code}, Time: %{time_total}s\n" \ https://api.curate-me.ai/v1/openai/chat/completions \ -H "X-CM-API-Key: $API_KEY" \ -H "Content-Type: application/json" \ -d '{"model": "gpt-4o-mini", "messages": [{"role":"user","content":"ping"}], "max_tokens": 5}' # HTTP: 200, Time should be < 3s # Monitor error rate for 5 minutes ./scripts/errors recent

Prevention

MeasureHow
Provider health monitoringAlert when any provider circuit breaker opens
Retry budgetSet a per-minute retry cap to prevent storms
Diverse provider routingEnsure fallback chains include providers on different infrastructure
DNS redundancyConfigure multiple DNS servers in /etc/resolv.conf
Load sheddingUnder extreme failover, shed low-priority traffic to protect capacity


Rollback

Revert the changes described in the Procedure section. If a configuration change was made, restore the previous value from the MongoDB audit log or Redis backup.

Verification

After applying the fix, verify:

  • The symptoms listed above are no longer present
  • No new errors in gateway logs: docker logs curateme-backend-gateway --tail=50
  • Health check passes: curl -s http://localhost:8002/health | jq .status

Escalation

  1. If all providers are down for more than 15 minutes, declare SEV1 per the Incident Response playbook
  2. Collect: provider health output, ./scripts/errors by-source gateway, retry metrics
  3. If retry storms persist after reducing max_provider_retries to 1, check for application-level retries in customer code
  4. Contact platform team with: which providers are affected, circuit breaker states, and total time in failover loop