Skip to Content
RunbooksMulti-Provider Failover Loops

Multi-Provider Failover Loops

Symptoms

  • High latency on all requests (10s+ instead of 1-2s)
  • Logs show rapid provider switching
  • Circuit breaker toggling between open and half-open
  • 502 errors intermittently returned to users

Likely Causes

  1. Multiple providers down simultaneously — rare but possible during major outages
  2. Circuit breaker thresholds too low — opens too quickly on transient errors
  3. Retry storm — failed requests retry, amplifying load on remaining providers
  4. DNS resolution failure — provider endpoints unreachable

Triage Steps

1. Check provider health

./scripts/analytics health # Look at: provider_status for each provider

2. Check circuit breaker states

# View current circuit breaker states curl https://api.curate-me.ai/gateway/admin/providers/health \ -H "Authorization: Bearer $JWT"

3. Check for retry storms

./scripts/errors by-source gateway | grep "routing_fallback"

Resolution

Force circuit breaker open

Stop sending traffic to a known-bad provider:

curl -X POST https://api.curate-me.ai/gateway/admin/providers/openai/circuit-breaker \ -H "Authorization: Bearer $JWT" \ -d '{"state": "open", "reason": "manual override during outage"}'

Reduce retry attempts

Temporarily lower retry count to reduce amplification:

# Set max retries to 1 (from default 3) curl -X PUT https://api.curate-me.ai/gateway/admin/settings \ -H "Authorization: Bearer $JWT" \ -d '{"max_provider_retries": 1}'

Escalation

If all providers are down, enable the response cache as a degraded-mode fallback and alert affected orgs.