Multi-Provider Failover Loops
Symptoms
- High latency on all requests (10s+ instead of 1-2s)
- Logs show rapid provider switching
- Circuit breaker toggling between open and half-open
- 502 errors intermittently returned to users
Likely Causes
- Multiple providers down simultaneously — rare but possible during major outages
- Circuit breaker thresholds too low — opens too quickly on transient errors
- Retry storm — failed requests retry, amplifying load on remaining providers
- DNS resolution failure — provider endpoints unreachable
Triage Steps
1. Check provider health
./scripts/analytics health
# Look at: provider_status for each provider2. Check circuit breaker states
# View current circuit breaker states
curl https://api.curate-me.ai/gateway/admin/providers/health \
-H "Authorization: Bearer $JWT"3. Check for retry storms
./scripts/errors by-source gateway | grep "routing_fallback"Resolution
Force circuit breaker open
Stop sending traffic to a known-bad provider:
curl -X POST https://api.curate-me.ai/gateway/admin/providers/openai/circuit-breaker \
-H "Authorization: Bearer $JWT" \
-d '{"state": "open", "reason": "manual override during outage"}'Reduce retry attempts
Temporarily lower retry count to reduce amplification:
# Set max retries to 1 (from default 3)
curl -X PUT https://api.curate-me.ai/gateway/admin/settings \
-H "Authorization: Bearer $JWT" \
-d '{"max_provider_retries": 1}'Escalation
If all providers are down, enable the response cache as a degraded-mode fallback and alert affected orgs.