Why Your Retry Loop Gets 0% Recovery for LLM API Failures
When I started building production AI applications, I assumed standard fault tolerance patterns would work. Retry, circuit breaker—these patterns solved distributed systems problems for decades.
But for LLM APIs, they fail spectacularly.
I ran 6000+ real API calls to prove it.
The Experiment
Four approaches tested:
- Plain API calls - no protection
- Simple retry - 3 attempts with exponential backoff
- Circuit breaker - fast fail after threshold
- Self-healing flywheel - adaptive fault recovery
Results
| Scenario | Plain | Retry | Circuit Breaker | Flywheel |
|---|---|---|---|---|
| normal | 96.5% | 95.0% | 95.1% | 97.1% |
| timeout | 0% | 0% | 0% | 91.9% |
| invalid_model | 0% | 0% | 0% | 86.2% |
| empty_body | 0% | 0% | 0% | 97.2% |
Recovery rate = successful response within 30 seconds
The Problem with Traditional Patterns
Retry assumes transient failures
for attempt in range(3):
try:
return call_llm_api(prompt)
except TimeoutError:
if attempt == 2:
raise
time.sleep(exponential_backoff(attempt))
But LLM API failures are often structural:
- Model temporarily unavailable
- Invalid model name
- Rate limits
Circuit breaker just fast-fails. It does not help when the underlying issue persists.
The Flywheel Approach
Instead of assuming failures are transient, assume the current strategy might be wrong:
- Detect - Identify failure pattern
- Adapt - Switch to alternative strategy
- Learn - Record what worked
- Optimize - Improve over time
The Most Interesting Result
The invalid_model scenario starts at 0% recovery but climbs to 86.2% over 3 learning cycles, eventually reaching 100%.
It learns from failures.
Verification
- 111 real failures recorded
- SHA-256:
116e49febfafc3f8503d2debe2f024446e21c601f5afe9b44e17a2a3ebec9179
I am 王桂桂, founder of NeuralBridge. Demo: https://neuralbridge-ai.surge.sh
Top comments (0)