Most LLM demos work perfectly.
Until they don’t.
You test your prompt in the playground. It responds beautifully. You wire it into production. A few users try it. Everything seems fine.
Then traffic increases.
Then Bedrock throttles.
Then retries start firing.
Then your queue depth spikes.
Then you accidentally DDoS your own model endpoint.
This is the moment most AI systems fail — not because of intelligence, but because of infrastructure.
If you're building a real production AI backend, you don’t just need prompts.
You need circuit breakers.
The Illusion of Reliability in LLM Systems
When we integrate an LLM into a system, it feels like calling any other API:
await callLLM(prompt)
But LLMs are not ordinary APIs.
They are:
- Capacity-constrained
- Rate-limited
- Token-limited
- Region-dependent
- Occasionally throttled
- Sometimes unavailable
And when they fail, they fail in bursts.
The Real Failure Modes
Let’s look at what actually happens in production.
1 Bedrock Throttling
You’ll see errors like:
ThrottlingException: Too many tokens per day
Or:
Rate exceeded
This is not a bug in your code.
This is capacity control.
But here’s where it becomes dangerous:
If your system retries immediately, you amplify the problem.
2 Retry Storms
Imagine 500 concurrent requests.
Each one gets throttled.
Each one retries instantly.
Now you have 1,000 requests.
They retry again.
Now you have 2,000.
You’ve created a retry storm.
Your queue explodes.
Your workers saturate.
Your AI endpoint collapses.
This is how fragile AI backends implode.
3 Naive Exponential Backoff Isn’t Enough
Most developers think this solves it:
retryWithExponentialBackoff()
That’s necessary.
But it’s not sufficient.
Because if the upstream dependency (Bedrock) is hard-throttled for minutes or hours, exponential backoff just spreads out the pain.
You still keep hitting a failing system.
What you actually need is a circuit breaker.
What Is a Circuit Breaker (In AI Context)?
A circuit breaker is a control mechanism that:
- Detects repeated failures
- Stops sending traffic to a failing dependency
- Waits for recovery
- Gradually restores traffic
It prevents cascading failures.
It protects your infrastructure from external instability.
In LLM systems, it’s mandatory.
Designing Circuit Breakers for LLM Pipelines
1. Failure Threshold Detection
Track consecutive failures:
- ThrottlingException
- Timeout
- 5xx responses
- Token quota exceeded
If failure rate exceeds a threshold (e.g., 30% in 1 minute):
Trip the breaker.
Store this state in:
- Memory (single worker)
- Redis (multi-instance)
- DynamoDB (serverless safe)
2. Open the Circuit
When open:
Do NOT call Bedrock.
Instead:
- Return a graceful error
- Queue for later processing
- Route to fallback model
This prevents retry storms.
3. Half-Open State
After a cooldown (e.g., 60 seconds):
Allow limited traffic:
- 1 request
- Then 5
- Then 10
If successful → close the breaker.
If failed → reopen immediately.
Controlled recovery is critical.
Fallback Models: Your Safety Net
Circuit breakers should not just stop traffic.
They should degrade gracefully.
Primary model:
Claude Sonnet 4.5
Fallback model:
Claude 3 Sonnet
Emergency fallback:
Claude Haiku
If high-tier model fails:
- Automatically switch to smaller model
- Reduce max_tokens
- Return simplified output
Users prefer partial functionality over total outage.
Auto-Disabling Failing Endpoints
In distributed AI systems, you might have:
- Multiple regions
- Multiple models
- Multiple inference profiles
If one endpoint begins failing:
Disable it automatically.
Maintain a health registry:
us-east-1: unhealthy
eu-west-1: healthy
Route traffic only to healthy regions.
This is how resilient systems behave.
A Safe Architecture Pattern
API Gateway
↓
Request Lambda
↓
SQS
↓
Worker Lambda
↓
Circuit Breaker Layer
↓
LLM Call
↓
Fallback Router
↓
S3 + DynamoDB
Never let your worker blindly call the model.
Always call through a protective layer.
AI Systems Are Distributed Systems
LLM integration is not prompt engineering.
It’s distributed systems engineering.
If you wouldn’t connect your production system to a database without:
- Connection pooling
- Retry logic
- Circuit breakers
- Health checks
Then you shouldn’t connect it directly to an LLM either.
Final Thought
LLMs are probabilistic.
Infrastructure must be deterministic.
If you don’t design protective layers around your AI dependencies, your system will eventually fail under load.
Not because your model is bad.
But because your architecture is fragile.
And fragile systems don’t scale.
Top comments (0)