saif ur rahman

Posted on Feb 25

Why Your LLM Pipeline Needs Circuit Breakers

#aws #genai #ai #bedrock

Most LLM demos work perfectly.

Until they don’t.

You test your prompt in the playground. It responds beautifully. You wire it into production. A few users try it. Everything seems fine.

Then traffic increases.

Then Bedrock throttles.

Then retries start firing.

Then your queue depth spikes.

Then you accidentally DDoS your own model endpoint.

This is the moment most AI systems fail — not because of intelligence, but because of infrastructure.

If you're building a real production AI backend, you don’t just need prompts.

You need circuit breakers.

The Illusion of Reliability in LLM Systems

When we integrate an LLM into a system, it feels like calling any other API:

await callLLM(prompt)

But LLMs are not ordinary APIs.

They are:

Capacity-constrained
Rate-limited
Token-limited
Region-dependent
Occasionally throttled
Sometimes unavailable

And when they fail, they fail in bursts.

The Real Failure Modes

Let’s look at what actually happens in production.

1 Bedrock Throttling

You’ll see errors like:

ThrottlingException: Too many tokens per day

Or:

Rate exceeded

This is not a bug in your code.

This is capacity control.

But here’s where it becomes dangerous:

If your system retries immediately, you amplify the problem.

2 Retry Storms

Imagine 500 concurrent requests.

Each one gets throttled.

Each one retries instantly.

Now you have 1,000 requests.

They retry again.

Now you have 2,000.

You’ve created a retry storm.

Your queue explodes.
Your workers saturate.
Your AI endpoint collapses.

This is how fragile AI backends implode.

3 Naive Exponential Backoff Isn’t Enough

Most developers think this solves it:

retryWithExponentialBackoff()

That’s necessary.

But it’s not sufficient.

Because if the upstream dependency (Bedrock) is hard-throttled for minutes or hours, exponential backoff just spreads out the pain.

You still keep hitting a failing system.

What you actually need is a circuit breaker.

What Is a Circuit Breaker (In AI Context)?

A circuit breaker is a control mechanism that:

Detects repeated failures
Stops sending traffic to a failing dependency
Waits for recovery
Gradually restores traffic

It prevents cascading failures.

It protects your infrastructure from external instability.

In LLM systems, it’s mandatory.

Designing Circuit Breakers for LLM Pipelines

1. Failure Threshold Detection

Track consecutive failures:

ThrottlingException
Timeout
5xx responses
Token quota exceeded

If failure rate exceeds a threshold (e.g., 30% in 1 minute):

Trip the breaker.

Store this state in:

Memory (single worker)
Redis (multi-instance)
DynamoDB (serverless safe)

2. Open the Circuit

When open:

Do NOT call Bedrock.

Instead:

Return a graceful error
Queue for later processing
Route to fallback model

This prevents retry storms.

3. Half-Open State

After a cooldown (e.g., 60 seconds):

Allow limited traffic:

1 request
Then 5
Then 10

If successful → close the breaker.
If failed → reopen immediately.

Controlled recovery is critical.

Fallback Models: Your Safety Net

Circuit breakers should not just stop traffic.

They should degrade gracefully.

Primary model:

Claude Sonnet 4.5

Fallback model:

Claude 3 Sonnet

Emergency fallback:

Claude Haiku

If high-tier model fails:

Automatically switch to smaller model
Reduce max_tokens
Return simplified output

Users prefer partial functionality over total outage.

Auto-Disabling Failing Endpoints

In distributed AI systems, you might have:

Multiple regions
Multiple models
Multiple inference profiles

If one endpoint begins failing:

Disable it automatically.

Maintain a health registry:

us-east-1: unhealthy
eu-west-1: healthy

Route traffic only to healthy regions.

This is how resilient systems behave.

A Safe Architecture Pattern

API Gateway
    ↓
Request Lambda
    ↓
SQS
    ↓
Worker Lambda
    ↓
Circuit Breaker Layer
    ↓
LLM Call
    ↓
Fallback Router
    ↓
S3 + DynamoDB

Never let your worker blindly call the model.

Always call through a protective layer.

AI Systems Are Distributed Systems

LLM integration is not prompt engineering.

It’s distributed systems engineering.

If you wouldn’t connect your production system to a database without:

Connection pooling
Retry logic
Circuit breakers
Health checks

Then you shouldn’t connect it directly to an LLM either.

Final Thought

LLMs are probabilistic.

Infrastructure must be deterministic.

If you don’t design protective layers around your AI dependencies, your system will eventually fail under load.

Not because your model is bad.

But because your architecture is fragile.

And fragile systems don’t scale.

DEV Community