DEV Community

Cover image for Why Your LLM Pipeline Needs Circuit Breakers
saif ur rahman
saif ur rahman

Posted on

Why Your LLM Pipeline Needs Circuit Breakers

Most LLM demos work perfectly.

Until they don’t.

You test your prompt in the playground. It responds beautifully. You wire it into production. A few users try it. Everything seems fine.

Then traffic increases.

Then Bedrock throttles.

Then retries start firing.

Then your queue depth spikes.

Then you accidentally DDoS your own model endpoint.

This is the moment most AI systems fail — not because of intelligence, but because of infrastructure.

If you're building a real production AI backend, you don’t just need prompts.

You need circuit breakers.

The Illusion of Reliability in LLM Systems

When we integrate an LLM into a system, it feels like calling any other API:

await callLLM(prompt)
Enter fullscreen mode Exit fullscreen mode

But LLMs are not ordinary APIs.

They are:

  • Capacity-constrained
  • Rate-limited
  • Token-limited
  • Region-dependent
  • Occasionally throttled
  • Sometimes unavailable

And when they fail, they fail in bursts.

The Real Failure Modes

Let’s look at what actually happens in production.

1 Bedrock Throttling

You’ll see errors like:

ThrottlingException: Too many tokens per day
Enter fullscreen mode Exit fullscreen mode

Or:

Rate exceeded
Enter fullscreen mode Exit fullscreen mode

This is not a bug in your code.

This is capacity control.

But here’s where it becomes dangerous:

If your system retries immediately, you amplify the problem.

2 Retry Storms

Imagine 500 concurrent requests.

Each one gets throttled.

Each one retries instantly.

Now you have 1,000 requests.

They retry again.

Now you have 2,000.

You’ve created a retry storm.

Your queue explodes.
Your workers saturate.
Your AI endpoint collapses.

This is how fragile AI backends implode.

3 Naive Exponential Backoff Isn’t Enough

Most developers think this solves it:

retryWithExponentialBackoff()
Enter fullscreen mode Exit fullscreen mode

That’s necessary.

But it’s not sufficient.

Because if the upstream dependency (Bedrock) is hard-throttled for minutes or hours, exponential backoff just spreads out the pain.

You still keep hitting a failing system.

What you actually need is a circuit breaker.

What Is a Circuit Breaker (In AI Context)?

A circuit breaker is a control mechanism that:

  1. Detects repeated failures
  2. Stops sending traffic to a failing dependency
  3. Waits for recovery
  4. Gradually restores traffic

It prevents cascading failures.

It protects your infrastructure from external instability.

In LLM systems, it’s mandatory.

Designing Circuit Breakers for LLM Pipelines

1. Failure Threshold Detection

Track consecutive failures:

  • ThrottlingException
  • Timeout
  • 5xx responses
  • Token quota exceeded

If failure rate exceeds a threshold (e.g., 30% in 1 minute):

Trip the breaker.

Store this state in:

  • Memory (single worker)
  • Redis (multi-instance)
  • DynamoDB (serverless safe)

2. Open the Circuit

When open:

Do NOT call Bedrock.

Instead:

  • Return a graceful error
  • Queue for later processing
  • Route to fallback model

This prevents retry storms.

3. Half-Open State

After a cooldown (e.g., 60 seconds):

Allow limited traffic:

  • 1 request
  • Then 5
  • Then 10

If successful → close the breaker.
If failed → reopen immediately.

Controlled recovery is critical.

Fallback Models: Your Safety Net

Circuit breakers should not just stop traffic.

They should degrade gracefully.

Primary model:

Claude Sonnet 4.5
Enter fullscreen mode Exit fullscreen mode

Fallback model:

Claude 3 Sonnet
Enter fullscreen mode Exit fullscreen mode

Emergency fallback:

Claude Haiku
Enter fullscreen mode Exit fullscreen mode

If high-tier model fails:

  • Automatically switch to smaller model
  • Reduce max_tokens
  • Return simplified output

Users prefer partial functionality over total outage.


Auto-Disabling Failing Endpoints

In distributed AI systems, you might have:

  • Multiple regions
  • Multiple models
  • Multiple inference profiles

If one endpoint begins failing:

Disable it automatically.

Maintain a health registry:

us-east-1: unhealthy
eu-west-1: healthy
Enter fullscreen mode Exit fullscreen mode

Route traffic only to healthy regions.

This is how resilient systems behave.

A Safe Architecture Pattern

API Gateway
    ↓
Request Lambda
    ↓
SQS
    ↓
Worker Lambda
    ↓
Circuit Breaker Layer
    ↓
LLM Call
    ↓
Fallback Router
    ↓
S3 + DynamoDB
Enter fullscreen mode Exit fullscreen mode

Never let your worker blindly call the model.

Always call through a protective layer.

AI Systems Are Distributed Systems

LLM integration is not prompt engineering.

It’s distributed systems engineering.

If you wouldn’t connect your production system to a database without:

  • Connection pooling
  • Retry logic
  • Circuit breakers
  • Health checks

Then you shouldn’t connect it directly to an LLM either.

Final Thought

LLMs are probabilistic.

Infrastructure must be deterministic.

If you don’t design protective layers around your AI dependencies, your system will eventually fail under load.

Not because your model is bad.

But because your architecture is fragile.

And fragile systems don’t scale.

Top comments (0)