DEV Community

Cover image for LiteLLM Broke at 300 RPS in Production. Here's How We Fixed It
Debby McKinney
Debby McKinney

Posted on

LiteLLM Broke at 300 RPS in Production. Here's How We Fixed It

The Incident

Tuesday, 2:47 PM
Our customer support chatbot is handling 280 RPS. Everything’s fine.

2:53 PM
Traffic hits 310 RPS. Response times spike. Users start complaining in Slack.

2:58 PM
P99 latency reaches 18 seconds. Some requests time out completely.

3:05 PM
We manually restart LiteLLM. Traffic drops during the restart. Users are angry.

This happened three times that week.

What We Thought the Problem Was

  • “Maybe we need more replicas”
  • “Let’s add a load balancer”
  • “Probably need better hardware”

We scaled horizontally and added three more LiteLLM instances.

Result

Cost increased 4×

  • Traffic hit 320 RPS
  • The same issues appeared
  • All instances struggled simultaneously

What the Problem Actually Was

LiteLLM is built on Python + FastAPI.

At low traffic (< 200 RPS), it works well.
Past 300 RPS, Python’s architecture becomes the bottleneck.

The Python Problem

GIL (Global Interpreter Lock): Only one thread executes Python code at a time

  • Async overhead: Event loop coordination adds latency
  • Memory pressure: Heavy dependencies + long-running processes
  • GC pauses: Garbage collection freezes request handling

What We Observed at 350 RPS (Single Instance)

CPU: 85% (one core maxed due to GIL)

  • Memory: 3.2 GB → 5.1 GB → 6.8 GB (steadily climbing)
  • Latency: 200 ms → 2 s → 12 s → timeout
  • GC pauses: 100–300 ms every ~30 seconds

After 2 hours, memory reached 8 GB.
The process was killed by the OOM killer.

This isn’t a LiteLLM-specific issue.
It’s Python hitting its limits at high throughput.

How Bifrost Solves This

We needed production-grade infrastructure, not a prototype that breaks under load.

So we built Bifrost in Go, specifically for high-throughput LLM workloads.
It’s open source and MIT licensed.

Key Architectural Differences

  1. True Concurrency (No GIL)

Go’s goroutines execute in parallel across all CPU cores.

// Thousands of goroutines, truly parallel
go handleRequest(req1)
go handleRequest(req2)
go handleRequest(req3)
// All executing simultaneously

  1. Lightweight Concurrency

`Go: 10,000 goroutines ≈ ~100 MB memory

Python: 10,000 threads / async tasks → out of memory`

  1. Predictable Memory
  • Go’s garbage collector is designed for low-latency systems:
  • Concurrent GC (doesn’t stop the world)
  • Predictable pause times (typically < 1 ms)
  • No circular-reference memory leaks
  1. Native HTTP/2
  • Built-in HTTP/2 support
  • Request multiplexing
  • No external dependencies
  • The Real-World Difference

We ran the same production workload through both gateways.

Test: Customer support chatbot, real user traffic
Load: 500 RPS sustained

LiteLLM

(3 × t3.xlarge instances)

P50 latency: 2.1 s
P99 latency: 23.4 s
Memory per instance: 4–7 GB (climbing)
Timeout rate: 8%
Cost: ~$450/month
Stability: Restart required every 6–8 hours

Bifrost

(1 × t3.large instance)

P50 latency: 230 ms
P99 latency: 520 ms
Memory: 1.4 GB (stable)
Timeout rate: 0.1%
Cost: ~$60/month
Stability: 30+ days without restart

Result

  • 45× faster P99 latency
  • 7× cheaper
  • Actually stable

But Bifrost Isn’t Just About Performance

Rebuilding from scratch let us add production features LiteLLM doesn’t have.

1. Adaptive Load Balancing

Multiple API keys?
Bifrost continuously monitors:

  • Latency
  • Error rates
  • Traffic is automatically reweighted:

Real-time weight adjustment:
├─ Key 1: 1.2× weight (healthy)
├─ Key 2: 0.5× weight (high latency)
└─ Key 3: 1.0× weight (normal)

No manual intervention required.

2. Semantic Caching

Not exact-match caching — semantic similarity.

“How do I reset my password?”

“What’s the password reset process?”

The second query hits the cache.

  • Cache hit rate: 40%
  • Cost savings: ~$1,200/month

3. Zero-Overhead Observability

Every request is logged with full context:

  • Inputs / outputs
  • Token usage
  • Latency breakdown
  • Cost per request

All async. Zero performance impact.
Built-in dashboard.

4. Production-Grade Failover

Primary provider down?
Bifrost automatically fails over.

We’ve had OpenAI incidents where traffic switched to Anthropic automatically.
Users didn’t notice.

Migration Was Surprisingly Easy

Expected: Days of refactoring
Actual: ~15 minutes

Step 1: Start Bifrost
docker run -p 8080:8080 -v $(pwd)/data:/app/data maximhq/bifrost

Step 2: Add API Keys

Visit: http://localhost:8080

Step 3: Change One Line in Code

Before

import openai
openai.api_key = "sk-..."

After

import openai
openai.api_base = "http://localhost:8080/openai"
openai.api_key = "sk-..."

Step 4: Deploy

That’s it.

Bifrost is OpenAI-compatible.
If your code works with OpenAI, it works with Bifrost.

Supports LangChain, LlamaIndex, LiteLLM SDK, and more.

The Production Rollout

Week 1: 10% traffic

No issues

Latency down 60%

Week 2: 50% traffic

Still stable

Costs already dropping

Week 3: 100% migration

Shut down 2 of 3 LiteLLM instances

Performance better than ever

Three Months Later, Zero downtime incidents

Handling 800+ RPS during peaks

Monthly cost: $60 vs $450

No manual restarts

When to Use Each
Use LiteLLM if:

  • You’re prototyping
  • Traffic is < 100 RPS
  • You need deep Python ecosystem integration
  • You’re okay with manual scaling and monitoring

Use Bifrost if:

  • You’re running production workloads
  • Traffic > 200 RPS (or will be soon)
  • You care about P99 latency
  • You want predictable costs
  • You’re tired of restarting your gateway

Try Bifrost; Open source (MIT). Run it locally in 30 seconds:

git clone https://github.com/maximhq/bifrost
cd bifrost
docker compose up

Visit http://localhost:8080, add your API keys, and point your app at Bifrost.

Benchmark It Yourself
cd bifrost/benchmarks
./benchmark -provider bifrost -rate 500 -duration 60

Compare with your current setup.

The Bottom Line

LiteLLM breaking at ~300 RPS wasn’t a bug.
It was Python hitting its architectural limits.

We needed production-grade infrastructure.
So we built it — in Go — and open sourced it.

If you’re hitting scale issues with your LLM gateway, you’re not alone.
We hit them too.

Bifrost solved them. Might solve yours.

Benchmarks: https://docs.getbifrost.ai/benchmarking/getting-started

Docs: https://docs.getbifrost.ai

Repo: https://github.com/maximhq/bifrost

Built by the team at Maxim AI. We also build evaluation and observability tools for production AI systems.

Top comments (0)