Debby McKinney

Posted on Dec 12, 2025 • Edited on Jan 9

LiteLLM Broke at 300 RPS in Production. Here's How We Fixed It

#llm #go #ai #chatgpt

The Incident

Tuesday, 2:47 PM
Our customer support chatbot is handling 280 RPS. Everything’s fine.

2:53 PM
Traffic hits 310 RPS. Response times spike. Users start complaining in Slack.

2:58 PM
P99 latency reaches 18 seconds. Some requests time out completely.

3:05 PM
We manually restart LiteLLM. Traffic drops during the restart. Users are angry.

This happened three times that week.

What We Thought the Problem Was

“Maybe we need more replicas”
“Let’s add a load balancer”
“Probably need better hardware”

We scaled horizontally and added three more LiteLLM instances.

Result

Cost increased 4×

Traffic hit 320 RPS
The same issues appeared
All instances struggled simultaneously

What the Problem Actually Was

LiteLLM is built on Python + FastAPI.

At low traffic (< 200 RPS), it works well.
Past 300 RPS, Python’s architecture becomes the bottleneck.

The Python Problem

GIL (Global Interpreter Lock): Only one thread executes Python code at a time

Async overhead: Event loop coordination adds latency
Memory pressure: Heavy dependencies + long-running processes
GC pauses: Garbage collection freezes request handling

What We Observed at 350 RPS (Single Instance)

CPU: 85% (one core maxed due to GIL)

Memory: 3.2 GB → 5.1 GB → 6.8 GB (steadily climbing)
Latency: 200 ms → 2 s → 12 s → timeout
GC pauses: 100–300 ms every ~30 seconds

After 2 hours, memory reached 8 GB.
The process was killed by the OOM killer.

This isn’t a LiteLLM-specific issue.
It’s Python hitting its limits at high throughput.

How Bifrost Solves This

We needed production-grade infrastructure, not a prototype that breaks under load.

So we built Bifrost in Go, specifically for high-throughput LLM workloads.
It’s open source and MIT licensed.

Key Architectural Differences

True Concurrency (No GIL)

Go’s goroutines execute in parallel across all CPU cores.

// Thousands of goroutines, truly parallel go handleRequest(req1) go handleRequest(req2) go handleRequest(req3) // All executing simultaneously

Lightweight Concurrency

`Go: 10,000 goroutines ≈ ~100 MB memory

Python: 10,000 threads / async tasks → out of memory`

Predictable Memory

Go’s garbage collector is designed for low-latency systems:
Concurrent GC (doesn’t stop the world)
Predictable pause times (typically < 1 ms)
No circular-reference memory leaks

Native HTTP/2

Built-in HTTP/2 support
Request multiplexing
No external dependencies
The Real-World Difference

We ran the same production workload through both gateways.

Test: Customer support chatbot, real user traffic
Load: 500 RPS sustained

LiteLLM

(3 × t3.xlarge instances)

P50 latency: 2.1 s P99 latency: 23.4 s Memory per instance: 4–7 GB (climbing) Timeout rate: 8% Cost: ~$450/month Stability: Restart required every 6–8 hours

Bifrost

(1 × t3.large instance)

P50 latency: 230 ms P99 latency: 520 ms Memory: 1.4 GB (stable) Timeout rate: 0.1% Cost: ~$60/month Stability: 30+ days without restart

Result

45× faster P99 latency
7× cheaper
Actually stable

But Bifrost Isn’t Just About Performance

Rebuilding from scratch let us add production features LiteLLM doesn’t have.

1. Adaptive Load Balancing

Multiple API keys?
Bifrost continuously monitors:

Latency
Error rates
Traffic is automatically reweighted:

Real-time weight adjustment:
├─ Key 1: 1.2× weight (healthy)
├─ Key 2: 0.5× weight (high latency)
└─ Key 3: 1.0× weight (normal)

No manual intervention required.

2. Semantic Caching

Not exact-match caching — semantic similarity.

“How do I reset my password?”

“What’s the password reset process?”

The second query hits the cache.

Cache hit rate: 40%
Cost savings: ~$1,200/month

3. Zero-Overhead Observability

Every request is logged with full context:

Inputs / outputs
Token usage
Latency breakdown
Cost per request

All async. Zero performance impact.
Built-in dashboard.

4. Production-Grade Failover

Primary provider down?
Bifrost automatically fails over.

We’ve had OpenAI incidents where traffic switched to Anthropic automatically.
Users didn’t notice.

Migration Was Surprisingly Easy

Expected: Days of refactoring
Actual: ~15 minutes

Step 1: Start Bifrost
docker run -p 8080:8080 -v $(pwd)/data:/app/data maximhq/bifrost

Step 2: Add API Keys

Visit: http://localhost:8080

Step 3: Change One Line in Code

Before

import openai openai.api_key = "sk-..."

After

import openai openai.api_base = "http://localhost:8080/openai" openai.api_key = "sk-..."

Step 4: Deploy

That’s it.

Bifrost is OpenAI-compatible.
If your code works with OpenAI, it works with Bifrost.

Supports LangChain, LlamaIndex, LiteLLM SDK, and more.

The Production Rollout

Week 1: 10% traffic

No issues

Latency down 60%

Week 2: 50% traffic

Still stable

Costs already dropping

Week 3: 100% migration

Shut down 2 of 3 LiteLLM instances

Performance better than ever

Three Months Later, Zero downtime incidents

Handling 800+ RPS during peaks

Monthly cost: $60 vs $450

No manual restarts

When to Use Each
Use LiteLLM if:

You’re prototyping
Traffic is < 100 RPS
You need deep Python ecosystem integration
You’re okay with manual scaling and monitoring

Use Bifrost if:

You’re running production workloads
Traffic > 200 RPS (or will be soon)
You care about P99 latency
You want predictable costs
You’re tired of restarting your gateway

Try Bifrost; Open source (MIT). Run it locally in 30 seconds:

git clone https://github.com/maximhq/bifrost cd bifrost docker compose up

Visit http://localhost:8080, add your API keys, and point your app at Bifrost.

Benchmark It Yourself
cd bifrost/benchmarks
./benchmark -provider bifrost -rate 500 -duration 60

Compare with your current setup.

The Bottom Line

LiteLLM breaking at ~300 RPS wasn’t a bug.
It was Python hitting its architectural limits.

We needed production-grade infrastructure.
So we built it — in Go — and open sourced it.

If you’re hitting scale issues with your LLM gateway, you’re not alone.
We hit them too.

Bifrost solved them. Might solve yours.

Benchmarks: https://docs.getbifrost.ai/benchmarking/getting-started

Docs: https://docs.getbifrost.ai

Repo: https://github.com/maximhq/bifrost

Built by the team at Maxim AI. We also build evaluation and observability tools for production AI systems.

Top comments (1)

Kyle • Jan 28

This makes absolutely no sense, if it's a python issue, adding more pods with a load balancer will absolutely solve the issue regardless if the language is the bottleneck. GIL is only important on the python process running.

I'm not a huge fan of python in general, but each new instance added is basically separate from the others. The only bottlenecks you'll see when adding more pods and seeing similar performance would be external resources, such as postgres or mlflow related services. At that point it would be an external bottleneck issue.

We're able to scale up litellm with the proper external resources well beyond 300 rps, very cheaply.