DEV Community

Cover image for LiteLLM Broke at 300 RPS in Production. Here's How We Fixed It
Debby McKinney
Debby McKinney

Posted on • Edited on

LiteLLM Broke at 300 RPS in Production. Here's How We Fixed It

The Incident

Tuesday, 2:47 PM
Our customer support chatbot is handling 280 RPS. Everything’s fine.

2:53 PM
Traffic hits 310 RPS. Response times spike. Users start complaining in Slack.

2:58 PM
P99 latency reaches 18 seconds. Some requests time out completely.

3:05 PM
We manually restart LiteLLM. Traffic drops during the restart. Users are angry.

This happened three times that week.

What We Thought the Problem Was

  • “Maybe we need more replicas”
  • “Let’s add a load balancer”
  • “Probably need better hardware”

We scaled horizontally and added three more LiteLLM instances.

Result

Cost increased 4×

  • Traffic hit 320 RPS
  • The same issues appeared
  • All instances struggled simultaneously

What the Problem Actually Was

LiteLLM is built on Python + FastAPI.

At low traffic (< 200 RPS), it works well.
Past 300 RPS, Python’s architecture becomes the bottleneck.

The Python Problem

GIL (Global Interpreter Lock): Only one thread executes Python code at a time

  • Async overhead: Event loop coordination adds latency
  • Memory pressure: Heavy dependencies + long-running processes
  • GC pauses: Garbage collection freezes request handling

What We Observed at 350 RPS (Single Instance)

CPU: 85% (one core maxed due to GIL)

  • Memory: 3.2 GB → 5.1 GB → 6.8 GB (steadily climbing)
  • Latency: 200 ms → 2 s → 12 s → timeout
  • GC pauses: 100–300 ms every ~30 seconds

After 2 hours, memory reached 8 GB.
The process was killed by the OOM killer.

This isn’t a LiteLLM-specific issue.
It’s Python hitting its limits at high throughput.

How Bifrost Solves This

We needed production-grade infrastructure, not a prototype that breaks under load.

So we built Bifrost in Go, specifically for high-throughput LLM workloads.
It’s open source and MIT licensed.

Key Architectural Differences

  1. True Concurrency (No GIL)

Go’s goroutines execute in parallel across all CPU cores.

// Thousands of goroutines, truly parallel
go handleRequest(req1)
go handleRequest(req2)
go handleRequest(req3)
// All executing simultaneously

  1. Lightweight Concurrency

`Go: 10,000 goroutines ≈ ~100 MB memory

Python: 10,000 threads / async tasks → out of memory`

  1. Predictable Memory
  • Go’s garbage collector is designed for low-latency systems:
  • Concurrent GC (doesn’t stop the world)
  • Predictable pause times (typically < 1 ms)
  • No circular-reference memory leaks
  1. Native HTTP/2
  • Built-in HTTP/2 support
  • Request multiplexing
  • No external dependencies
  • The Real-World Difference

We ran the same production workload through both gateways.

Test: Customer support chatbot, real user traffic
Load: 500 RPS sustained

LiteLLM

(3 × t3.xlarge instances)

P50 latency: 2.1 s
P99 latency: 23.4 s
Memory per instance: 4–7 GB (climbing)
Timeout rate: 8%
Cost: ~$450/month
Stability: Restart required every 6–8 hours

Bifrost

(1 × t3.large instance)

P50 latency: 230 ms
P99 latency: 520 ms
Memory: 1.4 GB (stable)
Timeout rate: 0.1%
Cost: ~$60/month
Stability: 30+ days without restart

Result

  • 45× faster P99 latency
  • 7× cheaper
  • Actually stable

But Bifrost Isn’t Just About Performance

Rebuilding from scratch let us add production features LiteLLM doesn’t have.

1. Adaptive Load Balancing

Multiple API keys?
Bifrost continuously monitors:

  • Latency
  • Error rates
  • Traffic is automatically reweighted:

Real-time weight adjustment:
├─ Key 1: 1.2× weight (healthy)
├─ Key 2: 0.5× weight (high latency)
└─ Key 3: 1.0× weight (normal)

No manual intervention required.

2. Semantic Caching

Not exact-match caching — semantic similarity.

“How do I reset my password?”

“What’s the password reset process?”

The second query hits the cache.

  • Cache hit rate: 40%
  • Cost savings: ~$1,200/month

3. Zero-Overhead Observability

Every request is logged with full context:

  • Inputs / outputs
  • Token usage
  • Latency breakdown
  • Cost per request

All async. Zero performance impact.
Built-in dashboard.

4. Production-Grade Failover

Primary provider down?
Bifrost automatically fails over.

We’ve had OpenAI incidents where traffic switched to Anthropic automatically.
Users didn’t notice.

Migration Was Surprisingly Easy

Expected: Days of refactoring
Actual: ~15 minutes

Step 1: Start Bifrost
docker run -p 8080:8080 -v $(pwd)/data:/app/data maximhq/bifrost

Step 2: Add API Keys

Visit: http://localhost:8080

Step 3: Change One Line in Code

Before

import openai
openai.api_key = "sk-..."

After

import openai
openai.api_base = "http://localhost:8080/openai"
openai.api_key = "sk-..."

Step 4: Deploy

That’s it.

Bifrost is OpenAI-compatible.
If your code works with OpenAI, it works with Bifrost.

Supports LangChain, LlamaIndex, LiteLLM SDK, and more.

The Production Rollout

Week 1: 10% traffic

No issues

Latency down 60%

Week 2: 50% traffic

Still stable

Costs already dropping

Week 3: 100% migration

Shut down 2 of 3 LiteLLM instances

Performance better than ever

Three Months Later, Zero downtime incidents

Handling 800+ RPS during peaks

Monthly cost: $60 vs $450

No manual restarts

When to Use Each
Use LiteLLM if:

  • You’re prototyping
  • Traffic is < 100 RPS
  • You need deep Python ecosystem integration
  • You’re okay with manual scaling and monitoring

Use Bifrost if:

  • You’re running production workloads
  • Traffic > 200 RPS (or will be soon)
  • You care about P99 latency
  • You want predictable costs
  • You’re tired of restarting your gateway

Try Bifrost; Open source (MIT). Run it locally in 30 seconds:

git clone https://github.com/maximhq/bifrost
cd bifrost
docker compose up

Visit http://localhost:8080, add your API keys, and point your app at Bifrost.

Benchmark It Yourself
cd bifrost/benchmarks
./benchmark -provider bifrost -rate 500 -duration 60

Compare with your current setup.

The Bottom Line

LiteLLM breaking at ~300 RPS wasn’t a bug.
It was Python hitting its architectural limits.

We needed production-grade infrastructure.
So we built it — in Go — and open sourced it.

If you’re hitting scale issues with your LLM gateway, you’re not alone.
We hit them too.

Bifrost solved them. Might solve yours.

Benchmarks: https://docs.getbifrost.ai/benchmarking/getting-started

Docs: https://docs.getbifrost.ai

Repo: https://github.com/maximhq/bifrost

Built by the team at Maxim AI. We also build evaluation and observability tools for production AI systems.

Top comments (0)