The Incident
Tuesday, 2:47 PM
Our customer support chatbot is handling 280 RPS. Everything’s fine.
2:53 PM
Traffic hits 310 RPS. Response times spike. Users start complaining in Slack.
2:58 PM
P99 latency reaches 18 seconds. Some requests time out completely.
3:05 PM
We manually restart LiteLLM. Traffic drops during the restart. Users are angry.
This happened three times that week.
What We Thought the Problem Was
- “Maybe we need more replicas”
- “Let’s add a load balancer”
- “Probably need better hardware”
We scaled horizontally and added three more LiteLLM instances.
Result
Cost increased 4×
- Traffic hit 320 RPS
- The same issues appeared
- All instances struggled simultaneously
What the Problem Actually Was
LiteLLM is built on Python + FastAPI.
At low traffic (< 200 RPS), it works well.
Past 300 RPS, Python’s architecture becomes the bottleneck.
The Python Problem
GIL (Global Interpreter Lock): Only one thread executes Python code at a time
- Async overhead: Event loop coordination adds latency
- Memory pressure: Heavy dependencies + long-running processes
- GC pauses: Garbage collection freezes request handling
What We Observed at 350 RPS (Single Instance)
CPU: 85% (one core maxed due to GIL)
- Memory: 3.2 GB → 5.1 GB → 6.8 GB (steadily climbing)
- Latency: 200 ms → 2 s → 12 s → timeout
- GC pauses: 100–300 ms every ~30 seconds
After 2 hours, memory reached 8 GB.
The process was killed by the OOM killer.
This isn’t a LiteLLM-specific issue.
It’s Python hitting its limits at high throughput.
How Bifrost Solves This
We needed production-grade infrastructure, not a prototype that breaks under load.
So we built Bifrost in Go, specifically for high-throughput LLM workloads.
It’s open source and MIT licensed.
Key Architectural Differences
- True Concurrency (No GIL)
Go’s goroutines execute in parallel across all CPU cores.
// Thousands of goroutines, truly parallel
go handleRequest(req1)
go handleRequest(req2)
go handleRequest(req3)
// All executing simultaneously
- Lightweight Concurrency
`Go: 10,000 goroutines ≈ ~100 MB memory
Python: 10,000 threads / async tasks → out of memory`
- Predictable Memory
- Go’s garbage collector is designed for low-latency systems:
- Concurrent GC (doesn’t stop the world)
- Predictable pause times (typically < 1 ms)
- No circular-reference memory leaks
- Native HTTP/2
- Built-in HTTP/2 support
- Request multiplexing
- No external dependencies
- The Real-World Difference
We ran the same production workload through both gateways.
Test: Customer support chatbot, real user traffic
Load: 500 RPS sustained
LiteLLM
(3 × t3.xlarge instances)
P50 latency: 2.1 s
P99 latency: 23.4 s
Memory per instance: 4–7 GB (climbing)
Timeout rate: 8%
Cost: ~$450/month
Stability: Restart required every 6–8 hours
Bifrost
(1 × t3.large instance)
P50 latency: 230 ms
P99 latency: 520 ms
Memory: 1.4 GB (stable)
Timeout rate: 0.1%
Cost: ~$60/month
Stability: 30+ days without restart
Result
- 45× faster P99 latency
- 7× cheaper
- Actually stable
But Bifrost Isn’t Just About Performance
Rebuilding from scratch let us add production features LiteLLM doesn’t have.
1. Adaptive Load Balancing
Multiple API keys?
Bifrost continuously monitors:
- Latency
- Error rates
- Traffic is automatically reweighted:
Real-time weight adjustment:
├─ Key 1: 1.2× weight (healthy)
├─ Key 2: 0.5× weight (high latency)
└─ Key 3: 1.0× weight (normal)
No manual intervention required.
2. Semantic Caching
Not exact-match caching — semantic similarity.
“How do I reset my password?”
“What’s the password reset process?”
The second query hits the cache.
- Cache hit rate: 40%
- Cost savings: ~$1,200/month
3. Zero-Overhead Observability
Every request is logged with full context:
- Inputs / outputs
- Token usage
- Latency breakdown
- Cost per request
All async. Zero performance impact.
Built-in dashboard.
4. Production-Grade Failover
Primary provider down?
Bifrost automatically fails over.
We’ve had OpenAI incidents where traffic switched to Anthropic automatically.
Users didn’t notice.
Migration Was Surprisingly Easy
Expected: Days of refactoring
Actual: ~15 minutes
Step 1: Start Bifrost
docker run -p 8080:8080 -v $(pwd)/data:/app/data maximhq/bifrost
Step 2: Add API Keys
Visit: http://localhost:8080
Step 3: Change One Line in Code
Before
import openai
openai.api_key = "sk-..."
After
import openai
openai.api_base = "http://localhost:8080/openai"
openai.api_key = "sk-..."
Step 4: Deploy
That’s it.
Bifrost is OpenAI-compatible.
If your code works with OpenAI, it works with Bifrost.
Supports LangChain, LlamaIndex, LiteLLM SDK, and more.
The Production Rollout
Week 1: 10% traffic
No issues
Latency down 60%
Week 2: 50% traffic
Still stable
Costs already dropping
Week 3: 100% migration
Shut down 2 of 3 LiteLLM instances
Performance better than ever
Three Months Later, Zero downtime incidents
Handling 800+ RPS during peaks
Monthly cost: $60 vs $450
No manual restarts
When to Use Each
Use LiteLLM if:
- You’re prototyping
- Traffic is < 100 RPS
- You need deep Python ecosystem integration
- You’re okay with manual scaling and monitoring
Use Bifrost if:
- You’re running production workloads
- Traffic > 200 RPS (or will be soon)
- You care about P99 latency
- You want predictable costs
- You’re tired of restarting your gateway
Try Bifrost; Open source (MIT). Run it locally in 30 seconds:
git clone https://github.com/maximhq/bifrost
cd bifrost
docker compose up
Visit http://localhost:8080, add your API keys, and point your app at Bifrost.
Benchmark It Yourself
cd bifrost/benchmarks
./benchmark -provider bifrost -rate 500 -duration 60
Compare with your current setup.
The Bottom Line
LiteLLM breaking at ~300 RPS wasn’t a bug.
It was Python hitting its architectural limits.
We needed production-grade infrastructure.
So we built it — in Go — and open sourced it.
If you’re hitting scale issues with your LLM gateway, you’re not alone.
We hit them too.
Bifrost solved them. Might solve yours.
Benchmarks: https://docs.getbifrost.ai/benchmarking/getting-started
Docs: https://docs.getbifrost.ai
Repo: https://github.com/maximhq/bifrost
Built by the team at Maxim AI. We also build evaluation and observability tools for production AI systems.
Top comments (0)