When we built Bifrost, we made performance a priority. But claims like "50x faster" need data. Here's how we benchmarked our gateway and what the results taught us about building high-performance infrastructure.
maximhq
/
bifrost
Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
Bifrost
The fastest way to build AI applications that never go down
Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.
Quick Start
Go from zero to production-ready AI gateway in under a minute.
Step 1: Start Bifrost Gateway
# Install and run locally
npx -y @maximhq/bifrost
# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Step 2: Configure via Web UI
# Open the built-in web interface
open http://localhost:8080
Step 3: Make your first API call
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello, Bifrost!"}]
}'
That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…
Why Benchmark at All
Early feedback on Bifrost was positive, but we heard the same question repeatedly: "How does it compare to LiteLLM?"
Fair question. LiteLLM has been around longer, has more users, and supports 100+ providers. But we built Bifrost in Go specifically for performance. We needed to prove it mattered.
The Testing Approach
We wanted honest, reproducible benchmarks. No cherry-picked scenarios. No artificial advantages.
Identical Hardware
All tests ran on AWS t3.medium instances (2 vCPUs, 4GB RAM). Same network. Same region. Same everything. This eliminated hardware as a variable.
We also tested on t3.xlarge (4 vCPUs, 16GB RAM) to see how both gateways scaled with better resources.
Sustained Load
Burst tests are misleading. A gateway might handle 5,000 RPS for 10 seconds but fall apart after a minute. We ran tests for 60+ seconds to ensure sustained performance.
Target: 500 RPS sustained on t3.medium, 5,000 RPS on t3.xlarge.
Mocked Providers
We used mocked OpenAI endpoints with consistent response times. This isolated gateway performance from provider variability. If OpenAI has a slow day, it shouldn't make LiteLLM look worse than Bifrost.
Open Source Tooling
We open-sourced the entire benchmarking suite. Anyone can reproduce our results. The code is on GitHub. Run it yourself.
The Results
At 500 RPS (t3.medium)
Bifrost:
- p99 Latency: 1.68s
- Throughput: 424 req/sec
- Memory: 120MB
- Success Rate: 100%
LiteLLM:
- p99 Latency: 90.72s
- Throughput: 44.84 req/sec
- Memory: 372MB
- Success Rate: Degraded
The gap: 54x faster p99 latency, 9.4x higher throughput, 3x lighter memory footprint.
At 5,000 RPS (t3.xlarge)
Bifrost:
- Mean Overhead: 11µs
- Queue Wait: 1.67µs
- Memory: 3,340MB (21% of 16GB)
- Success Rate: 100%
LiteLLM:
- Could not sustain this load
- Performance degraded significantly
- Mean Overhead: ~500µs (45x higher)
What We Learned
1. Language Choice Matters More Than We Expected
We knew Go would be faster than Python. We didn't expect 50x faster.
The Global Interpreter Lock (GIL) in Python fundamentally limits parallelism. At 500 RPS, you're hitting that limit constantly. Go's goroutines handle concurrency effortlessly.
Memory management also diverged. Python's garbage collector paused noticeably under load. Go's GC rarely interrupted processing.
2. Database Logging Is a Hidden Bottleneck
Early Bifrost versions logged every request to PostgreSQL synchronously. This added 150-200µs per request. Not huge, but at 5,000 RPS it meant 750,000-1,000,000 database writes per second.
The database became the bottleneck, not the gateway.
The fix: Async batch logging. Buffer 1,000 requests in memory. Write once every 100ms. Latency dropped from 200µs to 11µs. The database handled the load easily.
3. Connection Pooling Needs Aggressive Tuning
Default HTTP client settings assume low traffic. At scale, they fail.
We increased max idle connections to 100 and max connections per host to 20. Connection reuse jumped from 60% to 95%. This eliminated the 50-100ms overhead of creating new connections.
4. Object Pooling Dramatically Reduces GC Pressure
Every request allocated buffers for JSON marshaling, HTTP bodies, and response parsing. At 5,000 RPS, that's 5,000 allocations per second.
We implemented sync.Pool for all frequently allocated objects:
- JSON marshal buffers
- HTTP response bodies
- Channel objects
- Message structures
Memory allocations dropped from 372MB/sec to 140MB/sec. GC pauses reduced by 70%.
5. Queue Management Prevents Cascading Failures
When traffic spikes from 2K to 5K RPS, requests need somewhere to go. Without proper buffering, you just reject them immediately (502s).
We sized our job queue based on:
buffer_size = (target_rps * avg_latency_seconds) * 3
For 5K RPS with 2s latency: 5000 * 2 * 3 = 30,000 buffer size.
This absorbed burst traffic without dropping requests.
6. Goroutine Leaks Are Subtle but Deadly
Monitoring revealed slow goroutine accumulation. After 6 hours at 2K RPS:
- Expected: ~200 goroutines
- Actual: 4,783 goroutines
Error handling paths weren't properly cleaning up. Context cancellation fixed this. Goroutine count stayed stable even after days of load.
7. Per-Operation Latency Breakdown Revealed Surprises
We instrumented every operation to measure latency:
t3.xlarge at 5K RPS:
- Queue wait: 1.67µs
- Key selection: 10ns
- Message formatting: 2.11µs
- JSON marshaling: 26.80µs
- HTTP request: 1.50s (provider)
- Response parsing: 2.11ms
Bifrost's total overhead: 11µs (everything except the actual provider call).
Most operations completed in microseconds or nanoseconds. The provider API call dominated latency, which is exactly what you want. The gateway adds negligible overhead.
Comparing Instance Sizes
We tested both t3.medium and t3.xlarge to understand scaling:
| Metric | t3.medium | t3.xlarge | Improvement |
|---|---|---|---|
| Bifrost Overhead | 59µs | 11µs | 81% faster |
| Queue Wait | 47.13µs | 1.67µs | 96% faster |
| JSON Marshaling | 63.47µs | 26.80µs | 58% faster |
| Memory Usage | 1,312MB | 3,340MB | +155% |
The t3.xlarge delivered 81% lower overhead while using only 21% of available memory. Plenty of headroom for traffic spikes.
Configuration Flexibility
One insight: there's no single "best" configuration. It depends on your priorities.
For speed (t3.xlarge profile):
- initial_pool_size: 15,000
- buffer_size: 20,000
- Higher memory, lower latency
For cost (t3.medium profile):
- initial_pool_size: 10,000
- buffer_size: 15,000
- Lower memory, acceptable latency
Teams can tune the speed/memory tradeoff for their workload.
What We'd Change
More Provider Testing
We tested with mocked OpenAI endpoints. Real provider latency varies. OpenAI might be fast today and slow tomorrow. Testing against live providers would add variability but reflect reality better.
Longer Endurance Tests
60 seconds proves sustained performance. But what about 6 hours? 24 hours? We've run longer tests internally, but they're not in the public benchmark suite yet.
Multi-Instance Scenarios
Production deployments run multiple instances behind load balancers. Benchmarking horizontal scaling would show real-world behavior better.
Geographic Distribution
All tests ran in us-east-1. Testing from multiple regions would reveal network latency impacts.
Try It Yourself
The benchmarking tool is open source. Run these tests on your own hardware:
git clone https://github.com/maximhq/bifrost-benchmarking.git
cd bifrost-benchmarking
go build benchmark.go
# Test Bifrost
./benchmark -provider bifrost -port 8080 -rate 1000 -duration 60
# Test LiteLLM
./benchmark -provider litellm -port 8000 -rate 1000 -duration 60
Compare results. Tweet at us if you find different numbers. We want accurate benchmarks more than favorable ones.
The Bottom Line
Building fast infrastructure requires measurement. "Feels fast" isn't good enough. Numbers don't lie.
Our benchmarks showed 50x performance improvements over Python-based alternatives. That number surprised even us. But it's reproducible. It's measurable. It's real.
If you're building high-throughput AI applications, gateway performance matters. Don't trust marketing claims (including ours). Run your own tests.

Top comments (0)