Pranay Batta

Posted on Jan 16

How We Benchmarked Bifrost against LiteLLM(And What We Learned About Performance)

#webdev #programming #tutorial #ai

When we built Bifrost, we made performance a priority. But claims like "50x faster" need data. Here's how we benchmarked our gateway and what the results taught us about building high-performance infrastructure.

maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

View on GitHub

Why Benchmark at All

Early feedback on Bifrost was positive, but we heard the same question repeatedly: "How does it compare to LiteLLM?"

Fair question. LiteLLM has been around longer, has more users, and supports 100+ providers. But we built Bifrost in Go specifically for performance. We needed to prove it mattered.

The Testing Approach

We wanted honest, reproducible benchmarks. No cherry-picked scenarios. No artificial advantages.

Identical Hardware

All tests ran on AWS t3.medium instances (2 vCPUs, 4GB RAM). Same network. Same region. Same everything. This eliminated hardware as a variable.

We also tested on t3.xlarge (4 vCPUs, 16GB RAM) to see how both gateways scaled with better resources.

Sustained Load

Burst tests are misleading. A gateway might handle 5,000 RPS for 10 seconds but fall apart after a minute. We ran tests for 60+ seconds to ensure sustained performance.

Target: 500 RPS sustained on t3.medium, 5,000 RPS on t3.xlarge.

Mocked Providers

We used mocked OpenAI endpoints with consistent response times. This isolated gateway performance from provider variability. If OpenAI has a slow day, it shouldn't make LiteLLM look worse than Bifrost.

Open Source Tooling

We open-sourced the entire benchmarking suite. Anyone can reproduce our results. The code is on GitHub. Run it yourself.

The Results

At 500 RPS (t3.medium)

Bifrost:

p99 Latency: 1.68s
Throughput: 424 req/sec
Memory: 120MB
Success Rate: 100%

LiteLLM:

p99 Latency: 90.72s
Throughput: 44.84 req/sec
Memory: 372MB
Success Rate: Degraded

The gap: 54x faster p99 latency, 9.4x higher throughput, 3x lighter memory footprint.

At 5,000 RPS (t3.xlarge)

Bifrost:

Mean Overhead: 11µs
Queue Wait: 1.67µs
Memory: 3,340MB (21% of 16GB)
Success Rate: 100%

LiteLLM:

Could not sustain this load
Performance degraded significantly
Mean Overhead: ~500µs (45x higher)

What We Learned

1. Language Choice Matters More Than We Expected

We knew Go would be faster than Python. We didn't expect 50x faster.

The Global Interpreter Lock (GIL) in Python fundamentally limits parallelism. At 500 RPS, you're hitting that limit constantly. Go's goroutines handle concurrency effortlessly.

Memory management also diverged. Python's garbage collector paused noticeably under load. Go's GC rarely interrupted processing.

2. Database Logging Is a Hidden Bottleneck

Early Bifrost versions logged every request to PostgreSQL synchronously. This added 150-200µs per request. Not huge, but at 5,000 RPS it meant 750,000-1,000,000 database writes per second.

The database became the bottleneck, not the gateway.

The fix: Async batch logging. Buffer 1,000 requests in memory. Write once every 100ms. Latency dropped from 200µs to 11µs. The database handled the load easily.

3. Connection Pooling Needs Aggressive Tuning

Default HTTP client settings assume low traffic. At scale, they fail.

We increased max idle connections to 100 and max connections per host to 20. Connection reuse jumped from 60% to 95%. This eliminated the 50-100ms overhead of creating new connections.

4. Object Pooling Dramatically Reduces GC Pressure

Every request allocated buffers for JSON marshaling, HTTP bodies, and response parsing. At 5,000 RPS, that's 5,000 allocations per second.

We implemented sync.Pool for all frequently allocated objects:

JSON marshal buffers
HTTP response bodies
Channel objects
Message structures

Memory allocations dropped from 372MB/sec to 140MB/sec. GC pauses reduced by 70%.

5. Queue Management Prevents Cascading Failures

When traffic spikes from 2K to 5K RPS, requests need somewhere to go. Without proper buffering, you just reject them immediately (502s).

We sized our job queue based on:

buffer_size = (target_rps * avg_latency_seconds) * 3

For 5K RPS with 2s latency: 5000 * 2 * 3 = 30,000 buffer size.

This absorbed burst traffic without dropping requests.

6. Goroutine Leaks Are Subtle but Deadly

Monitoring revealed slow goroutine accumulation. After 6 hours at 2K RPS:

Expected: ~200 goroutines
Actual: 4,783 goroutines

Error handling paths weren't properly cleaning up. Context cancellation fixed this. Goroutine count stayed stable even after days of load.

7. Per-Operation Latency Breakdown Revealed Surprises

We instrumented every operation to measure latency:

t3.xlarge at 5K RPS:

Queue wait: 1.67µs
Key selection: 10ns
Message formatting: 2.11µs
JSON marshaling: 26.80µs
HTTP request: 1.50s (provider)
Response parsing: 2.11ms

Bifrost's total overhead: 11µs (everything except the actual provider call).

Most operations completed in microseconds or nanoseconds. The provider API call dominated latency, which is exactly what you want. The gateway adds negligible overhead.

Comparing Instance Sizes

We tested both t3.medium and t3.xlarge to understand scaling:

Metric	t3.medium	t3.xlarge	Improvement
Bifrost Overhead	59µs	11µs	81% faster
Queue Wait	47.13µs	1.67µs	96% faster
JSON Marshaling	63.47µs	26.80µs	58% faster
Memory Usage	1,312MB	3,340MB	+155%

The t3.xlarge delivered 81% lower overhead while using only 21% of available memory. Plenty of headroom for traffic spikes.

Configuration Flexibility

One insight: there's no single "best" configuration. It depends on your priorities.

For speed (t3.xlarge profile):

initial_pool_size: 15,000
buffer_size: 20,000
Higher memory, lower latency

For cost (t3.medium profile):

initial_pool_size: 10,000
buffer_size: 15,000
Lower memory, acceptable latency

Teams can tune the speed/memory tradeoff for their workload.

What We'd Change

More Provider Testing

We tested with mocked OpenAI endpoints. Real provider latency varies. OpenAI might be fast today and slow tomorrow. Testing against live providers would add variability but reflect reality better.

Longer Endurance Tests

60 seconds proves sustained performance. But what about 6 hours? 24 hours? We've run longer tests internally, but they're not in the public benchmark suite yet.

Multi-Instance Scenarios

Production deployments run multiple instances behind load balancers. Benchmarking horizontal scaling would show real-world behavior better.

Geographic Distribution

All tests ran in us-east-1. Testing from multiple regions would reveal network latency impacts.

Try It Yourself

The benchmarking tool is open source. Run these tests on your own hardware:

git clone https://github.com/maximhq/bifrost-benchmarking.git
cd bifrost-benchmarking
go build benchmark.go

# Test Bifrost
./benchmark -provider bifrost -port 8080 -rate 1000 -duration 60

# Test LiteLLM
./benchmark -provider litellm -port 8000 -rate 1000 -duration 60

Compare results. Tweet at us if you find different numbers. We want accurate benchmarks more than favorable ones.

The Bottom Line

Building fast infrastructure requires measurement. "Feels fast" isn't good enough. Numbers don't lie.

Our benchmarks showed 50x performance improvements over Python-based alternatives. That number surprised even us. But it's reproducible. It's measurable. It's real.

If you're building high-throughput AI applications, gateway performance matters. Don't trust marketing claims (including ours). Run your own tests.

Benchmarking Repository | Documentation | Results

DEV Community