Debby McKinney

Posted on Jan 16

We Benchmarked 5 LLM Gateways at 5,000 RPS. Here's What Broke.

#ai #programming #webdev #mcp

LLM gateways handle the traffic between applications and AI providers. At 100 requests per second, most gateways work fine. At 5,000 RPS, the differences become dramatic.

We ran sustained benchmark tests on identical hardware (AWS t3.medium, 500 RPS) to see which gateways can actually handle production load. The results revealed performance differences of 50x or more.

The Test Setup

Hardware: AWS t3.medium (2 vCPUs, 4GB RAM)
Load: 500 requests per second sustained
Duration: 60+ seconds
Payload: Standard chat completion requests
Providers: Mocked OpenAI endpoints for consistent testing

All gateways were tested on identical infrastructure to ensure fair comparison. We measured p99 latency (worst-case user experience), throughput (requests completed per second), memory usage, and success rate.

The Results

Bifrost (Go-based)

Performance:

p99 Latency: 1.68s
Throughput: 424 req/sec
Memory: 120MB
Success Rate: 100%
Mean Overhead: 11µs at 5K RPS

maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

View on GitHub

Bifrost handled the load without breaking. The Go-based architecture showed consistent performance throughout the test. Memory stayed stable at 120MB. No failed requests.

LiteLLM (Python-based)

Performance:

p99 Latency: 90.72s
Throughput: 44.84 req/sec
Memory: 372MB
Success Rate: Degraded significantly
Mean Overhead: ~500µs

LiteLLM struggled under sustained load. P99 latency hit 90 seconds (not milliseconds). Throughput dropped to 44 requests per second. Beyond 500 RPS, the gateway became unreliable.

maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

The fastest way to build AI applications that never go down

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

View on GitHub

The performance gap: 54x slower p99 latency, 9.4x lower throughput, 3x higher memory usage.

Kong AI Gateway (Lua/Go hybrid)

Performance:

Latency: Moderate overhead
Throughput: 2,000-3,000 RPS
Architecture: Lua + Go core
Success Rate: High

Kong handled production load well but added more overhead than pure Go implementations. The mature infrastructure and extensive features come with performance costs. Still suitable for enterprise workloads where governance features justify the overhead.

Portkey (TypeScript)

Performance:

Latency: Standard for Node.js
Throughput: Good for moderate loads
Architecture: TypeScript/Node.js
Success Rate: Reliable under designed load

Portkey performed well within its design parameters. Node.js architecture adds more overhead than compiled languages but provides good developer experience and extensive features. Better suited for moderate traffic patterns (under 2K RPS).

Helicone (Rust-based)

Performance:

P50 Latency: 8ms
Throughput: Horizontal scaling capable
Architecture: Rust (compiled)
Success Rate: High

Helicone's Rust implementation delivered strong performance. Not quite matching Go's overhead numbers but significantly better than interpreted languages. Good choice when observability is priority alongside performance.

What Actually Broke

Python Hit the Wall

LiteLLM's Python architecture couldn't sustain high request rates. The Global Interpreter Lock (GIL) bottlenecks parallelism. Memory climbed. Garbage collection paused processing. Beyond 500 RPS, the system became unreliable.

Database Logging Killed Latency

Synchronous database logging adds 100-200µs per request. At 5,000 RPS, that's 500,000+ writes per second. Solutions: async batch logging (buffer 1,000 requests), time-based flushing (every 100ms), in-memory logging with periodic persistence.

Memory Allocations Triggered GC

Python and Node.js allocate new objects per request. At scale, garbage collection pauses become noticeable. Bifrost's object pooling dropped allocations from 372MB/sec to 140MB/sec.

Connection Pools Weren't Tuned

Default HTTP settings fail at scale. Without tuning, connection reuse drops from 90% to 40%, adding 50-100ms per request. Properly tuned pools maintained 95%+ reuse.

Architecture Matters at Scale

Compiled vs Interpreted Languages

Go and Rust (compiled to native code):

No interpreter overhead
Predictable performance
Efficient memory management
Native concurrency support

Python and Node.js (interpreted/JIT):

Interpreter overhead on every operation
GC pauses impact latency
GIL (Python) limits true parallelism
Higher memory footprint

At 100 RPS, the difference is negligible. At 5,000 RPS, it's 50x performance gap.

Concurrency Models

Go's goroutines: Thousands of lightweight threads, minimal overhead, true parallelism

Python's threading: Limited by GIL, async helps but adds complexity

Node.js event loop: Single-threaded, good for I/O-bound but can bottleneck

Rust's async: Efficient but requires careful implementation

When Performance Doesn't Matter

Not every application needs 5,000 RPS throughput:

LiteLLM works fine for:

Development environments
Internal tools with < 100 users
Prototyping and experimentation
Teams heavily invested in Python

Portkey makes sense for:

Moderate traffic (< 2K RPS)
Teams prioritizing features over raw speed
Applications where 100-200µs overhead is acceptable

Kong is ideal for:

Enterprises with existing Kong infrastructure
Complex governance requirements
Teams needing comprehensive feature sets

When Performance Is Critical

High-traffic production applications need optimized gateways:

Real-time chat: Every millisecond matters for user experience

Voice assistants: Latency compounds across multiple LLM calls

Agent loops: Agents make 10+ LLM calls per task, overhead multiplies

High-volume APIs: Cost per request matters at scale

For these use cases, gateway overhead of 500µs vs 11µs makes a measurable difference in both user experience and infrastructure costs.

The Memory Factor

Memory usage impacts cost at scale:

LiteLLM: 372MB at moderate load

Bifrost: 120MB (3x lighter)

For 10 instances handling traffic:

LiteLLM: 3.72GB total
Bifrost: 1.2GB total

The difference: Ability to run on smaller instances (cost savings) or handle more traffic on same hardware (better utilization).

Benchmark Methodology

All tests used:

Identical AWS t3.medium instances
Same network conditions
Mocked OpenAI endpoints (consistent response times)
60+ second sustained load
Identical request/response payloads

The benchmarking tool is open source. Anyone can reproduce these results.

Recommendations by Use Case

Choose Bifrost if:

Traffic > 2K RPS
Latency sensitivity (real-time applications)
Cost optimization priority
Self-hosted deployment

Choose LiteLLM if:

Traffic < 500 RPS
Python ecosystem preferred
Rapid prototyping
Moderate performance acceptable

Choose Kong if:

Enterprise governance required
Existing Kong infrastructure
Complex routing needs
Budget for commercial license

Choose Portkey if:

Features > raw performance
Managed service preferred
Teams under 100 users
Prompt management critical

Choose Helicone if:

Observability is priority
Performance + monitoring balance needed
Flexible deployment options
Cost tracking essential

The Bottom Line

At low request rates, gateway choice doesn't matter much. At production scale, architecture fundamentally determines performance.

Python-based gateways work for prototypes. Go and Rust implementations handle production load. The 50x performance difference isn't marketing hype. It's the measurable result of architectural choices.

Choose based on your actual traffic patterns, not theoretical capacity. But understand the scaling ceiling before you hit it in production.

Benchmark Repository: github.com/maximhq/bifrost-benchmarking

Run these tests yourself. Hardware costs $0.05/hour. Truth costs nothing.

DEV Community

We Benchmarked 5 LLM Gateways at 5,000 RPS. Here's What Broke.

The Test Setup

The Results

Bifrost (Go-based)

maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

The fastest way to build AI applications that never go down

Quick Start

LiteLLM (Python-based)

maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

The fastest way to build AI applications that never go down

Quick Start

Kong AI Gateway (Lua/Go hybrid)

Portkey (TypeScript)

Helicone (Rust-based)

What Actually Broke

Python Hit the Wall

Database Logging Killed Latency

Memory Allocations Triggered GC

Connection Pools Weren't Tuned

Architecture Matters at Scale

Compiled vs Interpreted Languages

Concurrency Models

When Performance Doesn't Matter

When Performance Is Critical

The Memory Factor

Benchmark Methodology

Recommendations by Use Case

The Bottom Line

Top comments (0)