DEV Community

Debby McKinney
Debby McKinney

Posted on

We Benchmarked 5 LLM Gateways at 5,000 RPS. Here's What Broke.

LLM gateways handle the traffic between applications and AI providers. At 100 requests per second, most gateways work fine. At 5,000 RPS, the differences become dramatic.

We ran sustained benchmark tests on identical hardware (AWS t3.medium, 500 RPS) to see which gateways can actually handle production load. The results revealed performance differences of 50x or more.

The Test Setup

  1. Hardware: AWS t3.medium (2 vCPUs, 4GB RAM)
  2. Load: 500 requests per second sustained
  3. Duration: 60+ seconds
  4. Payload: Standard chat completion requests
  5. Providers: Mocked OpenAI endpoints for consistent testing

All gateways were tested on identical infrastructure to ensure fair comparison. We measured p99 latency (worst-case user experience), throughput (requests completed per second), memory usage, and success rate.

The Results

Bifrost (Go-based)

Performance:

  • p99 Latency: 1.68s
  • Throughput: 424 req/sec
  • Memory: 120MB
  • Success Rate: 100%
  • Mean Overhead: 11µs at 5K RPS

GitHub logo maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

Go Report Card Discord badge Known Vulnerabilities codecov Docker Pulls Run In Postman Artifact Hub License

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Get started

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080
Enter fullscreen mode Exit fullscreen mode

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

Bifrost handled the load without breaking. The Go-based architecture showed consistent performance throughout the test. Memory stayed stable at 120MB. No failed requests.

LiteLLM (Python-based)

Performance:

  • p99 Latency: 90.72s
  • Throughput: 44.84 req/sec
  • Memory: 372MB
  • Success Rate: Degraded significantly
  • Mean Overhead: ~500µs

LiteLLM struggled under sustained load. P99 latency hit 90 seconds (not milliseconds). Throughput dropped to 44 requests per second. Beyond 500 RPS, the gateway became unreliable.

GitHub logo maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

Go Report Card Discord badge Known Vulnerabilities codecov Docker Pulls Run In Postman Artifact Hub License

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Get started

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080
Enter fullscreen mode Exit fullscreen mode

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

The performance gap: 54x slower p99 latency, 9.4x lower throughput, 3x higher memory usage.

Kong AI Gateway (Lua/Go hybrid)

Performance:

  • Latency: Moderate overhead
  • Throughput: 2,000-3,000 RPS
  • Architecture: Lua + Go core
  • Success Rate: High

Kong handled production load well but added more overhead than pure Go implementations. The mature infrastructure and extensive features come with performance costs. Still suitable for enterprise workloads where governance features justify the overhead.

Portkey (TypeScript)

Performance:

  • Latency: Standard for Node.js
  • Throughput: Good for moderate loads
  • Architecture: TypeScript/Node.js
  • Success Rate: Reliable under designed load

Portkey performed well within its design parameters. Node.js architecture adds more overhead than compiled languages but provides good developer experience and extensive features. Better suited for moderate traffic patterns (under 2K RPS).

Helicone (Rust-based)

Performance:

  • P50 Latency: 8ms
  • Throughput: Horizontal scaling capable
  • Architecture: Rust (compiled)
  • Success Rate: High

Helicone's Rust implementation delivered strong performance. Not quite matching Go's overhead numbers but significantly better than interpreted languages. Good choice when observability is priority alongside performance.

What Actually Broke

Python Hit the Wall

LiteLLM's Python architecture couldn't sustain high request rates. The Global Interpreter Lock (GIL) bottlenecks parallelism. Memory climbed. Garbage collection paused processing. Beyond 500 RPS, the system became unreliable.

Database Logging Killed Latency

Synchronous database logging adds 100-200µs per request. At 5,000 RPS, that's 500,000+ writes per second. Solutions: async batch logging (buffer 1,000 requests), time-based flushing (every 100ms), in-memory logging with periodic persistence.

Memory Allocations Triggered GC

Python and Node.js allocate new objects per request. At scale, garbage collection pauses become noticeable. Bifrost's object pooling dropped allocations from 372MB/sec to 140MB/sec.

Connection Pools Weren't Tuned

Default HTTP settings fail at scale. Without tuning, connection reuse drops from 90% to 40%, adding 50-100ms per request. Properly tuned pools maintained 95%+ reuse.

Architecture Matters at Scale

Compiled vs Interpreted Languages

Go and Rust (compiled to native code):

  • No interpreter overhead
  • Predictable performance
  • Efficient memory management
  • Native concurrency support

Python and Node.js (interpreted/JIT):

  • Interpreter overhead on every operation
  • GC pauses impact latency
  • GIL (Python) limits true parallelism
  • Higher memory footprint

At 100 RPS, the difference is negligible. At 5,000 RPS, it's 50x performance gap.

Concurrency Models

Go's goroutines: Thousands of lightweight threads, minimal overhead, true parallelism

Python's threading: Limited by GIL, async helps but adds complexity

Node.js event loop: Single-threaded, good for I/O-bound but can bottleneck

Rust's async: Efficient but requires careful implementation

When Performance Doesn't Matter

Not every application needs 5,000 RPS throughput:

LiteLLM works fine for:

  • Development environments
  • Internal tools with < 100 users
  • Prototyping and experimentation
  • Teams heavily invested in Python

Portkey makes sense for:

  • Moderate traffic (< 2K RPS)
  • Teams prioritizing features over raw speed
  • Applications where 100-200µs overhead is acceptable

Kong is ideal for:

  • Enterprises with existing Kong infrastructure
  • Complex governance requirements
  • Teams needing comprehensive feature sets

When Performance Is Critical

High-traffic production applications need optimized gateways:

Real-time chat: Every millisecond matters for user experience

Voice assistants: Latency compounds across multiple LLM calls

Agent loops: Agents make 10+ LLM calls per task, overhead multiplies

High-volume APIs: Cost per request matters at scale

For these use cases, gateway overhead of 500µs vs 11µs makes a measurable difference in both user experience and infrastructure costs.

The Memory Factor

Memory usage impacts cost at scale:

LiteLLM: 372MB at moderate load

Bifrost: 120MB (3x lighter)

For 10 instances handling traffic:

  • LiteLLM: 3.72GB total
  • Bifrost: 1.2GB total

The difference: Ability to run on smaller instances (cost savings) or handle more traffic on same hardware (better utilization).

Benchmark Methodology

All tests used:

  • Identical AWS t3.medium instances
  • Same network conditions
  • Mocked OpenAI endpoints (consistent response times)
  • 60+ second sustained load
  • Identical request/response payloads

The benchmarking tool is open source. Anyone can reproduce these results.

Recommendations by Use Case

Choose Bifrost if:

  • Traffic > 2K RPS
  • Latency sensitivity (real-time applications)
  • Cost optimization priority
  • Self-hosted deployment

Choose LiteLLM if:

  • Traffic < 500 RPS
  • Python ecosystem preferred
  • Rapid prototyping
  • Moderate performance acceptable

Choose Kong if:

  • Enterprise governance required
  • Existing Kong infrastructure
  • Complex routing needs
  • Budget for commercial license

Choose Portkey if:

  • Features > raw performance
  • Managed service preferred
  • Teams under 100 users
  • Prompt management critical

Choose Helicone if:

  • Observability is priority
  • Performance + monitoring balance needed
  • Flexible deployment options
  • Cost tracking essential

The Bottom Line

At low request rates, gateway choice doesn't matter much. At production scale, architecture fundamentally determines performance.

Python-based gateways work for prototypes. Go and Rust implementations handle production load. The 50x performance difference isn't marketing hype. It's the measurable result of architectural choices.

Choose based on your actual traffic patterns, not theoretical capacity. But understand the scaling ceiling before you hit it in production.


Benchmark Repository: github.com/maximhq/bifrost-benchmarking

Run these tests yourself. Hardware costs $0.05/hour. Truth costs nothing.

Top comments (0)