LLM gateways handle the traffic between applications and AI providers. At 100 requests per second, most gateways work fine. At 5,000 RPS, the differences become dramatic.
We ran sustained benchmark tests on identical hardware (AWS t3.medium, 500 RPS) to see which gateways can actually handle production load. The results revealed performance differences of 50x or more.
The Test Setup
- Hardware: AWS t3.medium (2 vCPUs, 4GB RAM)
- Load: 500 requests per second sustained
- Duration: 60+ seconds
- Payload: Standard chat completion requests
- Providers: Mocked OpenAI endpoints for consistent testing
All gateways were tested on identical infrastructure to ensure fair comparison. We measured p99 latency (worst-case user experience), throughput (requests completed per second), memory usage, and success rate.
The Results
Bifrost (Go-based)
Performance:
- p99 Latency: 1.68s
- Throughput: 424 req/sec
- Memory: 120MB
- Success Rate: 100%
- Mean Overhead: 11µs at 5K RPS
maximhq
/
bifrost
Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
Bifrost
The fastest way to build AI applications that never go down
Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.
Quick Start
Go from zero to production-ready AI gateway in under a minute.
Step 1: Start Bifrost Gateway
# Install and run locally
npx -y @maximhq/bifrost
# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Step 2: Configure via Web UI
# Open the built-in web interface
open http://localhost:8080
Step 3: Make your first API call
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello, Bifrost!"}]
}'
That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…
Bifrost handled the load without breaking. The Go-based architecture showed consistent performance throughout the test. Memory stayed stable at 120MB. No failed requests.
LiteLLM (Python-based)
Performance:
- p99 Latency: 90.72s
- Throughput: 44.84 req/sec
- Memory: 372MB
- Success Rate: Degraded significantly
- Mean Overhead: ~500µs
LiteLLM struggled under sustained load. P99 latency hit 90 seconds (not milliseconds). Throughput dropped to 44 requests per second. Beyond 500 RPS, the gateway became unreliable.
maximhq
/
bifrost
Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
Bifrost
The fastest way to build AI applications that never go down
Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.
Quick Start
Go from zero to production-ready AI gateway in under a minute.
Step 1: Start Bifrost Gateway
# Install and run locally
npx -y @maximhq/bifrost
# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Step 2: Configure via Web UI
# Open the built-in web interface
open http://localhost:8080
Step 3: Make your first API call
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello, Bifrost!"}]
}'
That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…
The performance gap: 54x slower p99 latency, 9.4x lower throughput, 3x higher memory usage.
Kong AI Gateway (Lua/Go hybrid)
Performance:
- Latency: Moderate overhead
- Throughput: 2,000-3,000 RPS
- Architecture: Lua + Go core
- Success Rate: High
Kong handled production load well but added more overhead than pure Go implementations. The mature infrastructure and extensive features come with performance costs. Still suitable for enterprise workloads where governance features justify the overhead.
Portkey (TypeScript)
Performance:
- Latency: Standard for Node.js
- Throughput: Good for moderate loads
- Architecture: TypeScript/Node.js
- Success Rate: Reliable under designed load
Portkey performed well within its design parameters. Node.js architecture adds more overhead than compiled languages but provides good developer experience and extensive features. Better suited for moderate traffic patterns (under 2K RPS).
Helicone (Rust-based)
Performance:
- P50 Latency: 8ms
- Throughput: Horizontal scaling capable
- Architecture: Rust (compiled)
- Success Rate: High
Helicone's Rust implementation delivered strong performance. Not quite matching Go's overhead numbers but significantly better than interpreted languages. Good choice when observability is priority alongside performance.
What Actually Broke
Python Hit the Wall
LiteLLM's Python architecture couldn't sustain high request rates. The Global Interpreter Lock (GIL) bottlenecks parallelism. Memory climbed. Garbage collection paused processing. Beyond 500 RPS, the system became unreliable.
Database Logging Killed Latency
Synchronous database logging adds 100-200µs per request. At 5,000 RPS, that's 500,000+ writes per second. Solutions: async batch logging (buffer 1,000 requests), time-based flushing (every 100ms), in-memory logging with periodic persistence.
Memory Allocations Triggered GC
Python and Node.js allocate new objects per request. At scale, garbage collection pauses become noticeable. Bifrost's object pooling dropped allocations from 372MB/sec to 140MB/sec.
Connection Pools Weren't Tuned
Default HTTP settings fail at scale. Without tuning, connection reuse drops from 90% to 40%, adding 50-100ms per request. Properly tuned pools maintained 95%+ reuse.
Architecture Matters at Scale
Compiled vs Interpreted Languages
Go and Rust (compiled to native code):
- No interpreter overhead
- Predictable performance
- Efficient memory management
- Native concurrency support
Python and Node.js (interpreted/JIT):
- Interpreter overhead on every operation
- GC pauses impact latency
- GIL (Python) limits true parallelism
- Higher memory footprint
At 100 RPS, the difference is negligible. At 5,000 RPS, it's 50x performance gap.
Concurrency Models
Go's goroutines: Thousands of lightweight threads, minimal overhead, true parallelism
Python's threading: Limited by GIL, async helps but adds complexity
Node.js event loop: Single-threaded, good for I/O-bound but can bottleneck
Rust's async: Efficient but requires careful implementation
When Performance Doesn't Matter
Not every application needs 5,000 RPS throughput:
LiteLLM works fine for:
- Development environments
- Internal tools with < 100 users
- Prototyping and experimentation
- Teams heavily invested in Python
Portkey makes sense for:
- Moderate traffic (< 2K RPS)
- Teams prioritizing features over raw speed
- Applications where 100-200µs overhead is acceptable
Kong is ideal for:
- Enterprises with existing Kong infrastructure
- Complex governance requirements
- Teams needing comprehensive feature sets
When Performance Is Critical
High-traffic production applications need optimized gateways:
Real-time chat: Every millisecond matters for user experience
Voice assistants: Latency compounds across multiple LLM calls
Agent loops: Agents make 10+ LLM calls per task, overhead multiplies
High-volume APIs: Cost per request matters at scale
For these use cases, gateway overhead of 500µs vs 11µs makes a measurable difference in both user experience and infrastructure costs.
The Memory Factor
Memory usage impacts cost at scale:
LiteLLM: 372MB at moderate load
Bifrost: 120MB (3x lighter)
For 10 instances handling traffic:
- LiteLLM: 3.72GB total
- Bifrost: 1.2GB total
The difference: Ability to run on smaller instances (cost savings) or handle more traffic on same hardware (better utilization).
Benchmark Methodology
All tests used:
- Identical AWS t3.medium instances
- Same network conditions
- Mocked OpenAI endpoints (consistent response times)
- 60+ second sustained load
- Identical request/response payloads
The benchmarking tool is open source. Anyone can reproduce these results.
Recommendations by Use Case
Choose Bifrost if:
- Traffic > 2K RPS
- Latency sensitivity (real-time applications)
- Cost optimization priority
- Self-hosted deployment
Choose LiteLLM if:
- Traffic < 500 RPS
- Python ecosystem preferred
- Rapid prototyping
- Moderate performance acceptable
Choose Kong if:
- Enterprise governance required
- Existing Kong infrastructure
- Complex routing needs
- Budget for commercial license
Choose Portkey if:
- Features > raw performance
- Managed service preferred
- Teams under 100 users
- Prompt management critical
Choose Helicone if:
- Observability is priority
- Performance + monitoring balance needed
- Flexible deployment options
- Cost tracking essential
The Bottom Line
At low request rates, gateway choice doesn't matter much. At production scale, architecture fundamentally determines performance.
Python-based gateways work for prototypes. Go and Rust implementations handle production load. The 50x performance difference isn't marketing hype. It's the measurable result of architectural choices.
Choose based on your actual traffic patterns, not theoretical capacity. But understand the scaling ceiling before you hit it in production.
Benchmark Repository: github.com/maximhq/bifrost-benchmarking
Run these tests yourself. Hardware costs $0.05/hour. Truth costs nothing.

Top comments (0)