I spent the last few weeks researching LLM gateway solutions for production teams. Here's what I found after testing five different options, talking to engineering teams running them at scale, and breaking things in my staging environment.
I didn't test every edge case. We focused on REST APIs with streaming responses, didn't test batch processing extensively, and our traffic patterns might be different from yours. But here's what I learned.
Why Production Teams Need LLM Gateways
Here's what happened when we didn't use one:
Our application relied only on OpenAI. When they had an outage last month, our entire product went down. This created problems when we had customers waiting for support.
Then there's cost. We were using GPT-4 for simple tasks that Claude Haiku could handle for one-tenth the price. One weekend of refactoring our routing logic saved us $3,000 per month.
But managing multiple providers yourself creates its own problems. You end up writing custom code for each API, normalizing their different error formats, managing API keys, building retry logic from scratch, and spending hours debugging why Anthropic's rate limit response looks different from OpenAI's.
LLM gateways solve this. One API for all providers. Automatic fallbacks. Cost tracking that works. And your application won't crash because one provider is having issues.
Here are the five gateways that impressed me.
1. Bifrost (by Maxim AI)
What it is: A high-performance LLM gateway built in Go. It's designed for speed and reliability.
Best for: Customer-facing applications where latency matters. Real-time chat, high-traffic APIs, anything where users will notice if responses are slow.
The performance numbers caught my attention first. In our synthetic load tests, Bifrost added about 11 microseconds of latency at 5,000 requests per second. When I ran the same test with LiteLLM (which is Python-based), it added around 50 microseconds.
What really sold me was the P99 latency test. At 1,000 concurrent users, LiteLLM's slowest responses hit 28 seconds. Bifrost stayed under 50 milliseconds. If you're building a chatbot, that's the difference between users staying on your application and immediately leaving.
Now, I didn't test this with burst traffic or serverless deployments - our setup is traditional Kubernetes. Your results might differ depending on your infrastructure.
What makes it different:
Smart load balancing that actually works. Bifrost was the first gateway I found that automatically routes requests based on real-time performance. It monitors which providers are healthy, routes around failures, and prevents you from hitting rate limits. Most gateways claim to do this, but Bifrost's implementation is noticeably better.
It also has cluster mode built in, so you can run multiple instances without complicated setup. And here's what surprised me - it includes SSO, audit logs, team budgets, and role-based access control without adding latency. Most gateways make you choose between features and speed. Bifrost somehow does both.
Setup:
npx -y @maximhq/bifrost
In 30 seconds you have a gateway running with a web UI. Since it uses OpenAI's API format, integrating it is just changing your base URL. I had our staging environment switched over in under 10 minutes.
Bifrost covers all the major providers - OpenAI, Anthropic, Google Vertex AI, AWS Bedrock, Azure OpenAI, Cohere, Mistral, Groq, Together AI, and Replicate. Plus they added support for any OpenAI-compatible endpoint, which means you can actually use custom or self-hosted models too.
For most production use cases, you're using one of these major providers anyway. LiteLLM does have broader coverage and a more mature open-source community - they've been around longer with more contributors and community support. If that ecosystem and maximum provider choice matters more to you than raw performance, LiteLLM is a solid pick. But for our needs, Bifrost's speed and provider coverage were enough.
Why we chose it: For our use case (high-scale, customer-facing chat), the 11 microsecond overhead was too good to pass up. The enterprise features were a bonus we didn't expect at this performance level.
Pricing: Open-source and free to self-host. Enterprise support is available.
2. LiteLLM
This is probably the most popular open-source LLM gateway. Python-based, with both an SDK and proxy server.
If you're in a Python environment or need access to niche models, this is the default choice. The provider coverage is unmatched - over 100 providers including all the major ones (OpenAI, Anthropic, Google, Azure, AWS) plus specialized options like HuggingFace, Ollama, Replicate, Anyscale, and Perplexity.
For Python developers, setup is straightforward:
from litellm import completion
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
api_key="your-key"
)
# Switch to Claude without changing code
response = completion(
model="claude-4-sonnet",
messages=[{"role": "user", "content": "Hello"}]
)
Configuration uses YAML. The documentation is thorough, and there's a strong community.
Where it breaks down: Performance at scale. LiteLLM is written in Python using FastAPI. At low to moderate traffic, it performs well. But in our load tests, the limitations showed clearly.
At 500 requests per second, P99 latency hit 28 seconds. At 1,000 requests per second, it crashed - ran out of memory and started failing requests. The Python GIL and async overhead become real bottlenecks when handling thousands of concurrent requests.
I saw this in our staging environment. At 200 requests per second, everything ran smoothly. When I simulated higher traffic (around 2,000 requests per second), LiteLLM started timing out. Memory usage increased to over 8GB, and we got cascading failures.
When to use it:
- Development and testing environments
- Prototyping and trying different models
- Internal tools with moderate traffic (under 500 RPS)
- When you need access to 100+ providers
- Python-first teams where ecosystem fit matters
When to avoid it:
- Customer-facing applications at scale
- Real-time features where every millisecond counts
- Production workloads requiring 99.9%+ uptime
The ecosystem is mature with active development, but if you're planning to handle thousands of requests per second in production, you'll likely hit performance issues.
Pricing: Fully open-source and free. You pay for hosting it yourself.
3. Portkey
Portkey is more than just a gateway - it's a full AI control plane with routing, observability, guardrails, and governance.
The observability depth is what sets it apart. Every request gets full traces showing you which user made the call, which models were tried, why they failed, which fallback was used, how long each step took, and the exact cost. This isn't just logging - it's distributed tracing for AI.
When our staging environment started using too many tokens, Portkey's traces showed us exactly which user and which prompt was causing it. That level of detail is valuable when debugging production issues.
from portkey_ai import Portkey
portkey = Portkey(
api_key="your-portkey-key",
virtual_key="your-provider-virtual-key"
)
response = portkey.chat.completions.create(
messages=[{"role": "user", "content": "Hello"}],
model="gpt-4"
)
Enterprise features:
- PII detection, content filtering, prompt injection detection
- SOC 2, HIPAA, GDPR compliance with full audit trails
- SSO/SAML, team permissions, role-based access
- Data residency controls
According to their team, they handle over 10 billion requests monthly with 99.9999% uptime. I couldn't independently verify this, but the platform felt stable during our testing.
The tradeoff: I measured latency overhead of 20-40 milliseconds when using advanced features like guardrails and detailed tracing. For a small team that just needs basic routing, Portkey is probably more than necessary. The learning curve is also steeper than simpler gateways.
Why we didn't choose it: For our use case, the added latency and complexity weren't worth the governance features we didn't need yet. But I talked to a healthcare company using Portkey specifically for PII detection. Every LLM request gets scanned for protected health information, logged with full audit trails, and only routed to HIPAA-compliant providers. For them, the compliance features justified the cost.
If you're in a regulated industry or managing AI across multiple teams with governance requirements, Portkey's observability is among the best available.
Pricing: Free tier for development | Starts at $49/month | Enterprise custom pricing
4. Kong AI Gateway
Kong's API Gateway with AI-specific features added. If you're already using Kong, this is worth looking at.
Kong brings decades of API gateway experience to LLM routing - authentication, rate limiting, security, and observability at large scale. All the infrastructure pieces that matter when running production workloads.
# Install AI Proxy plugin
curl -X POST http://localhost:8001/services/ai-service/plugins \
--data "name=ai-proxy" \
--data "config.route_type=llm/v1/chat" \
--data "config.auth.header_name=Authorization" \
--data "config.model.provider=openai" \
--data "config.model.name=gpt-4"
AI-specific capabilities:
- Unified API across OpenAI, Anthropic, AWS Bedrock, Azure AI, Google Vertex
- RAG pipelines built in
- PII removal across 12 languages
- Content filtering and safety controls
Where this makes sense: You're already using Kong for API management. That's the primary reason to choose this. The integration with existing Kong infrastructure is seamless, and you get unified observability across all your APIs.
Where it doesn't: If you're not already on Kong, the learning curve is significant. It's built for large enterprises, not small teams needing quick deployment. We evaluated this briefly but decided it was more complexity than we needed.
Pricing: Available through Kong Konnect (managed) or self-hosted | Enterprise custom pricing
5. Helicone AI Gateway
Started as an observability platform, recently launched a Rust-based gateway. Lightweight and fast.
Built in Rust, Helicone achieves around 8ms P50 latency with sub-5ms overhead even under load, based on what their team shared with me. The gateway ships as a single 15MB binary that runs anywhere.
# Run with npx
npx @helicone/ai-gateway
# Or with Docker
docker run -p 8787:8787 helicone/ai-gateway
The observability is their core strength - request-level tracing, user tracking, cost forecasting, performance analytics, and real-time alerts. It's as comprehensive as Portkey's but with less complexity.
Flexible deployment:
- Cloud-hosted (managed service)
- Self-hosted (full control)
- Hybrid (self-host gateway, use cloud observability)
The consideration: The gateway is newer (launched mid-2024). Core routing is solid, but some advanced enterprise features are still developing. For most teams this isn't a problem, but large enterprises might want to validate specific requirements first.
Pricing:
- Gateway: Open-source and free to self-host
- Observability: Starts free, then $20/month for 100,000 requests
The separation is smart - you can self-host for free and only pay for observability if you want it.
How to Choose
After evaluating these gateways, here's what I learned:
Choose Bifrost if: Performance is critical. You're handling 5,000+ requests per second, serving customer-facing features, or building real-time applications where latency matters. The 11 microsecond overhead is hard to beat.
Choose LiteLLM if: You're in a Python environment with moderate traffic (under 500 RPS). The provider coverage is unmatched - over 100 models including specialized ones. Great for development, prototyping, and internal tools.
Choose Portkey if: You're in a regulated industry needing compliance controls (HIPAA, SOC 2) or managing AI across multiple teams. The observability and governance features are excellent, but you'll pay for it in latency (20-40ms overhead).
Choose Kong if: You're already using Kong for API management. Otherwise, the learning curve probably isn't worth it unless you're a large enterprise needing infrastructure-level control.
Choose Helicone if: You want performance and observability without enterprise complexity. Good for teams with data residency requirements who want self-hosted infrastructure with cloud monitoring.
Questions?
Have you deployed LLM gateways in production? What did you choose and why? What surprised you?
Still evaluating options? I can help with specific questions about performance, integration, or cost modeling at your scale. Leave a comment below.






Top comments (0)