Pranay Batta

Posted on Jan 15

Why We Built Another LLM Gateway (And Made It 50x Faster than every available one)

#llm #ai #python #programming

When we started building AI applications at Maxim, we hit the same problem every team hits: managing multiple LLM providers is messy. Different APIs, different authentication patterns, different error formats. We needed a gateway.

We tried the existing options. LiteLLM gave us multi-provider support but couldn't handle production load. Kong had enterprise features but required complex setup. Portkey was feature-rich but came with vendor lock-in. None of them were built for the performance we needed.

So we built Bifrost.

maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

View on GitHub

The Performance Problem

Here's what we discovered running LLM gateways in production: latency matters more than most people realize.

Your application makes a request. The gateway routes it to OpenAI. OpenAI processes it (1-3 seconds). The gateway returns the response. That routing overhead? With Python-based gateways like LiteLLM, it adds 500+ microseconds per request.

Sounds small. But at 5,000 requests per second, that overhead becomes noticeable. More importantly, under load, Python gateways start breaking. Memory leaks. Connection pool exhaustion. GIL contention. Database slowdowns from logging.

We benchmarked LiteLLM at 500 RPS on identical hardware. At that rate, p99 latency hit 90.72 seconds. Not milliseconds. Seconds. Beyond 500 RPS, it broke completely.

Production AI applications need better.

Why Go

We chose Go for Bifrost because of three things:

Compiled Performance
Go compiles to native machine code. No interpreter. No JIT warmup. Consistent performance from the first request.

Built-in Concurrency
Goroutines handle thousands of concurrent connections with minimal overhead. No GIL. No thread pool tuning. Just works.

Memory Efficiency
Go's garbage collector is optimized for low-latency applications. Memory footprint stays predictable under load.

The results speak for themselves. Bifrost adds 11 microseconds of overhead at 5,000 RPS. That's 45x lower than LiteLLM's 500µs. Under sustained load, Bifrost maintains this performance while Python alternatives degrade.

Benchmark Results

We ran sustained benchmarks on identical t3.medium instances:

Bifrost vs LiteLLM at 500 RPS:

p99 Latency: 1.68s (Bifrost) vs 90.72s (LiteLLM) → 54x faster
Throughput: 424 req/sec (Bifrost) vs 44.84 req/sec (LiteLLM) → 9.4x higher
Memory: 120MB (Bifrost) vs 372MB (LiteLLM) → 3x lighter
Mean Overhead: 11µs (Bifrost) vs 500µs (LiteLLM) → 45x lower

At 5,000 RPS, Bifrost maintains 11µs overhead with 100% success rate. LiteLLM can't sustain this load.

But Performance Isn't Everything

Speed matters, but production AI needs more. We built Bifrost with features that matter for real deployments:

Zero-Config Deployment

npx -y @maximhq/bifrost

One command. No configuration files. No database setup. Running in 30 seconds.

Add providers through environment variables or the web UI. Start routing immediately.

Hierarchical Governance

Enterprise teams need budget controls at multiple levels. Bifrost supports:

Customer-level spending caps (organization-wide)
Team-level budgets (per department)
Virtual key budgets (per application)
Provider-level caps (per AI vendor)

All budgets are independent and checked in real-time. When any budget is exceeded, requests block immediately with HTTP 402. No surprise bills.

Model Context Protocol (MCP)

AI agents need tools. Bifrost supports MCP with:

STDIO, HTTP, and SSE connections
Agent mode (autonomous execution)
Code mode (TypeScript orchestration)
Tool filtering per virtual key
Governance controls on tool usage

Connect agents to filesystems, databases, APIs through standardized MCP servers.

Semantic Caching

Cache responses based on semantic similarity, not exact string matching. "What are your hours?" and "When do you open?" hit the same cache.

Reduces costs by 40-60% for applications with predictable query patterns.

Automatic Fallback

Configure multiple providers for the same model. If OpenAI throttles, Bifrost automatically fails over to Anthropic. Zero downtime. No manual intervention.

Weighted routing distributes load across providers based on health and performance.

Enterprise Security

SSO integration (Google, GitHub)
HashiCorp Vault for API key management
Comprehensive audit logging (SOC 2, GDPR, HIPAA, ISO 27001)
PII detection via guardrails (AWS Bedrock, Azure Content Safety, Patronus AI)
Virtual keys for access control

What We Learned Building It

1. Connection Pooling Is Critical

Early versions of Bifrost had connection pool issues under sustained load. We implemented adaptive pooling that scales based on request rate and provider latency. Connection reuse jumped from 60% to 95%.

2. Memory Management Makes or Breaks Performance

Go's garbage collector is good but not magic. We had to carefully manage buffer pooling, especially for large responses. Memory allocations dropped by 40% after optimization.

3. Goroutine Leaks Are Subtle

Monitoring goroutine counts revealed slow leaks in error handling paths. Context cancellation patterns fixed this. Now goroutine count stays stable even after days of sustained load.

4. Database Logging Kills Performance

Synchronous database writes for logging add unacceptable latency. We made logging async with batched writes. Latency dropped from 200µs to 11µs.

5. Provider Differences Are Everywhere

Every LLM provider has quirks. OpenAI returns different error codes than Anthropic. AWS Bedrock has different rate limit headers. Normalizing these differences took significant effort but makes the unified API actually work.

Open Source by Default

Bifrost is Apache 2.0 licensed. The entire codebase is public on GitHub. No enterprise-only performance features. No artificial limitations.

We built this for ourselves and open-sourced it because we think the ecosystem needs a high-performance option. Teams shouldn't choose between speed and features.

Enterprise features exist (SSO, Vault integration, priority support) but the core gateway is fully open.

Integrates With Maxim Platform

Bifrost connects to Maxim's AI quality platform for end-to-end workflows:

Agent simulation and testing before production
Unified evaluation frameworks
Production observability with quality monitoring
Data curation from production logs

This integration lets teams ship reliable AI agents 5x faster by unifying pre-release testing with production monitoring.

But Bifrost works standalone too. Use it as a pure gateway without the platform.

Try It

# Deploy Bifrost
npx -y @maximhq/bifrost

# Configure providers via environment
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...

# Use OpenAI-compatible API
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Unified interface. 15+ providers. Zero configuration.

GitHub | Docs | Benchmarks

Building production AI? Performance bottlenecks? Share your experiences in the comments.

DEV Community