Unless you live under a rock, AI has been the talk of the town since 2022 when ChatGPT was announced to the world powered by a Large Language Model (LLM).
Since then, the Large Language Models (LLMs) such as OpenAI’s GPT series, Anthropic’s Claude, and Google’s Gemini, have become the back bone of AI-powered applications.
However, as these AI-powered applications scale, they encounter the following issues:
If performance drops or prices rise when using one LLM provider, migrating to another LLM requires changing code frequently.
Using different LLM providers becomes a bottleneck since each LLM provider defines their API requests and responses differently.
Monitoring usage, optimizing cost, and debugging issues across different LLMs becomes a nightmare when everything is scattered across direct LLM provider APIs.
That’s where LLM Gateways comes in.
Understanding what is an LLM gateway?
In simple terms, an LLM gateway can be described as a middle man who routes, manages, optimizes and monitors requests together with responses from your AI-powered app to different LLM providers.
Below is an illustration of an LLM gateway.
In this guide, we will cover how a Go-based LLM gateway named Bifrost achieves extreme performance gains over LiteLLM, a python-based LLM gateway.
Before we get started, here is what we will cover:
Understanding what is an LLM gateway
What is Bifrost and why was it built?
Bifrost vs LiteLLM performance comparison
Advantages of Go-based design architecture over Python-based design architecture.
Bifrost key features that improve performance gains
How to get started with Bifrost
Let’s jump in!
What is Bifrost?
Bifrost is a high-performance LLM gateway designed to route, manage, and optimize requests between your AI application and multiple large language model providers such as OpenAI, Anthropic, Gemini and others.
It acts as a single entry point for all LLM traffic, providing a unified API, built-in observability, and performance optimizations that make it suitable for production AI systems.
You can learn more about Bifrost LLM gateway here on there website.
Why Was Bifrost Built?
As LLM-powered applications moved from experimentation to production, teams began facing several challenges:
**Extreme Latency & Speed: **Existing LLM gateways became performance bottlenecks under high traffic
**Reliability at Scale: **Under heavy load (500+ requests per second), other gateways started failing requests or consuming excessive memory.
Unified Control: Standardizing the fragmented API landscape so developers don't have to rewrite code when switching models.
Instead of being another LLM proxy, Bifrost was built to be the fastest, and most scalable LLM gateway where it is:
Blazing fast: Built in Go, Bifrost introduces <15µs* internal overhead per request at 5000 RPS.
First-class observability: Native Prometheus metrics built-in - no wrappers, no sidecars, just drop it in and scrape.
Flexible transport: Supports HTTP and* gRPC *(planned) *out of the box, so you don’t have to contort your infra to fit the tool. Bifrost bends to *your system, not the other way around.
Feel free to explore Bifrost docs here.
Bifrost vs LiteLLM performance comparison
To compare Bifrost and LiteLLM, some benchmark tests were ran at 500 RPS to compare performance of Bifrost and LiteLLM. Below are the results
Based on the benchmarking data, throughput comparison results between Bifrost and LiteLLM where the number of requests a gateway can successfully process per second (RPS) shows Bifrost performing better than LiteLLM, as shown below.
In Latency Comparison between Bifrost and LiteLLM, Bifrost was ~9.5x faster, with ~54x lower P99 latency, and used 68% less memory than LiteLLM — on t3.medium instance (2 vCPUs) with tier 5 OpenAI Key, as shown below.
When a like-for-like benchmark based on the LiteLLM proxy’s own benchmarking setup was ran with the same load, same mocking behaviour, and same hardware profile to get a clean, apples-to-apples comparison of real-world overheads from the gateway layer itself, Bifrost performed better than LiteLLM, as shown below.
Advantages of Bifrost Go-based design architecture over LiteLLM Python-based design architecture
The performance gap between Bifrost (Go) and LiteLLM (Python) is a good example of why architecture design choice for an LLM gateway is critical for infrastructure tools.
While Python is the undisputed king of AI research and data science, Go (Golang) is widely considered superior for building the infrastructure (gateways, proxies, and cloud-native tools) that serves those models.
Here are the advantages of a Go-based architecture design for an LLM gateway over a Python-based one:
a). Higher Performance and Lower Latency
Go is a compiled language that produces native binaries, while Python is interpreted. This means:
Go executes faster at runtime.
It adds less overhead per request.
It provides more predictable response times under load.
For systems handling thousands of concurrent requests, this leads to lower latency and higher throughput.
b). Efficient Concurrency Model
Go was built with concurrency as a core feature using goroutines and channels. This means:
Goroutines are lightweight and cheap to create.
Go can handle many concurrent requests with minimal memory usage.
Context switching is faster and simpler than Python threads or async tasks.
Python, on the other hand is limited by the Global Interpreter Lock (GIL) and requires complex async frameworks to scale concurrency.
c). Predictable Memory Management
Go’s garbage collector is optimized for low-latency server workloads. This means:
More stable memory usage.
Fewer unpredictable pauses.
Better performance consistency under load.
Python’s memory management can cause higher overhead and less predictable pauses in long-running services.
You can learn more about Bifrost Go-based architecture here on the docs.
Bifrost key features that improve performance gains
Bifrost is designed as a production-grade infrastructure layer rather than just a developer utility. Its core features work in harmony with its Go-based architecture to eliminate the bottlenecks typically found in Python-based proxies like LiteLLM.
Here is a breakdown of how its key features drive significant performance and reliability gains:
a). Adaptive Load Balancing
The Adaptive Load Balancing feature in Bifrost enables the LLM gateway to:
Continuously monitor performance metrics like latency, error rates, and throughput for each provider and API key.
Dynamically adjusts where traffic goes so faster, healthier providers or keys get more requests.
Reduce an endpoint traffic or temporarily removes it when it slows down or starts failing until it recovers.
Bifrost’s adaptive load balancing feature improves performance by reducing slow or failed calls, keeping responses fast, and preventing bottlenecks that hurt overall throughput and reliability.
You can learn more about the Bifrost’s adaptive load balancing feature here on the Bifrost docs.
b). Semantic Caching
The semantic caching feature in Bifrost enables the LLM gateway to:
Not only cache exact matches, but use semantic similarity to detect when requests mean the same thing even if the text is phrased differently.
Return the stored response in milliseconds on a cache hit which is much faster than making a full LLM request.
Reduce repeated external API calls when similar queries occur frequently.
Bifrost’s semantic caching feature improves performance by providing instant responses for many requests and fewer expensive model calls, which cuts both latency and cost.
You can learn more about the Bifrost’s semantic caching feature here on the Bifrost docs.
c). Unified Interfaces
The unified interfaces feature in Bifrost enables the LLM gateway to:
Provide one consistent OpenAI-compatible API no matter which provider (OpenAI, Anthropic, AWS, etc.) is used.
Internally handle provider-specific quirks (input/output formats, errors, rate limits).
Bifrost’s unified interfaces feature improves performance by simplifying your application code and removing overhead by letting requests flow efficiently and reduce expensive error handling or conversions.
You can learn more about the unified interfaces feature here on the Bifrost docs.
d). Built-In Metrics and Observability
The Built-In Metrics and Observability feature enables you to monitor and analyze every AI request and response in real-time where:
Native Prometheus-based metrics is supported without requiring external sidecars or wrappers.
It tracks key performance signals like latency, success rates, cache hit/miss rates, and throughput.
It help teams identify slow paths, provider degradation, and optimization opportunities.
Bifrost’s Built-In Metrics and Observability feature improves performance by helping teams fine-tune configurations and prevent issues before they slow down systems.
You can learn more about the Built-In Metrics and Observability feature feature here on the Bifrost docs.
How to get started with Bifrost LLM gateway
Getting started with Bifrost is designed to be straightforward, especially if you are already using the OpenAI API.
There are two primary ways to integrate it: as a standalone HTTP server (easiest for most) or as a Go package directly in your code.
Let’s get started
Prerequisites
Go 1.23 or higher (not needed if using Docker)
Access to at least one AI model provider (OpenAI, Anthropic, etc.)
API keys for the providers you wish to use
Option A: Using Bifrost as an HTTP Server
This is the most common method. You run Bifrost as a proxy and simply point your application to it.
To use Bifrost as an HTTP server, follow the steps below,
Step 1: Configuration:
Create a file named config.json which tells Bifrost which keys and models to use, as shown below
{
"openai": {
"keys": [{
"value": "env.OPENAI_API_KEY",
"models": ["gpt-4o-mini"],
"weight": 1.0
}]
}
}
Step 2. Set Environment Variables:
Add your API key to your environment so the config file can read it securely
export OPENAI_API_KEY=your_openai_api_key;
Step 3: Start the Bifrost HTTP Server:
You have two options to run the server, either using Go Binary or a Docker setup if go is not installed.
a. Using Go Binary
To use Go Binary, first, install the transport package:
go install github.com/maximhq/bifrost/transports/bifrost-http@latest
Then run the server (make sure Go is present in the PATH):
bifrost-http -config config.json -port 8080
b. Using Docker
To use Docker, first download the Dockerfile:
curl -L -o Dockerfile https://raw.githubusercontent.com/maximhq/bifrost/main/transports/Dockerfile
Then build the Docker image:
docker build \
--build-arg CONFIG_PATH=./config.json \
--build-arg PORT=8080 \
-t bifrost-transports .
Finally, run the Docker container:
docker run -p 8080:8080 -e OPENAI_API_KEY bifrost-transports
Step 4: Using the API:
Once the server is running, you can send requests to the HTTP endpoints.
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me about Bifrost in Norse mythology."}
]
}'
Note: For additional configurations in HTTP server setup, please read this.
Option B: Using Bifrost as a Go Package
If you are building a Go application and want to avoid the network hop of an external proxy, you can embed Bifrost directly.
To use Bifrost as a Go Package, follow the steps below,
Step 1: Go Get Bifrost:
To get Bitfrost, run the following command to install Bifrost as a golang package in your project.
go get github.com/maximhq/bifrost/core@latest
Step 2: Implement Your Account Interface:
To implement your account interface, you first need to create your account which follows Bifrost's account interface.
type BaseAccount struct{}
func (baseAccount *BaseAccount) GetConfiguredProviders() ([]schemas.ModelProvider, error) {
return []schemas.ModelProvider{schemas.OpenAI}, nil
}
func (baseAccount *BaseAccount) GetKeysForProvider(providerKey schemas.ModelProvider) ([]schemas.Key, error) {
return []schemas.Key{
{
Value: os.Getenv("OPENAI_API_KEY"),
Models: []string{"gpt-4o-mini"},
},
}, nil
}
func (baseAccount *BaseAccount) GetConfigForProvider(providerKey schemas.ModelProvider) (*schemas.ProviderConfig, error) {
return &schemas.ProviderConfig{
NetworkConfig: schemas.DefaultNetworkConfig,
ConcurrencyAndBufferSize: schemas.DefaultConcurrencyAndBufferSize,
}, nil
}
Bifrost uses these methods to get all the keys and configurations it needs to call the providers. You can check Additional Configurations for further customisations.
Step 3: Initialize Bifrost:
To initialize Bifrost, set up the Bifrost instance by providing your account implementation.
account := BaseAccount{}
client, err := bifrost.Init(schemas.BifrostConfig{
Account: &account,
})
Step 4: Use Bifrost:
Make your First LLM Call!
bifrostResult, bifrostErr := client.ChatCompletionRequest(
context.Background(),
&schemas.BifrostRequest{
Provider: schemas.OpenAI,
Model: "gpt-4o-mini", // make sure you have configured gpt-4o-mini in your account interface
Input: schemas.RequestInput{
ChatCompletionInput: bifrost.Ptr([]schemas.Message{{
Role: schemas.RoleUser,
Content: schemas.MessageContent{
ContentStr: bifrost.Ptr("What is a LLM gateway?"),
},
}}),
},
},
)
// you can add model parameters by passing them in Params: &schemas.ModelParameters{...yourParams} in ChatCompletionRequest.
For more settings and configurations, check out the Documentation.
Bonus: You can use Maxim’s pre-made plugin from [github.com/maximhq/bifrost/plugins](https://github.com/maximhq/bifrost/tree/main/plugins) to add observability to Bifrost in just a single line by installing the package using the command below,
go get github.com/maximhq/bifrost/plugins/maxim
Then add observability as shown below.
maximPlugin, err := maxim.NewMaximLoggerPlugin(os.Getenv("MAXIM_API_KEY"), os.Getenv("MAXIM_LOG_REPO_ID"))
client, err := bifrost.Init(schemas.BifrostConfig{
Account: &account,
Plugins: []schemas.Plugin{maximPlugin},
})
Remember to get your Maxim API Key and Log Repo ID from here.
Conclusion
In conclusion, scaling your AI-powered app from a proof-of-concept to a production-grade AI application can be a headache if you don’t use a fast and scalable LLM gateway*.*
While Python-based proxies like LiteLLM are excellent for early development, they eventually become the bottleneck in high-throughput environments.
Bifrost eliminates this hurdle by moving the gateway logic to a high-performance Go-based architecture.
By choosing Bifrost, you gain:
Infrastructure Efficiency that handle 10x more traffic with 68% less memory compared to LiteLLM gateway
Zero-Latency Overhead with internal processing in microseconds ($<15\mu s$), not milliseconds.
**Production Stability with **100% success rates even under heavy concurrency.
Finally, If you’re building or scaling an AI application and performance is becoming a bottleneck, explore Bifrost and try it yourself:
🌐 Website: https://getmax.im/bifrost-home
📦 GitHub: https://git.new/bifrostrepo
📘 Docs (Quickstart): https://getmax.im/bifrostdocs
📊 Benchmark vs LiteLLM: https://www.getmaxim.ai/blog/bifrost-a-drop-in-llm-proxy-40x-faster-than-litellm/
Test it in your own workload and see how a performance-first LLM gateway can change your system’s behavior at scale.













Top comments (3)
great work! I have noticed this before here especially since it’s much faster compared to LiteLLM so I'm curious.. does performance matter that much? or is it just part of the standard.
My understanding is that if you have an AI agent with let's say 10K+ users, LLM gateway that handles requests + responses needs to be up to the task.
Like that is the main idea.
More of a DevOps stuff if I may say.
yeah got the idea. I will look more into it :)