Imagine building an AI agent that uses only one LLM provider such as OpenAI, Gemini or Anthropic through their specific APIs.
While in production, your AI agent LLM provider hits a rate limit during a traffic spike or has a partial outage or you realize that different tasks in your agent’s workflow are better suited for different AI models entirely.
To avoid such issues, you need to integrate a multi AI model provider system using an LLM gateway.
The LLM gateway handles AI model selection, authentication, failover, load balancing, routing and observability without changing your AI agent code.
In this guide, you will learn how build an AI agent with multi AI model provider consisting of OpenAI, Gemini and Anthropic using Bifrost.
Before we jump in, here is what we will cover:
What is an LLM Gateway (and why Bifrost)?
Setting up Bifrost with multiple AI providers
Building the AI agent
Implementing Load balancing
Implementing Routing Rules
Let’s jump in!
What is an LLM Gateway (and Why Bifrost)?
An LLM gateway is a middleware layer that sits between your AI agent and one or more AI model providers.
Instead of writing code for OpenAI's API format, Anthropic's authentication scheme, and Gemini's endpoint structure, your AI agent simply sends requests to the gateway in a standard format, and the gateway takes care of the rest.
In simple terms, an LLM gateway routes model requests without your agent needing to know how each LLM provider handles requests and responses.
But why will we be using Bifrost in this guide?
Bifrost is an open-source LLM gateway built by Maxim AI to route, manage, and optimize requests between your AI application and multiple large language model providers.
The LLM gateway is built using Go (Golang) for performance and below is what makes it stand out:
Blazing fast overhead: At 5,000 requests per second, Bifrost adds less than 15 microseconds of internal overhead per request which is very important at production scale.
Zero-config startup: You can launch Bifrost with a single
npxcommand and configure everything through a web UI.Built-in fallbacks and load balancing: If a provider fails or rate-limits you, Bifrost automatically routes to a backup. Traffic can be distributed across multiple keys or providers using weighted rules.
Semantic caching: Repeated or similar queries can be served from cache, which helps in cutting costs and reducing latency.
Observability: Bifrost offers Prometheus metrics, request tracing, and a built-in web dashboard out of the box.
Bifrost supports 15+ providers including OpenAI, Anthropic, Google Gemini (via Vertex or GenAI), AWS Bedrock, Azure, Mistral, Cohere, Groq, and more.
You can learn more about Bifrost LLM gateway here on there website.
Prerequisites
Before we start, make sure you have the following:
Node.js 18+ (for running Bifrost via npx) or Docker (for containerized deployment)
-
API keys for the providers you want to use:
- OpenAI: platform.openai.com
- Anthropic: console.anthropic.com
- Google (Gemini): aistudio.google.com or Google Cloud (for Vertex)
Python 3.9+ or Node.js for writing the agent code
Setting up Bifrost with Multiple LLM providers
In this section, you will how to set up Bifrost LLM gateway with multiple AI model providers for your AI agent or application.
Let’s get started.
Step 1: Install Bifrost using NPX binary
To install Bifrost, first, create your project folder named multi-provider-agent and open it using your preferred code editor such as VS Code or Cursor.
Then run the command below with the -app-dir flag that determines where Bifrost stores all its data:
npx -y @maximhq/bifrost -app-dir ./my-bifrost-data
Step 2: Create a config.json file
Once you have installed Bifrost in your project, create a config.json file in the ./my-bifrost-data folder. Then add the code below that defines multiple LLM providers and database persistence.
{
"$schema": "https://www.getbifrost.ai/schema",
"client": {
"drop_excess_requests": false
},
"providers": {
"openai": {
"keys": [
{
"name": "openai-primary",
"value": "env.OPENAI_API_KEY",
"models": [],
"weight": 1.0
}
]
},
"anthropic": {
"keys": [
{
"name": "anthropic-primary",
"value": "env.ANTHROPIC_API_KEY",
"models": [
],
"weight": 1.0
}
]
},
"gemini": {
"keys": [
{
"name": "gemini-primary",
"value": "env.GEMINI_API_KEY",
"models": [],
"weight": 1.0
}
]
}
},
"config_store": {
"enabled": true,
"type": "sqlite",
"config": {
"path": "./config.db"
}
}
}
Step 3: Set up your API keys
After creating the config.json file, set your API keys as environment variables so they're never hardcoded using the commands below in the terminal.
export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"
export GEMINI_API_KEY="your-gemini-api-key"
Step 4: Start Bifrost Gateway server
Once you have set up your API keys, start the Bifrost LLM gateway server by running the command below again (Make sure to stop the initial running server instance).
npx -y @maximhq/bifrost -app-dir ./my-bifrost-data
Then Bifrost will listen on port 8080, as shown below.
Finally, navigate to the gateway dashboard at http://localhost:8080 as shown below
Step 5: Verify the Setup
After installing and starting Bifrost, you can verify if it's working by running the curl command below in the terminal.
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Then you should get a response from the Bifrost LLM gateway, as shown below.
Building an AI agent with multiple AI models using Bifrost and LangChain (LangGraph)
In this section, you will learn how to build an AI agent with multiple AI models using Bifrost and LangGraph.
You can learn how to set up Bifrost with LangGraph using LangChain SDK here on Bifrost docs.
Let’s get started.
Step 1: Set up a Python virtual environment
To set up a Python virtual environment for your project, run the commands below in the terminal to create and activate a virtual environment.
# Create a virtual environment
python3 -m venv venv
# Activate it
source venv/bin/activate
Step 2: Install LangGraph dependencies
After setting up a Python virtual environment, install LangGraph dependencies in your project by running the command below in the terminal.
pip install langgraph langchain-openai
Step 3: Configure your AI agent
Once you have installed LangGraph dependencies, create an agent.py file in your project folder and add the following code that sets up a Multi-Provider Research Agent with LangGraph.
import os
from typing import TypedDict, Annotated
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_core.tools import tool
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
import operator
BIFROST_URL = "http://localhost:8080"
# ─────────────────────────────────────────────
# Model definitions — all pointing at Bifrost
# ─────────────────────────────────────────────
# Deep reasoning: complex planning and analysis
planner_llm = ChatOpenAI(
model="openrouter/claude-3.7-sonnet",
base_url=f"{BIFROST_URL}/v1",
api_key="dummy",
max_tokens=2048
)
# General purpose: tool use and synthesis
executor_llm = ChatOpenAI(
model="openai/gpt-4o",
base_url=f"{BIFROST_URL}/v1",
api_key="dummy",
max_tokens=2048
)
# Fast and cheap: lightweight formatting and summaries
summarizer_llm = ChatOpenAI(
model="gemini/gemini-2.0-flash",
base_url=f"{BIFROST_URL}/v1",
api_key="dummy",
max_tokens=1024
)
# ─────────────────────────────────────────────
# Tools
# ─────────────────────────────────────────────
@tool
def search_web(query: str) -> str:
"""Search the web for information on a given topic."""
# Replace with a real search implementation (e.g. Tavily, SerpAPI)
return f"[Search results for '{query}': Sample results would appear here]"
@tool
def calculate(expression: str) -> str:
"""Evaluate a mathematical expression."""
try:
result = eval(expression, {"__builtins__": {}})
return str(result)
except Exception as e:
return f"Calculation error: {e}"
tools = [search_web, calculate]
tool_node = ToolNode(tools)
executor_with_tools = executor_llm.bind_tools(tools)
# ─────────────────────────────────────────────
# State
# ─────────────────────────────────────────────
class AgentState(TypedDict):
messages: Annotated[list, operator.add]
research_plan: str
final_summary: str
# ─────────────────────────────────────────────
# Nodes
# ─────────────────────────────────────────────
def planner_node(state: AgentState) -> dict:
"""
Uses Claude Opus via Bifrost.
Analyses the user's request and produces a structured research plan.
"""
messages = [
SystemMessage(content="""You are a strategic research planner.
Given a user's question, create a clear, step-by-step research plan
that identifies: what information is needed, what tools to use,
and what the final answer should look like."""),
] + state["messages"]
response = planner_llm.invoke(messages)
return {
"messages": [response],
"research_plan": response.content
}
def executor_node(state: AgentState) -> dict:
"""
Uses GPT-4o via Bifrost with tool access.
Executes the research plan by calling tools and gathering information.
"""
messages = [
SystemMessage(content=f"""You are a research executor.
Follow this plan carefully and use the available tools to gather
the required information:
{state.get('research_plan', 'Gather information to answer the user question.')}"""),
] + state["messages"]
response = executor_with_tools.invoke(messages)
return {"messages": [response]}
def summarizer_node(state: AgentState) -> dict:
"""
Uses Gemini Flash via Bifrost.
Takes the gathered research and produces a clean, concise final answer.
"""
# Collect all content from the conversation
research_content = "\n\n".join([
msg.content for msg in state["messages"]
if hasattr(msg, "content") and msg.content
])
messages = [
SystemMessage(content="""You are a concise summarizer.
Given research findings, produce a clear, well-structured final answer.
Be direct and helpful. Remove redundancy."""),
HumanMessage(content=f"Research gathered:\n\n{research_content}\n\nProvide the final answer.")
]
response = summarizer_llm.invoke(messages)
return {
"messages": [response],
"final_summary": response.content
}
def should_use_tools(state: AgentState) -> str:
"""Conditional edge: if the executor requested tool calls, run them."""
last_message = state["messages"][-1]
if hasattr(last_message, "tool_calls") and last_message.tool_calls:
return "tools"
return "summarizer"
# ─────────────────────────────────────────────
# Graph assembly
# ─────────────────────────────────────────────
def build_research_agent() -> StateGraph:
graph = StateGraph(AgentState)
graph.add_node("planner", planner_node)
graph.add_node("executor", executor_node)
graph.add_node("tools", tool_node)
graph.add_node("summarizer", summarizer_node)
graph.set_entry_point("planner")
graph.add_edge("planner", "executor")
graph.add_conditional_edges(
"executor",
should_use_tools,
{"tools": "tools", "summarizer": "summarizer"}
)
graph.add_edge("tools", "executor") # Loop back after tool use
graph.add_edge("summarizer", END)
return graph.compile()
# ─────────────────────────────────────────────
# Run it
# ─────────────────────────────────────────────
if __name__ == "__main__":
agent = build_research_agent()
questions = [
"What are the key differences between transformer and mamba architectures for language models?",
"If a company grows revenue 23% YoY from $4.2M, what is the new revenue?",
]
for question in questions:
print(f"\n{'='*60}")
print(f"Question: {question}")
result = agent.invoke({
"messages": [HumanMessage(content=question)],
"research_plan": "",
"final_summary": ""
})
print(f"\nFinal Answer:\n{result['final_summary']}")
Step 4: Run your agent
After setting up your AI agent, run it by executing the following command in the terminal.
python agent.py
Once the AI agent runs and completes its workflow, you should see the response on the terminal, as shown below.
Implementing Load balancing
In this section, you will learn how to implement load balancing in your AI agent using Bifrost. But before we get started, let’s understand what is load balancing.
What is load balancing in Bifrost?
In Bifrost, Load balancing means distributing AI requests across multiple API keys and AI models instead of sending all traffic to a single key.
Sending all your AI agent traffic to a single API key can lead to the following issues:
You can hit rate limits
Requests can fail when traffic spikes
One key outage breaks everything
Performance becomes unstable
However, if you implement load balancing using Bifrost, your AI agent can:
Share traffic across several keys
Automatically switch if one key fails
Avoid overloading a single provider
Load balancing in Bifrost is implemented using two methods which are weighted load balancing and Model Whitelisting & Filtering.
In weighted load balancing, you assign a weight to each API key where openai-primary handles ~70% of requests and openai-secondary handles ~30%, as shown below.
"providers": {
"openai": {
"keys": [
{
"name": "openai-primary",
"value": "env.OPENAI_API_KEY_PRIMARY",
"models": ["gpt-4o", "gpt-4o-mini",],
"weight": 0.7
},
{
"name": "openai-secondary",
"value": "env.OPENAI_API_KEY_SECONDARY",
"models": [],
"weight": 0.3
}
]
},
}
In Model Whitelisting & Filtering, keys can be restricted to specific models for access control and cost management, a shown below. A request for gpt-4o will only be routed to premium-key while a request to gpt-4o-mini will only use standard-key
"providers": {
"openai": {
"keys": [
{
"name": "premium-key",
"value": "env.OPENAI_API_KEY_PREMIUM",
"models": ["gpt-4o", "o1-preview"],
"weight": 1.0
},
{
"name": "standard-key",
"value": "env.OPENAI_API_KEY_STANDARD",
"models": [],
"weight": 1.0
}
]
},
}
You can learn more about about load balancing here on Bifrost docs.
Implementing Routing Rules
In this section, you will learn how to implement routing rules in your AI agent. Before we get started, let’s understand what are routing rules in Bifrost.
What are routing rules in Bifrost?
Routing rules in Bifrost are like traffic signs for your AI requests. They tell Bifrost where to send each request based on conditions you define.
Instead of always using the same AI model or provider, routing rules let you say things like, “If the request looks like this, send it to that provider.”
For example:
If a request is for summarization, use a cheaper model
If the user is a premium customer, use a more powerful model
If one provider is too busy, send the request to another provider
Everytime your AI agent sends a request, Bifrost checks these rules. The first rule that matches the request is sent to the chosen provider and model.
If no rule matches the request, Bifrost simply falls back to its default behavior, such as load balancing between providers.
To implement routing rules, define them in the config.json file, as shown below.
"governance": {
"routing_rules": [
{
"id": "rule-uuid-123",
"name": "Premium Tier Route",
"description": "Route premium users to fast provider",
"enabled": true,
"cel_expression": "headers[\"x-tier\"] == \"premium\"",
"provider": "openrouter",
"model": "claude-3.7-sonnet",
"fallbacks": ["openrouter/gpt-4o", "openrouter/gpt-3.5-turbo"],
"scope": "global",
"scope_id": null,
"priority": 10
},
{
"id": "rule-uuid-456",
"name": "Budget Overflow Route",
"description": "Route to cheaper provider when budget is high",
"enabled": true,
"cel_expression": "budget_used > 85",
"provider": "groq",
"model": "llama-2-70b",
"fallbacks": [],
"scope": "team",
"scope_id": "team-ml-ops",
"priority": 5
},
{
"id": "team-preference",
"name": "ML Team Anthropic Route",
"enabled": true,
"cel_expression": "team_name == \"ml-research\"",
"provider": "anthropic",
"model": "claude-3-opus-20240229",
"fallbacks": ["bedrock/claude-3-opus"],
"scope": "team",
"scope_id": "team-ml-research",
"priority": 0
}
]
},
After defining the routing rules, you can pass the routing rules in your AI agent request, as shown below.
# Deep reasoning: complex planning and analysis
planner_llm = ChatOpenAI(
model="openrouter/claude-3.7-sonnet",
base_url=f"{BIFROST_URL}/v1",
api_key="dummy",
default_headers={
'x-tier': 'premium',
},
max_tokens=2048
)
You can learn more about about routing rules here on Bifrost docs.
Conclusion
In conclusion, building AI agents that rely on a single LLM provider is risky and limits flexibility. By using an LLM gateway like Bifrost, you can create a multi-provider AI agent that is resilient, scalable, and future-proof.
Finally, If you’re building or scaling an AI application and performance is becoming a bottleneck, explore Bifrost and try it yourself:
🌐 Website: https://getmax.im/bifrost-home
📦 GitHub: https://git.new/bifrostrepo
📘 Docs (Quickstart): https://getmax.im/bifrostdocs








Top comments (8)
Nice Bonnie! I forgot I had to look into LLM gateways lol
LLM gateways are quite interesting.
They come in handy especially when building production ready AI agents.
Great tutorial. The model specialization pattern you're using here is something I've been moving toward too: different models for different cognitive tasks (planner vs. executor vs. summarizer) is a much cleaner design than hammering one model for everything.
One thing worth adding: the routing rules via CEL expressions are powerful but can get brittle as the ruleset grows. We found it helpful to version-control the routing config separately and treat changes to routing rules with the same rigor as code changes, since a bad rule can silently degrade agent quality without obvious errors.
Also curious if you've tested Bifrost under actual sustained load. The <15 microsecond overhead claim is impressive. Does that hold when the semantic cache has cold misses and all three providers are in play simultaneously?
Great points, Vic Chen.
On routing rules - I completely agree. Treat them like code. A bad rule can silently route premium traffic to a cheap model and you won't notice until users complain.
At Bifrost, they version control their rules in the same repo as the agent code and run tests against rule changes before deploying.
On the sustained load question - the <15µs overhead holds for the gateway routing itself. Cache misses add latency obviously (embedding lookup + vector search), but that's separate from the core routing overhead.
With all three providers active, the gateway just picks based on weights/rules - doesn't add significant overhead per additional provider. The Go concurrency model handles the parallel connections efficiently.
One thing worth noting: the overhead stays consistent under load because there's no GIL situation like Python gateways have. We've tested at 5k RPS and the p99 stays flat.
The GIL point is huge — and honestly underappreciated in a lot of gateway comparisons. In financial workloads where you need consistent sub-millisecond routing, Python's threading model becomes a real liability at scale. The p99 staying flat at 5k RPS in Go is exactly the kind of guarantee you need before putting this in front of anything latency-sensitive.
The versioned routing rules approach also resonates. In finance especially, being able to audit exactly which rule was active when a specific model was called (and why) is not optional — it's a compliance requirement. Treating routing config as code with proper PR reviews + tests is the right default from day one, not something you bolt on later.
Curious about the cache miss behavior in practice — for embeddings-based routing, what's the p99 on a cold lookup vs warm? And do you pre-warm on deploy or accept the initial cold-start penalty?
Strong write-up — especially the planner/executor/summarizer split.
One production pattern that helped us a lot: define an explicit routing SLO per path (e.g. p95 latency, error budget, cost/request), then auto-downgrade model class only when the path breaches budget for N consecutive windows.
It prevents “silent quality drift” from static fallback chains and gives you auditable routing decisions.
Have you tried exposing routing decisions as structured events (provider, model, rule_id, fallback_reason, latency_ms) for postmortems?
Neat approach with the unified gateway — does Bifrost handle streaming responses the same way across all three providers, or do you lose SSE compatibility with some?
Good question, Klement
Bifrost normalizes streaming across all providers to OpenAI's SSE format. So you write your client code once for OpenAI-style streaming and it works regardless of whether the backend is Anthropic, Gemini, or OpenAI.
The gateway handles the translation - Anthropic has slightly different SSE event structure, Gemini via Vertex has its own format. Bifrost converts them all to the standard data: {...} format your client expects.
No compatibility loss - Bifrost uses streaming in production across all three providers without changing client code.
Just point to Bifrost, request streaming, and it works.