Bonnie for CrossPostr

Posted on Feb 24 • Edited on Feb 28

Building AI Agent With Multiple AI Model Providers Using an LLM Gateway (OpenAI, Anthropic, Gemini & Bifrost)

#programming #tutorial #opensource #webdev

Imagine building an AI agent that uses only one LLM provider such as OpenAI, Gemini or Anthropic through their specific APIs.

While in production, your AI agent LLM provider hits a rate limit during a traffic spike or has a partial outage or you realize that different tasks in your agent’s workflow are better suited for different AI models entirely.

To avoid such issues, you need to integrate a multi AI model provider system using an LLM gateway.

The LLM gateway handles AI model selection, authentication, failover, load balancing, routing and observability without changing your AI agent code.

In this guide, you will learn how build an AI agent with multi AI model provider consisting of OpenAI, Gemini and Anthropic using Bifrost.

Before we jump in, here is what we will cover:

What is an LLM Gateway (and why Bifrost)?
Setting up Bifrost with multiple AI providers
Building the AI agent
Implementing Load balancing
Implementing Routing Rules

Let’s jump in!

What is an LLM Gateway (and Why Bifrost)?

An LLM gateway is a middleware layer that sits between your AI agent and one or more AI model providers.

Instead of writing code for OpenAI's API format, Anthropic's authentication scheme, and Gemini's endpoint structure, your AI agent simply sends requests to the gateway in a standard format, and the gateway takes care of the rest.

In simple terms, an LLM gateway routes model requests without your agent needing to know how each LLM provider handles requests and responses.

But why will we be using Bifrost in this guide?

Bifrost is an open-source LLM gateway built by Maxim AI to route, manage, and optimize requests between your AI application and multiple large language model providers.

The LLM gateway is built using Go (Golang) for performance and below is what makes it stand out:

Blazing fast overhead: At 5,000 requests per second, Bifrost adds less than 15 microseconds of internal overhead per request which is very important at production scale.
Zero-config startup: You can launch Bifrost with a single npx command and configure everything through a web UI.
Built-in fallbacks and load balancing: If a provider fails or rate-limits you, Bifrost automatically routes to a backup. Traffic can be distributed across multiple keys or providers using weighted rules.
Semantic caching: Repeated or similar queries can be served from cache, which helps in cutting costs and reducing latency.
Observability: Bifrost offers Prometheus metrics, request tracing, and a built-in web dashboard out of the box.

Bifrost supports 15+ providers including OpenAI, Anthropic, Google Gemini (via Vertex or GenAI), AWS Bedrock, Azure, Mistral, Cohere, Groq, and more.

You can learn more about Bifrost LLM gateway here on there website.

Prerequisites

Before we start, make sure you have the following:

Node.js 18+ (for running Bifrost via npx) or Docker (for containerized deployment)
API keys for the providers you want to use:
- OpenAI: platform.openai.com
- Anthropic: console.anthropic.com
- Google (Gemini): aistudio.google.com or Google Cloud (for Vertex)
Python 3.9+ or Node.js for writing the agent code

Setting up Bifrost with Multiple LLM providers

In this section, you will how to set up Bifrost LLM gateway with multiple AI model providers for your AI agent or application.

Let’s get started.

Step 1: Install Bifrost using NPX binary

To install Bifrost, first, create your project folder named multi-provider-agent and open it using your preferred code editor such as VS Code or Cursor.

Then run the command below with the -app-dir flag that determines where Bifrost stores all its data:

npx -y @maximhq/bifrost -app-dir ./my-bifrost-data

Step 2: Create a config.json file

Once you have installed Bifrost in your project, create a config.json file in the ./my-bifrost-data folder. Then add the code below that defines multiple LLM providers and database persistence.

{
  "$schema": "https://www.getbifrost.ai/schema",
  "client": {
    "drop_excess_requests": false
  },
  "providers": {
    "openai": {
      "keys": [
        {
          "name": "openai-primary",
          "value": "env.OPENAI_API_KEY",
          "models": [],
          "weight": 1.0
        }
      ]
    },
    "anthropic": {
      "keys": [
        {
          "name": "anthropic-primary",
          "value": "env.ANTHROPIC_API_KEY",
          "models": [

          ],
          "weight": 1.0
        }
      ]
    },
    "gemini": {
      "keys": [
        {
          "name": "gemini-primary",
          "value": "env.GEMINI_API_KEY",
          "models": [],
          "weight": 1.0
        }
      ]
    }
  },
  "config_store": {
    "enabled": true,
    "type": "sqlite",
    "config": {
      "path": "./config.db"
    }
  }
}

Step 3: Set up your API keys

After creating the config.json file, set your API keys as environment variables so they're never hardcoded using the commands below in the terminal.

export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"
export GEMINI_API_KEY="your-gemini-api-key"

Step 4: Start Bifrost Gateway server

Once you have set up your API keys, start the Bifrost LLM gateway server by running the command below again (Make sure to stop the initial running server instance).

npx -y @maximhq/bifrost -app-dir ./my-bifrost-data

Then Bifrost will listen on port 8080, as shown below.

Finally, navigate to the gateway dashboard at http://localhost:8080 as shown below

Step 5: Verify the Setup

After installing and starting Bifrost, you can verify if it's working by running the curl command below in the terminal.

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Then you should get a response from the Bifrost LLM gateway, as shown below.

Building an AI agent with multiple AI models using Bifrost and LangChain (LangGraph)

In this section, you will learn how to build an AI agent with multiple AI models using Bifrost and LangGraph.

You can learn how to set up Bifrost with LangGraph using LangChain SDK here on Bifrost docs.

Let’s get started.

Step 1: Set up a Python virtual environment

To set up a Python virtual environment for your project, run the commands below in the terminal to create and activate a virtual environment.

# Create a virtual environment
python3 -m venv venv

# Activate it
source venv/bin/activate

Step 2: Install LangGraph dependencies

After setting up a Python virtual environment, install LangGraph dependencies in your project by running the command below in the terminal.

pip install langgraph langchain-openai

Step 3: Configure your AI agent

Once you have installed LangGraph dependencies, create an agent.py file in your project folder and add the following code that sets up a Multi-Provider Research Agent with LangGraph.

import os
from typing import TypedDict, Annotated
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_core.tools import tool
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
import operator

BIFROST_URL = "http://localhost:8080"

# ─────────────────────────────────────────────
# Model definitions — all pointing at Bifrost
# ─────────────────────────────────────────────

# Deep reasoning: complex planning and analysis
planner_llm = ChatOpenAI(
    model="openrouter/claude-3.7-sonnet",
    base_url=f"{BIFROST_URL}/v1",
    api_key="dummy",
    max_tokens=2048
)

# General purpose: tool use and synthesis
executor_llm = ChatOpenAI(
    model="openai/gpt-4o",
    base_url=f"{BIFROST_URL}/v1",
    api_key="dummy",
    max_tokens=2048
)

# Fast and cheap: lightweight formatting and summaries
summarizer_llm = ChatOpenAI(
    model="gemini/gemini-2.0-flash",
    base_url=f"{BIFROST_URL}/v1",
    api_key="dummy",
    max_tokens=1024
)

# ─────────────────────────────────────────────
# Tools
# ─────────────────────────────────────────────

@tool
def search_web(query: str) -> str:
    """Search the web for information on a given topic."""
    # Replace with a real search implementation (e.g. Tavily, SerpAPI)
    return f"[Search results for '{query}': Sample results would appear here]"

@tool
def calculate(expression: str) -> str:
    """Evaluate a mathematical expression."""
    try:
        result = eval(expression, {"__builtins__": {}})
        return str(result)
    except Exception as e:
        return f"Calculation error: {e}"

tools = [search_web, calculate]
tool_node = ToolNode(tools)
executor_with_tools = executor_llm.bind_tools(tools)

# ─────────────────────────────────────────────
# State
# ─────────────────────────────────────────────

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    research_plan: str
    final_summary: str

# ─────────────────────────────────────────────
# Nodes
# ─────────────────────────────────────────────

def planner_node(state: AgentState) -> dict:
    """
    Uses Claude Opus via Bifrost.
    Analyses the user's request and produces a structured research plan.
    """
    messages = [
        SystemMessage(content="""You are a strategic research planner. 
        Given a user's question, create a clear, step-by-step research plan 
        that identifies: what information is needed, what tools to use, 
        and what the final answer should look like."""),
    ] + state["messages"]

    response = planner_llm.invoke(messages)

    return {
        "messages": [response],
        "research_plan": response.content
    }

def executor_node(state: AgentState) -> dict:
    """
    Uses GPT-4o via Bifrost with tool access.
    Executes the research plan by calling tools and gathering information.
    """
    messages = [
        SystemMessage(content=f"""You are a research executor. 
        Follow this plan carefully and use the available tools to gather 
        the required information:

        {state.get('research_plan', 'Gather information to answer the user question.')}"""),
    ] + state["messages"]

    response = executor_with_tools.invoke(messages)
    return {"messages": [response]}

def summarizer_node(state: AgentState) -> dict:
    """
    Uses Gemini Flash via Bifrost.
    Takes the gathered research and produces a clean, concise final answer.
    """
    # Collect all content from the conversation
    research_content = "\n\n".join([
        msg.content for msg in state["messages"] 
        if hasattr(msg, "content") and msg.content
    ])

    messages = [
        SystemMessage(content="""You are a concise summarizer. 
        Given research findings, produce a clear, well-structured final answer. 
        Be direct and helpful. Remove redundancy."""),
        HumanMessage(content=f"Research gathered:\n\n{research_content}\n\nProvide the final answer.")
    ]

    response = summarizer_llm.invoke(messages)

    return {
        "messages": [response],
        "final_summary": response.content
    }

def should_use_tools(state: AgentState) -> str:
    """Conditional edge: if the executor requested tool calls, run them."""
    last_message = state["messages"][-1]
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "tools"
    return "summarizer"

# ─────────────────────────────────────────────
# Graph assembly
# ─────────────────────────────────────────────

def build_research_agent() -> StateGraph:
    graph = StateGraph(AgentState)

    graph.add_node("planner", planner_node)
    graph.add_node("executor", executor_node)
    graph.add_node("tools", tool_node)
    graph.add_node("summarizer", summarizer_node)

    graph.set_entry_point("planner")
    graph.add_edge("planner", "executor")
    graph.add_conditional_edges(
        "executor",
        should_use_tools,
        {"tools": "tools", "summarizer": "summarizer"}
    )
    graph.add_edge("tools", "executor")   # Loop back after tool use
    graph.add_edge("summarizer", END)

    return graph.compile()

# ─────────────────────────────────────────────
# Run it
# ─────────────────────────────────────────────

if __name__ == "__main__":
    agent = build_research_agent()

    questions = [
        "What are the key differences between transformer and mamba architectures for language models?",
        "If a company grows revenue 23% YoY from $4.2M, what is the new revenue?",
    ]

    for question in questions:
        print(f"\n{'='*60}")
        print(f"Question: {question}")

        result = agent.invoke({
            "messages": [HumanMessage(content=question)],
            "research_plan": "",
            "final_summary": ""
        })

        print(f"\nFinal Answer:\n{result['final_summary']}")

Step 4: Run your agent

After setting up your AI agent, run it by executing the following command in the terminal.

python agent.py

Once the AI agent runs and completes its workflow, you should see the response on the terminal, as shown below.

Implementing Load balancing

In this section, you will learn how to implement load balancing in your AI agent using Bifrost. But before we get started, let’s understand what is load balancing.

What is load balancing in Bifrost?

In Bifrost, Load balancing means distributing AI requests across multiple API keys and AI models instead of sending all traffic to a single key.

Sending all your AI agent traffic to a single API key can lead to the following issues:

You can hit rate limits
Requests can fail when traffic spikes
One key outage breaks everything
Performance becomes unstable

However, if you implement load balancing using Bifrost, your AI agent can:

Share traffic across several keys
Automatically switch if one key fails
Avoid overloading a single provider

Load balancing in Bifrost is implemented using two methods which are weighted load balancing and Model Whitelisting & Filtering.

In weighted load balancing, you assign a weight to each API key where openai-primary handles ~70% of requests and openai-secondary handles ~30%, as shown below.

"providers": {
    "openai": {
      "keys": [
        {
          "name": "openai-primary",
          "value": "env.OPENAI_API_KEY_PRIMARY",
          "models": ["gpt-4o", "gpt-4o-mini",],
          "weight": 0.7
        },
        {
          "name": "openai-secondary",
          "value": "env.OPENAI_API_KEY_SECONDARY",
          "models": [],
          "weight": 0.3
        }
      ]
    }, 
 }

In Model Whitelisting & Filtering, keys can be restricted to specific models for access control and cost management, a shown below. A request for gpt-4o will only be routed to premium-key while a request to gpt-4o-mini will only use standard-key

"providers": {
    "openai": {
      "keys": [
        {
          "name": "premium-key",
          "value": "env.OPENAI_API_KEY_PREMIUM",
          "models": ["gpt-4o", "o1-preview"],
          "weight": 1.0
        },
        {
          "name": "standard-key",
          "value": "env.OPENAI_API_KEY_STANDARD",
          "models": [],
          "weight": 1.0
        }
      ]
    }, 
 }

You can learn more about about load balancing here on Bifrost docs.

Implementing Routing Rules

In this section, you will learn how to implement routing rules in your AI agent. Before we get started, let’s understand what are routing rules in Bifrost.

What are routing rules in Bifrost?

Routing rules in Bifrost are like traffic signs for your AI requests. They tell Bifrost where to send each request based on conditions you define.

Instead of always using the same AI model or provider, routing rules let you say things like, “If the request looks like this, send it to that provider.”

For example:

If a request is for summarization, use a cheaper model
If the user is a premium customer, use a more powerful model
If one provider is too busy, send the request to another provider

Everytime your AI agent sends a request, Bifrost checks these rules. The first rule that matches the request is sent to the chosen provider and model.

If no rule matches the request, Bifrost simply falls back to its default behavior, such as load balancing between providers.

To implement routing rules, define them in the config.json file, as shown below.

"governance": {
    "routing_rules": [
      {
        "id": "rule-uuid-123",
        "name": "Premium Tier Route",
        "description": "Route premium users to fast provider",
        "enabled": true,
        "cel_expression": "headers[\"x-tier\"] == \"premium\"",
        "provider": "openrouter",
        "model": "claude-3.7-sonnet",
        "fallbacks": ["openrouter/gpt-4o", "openrouter/gpt-3.5-turbo"],
        "scope": "global",
        "scope_id": null,
        "priority": 10
      },
      {
        "id": "rule-uuid-456",
        "name": "Budget Overflow Route",
        "description": "Route to cheaper provider when budget is high",
        "enabled": true,
        "cel_expression": "budget_used > 85",
        "provider": "groq",
        "model": "llama-2-70b",
        "fallbacks": [],
        "scope": "team",
        "scope_id": "team-ml-ops",
        "priority": 5
      },
      {
        "id": "team-preference",
        "name": "ML Team Anthropic Route",
        "enabled": true,
        "cel_expression": "team_name == \"ml-research\"",
        "provider": "anthropic",
        "model": "claude-3-opus-20240229",
        "fallbacks": ["bedrock/claude-3-opus"],
        "scope": "team",
        "scope_id": "team-ml-research",
        "priority": 0
      }
    ]
  },

After defining the routing rules, you can pass the routing rules in your AI agent request, as shown below.

# Deep reasoning: complex planning and analysis
planner_llm = ChatOpenAI(
    model="openrouter/claude-3.7-sonnet",
    base_url=f"{BIFROST_URL}/v1",
    api_key="dummy",
    default_headers={
        'x-tier': 'premium',
    },
    max_tokens=2048
)

You can learn more about about routing rules here on Bifrost docs.

Conclusion

In conclusion, building AI agents that rely on a single LLM provider is risky and limits flexibility. By using an LLM gateway like Bifrost, you can create a multi-provider AI agent that is resilient, scalable, and future-proof.

Finally, If you’re building or scaling an AI application and performance is becoming a bottleneck, explore Bifrost and try it yourself:

🌐 Website: https://getmax.im/bifrost-home
📦 GitHub: https://git.new/bifrostrepo
📘 Docs (Quickstart): https://getmax.im/bifrostdocs

Top comments (8)

Anmol Baranwal • Feb 26

Nice Bonnie! I forgot I had to look into LLM gateways lol

Bonnie CrossPostr • Feb 28

LLM gateways are quite interesting.

They come in handy especially when building production ready AI agents.

klement Gunndu • Feb 24

Neat approach with the unified gateway — does Bifrost handle streaming responses the same way across all three providers, or do you lose SSE compatibility with some?

Bonnie CrossPostr • Feb 28

Good question, Klement

Bifrost normalizes streaming across all providers to OpenAI's SSE format. So you write your client code once for OpenAI-style streaming and it works regardless of whether the backend is Anthropic, Gemini, or OpenAI.

The gateway handles the translation - Anthropic has slightly different SSE event structure, Gemini via Vertex has its own format. Bifrost converts them all to the standard data: {...} format your client expects.

No compatibility loss - Bifrost uses streaming in production across all three providers without changing client code.

Just point to Bifrost, request streaming, and it works.

Jane Alesi • Mar 4

Strong write-up — especially the planner/executor/summarizer split.

One production pattern that helped us a lot: define an explicit routing SLO per path (e.g. p95 latency, error budget, cost/request), then auto-downgrade model class only when the path breaches budget for N consecutive windows.

It prevents “silent quality drift” from static fallback chains and gives you auditable routing decisions.

Have you tried exposing routing decisions as structured events (provider, model, rule_id, fallback_reason, latency_ms) for postmortems?

Hugo • May 23

Multi-provider agent routing is exactly where the industry is heading. The key challenge I've found isn't just connecting to multiple APIs - it's intelligently routing based on: 1) model strengths (coding vs creative vs analysis), 2) latency requirements (we run HK edge nodes for <200ms in APAC), and 3) cost optimization per request type. How does your gateway decide which provider to use for a given request? We use a combination of task classification and user-defined preferences, and it's reduced average costs significantly while maintaining quality.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.