Debby McKinney

Posted on Feb 25

Using Bifrost as Unified Gateway for vLLM and OpenAI-Compatible Endpoints

#programming #ai #tutorial #devops

Self-hosted models (vLLM, Ollama, TGI) and cloud providers (OpenAI, Anthropic) require different configurations, API formats, and management. Bifrost provides a single unified interface for both—enabling seamless routing between self-hosted and cloud models without application code changes.

This guide shows how to configure Bifrost as a unified gateway for vLLM, Ollama, and cloud providers.

maximhq / bifrost

Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost AI Gateway

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration…

View on GitHub

Why Unify Self-Hosted and Cloud Models?

Cost optimization: Route cheap requests to self-hosted models, expensive to cloud

Hybrid infrastructure: Leverage on-premises GPU capacity with cloud failover

Compliance: Keep sensitive data on self-hosted, route general queries to cloud

Single interface: One API for all models regardless of deployment

Architecture

Application
    ↓ (single OpenAI-compatible API)
Bifrost Gateway
    ↓
    ├→ vLLM (self-hosted Llama 3)
    ├→ Ollama (self-hosted Mistral)
    ├→ OpenAI (cloud GPT-4o)
    └→ Anthropic (cloud Claude)

Link to docs: https://docs.getbifrost.ai/deployment-guides/enterprise/overview#architecture

Result: Application uses identical code for all models. Bifrost handles routing transparently.

Step 1: Configure vLLM Provider

vLLM Setup (self-hosted OpenAI-compatible server):

# Start vLLM server with Llama 3 8B
vllm serve meta-llama/Llama-3-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000

Add vLLM to Bifrost (Web UI):

Go to "Providers" → "Add Provider"
Select "Custom Provider"
Configure:
- Provider Name: vllm-local
- Base URL: http://vllm-endpoint:8000
- API Key: dummy (vLLM doesn't require key)
- Base Provider Type: OpenAI
- Allowed Requests: Chat completion, streaming
Save

Add vLLM to Bifrost (API):

curl -X POST http://localhost:8080/api/providers \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "vllm-local",
    "keys": [
      {
        "name": "vllm-key-1",
        "value": "dummy",
        "weight": 1.0
      }
    ],
    "network_config": {
      "base_url": "http://vllm-endpoint:8000",
      "default_request_timeout_in_seconds": 60
    },
    "custom_provider_config": {
      "base_provider_type": "openai",
      "allowed_requests": {
        "chat_completion": true,
        "chat_completion_stream": true
      }
    }
  }'

Test vLLM through Bifrost:

Link to docs: https://docs.getbifrost.ai/providers/supported-providers/vllm#vllm

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="your-api-key"
)

response = client.chat.completions.create(
    model="vllm-local/meta-llama/Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Step 2: Configure Ollama Provider

Link to docs: https://docs.getbifrost.ai/providers/supported-providers/ollama

Ollama Setup (self-hosted):

# Start Ollama with Mistral
ollama serve
ollama pull mistral

Add Ollama to Bifrost (Web UI):

Go to "Providers" → "Add Provider"
Select "Custom Provider"
Configure:
- Provider Name: ollama-local
- Base URL: http://ollama-endpoint:11434
- API Key: dummy
- Base Provider Type: OpenAI
Save

Add Ollama to Bifrost (API):

curl -X POST http://localhost:8080/api/providers \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "ollama-local",
    "keys": [
      {
        "name": "ollama-key-1",
        "value": "dummy",
        "weight": 1.0
      }
    ],
    "network_config": {
      "base_url": "http://ollama-endpoint:11434",
      "default_request_timeout_in_seconds": 60
    },
    "custom_provider_config": {
      "base_provider_type": "openai",
      "allowed_requests": {
        "chat_completion": true,
        "chat_completion_stream": true
      }
    }
  }'

Test Ollama through Bifrost:

response = client.chat.completions.create(
    model="ollama-local/mistral",
    messages=[{"role": "user", "content": "Hello!"}]
)

Step 3: Add Cloud Providers

Configure OpenAI:

curl -X POST http://localhost:8080/api/providers \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "openai",
    "keys": [
      {
        "name": "openai-key-1",
        "value": "env.OPENAI_API_KEY",
        "weight": 1.0
      }
    ]
  }'

Configure Anthropic:

curl -X POST http://localhost:8080/api/providers \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "anthropic",
    "keys": [
      {
        "name": "anthropic-key-1",
        "value": "env.ANTHROPIC_API_KEY",
        "weight": 1.0
      }
    ]
  }'

Step 4: Unified Routing with Virtual Keys

Link to docs:https://docs.getbifrost.ai/features/governance/virtual-keys#virtual-keys

Create virtual key that routes across all providers:

Configuration:

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-unified \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "vllm-local",
        "allowed_models": ["meta-llama/Llama-3-8B-Instruct"],
        "weight": 0.5
      },
      {
        "provider": "ollama-local",
        "allowed_models": ["mistral"],
        "weight": 0.3
      },
      {
        "provider": "openai",
        "allowed_models": ["gpt-4o-mini"],
        "weight": 0.15
      },
      {
        "provider": "anthropic",
        "allowed_models": ["claude-3-5-haiku-20241022"],
        "weight": 0.05
      }
    ]
  }'

Routing Strategy:

50% traffic → vLLM (cheapest, self-hosted)
30% traffic → Ollama (self-hosted backup)
15% traffic → OpenAI (cloud premium)
5% traffic → Anthropic (cloud fallback)

Use Case Examples

Use Case 1: Cost Optimization

Route simple queries to self-hosted, complex to cloud.

Free-tier Virtual Key (self-hosted only):

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-free \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "vllm-local",
        "weight": 1.0
      }
    ],
    "budget": {
      "max_limit": 0,
      "reset_duration": "1d"
    }
  }'

Premium Virtual Key (cloud models):

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-premium \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "openai",
        "allowed_models": ["gpt-4o"],
        "weight": 1.0
      }
    ],
    "budget": {
      "max_limit": 100,
      "reset_duration": "1M"
    }
  }'

Application:

# Free tier users → self-hosted
client_free = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="vk-free"
)

# Premium users → cloud models
client_premium = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="vk-premium"
)

Use Case 2: Compliance and Data Sovereignty

Keep sensitive data on-premises, route general queries to cloud.

On-Premises Virtual Key (sensitive data):

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-sensitive \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "vllm-local",
        "weight": 1.0
      },
      {
        "provider": "ollama-local",
        "weight": 0.0
      }
    ]
  }'

Cloud Virtual Key (general queries):

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-general \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "openai",
        "weight": 0.8
      },
      {
        "provider": "anthropic",
        "weight": 0.2
      }
    ]
  }'

Use Case 3: Hybrid High Availability

Self-hosted primary with cloud failover.

Configuration:

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-hybrid \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "vllm-local",
        "weight": 0.9
      },
      {
        "provider": "openai",
        "weight": 0.1
      }
    ]
  }'

Behavior:

Primary: vLLM handles 90% of traffic
Failover: If vLLM unavailable, automatically routes to OpenAI
No application code changes

Observability Across All Providers

Built-in Dashboard (http://localhost:8080):

Real-time request logs (vLLM + Ollama + cloud)
Token usage per model
Cost tracking (cloud providers)
Latency comparison (self-hosted vs cloud)

Prometheus Metrics:

# Requests by provider
rate(bifrost_requests_total[5m]) by (provider)

# Compare latency: self-hosted vs cloud
avg(bifrost_request_duration_seconds) by (provider)

# Cost tracking (cloud only)
sum(bifrost_cost_total) by (provider)

Example Query:

# Percentage of traffic on self-hosted vs cloud
sum(rate(bifrost_requests_total{provider=~"vllm.*|ollama.*"}[5m])) 
/ 
sum(rate(bifrost_requests_total[5m])) * 100

Advanced: Weighted Routing by Request Type

Route based on request characteristics.

Development Virtual Key (fast iteration on self-hosted):

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-dev \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "ollama-local",
        "weight": 1.0
      }
    ],
    "rate_limit": {
      "request_max_limit": 1000,
      "request_reset_duration": "1h"
    }
  }'

Production Virtual Key (cloud reliability):

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "openai",
        "weight": 0.7
      },
      {
        "provider": "vllm-local",
        "weight": 0.3
      }
    ]
  }'

Complete Setup Example

1. Start self-hosted models:

# vLLM
vllm serve meta-llama/Llama-3-8B-Instruct --port 8000

# Ollama
ollama serve
ollama pull mistral

2. Start Bifrost:

npx -y @maximhq/bifrost

3. Configure all providers (via Web UI at http://localhost:8080)

4. Create unified virtual key:

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-unified \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {"provider": "vllm-local", "weight": 0.5},
      {"provider": "ollama-local", "weight": 0.3},
      {"provider": "openai", "weight": 0.2}
    ]
  }'

5. Use in application:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="vk-unified"
)

# Automatically routes across vLLM, Ollama, and OpenAI
response = client.chat.completions.create(
    model="gpt-4o-mini",  # Model name determines routing
    messages=[{"role": "user", "content": "Hello!"}]
)

Benefits

Single interface: One API for all models (self-hosted + cloud)

Cost optimization: Route cheap requests to self-hosted, expensive to cloud

High availability: Automatic failover from self-hosted to cloud

Data sovereignty: Keep sensitive data on-premises

Observability: Unified monitoring across all providers

Zero code changes: Application doesn't know about underlying infrastructure

Get Started

Install Bifrost:

npx -y @maximhq/bifrost

Docs: https://getmax.im/bifrostdocs

GitHub: https://git.new/bifrost

Key Takeaway: Bifrost provides a unified interface for vLLM, Ollama, and cloud providers (OpenAI, Anthropic) through a single OpenAI-compatible API. Enable cost optimization (route to self-hosted), compliance (keep sensitive data on-premises), and high availability (automatic cloud failover) without application code changes.

DEV Community