Self-hosted models (vLLM, Ollama, TGI) and cloud providers (OpenAI, Anthropic) require different configurations, API formats, and management. Bifrost provides a single unified interface for both—enabling seamless routing between self-hosted and cloud models without application code changes.
This guide shows how to configure Bifrost as a unified gateway for vLLM, Ollama, and cloud providers.
maximhq
/
bifrost
Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
Bifrost AI Gateway
The fastest way to build AI applications that never go down
Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.
Quick Start
Go from zero to production-ready AI gateway in under a minute.
Step 1: Start Bifrost Gateway
# Install and run locally
npx -y @maximhq/bifrost
# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Step 2: Configure via Web UI
# Open the built-in web interface
open http://localhost:8080
Step 3: Make your first API call
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello, Bifrost!"}]
}'
That's it! Your AI gateway is running with a web interface for visual configuration…
Why Unify Self-Hosted and Cloud Models?
Cost optimization: Route cheap requests to self-hosted models, expensive to cloud
Hybrid infrastructure: Leverage on-premises GPU capacity with cloud failover
Compliance: Keep sensitive data on self-hosted, route general queries to cloud
Single interface: One API for all models regardless of deployment
Architecture
Application
↓ (single OpenAI-compatible API)
Bifrost Gateway
↓
├→ vLLM (self-hosted Llama 3)
├→ Ollama (self-hosted Mistral)
├→ OpenAI (cloud GPT-4o)
└→ Anthropic (cloud Claude)
Link to docs: https://docs.getbifrost.ai/deployment-guides/enterprise/overview#architecture
Result: Application uses identical code for all models. Bifrost handles routing transparently.
Step 1: Configure vLLM Provider
vLLM Setup (self-hosted OpenAI-compatible server):
# Start vLLM server with Llama 3 8B
vllm serve meta-llama/Llama-3-8B-Instruct \
--host 0.0.0.0 \
--port 8000
Add vLLM to Bifrost (Web UI):
- Go to "Providers" → "Add Provider"
- Select "Custom Provider"
- Configure:
- Provider Name:
vllm-local - Base URL:
http://vllm-endpoint:8000 - API Key:
dummy(vLLM doesn't require key) - Base Provider Type:
OpenAI - Allowed Requests: Chat completion, streaming
- Provider Name:
- Save
Add vLLM to Bifrost (API):
curl -X POST http://localhost:8080/api/providers \
-H "Content-Type: application/json" \
-d '{
"provider": "vllm-local",
"keys": [
{
"name": "vllm-key-1",
"value": "dummy",
"weight": 1.0
}
],
"network_config": {
"base_url": "http://vllm-endpoint:8000",
"default_request_timeout_in_seconds": 60
},
"custom_provider_config": {
"base_provider_type": "openai",
"allowed_requests": {
"chat_completion": true,
"chat_completion_stream": true
}
}
}'
Test vLLM through Bifrost:
Link to docs: https://docs.getbifrost.ai/providers/supported-providers/vllm#vllm
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="your-api-key"
)
response = client.chat.completions.create(
model="vllm-local/meta-llama/Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Step 2: Configure Ollama Provider
Link to docs: https://docs.getbifrost.ai/providers/supported-providers/ollama
Ollama Setup (self-hosted):
# Start Ollama with Mistral
ollama serve
ollama pull mistral
Add Ollama to Bifrost (Web UI):
- Go to "Providers" → "Add Provider"
- Select "Custom Provider"
- Configure:
- Provider Name:
ollama-local - Base URL:
http://ollama-endpoint:11434 - API Key:
dummy - Base Provider Type:
OpenAI
- Provider Name:
- Save
Add Ollama to Bifrost (API):
curl -X POST http://localhost:8080/api/providers \
-H "Content-Type: application/json" \
-d '{
"provider": "ollama-local",
"keys": [
{
"name": "ollama-key-1",
"value": "dummy",
"weight": 1.0
}
],
"network_config": {
"base_url": "http://ollama-endpoint:11434",
"default_request_timeout_in_seconds": 60
},
"custom_provider_config": {
"base_provider_type": "openai",
"allowed_requests": {
"chat_completion": true,
"chat_completion_stream": true
}
}
}'
Test Ollama through Bifrost:
response = client.chat.completions.create(
model="ollama-local/mistral",
messages=[{"role": "user", "content": "Hello!"}]
)
Step 3: Add Cloud Providers
Configure OpenAI:
curl -X POST http://localhost:8080/api/providers \
-H "Content-Type: application/json" \
-d '{
"provider": "openai",
"keys": [
{
"name": "openai-key-1",
"value": "env.OPENAI_API_KEY",
"weight": 1.0
}
]
}'
Configure Anthropic:
curl -X POST http://localhost:8080/api/providers \
-H "Content-Type: application/json" \
-d '{
"provider": "anthropic",
"keys": [
{
"name": "anthropic-key-1",
"value": "env.ANTHROPIC_API_KEY",
"weight": 1.0
}
]
}'
Step 4: Unified Routing with Virtual Keys
Link to docs:https://docs.getbifrost.ai/features/governance/virtual-keys#virtual-keys
Create virtual key that routes across all providers:
Configuration:
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-unified \
-H "Content-Type: application/json" \
-d '{
"provider_configs": [
{
"provider": "vllm-local",
"allowed_models": ["meta-llama/Llama-3-8B-Instruct"],
"weight": 0.5
},
{
"provider": "ollama-local",
"allowed_models": ["mistral"],
"weight": 0.3
},
{
"provider": "openai",
"allowed_models": ["gpt-4o-mini"],
"weight": 0.15
},
{
"provider": "anthropic",
"allowed_models": ["claude-3-5-haiku-20241022"],
"weight": 0.05
}
]
}'
Routing Strategy:
- 50% traffic → vLLM (cheapest, self-hosted)
- 30% traffic → Ollama (self-hosted backup)
- 15% traffic → OpenAI (cloud premium)
- 5% traffic → Anthropic (cloud fallback)
Use Case Examples
Use Case 1: Cost Optimization
Route simple queries to self-hosted, complex to cloud.
Free-tier Virtual Key (self-hosted only):
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-free \
-H "Content-Type: application/json" \
-d '{
"provider_configs": [
{
"provider": "vllm-local",
"weight": 1.0
}
],
"budget": {
"max_limit": 0,
"reset_duration": "1d"
}
}'
Premium Virtual Key (cloud models):
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-premium \
-H "Content-Type: application/json" \
-d '{
"provider_configs": [
{
"provider": "openai",
"allowed_models": ["gpt-4o"],
"weight": 1.0
}
],
"budget": {
"max_limit": 100,
"reset_duration": "1M"
}
}'
Application:
# Free tier users → self-hosted
client_free = OpenAI(
base_url="http://localhost:8080/v1",
api_key="vk-free"
)
# Premium users → cloud models
client_premium = OpenAI(
base_url="http://localhost:8080/v1",
api_key="vk-premium"
)
Use Case 2: Compliance and Data Sovereignty
Keep sensitive data on-premises, route general queries to cloud.
On-Premises Virtual Key (sensitive data):
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-sensitive \
-H "Content-Type: application/json" \
-d '{
"provider_configs": [
{
"provider": "vllm-local",
"weight": 1.0
},
{
"provider": "ollama-local",
"weight": 0.0
}
]
}'
Cloud Virtual Key (general queries):
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-general \
-H "Content-Type: application/json" \
-d '{
"provider_configs": [
{
"provider": "openai",
"weight": 0.8
},
{
"provider": "anthropic",
"weight": 0.2
}
]
}'
Use Case 3: Hybrid High Availability
Self-hosted primary with cloud failover.
Configuration:
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-hybrid \
-H "Content-Type: application/json" \
-d '{
"provider_configs": [
{
"provider": "vllm-local",
"weight": 0.9
},
{
"provider": "openai",
"weight": 0.1
}
]
}'
Behavior:
- Primary: vLLM handles 90% of traffic
- Failover: If vLLM unavailable, automatically routes to OpenAI
- No application code changes
Observability Across All Providers
Built-in Dashboard (http://localhost:8080):
- Real-time request logs (vLLM + Ollama + cloud)
- Token usage per model
- Cost tracking (cloud providers)
- Latency comparison (self-hosted vs cloud)
Prometheus Metrics:
# Requests by provider
rate(bifrost_requests_total[5m]) by (provider)
# Compare latency: self-hosted vs cloud
avg(bifrost_request_duration_seconds) by (provider)
# Cost tracking (cloud only)
sum(bifrost_cost_total) by (provider)
Example Query:
# Percentage of traffic on self-hosted vs cloud
sum(rate(bifrost_requests_total{provider=~"vllm.*|ollama.*"}[5m]))
/
sum(rate(bifrost_requests_total[5m])) * 100
Advanced: Weighted Routing by Request Type
Route based on request characteristics.
Development Virtual Key (fast iteration on self-hosted):
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-dev \
-H "Content-Type: application/json" \
-d '{
"provider_configs": [
{
"provider": "ollama-local",
"weight": 1.0
}
],
"rate_limit": {
"request_max_limit": 1000,
"request_reset_duration": "1h"
}
}'
Production Virtual Key (cloud reliability):
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
-H "Content-Type: application/json" \
-d '{
"provider_configs": [
{
"provider": "openai",
"weight": 0.7
},
{
"provider": "vllm-local",
"weight": 0.3
}
]
}'
Complete Setup Example
1. Start self-hosted models:
# vLLM
vllm serve meta-llama/Llama-3-8B-Instruct --port 8000
# Ollama
ollama serve
ollama pull mistral
2. Start Bifrost:
npx -y @maximhq/bifrost
3. Configure all providers (via Web UI at http://localhost:8080)
4. Create unified virtual key:
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-unified \
-H "Content-Type: application/json" \
-d '{
"provider_configs": [
{"provider": "vllm-local", "weight": 0.5},
{"provider": "ollama-local", "weight": 0.3},
{"provider": "openai", "weight": 0.2}
]
}'
5. Use in application:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="vk-unified"
)
# Automatically routes across vLLM, Ollama, and OpenAI
response = client.chat.completions.create(
model="gpt-4o-mini", # Model name determines routing
messages=[{"role": "user", "content": "Hello!"}]
)
Benefits
Single interface: One API for all models (self-hosted + cloud)
Cost optimization: Route cheap requests to self-hosted, expensive to cloud
High availability: Automatic failover from self-hosted to cloud
Data sovereignty: Keep sensitive data on-premises
Observability: Unified monitoring across all providers
Zero code changes: Application doesn't know about underlying infrastructure
Get Started
Install Bifrost:
npx -y @maximhq/bifrost
Docs: https://getmax.im/bifrostdocs
GitHub: https://git.new/bifrost
Key Takeaway: Bifrost provides a unified interface for vLLM, Ollama, and cloud providers (OpenAI, Anthropic) through a single OpenAI-compatible API. Enable cost optimization (route to self-hosted), compliance (keep sensitive data on-premises), and high availability (automatic cloud failover) without application code changes.

Top comments (0)