DEV Community

Debby McKinney
Debby McKinney

Posted on

Using Bifrost as Unified Gateway for vLLM and OpenAI-Compatible Endpoints

Self-hosted models (vLLM, Ollama, TGI) and cloud providers (OpenAI, Anthropic) require different configurations, API formats, and management. Bifrost provides a single unified interface for both—enabling seamless routing between self-hosted and cloud models without application code changes.

This guide shows how to configure Bifrost as a unified gateway for vLLM, Ollama, and cloud providers.

GitHub logo maximhq / bifrost

Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost AI Gateway

Go Report Card Discord badge Known Vulnerabilities codecov Docker Pulls Run In Postman Artifact Hub License

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Get started

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080
Enter fullscreen mode Exit fullscreen mode

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

That's it! Your AI gateway is running with a web interface for visual configuration…


Why Unify Self-Hosted and Cloud Models?

Cost optimization: Route cheap requests to self-hosted models, expensive to cloud

Hybrid infrastructure: Leverage on-premises GPU capacity with cloud failover

Compliance: Keep sensitive data on self-hosted, route general queries to cloud

Single interface: One API for all models regardless of deployment


Architecture

Application
    ↓ (single OpenAI-compatible API)
Bifrost Gateway
    ↓
    ├→ vLLM (self-hosted Llama 3)
    ├→ Ollama (self-hosted Mistral)
    ├→ OpenAI (cloud GPT-4o)
    └→ Anthropic (cloud Claude)
Enter fullscreen mode Exit fullscreen mode

Link to docs: https://docs.getbifrost.ai/deployment-guides/enterprise/overview#architecture

Result: Application uses identical code for all models. Bifrost handles routing transparently.


Step 1: Configure vLLM Provider

vLLM Setup (self-hosted OpenAI-compatible server):

# Start vLLM server with Llama 3 8B
vllm serve meta-llama/Llama-3-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000
Enter fullscreen mode Exit fullscreen mode

Add vLLM to Bifrost (Web UI):

  1. Go to "Providers" → "Add Provider"
  2. Select "Custom Provider"
  3. Configure:
    • Provider Name: vllm-local
    • Base URL: http://vllm-endpoint:8000
    • API Key: dummy (vLLM doesn't require key)
    • Base Provider Type: OpenAI
    • Allowed Requests: Chat completion, streaming
  4. Save

Add vLLM to Bifrost (API):

curl -X POST http://localhost:8080/api/providers \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "vllm-local",
    "keys": [
      {
        "name": "vllm-key-1",
        "value": "dummy",
        "weight": 1.0
      }
    ],
    "network_config": {
      "base_url": "http://vllm-endpoint:8000",
      "default_request_timeout_in_seconds": 60
    },
    "custom_provider_config": {
      "base_provider_type": "openai",
      "allowed_requests": {
        "chat_completion": true,
        "chat_completion_stream": true
      }
    }
  }'
Enter fullscreen mode Exit fullscreen mode

Test vLLM through Bifrost:

Link to docs: https://docs.getbifrost.ai/providers/supported-providers/vllm#vllm

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="your-api-key"
)

response = client.chat.completions.create(
    model="vllm-local/meta-llama/Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure Ollama Provider

Link to docs: https://docs.getbifrost.ai/providers/supported-providers/ollama

Ollama Setup (self-hosted):

# Start Ollama with Mistral
ollama serve
ollama pull mistral
Enter fullscreen mode Exit fullscreen mode

Add Ollama to Bifrost (Web UI):

  1. Go to "Providers" → "Add Provider"
  2. Select "Custom Provider"
  3. Configure:
    • Provider Name: ollama-local
    • Base URL: http://ollama-endpoint:11434
    • API Key: dummy
    • Base Provider Type: OpenAI
  4. Save

Add Ollama to Bifrost (API):

curl -X POST http://localhost:8080/api/providers \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "ollama-local",
    "keys": [
      {
        "name": "ollama-key-1",
        "value": "dummy",
        "weight": 1.0
      }
    ],
    "network_config": {
      "base_url": "http://ollama-endpoint:11434",
      "default_request_timeout_in_seconds": 60
    },
    "custom_provider_config": {
      "base_provider_type": "openai",
      "allowed_requests": {
        "chat_completion": true,
        "chat_completion_stream": true
      }
    }
  }'
Enter fullscreen mode Exit fullscreen mode

Test Ollama through Bifrost:

response = client.chat.completions.create(
    model="ollama-local/mistral",
    messages=[{"role": "user", "content": "Hello!"}]
)
Enter fullscreen mode Exit fullscreen mode

Step 3: Add Cloud Providers

Configure OpenAI:

curl -X POST http://localhost:8080/api/providers \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "openai",
    "keys": [
      {
        "name": "openai-key-1",
        "value": "env.OPENAI_API_KEY",
        "weight": 1.0
      }
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Configure Anthropic:

curl -X POST http://localhost:8080/api/providers \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "anthropic",
    "keys": [
      {
        "name": "anthropic-key-1",
        "value": "env.ANTHROPIC_API_KEY",
        "weight": 1.0
      }
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Step 4: Unified Routing with Virtual Keys

Link to docs:https://docs.getbifrost.ai/features/governance/virtual-keys#virtual-keys

Create virtual key that routes across all providers:

Configuration:

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-unified \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "vllm-local",
        "allowed_models": ["meta-llama/Llama-3-8B-Instruct"],
        "weight": 0.5
      },
      {
        "provider": "ollama-local",
        "allowed_models": ["mistral"],
        "weight": 0.3
      },
      {
        "provider": "openai",
        "allowed_models": ["gpt-4o-mini"],
        "weight": 0.15
      },
      {
        "provider": "anthropic",
        "allowed_models": ["claude-3-5-haiku-20241022"],
        "weight": 0.05
      }
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Routing Strategy:

  • 50% traffic → vLLM (cheapest, self-hosted)
  • 30% traffic → Ollama (self-hosted backup)
  • 15% traffic → OpenAI (cloud premium)
  • 5% traffic → Anthropic (cloud fallback)

Use Case Examples

Use Case 1: Cost Optimization

Route simple queries to self-hosted, complex to cloud.

Free-tier Virtual Key (self-hosted only):

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-free \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "vllm-local",
        "weight": 1.0
      }
    ],
    "budget": {
      "max_limit": 0,
      "reset_duration": "1d"
    }
  }'
Enter fullscreen mode Exit fullscreen mode

Premium Virtual Key (cloud models):

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-premium \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "openai",
        "allowed_models": ["gpt-4o"],
        "weight": 1.0
      }
    ],
    "budget": {
      "max_limit": 100,
      "reset_duration": "1M"
    }
  }'
Enter fullscreen mode Exit fullscreen mode

Application:

# Free tier users → self-hosted
client_free = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="vk-free"
)

# Premium users → cloud models
client_premium = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="vk-premium"
)
Enter fullscreen mode Exit fullscreen mode

Use Case 2: Compliance and Data Sovereignty

Keep sensitive data on-premises, route general queries to cloud.

On-Premises Virtual Key (sensitive data):

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-sensitive \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "vllm-local",
        "weight": 1.0
      },
      {
        "provider": "ollama-local",
        "weight": 0.0
      }
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Cloud Virtual Key (general queries):

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-general \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "openai",
        "weight": 0.8
      },
      {
        "provider": "anthropic",
        "weight": 0.2
      }
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Use Case 3: Hybrid High Availability

Self-hosted primary with cloud failover.

Configuration:

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-hybrid \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "vllm-local",
        "weight": 0.9
      },
      {
        "provider": "openai",
        "weight": 0.1
      }
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Behavior:

  • Primary: vLLM handles 90% of traffic
  • Failover: If vLLM unavailable, automatically routes to OpenAI
  • No application code changes

Observability Across All Providers

Built-in Dashboard (http://localhost:8080):

  • Real-time request logs (vLLM + Ollama + cloud)
  • Token usage per model
  • Cost tracking (cloud providers)
  • Latency comparison (self-hosted vs cloud)

Prometheus Metrics:

# Requests by provider
rate(bifrost_requests_total[5m]) by (provider)

# Compare latency: self-hosted vs cloud
avg(bifrost_request_duration_seconds) by (provider)

# Cost tracking (cloud only)
sum(bifrost_cost_total) by (provider)
Enter fullscreen mode Exit fullscreen mode

Example Query:

# Percentage of traffic on self-hosted vs cloud
sum(rate(bifrost_requests_total{provider=~"vllm.*|ollama.*"}[5m])) 
/ 
sum(rate(bifrost_requests_total[5m])) * 100
Enter fullscreen mode Exit fullscreen mode

Advanced: Weighted Routing by Request Type

Route based on request characteristics.

Development Virtual Key (fast iteration on self-hosted):

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-dev \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "ollama-local",
        "weight": 1.0
      }
    ],
    "rate_limit": {
      "request_max_limit": 1000,
      "request_reset_duration": "1h"
    }
  }'
Enter fullscreen mode Exit fullscreen mode

Production Virtual Key (cloud reliability):

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-prod \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {
        "provider": "openai",
        "weight": 0.7
      },
      {
        "provider": "vllm-local",
        "weight": 0.3
      }
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Complete Setup Example

1. Start self-hosted models:

# vLLM
vllm serve meta-llama/Llama-3-8B-Instruct --port 8000

# Ollama
ollama serve
ollama pull mistral
Enter fullscreen mode Exit fullscreen mode

2. Start Bifrost:

npx -y @maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

3. Configure all providers (via Web UI at http://localhost:8080)

4. Create unified virtual key:

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-unified \
  -H "Content-Type: application/json" \
  -d '{
    "provider_configs": [
      {"provider": "vllm-local", "weight": 0.5},
      {"provider": "ollama-local", "weight": 0.3},
      {"provider": "openai", "weight": 0.2}
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

5. Use in application:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="vk-unified"
)

# Automatically routes across vLLM, Ollama, and OpenAI
response = client.chat.completions.create(
    model="gpt-4o-mini",  # Model name determines routing
    messages=[{"role": "user", "content": "Hello!"}]
)
Enter fullscreen mode Exit fullscreen mode

Benefits

Single interface: One API for all models (self-hosted + cloud)

Cost optimization: Route cheap requests to self-hosted, expensive to cloud

High availability: Automatic failover from self-hosted to cloud

Data sovereignty: Keep sensitive data on-premises

Observability: Unified monitoring across all providers

Zero code changes: Application doesn't know about underlying infrastructure


Get Started

Install Bifrost:

npx -y @maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Docs: https://getmax.im/bifrostdocs

GitHub: https://git.new/bifrost


Key Takeaway: Bifrost provides a unified interface for vLLM, Ollama, and cloud providers (OpenAI, Anthropic) through a single OpenAI-compatible API. Enable cost optimization (route to self-hosted), compliance (keep sensitive data on-premises), and high availability (automatic cloud failover) without application code changes.

Top comments (0)