If you're running an LLM gateway for multiple teams or customers, you need budget controls that actually work. Not just API key quotas. Not manual spending alerts. Real hierarchical budgets that prevent any single team from draining your entire monthly allocation.
Bifrost's four-tier budget system lets you set limits at Customer, Team, Virtual Key, and Provider levels. Each tier checks independently. Any violation blocks the request. No database queries. No network calls. Just in-memory validation with 11µs overhead.
maximhq
/
bifrost
Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
Bifrost
The fastest way to build AI applications that never go down
Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.
Quick Start
Go from zero to production-ready AI gateway in under a minute.
Step 1: Start Bifrost Gateway
# Install and run locally
npx -y @maximhq/bifrost
# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Step 2: Configure via Web UI
# Open the built-in web interface
open http://localhost:8080
Step 3: Make your first API call
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello, Bifrost!"}]
}'
That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…
Here's how to configure it.
The Budget Hierarchy
Four tiers, checked in order on every request:
Customer Budget - Your organization's total monthly cap. Blocks everything when exceeded.
Team Budget - Department or project limits. Multiple teams share the customer budget.
Virtual Key Budget - Per-API-key spending caps. Teams create multiple VKs for different services.
Provider Config Budget - Granular control per AI provider (OpenAI, Anthropic, etc.) within a single VK.
All four are optional. Use only the tiers you need.
Basic Configuration
Set up a virtual key with team and customer budgets:
customer_budget:
requests: 1000000
tokens: 50000000
duration: "1M"
team_budget:
requests: 100000
tokens: 5000000
duration: "1M"
virtual_key_budget:
requests: 10000
tokens: 500000
duration: "1w"
Now every request checks three budget tiers. If the VK hits 10,000 requests in one week, requests fail with 402 Payment Required. If the team hits 100,000 requests in one month, same thing. If the customer hits 1 million, everything stops.
Reset Durations
Budgets reset automatically based on duration:
-
1m- One minute (testing only) -
1h- One hour -
1d- One day -
1w- One week -
1M- One month
No cron jobs. No manual resets. The system tracks when each budget period started and resets counters when the duration elapses.
Provider-Level Budgets
You want to test three AI providers but cap spending on each one separately. Configure provider budgets within your virtual key:
provider_configs:
- provider: "openai"
budget:
tokens: 1000000
duration: "1w"
weight: 1.0
- provider: "anthropic"
budget:
tokens: 1000000
duration: "1w"
weight: 0.0
Weight 1.0 means primary. Weight 0.0 means failover.
When OpenAI hits its 1M token budget, Bifrost automatically routes subsequent requests to Anthropic until OpenAI's budget resets.
Rate Limiting
Separate from budgets, you can set request rate limits at two levels:
VK-level rate limit - Maximum requests per time window for the entire virtual key.
Provider-level rate limit - Maximum requests per time window per AI provider.
rate_limiting:
vk_level:
requests: 100
duration: "1m"
provider_level:
- provider: "openai"
requests: 50
duration: "1m"
- provider: "anthropic"
requests: 30
duration: "1m"
Rate limits use sliding windows. They return 429 Too Many Requests with a retry_after header. Budgets return 402 Payment Required with reset timestamps.
Error Responses
When a budget is exceeded, you get detailed error information:
{
"error": {
"message": "Virtual key budget exceeded",
"type": "budget_exceeded",
"code": "vk_budget_limit",
"details": {
"tier": "virtual_key",
"current_usage": {
"requests": 10000,
"tokens": 500000
},
"limits": {
"requests": 10000,
"tokens": 500000
},
"reset_at": "2025-02-10T00:00:00Z"
}
}
}
The reset_at field tells you exactly when the budget resets. Your application can use this to inform users or schedule retries.
Multi-Tenant Deployment
The four-tier hierarchy supports different organizational structures:
SaaS Platform
- Customer = Paying subscriber
- Team = Departments within subscriber
- VK = Different applications
Agency
- Customer = Client
- Team = Projects for that client
- VK = Specific campaigns or services
Enterprise Internal
- Customer = Company
- Team = Departments
- VK = Applications or environments (dev/staging/prod)
You don't reconfigure the system for different use cases. You just map your org structure to the appropriate tiers.
Performance
Budget checks happen entirely in-memory using atomic counters. No database queries. No external API calls.
Measured overhead at 5,000 RPS:
- Mean: 11µs
- P95: 18µs
- P99: 24µs
The budget system doesn't create latency. All validations complete in microseconds.
Complete Example
Here's a full configuration for a SaaS platform:
customers:
- id: "customer-acme-corp"
budget:
requests: 1000000
tokens: 50000000
duration: "1M"
teams:
- id: "team-engineering"
budget:
requests: 500000
tokens: 25000000
duration: "1M"
virtual_keys:
- id: "vk-prod-api"
budget:
requests: 100000
tokens: 5000000
duration: "1w"
rate_limiting:
requests: 100
duration: "1m"
provider_configs:
- provider: "openai"
budget:
tokens: 3000000
duration: "1w"
weight: 1.0
- provider: "anthropic"
budget:
tokens: 2000000
duration: "1w"
weight: 0.0
This creates:
- Monthly cap of 1M requests for Acme Corp
- Monthly cap of 500K requests for Engineering team
- Weekly cap of 100K requests for the production API key
- Separate token budgets for OpenAI (primary) and Anthropic (failover)
- Rate limit of 100 requests per minute on the VK
When This Matters
You need hierarchical budgets when:
Running multi-tenant SaaS - Prevent one customer from impacting others. Enforce fair usage.
Managing agency clients - Hard caps per client prevent billing surprises.
Controlling internal costs - Cap sandbox environments to avoid accidental production-level spending during testing.
For simple single-tenant deployments, basic API key quotas work fine. But production multi-tenant systems need more control.
Implementation
Bifrost implements this as open-source infrastructure. The budget system is part of the core gateway.
Source: https://github.com/maximhq/bifrost
Documentation: https://docs.getbifrost.ai
The configuration examples, error responses, and performance numbers all come from the implementation and documentation.

Top comments (0)