Multi-tenant LLM deployments need budget controls that work at scale. Organizations with 10+ teams and 50+ virtual keys can't manually track spending per API key. And they can't let one misconfigured automation drain the entire monthly budget.
maximhq
/
bifrost
Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
Bifrost
The fastest way to build AI applications that never go down
Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.
Quick Start
Go from zero to production-ready AI gateway in under a minute.
Step 1: Start Bifrost Gateway
# Install and run locally
npx -y @maximhq/bifrost
# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Step 2: Configure via Web UI
# Open the built-in web interface
open http://localhost:8080
Step 3: Make your first API call
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello, Bifrost!"}]
}'
That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…
Bifrost implements four-tier hierarchical budget management:
Customer → Team → Virtual Key → Provider Config.
Each tier checks budgets independently. Any violation blocks the request.
This is what we learned building it.
The Core Problem
Production LLM infrastructure serves multiple stakeholders with different budget needs.
A SaaS platform might serve 50 customers, each with 3-5 teams. That's 150-250 logical budget boundaries. An agency managing AI services for 20 clients needs separate billing per client, with caps on individual API keys within each client.
Traditional solutions don't scale:
API key quotas work for single-tenant systems. They break when you need 100+ keys with hierarchical relationships.
Manual tracking works when you have 5 virtual keys. It doesn't work when you have 200, each with monthly budgets that reset on different schedules.
Provider-level limits (like OpenAI's organization quotas) protect the entire account. They don't protect Team A's budget from Team B's runaway script.
The solution is in-memory budget tracking at multiple levels, with checks on every request. No database queries. No external calls. Just fast, local validation against hierarchical limits.
Four-Tier Budget Hierarchy
Bifrost checks budgets at four levels, in order:
Customer Budget - Top-level spending cap for the entire organization. Blocks all requests when exceeded.
Team Budget - Department or project-level cap. Multiple teams share the customer budget. Each team has its own limit.
Virtual Key Budget - API key-level cap. Teams create multiple VKs for different services. Each VK has its own budget.
Provider Config Budget - Granular control per AI provider within a single virtual key. One VK can have separate budgets for OpenAI, Anthropic, and Gemini.
Implementation Details
All budget tracking happens in-memory using atomic counters. The request flow:
- Request arrives with virtual key ID
- Load customer/team/VK/provider hierarchy from cache
- Check all four budget tiers atomically
- If any tier is exceeded, reject request
- Otherwise, forward to provider and increment counters
Measured overhead: 11µs mean latency at 5,000 RPS in sustained load tests.
Budget Check Logic
Each budget tier has two limits:
- Request count - Track number of API calls
- Token count - Track cumulative tokens (input + output)
Example configuration:
customer_budget:
requests: 1000000
tokens: 50000000
duration: "1M"
team_budget:
requests: 100000
tokens: 5000000
duration: "1M"
virtual_key_budget:
requests: 10000
tokens: 500000
duration: "1w"
Reset Durations
All budgets support five reset intervals:
-
1m- One minute (for testing/debugging) -
1h- One hour -
1d- One day -
1w- One week -
1M- One month
Resets happen automatically based on the configured duration. No manual intervention required.
Rate Limiting
Separate from budgets, Bifrost implements two-tier rate limiting:
Virtual Key Rate Limit - Requests per time window for the entire VK.
Provider Config Rate Limit - Requests per time window for specific providers.
Both use sliding window algorithm. Configuration:
rate_limiting:
vk_level:
requests: 100
duration: "1m"
provider_level:
- provider: "openai"
requests: 50
duration: "1m"
- provider: "anthropic"
requests: 30
duration: "1m"
Rate limits block requests temporarily (429 Too Many Requests). Budgets block requests permanently until reset (402 Payment Required).
Provider-Level Budgets
This solved a specific problem: teams want different spending limits per AI provider within the same virtual key.
Example: You're testing three providers. You allocate $500 total budget but want to cap OpenAI at $200, Anthropic at $200, and Gemini at $100. If OpenAI hits $200, requests should failover to Anthropic automatically.
Configuration:
provider_configs:
- provider: "openai"
budget:
tokens: 1000000
duration: "1w"
weight: 1.0
- provider: "anthropic"
budget:
tokens: 1000000
duration: "1w"
weight: 0.0 # Failover only
Weight 1.0 means primary. Weight 0.0 means failover (only used when primary fails or hits budget).
When OpenAI's budget is exceeded, the routing layer automatically switches to Anthropic for subsequent requests until the OpenAI budget resets.
Error Responses
Budget violations return specific HTTP status codes with detailed error information:
402 Payment Required - Budget exceeded
{
"error": {
"message": "Virtual key budget exceeded",
"type": "budget_exceeded",
"code": "vk_budget_limit",
"details": {
"tier": "virtual_key",
"current_usage": {
"requests": 10000,
"tokens": 500000
},
"limits": {
"requests": 10000,
"tokens": 500000
},
"reset_at": "2025-02-10T00:00:00Z"
}
}
}
429 Too Many Requests - Rate limit exceeded
{
"error": {
"message": "Rate limit exceeded",
"type": "rate_limit_exceeded",
"code": "vk_rate_limit",
"retry_after": 42
}
}
These responses come from Bifrost's documentation on error handling. The retry_after header tells clients exactly when they can retry.
Performance Characteristics
Benchmark results from 5,000 RPS sustained load:
- Mean overhead: 11µs
- P95 overhead: 18µs
- P99 overhead: 24µs
- Memory usage: 372MB resident (stable over 24 hours)
The budget system doesn't create performance bottlenecks. All checks are in-memory atomic operations. No database queries. No network calls.
Multi-Tenancy Patterns
Three common deployment patterns emerged:
SaaS Platform - One customer per paying subscriber. Teams represent departments within that subscriber. Virtual keys represent different applications.
Agency - One customer per client. Teams represent projects. Virtual keys represent specific campaigns or services.
Enterprise Internal - One customer for the entire company. Teams represent departments. Virtual keys represent applications or environments (dev, staging, prod).
The four-tier hierarchy supports all three patterns without configuration changes. You just map your organizational structure to the tiers.
What We Got Wrong
Initially, we implemented provider-level budgets as a separate top-level concept instead of hierarchical. This broke when users wanted multiple virtual keys sharing the same provider budget.
The solution was making provider budgets a property of virtual key configs, not a separate system. Now each VK has its own provider budgets, and teams can create as many VKs as needed with identical or different provider limits.
Another issue: We didn't include reset_at timestamps in budget exceeded responses initially. Users had no way to know when budgets would reset. Adding timestamps to error responses fixed this.
What We'd Build Differently
If starting fresh, we'd make budget periods configurable per tier. Right now, all budgets at a given tier share the same reset schedule. Some users want daily budgets for virtual keys but monthly budgets for teams.
The complexity of supporting per-budget reset schedules isn't worth it for most use cases. But it would solve edge cases for teams with unusual billing cycles.
Another change: Real-time budget usage webhooks. Currently, users can query budget status via API, but there's no push notification when a budget reaches 80% or 90% utilization. Webhooks would enable proactive alerts before budgets are exhausted.
Production Deployment
Budget configuration happens via YAML or the Bifrost API. Example complete configuration:
customers:
- id: "customer-001"
budget:
requests: 1000000
tokens: 50000000
duration: "1M"
teams:
- id: "team-engineering"
budget:
requests: 500000
tokens: 25000000
duration: "1M"
virtual_keys:
- id: "vk-prod-app"
budget:
requests: 100000
tokens: 5000000
duration: "1w"
rate_limiting:
requests: 100
duration: "1m"
provider_configs:
- provider: "openai"
budget:
tokens: 3000000
duration: "1w"
weight: 1.0
- provider: "anthropic"
budget:
tokens: 2000000
duration: "1w"
weight: 0.0
This configuration creates a complete budget hierarchy with automatic failover.
Where This Matters
Budget controls matter most in three scenarios:
Multi-tenant SaaS - Prevent one customer's usage from impacting others. Enforce fair usage across tenants.
Cost-sensitive deployments - Hard caps on monthly spending. No surprises on the bill.
Development/testing environments - Cap sandbox environments to prevent accidental production-level costs during testing.
For single-tenant internal tools with one team, simpler solutions work fine. But production multi-tenant systems need hierarchical budgets.
Implementation
Bifrost implements this as open-source infrastructure. The budget system is part of the core gateway, not a separate service.
Source: https://git.new/bifrost
Documentation: https://docs.getbifrost.ai
The four-tier hierarchy, rate limiting, and provider-level budgets are all documented in the Bifrost gateway docs. All technical details in this article come from the implementation and documentation, not hypothetical designs.

Top comments (0)