Pranay Batta

Posted on Feb 2

Building Hierarchical Budget Controls for Multi-Tenant LLM Gateways

#programming #ai #tutorial #devops

Multi-tenant LLM deployments need budget controls that work at scale. Organizations with 10+ teams and 50+ virtual keys can't manually track spending per API key. And they can't let one misconfigured automation drain the entire monthly budget.

maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

View on GitHub

Bifrost implements four-tier hierarchical budget management:

Customer → Team → Virtual Key → Provider Config.

Each tier checks budgets independently. Any violation blocks the request.

This is what we learned building it.

The Core Problem

Production LLM infrastructure serves multiple stakeholders with different budget needs.

A SaaS platform might serve 50 customers, each with 3-5 teams. That's 150-250 logical budget boundaries. An agency managing AI services for 20 clients needs separate billing per client, with caps on individual API keys within each client.

Traditional solutions don't scale:

API key quotas work for single-tenant systems. They break when you need 100+ keys with hierarchical relationships.

Manual tracking works when you have 5 virtual keys. It doesn't work when you have 200, each with monthly budgets that reset on different schedules.

Provider-level limits (like OpenAI's organization quotas) protect the entire account. They don't protect Team A's budget from Team B's runaway script.

The solution is in-memory budget tracking at multiple levels, with checks on every request. No database queries. No external calls. Just fast, local validation against hierarchical limits.

Four-Tier Budget Hierarchy

Bifrost checks budgets at four levels, in order:

Customer Budget - Top-level spending cap for the entire organization. Blocks all requests when exceeded.

Team Budget - Department or project-level cap. Multiple teams share the customer budget. Each team has its own limit.

Virtual Key Budget - API key-level cap. Teams create multiple VKs for different services. Each VK has its own budget.

Provider Config Budget - Granular control per AI provider within a single virtual key. One VK can have separate budgets for OpenAI, Anthropic, and Gemini.

Implementation Details

All budget tracking happens in-memory using atomic counters. The request flow:

Request arrives with virtual key ID
Load customer/team/VK/provider hierarchy from cache
Check all four budget tiers atomically
If any tier is exceeded, reject request
Otherwise, forward to provider and increment counters

Measured overhead: 11µs mean latency at 5,000 RPS in sustained load tests.

Budget Check Logic

Each budget tier has two limits:

Request count - Track number of API calls
Token count - Track cumulative tokens (input + output)

Example configuration:

customer_budget:
  requests: 1000000
  tokens: 50000000
  duration: "1M"

team_budget:
  requests: 100000
  tokens: 5000000
  duration: "1M"

virtual_key_budget:
  requests: 10000
  tokens: 500000
  duration: "1w"

Reset Durations

All budgets support five reset intervals:

1m - One minute (for testing/debugging)
1h - One hour
1d - One day
1w - One week
1M - One month

Resets happen automatically based on the configured duration. No manual intervention required.

Rate Limiting

Separate from budgets, Bifrost implements two-tier rate limiting:

Virtual Key Rate Limit - Requests per time window for the entire VK.

Provider Config Rate Limit - Requests per time window for specific providers.

Both use sliding window algorithm. Configuration:

rate_limiting:
  vk_level:
    requests: 100
    duration: "1m"

  provider_level:
    - provider: "openai"
      requests: 50
      duration: "1m"
    - provider: "anthropic"
      requests: 30
      duration: "1m"

Rate limits block requests temporarily (429 Too Many Requests). Budgets block requests permanently until reset (402 Payment Required).

Provider-Level Budgets

This solved a specific problem: teams want different spending limits per AI provider within the same virtual key.

Example: You're testing three providers. You allocate $500 total budget but want to cap OpenAI at $200, Anthropic at $200, and Gemini at $100. If OpenAI hits $200, requests should failover to Anthropic automatically.

Configuration:

provider_configs:
  - provider: "openai"
    budget:
      tokens: 1000000
      duration: "1w"
    weight: 1.0

  - provider: "anthropic"
    budget:
      tokens: 1000000
      duration: "1w"
    weight: 0.0  # Failover only

Weight 1.0 means primary. Weight 0.0 means failover (only used when primary fails or hits budget).

When OpenAI's budget is exceeded, the routing layer automatically switches to Anthropic for subsequent requests until the OpenAI budget resets.

Error Responses

Budget violations return specific HTTP status codes with detailed error information:

402 Payment Required - Budget exceeded

{
  "error": {
    "message": "Virtual key budget exceeded",
    "type": "budget_exceeded",
    "code": "vk_budget_limit",
    "details": {
      "tier": "virtual_key",
      "current_usage": {
        "requests": 10000,
        "tokens": 500000
      },
      "limits": {
        "requests": 10000,
        "tokens": 500000
      },
      "reset_at": "2025-02-10T00:00:00Z"
    }
  }
}

429 Too Many Requests - Rate limit exceeded

{
  "error": {
    "message": "Rate limit exceeded",
    "type": "rate_limit_exceeded",
    "code": "vk_rate_limit",
    "retry_after": 42
  }
}

These responses come from Bifrost's documentation on error handling. The retry_after header tells clients exactly when they can retry.

Performance Characteristics

Benchmark results from 5,000 RPS sustained load:

Mean overhead: 11µs
P95 overhead: 18µs
P99 overhead: 24µs
Memory usage: 372MB resident (stable over 24 hours)

The budget system doesn't create performance bottlenecks. All checks are in-memory atomic operations. No database queries. No network calls.

Multi-Tenancy Patterns

Three common deployment patterns emerged:

SaaS Platform - One customer per paying subscriber. Teams represent departments within that subscriber. Virtual keys represent different applications.

Agency - One customer per client. Teams represent projects. Virtual keys represent specific campaigns or services.

Enterprise Internal - One customer for the entire company. Teams represent departments. Virtual keys represent applications or environments (dev, staging, prod).

The four-tier hierarchy supports all three patterns without configuration changes. You just map your organizational structure to the tiers.

What We Got Wrong

Initially, we implemented provider-level budgets as a separate top-level concept instead of hierarchical. This broke when users wanted multiple virtual keys sharing the same provider budget.

The solution was making provider budgets a property of virtual key configs, not a separate system. Now each VK has its own provider budgets, and teams can create as many VKs as needed with identical or different provider limits.

Another issue: We didn't include reset_at timestamps in budget exceeded responses initially. Users had no way to know when budgets would reset. Adding timestamps to error responses fixed this.

What We'd Build Differently

If starting fresh, we'd make budget periods configurable per tier. Right now, all budgets at a given tier share the same reset schedule. Some users want daily budgets for virtual keys but monthly budgets for teams.

The complexity of supporting per-budget reset schedules isn't worth it for most use cases. But it would solve edge cases for teams with unusual billing cycles.

Another change: Real-time budget usage webhooks. Currently, users can query budget status via API, but there's no push notification when a budget reaches 80% or 90% utilization. Webhooks would enable proactive alerts before budgets are exhausted.

Production Deployment

Budget configuration happens via YAML or the Bifrost API. Example complete configuration:

customers:
  - id: "customer-001"
    budget:
      requests: 1000000
      tokens: 50000000
      duration: "1M"

    teams:
      - id: "team-engineering"
        budget:
          requests: 500000
          tokens: 25000000
          duration: "1M"

        virtual_keys:
          - id: "vk-prod-app"
            budget:
              requests: 100000
              tokens: 5000000
              duration: "1w"

            rate_limiting:
              requests: 100
              duration: "1m"

            provider_configs:
              - provider: "openai"
                budget:
                  tokens: 3000000
                  duration: "1w"
                weight: 1.0

              - provider: "anthropic"
                budget:
                  tokens: 2000000
                  duration: "1w"
                weight: 0.0

This configuration creates a complete budget hierarchy with automatic failover.

Where This Matters

Budget controls matter most in three scenarios:

Multi-tenant SaaS - Prevent one customer's usage from impacting others. Enforce fair usage across tenants.

Cost-sensitive deployments - Hard caps on monthly spending. No surprises on the bill.

Development/testing environments - Cap sandbox environments to prevent accidental production-level costs during testing.

For single-tenant internal tools with one team, simpler solutions work fine. But production multi-tenant systems need hierarchical budgets.

Implementation

Bifrost implements this as open-source infrastructure. The budget system is part of the core gateway, not a separate service.

Source: https://git.new/bifrost

Documentation: https://docs.getbifrost.ai

The four-tier hierarchy, rate limiting, and provider-level budgets are all documented in the Bifrost gateway docs. All technical details in this article come from the implementation and documentation, not hypothetical designs.

DEV Community