Debby McKinney

Posted on Feb 2

How to Set Up Four-Tier Budget Controls in Your LLM Gateway

#webdev #programming #ai #tutorial

If you're running an LLM gateway for multiple teams or customers, you need budget controls that actually work. Not just API key quotas. Not manual spending alerts. Real hierarchical budgets that prevent any single team from draining your entire monthly allocation.

Bifrost's four-tier budget system lets you set limits at Customer, Team, Virtual Key, and Provider levels. Each tier checks independently. Any violation blocks the request. No database queries. No network calls. Just in-memory validation with 11µs overhead.

maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

View on GitHub

Here's how to configure it.

The Budget Hierarchy

Four tiers, checked in order on every request:

Customer Budget - Your organization's total monthly cap. Blocks everything when exceeded.

Team Budget - Department or project limits. Multiple teams share the customer budget.

Virtual Key Budget - Per-API-key spending caps. Teams create multiple VKs for different services.

Provider Config Budget - Granular control per AI provider (OpenAI, Anthropic, etc.) within a single VK.

All four are optional. Use only the tiers you need.

Basic Configuration

Set up a virtual key with team and customer budgets:

customer_budget:
  requests: 1000000
  tokens: 50000000
  duration: "1M"

team_budget:
  requests: 100000
  tokens: 5000000
  duration: "1M"

virtual_key_budget:
  requests: 10000
  tokens: 500000
  duration: "1w"

Now every request checks three budget tiers. If the VK hits 10,000 requests in one week, requests fail with 402 Payment Required. If the team hits 100,000 requests in one month, same thing. If the customer hits 1 million, everything stops.

Reset Durations

Budgets reset automatically based on duration:

1m - One minute (testing only)
1h - One hour
1d - One day
1w - One week
1M - One month

No cron jobs. No manual resets. The system tracks when each budget period started and resets counters when the duration elapses.

Provider-Level Budgets

You want to test three AI providers but cap spending on each one separately. Configure provider budgets within your virtual key:

provider_configs:
  - provider: "openai"
    budget:
      tokens: 1000000
      duration: "1w"
    weight: 1.0

  - provider: "anthropic"
    budget:
      tokens: 1000000
      duration: "1w"
    weight: 0.0

Weight 1.0 means primary. Weight 0.0 means failover.

When OpenAI hits its 1M token budget, Bifrost automatically routes subsequent requests to Anthropic until OpenAI's budget resets.

Rate Limiting

Separate from budgets, you can set request rate limits at two levels:

VK-level rate limit - Maximum requests per time window for the entire virtual key.

Provider-level rate limit - Maximum requests per time window per AI provider.

rate_limiting:
  vk_level:
    requests: 100
    duration: "1m"

  provider_level:
    - provider: "openai"
      requests: 50
      duration: "1m"
    - provider: "anthropic"
      requests: 30
      duration: "1m"

Rate limits use sliding windows. They return 429 Too Many Requests with a retry_after header. Budgets return 402 Payment Required with reset timestamps.

Error Responses

When a budget is exceeded, you get detailed error information:

{
  "error": {
    "message": "Virtual key budget exceeded",
    "type": "budget_exceeded",
    "code": "vk_budget_limit",
    "details": {
      "tier": "virtual_key",
      "current_usage": {
        "requests": 10000,
        "tokens": 500000
      },
      "limits": {
        "requests": 10000,
        "tokens": 500000
      },
      "reset_at": "2025-02-10T00:00:00Z"
    }
  }
}

The reset_at field tells you exactly when the budget resets. Your application can use this to inform users or schedule retries.

Multi-Tenant Deployment

The four-tier hierarchy supports different organizational structures:

SaaS Platform

Customer = Paying subscriber
Team = Departments within subscriber
VK = Different applications

Agency

Customer = Client
Team = Projects for that client
VK = Specific campaigns or services

Enterprise Internal

Customer = Company
Team = Departments
VK = Applications or environments (dev/staging/prod)

You don't reconfigure the system for different use cases. You just map your org structure to the appropriate tiers.

Performance

Budget checks happen entirely in-memory using atomic counters. No database queries. No external API calls.

Measured overhead at 5,000 RPS:

Mean: 11µs
P95: 18µs
P99: 24µs

The budget system doesn't create latency. All validations complete in microseconds.

Complete Example

Here's a full configuration for a SaaS platform:

customers:
  - id: "customer-acme-corp"
    budget:
      requests: 1000000
      tokens: 50000000
      duration: "1M"

    teams:
      - id: "team-engineering"
        budget:
          requests: 500000
          tokens: 25000000
          duration: "1M"

        virtual_keys:
          - id: "vk-prod-api"
            budget:
              requests: 100000
              tokens: 5000000
              duration: "1w"

            rate_limiting:
              requests: 100
              duration: "1m"

            provider_configs:
              - provider: "openai"
                budget:
                  tokens: 3000000
                  duration: "1w"
                weight: 1.0

              - provider: "anthropic"
                budget:
                  tokens: 2000000
                  duration: "1w"
                weight: 0.0

This creates:

Monthly cap of 1M requests for Acme Corp
Monthly cap of 500K requests for Engineering team
Weekly cap of 100K requests for the production API key
Separate token budgets for OpenAI (primary) and Anthropic (failover)
Rate limit of 100 requests per minute on the VK

When This Matters

You need hierarchical budgets when:

Running multi-tenant SaaS - Prevent one customer from impacting others. Enforce fair usage.

Managing agency clients - Hard caps per client prevent billing surprises.

Controlling internal costs - Cap sandbox environments to avoid accidental production-level spending during testing.

For simple single-tenant deployments, basic API key quotas work fine. But production multi-tenant systems need more control.

Implementation

Bifrost implements this as open-source infrastructure. The budget system is part of the core gateway.

Source: https://github.com/maximhq/bifrost

Documentation: https://docs.getbifrost.ai

The configuration examples, error responses, and performance numbers all come from the implementation and documentation.

DEV Community