DEV Community

Debby McKinney
Debby McKinney

Posted on

How to Set Up Four-Tier Budget Controls in Your LLM Gateway

If you're running an LLM gateway for multiple teams or customers, you need budget controls that actually work. Not just API key quotas. Not manual spending alerts. Real hierarchical budgets that prevent any single team from draining your entire monthly allocation.

Bifrost's four-tier budget system lets you set limits at Customer, Team, Virtual Key, and Provider levels. Each tier checks independently. Any violation blocks the request. No database queries. No network calls. Just in-memory validation with 11µs overhead.

GitHub logo maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

Go Report Card Discord badge Known Vulnerabilities codecov Docker Pulls Run In Postman Artifact Hub License

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Get started

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080
Enter fullscreen mode Exit fullscreen mode

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

Here's how to configure it.

The Budget Hierarchy

Four tiers, checked in order on every request:

Customer Budget - Your organization's total monthly cap. Blocks everything when exceeded.

Team Budget - Department or project limits. Multiple teams share the customer budget.

Virtual Key Budget - Per-API-key spending caps. Teams create multiple VKs for different services.

Provider Config Budget - Granular control per AI provider (OpenAI, Anthropic, etc.) within a single VK.

All four are optional. Use only the tiers you need.

Basic Configuration

Set up a virtual key with team and customer budgets:

customer_budget:
  requests: 1000000
  tokens: 50000000
  duration: "1M"

team_budget:
  requests: 100000
  tokens: 5000000
  duration: "1M"

virtual_key_budget:
  requests: 10000
  tokens: 500000
  duration: "1w"
Enter fullscreen mode Exit fullscreen mode

Now every request checks three budget tiers. If the VK hits 10,000 requests in one week, requests fail with 402 Payment Required. If the team hits 100,000 requests in one month, same thing. If the customer hits 1 million, everything stops.

Reset Durations

Budgets reset automatically based on duration:

  • 1m - One minute (testing only)
  • 1h - One hour
  • 1d - One day
  • 1w - One week
  • 1M - One month

No cron jobs. No manual resets. The system tracks when each budget period started and resets counters when the duration elapses.

Provider-Level Budgets

You want to test three AI providers but cap spending on each one separately. Configure provider budgets within your virtual key:

provider_configs:
  - provider: "openai"
    budget:
      tokens: 1000000
      duration: "1w"
    weight: 1.0

  - provider: "anthropic"
    budget:
      tokens: 1000000
      duration: "1w"
    weight: 0.0
Enter fullscreen mode Exit fullscreen mode

Weight 1.0 means primary. Weight 0.0 means failover.

When OpenAI hits its 1M token budget, Bifrost automatically routes subsequent requests to Anthropic until OpenAI's budget resets.

Rate Limiting

Separate from budgets, you can set request rate limits at two levels:

VK-level rate limit - Maximum requests per time window for the entire virtual key.

Provider-level rate limit - Maximum requests per time window per AI provider.

rate_limiting:
  vk_level:
    requests: 100
    duration: "1m"

  provider_level:
    - provider: "openai"
      requests: 50
      duration: "1m"
    - provider: "anthropic"
      requests: 30
      duration: "1m"
Enter fullscreen mode Exit fullscreen mode

Rate limits use sliding windows. They return 429 Too Many Requests with a retry_after header. Budgets return 402 Payment Required with reset timestamps.

Error Responses

When a budget is exceeded, you get detailed error information:

{
  "error": {
    "message": "Virtual key budget exceeded",
    "type": "budget_exceeded",
    "code": "vk_budget_limit",
    "details": {
      "tier": "virtual_key",
      "current_usage": {
        "requests": 10000,
        "tokens": 500000
      },
      "limits": {
        "requests": 10000,
        "tokens": 500000
      },
      "reset_at": "2025-02-10T00:00:00Z"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The reset_at field tells you exactly when the budget resets. Your application can use this to inform users or schedule retries.

Multi-Tenant Deployment

The four-tier hierarchy supports different organizational structures:

SaaS Platform

  • Customer = Paying subscriber
  • Team = Departments within subscriber
  • VK = Different applications

Agency

  • Customer = Client
  • Team = Projects for that client
  • VK = Specific campaigns or services

Enterprise Internal

  • Customer = Company
  • Team = Departments
  • VK = Applications or environments (dev/staging/prod)

You don't reconfigure the system for different use cases. You just map your org structure to the appropriate tiers.

Performance

Budget checks happen entirely in-memory using atomic counters. No database queries. No external API calls.

Measured overhead at 5,000 RPS:

  • Mean: 11µs
  • P95: 18µs
  • P99: 24µs

The budget system doesn't create latency. All validations complete in microseconds.

Complete Example

Here's a full configuration for a SaaS platform:

customers:
  - id: "customer-acme-corp"
    budget:
      requests: 1000000
      tokens: 50000000
      duration: "1M"

    teams:
      - id: "team-engineering"
        budget:
          requests: 500000
          tokens: 25000000
          duration: "1M"

        virtual_keys:
          - id: "vk-prod-api"
            budget:
              requests: 100000
              tokens: 5000000
              duration: "1w"

            rate_limiting:
              requests: 100
              duration: "1m"

            provider_configs:
              - provider: "openai"
                budget:
                  tokens: 3000000
                  duration: "1w"
                weight: 1.0

              - provider: "anthropic"
                budget:
                  tokens: 2000000
                  duration: "1w"
                weight: 0.0
Enter fullscreen mode Exit fullscreen mode

This creates:

  1. Monthly cap of 1M requests for Acme Corp
  2. Monthly cap of 500K requests for Engineering team
  3. Weekly cap of 100K requests for the production API key
  4. Separate token budgets for OpenAI (primary) and Anthropic (failover)
  5. Rate limit of 100 requests per minute on the VK

When This Matters

You need hierarchical budgets when:

Running multi-tenant SaaS - Prevent one customer from impacting others. Enforce fair usage.

Managing agency clients - Hard caps per client prevent billing surprises.

Controlling internal costs - Cap sandbox environments to avoid accidental production-level spending during testing.

For simple single-tenant deployments, basic API key quotas work fine. But production multi-tenant systems need more control.

Implementation

Bifrost implements this as open-source infrastructure. The budget system is part of the core gateway.

Source: https://github.com/maximhq/bifrost

Documentation: https://docs.getbifrost.ai

The configuration examples, error responses, and performance numbers all come from the implementation and documentation.

Top comments (0)