Pranay Batta

Posted on Apr 13

How to Track LLM Costs and Rate Limits on AWS Bedrock with an AI Gateway

#llm #ai #opensource #aws

Running LLM workloads on AWS is easy. Knowing what they cost is not. You spin up Bedrock, call Claude or Mistral a few thousand times, and the bill shows up three days later as a single line item. No breakdown by team. No per-model cost tracking. No rate limits unless you build them yourself.

I spent the last two weeks evaluating how teams can get proper cost governance over LLM usage on AWS. Native tools, third-party gateways, open-source options. Here is what I found.

The Problem with AWS Native Cost Tracking

AWS gives you CloudWatch and Cost Explorer. Both are built for general AWS resource monitoring. They work fine for EC2, Lambda, S3. For LLM workloads on Bedrock, they fall short.

What you get from CloudWatch + Cost Explorer:

Aggregate Bedrock spend per region
Invocation counts at the service level
Basic alarms on total spend thresholds

What you do not get:

Per-model token-level cost breakdowns
Team or project-level budget enforcement
Rate limiting by user, team, or API key
Real-time cost tracking per request
Automatic routing away from providers that exceed limits

If you are running one model for one team, native tools are fine. The moment you have multiple teams, multiple models, or need to enforce granular budgets, you are building custom infrastructure.

The Gateway Approach

An LLM gateway sits between your application and Bedrock. Every request passes through it. That gives you a single place to track costs, enforce rate limits, and control routing.

I tested three approaches:

Feature	AWS Native (CloudWatch + Cost Explorer)	LiteLLM	Bifrost
LLM-specific cost tracking	Aggregate only	Per-request, per-model	Per-request, per-model
Budget hierarchy	Account-level billing alerts	Basic budget controls	4-tier: Customer > Team > Virtual Key > Provider
Rate limiting	No native LLM rate limits	Basic rate limiting	VK + Provider Config level, token and request limits
Reset durations	N/A	Limited options	1m, 5m, 1h, 1d, 1w, 1M, 1Y (calendar-aligned UTC)
Bedrock support	Native	Yes	Yes (provider type "bedrock")
Overhead	None	~8ms (Python)	11 microseconds (Go)
Deployment	N/A	Self-hosted or cloud	Self-hosted (runs in your VPC)
Language	N/A	Python	Go

The numbers tell the story. For teams that need real LLM cost governance on AWS, a dedicated gateway is the right call.

Setting Up Bifrost with AWS Bedrock

Bifrost runs in your VPC alongside Bedrock. No data leaves your infrastructure. That matters for teams with compliance requirements.

Start the gateway:

npx -y @maximhq/bifrost

Full setup guide here.

Configure Bedrock as a provider:

accounts:
  - id: "ml-team"
    providers:
      - id: "bedrock-claude"
        type: "bedrock"
        region: "us-east-1"
        model: "anthropic.claude-sonnet-4-20250514-v1:0"
        weight: 80
      - id: "bedrock-mistral"
        type: "bedrock"
        region: "us-west-2"
        model: "mistral.mistral-large-2407-v1:0"
        weight: 20

Weighted routing across models. 80% of requests go to Claude Sonnet on Bedrock, 20% to Mistral. Both running through your AWS account. The provider configuration docs cover all Bedrock model formats and region options.

Four-Tier Budget Hierarchy

This is where Bifrost separates itself from everything else I tested. The budget system has four levels: Customer, Team, Virtual Key, and Provider Config. All four must pass for a request to go through.

budgets:
  customer:
    - id: "acme-corp"
      limit: 5000
      period: "1M"

  team:
    - id: "ml-engineering"
      customer_id: "acme-corp"
      limit: 2000
      period: "1M"

  virtual_key:
    - id: "staging-key"
      team_id: "ml-engineering"
      limit: 500
      period: "1w"

  provider_config:
    - id: "bedrock-claude"
      limit: 1000
      period: "1M"

Customer gets $5,000/month. ML Engineering team gets $2,000 of that. The staging key is capped at $500/week. And the Bedrock Claude provider itself is capped at $1,000/month. If any tier hits its limit, the request is blocked.

Cost is calculated from provider pricing, token usage, request type, cache status, and batch operations. Not estimated. Calculated from actual usage data.

The governance docs have the full breakdown.

Rate Limiting That Actually Works for LLMs

AWS does not give you LLM-specific rate limits. Bedrock has service quotas, but those are blunt instruments. You cannot limit a specific team to 100 requests per minute or cap token consumption per API key.

Bifrost handles rate limiting at two levels: Virtual Key and Provider Config. You can set both request limits (calls per duration) and token limits (tokens per duration).

rate_limits:
  virtual_key:
    - id: "staging-key"
      requests:
        limit: 100
        duration: "1h"
      tokens:
        limit: 50000
        duration: "1h"

  provider_config:
    - id: "bedrock-claude"
      requests:
        limit: 500
        duration: "1h"

Reset durations: 1m, 5m, 1h, 1d, 1w, 1M, 1Y. The daily, weekly, monthly, and yearly resets are calendar-aligned in UTC. So "1d" resets at midnight UTC, not 24 hours from first request.

Here is the clever part: if a provider config exceeds its rate limit, that provider gets excluded from routing. But other providers in the account remain available. Traffic shifts automatically. No downtime, no manual intervention.

Observability at Sub-Millisecond Overhead

Every request through Bifrost is captured: tokens used, latency, cost, response status. The observability layer adds less than 0.1ms of overhead. Storage backend is SQLite or PostgreSQL.

What makes this useful for AWS teams:

14+ API filter options for querying logs. Filter by model, provider, team, cost range, status code, time window.
WebSocket live updates. Watch requests flow through in real time. Useful during load testing or incident debugging.
Single pane across providers. If you are running Bedrock plus OpenAI or Gemini as failover, all logs are in one place.

Compare that to checking CloudWatch for Bedrock, then the OpenAI dashboard for your fallback, then manually correlating timestamps. The centralised view saves real time.

Honest Trade-offs

No tool solves everything. Here is what to know:

Bifrost is self-hosted only. You run it, you maintain it. For teams already on AWS with VPC infrastructure, this is straightforward. For smaller teams without DevOps, it is extra work.

LiteLLM has broader provider coverage. 100+ providers out of the box. If you need niche providers, LiteLLM may have them. Bifrost focuses on major providers but adds the Go performance advantage and deeper governance features.

AWS native tools have zero overhead. If all you need is aggregate cost visibility and basic billing alerts, CloudWatch is already there. No extra infrastructure.

Go vs Python matters at scale. Bifrost's 11 microsecond overhead versus LiteLLM's ~8ms becomes significant when you are processing thousands of requests per minute. At low volume, both are fine. At scale, the difference compounds. The benchmarks back this up: 5,000 RPS on a single instance.

Bifrost is a newer project. The community is growing but smaller than LiteLLM's. Documentation is solid. Edge cases may require checking GitHub issues.

When to Use What

Stick with AWS native tools if: You have one team, one model, and just need billing alerts.

Consider LiteLLM if: You need maximum provider coverage and are comfortable with Python-based overhead.

Use Bifrost if: You need granular cost governance, multi-tier budgets, LLM-specific rate limiting, and minimal latency on AWS. Especially if you are already running in a VPC and want semantic caching and automatic failover alongside cost controls.

Quick Start

# 1. Start Bifrost in your VPC
npx -y @maximhq/bifrost

# 2. Configure Bedrock providers in bifrost.yaml

# 3. Set budget and rate limit tiers

# 4. Point your application at the gateway
export ANTHROPIC_BASE_URL=http://localhost:8080/anthropic
export ANTHROPIC_API_KEY=your-bifrost-virtual-key

Every Bedrock request now has cost tracking, rate limiting, and observability built in.

GitHub | Docs | Website

AWS makes it easy to run LLM workloads. It does not make it easy to govern them. If your team is scaling Bedrock usage and needs real cost controls, a dedicated LLM gateway fills the gap that CloudWatch and Cost Explorer leave open.

Check the repo if you want to dig into the source.

Top comments (1)

Argon Loop • May 26

Pranay, I read your Bedrock gateway piece and the line "Running LLM workloads on AWS is easy. Knowing what they cost is not." matched a problem I keep seeing in AI platform teams. The hard part is rarely the model bill itself; it is preserving enough request context through the gateway, router, retry path, and tenant budget layer to explain why that bill exists. I liked that you compared budget tiers, rate limits, and routing behavior as one system instead of separate knobs. When you evaluate gateway options, do you trust their tenant budget counters as authoritative, or do you still validate them against raw request traces?

— Argon