Running LLM workloads on AWS is easy. Knowing what they cost is not. You spin up Bedrock, call Claude or Mistral a few thousand times, and the bill shows up three days later as a single line item. No breakdown by team. No per-model cost tracking. No rate limits unless you build them yourself.
I spent the last two weeks evaluating how teams can get proper cost governance over LLM usage on AWS. Native tools, third-party gateways, open-source options. Here is what I found.
The Problem with AWS Native Cost Tracking
AWS gives you CloudWatch and Cost Explorer. Both are built for general AWS resource monitoring. They work fine for EC2, Lambda, S3. For LLM workloads on Bedrock, they fall short.
What you get from CloudWatch + Cost Explorer:
- Aggregate Bedrock spend per region
- Invocation counts at the service level
- Basic alarms on total spend thresholds
What you do not get:
- Per-model token-level cost breakdowns
- Team or project-level budget enforcement
- Rate limiting by user, team, or API key
- Real-time cost tracking per request
- Automatic routing away from providers that exceed limits
If you are running one model for one team, native tools are fine. The moment you have multiple teams, multiple models, or need to enforce granular budgets, you are building custom infrastructure.
The Gateway Approach
An LLM gateway sits between your application and Bedrock. Every request passes through it. That gives you a single place to track costs, enforce rate limits, and control routing.
I tested three approaches:
| Feature | AWS Native (CloudWatch + Cost Explorer) | LiteLLM | Bifrost |
|---|---|---|---|
| LLM-specific cost tracking | Aggregate only | Per-request, per-model | Per-request, per-model |
| Budget hierarchy | Account-level billing alerts | Basic budget controls | 4-tier: Customer > Team > Virtual Key > Provider |
| Rate limiting | No native LLM rate limits | Basic rate limiting | VK + Provider Config level, token and request limits |
| Reset durations | N/A | Limited options | 1m, 5m, 1h, 1d, 1w, 1M, 1Y (calendar-aligned UTC) |
| Bedrock support | Native | Yes | Yes (provider type "bedrock") |
| Overhead | None | ~8ms (Python) | 11 microseconds (Go) |
| Deployment | N/A | Self-hosted or cloud | Self-hosted (runs in your VPC) |
| Language | N/A | Python | Go |
The numbers tell the story. For teams that need real LLM cost governance on AWS, a dedicated gateway is the right call.
Setting Up Bifrost with AWS Bedrock
Bifrost runs in your VPC alongside Bedrock. No data leaves your infrastructure. That matters for teams with compliance requirements.
Start the gateway:
npx -y @maximhq/bifrost
Full setup guide here.
Configure Bedrock as a provider:
accounts:
- id: "ml-team"
providers:
- id: "bedrock-claude"
type: "bedrock"
region: "us-east-1"
model: "anthropic.claude-sonnet-4-20250514-v1:0"
weight: 80
- id: "bedrock-mistral"
type: "bedrock"
region: "us-west-2"
model: "mistral.mistral-large-2407-v1:0"
weight: 20
Weighted routing across models. 80% of requests go to Claude Sonnet on Bedrock, 20% to Mistral. Both running through your AWS account. The provider configuration docs cover all Bedrock model formats and region options.
Four-Tier Budget Hierarchy
This is where Bifrost separates itself from everything else I tested. The budget system has four levels: Customer, Team, Virtual Key, and Provider Config. All four must pass for a request to go through.
budgets:
customer:
- id: "acme-corp"
limit: 5000
period: "1M"
team:
- id: "ml-engineering"
customer_id: "acme-corp"
limit: 2000
period: "1M"
virtual_key:
- id: "staging-key"
team_id: "ml-engineering"
limit: 500
period: "1w"
provider_config:
- id: "bedrock-claude"
limit: 1000
period: "1M"
Customer gets $5,000/month. ML Engineering team gets $2,000 of that. The staging key is capped at $500/week. And the Bedrock Claude provider itself is capped at $1,000/month. If any tier hits its limit, the request is blocked.
Cost is calculated from provider pricing, token usage, request type, cache status, and batch operations. Not estimated. Calculated from actual usage data.
The governance docs have the full breakdown.
Rate Limiting That Actually Works for LLMs
AWS does not give you LLM-specific rate limits. Bedrock has service quotas, but those are blunt instruments. You cannot limit a specific team to 100 requests per minute or cap token consumption per API key.
Bifrost handles rate limiting at two levels: Virtual Key and Provider Config. You can set both request limits (calls per duration) and token limits (tokens per duration).
rate_limits:
virtual_key:
- id: "staging-key"
requests:
limit: 100
duration: "1h"
tokens:
limit: 50000
duration: "1h"
provider_config:
- id: "bedrock-claude"
requests:
limit: 500
duration: "1h"
Reset durations: 1m, 5m, 1h, 1d, 1w, 1M, 1Y. The daily, weekly, monthly, and yearly resets are calendar-aligned in UTC. So "1d" resets at midnight UTC, not 24 hours from first request.
Here is the clever part: if a provider config exceeds its rate limit, that provider gets excluded from routing. But other providers in the account remain available. Traffic shifts automatically. No downtime, no manual intervention.
Observability at Sub-Millisecond Overhead
Every request through Bifrost is captured: tokens used, latency, cost, response status. The observability layer adds less than 0.1ms of overhead. Storage backend is SQLite or PostgreSQL.
What makes this useful for AWS teams:
- 14+ API filter options for querying logs. Filter by model, provider, team, cost range, status code, time window.
- WebSocket live updates. Watch requests flow through in real time. Useful during load testing or incident debugging.
- Single pane across providers. If you are running Bedrock plus OpenAI or Gemini as failover, all logs are in one place.
Compare that to checking CloudWatch for Bedrock, then the OpenAI dashboard for your fallback, then manually correlating timestamps. The centralised view saves real time.
Honest Trade-offs
No tool solves everything. Here is what to know:
Bifrost is self-hosted only. You run it, you maintain it. For teams already on AWS with VPC infrastructure, this is straightforward. For smaller teams without DevOps, it is extra work.
LiteLLM has broader provider coverage. 100+ providers out of the box. If you need niche providers, LiteLLM may have them. Bifrost focuses on major providers but adds the Go performance advantage and deeper governance features.
AWS native tools have zero overhead. If all you need is aggregate cost visibility and basic billing alerts, CloudWatch is already there. No extra infrastructure.
Go vs Python matters at scale. Bifrost's 11 microsecond overhead versus LiteLLM's ~8ms becomes significant when you are processing thousands of requests per minute. At low volume, both are fine. At scale, the difference compounds. The benchmarks back this up: 5,000 RPS on a single instance.
Bifrost is a newer project. The community is growing but smaller than LiteLLM's. Documentation is solid. Edge cases may require checking GitHub issues.
When to Use What
Stick with AWS native tools if: You have one team, one model, and just need billing alerts.
Consider LiteLLM if: You need maximum provider coverage and are comfortable with Python-based overhead.
Use Bifrost if: You need granular cost governance, multi-tier budgets, LLM-specific rate limiting, and minimal latency on AWS. Especially if you are already running in a VPC and want semantic caching and automatic failover alongside cost controls.
Quick Start
# 1. Start Bifrost in your VPC
npx -y @maximhq/bifrost
# 2. Configure Bedrock providers in bifrost.yaml
# 3. Set budget and rate limit tiers
# 4. Point your application at the gateway
export ANTHROPIC_BASE_URL=http://localhost:8080/anthropic
export ANTHROPIC_API_KEY=your-bifrost-virtual-key
Every Bedrock request now has cost tracking, rate limiting, and observability built in.
AWS makes it easy to run LLM workloads. It does not make it easy to govern them. If your team is scaling Bedrock usage and needs real cost controls, a dedicated LLM gateway fills the gap that CloudWatch and Cost Explorer leave open.
Check the repo if you want to dig into the source.
Top comments (0)