Debby McKinney

Posted on Apr 6

Best AWS Gateway for Tracking LLM Costs and Rate Limits

#aws #ai #llm #cloud

TL;DR: If you are running LLM workloads on AWS (Bedrock, SageMaker, or calling external APIs from EC2/Lambda), you probably do not have great visibility into per-team costs or rate limit management. Here is a look at gateway options that solve this, with a focus on what actually works for AWS-heavy setups.

The Problem with LLM Cost Tracking on AWS

If you are using AWS Bedrock, your cost tracking options are limited. CloudWatch gives you invocation counts and latency. AWS Cost Explorer shows aggregate Bedrock spend. But neither gives you:

Per-team or per-application cost breakdowns
Real-time budget enforcement (not after-the-fact alerts)
Rate limiting per user or per service
Unified view when you also use OpenAI or Anthropic directly

Most teams figure this out after the first surprise bill.

What to Look for in an LLM Gateway for AWS

A good gateway for AWS LLM workloads should handle:

Cost tracking per team/service: Not just total spend, but who is spending what
Budget enforcement: Hard caps that stop requests when limits are hit
Rate limiting: Per-user, per-team, and per-provider throttling
Multi-provider support: Because most teams use Bedrock AND direct API calls
Low overhead: Your gateway should not become the bottleneck

Option 1: AWS API Gateway + Custom Lambda

You can build cost tracking yourself using API Gateway as a proxy, Lambda for request processing, and DynamoDB for tracking.

Pros:

Fully within AWS ecosystem
You control everything

Cons:

You have to build and maintain everything
Lambda cold starts add latency
No built-in LLM-aware features (token counting, model pricing)
Cost tracking logic is your responsibility

This works for teams with dedicated platform engineering resources. For most teams, it is more effort than the problem is worth.

Option 2: Bifrost (Open Source, Self-Hosted)

Bifrost is an open-source LLM gateway written in Go. It supports Bedrock natively alongside 20+ other providers.

What it does for AWS cost tracking:

The four-tier budget hierarchy is where Bifrost stands out:

Customer level: Total organization budget
Team level: Per-team spending caps (e.g., engineering gets $500/month, marketing gets $200/month)
Virtual Key level: Per-application or per-service budgets with configurable reset durations
Provider Config level: Per-provider rate limits

When a budget is hit at any level, the gateway enforces it. If your Bedrock budget runs out, requests can automatically fall back to a cheaper provider or stop entirely. This is real-time enforcement, not an alert you see the next day.

Rate limiting:

Bifrost handles rate limiting at the Virtual Key level:

Token-based limits (max tokens per period)
Request-based limits (max requests per period)
Configurable reset durations (per minute, hour, day, week, month)

If a provider config exceeds its rate limits, that provider is excluded from routing. Other providers stay available.

AWS Bedrock setup:

{
  "providers": {
    "bedrock": {
      "keys": [{
        "name": "bedrock-1",
        "value": "env.AWS_ACCESS_KEY",
        "models": ["anthropic.claude-3-sonnet"],
        "weight": 0.7
      }]
    },
    "openai": {
      "keys": [{
        "name": "openai-1",
        "value": "env.OPENAI_API_KEY",
        "models": ["gpt-4o-mini"],
        "weight": 0.3
      }]
    }
  }
}

Use provider-prefixed model names in your requests:

bedrock/anthropic.claude-3-sonnet
openai/gpt-4o-mini

Bifrost handles the authentication, request format translation, and cost logging for each provider.

Performance: 11µs overhead per request, 5,000 RPS sustained throughput. Self-hosted, so your data stays within your AWS VPC. That matters for compliance.

Cost tracking:

The Model Catalog tracks pricing across all providers automatically. Every request is logged with token counts and calculated cost. You get one dashboard for Bedrock, OpenAI, Anthropic, and any other provider you configure.

Semantic caching:

The cache layer (Weaviate-backed) can reduce costs further by serving cached responses for similar queries. Dual-layer: exact hash matching plus semantic similarity.

Option 3: Build on CloudWatch + Cost Explorer

If you just want visibility (not enforcement), you can set up CloudWatch dashboards for Bedrock metrics and use AWS Cost Explorer with tags.

Pros:

No additional infrastructure
Native AWS tooling

Cons:

No real-time budget enforcement
No per-user or per-team granularity without custom tagging
Does not cover non-AWS providers
No rate limiting beyond Bedrock's built-in throttling

Comparison

Feature	API Gateway + Lambda	Bifrost	CloudWatch
Per-team cost tracking	Build yourself	Built-in	Manual tagging
Real-time budget caps	Build yourself	Built-in	No
Rate limiting	Build yourself	Built-in	Bedrock only
Multi-provider	Build yourself	20+ providers	AWS only
Overhead	Lambda cold starts	11µs	N/A
Maintenance	High	Low (single binary)	Low
Self-hosted	Yes	Yes	N/A
Open source	Your code	Yes	No

Recommendation

If you need actual cost enforcement and rate limiting (not just monitoring), Bifrost is the most practical option for AWS-heavy teams. It is self-hosted, so it runs inside your VPC. The budget hierarchy maps well to how engineering organizations are structured. And it covers both AWS and non-AWS providers.

If you only need visibility and are fine with after-the-fact cost analysis, CloudWatch and Cost Explorer work without additional infrastructure.

Links:

DEV Community