Pranay Batta

Posted on Apr 1

LLM Cost Tracking and Spend Management for Engineering Teams

#ai #programming #devops #security

Your team ships a feature using GPT-4, it works great in staging, and then production traffic hits. Suddenly you are burning through API credits faster than anyone expected. Multiply that across three providers, five teams, and a few hundred thousand requests per day. Good luck figuring out where the money went.

We built Bifrost, an open-source LLM gateway in Go, and cost tracking was one of the first problems we had to solve properly. This post covers what we learned, how we designed spend management into the gateway layer, and what the alternatives look like. You can get started with the setup guide in under a minute.

TL;DR: Bifrost gives you per-request cost logging, four-tier budget hierarchies (Customer, Team, Virtual Key, Provider Config), auto-synced model pricing, and cache-aware cost calculations. All at 11 microsecond latency overhead. You can run it right now with npx -y @maximhq/bifrost. Full docs here.

The actual problem with LLM costs

Cloud compute costs are predictable. You pick an instance type, you know the hourly rate, you can forecast monthly spend within a few percent.

LLM costs are nothing like that.

A single API call costs somewhere between $0.0001 and $0.50 depending on the model, the input length, the output length, whether you are sending images or audio, and whether the context crosses the 128k token threshold (where pricing tiers change). That is per request.

Now add multi-provider routing. Your app might use OpenAI for chat, Anthropic for analysis, and a smaller model for classification. Each provider has different pricing structures, different token counting methods, and different billing cycles.

The result: engineering teams have no idea what they are spending until the invoice arrives.

What cost tracking actually requires

Most teams start with "we will check the provider dashboard." That breaks down fast for three reasons.

Per-request granularity. You need to know the cost of every single API call, tied to which customer, which team, and which feature triggered it. Provider dashboards give you aggregate numbers, not per-request attribution.

Real-time budget enforcement. Knowing you overspent last month does not help. You need the system to reject requests when a budget limit is hit, before the money is gone.

Multi-modal cost calculation. If your app sends images, audio, or very long contexts, the cost calculation is not a simple token multiplication. You need tiered pricing support, per-image costs, per-second audio costs, and character-based pricing for certain models.

How we built cost tracking in Bifrost

We wanted cost management to be a gateway-level concern, not something each application team has to implement. Here is how the pieces fit together.

Model Catalog with auto-synced pricing

The Model Catalog is the foundation. It maintains pricing data for every supported model across all providers. You can also force a pricing sync at any time via the API.

On startup, Bifrost downloads the latest pricing sheet and loads it into memory. When a ConfigStore (SQLite or PostgreSQL) is available, it also persists the data and re-syncs every 24 hours automatically. All lookups are O(1) from memory.

The pricing data covers multiple modalities:

Text: token-based and character-based pricing for chat completions, text completions, and embeddings
Audio: token-based and duration-based pricing for speech synthesis and transcription
Images: per-image costs with tiered pricing for high-token contexts
Tiered pricing: automatic rate changes above 128k tokens, reflecting actual provider pricing

This means cost calculation is accurate for every request type, not an approximation based on token count alone.

Four-tier budget hierarchy

This is where spend management happens. Bifrost supports budgets at four levels:

Customer - set a spending cap for an entire customer account
Team - limit spend per team within a customer
Virtual Key - control costs per API key (useful for per-feature or per-environment budgets)
Provider Config - cap total spend on a specific provider

Each budget has a max_limit, a reset_duration (daily, weekly, monthly), and tracks current_usage in real time.

Here is what creating a customer with a budget looks like via the API:

curl --request POST \
  --url http://localhost:8080/api/governance/customers \
  --header 'Content-Type: application/json' \
  --data '{
    "name": "acme-corp",
    "budget": {
      "max_limit": 500,
      "reset_duration": "monthly"
    }
  }'

The response includes the budget object with current_usage tracked automatically:

{
  "customer": {
    "id": "cust-abc123",
    "name": "acme-corp",
    "budget": {
      "id": "bdgt-xyz",
      "max_limit": 500,
      "reset_duration": "monthly",
      "current_usage": 0
    }
  }
}

When current_usage hits max_limit, requests are rejected. No surprises on the invoice.

LogStore: per-request cost audit trail

Every request that passes through Bifrost gets logged with full cost data. The LogStore captures:

Provider and model used
Input tokens, output tokens, total tokens
Calculated cost (broken down into input cost, output cost, request cost, total cost)
Latency
Status (success or error)
Timestamps
Full input/output content (serialized as JSON)

You can query this data with filters. Want to see all requests to OpenAI that cost more than $0.10 in the last hour? That is a single API call.

curl --request POST \
  --url http://localhost:8080/api/logs/search \
  --header 'Content-Type: application/json' \
  --data '{
    "filters": {
      "providers": ["openai"],
      "min_cost": 0.10,
      "start_time": "2026-03-31T00:00:00Z"
    },
    "pagination": {
      "limit": 50,
      "sort_by": "cost",
      "order": "desc"
    }
  }'

The response includes aggregated stats alongside individual logs: total requests, success rate, average latency, total tokens, and total cost for the query. This is the data you need for cost attribution and chargeback.

Getting started

You can have this running in under a minute:

npx -y @maximhq/bifrost

Or with Docker if you prefer containerized deployment. Then point your LLM calls at the Bifrost endpoint instead of directly at the provider — it works as a drop-in replacement for the OpenAI SDK, Anthropic SDK, and Bedrock SDK. Cost tracking, budget enforcement, and logging happen automatically.

Check the setup docs for configuration details.

Cache-aware cost tracking

This is a detail that matters more than you would expect.

Bifrost includes a dual-layer semantic cache (exact hash matching + semantic similarity via Weaviate). When a request hits the cache, the cost calculation changes:

Direct cache hit (exact match): zero cost. The response comes from cache, no provider API call is made.
Semantic cache hit (similar query found): the cost is the embedding generation cost only. No model inference cost.
Cache miss with storage: the cost is the base model usage plus the embedding generation cost for storing the result.

If you are not tracking cache-aware costs, your cost reports will overcount. Every cache hit that gets reported at full model price inflates your numbers and hides the ROI of caching.

How other tools handle cost tracking

Credit where it is due. There are several tools in this space, and they each take a different approach.

Helicone is a proxy-based observability platform. It logs requests and provides cost analytics through a dashboard. The cost tracking is solid, with per-request granularity. Where it differs from Bifrost: Helicone is primarily an observability tool. Budget enforcement and cache-aware cost calculations are not its focus. It is a good choice if you want analytics without gateway-level controls.

OpenRouter acts as a unified API layer across multiple LLM providers. It handles routing and gives you a single bill, which simplifies accounting. However, OpenRouter is a hosted proxy — your requests pass through their infrastructure. There is no self-hosted option, no budget enforcement at the gateway level, and no per-customer or per-team spend hierarchy. If you need cost attribution beyond "which model was called," you will need to build that yourself on top of their logs.

AWS API Gateway + Bedrock is what many AWS-native teams reach for. You get IAM-based access control and CloudWatch metrics. The limitation is that cost tracking is coarse-grained — you get aggregate billing through AWS Cost Explorer, not per-request cost breakdowns tied to your internal teams or customers. Building a four-tier budget hierarchy on top of AWS services means stitching together Lambda, DynamoDB, and custom billing logic. It works but it is a lot of glue code.

Kong AI Gateway and Cloudflare AI Gateway both provide rate limiting and basic analytics for AI API traffic. Kong gives you plugin-based extensibility, and Cloudflare gives you edge caching and DDoS protection. Neither provides built-in per-request cost calculation with multi-modal pricing awareness, and neither offers the kind of budget hierarchy where you can set spending caps at the customer, team, and key level with automatic enforcement.

LiteLLM is the most well-known Python-based proxy. It supports cost tracking and has a wide model coverage. The trade-off is performance. LiteLLM adds roughly 8ms of latency overhead per request. Bifrost adds 11 microseconds, which is about 50x faster. At 5,000 RPS, that difference compounds. If your use case is low-throughput internal tooling, LiteLLM works fine. If you are running production workloads at scale, the latency overhead matters.

The math is straightforward: at 5,000 requests per second, 8ms overhead means 40 seconds of cumulative latency overhead per second of wall time. At 11 microseconds, it is 0.055 seconds.

What we learned building this

A few things surprised us during development.

Pricing data goes stale fast. Providers update pricing regularly. We started with a static pricing file and quickly realized it needed to be auto-synced. The 24-hour sync interval with O(1) memory lookups was the balance we settled on. You can also trigger a manual pricing sync via POST /api/pricing/force-sync if a provider drops prices and you want immediate accuracy.

Budget enforcement needs to be in the hot path. We tried implementing budgets as an async check initially. The problem: by the time the async check ran, the request was already sent to the provider and the cost was incurred. Budget checks have to happen before the request goes upstream. That is why Bifrost handles it at the gateway layer with in-memory state.

Multi-modal cost calculation is harder than it looks. Text-only cost is straightforward: multiply tokens by price per token. But when a request includes images, the cost depends on the image resolution and the token context length. Audio adds per-second pricing. Some models charge per character instead of per token. The Model Catalog handles all of this, but getting it right required modelling each provider's pricing structure individually.

Cost attribution needs hierarchy. Flat per-key budgets are not enough for real organizations. An engineering team needs to know: "How much is Customer X spending? How much of that is Team Y? Which virtual key is burning through budget?" That is why we built the four-tier hierarchy (Customer, Team, Virtual Key, Provider Config). You can create virtual keys via the API and attach budgets to each level.

Wrapping up

LLM cost management is not optional for production systems. If you are routing requests across multiple providers without per-request cost tracking, budget enforcement, and cache-aware calculations, you are flying blind. For enterprise teams, Bifrost also supports audit logs, log exports, and intelligent load balancing.

Bifrost is open-source, written in Go, and runs with a single command. It handles cost tracking at the gateway layer so your application code does not have to.

If you are dealing with LLM spend management, give it a try and let us know what is missing. We are actively building based on what teams actually need.

DEV Community