Debby McKinney

Posted on Mar 12

5 Ways to Track and Cut Your LLM API Costs Without Switching Models

#ai #programming #javascript #tutorial

TL;DR: Most teams overspend on LLM APIs because they have zero visibility into what is actually costing them money. Before you downgrade models or cut features, try these five strategies at the gateway layer: cost tracking per team/key, semantic caching for repeated queries, weighted routing to cheaper providers for simple tasks, budget caps that prevent runaway spend, and model-level cost analysis. None of these require changing your application code.

The Visibility Problem

Here is what a typical LLM API bill looks like: one line item. "$4,237.89 — OpenAI API usage."

That tells you nothing. Which team spent the most? Which feature is the most expensive? Is that $4,000 going to a RAG pipeline that could use a cheaper model, or to a customer-facing chatbot that genuinely needs GPT-4o?

Without per-team, per-feature cost tracking, you are optimizing blind. And most teams stay blind until the bill gets painful enough to investigate.

Strategy 1: Per-Key Cost Tracking

The first step is knowing where the money goes. Set up virtual API keys for each team, project, or feature.

Instead of one shared API key, give each team their own key that routes through a gateway. The gateway logs every request with the key ID, model used, tokens consumed, and cost.

With Bifrost, this looks like:

{
  "virtual_keys": [
    {
      "id": "team-search",
      "display_name": "Search Team",
      "allowed_models": ["gpt-4o", "gemini-2.5-flash"]
    },
    {
      "id": "team-chatbot",
      "display_name": "Chatbot Team",
      "allowed_models": ["claude-sonnet-4-6", "gpt-4o"]
    }
  ]
}

Now your cost dashboard shows: Search team spent $1,500 this month, Chatbot team spent $3,200. You know exactly where to focus optimization efforts.

The built-in dashboard at localhost:8080 shows per-key cost breakdowns, token usage, and request counts in real time. No third-party observability tool needed for basic cost tracking.

Before you even start optimizing, figure out your baseline. The LLM cost calculator helps you estimate what you should be paying based on your token volumes and model choices.

Strategy 2: Semantic Caching

This is the lowest-effort, highest-impact optimization for most apps.

If your users ask similar questions repeatedly, you are paying full price for every request even when the answer is the same. Semantic caching catches these.

How it works in Bifrost:

Exact match cache (Layer 1): Request gets hashed. If an identical request was made before, return the cached response. Zero cost, instant response.
Semantic similarity cache (Layer 2): If no exact match, check vector similarity via Weaviate. "How do I reset my password?" and "I forgot my password, how do I change it?" are semantically similar. If the similarity score is above your threshold, return the cached response.

For customer support bots, FAQ chatbots, and search-heavy applications, caching can cut 30-50% of requests before they ever hit the LLM provider. That is 30-50% off your bill, immediately.

One caveat: semantic caching requires a Weaviate instance for the vector similarity layer. The exact-match cache works out of the box. If you are not ready to set up Weaviate, start with exact-match caching alone, it still catches a surprising number of repeated queries.

Strategy 3: Weighted Routing to Cheaper Models

Not every request needs your most expensive model.

A quick classification, a simple extraction, a boilerplate generation task: these can run on Gemini Flash or Llama 3.3 at a fraction of GPT-4o's cost, with identical output quality for that task.

Set up weighted routing to send a percentage of traffic to cheaper models:

{
  "accounts": [
    {
      "id": "gemini-cheap",
      "provider": "gemini",
      "model": "gemini-2.5-flash",
      "weight": 60
    },
    {
      "id": "openai-premium",
      "provider": "openai",
      "model": "gpt-4o",
      "weight": 40
    }
  ]
}

60% of requests go to Gemini Flash (cheaper), 40% to GPT-4o (more capable). Start with a conservative split and increase the cheaper model's weight as you verify quality holds up.

Want to compare exact pricing per model before deciding? The model library lists 1000+ models across all 19 supported providers with current pricing. You can also filter by provider: OpenAI models, Anthropic models, Gemini models.

Strategy 4: Budget Caps That Actually Work

"We will monitor costs" is not a strategy. You need hard limits.

Bifrost supports a four-tier budget hierarchy:

Organization ($50,000/month cap)
  └── Team: Backend ($15,000/month)
       └── Virtual Key: RAG Pipeline ($5,000/month, $250/day)
            └── Provider: OpenAI ($3,000/month)

Each level can have daily, weekly, or monthly caps. When a cap is hit, you configure the behavior:

Hard stop: Request fails with a budget exceeded error. Good for dev/staging environments.
Soft failover: Request automatically routes to a cheaper model. Good for production where you want cost control without downtime.

{
  "budget": {
    "monthly_limit_usd": 5000,
    "daily_limit_usd": 250,
    "on_budget_exceeded": "failover_to_cheaper"
  }
}

This prevents the classic disaster: one engineer runs an automated batch job over the weekend, nobody notices until Monday, and you have a $12,000 surprise on the invoice.

For teams running Claude Code across multiple developers, this is critical. Each developer gets a virtual key with a daily budget. No single developer can burn through the team's allocation.

Strategy 5: Model-Level Cost Analysis

Once you have per-key tracking running, analyze which models are actually worth their price.

Look at your cost-per-task, not cost-per-token. If GPT-4o takes 500 tokens to answer a question and Gemini Flash takes 800 tokens but costs 1/5th per token, Gemini Flash is still cheaper for that task despite using more tokens.

Track these metrics per virtual key:

Cost per request (not just tokens): includes input + output tokens at the model's actual pricing
Requests per day per model: identifies which models are used most
Cache hit rate: shows how much caching is saving you
Failed requests: failed requests still cost money on some providers (you pay for input tokens even if the response errors out)

Bifrost's dashboard at :8080 shows all of this. For deeper analysis, you can export logs and build custom dashboards, but the built-in one covers most teams' needs.

If you are deciding between gateways for cost management, the buyer's guide compares how different gateways handle cost tracking and budget controls.

What This Looks Like in Practice

A team we talked to was spending $8,000/month on OpenAI. After implementing these five strategies through a gateway:

Per-key tracking revealed their search feature used 60% of the budget
Semantic caching on the search feature cut its costs by 40%
Weighted routing moved simple search queries to Gemini Flash
Budget caps prevented a test environment from burning real money
Cost analysis showed two internal tools could switch to Llama entirely

Estimated monthly spend dropped to around $3,500. No model downgrades on customer-facing features. No code changes in the application layer. All routing and caching handled at the gateway.

Getting Started

The fastest path:

# Install Bifrost
npx -y @maximhq/bifrost

# Open dashboard, configure providers
open http://localhost:8080

# Point your app to the gateway
# base_url = http://localhost:8080/v1

Set up virtual keys for each team. Enable exact-match caching. Add a monthly budget cap. That alone will give you visibility and prevent surprises.

Then iterate: add weighted routing, set up semantic caching with Weaviate, tune your model splits based on the cost data you are now collecting.

The docs cover the full setup. Source is on GitHub, MIT licensed, free to self-host.

If your team is also evaluating how well your LLM outputs hold up after switching models or adjusting prompts, Maxim AI handles AI evaluation and testing. Useful for verifying that cost optimizations do not degrade quality.