Debby McKinney

Posted on Mar 10

Top AI Gateways for Semantic Caching + Dynamic Routing for AI Cost Optimization

#ai #programming #tutorial #devops

If you are running LLM calls in production, you already know the bill adds up fast. Every API call costs tokens. Every token costs money. And a good chunk of those calls are either duplicates or close enough to duplicates that you are paying twice for the same answer.

There are two levers that actually reduce LLM API costs at scale: semantic caching and dynamic routing. Used together, they can cut your spend significantly without touching your application code.

Let me break down how each works, then compare the gateways that support them.

Semantic Caching: Stop Paying for the Same Answer Twice

Traditional caching uses exact string matching. If the same prompt comes in, you return the cached response. That works, but users rarely phrase things identically.

Semantic caching goes further. It checks whether the meaning of a new query is similar enough to a cached one. If someone asks "What is TCP?" and then "Explain TCP protocol", a semantic cache recognizes those as the same intent and returns the cached response.

The cost math is simple:

Direct cache hit (exact match): $0 cost. No API call at all.
Semantic cache hit: You pay only for the embedding comparison. That is a fraction of a cent versus the full generation cost.
Cache miss: Normal API call, response gets cached for next time.

If your application has any pattern of repeated or similar queries (and most do), semantic caching pays for itself almost immediately.

Dynamic Routing: Use the Cheapest Provider That Works

Not every query needs your most expensive model. A simple classification task does not need GPT-4o. A straightforward summarization does not need Claude Opus.

Dynamic routing lets you distribute requests across providers based on weights, budgets, and fallback rules. You set a daily budget on your expensive provider, and when that budget runs out, requests automatically route to a cheaper alternative.

Combined with automatic failover on rate limits, 5xx errors, and timeouts, you get both cost optimization and reliability from the same configuration.

Gateway Comparison

Bifrost

Bifrost is an open-source LLM gateway written in Go. It combines both semantic caching and dynamic routing in one tool.

Caching: Dual-layer system. The first layer is exact hash matching for identical prompts. The second layer is semantic similarity matching using Weaviate as the vector store. You get the speed of exact matching when it applies, and the intelligence of semantic matching when it does not.

Routing: Weighted load balancing across providers with automatic failover. You can set budgets at four levels: Customer, Team, Virtual Key, and Provider Config. Budget reset frequencies go from 1 minute to 1 month (1m, 1h, 1d, 1w, 1M).

Here is what a budget-based routing config looks like in practice. Set a daily budget on your primary (expensive) provider:

{
  "providers": [
    {
      "name": "openai-primary",
      "provider": "openai",
      "apiKey": "sk-xxxxx",
      "weight": 80,
      "budget": {
        "amount": 100,
        "reset": "1d"
      }
    },
    {
      "name": "openai-fallback",
      "provider": "openai",
      "apiKey": "sk-yyyyy",
      "weight": 20
    }
  ]
}

When the primary provider hits its $100 daily budget, Bifrost automatically routes to the fallback. No code changes. No downtime.

Performance: 11µs latency overhead per request. 5,000 RPS sustained throughput. That is 50x faster than Python-based alternatives. The Go runtime makes a measurable difference when you are processing thousands of requests.

Setup: npx -y @maximhq/bifrost or Docker. Zero-config to start, full config through Web UI or JSON.

GitHub: https://git.new/bifrost
Docs: https://getmax.im/bifrostdocs

Portkey

Portkey is a commercial AI gateway that offers both caching and routing.

Caching: Supports simple caching and semantic caching. The semantic cache uses embeddings to match similar queries. You enable it by adding a cache header to your requests.

Routing: Supports load balancing, fallbacks, and conditional routing. You can route based on metadata, and it supports canary deployments for testing new models.

Tradeoffs: Portkey is a hosted service with a free tier and paid plans. If you need full control over your data and infrastructure, you would need their enterprise plan. It is not open source.

LiteLLM

LiteLLM is a Python-based open-source proxy that focuses on routing.

Caching: Supports Redis-based caching for exact matches. Semantic caching is available but is a secondary feature, not the primary focus.

Routing: This is where LiteLLM is strong. It supports 100+ LLM providers with a unified interface. You can set budgets, rate limits, and fallbacks.

Tradeoffs: Written in Python, so you are looking at roughly 8ms of overhead per request compared to Bifrost's 11µs. For low-volume use cases that does not matter. At scale, it adds up. LiteLLM is a good choice if routing flexibility across many providers is your primary need and caching is secondary.

Helicone

Helicone is primarily an LLM observability and logging platform.

Caching: Supports response caching with a simple header toggle. Focused on exact match caching rather than semantic similarity.

Routing: Helicone is not a routing gateway. Its strength is in logging, analytics, and cost tracking. You would use it alongside a gateway, not as a replacement.

Tradeoffs: If your main need is visibility into LLM usage and costs, Helicone is excellent. But for caching and routing as cost optimization tools, you would need to pair it with another solution.

Which One Should You Pick?

It depends on what you need most:

Semantic caching + dynamic routing + performance: Bifrost. Open source, Go-based, dual-layer caching with Weaviate, full budget hierarchy. Best fit if you want both cost levers in one tool at high throughput.
Routing across 100+ providers: LiteLLM. Widest provider support, Python-based.
Hosted solution with managed caching and routing: Portkey. Commercial product with a polished UI.
Observability and cost tracking: Helicone. Pair it with one of the above.

If you are optimizing for cost, you want both caching and routing working together. A request that hits the semantic cache costs nearly nothing. A request that misses the cache gets routed to the most cost-effective provider. That combination is where the real savings come from.

Start with Bifrost if you want to try both in one setup: https://git.new/bifrost

DEV Community