Pranay Batta

Posted on Apr 22

How to Cut LLM Token Spend with Semantic Caching: A Production Setup Guide

#ai #tutorial #llm #opensource

TL;DR: Semantic caching intercepts LLM API calls and returns cached responses for similar queries, skipping the provider entirely. Zero tokens consumed on cache hits. I set this up with Bifrost and Weaviate in under 30 minutes and it started saving tokens on the first day.

What We Are Building

A semantic cache layer that sits between your application and LLM providers. Every API call passes through the cache first. If the query matches a previous one (exact match or semantically similar), the cached response is returned instantly. No LLM call, no tokens billed.

maximhq / bifrost

Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost AI Gateway

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration…

View on GitHub

Here is the flow:

App -> Bifrost Gateway -> [Cache Check] -> Hit?  -> Return cached response (0 tokens)
                                        -> Miss? -> Forward to LLM provider -> Cache response -> Return

The end result: repeated and similar queries cost nothing. For workloads with common patterns (customer support, code generation, FAQ bots), the savings add up fast.

Prerequisites

You need three things:

Docker and Docker Compose installed (docs)
Weaviate as the vector store for semantic similarity matching
Bifrost as the LLM gateway with caching enabled
At least one LLM provider API key (OpenAI, Anthropic, etc.)

Everything runs locally. No cloud accounts needed beyond your LLM provider key.

Step 1: Deploy Weaviate for Vector Storage

Weaviate stores the vector embeddings that power semantic matching. When a new query comes in, Bifrost converts it to a vector and checks Weaviate for similar past queries.

Create a docker-compose.yml:

version: '3.8'

services:
  weaviate:
    image: cr.weaviate.io/semitechnologies/weaviate:latest
    ports:
      - "8081:8080"
      - "50051:50051"
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
      ENABLE_MODULES: 'text2vec-transformers'
      TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080'
      CLUSTER_HOSTNAME: 'node1'
    volumes:
      - weaviate_data:/var/lib/weaviate
    restart: on-failure

  t2v-transformers:
    image: cr.weaviate.io/semitechnologies/transformers-inference:sentence-transformers-all-MiniLM-L6-v2
    environment:
      ENABLE_CUDA: '0'
    restart: on-failure

volumes:
  weaviate_data:

Spin it up:

docker compose up -d

Verify Weaviate is running:

curl http://localhost:8081/v1/meta | python3 -m json.tool

You should see a JSON response with version info. If you get connection refused, give it 30 seconds for the transformer model to load.

For more on Weaviate's architecture and vectoriser modules, check their docs.

Step 2: Configure Bifrost with Semantic Caching Enabled

Bifrost is an open-source LLM gateway written in Go. 11 microsecond latency overhead, 5,000 RPS throughput. The part that matters here: it has dual-layer caching built in.

Dual-layer means two cache checks run on every request:

Exact hash match - identical queries return cached responses instantly
Semantic similarity - queries that mean the same thing but are worded differently also hit the cache

Start Bifrost:

docker run -p 8080:8080 maximhq/bifrost

Or if you prefer npx:

npx -y @maximhq/bifrost

Now configure the gateway. Create a config.yaml:

gateway:
  host: "0.0.0.0"
  port: 8080

cache:
  enabled: true
  type: "semantic"
  vector_store:
    provider: "weaviate"
    host: "http://localhost:8081"
  conversation_history_threshold: 3

accounts:
  - id: "production"
    providers:
      - id: "openai-main"
        type: "openai"
        api_key: "${OPENAI_API_KEY}"
        model: "gpt-4o"
        weight: 70
      - id: "anthropic-fallback"
        type: "anthropic"
        api_key: "${ANTHROPIC_API_KEY}"
        model: "claude-sonnet-4-20250514"
        weight: 30

Key config values:

cache.enabled: true turns on the dual-layer cache
cache.type: "semantic" enables both exact hash and semantic similarity (not just exact match)
vector_store.provider: "weaviate" points to your Weaviate instance
conversation_history_threshold: 3 controls how much conversation context is used for cache key generation. Default is 3. Higher values mean more context-sensitive cache matching but fewer hits.

Full configuration options are in the Bifrost docs.

Step 3: Point Your LLM Calls Through Bifrost

Bifrost exposes a drop-in replacement for the OpenAI SDK. Change your base URL and everything else stays the same.

Python (OpenAI SDK):

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="your-openai-api-key"
)

# First call - cache miss, hits the LLM provider
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "What are the benefits of microservices architecture?"}
    ]
)
print(response.choices[0].message.content)

# Second call - same query, exact cache hit, zero tokens
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "What are the benefits of microservices architecture?"}
    ]
)
print(response.choices[0].message.content)

# Third call - different wording, same intent, semantic cache hit
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Why should I use a microservices pattern?"}
    ]
)
print(response.choices[0].message.content)

The first call goes to OpenAI. Tokens are consumed, response is cached. The second call is identical, so the exact hash matches. Response comes from cache. The third call is worded differently but semantically similar. Weaviate's vector search finds the match. Response comes from cache again.

Both cache hits skip the LLM provider entirely. Zero tokens. Zero cost.

Node.js (OpenAI SDK):

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8080/v1',
  apiKey: 'your-openai-api-key',
});

const response = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'user', content: 'Explain container orchestration in simple terms' }
  ],
});

console.log(response.choices[0].message.content);

Same pattern. Point the base URL at Bifrost, and caching is transparent to your application code.

If you are using the Anthropic SDK, Bifrost supports that too. The Anthropic SDK integration page has the details.

Step 4: Monitor Cache Hits and Token Savings

Once traffic is flowing, you want to see what is hitting cache vs what is going through to providers.

Bifrost exposes metrics that let you track:

Cache hit rate (exact vs semantic)
Total requests vs routed requests (routed = cache misses that hit a provider)
Token usage per provider

Check your Bifrost logs to see cache behaviour in real time:

docker logs -f <bifrost-container-id>

Each request will indicate whether it was served from cache or forwarded to a provider. Track the ratio over time. On workloads with repeated query patterns, the cache hit rate climbs quickly within the first few hours.

How It Works: Exact Hash vs Semantic Similarity

A quick breakdown of the two cache layers.

Exact hash matching is straightforward. The entire request (messages, model, parameters) is hashed. If an identical request has been seen before, the cached response is returned. This is fast and deterministic. Same input, same output.

Semantic similarity is where it gets interesting. When no exact match exists, Bifrost converts the query into a vector embedding using the transformer model running in Weaviate. It then searches for existing cached queries that are semantically close. If the similarity score is above the threshold, the cached response is returned.

This is what catches queries like:

"How do I deploy to Kubernetes?" and "What is the process for deploying on k8s?"
"Explain OAuth 2.0" and "How does OAuth2 authentication work?"

Different words. Same intent. One LLM call instead of two.

The conversation_history_threshold setting controls how many previous messages in a conversation are included when generating the cache key. At the default of 3, Bifrost uses the last 3 messages for context. This prevents a cached response from a different conversation context being returned incorrectly.

For more on how sentence embeddings power this kind of similarity search, HuggingFace has a solid primer.

Results: What I Measured After Running This for a Week

I ran this setup against three different workloads for seven days. Here is what I observed.

Customer support bot (repetitive queries): Highest cache hit rate. Users ask variations of the same 50-100 questions. After the first day, the cache warmed up and a large portion of queries were served from cache. Semantic matching caught the paraphrased versions that exact hash would miss.

Code generation assistant (moderate repetition): Lower hit rate than customer support, but still meaningful. Common patterns like "write a function to parse JSON" or "create a REST endpoint" showed up repeatedly with slight variations. Semantic caching caught many of these.

Open-ended research queries (low repetition): Lowest hit rate, as expected. Each query was unique enough that neither exact nor semantic matching triggered often. Caching still helped with follow-up questions that rephrased earlier queries.

Latency on cache hits: Near-instant. The Weaviate vector lookup adds milliseconds, but compared to a full LLM round trip (typically 500ms to 3s), cache hits felt instantaneous.

Gateway overhead: Bifrost's 11 microsecond latency overhead held up. The caching layer adds the Weaviate lookup time on misses and hits, but the gateway itself adds almost nothing.

The workloads where semantic caching pays off most are the ones with natural query repetition. Customer support, internal knowledge bases, FAQ systems, onboarding assistants. If your users ask the same things in different ways, you are paying for the same answer multiple times.

For reference, here is what OpenAI charges per token and what Anthropic charges. On GPT-4o at current pricing, even a moderate cache hit rate translates to real savings on a monthly bill.

DEV Community