How Vector Embeddings Can Slash Your LLM API Costs by 80%
If you're building applications powered by large language models, you've probably noticed something painful: your API bill keeps growing. Not because your application is doing anything particularly complex, but because users keep asking variations of the same questions, and each variation triggers a fresh API call.
This is the hidden tax of working with LLMs, and today I want to show you how to solve it with semantic caching.
The Problem with Traditional Caching
Let's say you're building a customer support chatbot. A user asks, "How do I reset my password?" Your application calls the OpenAI API or any other LLMs provider, gets a response, and you wisely decide to cache it. Smart move.
But then another user comes along and asks, "I forgot my password, what should I do?" Your cache looks at this query, compares it character by character with what's stored, and finds no match. So off goes another API request. You pay again and the user waits again.
Here's what traditional caching sees:
"How do I reset my password?" ≠ "I forgot my password, what should I do?"
Different strings. Cache miss. End of story.
Now imagine this happening hundreds or thousands of times per day across all the different ways people phrase the same questions. Your cache hit rate stays frustratingly low, and your costs stay frustratingly high.
The fundamental issue is that traditional caching operates on syntax (the exact sequence of characters) when what we really care about is semantics, the meaning behind those characters.
Enter Semantic Caching
What if your cache could understand that "How do I reset my password?" and "I forgot my password, what should I do?" are essentially asking the same thing? What if it could recognize meaning, not just text?
This is exactly what semantic caching does. Instead of storing queries as raw strings, it converts them into vector embeddings mathematical representations that capture the semantic meaning of text. When a new query arrives, the system converts it to a vector and searches for similar vectors in the cache. If it finds one above a certain similarity threshold, it returns the cached response without ever touching the LLM API.
The magic happens because embedding models are trained to place semantically similar text close together in vector space. "Reset my password" and "forgot my password" end up as neighboring vectors, even though they share few common words.
How It Works Under the Hood
┌─────────────────────────────────────────────────────────────────┐
│ User Query │
│ "What's the capital of France?" │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ L1 Cache (Exact Match) │
│ Caffeine In-Memory │
│ < 1ms lookup │
└─────────────────────────────────────────────────────────────────┘
│ Miss
▼
┌─────────────────────────────────────────────────────────────────┐
│ Embedding Provider │
│ Convert text → [0.023, -0.891, 0.445, ...] │
│ (ONNX, OpenAI, Ollama, Azure) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ L2 Cache (Semantic Search) │
│ Vector similarity search in storage │
│ (Redis, Elasticsearch, In-Memory) │
│ │
│ Cached: "Tell me France's capital" → similarity: 0.94 ✓ │
└─────────────────────────────────────────────────────────────────┘
│ Hit!
▼
┌─────────────────────────────────────────────────────────────────┐
│ Return: "Paris" │
│ Total time: ~10-50ms │
│ Cost: $0.00 │
└─────────────────────────────────────────────────────────────────┘
The dual-level architecture is key:
- L1 Cache: Exact matches with sub-millisecond latency
- L2 Cache: Semantic matches with millisecond latency (Vector storage)
Introducing Semantic LLM Cache for Java
Semantic LLM Cache is an open-source library that brings this capability to Java and Spring Boot applications. It handles the complexity of embedding generation, vector storage, and similarity search, exposing a simple interface that feels native to Spring developers.
Let me walk you through how it works and how to integrate it into your application.
Getting Started
First, add the dependencies to your project. You'll need three components: the Spring Boot starter, a storage backend, and an embedding provider.
https://github.com/Jamalianpour/semantic-llm-cache
<!-- The core Spring Boot integration -->
<dependency>
<groupId>io.github.jamalianpour</groupId>
<artifactId>semantic-llm-cache-spring-boot-starter</artifactId>
<version>0.0.1</version>
</dependency>
<!-- Storage: where vectors live (starting with in-memory for simplicity) -->
<dependency>
<groupId>io.github.jamalianpour</groupId>
<artifactId>semantic-llm-cache-storage-memory</artifactId>
<version>0.0.1</version>
</dependency>
<!-- Embeddings: how text becomes vectors (using OpenAI here) -->
<dependency>
<groupId>io.github.jamalianpour</groupId>
<artifactId>semantic-llm-cache-embedding-openai</artifactId>
<version>0.0.1</version>
</dependency>
Each component is modular by design. You can swap storage backends or embedding providers without changing your application code. This matters because your needs will evolve you might start with in-memory storage for development and move to Redis or Elasticsearch for production.
Configuration
Next, configure the cache in your application.yml:
semantic-cache:
embedding:
provider: openai
api-key: ${OPENAI_API_KEY} # Your OpenAI API key
model: text-embedding-3-small # Cost-effective embedding model
storage:
type: memory # In-memory for development
defaults:
similarity-threshold: 0.92 # How similar queries must be (0.0 to 1.0)
ttl: 24h # How long cached responses live
The similarity-threshold parameter is crucial. Setting it to 0.92 means a query must be at least 92% similar to a cached query to be considered a match. Too low, and you'll return incorrect responses for genuinely different questions. Too high, and you'll miss opportunities to serve from cache. I've found 0.90 to 0.95 works well for most conversational AI applications, but you should experiment with your specific use case.
Using the Annotation
Now comes the elegant part. To enable semantic caching on any method, simply add the @SemanticCache annotation:
@Service
public class CustomerSupportService {
private final OpenAiClient openAiClient;
public CustomerSupportService(OpenAiClient openAiClient) {
this.openAiClient = openAiClient;
}
@SemanticCache(
namespace = "support", // Isolates this cache from others
similarity = 0.92 // Override default threshold if needed
)
public String answerQuestion(String question) {
// This only executes on cache misses
return openAiClient.complete(question);
}
}
That's genuinely all the code you need. When answerQuestion is called, the library intercepts the call, generates an embedding for the question, searches for similar cached entries, and either returns a cached response or proceeds to call your method and cache the result.
The namespace parameter creates logical separation between different caches. Your FAQ responses shouldn't interfere with your product recommendation cache, even if someone asks a similar-sounding question in both contexts.
Understanding What Happens Under the Hood
To use semantic caching effectively, it helps to understand the flow of operations.
When a query arrives, the library first generates a vector embedding. If you're using OpenAI's text-embedding-3-small model, this produces a 1536-dimensional vector — essentially a list of 1536 numbers that mathematically represent the meaning of your text. This embedding generation takes around 50-100ms and costs a fraction of a cent.
Next, the library searches the vector storage for similar embeddings. It uses cosine similarity, which measures the angle between two vectors. Identical vectors have a similarity of 1.0. Completely unrelated vectors approach 0.0. The search returns the most similar cached entry along with its similarity score.
If the similarity exceeds your threshold, you have a cache hit. The library returns the stored response, and your LLM API is never called. Total latency: typically under 10ms for in-memory storage, under 50ms for Redis or Elasticsearch.
If the similarity falls below your threshold (or no similar entries exist), you have a cache miss. Your method executes normally, calling the LLM API. Before returning, the library caches both the query embedding and the response for future use.
Choosing Your Storage Backend
The library supports three storage backends, each suited to different scenarios.
In-Memory Storage keeps everything in the JVM heap using a ConcurrentHashMap. It's fast and requires no external infrastructure, making it perfect for development, testing, and small applications. The obvious limitation is that data disappears when your application restarts, and it doesn't work across multiple application instances.
semantic-cache:
storage:
type: memory
Redis Storage uses Redis Stack's vector search capabilities. It provides persistence, sub-millisecond latency, and works across distributed deployments. If you're already running Redis, this is often the natural choice for production.
semantic-cache:
storage:
type: redis
redis:
host: localhost
port: 6379
You'll need Redis Stack (not plain Redis) because it includes the RediSearch module with vector similarity search. The easiest way to get started is with Docker:
docker run -d --name redis-stack -p 6379:6379 redis/redis-stack:latest
Elasticsearch Storage is designed for large-scale deployments. It handles millions of vectors efficiently and integrates well if you're already using Elasticsearch for search or logging. The HNSW algorithm provides approximate nearest neighbor search that scales remarkably well.
semantic-cache:
storage:
type: elasticsearch
elasticsearch:
uris: http://localhost:9200
Choosing Your Embedding Provider
Vector quality matters. Better embeddings lead to more accurate similarity matching, which means higher cache hit rates and fewer false positives.
OpenAI provides excellent embeddings with minimal setup. The text-embedding-3-small model offers a good balance of quality and cost at $0.02 per million tokens. For applications where embedding quality is critical, text-embedding-3-large provides measurably better results at a higher price point.
Azure OpenAI offers the same models through Azure's infrastructure, which matters for enterprises with compliance requirements or existing Azure investments.
Ollama lets you run embedding models locally. This eliminates embedding costs entirely and keeps all data on your infrastructure. The nomic-embed-text model produces good quality embeddings and runs efficiently on modest hardware.
semantic-cache:
embedding:
provider: ollama
model: nomic-embed-text
ollama-base-url: http://localhost:11434
ONNX goes a step further, running models directly in your JVM without any external service. This is ideal for offline deployments or when you want to minimize dependencies. The library includes several pre-trained models:
semantic-cache:
embedding:
provider: onnx
onnx:
pretrained-model: ALL_MINILM_L6_V2 # Fast and lightweight
The trade-off with local models is generally quality. OpenAI's embeddings tend to outperform open-source alternatives, though the gap has been narrowing. For many applications, the cost savings of local embeddings outweigh the modest quality difference.
Multi-Tenant Caching
Real applications often serve multiple users or organizations, and you typically don't want User A's cached responses served to User B. The library handles this through context keys.
@SemanticCache(
namespace = "support",
similarity = 0.92,
contextKeys = {"#userId"} // Isolate cache by user
)
public String answerQuestion(String question, String userId) {
return openAiClient.complete(question);
}
The contextKeys parameter accepts SpEL expressions that evaluate to isolation keys. The cache effectively becomes partitioned — a query from User A only matches cached entries from User A's previous queries.
You can combine multiple context keys for more complex isolation:
@SemanticCache(
namespace = "support",
contextKeys = {"#tenantId", "#department"}
)
public String answerQuestion(String question, String tenantId, String department) {
return openAiClient.complete(question);
}
Cache Eviction
When underlying data changes, you need to invalidate cached responses. The @SemanticCacheEvict annotation handles this:
@SemanticCacheEvict(
namespace = "faq",
key = "#topic",
similarity = 0.85 // Evict entries similar to this topic
)
public void updateFaqContent(String topic, String newContent) {
faqRepository.update(topic, newContent);
}
This is particularly powerful because eviction is also semantic. Updating the "password reset" FAQ entry will evict cached responses for "reset password," "forgot password," and other similar queries — exactly the behavior you want.
For bulk updates, you can clear an entire namespace:
@SemanticCacheEvict(namespace = "faq", allEntries = true)
public void rebuildAllFaqs() {
// Clears all cached FAQ responses
}
Using Without Spring Boot
While the library is optimized for Spring Boot, the core components work in any Java application:
// Create embedding provider
EmbeddingProvider embeddings = OpenAiEmbeddingFactory.create(apiKey);
// Create storage backend
VectorStorage storage = RedisVectorStorageFactory.create(
"redis://localhost:6379",
1536 // Dimensions must match embedding model
);
// Build the cache
SemanticCache cache = SemanticCache.builder()
.embeddingProvider(embeddings)
.storage(storage)
.config(CacheConfig.builder()
.similarityThreshold(0.92)
.ttl(Duration.ofHours(24))
.build())
.build();
// Use it directly
cache.put("How do I reset my password?", "To reset your password...");
Optional<CacheHit> hit = cache.get("I forgot my password");
if (hit.isPresent()) {
System.out.println("Cache hit! Similarity: " + hit.get().similarity());
System.out.println("Response: " + hit.get().response());
}
This programmatic API gives you full control and works with any framework or no framework at all.
Conclusion
Semantic caching represents a fundamental shift in how we think about caching for AI applications. By operating on meaning rather than text, it unlocks cache hit rates that traditional approaches simply cannot achieve.
Semantic LLM Cache brings this capability to the Java ecosystem with a clean, modular design that respects how Spring developers build applications. Whether you're building chatbots, FAQ systems, RAG applications, or any other LLM-powered feature, semantic caching can meaningfully reduce your costs and improve response times.
The library is open source and available on GitHub. Contributions, feedback, and feature requests are always welcome.
GitHub: https://github.com/Jamalianpour/semantic-llm-cache
If you found this useful, consider giving the repository a star. It helps others discover the project and motivates continued development.
Top comments (0)