Stop Wasting LLM Budgets: High-Performance Semantic Caching with Spring AI and pgvector
Your enterprise is likely bleeding thousands of dollars on duplicate LLM API calls because your Redis cache fails when a user asks "How do I reset my password?" instead of "Password reset steps." In 2026, relying on exact-string matching for LLM caching is a rookie mistake that kills both your latency and your budget.
Why Most Developers Get This Wrong
- Exact-Match Obsession: Using traditional Redis or Memcached key-value pairs, which completely misses semantically identical queries with different wordings.
-
Database Abuse: Hand-rolling vector math inside the application layer instead of letting
pgvectorperform native, hardware-accelerated cosine distance queries. - Network Bloat: Calling external APIs (like OpenAI) to embed the user's query before checking the cache, defeating the low-latency purpose of caching.
The Right Way
Intercept LLM calls at the framework level using Spring AI Advisors paired with a local embedding model and a pgvector-backed similarity search.
-
Use Spring AI Advisors: Implement a custom
CallAroundAdvisorto transparently intercept prompts before they hit the external LLM provider. -
Local Embeddings: Use a local ONNX model (like
all-MiniLM-L6-v2) inside your JVM process to generate query embeddings in under 5ms, avoiding external network hops. -
Cosine Distance Thresholding: Query PostgreSQL using
pgvectorwith an HNSW index, filtering results with a strict similarity threshold (e.g.,> 0.96).
Show Me The Code
Here is how to implement a high-performance, reusable semantic cache advisor using Spring AI:
public class SemanticCacheAdvisor implements CallAroundAdvisor {
private final PgVectorStore vectorStore;
private final double similarityThreshold = 0.96;
@Override
public AdvisedResponse aroundCall(AdvisedRequest request, CallAroundAdvisorChain chain) {
String query = request.getPrompt().getInstructions().get(0).getContent();
var matches = vectorStore.similaritySearch(
SearchRequest.query(query).withSimilarityThreshold(similarityThreshold).withTopK(1)
);
if (!matches.isEmpty()) {
return AdvisedResponse.from(matches.get(0).getMetadata().get("cached_response").toString());
}
AdvisedResponse response = chain.nextAroundCall(request);
var cachedDoc = new Document(query, Map.of("cached_response", response.getMessage()));
vectorStore.add(List.of(cachedDoc));
return response;
}
}
Key Takeaways
-
Decouple Caching: Keep your business logic clean; use Spring AI's
Advisorchain to handle semantic caching transparently without polluting your services. -
Index for Scale: Always create an HNSW index on your
pgvectorcolumns to maintain sub-10ms query times as your cache grows to millions of rows. - Set Strict Thresholds: Keep your similarity threshold high (0.95+) to prevent "hallucinated" cache hits where distinct user intents are incorrectly matched.
I built javalld.com while prepping for senior roles — complete LLD problems with execution traces, not just theory.
Top comments (0)