Stop Burning Cash on Duplicated LLM Queries: High-Performance Semantic Caching with Spring AI and PgVector
With enterprise LLM API costs skyrocketing in 2026, blindly forwarding every user prompt to external providers is architectural malpractice. You are paying premium rates for semantically identical queries that your system could easily resolve locally in under 10 milliseconds.
Heads up: if you want to see these patterns applied to real interview problems, javalld.com has full machine coding solutions with traces.
Why Most Developers Get This Wrong
- Exact string matching: Relying on Redis or Memcached for exact-key lookups fails completely when "How do I reset my password?" and "Password reset steps" yield the exact same user intent.
- Cloud-based embedding latency: Round-tripping to external embedding APIs just to check your cache defeats the performance benefits of caching in the first place.
- Loose similarity thresholds: Setting a static cosine similarity threshold without accounting for domain-specific embedding drift, leading to incorrect cache hits.
The Right Way
Intercept incoming queries at the gateway, generate embeddings locally using ONNX, and run a vector similarity search against PgVector with a strict threshold.
- Local Embedding Generation: Use Spring AI's local ONNX runtime support or a local Ollama instance to generate embeddings in under 2ms.
-
PgVector Cosine Similarity: Leverage PostgreSQL's
pgvectorextension with an HNSW index to query cached responses using cosine distance (<=>). -
Adaptive Thresholding: Enforce a strict similarity threshold (e.g.,
> 0.92forall-MiniLM-L6-v2) to prevent serving stale or irrelevant cached answers. - TTL-backed Vector Eviction: Pair your vector store with a standard PostgreSQL TTL or soft-delete mechanism to automatically invalidate stale cache entries.
Show Me The Code
Here is how to implement a high-performance semantic cache query using Spring AI's native VectorStore API:
@Service
public class SemanticCacheService {
private final VectorStore vectorStore; // Autowired PgVectorStore
private static final double SIMILARITY_THRESHOLD = 0.92;
public Optional<String> getCachedResponse(String query) {
SearchRequest searchRequest = SearchRequest.query(query)
.withTopK(1)
.withSimilarityThreshold(SIMILARITY_THRESHOLD);
List<Document> results = vectorStore.similaritySearch(searchRequest);
return results.stream()
.map(doc -> (String) doc.getMetadata().get("cached_response"))
.findFirst();
}
}
Key Takeaways
- Drastically Cut Costs: Intercepting repetitive prompts locally can slash your LLM API bills by up to 40% on day one.
- Sub-10ms Latency: Local embedding generation combined with PgVector HNSW indexing turns slow LLM calls into instant local lookups.
-
Spring AI is Production-Ready: Stop writing custom vector database boilerplate; use Spring AI's native
PgVectorStoreandSearchRequestAPIs to do the heavy lifting.
---JSON
{"title": "Stop Burning Cash on Duplicated LLM Queries: High-Performance Semantic Caching with Spring AI and PgVector", "tags": ["java", "ai", "llm", "systemdesign"]}
---END---
Top comments (0)