Stop Wasting LLM Budgets: High-Performance Semantic Caching with Spring AI and pgvector

#java #ai #llm #systemdesign

Stop Wasting LLM Budgets: High-Performance Semantic Caching with Spring AI and pgvector

Your enterprise is likely bleeding thousands of dollars on duplicate LLM API calls because your Redis cache fails when a user asks "How do I reset my password?" instead of "Password reset steps." In 2026, relying on exact-string matching for LLM caching is a rookie mistake that kills both your latency and your budget.

Why Most Developers Get This Wrong

Exact-Match Obsession: Using traditional Redis or Memcached key-value pairs, which completely misses semantically identical queries with different wordings.
Database Abuse: Hand-rolling vector math inside the application layer instead of letting pgvector perform native, hardware-accelerated cosine distance queries.
Network Bloat: Calling external APIs (like OpenAI) to embed the user's query before checking the cache, defeating the low-latency purpose of caching.

The Right Way

Intercept LLM calls at the framework level using Spring AI Advisors paired with a local embedding model and a pgvector-backed similarity search.

Use Spring AI Advisors: Implement a custom CallAroundAdvisor to transparently intercept prompts before they hit the external LLM provider.
Local Embeddings: Use a local ONNX model (like all-MiniLM-L6-v2) inside your JVM process to generate query embeddings in under 5ms, avoiding external network hops.
Cosine Distance Thresholding: Query PostgreSQL using pgvector with an HNSW index, filtering results with a strict similarity threshold (e.g., > 0.96).

Show Me The Code

Here is how to implement a high-performance, reusable semantic cache advisor using Spring AI:

public class SemanticCacheAdvisor implements CallAroundAdvisor {
    private final PgVectorStore vectorStore;
    private final double similarityThreshold = 0.96;

    @Override
    public AdvisedResponse aroundCall(AdvisedRequest request, CallAroundAdvisorChain chain) {
        String query = request.getPrompt().getInstructions().get(0).getContent();
        var matches = vectorStore.similaritySearch(
            SearchRequest.query(query).withSimilarityThreshold(similarityThreshold).withTopK(1)
        );
        if (!matches.isEmpty()) {
            return AdvisedResponse.from(matches.get(0).getMetadata().get("cached_response").toString());
        }
        AdvisedResponse response = chain.nextAroundCall(request);
        var cachedDoc = new Document(query, Map.of("cached_response", response.getMessage()));
        vectorStore.add(List.of(cachedDoc));
        return response;
    }
}

Key Takeaways

Decouple Caching: Keep your business logic clean; use Spring AI's Advisor chain to handle semantic caching transparently without polluting your services.
Index for Scale: Always create an HNSW index on your pgvector columns to maintain sub-10ms query times as your cache grows to millions of rows.
Set Strict Thresholds: Keep your similarity threshold high (0.95+) to prevent "hallucinated" cache hits where distinct user intents are incorrectly matched.