Large Language Models are powerful, but they're also expensive and slow when handling repetitive queries. If your AI application receives thousands of similar questions every day, repeatedly calling an LLM for nearly identical requests is inefficient.
What if you could intelligently reuse previous AI responses—even when the wording is different?
This is where Semantic Caching comes in.
In this article, we'll build a production-ready semantic caching layer using Spring AI and PgVector, enabling Java developers to dramatically reduce AI costs, lower latency, and improve user experience.
The Problem: Traditional Caching Doesn't Work for AI
Consider these user queries:
What is Spring Boot?
Explain Spring Boot framework.
Can you tell me about Spring Boot?
A traditional cache such as Redis treats these as completely different keys:
cache.get("What is Spring Boot?");
cache.get("Explain Spring Boot framework.");
Result:
❌ Cache Miss
❌ New LLM Call
❌ Increased Cost
❌ Higher Latency
Although the intent is identical, traditional caching cannot understand meaning.
Semantic caching solves this problem.
What is Semantic Caching?
Semantic caching stores:
User query
Query embedding
AI response
When a new request arrives:
Generate an embedding
Search for similar embeddings
Return cached response if similarity exceeds a threshold
Otherwise call the LLM and store the result
Instead of matching text, we match meaning.
Why Use PgVector?
PgVector extends PostgreSQL with vector similarity search capabilities.
Benefits:
Open source
No additional vector database required
Works directly with PostgreSQL
Supports cosine similarity
Production-ready
Easy integration with Spring AI
For many enterprise applications, PgVector eliminates the need for separate infrastructure like Pinecone or Weaviate.
High-Level Architecture
User Query
|
v
Generate Embedding
|
v
PgVector Similarity Search
|
+----------+----------+
| |
Cache Hit Cache Miss
| |
v v
Cached Response Call LLM
| |
+----------+----------+
|
v
Return Result
This architecture reduces both latency and token consumption.
Technology Stack
Java 21
Spring Boot 3
Spring AI
PostgreSQL
PgVector
OpenAI Embeddings
Maven
Setting Up PgVector
Enable the extension:
CREATE EXTENSION IF NOT EXISTS vector;
Create a cache table:
CREATE TABLE semantic_cache (
id BIGSERIAL PRIMARY KEY,
query TEXT NOT NULL,
response TEXT NOT NULL,
embedding VECTOR(1536),
created_at TIMESTAMP DEFAULT NOW()
);
Create an index for fast similarity search:
CREATE INDEX semantic_cache_embedding_idx
ON semantic_cache
USING ivfflat (embedding vector_cosine_ops);
The index becomes increasingly important as cached entries grow into the thousands or millions.
Maven Dependencies
Add Spring AI and PostgreSQL dependencies:
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
</dependency>
<dependency>
<groupId>com.pgvector</groupId>
<artifactId>pgvector</artifactId>
<version>0.1.6</version>
</dependency>
Application Configuration
spring:
datasource:
url: jdbc:postgresql://localhost:5432/ai_db
username: postgres
password: postgres
ai:
openai:
api-key: ${OPENAI_API_KEY}
Store sensitive credentials using environment variables or a secret management solution.
Entity Model
@entity
@Table(name = "semantic_cache")
@Getter
@setter
@NoArgsConstructor
@AllArgsConstructor
@builder
public class SemanticCache {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
private String query;
@Column(columnDefinition = "TEXT")
private String response;
private LocalDateTime createdAt;
}
Embedding Generation Service
Spring AI makes embedding generation straightforward.
@Service
@RequiredArgsConstructor
public class EmbeddingService {
private final EmbeddingModel embeddingModel;
public float[] generateEmbedding(String text) {
return embeddingModel
.embed(text);
}
}
Every query will be converted into a high-dimensional vector representation.
Similarity Search Repository
Using PostgreSQL cosine similarity:
@Repository
@RequiredArgsConstructor
public class SemanticCacheRepository {
private final JdbcTemplate jdbcTemplate;
public Optional<String> findSimilarResponse(
PGvector embedding,
double threshold) {
String sql = """
SELECT response,
1 - (embedding <=> ?) AS similarity
FROM semantic_cache
WHERE 1 - (embedding <=> ?) > ?
ORDER BY similarity DESC
LIMIT 1
""";
List<String> responses =
jdbcTemplate.query(
sql,
ps -> {
ps.setObject(1, embedding);
ps.setObject(2, embedding);
ps.setDouble(3, threshold);
},
(rs, rowNum) -> rs.getString("response")
);
return responses.stream().findFirst();
}
}
The threshold controls how strict the cache matching should be.
Typical values:
Threshold Behavior
0.70 Aggressive caching
0.80 Balanced
0.90 Very strict
For most production systems, 0.80–0.85 works well.
Semantic Cache Service
Now let's connect everything.
@Service
@RequiredArgsConstructor
@Slf4j
public class SemanticCacheService {
private final EmbeddingService embeddingService;
private final SemanticCacheRepository repository;
private final ChatClient chatClient;
public String getResponse(String query) {
float[] vector =
embeddingService.generateEmbedding(query);
PGvector embedding =
new PGvector(vector);
Optional<String> cachedResponse =
repository.findSimilarResponse(
embedding,
0.85
);
if (cachedResponse.isPresent()) {
log.info("Semantic Cache Hit");
return cachedResponse.get();
}
log.info("Semantic Cache Miss");
String response =
chatClient.prompt(query)
.call()
.content();
saveResponse(query, response, embedding);
return response;
}
private void saveResponse(
String query,
String response,
PGvector embedding) {
// Persist cache record
}
}
This is the core semantic caching workflow.
Real-World Example
Imagine an HR chatbot receiving these questions:
What is the company's leave policy?
How many annual leaves do employees get?
Can I take paid vacation days?
Without semantic caching:
3 LLM Requests
3 API Charges
3 Response Generations
With semantic caching:
1 LLM Request
2 Cache Hits
Much Lower Cost
At enterprise scale, this translates into thousands of dollars saved every month.
Performance Results
A typical benchmark:
Scenario Response Time
OpenAI API Call 1500–3000 ms
Semantic Cache Hit 20–50 ms
Improvement:
Up to 90% faster responses
Up to 80% lower AI costs
Reduced API rate-limit pressure
Actual numbers vary depending on model choice and infrastructure.
Production Considerations
- Cache Expiration
AI responses can become outdated.
Add TTL support:
DELETE FROM semantic_cache
WHERE created_at < NOW() - INTERVAL '30 days';
Schedule cleanup jobs regularly.
- Multi-Tenant Systems
Store tenant IDs:
tenant_id VARCHAR(50)
Only search cache within the current tenant.
- Response Quality Monitoring
Track:
Cache hit rate
Similarity score
User feedback
Incorrect cache matches
Observability is critical when deploying semantic caching at scale.
- Hybrid Cache Strategy
Best practice:
Redis
|
Semantic Cache
|
LLM
Flow:
Redis lookup
Semantic lookup
LLM call
This delivers maximum performance.
- Embedding Model Consistency
Never change embedding models without re-indexing vectors.
For example:
text-embedding-3-small
and
text-embedding-3-large
produce different vector spaces.
Mixing them will reduce search accuracy.
When Should You Use Semantic Caching?
Use semantic caching when:
✅ Users ask repetitive questions
✅ Building AI chatbots
✅ Customer support assistants
✅ Internal company knowledge bases
✅ HR assistants
✅ Documentation search systems
✅ High-volume AI applications
Avoid it when:
❌ Every query is unique
❌ Responses depend heavily on real-time data
❌ Accuracy requirements are extremely strict
Key Takeaways
Traditional caching fails for AI applications because it relies on exact text matching.
Semantic caching matches user intent using vector embeddings.
Spring AI simplifies embedding generation and LLM integration.
PgVector provides efficient vector similarity search directly inside PostgreSQL.
Cache hits can reduce response times from seconds to milliseconds.
Production systems should include TTL policies, monitoring, and hybrid cache strategies.
Properly implemented semantic caching can significantly reduce AI infrastructure costs while improving user experience.
Final Thoughts
As AI applications scale, managing cost and latency becomes just as important as model quality. Semantic caching is one of the highest-impact optimizations you can implement because it reduces unnecessary LLM calls while delivering faster responses to users.
With Spring AI and PgVector, Java developers can build a robust semantic caching layer using technologies they already know—without introducing a dedicated vector database.
If you're building AI-powered applications with Spring Boot, semantic caching should be one of the first production optimizations on your roadmap.
Have you implemented semantic caching in your AI applications? Share your experience and performance gains in the comments.
Top comments (0)