DEV Community

Lav Kumar Dixit
Lav Kumar Dixit

Posted on

Semantic Caching with Spring AI and PgVector: Reduce LLM Costs and Improve Response Time by 90%

Large Language Models are powerful, but they're also expensive and slow when handling repetitive queries. If your AI application receives thousands of similar questions every day, repeatedly calling an LLM for nearly identical requests is inefficient.

What if you could intelligently reuse previous AI responses—even when the wording is different?

This is where Semantic Caching comes in.

In this article, we'll build a production-ready semantic caching layer using Spring AI and PgVector, enabling Java developers to dramatically reduce AI costs, lower latency, and improve user experience.

The Problem: Traditional Caching Doesn't Work for AI

Consider these user queries:

What is Spring Boot?
Explain Spring Boot framework.
Can you tell me about Spring Boot?

A traditional cache such as Redis treats these as completely different keys:

cache.get("What is Spring Boot?");
cache.get("Explain Spring Boot framework.");

Result:

❌ Cache Miss

❌ New LLM Call

❌ Increased Cost

❌ Higher Latency

Although the intent is identical, traditional caching cannot understand meaning.

Semantic caching solves this problem.

What is Semantic Caching?

Semantic caching stores:

User query
Query embedding
AI response

When a new request arrives:

Generate an embedding
Search for similar embeddings
Return cached response if similarity exceeds a threshold
Otherwise call the LLM and store the result

Instead of matching text, we match meaning.

Why Use PgVector?

PgVector extends PostgreSQL with vector similarity search capabilities.

Benefits:

Open source
No additional vector database required
Works directly with PostgreSQL
Supports cosine similarity
Production-ready
Easy integration with Spring AI

For many enterprise applications, PgVector eliminates the need for separate infrastructure like Pinecone or Weaviate.

High-Level Architecture
User Query
|
v
Generate Embedding
|
v
PgVector Similarity Search
|
+----------+----------+
| |
Cache Hit Cache Miss
| |
v v
Cached Response Call LLM
| |
+----------+----------+
|
v
Return Result

This architecture reduces both latency and token consumption.

Technology Stack
Java 21
Spring Boot 3
Spring AI
PostgreSQL
PgVector
OpenAI Embeddings
Maven
Setting Up PgVector

Enable the extension:

CREATE EXTENSION IF NOT EXISTS vector;

Create a cache table:

CREATE TABLE semantic_cache (
id BIGSERIAL PRIMARY KEY,

query TEXT NOT NULL,

response TEXT NOT NULL,

embedding VECTOR(1536),

created_at TIMESTAMP DEFAULT NOW()
Enter fullscreen mode Exit fullscreen mode

);

Create an index for fast similarity search:

CREATE INDEX semantic_cache_embedding_idx
ON semantic_cache
USING ivfflat (embedding vector_cosine_ops);

The index becomes increasingly important as cached entries grow into the thousands or millions.

Maven Dependencies

Add Spring AI and PostgreSQL dependencies:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>

<dependency>
    <groupId>org.postgresql</groupId>
    <artifactId>postgresql</artifactId>
</dependency>

<dependency>
    <groupId>com.pgvector</groupId>
    <artifactId>pgvector</artifactId>
    <version>0.1.6</version>
</dependency>
Enter fullscreen mode Exit fullscreen mode


Application Configuration
spring:
datasource:
url: jdbc:postgresql://localhost:5432/ai_db
username: postgres
password: postgres

ai:
openai:
api-key: ${OPENAI_API_KEY}

Store sensitive credentials using environment variables or a secret management solution.

Entity Model
@entity
@Table(name = "semantic_cache")
@Getter
@setter
@NoArgsConstructor
@AllArgsConstructor
@builder
public class SemanticCache {

@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;

private String query;

@Column(columnDefinition = "TEXT")
private String response;

private LocalDateTime createdAt;
Enter fullscreen mode Exit fullscreen mode

}
Embedding Generation Service

Spring AI makes embedding generation straightforward.

@Service
@RequiredArgsConstructor
public class EmbeddingService {

private final EmbeddingModel embeddingModel;

public float[] generateEmbedding(String text) {

    return embeddingModel
            .embed(text);
}
Enter fullscreen mode Exit fullscreen mode

}

Every query will be converted into a high-dimensional vector representation.

Similarity Search Repository

Using PostgreSQL cosine similarity:

@Repository
@RequiredArgsConstructor
public class SemanticCacheRepository {

private final JdbcTemplate jdbcTemplate;

public Optional<String> findSimilarResponse(
        PGvector embedding,
        double threshold) {

    String sql = """
        SELECT response,
               1 - (embedding <=> ?) AS similarity
        FROM semantic_cache
        WHERE 1 - (embedding <=> ?) > ?
        ORDER BY similarity DESC
        LIMIT 1
    """;

    List<String> responses =
            jdbcTemplate.query(
                    sql,
                    ps -> {
                        ps.setObject(1, embedding);
                        ps.setObject(2, embedding);
                        ps.setDouble(3, threshold);
                    },
                    (rs, rowNum) -> rs.getString("response")
            );

    return responses.stream().findFirst();
}
Enter fullscreen mode Exit fullscreen mode

}

The threshold controls how strict the cache matching should be.

Typical values:

Threshold Behavior
0.70 Aggressive caching
0.80 Balanced
0.90 Very strict

For most production systems, 0.80–0.85 works well.

Semantic Cache Service

Now let's connect everything.

@Service
@RequiredArgsConstructor
@Slf4j
public class SemanticCacheService {

private final EmbeddingService embeddingService;

private final SemanticCacheRepository repository;

private final ChatClient chatClient;

public String getResponse(String query) {

    float[] vector =
            embeddingService.generateEmbedding(query);

    PGvector embedding =
            new PGvector(vector);

    Optional<String> cachedResponse =
            repository.findSimilarResponse(
                    embedding,
                    0.85
            );

    if (cachedResponse.isPresent()) {

        log.info("Semantic Cache Hit");

        return cachedResponse.get();
    }

    log.info("Semantic Cache Miss");

    String response =
            chatClient.prompt(query)
                      .call()
                      .content();

    saveResponse(query, response, embedding);

    return response;
}

private void saveResponse(
        String query,
        String response,
        PGvector embedding) {

    // Persist cache record
}
Enter fullscreen mode Exit fullscreen mode

}

This is the core semantic caching workflow.

Real-World Example

Imagine an HR chatbot receiving these questions:

What is the company's leave policy?
How many annual leaves do employees get?
Can I take paid vacation days?

Without semantic caching:

3 LLM Requests
3 API Charges
3 Response Generations

With semantic caching:

1 LLM Request
2 Cache Hits
Much Lower Cost

At enterprise scale, this translates into thousands of dollars saved every month.

Performance Results

A typical benchmark:

Scenario Response Time
OpenAI API Call 1500–3000 ms
Semantic Cache Hit 20–50 ms

Improvement:

Up to 90% faster responses
Up to 80% lower AI costs
Reduced API rate-limit pressure

Actual numbers vary depending on model choice and infrastructure.

Production Considerations

  1. Cache Expiration

AI responses can become outdated.

Add TTL support:

DELETE FROM semantic_cache
WHERE created_at < NOW() - INTERVAL '30 days';

Schedule cleanup jobs regularly.

  1. Multi-Tenant Systems

Store tenant IDs:

tenant_id VARCHAR(50)

Only search cache within the current tenant.

  1. Response Quality Monitoring

Track:

Cache hit rate
Similarity score
User feedback
Incorrect cache matches

Observability is critical when deploying semantic caching at scale.

  1. Hybrid Cache Strategy

Best practice:

Redis
|
Semantic Cache
|
LLM

Flow:

Redis lookup
Semantic lookup
LLM call

This delivers maximum performance.

  1. Embedding Model Consistency

Never change embedding models without re-indexing vectors.

For example:

text-embedding-3-small

and

text-embedding-3-large

produce different vector spaces.

Mixing them will reduce search accuracy.

When Should You Use Semantic Caching?

Use semantic caching when:

✅ Users ask repetitive questions

✅ Building AI chatbots

✅ Customer support assistants

✅ Internal company knowledge bases

✅ HR assistants

✅ Documentation search systems

✅ High-volume AI applications

Avoid it when:

❌ Every query is unique

❌ Responses depend heavily on real-time data

❌ Accuracy requirements are extremely strict

Key Takeaways
Traditional caching fails for AI applications because it relies on exact text matching.
Semantic caching matches user intent using vector embeddings.
Spring AI simplifies embedding generation and LLM integration.
PgVector provides efficient vector similarity search directly inside PostgreSQL.
Cache hits can reduce response times from seconds to milliseconds.
Production systems should include TTL policies, monitoring, and hybrid cache strategies.
Properly implemented semantic caching can significantly reduce AI infrastructure costs while improving user experience.
Final Thoughts

As AI applications scale, managing cost and latency becomes just as important as model quality. Semantic caching is one of the highest-impact optimizations you can implement because it reduces unnecessary LLM calls while delivering faster responses to users.

With Spring AI and PgVector, Java developers can build a robust semantic caching layer using technologies they already know—without introducing a dedicated vector database.

If you're building AI-powered applications with Spring Boot, semantic caching should be one of the first production optimizations on your roadmap.

Have you implemented semantic caching in your AI applications? Share your experience and performance gains in the comments.

Java #SpringBoot #SpringAI #PgVector #ArtificialIntelligence #GenerativeAI #LLM #PostgreSQL #BackendDevelopment #JavaDeveloper #MachineLearning #VectorDatabase #OpenAI #SoftwareArchitecture #PerformanceOptimization #DevOps #CloudNative #AIEngineering #SemanticSearch #TechBlog

Top comments (0)