DEV Community

丁久
丁久

Posted on • Originally published at dingjiu1989-hue.github.io

AI Caching Strategies: Semantic Caching, Cache Invalidation, Cost Reduction, and Latency Improvement

This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.

AI Caching Strategies: Semantic Caching, Cache Invalidation, Cost Reduction, and Latency Improvement

LLM API calls are expensive and slow. The average GPT-4 response costs about a cent and takes seconds. For production applications serving thousands of users, caching is not optional. It is an economic necessity. Here is how to cache AI responses effectively.

The Case for Caching

Without caching, every user query hits the LLM API. This means every query costs money, every query takes seconds, and your API costs scale linearly with usage.

With caching, repeated or similar queries return instant, free responses. For applications where users ask similar questions like a documentation chatbot or a customer support bot, cache hit rates of 40% to 70% are achievable.

Caching also improves consistency. LLMs are non-deterministic. The same prompt can produce different responses each time. Caching ensures users see consistent answers to the same question, which builds trust.

Exact Match Caching

Exact match caching is the simplest approach. Hash the input and use it as a cache key. If the exact same prompt is seen again, return the cached response.

This works well for applications with a fixed set of queries or templates. A code documentation chatbot might see the same question "How do I use the authentication middleware?" dozens of times per day.

Implement exact match caching with Redis or Memcached. Set a TTL based on how often the underlying knowledge changes. For static documentation, a TTL of 24 hours is reasonable. For rapidly changing data, reduce the TTL to minutes.

The limitation is obvious: exact match catches only identical queries. "How do I reset my password?" and "How can I reset my password?" are different cache keys even though they mean the same thing.

Semantic Caching

Semantic caching uses embeddings to find similar queries. When a query arrives, compute its embedding and search for nearby cached queries. If a sufficiently similar query exists, return its cached response.

The threshold for semantic similarity depends on your use case. A cosine similarity of 0.95 or higher indicates effectively identical queries. Between 0.85 and 0.95, consider the context: minor wording differences may still warrant the same response.

Semantic caching requires a vector database. Pinecone, Weaviate, or pgvector with PostgreSQL all work well. The cache stores the query embedding, the response, and metadata like timestamp and response model.

The trade-off is latency and cost. Computing embeddings adds milliseconds to each query. Searching the vector index adds more. The semantic cache must be fast enough that the overhead does not exceed the savings from avoided API calls.

Cache Invalidation

Stale responses are worse than no cache. An AI assistant giving outdated information erodes trust and can cause real problems if the information involves pricing, policies, or technical specifications.

Invalidate cache entries when the underlying knowledge changes. If you update your documentation, clear all cached responses that reference the changed pages. This requires tracking which documents each cached response depends on.

Time-based invalidation is simpler but less precise. Set TTLs based on data freshness requirements. Pricing information might have a one-hour TTL. General knowledge might have a 24-hour TTL. Company-specific policies might be somewhere in between.

For semantic caches, invalidation is trickier. A single document update can affect semantically similar but non-identical queries. When in doubt, clear the entire cache for the affected context. The temporary increase in API costs is better than serving stale information.

Multi-Tier Caching

A multi-tier caching strategy balances cost, latency, and freshness. Implement three tiers: exact match, semantic, and model response.

Exact match cache: Redis with short TTL. Catches repeated identical queries. Sub-millisecond lookup. This tier handles the most common case.

Semantic cache: Vector database with medium TTL. Catches similar but non-identical queries. Millisecond lookup. This tier increases hit rate significantly.

Model response cache: Cache the raw API response for identical prompts. This tier is useful when prompt caching APIs like Anthropic's prompt caching apply. It reduces costs but does not reduce latency since the API call still happens.

Caching with Streaming

Streaming responses complicate caching. If you stream tokens to the user, you cannot serve a cached response because streaming implies new content generation.

The solution is to cache complete responses and serve them as non-streaming when a cache hit occurs. For cache misses, stream the response as normal and cache the complete response for future requests.

Alternatively, use a hybrid approach. Serve cached responses instantly for common queries and stream fresh responses for unique queries. Most users do not need streaming for simple factual quest


Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.

Found this useful? Check out more developer guides and tool comparisons on AI Study Room.

Top comments (0)