Tasos Nikolaou

Posted on Feb 24

PromptCache Part I: Stop Paying Twice for the Same LLM Answer

#ai #python #opensource #machinelearning

Designing a semantic cache layer for cost and latency optimization in LLM systems.

Most LLM cost isn’t spent on novelty.
It’s spent on repetition, requests that are semantically identical, but syntactically different.

PromptCache was built to eliminate that redundancy.

The Invisible Cost Leak in LLM Systems

If you’re running an LLM in production, you are almost certainly paying for this:

"How do I reset my password?"
"I forgot my password, what do I do?"
"Steps to reset account password?"
"Help me change password"

Different strings.
Same intent.
Same answer.
Different billable request.

Traditional caching doesn't help because:

"How do I reset my password?" != "Steps to reset account password?"

Exact match fails.
But meaning hasn't changed.
That's where semantic caching comes in.

The Theory: Why This Works

LLMs don't understand text as strings.
They convert text into vectors (embeddings).
Two sentences with similar meaning produce vectors that are close together in high-dimensional space.

Example (simplified):

"Reset my password"
      ↓
[0.12, -0.87, 0.44, ...]

"How do I change my password?"
      ↓
[0.11, -0.89, 0.41, ...]

These vectors are very close.

So instead of asking:

"Have I seen this exact string before?"

We ask:

"Have I seen something semantically similar before?"

If the similarity is high enough, we reuse the answer.
That's semantic caching.

How It Works in Practice

When a request comes in:

User Prompt
     ↓
Embedding
     ↓
Vector search in Redis
     ↓
High similarity?
     ↓
Yes → Return cached response
No  → Call LLM and store result

You're adding a semantic memoization layer in front of your LLM.

Real Results

In a support-heavy workload with repetitive queries:

~60% cache hit rate
~50% reduction in token usage
~40% lower API spend

Results vary by workload density and repetition patterns, but in structured environments, the impact is immediate.

Example Implementation

Here's a simplified example using Redis vector search:

from promptcache import SemanticCache
from promptcache.backends.redis_vector import RedisVectorBackend
from promptcache.embedders.openai import OpenAIEmbedder
from promptcache.types import CacheMeta

embedder = OpenAIEmbedder(model="text-embedding-3-small")

backend = RedisVectorBackend(
    url="redis://localhost:6379/0",
    dim=embedder.dim,
)

cache = SemanticCache(
    backend=backend,
    embedder=embedder,
    namespace="support-bot",
    threshold=0.92,
)

meta = CacheMeta(
    model="gpt-4.1-mini",
    system_prompt="You are a helpful support assistant.",
)

result = cache.get_or_set(
    prompt="How can I change my password?",
    llm_call=my_llm_call,
    extract_text=lambda r: r.output_text,
    meta=meta,
)

print(result.cache_hit)

That's it.

No orchestration framework required.

If you want to try this approach, I packaged it up here:

GitHub: https://github.com/tase-nikol/promptcache
PyPI: https://pypi.org/project/promptcache-ai/

Install:

pip install promptcache-ai

When This Works Best

Semantic caching is powerful when:

Prompts are repetitive
Temperature is low
Answers are stable
Volume is high

It won't help much for:

Highly personalized prompts
Creative writing
Rapidly changing context

In those cases, novelty dominates repetition, and caching provides diminishing returns.

The Bigger Insight

Most LLM systems are fundamentally stateless.
They recompute answers even when nothing meaningful has changed.

Semantic caching introduces selective memory, reusing intelligence only when it is economically justified.

Instead of optimizing prompts endlessly, sometimes the smarter move is optimizing infrastructure.

If you're building LLM systems in production, semantic caching is one of the highest-leverage optimizations you can add.

But optimizing cost raised a more uncomfortable question:
What guarantees that a cache hit is actually correct?

In the next article, we examine how high hit rates can silently mask semantic errors, and why PromptCache evolved beyond threshold tuning.

Intelligence is expensive.
Memory is cheap.
Use both wisely.

Top comments (2)

Matthew Hou • Feb 24

Prompt caching is one of those optimizations that sounds obvious in retrospect but took me embarrassingly long to implement properly. The part that surprised me most was how much of our traffic was near-duplicate after we started logging and analyzing prompts. We had assumed each user request was unique, but it turned out about 30% were semantically equivalent queries phrased differently. One wrinkle I ran into: TTL management is trickier than it looks. We set cache TTL too long initially and started serving stale responses after we updated our system prompt. Now we version our system prompts and use that as part of the cache key. Curious if your implementation handles prompt versioning or if you just flush on deploy.

Tasos Nikolaou • Feb 24

Great point! TTL alone isn't sufficient once behavior changes. Here, cache identity isn't just prompt_embedding. It's effectively:
(namespace, model, ctx_hash, embedder)

Where ctx_hash is a deterministic hash over:

system prompt
tool definitions
any static behavioral constraints

That means if we change the system prompt or tool schema, the logical cache partition changes automatically. No global flush required, and no risk of serving responses generated under a different instruction contract.

TTL is used purely for lifecycle control (memory pressure, economic windowing), not correctness guarantees.

In practice I've found that TTL, controls staleness horizon and context hashing, controls behavioral equivalence

Without the second, semantic similarity alone can't guarantee safe reuse.

Versioning the system prompt explicitly, like you're doing, achieves the same isolation just at a more visible layer.

This is exactly where semantic caching becomes infrastructure engineering rather than just an optimization.