DEV Community

Discussion on: PromptCache Part I: Stop Paying Twice for the Same LLM Answer

Collapse
 
matthewhou profile image
Matthew Hou

Prompt caching is one of those optimizations that sounds obvious in retrospect but took me embarrassingly long to implement properly. The part that surprised me most was how much of our traffic was near-duplicate after we started logging and analyzing prompts. We had assumed each user request was unique, but it turned out about 30% were semantically equivalent queries phrased differently. One wrinkle I ran into: TTL management is trickier than it looks. We set cache TTL too long initially and started serving stale responses after we updated our system prompt. Now we version our system prompts and use that as part of the cache key. Curious if your implementation handles prompt versioning or if you just flush on deploy.

Collapse
 
tasenikol profile image
Tasos Nikolaou

Great point! TTL alone isn't sufficient once behavior changes. Here, cache identity isn't just prompt_embedding. It's effectively:
(namespace, model, ctx_hash, embedder)

Where ctx_hash is a deterministic hash over:

  • system prompt
  • tool definitions
  • any static behavioral constraints

That means if we change the system prompt or tool schema, the logical cache partition changes automatically. No global flush required, and no risk of serving responses generated under a different instruction contract.

TTL is used purely for lifecycle control (memory pressure, economic windowing), not correctness guarantees.

In practice I've found that TTL, controls staleness horizon and context hashing, controls behavioral equivalence

Without the second, semantic similarity alone can't guarantee safe reuse.

Versioning the system prompt explicitly, like you're doing, achieves the same isolation just at a more visible layer.

This is exactly where semantic caching becomes infrastructure engineering rather than just an optimization.