Hey everyone,
I’ve been working on an open-source tool called CacheSentry, and I’m looking for feedback from people building real LLM apps.
The problem it focuses on is prompt-cache regressions.
In long-prompt apps, the beginning of the prompt is often mostly stable:
- system instructions
- tool schemas
- policies
- retrieved context
- memory
- conversation structure
But small dynamic fields can accidentally get inserted near the front:
- UUIDs
- timestamps
- request IDs
- session IDs
- dynamic metadata
- shuffled tool/schema order
That can silently destroy the stable prefix and reduce prompt-cache reuse.
In one controlled validation I ran:
Stable prompt: 2,816 cached tokens
UUID near the front: 0 cached tokens
UUID moved later: 2,816 cached tokens again
That was the moment I realized this is the kind of issue most teams probably won’t catch in code review.
So I built CacheSentry.
It analyzes prompt traces and can:
- compare against a known-good baseline
- detect dynamic fields near the stable prefix
- estimate reusable token loss
- identify the culprit field
- fail CI when cacheability regresses
- compare predictions with runtime cache signals where available
The idea is simple:
Unit tests check correctness.
Evals check output quality.
Observability checks latency and cost.
CacheSentry checks whether a prompt/template change broke cacheability.
Repo:
https://github.com/PS4Emp/cachesentry
I’m especially interested in feedback from people building agents, RAG systems, long-context apps, LiteLLM gateways, or OpenTelemetry-based LLM observability.
Has anyone here seen prompt caching behave unexpectedly because dynamic content moved too early in the prompt?
Top comments (0)