Every time an LLM agent re-sends a system prompt, a tool schema, or a block of retrieved documents, the GPU recomputes attention state it has already computed before. MinIO calls that the "recompute tax," and its new MemKV cache is built to stop paying it. The company claims up to 95% better GPU utilization for inference-heavy pipelines. That number is worth unpacking before you wire it into your stack.
The recompute tax, defined
A transformer generates text in two phases. Prefill processes the entire input prompt at once and builds a key/value (KV) tensor for every token in every attention layer. Decode then generates output tokens one at a time, reusing that KV state so it never re-reads earlier tokens from scratch. The KV cache is what makes decode fast.
The problem is that the KV cache normally lives in GPU high-bandwidth memory (HBM) and disappears the moment a request finishes or gets evicted under memory pressure. For a single chatbot turn that is fine. For agentic workloads it is wasteful, because those workloads re-send the same tokens constantly:
- A multi-step agent replays its full system prompt and tool definitions on every step.
- A RAG pipeline prepends the same retrieved passages across follow-up questions.
- A batch job runs hundreds of prompts that share an identical instruction header.
Each of those shared prefixes triggers a fresh prefill. Prefill is compute-bound — it scales with prompt length times model size — so a long, reused prefix can burn seconds of GPU time producing a KV cache that is byte-for-byte identical to one you computed a minute ago. That is the tax.
KV cache is bigger than most people expect. For a 70B-class model using grouped-query attention, the cache runs roughly 300 KB per token, so a 10,000-token context is around 3 GB of state. Multiply by concurrent requests and it is easy to see why HBM fills up and caches get evicted.
What MemKV actually changes
MemKV's pitch is tiering. Instead of letting KV cache live and die in HBM, it persists attention state to a faster-to-reload memory or storage tier, then hands it back when a matching prefix shows up again. A reused system prompt gets its KV cache loaded instead of recomputed. Across nodes, one machine's prefill can populate a cache that another machine reads.
This is the same idea behind prefix caching in vLLM and the prompt-caching features cloud providers expose, extended past the boundary of a single GPU's memory. The win is real when prefix reuse is high: if 90% of your token volume is shared boilerplate, eliminating its recompute removes most of your prefill cost.
So what does "95% better GPU utilization" describe? Read it as the share of redundant prefill MemKV can remove under favorable conditions — heavy reuse, stable prefixes, a fast path back to the cached bytes. It is not a promise that every workload gets 95% faster, and it is not a claim about absolute hardware utilization. Treat it as a ceiling, not a baseline.
When reload beats recompute
Offloading is not free. Loading a KV cache means moving those gigabytes back to the GPU, and that transfer competes with the recompute it replaces. The decision comes down to one comparison:
- Recompute cost scales with prompt length and model FLOPs. It is fixed by your model.
- Reload cost scales with cache size divided by the bandwidth between the cache tier and the GPU.
Inside a single node, NVLink or PCIe 5 moves a few-gigabyte cache in well under a second — comfortably faster than a multi-second prefill. Across a network, a 3 GB cache over a 100 Gbps link still lands in roughly a quarter of a second. But push the cache to slower object storage, or run on a congested network, and reload can cost more than just recomputing from scratch.
Benchmark reload-versus-recompute on your own hardware before committing. The break-even point depends on your interconnect bandwidth, model size, and prefix-reuse rate. A cache tier that helps a co-located GPU cluster can quietly slow down a workload spread across a slow network.
There is also a correctness dimension. A cached KV block is only valid if the tokens, model weights, quantization, and attention configuration that produced it all match the current request exactly. A keying scheme that is too loose serves stale or mismatched state; one that is too strict never hits. This is the unglamorous engineering that decides whether tiered KV caching works in production.
Should you adopt it
MemKV — and KV cache offloading generally — pays off when three things are true: your prefixes are long, they are heavily reused, and the GPUs sit close to the cache tier. Agentic systems and RAG pipelines usually satisfy the first two. The third is an infrastructure decision you control.
Before adopting any KV cache tier, measure your prefix-reuse rate. Log the token-level overlap between consecutive requests for a day. If shared prefixes are under a third of your token volume, fix prompt structure first — stabilizing your system prompt and retrieval order often beats adding a caching layer.
If your workload is single-turn, short-context, or has a unique prompt every time, the recompute tax is small and offloading adds complexity for little return. The tax is only worth eliminating once you are actually paying a lot of it.
The recompute tax is real, and for agentic and RAG workloads it can be a large line item. MemKV is a credible way to stop paying it — provided you verify the reload path is genuinely faster than recompute on your own hardware, rather than trusting a headline number.
Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.
Top comments (0)