DEV Community

Alibaba Cloud Smart Studio
Alibaba Cloud Smart Studio

Posted on

KV-Pool: 4.5x Agent Inference Throughput with Persistent KV Cache

Why Agent Workloads Are Expensive

LLM inference costs always scale with context length. In agent workloads, this becomes especially expensive. Consider a coding agent helping a developer refactor a module. The agent reads the file, proposes an edit, applies it, runs tests, sees a failure, reads the error log, and tries again. 

Each of these steps is a separate LLM call, and each call carries the entire conversation history. By the final step, the context has grown to 30K+ tokens, but the new information is just a few lines of test output. The model re-computes everything from scratch every time.

KV-Pool: Reuse What You Already Computed

To maximize GPU utilization, improve throughput, and reduce inference latency, we introduced an optimized KV-Pool service.

KV-Pool persists KV cache across requests in a shared, GPU-resident memory pool. When the next request arrives with overlapping context, the system performs a prefix match against cached entries, skips the redundant prefill computation, and only processes the new tokens. This means the model does not re-read the system prompt, conversation history, or prior tool results that it has already encoded.

The cache is indexed by token-level prefix matching: as long as the beginning of a new request matches a cached sequence, the corresponding KV states are loaded directly from the pool instead of being recomputed. The longer the shared prefix, the more computation is saved. In multi-turn agent sessions where context grows incrementally, hit rates compound with each successive turn.

KV-Pool

Benchmark

Why This Workload

We benchmarked KV-Pool using conversation traces captured from real Claude Code interactions, not synthetic data. We chose this workload deliberately: coding agents are among the most demanding agent use cases, and their traffic patterns amplify the exact bottleneck KV-Pool addresses.

  • Long inputs, short outputs: the model spends most of its compute on prefill, not generation.

  • Heavy context reuse across turns: each turn appends a small amount of new content to the same growing context. Most of the input is repeated from prior turns.

  • Where KV-Pool pays off most: when the ratio of reusable context to new tokens is high, cache hit rates climb toward the theoretical maximum.

These results reflect agent-specific workloads. Other use cases such as chatbots, RAG, and batch processing will also benefit from KV-Pool, but the magnitude of improvement will vary depending on context overlap and turn structure.

Setup

  • Hardware: H20 GPUs (4-card and 8-card configurations)

  • Models: MiniMax M2.5, DeepSeek V4 Flash, Qwen3.5-122B, Qwen3.5-397B

  • Concurrency: 16 parallel sessions

  • Duration: 600-second sustained load window

  • Data: Multi-turn coding assistant session replays

Benchmark Results

Benchmark

Across all these models, the pattern is consistent:

  • Input Throughput: improved up to 4.5x

  • TTFT: dropped 47-91%

  • Average Total Latency: dropped 41-70%

  • Cache Hit Rate: reached 94.9-96.2% with KV-Pool enabled

The key takeaway: cache benefits are stable and predictable. These models with different architectures and scales all converge on 95%+ hit rates under this workload. If your application has similar multi-turn, long-context patterns, you can expect similar gains.

In practical terms, an agent task that previously required the user to wait through several seconds of latency on every turn now feels closer to a real-time conversation. The model responds fast enough that inference is no longer the bottleneck in the agent loop.

Instead, the limiting factor shifts to the agent framework itself: tool execution, file I/O, API calls. This is a meaningful threshold. When inference latency drops below the time spent on tool actions, the user stops noticing the model and starts experiencing the agent as a continuous workflow.

What This Means in Practice

Faster Agent Interactions

TTFT is a critical factor in agent workloads. Every LLM call in an agent loop blocks until the first token arrives, and these calls happen sequentially.

KV-Pool reduces TTFT by up to 91%, significantly lowering the latency of each agent call. Agent loops complete faster, and tasks that previously felt sluggish become responsive. Whether it's a coding assistant iterating through file edits, a review agent processing feedback rounds, or a documentation generator building content incrementally, the experience stays fast as context grows.

More Users on the Same Hardware

Higher throughput means the same GPU deployment can serve significantly more concurrent agent sessions at acceptable latency. For teams scaling their user base, this defers the need for additional hardware and keeps per-user infrastructure cost flat.

This matters for agent workloads specifically because each user session is long-lived and context-heavy. A single coding assistant session can occupy GPU memory for dozens of turns, and the context only grows. With KV-Pool, the same hardware absorbs the additional load because the per-request compute cost drops significantly.

GPU Revenue Potential

For teams looking to monetize their GPU infrastructure, KV-Pool directly improves the return on every card. Using market pricing as a reference, our team estimated the revenue potential of different model deployments under agent workloads. The results show healthy gross margins even at moderate utilization levels.

With KV-Pool enabled, the same GPUs process more tokens per hour, which means more revenue from the same hardware investment.

Get Started

Whether you want to deploy high-performance open-source models for internal use or serve third-party customers for profit.

Try it in Smart Studio →

Contact us → for partnership details and custom deployment options.

Top comments (0)