DEV Community

Aamer Mihaysi
Aamer Mihaysi

Posted on

DeepSeek-V4: Finally, a Context Window Built for Agents

Most long-context models are benchmarks in search of a use case. DeepSeek-V4 is different. It is built for the one workload that actually needs a million tokens: agents running long-horizon tasks.

The specs are straightforward. Two MoE checkpoints: V4-Pro at 1.6T total parameters with 49B active, and V4-Flash at 284B total with 13B active. Both ship with a 1M-token context window. But the headline is not the window size. It is what happens to inference cost as you use it.

At 1M tokens, V4-Pro requires 27% of the single-token FLOPs compared to V3.2. The KV cache uses 10% of the memory. V4-Flash drops further: 10% of FLOPs, 7% of KV cache. Against a standard grouped-query attention baseline, V4 uses roughly 2% the cache size. These are not incremental gains. They are the difference between a demo and a production deployment.

Hybrid Attention

The architecture splits attention into two mechanisms that alternate across layers.

Compressed Sparse Attention (CSA) compresses KV entries 4x using softmax-gated pooling, then runs a lightning indexer in FP4 to select top-k blocks per query. A sliding window handles the most recent uncompressed tokens.

Heavily Compressed Attention (HCA) goes further: 128x compression, then dense attention over the compressed stream. The compression is aggressive enough that dense attention becomes cheap.

Layers alternate between CSA and HCA. Storage uses FP8 for most KV entries, BF16 only for RoPE dimensions.

What Actually Changes for Agents

Interleaved thinking across tool calls. V3.2 discarded reasoning traces when a new user message arrived. For multi-turn agent workflows, this meant the model lost accumulated state. V4 preserves reasoning content across user message boundaries when tool calls are present.

Tool-call schema with dedicated tokens. V4 introduces a DSML special token and an XML-based tool-call format. This removes a class of JSON escaping failures that plague string-based tool calls.

DSec: a sandbox built for RL rollouts. The agent behavior was trained with RL against real tool environments. DeepSeek Elastic Compute exposes four execution substrates: function calls, containers, microVMs (Firecracker), and full VMs (QEMU).

The Numbers

  • Terminal Bench 2.0: 67.9
  • SWE Verified: 80.6 resolved
  • MCPAtlas Public: 73.6
  • Toolathlon: 51.8

V4-Pro-Max hits 67% pass rate on DeepSeek internal R&D coding benchmark versus 47% for Sonnet 4.5 and 70% for Opus 4.5.

Long-context retrieval holds at 0.59 accuracy on MRCR 8-needle at 1M tokens.

The Real Test

V4-Pro is at parity with frontier closed models on agent tasks. The open question is whether the community's tool harnesses adapt to the DSML schema and whether the interleaved thinking gains transfer to out-of-domain agent frameworks.

The model is on the Hub. The architecture is documented. The sandbox is described. What happens next depends on whether the ecosystem builds around these primitives or ignores them in favor of the next benchmark chase.

Top comments (2)

Collapse
 
monom profile image
Rasmus Ros

A million tokens only matters if people can afford to use it. That seems like the real point here. Lower cost and less memory make long-running agent work more practical, not just more impressive on paper.

Keeping context across tool steps also seems more useful than most headline model upgrades. The part I would watch is adoption. If the tool format fits easily into what people already use, this could land. If not, it risks being another strong model with extra friction.

Collapse
 
sunychoudhary profile image
Suny Choudhary

Bigger context windows definitely help agents, but they do not remove the hard parts.

More context means the agent can carry longer workflows, more files, and more history. But it also means more noise, more stale assumptions, and more places for bad instructions or sensitive data to hide.

For agents, context size is only half the problem. The real question is what gets included, what gets ignored, and how reliably the agent can reason over that much context without drifting.