DeepSeek's new open models give everyone a million-word memory by default

#openweights #longcontext #deepseek #attention

DeepSeek has previewed its V4 model family, led by a 1.6 trillion-parameter flagship, and made a one-million-token context window the default across all its services. The weights are downloadable and self-hostable, putting frontier-scale long context in reach of smaller labs and individuals without per-token payment to a closed provider.

Key facts

What: DeepSeek previewed two free-to-download V4 models that can read a million tokens at once, no longer as a premium add-on but as the standard setting.
When: 2026-06-29
Primary source: read the source

A large language model has no persistent memory. Each time it answers, it re-reads everything in front of it — your question, the conversation so far, any documents you pasted — and that pile of text is the context. The context window is the hard ceiling on how much it can hold at once. For years that ceiling was a few thousand words, then tens of thousands. Pushing it to a million has been possible but expensive, usually sold as a special, pricey tier. DeepSeek's move is to make a million the everyday default.

The family comes in two sizes. V4-Pro is the big one — 1.6 trillion parameters in total, but only about 49 billion of them switch on for any given word. That design is called a mixture of experts: instead of running the entire brain for every token, the model routes each piece of text to a small relevant subset of specialists, so it stays affordable to run despite its enormous size. V4-Flash is the smaller, cheaper, faster sibling, meant for everyday chat and quick edits, and DeepSeek says it keeps up with Pro on simpler agent tasks.

Making a million-token window affordable comes down to how the model handles its KV cache — the running set of notes it stores about every previous word, which grows steadily the longer the conversation gets. At a million tokens those notes become a mountain of memory, and the model normally has to consult every note for every new word it writes. DeepSeek's approach, which they call sparse attention plus token-wise compression, stops doing that. The model attends to a sparse, relevant slice of the past and compresses the rest — the equivalent of skimming back to the few parts of a long report that matter while keeping a compressed gist of the rest. That is what makes a million-token window cheap enough to leave switched on for everyone.

Long context is the foundation under a lot of useful work. Feeding an AI an entire codebase, a stack of legal contracts, a year of email threads, or a long research transcript all depend on how much it can hold at once. Making a million tokens the floor rather than a luxury lowers the bar for everyone building those tools — and because the weights are open, smaller labs and individuals get access to frontier-scale long context without paying a closed provider per token. DeepSeek also says V4-Pro leads open models in math, science, and coding and trails only the very top closed model on general world knowledge, which keeps narrowing the gap between open and closed AI.

The honest caveat is the difference between a model that supports a million tokens and one that uses them well. Long-context models have a well-documented habit of paying close attention to the beginning and end of a huge input while glossing over the middle — the so-called lost-in-the-middle problem — and aggressive sparse attention can make that worse, because skimming is exactly the behavior that risks missing a buried detail. All of DeepSeek's quality claims also come from DeepSeek's own report; nobody outside the company has independently stress-tested the million-token recall yet. Treat the window as a real and welcome capability, but wait for outside long-context retrieval tests before trusting it to never drop the one sentence that mattered on page 400. One practical note for anyone already building on DeepSeek: the older chat and reasoner endpoints retire on July 24, with traffic shifting to V4-Flash, so existing integrations will need a look.

Originally published on Ground Truth, where every claim is checked against the primary source.

Top comments (1)

FreyaLi • Jul 14

permanently cut to 25% in threerouter