Baidu Unlimited OCR Holds the KV Cache Constant for 40+ Pages: Reference Sliding Window Attention

#ai #machinelearning #llm

What: The Unlimited OCR release (Baidu, arXiv 2606.23050) is a 3-billion-parameter open OCR model whose decoder replaces standard attention with Reference Sliding Window Attention (R-SWA) — the trick that lets it transcribe 40+ pages in a single forward pass.

Why: The KV cache is the memory that grows with every token a model writes; on a 40-page transcription that growth can dominate inference memory and slow generation, so holding the cache constant is what makes one-pass, whole-document OCR practical.

vs prior: A standard decoder makes each new token attend to the entire growing output, so its KV cache grows linearly; R-SWA makes each token attend to the fixed document plus only the last 128 output tokens, so the cache stays a constant size.

Think of it as

A scribe copying a long book — the source kept open on the desk, and only the last line they wrote still in view.

                 COPYING ONE 40-PAGE BOOK
                            │
              ┌─────────────┴─────────────┐
              │                           │
     ┌────────▼────────┐         ┌────────▼────────┐
     │  R-SWA scribe   │         │   plain scribe  │
     │  (disciplined)  │         │  (stacks pages) │
     └────────┬────────┘         └────────┬────────┘
              │                           │
     desk = source book pinned    desk = every page copied
       + last 128 output lines       stacked so far, growing
              │                           │
              ▼                           ▼
     ✓ desk never overflows      ✗ desk overflows by page 40
       KV cache stays constant     KV cache grows linearly

reference tokens = the source book kept open on the desk, always in reach
sliding window = only the last line the scribe glances at to keep continuity
constant KV cache = a desk that never overflows, however long the book
linear KV growth = stacking every page you've copied on the desk until it spills
40+ pages in one pass = copying a whole long book in a single sitting

Quick glossary

KV cache — The stored keys and values for every token already processed, so attention never recomputes them. It is the dominant memory cost of inference, and in a standard decoder it grows with every token generated.

Reference Sliding Window Attention (R-SWA) — Baidu's replacement for every decoder attention layer: each generated token attends to all reference tokens plus only the preceding 128 output tokens, instead of the entire growing sequence. That caps the cache at a constant size.

Reference tokens — The document (visual) tokens the encoder produces from the pages being read. R-SWA keeps these fully visible to every output token — they are the fixed part of the attention window, never slid past.

Sliding window attention — An attention mask where each token sees only the last W tokens, not all of them. It bounds memory but, on its own, would slide off the document a model is reading — which is the problem R-SWA's pinned reference tokens fix.

Forward pass — One run of the model over its inputs to produce outputs. Unlimited OCR transcribes 40+ pages in a single forward pass rather than chunking the document into many smaller passes.

Active parameters — Unlimited OCR has 3 billion total parameters but activates only 500 million per token — a sparse design where most weights stay idle on any given step, keeping compute low.

DeepSeek-OCR encoder — A high-compression visual encoder that turns a page image into a small number of tokens. Pairing it with R-SWA's constant-cache decoder is what lets dozens of pages fit inside a single 32,000-token context.

The news. On June 22, 2026, Baidu released Unlimited OCR, a 3-billion-parameter (500 million active) end-to-end OCR model that transcribes 40+ pages of documents in a single forward pass under a 32,000-token context. It replaces every decoder attention layer with Reference Sliding Window Attention (R-SWA), which holds the KV cache at a constant size throughout decoding instead of letting it grow with output length, and reports new end-to-end state-of-the-art on OmniDocBench v1.5 and v1.6. Weights and code are public under CC-BY 4.0. Read the paper →

Picture a scribe copying a long book by hand. The trick that keeps the desk clear is not memory — it's what stays on the desk. The source book lies open, always in reach, and the scribe glances at only the last line they wrote to keep the handwriting and spelling continuous. They never re-read the hundred pages already copied; those go in a drawer. So the desk holds the same two things on page 1 and on page 200 — the source, and the current line — and it never overflows, no matter how long the book.

A standard transformer decoder is the opposite scribe: it keeps every page it has copied stacked on the desk, because each new token attends to all previous tokens. That stack is the KV cache, and its size grows linearly with the length of the output — which is fine for a one-paragraph answer and ruinous for a 40-page transcription, where the output is enormous. The cache is already the biggest memory cost in inference; let it grow with every page and a long document becomes much harder to fit and serve efficiently.

R-SWA is the disciplined scribe. It replaces every decoder attention layer so that each generated token attends to exactly two things: the full set of reference tokens — the document the encoder produced, kept pinned and fully visible — plus only the preceding 128 output tokens, a short sliding window over what was just written. The document never slides out of view, but the output history does. Because both pieces are bounded — the document is fixed and the window is 128 — the KV cache stays a constant size from the first page to the fortieth. This is the move a plain sliding window can't make on its own: slide a fixed window over everything and you'd lose the document you're reading; R-SWA exempts the reference tokens from the slide.

Here is why "grows with sequence length" is the term that hurts. KV-cache memory is a product — layers × heads × head dimension × bytes × sequence length — and only that last factor moves as the model writes more. R-SWA freezes that factor for the output: instead of the sequence length climbing toward 32,000, the output's contribution is clamped at 128, while the reference tokens add a fixed, encoder-compressed amount. Pair that constant-cache decoder with DeepSeek-OCR's high-compression visual encoder — which compresses each page image into far fewer visual tokens — and dozens of pages fit in one 32,000-token pass.

Walk the numbers on one long document. Say transcribing 40 pages produces roughly 12,000 output tokens (illustrative — the real count depends on the document). A standard decoder's cache holds all 12,000, and the 12,000th token attends back across 11,999 predecessors — so both memory and per-token attention work climb with every page. R-SWA caps the output window at 128. So that same final token attends to just the last 128 outputs plus the fixed document tokens, and the output's contribution to the cache stays flat at 128 entries whether the document is 4 pages or 40. That clamp — from a number that grows with the page count to a constant 128 — is the decoder-side reason this can pair with a compressed visual encoder and read 40+ pages in one forward pass.

Attention scheme	Each output token attends to…	KV cache vs output length	Where it earns its keep
Standard causal attention	every previous token	Grows linearly	Accurate, but memory explodes on long outputs
Plain sliding-window attention	only the last ~W tokens (W is a fixed window, model-dependent)	Constant (~W)	Cheap streaming, but it slides off the document being read
R-SWA (Unlimited OCR)	all reference tokens + the last 128 outputs [paper]	Constant	Long-document OCR: keeps the full source visible while bounding output memory

The honest caveats. The 128-token window is a default, and a short window is a bet that the next line of a transcription rarely depends on text written thousands of tokens earlier — true for reading a document top-to-bottom, less obviously true for tasks with long-range output structure. And the constant-cache win leans on the encoder doing real work: if the reference tokens themselves were not compressed, "all reference tokens" would be its own large, fixed cost. But the deeper lesson generalizes past OCR — the paper itself notes R-SWA is "a general-purpose parsing attention mechanism… equally applicable to tasks such as ASR, translation, etc." Once you accept that an output token rarely needs the entire output history, the question stops being "how do we shrink the cache" and becomes "what must stay pinned, and how short can the window be" — and the cache stops growing at all.

Goes deeper in: LLM Internals → KV Cache → Memory Cost

Related explainers

FlashMemory — lookahead sparse attention — also bounds the KV cache by having each token attend to fewer keys; R-SWA bounds it by a fixed window plus a pinned reference instead.
SP-KV — self-pruned KV cache — drops low-value KV pairs to shrink the cache; R-SWA never writes the far output history in the first place.
SubQ 1.1 — subquadratic sparse attention — near-linear attention for million-token context; R-SWA is the OCR-decoder cousin of the same "don't attend to everything" idea.
DeepSeek V4 — long-context cost cut to a fraction — attacks the same long-context memory pressure at the architecture level, and shares the optical-compression encoder lineage R-SWA builds on.

FAQ

What is Reference Sliding Window Attention (R-SWA)?

R-SWA is the decoder attention scheme in Baidu's Unlimited OCR (arXiv 2606.23050, June 2026). It replaces every decoder attention layer so that each generated token attends to all reference tokens — the document tokens the encoder produced — plus only the preceding 128 output tokens, rather than the entire growing output sequence. Because the document is fixed and the output window is capped at 128, the KV cache stays a constant size throughout decoding instead of growing linearly with output length. That is what lets the 3-billion-parameter (500 million active) model transcribe 40+ pages in a single 32,000-token forward pass.

Why does holding the KV cache constant matter for OCR?

The KV cache is the dominant memory cost of inference, and in a standard decoder it grows with every token generated. Transcribing a long document produces an enormous output, so a linearly growing cache quickly exceeds GPU memory — which is why most OCR systems chunk a document into many small passes and stitch the results. R-SWA caps the output's contribution to the cache at 128 tokens, so memory does not grow with page count, and the whole document can be read in one forward pass. Baidu reports new end-to-end state-of-the-art on OmniDocBench v1.5 and v1.6 with this design.

How is R-SWA different from normal sliding-window attention?

Plain sliding-window attention (as in some streaming models) lets each token see only the last W tokens of everything — which bounds memory but would slide the document a model is reading out of view. R-SWA splits the window in two: the reference (document) tokens are pinned and stay fully visible to every output token, while only the output history is subject to the 128-token slide. So the model keeps the entire source in sight while still bounding the part of the cache that would otherwise grow — the constant-size cache without losing the document.

Originally posted on Learn AI Visually.