I’ve always been drawn to the world of AI, while also enjoying the low-level mindset of embedded systems—understanding how things work under the hood and finding opportunities to optimize them. That combination naturally led me to explore how large models can be made faster and more efficient. During this exploration, I came across an academic paper on Sparse-K Attention, a technique designed to reduce a significant portion of the heavy computation inside the Attention mechanism. After reading it, I knew I wanted to implement it myself and see what impact it would have on a real inference engine. I chose to integrate it into llama.cpp, which exposes the full computational graph and makes it possible to observe how a change in the mask can influence the entire Attention flow.
What Is Attention, Anyway?
Before diving into Sparse-K, it’s important to understand what the Attention mechanism actually does, and why it’s so computationally expensive.
When a language model receives a sentence like:
“The cat sat on the gray carpet,”
it doesn’t just read the words — it needs to understand the meaning, the relationships, and how each word connects to the others.
To achieve this, every word evaluates how strongly it relates to every other word in the sentence.
This creates a dense web of connections, where each token computes a score against all other tokens.
In other words:
N tokens → N × N attention score computations.
And that quadratic growth is exactly what makes Attention so costly, especially for long sequences.
Why is this such a performance problem?
Because the computation scales quadratically — O(N²).
Once the sequence becomes long, the number of attention connections explodes.
A sequence of 2048 tokens requires over 4 million score computations in a single Attention layer.
As I looked deeper into these matrices, I realized that most of these connections don’t meaningfully contribute to the final prediction.
Many of them are weak or irrelevant.
Here’s a visual intuition: instead of a dense web where every token attends to every other token, we often only need a few meaningful connections:

And that is exactly what leads us to Sparse-K.
So What Does Sparse-K Actually Do?
Sparse-K introduces a simple but powerful idea: instead of evaluating all pairwise relationships between tokens, it keeps only the top-K most meaningful connections for each token — and ignores the rest.
This means that instead of performing millions of computations, the model works with only a small subset that still preserves most of the important information.
The beauty of this approach is that it doesn’t change the structure of the Attention mechanism — it simply reduces the amount of data it needs to process.
In practice, this dramatically reduces the number of computations, lowers memory pressure, and speeds up the entire inference pipeline.
From Attention Scores to a Sparse Mask
In standard Attention, after computing the similarity scores (Q × K), a full matrix of values is produced.
Sparse-K takes this matrix and applies an additional step:
For each token, only the top-K highest values are selected.
All other values are “removed” using a mask.
The removed values are assigned a very large negative value (−∞), so they do not affect the Softmax.
As a result, the Softmax and all subsequent computations are performed only on the truly relevant connections.

This image shows the attention score computations between each token and all other tokens in the sentence.
For each token, only the top-K highest scores are selected and highlighted in a different color.
In the next step, a sparse mask is constructed: the selected scores are kept (and set to zero for computation),
while all remaining positions are assigned negative infinity (−∞), so they do not participate in the remaining computation stages.

This illustrates the final Sparse-K attention mask after top-K selection.
Why Are Masks the Right Solution?
The decision to implement Sparse-K using a mask is not accidental — it is a critical architectural choice.
Masks are already a fundamental part of the Attention mechanism.
Every Transformer model relies on several types of masks, such as:
Causal masks, which prevent the model from attending to future tokens
Sequence separation masks, which distinguish between different input segments
Sparse-K does not introduce a new mechanism; instead, it integrates naturally into the existing masking system.
Rather than modifying operators, rewriting computations, or adding complex control logic to the Attention flow, the approach is simple and elegant:
Build an additional mask that represents the Top-K selection
Combine it with the existing base mask
Use it within the same Attention mechanism that already exists
This allows Sparse-K to fit seamlessly into the current pipeline,
without breaking compatibility, without altering the computational graph structure,
and without requiring deep changes to the core implementation.
An important additional benefit of a mask-based approach is that it allows us to reuse highly optimized Attention implementations, such as Flash Attention.
Since Flash Attention is already designed to work with masks as part of its interface, Sparse-K can directly leverage these optimizations instead of bypassing them.
In other words, using a mask enables Sparse-K to benefit from all existing Attention optimizations,
achieving meaningful performance improvements while preserving efficiency, compatibility, and maintainability.
Integration into llama.cpp
Sparse-K integrates directly into the Attention block, at a natural point in the computation flow:
after the similarity scores Q × K are computed, and before the subsequent Attention computation stages.
At this point, the Sparse-K mask is constructed, sparsifying the connections and allowing the computation to proceed only on a relevant subset of tokens.
All other parts of the model — the Embedding layer, the MLP computations, and the multi-head composition — remain completely unchanged.
In this way, Sparse-K targets the most computationally expensive part of the model, while leaving the rest of the architecture intact.
Measurements and Performance
After completing the implementation, the measurement phase began.
Sparse-K reduces the amount of computation inside the Attention mechanism by orders of magnitude.
Performance benchmarks showed a significant speedup in prompt processing time,
while evaluation time and memory usage remained nearly unchanged.

The following image illustrates the dramatic improvement achieved with Sparse-K.
On the left side, Baseline Attention is shown:
each token computes a relationship with every other token, producing a dense matrix of size
2048 × 2048, which results in approximately 4.2 million computations per Attention layer.
On the right side, Sparse-K Attention is shown:
for each token, only the K most relevant connections are retained (in this example,** K = 32*),
reducing the computation to **2048 × 32, or about **65* thousand computations per layer.
It is important to note that this computation is repeated independently in every Attention layer of the model.
As the model becomes deeper and contains more Attention layers — as illustrated in the figure —
the computational savings accumulate and multiply.
In modern models with dozens of Attention layers,
switching from dense to sparse computation at each layer leads to a substantial reduction in total computation
across the entire inference process.
The next image presents a table of performance improvements.
The most significant gains are observed during the initial prompt processing, where the bulk of the Attention computation occurs.
How to Choose K ?
Choosing K in Sparse-K is a direct tradeoff between speed and accuracy. There is no single “correct” value — it depends on the context length and the model size.
Practical rules of thumb:
- Based on context length (N):
N = 2048 → K ≈ 32–64
N = 4096 → K ≈ 64–128
- Based on model size:
Small models (2B–7B) → K ≈ 32–64
Medium models (13B–34B) → K ≈ 64–128
Large models (70B+) → K ≈ 128–256
- Simple heuristic: K ≈ √N (e.g., N = 2048 → K ≈ 64)
In practice:
smaller K favors speed,
larger K favors accuracy.
Summary
Sparse-K offers an effective way to reduce the computational cost of the Attention mechanism without changing the model’s structure or disrupting the existing pipeline.
Instead of computing all pairwise token relationships, it focuses on a small subset of connections that actually carry meaningful information.
By selecting Top-K connections per token and applying a sparse mask, the overall amount of computation is significantly reduced, performance improves, and the speed–accuracy tradeoff becomes explicitly controllable.
Working on Sparse-K highlights how carefully examining an existing mechanism can unlock powerful optimizations—without modifying the model itself.
Sparse-K doesn’t make Attention smarter — it simply stops wasting computation on what doesn’t matter.
Thanks for reading — feel free to share thoughts, questions, or ideas in the comments.
The implementation is available as an open pull request and is currently under review:
👉 https://github.com/ggml-org/llama.cpp/pull/16817



Top comments (0)