Attention is the fundamental mechanism that enables LLMs to capture long-range dependencies and context. It was introduced as part of the Transformer architecture in 2017 in Attention is All You Need. As models have grown in size, new variants have emerged to address the computational cost of standard Multi-Head attention.
Multi-Head Latent Attention (MLA), introduced in the Deepseek-V2 paper, represents a novel approach to efficient attention. MLA introduces a low-rank compression technique that reduces the memory footprint without sacrificing model performance. MLA is used by the DeepSeek-V3 and DeepSeek-R1 models available in the Vertex Model Garden and deployable to Google Kubernetes Engine.
In this blog post, we’ll start with the standard attention mechanism and build up to what makes Multi-Head Latent Attention special.
Attention
To illustrate how single-head attention works, consider this example sentence:
The animal didn’t cross the street, because it was too tired.
How can we understand what “it” means? We need to look at surrounding words, or more precisely, tokens. Attention allows us to analyze this context mathematically.
There are three components that make attention work: queries, keys, and values. We compare the Query word (“it”) to each Key in the sentence using the dot product. The dot product measures how “similar” two vectors are. A higher dot product means more “attention” or relevance. This is reflected in the QKᵀ term of the attention calculation. A scaling factor, based on the dimension of the keys, is applied to keep the dot products from becoming too large.
We then use these scores to weigh the Value vectors, which hold the actual information associated with each key. Words that are more “relevant” get more weight in understanding our Query word. This weighted sum of these Value vectors becomes the Attention Output — a context-aware representation of our word “it”.
Positional Encodings
Before we move on, there’s one more crucial piece to the puzzle: positional encodings. So far, we haven’t considered the order of words in the sentence. Without this information, “Cross the street” and “Street the cross” would be treated identically because attention, by itself, is order-agnostic.
To address this, we add positional encodings. These are special vectors that tell the attention mechanism the position of each word in the sequence. In Attention is All you Need, sinusoidal positional encodings were used. They create a unique “position signature” for each word with sine and cosine functions of different frequencies.
These positional encodings are added to the input word representations before we calculate Queries, Keys, and Values. Altogether, the attention mechanism is now position-aware.
While much of the original Transformers architecture has stood the test of time, a new approach to positional encodings is now commonly used. Rotary Position Embeddings, or RoPE, was introduced in 2021 in the RoFormer architecture. Rather than adding a position term, RoPE rotates the query and key vectors based on their relative positions. This rotation allows the model to understand the relationship between words based on their distance from each other.
For a more detailed understanding of positional encoding techniques, I recommend the Designing Positional Encoding blog post.
Multi-Head Attention
So far, we’ve heard about attention mechanics using one set of Query, Key, and Value projections. To capture more nuanced relationships, we can use Multi-Head Attention or MHA. MHA uses multiple sets of QKV projections — “Head 1”, “Head 2”, “Head 3”, and so on. Each head learns to focus on different aspects of the relationship between words. For example, one head might focus on grammatical relationships and another on semantic relationships like synonymy or antonymy. Each head calculates its own attention output, and MHA concatenates these outputs and projects the resulting vector to obtain the final output.
As Key and Value tensors for earlier tokens in a sequence remain the same, they can be cached to avoid unnecessary computations. This key-value (KV) cache can become a memory bottleneck, slowing down inference, especially for longer texts.
Multi-Query and Grouped-Query Attention
Fortunately, new techniques have emerged to address this issue, by reducing the number of keys and values.
Let’s look at Multi-Query Attention, or MQA, first. Instead of each Query head having its own Key and Value set, all Query heads share a single Key and Value set. Because the KV cache size is a function of the dimensions of each head, MQA can significantly reduce the cache size.
However, there’s a trade-off. By sharing a single Key and Value set, model performance can suffer. Grouped-Query Attention, or GQA, is a middle ground. Instead of one shared Key and Value in MQA, it uses a small number of shared Key and Value sets, called “groups”.
Introducing MLA
Ideally, we want to shrink the KV Cache without sacrificing performance. And that”s where MLA, or Multi-Head Latent Attention comes in.
MLA tackles this KV Cache problem with compression. MLA compresses, or down-projects, keys and values into a smaller low-rank matrix. This compressed matrix is then up-projected during the attention calculation. We can see how a compressed latent matrix in lower-dimensional space is derived with a down-projection matrix Wᴰᴷⱽ. Keys and values can be up-converted with Wᵁᴷ and Wᵁⱽ, respectively.
MLA requires a modified approach to address token positions called decoupled RoPE. Standard RoPE directly modifies compressed keys and values with position information. MLA’s compression technique complicates this and would hinder inference efficiency. Instead, with decoupled RoPE, the relative position information is encoded in separate vectors that are then concatenated with the compressed keys and values before the up-projection. This allows for efficient application of positional information without interfering with the compression/decompression process.
In the end, Multi-Head Latent Attention provides faster inference and smaller memory usage. According to the Deepseek-V2 paper, this approach maintains or even improves performance compared to Multi-Head attention. Compressing the keys and values to a low-rank representation apparently doesn’t lose too much information, and may even aid generalization.
To try out MLA in action, you can deploy DeepSeek-R1 or DeepSeek-V3 671B on Google Kubernetes Engine (GKE) using graphical processing units (GPUs) across multiple nodes with this guide. You can also deploy DeepSeek from the Vertex Model Garden. Feel free to connect on LinkedIn, X, and Bluesky to continue the discussion!
Top comments (0)