DEV Community

Karl Weinmeister for Google Cloud

Posted on • Originally published at Medium on

Attention Evolved: How Multi-Head Latent Attention Works

Compressing keys and values to reduce the cache size is MLA’s key innovation

Attention is the fundamental mechanism that enables LLMs to capture long-range dependencies and context. It was introduced as part of the Transformer architecture in 2017 in Attention is All You Need. As models have grown in size, new variants have emerged to address the computational cost of standard Multi-Head attention.

Multi-Head Latent Attention (MLA), introduced in the Deepseek-V2 paper, represents a novel approach to efficient attention. MLA introduces a low-rank compression technique that reduces the memory footprint without sacrificing model performance. MLA is used by the DeepSeek-V3 and DeepSeek-R1 models available in the Vertex Model Garden and deployable to Google Kubernetes Engine.

In this blog post, we’ll start with the standard attention mechanism and build up to what makes Multi-Head Latent Attention special.

Attention

To illustrate how single-head attention works, consider this example sentence:

The animal didn’t cross the street, because it was too tired.

How can we understand what “it” means? We need to look at surrounding words, or more precisely, tokens. Attention allows us to analyze this context mathematically.

Self-attention for pronoun “it” (source)

There are three components that make attention work: queries, keys, and values. We compare the Query word (“it”) to each Key in the sentence using the dot product. The dot product measures how “similar” two vectors are. A higher dot product means more “attention” or relevance. This is reflected in the QKᵀ term of the attention calculation. A scaling factor, based on the dimension of the keys, is applied to keep the dot products from becoming too large.

We then use these scores to weigh the Value vectors, which hold the actual information associated with each key. Words that are more “relevant” get more weight in understanding our Query word. This weighted sum of these Value vectors becomes the Attention Output — a context-aware representation of our word “it”.

Attention mechanism overview (source)

Positional Encodings

Before we move on, there’s one more crucial piece to the puzzle: positional encodings. So far, we haven’t considered the order of words in the sentence. Without this information, “Cross the street” and “Street the cross” would be treated identically because attention, by itself, is order-agnostic.

To address this, we add positional encodings. These are special vectors that tell the attention mechanism the position of each word in the sequence. In Attention is All you Need, sinusoidal positional encodings were used. They create a unique “position signature” for each word with sine and cosine functions of different frequencies.

Visualization of positional encodings (source)

These positional encodings are added to the input word representations before we calculate Queries, Keys, and Values. Altogether, the attention mechanism is now position-aware.

While much of the original Transformers architecture has stood the test of time, a new approach to positional encodings is now commonly used. Rotary Position Embeddings, or RoPE, was introduced in 2021 in the RoFormer architecture. Rather than adding a position term, RoPE rotates the query and key vectors based on their relative positions. This rotation allows the model to understand the relationship between words based on their distance from each other.

Illustration of position embedded into a vector through rotation

For a more detailed understanding of positional encoding techniques, I recommend the Designing Positional Encoding blog post.

Multi-Head Attention

So far, we’ve heard about attention mechanics using one set of Query, Key, and Value projections. To capture more nuanced relationships, we can use Multi-Head Attention or MHA. MHA uses multiple sets of QKV projections — “Head 1”, “Head 2”, “Head 3”, and so on. Each head learns to focus on different aspects of the relationship between words. For example, one head might focus on grammatical relationships and another on semantic relationships like synonymy or antonymy. Each head calculates its own attention output, and MHA concatenates these outputs and projects the resulting vector to obtain the final output.

Multi-Head attention calculation (source)

As Key and Value tensors for earlier tokens in a sequence remain the same, they can be cached to avoid unnecessary computations. This key-value (KV) cache can become a memory bottleneck, slowing down inference, especially for longer texts.

Illustration of how the KV Cache grows with each token in the sequence

Multi-Query and Grouped-Query Attention

Fortunately, new techniques have emerged to address this issue, by reducing the number of keys and values.

Let’s look at Multi-Query Attention, or MQA, first. Instead of each Query head having its own Key and Value set, all Query heads share a single Key and Value set. Because the KV cache size is a function of the dimensions of each head, MQA can significantly reduce the cache size.

However, there’s a trade-off. By sharing a single Key and Value set, model performance can suffer. Grouped-Query Attention, or GQA, is a middle ground. Instead of one shared Key and Value in MQA, it uses a small number of shared Key and Value sets, called “groups”.

Introducing MLA

Ideally, we want to shrink the KV Cache without sacrificing performance. And that”s where MLA, or Multi-Head Latent Attention comes in.

KV Cache Comparison (source)

MLA tackles this KV Cache problem with compression. MLA compresses, or down-projects, keys and values into a smaller low-rank matrix. This compressed matrix is then up-projected during the attention calculation. We can see how a compressed latent matrix in lower-dimensional space is derived with a down-projection matrix Wᴰᴷⱽ. Keys and values can be up-converted with Wᵁᴷ and Wᵁⱽ, respectively.

Calculation of compressed latent matrix, keys, and values

MLA requires a modified approach to address token positions called decoupled RoPE. Standard RoPE directly modifies compressed keys and values with position information. MLA’s compression technique complicates this and would hinder inference efficiency. Instead, with decoupled RoPE, the relative position information is encoded in separate vectors that are then concatenated with the compressed keys and values before the up-projection. This allows for efficient application of positional information without interfering with the compression/decompression process.

Query (q) and Key (k) vectors are formed by concatenating their compressed (C) and positional (R) parts

In the end, Multi-Head Latent Attention provides faster inference and smaller memory usage. According to the Deepseek-V2 paper, this approach maintains or even improves performance compared to Multi-Head attention. Compressing the keys and values to a low-rank representation apparently doesn’t lose too much information, and may even aid generalization.

To try out MLA in action, you can deploy DeepSeek-R1 or DeepSeek-V3 671B on Google Kubernetes Engine (GKE) using graphical processing units (GPUs) across multiple nodes with this guide. You can also deploy DeepSeek from the Vertex Model Garden. Feel free to connect on LinkedIn, X, and Bluesky to continue the discussion!


Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (0)

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay