Jimin Lee

Posted on Sep 28

Matryoshka Magic: How MatFormer Delivers Multi-Size Models from a Single Training

#llm #nlp #matformer

MatFormer: The Russian Doll Approach to AI Models

Today, we're diving into a fascinating technique Google recently mentioned they used when announcing the Gemma 3n models: MatFormer (Matryoshka Nested Transformers).

Matryoshka is the classic Russian nesting doll, where a smaller doll fits perfectly inside a larger one. True to its name, MatFormer applies this concept to the Transformer architecture, nesting a smaller network within a larger one. Let’s break down why this technique emerged and how it works its magic.

MatFormer: The Big Doll Holds the Little Doll

The Google Gemma 3n release included two models: a 4-Billion parameter model (E4B) and a 2-Billion parameter model (E2B). The crucial point here is that the E2B model wasn't trained separately from E4B.

Think of E4B as the largest Matryoshka doll. Using the MatFormer technique, E2B—the smaller, nested doll—was trained concurrently as part of the process of training the larger E4B.

This groundbreaking idea is based on the DeepMind paper, "MatFormer: Nested Transformer for Elastic Inference." The core concept is creating a single, overarching Transformer network that contains smaller, fully functional Transformer networks nested within it. You can even nest the smallest network inside a slightly larger one, and so on.

A little abstract? Let's get into the why this is such a big deal.

The Motivation Behind MatFormer

We constantly see major LLM releases, like LLaMA 3, announcing models in various sizes—say, 8B and 70B. Why do companies bother releasing multiple sizes simultaneously?

1. The Performance-vs-Resource Dilemma

It’s no secret: bigger models deliver better performance, but they demand significantly more resources (GPU memory, VRAM, and processing time), leading to slower inference speeds. However, not every task requires the brute force of a 70B model. For simpler tasks, a smaller model is often far more efficient. This is why smaller siblings are released.

2. The Inefficiency of Double Training

The problem? Traditionally, you have to train the big model and the small model separately. This duplicates the computational resources needed for training and doubles the time it takes to get to market.

3. The Performance Gap (The Little Guy's Struggle)

Despite being from the same family, smaller models generally underperform their larger counterparts. While some performance difference is inevitable, the smaller model is missing out. If the little sibling could just glance at what the bigger sibling is learning, their skills would improve. Training separately prevents any knowledge transfer.

4. The Limited Choice Conundrum

When a company releases only 8B and 70B models, users are forced to pick one. It's like going shopping and finding a shirt that's either a little too small (Size L) or a little too big (Size XL), and you have to choose one. Wouldn't it be great if you could get something perfectly tailored—a size between L and XL?

MatFormer was born from the ambitious idea of solving all these problems with a single training run.

Deconstructing the MatFormer Architecture

While the MatFormer idea can be applied to any part of the Transformer, we’ll focus on the Feed Forward Network (FFN), which often accounts for around 60% of a model's parameters.

Quick Technical Summary (Skip to Mix & Match if you're not into the technical details): The MatFormer core concept applies a nesting structure to the FFN’s weight matrix W. During training, the system randomly selects and updates only a subset of this nested matrix for each data batch. This allows for the simultaneous, efficient training of multiple high-performing models of different sizes. For deployment, you simply extract the necessary subset of W.

Transformer Refresher

A Transformer block consists mainly of the Attention mechanism and the FFN.

Zooming in on the FFN, it’s a two-step process:

"Blow Up" (Dimension Expansion): The vector that passed through the Attention layer is expanded by a factor of 4x to 6x. (e.g., 1024 dimensions -> 4096 dimensions).
"Squeeze Down" (Dimension Reduction): The expanded dimension is reduced back to the original size. (e.g., 4096 dimensions -> 1024 dimensions).

Both steps are accomplished by multiplying the input by weight matrices (W). This W is the FFN's key parameter and a major determinant of an LLM's size. The larger the expansion dimension, the better the model's expressiveness and performance, but the larger W (and the overall model) becomes.

In the diagram, the two "Linear" blocks represent the weight matrices that make up W, which is our focus.

The Nested FFN Structure of MatFormer

MatFormer creates a nested structure for the FFN’s expansion/reduction weight matrices, W.

In the FFN diagram above, notice the nested boxes between the diamond-shaped Linear layers:

The outermost box (yellow) contains the blue one.
The blue box contains the orange one.
The orange box contains the red one.

These boxes represent the dimension size to which the FFN's input vector is expanded:

The Largest Model: Uses the entire W matrix, expanding the input to the size of the yellow box (the largest dimension).
The Smallest Model: Uses only a subset of W, expanding the input to the size of the red box (the smallest dimension).

The genius is this: instead of creating a separate W for each size, you create one giant W and, when a smaller dimension is needed, you simply use a slice of the full W. It’s nesting at the matrix level!

Crucially, the weight matrix for the smallest size (red) is always located in the top-left corner of the largest W. This ensures that no matter which size you select, the information within the red-box W is always shared and updated during training. This forces the red section to become the most important, compact, and information-dense core of the entire model.

Training the MatFormer

How do you train this huge, nested W efficiently? Training all sizes (red, orange, blue, yellow) simultaneously would be computationally prohibitive.

MatFormer proposes a clever solution: for each training data batch, you randomly select only one of the nested sizes and update only that corresponding part of the W matrix.

Random Selection: For the first data batch, the system might randomly select the "blue box size."
Partial Update: Only the parameters in the blue subset of W (which includes red, orange, and blue layers) are updated.
Iteration: For the next batch, it might randomly select the "red box size," and only the red subset of W is updated.

The MatFormer paper adjusted the selection probability to ensure that all sizes get trained fairly.

The Two Magic Outcomes

Compute Savings: You don't train all sizes at once. You only train one slice per step, dramatically reducing the overall computational load.
Core Knowledge Consolidation: The smallest W (red) is always included in the update, regardless of which larger slice is selected. Whether you're training the blue model or the yellow model, the red slice is part of that process and gets updated. This means the smallest, core part of the model is updated the most frequently, forcing it to absorb the most crucial information efficiently. The little guy gets to peek at the big guy's homework!

Deployment and Utilization: The Mix & Match Strategy

Once training is complete, it’s time for deployment.

Full Power Model: Uses the entire W matrix.
Light Weight Model: The smallest W is simply cut out—just the red box slice—and deployed.
Mid-Sized Model: The blue slice (red + orange + blue) is cut out and deployed.

You've achieved multiple performance points from a single set of weights.

The Mix & Match Tailored Fit

A real Transformer model is built by stacking multiple layers, each with its own FFN W matrix. This is where MatFormer truly shines.

Just as you might wear a loose-fitting XL for pants but a tight-fitting L for a shirt, you can select a different W size for every single layer of the model. This is the Mix & Match capability.

For example, Layer 1 could use the red W slice, Layer 2 could use the blue W slice, and Layer 3 could use the yellow W slice.

Why does this matter?

Maximizing GPU Resources!

Say your GPU can barely handle a 30B parameter model. Traditionally, you'd be stuck choosing between an 8B or a 70B model, meaning you’d have to settle for the 8B model and leave a lot of computational headroom unused. With MatFormer, you can Mix & Match the layer-by-layer W sizes to create a custom model that totals exactly 30B parameters, fully utilizing your GPU's capacity without waste.

Flexible Environment Scaling

MatFormer enables deployment flexibility based on the environment and performance needs:

Powerful Cloud Environment: Uses the largest (yellow) W slice for every layer.
Standard PC Environment: Mixes the layer sizes, maybe yellow, blue, and orange slices, to balance speed and power.
Smartphone or Edge Devices: Primarily uses the orange or red W slices for minimal footprint and maximum speed.

How's the Performance?

Performance is the ultimate test. The results presented in the MatFormer paper are highly encouraging.

Same-Size Model Comparison

Red Triangles (Conventional Training): Models trained the traditional, separate way.
Blue Triangles (MatFormer Training): Models trained via MatFormer and then extracted at a fixed size.

The MatFormer-trained models consistently show better performance than their conventionally trained counterparts. This is attributed to the smaller W subsets being updated frequently and concurrently with the larger models, resulting in more robust and efficient representations across the board.

Mix & Match Performance

Blue Stars (Mix & Match): Custom models created by mixing layer sizes.

The Mix & Match models exhibit performance that falls between the fixed-size blue triangles corresponding to their effective parameter count. This confirms that the Mix & Match approach provides predictable performance improvements, validating the tailored model strategy.

MatFormer vs. MoE (Mixture of Experts)

Reading about using a subset of a network, you might be reminded of Mixture of Experts (MoE). MoE only uses a select few "Experts" (FFNs) at inference time to process an input, similar to MatFormer in its partial computation.

However, there is a critical distinction:

Feature	MatFormer	MoE
Model Structure	Nested/Overlapping FFN Matrices	Parallel, independent FFNs (Experts)
Memory Requirement (Inference)	Only the selected size's W is loaded into memory	All expert parameters must be loaded into memory
Parameter Efficiency	Single training run yields multiple true-size models	High parameter count, but low computational cost
Primary Goal	Deriving multiple, efficient model sizes from one weight set	Maximizing performance via sparse computation at inference

MatFormer's models are physically nested, meaning the smallest model literally requires only the smallest subset of parameters to be loaded into memory for deployment. In contrast, while an MoE only computes on a few experts, it still needs the entire network's parameters loaded into memory, leading to much higher memory demands for the smallest deployable model.

Conclusion: The Custom-Fit Strategy for the LLM Era

MatFormer is an innovative idea that addresses the inherent inefficiencies in the training and deployment of Large Language Models.

By using the Matryoshka doll concept—nesting smaller weight matrices (W) within a single, larger one—and selectively updating slices during a single training run, it executes a One Model, Multi Size strategy without sacrificing performance.

Crucially, the Mix & Match feature allows engineers like us to build a custom-fit model for specific hardware, maximizing the use of a GPU from a powerful cloud server all the way down to an edge device. MatFormer offers unprecedented flexibility and efficiency in the broad deployment of LLMs.

DEV Community