Alain Airom

Posted on Nov 8

Book review: “Build a DeepSeek Model (From Scratch)”

#llm #deepseek #bookreview

My impressions after reading the first chapters of “Build a DeepSeek Model (From Scratch)” book (edition in progress from manning.com)

Image from Book Cover on manning.com

Build a DeepSeek Model (From Scratch)

The book’s core premise is to demystify and teach the implementation of the techniques that allowed the DeepSeek model to achieve performance comparable to leading proprietary systems, but at a fraction of the training cost. DeepSeek was selected as the subject because its release marked a pivotal moment in open-source AI, demonstrating that open models could rival the intelligence level of closed-source giants.

Having said that; the scope deliberately excludes reproducing DeepSeek’s proprietary training data, replicating its exact model weights, diving into the massive-scale distributed training infrastructure required for hundreds of billions of parameters, or addressing production concerns like serving models to millions of users or implementing safety filters. The focus remains on clarity, comprehension, and implementation from first principles.

In fact, the goal (as of my understanding) is to gain both a theoretical understanding and the practical skills to build a scaled-down, functional model using consumer hardware, without needing a supercomputer.

⭕ Disclaimer: I’m not affiliated with Manning publication or any of the writers, and I would not touch a dime if you buy and read this book. My enthusiasm is purely for the subject matter, not a hidden affiliate link. Read it or don’t — your LLM journey, and my financial stability, remain entirely independent!

The Book’s Structure (so far…)

The book is structured around a four-stage roadmap that systematically covers DeepSeek’s technical breakthroughs, which are targeted solutions to fundamental bottlenecks in scaling large language models (LLMs) related to computational complexity, memory, and parameter scaling.

Stage 1 & 2: Architectural Innovations

These stages focus on replacing standard Transformer components with highly efficient alternatives:

Key-Value (KV) Cache: A foundational building block introduced first for efficient inference.
Multi-Head Latent Attention (MLA) : Replaces standard attention to tackle the speed and memory bottleneck associated with long input sequences.
DeepSeek Mixture-of-Experts (MoE) : Replaces the dense feed-forward network to tackle model capacity and scaling issues efficiently.

Stage 3: Advanced Training Techniques

This stage explores methods used to optimize the training process and resource utilization:

Multi-Token Prediction (MTP) : A smarter training objective that accelerates both training and inference speed by predicting multiple future tokens simultaneously.
FP8 Quantization : The use of an 8-bit floating-point format to address computational efficiency by compressing model weights and activations.
Dual Pipe Parallelism : An efficient scheduling strategy that overlaps different training tasks to minimize GPU idle time during large-scale training.

Stage 4: Post-Training Methods

This final stage details how the base model is refined and compressed for real-world use:

Reinforcement Learning (RL) : Applied using sophisticated methods like Pure RL and Final RL to instill advanced reasoning capabilities in the model, exemplified by the creation of DeepSeek-R1.
Knowledge Distillation : A technique to compress the knowledge of a large “teacher” model (e.g., DeepSeek-v3 with 671 billion parameters) into smaller, highly performant “student” models (as low as 1.5 billion parameters) for practical deployment

Scope and Prerequisites

The book is to be considered as a hands-on tutorial and requires the reader to have a foundation in machine learning and deep learning concepts, comfort with Python, familiarity with PyTorch, and some exposure to the Transformer architecture.

But… as a Manning Early Access Program (MEAP) title, the book is released in iterations, and the official GitHub repository mirrors this structure. This repository is essential for the learning process, as it provides runnable code for the completed chapters. Readers can clone, run, and adpat them directly in their own development environments, transforming theoretical knowledge into practical, hands-on experience.

# Chapter 4: Mixture of Experts in DeepSeek

Large Language Models have grown exponentially in size, with parameters reaching into the hundreds of billions. This growth presents significant computational challenges, particularly for inference. DeepSeek models use a **Mixture of Experts (MoE)** architecture to tackle this problem.

## What is Mixture of Experts?

MoE is a neural network architecture that employs multiple specialized sub-networks (experts) and a routing mechanism to direct different inputs to the appropriate experts. This approach allows models to scale to enormous parameter counts while keeping computational costs manageable.

In this chapter, we'll implement the DeepSeek MoE architecture, which includes:

1. **Shared Experts** - Always active for all tokens
2. **Routed Experts** - Selectively activated based on token content
3. **Load Balancing** - Techniques to ensure even expert utilization

The DeepSeek approach combines the efficiency of sparse activation with the reliability of always-on shared experts, creating a robust and computationally efficient architecture.

## Required Libraries

Let's begin by importing the necessary Python libraries:

- `math`: For mathematical operations like square root and pi
- `contextlib.nullcontext`: For optional context management
- `typing.Optional`: For type hints with optional parameters
- `torch`: PyTorch library for tensor operations and neural networks
- `torch.nn`: Neural network modules and functions
- `torch.nn.functional`: Functional interface for neural network operations
import math 
from contextlib import nullcontext 
from typing import Optional 

import torch 
import torch.nn as nn 
import torch.nn.functional as F
## Core MoE Components

In this notebook, we'll implement two critical components of the DeepSeek MoE architecture:

1. **ExpertFFN**: The individual expert networks
2. **DeepSeekMoE**: The complete MoE layer with routing

Let's examine how DeepSeek designs these components to achieve an efficient balance between parameter count and computational cost.
def _gelu(x: torch.Tensor) -> torch.Tensor:
    # Slightly faster GELU (approx)
    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) *
                                       (x + 0.044715 * torch.pow(x, 3))))

class ExpertFFN(nn.Module):
    """
    A 2-layer MLP expert. Hidden dim is usually smaller than a dense FFN
    (e.g., 0.25 × d_model in DeepSeek-V3).
    """
    def __init__(self, d_model: int, hidden: int, dropout: float = 0.0):
        super().__init__()
        self.fc1 = nn.Linear(d_model, hidden, bias=False)
        self.fc2 = nn.Linear(hidden, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.fc2(self.dropout(_gelu(self.fc1(x))))



class DeepSeekMoE(nn.Module):
    """
    DeepSeek-V3 style Mixture-of-Experts (MoE) layer.

    This MoE layer incorporates both routed experts (selected by a router)
    and shared experts (applied to all inputs). It is designed based on
    the architecture described in the DeepSeek-V3 paper.

    Args:
        d_model (int): The dimension of the input and output features.
        n_routed_exp (int): The number of routed experts.
        n_shared_exp (int, optional): The number of shared experts. Defaults to 1.
        top_k (int, optional): The number of routed experts to select for each token.
                               Defaults to 8.
        routed_hidden (int, optional): The hidden dimension for routed experts.
                                      Defaults to 2048.
        shared_hidden (Optional[int], optional): The hidden dimension for shared experts.
                                                If None, uses routed_hidden. Defaults to None.
        bias_lr (float, optional): Learning rate for the router bias (updated online).
                                   Defaults to 0.01.
        fp16_router (bool, optional): Whether to use FP16 precision for router calculations.
                                     Defaults to False.
    """
    def __init__(
        self,
        d_model: int,
        n_routed_exp: int,
        n_shared_exp: int = 1,
        top_k: int = 8,
        routed_hidden: int = 2_048,
        shared_hidden: Optional[int] = None,
        bias_lr: float = 0.01,
        fp16_router: bool = False,
    ):
        super().__init__()
        # Assert that the number of selected experts (top_k) is less than or equal to the total number of routed experts.
        assert top_k <= n_routed_exp, "k must be ≤ number of routed experts"

        self.d_model = d_model
        self.n_routed = n_routed_exp
        self.n_shared = n_shared_exp
        self.top_k = top_k
        self.bias_lr = bias_lr
        self.fp16_router = fp16_router

        # Module list for the routed experts.
        self.routed = nn.ModuleList(
            [ExpertFFN(d_model, routed_hidden) for _ in range(n_routed_exp)]
        )
        # Determine the hidden dimension for shared experts. Use routed_hidden if shared_hidden is not provided.
        hidden_shared = shared_hidden or routed_hidden
        # Module list for the shared experts.
        self.shared = nn.ModuleList(
            [ExpertFFN(d_model, hidden_shared) for _ in range(n_shared_exp)]
        )

        # Register a parameter for the centroids used by the router.
        # Centroids represent the "preference" of each expert for different input features.
        self.register_parameter("centroids", nn.Parameter(torch.empty(n_routed_exp, d_model)))
        # Initialize centroids with a normal distribution.
        nn.init.normal_(self.centroids, std=d_model ** -0.5)

        # Register a buffer for the router bias. This bias is updated online
        # without using standard gradient descent, hence it's not a parameter.
        self.register_buffer("bias", torch.zeros(n_routed_exp))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of the DeepSeekMoE layer.

        Args:
            x (torch.Tensor): The input tensor of shape [B, S, D], where B is
                              batch size, S is sequence length, and D is d_model.

        Returns:
            torch.Tensor: The output tensor of shape [B, S, D], which is the
                          sum of the input, shared expert outputs, and routed
                          expert outputs.
        """
        # Get dimensions of the input tensor.
        B, S, D = x.shape
        # Reshape the input to [N, D], where N = B * S (number of tokens).
        x_flat = x.reshape(-1, D)  # [N, D] with N=B*S

        # 1) Shared path: Process the input through all shared experts and sum their outputs.
        shared_out = torch.zeros_like(x)
        for exp in self.shared:
            shared_out += exp(x)
        # (Optional) Scale the shared expert output by the number of shared experts.
        # This can help in balancing the contribution of shared vs. routed experts.
        # shared_out = shared_out / max(1, self.n_shared)

        # 2) Router logits: Calculate the affinity of each token to each routed expert.
        # Use autocasting to FP16 if fp16_router is True and the device is CUDA.
        use_autocast = self.fp16_router and x.is_cuda
        device_type = "cuda" if x.is_cuda else x.device.type
        with torch.autocast(device_type=device_type, enabled=use_autocast):
            # Calculate logits by taking the dot product of the flattened input with the expert centroids.
            logits = F.linear(x_flat, self.centroids)  # [N, E]
            # Add the router bias to the logits. Ensure bias matches the logits' dtype.
            logits = logits + self.bias.to(logits.dtype)

        # Select the top_k experts with the highest logits for each token.
        topk_logits, topk_idx = torch.topk(logits, self.top_k, dim=-1)        # [N, k]
        # Apply softmax to the top_k logits to get gating weights.
        # Ensure the gate weights have the same dtype as the input for subsequent calculations.
        gate = F.softmax(topk_logits, dim=-1, dtype=x.dtype)                   # [N, k]

        # 3) Dispatch per expert: Route tokens to their selected experts and combine outputs.
        routed_out = torch.zeros_like(x_flat)                                   # [N, D]
        # Iterate through each routed expert.
        for i in range(self.n_routed):
            # Create a mask to identify which tokens selected the current expert (expert i).
            mask = (topk_idx == i)
            # Find the indices of the rows (tokens) and columns (which of the top-k) where expert i was selected.
            row_idx, which_k = mask.nonzero(as_tuple=True)                      # 1-D each
            # If no tokens selected this expert, skip.
            if row_idx.numel() == 0:
                continue
            # Select the input tokens that are routed to expert i.
            exp_in = x_flat.index_select(0, row_idx)                            # [Ti, D] where Ti is the number of tokens routed to expert i
            # Pass the selected tokens through the expert's FFN.
            out = self.routed[i](exp_in)                                        # [Ti, D]
            # Get the gating weights for the tokens routed to expert i.
            w = gate[row_idx, which_k].unsqueeze(-1)                            # [Ti, 1]
            # Scale the expert output by the gating weights and add it to the routed_out tensor
            # at the original token positions using index_add_.
            routed_out.index_add_(0, row_idx, out * w)

        # Reshape the routed output back to the original [B, S, D] shape.
        routed_out = routed_out.view(B, S, D)
        # The final output is the sum of the original input, shared expert outputs, and routed expert outputs.
        return x + shared_out + routed_out

    @torch.no_grad()
    def update_bias(self, x: torch.Tensor):
        """
        Updates the router bias based on expert load.

        This method is typically called once per optimizer step using the
        same batch of tokens that were passed through the forward method.
        It uses the current router logits (including the current bias) to
        estimate the load on each expert and adjusts the bias to encourage
        a more balanced distribution of tokens across experts.

        Args:
            x (torch.Tensor): The input tensor of shape [B, S, D], identical
                              to the input used in the corresponding forward pass.
        """
        # Calculate the total number of tokens.
        N = x.shape[0] * x.shape[1]
        # Calculate the router logits (affinity scores) for each token and expert, including the current bias.
        logits = F.linear(x.reshape(-1, self.d_model), self.centroids) + self.bias
        # Determine the top_k experts selected for each token based on the current logits.
        _, idx = torch.topk(logits, self.top_k, dim=-1)

        # Count how many times each expert was selected as one of the top_k.
        counts = torch.bincount(idx.flatten(), minlength=self.n_routed).float()
        # Calculate the average number of times an expert should ideally be selected.
        avg = counts.sum() / max(1, self.n_routed)

        # Calculate the "violation" for each expert. A positive violation means
        # the expert is under-loaded compared to the average, and its bias
        # should be increased to make it more likely to be selected in the future.
        # A negative violation means it's over-loaded, and its bias should be decreased.
        # Add a small epsilon (1e-6) to the denominator to avoid division by zero.
        violation = (avg - counts) / (avg + 1e-6)
        # Update the bias using a smooth, bounded update based on the violation.
        # torch.tanh() squashes the violation into the range [-1, 1], preventing
        # excessively large bias updates. The bias_lr controls the step size.
        self.bias.add_(self.bias_lr * torch.tanh(violation))
## Expert Network Implementation

### GELU Activation Function

We begin with a custom implementation of the GELU (Gaussian Error Linear Unit) activation function. This approximation is slightly faster than the standard implementation:

$\text{GELU}(x) \approx 0.5 \cdot x \cdot \left(1 + \tanh\left(\sqrt{\frac{2}{\pi}} \cdot \left(x + 0.044715 \cdot x^3\right)\right)\right)$

### ExpertFFN Class

The `ExpertFFN` class implements a simple 2-layer MLP (Multi-Layer Perceptron) that serves as an individual expert in our MoE architecture:

1. **Architecture**: Input → Linear → GELU → Dropout → Linear → Output
2. **Parameters**:
   - `d_model`: Input/output dimension (model dimension)
   - `hidden`: Hidden layer dimension (typically smaller than in standard FFNs)
   - `dropout`: Dropout rate for regularization

In DeepSeek-V3, the hidden dimension for experts is approximately 0.25 × d_model, making each expert significantly smaller than a traditional FFN. This is part of what makes MoE efficient - the experts are individually small.

### DeepSeekMoE Class

The `DeepSeekMoE` class implements DeepSeek's innovative approach to Mixture of Experts. It combines several key design choices:

1. **Hybrid Expert Structure**:
   - **Shared Experts**: Always active for all tokens
   - **Routed Experts**: Selectively activated based on token content

2. **Top-K Routing**:
   - Each token activates K out of E total experts
   - Weighted combination of expert outputs

3. **Router Design**:
   - Uses simple dot-product similarity with learned centroids
   - Optional mixed-precision routing for efficiency
   - Dynamic bias adjustment for load balancing
## Understanding the Forward Pass

The forward method in `DeepSeekMoE` contains the core logic of how tokens are processed through both shared and routed experts. Let's break down the key steps:

### 1. Shared Path

All tokens are processed by all shared experts, and their outputs are summed. This ensures a base level of processing for every token regardless of routing decisions.

### 2. Router Calculation

The router uses a simple but effective approach:
- Computes dot-product similarity between tokens and expert centroids
- Adds the dynamic bias to adjust for load balancing
- Optionally uses mixed precision for efficiency on GPUs
- Selects the top-k experts for each token based on similarity scores
- Normalizes the scores with softmax to get gate values (weights)

### 3. Selective Expert Computation

For each expert, the implementation:
- Identifies which tokens selected this expert
- Processes only those tokens (avoiding unnecessary computation)
- Weights the outputs by the corresponding gate values
- Accumulates the weighted outputs in the final result

### 4. Residual Connection

Finally, the original input is added to both the shared and routed paths, forming a residual connection that helps with gradient flow and training stability.

### Load Balancing

The separate `update_bias` method implements an online load balancing mechanism that adjusts the router's bias terms to encourage more balanced expert utilization over time. This is crucial for preventing the "rich get richer" problem where a few experts might dominate the routing.
## Testing the DeepSeek MoE Implementation

Let's test our implementation with some sample data to verify it's working correctly. We'll create a model with:

- Model dimension (`d_model`): 1024
- Number of routed experts: 16
- Number of shared experts: 2
- Top-K experts per token: 8

Then we'll pass a batch of random token embeddings through the model and verify the output shape.
# Test the DeepSeekMoE class
d_model = 1024
n_routed_exp = 16
n_shared_exp = 2
top_k = 8

model = DeepSeekMoE(d_model, n_routed_exp, n_shared_exp, top_k)

# Create different random input data
batch_size_new = 2
seq_len_new = 64
random_input_new = torch.randn(batch_size_new, seq_len_new, d_model)

# Pass the new random input to the model's forward method
output_new = model(random_input_new)

print("New output shape:", output_new.shape)
## Conclusion: The Impact of MoE in DeepSeek

DeepSeek's implementation of Mixture of Experts represents a significant advancement in large language model architecture. By combining always-on shared experts with selectively activated routed experts, DeepSeek achieves:

1. **Parameter Efficiency**: The model can have a vast number of parameters (hundreds of billions) while only activating a small fraction for any given token.

2. **Computational Efficiency**: By activating only top-k experts per token, computation scales sub-linearly with the number of parameters.

3. **Quality Preservation**: The shared experts ensure that all tokens receive high-quality processing, addressing a common weakness of pure sparse MoE models.

4. **Load Balancing**: The dynamic bias adjustment mechanism prevents expert imbalance, ensuring all experts are utilized effectively.

This architecture is a key reason why DeepSeek models can achieve state-of-the-art performance while maintaining reasonable inference costs. The hybrid approach strikes an excellent balance between the pure dense models (like GPT) and pure sparse models (like earlier MoE architectures), getting the best of both worlds.

In a full model implementation, these MoE layers would replace some or all of the standard FFN layers in transformer blocks, while keeping the attention mechanisms unchanged.

Conclusion

So at the end, if you’re a perpetual learner like me, I think this is a good read!

Thanks for reading 👍

DEV Community