M Shojaei

Posted on Jul 23

SwiGLU: The FFN Upgrade I Use to Get Free Performance

#llm #deepseek #deeplearning #ai

Here’s why your Transformer’s feed-forward network is probably outdated. For years, the default was a simple MLP block with a ReLU or GELU activation. That’s cheap, but it’s not what’s running inside the models that matter today. Llama, Mistral, PaLM, and Apple’s foundation models all use a variant of a Gated Linear Unit, specifically SwiGLU.

This post will show you exactly what SwiGLU is, why it works, and how to implement it. We’ll skip the academic fluff and focus on the mechanics and the common gotchas I've seen trip up teams in production. This isn't just theory; it's a small code change that has a measurable impact on model quality.

1. From Simple Activations to Gated Information Flow

A neural network without non-linear activations is just one big, useless linear regression. Functions like ReLU (max(0, x)) solve this by bending and folding the data space, letting the model learn complex patterns.

But a simple activation function is a blunt instrument. It treats every feature in a vector the same way—pushing it through an identical mathematical curve.

The next logical step was the Gated Linear Unit (GLU). The core idea is to split the input into two parallel paths: one carries the data, and the other learns a "gate" that decides how much of the data to let through.

# The original GLU concept
data_path = x @ W1
gate_path = x @ W2

# The gate uses a sigmoid to produce values from 0 to 1
gate_values = sigmoid(gate_path)

# Element-wise multiply: the gate selectively dampens or passes the data
output = data_path * gate_values

This dynamic, data-dependent filtering is more powerful than a static ReLU. It allows the network to route information more intelligently. The original GLU paper spawned several variants, including ReGLU (ReLU gate) and GEGLU (GELU gate). The one that won out is SwiGLU.

2. The Math That Matters: What is SwiGLU?

SwiGLU simply replaces the sigmoid function in the GLU's gate with another activation: Swish (also known as SiLU in PyTorch).

Swish is defined as $\text{Swish}(x) = x \cdot \sigma(x)$ , where $\sigma$ is the sigmoid function. It's a smoother function than ReLU that doesn't completely kill negative values, which helps gradients flow during training.

So, the full SwiGLU operation becomes:

\text{SwiGLU}(x) = (xW_1 + b_1) \odot \text{Swish}(xW_2 + b_2)

Where $\odot$ is an element-wise multiplication. In code, it’s even simpler.

3. The Code: A Drop-in Replacement (With a Catch)

Here is a standard SwiGLU module in PyTorch. It’s what you’ll find inside Llama or Mistral’s feed-forward blocks.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLU(nn.Module):
    """
    A standard SwiGLU FFN implementation.
    Reference: Noam Shazeer's "GLU Variants Improve Transformer"
    (https://arxiv.org/abs/2002.05202)
    """
    def __init__(self, d_model: int, d_ffn: int):
        super().__init__()
        # The SwiGLU paper recommends the hidden dimension be 2/3 of the FFN dimension
        hidden_dim = int(2 * d_ffn / 3)

        self.w1 = nn.Linear(d_model, hidden_dim, bias=False)
        self.w2 = nn.Linear(d_model, hidden_dim, bias=False)
        self.w3 = nn.Linear(hidden_dim, d_model, bias=False)

    def forward(self, x: torch.Tensor):
        # First linear projection for the gate, activated by SiLU (Swish)
        gate = F.silu(self.w1(x))
        # Second linear projection for the data
        data = self.w2(x)
        # Element-wise multiplication, followed by the final projection
        return self.w3(gate * data)

The critical detail: A traditional FFN has two matrices (d_model -> d_ffn and d_ffn -> d_model). SwiGLU has three. To keep the parameter count and FLOPs roughly equivalent to a standard GELU-based FFN, you can't just keep the same hidden dimension.

The original PaLM paper proposed setting the inner SwiGLU dimension to 2/3 of the standard FFN dimension. For example, if your old FFN expanded d_model=4096 to d_ffn=16384, the SwiGLU equivalent would have a hidden dimension of roughly int(2/3 * 16384) = 10922. This keeps the parameter count comparable.

4. Why Does This Tweak Actually Work?

This small architectural change brings several benefits that compound at scale.

Richer Representations: Because Swish is non-zero for negative inputs and the gating is multiplicative, the network can model more complex interactions. It can even learn quadratic functions, giving it more expressive power than a stack of linear layers and ReLUs.
Smoother Gradients: Swish has a smooth, non-monotonic curve. Unlike ReLU, its derivative is non-zero almost everywhere, which prevents "dead neurons" and stabilizes training by providing a more consistent gradient signal.
Dynamic Feature Selection: The gating mechanism allows the FFN block to act as a dynamic router. For each token, it can learn to amplify important features and suppress irrelevant ones, a job previously left mostly to the attention layers.
Proven at Scale: This isn't a speculative tweak. It's battle-tested.
- Google PaLM & Gemini: Use SwiGLU.
- Meta Llama 2 & 3: Use SwiGLU.
- Mistral & Mixtral: Use SwiGLU.
- Apple Intelligence: Reports confirm a standard SwiGLU FFN.

When this many production-grade models converge on a single component, it’s not an accident. It’s because it delivers a better trade-off between parameter count, training stability, and final model quality.

[2002.05202] GLU Variants Improve Transformer

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

arxiv.org

5. Pitfalls & Fixes (The Real World)

Just swapping nn.GELU for a SwiGLU module isn't enough. I've seen a few common mistakes.

Ignoring Hidden Dimensions: As mentioned, just plugging in SwiGLU with the old d_ffn will increase your parameter count by ~50%. You must adjust the intermediate dimension down. The 2/3 rule is a good starting point, but it's a tunable hyperparameter. As one engineer benchmarked, finding a nearby value that's divisible by 8 or 16 can improve hardware utilization and training speed.
Activation Outliers: The multiplicative gating can sometimes produce very large activation values ("spikes"). This isn't usually a problem for FP32 or BFloat16 training, but it can wreck low-precision quantization schemes like FP8. Research into "Smooth-SwiGLU" is ongoing to address this for extreme-scale training.
Hype and Alternatives: SwiGLU is the incumbent, but it's not the final word. Research on activations is active. Nemotron-4 340B from NVIDIA, for instance, uses Squared ReLU (ReLU²). Other work on sparse LLMs suggests that functions like dReLU can offer better performance with higher activation sparsity, which is critical for faster inference. Keep an eye on this space.

My Opinion

In my opinion, if you're building a new Transformer from scratch in 2024 or fine-tuning an older architecture, swapping the FFN for a properly-dimensioned SwiGLU block is one of the highest-ROI changes you can make. It's a low-effort, low-risk upgrade that aligns your model with proven, state-of-the-art architectures.

Most of the knowledge in an LLM is stored in its feed-forward layers. Improving their capacity and dynamics gives you a direct, measurable lift. Don't cargo-cult it, but understand that the switch from static activations to dynamic gating is a fundamental improvement.

What You Can Do Now

Review your model's FFN: If it's using a plain GELU or ReLU, benchmark a version with a SwiGLU block.
Implement SwiGLU correctly: Use the three-matrix design and adjust the hidden dimension to 2/3 * d_ffn as a starting point.
Validate the change: Monitor your validation loss. You should see a small but consistent improvement or faster convergence for the same parameter budget.

This is not a magic bullet, but it's a piece of solid, validated engineering that has become the standard for a reason.

Sources & Further Reading

Topic	Reference
Original SwiGLU Proposal	Shazeer, "GLU Variants Improve Transformer" (arXiv:2002.05202)
Large-Scale Application	Chowdhery et al., "PaLM" (arXiv:2204.02311)
Swish Activation	Ramachandran et al., "Searching for Activation Functions" (arXiv:1710.05941)
Activation Sparsity	Liu et al., "Discovering Efficient Activation Functions for Sparse LLMs" (arXiv:2402.03804)

DEV Community