<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Swarit Shukla</title>
    <description>The latest articles on DEV Community by Swarit Shukla (@swaritshukla).</description>
    <link>https://dev.to/swaritshukla</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3874476%2F4d18f40b-9c14-4429-8ebc-716b824b1461.jpg</url>
      <title>DEV Community: Swarit Shukla</title>
      <link>https://dev.to/swaritshukla</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/swaritshukla"/>
    <language>en</language>
    <item>
      <title>The Elegance of MoE: How Gemma 4's 26B Model Runs Like a 4B Model</title>
      <dc:creator>Swarit Shukla</dc:creator>
      <pubDate>Sun, 12 Apr 2026 06:57:28 +0000</pubDate>
      <link>https://dev.to/swaritshukla/the-elegance-of-moe-how-gemma-4s-26b-model-runs-like-a-4b-model-4kl4</link>
      <guid>https://dev.to/swaritshukla/the-elegance-of-moe-how-gemma-4s-26b-model-runs-like-a-4b-model-4kl4</guid>
      <description>&lt;p&gt;Google recently dropped its new family of open-source AI models, Gemma 4, but the variant that truly captured my interest is Gemma-4-26B-A4B-IT. The question is: how can a 26 billion parameter model only activate 4 billion parameters at a time? This is where the elegance lies. By only activating 4 billion parameters, it reduces the cost of compute a lot. So what’s the magic behind this? It turns out it uses a clever architecture called MoE (Mixture of Experts) that lets the model choose experts, and hence it only activates 4 billion parameters, making it extremely fast and compute-efficient.&lt;/p&gt;

&lt;p&gt;A Mixture of Experts model is not a giant monolith. Internally, it is divided into experts (for example, 128). Experts specialize in different fields like coding, physics, calculus, and literature. So instead of using a giant neural network, it uses smaller expert neural networks. Note that these experts are not predefined—the neural network learns this itself during backpropagation.&lt;/p&gt;

&lt;p&gt;Dense models vs Mixture of Experts&lt;br&gt;
Traditional dense models differ from Mixture of Experts. In a dense model, each input token is fed to all of the parameters—but not in the case of MoE.&lt;/p&gt;

&lt;p&gt;MoE uses a router that assigns a token to only the top k experts (usually 2 or 8). The router (which is a neural network) takes in the token as input and then computes probabilities for all the experts. The top two with the highest probability are assigned that token.&lt;/p&gt;

&lt;p&gt;So at a time, only four billion parameters are activated, and the remaining 22 billion sit idle.&lt;/p&gt;

&lt;p&gt;The restaurant analogy&lt;br&gt;
Think of it this way—there are two restaurants:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Dense restaurant&lt;br&gt;
You place an order.&lt;br&gt;
In a dense restaurant, that order gets passed to every chef, and every chef works on it. It doesn’t matter if the order is for pasta—even the dessert chef will work on it. After every chef works on the dish, they create a delicious pasta.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MoE restaurant&lt;br&gt;
This is where the router—the manager of experts—comes into play.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In an MoE, instead of the order directly going to the chefs, it first goes to the manager. The manager then decides which two chefs will work on the dish. If the dish ordered is pasta, then the two chefs working on it would be:&lt;/p&gt;

&lt;p&gt;The Vegetable Chef (Entremetier): Boils the starch (the pasta noodles)&lt;br&gt;
The Sauce Chef (Saucier): Cooks the hot, savory meat sauce to pour over the top&lt;br&gt;
Together, they create a delicious pasta without making all the chefs in the restaurant work on it. It’s as good as the one made by a dense restaurant, but with fewer chefs involved.&lt;/p&gt;

&lt;p&gt;(The idea for this analogy came from The Bear show—it’s an amazing show, by the way. Check it out.)&lt;/p&gt;

&lt;p&gt;Total vs Active parameters&lt;br&gt;
Total parameters – This represents the amount of diverse knowledge an LLM has. Let’s say a model has 128 experts and 26 billion parameters. Those 26 billion parameters are spread across 128 experts in their fields. Some are good at math, some at literature, and they might also go niche—like an expert in pop culture, movies, and music.&lt;/p&gt;

&lt;p&gt;Active parameters – This represents the compute cost of the model. So if a model has 26 billion parameters but only activates 4 billion at a time, the model’s compute cost and response time become that of a 4 billion parameter model.&lt;/p&gt;

&lt;p&gt;The vRAM twist&lt;br&gt;
Even though the model becomes extremely efficient at generating inference and reduces the compute cost significantly, there’s still the angle of vRAM.&lt;/p&gt;

&lt;p&gt;It doesn’t matter if the model activates only 4 billion parameters at a time—you still have to load the whole 26 billion parameters into your vRAM. So even though your low-end hardware can run the model efficiently, to load the model you will still need enough vRAM, which may force you to use a high-end system.&lt;/p&gt;

&lt;p&gt;It might give you fast responses and consume less energy, but you will still need a powerful device.&lt;/p&gt;

&lt;p&gt;An intuitive demonstration&lt;br&gt;
Let’s say you input the sequence: “Indian cuisine is very…“ and the LLM has to complete it.&lt;/p&gt;

&lt;p&gt;Input – The token “Indian” arrives&lt;br&gt;
Router’s evaluation – Based on mathematical evaluation of the token “Indian”, the router selects the top 2 experts, which could be the ones that specialize in geography (#34) and food (#87). (Modern LLMs consist of multiple layers stacked together, so the router assigns the token to different experts repeatedly as it goes deeper into the architecture.)&lt;br&gt;
Computation – Only the parameters in experts #34 and #87 get activated and used for computation, while the remaining parameters stay idle&lt;br&gt;
Repetition – The model repeats the process, but this time the router might choose completely different experts based on the next token&lt;br&gt;
The History&lt;br&gt;
It might seem like a very novel idea to most people, but the actual concept was introduced more than three decades ago, in 1991, by Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and the “Godfather of AI,” Geoffrey E. Hinton. The paper was titled “Adaptive Mixtures of Local Experts.”&lt;/p&gt;

&lt;p&gt;The modern implementation of this idea came in 2017. The paper, titled “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer,” was written by Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean (the Google Brain team). This paper further refined the idea by introducing the concept of sparsity—it forced the neural network to activate only a small number of parameters at a time, making them highly efficient.&lt;/p&gt;

&lt;p&gt;by Swarit Shukla&lt;/p&gt;

</description>
      <category>ai</category>
      <category>nlp</category>
      <category>mixtureofexperts</category>
      <category>gemini</category>
    </item>
  </channel>
</rss>
