Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.
Large language models keep getting larger.
Hundreds of billions of parameters. Trillions of parameters. Yet somehow, many of these models remain surprisingly fast and affordable to run.
How?
The trick is that most modern frontier models don't use all of their parameters for every token.
Instead, they use a technique called Mixture of Experts (MoE).
Think of it like replacing a single giant software service with a fleet of specialized microservices. Rather than every request hitting every service, a router decides which specialists should handle a particular request.
That's the core idea behind MoE.
Let's break down how it works, why it matters, and what challenges engineers face when running MoE models in production.
1. The Scaling Problem
Traditional transformer models are dense.
When a token enters a transformer layer:
- Attention runs.
- The feed-forward network (MLP) runs.
- Every parameter in that layer participates in computation.
If you double the model size, you roughly double the compute cost.
This creates a painful tradeoff:
- More parameters → better quality
- More parameters → slower inference and more expensive training
Researchers wanted a way to increase model capacity without increasing computation proportionally.
MoE emerged as one of the most successful solutions. Instead of activating every parameter, MoE activates only a small subset for each token.
2. The Restaurant Analogy
Imagine a restaurant with eight specialists:
- Pizza chef
- Sushi chef
- Pastry chef
- Grill chef
- Salad chef
- Soup chef
- Pasta chef
- Dessert chef
When a customer orders pizza, there's no reason for all eight chefs to work on the order.
The restaurant manager simply routes the request to the relevant specialists.
MoE applies the same idea.
Instead of one large neural network handling every token:
- Multiple expert networks exist
- A router chooses which experts should process each token
- Only selected experts perform computation
The result is a model that can contain many more parameters than are actually used for any individual inference step.
3. What Actually Changes Inside a Transformer?
One surprising fact about MoE:
Most of the transformer remains unchanged.
Typically:
- Attention layers stay dense
- Embeddings stay dense
- Normalization layers stay dense
The feed-forward network (MLP) is replaced by a collection of experts.
A standard transformer block looks roughly like:
Input
↓
Attention
↓
Feed Forward Network
↓
Output
An MoE block becomes:
Input
↓
Attention
↓
Router
↓
Selected Experts
↓
Combine Outputs
↓
Output
Each expert is often just another feed-forward network.
Instead of one MLP, you may have:
Expert 1
Expert 2
Expert 3
...
Expert 64
The router decides which ones should handle each token.
4. How Routing Works
The router is usually a lightweight neural network.
For each token it produces scores:
Token: "database"
Expert 1: 0.05
Expert 2: 0.61
Expert 3: 0.09
Expert 4: 0.25
The model then selects the top experts.
Top-2 Routing
Historically, many MoE systems used Top-2 routing:
Selected:
Expert 2
Expert 4
Both experts process the token.
Their outputs are combined using the router probabilities as weights.
Switch Routing
Later, Google's Switch Transformer simplified this further.
Instead of selecting two experts:
Selected:
Expert 2
Only one expert runs.
This significantly reduces communication and inference overhead while preserving much of the benefit.
5. Why MoE Models Are So Efficient
Let's compare two hypothetical models.
Dense Model
100B parameters
100B active per token
MoE Model
8 experts × 100B parameters
= 800B total parameters
Only 2 experts active
= 200B active parameters
The MoE model can have dramatically larger capacity while activating only a fraction of its weights during inference.
This is often called conditional computation.
Different inputs trigger different computation paths.
The model effectively says:
"Not every problem requires every part of my brain."
This is one reason MoE architectures became attractive for large-scale LLMs. They allow parameter counts to grow much faster than inference cost.
6. The Hidden Engineering Challenges
The basic idea sounds simple.
Production systems quickly reveal the hard parts.
Challenge 1: Expert Collapse
Suppose the router learns:
90% of tokens → Expert 7
Now Expert 7 receives almost all training.
Other experts receive little data and become useless.
Researchers combat this with load-balancing losses that encourage more even utilization.
Challenge 2: Distributed Communication
Imagine:
GPU 1 → Experts 1-8
GPU 2 → Experts 9-16
GPU 3 → Experts 17-24
A batch of tokens may need experts spread across multiple machines.
Now inference becomes a networking problem.
Token activations must be shuffled between devices before expert computation can occur.
In many MoE deployments, communication becomes a significant bottleneck.
Challenge 3: Load Imbalance
Real traffic isn't uniform.
Some experts become hot.
Others remain mostly idle.
This creates GPU utilization problems similar to uneven request distribution in distributed systems.
Modern routing approaches focus heavily on balancing expert workloads.
Challenge 4: Token Dropping
Experts often have limited capacity.
If too many tokens are routed to one expert:
Capacity: 1000 tokens
Incoming: 1500 tokens
Some tokens may need to be rerouted or dropped.
Managing these overflow situations becomes part of production MoE serving infrastructure.
7. What MoE Looks Like in Production
For developers building AI systems, the practical implications are interesting.
Memory Footprint
An MoE model may advertise:
600B parameters
But only a fraction are active for any token.
Compute cost may resemble a much smaller dense model.
Inference Isn't Automatically Cheaper
Many developers assume:
Fewer active parameters
=
Lower latency
Not always.
Routing overhead, expert communication, and distributed synchronization can erase part of the theoretical gain.
Serving MoE efficiently often requires specialized inference stacks.
Observability Matters
Production teams increasingly monitor:
- Expert utilization
- Router entropy
- Token distribution
- Expert hot spots
- Cross-device traffic
An overloaded expert can become the AI equivalent of a hot database shard.
Routing Becomes Product Behavior
Recent research suggests routing patterns can become task-specific.
Different prompt categories often activate different expert combinations, meaning the routing system itself becomes part of the model's learned intelligence.
Conclusion
Mixture of Experts is one of the most important ideas behind modern large-scale AI systems.
Instead of making every token pass through every parameter, MoE introduces specialization:
- Experts perform different computations
- Routers choose which experts to use
- Only a small subset activates per token
The result is a model that can grow dramatically in total capacity while keeping computation relatively manageable.
For software engineers, MoE feels surprisingly familiar.
It's essentially:
- Service routing
- Load balancing
- Resource scheduling
- Distributed systems
...implemented inside a neural network.
As AI systems continue scaling, understanding MoE is becoming as important as understanding transformers themselves.
Question: If you were designing an MoE model, would you optimize primarily for maximum model quality, or for predictable production latency and infrastructure simplicity? Why?
*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.
git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*
Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.
HexmosTech
/
git-lrc
Free, Micro AI Code Reviews That Run on Commit
| 🇩🇰 Dansk | 🇪🇸 Español | 🇮🇷 Farsi | 🇫🇮 Suomi | 🇯🇵 日本語 | 🇳🇴 Norsk | 🇵🇹 Português | 🇷🇺 Русский | 🇦🇱 Shqip | 🇨🇳 中文 | 🇮🇳 हिन्दी |
git-lrc
Free, Micro AI Code Reviews That Run on Commit
AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.
git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.
See It In Action
See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements
git-lrc-intro-60s.mp4
Why
- 🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
- 🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.
- …
Top comments (1)
Loved the routing breakdown. Would love a follow-up on auxiliary-loss-free balancing.