How Smaller, Specialised Models Can Work Better Than One Giant Model
Mixture of experts (MoE) is a machine learning approach that divides an artificial intelligence model into separate sub-networks (or experts), each specialising in a subset of the input data, to perform a task jointly
Building Bigger Models
The belief is that if we make models bigger, they will be smarter.
So we keep increasing:
- Number of parameters (the model’s internal “brain cells”)
- Amount of training data
- Computing power needed to train and run them
Example:
- GPT-3 → 175 billion parameters
- GPT-4 → more than 1 trillion!
But...
- Costs are exploding
- Improvement might be slowing down sometimes
The Idea of Mixture of Experts (MoE)
Instead of one huge model doing everything, MoE build many smaller models (called “experts”), each good at one thing. Then add a router (a smaller model) that will choose which expert to use for each task.
Example:
- Expert A: good at math
- Expert B: good at writing
- Expert C: good at coding
If you ask a math question → the router sends it to Expert A only.
How It Works in simple terms
- You send a question (the input).
- The router examines it and determines which experts should handle it.
- Only those experts “wake up” and work on the question.
- Their answers are combined into one final response.
You don’t waste compute on experts you don’t need. Each expert becomes very skilled in their own area. The system gets faster and cheaper over time.
Analogy
Think of a team project:
Instead of one person trying to do everything, you have many people with different strengths.
When a task comes up, you call the right person for the job.
That’s how MoE works — it’s teamwork inside AI.
Why MoE Is Smarter, Not Larger
One Big Model
- Compute: Uses all parts for every task
- Cost: Expensive
- Speed: Slower
- Specialisation: General at everything
Mixture of Experts (MoE)
- Compute: Uses only a few experts per task
- Cost: More efficient
- Speed: Faster
- Specialisation: Great at specific things
Andrew Ng often argues that smaller, focused systems + smart orchestration can outperform huge, general-purpose models.
His thoughts are: “You don’t always need a bigger model; you need the right workflow.”
MoE is the architecture version of that same idea: smarter routing and specialisation.
While MoE enables specialisation through structure and routing, each expert can also be further specialised via fine-tuning.
MoE is not a new idea, but its relevance today comes from the need to scale AI efficiently, not just aggressively.
Why does it matter
- Saves energy and money
- Reduces latency, ie faster answers
- Easier to update or improve individual experts
- Encourages modular design - like Andrew Ng’s view that smaller systems working together can achieve more
Top comments (0)