Three heavyweights dropped this year: Gemma 4 (Google), Llama 4 (Meta), Mistral Small 4 (Mistral). All free to run. All structurally different. Here's which one fits which job.
Short answer: long context → Llama 4 Scout. License-clean commercial use → Mistral Small 4. On-device → Gemma 4 E2B / E4B.
Quick Take
| Gemma 4 (31B / 26B MoE) | Llama 4 Scout | Mistral Small 4 | |
|---|---|---|---|
| Architecture | Dense (31B) · MoE (26B/A4B) | MoE (17B active / 109B) | MoE (~22B active / 119B) |
| Context | E2B/E4B 128K · 31B/26B 256K | 10M | 256K |
| License | Google Gemma ToU | Llama 4 Community | Apache 2.0 |
| Multimodal | text + image + video + OCR (E2B/E4B add audio) | text + image (early fusion) | text + image (first in Small series) |
| Edge fit | Excellent (E2B/E4B) | Low | Low (multi-GPU even quantized) |
MoE vs Dense
MoE is a bank of specialized tellers — only the relevant experts fire per input. Llama 4 Scout: 109B total, 17B active. Mistral Small 4: 119B total across 128 experts, ~22B active. Gemma 4 26B: the "small MoE" path — 26B total, ~3.8B active, targeting 4B-speed with bigger-model intelligence.
Gemma 4 E2B, E4B, and 31B are Dense. Every parameter fires on every token. Higher compute per parameter, but memory requirements scale linearly and planning is easier.
One MoE trap people hit: inference compute drops, but all weights still need to sit in memory. Llama 4 Scout in fp16 = ~218GB VRAM. 4-bit = ~55GB. "Only 17B active so it's lightweight" is wrong.
Context Window — 10M, 256K, 128K
Llama 4 Scout's 10M is the outlier. Meta got there via iRoPE — interleaved RoPE that holds accuracy past the training sequence length. Practical impact: you can drop an entire monorepo into one prompt and skip the RAG pipeline altogether.
Mistral Small 4 sits at 256K. Gemma 4's small variants (E2B/E4B) are 128K; the medium 31B and 26B MoE jump to 256K. For normal-scale work — books, research paper batches, long meeting transcripts — 128K is already more than enough.
Benchmarks
- Llama 4 Maverick on SWE-bench: 76.8 to 80.8 depending on the evaluation variant. Open-source top tier — but not "absolute #1." GLM-5 (77.8) shows up right next to it on SWE-bench Verified.
- Llama 4 Scout is smaller than Maverick but wins on repo-scale analysis thanks to 10M context.
- Gemma 4 31B shines on multimodal tasks relative to its size class.
- Mistral Small 4 (per Mistral's evals) matches or surpasses GPT-OSS 120B and Qwen-class models on several key benchmarks — at ~22B active.
Benchmarks and day-to-day use diverge. Run them yourself before committing.
Multimodal — Images, Video, Audio
None of these three is text-only in 2026.
- Gemma 4 is natively multimodal across every variant: text, image, video, OCR. E2B and E4B add native audio input — voice assistants and on-device transcription become direct use cases.
- Llama 4 Scout/Maverick use early fusion — text and vision tokens unified inside the foundation model.
- Mistral Small 4 is the first in the Mistral Small series to support native vision. Images ride in the normal API message array alongside text, inside the same 256K window.
Licenses (Actually Read Before Shipping)
- Mistral Small 4 / Apache 2.0 — zero restrictions. Fine-tune, redistribute, embed in SaaS, ship it.
- Llama 4 Community — commercial use fine below 700M MAU, but Meta's approval is required above that (sole discretion). Also: mandatory "Built with Llama" badge on a related web or in-app page.
- Gemma 4 / Google Gemma ToU — you can't use Gemma outputs to train competing LLMs, and AI-adjacent services need to read the clauses carefully.
Edge Deployment Reality
| Model | fp16 VRAM | 4-bit VRAM | Realistic hardware |
|---|---|---|---|
| Gemma 4 E4B | ~8GB | ~3GB | Laptop / phone |
| Gemma 4 31B | ~62GB | ~16GB | RTX 4090 / M2 Max |
| Llama 4 Scout | ~218GB | ~55GB | Multi-GPU / H100 at Int4 |
| Mistral Small 4 | ~238GB | ~60GB | Multi-GPU / high-end workstation |
Gemma 4 E4B at 4-bit = ~3GB. Runs on a laptop. For smartphone deployments E2B is the target. Llama 4 Scout and Mistral Small 4 stay in server territory even quantized — the full MoE weights have to fit in memory regardless of active count.
How to Combine All Three
Routing by request type is more realistic than picking one:
request type → model
-------------------------------------------
whole-doc / whole-repo analysis → Llama 4 Scout (10M context)
image + video + audio input → Gemma 4
commercial API traffic → Mistral Small 4 (Apache 2.0)
Using hosted APIs (Together AI, Groq, Fireworks) on top of this routing lets you optimize both cost and capability together.
FAQ
Q. How does Scout actually handle 10M tokens?
iRoPE — Meta's interleaved version of RoPE position encoding. Extends accuracy well past training length.
Q. Which is most commercial-friendly?
Mistral Small 4. Apache 2.0. No MAU cap, no branding requirement.
Q. Is MoE always better than Dense?
No. Inference compute drops, but memory scales with total parameters. Edge = Dense small or compact MoE like Gemma 4 26B. MoE only pays off with multi-GPU.
Q. Best at coding?
Llama 4 Maverick (76.8–80.8 on SWE-bench) — top tier, not #1. GLM-5 (77.8) is right there too. Mistral Small 4 is fine for general code review; Scout's 10M wins whole-repo work.
Sources
- Hugging Face — Welcome Gemma 4
- Meta AI — The Llama 4 herd
- Llama 4 Community License
- Mistral Small 4 announcement
Originally published at GoCodeLab. Always read each model's official license before commercial deployment — this post is not legal advice.
Top comments (0)