This article was originally published on runaihome.com
When Meta released Llama 4 Scout in April 2025, the headline was irresistible: a 17B active-parameter model with a 10 million token context window. Home lab forums lit up. People assumed Scout would run on their existing RTX 4090, since 17B models at Q4 need roughly 10–12 GB of VRAM.
Then they typed ollama pull llama4:scout and watched 67 GB download.
The "17B active parameters" claim is technically accurate — Scout only activates 17 billion weights per token during inference. But the model has 109 billion total parameters spread across 16 expert networks, and all 109 billion must be loaded into memory before a single token generates. The active-vs-total distinction matters for compute throughput, not for how much VRAM or RAM you need.
This is the article most Scout guides skip. Here's the actual hardware math.
How Scout's MoE Architecture Works (and Why It Matters for VRAM)
Scout is a Mixture-of-Experts (MoE) model. Unlike a dense model where every weight fires on every token, MoE divides the feedforward layers into 16 specialized "expert" networks. A router decides which 1–2 experts handle each token. The attention layers are dense (always active); the FFN experts are sparse (mostly idle).
The math behind 17B active parameters:
- Attention layers shared across all tokens: ~7B parameters
- Each of the 16 expert FFN networks: ~6.4B parameters each
- Active at inference: attention (7B) + ~1–2 experts (~6.4–12.8B) ≈ 13–20B, averaging ~17B
The catch: all 16 experts must be in memory so the router can dispatch tokens to whichever expert it selects. You can't pre-load only the "active" experts — you don't know which one until runtime.
This is why Scout's memory footprint is similar to a 109B dense model, not a 17B one. The inference compute resembles a 17B model (fast generation once loaded). The loading requirement resembles a 109B model (slow setup, large memory).
The VRAM Reality: Scout's File Sizes by Quantization
The authoritative numbers come from Unsloth's GGUF releases and the official Ollama library:
| Quantization | Bits/weight | File size | Minimum VRAM (GPU-only) |
|---|---|---|---|
| BF16 (unquantized) | 16 | 216 GB | 4× A100 80GB |
| Q8_0 | 8 | 115 GB | Mac Studio M4 Ultra 192GB |
| Q4_K_M (Ollama default) | ~4.5 | 67 GB | 2× RTX 4090 or Mac 128GB |
| UD-Q4_K_XL (Unsloth) | 4.5 | 65.6 GB | Same as above |
| UD-Q2_K_XL (Unsloth, recommended) | 2.71 | 42.2 GB | 2× RTX 3090 |
| UD-IQ1_S (Unsloth, smallest) | 1.78 | 33.8 GB | RTX 5090 32GB (partial offload) |
The Unsloth dynamic quantization (UD) variants use a smart approach: they quantize the large MoE expert layers more aggressively, while leaving attention and embedding layers at 4–6 bit. This preserves quality better than applying uniform low-bit quantization across the entire model.
The practical takeaway:
-
ollama pull llama4:scoutgives you the 67 GB Q4_K_M version - You need 70+ GB of combined memory (VRAM or unified) to run it without CPU offloading
- For single-GPU users, CPU offloading is the only path — with significant speed penalties
GPU Decision Matrix: What Configuration Runs Scout How Fast
| Hardware | Combined VRAM | Best quant that fits | Approx. tok/s | Assessment |
|---|---|---|---|---|
| Single RTX 4090 (24GB) | 24 GB | UD-IQ1_S with ~10GB CPU offload | ~10–15 | Possible, slow |
| Single RTX 5090 (32GB) | 32 GB | UD-IQ1_S (~34GB, minor offload) | ~20–28 | Borderline, usable |
| 2× RTX 3090 (48GB) | 48 GB | UD-Q2_K_XL (42.2GB, fits) | ~28–38 | Good value path |
| 2× RTX 4090 (48GB) | 48 GB | UD-Q2_K_XL (42.2GB, fits) | ~35–48 | High-end consumer |
| 3× RTX 4090 (72GB) | 72 GB | Q4_K_M (67GB, fits) | ~40–55 | Top-tier home setup |
| Mac Studio M4 Max (128GB) | 128 GB unified | Q4_K_M (67GB, fits easily) | ~18–25 | Best single-device option |
| Mac Studio M4 Ultra (192GB) | 192 GB unified | Q8_0 (115GB, fits) | ~12–18 | Full quality, lower throughput |
Notes:
- Tok/s estimates are for Scout's active-parameter profile (~17B active), which generates tokens faster than a dense 70B model but slower than a dense 17B model (overhead from expert routing)
- CPU offloading tokens-per-second drops sharply as more layers hit system RAM — PCIe 4.0 bandwidth (32 GB/s) versus GDDR6X (1008 GB/s on RTX 4090) is the bottleneck
- For multi-GPU tensor parallelism with llama.cpp, you need a recent build with true tensor parallel support; NVLink is dead on consumer Ampere and later
The clear winner for a single home device with no compromise: Mac Studio M4 Max with 128GB. The $1,999 entry price gets you enough unified memory bandwidth to run Scout at full Q4 quality. See the Mac Studio vs dual RTX 4090 breakdown for full context.
For NVIDIA setups, the 2× RTX 3090 path running UD-Q2_K_XL (42.2 GB, fits in 48 GB combined) hits a reasonable price-to-performance point. Used RTX 3090s run $700–$1,000 on eBay — the value analysis is here.
Don't want to buy hardware to evaluate Scout? Rent an A100 80GB on RunPod for $1.19/hr and run Q8_0 to see if it fits your workload before committing to hardware.
Scout vs Llama 3.3 70B: The Quality Comparison
The benchmark picture is more nuanced than headlines suggest.
| Benchmark | Scout (109B MoE) | Llama 3.3 70B | Winner |
|---|---|---|---|
| MMLU (general knowledge) | 79.6% | 86.0% | Llama 3.3 70B |
| MMLU-Pro (hard subset) | 74.3% | 68.9% | Scout |
| MGSM (multilingual math) | 90.6% | 91.1% | Llama 3.3 70B (marginal) |
| GPQA Diamond (PhD-level science) | 57.2% | — | Scout |
| DocVQA (document understanding) | 94.4% | N/A (text only) | Scout (unique capability) |
| ChartQA | 88.8% | N/A (text only) | Scout (unique capability) |
| HumanEval (coding) | — | 88.4% | Llama 3.3 70B |
The headline MMLU number (standard test) goes to Llama 3.3 70B by 6.4 percentage points. On MMLU-Pro — the harder subset that weeds out pattern-matching — Scout wins by 5.4 points. The GPQA Diamond score (57.2%) is notably strong, suggesting Scout reasons better on difficult scientific problems despite losing the general-knowledge headline.
The more important comparison for local AI use: Scout is a multimodal model; Llama 3.3 70B is text-only. If your use case includes reading PDFs, analyzing charts, or processing images alongside text, Scout has no head-to-head competitor at this open-weights quality level.
For pure text tasks — coding, writing, summarization — Llama 3.3 70B remains the stronger choice, and it fits better in commodity hardware at 40 GB for Q4_K_M versus Scout's 67 GB. For reference, the quantization quality tradeoffs for either model are covered here.
The 10 Million Token Context Window (and Why You Probably Won't Use It)
Meta's official context limit for Scout is 10 million tokens — roughly 7.5 million words, or a small library of documents. The actual Ollama library entry shows 128K as the functional context window, and this is intentional: a 10M context requires extraordinary memory.
At Q4 (67 GB model), processing a 10M token context in a single pass would require additional KV cache memory proportional to context length. At 10M tokens with Scout's architecture, the KV cache alone would dwarf the model weights — running int
Top comments (0)