Llama 4 Scout for Local AI in 2026: What "17B Active Parameters" Actually Means for Your GPU

#llama4 #scout #localai #vram

This article was originally published on runaihome.com

When Meta released Llama 4 Scout in April 2025, the headline was irresistible: a 17B active-parameter model with a 10 million token context window. Home lab forums lit up. People assumed Scout would run on their existing RTX 4090, since 17B models at Q4 need roughly 10–12 GB of VRAM.

Then they typed ollama pull llama4:scout and watched 67 GB download.

The "17B active parameters" claim is technically accurate — Scout only activates 17 billion weights per token during inference. But the model has 109 billion total parameters spread across 16 expert networks, and all 109 billion must be loaded into memory before a single token generates. The active-vs-total distinction matters for compute throughput, not for how much VRAM or RAM you need.

This is the article most Scout guides skip. Here's the actual hardware math.

How Scout's MoE Architecture Works (and Why It Matters for VRAM)

Scout is a Mixture-of-Experts (MoE) model. Unlike a dense model where every weight fires on every token, MoE divides the feedforward layers into 16 specialized "expert" networks. A router decides which 1–2 experts handle each token. The attention layers are dense (always active); the FFN experts are sparse (mostly idle).

The math behind 17B active parameters:

Attention layers shared across all tokens: ~7B parameters
Each of the 16 expert FFN networks: ~6.4B parameters each
Active at inference: attention (7B) + ~1–2 experts (~6.4–12.8B) ≈ 13–20B, averaging ~17B

The catch: all 16 experts must be in memory so the router can dispatch tokens to whichever expert it selects. You can't pre-load only the "active" experts — you don't know which one until runtime.

This is why Scout's memory footprint is similar to a 109B dense model, not a 17B one. The inference compute resembles a 17B model (fast generation once loaded). The loading requirement resembles a 109B model (slow setup, large memory).

The VRAM Reality: Scout's File Sizes by Quantization

The authoritative numbers come from Unsloth's GGUF releases and the official Ollama library:

Quantization	Bits/weight	File size	Minimum VRAM (GPU-only)
BF16 (unquantized)	16	216 GB	4× A100 80GB
Q8_0	8	115 GB	Mac Studio M4 Ultra 192GB
Q4_K_M (Ollama default)	~4.5	67 GB	2× RTX 4090 or Mac 128GB
UD-Q4_K_XL (Unsloth)	4.5	65.6 GB	Same as above
UD-Q2_K_XL (Unsloth, recommended)	2.71	42.2 GB	2× RTX 3090
UD-IQ1_S (Unsloth, smallest)	1.78	33.8 GB	RTX 5090 32GB (partial offload)

The Unsloth dynamic quantization (UD) variants use a smart approach: they quantize the large MoE expert layers more aggressively, while leaving attention and embedding layers at 4–6 bit. This preserves quality better than applying uniform low-bit quantization across the entire model.

The practical takeaway:

ollama pull llama4:scout gives you the 67 GB Q4_K_M version
You need 70+ GB of combined memory (VRAM or unified) to run it without CPU offloading
For single-GPU users, CPU offloading is the only path — with significant speed penalties

GPU Decision Matrix: What Configuration Runs Scout How Fast

Hardware	Combined VRAM	Best quant that fits	Approx. tok/s	Assessment
Single RTX 4090 (24GB)	24 GB	UD-IQ1_S with ~10GB CPU offload	~10–15	Possible, slow
Single RTX 5090 (32GB)	32 GB	UD-IQ1_S (~34GB, minor offload)	~20–28	Borderline, usable
2× RTX 3090 (48GB)	48 GB	UD-Q2_K_XL (42.2GB, fits)	~28–38	Good value path
2× RTX 4090 (48GB)	48 GB	UD-Q2_K_XL (42.2GB, fits)	~35–48	High-end consumer
3× RTX 4090 (72GB)	72 GB	Q4_K_M (67GB, fits)	~40–55	Top-tier home setup
Mac Studio M4 Max (128GB)	128 GB unified	Q4_K_M (67GB, fits easily)	~18–25	Best single-device option
Mac Studio M4 Ultra (192GB)	192 GB unified	Q8_0 (115GB, fits)	~12–18	Full quality, lower throughput

Notes:

Tok/s estimates are for Scout's active-parameter profile (~17B active), which generates tokens faster than a dense 70B model but slower than a dense 17B model (overhead from expert routing)
CPU offloading tokens-per-second drops sharply as more layers hit system RAM — PCIe 4.0 bandwidth (32 GB/s) versus GDDR6X (1008 GB/s on RTX 4090) is the bottleneck
For multi-GPU tensor parallelism with llama.cpp, you need a recent build with true tensor parallel support; NVLink is dead on consumer Ampere and later

The clear winner for a single home device with no compromise: Mac Studio M4 Max with 128GB. The $1,999 entry price gets you enough unified memory bandwidth to run Scout at full Q4 quality. See the Mac Studio vs dual RTX 4090 breakdown for full context.

For NVIDIA setups, the 2× RTX 3090 path running UD-Q2_K_XL (42.2 GB, fits in 48 GB combined) hits a reasonable price-to-performance point. Used RTX 3090s run $700–$1,000 on eBay — the value analysis is here.

Don't want to buy hardware to evaluate Scout? Rent an A100 80GB on RunPod for $1.19/hr and run Q8_0 to see if it fits your workload before committing to hardware.

Scout vs Llama 3.3 70B: The Quality Comparison

The benchmark picture is more nuanced than headlines suggest.

Benchmark	Scout (109B MoE)	Llama 3.3 70B	Winner
MMLU (general knowledge)	79.6%	86.0%	Llama 3.3 70B
MMLU-Pro (hard subset)	74.3%	68.9%	Scout
MGSM (multilingual math)	90.6%	91.1%	Llama 3.3 70B (marginal)
GPQA Diamond (PhD-level science)	57.2%	—	Scout
DocVQA (document understanding)	94.4%	N/A (text only)	Scout (unique capability)
ChartQA	88.8%	N/A (text only)	Scout (unique capability)
HumanEval (coding)	—	88.4%	Llama 3.3 70B

The headline MMLU number (standard test) goes to Llama 3.3 70B by 6.4 percentage points. On MMLU-Pro — the harder subset that weeds out pattern-matching — Scout wins by 5.4 points. The GPQA Diamond score (57.2%) is notably strong, suggesting Scout reasons better on difficult scientific problems despite losing the general-knowledge headline.

The more important comparison for local AI use: Scout is a multimodal model; Llama 3.3 70B is text-only. If your use case includes reading PDFs, analyzing charts, or processing images alongside text, Scout has no head-to-head competitor at this open-weights quality level.

For pure text tasks — coding, writing, summarization — Llama 3.3 70B remains the stronger choice, and it fits better in commodity hardware at 40 GB for Q4_K_M versus Scout's 67 GB. For reference, the quantization quality tradeoffs for either model are covered here.

The 10 Million Token Context Window (and Why You Probably Won't Use It)

Meta's official context limit for Scout is 10 million tokens — roughly 7.5 million words, or a small library of documents. The actual Ollama library entry shows 128K as the functional context window, and this is intentional: a 10M context requires extraordinary memory.

At Q4 (67 GB model), processing a 10M token context in a single pass would require additional KV cache memory proportional to context length. At 10M tokens with Scout's architecture, the KV cache alone would dwarf the model weights — running int