Meta dropped Llama 4 Scout on April 5, 2026, and the benchmark reaction was -- honestly, a little mixed. It doesn't top every leaderboard. Gemma 4 31B beats it on reasoning.
But if you've actually read the spec sheet, you know why the story's more interesting than a single GPQA number. A 10 million token context window. Multimodal baked into training from day one, not stapled on later. And an architecture so efficient it runs on the same GPU card you might already have sitting under your desk.
There's also a license situation that's gotten almost no coverage. And if you're building something real with Llama 4 Scout, you need to understand it before you ship.
Quick Verdict
Rating: 4.0/5
Llama 4 Scout is genuinely impressive hardware-to-capability ratio -- the best local multimodal story in open-weight AI right now. The 10M context window is a real technical moat for specific workloads. It runs on 12GB VRAM. The Ollama setup takes under five minutes.
The deduction is for the license. The Llama 4 Community License isn't Apache 2.0. It has a 700M MAU ceiling, attribution requirements, and -- this is the part people aren't talking about -- a flat prohibition on EU-based developers using the multimodal features. If you're in the EU and building a multimodal product, this is a hard blocker.
For most commercial developers outside the EU, Scout is an excellent choice. Read the license first.
What Is Llama 4 Scout?
Llama 4 Scout is Meta's efficiency-focused open-weight model, released April 5, 2026 alongside Llama 4 Maverick (the larger sibling). Both are natively multimodal -- trained from scratch on text, images, and video, not fine-tuned on multimodal data after the fact. That distinction matters more than most coverage acknowledges.
The model spec:
- 17 billion active parameters, 109 billion total
- 16 experts (Mixture-of-Experts architecture)
- 10 million token context window -- 40x larger than Gemma 4's 256K
- Multimodal inputs: text, images (up to 8 per prompt), video
- Trained on: 30+ trillion tokens of text, image, and video data
-
Available on: Hugging Face (
meta-llama/Llama-4-Scout-17B-16E-Instruct)
The MoE architecture is how Meta pulled off the hardware story. Only 17B parameters fire at inference time despite a 109B parameter knowledge base distributed across 16 experts. Result: GPT-4-class knowledge depth at 12GB VRAM. That's accessible hardware.
The License: What Developers Actually Need to Know
OK so. The license.
Most coverage of Llama 4 Scout calls it "open source" and moves on. It isn't -- not by OSI definition, and not in the way Apache 2.0 is open. The Llama 4 Community License is a custom commercial license from Meta, and it has restrictions that will matter for some development teams.
What it permits:
You can download the weights, run the model, build commercial products, fine-tune on proprietary data, and charge for what you build. No per-token royalties. No revenue percentage owed to Meta. For solo developers and most companies, that's a real and useful grant.
What it restricts:
The 700M MAU cap. If your product or service has more than 700 million monthly active users, you can't use Llama 4 without negotiating a separate license with Meta. This only affects a handful of companies. But if you're one of them, or if you're building infrastructure you expect to scale to that level, it's worth knowing going in.
Attribution. Every consumer-facing product built with Llama 4 must display "Built with Llama" prominently -- on the website, in the app UI, somewhere visible. This isn't a footnote in the terms; it's a contractual requirement. Apache 2.0 has attribution requirements too, but they're less prescriptive about placement.
The EU multimodal restriction. This is the one that's gotten almost no coverage, and it's a real issue.
If you are an individual domiciled in the EU, or a company with its principal place of business in the EU, the Llama 4 Community License does not grant you rights to use Llama 4's multimodal features. Text generation: fine. Images, video, multimodal inputs: blocked.
The carve-out says EU companies with business operations outside the EU may distribute globally using standard practices -- but if your HQ is in Berlin or Paris or Amsterdam and you're building a multimodal app, you need to either use a different model (Gemma 4's Apache 2.0 has no such restriction) or get legal involved.
For most US and UK developers, this doesn't apply. For EU-based teams building with multimodal inputs, it's a hard blocker. Check the Llama 4 Community License directly before you build anything.
Benchmark Performance
Llama 4 Scout isn't the top open-weight model by benchmark score. If benchmark leaderboard position is your primary criterion, Gemma 4 31B wins most categories.
Where Scout actually stands:
GPQA Diamond (graduate-level reasoning): 74.3%. Gemma 4 31B scores 84.3% -- a 10-point gap that's real and consistent. On reasoning-heavy tasks, Scout isn't the benchmark leader.
Versus its actual competitive tier: Against Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 -- the models it was sized and priced to compete with -- Scout performs well. Meta's internal benchmarks show it outperforming these models across coding, multilingual, and reasoning tasks. That's the honest comparison for the hardware tier.
Context-length tasks: This is where the 10M window turns into a real performance story. If your task involves an entire codebase, a 500-page legal document, or multi-session conversation history, Scout isn't just leading -- Gemma 4 literally can't match it at 256K.
The benchmarks tell a story about specialization, not overall dominance. If you need pure reasoning on single queries, Gemma 4 31B. If you need multimodal or long-context on modest hardware, Scout.
What Scout Can Actually Do With Images and Video
Natively multimodal means the vision capability is integrated into Scout's base training, not an add-on module. In practice:
- Image analysis: Up to 8 images per prompt. Describe, compare, identify objects, extract text from photos, reason about visual content across multiple images simultaneously.
- Document understanding: Feed in scanned PDFs or document images and ask questions. Works better than OCR-then-prompt approaches for complex layouts.
- Video comprehension: Video clips as input, with the model reasoning across frames. Useful for summarization, action recognition, and temporal reasoning tasks.
- Interleaved vision+text reasoning: The model was trained to handle alternating text and images in a single prompt, which makes it more natural to build workflows that mix document pages, screenshots, and text instructions.
The "up to 8 images per prompt" cap is worth noting for batch image analysis workflows. It's not unlimited. But for most real-world product use cases -- product photo analysis, receipt scanning, document QA -- 8 is plenty per turn.
Running Llama 4 Scout Locally
The local deployment story is genuinely good. I expected friction at the 109B parameter size. There isn't much.
Ollama (recommended starting point):
ollama pull llama4
ollama run llama4
That's it. Ollama handles quantization automatically, defaulting to Q4_K_M which runs on ~12GB VRAM. The Ollama model library entry supports Scout's multimodal inputs natively, so you can pass image paths directly in supported interfaces.
LM Studio:
Visual model browser, search for "Llama 4 Scout," one-click download from Hugging Face. Built-in chat interface for testing, plus a local server mode that exposes an OpenAI-compatible API on localhost:1234. Useful if you want to test local inference against existing OpenAI API client code without changing anything.
Direct Hugging Face / llama.cpp:
Weights at meta-llama/Llama-4-Scout-17B-16E-Instruct (instruction-tuned) and meta-llama/Llama-4-Scout-17B-16E (base). GGUF quantized versions are available from the community. Standard Transformers, vLLM, and TGI integrations all work.
Hardware requirement summary at 4-bit quantization: 12GB VRAM. RTX 3080 12GB, RTX 4070, or better. MacBook Pro M3 Pro (18GB unified memory) runs it too. If you're on an older card with 8GB, you'll need to drop to smaller quantization or use CPU offload (slow).
Llama 4 Scout vs Gemma 4: When to Choose Each
I covered Gemma 4's Apache 2.0 angle separately in our Gemma 4 review. The short version for developers choosing between them:
Choose Llama 4 Scout if:
- Your workloads involve very long contexts -- full codebases, book-length documents, extended chat histories. The 10M context window is a real advantage Gemma 4 can't match.
- You need multimodal capability and you're outside the EU. Scout's native training on images and video is strong.
- You're already in the Meta AI ecosystem (Meta AI Studio, Llama API) and want consistency.
- You want GPU-efficient inference on large-knowledge-base reasoning -- 17B active parameters at 109B knowledge depth is a good trade.
Choose Gemma 4 31B if:
- Reasoning accuracy, coding quality, and math performance are the priority. Gemma 4 31B leads the open-weight field here.
- Your team is EU-based and building multimodal products. Gemma 4's Apache 2.0 has no geographic restrictions.
- You want the cleanest possible commercial license -- Apache 2.0 means no custom terms, no legal review required.
- You don't need 10M context and want the strongest per-query results.
Honestly? Both have a place. They're solving different problems well.
Who Should Use Llama 4 Scout
Developers building long-context applications. Legal tech, research tools, financial document analysis, code review pipelines -- anything that regularly processes documents or contexts that exceed 100K tokens. Scout's 10M window is the only open-weight option at that scale.
Privacy-first products with multimodal requirements. If you're building an app that processes user-submitted images or video and the data can't leave the user's device or your infrastructure, Scout running locally is one of very few viable options. Hosted APIs for multimodal AI are the alternative -- but they mean your users' images are going somewhere.
Infrastructure teams evaluating API cost reduction. Multimodal API calls to GPT-4o Vision or Claude add up. If your application makes large volumes of image-processing calls, self-hosting Scout on GPU infrastructure can change the unit economics significantly.
Not EU-based devs building multimodal products. I keep coming back to this because the EU restriction is real and largely uncovered. If that's you -- use Gemma 4. It works.
Limitations Worth Knowing
Benchmark reality: Scout isn't the top open-weight model for pure reasoning. If you're using benchmarks to select a model for a coding assistant or document reasoning tool, run the Gemma 4 numbers too.
The license requires actual legal review for EU teams. "Open source" is how most people are describing Llama 4, and it's not accurate. The EU multimodal restriction could block commercial deployment for teams that don't read the fine print.
Vision quality caps at 8 images per prompt. For very high-throughput image processing workflows, that constraint matters at the architecture level.
Fine-tuning the full 109B parameter model needs serious GPU resources. The 17B active parameter architecture makes inference efficient; full fine-tuning at 109B scale is still expensive. Most teams will use the instruct-tuned checkpoint as-is.
Final Verdict
Llama 4 Scout earns a 4.0/5. The 10M context window is genuinely remarkable and has no equivalent in open-weight AI right now. The MoE efficiency story means you can actually run it on hardware real developers own. The multimodal capabilities are native, not patched in.
The license drag is real. It's not Apache 2.0. The EU multimodal restriction is a hard blocker for a significant part of the global developer community, and the "Built with Llama" attribution requirement is more prescriptive than most developers expect. These aren't disqualifying -- but they require eyes-open assessment before you commit to building on Scout.
Pull it in Ollama today and test it. For context-heavy or multimodal workloads outside the EU, it's likely the best local option you have.
Model specs, license terms, and benchmark figures verified against Meta's official Llama 4 documentation and Hugging Face model card as of April 2026. License reviewed at llama.com/llama4/license. Compare with our Google Gemma 4 review for the Apache 2.0 alternative.
Top comments (0)