DEV Community

daniel jeong
daniel jeong

Posted on • Originally published at manoit.co.kr

Meta Llama 4 Scout & Maverick — The Complete Production Guide: 17B Active MoE, 10M Context, iRoPE, and the vLLM/Ollama Deployment Playbook

Meta Llama 4 Scout & Maverick — The Complete Production Guide: 17B Active MoE, 10M Context, iRoPE, and the vLLM/Ollama Deployment Playbook

Published on the ManoIT Tech Blog (Korean original). On April 5, 2026, Meta released Llama 4 Scout and Llama 4 Maverick — the first open-weights models in the Llama family to use a Mixture-of-Experts (MoE) architecture, the first to be natively multimodal, and — with Scout — the first to deliver a real 10M-token context window that runs on a single H100 GPU. This post unpacks the architecture, benchmarks, license caveats, deployment options, Llama Guard 4 safety stack, and the Behemoth delay — from a production-deployment perspective grounded in what ManoIT recommends to customers.


1. The Llama 4 family at a glance — Scout / Maverick / Behemoth

Llama 4 was always planned as a three-model family. April 5 brought two models; the largest one, Behemoth, is still in private training. Knowing where each model fits is the first decision in any adoption review.

Model Active / Total params Experts Context Primary use Status
Llama 4 Scout 17B / 109B 16 10M tokens Long-document analysis, code-base RAG, video summarization Released 2026-04-05 (Hugging Face, Ollama)
Llama 4 Maverick 17B / 400B 128 1M tokens Multimodal assistant, GPT-4o replacement Released 2026-04-05 (Hugging Face, Ollama, watsonx)
Llama 4 Behemoth 288B / ~2T 16 private Teacher model for Scout/Maverick, STEM-heavy Delayed to fall 2026 (internal evaluation)

The pattern is "same 17B active, different expert pool." Scout routes each token to 1-of-16 experts; Maverick routes 1-of-128. Active parameters being equal means the per-token compute cost is the same — but the diversity of the expert pool changes the model's expressive ceiling. That's why Maverick beats GPT-4o on multimodal aggregates and Scout fits in a single-H100 + Int4 footprint.

2. iRoPE architecture — the secret behind a 10M context

Scout's 10M-token reach comes from iRoPE (Interleaved Rotary Position Embeddings). Traditional transformers apply a position encoding (RoPE) at every attention layer — but as the context grows, the position signal becomes noise and length generalization collapses.

Traditional limitation iRoPE answer Operational effect
RoPE on every layer → can't generalize beyond 8K~128K training length Interleave a NoPE (no position encoding) layer every 4 layers Train at 256K, extrapolate at inference to 10M
Long-context noise blurs token-token relationships NoPE layers do global attention over the full causal mask Near-perfect needle-in-haystack at 10M
RoPE index overflow at long contexts NoPE layers remove absolute positional dependency Whole-video and whole-codebase indexing become viable

The idea is simple: mix layers that need positional encoding with ones that don't. RoPE layers learn local order; NoPE layers freely connect distant tokens by meaning. The result is reliable generalization to context lengths the model never saw at training time — which is why Scout retrieves accurately at 10M tokens.

┌─────────────── Llama 4 Scout attention stack (concept) ───────────┐
│                                                                    │
│   Layer 1  : RoPE Attention   (learns local order)                 │
│   Layer 2  : RoPE Attention                                        │
│   Layer 3  : RoPE Attention                                        │
│   Layer 4  : NoPE Attention   ← global causal mask, position-free  │
│   Layer 5  : RoPE Attention                                        │
│   Layer 6  : RoPE Attention                                        │
│   Layer 7  : RoPE Attention                                        │
│   Layer 8  : NoPE Attention   ← every 4 layers                     │
│   ...                                                              │
│                                                                    │
│   Each token routed to 1-of-16 experts (Top-1 MoE)                 │
│   → Active params at inference: 17B / Total params: 109B           │
└────────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

3. Benchmarks — Llama 4 vs GPT-4o vs Gemini 2.0 Flash

Meta's own numbers show clear wins for in-class comparisons. As always — these are vendor benchmarks; run your own evals before committing.

Benchmark Scout 17B/16E Maverick 17B/128E GPT-4o Gemini 2.0 Flash What it means
MMLU-Pro 74.3 80.5 78.0 77.6 Maverick wins multi-domain reasoning
MATH (Hendrycks) 50.3 61.2 76.6 56.1 STEM gap vs GPT-4o / o-series remains
GPQA Diamond 57.2 69.8 53.6 60.1 Graduate-level science — Maverick #1
ChartQA (image) 88.8 90.0 85.7 87.3 Multimodal chart understanding — both SOTA
DocVQA 94.4 94.4 92.8 92.1 Document-image QA — new bar set
MMMU 69.4 73.4 69.1 71.7 Multimodal exam aggregate — Maverick wins
Long Context (10M, NIAH) ~99% N/A ~64K limit ~1M limit Scout's exclusive territory

The figure that should give a buyer pause is MATH. Llama 4 caught up or pulled ahead in general and multimodal domains, but pure math/STEM still favors OpenAI's o-series by a clear margin. If your workload is STEM-heavy reasoning, Llama 4 alone is not enough — pair it with o-series or wait for Behemoth GA. If your workload is long-document RAG, multimodal assistants, or codebase analysis, Llama 4 is the strongest cost-to-quality option on the market today.

4. License — "Open Weights" but not OSI Open Source

Llama 4 ships under the Llama 4 Community License Agreement — and this is the place buyers most often misread. It is not Apache 2.0 or MIT, and it does not meet the OSI Open Source definition. Five clauses to inspect before you ship.

Clause Constraint What to do
① 700M MAU cap If your product exceeds 700M monthly active users, you must negotiate a separate license with Meta Startups and mid-market: irrelevant. Hyperscale SaaS: legal review
② EU multimodal carve-out EU users / EU companies cannot use the vision features (text-only allowed) EU services must disable Maverick vision or fork to a text-only path
③ Acceptable Use Policy Threat modeling, malicious cybersec, CSAM, fraud, etc. are prohibited Pair with Llama Guard 4 input/output filtering + your own content policy
④ "Built with Llama" attribution Products built on Llama must display "Built with Llama" Add to About page / docs footer
⑤ Derivative naming Derivative model names must start with "Llama-" e.g. Llama-Manoit-Customer-Support-v1

One-liner: "Commercial use is fine, but check the five clauses — 700M MAU, EU vision, AUP, attribution, naming." For most Korean SaaS, B2B tools, and internal projects clause 1 is moot. EU multimodal services need clause 2 in their pre-launch checklist.

5. Deployment options — Ollama, vLLM, watsonx.ai, Hugging Face

Llama 4 launched day-one on Hugging Face, Ollama, watsonx.ai, Together, Fireworks, Groq, and Oracle OCI. Pick by workload character:

Option Best for Strengths Constraints
Ollama (local) Development, PoC, offline tools One-line ollama run llama4:scout, automatic GGUF quantization Single-node, no multi-GPU sharding
vLLM (self-hosted) Low-latency production serving PagedAttention, continuous batching, OpenAI-compatible API You operate the GPUs and NCCL
HF TGI Hugging Face standardization Token streaming, tensor parallelism Slightly lower throughput than vLLM
watsonx.ai / Bedrock Enterprise compliance VPC isolation, SOC2, HIPAA, EU data residency 3-5× the per-token cost vs self-hosting
Together / Fireworks / Groq Cheap token consumption $0.27/1M tokens (Maverick at some vendors) Vendor lock-in, residency review needed

5.1 vLLM production deploy (Maverick, 8×H100, FP8)

# Step 1: get a Hugging Face token
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxx"

# Step 2: install vLLM 0.7.0+ (Llama 4 MoE support)
pip install --upgrade vllm

# Step 3: serve Maverick (8×H100, FP8 quantization)
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --max-model-len 1048576 \
  --gpu-memory-utilization 0.9 \
  --enable-prefix-caching \
  --port 8000

# Step 4: hit the OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
    "messages": [{"role": "user", "content": "Summarize Llama 4 adoption trends in one paragraph"}]
  }'
Enter fullscreen mode Exit fullscreen mode

5.2 Ollama local (Scout)

# Step 1: install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Step 2: pull Scout (~60GB at Q4_K_M)
ollama pull llama4:scout

# Step 3: interactive
ollama run llama4:scout

# Step 4: API mode (port 11434)
curl http://localhost:11434/api/chat -d '{
  "model": "llama4:scout",
  "messages": [
    {"role": "user", "content": "How would I summarize a 10MB log file in a single inference?"}
  ]
}'
Enter fullscreen mode Exit fullscreen mode

Practical hardware floor: Scout fits in a single H100 (80GB) at Int4, Maverick wants at least 4×H100 (80GB). A Mac Studio M2 Ultra (192GB) can run Scout Q4 for PoC purposes at 5-10 tok/s.

6. Llama Guard 4 — the 12B multimodal safety stack

Alongside Llama 4, Meta released Llama Guard 4 (12B) — a multimodal classifier that scores both text and images against 13 risk categories (violence, self-harm, sexual content, CSAM, hate, criminal facilitation, CBRN, etc.). The recommended ManoIT pipeline applies guards on both input and output:

┌──────────────────────── User request ────────────────────────┐
│  POST /v1/chat   { messages: [...] }                        │
└──────────────────────────────┬───────────────────────────────┘
                               │
                               ▼
              ┌────────────────────────────────┐
              │  ① Prompt Guard (86M, fast)     │
              │  Jailbreak / Prompt Injection   │
              └────────────────────────────────┘
                               │ pass
                               ▼
              ┌────────────────────────────────┐
              │  ② Llama Guard 4 (12B)          │
              │  Input safety classifier (S1~13)│
              └────────────────────────────────┘
                               │ safe
                               ▼
              ┌────────────────────────────────┐
              │  ③ Llama 4 Maverick / Scout     │
              │  Generate response              │
              └────────────────────────────────┘
                               │
                               ▼
              ┌────────────────────────────────┐
              │  ④ Llama Guard 4 (output)       │
              │  unsafe → block / redact        │
              └────────────────────────────────┘
                               │ safe
                               ▼
                       Final response
Enter fullscreen mode Exit fullscreen mode
Guard model Role Size Latency impact
Prompt Guard 2 (86M) First-pass jailbreak / prompt-injection filter 86M (DistilBERT-class) +10~20ms
Llama Guard 4 (12B) 13-category classifier (input + output) 12B multimodal +150~300ms
CyberSec Eval Detect security flaws in generated code Eval framework Pre-deploy static check

7. The Behemoth delay — what it means

Llama 4 Behemoth (2T total / 288B active) was not part of the April 5 release. It was originally targeted for April, slipped to June, and is now expected in fall 2026. Reporting attributes the delay to internal concerns about whether Behemoth's gain over Maverick justifies a public rollout.

Date Event Meaning
2026-01 Behemoth ~75% trained, April launch announced Planned co-release with Scout / Maverick
2026-04-05 Only Scout / Maverick released; Behemoth slips to June "Need more internal evaluation"
2026-05-15 Behemoth pushed to fall 2026 Strong on STEM, unclear lift on general workloads
2026-06-10 Meta superintelligence lab reportedly forms Top-tier model team carved out, Behemoth realigned

The delay does not affect a Llama 4 adoption decision. Scout and Maverick already compete with GPT-4o and Gemini 2.0 Flash, and the reason enterprises pick open-weights models is data sovereignty, on-prem control, and customization — not a single benchmark crown. If STEM is your hot path, hybrid with Claude Opus 4.7 or OpenAI o-series until Behemoth GA.

8. ManoIT 8-step production checklist

Step Check Concrete action
① License review 5 clauses — 700M MAU, EU vision, AUP, "Built with Llama", naming Legal review → impact memo
② Model selection Scout vs Maverick by workload (text / multimodal / context length) Long-doc RAG → Scout, multimodal → Maverick
③ Infra design FP8 / Int4 quantization, TP degree, KV cache memory budget Maverick: 8×H100 FP8, Scout: 1×H100 Int4
④ Serving stack vLLM / TGI / Ollama choice, autoscaling policy vLLM 0.7.0+ recommended, KEDA for GPU pool autoscale
⑤ Safety stack Prompt Guard 2 + Llama Guard 4 on input and output Budget +200ms latency, define category policy
⑥ Eval pipeline Korean benchmarks + domain golden datasets KMMLU, KoBEST, in-house RAG accuracy automated
⑦ Observability Token throughput, GPU util, guard block rate, RAG hit rate OpenTelemetry GenAI semconv + Grafana
⑧ Governance Model card, retention, audit log, rollback runbook Model card + prompt versioning + sampled response retention

9. Field guide — which model for which job

Scenario Recommended Why
50 contracts side-by-side, clause comparison Scout 10M context, single node, lowest cost
Whole-codebase RAG (200K LOC) Scout Fit the entire repo in context
Multimodal customer-support chatbot (image + text) Maverick DocVQA / ChartQA leader, GPT-4o replacement
20-hour video summarization Scout Native long-context, full-video indexing
STEM math OpenAI o-series · Claude Opus MATH/AIME gap remains, wait for Behemoth GA
Real-time voice assistant Maverick + Whisper (STT) <500ms response, multimodal context
On-prem medical / financial RAG Scout + Llama Guard 4 Data sovereignty, 700M MAU irrelevant
EU-resident multimodal service Maverick (text only) or GPT-4o EU vision license restriction

10. Conclusion — what Llama 4 changes, and what it doesn't

The significance of Scout and Maverick is that open-weights models reached SOTA in multimodal and ultra-long-context territory. Through 2024-2025 open models could compete with GPT-4o on English text reasoning, but multimodal and >100K-token contexts had a real gap. iRoPE and native multimodal training closed it.

What did not change: the STEM / math reasoning gap vs OpenAI o-series remains, and that won't close until Behemoth GA. "Open Weights" is not "Open Source" — the five Llama 4 Community License clauses (especially EU multimodal exclusion and 700M MAU) require a legal review before adoption. And safety and governance cost does not vanish because the weights are public — Llama Guard 4 + Prompt Guard + your own eval pipeline are not optional.

ManoIT's recommendation for Q2 2026: standardize on Scout as the default RAG engine (legal, technical document analysis), Maverick as the multimodal assistant backbone (customer support, automation workflows). Hybridize with Claude Opus 4.7 or OpenAI o-series where STEM is the hot path. For multimodal services with EU exposure, gate adoption on a license review. We will revisit this guidance when Behemoth GA arrives in fall 2026.


This article was co-authored by the ManoIT engineering team with Anthropic Claude Opus 4.7. Korean original published on the ManoIT tech blog. Sources: Meta AI - The Llama 4 herd, Llama 4 Official Model Page, Hugging Face Llama-4-Scout, Hugging Face Llama-4-Maverick, Hugging Face Llama 4 Release Blog, IBM watsonx.ai Llama 4 Announcement, Ollama Llama 4 Library, Computerworld - Behemoth Pause, Protect AI - Llama 4 Vulnerability Assessment.


Originally published at ManoIT Tech Blog.

Top comments (0)