daniel jeong

Posted on Apr 27 • Originally published at manoit.co.kr

Meta Llama 4 Scout & Maverick — The Complete Production Guide: 17B Active MoE, 10M Context, iRoPE, and the vLLM/Ollama Deployment Playbook

#ai #llm #machinelearning #opensource

Meta Llama 4 Scout & Maverick — The Complete Production Guide: 17B Active MoE, 10M Context, iRoPE, and the vLLM/Ollama Deployment Playbook

Published on the ManoIT Tech Blog (Korean original). On April 5, 2026, Meta released Llama 4 Scout and Llama 4 Maverick — the first open-weights models in the Llama family to use a Mixture-of-Experts (MoE) architecture, the first to be natively multimodal, and — with Scout — the first to deliver a real 10M-token context window that runs on a single H100 GPU. This post unpacks the architecture, benchmarks, license caveats, deployment options, Llama Guard 4 safety stack, and the Behemoth delay — from a production-deployment perspective grounded in what ManoIT recommends to customers.

1. The Llama 4 family at a glance — Scout / Maverick / Behemoth

Llama 4 was always planned as a three-model family. April 5 brought two models; the largest one, Behemoth, is still in private training. Knowing where each model fits is the first decision in any adoption review.

Model	Active / Total params	Experts	Context	Primary use	Status
Llama 4 Scout	17B / 109B	16	10M tokens	Long-document analysis, code-base RAG, video summarization	Released 2026-04-05 (Hugging Face, Ollama)
Llama 4 Maverick	17B / 400B	128	1M tokens	Multimodal assistant, GPT-4o replacement	Released 2026-04-05 (Hugging Face, Ollama, watsonx)
Llama 4 Behemoth	288B / ~2T	16	private	Teacher model for Scout/Maverick, STEM-heavy	Delayed to fall 2026 (internal evaluation)

The pattern is "same 17B active, different expert pool." Scout routes each token to 1-of-16 experts; Maverick routes 1-of-128. Active parameters being equal means the per-token compute cost is the same — but the diversity of the expert pool changes the model's expressive ceiling. That's why Maverick beats GPT-4o on multimodal aggregates and Scout fits in a single-H100 + Int4 footprint.

2. iRoPE architecture — the secret behind a 10M context

Scout's 10M-token reach comes from iRoPE (Interleaved Rotary Position Embeddings). Traditional transformers apply a position encoding (RoPE) at every attention layer — but as the context grows, the position signal becomes noise and length generalization collapses.

Traditional limitation	iRoPE answer	Operational effect
RoPE on every layer → can't generalize beyond 8K~128K training length	Interleave a NoPE (no position encoding) layer every 4 layers	Train at 256K, extrapolate at inference to 10M
Long-context noise blurs token-token relationships	NoPE layers do global attention over the full causal mask	Near-perfect needle-in-haystack at 10M
RoPE index overflow at long contexts	NoPE layers remove absolute positional dependency	Whole-video and whole-codebase indexing become viable

The idea is simple: mix layers that need positional encoding with ones that don't. RoPE layers learn local order; NoPE layers freely connect distant tokens by meaning. The result is reliable generalization to context lengths the model never saw at training time — which is why Scout retrieves accurately at 10M tokens.

┌─────────────── Llama 4 Scout attention stack (concept) ───────────┐
│                                                                    │
│   Layer 1  : RoPE Attention   (learns local order)                 │
│   Layer 2  : RoPE Attention                                        │
│   Layer 3  : RoPE Attention                                        │
│   Layer 4  : NoPE Attention   ← global causal mask, position-free  │
│   Layer 5  : RoPE Attention                                        │
│   Layer 6  : RoPE Attention                                        │
│   Layer 7  : RoPE Attention                                        │
│   Layer 8  : NoPE Attention   ← every 4 layers                     │
│   ...                                                              │
│                                                                    │
│   Each token routed to 1-of-16 experts (Top-1 MoE)                 │
│   → Active params at inference: 17B / Total params: 109B           │
└────────────────────────────────────────────────────────────────────┘

3. Benchmarks — Llama 4 vs GPT-4o vs Gemini 2.0 Flash

Meta's own numbers show clear wins for in-class comparisons. As always — these are vendor benchmarks; run your own evals before committing.

Benchmark	Scout 17B/16E	Maverick 17B/128E	GPT-4o	Gemini 2.0 Flash	What it means
MMLU-Pro	74.3	80.5	78.0	77.6	Maverick wins multi-domain reasoning
MATH (Hendrycks)	50.3	61.2	76.6	56.1	STEM gap vs GPT-4o / o-series remains
GPQA Diamond	57.2	69.8	53.6	60.1	Graduate-level science — Maverick #1
ChartQA (image)	88.8	90.0	85.7	87.3	Multimodal chart understanding — both SOTA
DocVQA	94.4	94.4	92.8	92.1	Document-image QA — new bar set
MMMU	69.4	73.4	69.1	71.7	Multimodal exam aggregate — Maverick wins
Long Context (10M, NIAH)	~99%	N/A	~64K limit	~1M limit	Scout's exclusive territory

The figure that should give a buyer pause is MATH. Llama 4 caught up or pulled ahead in general and multimodal domains, but pure math/STEM still favors OpenAI's o-series by a clear margin. If your workload is STEM-heavy reasoning, Llama 4 alone is not enough — pair it with o-series or wait for Behemoth GA. If your workload is long-document RAG, multimodal assistants, or codebase analysis, Llama 4 is the strongest cost-to-quality option on the market today.

4. License — "Open Weights" but not OSI Open Source

Llama 4 ships under the Llama 4 Community License Agreement — and this is the place buyers most often misread. It is not Apache 2.0 or MIT, and it does not meet the OSI Open Source definition. Five clauses to inspect before you ship.

Clause	Constraint	What to do
① 700M MAU cap	If your product exceeds 700M monthly active users, you must negotiate a separate license with Meta	Startups and mid-market: irrelevant. Hyperscale SaaS: legal review
② EU multimodal carve-out	EU users / EU companies cannot use the vision features (text-only allowed)	EU services must disable Maverick vision or fork to a text-only path
③ Acceptable Use Policy	Threat modeling, malicious cybersec, CSAM, fraud, etc. are prohibited	Pair with Llama Guard 4 input/output filtering + your own content policy
④ "Built with Llama" attribution	Products built on Llama must display "Built with Llama"	Add to About page / docs footer
⑤ Derivative naming	Derivative model names must start with "Llama-"	e.g. `Llama-Manoit-Customer-Support-v1`

One-liner: "Commercial use is fine, but check the five clauses — 700M MAU, EU vision, AUP, attribution, naming." For most Korean SaaS, B2B tools, and internal projects clause 1 is moot. EU multimodal services need clause 2 in their pre-launch checklist.

5. Deployment options — Ollama, vLLM, watsonx.ai, Hugging Face

Llama 4 launched day-one on Hugging Face, Ollama, watsonx.ai, Together, Fireworks, Groq, and Oracle OCI. Pick by workload character:

Option	Best for	Strengths	Constraints
Ollama (local)	Development, PoC, offline tools	One-line `ollama run llama4:scout`, automatic GGUF quantization	Single-node, no multi-GPU sharding
vLLM (self-hosted)	Low-latency production serving	PagedAttention, continuous batching, OpenAI-compatible API	You operate the GPUs and NCCL
HF TGI	Hugging Face standardization	Token streaming, tensor parallelism	Slightly lower throughput than vLLM
watsonx.ai / Bedrock	Enterprise compliance	VPC isolation, SOC2, HIPAA, EU data residency	3-5× the per-token cost vs self-hosting
Together / Fireworks / Groq	Cheap token consumption	$0.27/1M tokens (Maverick at some vendors)	Vendor lock-in, residency review needed

5.1 vLLM production deploy (Maverick, 8×H100, FP8)

# Step 1: get a Hugging Face token
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxx"

# Step 2: install vLLM 0.7.0+ (Llama 4 MoE support)
pip install --upgrade vllm

# Step 3: serve Maverick (8×H100, FP8 quantization)
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --max-model-len 1048576 \
  --gpu-memory-utilization 0.9 \
  --enable-prefix-caching \
  --port 8000

# Step 4: hit the OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
    "messages": [{"role": "user", "content": "Summarize Llama 4 adoption trends in one paragraph"}]
  }'

5.2 Ollama local (Scout)

# Step 1: install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Step 2: pull Scout (~60GB at Q4_K_M)
ollama pull llama4:scout

# Step 3: interactive
ollama run llama4:scout

# Step 4: API mode (port 11434)
curl http://localhost:11434/api/chat -d '{
  "model": "llama4:scout",
  "messages": [
    {"role": "user", "content": "How would I summarize a 10MB log file in a single inference?"}
  ]
}'

Practical hardware floor: Scout fits in a single H100 (80GB) at Int4, Maverick wants at least 4×H100 (80GB). A Mac Studio M2 Ultra (192GB) can run Scout Q4 for PoC purposes at 5-10 tok/s.

6. Llama Guard 4 — the 12B multimodal safety stack

Alongside Llama 4, Meta released Llama Guard 4 (12B) — a multimodal classifier that scores both text and images against 13 risk categories (violence, self-harm, sexual content, CSAM, hate, criminal facilitation, CBRN, etc.). The recommended ManoIT pipeline applies guards on both input and output:

┌──────────────────────── User request ────────────────────────┐
│  POST /v1/chat   { messages: [...] }                        │
└──────────────────────────────┬───────────────────────────────┘
                               │
                               ▼
              ┌────────────────────────────────┐
              │  ① Prompt Guard (86M, fast)     │
              │  Jailbreak / Prompt Injection   │
              └────────────────────────────────┘
                               │ pass
                               ▼
              ┌────────────────────────────────┐
              │  ② Llama Guard 4 (12B)          │
              │  Input safety classifier (S1~13)│
              └────────────────────────────────┘
                               │ safe
                               ▼
              ┌────────────────────────────────┐
              │  ③ Llama 4 Maverick / Scout     │
              │  Generate response              │
              └────────────────────────────────┘
                               │
                               ▼
              ┌────────────────────────────────┐
              │  ④ Llama Guard 4 (output)       │
              │  unsafe → block / redact        │
              └────────────────────────────────┘
                               │ safe
                               ▼
                       Final response

Guard model	Role	Size	Latency impact
Prompt Guard 2 (86M)	First-pass jailbreak / prompt-injection filter	86M (DistilBERT-class)	+10~20ms
Llama Guard 4 (12B)	13-category classifier (input + output)	12B multimodal	+150~300ms
CyberSec Eval	Detect security flaws in generated code	Eval framework	Pre-deploy static check

7. The Behemoth delay — what it means

Llama 4 Behemoth (2T total / 288B active) was not part of the April 5 release. It was originally targeted for April, slipped to June, and is now expected in fall 2026. Reporting attributes the delay to internal concerns about whether Behemoth's gain over Maverick justifies a public rollout.

Date	Event	Meaning
2026-01	Behemoth ~75% trained, April launch announced	Planned co-release with Scout / Maverick
2026-04-05	Only Scout / Maverick released; Behemoth slips to June	"Need more internal evaluation"
2026-05-15	Behemoth pushed to fall 2026	Strong on STEM, unclear lift on general workloads
2026-06-10	Meta superintelligence lab reportedly forms	Top-tier model team carved out, Behemoth realigned

The delay does not affect a Llama 4 adoption decision. Scout and Maverick already compete with GPT-4o and Gemini 2.0 Flash, and the reason enterprises pick open-weights models is data sovereignty, on-prem control, and customization — not a single benchmark crown. If STEM is your hot path, hybrid with Claude Opus 4.7 or OpenAI o-series until Behemoth GA.

8. ManoIT 8-step production checklist

Step	Check	Concrete action
① License review	5 clauses — 700M MAU, EU vision, AUP, "Built with Llama", naming	Legal review → impact memo
② Model selection	Scout vs Maverick by workload (text / multimodal / context length)	Long-doc RAG → Scout, multimodal → Maverick
③ Infra design	FP8 / Int4 quantization, TP degree, KV cache memory budget	Maverick: 8×H100 FP8, Scout: 1×H100 Int4
④ Serving stack	vLLM / TGI / Ollama choice, autoscaling policy	vLLM 0.7.0+ recommended, KEDA for GPU pool autoscale
⑤ Safety stack	Prompt Guard 2 + Llama Guard 4 on input and output	Budget +200ms latency, define category policy
⑥ Eval pipeline	Korean benchmarks + domain golden datasets	KMMLU, KoBEST, in-house RAG accuracy automated
⑦ Observability	Token throughput, GPU util, guard block rate, RAG hit rate	OpenTelemetry GenAI semconv + Grafana
⑧ Governance	Model card, retention, audit log, rollback runbook	Model card + prompt versioning + sampled response retention

9. Field guide — which model for which job

Scenario	Recommended	Why
50 contracts side-by-side, clause comparison	Scout	10M context, single node, lowest cost
Whole-codebase RAG (200K LOC)	Scout	Fit the entire repo in context
Multimodal customer-support chatbot (image + text)	Maverick	DocVQA / ChartQA leader, GPT-4o replacement
20-hour video summarization	Scout	Native long-context, full-video indexing
STEM math	OpenAI o-series · Claude Opus	MATH/AIME gap remains, wait for Behemoth GA
Real-time voice assistant	Maverick + Whisper (STT)	<500ms response, multimodal context
On-prem medical / financial RAG	Scout + Llama Guard 4	Data sovereignty, 700M MAU irrelevant
EU-resident multimodal service	Maverick (text only) or GPT-4o	EU vision license restriction

10. Conclusion — what Llama 4 changes, and what it doesn't

The significance of Scout and Maverick is that open-weights models reached SOTA in multimodal and ultra-long-context territory. Through 2024-2025 open models could compete with GPT-4o on English text reasoning, but multimodal and >100K-token contexts had a real gap. iRoPE and native multimodal training closed it.

What did not change: the STEM / math reasoning gap vs OpenAI o-series remains, and that won't close until Behemoth GA. "Open Weights" is not "Open Source" — the five Llama 4 Community License clauses (especially EU multimodal exclusion and 700M MAU) require a legal review before adoption. And safety and governance cost does not vanish because the weights are public — Llama Guard 4 + Prompt Guard + your own eval pipeline are not optional.

ManoIT's recommendation for Q2 2026: standardize on Scout as the default RAG engine (legal, technical document analysis), Maverick as the multimodal assistant backbone (customer support, automation workflows). Hybridize with Claude Opus 4.7 or OpenAI o-series where STEM is the hot path. For multimodal services with EU exposure, gate adoption on a license review. We will revisit this guidance when Behemoth GA arrives in fall 2026.

This article was co-authored by the ManoIT engineering team with Anthropic Claude Opus 4.7. Korean original published on the ManoIT tech blog. Sources: Meta AI - The Llama 4 herd, Llama 4 Official Model Page, Hugging Face Llama-4-Scout, Hugging Face Llama-4-Maverick, Hugging Face Llama 4 Release Blog, IBM watsonx.ai Llama 4 Announcement, Ollama Llama 4 Library, Computerworld - Behemoth Pause, Protect AI - Llama 4 Vulnerability Assessment.

Originally published at ManoIT Tech Blog.

DEV Community

Meta Llama 4 Scout & Maverick — The Complete Production Guide: 17B Active MoE, 10M Context, iRoPE, and the vLLM/Ollama Deployment Playbook

Meta Llama 4 Scout & Maverick — The Complete Production Guide: 17B Active MoE, 10M Context, iRoPE, and the vLLM/Ollama Deployment Playbook

1. The Llama 4 family at a glance — Scout / Maverick / Behemoth

2. iRoPE architecture — the secret behind a 10M context

3. Benchmarks — Llama 4 vs GPT-4o vs Gemini 2.0 Flash

4. License — "Open Weights" but not OSI Open Source

5. Deployment options — Ollama, vLLM, watsonx.ai, Hugging Face

5.1 vLLM production deploy (Maverick, 8×H100, FP8)

5.2 Ollama local (Scout)

6. Llama Guard 4 — the 12B multimodal safety stack

7. The Behemoth delay — what it means

8. ManoIT 8-step production checklist

9. Field guide — which model for which job

10. Conclusion — what Llama 4 changes, and what it doesn't

Top comments (0)