DEV Community

Jangwook Kim
Jangwook Kim

Posted on • Originally published at effloow.com

Llama 4 Scout: Run Meta's Vision Model on One GPU

In April 2025, Meta released Llama 4 Scout and fundamentally changed the economics of running a capable vision-language model. Before Scout, getting GPT-4-class multimodal performance on your own hardware required either massive compute or painful trade-offs. Scout changed that equation: 109 billion total parameters, but only 17 billion activate per token — and it fits on hardware you might already own.

A year on, the ecosystem around Scout has matured significantly. Ollama, vLLM, and cloud providers like Groq all support it. Fine-tuning pipelines exist. And the 10 million token context window remains unmatched in the open-weight world. This guide covers everything you need to actually deploy and use it.

Why Scout's Architecture Is Different

Scout is a Mixture-of-Experts (MoE) model with 109 billion total parameters organized across 16 experts. During inference, only 2 experts activate per token — meaning the model uses just 17 billion active parameters for any given computation.

This matters for a deceptively simple reason: VRAM requirements scale with active parameters, not total parameters. A dense 17B model and Scout have roughly similar inference-time memory footprints, but Scout delivers quality closer to what you'd expect from a 50–70B dense model because the experts specialize.

Beyond the MoE design, Scout uses an early fusion approach for multimodality. Rather than processing images through a separate encoder and projecting the features into a language model, Scout's vision and language understanding are unified from the start. Text tokens and image patch tokens flow through the same transformer layers. This makes multi-image reasoning and long-context vision tasks substantially more reliable than bolt-on architectures.

The context window is 10 million tokens — genuinely unique in open-weight models. In practice, this means Scout can reason over entire codebases, process hundreds of images in a single prompt, or read through a full research corpus without chunking.

Hardware Requirements: What You Actually Need

The backlog note says "24GB VRAM" and that's technically true, but the story has nuance. Here's the realistic breakdown:

24GB consumer GPU (RTX 3090/4090)

At 1.78-bit quantization, Scout fits into 22–23GB of VRAM with approximately 20 tokens per second throughput. This is usable for local experimentation but not production. At Q4_K_M quantization the model is ~58GB — it won't fit in 24GB, so 1.78-bit is your only option at this tier. Quality is noticeably reduced versus higher-precision variants.

Single H100 (80GB)

This is the sweet spot for production deployments. Scout runs at INT4 (4-bit quantization) on a single H100, achieving 460+ tokens per second on Groq's infrastructure. At FP8, you get better quality on reasoning tasks but need roughly 50GB. The H100 supports FP8 natively; A100 does not.

Apple Silicon (M-series Macs)

Scout requires 64GB+ of unified memory on Apple Silicon at Q4_K_M quantization (~58GB). An M4 Pro with 64GB gets you 15–25 tokens per second through Metal GPU acceleration — comfortable for interactive use. Expect a 45–60 minute download on a 100 Mbps connection; make sure you have 70GB free disk space.

Memory at a glance:

Hardware Quantization VRAM / RAM Speed (tok/s) Use Case
RTX 3090 / 4090 1.78-bit ~22GB ~20 Local dev only
Single H100 80GB INT4 / FP8 40–50GB 200–460+ Production API
2× A100 80GB FP8 tensor parallel ~100GB total 300+ High-throughput
Apple M4 Pro 64GB Q4_K_M 58GB unified 15–25 Local developer

Running Scout with Ollama

Ollama is the fastest path to a working local deployment. You need Ollama 0.6 or later — Scout support was added in that release.

# Check and update Ollama
ollama --version
# If < 0.6, download latest from ollama.com

# Pull and run Scout (downloads Q4_K_M, ~58GB)
ollama run llama4

# For a smaller footprint, pull a specific quantization
ollama pull llama4:scout-17b-16e-instruct-q4_k_m
Enter fullscreen mode Exit fullscreen mode

Ollama automatically uses Metal on Apple Silicon and CUDA on NVIDIA GPUs. No additional configuration required.

Sending an image:

# Via CLI
ollama run llama4 "Describe this image" --image path/to/image.png

# Via REST API (local server)
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama4",
    "prompt": "What is in this image?",
    "images": ["'$(base64 -i image.png)'"]
  }'
Enter fullscreen mode Exit fullscreen mode

Python client:

import ollama

response = ollama.chat(
    model="llama4",
    messages=[
        {
            "role": "user",
            "content": "Describe the architecture in this diagram.",
            "images": ["./architecture.png"],
        }
    ],
)
print(response["message"]["content"])
Enter fullscreen mode Exit fullscreen mode

Deploying with vLLM for Production

For production workloads, vLLM v0.8.3 or later is required. Scout and Maverick support was added in that version.

Single GPU deployment:

pip install vllm>=0.8.3

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --max-model-len 131072 \
  --quantization fp8
Enter fullscreen mode Exit fullscreen mode

This serves Scout with ~130K effective context on a single H100 at FP8 quality. The --max-model-len parameter controls context length; lower values reduce KV cache memory requirements.

Tensor-parallel (2× GPU), FP8:

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 2 \
  --quantization fp8 \
  --max-model-len 262144
Enter fullscreen mode Exit fullscreen mode

Two GPUs doubles effective context to ~260K and improves quality on reasoning-heavy tasks compared to INT4.

Large-scale (8× H100) with 1M token context:

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 1000000
Enter fullscreen mode Exit fullscreen mode

For Docker deployments, add --ipc=host — NCCL requires shared memory for inter-GPU communication and will fail without it.

docker run --gpus all --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 2
Enter fullscreen mode Exit fullscreen mode

vLLM exposes an OpenAI-compatible API, so any client that works with OpenAI's SDK will work with Scout out of the box.

Using Cloud APIs (No Hardware Required)

If you don't want to manage GPU infrastructure, several providers offer Scout via API:

Groq has the most transparent pricing: $0.11 per million input tokens, $0.34 per million output tokens. Scout runs at 460+ tokens per second on Groq's LPU hardware — significantly faster than GPU-based inference for most request sizes.

Together AI, Fireworks AI, and inference.net also serve Scout with competitive pricing ranging from $0.07–$0.30 per million input tokens. Together AI requires a minimum $5 credit to start.

IBM watsonx.ai offers both Scout and Maverick for enterprise deployments with SLA guarantees.

All providers serve Scout through an OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key="your-groq-api-key",
)

response = client.chat.completions.create(
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What does this chart show?"},
                {"type": "image_url", "image_url": {"url": "https://example.com/chart.png"}},
            ],
        }
    ],
)
print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Vision Capabilities in Practice

Scout's early fusion multimodality is most noticeable in tasks that require reasoning across multiple images or correlating visual content with long text.

What it does well:

  • Multi-image reasoning: Ask Scout to compare three product screenshots or track changes across a diagram sequence — it handles cross-image references reliably.
  • Document parsing: Long PDFs or technical diagrams with text overlay. The 10M context window means you can pass hundreds of pages of scanned documents without chunking.
  • Code screenshot analysis: Drop in a screenshot of an error message or a UI bug; Scout can read it accurately and suggest fixes.
  • Image captioning at scale: Batch image description tasks that previously required expensive GPT-4V API calls.

Where it has limits:

Scout's vision benchmarks (MMMU, MATH-Vision) lag behind Gemma 4 31B for pure mathematical image reasoning. For competitive vision-only benchmarks, Gemma 4's 26B MoE model scores higher. Scout's advantage is its context window and multi-image handling, not raw single-image QA accuracy.

Fine-Tuning Scout on a Single GPU

Fine-tuning Llama 4 Scout requires careful handling of the MoE architecture. Standard frameworks struggle with Scout in 4-bit precision because of the vision layers — most implementations OOM. Unsloth is currently the only framework that handles QLoRA fine-tuning for Scout reliably.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    max_seq_length=8192,
    dtype=None,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)
Enter fullscreen mode Exit fullscreen mode

Key considerations:

  • Do not quantize vision layers. Unsloth's dynamic quantization selectively applies lower bits to MoE layers while leaving attention and vision layers at higher precision.
  • Unsloth achieves 1.5× faster training and 50% lower VRAM than vanilla QLoRA on Scout.
  • With QLoRA + Unsloth, a single 24GB GPU can run fine-tuning — though training will be slow (~4 samples/second at batch size 1).

How Scout Compares to Alternatives

Model Active Params Context Vision License Min VRAM (Q4)
Llama 4 Scout 17B (of 109B) 10M Native Llama 4 Community 58GB
Gemma 4 26B MoE 3.8B (of 26B) 128K Native Apache 2.0 ~14GB
Qwen 3.5 VL 72B 72B dense 128K Native Apache 2.0 ~40GB Q4
Mistral Small 4 24B 24B dense 128K Native Apache 2.0 ~13GB
Llama 4 Maverick 17B (of 400B) 1M Native Llama 4 Community Multi-GPU only

Scout's unique position is the 10M context window — no other open-weight model is close. If your use case involves long-document reasoning, multi-image analysis, or codebase-wide context, Scout has no real competitor. If you need lighter hardware footprint and don't need extended context, Gemma 4 26B MoE is a compelling alternative at roughly a quarter of the memory cost.

One licensing note: Scout uses the Llama 4 Community License, not a fully permissive open-source license. Commercial use is allowed, but there are restrictions on using Llama outputs to train competing models and deployment at scale (>700M MAU) requires a separate agreement. For Apache 2.0 with no restrictions, Gemma 4 or Qwen 3.5 are cleaner choices.

Common Mistakes and Troubleshooting

Ollama shows OOM with llama4

You're likely on a machine with less than 64GB unified memory or VRAM. Try the explicit 1.78-bit variant: ollama pull llama4:scout-17b-16e-instruct-iq1_m — this fits in ~22GB.

vLLM fails to initialize Scout

Ensure you're on v0.8.3+. Earlier versions don't have the Scout model runner. Also check --ipc=host is set for multi-GPU Docker deployments.

Fine-tuning OOM despite using 4-bit

Use Unsloth, not vanilla HuggingFace PEFT. Standard PEFT cannot handle Scout's vision layers at 4-bit and will exceed VRAM on any consumer GPU.

Slow vision inference on Apple Silicon

Ollama uses Metal automatically, but confirm via ollama ps — it should show llama4 using GPU. If not, try OLLAMA_GPU_DRIVER=metal ollama serve.

Model gives incoherent outputs at 1.78-bit

This is expected. At extreme quantization, model quality degrades noticeably. If your use case requires consistent quality, you need 24GB+ for Q4 (which means 64GB+ unified memory on Apple Silicon, or a 40/80GB VRAM GPU on NVIDIA).

FAQ

Q: Is Llama 4 Scout free to use commercially?

Yes, with conditions. The Llama 4 Community License allows commercial use at no cost for most companies. Restrictions apply if you use Llama outputs to train competing models or if your deployment exceeds 700 million monthly active users — at that scale, Meta requires a separate license agreement.

Q: How does the 10M token context actually work in practice?

Few providers serve Scout at the full 10M token limit due to KV cache memory costs. Groq and similar APIs typically cap at 128K–1M tokens for hosted inference. Self-hosted vLLM on 8× H100s can serve up to 1M token context. The full 10M requires specialized infrastructure.

Q: Can Scout process video?

Not natively. Scout handles static images — including multiple images in a single prompt. For video, you'd need to sample frames and pass them as a sequence of images, which works within Scout's long-context capability but requires preprocessing.

Q: What's the difference between Scout and Maverick?

Both have 17B active parameters per token, but Maverick is a 400B total parameter model with 128 experts vs Scout's 109B / 16 experts. Maverick delivers higher quality on complex tasks but requires multi-GPU setups to run locally. Scout is the practical choice for single-GPU deployment.

Q: Does Scout support function calling / tool use?

Yes. The instruct variant supports structured tool use and JSON mode. It works well for agentic workflows where the model needs to parse visual inputs and trigger downstream API calls based on what it sees.

Key Takeaways

Llama 4 Scout represents a meaningful shift in what's achievable with a single GPU in the open-weight model world. The 10M token context window is genuinely unprecedented and unlocks use cases — multi-document reasoning, codebase analysis, large-scale image processing — that required proprietary APIs not long ago.

The hardware requirements are higher than the "17B active parameter" headline suggests. You need 24GB VRAM at heavily quantized precision for local use, or 64GB+ unified memory on Apple Silicon at practical quality. For production, a single H100 is the realistic minimum.

If your workload involves vision, long-context, or both — and you want to avoid per-token API costs at scale — Scout is the best open-weight option available in 2026. For teams that don't need the extended context and want lighter hardware requirements, Gemma 4 26B MoE is the alternative worth benchmarking.

Bottom Line

Llama 4 Scout is the best open-weight vision-language model for long-context tasks in 2026. The 10M token context window is unique, the MoE architecture makes single-GPU deployment feasible, and the ecosystem (Ollama, vLLM, Groq) is mature. Hardware requirements are steeper than the marketing implies — budget for a 64GB Mac or H100 for real-world use.

Top comments (0)