Jovan Chan

Posted on Jun 2 • Originally published at runaihome.com

Mistral Small 4 for Local AI in 2026: The 119B MoE Hardware Reality

#mistral #localai #hardware #gpu

This article was originally published on runaihome.com

TL;DR: Mistral Small 4 is a 119B MoE model with 6B active parameters per token—GPT-4-class in coding and reasoning, multimodal, and fully open-weight. The problem is Q4_K_M quantization lands at ~74 GB, so no single consumer GPU gets you there. Your two realistic local paths are three RTX 4090s (GPU cost alone: ~$3,300–5,500 depending on new vs. used) or a Mac Studio M3 Ultra with 96 GB ($3,999). For most readers, the Mistral API at $0.15/M input tokens removes all of this friction.

	3× RTX 4090	Mac Studio M3 Ultra 96 GB	RunPod H100 PCIe
Best for	Privacy-first, high-volume inference	Silent desk setup, macOS workflow	Occasional bursts without sunk cost
Upfront cost	~$3,300–5,500 (GPUs only)	$3,999	$0
Ongoing cost	~$0.16–0.18/hr electricity	~$0.04/hr electricity	$1.99/hr
Speed at Q4_K_M	22–32 tok/s	~8–12 tok/s	60–80 tok/s
The catch	Three-slot motherboard + 1,600 W PSU	M3 Ultra only; M4 Max 64 GB won't fit	Data leaves your machine

Honest take: Unless you're generating 10M+ tokens per month with hard privacy requirements, the API at $0.15/M input tokens is cheaper and faster than any consumer hardware setup for this model. The local path here requires real justification.

What Mistral Small 4 actually is

Released March 2026 (model version: mistral-small-4-119b-2603), Mistral Small 4 is Mistral AI's first public Mixture-of-Experts model. The name is confusing—"Small" refers to its inference cost, not its parameter count.

The architecture: 119 billion total parameters, 128 experts, 4 active per token. That means the model activates just 6 billion parameters per forward pass (8 billion if you include embedding and output layers). In practice, inference compute roughly matches a 6–8B dense model, not the headline 119B. This is the same efficiency trick DeepSeek V3 and Llama 4 Scout use—you store a large model in memory, but think with a fraction of it.

What Mistral consolidated into this release:

Magistral (their reasoning model) → Mistral Small 4 matches it on math and multi-step logic
Devstral (their coding agent model) → Mistral Small 4 has the same code execution profile
Mistral Small 3.x (their instruct model) → now superseded for general chat

Context window: 262,144 tokens. Modalities: text and image. License: Apache 2.0.

The 40% latency improvement and 3× throughput gain over Mistral Small 3 come entirely from the MoE switch—you're doing 6B worth of FLOPS per token instead of 24B. The model is larger in memory, but faster when memory bandwidth isn't the bottleneck (i.e., on professional hardware with NVLink or high-bandwidth interconnects).

The benchmark story

Mistral's published numbers for Small 4:

Benchmark	Mistral Small 4	GPT-OSS 120B	Mistral Small 3.2 24B
MMLU Pro	78	~76	71
LiveCodeBench	64	63	53

MMLU Pro at 78 places it in the same tier as GPT-4-class models for general knowledge. LiveCodeBench at 64—beating the competing 120B dense model at 63—is the more meaningful number for the home lab audience: this is the model you'd reach for when you want GPT-4o-level coding help running locally, with 262K context and vision input.

For direct comparison, Mistral Small 3.2 (the 24B dense model covered in our Llama 3.3 vs Qwen3 vs Mistral comparison) scored 53 on LiveCodeBench. Mistral Small 4 is not a slight upgrade—it's a different product tier.

The honest benchmark caveat: these numbers come from Mistral's release materials. Third-party evaluations on specific tasks vary, and the coding lead over dense models narrows when token budgets are short (MoE excels at reasoning-heavy tasks where many experts contribute).

The hardware math: quantization options

Mistral Small 4's GGUF quantization file sizes (bartowski builds, as of May 2026):

Quantization	File size	VRAM needed (weights alone)	Quality vs FP16
Q2_K	~45 GB	~46–48 GB	Significant degradation
Q3_K_M	~52 GB	~54–56 GB	Noticeable degradation on reasoning
Q4_K_M	~74 GB	~76–78 GB	Recommended minimum
Q5_K_M	~89 GB	~91–93 GB	Near-lossless
FP16	~244 GB	~250+ GB	Reference quality

The NVFP4 checkpoint (Mistral's own format) is designed to slot into a single H100 80 GB for cloud deployment. For local GGUF-based inference via llama.cpp or Ollama, Q4_K_M is the realistic floor where quality doesn't obviously degrade on coding and reasoning tasks.

VRAM needed is weights + ~2–4 GB OS/framework overhead + KV cache. KV cache scales with context length: at 8K context it's cheap (~2–3 GB for most quantization levels); at the full 262K context window it gets expensive. Plan your headroom accordingly.

The consumer GPU path

Single RTX 4090 (24 GB): skip it

At 24 GB VRAM, a single RTX 4090 holds only 30–35 of Q4_K_M's transformer layers in VRAM. The rest—roughly 65%—spills to system RAM. With 64 GB DDR5 as overflow, expect 5–10 tokens/second. Interactive chat at that speed is painful. You'd spend the same money more usefully on cloud inference.

Even an RTX 5090 (32 GB VRAM) misses Q4_K_M by more than 40 GB. The 5090 shines at 32B-and-under models; Mistral Small 4 simply isn't in its wheelhouse.

Two RTX 4090s (48 GB combined): Q2_K only, usable

Two 4090s connected via PCIe tensor parallelism gives you 48 GB of combined VRAM. Q2_K at ~45–48 GB fits—just—with minimal CPU spillover. On the Q2_K build, expect 14–20 tok/s with 32 GB of DDR5 as buffer memory.

The problem is Q2_K quality. On reasoning and coding benchmarks, 2-bit quantization of a 119B MoE model introduces visible degradation. You're paying the full hardware premium for a meaningfully worse model. If you can afford two 4090s, read the next section.

The multi-GPU wiring specifics—NVLink vs. PCIe tensor parallelism, which frameworks support it—are covered in our multi-GPU local AI guide.

Three RTX 4090s (72 GB combined): Q4_K_M, the real entry point

Three RTX 4090s give 72 GB of combined VRAM. Q4_K_M at ~74 GB is slightly over—you'll still have a few GB spilling to RAM—but effective throughput jumps to 22–32 tok/s, which is usable for interactive work and background batch tasks.

What this actually costs (May 2026):

RTX 4090 used (eBay completed listings): ~$1,099 each → 3× = ~$3,300
RTX 4090 new (Amazon): ~$2,755 each → 3× = ~$8,265 (production ended October 2024; prices reflect remaining inventory)
Realistic used build with 3× GPUs: $3,300–4,500 for the GPUs

Add to that:

HEDT or server platform motherboard with 3+ PCIe x16 slots: ~$400–700
PSU: three 4090s draw ~450 W each at full load; plan for 1,800 W total system draw, which means a 2,000 W PSU minimum: ~$350–500
CPU, RAM (64 GB minimum), NVMe: ~$600–900

Total realistic build: $5,000–7,000 all-in for a three-4090 Mistral Small 4 rig. Not cheap, and you're dealing with hardware from a discontinued GPU generation that's no longer under warranty when purchased used.

Power draw reality: at 1,600 W continuous draw (80% load), you're adding ~$0.16–0.19/hour in electricity at US average rates ($0.12–0.15/kWh). Running 8 hours per day, that's ~$40–50/month just in power.

The Apple Silicon path

Apple Silicon's unified memory makes it the most straightforward consumer hardware for oversized MoE models. The memory bandwidth is lower than multi-4090 setups, but you don't need PCIe topology negotiations, and the thermal situation is manageable in a desktop form factor.

Mac Studio M4 Max with 96 GB (configure-to-order): slower but fits

The Mac Studio M4 Max can be configured with up to 96 GB of unified memory as a configure-to-order option (chec

DEV Community