This article was originally published on runaihome.com
TL;DR: Mistral Small 4 is a 119B MoE model with 6B active parameters per token—GPT-4-class in coding and reasoning, multimodal, and fully open-weight. The problem is Q4_K_M quantization lands at ~74 GB, so no single consumer GPU gets you there. Your two realistic local paths are three RTX 4090s (GPU cost alone: ~$3,300–5,500 depending on new vs. used) or a Mac Studio M3 Ultra with 96 GB ($3,999). For most readers, the Mistral API at $0.15/M input tokens removes all of this friction.
| 3× RTX 4090 | Mac Studio M3 Ultra 96 GB | RunPod H100 PCIe | |
|---|---|---|---|
| Best for | Privacy-first, high-volume inference | Silent desk setup, macOS workflow | Occasional bursts without sunk cost |
| Upfront cost | ~$3,300–5,500 (GPUs only) | $3,999 | $0 |
| Ongoing cost | ~$0.16–0.18/hr electricity | ~$0.04/hr electricity | $1.99/hr |
| Speed at Q4_K_M | 22–32 tok/s | ~8–12 tok/s | 60–80 tok/s |
| The catch | Three-slot motherboard + 1,600 W PSU | M3 Ultra only; M4 Max 64 GB won't fit | Data leaves your machine |
Honest take: Unless you're generating 10M+ tokens per month with hard privacy requirements, the API at $0.15/M input tokens is cheaper and faster than any consumer hardware setup for this model. The local path here requires real justification.
What Mistral Small 4 actually is
Released March 2026 (model version: mistral-small-4-119b-2603), Mistral Small 4 is Mistral AI's first public Mixture-of-Experts model. The name is confusing—"Small" refers to its inference cost, not its parameter count.
The architecture: 119 billion total parameters, 128 experts, 4 active per token. That means the model activates just 6 billion parameters per forward pass (8 billion if you include embedding and output layers). In practice, inference compute roughly matches a 6–8B dense model, not the headline 119B. This is the same efficiency trick DeepSeek V3 and Llama 4 Scout use—you store a large model in memory, but think with a fraction of it.
What Mistral consolidated into this release:
- Magistral (their reasoning model) → Mistral Small 4 matches it on math and multi-step logic
- Devstral (their coding agent model) → Mistral Small 4 has the same code execution profile
- Mistral Small 3.x (their instruct model) → now superseded for general chat
Context window: 262,144 tokens. Modalities: text and image. License: Apache 2.0.
The 40% latency improvement and 3× throughput gain over Mistral Small 3 come entirely from the MoE switch—you're doing 6B worth of FLOPS per token instead of 24B. The model is larger in memory, but faster when memory bandwidth isn't the bottleneck (i.e., on professional hardware with NVLink or high-bandwidth interconnects).
The benchmark story
Mistral's published numbers for Small 4:
| Benchmark | Mistral Small 4 | GPT-OSS 120B | Mistral Small 3.2 24B |
|---|---|---|---|
| MMLU Pro | 78 | ~76 | 71 |
| LiveCodeBench | 64 | 63 | 53 |
MMLU Pro at 78 places it in the same tier as GPT-4-class models for general knowledge. LiveCodeBench at 64—beating the competing 120B dense model at 63—is the more meaningful number for the home lab audience: this is the model you'd reach for when you want GPT-4o-level coding help running locally, with 262K context and vision input.
For direct comparison, Mistral Small 3.2 (the 24B dense model covered in our Llama 3.3 vs Qwen3 vs Mistral comparison) scored 53 on LiveCodeBench. Mistral Small 4 is not a slight upgrade—it's a different product tier.
The honest benchmark caveat: these numbers come from Mistral's release materials. Third-party evaluations on specific tasks vary, and the coding lead over dense models narrows when token budgets are short (MoE excels at reasoning-heavy tasks where many experts contribute).
The hardware math: quantization options
Mistral Small 4's GGUF quantization file sizes (bartowski builds, as of May 2026):
| Quantization | File size | VRAM needed (weights alone) | Quality vs FP16 |
|---|---|---|---|
| Q2_K | ~45 GB | ~46–48 GB | Significant degradation |
| Q3_K_M | ~52 GB | ~54–56 GB | Noticeable degradation on reasoning |
| Q4_K_M | ~74 GB | ~76–78 GB | Recommended minimum |
| Q5_K_M | ~89 GB | ~91–93 GB | Near-lossless |
| FP16 | ~244 GB | ~250+ GB | Reference quality |
The NVFP4 checkpoint (Mistral's own format) is designed to slot into a single H100 80 GB for cloud deployment. For local GGUF-based inference via llama.cpp or Ollama, Q4_K_M is the realistic floor where quality doesn't obviously degrade on coding and reasoning tasks.
VRAM needed is weights + ~2–4 GB OS/framework overhead + KV cache. KV cache scales with context length: at 8K context it's cheap (~2–3 GB for most quantization levels); at the full 262K context window it gets expensive. Plan your headroom accordingly.
The consumer GPU path
Single RTX 4090 (24 GB): skip it
At 24 GB VRAM, a single RTX 4090 holds only 30–35 of Q4_K_M's transformer layers in VRAM. The rest—roughly 65%—spills to system RAM. With 64 GB DDR5 as overflow, expect 5–10 tokens/second. Interactive chat at that speed is painful. You'd spend the same money more usefully on cloud inference.
Even an RTX 5090 (32 GB VRAM) misses Q4_K_M by more than 40 GB. The 5090 shines at 32B-and-under models; Mistral Small 4 simply isn't in its wheelhouse.
Two RTX 4090s (48 GB combined): Q2_K only, usable
Two 4090s connected via PCIe tensor parallelism gives you 48 GB of combined VRAM. Q2_K at ~45–48 GB fits—just—with minimal CPU spillover. On the Q2_K build, expect 14–20 tok/s with 32 GB of DDR5 as buffer memory.
The problem is Q2_K quality. On reasoning and coding benchmarks, 2-bit quantization of a 119B MoE model introduces visible degradation. You're paying the full hardware premium for a meaningfully worse model. If you can afford two 4090s, read the next section.
The multi-GPU wiring specifics—NVLink vs. PCIe tensor parallelism, which frameworks support it—are covered in our multi-GPU local AI guide.
Three RTX 4090s (72 GB combined): Q4_K_M, the real entry point
Three RTX 4090s give 72 GB of combined VRAM. Q4_K_M at ~74 GB is slightly over—you'll still have a few GB spilling to RAM—but effective throughput jumps to 22–32 tok/s, which is usable for interactive work and background batch tasks.
What this actually costs (May 2026):
- RTX 4090 used (eBay completed listings): ~$1,099 each → 3× = ~$3,300
- RTX 4090 new (Amazon): ~$2,755 each → 3× = ~$8,265 (production ended October 2024; prices reflect remaining inventory)
- Realistic used build with 3× GPUs: $3,300–4,500 for the GPUs
Add to that:
- HEDT or server platform motherboard with 3+ PCIe x16 slots: ~$400–700
- PSU: three 4090s draw ~450 W each at full load; plan for 1,800 W total system draw, which means a 2,000 W PSU minimum: ~$350–500
- CPU, RAM (64 GB minimum), NVMe: ~$600–900
Total realistic build: $5,000–7,000 all-in for a three-4090 Mistral Small 4 rig. Not cheap, and you're dealing with hardware from a discontinued GPU generation that's no longer under warranty when purchased used.
Power draw reality: at 1,600 W continuous draw (80% load), you're adding ~$0.16–0.19/hour in electricity at US average rates ($0.12–0.15/kWh). Running 8 hours per day, that's ~$40–50/month just in power.
The Apple Silicon path
Apple Silicon's unified memory makes it the most straightforward consumer hardware for oversized MoE models. The memory bandwidth is lower than multi-4090 setups, but you don't need PCIe topology negotiations, and the thermal situation is manageable in a desktop form factor.
Mac Studio M4 Max with 96 GB (configure-to-order): slower but fits
The Mac Studio M4 Max can be configured with up to 96 GB of unified memory as a configure-to-order option (chec
Top comments (0)