This article was originally published on runaihome.com
TL;DR: MiniMax M3 is a 428B-parameter Mixture-of-Experts model (~23B active) with frontier-tier coding scores and a 1M-token context. The honest problem: even a 4-bit GGUF is ~265GB, and the one machine that could have run it — a maxed Mac Studio — got gutted by the 2026 DRAM shortage. For almost every home lab, the API is the move; wait for community distills before buying hardware.
| MiniMax M3 (local) | MiniMax M3 (API) | A "runnable" local model (Qwen3.6 35B-A3B) | |
|---|---|---|---|
| Best for | Labs with 256GB+ RAM already on hand | Everyone who wants M3's quality today | Anyone with a single 24GB GPU |
| Price / Cost | $10k+ in hardware, if you can source the RAM | $0.30/$1.20 per M tokens (launch promo) | ~$1,070 used RTX 3090 |
| The catch | Q4 GGUF is 265GB — no single consumer box fits it | Not local; data leaves your machine | Not frontier-tier, but actually runs |
Honest take: M3 is a genuinely impressive open-weight model, but in June 2026 it's a data-center model wearing an "open weights" badge. Run it on the API, keep your local stack on Qwen3.6 or Gemma 4, and revisit when memory prices fall or a distilled M3 lands.
What MiniMax M3 actually is
MiniMax released M3 on June 1, 2026. It's an open-weight Mixture-of-Experts model with roughly 428B total parameters and ~23B activated per token, spread across 256 fine-grained experts. The headline feature is a 1-million-token context window (1,048,576 tokens, with up to 512K output tokens) made practical by a new attention design MiniMax calls MSA — MiniMax Sparse Attention.
MSA keeps a Grouped-Query Attention backbone and layers block-level sparse selection on top of real, uncompressed key-values. MiniMax reports more than 9× faster prefill and more than 15× faster decoding at 1M context versus the previous M2 generation, with per-token compute cut to roughly 1/20th. That's the part that matters for anyone thinking about long-context agentic work: the speedup isn't from a smaller model, it's from not attending to every token.
On benchmarks, M3 punches at the frontier. It scores 59.0% on SWE-Bench Pro, 66.0% on Terminal-Bench 2.1, 74.2% on MCP Atlas, and 83.5 on BrowseComp — the last figure beating Claude Opus 4.7's 79.3 on autonomous browsing. It edges past GPT-5.5 and Gemini 3.1 Pro on the coding/agent metrics while sitting just below Claude Opus 4.8 overall.
One important correction, because a lot of the early write-ups got this wrong: M3 is not the 229B model. That was MiniMax M2.7. M3 is the bigger 428B MoE with native multimodal input (text, image, and video). If a guide tells you M3 is 229.9B/9.8B-active, it's quoting the wrong generation.
The number that breaks the dream: 265GB
Here's where the "open weight" story collides with physics. The full BF16 weights are about 855GB. Nobody runs that at home. So the question is what the quantized GGUFs look like — and Unsloth has published Dynamic 2.0 quants for exactly this model.
| Quant (Unsloth Dynamic 2.0) | Disk / memory footprint | What can hold it |
|---|---|---|
| UD-IQ1_M (1-bit) | ~128 GB | Quality falls off a cliff; not recommended |
| UD-Q2_K_XL (2-bit) | ~143 GB | Needs 192GB+ RAM realistically |
| UD-Q4_K_XL (4-bit) | ~265 GB | The "real" quant — needs 320GB+ |
| Q8_0 / UD-Q8_K_XL | ~453–464 GB | Multi-GPU server territory |
For local LLM work, Q4_K_M-class quantization is the floor for keeping a model's quality intact. For M3 that's the 265GB UD-Q4_K_XL file — and that's just weights, before KV cache and context allocation. Push toward long context and you're adding tens of gigabytes on top.
To put 265GB in perspective: that's eleven RTX 3090s' worth of VRAM (24GB each), and you'd want a twelfth for headroom. At June 2026 used prices — a RTX 3090 averages around $1,070, with listings ranging $966–$1,189 — that's roughly $12,000–$13,000 in GPUs alone, before the motherboard, PCIe risers, PSUs, and the power bill to feed twelve 350W cards. Even then, llama.cpp's NVIDIA support for M3 is preliminary, and naive PCIe sharding of a model this size is slow.
If you've been following the GDDR7 shortage and NVIDIA's consumer GPU freeze, you already know this is the worst possible moment to be buying twelve high-VRAM cards.
The Mac angle — and why it just collapsed
For a model this size, the usual home-lab answer is unified memory: one Apple Silicon box with a huge RAM pool that the GPU can address directly. A 256GB or 512GB Mac Studio used to be the cleanest way to run a 400B-class MoE without a GPU farm.
That option is gone as of June 2026. Here's the timeline:
- March 2026: Apple pulled the 512GB unified-memory option from the Mac Studio M3 Ultra and raised the 256GB upgrade price by $400, citing the same AI-driven DRAM squeeze we covered in the DDR5 and SSD price surge.
- May 2026: Apple removed the 256GB option too.
- June 2026: The M3 Ultra Mac Studio ships with 96GB as its only memory configuration.
So the device that was supposed to be the answer — a maxed Mac Studio — now tops out at 96GB. That doesn't fit even the 143GB 2-bit quant, let alone the 265GB Q4. And to be clear about the spec confusion floating around: there is no "Mac Studio M4 Ultra." Apple shipped the 2025 Mac Studio with an M4 Max base and an M3 Ultra at the top; there was never an M4 Ultra SKU. Any guide promising "M3 at Q4 on an M4 Ultra 192GB" is describing a machine that doesn't exist running a quant that wouldn't fit if it did.
The M4 Max Mac Studio starts at $1,999 but caps even lower on memory. For context on where Apple Silicon genuinely shines for local AI, our Mac Studio M4 Max vs Mac Mini M4 Pro guide covers the models you can actually buy and run today.
What speed should you even expect?
Throughput data for M3 specifically is thin this early, but the architecture tells you most of what you need. Decode speed on any quantized LLM is memory-bandwidth-bound, not compute-bound — the reason NPUs with big TOPS numbers still lose on tokens/second. With ~23B active parameters per token, M3 decodes like a ~23B model per step, but every token still has to stream the active expert weights out of memory.
For a concrete reference point: the previous-gen MiniMax M2.5 (229B) at Q4_K_M was clocked at roughly 12 tokens/sec on an RTX PRO 6000 Blackwell — a 96GB workstation card. M3 is nearly twice the total size, so even on hardware that can hold a usable quant, expect low-double-digit tokens/sec at best, dropping further as you fill that 1M context. That's usable for batch/agentic work, painful for interactive chat. (More on where that 96GB ceiling sits in our RTX PRO 6000 Blackwell deep dive.)
If you genuinely want to run it locally
Say you already have a 256GB-plus RAM server or a multi-GPU rig and you want to try. Here's the honest setup path as of mid-June 2026.
1. llama.cpp support is preliminary. M3 isn't in a released llama.cpp build yet. You build from the open pull request:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git fetch origin pull/24523/head:minimax-m3
git checkout minimax-m3
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j --target llama-cli llama-server
2. Pull the Unsloth Dynamic GGUF. For a CPU-RAM-heavy box, UD-Q2_K_XL (143GB) is the realistic entry; UD-Q4_K_XL (26
Top comments (0)