This article was originally published on aifoss.dev
TL;DR: MiniMax M3 is a 428B-parameter (23B active) open-weight MoE released June 1, 2026, with a 1M-token context, native multimodality, and 59.0% on SWE-Bench Pro — edging out GPT-5.5. The weights are public, but two things bite self-hosters: it needs a multi-GPU server (200GB+ VRAM quantized), and the MiniMax Community License blocks commercial use without a separate agreement.
| MiniMax M3 | DeepSeek V4 Pro | GLM-5.1 | |
|---|---|---|---|
| Best for | 1M-context agentic + multimodal | Frontier reasoning at any cost | MIT coding model you can ship |
| License | MiniMax Community (non-commercial) | MIT (commercial OK) | MIT (commercial OK) |
| Size | 428B / 23B active | 1.6T / 49B active | 744B |
| The catch | Restrictive license + 200GB+ VRAM | ~862GB weights, datacenter-only | 24GB GPU + 256GB RAM for 2-bit |
Honest take: M3 is the most capable open-weight model you can download today for long-context and multimodal agent work — but if you want to sell anything built on it, the license sends you back to DeepSeek V4 or GLM-5.1.
What MiniMax M3 actually is
MiniMax, the Shanghai lab behind the MiniMax-M and Hailuo lines, released M3 on June 1, 2026 and published the weights to HuggingFace within roughly ten days of launch. It is a Mixture-of-Experts model: 428 billion total parameters, but only about 23 billion activate per token. That sparsity is the whole point — you get the knowledge capacity of a near-half-trillion-parameter model with the per-token compute of a 23B dense one.
Three things make M3 stand out from the crowd of 2026 open-weight drops:
- A genuine 1M-token context window. Not a marketing 1M that degrades at 64k — the architecture is built for it.
- Native multimodality. Text, image, and video understanding (plus computer-use style desktop interaction) are trained in from the first step, not bolted on with an adapter.
- Frontier coding scores. 59.0% on SWE-Bench Pro at launch, which is a number that until recently belonged to closed frontier labs.
The headline trick is MiniMax Sparse Attention (MSA). Instead of attending to every past token, a lightweight index branch scans incoming tokens and selects which blocks of history are worth attention, then runs full attention only on those. At 1M context, MiniMax reports MSA cuts per-token compute to roughly one-twentieth of a dense baseline, with more than 9× faster prefill and more than 15× faster decoding. The sparse block size is 128 — a detail that matters the moment you try to serve it (more on that below).
The benchmarks, with the asterisks
MiniMax's own launch numbers:
| Benchmark | MiniMax M3 | Context |
|---|---|---|
| SWE-Bench Pro | 59.0% | beats GPT-5.5 (58.6%), Gemini 3.1 Pro (54.2%); below Claude Opus 4.8 (~69%) |
| Terminal-Bench 2.1 | 66.0% | agentic terminal tasks |
| SWE-fficiency | 34.8% | code-edit efficiency |
| KernelBench Hard | 28.8% | low-level kernel generation |
| MCP Atlas | 74.2% | tool-calling over MCP |
| BrowseComp | 83.5 | autonomous browsing — leads Opus on this |
The asterisk: at launch these were vendor-reported. Independent third-party scores from Artificial Analysis and LMArena were not yet published when M3 dropped, and TechTimes flagged the launch as "frontier claims, unverified benchmarks." By mid-June the model is on OpenRouter with independent benchmark listings, but treat the 59.0% as MiniMax's claim until a neutral harness confirms it on your own task distribution. SWE-Bench Pro is also a coding-only signal — strong agentic browsing (the 83.5 BrowseComp) is arguably the more differentiated result here.
For comparison, GLM-5.1 scored 58.4% on SWE-Bench Pro and DeepSeek's V4 line targets the same frontier tier — so M3 is in the pack on raw coding, not running away from it. Where it pulls ahead is the combination: nobody else offers 1M context + native multimodal + this coding score in one downloadable checkpoint.
The hardware reality nobody puts in the headline
This is where "open weight" stops meaning "you can run it." A 428B model is large even when sparse:
- A rough 4-bit weight-only estimate lands around 214GB before KV cache and activation overhead.
- NVIDIA's MXFP8 checkpoint (
MiniMax-M3-MXFP8, quantized from the FP16 release) is around 440GB and is designed for a multi-GPU node — the official SGLang config uses tensor parallelism across 8 GPUs. - Even aggressive community dynamic quants (Q2/Q3-class GGUF) are expected to need somewhere in the 75–150GB memory range to load at all.
So the practical floor is a workstation or server with stacked GPUs (think 4× to 8× 48GB cards, or datacenter Blackwell/MI350-class accelerators), plus a large pool of system RAM if you offload MoE layers to CPU. This is not a RTX 4090 model. A single 4090's 24GB doesn't hold a meaningful fraction of it.
If you want to try M3 before committing to a hardware build, renting a multi-GPU instance on RunPod for an afternoon is far cheaper than buying the cards — an 8×H100 or B200 node by the hour will tell you whether M3's long-context behavior actually helps your workload. For sizing a permanent local rig, the multi-GPU build math on runaihome.com is the better reference than any model card.
A note for AMD owners: M3 got day-0 support on AMD Instinct (MI350X/MI355X), and the MXFP8 weights run natively on both Blackwell (B200/B300/GB200) and CDNA4. This is one of the better-supported launches for non-NVIDIA datacenter hardware in 2026.
Serving it: the one flag that breaks everything
M3 supports vLLM and SGLang from day one, plus HuggingFace Transformers. The gotcha is MSA's block size. Because the sparse index works on 128-token blocks, your KV cache block size must match — vLLM's default of 16 misaligns the sparse attention indexing and produces garbage or errors.
A minimal vLLM launch on a multi-GPU node looks like this:
# 8-GPU tensor-parallel serve; block size MUST be 128 to match MSA
vllm serve MiniMaxAI/MiniMax-M3-MXFP8 \
--tensor-parallel-size 8 \
--block-size 128 \
--max-model-len 1000000 \
--trust-remote-code \
--served-model-name minimax-m3
# expected: an OpenAI-compatible endpoint on :8000
# INFO: Started server process
# INFO: Uvicorn running on http://0.0.0.0:8000
Once it's up, it speaks the OpenAI API, so anything you already point at a local endpoint — Open WebUI, your agent framework, an IDE plugin — just works. If you've set up vLLM before, the only new thing to internalize is the --block-size 128 requirement and the 1M --max-model-len (which you should lower unless you actually need it; the KV cache at full 1M context is enormous).
A problem worth flagging from early deployment reports: people serving M3 at the full 1M context without lowering it OOM their nodes on the KV cache, not the weights. Set --max-model-len to what your task needs (32k or 128k is plenty for most coding work) and you reclaim a lot of memory. The 1M number is a capability ceiling, not a default you should run at.
The license is the real story for self-hosters
Here's the part that reorders everything above. M3 ships under the MiniMax Community License, not MIT or Apache. The relevant terms:
- Free: personal use (including self-hosted deployment for coding, app/agent/tool development, research, experimentation), and use by non-profits, academic institutions, and researchers for non-commercial purposes.
- Prohibited without a separate written agreement from MiniMax: any commercial use of the model or derivative works.
Read that twice if you're building a product. "Open weight" here means you can download, run, fine-tune, and learn from M3 freely — but the moment money changes hands for something M3 powers, you need a license from MiniMax. That is a cat
Top comments (0)