MiniMax M3 Review 2026: Open-Weight 1M-Context Frontier

#minimax #llm #moe #localllm

This article was originally published on aifoss.dev

TL;DR: MiniMax M3 is a 428B-parameter (23B active) open-weight MoE released June 1, 2026, with a 1M-token context, native multimodality, and 59.0% on SWE-Bench Pro — edging out GPT-5.5. The weights are public, but two things bite self-hosters: it needs a multi-GPU server (200GB+ VRAM quantized), and the MiniMax Community License blocks commercial use without a separate agreement.

	MiniMax M3	DeepSeek V4 Pro	GLM-5.1
Best for	1M-context agentic + multimodal	Frontier reasoning at any cost	MIT coding model you can ship
License	MiniMax Community (non-commercial)	MIT (commercial OK)	MIT (commercial OK)
Size	428B / 23B active	1.6T / 49B active	744B
The catch	Restrictive license + 200GB+ VRAM	~862GB weights, datacenter-only	24GB GPU + 256GB RAM for 2-bit

Honest take: M3 is the most capable open-weight model you can download today for long-context and multimodal agent work — but if you want to sell anything built on it, the license sends you back to DeepSeek V4 or GLM-5.1.

What MiniMax M3 actually is

MiniMax, the Shanghai lab behind the MiniMax-M and Hailuo lines, released M3 on June 1, 2026 and published the weights to HuggingFace within roughly ten days of launch. It is a Mixture-of-Experts model: 428 billion total parameters, but only about 23 billion activate per token. That sparsity is the whole point — you get the knowledge capacity of a near-half-trillion-parameter model with the per-token compute of a 23B dense one.

Three things make M3 stand out from the crowd of 2026 open-weight drops:

A genuine 1M-token context window. Not a marketing 1M that degrades at 64k — the architecture is built for it.
Native multimodality. Text, image, and video understanding (plus computer-use style desktop interaction) are trained in from the first step, not bolted on with an adapter.
Frontier coding scores. 59.0% on SWE-Bench Pro at launch, which is a number that until recently belonged to closed frontier labs.

The headline trick is MiniMax Sparse Attention (MSA). Instead of attending to every past token, a lightweight index branch scans incoming tokens and selects which blocks of history are worth attention, then runs full attention only on those. At 1M context, MiniMax reports MSA cuts per-token compute to roughly one-twentieth of a dense baseline, with more than 9× faster prefill and more than 15× faster decoding. The sparse block size is 128 — a detail that matters the moment you try to serve it (more on that below).

The benchmarks, with the asterisks

MiniMax's own launch numbers:

Benchmark	MiniMax M3	Context
SWE-Bench Pro	59.0%	beats GPT-5.5 (58.6%), Gemini 3.1 Pro (54.2%); below Claude Opus 4.8 (~69%)
Terminal-Bench 2.1	66.0%	agentic terminal tasks
SWE-fficiency	34.8%	code-edit efficiency
KernelBench Hard	28.8%	low-level kernel generation
MCP Atlas	74.2%	tool-calling over MCP
BrowseComp	83.5	autonomous browsing — leads Opus on this

The asterisk: at launch these were vendor-reported. Independent third-party scores from Artificial Analysis and LMArena were not yet published when M3 dropped, and TechTimes flagged the launch as "frontier claims, unverified benchmarks." By mid-June the model is on OpenRouter with independent benchmark listings, but treat the 59.0% as MiniMax's claim until a neutral harness confirms it on your own task distribution. SWE-Bench Pro is also a coding-only signal — strong agentic browsing (the 83.5 BrowseComp) is arguably the more differentiated result here.

For comparison, GLM-5.1 scored 58.4% on SWE-Bench Pro and DeepSeek's V4 line targets the same frontier tier — so M3 is in the pack on raw coding, not running away from it. Where it pulls ahead is the combination: nobody else offers 1M context + native multimodal + this coding score in one downloadable checkpoint.

The hardware reality nobody puts in the headline

This is where "open weight" stops meaning "you can run it." A 428B model is large even when sparse:

A rough 4-bit weight-only estimate lands around 214GB before KV cache and activation overhead.
NVIDIA's MXFP8 checkpoint (MiniMax-M3-MXFP8, quantized from the FP16 release) is around 440GB and is designed for a multi-GPU node — the official SGLang config uses tensor parallelism across 8 GPUs.
Even aggressive community dynamic quants (Q2/Q3-class GGUF) are expected to need somewhere in the 75–150GB memory range to load at all.

So the practical floor is a workstation or server with stacked GPUs (think 4× to 8× 48GB cards, or datacenter Blackwell/MI350-class accelerators), plus a large pool of system RAM if you offload MoE layers to CPU. This is not a RTX 4090 model. A single 4090's 24GB doesn't hold a meaningful fraction of it.

If you want to try M3 before committing to a hardware build, renting a multi-GPU instance on RunPod for an afternoon is far cheaper than buying the cards — an 8×H100 or B200 node by the hour will tell you whether M3's long-context behavior actually helps your workload. For sizing a permanent local rig, the multi-GPU build math on runaihome.com is the better reference than any model card.

A note for AMD owners: M3 got day-0 support on AMD Instinct (MI350X/MI355X), and the MXFP8 weights run natively on both Blackwell (B200/B300/GB200) and CDNA4. This is one of the better-supported launches for non-NVIDIA datacenter hardware in 2026.

Serving it: the one flag that breaks everything

M3 supports vLLM and SGLang from day one, plus HuggingFace Transformers. The gotcha is MSA's block size. Because the sparse index works on 128-token blocks, your KV cache block size must match — vLLM's default of 16 misaligns the sparse attention indexing and produces garbage or errors.

A minimal vLLM launch on a multi-GPU node looks like this:

# 8-GPU tensor-parallel serve; block size MUST be 128 to match MSA
vllm serve MiniMaxAI/MiniMax-M3-MXFP8 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --max-model-len 1000000 \
  --trust-remote-code \
  --served-model-name minimax-m3

# expected: an OpenAI-compatible endpoint on :8000
# INFO: Started server process
# INFO: Uvicorn running on http://0.0.0.0:8000

Once it's up, it speaks the OpenAI API, so anything you already point at a local endpoint — Open WebUI, your agent framework, an IDE plugin — just works. If you've set up vLLM before, the only new thing to internalize is the --block-size 128 requirement and the 1M --max-model-len (which you should lower unless you actually need it; the KV cache at full 1M context is enormous).

A problem worth flagging from early deployment reports: people serving M3 at the full 1M context without lowering it OOM their nodes on the KV cache, not the weights. Set --max-model-len to what your task needs (32k or 128k is plenty for most coding work) and you reclaim a lot of memory. The 1M number is a capability ceiling, not a default you should run at.

The license is the real story for self-hosters

Here's the part that reorders everything above. M3 ships under the MiniMax Community License, not MIT or Apache. The relevant terms:

Free: personal use (including self-hosted deployment for coding, app/agent/tool development, research, experimentation), and use by non-profits, academic institutions, and researchers for non-commercial purposes.
Prohibited without a separate written agreement from MiniMax: any commercial use of the model or derivative works.

Read that twice if you're building a product. "Open weight" here means you can download, run, fine-tune, and learn from M3 freely — but the moment money changes hands for something M3 powers, you need a license from MiniMax. That is a cat