MiniMax-M3 is live on DEVUP AI — routable through the same OpenAI-compatible gateway as our other 170+ models, billed in DZD via Edahabia/CIB. Model ID: MiniMaxAI/MiniMax-M3.
Most launch-day coverage repeated MiniMax's press numbers without reading the actual paper behind the architecture. This post goes one level deeper: what the arXiv paper actually claims, where it diverges from the PR figures, what M3 costs and how it's exposed on our platform, and production-grade integration code — retries, timeouts, streaming, and multimodal, all handled correctly.
What's actually new
Three capabilities that, until now, only closed-source models combined in one checkpoint:
- Frontier-tier coding/agentic performance — competitive with GPT-5.5 and Gemini 3.1 Pro on agentic benchmarks
- 1M-token context (with a guaranteed floor — more on that below)
- Native multimodality — image and video input, trained in from step zero, not bolted on
Architecture: MoE, ~428B total parameters, ~23B active per token.
Under the hood: MiniMax Sparse Attention (MSA)
Here's the mechanism, straight from the technical report (Lai et al., MiniMax, June 2026):
MSA is a blockwise sparse attention built on top of Grouped Query Attention (GQA) — not a replacement for GQA, an extension of it. It has two branches:
- Index Branch — a lightweight scorer that ranks key-value blocks and independently selects a Top-k subset per GQA group. This is the detail most summaries miss: selection isn't global, it's group-specific, which is what lets the model retrieve different relevant context for different attention heads within the same layer.
- Main Branch — performs exact block-sparse attention, but only over the blocks the Index Branch selected. No approximation inside the selected blocks — the sparsity is in which blocks get attended to, not in how attention is computed within them.
To make that sparsity translate into wall-clock speed rather than just fewer FLOPs on paper, MiniMax co-designed the GPU kernel with exp-free Top-k selection and KV-outer sparse attention, specifically to keep tensor-core utilization high under block-granular memory access — a detail that matters because naive sparse attention implementations often lose their theoretical speedup to poor hardware utilization.
The numbers, precisely: On a 109B-parameter research checkpoint with native multimodal training, MSA matches GQA quality while cutting per-token attention compute by 28.4x at 1M context. Paired with the co-designed kernel, that's 14.2x prefill and 7.6x decoding wall-clock speedup on H800 GPUs.
The gap between the two curves widens with sequence length — GQA's cost grows steeply while MSA stays close to flat, which is the entire point of a sparse mechanism: the advantage barely matters at 32K tokens and becomes decisive at 1M.
Compare that to the press figures circulating since launch — "~1/20th compute, 9x prefill, 15x decode." These aren't wrong, but they're not the same number as the paper's, and the discrepancy is worth understanding rather than hand-waving: the paper's 28.4x figure is specifically attention compute on a smaller 109B test model, while the press figures describe end-to-end per-token compute on the production 428B/23B-active M3 — a different denominator (attention-only vs. full forward pass) and a different model scale. Both can be true simultaneously. If you're doing capacity planning, use the paper's numbers for attention-layer cost modeling and treat the press figures as a rough end-to-end proxy, not interchangeable data points.
The inference kernel itself is open-sourced: github.com/MiniMax-AI/MSA. Worth cloning if you're evaluating self-hosting.
Benchmarks — with the caveats that matter
MiniMax reports:
| Benchmark | M3 Score | Notes |
|---|---|---|
| SWE-Bench Pro | 59.0% | Ahead of GPT-5.5 (58.6%) and Gemini 3.1 Pro |
| Terminal-Bench 2.1 | 66.0% | |
| MCP Atlas | 74.2% | |
| BrowseComp | 83.5 | vs. Claude Opus 4.7's 79.3 |
Two things before you weight these for a production decision:
- The comparison baseline is stale. MiniMax benchmarked against Claude Opus 4.7 — Opus 4.8 shipped four days before M3's launch. Against the actual current frontier, M3 sits further back than the launch post implies.
- Self-reported, self-scaffolded. Several results ran on MiniMax's own infrastructure with agent scaffolding (Claude Code, Mini-SWE-Agent, Terminus). Weights are now public on Hugging Face, so independently reproduced numbers should surface over the coming weeks — re-check before leaning on these for anything critical.
The long-horizon autonomy demos are the more interesting signal for agentic use cases: M3 ran unsupervised for ~12 hours reproducing an ICLR 2025 paper (18 commits, 23 figures, core experimental claims validated), and separately spent ~24 hours on a CUDA/Hopper kernel optimization task — 147 submissions, 1,959 tool calls, pushing hardware utilization from 7.6% to 71.3%, continuing to improve well past the point where most models plateau and stop.
On DEVUP AI: specs and pricing
| Value | |
|---|---|
| Model ID | MiniMaxAI/MiniMax-M3 |
| Pricing | 105 DZD in / 420 DZD out / 21 DZD cached — per 1M tokens |
| Context window (as exposed) | 524K tokens |
| Architecture | MoE, ~428B total / ~23B active |
| Capabilities | Public · JSON mode · Function calling · Multimodal |
One detail worth noticing: MiniMax's official ceiling is 1M tokens, but they only guarantee 512K. 512K × 1024 = 524,288 — which is exactly the 524K we expose. We're not truncating the model's capability; we're exposing the guaranteed floor rather than the marketing ceiling, so your context-length assumptions hold under load instead of degrading silently past the guaranteed range.
Production integration
Basic completion, with real retry/error handling
The gateway returns OpenAI-compatible error bodies (error.message, error.type, error.code). 400/401/403 are never retryable — fix the payload, key, or balance. 429 and 500 are, and 429 responses include a Retry-After header you should actually respect instead of guessing a backoff.
import { DevUpAI } from "devupai";
const MODEL = "MiniMaxAI/MiniMax-M3";
const devup = new DevUpAI({
apiKey: process.env.DEVUP_API_KEY!, // never hardcode; pull from a secrets manager in prod
});
interface DevUpErrorBody {
error: { message: string; type: string; code: string };
}
type ChatMessage = { role: "system" | "user" | "assistant"; content: string };
/**
* Calls MiniMax-M3 through the DEVUP AI gateway with production-grade handling:
* - explicit timeout suited to long-context (up to 524K token) requests, which
* can legitimately take tens of minutes, not seconds
* - retry only on documented-retryable statuses (429, 500)
* - honors the Retry-After header on 429 instead of guessing a backoff
* - exponential backoff + jitter as a fallback when Retry-After is absent
*/
async function callMiniMaxM3(
messages: ChatMessage[],
{ timeoutMs = 120_000, maxRetries = 3 }: { timeoutMs?: number; maxRetries?: number } = {}
): Promise<string> {
let attempt = 0;
while (true) {
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), timeoutMs);
try {
const completion = await devup.chat.completions.create(
{ model: MODEL, messages, max_tokens: 4096 },
{ signal: controller.signal }
);
return completion.choices[0]?.message?.content ?? "";
} catch (err: any) {
const status: number | undefined = err?.status ?? err?.response?.status;
const body: DevUpErrorBody | undefined = err?.error ?? err?.response?.data;
const code = body?.error?.code;
const retryable = status === 429 || status === 500;
if (!retryable || attempt >= maxRetries) {
throw new Error(
`MiniMax-M3 request failed (${status ?? "unknown"} / ${code ?? "no_code"}): ${
body?.error?.message ?? err.message
}`
);
}
const retryAfterHeader = err?.response?.headers?.["retry-after"];
const retryAfterMs = retryAfterHeader ? Number(retryAfterHeader) * 1000 : 2 ** attempt * 1000;
const jitterMs = Math.random() * 250;
await new Promise((resolve) => setTimeout(resolve, retryAfterMs + jitterMs));
attempt++;
} finally {
clearTimeout(timer);
}
}
}
Streaming (SSE)
For anything user-facing, stream rather than block — especially at long context, where full-response latency can be substantial.
const stream = await devup.chat.completions.create({
model: MODEL,
messages: [{ role: "user", content: "Summarize this 400K-token incident log and flag anomalies." }],
stream: true,
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) process.stdout.write(delta);
if (chunk.usage) {
console.log(`\n[tokens] prompt=${chunk.usage.prompt_tokens} completion=${chunk.usage.completion_tokens}`);
}
}
Multimodal input
const completion = await devup.chat.completions.create({
model: MODEL,
messages: [
{
role: "user",
content: [
{ type: "image_url", image_url: { url: "https://example.com/architecture-diagram.png" } },
{ type: "text", text: "Review this system architecture diagram and flag single points of failure." },
],
},
],
});
Tool calling
Standard OpenAI-compatible tools array works unmodified. One tuning note specific to tool calling: MiniMax's own recommended defaults (temperature=1.0, top_p=0.95) are tuned for general generation quality, not argument-parsing reliability. For tool-calling workloads, drop temperature below 1.0 — consistent with our platform-wide guidance for structured function-call output regardless of model.
API vs. self-hosting the open weights
Weights are public on Hugging Face (~428B total / ~23B active MoE). Real architectural choice:
| Via DEVUP AI (API) | Self-hosted (open weights) | |
|---|---|---|
| Time to integrate | Minutes — OpenAI-compatible endpoint | Days — the released MSA kernel targets H800; broader GPU/framework support (llama.cpp, etc.) is still maturing and currently falls back to dense attention |
| Cost profile | Pay-per-token, DZD billing, no upfront infra | High upfront GPU cost — 428B params means serious VRAM even with MoE sparsity |
| Data residency | Traffic routed through our gateway; see our DPA for compliance requirements | Full control — matters if you're bound by strict residency rules |
| Long-context throughput | Shared infra, subject to queueing under load | Full control over batching/throughput tuning, but you own the kernel engineering |
One honest note on data handling: M3 is developed in Shanghai, and Chinese entities operate under legal obligations to cooperate with state intelligence requests. That's not a reason to avoid the model — it's a reason to be deliberate about what you send it, same as with any third-party inference endpoint. If you're processing regulated or sensitive data, self-hosting the open weights (or scrubbing payloads before routing through any API) is the safer default.
Why this matters for us
Long context, native multimodal, and agentic coding — at open-weight economics, with a documented and open-sourced attention kernel — means Algerian and MENA teams building RAG pipelines, code-review agents, or document-heavy workflows now have a genuinely frontier-adjacent option that doesn't require a foreign card or USD billing. One key, one DZD invoice, 170+ models including this one.
Primary sources: arXiv:2606.13392 (MSA technical report), MiniMax-M3 on Hugging Face, DEVUP AI model page. Secondary: MiniMax official model page, VentureBeat, The Decoder, Artificial Analysis, TechTimes.

Top comments (0)