Jangwook Kim

Posted on Apr 29 • Originally published at effloow.com

MiniMax M2.5 API Guide: 80% SWE-Bench at $0.15/M Tokens

#minimax #llmapi #swebench #costperformance

If you are building a coding agent that will run thousands of agentic loops per day, the model you choose determines whether your infrastructure bill is $50/day or $1,000/day — for nearly identical task performance.

MiniMax M2.5 is the clearest illustration of that gap yet. It scores 80.2% on SWE-Bench Verified — within 0.6 percentage points of Claude Opus 4.6 at 80.8% — while charging $0.15 per million input tokens versus $5.00 for Opus. On output tokens, where agentic workflows spend most of their budget, the ratio is roughly 26× in M2.5's favour.

Effloow Lab replayed M2.5's published evaluation data against competing models and built a cost-per-task comparison table to quantify what that gap means in production. The full methodology is in data/lab-runs/minimax-m2-5-paper-poc.md.

What MiniMax M2.5 Actually Is

MiniMax is a Chinese AI company founded in 2021. Its M-series models have been benchmarking progressively closer to frontier proprietary models since the M1 generation. M2.5 is the company's coding-specialist release, trained specifically for agentic workflows and software engineering tasks — and released with open weights on Hugging Face.

Under the hood, M2.5 runs a 230 billion parameter Mixture-of-Experts (MoE) architecture, but only activates 8 of its 256 experts per forward pass. That means roughly 10 billion parameters are active at any given moment — close to a dedicated 10B dense model for inference compute, while retaining the emergent capability of a much larger parameter space learned during training.

That architectural choice is the reason for the price efficiency. You pay inference costs proportional to ~10B active parameters, not 230B.

Key specifications:

Specification	Value
Total parameters	230B
Active parameters / token	~10B (8 of 256 experts)
Context window (API)	196,608 tokens
Max output	131,072 tokens
Training approach	CISPO RL, 200K+ real-world environments
Open weights	Yes — MiniMaxAI/MiniMax-M2.5 on Hugging Face

Two API variants are available: M2.5 standard for cost-sensitive batch workloads and M2.5-Highspeed for latency-sensitive interactive pipelines. Both produce identical benchmark results — the only difference is throughput and a small output price premium for the faster variant.

The Benchmark Numbers

MiniMax published its evaluation methodology openly. The company ran SWE-Bench Verified using Claude Code as the scaffolding with the default system prompt overridden, averaged over four runs, and cross-validated on Droid and Opencode scaffoldings.

Published MiniMax M2.5 benchmark results:

Benchmark	Score	Context
SWE-Bench Verified	80.2%	Top-3 globally at time of release
Multi-SWE-Bench	51.3%	#1 globally (ahead of Opus 4.6 at 50.3%)
BrowseComp	76.3%	Agentic web research tasks
Speed vs M2.1	37% faster	22.8 min/task vs 31.3 min
Tool-call efficiency	20% fewer rounds	vs M2.1, reduced latency and token spend

Multi-SWE-Bench is particularly significant. It tests multilingual repositories across 10+ programming languages — a harder benchmark that requires generalization beyond English-only GitHub issues. M2.5's first-place ranking there suggests it is not overfitted to English codebases.

The average token consumption per SWE-Bench Verified task was 3.52 million tokens (input + output combined). That number is the basis for Effloow Lab's cost comparison table in the next section.

Benchmark Replay: Cost-Per-Task Analysis

Effloow Lab reproduced the cost-per-task comparison using M2.5's published 3.52M tokens/task average and each model's publicly listed API pricing. The calculation applies a 70/30 input/output split — conservative for agentic coding workloads where context accumulates but per-turn code generation is bounded. Full methodology in data/lab-runs/minimax-m2-5-paper-poc.md.

Model	SWE-Bench	Input $/M	Output $/M	Est. cost/task
MiniMax M2.5	80.2%	$0.15	$0.95	~$1.37
MiniMax M2.5-Highspeed	80.2%	$0.15	$1.15	~$1.58
Claude Haiku 4.5	73.3%	$1.00	$5.00	~$7.74
Gemini 3.1 Pro	80.6%	$2.00	$12.00	~$17.55
Claude Sonnet 4.6	~79.6%	$3.00	$15.00	~$23.23
Claude Opus 4.6	80.8%	$5.00	$25.00	~$38.72

Estimates use published 3.52M avg tokens/task at 70/30 input/output split. Production ratios vary by agent design.

Three patterns stand out:

The Haiku comparison is counterintuitive. Claude Haiku 4.5 is marketed as the budget-tier model, but at $1/$5 pricing it costs roughly 5.7× more per task than M2.5 — while scoring 6.9 percentage points lower on SWE-Bench. For high-volume agentic coding, M2.5 is both cheaper and more capable than Haiku.

Sonnet vs M2.5 is a 17× price gap for 0.6 SWE-Bench points. Claude Sonnet 4.6 is an excellent model at ~79.6% SWE-Bench, but the cost ratio over M2.5 is hard to justify unless you specifically need Claude's particular output style, tool-call format, or compliance posture.

M2.5-Highspeed is worth 15% more on output for latency-sensitive workloads. The throughput upgrade is real, and for interactive coding tools where a human is waiting, the premium is small relative to the overall cost advantage over alternatives.

Architecture Deep Dive: Why MoE Changes the Cost Equation

The cost story is structural, not promotional. Understanding why helps predict whether M2.5's pricing is durable.

In a standard dense transformer, every forward pass activates all parameters. In M2.5's MoE architecture, each token selects 8 experts from a pool of 256. Physically, the model stores 230B parameters — but each individual forward pass routes through roughly 8 × (230B ÷ 256) ≈ 7.2B effective parameters. That is close to small-model territory for compute, while the full 230B capacity is available in aggregate across different token types and task domains.

What MiniMax added beyond raw architecture:

CISPO algorithm. Standard RL optimization on MoE models causes expert routing instability — a few high-utilization experts get overtrained while others atrophy. MiniMax's CISPO algorithm regularizes expert load during large-scale training to prevent collapse. This is why the model generalizes well rather than excelling at one narrow task type.

Process rewards during agent rollouts. Standard RL training uses outcome rewards: did the final answer pass the test suite? For 30-minute agentic loops with thousands of tokens, outcome rewards give the model poor signal — you cannot easily trace which intermediate decision led to success or failure. M2.5 introduces process rewards at intermediate steps, giving the training run finer-grained attribution. This is what drives the 20% reduction in tool-calling rounds: the model learns to make more decisive, targeted calls rather than tentative multi-step exploration.

200,000+ real-world environment simulations. The training data covers real software development workflows — pull request discussions, CI/CD interactions, code review cycles — not just polished benchmark instances. This explains the strong Multi-SWE-Bench cross-language performance.

API Setup: From Zero to Coding Agent

MiniMax provides an OpenAI-compatible API and also supports the Anthropic SDK — in fact, MiniMax officially recommends the Anthropic SDK as the primary integration path for M2.5. This is an unusual positioning that likely reflects stronger prompt formatting for long-context agentic tasks.

Getting API Access

Sign up at platform.minimax.io
Generate an API key from the dashboard
Free tier: 20 RPM (requests per minute) — sufficient for development and testing
Paid usage bills per token with no subscription required

Basic Chat Request (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MINIMAX_API_KEY",
    base_url="https://api.minimax.io/v1",
)

response = client.chat.completions.create(
    model="minimax-m2.5-highspeed",  # or "minimax-m2.5" for standard
    messages=[
        {
            "role": "user",
            "content": "Write a Python function that parses a git diff and extracts changed line numbers."
        }
    ],
    max_tokens=2048,
)

print(response.choices[0].message.content)

Switch between minimax-m2.5 and minimax-m2.5-highspeed in the model field. All other parameters stay the same — no other API changes are needed.

Tool-Calling Agent Loop

M2.5 reduces tool-calling rounds by 20% compared to M2.1, which compounds across long agentic sessions. Here is a minimal loop pattern using file reading as a tool:

import json
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MINIMAX_API_KEY",
    base_url="https://api.minimax.io/v1",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read the contents of a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {
                        "type": "string",
                        "description": "File path to read"
                    }
                },
                "required": ["path"]
            }
        }
    }
]

messages = [
    {
        "role": "user",
        "content": "Read requirements.txt and identify any outdated packages."
    }
]

while True:
    response = client.chat.completions.create(
        model="minimax-m2.5",
        messages=messages,
        tools=tools,
    )
    msg = response.choices[0].message

    if not msg.tool_calls:
        print(msg.content)
        break

    messages.append({
        "role": "assistant",
        "content": None,
        "tool_calls": msg.tool_calls
    })

    for call in msg.tool_calls:
        if call.function.name == "read_file":
            args = json.loads(call.function.arguments)
            try:
                with open(args["path"]) as f:
                    result = f.read()
            except FileNotFoundError:
                result = f"File not found: {args['path']}"

            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": result,
            })

This pattern works without modification against the MiniMax API because the tool schema and message structure follow the OpenAI spec.

Automatic Prompt Caching

One of M2.5's developer-friendly features is automatic cache support with zero configuration. Unlike Claude's explicit cache_control breakpoints or OpenAI's 1024-token minimum requirement, MiniMax's cache activates without any API parameter changes. For applications with stable system prompts — most agents — cache benefits apply from the first deployment without engineering investment.

M2.5 Standard vs M2.5-Highspeed

The two variants serve different workloads. Use this table to route:

Factor	M2.5 Standard	M2.5-Highspeed
Benchmark results	Identical	Identical
Throughput	Standard TPS	Higher TPS
Input cost	$0.15/M	$0.15/M
Output cost	$0.95/M	$1.15/M
Best for	Batch jobs, nightly CI, data pipelines	Interactive agents, real-time tooling

For batch processing — nightly security scans, CI code review, mass test generation, data pipeline refactors — standard is clearly the better economic choice. For interactive developer tools where a human is waiting on a response, highspeed reduces perceived latency at a ~21% output cost premium.

Where to Run M2.5

Beyond the official MiniMax API, several inference providers host M2.5:

Together AI: together.ai/models/minimax-m2-5
Lambda AI: available as a hosted endpoint
Azure AI Foundry: MiniMax M2 family listed under third-party models
Ollama: ollama run minimax-m2.5 (local, requires sufficient VRAM)
Hugging Face: weights at MiniMaxAI/MiniMax-M2.5 for self-hosted vLLM/SGLang deploys

For self-hosting with vLLM, tool-call parser support is included. SGLang also supports M2.5 through its OpenAI-compatible server. See the vLLM production guide for deployment patterns.

Common Mistakes When Adopting M2.5

Treating the API as a drop-in swap without testing. The base URL is api.minimax.io/v1 (not api.openai.com/v1). While the schema is OpenAI-compatible, test your error handling and response parsing before deploying — minor differences in response metadata can surface unexpectedly.

Ignoring the 196K context limit. The underlying architecture supports 1M tokens, which MiniMax may expose in future API versions. Current integrations should plan around 196K. For very long agentic sessions, implement context management strategies — compressing prior turns, pruning completed tool-call chains, and tracking token spend with an observability layer like Langfuse.

Routing creative writing through M2.5. Like any coding-specialized model, M2.5 was trained on software engineering tasks. The cost advantage is most relevant in that domain. For open-ended creative work or tasks with very different token distributions from SWE-Bench, benchmark comparisons should not be extrapolated.

Not monitoring tool-call depth. Even at 20% fewer rounds than M2.1, deep tool-call chains accumulate context. Implement a round counter and handle graceful timeout before the context window fills — a common issue in production agentic loops regardless of which model you use.

Frequently Asked Questions

Q: Is MiniMax M2.5 open-source and can I self-host it?

Yes. Weights are on Hugging Face (MiniMaxAI/MiniMax-M2.5) and can be run via Ollama (ollama run minimax-m2.5), vLLM, or SGLang. The model is also available through Together AI, Lambda AI, and Azure AI Foundry. The license permits commercial use.

Q: How does M2.5 compare to DeepSeek V4 on cost?

Both are MoE models targeting price-performance efficiency. DeepSeek V4-Pro (1.6T total parameters, Apache 2.0, 1M context, $0.27/$1.10 per million tokens) has a larger parameter pool and wider context. M2.5 has a more polished API experience, automatic caching, and strong Multi-SWE-Bench performance. The choice between them primarily depends on context length needs and API ecosystem preferences. See the DeepSeek V4 migration guide for the V4 API details.

Q: Does M2.5 work with OpenAI Agents SDK and LangGraph?

Both frameworks accept any OpenAI-compatible endpoint. Set base_url="https://api.minimax.io/v1" and model="minimax-m2.5" in the provider configuration. The main consideration is prompt format compatibility — some frameworks use Claude or GPT-specific system prompt templates that may need adjustment for M2.5.

Q: What is the rate limit on the free tier?

Free and starter accounts are limited to 20 requests per minute (RPM). For production workloads with higher concurrency, a paid token plan is required. Tier-specific rate limits are listed on the token plan page at platform.minimax.io/docs/guides/pricing-token-plan.

Q: Is there a context caching API similar to Claude's cache_control?

No explicit configuration is needed. Automatic cache support is included and active by default. Repeated system prompts are cached without any code changes. This is simpler than Claude's explicit breakpoint syntax but offers less control over exactly which segments are cached.

Key Takeaways

MiniMax M2.5 is a credible alternative to frontier proprietary models for production coding agents — not just on benchmarks, but on the economics that determine whether an agent system is viable at scale.

The 230B MoE architecture with CISPO training produces results that land within 1 percentage point of Claude Opus 4.6 on SWE-Bench Verified, at $0.15/M input tokens versus $5.00. Per-task cost comparisons from Effloow Lab's benchmark replay put M2.5 at roughly $1.37/task against $38.72/task for Opus — a 28× gap for comparable coding benchmark performance.

M2.5-Highspeed adds throughput for interactive tooling at a ~21% output cost premium. For batch workloads, standard M2.5 is the better default. Automatic caching removes one operational concern entirely.

For teams already running agentic coding pipelines on Claude or GPT models, the case for a routing evaluation is straightforward. The SWE-Bench numbers are close enough that task quality differences will be marginal in most workloads, while the cost reduction is structural.

Bottom Line

MiniMax M2.5 delivers Opus-tier coding performance at a fraction of Haiku-tier pricing. For agentic coding pipelines — large codebase navigation, multi-file edits, CI integration loops — this is the most cost-effective option in the frontier performance tier as of early 2026. Worth evaluating before you commit to another model's output pricing at scale.

DEV Community