For most of the last two years, choosing the “best coding model” usually meant picking GPT, Claude, or Gemini, paying the per-token rate, and accepting closed weights. That is no longer the only path. Chinese labs are now shipping coding-focused models that either publish weights or price APIs low enough to change the economics of coding agents.
MiniMax M3 landed on June 1, 2026, and is the clearest signal yet. It is open-weight, built for coding and agentic workflows, supports a 1,000,000-token context window, and adds native multimodality. It joins DeepSeek’s V4 family and Alibaba’s Qwen 3.7 as serious contenders for teams that want lower cost, more control, and less vendor lock-in.
The three contenders
MiniMax M3
MiniMax M3 is the newest arrival. MiniMax positions it as a frontier coding model with:
- 1,000,000-token context window
- Native multimodality for image and video input
- Computer-use capabilities
- Coding and agentic workflow focus
- New MSA architecture
MiniMax says open weights and a technical report will follow within roughly ten days of launch. Parameter counts have not been disclosed.
Read the full breakdown: what is MiniMax M3.
DeepSeek V4-Pro
DeepSeek V4-Pro is the reasoning-and-coding workhorse. It is a thinking model that returns a reasoning_content chain before the final answer, which can help on multi-file refactors, dependency changes, renames, and signature updates.
DeepSeek also pairs V4-Pro with a cheaper non-thinking V4-Flash variant. Its biggest differentiator is cost.
Official site and API: deepseek.com.
Qwen 3.7
Qwen 3.7 is Alibaba’s flagship family, led by Qwen3.7-Max-Preview. It is a reasoning model with a 1,000,000-token context window and is positioned for long-horizon agent work.
The key caveat: as of its mid-May 2026 launch, Qwen3.7-Max is proprietary and closed-weight. Alibaba has a strong track record of open-sourcing lower tiers, but open 3.7 weights had not shipped.
More details: what is Qwen 3.7.
Open-source repos: github.com/QwenLM.
Spec comparison
| Spec | MiniMax M3 | DeepSeek V4-Pro | Qwen3.7-Max-Preview |
|---|---|---|---|
| Vendor | MiniMax | DeepSeek | Alibaba / Qwen |
| Released | June 1, 2026 | 2026 | May 2026 preview |
| Open weights | Yes, weights within ~10 days | Yes, based on DeepSeek’s R1/V3 track record | Not yet; flagship is closed-weight |
| Context window | 1,000,000 tokens | Not stated here | 1,000,000 tokens |
| Multimodal | Yes: image, video, computer use | No; text + reasoning | Text-focused reasoning |
| Reasoning / thinking mode | Yes | Yes, via reasoning_content
|
Yes, extended thinking |
| Parameter count | Not disclosed | Not disclosed here | Not disclosed here |
| Architecture | MSA | Not stated here | Not stated here |
If open weights are a hard requirement, filter first:
- Use MiniMax M3 if you want open weights plus multimodality.
- Use DeepSeek V4-Pro if you want open releases and low API cost.
- Use Qwen3.7-Max only if you are comfortable with a hosted, closed-weight flagship today.
Coding and agentic strength
The public evidence is uneven, so do not compare unsupported numbers as if they were equivalent.
MiniMax M3 launched with vendor-reported coding and agentic benchmarks:
| Benchmark, vendor-reported by MiniMax | MiniMax M3 |
|---|---|
| SWE-Bench Pro | 59.0% |
| Terminal-Bench 2.1 | 66.0% |
| SWE-fficiency | 34.8% |
| KernelBench Hard | 28.8% |
| MCP Atlas | 74.2% |
| PostTrainBench | 0.37 |
| SVG-Bench | Reported above Opus 4.7 |
| OmniDocBench | Reported above Gemini 3.1 Pro |
| Claw-Eval | Reported highest in its set |
SWE-Bench Pro and Terminal-Bench test practical software engineering work: resolving issues, editing code, and operating in a terminal. MCP Atlas focuses on tool use and agent orchestration. You can sanity-check the broader SWE-Bench field on the SWE-Bench leaderboard.
DeepSeek V4-Pro and Qwen 3.7 do not have directly comparable published numbers in the same format here, so a cell-by-cell benchmark table would be misleading.
What is documented:
- DeepSeek V4-Pro: reported by third-party comparisons as landing within a few coding benchmark points of GPT-5.5 while costing much less. Its practical edge is the reasoning chain, especially for multi-file changes. Setup and pricing details: how to use DeepSeek V4-Pro with Cursor.
- Qwen 3.7: reported 57 on the Artificial Analysis Intelligence Index at launch, plus roughly 1,475 Elo on LM Arena with a top-ten coding placement. Alibaba positions it for long autonomous runs and heavy tool use.
Practical takeaway:
- Pick MiniMax M3 if you want the most transparent agentic-coding evidence today.
- Pick DeepSeek V4-Pro if reasoning quality and cost matter most.
- Pick Qwen3.7-Max if you want top composite-intelligence results and can accept a hosted API.
For a broader Qwen comparison, see Qwen 3.7 vs GPT-5.5 vs Opus 4.7.
Context window and long-context cost
MiniMax M3 and Qwen3.7-Max both advertise a 1,000,000-token context window. DeepSeek V4-Pro’s context size is not stated here.
A million tokens is roughly 700,000 to 750,000 words. That can hold:
- A mid-sized repository
- Long documentation sets
- Multiple PDFs
- Long-running conversation history
- Agent traces and tool outputs
But a large context window is only a ceiling. It does not guarantee perfect recall or stable reasoning across the entire prompt. It also costs money: every token you send is billed.
Use long context only when needed.
A practical context strategy:
1. Start with the smallest useful prompt.
2. Include only the files relevant to the task.
3. Add dependency files when the model asks or when tests fail.
4. Cache stable project context where the provider supports it.
5. Avoid sending generated logs, build output, or duplicate docs unless needed.
6. Use the 1M-token window for repo-wide analysis, not every request.
MiniMax says its MSA architecture is designed for long-context efficiency. Its API uses a standard rate up to 512K input tokens and a separate long-context rate above that threshold. That split reflects the practical reality: long context is a premium tier.
For more tactics, see how to reduce agent token costs.
Price and access
Price is the reason this comparison matters. The same coding-agent workload can cost much less on these models than on Western flagship APIs. That pricing pressure is part of the broader Chinese LLM price war 2026.
DeepSeek V4-Pro has the clearest published per-token pricing.
| Token type | DeepSeek V4-Pro rate per 1M tokens |
|---|---|
| Input, cache miss | $0.435 |
| Input, cache hit | $0.003625 |
| Output | $0.87 |
That output rate is roughly 1/34 the cost of GPT-5.5 output. The non-thinking V4-Flash variant is cheaper still at $0.14 / $0.28 per million input/output tokens. A heavy day of coding-assistant use can land around $1.
MiniMax M3 uses token plans rather than one published per-token price:
| Plan | Price |
|---|---|
| Plus | $20 |
| Max | $50 |
| Ultra | $120 |
Its API uses a standard input rate up to 512K tokens and a long-context rate above that. MiniMax has not published an exact per-token figure here, so do not assume one. The plan model may fit teams that prefer predictable monthly spend.
API setup details: how to use the MiniMax M3 API.
Qwen 3.7 is billed per token through Alibaba Cloud. The Max preview went live in May 2026. Alibaba has priced recent Qwen models aggressively, but preview pricing can shift, so check Alibaba Cloud’s current model docs before estimating production cost.
Open weights change the ceiling:
- MiniMax M3: self-hosting becomes possible once weights are published.
- DeepSeek V4-Pro: DeepSeek has a history of open releases.
- Qwen3.7-Max: cannot be self-hosted today because flagship weights are closed.
If avoiding vendor lock-in is a requirement, that difference matters.
Which one should you pick?
| Your priority | Best fit | Why |
|---|---|---|
| Agentic coding with published benchmarks | MiniMax M3 | Vendor-reported SWE-Bench Pro, Terminal-Bench, and MCP Atlas numbers |
| Multimodal input | MiniMax M3 | Image, video, and computer-use support |
| Lowest cost on high-volume API traffic | DeepSeek V4-Pro | $0.87/1M output, cheaper Flash variant, cache-hit pricing |
| Reasoning-driven code quality | DeepSeek V4-Pro | Thinking chain helps on multi-file dependencies |
| Top composite-intelligence score on a public board | Qwen3.7-Max | Reported AA Intelligence Index score of 57 at launch |
| Long-horizon autonomous agents | Qwen3.7-Max or MiniMax M3 | Both target extended tool-use workflows |
| Self-hosting / no vendor lock-in today | MiniMax M3 or DeepSeek V4-Pro | Both are tied to open-weight availability; Qwen flagship is closed |
Simple decision tree:
Need open weights?
Yes -> MiniMax M3 or DeepSeek V4-Pro
No -> Qwen3.7-Max is also viable
Need multimodal input?
Yes -> MiniMax M3
No -> Continue
Need the lowest API bill?
Yes -> DeepSeek V4-Pro
No -> Continue
Need long-horizon hosted agents?
Yes -> Qwen3.7-Max or MiniMax M3
Need benchmark transparency for agentic coding?
Yes -> MiniMax M3
Test them yourself
Leaderboards measure someone else’s workload. Your codebase is the real benchmark.
All three models expose APIs, and the practical way to choose is to run the same prompts against each one and compare:
- Correctness
- Patch quality
- Tool-call structure
- Latency
- Token usage
- Cost
- Failure modes
Use Apidog to set up one project with three environments:
Environment 1: MiniMax M3
Environment 2: DeepSeek V4-Pro
Environment 3: Qwen3.7-Max
Then create one OpenAI-compatible chat request and switch environments.
Example request shape:
{
"model": "{{MODEL_NAME}}",
"messages": [
{
"role": "system",
"content": "You are a senior software engineer. Return concise implementation steps and code changes."
},
{
"role": "user",
"content": "Refactor this module to remove duplicated validation logic. Explain the files that need to change."
}
],
"temperature": 0.2
}
Use environment variables for provider-specific values:
BASE_URL={{MODEL_BASE_URL}}
API_KEY={{MODEL_API_KEY}}
MODEL_NAME={{MODEL_NAME}}
In Apidog, you can:
- Send the same prompt batch to all three models.
- Diff responses in one place.
- Save golden responses and replay them after prompt changes.
- Validate
tool_callsandreasoning_contentwith JSON Schema assertions. - Track whether a prompt edit breaks your agent contract.
Download Apidog here: Download Apidog.
MiniMax setup details: how to use the MiniMax M3 API.
Frequently asked questions
Which is the best open-weight coding model in 2026 right now?
For launch-day agentic-coding evidence, MiniMax M3 leads because it published task-level numbers such as SWE-Bench Pro 59.0% and Terminal-Bench 2.1 66.0%, though they are vendor-reported.
DeepSeek V4-Pro is the value pick: strong coding performance at roughly 1/34 the GPT-5.5 output price. Qwen3.7-Max has a top composite leaderboard result but is not open-weight yet.
The honest answer: run your own workload before committing.
Are all three truly open-weight?
No.
- MiniMax M3: open-weight, with weights and a technical report due within roughly ten days of its June 1, 2026 launch.
- DeepSeek V4-Pro: DeepSeek has a long record of open-weight releases across R1 and V3.
- Qwen3.7-Max-Preview: proprietary and closed-weight as of mid-May 2026.
More Qwen details: what is Qwen 3.7.
Which has the biggest context window?
MiniMax M3 and Qwen3.7-Max both advertise a 1,000,000-token context window. That is roughly 700,000 to 750,000 words.
DeepSeek V4-Pro’s context window is not stated here.
Remember: a large context window is not the same as perfect recall, and every token is billed.
Which is cheapest to run?
Based on published per-token rates, DeepSeek V4-Pro is the clear leader:
- $0.435 / 1M input tokens on cache miss
- $0.003625 / 1M input tokens on cache hit
- $0.87 / 1M output tokens
The cheaper V4-Flash variant is $0.14 / $0.28 per million input/output tokens.
MiniMax M3 uses monthly token plans. Qwen3.7-Max bills per token through Alibaba Cloud. If you self-host an open-weight model, your marginal cost becomes hardware rather than API tokens.
More pricing context: Chinese LLM price war 2026.
Is MiniMax M3 better than DeepSeek V4-Pro at coding?
The numbers are not directly comparable yet.
MiniMax M3 published SWE-Bench Pro and Terminal-Bench results at launch. DeepSeek V4-Pro has not reported those same tasks in the same format here.
M3’s edge is published agentic-coding evidence plus multimodality. DeepSeek’s edge is price and reasoning-driven performance on multi-file refactors.
The fair test is to run identical prompts against both models on your own repository.
The short version
Choose based on the constraint that matters most:
- MiniMax M3: best fit for published agentic-coding benchmarks, 1M context, multimodality, and open-weight direction.
- DeepSeek V4-Pro: best fit for low-cost, high-volume coding agents and reasoning-heavy refactors.
- Qwen3.7-Max: best fit for hosted long-horizon agents and top composite-intelligence results, if closed weights are acceptable.
Benchmarks will move, and several MiniMax M3 numbers are still vendor-reported. The durable workflow is simple: run the same prompts against all three APIs in one Apidog project, compare outputs and costs, and let your own workload pick the winner.
Top comments (0)