Hassann

Posted on Jun 1 • Originally published at apidog.com

MiniMax M3 vs DeepSeek V4-pro vs Qwen 3.7: Best Open-Weight Coding Model in 2026

For most of the last two years, choosing the “best coding model” usually meant picking GPT, Claude, or Gemini, paying the per-token rate, and accepting closed weights. That is no longer the only path. Chinese labs are now shipping coding-focused models that either publish weights or price APIs low enough to change the economics of coding agents.

Try Apidog today

MiniMax M3 landed on June 1, 2026, and is the clearest signal yet. It is open-weight, built for coding and agentic workflows, supports a 1,000,000-token context window, and adds native multimodality. It joins DeepSeek’s V4 family and Alibaba’s Qwen 3.7 as serious contenders for teams that want lower cost, more control, and less vendor lock-in.

The three contenders

MiniMax M3

MiniMax M3 is the newest arrival. MiniMax positions it as a frontier coding model with:

1,000,000-token context window
Native multimodality for image and video input
Computer-use capabilities
Coding and agentic workflow focus
New MSA architecture

MiniMax says open weights and a technical report will follow within roughly ten days of launch. Parameter counts have not been disclosed.

Read the full breakdown: what is MiniMax M3.

DeepSeek V4-Pro

DeepSeek V4-Pro is the reasoning-and-coding workhorse. It is a thinking model that returns a reasoning_content chain before the final answer, which can help on multi-file refactors, dependency changes, renames, and signature updates.

DeepSeek also pairs V4-Pro with a cheaper non-thinking V4-Flash variant. Its biggest differentiator is cost.

Official site and API: deepseek.com.

Qwen 3.7

Qwen 3.7 is Alibaba’s flagship family, led by Qwen3.7-Max-Preview. It is a reasoning model with a 1,000,000-token context window and is positioned for long-horizon agent work.

The key caveat: as of its mid-May 2026 launch, Qwen3.7-Max is proprietary and closed-weight. Alibaba has a strong track record of open-sourcing lower tiers, but open 3.7 weights had not shipped.

More details: what is Qwen 3.7.

Open-source repos: github.com/QwenLM.

Spec comparison

Spec	MiniMax M3	DeepSeek V4-Pro	Qwen3.7-Max-Preview
Vendor	MiniMax	DeepSeek	Alibaba / Qwen
Released	June 1, 2026	2026	May 2026 preview
Open weights	Yes, weights within ~10 days	Yes, based on DeepSeek’s R1/V3 track record	Not yet; flagship is closed-weight
Context window	1,000,000 tokens	Not stated here	1,000,000 tokens
Multimodal	Yes: image, video, computer use	No; text + reasoning	Text-focused reasoning
Reasoning / thinking mode	Yes	Yes, via `reasoning_content`	Yes, extended thinking
Parameter count	Not disclosed	Not disclosed here	Not disclosed here
Architecture	MSA	Not stated here	Not stated here

If open weights are a hard requirement, filter first:

Use MiniMax M3 if you want open weights plus multimodality.
Use DeepSeek V4-Pro if you want open releases and low API cost.
Use Qwen3.7-Max only if you are comfortable with a hosted, closed-weight flagship today.

Coding and agentic strength

The public evidence is uneven, so do not compare unsupported numbers as if they were equivalent.

MiniMax M3 launched with vendor-reported coding and agentic benchmarks:

Benchmark, vendor-reported by MiniMax	MiniMax M3
SWE-Bench Pro	59.0%
Terminal-Bench 2.1	66.0%
SWE-fficiency	34.8%
KernelBench Hard	28.8%
MCP Atlas	74.2%
PostTrainBench	0.37
SVG-Bench	Reported above Opus 4.7
OmniDocBench	Reported above Gemini 3.1 Pro
Claw-Eval	Reported highest in its set

SWE-Bench Pro and Terminal-Bench test practical software engineering work: resolving issues, editing code, and operating in a terminal. MCP Atlas focuses on tool use and agent orchestration. You can sanity-check the broader SWE-Bench field on the SWE-Bench leaderboard.

DeepSeek V4-Pro and Qwen 3.7 do not have directly comparable published numbers in the same format here, so a cell-by-cell benchmark table would be misleading.

What is documented:

DeepSeek V4-Pro: reported by third-party comparisons as landing within a few coding benchmark points of GPT-5.5 while costing much less. Its practical edge is the reasoning chain, especially for multi-file changes. Setup and pricing details: how to use DeepSeek V4-Pro with Cursor.
Qwen 3.7: reported 57 on the Artificial Analysis Intelligence Index at launch, plus roughly 1,475 Elo on LM Arena with a top-ten coding placement. Alibaba positions it for long autonomous runs and heavy tool use.

Practical takeaway:

Pick MiniMax M3 if you want the most transparent agentic-coding evidence today.
Pick DeepSeek V4-Pro if reasoning quality and cost matter most.
Pick Qwen3.7-Max if you want top composite-intelligence results and can accept a hosted API.

For a broader Qwen comparison, see Qwen 3.7 vs GPT-5.5 vs Opus 4.7.

Context window and long-context cost

MiniMax M3 and Qwen3.7-Max both advertise a 1,000,000-token context window. DeepSeek V4-Pro’s context size is not stated here.

A million tokens is roughly 700,000 to 750,000 words. That can hold:

A mid-sized repository
Long documentation sets
Multiple PDFs
Long-running conversation history
Agent traces and tool outputs

But a large context window is only a ceiling. It does not guarantee perfect recall or stable reasoning across the entire prompt. It also costs money: every token you send is billed.

Use long context only when needed.

A practical context strategy:

1. Start with the smallest useful prompt.
2. Include only the files relevant to the task.
3. Add dependency files when the model asks or when tests fail.
4. Cache stable project context where the provider supports it.
5. Avoid sending generated logs, build output, or duplicate docs unless needed.
6. Use the 1M-token window for repo-wide analysis, not every request.

MiniMax says its MSA architecture is designed for long-context efficiency. Its API uses a standard rate up to 512K input tokens and a separate long-context rate above that threshold. That split reflects the practical reality: long context is a premium tier.

For more tactics, see how to reduce agent token costs.

Price and access

Price is the reason this comparison matters. The same coding-agent workload can cost much less on these models than on Western flagship APIs. That pricing pressure is part of the broader Chinese LLM price war 2026.

DeepSeek V4-Pro has the clearest published per-token pricing.

Token type	DeepSeek V4-Pro rate per 1M tokens
Input, cache miss	$0.435
Input, cache hit	$0.003625
Output	$0.87

That output rate is roughly 1/34 the cost of GPT-5.5 output. The non-thinking V4-Flash variant is cheaper still at $0.14 / $0.28 per million input/output tokens. A heavy day of coding-assistant use can land around $1.

MiniMax M3 uses token plans rather than one published per-token price:

Plan	Price
Plus	$20
Max	$50
Ultra	$120

Its API uses a standard input rate up to 512K tokens and a long-context rate above that. MiniMax has not published an exact per-token figure here, so do not assume one. The plan model may fit teams that prefer predictable monthly spend.

API setup details: how to use the MiniMax M3 API.

Qwen 3.7 is billed per token through Alibaba Cloud. The Max preview went live in May 2026. Alibaba has priced recent Qwen models aggressively, but preview pricing can shift, so check Alibaba Cloud’s current model docs before estimating production cost.

Open weights change the ceiling:

MiniMax M3: self-hosting becomes possible once weights are published.
DeepSeek V4-Pro: DeepSeek has a history of open releases.
Qwen3.7-Max: cannot be self-hosted today because flagship weights are closed.

If avoiding vendor lock-in is a requirement, that difference matters.

Which one should you pick?

Your priority	Best fit	Why
Agentic coding with published benchmarks	MiniMax M3	Vendor-reported SWE-Bench Pro, Terminal-Bench, and MCP Atlas numbers
Multimodal input	MiniMax M3	Image, video, and computer-use support
Lowest cost on high-volume API traffic	DeepSeek V4-Pro	$0.87/1M output, cheaper Flash variant, cache-hit pricing
Reasoning-driven code quality	DeepSeek V4-Pro	Thinking chain helps on multi-file dependencies
Top composite-intelligence score on a public board	Qwen3.7-Max	Reported AA Intelligence Index score of 57 at launch
Long-horizon autonomous agents	Qwen3.7-Max or MiniMax M3	Both target extended tool-use workflows
Self-hosting / no vendor lock-in today	MiniMax M3 or DeepSeek V4-Pro	Both are tied to open-weight availability; Qwen flagship is closed

Simple decision tree:

Need open weights?
  Yes -> MiniMax M3 or DeepSeek V4-Pro
  No  -> Qwen3.7-Max is also viable

Need multimodal input?
  Yes -> MiniMax M3
  No  -> Continue

Need the lowest API bill?
  Yes -> DeepSeek V4-Pro
  No  -> Continue

Need long-horizon hosted agents?
  Yes -> Qwen3.7-Max or MiniMax M3

Need benchmark transparency for agentic coding?
  Yes -> MiniMax M3

Test them yourself

Leaderboards measure someone else’s workload. Your codebase is the real benchmark.

All three models expose APIs, and the practical way to choose is to run the same prompts against each one and compare:

Correctness
Patch quality
Tool-call structure
Latency
Token usage
Cost
Failure modes

Use Apidog to set up one project with three environments:

Environment 1: MiniMax M3
Environment 2: DeepSeek V4-Pro
Environment 3: Qwen3.7-Max

Then create one OpenAI-compatible chat request and switch environments.

Example request shape:

{
  "model": "{{MODEL_NAME}}",
  "messages": [
    {
      "role": "system",
      "content": "You are a senior software engineer. Return concise implementation steps and code changes."
    },
    {
      "role": "user",
      "content": "Refactor this module to remove duplicated validation logic. Explain the files that need to change."
    }
  ],
  "temperature": 0.2
}

Use environment variables for provider-specific values:

BASE_URL={{MODEL_BASE_URL}}
API_KEY={{MODEL_API_KEY}}
MODEL_NAME={{MODEL_NAME}}

In Apidog, you can:

Send the same prompt batch to all three models.
Diff responses in one place.
Save golden responses and replay them after prompt changes.
Validate tool_calls and reasoning_content with JSON Schema assertions.
Track whether a prompt edit breaks your agent contract.

Download Apidog here: Download Apidog.

MiniMax setup details: how to use the MiniMax M3 API.

Frequently asked questions

Which is the best open-weight coding model in 2026 right now?

For launch-day agentic-coding evidence, MiniMax M3 leads because it published task-level numbers such as SWE-Bench Pro 59.0% and Terminal-Bench 2.1 66.0%, though they are vendor-reported.

DeepSeek V4-Pro is the value pick: strong coding performance at roughly 1/34 the GPT-5.5 output price. Qwen3.7-Max has a top composite leaderboard result but is not open-weight yet.

The honest answer: run your own workload before committing.

Are all three truly open-weight?

No.

MiniMax M3: open-weight, with weights and a technical report due within roughly ten days of its June 1, 2026 launch.
DeepSeek V4-Pro: DeepSeek has a long record of open-weight releases across R1 and V3.
Qwen3.7-Max-Preview: proprietary and closed-weight as of mid-May 2026.

More Qwen details: what is Qwen 3.7.

Which has the biggest context window?

MiniMax M3 and Qwen3.7-Max both advertise a 1,000,000-token context window. That is roughly 700,000 to 750,000 words.

DeepSeek V4-Pro’s context window is not stated here.

Remember: a large context window is not the same as perfect recall, and every token is billed.

Which is cheapest to run?

Based on published per-token rates, DeepSeek V4-Pro is the clear leader:

$0.435 / 1M input tokens on cache miss
$0.003625 / 1M input tokens on cache hit
$0.87 / 1M output tokens

The cheaper V4-Flash variant is $0.14 / $0.28 per million input/output tokens.

MiniMax M3 uses monthly token plans. Qwen3.7-Max bills per token through Alibaba Cloud. If you self-host an open-weight model, your marginal cost becomes hardware rather than API tokens.

More pricing context: Chinese LLM price war 2026.

Is MiniMax M3 better than DeepSeek V4-Pro at coding?

The numbers are not directly comparable yet.

MiniMax M3 published SWE-Bench Pro and Terminal-Bench results at launch. DeepSeek V4-Pro has not reported those same tasks in the same format here.

M3’s edge is published agentic-coding evidence plus multimodality. DeepSeek’s edge is price and reasoning-driven performance on multi-file refactors.

The fair test is to run identical prompts against both models on your own repository.

The short version

Choose based on the constraint that matters most:

MiniMax M3: best fit for published agentic-coding benchmarks, 1M context, multimodality, and open-weight direction.
DeepSeek V4-Pro: best fit for low-cost, high-volume coding agents and reasoning-heavy refactors.
Qwen3.7-Max: best fit for hosted long-horizon agents and top composite-intelligence results, if closed weights are acceptable.

Benchmarks will move, and several MiniMax M3 numbers are still vendor-reported. The durable workflow is simple: run the same prompts against all three APIs in one Apidog project, compare outputs and costs, and let your own workload pick the winner.

DEV Community