DEV Community

Cover image for MiniMax M3 vs DeepSeek V4-pro vs Qwen 3.7: Best Open-Weight Coding Model in 2026
Hassann
Hassann

Posted on • Originally published at apidog.com

MiniMax M3 vs DeepSeek V4-pro vs Qwen 3.7: Best Open-Weight Coding Model in 2026

For most of the last two years, choosing the “best coding model” usually meant picking GPT, Claude, or Gemini, paying the per-token rate, and accepting closed weights. That is no longer the only path. Chinese labs are now shipping coding-focused models that either publish weights or price APIs low enough to change the economics of coding agents.

Try Apidog today

MiniMax M3 landed on June 1, 2026, and is the clearest signal yet. It is open-weight, built for coding and agentic workflows, supports a 1,000,000-token context window, and adds native multimodality. It joins DeepSeek’s V4 family and Alibaba’s Qwen 3.7 as serious contenders for teams that want lower cost, more control, and less vendor lock-in.

The three contenders

MiniMax M3

MiniMax M3 is the newest arrival. MiniMax positions it as a frontier coding model with:

  • 1,000,000-token context window
  • Native multimodality for image and video input
  • Computer-use capabilities
  • Coding and agentic workflow focus
  • New MSA architecture

MiniMax says open weights and a technical report will follow within roughly ten days of launch. Parameter counts have not been disclosed.

Read the full breakdown: what is MiniMax M3.

DeepSeek V4-Pro

DeepSeek V4-Pro is the reasoning-and-coding workhorse. It is a thinking model that returns a reasoning_content chain before the final answer, which can help on multi-file refactors, dependency changes, renames, and signature updates.

DeepSeek also pairs V4-Pro with a cheaper non-thinking V4-Flash variant. Its biggest differentiator is cost.

Official site and API: deepseek.com.

Qwen 3.7

Qwen 3.7 is Alibaba’s flagship family, led by Qwen3.7-Max-Preview. It is a reasoning model with a 1,000,000-token context window and is positioned for long-horizon agent work.

The key caveat: as of its mid-May 2026 launch, Qwen3.7-Max is proprietary and closed-weight. Alibaba has a strong track record of open-sourcing lower tiers, but open 3.7 weights had not shipped.

More details: what is Qwen 3.7.

Open-source repos: github.com/QwenLM.

Spec comparison

Spec MiniMax M3 DeepSeek V4-Pro Qwen3.7-Max-Preview
Vendor MiniMax DeepSeek Alibaba / Qwen
Released June 1, 2026 2026 May 2026 preview
Open weights Yes, weights within ~10 days Yes, based on DeepSeek’s R1/V3 track record Not yet; flagship is closed-weight
Context window 1,000,000 tokens Not stated here 1,000,000 tokens
Multimodal Yes: image, video, computer use No; text + reasoning Text-focused reasoning
Reasoning / thinking mode Yes Yes, via reasoning_content Yes, extended thinking
Parameter count Not disclosed Not disclosed here Not disclosed here
Architecture MSA Not stated here Not stated here

If open weights are a hard requirement, filter first:

  • Use MiniMax M3 if you want open weights plus multimodality.
  • Use DeepSeek V4-Pro if you want open releases and low API cost.
  • Use Qwen3.7-Max only if you are comfortable with a hosted, closed-weight flagship today.

Coding and agentic strength

The public evidence is uneven, so do not compare unsupported numbers as if they were equivalent.

MiniMax M3 launched with vendor-reported coding and agentic benchmarks:

Benchmark, vendor-reported by MiniMax MiniMax M3
SWE-Bench Pro 59.0%
Terminal-Bench 2.1 66.0%
SWE-fficiency 34.8%
KernelBench Hard 28.8%
MCP Atlas 74.2%
PostTrainBench 0.37
SVG-Bench Reported above Opus 4.7
OmniDocBench Reported above Gemini 3.1 Pro
Claw-Eval Reported highest in its set

SWE-Bench Pro and Terminal-Bench test practical software engineering work: resolving issues, editing code, and operating in a terminal. MCP Atlas focuses on tool use and agent orchestration. You can sanity-check the broader SWE-Bench field on the SWE-Bench leaderboard.

DeepSeek V4-Pro and Qwen 3.7 do not have directly comparable published numbers in the same format here, so a cell-by-cell benchmark table would be misleading.

What is documented:

  • DeepSeek V4-Pro: reported by third-party comparisons as landing within a few coding benchmark points of GPT-5.5 while costing much less. Its practical edge is the reasoning chain, especially for multi-file changes. Setup and pricing details: how to use DeepSeek V4-Pro with Cursor.
  • Qwen 3.7: reported 57 on the Artificial Analysis Intelligence Index at launch, plus roughly 1,475 Elo on LM Arena with a top-ten coding placement. Alibaba positions it for long autonomous runs and heavy tool use.

Practical takeaway:

  • Pick MiniMax M3 if you want the most transparent agentic-coding evidence today.
  • Pick DeepSeek V4-Pro if reasoning quality and cost matter most.
  • Pick Qwen3.7-Max if you want top composite-intelligence results and can accept a hosted API.

For a broader Qwen comparison, see Qwen 3.7 vs GPT-5.5 vs Opus 4.7.

Context window and long-context cost

MiniMax M3 and Qwen3.7-Max both advertise a 1,000,000-token context window. DeepSeek V4-Pro’s context size is not stated here.

A million tokens is roughly 700,000 to 750,000 words. That can hold:

  • A mid-sized repository
  • Long documentation sets
  • Multiple PDFs
  • Long-running conversation history
  • Agent traces and tool outputs

But a large context window is only a ceiling. It does not guarantee perfect recall or stable reasoning across the entire prompt. It also costs money: every token you send is billed.

Use long context only when needed.

A practical context strategy:

1. Start with the smallest useful prompt.
2. Include only the files relevant to the task.
3. Add dependency files when the model asks or when tests fail.
4. Cache stable project context where the provider supports it.
5. Avoid sending generated logs, build output, or duplicate docs unless needed.
6. Use the 1M-token window for repo-wide analysis, not every request.
Enter fullscreen mode Exit fullscreen mode

MiniMax says its MSA architecture is designed for long-context efficiency. Its API uses a standard rate up to 512K input tokens and a separate long-context rate above that threshold. That split reflects the practical reality: long context is a premium tier.

For more tactics, see how to reduce agent token costs.

Price and access

Price is the reason this comparison matters. The same coding-agent workload can cost much less on these models than on Western flagship APIs. That pricing pressure is part of the broader Chinese LLM price war 2026.

DeepSeek V4-Pro has the clearest published per-token pricing.

Token type DeepSeek V4-Pro rate per 1M tokens
Input, cache miss $0.435
Input, cache hit $0.003625
Output $0.87

That output rate is roughly 1/34 the cost of GPT-5.5 output. The non-thinking V4-Flash variant is cheaper still at $0.14 / $0.28 per million input/output tokens. A heavy day of coding-assistant use can land around $1.

MiniMax M3 uses token plans rather than one published per-token price:

Plan Price
Plus $20
Max $50
Ultra $120

Its API uses a standard input rate up to 512K tokens and a long-context rate above that. MiniMax has not published an exact per-token figure here, so do not assume one. The plan model may fit teams that prefer predictable monthly spend.

API setup details: how to use the MiniMax M3 API.

Qwen 3.7 is billed per token through Alibaba Cloud. The Max preview went live in May 2026. Alibaba has priced recent Qwen models aggressively, but preview pricing can shift, so check Alibaba Cloud’s current model docs before estimating production cost.

Open weights change the ceiling:

  • MiniMax M3: self-hosting becomes possible once weights are published.
  • DeepSeek V4-Pro: DeepSeek has a history of open releases.
  • Qwen3.7-Max: cannot be self-hosted today because flagship weights are closed.

If avoiding vendor lock-in is a requirement, that difference matters.

Which one should you pick?

Your priority Best fit Why
Agentic coding with published benchmarks MiniMax M3 Vendor-reported SWE-Bench Pro, Terminal-Bench, and MCP Atlas numbers
Multimodal input MiniMax M3 Image, video, and computer-use support
Lowest cost on high-volume API traffic DeepSeek V4-Pro $0.87/1M output, cheaper Flash variant, cache-hit pricing
Reasoning-driven code quality DeepSeek V4-Pro Thinking chain helps on multi-file dependencies
Top composite-intelligence score on a public board Qwen3.7-Max Reported AA Intelligence Index score of 57 at launch
Long-horizon autonomous agents Qwen3.7-Max or MiniMax M3 Both target extended tool-use workflows
Self-hosting / no vendor lock-in today MiniMax M3 or DeepSeek V4-Pro Both are tied to open-weight availability; Qwen flagship is closed

Simple decision tree:

Need open weights?
  Yes -> MiniMax M3 or DeepSeek V4-Pro
  No  -> Qwen3.7-Max is also viable

Need multimodal input?
  Yes -> MiniMax M3
  No  -> Continue

Need the lowest API bill?
  Yes -> DeepSeek V4-Pro
  No  -> Continue

Need long-horizon hosted agents?
  Yes -> Qwen3.7-Max or MiniMax M3

Need benchmark transparency for agentic coding?
  Yes -> MiniMax M3
Enter fullscreen mode Exit fullscreen mode

Test them yourself

Leaderboards measure someone else’s workload. Your codebase is the real benchmark.

All three models expose APIs, and the practical way to choose is to run the same prompts against each one and compare:

  • Correctness
  • Patch quality
  • Tool-call structure
  • Latency
  • Token usage
  • Cost
  • Failure modes

Use Apidog to set up one project with three environments:

Environment 1: MiniMax M3
Environment 2: DeepSeek V4-Pro
Environment 3: Qwen3.7-Max
Enter fullscreen mode Exit fullscreen mode

Then create one OpenAI-compatible chat request and switch environments.

Example request shape:

{
  "model": "{{MODEL_NAME}}",
  "messages": [
    {
      "role": "system",
      "content": "You are a senior software engineer. Return concise implementation steps and code changes."
    },
    {
      "role": "user",
      "content": "Refactor this module to remove duplicated validation logic. Explain the files that need to change."
    }
  ],
  "temperature": 0.2
}
Enter fullscreen mode Exit fullscreen mode

Use environment variables for provider-specific values:

BASE_URL={{MODEL_BASE_URL}}
API_KEY={{MODEL_API_KEY}}
MODEL_NAME={{MODEL_NAME}}
Enter fullscreen mode Exit fullscreen mode

In Apidog, you can:

  • Send the same prompt batch to all three models.
  • Diff responses in one place.
  • Save golden responses and replay them after prompt changes.
  • Validate tool_calls and reasoning_content with JSON Schema assertions.
  • Track whether a prompt edit breaks your agent contract.

Download Apidog here: Download Apidog.

MiniMax setup details: how to use the MiniMax M3 API.

Frequently asked questions

Which is the best open-weight coding model in 2026 right now?

For launch-day agentic-coding evidence, MiniMax M3 leads because it published task-level numbers such as SWE-Bench Pro 59.0% and Terminal-Bench 2.1 66.0%, though they are vendor-reported.

DeepSeek V4-Pro is the value pick: strong coding performance at roughly 1/34 the GPT-5.5 output price. Qwen3.7-Max has a top composite leaderboard result but is not open-weight yet.

The honest answer: run your own workload before committing.

Are all three truly open-weight?

No.

  • MiniMax M3: open-weight, with weights and a technical report due within roughly ten days of its June 1, 2026 launch.
  • DeepSeek V4-Pro: DeepSeek has a long record of open-weight releases across R1 and V3.
  • Qwen3.7-Max-Preview: proprietary and closed-weight as of mid-May 2026.

More Qwen details: what is Qwen 3.7.

Which has the biggest context window?

MiniMax M3 and Qwen3.7-Max both advertise a 1,000,000-token context window. That is roughly 700,000 to 750,000 words.

DeepSeek V4-Pro’s context window is not stated here.

Remember: a large context window is not the same as perfect recall, and every token is billed.

Which is cheapest to run?

Based on published per-token rates, DeepSeek V4-Pro is the clear leader:

  • $0.435 / 1M input tokens on cache miss
  • $0.003625 / 1M input tokens on cache hit
  • $0.87 / 1M output tokens

The cheaper V4-Flash variant is $0.14 / $0.28 per million input/output tokens.

MiniMax M3 uses monthly token plans. Qwen3.7-Max bills per token through Alibaba Cloud. If you self-host an open-weight model, your marginal cost becomes hardware rather than API tokens.

More pricing context: Chinese LLM price war 2026.

Is MiniMax M3 better than DeepSeek V4-Pro at coding?

The numbers are not directly comparable yet.

MiniMax M3 published SWE-Bench Pro and Terminal-Bench results at launch. DeepSeek V4-Pro has not reported those same tasks in the same format here.

M3’s edge is published agentic-coding evidence plus multimodality. DeepSeek’s edge is price and reasoning-driven performance on multi-file refactors.

The fair test is to run identical prompts against both models on your own repository.

The short version

Choose based on the constraint that matters most:

  • MiniMax M3: best fit for published agentic-coding benchmarks, 1M context, multimodality, and open-weight direction.
  • DeepSeek V4-Pro: best fit for low-cost, high-volume coding agents and reasoning-heavy refactors.
  • Qwen3.7-Max: best fit for hosted long-horizon agents and top composite-intelligence results, if closed weights are acceptable.

Benchmarks will move, and several MiniMax M3 numbers are still vendor-reported. The durable workflow is simple: run the same prompts against all three APIs in one Apidog project, compare outputs and costs, and let your own workload pick the winner.

Top comments (0)