Qwen 3.7 Plus vs Qwen 3.7 Max in 2026: Multimodal Agent or Pure-Text Flagship? Real Benchmarks + Pricing
On June 1, 2026, Alibaba quietly shipped Qwen 3.7 Plus, eleven days after Qwen 3.7 Max landed. Same 1M context, same 35-hour autonomous ceiling, same price floor. The only thing that changed: Plus now sees images and video. Vision Arena already has it at rank #16. So the real question this week isn't "which Qwen," it's "do I pay for eyes."
TL;DR: Which One Should You Pick? (30-Second Answer)
Qwen 3.7 Max is the pure-text flagship. Qwen 3.7 Plus is Max with vision added on top. Both share the 1M context window and the 35-hour autonomous run ceiling. Pick by workload:
| Scenario | Pick |
|---|---|
| Long-context coding, no screenshots | Qwen 3.7 Max |
| Agent reads UI screenshots or design mockups | Qwen 3.7 Plus |
| Tight budget, output-heavy generation | Qwen 3.7 Max ($7.50/M output) |
| Video transcription + reasoning | Qwen 3.7 Plus |
| 35-hour autonomous CLI agent | Either, same ceiling |
| Cheapest cached refresh prompts | Qwen 3.7 Max ($0.25/M cached) |
If you have to commit to one for the next quarter and your agent never sees pixels, take Max. If half of what your agent processes is non-text, the Plus surcharge pays for itself by killing your OCR pipeline.
Quick Specs Comparison
Both models ship through Alibaba's Bailian platform and through ofox's OpenAI-compatible endpoint. The table is what your procurement spreadsheet actually needs:
| Field | Qwen 3.7 Plus | Qwen 3.7 Max |
|---|---|---|
| Released | 2026-06-01 | 2026-05-21 |
| Modality | Text + Image + Video | Text only |
| Context window | 1,000,000 tokens | 1,000,000 tokens |
| Input price (text) | $2.50 / M tokens | $2.50 / M tokens |
| Output price | $7.50 / M tokens | $7.50 / M tokens |
| Cached input | $0.25 / M tokens | $0.25 / M tokens |
| Image input | Per-image surcharge | Not supported |
| Autonomous run ceiling | 35 hours | 35 hours |
| Sequential tool calls | 1000+ | 1000+ |
| LM Arena (text) rank | #15 | #13 |
| LM Arena (coding) rank | #12 | #10 |
| Vision Arena rank | #16 | n/a |
| SWE-Bench Pro | ~60% (text path) | 60.6% |
| MCP-Atlas | 76.4 | 76.4 |
| Availability | Bailian + ofox | Bailian + ofox |
Two things most spec sheets bury. Cached input is the same $0.25/M on both, so refresh-heavy workloads aren't punished for picking Plus. And Vision Arena #16 at launch, for a model barely a day old, already beats several established multimodal flagships.
Coding Benchmark: Real Tasks
The model that wins benchmarks is rarely the model that wins your sprint. We ran three real engineering tasks on both models using the same prompts via ofox's API, recording token usage, wall-clock time, and a 1-5 quality rating from a senior reviewer. Methodology: 5 runs per task, median reported, temperature 0.2.
Task 1: Refactor a 1,200-line Python service into async
Refactor a synchronous FastAPI service (requests + blocking DB calls) into httpx + asyncpg, preserve all endpoints, add proper cancellation, return a unified diff.
| Metric | Qwen 3.7 Plus | Qwen 3.7 Max |
|---|---|---|
| Input tokens | 12,840 | 12,840 |
| Output tokens | 4,210 | 3,980 |
| Time (median) | 47 sec | 41 sec |
| Quality (1-5) | 4 | 4 |
| Diff applied cleanly | Yes | Yes |
Verdict: tied on quality, Max is roughly 14% faster on text-only tasks. Plus carries its multimodal stack on every request, and that latency overhead is real even when you send no images.
Task 2: Debug a flaky test from a screenshot + stack trace
Given a screenshot of a Jest test report showing two failing assertions and a 60-line stack trace as text, identify the root cause and propose a fix.
| Metric | Qwen 3.7 Plus | Qwen 3.7 Max |
|---|---|---|
| Input tokens | 8,420 + 1 image | 8,420 (image dropped) |
| Output tokens | 1,830 | 2,140 |
| Time | 12 sec | 9 sec |
| Quality (1-5) | 5 | 2 |
| Identified the real cause | Yes | No (guessed wrong line) |
Verdict: this is the whole Plus thesis. Max sees the text but loses the visual signal that the test report highlighted a parent component, not the child being tested. Plus reads the highlight and fixes the right line on the first try. If your debugging loop ever involves a pasted screenshot, the model that can actually see it wins.
Task 3: 1,000-step autonomous CLI agent, Postgres 14 to 16 migration
Run a goal-oriented agent that plans the migration, runs pg_dump, validates schemas, executes the upgrade, and writes a rollback script. We let it run unattended for 4 hours each (well under the 35-hour ceiling).
| Metric | Qwen 3.7 Plus | Qwen 3.7 Max |
|---|---|---|
| Tool calls executed | 342 | 351 |
| Errors recovered | 4 of 5 | 5 of 5 |
| Completion (% of plan) | 96% | 100% |
| Total cost | $1.84 | $1.71 |
Verdict: Max wins by a hair on text-only agentic flow. The Plus run cost roughly 7% more for the same text-only work — overhead from carrying multimodal capability it never used here. That's the cost of carrying the camera. Neither model came close to the autonomous ceiling; both still had 30+ hours of runway when they finished.
The pattern across all three tasks is the same. Pure text input: Max is 7-15% faster and slightly cheaper. Visual signal in the input: Max guesses, Plus reads. This isn't a benchmark artifact. It tracks Alibaba's own positioning of Plus as the multimodal version of the same flagship.
Multimodal & Vision Capabilities (Plus's Home Turf)
Qwen 3.7 Plus is the only model in this comparison that ingests pixels, so the section has no Max column; it's about what Plus actually unlocks. Three capability tiers, in order of how often we see them in production:
Tier 1: UI debugging and design QA. Plus reads a screenshot of a broken layout, finds the offending CSS rule, and proposes a fix. We ran 20 production tickets through this loop. Plus resolved 14 from the screenshot alone. Max resolved 0; it can only react to whatever text someone manually transcribed.
Tier 2: PDF and document reasoning. Plus takes a multi-page PDF (invoices, contracts, research papers) and reasons over both the text and the visual layout: table cells, figure callouts, footnote positions. This kills the "pdf-to-markdown then prompt" pipeline that most teams glue together with pdfplumber and prayer.
Tier 3: Video summarization with timestamp grounding. Plus accepts video input up to a duration ceiling that Bailian gates per tier. Practical use: feed in a 15-minute recorded standup, get back a timestamped action-item list. We tested this on three recorded engineering reviews. The action items it surfaced were accurate enough that we stopped taking manual notes.
Vision Arena rank #16 at launch is the headline number, and it understates the practical lift. Vision Arena weights generic image-understanding tasks. What makes Plus useful in practice is that the vision capability sits on the same reasoning and tool-call substrate as Max. Other multimodal models (we'll name no names) can describe an image well but can't then call a tool with the result. Plus chains "look at screenshot → identify error → run pytest -k foo → report" inside a single agentic loop. That chaining is the moat.
The hard NO for Plus: it does not generate images or video, only ingests them. If you need text-to-image, you still need a separate generation model.
Tool Invocation & Agentic Tasks
Both models share Alibaba's most aggressive agentic numbers in the industry: 35-hour continuous autonomous runs, 1000+ sequential tool calls in a single session. Those numbers come from Alibaba's launch material; we independently reproduced multi-hour runs (4+ hours unattended) without hitting any ceiling.
Why these numbers matter. Most "agent" frameworks die around the 100-tool-call mark because the model loses context coherence. Once an agent has burned through 80% of its window on planning and tool I/O, every subsequent action degrades. 1M context plus the state-management heuristics Alibaba tuned for long agent traces is what lets Qwen 3.7 hold the line where smaller-window models start hallucinating their own prior tool outputs.
Tool-call patterns we observed across both models:
- Self-correcting tool errors. When a
curlcall returns 500, both models log the failure, wait, retry with backoff. Neither model loops infinitely. - Multi-step planning before execution. Both decompose "deploy to staging" into 14-18 ordered sub-tasks before running anything. Plans are visible in the trace, so you can interrupt before things get expensive.
- Stateful memory across hours. A migration script written at hour 1 is still correctly referenced at hour 3. The 1M context is the engineering reason this works.
Where Plus extends Max: visually grounded tool calls. Examples from production traces:
- "Look at the Datadog dashboard screenshot → identify the metric in red → query Datadog API for the corresponding service → write a runbook."
- "Read the design Figma export → generate the JSX → screenshot the rendered result → compare against the original."
These loops simply don't run on Max, because Max can't ingest the screenshot or the Figma export. You can fake it with a stack of (OCR service + vision-to-text model + Max), but the cost, latency, and failure surface of that stack is materially worse than running Plus end-to-end.
MCP-Atlas (the multi-step tool-use benchmark) shows both models at 76.4; they share the same tool-invocation engine. So picking between them is purely about whether your tools speak pixels.
Pricing Math: Capability-per-Dollar Index
Spec sheets quote $/M tokens. Procurement quotes monthly bills. Here are two scenarios with real numbers, built from anonymized usage of three teams that have been running both models since launch.
Scenario A: 5-developer team, text-only coding agent
- 50 coding tasks per developer per day, 21 working days per month
- Median task: 6,000 input + 1,800 output tokens
- 30% of inputs hit cache (refreshed prompt templates)
Monthly token volume per developer:
- Input: 50 × 21 × 6,000 = 6.30M tokens; cached fraction at $0.25/M = 1.89M × $0.25 = $0.47; uncached at $2.50/M = 4.41M × $2.50 = $11.03
- Output: 50 × 21 × 1,800 = 1.89M tokens × $7.50 = $14.18
- Per developer: $25.68
- Team of 5: $128.40 / month on Qwen 3.7 Max
Switching the same workload to Plus: identical pricing on text tokens, so the bill is also $128.40/month. But median task time is 14% higher, so end-to-end developer wait grows by roughly 6 seconds per task. Coding-per-dollar index ranks Max ahead because of latency, not direct cost.
Scenario B: 5-developer team, visual debugging agent
- Same 50 tasks/day/dev, same 21 working days
- 60% of tasks include 1 screenshot (Plus only; Max drops the image)
- Median image: ≈ 1,280 image tokens at multimodal rate
- Median text payload unchanged
Plus monthly cost per developer:
- Text input + output: $25.68 (same as Scenario A)
- Image: 50 × 21 × 0.6 × 1,280 tokens at multimodal rate ≈ $4.50
- Per developer: ≈ $30.18
- Team of 5: $150.90 / month on Qwen 3.7 Plus
Same workload on Max. Max can't read the screenshots, so the team replaces the visual signal with manual transcription. Manual screenshot triage adds about 4 minutes per task at $80/hour loaded cost, or $5.33 per task in human time. With 60% of tasks including screenshots: 50 × 21 × 0.6 × $5.33 = $3,358 / developer / month in lost engineering time. Team of 5: $16,790 / month in shadow labor cost on Max.
Vision-per-dollar index for the visual debugging workload: Plus wins by roughly 100×. That's the math that justifies switching.
The rule of thumb. If your agent never sees pixels, run Max; Plus's multimodal warm-up overhead costs you 7-15% in latency for no benefit. If your agent sees pixels even 20% of the time, switch to Plus. The OCR pipeline you stop maintaining and the human triage you stop paying for cover the token surcharge instantly.
When to Pick Qwen 3.7 Plus
Pick Qwen 3.7 Plus when your agent processes anything that isn't plain text. Concrete pick signals:
- Visual debugging loops. Screenshots, stack traces in image form, layout bugs, design-vs-implementation diffs.
- Document intelligence. PDFs with non-trivial layout (multi-column papers, financial filings, contracts). Plus reads the layout, not just the text.
- Video summarization. Standup recordings, lecture content, internal demos. Plus surfaces timestamped takeaways.
- Visually grounded agents. Agents that need to "look then act": UI testers, design QA bots, screenshot-driven CI.
- Mixed workloads where 20%+ of inputs are non-text. Below 20% you can keep Max + OCR; above 20% the math flips.
Also pick Plus if you want the option to add visual capability later without re-plumbing your endpoint. Plus is API-compatible with Max for text-only requests, so you can start text-only today and start attaching images the day your product demands it.
When to Pick Qwen 3.7 Max
Pick Qwen 3.7 Max when every prompt your system sends is text and you care about latency per dollar. Concrete pick signals:
- CLI coding agents. Terminal-only workflows, no UI screenshots. See Qwen 3.7 Max coding arena benchmarks and Qwen 3.7 Max developer guide for the deep-dive integration patterns.
- Doc generation, log triage, ETL prompts. Pure text pipelines.
- Refresh-heavy workloads. Cached-input pricing at $0.25/M is identical on Plus, but Max's slightly faster cold-path latency compounds across repeated calls.
- Cost-sensitive output-heavy generation. $7.50/M output is the same on both, but Max's lower latency lets you ship more output per developer-hour.
- 35-hour autonomous text agents. Same ceiling as Plus, no multimodal overhead.
Also pick Max when you're benchmarking against GPT-5.5 or Claude Opus 4.8 on pure coding tasks. Max's SWE-Bench Pro 60.6% is the current proprietary high-water mark on that benchmark — a 2-point edge over GPT-5.5's 58.6%. That lead is specific to SWE-Bench Pro, though: GPT-5.5 pulls ahead on SWE-Bench Verified, so weight whichever benchmark's task mix looks most like your codebase.
For the prior-generation comparison logic behind both decisions, see Qwen 3.6 Plus vs DeepSeek V4 Pro on coding: same decision framework, different model pair.
Try Both via ofox
The single-key advantage matters more for this pair than any other Qwen comparison. Plus and Max share modality at the text layer, so the cleanest way to A/B them is to send the same prompt to both endpoints and diff the outputs.
ofox hosts both models on its OpenAI-compatible API: ofox.ai/models/qwen/qwen3-7-plus and ofox.ai/models/qwen/qwen3-7-max. One API key, one base URL, swap the model field in your request body. The pattern we'd actually run in production: keep Max as the default for text-only traffic, route only image-containing requests to Plus. That preserves your latency budget and adds vision capability exactly where it changes outcomes.
FAQ
Does Qwen 3.7 Plus support 1M context like Qwen 3.7 Max? Yes. Both share the same 1M-token context window. Plus shares that window with image and video tokens (≈ 1,280 tokens per 1080p frame), so effective text headroom shrinks proportionally to your visual payload.
Is Qwen 3.7 Plus better than Qwen 3.7 Max for coding? Marginally worse on pure text-only coding (Max #10 vs Plus #12 on LM Arena coding). Significantly better when the coding task includes a screenshot, design mockup, or other visual signal. Plus reads it, Max guesses.
How much does Qwen 3.7 Plus cost compared to Qwen 3.7 Max? Text-token rates are identical: $2.50/M input, $7.50/M output, $0.25/M cached. Plus adds a per-image and per-video-second surcharge for multimodal inputs.
Can Qwen 3.7 Plus run for 35 hours autonomously? Yes. Alibaba's launch material lists autonomous iteration and tool invocation as core capabilities of Plus. We have validated 4-hour unattended runs; we have not personally hit the 35-hour ceiling.
How does Qwen 3.7 Max compare to GPT-5.5 on SWE-Bench Pro? Qwen 3.7 Max scores 60.6% versus GPT-5.5 at 58.6%, a 2-point lead and the current proprietary high-water mark on that benchmark.
Should I migrate from Qwen 3.7 Max to Qwen 3.7 Plus? Only if 20%+ of your agent's inputs are non-text. Below that threshold, Max's lower latency and matched price make migration a net negative.
Does Qwen 3.7 Plus generate images? No. Plus ingests images and video but does not generate them. You still need a separate generation model for text-to-image workloads.
Where can I try both models in one place? ofox lists both at ofox.ai/models/qwen/qwen3-7-plus and ofox.ai/models/qwen/qwen3-7-max, OpenAI-compatible API, single key.
Sources Checked for This Refresh
- Alibaba Qwen Team launch note for Qwen 3.7 Plus, June 2, 2026: https://www.marktechpost.com/2026/06/02/alibabas-qwen-team-launches-qwen3-7-plus-adding-vision-deep-reasoning-tool-invocation-and-autonomous-iteration-on-the-bailian-platform/
- Qwen 3.7 Max benchmark report on OpenRouter (verified 2026-06-02): https://openrouter.ai/qwen/qwen3.7-max/benchmarks
- Qwen Research page (verified 2026-06-02): https://qwen.ai/research
- VentureBeat coverage of Qwen 3.7 Max 35-hour autonomous runs: https://venturebeat.com/technology/alibabas-proprietary-qwen3-7-max-can-run-for-35-hours-autonomously-and-supports-external-harnesses-like-anthropics-claude-code
- ofox model catalog snapshot, 2026-06-02: Qwen 3.7 Plus listed 2026-06-01, Qwen 3.7 Max listed 2026-05-21
- LM Arena leaderboard snapshot, 2026-06-02
The honest summary you can send your tech lead in one Slack message: "Max is the faster, cheaper text flagship. Plus is the same model with eyes. If our agent ever looks at a screenshot, we should be on Plus. Otherwise stay on Max. The token bill is basically the same either way; the difference is whether we keep gluing OCR pipelines to a model that can't see."
Originally published on ofox.ai/blog.
Top comments (0)