Alibaba shipped Qwen 3.7 Plus a few days after Qwen 3.7 Max. The short version: Plus is Max with vision. It keeps the same 1M-token context and agentic backbone, adds image and video input, and lands at roughly one-sixth of Max’s price. If you’ve been following the family, our guide to what Qwen 3.7 is covers the text flagship; this post focuses on what the Plus variant adds and how to start using it.
One thing to flag up front: Qwen 3.7 Plus is API-only and proprietary. There are no open weights, which breaks from Qwen’s usual open-source pattern. Since Plus ships only as an API, implementation work comes down to sending requests, debugging payloads, and validating responses. That is where Apidog becomes useful, especially for multimodal API testing.
The short answer
Qwen 3.7 Plus is the multimodal, lower-cost sibling of Qwen 3.7 Max. You can pass it screenshots, design mockups, documents, or video, and it can reason over them as first-class input.
It is built for agents that operate graphical interfaces. For example, it can inspect an app screenshot and return pixel coordinates for the next action, such as:
{
"action": "click",
"x": 487,
"y": 232
}
On pure text tasks, Max still has a small edge. For tasks with visual input, Plus is the better fit and costs much less. The main tradeoff is that the weights are closed.
What’s new versus Qwen 3.7 Max
Three changes matter for developers.
1. Plus accepts visual input
Qwen 3.7 Max is text-only. Qwen 3.7 Plus accepts:
- Text
- Images
- Video
That enables use cases such as screenshot analysis, document and PDF reading, chart interpretation, and video understanding from one model.
2. Plus supports GUI grounding
Plus is positioned as a multimodal interactive agent for:
- Browser automation
- GUI navigation
- Screenshot-driven workflows
- Hybrid GUI + CLI agents
Instead of only describing a screen, it can return concrete UI actions, such as where to click.
Example target response format:
{
"steps": [
{
"type": "click",
"target": "Submit button",
"coordinates": {
"x": 487,
"y": 232
}
}
]
}
3. Plus is cheaper
Plus runs at a budget tier well below Max while keeping the 1M-token context window.
| Qwen 3.7 Plus | Qwen 3.7 Max | |
|---|---|---|
| Input modalities | Text, image, video | Text only |
| Context window | 1M tokens, shared with vision | 1M tokens |
| Input / output per 1M tokens | $0.40 / $1.60 | $2.50 / $7.50 |
| Cached input per 1M tokens | $0.08 | $0.25 |
| GUI grounding, ScreenSpot Pro | 79.0 | None |
| Terminal-Bench | 70.3 | 69.7 |
| Autonomous run ceiling | 35 hours | 35 hours |
Benchmarks
The launch numbers, backed by early hands-on reviews, show a clear pattern: Plus matches or slightly trails Max on text, then pulls ahead when vision matters.
Key numbers:
ScreenSpot Pro: 79.0
This measures GUI grounding: the ability to inspect a screenshot and produce exact pixel coordinates. Max cannot run this benchmark because it is text-only.Terminal-Bench: 70.3
Slightly ahead of Max’s 69.7.SWE-Bench Pro: about 60%
Roughly level with Max’s 60.6%.MCP-Atlas: 76.4
Tied with Max on tool-use orchestration.LM Arena
Plus sits slightly behind Max on text and coding: #15 vs #13 for text, and #12 vs #10 for coding.
Use Plus when the task includes a visual signal: screenshots, mockups, charts, documents, or video. For a text-only comparison, see our Qwen 3.7 vs GPT-5.5 vs Opus 4.7 comparison.
As always, treat vendor and early-reviewer benchmark numbers as directional, not absolute.
Pricing: the budget multimodal tier
Qwen 3.7 Plus pricing:
| Token type | Price per 1M tokens |
|---|---|
| Input | $0.40 |
| Output | $1.60 |
| Cached input | $0.08 |
That makes Plus roughly:
- 6x cheaper than Max on input
- Nearly 5x cheaper than Max on output
The important implementation detail: image and video tokens share the same 1M-token context budget. A high-resolution screenshot can consume thousands of tokens, and video frames can add up quickly.
When building your cost model, estimate:
total_context = text_tokens + image_tokens + video_tokens
If your app sends large screenshots or long videos, reduce the amount of text history you attach to each request.
For more context on pricing pressure across Chinese models, see our breakdown of the 2026 Chinese LLM price war.
The catch: proprietary and API-only
Qwen built much of its traction on open weights. Earlier Qwen models often shipped under Apache 2.0 or open-use licenses, which let teams download, fine-tune, and run models inside private infrastructure.
Qwen 3.7 Plus is different.
It is delivered strictly as a managed commercial API through Alibaba Cloud Model Studio. You cannot:
- Download the weights
- Self-host the model
- Run it offline
- Deploy it inside an air-gapped environment
For regulated or air-gapped environments, that may be a blocker.
An open-weight Plus variant has been floated for Q3 2026, but it is not confirmed. If open weights are mandatory, Qwen 3.7 Plus is not the right choice today. Rivals like Step 3.7 Flash ship under Apache 2.0 and undercut it on price.
How to access Qwen 3.7 Plus
You have two practical options.
Option 1: Use the API
Call Qwen 3.7 Plus through Alibaba Cloud Model Studio.
The endpoint is OpenAI-compatible, so the request structure is familiar if you have used OpenAI-style chat completions before. Our Qwen 3.7 API guide walks through authentication and the first request.
For multimodal requests, add image or video parts to the message payload.
Option 2: Try it in chat first
Use chat.qwen.ai before writing code. If you want no-cost routes for testing the family, see our Qwen 3.7 for free guide.
Minimal multimodal API example
Here is a basic Python example using the OpenAI-compatible client.
from openai import OpenAI
client = OpenAI(
api_key="YOUR_MODEL_STUDIO_KEY",
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
resp = client.chat.completions.create(
model="qwen3.7-plus",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Which button submits this form? Give pixel coordinates."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/screenshot.png"
}
},
],
}
],
)
print(resp.choices[0].message.content)
Check the Model Studio docs for the exact model identifier and regional base URL. International and China endpoints can differ.
Example: ask for structured GUI actions
For agents, avoid free-form responses when possible. Ask the model to return JSON that your automation layer can parse.
from openai import OpenAI
import json
client = OpenAI(
api_key="YOUR_MODEL_STUDIO_KEY",
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
prompt = """
You are controlling a browser from screenshots.
Return only valid JSON in this format:
{
"action": "click" | "type" | "wait" | "done",
"target": "short description",
"x": number,
"y": number,
"text": "optional text to type"
}
Task: Find the submit button and return the click coordinates.
"""
resp = client.chat.completions.create(
model="qwen3.7-plus",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/form-screenshot.png"
}
},
],
}
],
)
raw = resp.choices[0].message.content
action = json.loads(raw)
print(action)
Your automation code can then route the result:
if action["action"] == "click":
browser.mouse.click(action["x"], action["y"])
elif action["action"] == "type":
browser.keyboard.type(action["text"])
elif action["action"] == "wait":
browser.wait_for_timeout(1000)
Who should use Qwen 3.7 Plus
Use Qwen 3.7 Plus for workloads like:
Computer-use agents
Agents that inspect screenshots and click through real interfaces.GUI automation
Browser, desktop, or app workflows where coordinates matter.Screenshot-to-code and mockup-to-UI
The model reads a design and generates frontend code.Document, PDF, and chart understanding
Useful when visual layout matters, not just text extraction.Video understanding
Especially when you need a lower per-token cost.Long agentic runs
Up to the 35-hour ceiling with many sequential tool calls.
Stick with Max if your workload is pure text and you are optimizing for SWE-Bench Pro scores or text-only latency. For mixed workloads, Plus is usually the more practical default.
If you are comparing Plus against other open and budget models, see our MiniMax M3 vs DeepSeek V4 vs Qwen 3.7 comparison.
Testing Qwen 3.7 Plus with Apidog
Because Qwen 3.7 Plus is API-only, your development workflow lives around API requests.
Multimodal requests are easy to get wrong. You need to validate:
- Image payload structure
- Video payload structure
- Auth headers
- Regional base URLs
- Model identifiers
- JSON response formats
- Tool-calling loops
- Long-running agent traces
Apidog helps you test and debug those requests before they reach production.
You can use it to:
- Send Qwen 3.7 Plus API requests
- Inspect raw request and response bodies
- Manage Model Studio keys across environments
- Mock endpoints while your app is still being built
- Debug agent call sequences
For agent workflows, Apidog’s AI agent debugger helps you inspect the full call chain and identify where a run failed.
Download Apidog to test, debug, and mock the Qwen 3.7 Plus API before shipping.
FAQ
Is Qwen 3.7 Plus open source?
No. It is proprietary and available only as a managed API through Alibaba Cloud Model Studio. You cannot download or self-host the weights. An open-weight variant has been suggested for Q3 2026, but it is not confirmed.
Qwen 3.7 Plus or Max: which should I use?
Use Plus if you need vision, screenshots, PDFs, video, or lower pricing. Use Max if your workload is pure text and you are optimizing for SWE-Bench Pro scores or text-only latency.
How much does Qwen 3.7 Plus cost?
Qwen 3.7 Plus costs $0.40 per million input tokens, $1.60 per million output tokens, and $0.08 per million cached input tokens.
Does Qwen 3.7 Plus handle video?
Yes. It accepts text, images, and video as input. Visual tokens share the 1M-token context budget, so large media payloads reduce available text context.
What is the context window?
Qwen 3.7 Plus has a 1M-token context window shared across text, image, and video tokens.
How do I access Qwen 3.7 Plus?
Use the Alibaba Cloud Model Studio API, or try it in the browser at chat.qwen.ai.
The bottom line
Qwen 3.7 Plus takes Alibaba’s agentic flagship, adds vision, and cuts the price to a budget tier. For developers building computer-use agents, screenshot-driven coding tools, document workflows, or video-understanding apps, it is one of the cheaper frontier-tier multimodal options available.
The tradeoff is clear: closed weights and a hard dependency on Alibaba’s managed API.
If that works for your stack, start with the API. Test multimodal payloads, inspect responses, and mock production flows in Apidog before sending real traffic.


Top comments (0)