Hassann

Posted on Jun 3 • Originally published at apidog.com

Qwen 3.7 Plus: Alibaba's multimodal agent model, benchmarks and pricing

Alibaba shipped Qwen 3.7 Plus a few days after Qwen 3.7 Max. The short version: Plus is Max with vision. It keeps the same 1M-token context and agentic backbone, adds image and video input, and lands at roughly one-sixth of Max’s price. If you’ve been following the family, our guide to what Qwen 3.7 is covers the text flagship; this post focuses on what the Plus variant adds and how to start using it.

Try Apidog today

One thing to flag up front: Qwen 3.7 Plus is API-only and proprietary. There are no open weights, which breaks from Qwen’s usual open-source pattern. Since Plus ships only as an API, implementation work comes down to sending requests, debugging payloads, and validating responses. That is where Apidog becomes useful, especially for multimodal API testing.

The short answer

Qwen 3.7 Plus is the multimodal, lower-cost sibling of Qwen 3.7 Max. You can pass it screenshots, design mockups, documents, or video, and it can reason over them as first-class input.

It is built for agents that operate graphical interfaces. For example, it can inspect an app screenshot and return pixel coordinates for the next action, such as:

{
  "action": "click",
  "x": 487,
  "y": 232
}

On pure text tasks, Max still has a small edge. For tasks with visual input, Plus is the better fit and costs much less. The main tradeoff is that the weights are closed.

What’s new versus Qwen 3.7 Max

Three changes matter for developers.

1. Plus accepts visual input

Qwen 3.7 Max is text-only. Qwen 3.7 Plus accepts:

Text
Images
Video

That enables use cases such as screenshot analysis, document and PDF reading, chart interpretation, and video understanding from one model.

2. Plus supports GUI grounding

Plus is positioned as a multimodal interactive agent for:

Browser automation
GUI navigation
Screenshot-driven workflows
Hybrid GUI + CLI agents

Instead of only describing a screen, it can return concrete UI actions, such as where to click.

Example target response format:

{
  "steps": [
    {
      "type": "click",
      "target": "Submit button",
      "coordinates": {
        "x": 487,
        "y": 232
      }
    }
  ]
}

3. Plus is cheaper

Plus runs at a budget tier well below Max while keeping the 1M-token context window.

	Qwen 3.7 Plus	Qwen 3.7 Max
Input modalities	Text, image, video	Text only
Context window	1M tokens, shared with vision	1M tokens
Input / output per 1M tokens	$0.40 / $1.60	$2.50 / $7.50
Cached input per 1M tokens	$0.08	$0.25
GUI grounding, ScreenSpot Pro	79.0	None
Terminal-Bench	70.3	69.7
Autonomous run ceiling	35 hours	35 hours

Benchmarks

The launch numbers, backed by early hands-on reviews, show a clear pattern: Plus matches or slightly trails Max on text, then pulls ahead when vision matters.

Key numbers:

ScreenSpot Pro: 79.0

This measures GUI grounding: the ability to inspect a screenshot and produce exact pixel coordinates. Max cannot run this benchmark because it is text-only.
Terminal-Bench: 70.3

Slightly ahead of Max’s 69.7.
SWE-Bench Pro: about 60%

Roughly level with Max’s 60.6%.
MCP-Atlas: 76.4

Tied with Max on tool-use orchestration.
LM Arena

Plus sits slightly behind Max on text and coding: #15 vs #13 for text, and #12 vs #10 for coding.

Use Plus when the task includes a visual signal: screenshots, mockups, charts, documents, or video. For a text-only comparison, see our Qwen 3.7 vs GPT-5.5 vs Opus 4.7 comparison.

As always, treat vendor and early-reviewer benchmark numbers as directional, not absolute.

Pricing: the budget multimodal tier

Qwen 3.7 Plus pricing:

Token type	Price per 1M tokens
Input	$0.40
Output	$1.60
Cached input	$0.08

That makes Plus roughly:

6x cheaper than Max on input
Nearly 5x cheaper than Max on output

The important implementation detail: image and video tokens share the same 1M-token context budget. A high-resolution screenshot can consume thousands of tokens, and video frames can add up quickly.

When building your cost model, estimate:

total_context = text_tokens + image_tokens + video_tokens

If your app sends large screenshots or long videos, reduce the amount of text history you attach to each request.

For more context on pricing pressure across Chinese models, see our breakdown of the 2026 Chinese LLM price war.

The catch: proprietary and API-only

Qwen built much of its traction on open weights. Earlier Qwen models often shipped under Apache 2.0 or open-use licenses, which let teams download, fine-tune, and run models inside private infrastructure.

Qwen 3.7 Plus is different.

It is delivered strictly as a managed commercial API through Alibaba Cloud Model Studio. You cannot:

Download the weights
Self-host the model
Run it offline
Deploy it inside an air-gapped environment

For regulated or air-gapped environments, that may be a blocker.

An open-weight Plus variant has been floated for Q3 2026, but it is not confirmed. If open weights are mandatory, Qwen 3.7 Plus is not the right choice today. Rivals like Step 3.7 Flash ship under Apache 2.0 and undercut it on price.

How to access Qwen 3.7 Plus

You have two practical options.

Option 1: Use the API

Call Qwen 3.7 Plus through Alibaba Cloud Model Studio.

The endpoint is OpenAI-compatible, so the request structure is familiar if you have used OpenAI-style chat completions before. Our Qwen 3.7 API guide walks through authentication and the first request.

For multimodal requests, add image or video parts to the message payload.

Option 2: Try it in chat first

Use chat.qwen.ai before writing code. If you want no-cost routes for testing the family, see our Qwen 3.7 for free guide.

Minimal multimodal API example

Here is a basic Python example using the OpenAI-compatible client.

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MODEL_STUDIO_KEY",
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

resp = client.chat.completions.create(
    model="qwen3.7-plus",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Which button submits this form? Give pixel coordinates."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/screenshot.png"
                    }
                },
            ],
        }
    ],
)

print(resp.choices[0].message.content)

Check the Model Studio docs for the exact model identifier and regional base URL. International and China endpoints can differ.

Example: ask for structured GUI actions

For agents, avoid free-form responses when possible. Ask the model to return JSON that your automation layer can parse.

from openai import OpenAI
import json

client = OpenAI(
    api_key="YOUR_MODEL_STUDIO_KEY",
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

prompt = """
You are controlling a browser from screenshots.

Return only valid JSON in this format:
{
  "action": "click" | "type" | "wait" | "done",
  "target": "short description",
  "x": number,
  "y": number,
  "text": "optional text to type"
}

Task: Find the submit button and return the click coordinates.
"""

resp = client.chat.completions.create(
    model="qwen3.7-plus",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/form-screenshot.png"
                    }
                },
            ],
        }
    ],
)

raw = resp.choices[0].message.content
action = json.loads(raw)

print(action)

Your automation code can then route the result:

if action["action"] == "click":
    browser.mouse.click(action["x"], action["y"])
elif action["action"] == "type":
    browser.keyboard.type(action["text"])
elif action["action"] == "wait":
    browser.wait_for_timeout(1000)

Who should use Qwen 3.7 Plus

Use Qwen 3.7 Plus for workloads like:

Computer-use agents

Agents that inspect screenshots and click through real interfaces.
GUI automation

Browser, desktop, or app workflows where coordinates matter.
Screenshot-to-code and mockup-to-UI

The model reads a design and generates frontend code.
Document, PDF, and chart understanding

Useful when visual layout matters, not just text extraction.
Video understanding

Especially when you need a lower per-token cost.
Long agentic runs

Up to the 35-hour ceiling with many sequential tool calls.

Stick with Max if your workload is pure text and you are optimizing for SWE-Bench Pro scores or text-only latency. For mixed workloads, Plus is usually the more practical default.

If you are comparing Plus against other open and budget models, see our MiniMax M3 vs DeepSeek V4 vs Qwen 3.7 comparison.

Testing Qwen 3.7 Plus with Apidog

Because Qwen 3.7 Plus is API-only, your development workflow lives around API requests.

Multimodal requests are easy to get wrong. You need to validate:

Image payload structure
Video payload structure
Auth headers
Regional base URLs
Model identifiers
JSON response formats
Tool-calling loops
Long-running agent traces

Apidog helps you test and debug those requests before they reach production.

You can use it to:

Send Qwen 3.7 Plus API requests
Inspect raw request and response bodies
Manage Model Studio keys across environments
Mock endpoints while your app is still being built
Debug agent call sequences

For agent workflows, Apidog’s AI agent debugger helps you inspect the full call chain and identify where a run failed.

Download Apidog to test, debug, and mock the Qwen 3.7 Plus API before shipping.

FAQ

Is Qwen 3.7 Plus open source?

No. It is proprietary and available only as a managed API through Alibaba Cloud Model Studio. You cannot download or self-host the weights. An open-weight variant has been suggested for Q3 2026, but it is not confirmed.

Qwen 3.7 Plus or Max: which should I use?

Use Plus if you need vision, screenshots, PDFs, video, or lower pricing. Use Max if your workload is pure text and you are optimizing for SWE-Bench Pro scores or text-only latency.

How much does Qwen 3.7 Plus cost?

Qwen 3.7 Plus costs $0.40 per million input tokens, $1.60 per million output tokens, and $0.08 per million cached input tokens.

Does Qwen 3.7 Plus handle video?

Yes. It accepts text, images, and video as input. Visual tokens share the 1M-token context budget, so large media payloads reduce available text context.

What is the context window?

Qwen 3.7 Plus has a 1M-token context window shared across text, image, and video tokens.

How do I access Qwen 3.7 Plus?

Use the Alibaba Cloud Model Studio API, or try it in the browser at chat.qwen.ai.

The bottom line

Qwen 3.7 Plus takes Alibaba’s agentic flagship, adds vision, and cuts the price to a budget tier. For developers building computer-use agents, screenshot-driven coding tools, document workflows, or video-understanding apps, it is one of the cheaper frontier-tier multimodal options available.

The tradeoff is clear: closed weights and a hard dependency on Alibaba’s managed API.

If that works for your stack, start with the API. Test multimodal payloads, inspect responses, and mock production flows in Apidog before sending real traffic.

DEV Community