Preecha

Posted on Jun 13

Qwen3.6-Plus API: Beats Claude on Terminal Benchmarks

TL;DR

Qwen3.6-Plus is now officially available. It scores 78.8% on SWE-bench Verified and 61.6% on Terminal-Bench 2.0, where it beats Claude Opus 4.5. It also ships with a 1M token context window, a new preserve_thinking parameter for agent loops, and compatibility with Claude Code, OpenClaw, and Qwen Code through OpenAI-compatible and Anthropic-compatible APIs.

Try Apidog today

From preview to release

If you tested Qwen3.6-Plus Preview on OpenRouter, the official release is the production-ready version of that model. The preview launched quietly on March 30 with no waitlist and free access through OpenRouter. In its first two days, it processed more than 400 million completion tokens across roughly 400,000 requests.

The official release moves Qwen3.6-Plus to Alibaba Cloud Model Studio with a stable API, SLA-backed uptime, and a new API parameter designed for multi-step agent workflows.

This guide focuses on implementation:

What changed from the preview
How to call the API
How to enable preserve_thinking
How to connect Qwen3.6-Plus to Claude Code, OpenClaw, and Qwen Code
How to test the integration with Apidog before deploying

What Qwen3.6-Plus is

Qwen3.6-Plus is a hosted mixture-of-experts model from Alibaba's Qwen team. Like the Qwen3.5 series, it uses sparse activation, so only a subset of parameters is active for each token. That helps deliver strong performance without the compute cost of a similarly capable dense model.

Launch specs:

1 million token context window by default
Mandatory chain-of-thought reasoning, same as the preview
New preserve_thinking parameter for agentic tasks
Native multimodal support for vision, video, and document understanding
OpenAI-compatible API
Anthropic-compatible API
OpenAI Responses API support

Smaller open-source variants are expected to follow. If you need self-hosted weights, those are not part of the hosted API model but are on the roadmap.

Benchmark results

Coding agents

Qwen3.6-Plus is close to Claude Opus 4.5 on most SWE-bench tasks and leads the comparison on terminal operations.

Terminal-Bench 2.0 evaluates real shell workflows, including file operations, process control, and multi-step terminal tasks under a 3-hour timeout with 32 CPU cores and 48GB RAM.

Qwen3.6-Plus scores 61.6% compared with Claude Opus 4.5 at 59.3%. That matters if your agent needs to run shell-heavy workflows such as builds, test execution, file edits, and process management.

General agents and tool use

Benchmark	Claude Opus 4.5	Qwen3.6-Plus
TAU3-Bench	70.2%	70.7%
DeepPlanning	33.9%	41.5%
MCPMark	42.3%	48.2%
MCP-Atlas	71.8%	74.1%
WideSearch	76.4%	74.3%

MCPMark tests GitHub MCP v0.30.3 tool calls, with Playwright responses truncated at 32K tokens. Qwen3.6-Plus leading at 48.2% is relevant if you are building MCP-based developer tooling.

DeepPlanning at 41.5% versus Claude Opus 4.5 at 33.9% also points to stronger long-horizon planning performance.

Reasoning and knowledge

Benchmark	Claude Opus 4.5	Qwen3.6-Plus
GPQA	87.0%	90.4%
LiveCodeBench v6	84.8%	87.1%
IFEval strict	90.9%	94.3%
MMLU-Pro	89.5%	88.5%

GPQA evaluates graduate-level science reasoning. IFEval strict measures whether a model follows precise formatting and constraint instructions.

Qwen3.6-Plus leads both GPQA and IFEval strict, which is useful for structured output, multi-step agents, and tool-calling workflows where instruction drift can break execution.

Multimodal

Qwen3.6-Plus is a native multimodal model and leads several document, spatial, and object detection benchmarks.

Benchmark	Qwen3.6-Plus	Notes
OmniDocBench 1.5	91.2%	Top in table
RefCOCO avg	93.5%	Top in table
We-Math	89.0%	Top in table
CountBench	97.6%	Top in table
OSWorld-Verified	62.5%	Behind Claude, which scored 66.3%

Claude Opus 4.5 leads OSWorld-Verified at 66.3% versus Qwen3.6-Plus at 62.5%. For document understanding and spatial grounding tasks, Qwen3.6-Plus leads in the reported comparison.

How to call the API

Qwen3.6-Plus is available through Alibaba Cloud Model Studio. Get an API key from modelstudio.alibabacloud.com.

Regional base URLs:

Singapore: https://dashscope-intl.aliyuncs.com/compatible-mode/v1
Beijing: https://dashscope.aliyuncs.com/compatible-mode/v1
US Virginia: https://dashscope-us.aliyuncs.com/compatible-mode/v1

Basic streaming call

Install the OpenAI Python SDK if needed:

pip install openai

Then call the OpenAI-compatible endpoint:

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["DASHSCOPE_API_KEY"],
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3.6-plus",
    messages=[
        {
            "role": "user",
            "content": "Review this Python function and find bugs."
        }
    ],
    extra_body={
        "enable_thinking": True
    },
    stream=True,
)

reasoning = ""
answer = ""
is_answering = False

for chunk in completion:
    if not chunk.choices:
        continue

    delta = chunk.choices[0].delta

    if hasattr(delta, "reasoning_content") and delta.reasoning_content:
        if not is_answering:
            reasoning += delta.reasoning_content

    if delta.content:
        if not is_answering:
            is_answering = True

        answer += delta.content
        print(delta.content, end="", flush=True)

Use this pattern when you want to stream the final answer while optionally collecting reasoning content separately.

Use `preserve_thinking` for agent loops

The preview version only kept reasoning from the current turn. The official release adds preserve_thinking.

When you set preserve_thinking: true, the model retains chain-of-thought from previous turns in the conversation. Alibaba recommends this for agent scenarios because multi-step agents often need to reference the reasoning that led to earlier decisions.

It is disabled by default to control token usage. Enable it for multi-turn agent workflows.

completion = client.chat.completions.create(
    model="qwen3.6-plus",
    messages=conversation_history,
    extra_body={
        "enable_thinking": True,
        "preserve_thinking": True,
    },
    stream=True,
)

Minimal multi-turn agent pattern

conversation = []

def agent_step(user_message, preserve=True):
    conversation.append({
        "role": "user",
        "content": user_message
    })

    response = client.chat.completions.create(
        model="qwen3.6-plus",
        messages=conversation,
        extra_body={
            "enable_thinking": True,
            "preserve_thinking": preserve,
        },
        stream=False,
    )

    message = response.choices[0].message

    conversation.append({
        "role": "assistant",
        "content": message.content
    })

    return message.content

# Example: multi-step code review agent
result = agent_step("Analyze the auth module for security issues.")
result = agent_step("Now suggest fixes for the top 3 issues you found.")
result = agent_step("Write tests that validate each fix.")

Without preserve_thinking, the model on step 3 may not retain why it selected the issues in step 1. With preserve_thinking, the reasoning chain is preserved across turns.

Use Qwen3.6-Plus with Claude Code

The Qwen API supports the Anthropic protocol, so Claude Code can run against Qwen3.6-Plus by changing environment variables.

Install Claude Code:

npm install -g @anthropic-ai/claude-code

Set the Qwen-compatible Anthropic endpoint:

export ANTHROPIC_MODEL="qwen3.6-plus"
export ANTHROPIC_SMALL_FAST_MODEL="qwen3.6-plus"
export ANTHROPIC_BASE_URL=https://dashscope-intl.aliyuncs.com/apps/anthropic
export ANTHROPIC_AUTH_TOKEN=your_dashscope_api_key

claude

That lets you keep your Claude Code workflow while routing model calls to Qwen3.6-Plus.

Use Qwen3.6-Plus with OpenClaw

OpenClaw, formerly Moltbot / Clawdbot, is an open-source self-hosted coding agent.

Install it:

# Node.js 22+
curl -fsSL https://molt.bot/install.sh | bash

Set your API key and start the dashboard:

export DASHSCOPE_API_KEY=your_key
openclaw dashboard

Edit ~/.openclaw/openclaw.json and merge these fields. Do not overwrite the whole file.

{
  "models": {
    "providers": [
      {
        "name": "alibaba-coding-plan",
        "baseUrl": "https://coding-intl.dashscope.aliyuncs.com/v1",
        "apiKey": "${DASHSCOPE_API_KEY}",
        "models": [
          {
            "id": "qwen3.6-plus",
            "reasoning": true
          }
        ]
      }
    ]
  },
  "agents": {
    "defaults": {
      "models": ["qwen3.6-plus"]
    }
  }
}

Use Qwen3.6-Plus with Qwen Code

Qwen Code is Alibaba's open-source terminal agent for the Qwen model family. It provides 1,000 free API calls per day when you sign in with Qwen Code OAuth.

Install and authenticate:

npm install -g @qwen-code/qwen-code@latest

qwen

# In the Qwen Code UI, type:
/auth

After authentication, you can use Qwen3.6-Plus directly from the terminal agent.

Why `preserve_thinking` changes agent behavior

Most LLM APIs treat each turn independently. The model generates a response, reasoning is discarded, and the next turn starts with the visible conversation only.

That works for simple Q&A. It is weaker for agents running 10 to 20 step tasks because the model may lose track of why it made earlier choices.

preserve_thinking keeps the reasoning from previous turns available when generating the next response. In practice, an agent working through a repository-level task on step 8 can still use its analysis from steps 2, 4, and 6.

Alibaba's benchmarks also indicate this can reduce redundant reasoning. If the model does not need to re-derive context it already established, it can use fewer tokens per turn on complex multi-step workflows.

Use preserve_thinking when your agent:

Reviews or modifies multiple files over several steps
Uses tools repeatedly
Needs to keep a plan consistent
Must explain or act on earlier decisions
Chains terminal operations, tests, and code edits

What Qwen3.6-Plus is best for

Repository-level bug fixing

SWE-bench Verified at 78.8% and SWE-bench Pro at 56.6% make Qwen3.6-Plus competitive for automated code repair, review, and repository-level debugging pipelines.

Terminal automation

Terminal-Bench 2.0 leadership makes it strong for shell-heavy workflows such as:

Multi-step file operations
Process management
Build pipelines
Test execution
CLI-driven debugging

MCP tool calling

MCPMark at 48.2% makes Qwen3.6-Plus a strong option for MCP-based tool integrations, especially GitHub and browser automation workflows.

Long-context document analysis

The 1M token context window is useful for:

Full codebase review
Large specification analysis
Multi-file reasoning
Long policy, legal, or technical document processing

Frontend code generation

Qwen team's internal QwenWebBench covers seven categories:

Web Design
Web Apps
Games
SVG
Data Visualization
Animation
3D

Qwen3.6-Plus scores 1501.7 versus Claude Opus 4.5 at 1517.9, making them effectively close in this reported frontend generation benchmark.

Multilingual workflows

Qwen3.6-Plus scores 84.3% on WMT24++ and 88.2% on MAXIFE across 23 language settings, making it useful for non-English and multilingual applications.

Testing Qwen3.6-Plus API calls with Apidog

The API is OpenAI-compatible, so you can test it in Apidog like any other HTTP API.

Create a POST request:

https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions

Add your API key as an environment variable and pass it in the Authorization header:

Authorization: Bearer {{DASHSCOPE_API_KEY}}

Example request body:

{
  "model": "qwen3.6-plus",
  "messages": [
    {
      "role": "user",
      "content": "Review this Python function and find bugs."
    }
  ],
  "stream": false,
  "enable_thinking": true
}

Write assertions to validate the response before wiring it into production code:

pm.test("Response contains choices", () => {
  const body = pm.response.json();

  pm.expect(body).to.have.property("choices");
  pm.expect(body.choices).to.be.an("array").and.not.empty;
  pm.expect(body.choices[0].message.content).to.be.a("string").and.not.empty;
});

pm.test("No empty reasoning when thinking enabled", () => {
  const choice = pm.response.json().choices[0];

  if (choice.message.reasoning_content !== undefined) {
    pm.expect(choice.message.reasoning_content).to.not.be.empty;
  }
});

For agent workflows, create an Apidog Test Scenario that chains multiple requests:

Send the first prompt.
Save the assistant response.
Send the second prompt with conversation history.
Enable preserve_thinking.
Validate the response structure after each step.

You can also use Apidog Smart Mock to generate test responses while developing your orchestration layer. That lets you test request handling, retries, parsing, and error paths without calling the live API on every run.

What's coming next

The Qwen team confirmed that smaller open-source variants are expected within days. These are planned to follow the Qwen3.5 pattern: sparse MoE models with public Apache 2.0 weights.

The roadmap also includes:

Longer-horizon repository-level tasks
More complex multi-file problem solving
Continued multimodal agent development
GUI agents and visual coding as first-class capabilities

The Qwen3.5 open-source variants became widely deployed self-hosted models shortly after release. If Qwen3.6 follows the same pattern, the smaller variants may become common choices for self-hosted coding agents after they land.

Conclusion

Qwen3.6-Plus narrows the gap with Claude Opus 4.5 on coding tasks and leads in terminal operations, MCP tool calling, and long-horizon planning in the reported benchmarks.

For developers, the main implementation points are:

Use the OpenAI-compatible API for standard chat completions.
Enable enable_thinking when you need reasoning.
Enable preserve_thinking for multi-step agents.
Use the Anthropic-compatible endpoint for Claude Code.
Test requests, assertions, mocks, and multi-turn scenarios in Apidog before deployment.

The official API adds production stability, SLA coverage, and the new agent-focused preserve_thinking parameter. If you are building coding agents, terminal automation, or MCP-based tooling, Qwen3.6-Plus is worth benchmarking against your current model stack.

FAQ

What is the difference between Qwen3.6-Plus and the preview?

The preview, qwen/qwen3.6-plus-preview, launched on OpenRouter on March 30, 2026. The official release adds the preserve_thinking parameter, SLA-backed uptime, and full Model Studio support. Smaller open-source variants are also expected.

What is `preserve_thinking` and when should I use it?

By default, only reasoning from the current turn is kept. When preserve_thinking: true is set, the model retains chain-of-thought from previous conversation turns. Use it for multi-step agent loops where previous reasoning should inform later actions.

How does Qwen3.6-Plus compare to Claude Opus 4.5?

Claude Opus 4.5 leads on SWE-bench Verified, 80.9% versus 78.8%, and OSWorld-Verified, 66.3% versus 62.5%.

Qwen3.6-Plus leads on:

Terminal-Bench 2.0: 61.6% versus 59.3%
MCPMark: 48.2% versus 42.3%
DeepPlanning: 41.5% versus 33.9%
GPQA: 90.4% versus 87.0%

Can I use Qwen3.6-Plus with Claude Code?

Yes. Set ANTHROPIC_BASE_URL to the Dashscope Anthropic-compatible endpoint, ANTHROPIC_MODEL to qwen3.6-plus, and ANTHROPIC_AUTH_TOKEN to your Dashscope API key.

Is Qwen3.6-Plus open source?

The hosted API model is not open-weight. Smaller variants with public weights are confirmed to be releasing within days.

How do I get free access?

Install Qwen Code:

npm install -g @qwen-code/qwen-code@latest

Run it:

qwen

Then type:

/auth

What context window does Qwen3.6-Plus support?

It supports 1 million tokens by default. Some official benchmarks used 256K for standardized comparison, but the API default is 1M.

How do I test the API integration before deploying?

Import the OpenAI-compatible endpoint into Apidog, add your API key as an environment variable, write response assertions, and use Smart Mock for offline development. For multi-turn agents, chain requests into a Test Scenario to validate behavior end to end.

DEV Community

Qwen3.6-Plus API: Beats Claude on Terminal Benchmarks

TL;DR

From preview to release

What Qwen3.6-Plus is

Benchmark results

Coding agents

General agents and tool use

Reasoning and knowledge

Multimodal

How to call the API

Basic streaming call

Use `preserve_thinking` for agent loops

Minimal multi-turn agent pattern

Use Qwen3.6-Plus with Claude Code

Use Qwen3.6-Plus with OpenClaw

Use Qwen3.6-Plus with Qwen Code

Why `preserve_thinking` changes agent behavior

What Qwen3.6-Plus is best for

Repository-level bug fixing

Terminal automation

MCP tool calling

Long-context document analysis

Frontend code generation

Multilingual workflows

Testing Qwen3.6-Plus API calls with Apidog

What's coming next

Conclusion

FAQ

What is the difference between Qwen3.6-Plus and the preview?

What is `preserve_thinking` and when should I use it?

How does Qwen3.6-Plus compare to Claude Opus 4.5?

Can I use Qwen3.6-Plus with Claude Code?

Is Qwen3.6-Plus open source?

How do I get free access?

What context window does Qwen3.6-Plus support?

How do I test the API integration before deploying?

Top comments (0)

TL;DR

From preview to release

What Qwen3.6-Plus is

Benchmark results

Coding agents

General agents and tool use

Reasoning and knowledge

Multimodal

How to call the API

Basic streaming call

Use preserve_thinking for agent loops

Minimal multi-turn agent pattern

Use Qwen3.6-Plus with Claude Code

Use Qwen3.6-Plus with OpenClaw

Use Qwen3.6-Plus with Qwen Code

Why preserve_thinking changes agent behavior

What Qwen3.6-Plus is best for

Repository-level bug fixing

Terminal automation

MCP tool calling

Long-context document analysis

Frontend code generation

Multilingual workflows

Testing Qwen3.6-Plus API calls with Apidog

What's coming next

Conclusion

FAQ

What is the difference between Qwen3.6-Plus and the preview?

What is preserve_thinking and when should I use it?

How does Qwen3.6-Plus compare to Claude Opus 4.5?

Can I use Qwen3.6-Plus with Claude Code?

Is Qwen3.6-Plus open source?

How do I get free access?

What context window does Qwen3.6-Plus support?

How do I test the API integration before deploying?

Use `preserve_thinking` for agent loops

Why `preserve_thinking` changes agent behavior

What is `preserve_thinking` and when should I use it?