DEV Community

Cover image for Qwen3.6-Plus API: Beats Claude on Terminal Benchmarks
Wanda
Wanda

Posted on • Originally published at apidog.com

Qwen3.6-Plus API: Beats Claude on Terminal Benchmarks

TL;DR

Qwen3.6-Plus is now officially released. It scores 78.8% on SWE-bench Verified and 61.6% on Terminal-Bench 2.0, outperforming Claude Opus 4.5 on terminal tasks. Features include a 1M token context window, the new preserve_thinking parameter for agent loops, and seamless integration with Claude Code, OpenClaw, and Qwen Code via an OpenAI-compatible API.

Try Apidog today


From Preview to Release

If you read our previous guide on Qwen 3.6 Plus Preview on OpenRouter, you know what this model can deliver. The preview launched on March 30 with no waitlist and free OpenRouter access, processing over 400 million tokens in just two days.

The official release brings a production-ready model available via Alibaba Cloud Model Studio. Now you get a stable API, SLA-backed uptime, and a new API parameter (preserve_thinking) that improves multi-step agent workflows.

This guide covers the key changes, how to use the API, and how to test your integrations with Apidog before production deployment.


What Qwen3.6-Plus Is

Qwen3.6-Plus is a hosted mixture-of-experts model from Alibaba's Qwen team. Like the Qwen3.5 series, it uses sparse activation for efficient compute.

Key specs:

  • 1M token context window
  • Mandatory chain-of-thought reasoning
  • New preserve_thinking parameter for agentic tasks
  • Native multimodal support (vision, video, document understanding)
  • OpenAI-compatible API, Anthropic-compatible API, OpenAI Responses API

Open-source smaller variants will be available soon for self-hosted setups.


Benchmark Results

Coding Agents

Qwen3.6-Plus is just behind Claude Opus 4.5 on SWE-bench tasks, but leads on terminal operations.

Terminal-Bench Comparison

Terminal-Bench 2.0 evaluates real shell operations—file management, process control, multi-step workflows with significant compute. Qwen3.6-Plus scores 61.6%, beating Claude Opus 4.5 at 59.3%.

General Agents and Tool Use

Benchmark Claude Opus 4.5 Qwen3.6-Plus
TAU3-Bench 70.2% 70.7%
DeepPlanning 33.9% 41.5%
MCPMark 42.3% 48.2%
MCP-Atlas 71.8% 74.1%
WideSearch 76.4% 74.3%

MCPMark tests GitHub MCP tool calls. Qwen3.6-Plus leads on key planning and tool use tasks.

Reasoning and Knowledge

Benchmark Claude Opus 4.5 Qwen3.6-Plus
GPQA 87.0% 90.4%
LiveCodeBench v6 84.8% 87.1%
IFEval strict 90.9% 94.3%
MMLU-Pro 89.5% 88.5%

Qwen3.6-Plus leads in science reasoning and instruction-following benchmarks, key for structured agentic tasks.

Multimodal

Benchmark Qwen3.6-Plus Notes
OmniDocBench 1.5 91.2% Top in table
RefCOCO avg 93.5% Top in table
We-Math 89.0% Top in table
CountBench 97.6% Top in table
OSWorld-Verified 62.5% Behind Claude (66.3%)

Qwen3.6-Plus is ahead in document, spatial, and object detection tasks, though Claude leads in desktop automation.


How to Call the API

Qwen3.6-Plus is available on Alibaba Cloud Model Studio. Get your API key at modelstudio.alibabacloud.com.

Regional Base URLs:

  • Singapore: https://dashscope-intl.aliyuncs.com/compatible-mode/v1
  • Beijing: https://dashscope.aliyuncs.com/compatible-mode/v1
  • US Virginia: https://dashscope-us.aliyuncs.com/compatible-mode/v1

Basic Call With Streaming

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["DASHSCOPE_API_KEY"],
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3.6-plus",
    messages=[{"role": "user", "content": "Review this Python function and find bugs."}],
    extra_body={"enable_thinking": True},
    stream=True
)

reasoning = ""
answer = ""
is_answering = False

for chunk in completion:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta
    if hasattr(delta, "reasoning_content") and delta.reasoning_content:
        if not is_answering:
            reasoning += delta.reasoning_content
    if delta.content:
        if not is_answering:
            is_answering = True
        answer += delta.content
        print(delta.content, end="", flush=True)
Enter fullscreen mode Exit fullscreen mode

The preserve_thinking Parameter

The preview only retained reasoning from the current turn. The official release adds preserve_thinking.

When preserve_thinking: true is set, the model keeps chain-of-thought from all prior turns—ideal for multi-step agent workflows. Disabled by default to save tokens.

completion = client.chat.completions.create(
    model="qwen3.6-plus",
    messages=conversation_history,
    extra_body={
        "enable_thinking": True,
        "preserve_thinking": True, # keep reasoning across all turns
    },
    stream=True
)
Enter fullscreen mode Exit fullscreen mode

Use Qwen3.6-Plus With Claude Code

Qwen's API supports Anthropic's protocol. Use Claude Code with Qwen3.6-Plus by setting environment variables:

npm install -g @anthropic-ai/claude-code

export ANTHROPIC_MODEL="qwen3.6-plus"
export ANTHROPIC_SMALL_FAST_MODEL="qwen3.6-plus"
export ANTHROPIC_BASE_URL=https://dashscope-intl.aliyuncs.com/apps/anthropic
export ANTHROPIC_AUTH_TOKEN=your_dashscope_api_key

claude
Enter fullscreen mode Exit fullscreen mode

Use Qwen3.6-Plus With OpenClaw

OpenClaw is a self-hosted coding agent. Install and configure for Model Studio:

# Install (Node.js 22+)
curl -fsSL https://molt.bot/install.sh | bash

export DASHSCOPE_API_KEY=your_key
openclaw dashboard
Enter fullscreen mode Exit fullscreen mode

Edit ~/.openclaw/openclaw.json to include:

{
  "models": {
    "providers": [{
      "name": "alibaba-coding-plan",
      "baseUrl": "https://coding-intl.dashscope.aliyuncs.com/v1",
      "apiKey": "${DASHSCOPE_API_KEY}",
      "models": [{"id": "qwen3.6-plus", "reasoning": true}]
    }]
  },
  "agents": {
    "defaults": {"models": ["qwen3.6-plus"]}
  }
}
Enter fullscreen mode Exit fullscreen mode

Use Qwen3.6-Plus With Qwen Code

Qwen Code is Alibaba's open-source terminal agent. 1,000 free API calls/day with Qwen OAuth.

npm install -g @qwen-code/qwen-code@latest
qwen
# Type /auth to sign in and activate free tier
Enter fullscreen mode Exit fullscreen mode

Why preserve_thinking Changes Agent Behavior

Typical LLM APIs reset reasoning each turn. For multi-step agent tasks, this causes context drift.

With preserve_thinking, the model keeps all prior reasoning visible, making decisions more consistent over complex workflows and reducing repeated reasoning (saves tokens).

Example agent loop:

conversation = []

def agent_step(user_message, preserve=True):
    conversation.append({"role": "user", "content": user_message})

    response = client.chat.completions.create(
        model="qwen3.6-plus",
        messages=conversation,
        extra_body={
            "enable_thinking": True,
            "preserve_thinking": preserve,
        },
        stream=False
    )

    message = response.choices[0].message
    conversation.append({"role": "assistant", "content": message.content})
    return message.content

# Multi-step code review agent
result = agent_step("Analyze the auth module for security issues.")
result = agent_step("Now suggest fixes for the top 3 issues you found.")
result = agent_step("Write tests that validate each fix.")
Enter fullscreen mode Exit fullscreen mode

Without preserve_thinking, step 3 won't have access to the issues found in step 1.


What It's Best For

  • Repository-level bug fixing: SWE-bench Verified 78.8%, Pro 56.6%. Strong for automated code repair/review pipelines.
  • Terminal automation: Top performer on Terminal-Bench 2.0; ideal for shell-heavy workflows and build pipelines.
  • MCP tool calling: MCPMark at 48.2%—best for MCP-based integrations.
  • Long-context document/code analysis: 1M token window handles codebase reviews and large documents.
  • Frontend code generation: Nearly tied with Claude Opus 4.5 for frontend tasks (QwenWebBench 1501.7 vs 1517.9).
  • Multilingual scenarios: WMT24++ at 84.3%, MAXIFE at 88.2% across 23 languages.

Testing Qwen3.6-Plus API Calls With Apidog

The endpoint is OpenAI-compatible. Import it into Apidog and test like any other API.

Apidog Testing

  • POST to https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions
  • API key: Authorization: Bearer {{DASHSCOPE_API_KEY}}

Sample response assertions:

pm.test("Response contains choices", () => {
  const body = pm.response.json();
  pm.expect(body).to.have.property("choices");
  pm.expect(body.choices[0].message.content).to.be.a("string").and.not.empty;
});

pm.test("No empty reasoning when thinking enabled", () => {
  const choice = pm.response.json().choices[0];
  if (choice.message.reasoning_content !== undefined) {
    pm.expect(choice.message.reasoning_content).to.not.be.empty;
  }
});
Enter fullscreen mode Exit fullscreen mode
  • Use Smart Mock in Apidog to generate test responses without hitting the live API.
  • For multi-turn agents, create a Test Scenario chaining requests. Validate that preserve_thinking carries reasoning across turns before production.

Download Apidog free to start testing.


What's Coming Next

Smaller open-source variants will be released soon, following the Qwen3.5 pattern (sparse MoE, Apache 2.0 weights).

Roadmap:

  • Longer-horizon, repository-level tasks (complex, multi-file problem solving)
  • Multimodal agent development, including GUI agents and visual coding

Qwen3.5 open-source models quickly became a default for self-hosted coding agents. Expect the same for Qwen3.6 variants.


Conclusion

Qwen3.6-Plus closes the gap with Claude Opus 4.5 on coding and leads in terminal, MCP tool use, and planning. With a 1M token context, Anthropic protocol support, and the new preserve_thinking parameter, it's ready for production agentic systems.

The official API brings stability, SLA coverage, and reliable agent-focused workflows.

Apidog simplifies testing: import the endpoint, add assertions, use mocks, and run regression tests as you update your model or API version.


FAQ

What is the difference between Qwen3.6-Plus and the preview?

The preview (qwen/qwen3.6-plus-preview) launched on OpenRouter on March 30, 2026. The official release adds the preserve_thinking parameter, SLA-backed uptime, and full Model Studio support. Smaller open-source variants are also coming.

What is preserve_thinking and when should I use it?

By default, only current-turn reasoning is kept. Set preserve_thinking: true to retain reasoning from all previous turns. Use for multi-step agent loops where past reasoning should inform next actions.

How does Qwen3.6-Plus compare to Claude Opus 4.5?

Claude leads on SWE-bench Verified (80.9% vs 78.8%) and OSWorld-Verified (66.3% vs 62.5%). Qwen3.6-Plus leads on Terminal-Bench 2.0 (61.6% vs 59.3%), MCPMark (48.2% vs 42.3%), DeepPlanning (41.5% vs 33.9%), and GPQA (90.4% vs 87.0%).

Can I use Qwen3.6-Plus with Claude Code?

Yes. Set ANTHROPIC_BASE_URL to the Dashscope Anthropic-compatible endpoint, ANTHROPIC_MODEL to qwen3.6-plus, and ANTHROPIC_AUTH_TOKEN to your Dashscope API key.

Is Qwen3.6-Plus open source?

The hosted API model is not open-weight. Smaller variants with public weights will be released soon.

How do I get free access?

Install Qwen Code (npm install -g @qwen-code/qwen-code@latest), run qwen, then /auth. Sign in with Qwen Code OAuth for 1,000 free API calls/day against Qwen3.6-Plus.

What context window does it support?

1 million tokens by default. Some benchmarks used 256K for comparison, but the API default is 1M.

How do I test the API integration before deploying?

Import the endpoint into Apidog, add your API key as an environment variable, write response assertions, and use Smart Mock for offline development. Chain requests into a Test Scenario to validate multi-turn agent behavior before production deployment.

Top comments (0)