DEV Community

정상록
정상록

Posted on

Qwen3.6-35B-A3B: Swap Claude Code Backend With 2 Env Vars (and save 93% on inference)

TL;DR

Alibaba released Qwen3.6-35B-A3B on April 16, 2026 under Apache 2.0. It's a Mixture-of-Experts model with 35B total / 3B active parameters that:

  • Sets a new record on Terminal-Bench 2.0 (51.5) — agentic coding SOTA
  • Beats Claude Sonnet 4.5 on 4 core vision benchmarks
  • Exposes an Anthropic Messages API compatible endpoint
  • Runs locally on a single 24GB GPU at 196 tok/s

You can switch your entire Claude Code workflow to Qwen with two environment variables. I'm showing you how below.

Why You Should Care

If you're paying for Claude API and most of your usage is everyday coding (refactoring, bug fixes, doc writing), you're probably overspending by 15–16×. The cost math is straightforward:

Model Input /1M tokens Output /1M tokens
qwen3.6-flash $0.10 $0.40
claude-sonnet-4.5 $3.00 $15.00

At 100k tokens/day, that's ~$4.50/month vs ~$72/month. More importantly: the benchmarks say Qwen beats Claude on agentic coding tasks.

The Integration

Qwen3.6's endpoint speaks Anthropic Messages API natively. Claude Code CLI just needs to be told where to send requests.

Step 1: Grab a Model Studio API key

Sign up at modelstudio.alibabacloud.com, create a key, save it to your env.

echo "DASHSCOPE_API_KEY=sk-xxxxxxxxxxxxx" >> ~/.zshrc
source ~/.zshrc
Enter fullscreen mode Exit fullscreen mode

Step 2: Point Claude Code at Qwen

export ANTHROPIC_BASE_URL="https://dashscope-intl.aliyuncs.com/apps/anthropic"
export ANTHROPIC_MODEL="qwen3.6-flash"
export ANTHROPIC_API_KEY="$DASHSCOPE_API_KEY"
Enter fullscreen mode Exit fullscreen mode

Step 3: Verify

claude code chat "Write a Python FizzBuzz."
Enter fullscreen mode Exit fullscreen mode

If it works, you're done. Your existing agents, skills, tools — all of them now run on Qwen.

A Toggle Script You'll Actually Use

In practice you want Qwen for routine work and Claude for deep reasoning. Here's the switch script I use:

#!/bin/bash
# ~/.claude/scripts/switch-provider.sh
# Usage: source switch-provider.sh [claude|qwen]

case "$1" in
  claude)
    export ANTHROPIC_BASE_URL="https://api.anthropic.com"
    export ANTHROPIC_MODEL="claude-opus-4-5-20250210"
    export ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY_CLAUDE"
    echo "→ Claude Opus 4.5"
    ;;
  qwen)
    export ANTHROPIC_BASE_URL="https://dashscope-intl.aliyuncs.com/apps/anthropic"
    export ANTHROPIC_MODEL="qwen3.6-flash"
    export ANTHROPIC_API_KEY="$DASHSCOPE_API_KEY"
    echo "→ Qwen3.6 Flash"
    ;;
esac
Enter fullscreen mode Exit fullscreen mode
source ~/.claude/scripts/switch-provider.sh qwen     # routine work
source ~/.claude/scripts/switch-provider.sh claude   # architecture decisions
Enter fullscreen mode Exit fullscreen mode

The Benchmarks That Matter

Agentic coding:

SWE-bench Verified:  73.4 (Qwen3.6)  vs  52.0 (Gemma4-31B)
Terminal-Bench 2.0:  51.5 (Qwen3.6)  vs  42.9 (Gemma4-31B)
QwenWebBench:        1397 (Qwen3.6)  vs  1068 (Qwen3.5)  — +30.9%
Enter fullscreen mode Exit fullscreen mode

The QwenWebBench jump is the sleeper hit. That's browser-driven agent reliability. If you're building with browser-use or a similar stack, this is a huge step.

Vision (vs Claude Sonnet 4.5):

RealWorldQA:        85.3 vs 70.3   (+15.0)
MMMU:              81.7 vs 79.6   (+2.1)
OmniDocBench 1.5:  89.9 vs 85.8   (+4.1)
Enter fullscreen mode Exit fullscreen mode

First open-weight model I've seen beat Claude on multiple vision benchmarks simultaneously.

Local Deployment (vLLM)

If you can't route dev data through Alibaba Cloud, self-host.

# 1. Download weights (~70GB)
huggingface-cli download Qwen/Qwen3.6-35B-A3B \
  --local-dir ~/models/qwen3.6-35b-a3b

# 2. Serve with vLLM + FP8 to fit 24GB VRAM
pip install vllm==0.8.0
vllm serve ~/models/qwen3.6-35b-a3b \
  --quantization fp8 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.92 \
  --port 8000 \
  --served-model-name qwen3.6-35b-a3b
Enter fullscreen mode Exit fullscreen mode

Then point OpenClaw / Claude Code at http://localhost:8000/v1.

Expected throughput:

Hardware Quant tok/s
RTX 4090 (24GB) FP8 196
M4 Max (64GB) Q4_K_M 65
M3 Ultra (192GB) FP16 95

Preserve_thinking: The Feature Nobody's Talking About

Qwen3.6 has a preserve_thinking header that carries the thinking trace across turns. For agentic workflows (multi-step refactoring, long debugging sessions), this is a game-changer. Turn it on:

export CLAUDE_PRESERVE_THINKING=1
Enter fullscreen mode Exit fullscreen mode

In my limited testing, this cut my average "number of turns to solve a bug" by about 20%.

What I'd Still Use Claude For

Being honest: Qwen doesn't replace Claude for everything.

  1. Architecture decisions — Claude's "why not X?" reasoning is still clearly ahead
  2. Cultural/linguistic nuance — especially for non-English creative work
  3. Long-form creative writing — Claude feels more "alive"

Everything else — yes, Qwen is now my default.

Verdict

This is the first open-source release that makes a concrete economic case to switch. Not "maybe in a few months." Not "once they fix X." Today, with two env vars.

If you're running Claude Code daily, do the A/B test this week. I'd be genuinely surprised if Qwen doesn't handle 70%+ of your workflow at a fraction of the cost.


Official sources:

What's your experience been? Drop your Qwen3.6 vs Claude comparisons in the comments — I'm curious how it holds up across different codebases and workflows.

Top comments (0)