
Alibaba's Qwen team released Qwen3.6-35B-A3B on April 16, 2026 under Apache 2.0. It is a sparse mixture-of-experts model with 35 billion total parameters but only about 3 billion active per token. It scores 73.4% on SWE-bench Verified and 37.0 on MCPMark, which makes it one of the strongest open-weight models for agentic coding right now.
This post walks through serving it locally with vLLM, calling it from Python with the OpenAI SDK, and wiring up tool calling so the model can act as a coding agent.
Note: This is a personal summary based on publicly available information, not the official view of any company.
Prerequisites
- vLLM 0.19.0 or later (required for Qwen3.6 architecture support)
- NVIDIA GPU (RTX 4090 24GB works for single-GPU, multi-GPU for larger context)
- Python 3.12
- The model downloads automatically from Hugging Face on first launch ## Install vLLM
python -m venv qwen36-env
source qwen36-env/bin/activate
pip install vllm>=0.19.0
Older vLLM versions do not support the Qwen3.6 MoE architecture. If you hit errors about Qwen3MoeSparseMoeBlock, your vLLM is too old.
Start the vLLM Server
Basic (inference only)
vllm serve Qwen/Qwen3.6-35B-A3B \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--reasoning-parser qwen3
The --reasoning-parser qwen3 flag enables thinking mode, where the model generates internal reasoning steps before its final answer. This improves accuracy on coding tasks.
On a single RTX 4090, keep --max-model-len at 32768 or 65536. The full 262,144 context will OOM.
With tool calling
vllm serve Qwen/Qwen3.6-35B-A3B \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
--tool-call-parser qwen3_coder is mandatory. Without it, the model generates tool call JSON but vLLM will not parse it into structured tool_calls objects. This is the most common setup mistake and it fails silently.
Multi-GPU (example: 4 GPUs)
vllm serve Qwen/Qwen3.6-35B-A3B \
--port 8000 \
--tensor-parallel-size 4 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Once running, the OpenAI-compatible API is available at http://localhost:8000/v1.
Basic Chat From Python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy", # vLLM local does not require a real key
)
response = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=[
{"role": "user", "content": "Write a Python generator for the Fibonacci sequence."}
],
temperature=0.7,
max_tokens=2048,
)
print(response.choices[0].message.content)
Since vLLM exposes an OpenAI-compatible endpoint, the standard OpenAI SDK works directly. Swapping from gpt-4o to a local Qwen3.6 is a one-line base_url change.
Tool Calling (Function Calling)
This is where it gets interesting. Qwen3.6-35B-A3B was explicitly trained on tool-use patterns, scoring 37.0 on MCPMark compared to 18.1 for Gemma 4-31B.
import json
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy",
)
tools = [
{
"type": "function",
"function": {
"name": "search_files",
"description": "Search for files in the project by keyword",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search keyword"
},
"file_extension": {
"type": "string",
"description": "File extension filter (e.g. .py, .ts)"
}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read the contents of a file at a given path",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "File path"
}
},
"required": ["path"]
}
}
},
{
"type": "function",
"function": {
"name": "write_file",
"description": "Write content to a file at a given path",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "File path"
},
"content": {
"type": "string",
"description": "Content to write"
}
},
"required": ["path", "content"]
}
}
}
]
response = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=[
{
"role": "system",
"content": "You are a coding agent. Search, read, and modify files to complete the user's request."
},
{
"role": "user",
"content": "Find all Python files that handle database connections and change the pool size from 5 to 20."
}
],
tools=tools,
tool_choice="auto",
temperature=1.0, # recommended for thinking mode
max_tokens=4096,
)
message = response.choices[0].message
if message.tool_calls:
for call in message.tool_calls:
print(f"Tool: {call.function.name}")
print(f"Args: {call.function.arguments}")
print("---")
else:
print(message.content)
Agent Loop (Multi-Turn Tool Calling)
A real coding agent needs to call a tool, feed the result back to the model, then let it decide the next action. Here is a minimal loop:
def run_agent(user_request: str, tools: list, max_steps: int = 10):
messages = [
{
"role": "system",
"content": "You are a coding agent. Use the available tools to complete the task."
},
{"role": "user", "content": user_request}
]
for step in range(max_steps):
response = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=messages,
tools=tools,
tool_choice="auto",
temperature=1.0,
max_tokens=4096,
)
assistant_message = response.choices[0].message
messages.append(assistant_message)
if not assistant_message.tool_calls:
print(f"[Done] {assistant_message.content}")
return assistant_message.content
for call in assistant_message.tool_calls:
tool_name = call.function.name
tool_args = json.loads(call.function.arguments)
print(f"[Step {step+1}] {tool_name}({tool_args})")
result = execute_tool(tool_name, tool_args)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result),
})
return "Reached maximum steps"
def execute_tool(name: str, args: dict) -> dict:
"""Replace with real file system operations."""
if name == "search_files":
return {
"files": [
{"path": "src/db/connection.py", "match": "pool_size=5"},
{"path": "src/db/config.py", "match": "POOL_SIZE = 5"},
]
}
elif name == "read_file":
return {"content": f"# Contents of {args['path']}"}
elif name == "write_file":
return {"status": "ok", "path": args["path"]}
return {"error": "unknown tool"}
Replace execute_tool with real file system calls and you have a working local coding agent.
Thinking Mode
Qwen3.6 has two inference modes:
| Mode | Temperature | Use case |
|---|---|---|
| Thinking (recommended) | 1.0 | Complex coding, debugging, design decisions |
| Non-thinking | 0.7 | Simple completions, quick answers |
When --reasoning-parser qwen3 is set at server startup, thinking mode is on by default.
Hardware Guide
| Setup | max-model-len | Notes |
|---|---|---|
| RTX 4090 (24GB) × 1 | 32,768 | Handles most coding tasks |
| RTX 4090 × 2 (TP=2) | 65,536 | Enough for repo-wide context |
| A100 80GB × 1 | 131,072 | Comfortable single-GPU setup |
| H100 × 4 (TP=4) | 262,144 | Full context, production use |
The FP8 variant (Qwen/Qwen3.6-35B-A3B-FP8) uses less VRAM with nearly identical performance.
Key Takeaways
- vLLM 0.19.0+ serves Qwen3.6-35B-A3B as an OpenAI-compatible API at
localhost:8000/v1 - Tool calling requires
--enable-auto-tool-choice --tool-call-parser qwen3_coderat startup — without it, tool calls silently fail - The standard OpenAI Python SDK works directly, so switching from a cloud API to local inference is a one-line change
- Use
temperature=1.0with thinking mode for best coding accuracy - Apache 2.0 license — free for commercial use The model is three days old, so tool calling stability is still being validated by the community. Test on your own workloads before shipping to production.
References
- Qwen/Qwen3.6-35B-A3B - Hugging Face
- Qwen3.5 & Qwen3.6 Usage Guide - vLLM Recipes
- QwenLM/Qwen3.6 - GitHub
Top comments (0)