Lavelle Hatcher Jr

Posted on Apr 19

Serving Qwen3.6-35B-A3B With vLLM and Building a Coding Agent With Tool Calling

#opensource #ai #llm #python

Alibaba's Qwen team released Qwen3.6-35B-A3B on April 16, 2026 under Apache 2.0. It is a sparse mixture-of-experts model with 35 billion total parameters but only about 3 billion active per token. It scores 73.4% on SWE-bench Verified and 37.0 on MCPMark, which makes it one of the strongest open-weight models for agentic coding right now.

This post walks through serving it locally with vLLM, calling it from Python with the OpenAI SDK, and wiring up tool calling so the model can act as a coding agent.

Note: This is a personal summary based on publicly available information, not the official view of any company.

Prerequisites

vLLM 0.19.0 or later (required for Qwen3.6 architecture support)
NVIDIA GPU (RTX 4090 24GB works for single-GPU, multi-GPU for larger context)
Python 3.12
The model downloads automatically from Hugging Face on first launch ## Install vLLM

python -m venv qwen36-env
source qwen36-env/bin/activate
pip install vllm>=0.19.0

Older vLLM versions do not support the Qwen3.6 MoE architecture. If you hit errors about Qwen3MoeSparseMoeBlock, your vLLM is too old.

Start the vLLM Server

Basic (inference only)

vllm serve Qwen/Qwen3.6-35B-A3B \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --reasoning-parser qwen3

The --reasoning-parser qwen3 flag enables thinking mode, where the model generates internal reasoning steps before its final answer. This improves accuracy on coding tasks.

On a single RTX 4090, keep --max-model-len at 32768 or 65536. The full 262,144 context will OOM.

With tool calling

vllm serve Qwen/Qwen3.6-35B-A3B \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

--tool-call-parser qwen3_coder is mandatory. Without it, the model generates tool call JSON but vLLM will not parse it into structured tool_calls objects. This is the most common setup mistake and it fails silently.

Multi-GPU (example: 4 GPUs)

vllm serve Qwen/Qwen3.6-35B-A3B \
  --port 8000 \
  --tensor-parallel-size 4 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Once running, the OpenAI-compatible API is available at http://localhost:8000/v1.

Basic Chat From Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy",  # vLLM local does not require a real key
)

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=[
        {"role": "user", "content": "Write a Python generator for the Fibonacci sequence."}
    ],
    temperature=0.7,
    max_tokens=2048,
)

print(response.choices[0].message.content)

Since vLLM exposes an OpenAI-compatible endpoint, the standard OpenAI SDK works directly. Swapping from gpt-4o to a local Qwen3.6 is a one-line base_url change.

Tool Calling (Function Calling)

This is where it gets interesting. Qwen3.6-35B-A3B was explicitly trained on tool-use patterns, scoring 37.0 on MCPMark compared to 18.1 for Gemma 4-31B.

import json
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_files",
            "description": "Search for files in the project by keyword",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Search keyword"
                    },
                    "file_extension": {
                        "type": "string",
                        "description": "File extension filter (e.g. .py, .ts)"
                    }
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read the contents of a file at a given path",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {
                        "type": "string",
                        "description": "File path"
                    }
                },
                "required": ["path"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "write_file",
            "description": "Write content to a file at a given path",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {
                        "type": "string",
                        "description": "File path"
                    },
                    "content": {
                        "type": "string",
                        "description": "Content to write"
                    }
                },
                "required": ["path", "content"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=[
        {
            "role": "system",
            "content": "You are a coding agent. Search, read, and modify files to complete the user's request."
        },
        {
            "role": "user",
            "content": "Find all Python files that handle database connections and change the pool size from 5 to 20."
        }
    ],
    tools=tools,
    tool_choice="auto",
    temperature=1.0,  # recommended for thinking mode
    max_tokens=4096,
)

message = response.choices[0].message

if message.tool_calls:
    for call in message.tool_calls:
        print(f"Tool: {call.function.name}")
        print(f"Args: {call.function.arguments}")
        print("---")
else:
    print(message.content)

Agent Loop (Multi-Turn Tool Calling)

A real coding agent needs to call a tool, feed the result back to the model, then let it decide the next action. Here is a minimal loop:

def run_agent(user_request: str, tools: list, max_steps: int = 10):
    messages = [
        {
            "role": "system",
            "content": "You are a coding agent. Use the available tools to complete the task."
        },
        {"role": "user", "content": user_request}
    ]

    for step in range(max_steps):
        response = client.chat.completions.create(
            model="Qwen/Qwen3.6-35B-A3B",
            messages=messages,
            tools=tools,
            tool_choice="auto",
            temperature=1.0,
            max_tokens=4096,
        )

        assistant_message = response.choices[0].message
        messages.append(assistant_message)

        if not assistant_message.tool_calls:
            print(f"[Done] {assistant_message.content}")
            return assistant_message.content

        for call in assistant_message.tool_calls:
            tool_name = call.function.name
            tool_args = json.loads(call.function.arguments)

            print(f"[Step {step+1}] {tool_name}({tool_args})")

            result = execute_tool(tool_name, tool_args)

            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": json.dumps(result),
            })

    return "Reached maximum steps"


def execute_tool(name: str, args: dict) -> dict:
    """Replace with real file system operations."""
    if name == "search_files":
        return {
            "files": [
                {"path": "src/db/connection.py", "match": "pool_size=5"},
                {"path": "src/db/config.py", "match": "POOL_SIZE = 5"},
            ]
        }
    elif name == "read_file":
        return {"content": f"# Contents of {args['path']}"}
    elif name == "write_file":
        return {"status": "ok", "path": args["path"]}
    return {"error": "unknown tool"}

Replace execute_tool with real file system calls and you have a working local coding agent.

Thinking Mode

Qwen3.6 has two inference modes:

Mode	Temperature	Use case
Thinking (recommended)	1.0	Complex coding, debugging, design decisions
Non-thinking	0.7	Simple completions, quick answers

When --reasoning-parser qwen3 is set at server startup, thinking mode is on by default.

Hardware Guide

Setup	max-model-len	Notes
RTX 4090 (24GB) × 1	32,768	Handles most coding tasks
RTX 4090 × 2 (TP=2)	65,536	Enough for repo-wide context
A100 80GB × 1	131,072	Comfortable single-GPU setup
H100 × 4 (TP=4)	262,144	Full context, production use

The FP8 variant (Qwen/Qwen3.6-35B-A3B-FP8) uses less VRAM with nearly identical performance.

Key Takeaways

vLLM 0.19.0+ serves Qwen3.6-35B-A3B as an OpenAI-compatible API at localhost:8000/v1
Tool calling requires --enable-auto-tool-choice --tool-call-parser qwen3_coder at startup — without it, tool calls silently fail
The standard OpenAI Python SDK works directly, so switching from a cloud API to local inference is a one-line change
Use temperature=1.0 with thinking mode for best coding accuracy
Apache 2.0 license — free for commercial use The model is three days old, so tool calling stability is still being validated by the community. Test on your own workloads before shipping to production.

References

Qwen/Qwen3.6-35B-A3B - Hugging Face
Qwen3.5 & Qwen3.6 Usage Guide - vLLM Recipes
QwenLM/Qwen3.6 - GitHub

DEV Community