oreocato

Posted on May 5

# How to Run Qwen3.6-35B on Your Mac at 77 tok/s

#ai #tutorial #beginners

Level: intermediate

Estimated time: 20-40 minutes (most of it is the model download)

Minimum requirements: Mac with Apple Silicon (M1/M2/M3/M4) and 48 GB of unified RAM

What are we setting up?

A local server compatible with the OpenAI API that runs the Qwen3.6-35B-A3B model (quantized to 4 bits) using MLX, Apple's Machine Learning framework for Silicon. When you're done, you'll have an endpoint at http://127.0.0.1:7979 that you can point any OpenAI-compatible client to (OpenCode, Continue, Cursor, etc.).

Metric	Measured value
Generation throughput	~77 tok/s
TTFT (time-to-first-token)	~0.25 s
Context window	65 536 – 131 072 tokens
RAM required	~20 GB model + ~12 GB KV cache

Prerequisites

Hardware

Mac with Apple Silicon chip (M1 Pro/Max/Ultra or M2/M3/M4 equivalents)
Minimum 48 GB of unified RAM (the quantized model takes ~20 GB; the KV cache needs up to 12 GB additional)

Software

# Check Python version (you need 3.11+)
python3 --version

# Check that you have git
git --version

If you don't have Python 3.11, install it with Homebrew:

brew install python@3.11

Step 1 — Create the virtual environment

From the folder where you want to install everything:

mkdir mlx-server && cd mlx-server
python3.11 -m venv .venv
source .venv/bin/activate

Step 2 — Install dependencies

pip install --upgrade pip

# MLX and the OpenAI API-compatible server
pip install mlx-lm
pip install mlx-openai-server

Verify the installation:

mlx-openai-server --help

Step 3 — Download the model

The model is automatically downloaded from Hugging Face the first time you run it. It takes approximately 20 GB of disk space.

# Optional pre-download (recommended to track progress)
python3 -c "
from mlx_lm import load
model, tokenizer = load('mlx-community/Qwen3.6-35B-A3B-4bit')
print('Model downloaded successfully')
"

Note: You need a huggingface.co account and to accept the model's terms if the repository requires it. For this model it is not required.

Step 4 — Start the server

Option A — Direct command (simpler)

mlx-openai-server launch \
  --model-path mlx-community/Qwen3.6-35B-A3B-4bit \
  --model-type lm \
  --host 127.0.0.1 \
  --port 7979 \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3_5 \
  --enable-auto-tool-choice \
  --context-length 65536 \
  --temperature 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  --min-p 0.0 \
  --repetition-penalty 1.05 \
  --max-bytes 12884901888 \
  --prompt-cache-size 3 \
  --log-level INFO

Option B — Startup script (recommended)

Save the following script as start-mlx-server.sh:

#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENV="$SCRIPT_DIR/.venv"

# Default profile: high_context
# Change with: MLX_PROFILE=baseline ./start-mlx-server.sh
PROFILE="${MLX_PROFILE:-high_context}"

MODEL_PATH="mlx-community/Qwen3.6-35B-A3B-4bit"
HOST="127.0.0.1"
PORT="7979"

TOOL_CALL_PARSER="qwen3_coder"
REASONING_PARSER="qwen3_5"

TEMPERATURE="0.7"
TOP_P="0.8"
TOP_K="20"
MIN_P="0.0"
REPETITION_PENALTY="1.05"
MAX_CACHE_BYTES="12884901888"  # 12 GB

DRAFT_MODEL="mlx-community/Qwen3.5-0.8B-MLX-4bit"
NUM_DRAFT_TOKENS="${MLX_NUM_DRAFT_TOKENS:-4}"

case "$PROFILE" in
    baseline)
        CONTEXT_LENGTH="65536"
        PROMPT_CACHE_SIZE="3"
        EXTRA_ARGS=""
        ;;
    high_context)
        CONTEXT_LENGTH="131072"
        PROMPT_CACHE_SIZE="5"
        EXTRA_ARGS=""
        ;;
    speculative)
        CONTEXT_LENGTH="65536"
        PROMPT_CACHE_SIZE="3"
        EXTRA_ARGS="--draft-model-path ${DRAFT_MODEL} --num-draft-tokens ${NUM_DRAFT_TOKENS}"
        ;;
    speculative_high)
        CONTEXT_LENGTH="131072"
        PROMPT_CACHE_SIZE="5"
        EXTRA_ARGS="--draft-model-path ${DRAFT_MODEL} --num-draft-tokens ${NUM_DRAFT_TOKENS}"
        ;;
    *)
        echo "Unknown PROFILE: $PROFILE"
        echo "Options: baseline, high_context, speculative, speculative_high"
        exit 1
        ;;
esac

exec "$VENV/bin/mlx-openai-server" launch \
    --model-path "$MODEL_PATH" \
    --model-type lm \
    --host "$HOST" \
    --port "$PORT" \
    --tool-call-parser "$TOOL_CALL_PARSER" \
    --reasoning-parser "$REASONING_PARSER" \
    --enable-auto-tool-choice \
    --context-length "$CONTEXT_LENGTH" \
    --temperature "$TEMPERATURE" \
    --top-p "$TOP_P" \
    --top-k "$TOP_K" \
    --min-p "$MIN_P" \
    --repetition-penalty "$REPETITION_PENALTY" \
    --max-bytes "$MAX_CACHE_BYTES" \
    --prompt-cache-size "$PROMPT_CACHE_SIZE" \
    --log-level INFO \
    $EXTRA_ARGS

chmod +x start-mlx-server.sh
./start-mlx-server.sh

Usage examples:

./start-mlx-server.sh                                      # high_context (default)
MLX_PROFILE=baseline ./start-mlx-server.sh                # maximum throughput
MLX_PROFILE=speculative ./start-mlx-server.sh             # speculative decoding
MLX_PROFILE=speculative MLX_NUM_DRAFT_TOKENS=6 ./start-mlx-server.sh

Step 5 — Verify it works

In another terminal, send a test request:

curl http://127.0.0.1:7979/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3.6-35B-A3B-4bit",
    "messages": [{"role": "user", "content": "Hello, what is 2+2?"}],
    "max_tokens": 100
  }'

You should see a JSON response with the choices[0].message.content field.

Stopping the server

pkill -f mlx-openai-server

Or if you have the stop-mlx-server.sh script:

#!/usr/bin/env bash
pkill -f mlx-openai-server && echo "Server stopped."

Connect with your favorite client

The server exposes a 100% OpenAI-compatible API. Just point the base_url to your local server.

OpenCode

Create or edit the opencode.json file in the root of your project:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "mlx-local": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "MLX Local (Qwen3.6-35B)",
      "options": {
        "baseURL": "http://127.0.0.1:7979/v1"
      },
      "models": {
        "mlx-community/Qwen3.6-35B-A3B-4bit": {
          "name": "Qwen3.6-35B-A3B-4bit (local MLX)",
          "limit": {
            "context": 65536,
            "output": 32768
          }
        }
      }
    }
  }
}

Continue / Cursor

Base URL: http://127.0.0.1:7979/v1
API Key:  any-value  (the server does not validate it)
Model:    mlx-community/Qwen3.6-35B-A3B-4bit

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:7979/v1",
    api_key="local"
)

response = client.chat.completions.create(
    model="mlx-community/Qwen3.6-35B-A3B-4bit",
    messages=[{"role": "user", "content": "Explain what a transformer is"}]
)
print(response.choices[0].message.content)

Configuration profiles

Profile	Context	Cache	tok/s measured	When to use
`baseline`	65 536	3 entries	77.4	Maximum throughput
`high_context`	131 072	5 entries	75.7	Long documents, extended contexts (default)

The performance difference between both profiles (~2%) is within the noise margin. Use high_context if you work with large files or very long conversations.

Key parameters explained

Parameter	Value	Why it matters
`--max-bytes 12884901888`	12 GB	Critical. Without this limit the model's KV cache (MoE architecture with `ArraysCache`) grows unchecked until it exhausts RAM on contexts >30k tokens
`--prompt-cache-size 3`	3 LRU entries	Limits how many conversations the prefix cache keeps in memory
`--context-length 65536`	64k tokens	Maximum context window per request
`--temperature 0.7`	—	Balance between creativity and coherence
`--repetition-penalty 1.05`	—	Reduces repetitions in long responses

Troubleshooting

The server disconnects after 30,000 tokens

This was a known bug with the Qwen3.6-35B-A3B model due to its hybrid MoE architecture. The fix is to make sure you pass --max-bytes 12884901888. With this parameter the server works correctly up to 60,000+ tokens (verified).

Architecture notes (for the curious)

Qwen3.6-35B-A3B is a hybrid MoE (Mixture of Experts) model. Instead of activating all parameters per token, it only activates a subset of "experts", making it efficient for its size. The 4bit version quantizes the weights to 4 bits, reducing RAM usage from ~70 GB to ~20 GB with minimal quality loss.

MLX leverages Apple Silicon's unified memory: the GPU and CPU share the same RAM pool, eliminating the transfer bottleneck that exists in systems with a dedicated GPU. That's why a Mac with 48 GB can run a model that on a PC would require a GPU with 80 GB of VRAM.

References

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.