DEV Community

oreocato
oreocato

Posted on

# How to Run Qwen3.6-35B on Your Mac at 77 tok/s

Level: intermediate

Estimated time: 20-40 minutes (most of it is the model download)

Minimum requirements: Mac with Apple Silicon (M1/M2/M3/M4) and 48 GB of unified RAM


What are we setting up?

A local server compatible with the OpenAI API that runs the Qwen3.6-35B-A3B model (quantized to 4 bits) using MLX, Apple's Machine Learning framework for Silicon. When you're done, you'll have an endpoint at http://127.0.0.1:7979 that you can point any OpenAI-compatible client to (OpenCode, Continue, Cursor, etc.).

Metric Measured value
Generation throughput ~77 tok/s
TTFT (time-to-first-token) ~0.25 s
Context window 65 536 – 131 072 tokens
RAM required ~20 GB model + ~12 GB KV cache

Prerequisites

Hardware

  • Mac with Apple Silicon chip (M1 Pro/Max/Ultra or M2/M3/M4 equivalents)
  • Minimum 48 GB of unified RAM (the quantized model takes ~20 GB; the KV cache needs up to 12 GB additional)

Software

# Check Python version (you need 3.11+)
python3 --version

# Check that you have git
git --version
Enter fullscreen mode Exit fullscreen mode

If you don't have Python 3.11, install it with Homebrew:

brew install python@3.11
Enter fullscreen mode Exit fullscreen mode

Step 1 — Create the virtual environment

From the folder where you want to install everything:

mkdir mlx-server && cd mlx-server
python3.11 -m venv .venv
source .venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

Step 2 — Install dependencies

pip install --upgrade pip

# MLX and the OpenAI API-compatible server
pip install mlx-lm
pip install mlx-openai-server
Enter fullscreen mode Exit fullscreen mode

Verify the installation:

mlx-openai-server --help
Enter fullscreen mode Exit fullscreen mode

Step 3 — Download the model

The model is automatically downloaded from Hugging Face the first time you run it. It takes approximately 20 GB of disk space.

# Optional pre-download (recommended to track progress)
python3 -c "
from mlx_lm import load
model, tokenizer = load('mlx-community/Qwen3.6-35B-A3B-4bit')
print('Model downloaded successfully')
"
Enter fullscreen mode Exit fullscreen mode

Note: You need a huggingface.co account and to accept the model's terms if the repository requires it. For this model it is not required.


Step 4 — Start the server

Option A — Direct command (simpler)

mlx-openai-server launch \
  --model-path mlx-community/Qwen3.6-35B-A3B-4bit \
  --model-type lm \
  --host 127.0.0.1 \
  --port 7979 \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3_5 \
  --enable-auto-tool-choice \
  --context-length 65536 \
  --temperature 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  --min-p 0.0 \
  --repetition-penalty 1.05 \
  --max-bytes 12884901888 \
  --prompt-cache-size 3 \
  --log-level INFO
Enter fullscreen mode Exit fullscreen mode

Option B — Startup script (recommended)

Save the following script as start-mlx-server.sh:

#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENV="$SCRIPT_DIR/.venv"

# Default profile: high_context
# Change with: MLX_PROFILE=baseline ./start-mlx-server.sh
PROFILE="${MLX_PROFILE:-high_context}"

MODEL_PATH="mlx-community/Qwen3.6-35B-A3B-4bit"
HOST="127.0.0.1"
PORT="7979"

TOOL_CALL_PARSER="qwen3_coder"
REASONING_PARSER="qwen3_5"

TEMPERATURE="0.7"
TOP_P="0.8"
TOP_K="20"
MIN_P="0.0"
REPETITION_PENALTY="1.05"
MAX_CACHE_BYTES="12884901888"  # 12 GB

DRAFT_MODEL="mlx-community/Qwen3.5-0.8B-MLX-4bit"
NUM_DRAFT_TOKENS="${MLX_NUM_DRAFT_TOKENS:-4}"

case "$PROFILE" in
    baseline)
        CONTEXT_LENGTH="65536"
        PROMPT_CACHE_SIZE="3"
        EXTRA_ARGS=""
        ;;
    high_context)
        CONTEXT_LENGTH="131072"
        PROMPT_CACHE_SIZE="5"
        EXTRA_ARGS=""
        ;;
    speculative)
        CONTEXT_LENGTH="65536"
        PROMPT_CACHE_SIZE="3"
        EXTRA_ARGS="--draft-model-path ${DRAFT_MODEL} --num-draft-tokens ${NUM_DRAFT_TOKENS}"
        ;;
    speculative_high)
        CONTEXT_LENGTH="131072"
        PROMPT_CACHE_SIZE="5"
        EXTRA_ARGS="--draft-model-path ${DRAFT_MODEL} --num-draft-tokens ${NUM_DRAFT_TOKENS}"
        ;;
    *)
        echo "Unknown PROFILE: $PROFILE"
        echo "Options: baseline, high_context, speculative, speculative_high"
        exit 1
        ;;
esac

exec "$VENV/bin/mlx-openai-server" launch \
    --model-path "$MODEL_PATH" \
    --model-type lm \
    --host "$HOST" \
    --port "$PORT" \
    --tool-call-parser "$TOOL_CALL_PARSER" \
    --reasoning-parser "$REASONING_PARSER" \
    --enable-auto-tool-choice \
    --context-length "$CONTEXT_LENGTH" \
    --temperature "$TEMPERATURE" \
    --top-p "$TOP_P" \
    --top-k "$TOP_K" \
    --min-p "$MIN_P" \
    --repetition-penalty "$REPETITION_PENALTY" \
    --max-bytes "$MAX_CACHE_BYTES" \
    --prompt-cache-size "$PROMPT_CACHE_SIZE" \
    --log-level INFO \
    $EXTRA_ARGS
Enter fullscreen mode Exit fullscreen mode
chmod +x start-mlx-server.sh
./start-mlx-server.sh
Enter fullscreen mode Exit fullscreen mode

Usage examples:

./start-mlx-server.sh                                      # high_context (default)
MLX_PROFILE=baseline ./start-mlx-server.sh                # maximum throughput
MLX_PROFILE=speculative ./start-mlx-server.sh             # speculative decoding
MLX_PROFILE=speculative MLX_NUM_DRAFT_TOKENS=6 ./start-mlx-server.sh
Enter fullscreen mode Exit fullscreen mode

Step 5 — Verify it works

In another terminal, send a test request:

curl http://127.0.0.1:7979/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3.6-35B-A3B-4bit",
    "messages": [{"role": "user", "content": "Hello, what is 2+2?"}],
    "max_tokens": 100
  }'
Enter fullscreen mode Exit fullscreen mode

You should see a JSON response with the choices[0].message.content field.


Stopping the server

pkill -f mlx-openai-server
Enter fullscreen mode Exit fullscreen mode

Or if you have the stop-mlx-server.sh script:

#!/usr/bin/env bash
pkill -f mlx-openai-server && echo "Server stopped."
Enter fullscreen mode Exit fullscreen mode

Connect with your favorite client

The server exposes a 100% OpenAI-compatible API. Just point the base_url to your local server.

OpenCode

Create or edit the opencode.json file in the root of your project:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "mlx-local": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "MLX Local (Qwen3.6-35B)",
      "options": {
        "baseURL": "http://127.0.0.1:7979/v1"
      },
      "models": {
        "mlx-community/Qwen3.6-35B-A3B-4bit": {
          "name": "Qwen3.6-35B-A3B-4bit (local MLX)",
          "limit": {
            "context": 65536,
            "output": 32768
          }
        }
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Continue / Cursor

Base URL: http://127.0.0.1:7979/v1
API Key:  any-value  (the server does not validate it)
Model:    mlx-community/Qwen3.6-35B-A3B-4bit
Enter fullscreen mode Exit fullscreen mode

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:7979/v1",
    api_key="local"
)

response = client.chat.completions.create(
    model="mlx-community/Qwen3.6-35B-A3B-4bit",
    messages=[{"role": "user", "content": "Explain what a transformer is"}]
)
print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Configuration profiles

Profile Context Cache tok/s measured When to use
baseline 65 536 3 entries 77.4 Maximum throughput
high_context 131 072 5 entries 75.7 Long documents, extended contexts (default)

The performance difference between both profiles (~2%) is within the noise margin. Use high_context if you work with large files or very long conversations.


Key parameters explained

Parameter Value Why it matters
--max-bytes 12884901888 12 GB Critical. Without this limit the model's KV cache (MoE architecture with ArraysCache) grows unchecked until it exhausts RAM on contexts >30k tokens
--prompt-cache-size 3 3 LRU entries Limits how many conversations the prefix cache keeps in memory
--context-length 65536 64k tokens Maximum context window per request
--temperature 0.7 Balance between creativity and coherence
--repetition-penalty 1.05 Reduces repetitions in long responses

Troubleshooting

The server disconnects after 30,000 tokens

This was a known bug with the Qwen3.6-35B-A3B model due to its hybrid MoE architecture. The fix is to make sure you pass --max-bytes 12884901888. With this parameter the server works correctly up to 60,000+ tokens (verified).


Architecture notes (for the curious)

Qwen3.6-35B-A3B is a hybrid MoE (Mixture of Experts) model. Instead of activating all parameters per token, it only activates a subset of "experts", making it efficient for its size. The 4bit version quantizes the weights to 4 bits, reducing RAM usage from ~70 GB to ~20 GB with minimal quality loss.

MLX leverages Apple Silicon's unified memory: the GPU and CPU share the same RAM pool, eliminating the transfer bottleneck that exists in systems with a dedicated GPU. That's why a Mac with 48 GB can run a model that on a PC would require a GPU with 80 GB of VRAM.


References

Top comments (1)

Collapse
 
godaddy_llc_4e3a2f1804238 profile image
GoDaddy LLC

This is a seriously clean setup—getting ~77 tok/s on a local 35B model with that footprint is impressive.

What stands out is how close this gets to a “personal inference stack” that can realistically replace a chunk of API usage for dev workflows. A year ago this would’ve sounded like overkill; now it’s starting to look… practical 😄

Also +1 on calling out --max-bytes—that’s one of those parameters people ignore right up until their machine politely runs out of memory and exits.

The OpenAI-compatible layer is a smart move too. Being able to swap between local and hosted models without changing client code is basically future-proofing your setup.

I like the profile approach as well—most people underestimate how much context tuning matters until they hit real-world long prompts.

That said, the 48GB RAM requirement is still the “you shall not pass” gate for a lot of devs. Apple really turned RAM into the new GPU here.

The MoE explanation is a nice touch—people see “35B” and assume it’s unusable locally, but architecture + quantization changes the game.

Also, running this locally with no API key validation feels powerful and slightly dangerous at the same time 😅

Overall, this feels less like a tutorial and more like a glimpse of where local-first AI workflows are heading.