Level: intermediate
Estimated time: 20-40 minutes (most of it is the model download)
Minimum requirements: Mac with Apple Silicon (M1/M2/M3/M4) and 48 GB of unified RAM
What are we setting up?
A local server compatible with the OpenAI API that runs the Qwen3.6-35B-A3B model (quantized to 4 bits) using MLX, Apple's Machine Learning framework for Silicon. When you're done, you'll have an endpoint at http://127.0.0.1:7979 that you can point any OpenAI-compatible client to (OpenCode, Continue, Cursor, etc.).
| Metric | Measured value |
|---|---|
| Generation throughput | ~77 tok/s |
| TTFT (time-to-first-token) | ~0.25 s |
| Context window | 65 536 – 131 072 tokens |
| RAM required | ~20 GB model + ~12 GB KV cache |
Prerequisites
Hardware
- Mac with Apple Silicon chip (M1 Pro/Max/Ultra or M2/M3/M4 equivalents)
- Minimum 48 GB of unified RAM (the quantized model takes ~20 GB; the KV cache needs up to 12 GB additional)
Software
# Check Python version (you need 3.11+)
python3 --version
# Check that you have git
git --version
If you don't have Python 3.11, install it with Homebrew:
brew install python@3.11
Step 1 — Create the virtual environment
From the folder where you want to install everything:
mkdir mlx-server && cd mlx-server
python3.11 -m venv .venv
source .venv/bin/activate
Step 2 — Install dependencies
pip install --upgrade pip
# MLX and the OpenAI API-compatible server
pip install mlx-lm
pip install mlx-openai-server
Verify the installation:
mlx-openai-server --help
Step 3 — Download the model
The model is automatically downloaded from Hugging Face the first time you run it. It takes approximately 20 GB of disk space.
# Optional pre-download (recommended to track progress)
python3 -c "
from mlx_lm import load
model, tokenizer = load('mlx-community/Qwen3.6-35B-A3B-4bit')
print('Model downloaded successfully')
"
Note: You need a huggingface.co account and to accept the model's terms if the repository requires it. For this model it is not required.
Step 4 — Start the server
Option A — Direct command (simpler)
mlx-openai-server launch \
--model-path mlx-community/Qwen3.6-35B-A3B-4bit \
--model-type lm \
--host 127.0.0.1 \
--port 7979 \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3_5 \
--enable-auto-tool-choice \
--context-length 65536 \
--temperature 0.7 \
--top-p 0.8 \
--top-k 20 \
--min-p 0.0 \
--repetition-penalty 1.05 \
--max-bytes 12884901888 \
--prompt-cache-size 3 \
--log-level INFO
Option B — Startup script (recommended)
Save the following script as start-mlx-server.sh:
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENV="$SCRIPT_DIR/.venv"
# Default profile: high_context
# Change with: MLX_PROFILE=baseline ./start-mlx-server.sh
PROFILE="${MLX_PROFILE:-high_context}"
MODEL_PATH="mlx-community/Qwen3.6-35B-A3B-4bit"
HOST="127.0.0.1"
PORT="7979"
TOOL_CALL_PARSER="qwen3_coder"
REASONING_PARSER="qwen3_5"
TEMPERATURE="0.7"
TOP_P="0.8"
TOP_K="20"
MIN_P="0.0"
REPETITION_PENALTY="1.05"
MAX_CACHE_BYTES="12884901888" # 12 GB
DRAFT_MODEL="mlx-community/Qwen3.5-0.8B-MLX-4bit"
NUM_DRAFT_TOKENS="${MLX_NUM_DRAFT_TOKENS:-4}"
case "$PROFILE" in
baseline)
CONTEXT_LENGTH="65536"
PROMPT_CACHE_SIZE="3"
EXTRA_ARGS=""
;;
high_context)
CONTEXT_LENGTH="131072"
PROMPT_CACHE_SIZE="5"
EXTRA_ARGS=""
;;
speculative)
CONTEXT_LENGTH="65536"
PROMPT_CACHE_SIZE="3"
EXTRA_ARGS="--draft-model-path ${DRAFT_MODEL} --num-draft-tokens ${NUM_DRAFT_TOKENS}"
;;
speculative_high)
CONTEXT_LENGTH="131072"
PROMPT_CACHE_SIZE="5"
EXTRA_ARGS="--draft-model-path ${DRAFT_MODEL} --num-draft-tokens ${NUM_DRAFT_TOKENS}"
;;
*)
echo "Unknown PROFILE: $PROFILE"
echo "Options: baseline, high_context, speculative, speculative_high"
exit 1
;;
esac
exec "$VENV/bin/mlx-openai-server" launch \
--model-path "$MODEL_PATH" \
--model-type lm \
--host "$HOST" \
--port "$PORT" \
--tool-call-parser "$TOOL_CALL_PARSER" \
--reasoning-parser "$REASONING_PARSER" \
--enable-auto-tool-choice \
--context-length "$CONTEXT_LENGTH" \
--temperature "$TEMPERATURE" \
--top-p "$TOP_P" \
--top-k "$TOP_K" \
--min-p "$MIN_P" \
--repetition-penalty "$REPETITION_PENALTY" \
--max-bytes "$MAX_CACHE_BYTES" \
--prompt-cache-size "$PROMPT_CACHE_SIZE" \
--log-level INFO \
$EXTRA_ARGS
chmod +x start-mlx-server.sh
./start-mlx-server.sh
Usage examples:
./start-mlx-server.sh # high_context (default)
MLX_PROFILE=baseline ./start-mlx-server.sh # maximum throughput
MLX_PROFILE=speculative ./start-mlx-server.sh # speculative decoding
MLX_PROFILE=speculative MLX_NUM_DRAFT_TOKENS=6 ./start-mlx-server.sh
Step 5 — Verify it works
In another terminal, send a test request:
curl http://127.0.0.1:7979/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3.6-35B-A3B-4bit",
"messages": [{"role": "user", "content": "Hello, what is 2+2?"}],
"max_tokens": 100
}'
You should see a JSON response with the choices[0].message.content field.
Stopping the server
pkill -f mlx-openai-server
Or if you have the stop-mlx-server.sh script:
#!/usr/bin/env bash
pkill -f mlx-openai-server && echo "Server stopped."
Connect with your favorite client
The server exposes a 100% OpenAI-compatible API. Just point the base_url to your local server.
OpenCode
Create or edit the opencode.json file in the root of your project:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"mlx-local": {
"npm": "@ai-sdk/openai-compatible",
"name": "MLX Local (Qwen3.6-35B)",
"options": {
"baseURL": "http://127.0.0.1:7979/v1"
},
"models": {
"mlx-community/Qwen3.6-35B-A3B-4bit": {
"name": "Qwen3.6-35B-A3B-4bit (local MLX)",
"limit": {
"context": 65536,
"output": 32768
}
}
}
}
}
}
Continue / Cursor
Base URL: http://127.0.0.1:7979/v1
API Key: any-value (the server does not validate it)
Model: mlx-community/Qwen3.6-35B-A3B-4bit
Python (openai SDK)
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:7979/v1",
api_key="local"
)
response = client.chat.completions.create(
model="mlx-community/Qwen3.6-35B-A3B-4bit",
messages=[{"role": "user", "content": "Explain what a transformer is"}]
)
print(response.choices[0].message.content)
Configuration profiles
| Profile | Context | Cache | tok/s measured | When to use |
|---|---|---|---|---|
baseline |
65 536 | 3 entries | 77.4 | Maximum throughput |
high_context |
131 072 | 5 entries | 75.7 | Long documents, extended contexts (default) |
The performance difference between both profiles (~2%) is within the noise margin. Use
high_contextif you work with large files or very long conversations.
Key parameters explained
| Parameter | Value | Why it matters |
|---|---|---|
--max-bytes 12884901888 |
12 GB |
Critical. Without this limit the model's KV cache (MoE architecture with ArraysCache) grows unchecked until it exhausts RAM on contexts >30k tokens |
--prompt-cache-size 3 |
3 LRU entries | Limits how many conversations the prefix cache keeps in memory |
--context-length 65536 |
64k tokens | Maximum context window per request |
--temperature 0.7 |
— | Balance between creativity and coherence |
--repetition-penalty 1.05 |
— | Reduces repetitions in long responses |
Troubleshooting
The server disconnects after 30,000 tokens
This was a known bug with the Qwen3.6-35B-A3B model due to its hybrid MoE architecture. The fix is to make sure you pass --max-bytes 12884901888. With this parameter the server works correctly up to 60,000+ tokens (verified).
Architecture notes (for the curious)
Qwen3.6-35B-A3B is a hybrid MoE (Mixture of Experts) model. Instead of activating all parameters per token, it only activates a subset of "experts", making it efficient for its size. The 4bit version quantizes the weights to 4 bits, reducing RAM usage from ~70 GB to ~20 GB with minimal quality loss.
MLX leverages Apple Silicon's unified memory: the GPU and CPU share the same RAM pool, eliminating the transfer bottleneck that exists in systems with a dedicated GPU. That's why a Mac with 48 GB can run a model that on a PC would require a GPU with 80 GB of VRAM.
Top comments (1)
This is a seriously clean setup—getting ~77 tok/s on a local 35B model with that footprint is impressive.
What stands out is how close this gets to a “personal inference stack” that can realistically replace a chunk of API usage for dev workflows. A year ago this would’ve sounded like overkill; now it’s starting to look… practical 😄
Also +1 on calling out --max-bytes—that’s one of those parameters people ignore right up until their machine politely runs out of memory and exits.
The OpenAI-compatible layer is a smart move too. Being able to swap between local and hosted models without changing client code is basically future-proofing your setup.
I like the profile approach as well—most people underestimate how much context tuning matters until they hit real-world long prompts.
That said, the 48GB RAM requirement is still the “you shall not pass” gate for a lot of devs. Apple really turned RAM into the new GPU here.
The MoE explanation is a nice touch—people see “35B” and assume it’s unusable locally, but architecture + quantization changes the game.
Also, running this locally with no API key validation feels powerful and slightly dangerous at the same time 😅
Overall, this feels less like a tutorial and more like a glimpse of where local-first AI workflows are heading.