I spent an evening trying to get a free, local AI coding assistant running on my Mac --a 96GB Apple Silicon machine. No API keys. No subscriptions. Just my machine and an open-source model doing the thinking.
It took some trial and error --wrong Python versions, models that looped forever, a 72B model that took 18 minutes to answer one question, and a sneaky bug buried in a config file. But by the end, I had a setup that actually works.
This article walks through everything: the concepts, the setup, the failures (the best part), and the working result.
What We're Building
The goal is simple: run OpenCode --an open-source agentic coding CLI (think Cursor or Claude Code, but in your terminal) --powered by a large language model running entirely on your Mac.
No cloud. No API costs. No data leaving your machine.
Here's what the final result looks like: you type a coding question in your terminal, and a 30-billion-parameter AI model running on your Mac's GPU thinks about it, reads your files, edits your code, and runs commands --all locally.
The Architecture (Three Layers)
Before we install anything, let's understand what we're building. There are three layers:
+-------------------------------------+
| OpenCode (The App) |
| What you interact with. It sends |
| your messages and executes tools |
| like reading files, editing code, |
| and running shell commands. |
| "The Hands" |
+--------------+----------------------+
| HTTP (localhost:8080)
|
+--------------v----------------------+
| mlx_lm.server (The Hub) |
| Translates between OpenCode and |
| the model. Converts text to |
| numbers, runs inference on the |
| GPU, parses tool calls. |
| "The Translator" |
+--------------+----------------------+
| MLX Framework (GPU)
|
+--------------v----------------------+
| The LLM (The Brain) |
| A neural network with billions of |
| parameters. Takes in numbers, |
| outputs numbers. Doesn't "know" |
| about files or code -- just |
| predicts the next token. |
| "The Brain" |
+-------------------------------------+
OpenCode is the app you type into. It's "the hands" --it can read files, edit code, and run commands, but it doesn't know what to do. It asks the brain.
mlx_lm.server is the middleman. It runs an HTTP server on your Mac (port 8080) that speaks the same API as OpenAI's. OpenCode doesn't even know the model is local --it just talks to localhost:8080 as if it were calling an API.
The LLM is the brain. It takes in numbers and outputs numbers. That's it. It has no hands, no eyes, no ability to touch your filesystem. It can only think and suggest.
What Are Tokens?
Models don't read text --they read tokens, which are numbers. Before your message reaches the model, it's converted:
"Fix the bug in auth.py"
|
v (tokenizer)
[15640, 279, 8563, 304, 4428, 2386]
|
v (model inference on GPU)
[791, 4546, 374, 389, 1584, 220, ...]
|
v (detokenizer)
"The issue is on line 42..."
The component that does this conversion is called a tokenizer. Each model ships with its own tokenizer --you can think of it as the model's dictionary.
What Are Tool Calls?
Here's the key concept that makes an LLM useful as a coding assistant. The model can't actually read your files --remember, it's just a brain with no hands. But it can output a tool call: a structured request that says "hey, I need someone to read this file for me."
You: "What's in main.py?"
|
v
Model thinks... outputs:
<tool_call>
<function=read_file>{"path": "main.py"}
|
v
mlx_lm.server parses this into structured JSON
|
v
OpenCode receives: { "function": "read_file", "args": {"path": "main.py"} }
OpenCode executes it, reads the file, sends contents back to the model
|
v
Model receives file contents, thinks again, responds to you
This back-and-forth --model requests a tool, app executes it, sends results back --is the loop that makes AI coding assistants work. The model is the brain, OpenCode is the hands.
Why Parsers Matter
Different models output tool calls in different formats. Some use XML-style tags, some use raw JSON, some use special tokens. The tool parser in mlx_lm.server needs to understand the model's specific format. If there's a mismatch --and we'll see exactly this bug later --things crash.
What Is MLX?
MLX is Apple's machine learning framework, built specifically for Apple Silicon chips (M1, M2, M3, M4, and their Pro/Max/Ultra variants).
What makes it special: Apple Silicon has unified memory. Unlike a PC where the CPU and GPU have separate memory pools, your Mac's CPU and GPU share the same RAM. MLX takes advantage of this --a model loaded into memory is accessible to the GPU without any copying.
This is why Apple Silicon Macs are surprisingly good at running LLMs. A Mac with 96GB of unified memory can run models that would require an expensive NVIDIA GPU on a PC.
mlx-lm is a Python package built on MLX. It provides:
- A model loader that downloads models from HuggingFace
- An inference engine that runs models on your Mac's GPU
-
mlx_lm.server--an OpenAI-compatible HTTP server
Setting Up
Prerequisites
- A Mac with Apple Silicon (M1 or newer)
- At least 32GB of RAM (64GB+ recommended)
- macOS with Homebrew installed
Step 1: Install Python 3.12
macOS ships with an older Python. We need 3.12+ for the latest mlx-lm:
brew install python@3.12
Step 2: Create a Virtual Environment
/opt/homebrew/bin/python3.12 -m venv ~/mlx-env
source ~/mlx-env/bin/activate
pip install --upgrade pip
pip install mlx-lm
Gotcha I hit: I first tried using the system Python (3.9) and got
Model type glm4_moe_lite not supported. Newer model architectures require newer versions of mlx-lm, which require newer Python. If you see architecture errors, check your Python version first.
Step 3: Install OpenCode
# via npm
npm install -g opencode-ai
# or via Homebrew
brew install anomalyco/tap/opencode
Check the OpenCode docs for the latest install method.
Step 4: Start the Model Server
Let's start with the model I recommend (more on why later):
source ~/mlx-env/bin/activate
mlx_lm.server --model mlx-community/Qwen3-Coder-30B-A3B-Instruct-8bit --port 8080
The first run downloads the model (~33GB). Subsequent runs load from cache.
Tip: Add a shell alias so you don't have to remember the full command:
# Add to ~/.zshrc alias mlx='source ~/mlx-env/bin/activate' alias qwen='mlx_lm.server --model mlx-community/Qwen3-Coder-30B-A3B-Instruct-8bit --port 8080'Then just:
mlx && qwen
Step 5: Configure OpenCode
Create or edit ~/.config/opencode/opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"mlx": {
"npm": "@ai-sdk/openai-compatible",
"name": "MLX (local)",
"options": {
"baseURL": "http://localhost:8080/v1"
},
"models": {
"mlx-community/Qwen3-Coder-30B-A3B-Instruct-8bit": {
"name": "Qwen3-Coder 30B 8bit",
"tools": { "task": true }
}
}
}
}
}
The key parts:
-
baseURLpoints to your local mlx_lm.server -
npmtells OpenCode which SDK adapter to use (OpenAI-compatible) -
tools.taskenables sub-agent task spawning
Now launch OpenCode, select your local model, and start coding!
The Model Hunt
This is the fun part --and the most educational. I tried several models before finding what works. Here's the journey.
Attempt 1: GLM-4.7-Flash-4bit
What: A 4-bit quantized version of Zhipu's GLM-4 model, a Mixture of Experts (MoE) architecture. (Quantization is a compression technique --4-bit means each parameter is stored in just 4 bits instead of the original 16 or 32, which shrinks the model to fit in less RAM at the cost of some accuracy. 8-bit is a middle ground.)
What happened: It loaded fine and could answer simple questions. But when I asked it to do anything complex --like build a project --it got stuck in repetitive loops, running the same commands over and over.
Lesson: Not all models are built for agentic coding. GLM-4.7 is a general-purpose chat model. It doesn't have the instruction-following precision needed to use tools reliably in a multi-step workflow.
Attempt 2: Qwen2.5-72B-Instruct-4bit
What: A massive 72-billion parameter dense model, quantized to 4-bit. About 42GB in RAM.
What happened: It downloaded (~40GB), loaded into memory, and I sent it a message with about 74,000 tokens of context. Eighteen minutes later... broken pipe. Connection crashed.
Lesson: Just because a model fits in your RAM doesn't mean it's fast enough to be useful. Dense models process every single parameter for every token. At 72B parameters, even on a fast Mac, it's painfully slow for interactive coding.
Attempt 3: Devstral-Small-2-24B-8bit
What: Mistral's coding-focused model, 24B parameters, 8-bit quantization.
What happened:
WARNING - Received tools but model does not support tool calling.
Dead on arrival. The mlx-lm server checks its list of tool parsers to see if it knows how to read a model's tool call format. At the time (mlx-lm v0.30.6), there was no Mistral/Devstral parser. The model could probably output tool calls, but the server couldn't parse them.
Lesson: Before downloading a multi-gigabyte model, check if mlx-lm has a tool parser for it. You can check the parsers available in your installed version:
ls ~/mlx-env/lib/python3.12/site-packages/mlx_lm/tool_parsers/
On v0.30.6, the available parsers are:
-
glm47.py--for GLM-4 models -
json_tools.py--for models using raw JSON tool calls -
kimi_k2.py--for Kimi K2 -
qwen3_coder.py--for Qwen3-Coder family -
longcat.py,minimax_m2.py,function_gemma.py--for other specific models
No Mistral parser. No Devstral.
Attempt 4: Qwen3-Coder-30B-A3B-Instruct-8bit (Winner)
What: Alibaba's coding-focused model. 30 billion total parameters, but it's a Mixture of Experts (MoE) model that only activates 3 billion parameters per token.
What happened: It worked. Beautifully. Tool calls parsed correctly, it followed instructions well, and it was fast enough for interactive use.
Why it works:
- MoE architecture: Only 3B of 30B parameters are active per token. This means inference is as fast as a 3B model while having the knowledge of a 30B model.
- Coding-focused training: It was specifically trained for code generation and tool use.
-
Supported parser: mlx-lm has the
qwen3_coderparser built in. - ~33GB RAM: Fits comfortably on a 64GB+ Mac with room to spare.
Attempt 5: Qwen3-Coder-Next-4bit (Winner, after a bug fix)
What: The bigger sibling --80B total parameters (still 3B active), 512 mixture-of-experts. Scores 70.6% on SWE-Bench Verified, putting it in the league of models many times its active size.
What happened: It downloaded (~45GB), loaded fine, but crashed instantly when I tried to use it:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
This turned out to be the most interesting bug of the night. More on that in the next section.
Model Comparison
| Model | Type | RAM | Speed | Tool Calls | Verdict |
|---|---|---|---|---|---|
| GLM-4.7-Flash-4bit | MoE | ~16GB | Fast | Yes | Loops on complex tasks |
| Qwen2.5-72B-4bit | Dense | ~42GB | Very slow | Untested | Too slow for interactive use |
| Devstral-24B-8bit | Dense | ~25GB | Medium | No (no parser) | Dead on arrival |
| Qwen3-Coder-30B-8bit | MoE | ~33GB | Fast | Yes | Recommended |
| Qwen3-Coder-Next-4bit | MoE | ~50GB | Fast | Yes (after fix) | Best quality (needs 64GB+ RAM) |
Understanding Tool Calls --And the Bug We Found
This section is about the most interesting problem I hit, and it teaches a lot about how this whole system works under the hood.
The Crash
After downloading Qwen3-Coder-Next-4bit, the model loaded fine. But the moment it tried to make a tool call --say, reading a file --the server crashed:
File "mlx_lm/tool_parsers/json_tools.py", line 11, in parse_tool_call
return json.loads(text.strip())
json.decoder.JSONDecodeError: Expecting value: line 1 column 1
The Investigation
The error is in json_tools.py --a parser that expects raw JSON. But wait... Qwen3-Coder-Next is a Qwen3-Coder model. It should be using the qwen3_coder parser, which expects XML-style tool calls like this:
<tool_call>
<function=read_file>{"path": "main.py"}
So why is the server trying to parse it as JSON?
The Root Cause
Every model ships with a file called tokenizer_config.json in its HuggingFace download. Among other things, this file contains a field called tool_parser_type that tells mlx_lm.server which parser to use.
I checked Qwen3-Coder-Next's config:
"tool_parser_type": "json_tools"
There it is. The model was shipped with the wrong parser label. The model outputs tool calls in qwen3_coder XML format, but its config file tells the server to use the json_tools JSON parser. The server obediently tries to parse XML as JSON -> crash.
The Fix
The fix is a one-line change in the cached config file:
# Find the cached config
find ~/.cache/huggingface/hub/models--mlx-community--Qwen3-Coder-Next-4bit \
-name "tokenizer_config.json"
# Edit it --change "json_tools" to "qwen3_coder"
Change:
"tool_parser_type": "json_tools"
To:
"tool_parser_type": "qwen3_coder"
After restarting the server: everything worked perfectly. Tool calls parsed, files read, code edited --smooth as butter.
Note: This fix lives in your local HuggingFace cache. If you delete the cache or re-download the model, you'll need to apply it again. Hopefully the model authors will fix this upstream.
The Full Tool Call Flow
Now that we understand the bug, here's the complete flow of a tool call:
+-----------------------------------------------------+
| 1. You type: "What's in main.py?" |
+--------------+--------------------------------------+
|
+--------------v--------------------------------------+
| 2. OpenCode sends HTTP POST to localhost:8080 |
| with your message + available tools list |
+--------------+--------------------------------------+
|
+--------------v--------------------------------------+
| 3. mlx_lm.server tokenizes your message |
| "What's in main.py?" -> [token IDs...] |
+--------------+--------------------------------------+
|
+--------------v--------------------------------------+
| 4. Model runs inference on GPU |
| Outputs token IDs that decode to: |
| |
| <tool_call> |
| <function=read_file>{"path": "main.py"} |
+--------------+--------------------------------------+
|
+--------------v--------------------------------------+
| 5. Tool parser (qwen3_coder) reads the XML format |
| Converts to structured JSON: |
| { "function": "read_file", |
| "arguments": {"path": "main.py"} } |
+--------------+--------------------------------------+
|
+--------------v--------------------------------------+
| 6. OpenCode receives the parsed tool call |
| Executes it: reads main.py from disk |
| Sends file contents back to the model |
+--------------+--------------------------------------+
|
+--------------v--------------------------------------+
| 7. Model receives file contents, thinks again |
| Outputs a natural language response |
| "main.py contains a Flask application with..." |
+-----------------------------------------------------+
Step 5 is where the bug was. The wrong parser tried to JSON-parse XML-formatted output, and crashed.
Tips and Recommendations
RAM Guide
Your Mac's unified memory determines which models you can run:
| RAM | What You Can Run |
|---|---|
| 16GB | Small models only (7B-4bit). Tight. |
| 32GB | 7B-8bit or 13B-4bit comfortably |
| 64GB | Qwen3-Coder-30B-8bit (sweet spot) |
| 96GB+ | Qwen3-Coder-Next-4bit, or multiple models |
Rule of thumb: the model should use at most 75% of your RAM so macOS and other apps still have room.
MoE vs Dense Models
This is an important concept for choosing models:
Dense models (like Qwen2.5-72B) activate every parameter for every token. If the model has 72 billion parameters, all 72 billion participate in every calculation. Accurate, but slow.
Mixture of Experts (MoE) models (like Qwen3-Coder-30B) have many parameters but only activate a small subset per token. Qwen3-Coder-30B has 30B total parameters but only 3B active at any time. This means:
- Speed comparable to a 3B dense model
- Quality closer to a 30B dense model
- Same RAM usage as loading the full 30B (all parameters must be in memory, even if only some are used per token)
For local inference, MoE models are the way to go. You get much better speed-to-quality ratio.
Before You Download a Model, Check These
-
Does mlx-lm have a tool parser for it? Check
mlx_lm/tool_parsers/in your install. No parser = no tool calling. - Is it a coding model? General chat models tend to struggle with the precise, multi-step tool-use patterns that coding assistants require.
- MoE or Dense? For interactive use, strongly prefer MoE.
- What quantization? 4-bit uses less RAM but lower quality. 8-bit is a good balance. Check if the RAM fits your machine.
-
Is the
tool_parser_typecorrect? After downloading, peek at the model'stokenizer_config.jsonto make sure the parser type matches what the model actually outputs.
Debugging Tips
- Model stuck in loops? The model may not be good enough for agentic coding. Try a different one.
- "Model does not support tool calling"? No parser available for this model architecture in your version of mlx-lm.
-
JSON decode errors on tool calls? Likely a parser mismatch. Check
tool_parser_typein the model'stokenizer_config.json. - Very slow responses? Probably a dense model that's too large. Switch to a MoE model or use a smaller/more quantized variant.
- Model loads but OpenCode can't connect? Make sure the server is running on the same port OpenCode is configured to use (default: 8080).
Conclusion
Running a local LLM as your AI coding assistant is absolutely doable on Apple Silicon. The key ingredients:
- mlx-lm to serve the model on your Mac's GPU
- OpenCode as the agentic coding interface
- The right model --Qwen3-Coder-30B for the sweet spot, or Qwen3-Coder-Next if you have the RAM
The setup isn't plug-and-play (yet). You'll need to understand the three-layer architecture, check tool parser compatibility, and maybe fix a config file or two. But once it's running, you have a surprisingly capable AI coding assistant that's completely free, completely private, and completely local.
The models are getting better fast. What required an API call to Claude or GPT a year ago is now running on your laptop. It's a good time to start experimenting.
Resources
- MLX on GitHub
- mlx-lm on GitHub
- OpenCode
- MLX Community on HuggingFace --pre-converted models for Apple Silicon
- Qwen3-Coder-30B-A3B-Instruct-8bit
- Qwen3-Coder-Next-4bit
Top comments (1)
Did you try olama?