Bruno Mello

Posted on Feb 7

Running Local LLMs as Your AI Coding Assistant on Apple Silicon

#ai #programming #llm

I spent an evening trying to get a free, local AI coding assistant running on my Mac --a 96GB Apple Silicon machine. No API keys. No subscriptions. Just my machine and an open-source model doing the thinking.

It took some trial and error --wrong Python versions, models that looped forever, a 72B model that took 18 minutes to answer one question, and a sneaky bug buried in a config file. But by the end, I had a setup that actually works.

This article walks through everything: the concepts, the setup, the failures (the best part), and the working result.

What We're Building

The goal is simple: run OpenCode --an open-source agentic coding CLI (think Cursor or Claude Code, but in your terminal) --powered by a large language model running entirely on your Mac.

No cloud. No API costs. No data leaving your machine.

Here's what the final result looks like: you type a coding question in your terminal, and a 30-billion-parameter AI model running on your Mac's GPU thinks about it, reads your files, edits your code, and runs commands --all locally.

The Architecture (Three Layers)

Before we install anything, let's understand what we're building. There are three layers:

+-------------------------------------+
|        OpenCode (The App)           |
|  What you interact with. It sends   |
|  your messages and executes tools   |
|  like reading files, editing code,  |
|  and running shell commands.        |
|           "The Hands"               |
+--------------+----------------------+
               | HTTP (localhost:8080)
               |
+--------------v----------------------+
|     mlx_lm.server (The Hub)         |
|  Translates between OpenCode and    |
|  the model. Converts text to        |
|  numbers, runs inference on the     |
|  GPU, parses tool calls.            |
|         "The Translator"            |
+--------------+----------------------+
               | MLX Framework (GPU)
               |
+--------------v----------------------+
|        The LLM (The Brain)          |
|  A neural network with billions of  |
|  parameters. Takes in numbers,      |
|  outputs numbers. Doesn't "know"    |
|  about files or code -- just        |
|  predicts the next token.           |
|          "The Brain"                |
+-------------------------------------+

OpenCode is the app you type into. It's "the hands" --it can read files, edit code, and run commands, but it doesn't know what to do. It asks the brain.

mlx_lm.server is the middleman. It runs an HTTP server on your Mac (port 8080) that speaks the same API as OpenAI's. OpenCode doesn't even know the model is local --it just talks to localhost:8080 as if it were calling an API.

The LLM is the brain. It takes in numbers and outputs numbers. That's it. It has no hands, no eyes, no ability to touch your filesystem. It can only think and suggest.

What Are Tokens?

Models don't read text --they read tokens, which are numbers. Before your message reaches the model, it's converted:

"Fix the bug in auth.py"
        |
        v  (tokenizer)
[15640, 279, 8563, 304, 4428, 2386]
        |
        v  (model inference on GPU)
[791, 4546, 374, 389, 1584, 220, ...]
        |
        v  (detokenizer)
"The issue is on line 42..."

The component that does this conversion is called a tokenizer. Each model ships with its own tokenizer --you can think of it as the model's dictionary.

What Are Tool Calls?

Here's the key concept that makes an LLM useful as a coding assistant. The model can't actually read your files --remember, it's just a brain with no hands. But it can output a tool call: a structured request that says "hey, I need someone to read this file for me."

You: "What's in main.py?"
        |
        v
Model thinks... outputs:
  <tool_call>
  <function=read_file>{"path": "main.py"}
        |
        v
mlx_lm.server parses this into structured JSON
        |
        v
OpenCode receives: { "function": "read_file", "args": {"path": "main.py"} }
OpenCode executes it, reads the file, sends contents back to the model
        |
        v
Model receives file contents, thinks again, responds to you

This back-and-forth --model requests a tool, app executes it, sends results back --is the loop that makes AI coding assistants work. The model is the brain, OpenCode is the hands.

Why Parsers Matter

Different models output tool calls in different formats. Some use XML-style tags, some use raw JSON, some use special tokens. The tool parser in mlx_lm.server needs to understand the model's specific format. If there's a mismatch --and we'll see exactly this bug later --things crash.

What Is MLX?

MLX is Apple's machine learning framework, built specifically for Apple Silicon chips (M1, M2, M3, M4, and their Pro/Max/Ultra variants).

What makes it special: Apple Silicon has unified memory. Unlike a PC where the CPU and GPU have separate memory pools, your Mac's CPU and GPU share the same RAM. MLX takes advantage of this --a model loaded into memory is accessible to the GPU without any copying.

This is why Apple Silicon Macs are surprisingly good at running LLMs. A Mac with 96GB of unified memory can run models that would require an expensive NVIDIA GPU on a PC.

mlx-lm is a Python package built on MLX. It provides:

A model loader that downloads models from HuggingFace
An inference engine that runs models on your Mac's GPU
mlx_lm.server --an OpenAI-compatible HTTP server

Setting Up

Prerequisites

A Mac with Apple Silicon (M1 or newer)
At least 32GB of RAM (64GB+ recommended)
macOS with Homebrew installed

Step 1: Install Python 3.12

macOS ships with an older Python. We need 3.12+ for the latest mlx-lm:

brew install python@3.12

Step 2: Create a Virtual Environment

/opt/homebrew/bin/python3.12 -m venv ~/mlx-env
source ~/mlx-env/bin/activate
pip install --upgrade pip
pip install mlx-lm

Gotcha I hit: I first tried using the system Python (3.9) and got Model type glm4_moe_lite not supported. Newer model architectures require newer versions of mlx-lm, which require newer Python. If you see architecture errors, check your Python version first.

Step 3: Install OpenCode

# via npm
npm install -g opencode-ai

# or via Homebrew
brew install anomalyco/tap/opencode

Check the OpenCode docs for the latest install method.

Step 4: Start the Model Server

Let's start with the model I recommend (more on why later):

source ~/mlx-env/bin/activate
mlx_lm.server --model mlx-community/Qwen3-Coder-30B-A3B-Instruct-8bit --port 8080

The first run downloads the model (~33GB). Subsequent runs load from cache.

Tip: Add a shell alias so you don't have to remember the full command:
# Add to ~/.zshrc
alias mlx='source ~/mlx-env/bin/activate'
alias qwen='mlx_lm.server --model mlx-community/Qwen3-Coder-30B-A3B-Instruct-8bit --port 8080'
Then just: mlx && qwen

Step 5: Configure OpenCode

Create or edit ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "mlx": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "MLX (local)",
      "options": {
        "baseURL": "http://localhost:8080/v1"
      },
      "models": {
        "mlx-community/Qwen3-Coder-30B-A3B-Instruct-8bit": {
          "name": "Qwen3-Coder 30B 8bit",
          "tools": { "task": true }
        }
      }
    }
  }
}

The key parts:

baseURL points to your local mlx_lm.server
npm tells OpenCode which SDK adapter to use (OpenAI-compatible)
tools.task enables sub-agent task spawning

Now launch OpenCode, select your local model, and start coding!

The Model Hunt

This is the fun part --and the most educational. I tried several models before finding what works. Here's the journey.

Attempt 1: GLM-4.7-Flash-4bit

What: A 4-bit quantized version of Zhipu's GLM-4 model, a Mixture of Experts (MoE) architecture. (Quantization is a compression technique --4-bit means each parameter is stored in just 4 bits instead of the original 16 or 32, which shrinks the model to fit in less RAM at the cost of some accuracy. 8-bit is a middle ground.)

What happened: It loaded fine and could answer simple questions. But when I asked it to do anything complex --like build a project --it got stuck in repetitive loops, running the same commands over and over.

Lesson: Not all models are built for agentic coding. GLM-4.7 is a general-purpose chat model. It doesn't have the instruction-following precision needed to use tools reliably in a multi-step workflow.

Attempt 2: Qwen2.5-72B-Instruct-4bit

What: A massive 72-billion parameter dense model, quantized to 4-bit. About 42GB in RAM.

What happened: It downloaded (~40GB), loaded into memory, and I sent it a message with about 74,000 tokens of context. Eighteen minutes later... broken pipe. Connection crashed.

Lesson: Just because a model fits in your RAM doesn't mean it's fast enough to be useful. Dense models process every single parameter for every token. At 72B parameters, even on a fast Mac, it's painfully slow for interactive coding.

Attempt 3: Devstral-Small-2-24B-8bit

What: Mistral's coding-focused model, 24B parameters, 8-bit quantization.

What happened:

WARNING - Received tools but model does not support tool calling.

Dead on arrival. The mlx-lm server checks its list of tool parsers to see if it knows how to read a model's tool call format. At the time (mlx-lm v0.30.6), there was no Mistral/Devstral parser. The model could probably output tool calls, but the server couldn't parse them.

Lesson: Before downloading a multi-gigabyte model, check if mlx-lm has a tool parser for it. You can check the parsers available in your installed version:

ls ~/mlx-env/lib/python3.12/site-packages/mlx_lm/tool_parsers/

On v0.30.6, the available parsers are:

glm47.py --for GLM-4 models
json_tools.py --for models using raw JSON tool calls
kimi_k2.py --for Kimi K2
qwen3_coder.py --for Qwen3-Coder family
longcat.py, minimax_m2.py, function_gemma.py --for other specific models

No Mistral parser. No Devstral.

Attempt 4: Qwen3-Coder-30B-A3B-Instruct-8bit (Winner)

What: Alibaba's coding-focused model. 30 billion total parameters, but it's a Mixture of Experts (MoE) model that only activates 3 billion parameters per token.

What happened: It worked. Beautifully. Tool calls parsed correctly, it followed instructions well, and it was fast enough for interactive use.

Why it works:

MoE architecture: Only 3B of 30B parameters are active per token. This means inference is as fast as a 3B model while having the knowledge of a 30B model.
Coding-focused training: It was specifically trained for code generation and tool use.
Supported parser: mlx-lm has the qwen3_coder parser built in.
~33GB RAM: Fits comfortably on a 64GB+ Mac with room to spare.

Attempt 5: Qwen3-Coder-Next-4bit (Winner, after a bug fix)

What: The bigger sibling --80B total parameters (still 3B active), 512 mixture-of-experts. Scores 70.6% on SWE-Bench Verified, putting it in the league of models many times its active size.

What happened: It downloaded (~45GB), loaded fine, but crashed instantly when I tried to use it:

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

This turned out to be the most interesting bug of the night. More on that in the next section.

Model Comparison

Model	Type	RAM	Speed	Tool Calls	Verdict
GLM-4.7-Flash-4bit	MoE	~16GB	Fast	Yes	Loops on complex tasks
Qwen2.5-72B-4bit	Dense	~42GB	Very slow	Untested	Too slow for interactive use
Devstral-24B-8bit	Dense	~25GB	Medium	No (no parser)	Dead on arrival
Qwen3-Coder-30B-8bit	MoE	~33GB	Fast	Yes	Recommended
Qwen3-Coder-Next-4bit	MoE	~50GB	Fast	Yes (after fix)	Best quality (needs 64GB+ RAM)

Understanding Tool Calls --And the Bug We Found

This section is about the most interesting problem I hit, and it teaches a lot about how this whole system works under the hood.

The Crash

After downloading Qwen3-Coder-Next-4bit, the model loaded fine. But the moment it tried to make a tool call --say, reading a file --the server crashed:

File "mlx_lm/tool_parsers/json_tools.py", line 11, in parse_tool_call
    return json.loads(text.strip())
json.decoder.JSONDecodeError: Expecting value: line 1 column 1

The Investigation

The error is in json_tools.py --a parser that expects raw JSON. But wait... Qwen3-Coder-Next is a Qwen3-Coder model. It should be using the qwen3_coder parser, which expects XML-style tool calls like this:

<tool_call>
<function=read_file>{"path": "main.py"}

So why is the server trying to parse it as JSON?

The Root Cause

Every model ships with a file called tokenizer_config.json in its HuggingFace download. Among other things, this file contains a field called tool_parser_type that tells mlx_lm.server which parser to use.

I checked Qwen3-Coder-Next's config:

"tool_parser_type": "json_tools"

There it is. The model was shipped with the wrong parser label. The model outputs tool calls in qwen3_coder XML format, but its config file tells the server to use the json_tools JSON parser. The server obediently tries to parse XML as JSON -> crash.

The Fix

The fix is a one-line change in the cached config file:

# Find the cached config
find ~/.cache/huggingface/hub/models--mlx-community--Qwen3-Coder-Next-4bit \
  -name "tokenizer_config.json"

# Edit it --change "json_tools" to "qwen3_coder"

Change:

"tool_parser_type": "json_tools"

To:

"tool_parser_type": "qwen3_coder"

After restarting the server: everything worked perfectly. Tool calls parsed, files read, code edited --smooth as butter.

Note: This fix lives in your local HuggingFace cache. If you delete the cache or re-download the model, you'll need to apply it again. Hopefully the model authors will fix this upstream.

The Full Tool Call Flow

Now that we understand the bug, here's the complete flow of a tool call:

+-----------------------------------------------------+
|  1. You type: "What's in main.py?"                  |
+--------------+--------------------------------------+
               |
+--------------v--------------------------------------+
|  2. OpenCode sends HTTP POST to localhost:8080       |
|     with your message + available tools list         |
+--------------+--------------------------------------+
               |
+--------------v--------------------------------------+
|  3. mlx_lm.server tokenizes your message             |
|     "What's in main.py?" -> [token IDs...]           |
+--------------+--------------------------------------+
               |
+--------------v--------------------------------------+
|  4. Model runs inference on GPU                      |
|     Outputs token IDs that decode to:               |
|                                                      |
|     <tool_call>                                      |
|     <function=read_file>{"path": "main.py"}          |
+--------------+--------------------------------------+
               |
+--------------v--------------------------------------+
|  5. Tool parser (qwen3_coder) reads the XML format  |
|     Converts to structured JSON:                     |
|     { "function": "read_file",                       |
|       "arguments": {"path": "main.py"} }            |
+--------------+--------------------------------------+
               |
+--------------v--------------------------------------+
|  6. OpenCode receives the parsed tool call           |
|     Executes it: reads main.py from disk             |
|     Sends file contents back to the model            |
+--------------+--------------------------------------+
               |
+--------------v--------------------------------------+
|  7. Model receives file contents, thinks again       |
|     Outputs a natural language response              |
|     "main.py contains a Flask application with..."   |
+-----------------------------------------------------+

Step 5 is where the bug was. The wrong parser tried to JSON-parse XML-formatted output, and crashed.

Tips and Recommendations

RAM Guide

Your Mac's unified memory determines which models you can run:

RAM	What You Can Run
16GB	Small models only (7B-4bit). Tight.
32GB	7B-8bit or 13B-4bit comfortably
64GB	Qwen3-Coder-30B-8bit (sweet spot)
96GB+	Qwen3-Coder-Next-4bit, or multiple models

Rule of thumb: the model should use at most 75% of your RAM so macOS and other apps still have room.

MoE vs Dense Models

This is an important concept for choosing models:

Dense models (like Qwen2.5-72B) activate every parameter for every token. If the model has 72 billion parameters, all 72 billion participate in every calculation. Accurate, but slow.

Mixture of Experts (MoE) models (like Qwen3-Coder-30B) have many parameters but only activate a small subset per token. Qwen3-Coder-30B has 30B total parameters but only 3B active at any time. This means:

Speed comparable to a 3B dense model
Quality closer to a 30B dense model
Same RAM usage as loading the full 30B (all parameters must be in memory, even if only some are used per token)

For local inference, MoE models are the way to go. You get much better speed-to-quality ratio.

Before You Download a Model, Check These

Does mlx-lm have a tool parser for it? Check mlx_lm/tool_parsers/ in your install. No parser = no tool calling.
Is it a coding model? General chat models tend to struggle with the precise, multi-step tool-use patterns that coding assistants require.
MoE or Dense? For interactive use, strongly prefer MoE.
What quantization? 4-bit uses less RAM but lower quality. 8-bit is a good balance. Check if the RAM fits your machine.
Is the tool_parser_type correct? After downloading, peek at the model's tokenizer_config.json to make sure the parser type matches what the model actually outputs.

Debugging Tips

Model stuck in loops? The model may not be good enough for agentic coding. Try a different one.
"Model does not support tool calling"? No parser available for this model architecture in your version of mlx-lm.
JSON decode errors on tool calls? Likely a parser mismatch. Check tool_parser_type in the model's tokenizer_config.json.
Very slow responses? Probably a dense model that's too large. Switch to a MoE model or use a smaller/more quantized variant.
Model loads but OpenCode can't connect? Make sure the server is running on the same port OpenCode is configured to use (default: 8080).

Conclusion

Running a local LLM as your AI coding assistant is absolutely doable on Apple Silicon. The key ingredients:

mlx-lm to serve the model on your Mac's GPU
OpenCode as the agentic coding interface
The right model --Qwen3-Coder-30B for the sweet spot, or Qwen3-Coder-Next if you have the RAM

The setup isn't plug-and-play (yet). You'll need to understand the three-layer architecture, check tool parser compatibility, and maybe fix a config file or two. But once it's running, you have a surprisingly capable AI coding assistant that's completely free, completely private, and completely local.

The models are getting better fast. What required an API call to Claude or GPT a year ago is now running on your laptop. It's a good time to start experimenting.