DEV Community

Kunal
Kunal

Posted on • Originally published at kunalganglani.com

How to Run Local Agentic AI on Your Mac With MLX After WWDC 2026

Originally published at kunalganglani.com — read it there for inline code, hero image, and live links.

Apple gave local agentic AI on Mac a headline slot at WWDC 2026 and then explained it in 13 minutes and 37 seconds. No cloud API calls, no data leaving your machine, no subscription fees. Autonomous agents running entirely on Apple Silicon. The pitch was compelling. The actual setup instructions? Barely there.

I watched the session, opened my terminal, and immediately hit a wall of decisions that Apple glossed over. This post fills that gap. How to run local agentic AI on your Mac with MLX after WWDC 2026: the model selection, the speculative decoding trick that nearly doubles your throughput, the multimodal setup, and the final step everyone skips — actually wiring it into a coding agent you can use every day.

What Apple's WWDC 2026 MLX Session Actually Covered (And What It Didn't)

Apple's session — listed as a headline highlight on the WWDC26 developer portal alongside "Explore distributed inference and training with MLX" and "Meet Core AI" — made the case for local inference well. Privacy, low latency, offline access. These are real advantages. If you've ever had your internet drop mid-session with Claude Code and been left staring at a dead terminal, you know exactly why this matters.

Here's the official session if you want the Apple pitch:

[YOUTUBE:wykPErJ8M-8|WWDC26: Run local agentic AI on the Mac using MLX | Apple]

But 13 minutes means they glossed over every decision that actually matters. Which model do you download? How do you configure speculative decoding? What about multimodal input? How do you expose the model as an API endpoint that tools like Pi or Claude Code can connect to?

These are the questions that separate a toy demo from a genuinely useful local AI workflow. The Hacker News community noticed the gap immediately. A blog post by Kyle Howells covering the practical setup hit 396 upvotes and 99 comments within 19 hours. That kind of organic signal tells you exactly how many developers were left searching for answers after the session.

I've been running local LLM inference on Apple Silicon since the M1 days, and the gap between Apple's polished demos and what you actually need to do on your machine has always been wide. Let's close it.

Choosing Your Model: Gemma 4 26B-A4B vs Qwen3 35B-A3B

Model selection is the first decision and the one that matters most. Two models dominate the local agentic AI conversation on Mac right now: Google's Gemma 4 26B-A4B and Alibaba's Qwen3 35B-A3B.

Both use Mixture-of-Experts architecture, which is the only reason they run on consumer hardware at all. The total parameter counts look intimidating (26B and 35B), but only a fraction activate per token. Gemma 4 fires roughly 4B parameters per inference step. Qwen3 fires about 3B. That's what makes them fast enough for interactive use.

Here's how they compare on an M1 Max with 64 GB unified memory, based on Kyle Howells' benchmarks:

Dimension Gemma 4 26B-A4B (Q4_K_XL) Qwen3 35B-A3B (Q4)
Disk size ~16 GB (17 GB with MTP + projector) ~20 GB
Baseline generation speed 58.2 tok/s 38.0 tok/s
With MTP speculative decoding 69–90+ tok/s ~44 tok/s
Multimodal (images/screenshots) Yes, via projector No (not via llama.cpp MTP path)
MoE active parameters ~4B ~3B
Min RAM recommended 24 GB 32 GB

This one's straightforward. Gemma 4 26B-A4B is the better choice for most Mac developers. It's faster, it supports multimodal input (critical if you want to feed screenshots to your agent), and at ~17 GB total it fits comfortably in 24 GB of unified memory. That means M2 Pro, M3 Pro, and anything above can run it without swapping.

Qwen3 has its strengths — I've covered them in my Qwen3 agent capabilities review — but for this specific use case, Gemma 4 wins on every axis that matters for a local coding workflow.

If you're on a machine with only 16 GB of RAM, neither of these will work well. Look at Gemma 4 12B or check my local LLM hardware requirements guide for options at every memory tier.

How to Set Up llama.cpp With Metal on macOS

Two main paths for running local agentic AI on Mac: mlx-lm (Apple's official framework) or llama.cpp with Metal acceleration. Both work. I'll cover both, but llama.cpp currently has better MTP speculative decoding support and more mature GGUF model compatibility. The practitioner community has largely converged on it for this workflow, and for good reason.

The setup has three steps. Build llama.cpp. Download the model. Run it.

Building llama.cpp is standard CMake. Clone the repo, create a build directory, run cmake with Metal enabled, then make. The Metal backend compiles automatically on macOS with Xcode command-line tools installed. If you've built any C++ project on a Mac, you already know the drill.

Downloading the model is where a lot of tutorials create unnecessary friction. The Hacker News commenter Aurornis correctly pointed out that llama.cpp has a built-in -hf flag that downloads models directly from Hugging Face. Instead of manually navigating the Hugging Face web UI, hunting for the right GGUF file, and downloading 16 GB through your browser, you can pass -hf unsloth/gemma-4-26B-A4B-it-GGUF and llama.cpp handles the rest. This is the kind of workflow shortcut the WWDC session and most tutorials just skip.

The specific file you want is gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf. The Q4_K_XL quantization hits the sweet spot between quality and speed. Lower quantizations save a couple gigabytes but the quality drop is noticeable in coding tasks. Higher quantizations (Q6, Q8) are measurably better but won't fit comfortably in 24 GB machines once you add the MTP draft head.

Once you have the model, a basic run looks like launching llama-cli with -ngl 999 (offload all layers to GPU), -fa on (flash attention), and -c 4096 (context window). On an M1 Max 64 GB, this baseline setup gives you about 58.2 tokens per second generation speed and 298 tokens per second prompt processing.

58 tok/s is usable. It's not fast. For agentic coding workflows where the agent makes multiple tool calls per task, you need more. That's where speculative decoding comes in.

MTP Speculative Decoding: The Setup Apple Didn't Fully Explain

Multi-Token Prediction (MTP) speculative decoding is the single biggest performance unlock for running local agentic AI on Mac. The WWDC session barely mentioned it. Here's how it works.

Normal autoregressive generation produces one token at a time. Full forward pass, one token out, repeat. MTP adds a lightweight "draft head" — a small model that predicts multiple tokens ahead in a single step. The main model then verifies those predictions in parallel. When the predictions are correct (which they are surprisingly often for structured output like code), you effectively get multiple tokens for the cost of one forward pass.

The MTP draft head for Gemma 4 is a separate file: gemma-4-26B-A4B-it-Q8_0-MTP.gguf. It's quantized at Q8_0 (higher precision than the main model) because accuracy in the draft head directly affects acceptance rates. The file is small — the total model folder including the draft head and multimodal projector comes to about 17 GB.

To enable it, add two flags to your llama.cpp command: --model-draft pointing to the MTP file, and --spec-type draft-mtp. There's a third critical parameter: --spec-draft-n-max, which controls how many tokens the draft head predicts ahead.

Here's where most tutorials get it wrong. Kyle Howells initially tested with --spec-draft-n-max 3 (predicting 3 tokens ahead) and got 69.2 tok/s — a 19% improvement over baseline. But Unsloth's MTP guide specifically recommends starting with 2, not 3. Predicting 3 tokens ahead increases speculation overhead, and the third token's acceptance rate drops off significantly. With tuning, Kyle Howells reports speeds north of 90 tok/s — a 55% improvement over the 58 tok/s baseline.

90+ tokens per second on a laptop chip from 2021. No cloud. No API key. No data leaving your machine. That's the number that made me sit up and pay attention.

One important benchmarking caveat. The Hacker News commenter liuliu pointed out that short 128-token benchmarks overstate MTP acceptance rates because early output tends to have higher acceptance. For realistic benchmarking, you want system prompts of at least 1,000–3,000 tokens (simulating a real agent's context) and generation lengths at longer contexts (32k–64k tokens). The tools/llama-bench utility in llama.cpp automates this sweep — use it instead of eyeballing single-prompt results.

Having shipped production AI systems that depend on fast inference loops, I can tell you that the difference between 58 tok/s and 90 tok/s isn't just a benchmark number. It's the difference between an agent that feels sluggish on multi-step tasks and one that keeps up with your thinking. When your agent needs four tool calls to fix a bug, that 55% speedup compounds.

Adding Multimodal Support: Screenshots as Agent Input

One of Gemma 4's underrated features for agentic coding is multimodal input. You can feed the model screenshots — of a UI you're building, an error dialog, a design mock — and the agent reasons about what it sees.

This requires the Gemma 4 multimodal projector file, which maps image embeddings into the model's text space. It's included in the same Hugging Face repo as the main model and MTP draft head. The projector adds minimal disk overhead (the full folder stays around 17 GB).

In llama.cpp, you enable it with the --mmproj flag pointing to the projector file. Combined with MTP speculative decoding, you now have a local model that can:

  1. Read your code changes
  2. Look at a screenshot of the resulting UI
  3. Suggest fixes based on what it sees
  4. Execute those fixes through tool calls
  5. Repeat the cycle autonomously

All at 90+ tokens per second. All offline.

Qwen3 35B-A3B doesn't support this through the llama.cpp MTP path. Full stop. If your workflow involves any visual feedback loop — and for frontend development, that's practically every task — Gemma 4 is the only viable choice in this weight class.

I've been running exactly this setup for UI iteration work over the past week. The ability to paste a screenshot into the agent context instead of typing out "the button is 3 pixels too far to the left" changes the workflow completely. It's one of those features where once you have it, going back feels broken.

Running an OpenAI-Compatible API Server

A fast model with multimodal support is useless if your tools can't talk to it. This is the integration layer — the part that connects your local model to coding agents, IDE extensions, and anything else that speaks the OpenAI API protocol.

Two options.

Option 1: llama.cpp's built-in server. Running llama-server instead of llama-cli starts an HTTP server that exposes /v1/chat/completions and other OpenAI-compatible endpoints. Same model flags, same MTP configuration, but now accessible over HTTP on localhost. This is what Kyle Howells used in his viral setup.

Option 2: mlx-lm's server. Apple's mlx-lm package (5,800+ stars, 765 forks on GitHub) includes mlx_lm.server, which provides the same OpenAI-compatible endpoint but runs natively on the MLX framework. Installation is a single pip install mlx-lm. The server integrates directly with Hugging Face Hub, so you can point it at any of the thousands of pre-quantized models at huggingface.co/mlx-community.

The mlx-lm path is simpler — no CMake build step, no GGUF file management. But as of this writing, llama.cpp has more mature MTP speculative decoding support and gives you finer control over draft model configuration. If you're optimizing for raw throughput on Gemma 4 with MTP, llama.cpp is the better choice. If you want the easiest possible setup and are willing to trade some speed, mlx-lm gets you running in under five minutes.

Either way, once the server is running, any tool that supports custom OpenAI API endpoints can connect. Set the base URL to http://localhost:8080/v1 (or whatever port you configured), use any string as the API key (local servers don't authenticate), and you're live.

This is the pattern Apple's WWDC session gestured at but didn't walk through. And it's the pattern that makes local inference actually useful rather than a novelty. I've written about this architecture in the context of agent frameworks — the OpenAI-compatible API has become the de facto standard for tool-model communication, and running it locally just swaps the cloud endpoint for localhost.

Wiring It Into a Coding Agent: Pi, Claude Code, and Beyond

With your local server running, the last step is connecting a coding agent. This is where it stops being a demo and starts being a workflow.

Pi is the terminal-based coding agent Kyle Howells used. It connects to any OpenAI-compatible API endpoint, runs in your terminal, and can execute shell commands, read/write files, and iterate on code autonomously. For a fully local, privacy-first workflow, Pi + local Gemma 4 is the stack with zero external dependencies.

Claude Code works if you want a hybrid setup. You can configure Claude Code to use a custom API endpoint, pointing it at your local server for routine tasks while falling back to Anthropic's cloud models for the hard stuff. I've covered free Claude Code alternatives that work similarly.

Other tools that work out of the box: Continue (VS Code extension), Aider, and any agent framework that supports the OpenAI chat completions API.

Here's the thing nobody's saying about local inference: the value isn't replacing cloud models entirely. A local 26B MoE model is not as capable as Claude Sonnet or GPT-4. It's worse at complex multi-file refactors, worse at novel architectural decisions, worse at anything requiring deep reasoning across large contexts.

But it's better for a specific category of tasks: rapid iteration loops where latency matters more than raw intelligence. Fixing a lint error. Generating a boilerplate function. Writing a test. Formatting a commit message. For these, 90 tok/s locally beats 40 tok/s from a cloud API, especially when you factor in network variability and the cognitive cost of a dropped connection.

I've shipped enough features with agentic coding workflows to know the most productive setup is hybrid: local model for the fast inner loop, cloud model for the hard outer loop. The OpenAI-compatible API makes switching between them trivial.

Common Pitfalls (And How the HN Community Flagged Them)

The Hacker News discussion around Kyle Howells' post surfaced several gotchas that'll save you hours of debugging:

Benchmarking with too-short prompts. 128 tokens of generation is not enough to get reliable MTP acceptance-rate measurements. As Aurornis noted, early output has inflated acceptance rates. Use tools/llama-bench for proper sweeps, and test with realistic system prompts of 1,000–3,000+ tokens.

Skipping the -hf flag. Manually downloading GGUF files from Hugging Face is unnecessarily tedious. The -hf flag downloads models directly. Especially useful when you're iterating on quantization levels and want to quickly try Q4 vs Q5 vs Q6.

Setting --spec-draft-n-max too high. More draft tokens isn't always better. The acceptance rate drops with each additional token, and verification overhead increases. Start with 2. Benchmark. Then try 3 and see if it actually helps for your workload. For coding tasks with predictable structure, 3 sometimes wins. For open-ended generation, 2 is almost always better.

Ignoring context length configuration. The default -c 4096 context window is fine for quick tests but too small for real agent work. System prompts for coding agents easily hit 2,000–3,000 tokens, and you need room for file contents, tool outputs, and conversation history. Bump it to 8192 or 16384 if your memory allows. Larger contexts slow down prompt processing, but that's a worthwhile tradeoff.

Not building with Flash Attention. The -fa on flag enables Flash Attention, which is critical for Apple Silicon performance. Without it, you're leaving significant throughput on the table, especially at longer context lengths. I missed this on my first build and spent 20 minutes wondering why my numbers looked wrong.

MLX vs llama.cpp: Which Path Should You Choose?

I keep getting this question, and the honest answer is: it depends on what you're optimizing for.

Choose llama.cpp if:

  • You want maximum throughput with MTP speculative decoding
  • You need multimodal support via the Gemma 4 projector
  • You're comfortable building from source and managing GGUF files
  • You want fine-grained control over inference parameters

Choose mlx-lm if:

  • You want a five-minute setup with pip install mlx-lm
  • You're pulling models from mlx-community on Hugging Face
  • You value Apple-native integration and expect MLX to improve faster
  • You want LoRA fine-tuning support built in
  • You want distributed inference across multiple Macs (MLX supports this via mx.distributed)

The WWDC session positions MLX as Apple's strategic bet. The mlx-lm repository ships with quantization tools, a fine-tuning pipeline, and the OpenAI-compatible server. Apple is investing serious engineering effort here — expect MTP support and multimodal projectors to land in mlx-lm if they haven't already by the time you read this.

My prediction: within six months, the performance gap between mlx-lm and llama.cpp on Apple Silicon closes to near zero, and mlx-lm becomes the default choice for Mac developers. But right now, today, if you want the fastest possible local agentic AI setup, llama.cpp with the configuration described above is the path.

The Minimum Hardware You Actually Need

Let's be specific.

  • 24 GB unified memory (M2 Pro, M3 Pro, M4 Pro): Gemma 4 26B-A4B with MTP and projector fits. You'll have about 7 GB of headroom for the OS and other apps. Expect slightly lower speeds than the M1 Max benchmarks above due to memory bandwidth differences, but still well above 40 tok/s.

  • 32 GB unified memory (M1 Max, M2 Max, M3 Max, M4 Max): Comfortable territory. Both Gemma 4 and Qwen3 35B-A3B fit with room to spare. This is the sweet spot for most developers.

  • 64 GB+ unified memory (Max/Ultra configs): You can run multiple models simultaneously, keep larger context windows, or move to higher quantizations (Q6, Q8) for better output quality. If you're serious about local AI development, this is where you want to be.

  • 16 GB unified memory (M1, M2, M3 base): Neither Gemma 4 26B nor Qwen3 35B will fit. Period. Look at smaller models like Gemma 4 12B or Qwen3 8B. Less capable, but still useful for the rapid-iteration inner loop.

For a deeper dive on hardware tiers, I've covered this extensively in my Apple Silicon vs NVIDIA for local LLMs comparison and the M4 Max vs M5 Max breakdown.

What This Means for Mac Development Going Forward

Apple putting "Run local agentic AI on the Mac using MLX" as a headline WWDC session isn't just a technical announcement. It's a strategic statement. Apple is positioning the Mac as the developer machine for private, local AI — a direct counter to NVIDIA's RTX Spark and the entire cloud-inference ecosystem.

The pieces are coming together faster than most people realize. The hardware (unified memory architecture with massive bandwidth) was already there. The framework (MLX, with 5,800+ stars and active Apple engineering investment) is maturing fast. The models (MoE architectures like Gemma 4 that activate only 4B parameters per step) finally make 90+ tok/s possible on consumer hardware. And the ecosystem (OpenAI-compatible APIs that let any tool connect to any model) means you're not locked into Apple's tooling.

I've been building software for over 14 years, and this is one of those moments where the infrastructure is ahead of the adoption curve. The technology to run a genuinely useful local coding agent on a Mac exists right now. Most developers just don't know the steps to set it up. That's what Apple's WWDC session should have been. That's what this post is.

The developers who build local-first AI agent workflows today will have a significant advantage when the inevitable API outage, pricing change, or privacy regulation hits everyone else. Start with Gemma 4 26B-A4B, enable MTP speculative decoding, wire it into your coding agent of choice, and see what 90 tokens per second feels like when nothing ever has to leave your machine.

Frequently Asked Questions

Can I run local agentic AI on a Mac with only 16 GB of RAM?

Not with the Gemma 4 26B-A4B model discussed here — it needs around 17 GB just for the model files. You can run smaller models like Gemma 4 12B or Qwen3 8B on 16 GB machines, though they'll be less capable. For the full agentic workflow with MTP speculative decoding, 24 GB is the realistic minimum.

What is MTP speculative decoding and why does it matter?

MTP (Multi-Token Prediction) speculative decoding uses a small draft model to predict multiple tokens ahead, then verifies those predictions in parallel with the main model. When predictions are accepted — which happens often for structured output like code — you get multiple tokens for the cost of one forward pass. On Gemma 4, this can boost generation speed from 58 tok/s to over 90 tok/s.

Is MLX better than llama.cpp for local LLMs on Mac?

Right now, llama.cpp has more mature MTP speculative decoding support and finer control over inference parameters, making it faster for this specific workflow. MLX is easier to set up (one pip install) and has Apple's active investment behind it. The performance gap is closing, and MLX will likely become the default choice for Mac developers within the next six months.

Can I use this local setup with Claude Code or other coding tools?

Yes. Both llama.cpp and mlx-lm expose OpenAI-compatible API endpoints. Any tool that lets you configure a custom API base URL — including Claude Code, Continue, Aider, and most agent frameworks — can connect to your locally running model. Set the base URL to localhost, use any string as the API key, and you're connected.

How does local Gemma 4 compare to cloud models like Claude or GPT-4?

A local 26B MoE model is not as capable as frontier cloud models for complex reasoning, multi-file refactors, or novel architectural decisions. Where it excels is speed and privacy for routine tasks: fixing lint errors, generating boilerplate, writing tests, and other high-frequency, low-complexity work. The most productive setup is hybrid — local for the fast inner loop, cloud for the hard outer loop.


Originally published on kunalganglani.com

Top comments (0)