Originally published at kunalganglani.com — read it there for inline code, hero image, and live links.
The WWDC26 Session That Started Everything
Local agentic AI on Mac refers to running autonomous coding agents entirely on Apple Silicon hardware, using frameworks like MLX and llama.cpp with Metal acceleration, without sending a single token to a cloud API. Apple made this workflow official at WWDC26 with a dedicated 13-minute session titled "Run local agentic AI on the Mac using MLX." That's Apple's first explicit endorsement of running third-party LLMs locally for agentic developer workflows. But here's the thing nobody's saying about that session: 13 minutes isn't enough to actually build anything. They gave developers a concept and no complete implementation. This post fills that gap.
The Hacker News community confirmed the demand almost immediately. Kyle Howells, an independent developer, published a hands-on benchmark post covering this exact setup and it hit 400 upvotes and 100+ comments within 19 hours. Apple's companion YouTube session on distributed inference with MLX accumulated nearly 30,000 views in its first few days. Developers aren't just curious about running local LLM inference on Mac. They're desperate for it.
I've been running local models on Apple Silicon since the M1 Pro days, and this is the first time a local agentic coding setup has felt genuinely usable. Not "technically works if you squint" usable. Actually-ship-code-with-it usable. Here's exactly how to build it.
Why Gemma 4 26B-A4B Changes the Local LLM Game on Mac
The model at the center of this setup is Google's Gemma 4 26B-A4B, and it's the reason this workflow finally works. Gemma 4 is a Mixture-of-Experts (MoE) architecture with 26 billion total parameters but only about 4 billion active parameters per token. That distinction is everything for Apple Silicon inference.
A traditional 26B dense model would be brutally slow on consumer Mac hardware. Because Gemma 4 activates only a fraction of its parameters for each token, you get the intelligence of a much larger model at the speed and memory cost of a much smaller one. The Q4_K_XL quantized GGUF file from Unsloth is approximately 16 GB. With the MTP draft head and multimodal projector included, the full folder is about 17 GB.
That's the sweet spot. On a Mac with 32 GB or more of unified memory, you can load this model entirely into GPU-accessible memory with room to spare for context and your actual development tools. I've written about local LLM hardware requirements before, and the general rule holds: you need roughly 1.2x the model file size in available memory for comfortable inference. Gemma 4 26B-A4B sits right in that window for the Mac Studio and MacBook Pro configurations most developers actually own.
If you've been following the Gemma 4 12B comparisons, the 26B-A4B variant is a real step up in reasoning quality while being only marginally slower, thanks to MoE. For agentic coding work — where the model needs to plan multi-step changes, understand file relationships, and generate coherent diffs — that extra reasoning capability is the difference between a toy and a tool.
How Fast Is Local Agentic AI on Apple Silicon?
Kyle Howells published the most thorough publicly available benchmark data for this setup, tested on an M1 Max with 64 GB unified memory running macOS 15.7.7. Here's what he found:
| Configuration | Prompt tok/s | Generation tok/s |
|---|---|---|
| Gemma 4 26B-A4B Q4_K_XL, llama.cpp Metal (baseline) | 298.0 | 58.2 |
| + MTP Q8_0 draft model (--spec-draft-n-max 3) | — | 69.2 |
| + MTP (--spec-draft-n-max 2, Unsloth recommended) | — | ~65-80+ |
The baseline of 58.2 generation tokens per second is already usable. Adding the MTP speculative decoding draft model bumps that to 69.2 tok/s — roughly a 19% improvement. Unsloth claims up to 2x speedup under optimal conditions with longer agentic prompts.
But here's the part that most people missed. The Hacker News thread exposed a critical flaw in these numbers. A commenter named Aurornis pointed out that the 128-token benchmark length is way too short for meaningful MTP measurement. Early output tokens tend to have artificially high acceptance rates. As liuliu explained in the same thread, realistic agentic workloads involve system prompts of 1,000 to 3,000+ tokens and generation runs of hundreds or thousands of tokens. At those lengths, MTP acceptance rates stabilize and the speedup becomes more consistent.
The real benchmark for a coding agent isn't 128 tokens on a trivial prompt. It's 2,000 tokens of generated code after a 3,000-token system prompt at 32k context depth.
I always tell people to benchmark at realistic workloads. Having shipped agentic coding workflows myself, I can tell you the gap between a micro-benchmark and actual usage is huge. The good news: at realistic context lengths, MTP speculative decoding tends to perform better, not worse, because longer generation runs give the draft model more opportunities to predict correctly.
The Complete Local Agentic Stack: Four Components
The full local AI agentic stack on Mac consists of exactly four components:
llama.cpp with Metal acceleration — the inference engine that talks directly to Apple's GPU via Metal. You build it from source with
cmakeand Metal support enabled. This is the runtime that actually moves tensors through the model.Gemma 4 26B-A4B in GGUF format — the main model file (
gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf), approximately 16 GB. Quantized by Unsloth using their UD (Ultra Dynamic) quantization method for optimal quality-at-size.MTP Q8_0 draft head — the speculative decoding model (
gemma-4-26B-A4B-it-Q8_0-MTP.gguf). This small model predicts multiple tokens ahead, and llama.cpp verifies them against the main model in parallel. When predictions match — which happens a lot for code — you get free speed.Gemma 4 multimodal projector — enables the model to process images and screenshots. For a coding agent, this means you can feed it a screenshot of your UI and ask it to fix layout issues. Total disk footprint for all four components: approximately 17 GB.
Every piece of this stack is open source. llama.cpp is arguably the most battle-tested local inference engine in existence. The GGUF model format is a community standard. And the whole thing runs through an OpenAI-compatible API endpoint via llama-server, so any coding agent that speaks OpenAI's protocol can plug in immediately.
If you're coming from an Ollama background, this is the same underlying engine (Ollama wraps llama.cpp) but with more direct control over speculative decoding parameters and Metal-specific optimizations.
[YOUTUBE:CzgK02zsRg4|WWDC26: Explore distributed inference and training with MLX | Apple]
How to Set Up MTP Speculative Decoding Correctly
Speculative decoding with MTP (Multi-Token Prediction) is where most developers will either get a real speed boost or accidentally make things slower. The concept is straightforward: the draft model predicts the next N tokens, and the main model verifies them in a single forward pass. If the predictions are accepted, you've generated N tokens for the cost of one verification step.
The key flag is --spec-draft-n-max, which controls how many draft tokens to predict at once. The intuitive assumption is "more is better" — predict 4 or 5 tokens ahead for maximum throughput. This is wrong.
Unsloth recommends starting with --spec-draft-n-max 2, and their reasoning is sound. As you increase the number of draft tokens, the probability that all of them are accepted drops exponentially. If your acceptance rate per token is 70%, predicting 2 tokens gives you a ~49% chance of full acceptance. Predicting 4 drops that to ~24%. When draft tokens are rejected, the main model has to regenerate from the rejection point, and you've wasted compute on the rejected predictions.
The optimal number varies by model, quantization level, and even the type of content being generated. Code tends to be more predictable than natural language prose, which means higher acceptance rates and more room for aggressive draft counts. In my experience building local agentic coding setups, --spec-draft-n-max 2 is consistently safe. Bumping to 3 sometimes helps and sometimes hurts depending on the prompt structure.
Practical advice: start with 2, run llama-bench (included in llama.cpp's tools directory) at your actual system prompt length and target context size, and tune from there. Don't trust micro-benchmarks.
Why Apple Is Betting Hard on Local Agentic AI
WWDC26 was not subtle about Apple's direction. Beyond the "Run local agentic AI on the Mac using MLX" session, Apple published an entire constellation of related sessions:
- "Build agentic app experiences with the Foundation Models framework" (21:43)
- "Xcode, agents, and you" (24:03)
- "Speedrun your game port with agentic coding" (28:00)
- "Create UI prototypes using agents in Xcode" (18:11)
- "Explore distributed inference and training with MLX" (22:09)
- "Explore numerical computing in Swift with MLX"
Agentic AI is not a side feature at WWDC26. It's the overarching developer theme. Apple is telling developers: the Mac should be a first-class platform for running autonomous coding agents locally, not just a thin client for cloud APIs.
And the technical foundation is there to back it up. Apple's unified memory architecture is uniquely suited for LLM inference because the GPU and CPU share the same memory pool. No PCIe bottleneck copying tensors between system RAM and VRAM. An M4 Max with 128 GB unified memory can load models that would require multiple NVIDIA GPUs on a traditional workstation. Apple's MLX framework — now at 26,900+ GitHub stars with 1,890+ commits — is purpose-built to exploit this architecture.
The companion library mlx-lm (5,800 stars, 765 forks) provides the high-level Python and Swift APIs for running, fine-tuning, and serving LLMs. It includes an OpenAI-compatible server endpoint that works identically to llama.cpp's server mode, giving developers a choice of inference backends.
I wrote about Apple's Foundation Models strategy earlier this month, and the MLX push confirms what I suspected: Apple wants developers building on-device AI experiences that don't depend on any cloud provider. If you're building AI agents for Mac users, this is the direction to bet on.
Connecting Your Local Model to a Coding Agent
Once you have llama.cpp running with the Gemma 4 model and MTP draft head, the next step is exposing it as an OpenAI-compatible API server. llama.cpp's llama-server binary does this out of the box. You launch it with the same model flags plus a --port argument, and it serves a /v1/chat/completions endpoint that any OpenAI-compatible client can hit.
Kyle Howells uses Pi as his terminal coding agent. Pi connects to the local server endpoint, sends prompts with tool-calling instructions, and orchestrates multi-step coding workflows — reading files, making changes, running tests, iterating. Because it speaks the OpenAI protocol, switching between a local model and a cloud API is a one-line configuration change.
Same pattern I described in my post on free Claude Code alternatives. The ecosystem has converged on OpenAI's API format as the common language for LLM tool integration. Whether you're using Aider, Hermes Agent, or Pi, the local server just works.
The multimodal projector adds another dimension. With the Gemma 4 multimodal support loaded, you can pass screenshots directly to the model through the API. For UI development work, this means your agent can look at what it built, compare it to a design mockup, and iterate. All locally, all offline. No screenshots leaving your machine.
For developers who care about AI security and data privacy, this is the killer feature. Your code, your context, your screenshots — none of it touches a third-party server. I've seen enough prompt injection attacks and supply chain incidents targeting AI developer tools to know this isn't paranoia. It's pragmatism.
What the Hacker News Thread Got Right (and Wrong)
The Hacker News discussion around Kyle Howells' post generated some genuinely valuable technical corrections worth internalizing.
What the thread got right:
128 tokens is far too short for meaningful MTP benchmarks. Aurornis nailed this one — the acceptance rate is artificially high early in generation. Use llama-bench for proper sweeps. System prompt length matters enormously too. liuliu pointed out that real agentic prompts are 1,000-3,000+ tokens, which significantly affects prefill speed and overall throughput.
Context length scaling is the real test. Measuring generation speed at 32k-64k tokens matters because coding agents accumulate massive context as they read files and plan changes. Oh, and a useful practical tip from the thread: llama.cpp's -hf flag can download models directly from HuggingFace, saving the manual download step.
What the thread missed:
Most commenters obsessed over raw generation speed without talking about quality. A coding agent that generates 80 tok/s but writes broken code is worse than one at 50 tok/s that ships correct changes. Gemma 4 26B-A4B's MoE architecture gives it reasoning capabilities that punch well above what you'd expect from 4B active parameters. After working with various local LLMs for coding, I've found that model intelligence at the task level matters more than raw speed once you cross the usability threshold. This setup crosses that threshold decisively.
The thread also skipped over Apple's MLX framework as an alternative backend. While llama.cpp with Metal is the more mature option right now, mlx-lm is catching up fast and offers tighter integration with Apple's ecosystem. If you're a Swift developer building native Mac apps with embedded AI agent capabilities, MLX is probably the better long-term bet.
Minimum Hardware Requirements for This Setup
I've benchmarked local LLM setups across Apple Silicon tiers, so let me be specific about what you actually need for the Gemma 4 26B-A4B agentic stack:
| Mac Configuration | Feasibility | Expected Performance |
|---|---|---|
| M1/M2/M3 with 16 GB | ❌ Not feasible | Model won't fit with usable context |
| M1/M2/M3 with 24 GB | ⚠️ Tight | Model loads but limited context window, ~30 tok/s |
| M1/M2/M3 Pro/Max with 32 GB | ✅ Good | ~40-55 tok/s, comfortable 8k context |
| M1/M2/M3/M4 Max with 64 GB | ✅ Excellent | 58-80+ tok/s, 32k+ context |
| M4 Max/Ultra with 128+ GB | ✅ Overkill | Can run larger models or multiple agents |
The sweet spot for most developers is 32-64 GB of unified memory. The model file is 16 GB, and you need headroom for KV cache (which scales with context length), the MTP draft model, the multimodal projector, and your actual operating system and dev tools.
If you're on 24 GB, look at the Gemma 4 12B variant instead. Less capable for complex agentic tasks, but it'll run comfortably with room to spare.
For anyone considering a hardware upgrade specifically for local AI work, I've compared the Mac Studio vs PC builds and the M5 Max's AI capabilities. The unified memory advantage on Apple Silicon is real for this workload. It's not marketing.
Beyond Gemma 4: Alternative Models Worth Trying
Gemma 4 26B-A4B is the current best choice for this stack, but it's not the only option. Kyle Howells also tested Qwen3.6 35B-A3B as an alternative MoE model. The broader ecosystem of models that work well here:
- Qwen3.6 35B-A3B — another MoE architecture with different strengths, particularly strong on multilingual code and longer-context reasoning
- Mistral models via Ollama or llama.cpp — smaller Mistral variants work well when you need lower latency over peak intelligence
- DeepSeek Coder variants — solid if your use case is pure code generation without the multimodal requirement
One thing to know: MTP speculative decoding is model-specific. Not every model ships with an MTP draft head. Gemma 4 is the best-supported option for this technique right now, which is a big reason it's my default recommendation. As more model families adopt MTP, this will change.
For developers who want to try multiple models, the OpenAI-compatible API layer makes switching trivial. Point your coding agent at a different llama-server instance and the rest of your workflow stays identical. Same agent framework portability principle I've talked about in the context of vibe coding tools. Decouple the agent from the model.
What This Means for the Future of Mac Development
Apple's WWDC26 push on local agentic AI isn't just about faster inference. It's about making the Mac the definitive platform for AI-powered development. MLX framework, Metal-accelerated inference, unified memory architecture, native Swift integration. No other consumer hardware platform can match this combination for running local agents.
Here's my prediction: within the next year, Xcode will ship with built-in support for local AI agents that run entirely on your Mac. The WWDC26 sessions on "Xcode, agents, and you" and "Create UI prototypes using agents in Xcode" point directly to this. Apple isn't building MLX as a curiosity. They're building infrastructure.
For developers who care about privacy, cost, and offline reliability, this is the production AI setup to invest in today. You're not paying per token. You're not sending proprietary code to a third-party API. You're not dependent on internet connectivity. And with Gemma 4's MTP speculative decoding delivering 60-80+ tok/s on hardware most developers already own, you're not sacrificing usability either.
By the end of 2026, "local-first agentic coding" will be a standard workflow for Mac developers, not an enthusiast curiosity. Apple just made that official. The 13-minute session was the announcement. This guide is the implementation.
If you're building agentic AI workflows on Mac, stop waiting for Apple to finish writing the tutorial. Build the stack now, start shipping code with it, and be ready when Xcode's native agent support lands. The future of agentic development isn't in the cloud. It's on your desk.
Frequently Asked Questions
Can I run a local coding agent on a Mac with only 16 GB of RAM?
Not with Gemma 4 26B-A4B. The model file alone is 16 GB, leaving no room for context, the operating system, or your development tools. You'd need to use a smaller model like Gemma 4 12B or a 7-8B parameter model instead. For the full agentic stack described here, 32 GB is the practical minimum.
How does MTP speculative decoding actually speed up local LLM inference?
MTP uses a small draft model to predict multiple tokens ahead of the main model. The main model then verifies these predictions in a single forward pass. When the predictions match — which happens frequently for structured content like code — you effectively generate multiple tokens for the cost of one verification step. The speedup ranges from 15-100% depending on content type and configuration.
Is the local coding agent as smart as Claude or GPT-4 for coding tasks?
No. Gemma 4 26B-A4B is a capable model, but frontier cloud models like Claude Sonnet 4.6 and GPT-4.1 still outperform it on complex reasoning and large-scale refactoring tasks. The tradeoff is clear: local agents give you privacy, zero cost per token, and offline access, while cloud agents give you peak intelligence. Many developers use both — local for routine tasks and cloud APIs for complex ones.
What is Apple's MLX framework and how does it relate to llama.cpp?
MLX is Apple's open-source array framework designed specifically for machine learning on Apple Silicon. It's similar in purpose to PyTorch but optimized for unified memory and Metal GPU acceleration. llama.cpp is a separate, community-built inference engine that also supports Metal. Both can run the same models on Mac. MLX offers tighter Apple ecosystem integration, while llama.cpp has a larger community and more mature tooling for speculative decoding.
Do I need to be online to use this local agentic coding setup?
No. Once you've downloaded the model files (approximately 17 GB total), everything runs entirely offline. The coding agent connects to a local API server on your machine. No internet connection is required for inference, and no code or context data leaves your computer.
Which Apple Silicon chip is best for running local AI coding agents?
The M4 Max with 64 GB or 128 GB unified memory currently offers the best balance of performance and value. The M4 Ultra is faster but significantly more expensive. Even the older M1 Max with 64 GB (as tested in Kyle Howells' benchmarks) delivers perfectly usable 58-69 tok/s speeds. The key factor is memory capacity, not chip generation — more memory means larger models and longer context windows.
Originally published on kunalganglani.com
Top comments (0)