Apple's M4 chip put a Neural Engine and unified memory into laptops and desktops that don't require a server budget. For developers who want to run language models without OpenAI's bill, the 24GB MacBook Air or Mac mini is the cheapest serious entry point. The question isn't whether local LLMs work on it — they do. The question is which models fit, how fast they run, and where the cliffs are.
We tested this configuration the way most readers will use it: a base M4 (not Pro or Max), 24GB unified memory, macOS Sequoia 15.x, running Ollama and llama.cpp against models we'd actually use for coding, summarization, and JSON-mode tool calls.
The 24GB Memory Budget
Unified memory means your CPU, GPU, and Neural Engine share one pool. On a 24GB machine, macOS reserves a chunk for itself and apps; by default the GPU can address about 16-18GB of that. You can raise the ceiling with sudo sysctl iogpu.wired_limit_mb=20480 to give Metal more headroom, but pushing it too far makes the system swap and the kernel will refuse outright if you ask for too much. Conservatively, plan on ~18GB for model weights plus KV cache.
That budget rules out 70B-class models entirely (a 70B Q4_K_M GGUF is ~40GB) and makes 30B-class models a tight squeeze. The realistic sweet spot is 7B-14B parameters at 4-bit quantization, with 32B at 4-bit working if you close everything else.
Quick math for GGUF Q4_K_M weights:
- 7B: ~4.5 GB
- 8B (Llama 3.1): ~4.9 GB
- 13B: ~7.5 GB
- 14B (Qwen 2.5): ~9 GB
- 22B (Mistral Small): ~13 GB
- 32B (Qwen 2.5): ~19 GB
- 70B: ~40 GB (won't fit)
Add 1-3GB for KV cache depending on context length, and you can see where the cliff is.
The headline number on Apple Silicon for inference speed is memory bandwidth, not core count. The base M4 has ~120 GB/s; M4 Pro hits ~273 GB/s; M4 Max reaches ~410 GB/s. A model that runs at 7 tokens/sec on base M4 will run roughly twice as fast on M4 Pro with the same prompt and quantization. If you're choosing hardware specifically to run LLMs, bandwidth matters more than RAM size once you're past 24GB.
What Models Actually Run Well
On a base M4 with 24GB, here's what we measured running Ollama 0.4.x with default settings on a freshly booted machine. Numbers are decode tokens/sec on a 200-token prompt with 500-token output, single user, no batching.
- Llama 3.1 8B Q4_K_M: 24-28 tok/s. Excellent for code completion, summarization, and tool use. The 8B model is the default we'd suggest if you only install one.
- Qwen 2.5 Coder 7B Q4_K_M: 26-30 tok/s. Stronger than Llama 3.1 8B on code-specific tasks (HumanEval and MBPP scores are higher in the official paper). Replace your generalist 8B with this if you mostly write code.
- Qwen 2.5 14B Q4_K_M: 12-14 tok/s. Noticeably smarter on reasoning prompts. Still usable interactively if you're not waiting on it letter-by-letter.
- Mistral Small 22B Q4_K_M: 7-9 tok/s. Slow enough that you'll feel it. We'd reach for this only when 14B clearly fails.
-
Qwen 2.5 32B Q4_K_M: 4-6 tok/s. Technically fits with the
iogpu.wired_limit_mbbump, but the machine becomes unhappy. Run only when you have nothing else open.
Prompt processing (the time before the first token) scales with prompt length. A 4,000-token prompt on the 14B model takes ~12 seconds to ingest before output starts. For agentic coding workflows that stuff a whole file into context, this matters more than steady-state tokens/sec.
Ollama vs llama.cpp vs MLX
Three tools, three audiences.
Ollama wraps llama.cpp with a model registry, automatic GGUF downloads, and a REST API on localhost:11434. The CLI is two commands: ollama pull qwen2.5-coder:7b and ollama run qwen2.5-coder:7b. This is where you should start. The OpenAI-compatible endpoint at /v1/chat/completions means most existing client libraries work without changes.
llama.cpp is what Ollama runs underneath. Use it directly when you need flags Ollama doesn't expose: speculative decoding, grammar-constrained output, custom RoPE scaling, or KV cache quantization. The -fa flash attention flag and -ctk q4_0 -ctv q4_0 (quantized KV cache) together can let you push context length significantly further on a 24GB machine.
MLX is Apple's native ML framework. The mlx-lm package supports the same models in a different format (look for mlx-community/*-4bit repos on Hugging Face). On the same model and quantization, MLX is typically 10-25% faster than llama.cpp on Apple Silicon because it skips the GGUF abstraction. The downside is a smaller ecosystem and fewer integrations. If you only need one model for a specific app, MLX is worth the switch.
When Local Beats Cloud
Local LLMs aren't replacing Claude or GPT-4 for every task. The honest tradeoff: a 4-bit 14B model on your laptop is roughly comparable to GPT-3.5 from 2023 on most benchmarks. It loses to current frontier models on hard reasoning, long-context retrieval, and instruction following.
Where local wins:
- Privacy-sensitive code review: you control where the prompt and source go.
- Batch processing: a 5,000-document summarization job over a weekend costs you electricity, not API tokens.
- Offline development: airplanes, training rooms, anywhere the WiFi is unreliable.
- Tool-use prototyping: iterate on tool schemas without paying for each test run.
- Latency-sensitive autocomplete: 30 tokens/sec locally beats cloud round-trip latency for short completions.
If your workflow is "ask a hard question once a day," cloud models are still the right answer. If it's "make 500 cheap calls a day to summarize, classify, or autocomplete," the math favors a one-time hardware purchase.
Top comments (0)