GaltRanch

Posted on May 21 • Originally published at astrolexis.space

Apple Silicon as a Serious AI Dev Box: What an M4 Max Actually Does With a 70B Model

#apple #ai #machinelearning #mac

Originally published on the AstroLexis blog. Cross-posted here for the community.

If you're shopping for an LLM workstation in 2026, the default mental model is still "NVIDIA GPU, lots of VRAM, big tower." That's not wrong, but it's also not the only correct answer anymore. Apple Silicon — M3, M4, M5 — has quietly become one of the best local AI development boxes on the market, and almost nobody outside of MLX twitter is talking about the actual numbers. Here's what an M4 Max really does, where it crushes NVIDIA, where it doesn't, and why I built SiliconMon to see what's happening underneath.

The thesis: unified memory changes the math

The single architectural decision that makes Apple Silicon competitive for AI workloads is unified memory. On a typical NVIDIA system, the model weights live in dedicated GPU VRAM, separate from system RAM, connected by a PCIe bus. On Apple Silicon, there's one pool of memory — say, 128 GB on an M4 Max — and the CPU, GPU, and Neural Engine all see the same physical pages. No copy between host and device, no PCIe bottleneck on transfers, no juggling layers between cards.

For LLM inference, this matters more than people initially expect:

You can load a 70B parameter model in 4-bit quantization (~40 GB) directly into the unified pool, addressable by the GPU, without renting an enterprise card.
Context window expansion is cheap. Going from 4K to 32K context tokens doesn't require swapping or specialized layer offloading — it just uses more of the same pool.
Multimodal workloads (vision encoder + LLM + speech) coexist in one address space. ClearCaps' on-device captioning pipeline runs WhisperKit, an LLM, and Apple SpeakerKit on the same chip with no inter-device coordination.

The trade-off: total memory bandwidth on Apple Silicon (around 400-800 GB/s depending on chip tier) is below a top-tier NVIDIA card (HBM3 cards push north of 3 TB/s). For pure inference throughput on small models that fit easily in a 4090, NVIDIA still wins. For anything larger than ~20B parameters where you'd otherwise need multi-GPU setups, Apple's unified pool starts looking very attractive.

Real numbers on M-series for LLM inference

The tokens-per-second numbers depend heavily on quantization, framework (MLX vs llama.cpp), and whether you're measuring prefill or decode. Here's a rough baseline for decode speed on the most common configurations, with 4-bit quantized weights running on MLX:

Chip	Unified RAM	7B model	13B model	30B model	70B model
M2 Pro	32 GB	~45 tok/s	~22 tok/s	~8 tok/s	not viable
M3 Max	64 GB	~75 tok/s	~38 tok/s	~16 tok/s	~5 tok/s
M4 Max	128 GB	~110 tok/s	~55 tok/s	~28 tok/s	~10 tok/s
M3 Ultra	192 GB	~130 tok/s	~70 tok/s	~36 tok/s	~14 tok/s

For interactive use, anything above 15 tokens/second feels "instant" to a human reader. That means an M3 Max comfortably handles 30B models for interactive chat, and an M4 Max handles 70B models if you're patient on long generations.

The number that matters most for indie developers: a base M4 Mac mini at $1,400 with 24 GB unified memory runs quantized 13B models at 50+ tokens/second. That's a usable AI workstation for the price of a mid-range laptop, with zero noise, zero rack space, and 20W idle power draw.

Where Apple Silicon wins

Models that don't fit on a single consumer NVIDIA card. A 70B model in 4-bit needs ~40 GB. The biggest consumer NVIDIA card (5090) ships with 32 GB. You can split across multiple cards, but inter-card communication becomes the bottleneck. M4 Max with 128 GB swallows the whole model and has headroom for 32K context.
Power efficiency. An M4 Max under sustained inference load draws 30-50W. The equivalent NVIDIA workstation can pull 600-900W. If you're paying for electricity (anyone running 24/7 self-hosted inference) the OpEx delta is enormous.
Acoustic profile. Mac Studio is silent. Mac mini is silent. A workstation with two RTX cards is a lawnmower. For anyone working from home, this is non-negotiable.
Out-of-the-box experience. macOS + MLX + Homebrew + Ollama installs in twenty minutes and just works. CUDA-on-Linux remains a persistent source of pain.
Multimodal workflows. Unified memory means you can pipeline speech-to-text, LLM, and TTS without ever materializing intermediate buffers across PCIe.

Where Apple Silicon loses

Training and fine-tuning. Mac is great for inference but the training stack (PyTorch on MPS, MLX training APIs) is still meaningfully behind CUDA. Anything beyond LoRA on small models is faster on NVIDIA.
Throughput per dollar at scale. If you're running production serving with hundreds of concurrent requests, a rack of L40S cards beats a fleet of Mac Studios on raw cost-per-token. Apple wins for development; NVIDIA wins for production serving above a certain volume.
Software ecosystem for very new research. Cutting-edge research code lands on CUDA first. The Mac port arrives weeks to months later, sometimes with reduced functionality.
Tooling visibility. NVIDIA gives you nvidia-smi, nvtop, NVIDIA Nsight, profiling tools that work on day one. macOS gives you Activity Monitor and a vague sense of where your watts are going. This last gap is why I ended up writing SiliconMon.

What you can't see (and why SiliconMon exists)

When you fire up Ollama, llama.cpp, MLX, LM Studio, ComfyUI, or vLLM on a Mac, the operating system shows you almost nothing useful. Activity Monitor reports CPU% per process, but the GPU and Neural Engine residency are invisible. Memory pressure is a single colored bar. Power draw is hidden behind powermetrics, which requires sudo and outputs an unreadable wall of text.

I'd been running multiple local LLM stacks for over a year and had no way to answer simple questions:

When I run Ollama and ComfyUI simultaneously, are they sharing the GPU or fighting for it?
Is my 70B model actually using the Neural Engine, or is it entirely on the GPU?
What's the package power draw during inference vs idle? Am I thermal throttling on a long generation?
Why does the system feel sluggish — am I swapping unified memory, or is something else going on?

Existing tools each gave fragments. asitop shows IOReport stats but is command-line only and stops being maintained periodically. macmon and mactop are similar. Stats and iStat Menus are general-purpose and don't know what an MLX process is. None of them detect "this Python process is actually serving Llama 4 via vLLM" or "this is Ollama loading a Qwen3 quantization."

So I built SiliconMon. It does three things the others don't:

AI workload detection. SiliconMon recognizes the canonical names and command-line patterns of MLX, Ollama, llama.cpp, LM Studio, ComfyUI, vLLM, and Hugging Face's transformers stack. When you see "Inference 47% • Ollama: qwen3-32b" in the menu bar, that's because the detector matched the process name, command line arguments, and loaded library set.
IOReport-based residency. Real CPU/GPU/ANE residency numbers from Apple's IOReport private framework, the same source Apple uses internally. Sampled once per second, no sudo required, sub-1% CPU footprint at idle.
Energy unit correctness across chip generations. M5 Max ships IOReport channels with mixed energy units — millijoules, nanojoules, microjoules — in the same response. Getting the conversion wrong is a 30× error on power numbers. SiliconMon has explicit per-channel unit handling and a regression test for every M-series chip we support.

How to think about buying a Mac for local AI

Rough buying guide based on what I'd actually recommend to friends asking:

Hobbyist / curious: M4 Mac mini, 24 GB unified, $1,400. Runs 7B and 13B models smoothly. Won't handle 30B+ comfortably. Best dollar-for-LLM machine on the market for non-pros.

Developer running local LLMs daily: M3 Max MacBook Pro 14"/16" with 64 GB unified, $3,200-3,600. Handles 30B models for interactive use, fine for 70B if you're patient.

Serious indie / small team self-hosted AI: M3 Ultra Mac Studio with 192 GB unified, $5,500-7,500. Runs 70B comfortably and 120B+ models in quantized form. Silent, sits under a desk, draws less power than a microwave. Sweet spot for self-hosted AI assistants like Kulvex AI.

Production / training: Use NVIDIA. The Mac isn't the right tool for serving at scale or training large models.

Software stack: what to install on day one

# Homebrew (if you don't have it)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Ollama — easiest entry point
brew install ollama
ollama serve &
ollama run qwen3:13b

# MLX — for Python-side LLM work
pip install mlx mlx-lm
python -m mlx_lm.generate --model mlx-community/Llama-4-7B-Instruct-4bit \
    --prompt "Hello, world"

# llama.cpp
brew install llama.cpp
llama-server -hf mlx-community/Qwen3-32B-Instruct-GGUF

# LM Studio — GUI alternative
# Download from https://lmstudio.ai

# SiliconMon — see what's actually happening
open https://astrolexis.space/siliconmon

The honest take

If you're already invested in CUDA, building Linux workstations, and serving inference at scale: Apple Silicon is probably not for you, and that's fine. NVIDIA's lead on production infrastructure is real and not closing soon.

If you're an indie developer, a researcher who needs to iterate locally, a security-conscious team that can't ship code to the cloud, or anyone who values a quiet, low-power, easy-to-set-up AI workstation — Apple Silicon is dramatically better than its reputation. The M4 generation is the inflection point. The M5 Max coming later this year extends the lead.

Buy the unified memory, not the cores. If you're agonizing between the cheaper config and the next tier up, always go for more RAM. Models grow, context windows grow, and you can't upgrade Mac memory after purchase.

— Bruno Galtranch, founder, AstroLexis LLC. Questions on Apple Silicon for AI: contact@astrolexis.space.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.