I Run a 40GB AI Model on a MacBook. Three Months of MLX on M1 Max Has Changed How I Think About Apple Silicon.
It's Just a Laptop. But It's Running a 40GB Model Right Now.
I'm drafting this on a MacBook Pro. Qwen 3.6 35B-A3B MoE Q8 — about 40GB of weights — is pinned in Metal memory right now, and the fan is quiet.
That sentence still feels weird to write. A year ago I would have assumed "run a 35B model locally" meant a dedicated rig with an H100, or at least a pair of 4090s. Turns out it means a MacBook Pro M1 Max with the 64GB unified memory variant, MLX, and about a weekend of config tuning.
This post is a three-month dev diary on that setup. Not a product review. Not a "10x your AI productivity" take. Just what I've learned that isn't in the Apple keynote or the MLX README.
And since Tim Cook has been CEO for 14+ years with no named successor, I ended up thinking about what changes if the person running Apple changes — and what doesn't. Short version: a lot less than most market takes assume. The laptop on my desk is why.
The Setup: 64GB of Unified Memory, One Model, Zero Cloud
Hardware is an M1 Max MacBook Pro with the full 64GB unified memory. Yes, it's a $3k-class setup. That's the first honest thing to say.
The model is Qwen 3.6 35B-A3B MoE, Q8 quantization. Weights are ~40GB in Metal memory via mx.metal.set_wired_limit(45GB). That pin is load-bearing — without it the macOS memory compressor will happily try to page out the model while you're mid-inference.
Hard ceiling at set_memory_limit(48GB). Scratch buffers capped at set_cache_limit(512MB). Buffer left for OS + apps: ~14-16GB, tight but stable. Everything runs offline. No cloud fallback. No API key. Just the laptop.
For that ~14-16GB buffer to actually hold: no Docker, no 30-tab Chrome session. I used to keep Chrome open with dozens of tabs; the memory pressure during long inference was noticeable enough that I stopped. My background load during heavy generation is Xcode (SwiftUI work) + terminal + editor. That's it.
The Q8 Tax: Trading Speed for Sanity
I moved from Q4 to Q8 on April 17. The motivation was pure quality. Q4 output was noticeably more muddled on longer reasoning tasks, especially anything requiring numerical precision or sustained argument.
Q8 runs in the 35-50 tok/s range depending on context length. Q4 was faster — probably 10-15% more tok/s — but the output just wasn't as good. When you're generating content you'll actually publish, that tradeoff isn't close.
The honest take: if your use case is chat-style short responses, Q4 might be fine. For long-form drafting, research synthesis, or anything that has to be correct-ish without a human checking every sentence, Q8 earns its extra memory.
The fp16 Moment: 21.18 to 26.22 tok/s From One Env Var
Running MLX on M1 Max defaults to bf16 for many kernels. For Qwen 3.6 MoE specifically, that was costing real throughput.
Setting MLX_FORCE_FP16=1 in the LaunchAgent environment bumped tok/s from 21.18 to 26.22. That's +24% from one flag. No recompile. No re-quantization. No weight re-download.
I don't know the full story of why bf16 is the default if fp16 wins here — the MLX team almost certainly has a good reason at the kernel level. But empirically, on this hardware with this model, the flag is free speed.
Persisted it in the LaunchAgent plist, restarted, never looked back.
What Metal Memory Actually Wants: 45GB Wired, 48GB Ceiling, 512MB Scratch
Out of the box, Apple's memory compressor is aggressive. It will look at your 40GB model sitting in RAM, decide some of it is "idle," and start compressing pages. Every decompression on a subsequent inference is thrash.
The fix for MLX on M1 Max is a three-line config (pseudo-code — real calls take bytes, I'm using GB suffixes for readability):
-
set_wired_limit(45GB)— weights stay pinned, compressor can't touch them -
set_memory_limit(48GB)— hard ceiling, prevents runaway scratch buffers -
set_cache_limit(512MB)— caps Metal compile cache
Before this, compressed swap on my machine was 19.69GB. After, it sits at 1.7GB. That's a 10x improvement on memory pressure from three lines of config. The buffer for macOS + Chrome + everything else stays at ~14-16GB, which survives a full day of normal laptop use. (I wrote up the full debugging path for the memory compression issue here — it took me longer than I'd like to admit to figure out.)
The MoE Saturation Wall at 500 Tokens (The Thing Nobody Warns You About)
Qwen 3.6 is a Mixture-of-Experts model. On paper, sparse activation means you're only touching a fraction of weights per token, which is why it fits in 40GB at all.
What the papers don't emphasize: MoE models have a soft quality ceiling on single generation length. For Qwen 3.6 specifically, output degrades past roughly 500 tokens. Past 800 you start getting word salad. Past 1500 you get paragraphs that apologize to themselves mid-sentence.
The workaround is sectional generation. Split long outputs into 250-400 token sections, generate each independently, concatenate. State resets between calls. The model stays coherent the whole way through.
I automated it: a FastAPI endpoint that takes a research brief plus an ordered list of sections (heading + 1-sentence instruction + target word count) and fires one MLX call per section with max_tokens hard-capped under the degen zone. No shared context across calls. Outputs concatenate into a full draft. Maybe 40 lines of Python. If there's interest I'll clean it up and drop it as part of a small OSS package alongside the memory-safe runtime config.
This isn't an MLX issue. It's how MoE attention routing behaves under sustained sampling. Took me a while to isolate the variable.
The 4 AM Ghost: Managing Metal's Memory Drift
Even with wired_limit pinning, Metal accumulates scratch buffers over time. Long inference sessions leave compile cache and intermediate allocations that don't always free cleanly. After a couple of days of uptime, tok/s drifts down 5-10%.
The fix is a scheduled restart. I have a LaunchAgent KeepAlive set up to kill and relaunch the backend every day at 4 AM local time. Takes about 60 seconds end-to-end — roughly 40 of those are MLX warmup.
It's not elegant. A properly designed memory system wouldn't need this. But it works, it's invisible because it runs while I sleep, and the next morning tok/s is back at baseline. I'll take a cron job over a memory leak any day.
What I Actually Lose vs Cloud (And What I Don't)
Honest comparison. What you lose going local:
- Peak throughput: 26 tok/s here vs ~60-100 tok/s on cloud APIs
- Context window: 32k practical on this setup vs 200k+ cloud
- Scale: one user at a time vs unlimited parallel
What you don't lose:
- Quality: Q8 is close enough to cloud that most tasks don't notice
- Latency: sub-1s first token local vs 500-1500ms network round-trip
- Cost: $0 marginal per call vs $3-15 per million tokens
- Privacy: weights and prompts never leave the laptop
- Availability: works offline, works when the cloud provider has an outage
For a solo dev with one user (me), the tradeoff leans local hard. Mileage varies if you're serving an API.
The Thing Nobody Prices About Apple Silicon: Unified Memory
Here's the structural point most Apple Silicon takes miss.
On x86 + Nvidia, VRAM is separate from system RAM. A $3k gaming laptop ships with at most 16GB of VRAM — physically cannot hold Qwen 35B Q8, period. To match the 40GB I'm using here, you'd need two RTX 3090s (24GB each, NVLink bridge to share weights): ~$1,400-1,800 used for the cards alone, plus PSU, case, cooling, CPU. Easily another $1,500 before you have a running machine. And even then each forward pass is sharding across PCIe — not unified memory. Two 4090s don't even solve it cleanly because Nvidia dropped NVLink on the 4090 line.
Meanwhile this thing fits in a backpack and runs at a quiet coffee shop.
On Apple Silicon, the 40GB of model weights live in the same physical RAM the OS and Chrome use. No PCIe bottleneck between CPU and GPU compute — they literally share memory. That's not a Metal-is-faster-than-CUDA claim (per-op, it usually isn't). It's an architecture claim.
Which is why this MacBook runs models that most gaming desktops physically cannot. The chip speed is a subplot. The memory layout is the actual moat. (I made a longer version of this argument here, back when I was still surprised it was working at all.)
Three Months In, I'm Long the Ecosystem
Three months of MLX on M1 Max later, here's what I actually believe: I'm long the ecosystem, not the CEO.
Whoever succeeds Tim Cook next can reshape pricing, Services tiers, or the iPhone upgrade cadence. They can't reverse unified memory architecture in a quarter. They can't make pip install mlx-lm harder than pip install mlx-lm. They can't retroactively ship a gaming laptop with 40GB of usable VRAM for $3k.
The developer experience moat — pip install mlx-lm and you're done, with CUDA nowhere in sight — compounds quietly every time a solo dev gets a 35B model to run on their first try. That's the flywheel the market underprices.
I could be wrong on the broader empire thesis. But the laptop on my desk still runs the model. That floor doesn't move.
Come along for the ride — see me fall or thrive, whichever comes first.
Top comments (0)