DEV Community

SleepyQuant
SleepyQuant

Posted on • Originally published at sleepyquant.rest

How I Budget 64 GB Unified Memory on M1 Max for a 35B Model + Long-Running Agent Loops

How I Budget 64 GB Unified Memory on M1 Max for a 35B Model + Long-Running Agent Loops

The first lie I had to unlearn buying a 64 GB Mac for local LLM work was that I had 64 GB to use for the model.

You don't. After macOS, your browser, your editor, and whatever else you keep open during a workday, the actual usable headroom for ML is about 48-50 GB. That's enough for a 35 B parameter model in Q8 with some breathing room — but only if you're explicit about what else is allowed to live in memory.

This is the budget I run, what it leaves for other work, and how to recalculate for your own setup.

The actual budget on my Mac

Here's the layout I'm running right now, mid-workday with everything I normally have open:

M1 Max 64 GB Unified Memory
─────────────────────────────────────────────────
│ macOS kernel + system services       6.5 GB  │
│ WindowServer + UI compositor         1.8 GB  │
│ Safari (8 tabs, mid-weight)          2.4 GB  │
│ Swift IDE (Xcode-class)              2.7 GB  │
│ Spotlight + background indexing      0.5 GB  │
│ Discord                              0.3 GB  │
│ Terminal + tmux sessions             0.4 GB  │
│ Chrome (3 tabs incl. one heavy SPA)  4.0 GB  │
├───────────────────────────────────────────────┤
│ SYSTEM + WORKFLOW SUBTOTAL          18.6 GB  │
├───────────────────────────────────────────────┤
│ Python runtime + libs                1.2 GB  │
│ MLX model weights (35B Q8)          35.0 GB  │
│ Metal cache (capped)                 0.5 GB  │
│ Agent context buffers                2.0 GB  │
├───────────────────────────────────────────────┤
│ ML SUBTOTAL                         38.7 GB  │
├───────────────────────────────────────────────┤
│ Free + reclaimable buffer            6.7 GB  │
└───────────────────────────────────────────────┘
TOTAL ALLOCATED:                      64.0 GB
Enter fullscreen mode Exit fullscreen mode

That ~6.7 GB free buffer is what I have left for spikes. Chrome opening a heavier tab, a Spotlight reindex burst, a build kicking off. If the buffer drops under 3 GB, macOS starts compressing memory aggressively, and inference latency spikes.

The number I tune to: keep system + workflow under 20 GB so ML has at least 44 GB to play with, including buffer.

Why 35B Q8 specifically fits

Different model sizes and quantizations land in different memory bands. Rough numbers for the common ones I've tested or measured:

Model size Quant Resident memory What's left on 64 GB Mac
7B Q4 ~4 GB ~42 GB (comfortable)
7B Q8 ~7 GB ~39 GB (comfortable)
14B Q4 ~8 GB ~38 GB (comfortable)
14B Q8 ~14 GB ~32 GB (comfortable)
32B Q4 ~18 GB ~28 GB (comfortable)
32B Q8 ~32 GB ~14 GB (tight)
35B MoE Q4 ~19 GB ~27 GB (comfortable)
35B MoE Q8 ~35 GB ~11 GB (very tight)
70B Q4 ~38 GB ~8 GB (won't run with my workflow)
70B Q8 ~70 GB doesn't fit at all

35B Q8 is the largest model where I can still keep my normal dev workflow open. Anything bigger and I have to close apps to make room. 70B Q4 technically fits but leaves no headroom for the agent loop or browser.

This is also why I swapped from Q4 to Q8 instead of going from 35B to 70B. Q8 of the same model gave me a quality lift I could measure on real outputs; 70B Q4 would have forced me to close half my workspace. Quality-per-headroom favored the upgrade I made.

How to measure your own baseline

The fastest way to see your actual numbers: open Activity Monitor, switch to the Memory tab, sort by Memory descending. The "Memory Used" total at the bottom shows your committed footprint. The "Memory Pressure" graph shows whether macOS is comfortable or struggling.

For a more precise read, three terminal commands:

# System-wide memory state
memory_pressure -Q | head

# Per-process memory (top consumers)
ps -axm -o rss,command | sort -nr | head -15

# Pages active vs compressed vs free
vm_stat
Enter fullscreen mode Exit fullscreen mode

Run these mid-workday with everything you normally have open, before you load the model. That's your baseline. Subtract from 64 GB. Whatever's left is your ML budget.

If your baseline is over 20 GB, you have less ML room than I do. Some choices: close Chrome, reduce open browser tabs, kill Slack/Discord during inference sessions, or accept a smaller model.

What changes if you have less or more RAM

The shape of the budget holds across Mac generations, but the thresholds shift.

M2 Air 16 GB: roughly 6-8 GB system baseline. Leaves ~8-10 GB for ML. Realistic models: 7B Q4 only, with minimal multitasking.

M2 Pro 32 GB: ~12 GB baseline. Leaves ~20 GB for ML. Realistic: 14B Q8 or 32B Q4 with light workflow. 35B too tight.

M1/M2 Max 64 GB (my setup): ~18-20 GB baseline. Leaves ~44 GB. Realistic: 35B Q8 with normal workflow, 70B Q4 if you close most apps.

M2 Ultra 128 GB: ~20-22 GB baseline. Leaves ~106 GB. Realistic: 70B Q8 comfortable, 100B+ Q4 possible.

M3 Ultra 192 GB: similar baseline. Leaves ~170 GB. Realistic: 100B+ Q8, multiple models loaded simultaneously, or one large model + heavy concurrent workload.

The pattern: about 18-22 GB goes to "being a Mac" regardless of total RAM, plus another 0-10 GB depending on your browser/IDE habits. The leftover scales linearly with what you bought.

What goes wrong if you over-budget

The failure modes from over-allocating memory to ML, in order of how often I've hit them:

1. Inference latency spikes. Memory pressure triggers macOS compression. Decode tok/s drops from 26 to 8-12 silently. The model still responds, just slower. You assume the model degraded, when actually the memory layer did.

2. Random app evictions. macOS will start force-quitting background apps to free pages. Discord disappears, your IDE loses unsaved buffers, Spotify silently stops. Usually no notification.

3. Full system freeze. If compression saturates and the kernel can't recover, the whole machine locks. I hit this twice in one week before I tuned memory caps — write-up of the fix is in my 6-layer MLX defense post. Hard reboot required.

4. Swap to SSD wear. macOS will swap pages to SSD if compression fails. Heavy daily inference + tight memory = measurable SSD write amplification. Apple Silicon SSDs have decent endurance, but it's not zero.

The first two are warnings. The third is the failure mode that costs you a workday. Budget accordingly.

What this isn't

This budget is for one workflow: continuous local LLM inference with a multi-agent setup, plus normal dev work in parallel, on a 64 GB M1 Max. The principles generalize but the numbers don't.

If you're a researcher doing batch jobs, you can shut down your dev workflow during runs and free up the 18-20 GB system budget for the model. That lets you push to 50+ GB ML allocation on the same hardware.

If you're a single-shot interactive user (one prompt, read answer, repeat), you can be looser with the cache caps. The accumulated drift doesn't have time to build up.

If you're a multi-tenant server operator running inference for multiple users, you need to budget per concurrent session. The numbers in this post assume one user (me).

If you're choosing a Mac to buy for local LLM work, the practical guidance: 32 GB if you want 14B; 64 GB if you want 35B with workflow; 128 GB+ if you want 70B or want headroom for the next model generation. Apple Silicon non-upgradeable RAM means buy more than you think you need.

The smaller lesson

Unified memory is not a free lunch. The advantage over discrete VRAM (no copy overhead, model + workflow share pool) comes with the responsibility to be explicit about who gets what. Default macOS behavior assumes you're not running a 35 GB model. You have to opt into the budget.

If you've worked out a different budget that fits your workflow on the same RAM, I'd genuinely like to see it. Reply on the post.

Come along for the ride — see me fall or thrive, whichever comes first.

Top comments (0)