MLX Memory Safety Checklist
6-Layer Defense for M1/M2 Apple Silicon
A solo public notebook from SleepyQuant.
The problem
I froze my M1 Max twice in one week running Qwen 3.6 35B-A3B Q8 for a 12-agent stack.
Symptoms before the fix:
- Memory compressor hit 19.69 GB of compressed pages
- macOS started swapping random background apps (Safari tabs, IDE windows)
- After ~6 hours uptime: full system freeze, hard reboot only option
- MLX inference latency drifted from ~26 tok/s → ~14 tok/s before the freeze hit
Root cause: MLX on Apple Silicon uses unified memory + Metal command buffers that grow without explicit cleanup. Default macOS memory_pressure thresholds don't kick in fast enough for a 35GB-resident model + per-inference Metal cache buildup.
After the 6-layer defense below, same workload runs steady:
- Compressed memory: <1.7 GB (-91%)
- Metal active: ~35 GB (model weights, expected)
- Metal cache: <100 MB (was unbounded before)
- Free + reclaimable: ~30 GB buffer
- Zero freezes in 7 days continuous run
Here's exactly what each layer does and how to ship it.
Layer 1 — Metal wired_limit cap
What it does: tells Metal driver max bytes it can pin in physical RAM (un-pageable).
Set to ~70% of total unified memory. On 64GB M1 Max:
import mlx.core as mx
mx.metal.set_wired_limit(45 * 1024**3) # 45 GB
Why this matters: without a cap, Metal can grow past comfortable headroom and force macOS to compress everything else. With 45GB cap, the OS keeps ~19GB breathing room for app + IDE + browser.
Layer 2 — Metal cache_limit cap
What it does: caps the Metal allocator's internal buffer reuse cache. Different from wired memory — this is the "scratch" that builds per-inference.
mx.metal.set_cache_limit(512 * 1024**2) # 512 MB
Why 512 MB: empirically enough to keep inference fast (cache hit on common shapes) without unbounded growth on long generation runs. Set lower (256 MB) if you have <32GB total.
Layer 3 — memory_limit (soft ceiling)
mx.metal.set_memory_limit(48 * 1024**3) # 48 GB
This is MLX's own soft ceiling. Slightly higher than wired_limit to allow some pageable allocation but still bounded.
Layer 4 — Explicit clear_cache() after long inference
Hook into your generation loop:
def generate_with_cleanup(model, prompt, max_tokens):
output = model.generate(prompt, max_tokens=max_tokens)
if max_tokens >= 500:
mx.metal.clear_cache()
return output
Why threshold at 500 tokens: short generations don't accumulate enough cache to matter. Long ones (essay drafts, multi-section content, reasoning chains) do. Clearing on every call costs ~5-10ms per inference; clearing on threshold saves that overhead.
Layer 5 — 5-minute memory pressure watchdog
Background thread that polls macOS memory_pressure every 5 min. If pressure crosses "warn" threshold, force clear_cache() + log:
import subprocess, time, threading
def memory_watchdog():
while True:
out = subprocess.check_output(
["memory_pressure", "-Q"], text=True
)
# Parse "System-wide memory free percentage: 18%" from output
if "warn" in out.lower() or _free_pct(out) < 15:
mx.metal.clear_cache()
print(f"[watchdog] forced cache clear, free={_free_pct(out)}%")
time.sleep(300)
threading.Thread(target=memory_watchdog, daemon=True).start()
This is the "if all else fails" net. Catches drift cases that the per-inference threshold misses.
Layer 6 — Nightly restart via LaunchAgent
The honest one. Even with all 5 layers above, multi-day uptime accumulates fragmentation. Easiest fix: scheduled restart at 4 AM local time.
LaunchAgent plist (~/Library/LaunchAgents/com.yourapp.backend.plist):
<dict>
<key>Label</key>
<string>com.yourapp.backend</string>
<key>ProgramArguments</key>
<array>
<string>/path/to/your/start.sh</string>
</array>
<key>KeepAlive</key>
<true/>
<key>StartCalendarInterval</key>
<dict>
<key>Hour</key><integer>4</integer>
<key>Minute</key><integer>0</integer>
</dict>
<key>EnvironmentVariables</key>
<dict>
<key>MLX_FORCE_FP16</key><string>1</string>
</dict>
</dict>
Load: launchctl load ~/Library/LaunchAgents/com.yourapp.backend.plist
Why nightly not weekly: model warmup is ~60 seconds; nightly restart is barely noticeable but resets all accumulated state. Weekly meant the freeze caught me before the restart fired.
Verification commands
Run these while your inference workload is active to verify each layer is doing its job:
# Check Metal active + cache + compressed memory
sudo memory_pressure -Q
vm_stat | grep -E "(Pages active|Pages compressed|Pages free)"
# Check MLX limits applied
python -c "import mlx.core as mx; print(mx.metal.get_active_memory()/1024**3, 'GB active')"
python -c "import mlx.core as mx; print(mx.metal.get_cache_memory()/1024**3, 'GB cache')"
# Check LaunchAgent loaded
launchctl list | grep yourapp
Healthy steady-state targets (35GB model on 64GB Mac):
-
Pages compressed: <500k pages (~2 GB) - Metal active: ~35 GB
- Metal cache: <500 MB
- Pages free + inactive: >7M pages (~30 GB)
What happens when each layer fails
| Layer fails | Symptom |
|---|---|
| 1 (wired_limit) | Compressed memory climbs past 5 GB within hours |
| 2 (cache_limit) | Metal cache grows unbounded, eventually swap thrash |
| 3 (memory_limit) | Allocation errors mid-inference (rare, hard to catch) |
| 4 (clear_cache hook) | Slow drift over long generations, latency creep |
| 5 (watchdog) | Edge cases sneak past, freeze possible after 8+ hours |
| 6 (nightly restart) | Multi-day uptime hits fragmentation wall around day 3-4 |
All 6 together: zero freezes in continuous 7-day runs on 12-agent workload.
What this is and isn't
This is the setup that worked for one specific workload: 35GB Qwen MoE Q8 + 12-agent multi-tenant inference on a 64GB M1 Max. Numbers are real, from my own backend.
It is not a universal recipe. If you're running:
- Smaller models (<10 GB): Layer 1 cap can be tighter (15-20 GB), Layer 5 watchdog less critical
- Larger Macs (128 GB Studio): cap can be 80-90 GB
- Single-user dev workload: nightly restart may be overkill
Test each layer independently. Watch the verification commands. Adjust thresholds to your workload.
Want more posts like this?
I'm building a multi-agent quant stack on one M1 Max, public notebook style. Local AI engineering, MLX deep-dives, paper-trading transparency, all numbers (good and bad).
Subscribe to SleepyQuant Weekly at sleepyquant.rest — see me fall or thrive, whichever comes first.
Last updated 2026-04-27. Numbers from my own backend running Qwen 3.6 35B-A3B Q8 on M1 Max 64GB since 2026-04-20. If you find a layer that helped or didn't help in your setup, reply to the welcome email — I'd genuinely like to compare notes.
Top comments (0)