DEV Community

SleepyQuant
SleepyQuant

Posted on • Originally published at sleepyquant.rest

MLX Memory Safety Checklist: 6-Layer Defense for M1/M2 Apple Silicon

MLX Memory Safety Checklist

6-Layer Defense for M1/M2 Apple Silicon

A solo public notebook from SleepyQuant.


The problem

I froze my M1 Max twice in one week running Qwen 3.6 35B-A3B Q8 for a 12-agent stack.

Symptoms before the fix:

  • Memory compressor hit 19.69 GB of compressed pages
  • macOS started swapping random background apps (Safari tabs, IDE windows)
  • After ~6 hours uptime: full system freeze, hard reboot only option
  • MLX inference latency drifted from ~26 tok/s → ~14 tok/s before the freeze hit

Root cause: MLX on Apple Silicon uses unified memory + Metal command buffers that grow without explicit cleanup. Default macOS memory_pressure thresholds don't kick in fast enough for a 35GB-resident model + per-inference Metal cache buildup.

After the 6-layer defense below, same workload runs steady:

  • Compressed memory: <1.7 GB (-91%)
  • Metal active: ~35 GB (model weights, expected)
  • Metal cache: <100 MB (was unbounded before)
  • Free + reclaimable: ~30 GB buffer
  • Zero freezes in 7 days continuous run

Here's exactly what each layer does and how to ship it.


Layer 1 — Metal wired_limit cap

What it does: tells Metal driver max bytes it can pin in physical RAM (un-pageable).

Set to ~70% of total unified memory. On 64GB M1 Max:

import mlx.core as mx
mx.metal.set_wired_limit(45 * 1024**3)  # 45 GB
Enter fullscreen mode Exit fullscreen mode

Why this matters: without a cap, Metal can grow past comfortable headroom and force macOS to compress everything else. With 45GB cap, the OS keeps ~19GB breathing room for app + IDE + browser.


Layer 2 — Metal cache_limit cap

What it does: caps the Metal allocator's internal buffer reuse cache. Different from wired memory — this is the "scratch" that builds per-inference.

mx.metal.set_cache_limit(512 * 1024**2)  # 512 MB
Enter fullscreen mode Exit fullscreen mode

Why 512 MB: empirically enough to keep inference fast (cache hit on common shapes) without unbounded growth on long generation runs. Set lower (256 MB) if you have <32GB total.


Layer 3 — memory_limit (soft ceiling)

mx.metal.set_memory_limit(48 * 1024**3)  # 48 GB
Enter fullscreen mode Exit fullscreen mode

This is MLX's own soft ceiling. Slightly higher than wired_limit to allow some pageable allocation but still bounded.


Layer 4 — Explicit clear_cache() after long inference

Hook into your generation loop:

def generate_with_cleanup(model, prompt, max_tokens):
    output = model.generate(prompt, max_tokens=max_tokens)
    if max_tokens >= 500:
        mx.metal.clear_cache()
    return output
Enter fullscreen mode Exit fullscreen mode

Why threshold at 500 tokens: short generations don't accumulate enough cache to matter. Long ones (essay drafts, multi-section content, reasoning chains) do. Clearing on every call costs ~5-10ms per inference; clearing on threshold saves that overhead.


Layer 5 — 5-minute memory pressure watchdog

Background thread that polls macOS memory_pressure every 5 min. If pressure crosses "warn" threshold, force clear_cache() + log:

import subprocess, time, threading

def memory_watchdog():
    while True:
        out = subprocess.check_output(
            ["memory_pressure", "-Q"], text=True
        )
        # Parse "System-wide memory free percentage: 18%" from output
        if "warn" in out.lower() or _free_pct(out) < 15:
            mx.metal.clear_cache()
            print(f"[watchdog] forced cache clear, free={_free_pct(out)}%")
        time.sleep(300)

threading.Thread(target=memory_watchdog, daemon=True).start()
Enter fullscreen mode Exit fullscreen mode

This is the "if all else fails" net. Catches drift cases that the per-inference threshold misses.


Layer 6 — Nightly restart via LaunchAgent

The honest one. Even with all 5 layers above, multi-day uptime accumulates fragmentation. Easiest fix: scheduled restart at 4 AM local time.

LaunchAgent plist (~/Library/LaunchAgents/com.yourapp.backend.plist):

<dict>
    <key>Label</key>
    <string>com.yourapp.backend</string>
    <key>ProgramArguments</key>
    <array>
        <string>/path/to/your/start.sh</string>
    </array>
    <key>KeepAlive</key>
    <true/>
    <key>StartCalendarInterval</key>
    <dict>
        <key>Hour</key><integer>4</integer>
        <key>Minute</key><integer>0</integer>
    </dict>
    <key>EnvironmentVariables</key>
    <dict>
        <key>MLX_FORCE_FP16</key><string>1</string>
    </dict>
</dict>
Enter fullscreen mode Exit fullscreen mode

Load: launchctl load ~/Library/LaunchAgents/com.yourapp.backend.plist

Why nightly not weekly: model warmup is ~60 seconds; nightly restart is barely noticeable but resets all accumulated state. Weekly meant the freeze caught me before the restart fired.


Verification commands

Run these while your inference workload is active to verify each layer is doing its job:

# Check Metal active + cache + compressed memory
sudo memory_pressure -Q
vm_stat | grep -E "(Pages active|Pages compressed|Pages free)"

# Check MLX limits applied
python -c "import mlx.core as mx; print(mx.metal.get_active_memory()/1024**3, 'GB active')"
python -c "import mlx.core as mx; print(mx.metal.get_cache_memory()/1024**3, 'GB cache')"

# Check LaunchAgent loaded
launchctl list | grep yourapp
Enter fullscreen mode Exit fullscreen mode

Healthy steady-state targets (35GB model on 64GB Mac):

  • Pages compressed: <500k pages (~2 GB)
  • Metal active: ~35 GB
  • Metal cache: <500 MB
  • Pages free + inactive: >7M pages (~30 GB)

What happens when each layer fails

Layer fails Symptom
1 (wired_limit) Compressed memory climbs past 5 GB within hours
2 (cache_limit) Metal cache grows unbounded, eventually swap thrash
3 (memory_limit) Allocation errors mid-inference (rare, hard to catch)
4 (clear_cache hook) Slow drift over long generations, latency creep
5 (watchdog) Edge cases sneak past, freeze possible after 8+ hours
6 (nightly restart) Multi-day uptime hits fragmentation wall around day 3-4

All 6 together: zero freezes in continuous 7-day runs on 12-agent workload.


What this is and isn't

This is the setup that worked for one specific workload: 35GB Qwen MoE Q8 + 12-agent multi-tenant inference on a 64GB M1 Max. Numbers are real, from my own backend.

It is not a universal recipe. If you're running:

  • Smaller models (<10 GB): Layer 1 cap can be tighter (15-20 GB), Layer 5 watchdog less critical
  • Larger Macs (128 GB Studio): cap can be 80-90 GB
  • Single-user dev workload: nightly restart may be overkill

Test each layer independently. Watch the verification commands. Adjust thresholds to your workload.


Want more posts like this?

I'm building a multi-agent quant stack on one M1 Max, public notebook style. Local AI engineering, MLX deep-dives, paper-trading transparency, all numbers (good and bad).

Subscribe to SleepyQuant Weekly at sleepyquant.rest — see me fall or thrive, whichever comes first.


Last updated 2026-04-27. Numbers from my own backend running Qwen 3.6 35B-A3B Q8 on M1 Max 64GB since 2026-04-20. If you find a layer that helped or didn't help in your setup, reply to the welcome email — I'd genuinely like to compare notes.

Top comments (0)