DEV Community

Gugubibi
Gugubibi

Posted on • Originally published at sleepyquant.rest

What 19 GB of Memory Compression Taught Me About MLX on M1 Max

What 19 GB of Memory Compression Taught Me About MLX on M1 Max

The moment something was wrong

I opened Activity Monitor on my M1 Max one afternoon and saw this: Memory Used 60.74 GB out of 64, compressed memory 19.69 GB, swap starting to fill. The SwiftUI dashboard I use to drive my multi-agent quant stack had hung. Python — the backend process holding an MLX-loaded Qwen 3.6 35B-A3B model — reported 44 GB in Activity Monitor's "Memory" column.

My first thought was the obvious one: memory leak. Shut it down, restart, move on.

That would have been wrong. What I found instead was a much more interesting problem about how macOS handles Metal unified memory when a large model sits idle between inferences — and the fix turned out to be a single MLX API call I had never used.

This is the honest write-up: what broke, what I measured, what the fix actually was, and what I'm still not sure about.

What I was actually running

One M1 Max, 64 GB unified memory. One Python process holding the MLX framework with a Q8-quantized 35B-A3B MoE model loaded. About 35 GB of that goes to model weights in Metal-accessible memory; the rest of the process is the FastAPI backend, twelve specialized agents sharing the single model through a priority queue, a SQLite paper-trading book, and assorted content-generation loops.

Uptime at the point of the snapshot: just under 8 hours since the last backend restart.

In normal operation, Activity Monitor should show something like:

  • Python process: ~35-40 GB in the "Memory" column
  • Wired: 2-3 GB (kernel)
  • Compressed: low single digits
  • Free + reclaimable inactive: 15-20 GB

What I saw instead:

  • Python process: 44 GB
  • Compressed: 19.69 GB
  • Swap: 1.57 GB and climbing
  • Free: 3 GB

The compressed number was the interesting one. Not the total.

Why compressed memory is the signal, not the problem

macOS has an in-kernel memory compressor that tries to keep a working set resident by compressing pages that processes have allocated but aren't actively touching. When compressed memory grows, it usually means somewhere a process has a big chunk of memory that's "cold" — allocated but not referenced often enough to count as active.

Two-to-one is a rough compression ratio. 19.69 GB compressed suggests maybe 40 GB of "owed" memory being squeezed in.

On a normal desktop, this is invisible and fine. On a machine running a 35 GB model, it's a red flag: if the model weights are being compressed and decompressed as the compressor swaps them in and out of a resident state, every inference pays a cost to decompress pages before Metal can use them. CPU cycles burn. Latency drifts. Over hours, the machine becomes sluggish in a way that's hard to attribute.

The question became: why are my model weights going inactive between inferences in the first place?

The thing I didn't know about Apple Silicon Metal

On Apple Silicon, CPU and GPU share the same physical RAM. That's the unified memory advantage. But "unified" doesn't mean "all memory is treated the same." Metal exposes a few storage modes, and the one MLX uses by default for model weights is shared — accessible to both CPU and GPU.

Here's the thing I had to learn the hard way: shared storage pages are pageable. They can be marked inactive by the kernel. They can be compressed. From the operating system's perspective, a chunk of Metal-allocated memory that isn't actively being read or written looks exactly like a process's idle heap. It gets the same treatment.

So the loop I was producing was this:

  1. Model loaded into Metal shared storage (~35 GB)
  2. Inference fires, GPU reads weights, decoder runs
  3. Inference finishes
  4. Seconds pass. No one touches the weights.
  5. Kernel marks pages inactive
  6. Compressor kicks in, squeezes cold pages
  7. Next inference arrives
  8. GPU needs to read weights → decompress first → latency
  9. Return to 1.

Over hours, the compressor works harder and harder. The machine isn't leaking memory. It's thrashing a 35 GB working set against a compression algorithm that assumes cold data will stay cold. It won't stay cold. It's a running model.

The fix I should have known about six months ago

MLX has an API called mx.metal.set_wired_limit(bytes). It tells Metal: "keep up to N bytes of memory resident and uncompressible." I had never called it. The default is unlimited-but-unpinned, which means nothing is protected.

I set it to 45 GB — enough to cover the ~35 GB of model weights plus a few GB of KV cache and scratch. Added two more for good measure:

  • mx.metal.set_cache_limit(512 MB) — cap the Metal compile cache so it can't drift over time.
  • mx.metal.set_memory_limit(48 GB) — hard ceiling so Metal refuses to allocate beyond that. Fail loudly instead of OOM.

All three calls go in _load_model before mlx_lm.load() allocates weights, so Metal knows the budget up front.

Results (one backend restart later):

Metric Before After
Python "Memory" column 44 GB ~40 GB
Compressed 19.69 GB 1.7 GB
Swap 1.57 GB 1.6 GB (historical, drains)
Free + reclaimable inactive 3 GB ~30 GB

Compressed memory dropped by 91%. The model wasn't leaking. The kernel just wasn't pinning it, because I had never told it to.

Four more layers I added because I don't trust a single fix

Getting to 1.7 GB compressed on a fresh restart is nice. Keeping it there over days of uptime is different. I layered four more defenses in case any of them mattered:

Clear the Metal compile cache after heavy inference. My content pipeline runs max_tokens ≥ 500 inferences regularly (sectional generation for long-form writeups). Metal accumulates a compile/scratch cache that doesn't matter for a single run but drifts. Added mx.metal.clear_cache() as an automatic hook at the end of any inference above that token threshold.

A memory-pressure watchdog. A background task polls psutil.virtual_memory() every five minutes. If Metal cache exceeds 1 GB, clear it automatically. If total system memory used exceeds 60 GB, print a warning. Not an alarm — just a log signal I can grep later.

A nightly restart. Every night at 4 AM local time, the backend does os._exit(1). LaunchAgent KeepAlive respawns it in about a minute. Fresh MLX state, fresh Python heap. The warmup cost (~60 seconds of MLX reload) is free because I'm asleep and nothing depends on it.

Manual unload / reload API. POST /resources/mlx-unload sets a flag, drops the model reference, calls mx.metal.clear_cache(). Inference calls after that fail fast with a clear error. POST /resources/mlx-reload brings the model back in about 60 seconds. This is for when I want the full 40 GB of Metal memory for something else temporarily. Trade scanners and the paper engine keep running because they don't depend on MLX at all — they're pure Python against SQLite.

All five together survive multiple-day uptime without drift.

The parts I'm still not sure about

The 45 GB wired limit is a guess. It works on my machine with this exact model. If I added a second model, or switched to a denser quantization, or loaded more aggressive KV cache — I'd need to re-tune. I don't have a systematic way to pick the number other than "model weights plus headroom, less than the point where the rest of macOS starves."

The set_memory_limit(48 GB) hard ceiling may be too aggressive. I haven't stress-tested what happens when the limit is actually hit. Probably Metal throws an OutOfMemoryError and the inference fails with a clear traceback, which is what I want. But I haven't caused it on purpose yet.

The watchdog threshold — clear cache above 1 GB, warn above 60 GB — is arbitrary. I set those based on vibes and one afternoon of measurement. A more disciplined version would instrument several days of data and pick thresholds from actual distribution percentiles.

The nightly restart is the scariest one. It assumes nothing important is mid-execution at 4 AM. For now that's true because I'm a solo operator. For a multi-user production stack, it would not be acceptable, and I'd need a graceful-drain + cutover pattern instead.

What I'd tell past-me six months ago

If you're running a large MLX model on Apple Silicon and you've never touched mx.metal.set_wired_limit, check Activity Monitor's Compressed Memory number after a few hours of uptime. If it's in double-digit GB, you're probably paying a compression/decompression tax on every inference.

The fix is three lines:

import mlx.core as mx
mx.metal.set_wired_limit(45 * 1024**3)     # pin the model in resident RAM
mx.metal.set_cache_limit(512 * 1024**2)    # cap Metal compile/scratch
mx.metal.set_memory_limit(48 * 1024**3)    # fail loud above this, don't OOM
Enter fullscreen mode Exit fullscreen mode

That's it. Works on M1 and M2 generations. I haven't tested on M3 or M4 Pro / Max, but the API is the same and the underlying Metal behavior should be too.

The broader lesson I'm taking away: unified memory is a genuine advantage for local-first AI, but it inherits the OS's defaults for normal application memory. A 35 GB working set of neural-network weights is not what macOS's memory manager was designed for. The API to tell it "treat this differently" is there; I just had to know it existed.

What's next

I'm packaging the full hygiene layer as a small open-source helper — tentatively mlx-memory-safe — so anyone running MLX on a Mac can drop it in with one import instead of reading three sections of this post to rediscover the same fixes. Should land on GitHub and PyPI in the next week or two, with a separate write-up of the package internals.

If you've hit something similar, or if you've tested set_wired_limit on M3/M4 and seen different behavior, I'd love to hear about it. I still don't have a clean mental model for when shared storage mode pages leave the wired set under real-world pressure, and that gap is the next thing I want to understand.

Come along for the ride.


Disclaimer: This post reflects one solo operator's configuration on one M1 Max with 64 GB of unified memory in April 2026, running MLX + Qwen 3.6 35B-A3B Q8. Specific numbers (compressed GB, tok/s, wired limit) will differ on other hardware, other models, and other workloads. Test on your own setup before adopting any threshold as a default.

Top comments (0)