DEV Community: Gugubibi

What 19 GB of Memory Compression Taught Me About MLX on M1 Max

Gugubibi — Mon, 20 Apr 2026 09:29:24 +0000

What 19 GB of Memory Compression Taught Me About MLX on M1 Max

The moment something was wrong

I opened Activity Monitor on my M1 Max one afternoon and saw this: Memory Used 60.74 GB out of 64, compressed memory 19.69 GB, swap starting to fill. The SwiftUI dashboard I use to drive my multi-agent quant stack had hung. Python — the backend process holding an MLX-loaded Qwen 3.6 35B-A3B model — reported 44 GB in Activity Monitor's "Memory" column.

My first thought was the obvious one: memory leak. Shut it down, restart, move on.

That would have been wrong. What I found instead was a much more interesting problem about how macOS handles Metal unified memory when a large model sits idle between inferences — and the fix turned out to be a single MLX API call I had never used.

This is the honest write-up: what broke, what I measured, what the fix actually was, and what I'm still not sure about.

What I was actually running

One M1 Max, 64 GB unified memory. One Python process holding the MLX framework with a Q8-quantized 35B-A3B MoE model loaded. About 35 GB of that goes to model weights in Metal-accessible memory; the rest of the process is the FastAPI backend, twelve specialized agents sharing the single model through a priority queue, a SQLite paper-trading book, and assorted content-generation loops.

Uptime at the point of the snapshot: just under 8 hours since the last backend restart.

In normal operation, Activity Monitor should show something like:

Python process: ~35-40 GB in the "Memory" column
Wired: 2-3 GB (kernel)
Compressed: low single digits
Free + reclaimable inactive: 15-20 GB

What I saw instead:

Python process: 44 GB
Compressed: 19.69 GB
Swap: 1.57 GB and climbing
Free: 3 GB

The compressed number was the interesting one. Not the total.

Why compressed memory is the signal, not the problem

macOS has an in-kernel memory compressor that tries to keep a working set resident by compressing pages that processes have allocated but aren't actively touching. When compressed memory grows, it usually means somewhere a process has a big chunk of memory that's "cold" — allocated but not referenced often enough to count as active.

Two-to-one is a rough compression ratio. 19.69 GB compressed suggests maybe 40 GB of "owed" memory being squeezed in.

On a normal desktop, this is invisible and fine. On a machine running a 35 GB model, it's a red flag: if the model weights are being compressed and decompressed as the compressor swaps them in and out of a resident state, every inference pays a cost to decompress pages before Metal can use them. CPU cycles burn. Latency drifts. Over hours, the machine becomes sluggish in a way that's hard to attribute.

The question became: why are my model weights going inactive between inferences in the first place?

The thing I didn't know about Apple Silicon Metal

On Apple Silicon, CPU and GPU share the same physical RAM. That's the unified memory advantage. But "unified" doesn't mean "all memory is treated the same." Metal exposes a few storage modes, and the one MLX uses by default for model weights is shared — accessible to both CPU and GPU.

Here's the thing I had to learn the hard way: shared storage pages are pageable. They can be marked inactive by the kernel. They can be compressed. From the operating system's perspective, a chunk of Metal-allocated memory that isn't actively being read or written looks exactly like a process's idle heap. It gets the same treatment.

So the loop I was producing was this:

Model loaded into Metal shared storage (~35 GB)
Inference fires, GPU reads weights, decoder runs
Inference finishes
Seconds pass. No one touches the weights.
Kernel marks pages inactive
Compressor kicks in, squeezes cold pages
Next inference arrives
GPU needs to read weights → decompress first → latency
Return to 1.

Over hours, the compressor works harder and harder. The machine isn't leaking memory. It's thrashing a 35 GB working set against a compression algorithm that assumes cold data will stay cold. It won't stay cold. It's a running model.

The fix I should have known about six months ago

MLX has an API called mx.metal.set_wired_limit(bytes). It tells Metal: "keep up to N bytes of memory resident and uncompressible." I had never called it. The default is unlimited-but-unpinned, which means nothing is protected.

I set it to 45 GB — enough to cover the ~35 GB of model weights plus a few GB of KV cache and scratch. Added two more for good measure:

mx.metal.set_cache_limit(512 MB) — cap the Metal compile cache so it can't drift over time.
mx.metal.set_memory_limit(48 GB) — hard ceiling so Metal refuses to allocate beyond that. Fail loudly instead of OOM.

All three calls go in _load_model before mlx_lm.load() allocates weights, so Metal knows the budget up front.

Results (one backend restart later):

Metric	Before	After
Python "Memory" column	44 GB	~40 GB
Compressed	19.69 GB	1.7 GB
Swap	1.57 GB	1.6 GB (historical, drains)
Free + reclaimable inactive	3 GB	~30 GB

Compressed memory dropped by 91%. The model wasn't leaking. The kernel just wasn't pinning it, because I had never told it to.

Four more layers I added because I don't trust a single fix

Getting to 1.7 GB compressed on a fresh restart is nice. Keeping it there over days of uptime is different. I layered four more defenses in case any of them mattered:

Clear the Metal compile cache after heavy inference. My content pipeline runs max_tokens ≥ 500 inferences regularly (sectional generation for long-form writeups). Metal accumulates a compile/scratch cache that doesn't matter for a single run but drifts. Added mx.metal.clear_cache() as an automatic hook at the end of any inference above that token threshold.

A memory-pressure watchdog. A background task polls psutil.virtual_memory() every five minutes. If Metal cache exceeds 1 GB, clear it automatically. If total system memory used exceeds 60 GB, print a warning. Not an alarm — just a log signal I can grep later.

A nightly restart. Every night at 4 AM local time, the backend does os._exit(1). LaunchAgent KeepAlive respawns it in about a minute. Fresh MLX state, fresh Python heap. The warmup cost (~60 seconds of MLX reload) is free because I'm asleep and nothing depends on it.

Manual unload / reload API. POST /resources/mlx-unload sets a flag, drops the model reference, calls mx.metal.clear_cache(). Inference calls after that fail fast with a clear error. POST /resources/mlx-reload brings the model back in about 60 seconds. This is for when I want the full 40 GB of Metal memory for something else temporarily. Trade scanners and the paper engine keep running because they don't depend on MLX at all — they're pure Python against SQLite.

All five together survive multiple-day uptime without drift.

The parts I'm still not sure about

The 45 GB wired limit is a guess. It works on my machine with this exact model. If I added a second model, or switched to a denser quantization, or loaded more aggressive KV cache — I'd need to re-tune. I don't have a systematic way to pick the number other than "model weights plus headroom, less than the point where the rest of macOS starves."

The set_memory_limit(48 GB) hard ceiling may be too aggressive. I haven't stress-tested what happens when the limit is actually hit. Probably Metal throws an OutOfMemoryError and the inference fails with a clear traceback, which is what I want. But I haven't caused it on purpose yet.

The watchdog threshold — clear cache above 1 GB, warn above 60 GB — is arbitrary. I set those based on vibes and one afternoon of measurement. A more disciplined version would instrument several days of data and pick thresholds from actual distribution percentiles.

The nightly restart is the scariest one. It assumes nothing important is mid-execution at 4 AM. For now that's true because I'm a solo operator. For a multi-user production stack, it would not be acceptable, and I'd need a graceful-drain + cutover pattern instead.

What I'd tell past-me six months ago

If you're running a large MLX model on Apple Silicon and you've never touched mx.metal.set_wired_limit, check Activity Monitor's Compressed Memory number after a few hours of uptime. If it's in double-digit GB, you're probably paying a compression/decompression tax on every inference.

The fix is three lines:

import mlx.core as mx
mx.metal.set_wired_limit(45 * 1024**3)     # pin the model in resident RAM
mx.metal.set_cache_limit(512 * 1024**2)    # cap Metal compile/scratch
mx.metal.set_memory_limit(48 * 1024**3)    # fail loud above this, don't OOM

That's it. Works on M1 and M2 generations. I haven't tested on M3 or M4 Pro / Max, but the API is the same and the underlying Metal behavior should be too.

The broader lesson I'm taking away: unified memory is a genuine advantage for local-first AI, but it inherits the OS's defaults for normal application memory. A 35 GB working set of neural-network weights is not what macOS's memory manager was designed for. The API to tell it "treat this differently" is there; I just had to know it existed.

What's next

I'm packaging the full hygiene layer as a small open-source helper — tentatively mlx-memory-safe — so anyone running MLX on a Mac can drop it in with one import instead of reading three sections of this post to rediscover the same fixes. Should land on GitHub and PyPI in the next week or two, with a separate write-up of the package internals.

If you've hit something similar, or if you've tested set_wired_limit on M3/M4 and seen different behavior, I'd love to hear about it. I still don't have a clean mental model for when shared storage mode pages leave the wired set under real-world pressure, and that gap is the next thing I want to understand.

Come along for the ride.

Disclaimer: This post reflects one solo operator's configuration on one M1 Max with 64 GB of unified memory in April 2026, running MLX + Qwen 3.6 35B-A3B Q8. Specific numbers (compressed GB, tok/s, wired limit) will differ on other hardware, other models, and other workloads. Test on your own setup before adopting any threshold as a default.

Why Apple Silicon Quietly Won the Local-AI Race (April 2026)

Gugubibi — Sat, 18 Apr 2026 09:30:37 +0000

Why Apple Silicon Quietly Won the Local-AI Race (April 2026)

Executive summary

While the public AI narrative is dominated by capex wars and cloud GPU shortages, a quieter shift has happened on the desktop. A single Apple Silicon laptop with 64GB of unified memory now runs a 35-billion-parameter mixture-of-experts model at usable speed, with no API key, no rate limit, and no per-token bill. SleepyQuant — a build-in-public AI quant trading project — runs its full 12-agent stack on one M1 Max. Last week we swapped the primary inference model from a 4-bit to an 8-bit quantization. RAM went from about 19GB to about 35GB active. Decode speed dropped from roughly 50 tokens per second to about 10. The post that follows is the honest account of that trade, why it was the right call, and what unified memory architecture actually changes for anyone trying to ship local-first AI in 2026.

Thesis

The default assumption of the last two years is that meaningful AI requires meaningful infrastructure: a data center, a GPU cluster, an API contract. Apple's hardware bet quietly inverts that assumption for a specific category of work — single-operator inference of capable open-weight models on commodity hardware.

The mechanism is unified memory architecture, or UMA. On a traditional desktop, the CPU and GPU each own separate memory pools. To run a large model on the GPU, the model weights must be copied across the PCIe bus, then activations move back and forth for every layer. The cost is latency, energy, and an effective ceiling on model size set by the GPU's dedicated VRAM. On Apple Silicon, CPU, GPU, and Neural Engine cores share one unified memory pool on the same package. There is no copy step. The same 64GB of physical RAM is available to whichever processing unit needs it, in whatever ratio the workload demands.

This sounds like an engineering footnote. It is not. It is the mechanism that lets a 35B-parameter model fit and run on a $4,000 laptop instead of an $80,000 server. For workloads that are bounded by single-user inference latency and privacy — exactly the workloads small builders, indie developers, and solo operators care about — that changes the economics of building with AI from "raise a seed round for compute" to "buy the laptop."

Deep dive: what we actually run

SleepyQuant runs on one M1 Max with 64GB of unified memory. The primary inference engine is the MLX framework — Apple's open-source machine learning library tuned for Apple Silicon. The model is Qwen 3.6 35B-A3B, a sparse mixture-of-experts (MoE) architecture, served at 8-bit quantization. The active model footprint is around 35GB. With Python's process overhead and the rest of the agent stack loaded, total active and wired memory sits between 44GB and 47GB. That leaves a sliver of headroom under the 48GB practical ceiling we set for ourselves before macOS starts swap-paging into the SSD and the user-visible latency falls off a cliff.

Decode throughput at 8-bit is approximately 10 tokens per second. At 4-bit, the same model decoded at 49–60 tokens per second. The 5x slowdown is real, and it is not free. The reason we accepted it is that 8-bit is meaningfully sharper on data-aware tasks — content evaluation against a fact list, fabrication detection in generated drafts, structured output parsing. For a build-in-public project where every published number should be defensible, "slightly slower but more truthful" is the right trade. For a real-time chat application, it would not be.

The sparse MoE design adds one more wrinkle. Qwen 3.6 35B-A3B activates only ~3B parameters per token, which is what makes its decode throughput tractable on commodity hardware in the first place. But MoE models degenerate into repetitive word-salad when forced to generate long single completions — anything past about 500 output tokens reliably produces collapsing prose where the same phrases re-circulate. The fix is not "buy a denser model"; the fix is sectional generation. Long content gets split into 250–400-token sections that are generated independently and concatenated. The model never has to hold a 1500-word output in its working window at once. This is a structural workaround for an architectural property of MoE, not a hack.

On top of that base inference layer, SleepyQuant orchestrates twelve specialized agents — content drafting, quality evaluation, trading scan, risk analysis, news ingestion, and so on — sharing the single MLX runtime through a sequential lock that prevents two simultaneous Metal GPU calls from crashing the device. The lock turns into a priority queue: user-facing chat outranks agent tool calls, which outrank background automation. Twelve agents share one inference engine, not twelve cloud endpoints.

The full operational footprint: one laptop, one model on disk, one Metal-bound process, no recurring infrastructure cost. The bill of materials is the laptop and the electricity to run it.

Counter-argument: when Apple Silicon loses

The story above is selective. Apple Silicon is the wrong tool for several common AI workloads, and pretending otherwise sets up failure.

Training is the obvious one. Pre-training a foundation model from scratch, or even continued pre-training on a domain-specific corpus, demands cluster-grade compute and high-bandwidth interconnects that consumer hardware does not provide. The unified memory advantage works in the inference direction; in the training direction, dedicated GPU farms remain dominant.

Multi-tenant serving is the second loss case. A single MLX-bound laptop serves one inference at a time through a lock. That works for a solo operator running an internal stack. It does not work for a SaaS product with concurrent users, where horizontal scaling on cloud GPU is the rational architecture.

High-throughput batch inference is the third. If the workload is "score 100,000 documents tonight," a multi-GPU server with batched attention will eat the laptop's lunch. The laptop wins on per-token cost for low volume; cloud batch wins on throughput per dollar at scale.

Continuous fine-tuning is the fourth, and the one most people forget. The Apple Silicon stack excels at running pre-trained models efficiently. It is weaker at adapting them quickly. If the strategy depends on retraining on yesterday's market data every night to stay competitive, single-laptop inference is a structural disadvantage compared to a hedge fund operating its own GPU cluster.

These limitations are real. They constrain where the local-first thesis applies. They do not invalidate it.

Verdict

The local-first Apple Silicon stack is the right answer for a specific shape of project: a single operator (or small team), inference-dominant workloads, sensitivity to per-token cost, sensitivity to data leaving the machine, and acceptable latency at the throughput a sequential lock allows. Build-in-public projects, indie research, internal tooling, privacy-sensitive personal automation — all of these fit the shape.

For training, multi-tenant serving, high-throughput batch, and continuous fine-tuning at production scale, the cloud GPU stack remains the right answer.

What changed in 2026 is not that Apple Silicon is suddenly competitive everywhere. What changed is that the band of workloads for which a single laptop is sufficient has widened to include things that, two years ago, demanded a serious infrastructure budget. A 35B-parameter MoE running on one M-series chip at 10 tokens per second is not a benchmark to brag about against H100 clusters. It is, however, a baseline good enough to run a real product, on a real budget, with no vendor in the loop. For a category of builders who used to be priced out of meaningful AI infrastructure, that is the entire point.

More posts in this series — including the honest 4-bit vs 8-bit benchmark numbers, the sectional generation pattern in detail, and the 12-agent priority-queue design — live in the SleepyQuant blog archive.

Disclaimer: This post is engineering observation, not financial or hardware purchasing advice. Specific tokens-per-second numbers reflect the SleepyQuant configuration on one M1 Max with 64GB unified memory in April 2026; results on other hardware or quantizations will differ. Verify benchmarks against your own workload before making allocation decisions.

SleepyQuant Weekly · 2026W16

Gugubibi — Sat, 18 Apr 2026 08:26:56 +0000

This week in paper trading

Round-trips: 464
Win rate: 38.1%
Realized PnL: -34.58 USDT
Net return: +20.23%
Max drawdown: 3.14%
R:R ratio: 0.8

Failure vault: what broke, what changed

Past 7 days · 49 losing trades · total -24.63 USDT

Execution Slippage cluster × 25 across APT/USDT, BNB/USDT, ETH/USDT, LINK/USDT
Technical Failure cluster × 24 across APT/USDT, ARB/USDT, ATOM/USDT, AVAX/USDT
APT/USDT — Execution Slippage × 5 (-1.32 USDT, avg -0.26 per trade)

Strategy adjustments shipped / queued for next week:

[65% conf] Scanner-wide: cut position size 25% + tighten stop loss
[60% conf] Global: scan interval 8 → 12 minutes to filter noise
[85% conf] Temporarily remove APT/USDT from scan list for 48 hours

News that mattered

🔥 Trending: Bio Protocol (BIO) — Rank #365 (via CoinGecko Trending)
🔥 Trending: Pudgy Penguins (PENGU) — Rank #108 (via CoinGecko Trending)
🔥 Trending: RaveDAO (RAVE) — Rank #33 (via CoinGecko Trending)
🔥 Trending: Based (BASED) — Rank #722 (via CoinGecko Trending)
🔥 Trending: Bitcoin (BTC) — Rank #1 (via CoinGecko Trending)

One operating insight

The main lesson this week is simple: trust the quiet tape.

When the engine scans widely but trades narrowly, that usually means the filters are doing their job. A lower trade count is cheaper than forcing mediocre entries, especially when the failure vault is already pointing at repeat mistakes like noisy confirmation, weak follow-through, or execution drift. The right response is not "make the bot trade more." The right response is to tighten the decision path, preserve RAM for the live stack, and keep publishing the real numbers so the system can keep learning in public.

Stack and infra

The stack right now:

Apple M1 Max, 64GB unified memory
MLX Qwen 3.6 35B-A3B 8-bit quant (primary inference)
A lightweight CLI layer for build-time automation
12 AI agents coordinating in one local process
Binance spot + futures paper trading via ccxt

The model swap from 4-bit to 8-bit this week traded raw decode speed (about 50 tokens per second down to about 10) for sharper data-aware evaluation. Worthwhile for content quality scoring; less worthwhile for high-frequency scan loops, which still rely on cached deterministic signals.

If you're building local-first trading systems, hit reply and tell me what you optimize for first: speed, cost, or control. The next issue covers the inverted-control experiment: running the same signal backward on a parallel paper book to test whether the edge is real or whether the bot is anti-correlated with itself.

Compiled from live operating data. Every number in this issue came from the running system, not a deck.

The Inverted Control: What 24 Hours of Running Our Own Bot Backwards Revealed

Gugubibi — Sat, 18 Apr 2026 03:37:25 +0000

The Inverted Control: What 24 Hours of Running Our Own Bot Backwards Revealed

Executive Summary

After roughly 500 paper round-trips showed a persistent sub-35% win rate with average losses larger than average wins, we stopped scaling the live side and ran a cheap experiment: a second paper book that executes the exact opposite of every signal the bot produces, on the same universe, same cadence, same fee model.

Twenty-four hours in, the inverted book is winning 70.59% of round-trips versus 15.79% on the standard book. Both books are still losing in absolute terms because fees dominate at small sample. The important number is not the win rate gap. It is whether the inverted book's gross edge clears the fee floor by the time we hit the 100-round-trip decision point, roughly 8 to 12 days out.

This post walks through the setup, the data so far, where the reading could be wrong, and the specific decision that happens at 100 round-trips.

The Thesis

A bot that loses more than random is either extracting no signal, or extracting signal with the sign reversed. Those two hypotheses produce identical win-rate readings in a one-book world. They are only separable by running a second book with the signal flipped.

The second hypothesis is rarer but well-documented: overfit features trained on stale microstructure, labels that got reversed in a pipeline step, crowding where yesterday's "bullish" marker is now a faded trade. None of those are visible from inside a single losing book. All of them flip sign when you flip the signal.

Running the inverted control is the lowest-cost diagnostic that distinguishes the two hypotheses. In the first hypothesis (no signal), the inverted book converges to the same losing distribution, minus fee drag. In the second hypothesis (inverted signal), the inverted book diverges: higher win rate, smaller loss magnitude, possibly net-positive once sample grows past fee-drag territory.

The point of running the control is not to find a winning strategy. It is to stop guessing about which of those two worlds the bot is actually in.

The Setup

Two paper books, same engine, same universe, same fee schedule.

Book 1 — standard signal. Every decision from the scanner is executed as issued. LONG is LONG, BUY is BUY.
Book 2 — inverted mirror. Every decision is flipped programmatically before execution. LONG becomes SHORT, BUY becomes SELL (or hold, since the spot lane is accumulate-only during this window, making the flip mostly a futures test).

Both books start from identical simulated ~$1000 balances. Both pay realistic exchange-tier fees on open and close — no free-trade assumption, which is where most inversion backtests fail.

Universe: 30 USDT pairs on a major exchange, perps plus spot. Scan cadence 15 minutes. Leverage cap 3x. Drawdown hard stop 8% per book. Spot exit signals ignored in Book 2 for this window — the test isolates the futures direction bet.

The test completes at 100 post-flip round-trips on Book 2. At that point one of three decisions is on the table.

Deep Dive: 24 Hours of Parallel Data

Windowed to the period since the flip went live:

Book 1 — standard. 38 round-trips closed. Win rate 15.79%. Net result negative on the order of tens of USD.
Book 2 — inverted. 17 round-trips closed. Win rate 70.59%. Net result also negative, but by a much smaller per-round-trip magnitude (roughly 25x better than standard).

The win-rate gap from 15.79% to 70.59% is the headline. It is not a statistical fluke at this sample. A purely random signal in this setup would produce win rates clustering around 45-55% on both books. A noise signal (first hypothesis) would produce roughly symmetric rates on both books. What shows up instead — asymmetric split heavily favoring the inverse — is the fingerprint of a signal that carries information with the wrong sign.

Per-symbol, the inversion's effect is not uniform:

Symbol	Book 1 WR	Book 2 WR	Direction
ZEC/USDT	12.5% (8 RT)	80.0% (5 RT)	Inversion strongly helps
ARB/USDT	25.0% (4 RT)	100% (3 RT)	Inversion helps
DOGE/USDT	0.0% (5 RT)	100% (2 RT)	Inversion helps
UNI/USDT	0.0% (4 RT)	100% (1 RT)	Inversion helps (micro sample)
BCH/USDT	0.0% (1 RT)	100% (1 RT)	Inversion helps (micro sample)
NEAR/USDT	28.6% (7 RT)	0.0% (2 RT)	Inversion hurts
ADA/USDT	50.0% (4 RT)	33.3% (3 RT)	Inversion hurts

Five of seven symbols with both-book data favor inversion. Two do not. The symbols where inversion fails are the ones where the standard book was already near or above 30% — consistent with a "invert only what's clearly broken, leave the rest" hybrid strategy that may emerge at higher sample.

The Fee Floor

Every round-trip pair costs roughly the open-plus-close fee on a major exchange, applied to both books independently. With Book 2 running in parallel, fees double.

That doubles the bar. Book 2's improvement in gross profit-and-loss has to clear two fee stacks, not one. An inversion signal that wins on gross but gets eaten by the fee floor is a classic mean-reversion trap: backtests ignoring fees look clean, live books ignoring fees bleed out.

At 17 round-trips, Book 2's net-negative result is dominated by fee drag, not by losses on individual trades. The interesting question is whether that fee drag, as a percentage of gross result, shrinks as sample grows. If the gross per-round-trip edge holds at roughly current magnitude, net-positive becomes plausible around round-trip 50-70. If the gross edge compresses as the signal gets noisier at larger sample, net-positive never arrives.

Counter-Argument: Why This Reading Could Be Wrong

Taking the opposite side of our own preliminary conclusion:

Sample is too small. Seventeen round-trips on Book 2 is the sample size a drunk person at a blackjack table has after twenty minutes. Win-rate distributions at n=17 are wide enough that a 70.59% result can reverse to 35% over the next 30 trips without surprising anyone. Any reading here is provisional.

Recent regime shift. The standard book's historical 34% win rate was compiled over weeks. The 15.79% since the flip is over 24 hours. A regime change (one market day of trend-heavy action on symbols the scanner dislikes, for example) could compress the standard book's rate artificially without the underlying signal being any more broken than it was a week ago. That would make the inversion's apparent edge a mirage of timing.

Asymmetric fee burn. Book 2's inverted futures positions may open and close in ways that pay funding rate differently than Book 1's. If the test period coincides with a funding regime that favors one side, some of the apparent gross edge is just "Book 2 happened to be on the right side of funding this week."

The symbols where inversion fails are the ones we actually trade most. The test might reveal that inversion works on low-activity symbols that produce little volume, while the symbols driving Book 1's meaningful losses (higher-sample names like BTC, ETH, SOL, which Book 2 has not yet traded in this window) are not in the inverted-signal camp. A strategy that only works on low-volume names is not a strategy worth running.

The signal might be improving organically. Book 1's live standard-signal win rate (across all history, not just this window) has been creeping toward 34% from the 27% it hit in the worst stretch earlier in April. If the signal is already self-correcting, the inversion's apparent edge evaporates before the test window closes.

Any one of those could be what is actually going on. We are not going to know until the sample grows.

The Verdict

The decision point is 100 round-trips on Book 2, expected 8 to 12 days out.

If Book 2 lands net-positive with win rate above 55%: the inversion locks in. The live signal gets flipped permanently, along with the take-profit and stop-loss asymmetry (swap from 3% TP / -2% SL to 2% TP / -3% SL to match the inverted payoff shape). Live trading remains paused until the paper side clears a 30-day rolling benchmark of Binance Simple Earn at roughly 0.42% per month — the honest passive bar.

If Book 2 lands net-negative or drawdown exceeds 8%: the futures lane is disabled entirely. Spot accumulation remains. The diagnosis shifts from "inverted signal" to "no signal," and the rebuild restarts on features, not direction.

If Book 2 lands mixed — gross positive but net-negative, or win rate high but below 55%: the hybrid path becomes the next experiment. Invert only the symbols where Book 1's rolling win rate sits below 40%. Leave the ones above 40% standard. Re-run the control on that subset.

What the reader should take from this

If you are running a paper book that loses more than random: run the inverted control before killing the strategy. The setup is one column in the trades table (book_id) and one branch in the execute function. Cost is near zero, answer is binary, information is much larger than the cost.

If you are watching SleepyQuant for the outcome: the result arrives at 100 round-trips. We publish either a "inversion locks in, here is the updated config" or a "futures lane disabled, here is why" — whichever the numbers say, not whichever is more flattering.

If you are here for the general lesson: a losing signal is not automatically noise. Sometimes it is a working signal with the sign reversed. The diagnostic is cheap. The implication — that your model has been right about structure and wrong about direction — is unusual enough that most builders never check. The check itself is worth more than the result.

Follow the experiment

We publish one email per week with the round-trip count, the current win rates on both books, the fee-drag ratio, and whatever the honest read is at that point. No trading advice, no signals, no "buy at X." Just the numbers and what we are and are not willing to conclude from them.

Subscribe at sleepyquant.rest → the verdict lands in your inbox.

Show HN: SleepyQuant – a 12-agent crypto quant running on one Mac

Gugubibi — Sat, 18 Apr 2026 01:47:25 +0000

Show HN: SleepyQuant – a 12-agent crypto quant running on one Mac

Hey everyone,

SleepyQuant is a solo experiment I've been running for the last couple of weeks: 12 local AI agents coordinating a paper crypto trading book on a single Apple M1 Max. No cloud inference, no API bills, no vendor black box. Every agent prompt, every losing trade, every round-trip gets written up weekly.

Stack (all local):

Apple M1 Max, 64 GB RAM
MLX Qwen 2.5 32B Q8 as the primary agent model
DeepSeek R1 14B Q8 as a lazy-loaded reasoning lane for research tasks
Priority queue on the MLX inference lock so user chat preempts automation
FastAPI backend, SwiftUI macOS app, SQLite for state, ChromaDB for agent memory
Binance paper via ccxt, spot + futures, 70/30 allocation, 10x leverage on the futures lane

What's deliberately boring:

The paper book is roughly $78 equivalent. Not a typo. The real-mode transition gate requires three consecutive green days before anything touches real capital, and even then the first real trade is capped tiny. If the strategy can't handle $78, I'd rather find out for free.
Tight scalp TP/SL (2.0% / -1.5% on futures) with a hard -8% daily drawdown stop.
Every losing trade gets a post-mortem. The failure vault is public in the weekly newsletter, with root-cause classification (technical / news / execution slippage) and the exact param changes shipped as a response.
Funding rate guard — refuses to open futures positions when our side is paying extreme funding. Shipped after the scanner was quietly bleeding basis points for three days straight.

Agents (one role each):

A COO / dispatcher, a trading lead, separate futures + spot executors, a CFO, a CTO with filesystem + shell tools, an R&D / failure analyst, a legal / compliance officer, a resource monitor, a QA engineer, a news intelligence watcher, and a content / SEO writer.

Each agent has a focused system prompt + a small set of skill handlers. The COO routes CEO requests to the right specialist instead of one monolithic agent trying to do everything.

Live paper P&L widget + weekly newsletter: https://sleepyquant.rest

Two things I'd genuinely want feedback on — please weigh in below:

Is 12 agents worth the routing overhead? Or would a single bigger agent with tool use be cleaner at this scale? I keep flip-flopping and would love to hear from anyone who's been through the same decomposition choice.
MLX unload strategies on Apple Silicon? Right now my reasoning model auto-unloads after 2 minutes idle, which works but feels crude. If you're running MLX in production on a Mac, how do you free RAM when you need it back?

Try it or follow along:

Live paper P&L widget + weekly write-up: https://sleepyquant.rest
Subscribe to the weekly post-mortem newsletter — Beehiiv, free, one email per week, no upsells, no signals, no affiliate links
Cadence: every Tuesday. If the book dies, I'll write up that too

Happy to answer questions in the comments about the architecture, the failure vault, the priority queue design, or why local-first LLM agents are worth the effort on a 64 GB machine. Fire away.

SleepyQuant — Twitter brand assets (bio + pinned tweet)

Gugubibi — Sat, 18 Apr 2026 01:47:21 +0000

Profile

Display name (50 chars max):

SleepyQuant

Bio (160 chars max — landing + newsletter + 1-line pitch):

AI trades while the CEO sleeps. 12 local agents + one Mac M1 Max running a paper crypto book in public. Weekly post-mortems, zero hype.

(139 chars — room for a trailing link to sleepyquant.rest in the website field rather than in the bio text itself.)

Location: Runs on a Mac in a closet

Website: https://sleepyquant.rest

Pinned tweet

One tweet, no thread. Meant to be the first thing a new visitor sees. No question on purpose — it's a brand statement, not a conversation opener.

One Mac. 12 AI agents. A $78 paper crypto book.

I run a quant experiment while I sleep and post the whole journey — every win, every dumb loss, every architecture note — every week.

Live P&L + the newsletter: sleepyquant.rest

(278 chars — right under the 280 limit, no line-break tricks, reads in one pass.)

Alternate pinned tweet (if the first one feels too cold)

I ship weekly regardless of wins or losses.

Week 1 on paper: +2.65%, 9 round-trips, 3 losses with full post-mortems, funding-rate guard shipped mid-week.

Everything runs locally on one Mac. sleepyquant.rest

(242 chars.)

Notes

Keep the bio and pinned tweet aligned on tone. Reader should see the bio, then the pinned, and the two should feel like one voice.
Don't use "crypto trading bot" — implies signals and gets flagged by X ad policy. Use "paper crypto book" or "quant experiment".
Update the pinned weekly — roll in the latest round-trip number so it never feels stale. The alternate version is a good template for that weekly refresh.