<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: SleepyQuant</title>
    <description>The latest articles on DEV Community by SleepyQuant (@sleepyquant).</description>
    <link>https://dev.to/sleepyquant</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3885340%2F8cd2f97f-12d9-43c1-ace7-84a4532d823b.png</url>
      <title>DEV Community: SleepyQuant</title>
      <link>https://dev.to/sleepyquant</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sleepyquant"/>
    <language>en</language>
    <item>
      <title>I Run a 40GB AI Model on a MacBook. Three Months of MLX on M1 Max Has Changed How I Think About Apple Silicon.</title>
      <dc:creator>SleepyQuant</dc:creator>
      <pubDate>Thu, 23 Apr 2026 07:53:02 +0000</pubDate>
      <link>https://dev.to/sleepyquant/i-run-a-40gb-ai-model-on-a-macbook-three-months-of-mlx-on-m1-max-has-changed-how-i-think-about-h6j</link>
      <guid>https://dev.to/sleepyquant/i-run-a-40gb-ai-model-on-a-macbook-three-months-of-mlx-on-m1-max-has-changed-how-i-think-about-h6j</guid>
      <description>&lt;h1&gt;
  
  
  I Run a 40GB AI Model on a MacBook. Three Months of MLX on M1 Max Has Changed How I Think About Apple Silicon.
&lt;/h1&gt;

&lt;h2&gt;
  
  
  It's Just a Laptop. But It's Running a 40GB Model Right Now.
&lt;/h2&gt;

&lt;p&gt;I'm drafting this on a MacBook Pro. Qwen 3.6 35B-A3B MoE Q8 — about 40GB of weights — is pinned in Metal memory right now, and the fan is quiet.&lt;/p&gt;

&lt;p&gt;That sentence still feels weird to write. A year ago I would have assumed "run a 35B model locally" meant a dedicated rig with an H100, or at least a pair of 4090s. Turns out it means a MacBook Pro M1 Max with the 64GB unified memory variant, MLX, and about a weekend of config tuning.&lt;/p&gt;

&lt;p&gt;This post is a three-month dev diary on that setup. Not a product review. Not a "10x your AI productivity" take. Just what I've learned that isn't in the Apple keynote or the MLX README.&lt;/p&gt;

&lt;p&gt;And since Tim Cook has been CEO for 14+ years with no named successor, I ended up thinking about what changes if the person running Apple changes — and what doesn't. Short version: a lot less than most market takes assume. The laptop on my desk is why.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup: 64GB of Unified Memory, One Model, Zero Cloud
&lt;/h2&gt;

&lt;p&gt;Hardware is an M1 Max MacBook Pro with the full 64GB unified memory. Yes, it's a $3k-class setup. That's the first honest thing to say.&lt;/p&gt;

&lt;p&gt;The model is Qwen 3.6 35B-A3B MoE, Q8 quantization. Weights are ~40GB in Metal memory via &lt;code&gt;mx.metal.set_wired_limit(45GB)&lt;/code&gt;. That pin is load-bearing — without it the macOS memory compressor will happily try to page out the model while you're mid-inference.&lt;/p&gt;

&lt;p&gt;Hard ceiling at &lt;code&gt;set_memory_limit(48GB)&lt;/code&gt;. Scratch buffers capped at &lt;code&gt;set_cache_limit(512MB)&lt;/code&gt;. Buffer left for OS + apps: ~14-16GB, tight but stable. Everything runs offline. No cloud fallback. No API key. Just the laptop.&lt;/p&gt;

&lt;p&gt;For that ~14-16GB buffer to actually hold: no Docker, no 30-tab Chrome session. I used to keep Chrome open with dozens of tabs; the memory pressure during long inference was noticeable enough that I stopped. My background load during heavy generation is Xcode (SwiftUI work) + terminal + editor. That's it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Q8 Tax: Trading Speed for Sanity
&lt;/h2&gt;

&lt;p&gt;I moved from Q4 to Q8 on April 17. The motivation was pure quality. Q4 output was noticeably more muddled on longer reasoning tasks, especially anything requiring numerical precision or sustained argument.&lt;/p&gt;

&lt;p&gt;Q8 runs in the 35-50 tok/s range depending on context length. Q4 was faster — probably 10-15% more tok/s — but the output just wasn't as good. When you're generating content you'll actually publish, that tradeoff isn't close.&lt;/p&gt;

&lt;p&gt;The honest take: if your use case is chat-style short responses, Q4 might be fine. For long-form drafting, research synthesis, or anything that has to be correct-ish without a human checking every sentence, Q8 earns its extra memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fp16 Moment: 21.18 to 26.22 tok/s From One Env Var
&lt;/h2&gt;

&lt;p&gt;Running MLX on M1 Max defaults to bf16 for many kernels. For Qwen 3.6 MoE specifically, that was costing real throughput.&lt;/p&gt;

&lt;p&gt;Setting &lt;code&gt;MLX_FORCE_FP16=1&lt;/code&gt; in the LaunchAgent environment bumped tok/s from 21.18 to 26.22. That's +24% from one flag. No recompile. No re-quantization. No weight re-download.&lt;/p&gt;

&lt;p&gt;I don't know the full story of why bf16 is the default if fp16 wins here — the MLX team almost certainly has a good reason at the kernel level. But empirically, on this hardware with this model, the flag is free speed.&lt;/p&gt;

&lt;p&gt;Persisted it in the LaunchAgent plist, restarted, never looked back.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Metal Memory Actually Wants: 45GB Wired, 48GB Ceiling, 512MB Scratch
&lt;/h2&gt;

&lt;p&gt;Out of the box, Apple's memory compressor is aggressive. It will look at your 40GB model sitting in RAM, decide some of it is "idle," and start compressing pages. Every decompression on a subsequent inference is thrash.&lt;/p&gt;

&lt;p&gt;The fix for MLX on M1 Max is a three-line config (pseudo-code — real calls take bytes, I'm using GB suffixes for readability):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;set_wired_limit(45GB)&lt;/code&gt; — weights stay pinned, compressor can't touch them&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;set_memory_limit(48GB)&lt;/code&gt; — hard ceiling, prevents runaway scratch buffers&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;set_cache_limit(512MB)&lt;/code&gt; — caps Metal compile cache&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before this, compressed swap on my machine was 19.69GB. After, it sits at 1.7GB. That's a 10x improvement on memory pressure from three lines of config. The buffer for macOS + Chrome + everything else stays at ~14-16GB, which survives a full day of normal laptop use. (I wrote up the full debugging path for the memory compression issue &lt;a href="https://dev.to/blog/what-19-gb-of-memory-compression-taught-me-about-mlx-on-m1-max"&gt;here&lt;/a&gt; — it took me longer than I'd like to admit to figure out.)&lt;/p&gt;

&lt;h2&gt;
  
  
  The MoE Saturation Wall at 500 Tokens (The Thing Nobody Warns You About)
&lt;/h2&gt;

&lt;p&gt;Qwen 3.6 is a Mixture-of-Experts model. On paper, sparse activation means you're only touching a fraction of weights per token, which is why it fits in 40GB at all.&lt;/p&gt;

&lt;p&gt;What the papers don't emphasize: MoE models have a soft quality ceiling on single generation length. For Qwen 3.6 specifically, output degrades past roughly 500 tokens. Past 800 you start getting word salad. Past 1500 you get paragraphs that apologize to themselves mid-sentence.&lt;/p&gt;

&lt;p&gt;The workaround is sectional generation. Split long outputs into 250-400 token sections, generate each independently, concatenate. State resets between calls. The model stays coherent the whole way through.&lt;/p&gt;

&lt;p&gt;I automated it: a FastAPI endpoint that takes a research brief plus an ordered list of sections (heading + 1-sentence instruction + target word count) and fires one MLX call per section with &lt;code&gt;max_tokens&lt;/code&gt; hard-capped under the degen zone. No shared context across calls. Outputs concatenate into a full draft. Maybe 40 lines of Python. If there's interest I'll clean it up and drop it as part of a small OSS package alongside the memory-safe runtime config.&lt;/p&gt;

&lt;p&gt;This isn't an MLX issue. It's how MoE attention routing behaves under sustained sampling. Took me a while to isolate the variable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 4 AM Ghost: Managing Metal's Memory Drift
&lt;/h2&gt;

&lt;p&gt;Even with wired_limit pinning, Metal accumulates scratch buffers over time. Long inference sessions leave compile cache and intermediate allocations that don't always free cleanly. After a couple of days of uptime, tok/s drifts down 5-10%.&lt;/p&gt;

&lt;p&gt;The fix is a scheduled restart. I have a LaunchAgent KeepAlive set up to kill and relaunch the backend every day at 4 AM local time. Takes about 60 seconds end-to-end — roughly 40 of those are MLX warmup.&lt;/p&gt;

&lt;p&gt;It's not elegant. A properly designed memory system wouldn't need this. But it works, it's invisible because it runs while I sleep, and the next morning tok/s is back at baseline. I'll take a cron job over a memory leak any day.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Actually Lose vs Cloud (And What I Don't)
&lt;/h2&gt;

&lt;p&gt;Honest comparison. What you lose going local:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Peak throughput: 26 tok/s here vs ~60-100 tok/s on cloud APIs&lt;/li&gt;
&lt;li&gt;Context window: 32k practical on this setup vs 200k+ cloud&lt;/li&gt;
&lt;li&gt;Scale: one user at a time vs unlimited parallel&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What you don't lose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Quality: Q8 is close enough to cloud that most tasks don't notice&lt;/li&gt;
&lt;li&gt;Latency: sub-1s first token local vs 500-1500ms network round-trip&lt;/li&gt;
&lt;li&gt;Cost: $0 marginal per call vs $3-15 per million tokens&lt;/li&gt;
&lt;li&gt;Privacy: weights and prompts never leave the laptop&lt;/li&gt;
&lt;li&gt;Availability: works offline, works when the cloud provider has an outage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a solo dev with one user (me), the tradeoff leans local hard. Mileage varies if you're serving an API.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Thing Nobody Prices About Apple Silicon: Unified Memory
&lt;/h2&gt;

&lt;p&gt;Here's the structural point most Apple Silicon takes miss.&lt;/p&gt;

&lt;p&gt;On x86 + Nvidia, VRAM is separate from system RAM. A $3k gaming laptop ships with at most 16GB of VRAM — physically cannot hold Qwen 35B Q8, period. To match the 40GB I'm using here, you'd need two RTX 3090s (24GB each, NVLink bridge to share weights): ~$1,400-1,800 used for the cards alone, plus PSU, case, cooling, CPU. Easily another $1,500 before you have a running machine. And even then each forward pass is sharding across PCIe — not unified memory. Two 4090s don't even solve it cleanly because Nvidia dropped NVLink on the 4090 line.&lt;/p&gt;

&lt;p&gt;Meanwhile this thing fits in a backpack and runs at a quiet coffee shop.&lt;/p&gt;

&lt;p&gt;On Apple Silicon, the 40GB of model weights live in the same physical RAM the OS and Chrome use. No PCIe bottleneck between CPU and GPU compute — they literally share memory. That's not a Metal-is-faster-than-CUDA claim (per-op, it usually isn't). It's an architecture claim.&lt;/p&gt;

&lt;p&gt;Which is why this MacBook runs models that most gaming desktops physically cannot. The chip speed is a subplot. The memory layout is the actual moat. (I made a longer version of this argument &lt;a href="https://dev.to/blog/why-apple-silicon-quietly-won-the-local-ai-race-april-2026"&gt;here&lt;/a&gt;, back when I was still surprised it was working at all.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Months In, I'm Long the Ecosystem
&lt;/h2&gt;

&lt;p&gt;Three months of MLX on M1 Max later, here's what I actually believe: I'm long the ecosystem, not the CEO.&lt;/p&gt;

&lt;p&gt;Whoever succeeds Tim Cook next can reshape pricing, Services tiers, or the iPhone upgrade cadence. They can't reverse unified memory architecture in a quarter. They can't make &lt;code&gt;pip install mlx-lm&lt;/code&gt; harder than &lt;code&gt;pip install mlx-lm&lt;/code&gt;. They can't retroactively ship a gaming laptop with 40GB of usable VRAM for $3k.&lt;/p&gt;

&lt;p&gt;The developer experience moat — &lt;code&gt;pip install mlx-lm&lt;/code&gt; and you're done, with CUDA nowhere in sight — compounds quietly every time a solo dev gets a 35B model to run on their first try. That's the flywheel the market underprices.&lt;/p&gt;

&lt;p&gt;I could be wrong on the broader empire thesis. But the laptop on my desk still runs the model. That floor doesn't move.&lt;/p&gt;

&lt;p&gt;Come along for the ride — see me fall or thrive, whichever comes first.&lt;/p&gt;

</description>
      <category>applesilicon</category>
      <category>mlx</category>
      <category>llm</category>
      <category>qwen</category>
    </item>
    <item>
      <title>FPT Corporation and the AI Consulting Margin Compression: Why Vietnam's Biggest Tech Firm Lost a Third of Its Market Cap</title>
      <dc:creator>SleepyQuant</dc:creator>
      <pubDate>Wed, 22 Apr 2026 06:01:51 +0000</pubDate>
      <link>https://dev.to/sleepyquant/fpt-corporation-and-the-ai-consulting-margin-compression-why-vietnams-biggest-tech-firm-lost-a-205g</link>
      <guid>https://dev.to/sleepyquant/fpt-corporation-and-the-ai-consulting-margin-compression-why-vietnams-biggest-tech-firm-lost-a-205g</guid>
      <description>&lt;h1&gt;
  
  
  FPT Corporation and the AI Consulting Margin Compression: Why Vietnam's Biggest Tech Firm Lost a Third of Its Market Cap
&lt;/h1&gt;

&lt;h2&gt;
  
  
  An IT Giant Most Western Investors Have Never Heard Of
&lt;/h2&gt;

&lt;p&gt;FPT Corporation, Vietnam's largest IT services firm, is down ~33.8% from its 52-week high. This drawdown mirrors a broader sector-wide slump: TCS fell 21.4%, Wipro dropped 23.1%, and Infosys declined roughly 16% over the same window. The market appears to be repricing the entire labor-arbitrage consulting model at once, not punishing FPT in isolation.&lt;/p&gt;

&lt;p&gt;Here's what makes it interesting: in 9M2025, FPT still grew revenue +10.3% YoY (VND 49,887 billion ≈ $1.96B USD) and pre-tax profit +17.6% YoY (VND 9,540 billion ≈ $374M USD). The fundamentals didn't crash. The expectations did.&lt;/p&gt;

&lt;p&gt;I went down this rabbit hole after watching &lt;a href="https://www.youtube.com/watch?v=Pj0Y2zgcg-8" rel="noopener noreferrer"&gt;Mèo Giải Thích's&lt;/a&gt; Vietnamese-language deep dive on FPT (388k+ views). What follows is a case study in &lt;strong&gt;AI consulting margin compression&lt;/strong&gt; — one of the cleanest sector-wide pricing events I've seen in IT services in the past year. Below: what FPT actually does, the AI catalyst that hit the entire sector at once, and the counter-case the market isn't pricing in.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs24kyy0vct6n3b5nu1dd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs24kyy0vct6n3b5nu1dd.png" alt="Bar chart: FPT down 33.8% from 52-week peak (worst), Wipro -23.1%, FPT 1Y -22.2%, TCS -21.4%, Infosys -16.5%. AI margin compression hit the entire labor-arbitrage IT consulting sector simultaneously." width="800" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What FPT Actually Does
&lt;/h2&gt;

&lt;h3&gt;
  
  
  From banana flour machines to Vietnam's largest IT firm
&lt;/h3&gt;

&lt;p&gt;The founding story is almost too literal to be real: in 1988, the acronym FPT stood for "Food Processing Technology." Early FPT was drying cigarettes and installing air conditioners. Then came the pivot in 1990 — a $1 million computer contract with the Soviet Academy of Sciences changed everything. Within roughly a decade, FPT had become Vietnam's dominant IT firm. Understanding their current engine, though, requires looking at three distinct pillars rather than the single "IT" label.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three pillars: Technology, Telecom, Education
&lt;/h3&gt;

&lt;p&gt;According to the official 9M2025 earnings report (~$1.96B USD in nine-month revenue), Technology remains the undisputed core: about 62% of group revenue and 45% of group pre-tax profit. Telecom follows as a steady cash cow, contributing 29% of revenue (≈$539M USD) with surprising margin expansion — pre-tax profit grew +21% despite limited market-size headroom. Education rounds out the trio at just 9%; historically high-margin, but recent stagnation hints at real competitive pressure (more on that next).&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Japan is FPT's biggest customer
&lt;/h3&gt;

&lt;p&gt;What fascinates me most about FPT's Tech segment is where the money actually lives: overseas markets capture roughly 80–90% of that division's inbound revenue. Japan sits firmly as #1, followed by the US and APAC. Why? Because demographic collapse there has created an IT labor shortage so severe that Japanese planners are now recruiting half a million Indian tech workers to fill the gap. FPT's labor-cost advantage is the bridge Vietnamese firms have been crossing for years. In 2024, FPT also opened two AI factories — one in Vietnam, one in Japan — but they're still too small to materially move group numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Margin Story Hidden in Education
&lt;/h2&gt;

&lt;p&gt;Education accounts for just 9% of FPT's total revenue, yet it has historically been a cash cow with pre-tax margins hovering between 40-50%, according to Mèo Giải Thích. This profitability stems from a vertically integrated talent pipeline: FPT operates schools ranging from K-12 through university, and many graduates join the company directly. By internalizing recruitment, FPT drastically reduces external hiring friction and retraining costs while ensuring new hires are already aligned with its specific technical culture and operational standards. It is an elegant self-sustaining loop where education fuels technology growth without depending on volatile external labor markets.&lt;/p&gt;

&lt;p&gt;However, the official 9M2025 earnings data reveals a sharp divergence from that high-margin narrative. Education revenue grew only +1.0% YoY to VND 5,195 billion (≈$204M USD). This stagnation suggests headwinds are biting harder than anticipated. Vietnam's K-12 fee waiver in public schools has eroded the addressable market for private tuition, with families increasingly opting for free state alternatives over premium rates at FPT institutions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why The Market Lost Faith — The P/E Compression
&lt;/h2&gt;

&lt;h3&gt;
  
  
  From 30x to 15x in eighteen months
&lt;/h3&gt;

&lt;p&gt;I first understood valuation through a coffee-shop analogy from Mèo Giải Thích's video: if a shop earns $1 million a year but sells for $20 million, the Price-to-Earnings ratio is 20. Buyers are paying twenty years of current profits upfront for the future growth they expect.&lt;/p&gt;

&lt;p&gt;FPT's stock chart tells the same story in real time. P/E peaked around 30x when optimism was highest, normalized to roughly 19x over recent quarters, and now sits near 15x. The compression signals that investors have drastically lowered their growth assumptions.&lt;/p&gt;

&lt;h3&gt;
  
  
  How FPT compares to Indian IT consulting peers
&lt;/h3&gt;

&lt;p&gt;Here's the part that should make any FPT bull pause: the Indian IT consulting comparables aren't trading much higher. As of April 2026, TCS sits around ~19x trailing P/E, Infosys around ~18x, Wipro around ~16x. &lt;strong&gt;FPT at ~15x is trading at a discount to all three.&lt;/strong&gt; Sector-wide compression explains most of the move, but FPT carries an additional discount on top — the market is pricing in either smaller scale, less diversified revenue base, or company-specific execution risk that its global peers don't have.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqah8fx2x54hs1izgd3h6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqah8fx2x54hs1izgd3h6.png" alt="Bar chart: FPT trades at ~15x trailing P/E, Wipro 16x, Infosys 18x, TCS 19x. FPT trades at a discount to all three Indian IT consulting peers as of April 2026." width="800" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What the official 9M2025 numbers actually show
&lt;/h3&gt;

&lt;p&gt;The official 9M2025 data backs the deceleration. Tech segment revenue grew only +10.7% YoY against the segment's 24% historical CAGR (2018-2024). Total group revenue reached VND 49,887 billion (≈$1.96B USD) — still positive, but well off the trajectory the old multiple priced in. The gap between former hype and current reality is why the multiple collapsed from 30x toward 15x. But P/E compression doesn't happen in a vacuum — there was a specific catalyst.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why The Market Lost Faith — The AI Catalyst Behind the Margin Compression
&lt;/h2&gt;

&lt;h3&gt;
  
  
  February 23: The Anthropic shot heard around IT services
&lt;/h3&gt;

&lt;p&gt;The catalyst for the sector's re-rating arrived on &lt;strong&gt;February 23, 2026&lt;/strong&gt;, when &lt;a href="https://claude.com/blog/how-ai-helps-break-cost-barrier-cobol-modernization" rel="noopener noreferrer"&gt;Anthropic published "How AI helps break the cost barrier to COBOL modernization"&lt;/a&gt;. They claimed Claude Code could map dependencies across thousands of lines of legacy code, document workflows, and identify risks that "would take human analysts months to surface." This was not an abstract tech update — it was a direct shot at the consulting layer where firms charge premium hourly rates for human-led modernization work, exactly FPT's core moat in digital transformation and system integration.&lt;/p&gt;

&lt;h3&gt;
  
  
  IBM down 13.2% in a single day, FPT followed
&lt;/h3&gt;

&lt;p&gt;The market reacted immediately: &lt;a href="https://www.cnbc.com/2026/02/23/ibm-is-the-latest-ai-casualty-shares-are-tanking-on-anthropic-cobol-threat.html" rel="noopener noreferrer"&gt;IBM stock fell 13.2% that same day&lt;/a&gt;. The pricing signal suggested investors were rapidly discounting future labor-arbitrage margins across global IT services providers. FPT's own decline accelerated after this date — the timing is suggestive rather than coincidental within the broader -16% to -23% sector drawdown seen across TCS, Infosys, and Wipro. An insider sale by board member Bùi Quang Ngọc near the peak (timing flagged by the Mèo Giải Thích video) drew attention, but the source video itself cautioned against over-reading: founders Trương Gia Bình and Đỗ Cao Bảo did not sell, and a single insider transaction without volume context is closer to noise than signal. (For a tangentially-related build-in-public take on how a single missing argument in production code can compound into outsized loss, see &lt;a href="https://dev.to/blog/how-a-missing-book-id-kwarg-quietly-tanked-my-inverted-alpha-paper-trade/"&gt;my recent post on a one-line trading bug&lt;/a&gt;.)&lt;/p&gt;

&lt;h2&gt;
  
  
  The Counter-Case Worth Hearing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  AI doesn't only delete consulting — it creates new categories
&lt;/h3&gt;

&lt;p&gt;The bear case assumes AI simply deletes consulting hours, but it also creates entirely new categories. Companies still need someone to deploy these models into actual operations, integrate complex stacks with legacy systems, and train staff on the new workflows. FPT has explicitly pivoted in response, declaring an "AI-first" strategic direction. They are building infrastructure rather than relying on manual code migration alone. Their flagship vehicle is the FPT AI Factory, positioned as a "one-stop shop" for AI and Cloud services. At CES 2026, FPT showcased AI-first innovations across industries from automotive to semiconductor design.&lt;/p&gt;

&lt;h3&gt;
  
  
  What FPT's own numbers say about the pivot
&lt;/h3&gt;

&lt;p&gt;Per FPT's own reporting, their AI and Data Analytics service lines grew +41% YoY — tangible demand, though importantly off a small base; AI services are still single-digit percent of group revenue, not yet large enough to offset the deceleration in legacy Tech consulting. Chairman Trương Gia Bình has publicly emphasized future bets on Quantum Computing, Cybersecurity, UAVs, and Railway Tech, all underpinned by core AI capabilities. Two AI factories are operational — one in Vietnam, one opened in Japan in 2024 — aimed directly at the labor shortages and digital transformation demand that have driven FPT's overseas growth for years. At a ~15x P/E, the market is pricing low odds that this pivot scales fast enough. That is where the optionality sits.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm Watching, From Outside
&lt;/h2&gt;

&lt;p&gt;I am trying to understand FPT as a business, not as a ticker. The recent drawdown is stark, but the real story lies in three leading indicators that reveal whether the company can pivot from legacy arbitrage to AI-driven value:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;New Contract Value (NCV)&lt;/strong&gt; — the leading indicator for future Technology segment revenue. If NCV stagnates while signed revenue keeps growing through backlog consumption, that's demand friction showing up before it hits the top line.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tech segment pre-tax margin trend&lt;/strong&gt; — the canary for AI pricing pressure. As AI tools compress billable hours per project (the same dynamic &lt;a href="https://www.cnbc.com/2026/02/23/ibm-is-the-latest-ai-casualty-shares-are-tanking-on-anthropic-cobol-threat.html" rel="noopener noreferrer"&gt;behind IBM's 13% drop&lt;/a&gt;), it shows up here long before it shows up in total sales volume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI Factory contribution to group revenue&lt;/strong&gt; — the strategic execution check. If the two factories (Vietnam + Japan) can move from single-digit % to materially mid-single-digit % over the next 4-8 quarters, the bull pivot is landing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this is a recommendation. I'm watching because the case is interesting, not because I have an edge. The cost economics are also why I keep coming back to the &lt;a href="https://dev.to/blog/why-apple-silicon-quietly-won-the-local-ai-race-april-2026/"&gt;Apple Silicon angle on local AI&lt;/a&gt; — the same dynamic that compresses FPT's consulting margins is what makes running a 35B model on a laptop suddenly viable. Credit again to &lt;a href="https://www.youtube.com/watch?v=Pj0Y2zgcg-8" rel="noopener noreferrer"&gt;Mèo Giải Thích&lt;/a&gt; for doing the heavy synthesis on the Vietnamese-language side; this post is my attempt to put that story into English with the &lt;a href="https://fpt.com/-/media/project/fpt-corporation/fpt/ir/information-disclosures/year-report/2025/october/fpt_earnings-report-9m2025.pdf" rel="noopener noreferrer"&gt;official 9M2025 earnings numbers&lt;/a&gt; cross-checked.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Lesson Beyond FPT
&lt;/h2&gt;

&lt;p&gt;Pulling back from the company-specific drama reveals a sector-wide structural shift. Tata Consultancy Services is down 21.4%, Infosys 16.5%, Wipro 23.1%, FPT 22.2% over the same window — peak-to-trough on FPT is closer to 33.8%. Four major labor-arbitrage IT consulting firms across two continents, all repricing in the same direction at the same time.&lt;/p&gt;

&lt;p&gt;This is not a Vietnam story or even an FPT story. It is the entire "humans do the consulting work" business model getting re-rated by AI. The survivors will pivot fast from "we sell hours" to "we sell the AI that does the hours". The casualties will stay too long in the now-commoditizing layer.&lt;/p&gt;

&lt;p&gt;The IBM chart on Feb 23 and the FPT chart in the weeks after are saying exactly the same thing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;p&gt;This analysis was anchored on a Vietnamese-language YouTube video by &lt;strong&gt;Mèo Giải Thích&lt;/strong&gt; (Explaining Cat), a Vietnamese economics explainer channel — the primary narrative spine. The hard numbers and corroborating data points came from the following public sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mèo Giải Thích — "Tôi phân tích FPT để bạn không phải làm"&lt;/strong&gt; — &lt;a href="https://www.youtube.com/watch?v=Pj0Y2zgcg-8" rel="noopener noreferrer"&gt;YouTube video, 18 min, 388k+ views&lt;/a&gt;. The Vietnamese-language analysis that triggered this deep dive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FPT Corporation — 9M2025 Earnings Report&lt;/strong&gt; — &lt;a href="https://fpt.com/-/media/project/fpt-corporation/fpt/ir/information-disclosures/year-report/2025/october/fpt_earnings-report-9m2025.pdf" rel="noopener noreferrer"&gt;PDF (October 2025)&lt;/a&gt; and &lt;a href="https://fpt.com/en/news/fpt-news/ket-qua-kinh-doanh-9-thang-dau-nam-2025" rel="noopener noreferrer"&gt;investor news release&lt;/a&gt;. Source for all official segment revenue, profit, and growth numbers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FPT Software press releases&lt;/strong&gt; — &lt;a href="https://fptsoftware.com/newsroom/news-and-press-releases/news/fpt_sets_direction_for_tech_innovation" rel="noopener noreferrer"&gt;AI strategic direction&lt;/a&gt;, &lt;a href="https://fptsoftware.com/newsroom/news-and-press-releases/news/ces-2026-fpt-showcases-ai-first-innovations-across-industries" rel="noopener noreferrer"&gt;CES 2026 AI showcase&lt;/a&gt;, and &lt;a href="https://fptsoftware.com/newsroom/news-and-press-releases/news/fpt-global-it-services-signed-revenue-surpassed-1-3-b-usd" rel="noopener noreferrer"&gt;Global IT Services $1.3B signed revenue announcement&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anthropic — "How AI helps break the cost barrier to COBOL modernization"&lt;/strong&gt; (February 23, 2026) — &lt;a href="https://claude.com/blog/how-ai-helps-break-cost-barrier-cobol-modernization" rel="noopener noreferrer"&gt;primary blog post&lt;/a&gt; and &lt;a href="https://resources.anthropic.com/code-modernization-playbook" rel="noopener noreferrer"&gt;Code Modernization Playbook&lt;/a&gt;. Source for the AI catalyst dating and capability claims.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CNBC — "IBM is the latest AI casualty. Shares tank 13% on Anthropic programming language threat"&lt;/strong&gt; (&lt;a href="https://www.cnbc.com/2026/02/23/ibm-is-the-latest-ai-casualty-shares-are-tanking-on-anthropic-cobol-threat.html" rel="noopener noreferrer"&gt;February 23, 2026&lt;/a&gt;). Source for the IBM market reaction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IT Pro — Anthropic vs IBM debate on COBOL modernization&lt;/strong&gt; — &lt;a href="https://www.itpro.com/software/development/anthropic-says-claude-code-can-help-streamline-cost-prohibitive-cobol-modernization-but-ibm-says-its-not-that-simple-decades-of-hardware-software-integration-cannot-be-replicated-by-moving-code" rel="noopener noreferrer"&gt;counter-view from IBM&lt;/a&gt;. Included for the skeptical counter-perspective.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Yahoo Finance&lt;/strong&gt; — live equity data pulled 2026-04-22 for &lt;a href="https://finance.yahoo.com/quote/FPT.VN/" rel="noopener noreferrer"&gt;FPT.VN&lt;/a&gt;, &lt;a href="https://finance.yahoo.com/quote/TCS.NS/" rel="noopener noreferrer"&gt;TCS.NS (Tata Consultancy Services)&lt;/a&gt;, &lt;a href="https://finance.yahoo.com/quote/INFY/" rel="noopener noreferrer"&gt;INFY (Infosys)&lt;/a&gt;, and &lt;a href="https://finance.yahoo.com/quote/WIT/" rel="noopener noreferrer"&gt;WIT (Wipro)&lt;/a&gt;. Source for sector-wide drawdown comparison.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This is independent analysis grounded on publicly available sources. Not financial advice. Numbers stated are as of the source date noted; equity prices move continuously and any specific level cited may be stale by the time you read this. The author holds no position in FPT Corporation, Tata Consultancy Services, Infosys, Wipro, or IBM at the time of writing. Mèo Giải Thích is credited as the anchor source and was not consulted for this article.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If this was useful, I write weekly at &lt;a href="https://sleepyquant.rest" rel="noopener noreferrer"&gt;sleepyquant.rest&lt;/a&gt;. One email a week, real numbers, no signals. &lt;a href="https://sleepyquant.rest/#subscribe" rel="noopener noreferrer"&gt;Subscribe&lt;/a&gt; — come along to see me fall or thrive, whichever comes first.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>finance</category>
      <category>programming</category>
    </item>
    <item>
      <title>How a Missing book_id Kwarg Quietly Tanked My Inverted-Alpha Paper Trade</title>
      <dc:creator>SleepyQuant</dc:creator>
      <pubDate>Tue, 21 Apr 2026 09:10:43 +0000</pubDate>
      <link>https://dev.to/sleepyquant/how-a-missing-bookid-kwarg-quietly-tanked-my-inverted-alpha-paper-trade-3fgh</link>
      <guid>https://dev.to/sleepyquant/how-a-missing-bookid-kwarg-quietly-tanked-my-inverted-alpha-paper-trade-3fgh</guid>
      <description>&lt;h2&gt;
  
  
  Executive summary
&lt;/h2&gt;

&lt;p&gt;I ran an inverted-alpha paper-trading experiment to test whether inverting my live signals would produce net-positive P&amp;amp;L over 100 round-trips. The inverted-alpha book (Book 2) hit a 63% win rate — good enough to celebrate — but the per-trade average loss was six times larger than the per-trade win size. The shape of the P&amp;amp;L didn't match any thesis I had. After a few days of staring at the numbers, I traced the problem to a single missing keyword argument in the close-order routing path. One line of fix, and the per-round-trip cost on the inverted book dropped from about $0.29 to under $0.02 — roughly a 21x reduction in per-trade bleed. This post is the story of finding the bug, why it hid for three days, and the structural test I should have written up front.&lt;/p&gt;

&lt;h2&gt;
  
  
  The signal that didn't match any thesis
&lt;/h2&gt;

&lt;p&gt;A quick refresher on the inverted-alpha setup (I covered the original thesis in more detail in &lt;a href="https://dev.to/blog/the-inverted-control-what-24-hours-of-running-our-own-bot-backwards-revealed/"&gt;"The Inverted Control"&lt;/a&gt;). I run a multi-book paper-trading experiment on the same live signal source. Book 1 executes the signal as-is. Book 2 executes the inverted side of every signal — if Book 1 goes long, Book 2 goes short on the same symbol and size. The idea is simple: if my signal has negative edge on average, its inverse should have positive edge, less fees. Historical shadow analysis said the inversion would have produced roughly +$40 on 496 round-trips where Book 1 actually lost about $70. The live test was going to confirm or deny that in new market conditions.&lt;/p&gt;

&lt;p&gt;Two days in, Book 2 looked weird. The win rate was sitting around 63% — higher than Book 1's 34%, which is what you'd expect if the inversion thesis held. But the net P&amp;amp;L on Book 2 was already deeply negative, with an average per-round-trip loss three times worse than Book 1. The shape didn't make sense: a book that wins 63% of the time shouldn't bleed faster than one that wins 34% of the time unless the losing trades are massively larger than the winning trades. And they were. The average win was small and the average loss was huge. The R-multiple on Book 2 was roughly inverted from what the mirror design implied.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I initially suspected
&lt;/h2&gt;

&lt;p&gt;My first hypothesis was that the inversion thesis was just wrong in the current regime. Maybe the market had shifted from trending to mean-reverting, and the signal that had been losing in trend mode was now correct in the new regime — which would make its inverse wrong. That's an honest failure mode, and if that's what was happening, I needed to kill the test early.&lt;/p&gt;

&lt;p&gt;My second hypothesis was sample-size variance. Eighty round-trips is not a lot. A handful of asymmetric outliers can make the per-trade average look catastrophic before the law of large numbers smooths things out. I considered waiting for 200 round-trips before acting.&lt;/p&gt;

&lt;p&gt;Neither hypothesis explained the specific R-multiple asymmetry. If the signal had flipped edge direction, the win rate should have dropped toward 50% or below, not landed at 63%. If it was pure variance, the wins and losses should have been roughly symmetric around the expected mean. What I was seeing — high win rate, small wins, large losses — is the mechanical signature of something clipping the wins and letting the losses run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trace that revealed it
&lt;/h2&gt;

&lt;p&gt;I went into the logs. For each closed position on Book 2, I pulled the close-order record and checked which book the close actually hit. Every single one had routed to Book 1's ledger. Book 2's open positions existed. Book 2's trades showed up in the comparison snapshot. But Book 2's closes were landing on Book 1, which meant Book 2 positions were only closing when Book 1's mirror trade hit its own TP or SL — at Book 1's magnitudes, not Book 2's.&lt;/p&gt;

&lt;p&gt;That's the asymmetry I was seeing. Book 2's take-profit threshold (set symmetrically with Book 1 for the experiment) never fired on its own positions. Book 2 closed when Book 1's signal exited — and since Book 2 is the inverse, Book 1's winning exits were Book 2's losing exits, at Book 1's take-profit magnitude. Meanwhile, Book 1's losing exits (at its smaller stop-loss magnitude) were Book 2's winning exits. Wins capped at small, losses running to large. The R-multiple wasn't mysteriously inverted; it was mechanically forced that way by a routing bug.&lt;/p&gt;

&lt;h2&gt;
  
  
  The root cause — one missing kwarg
&lt;/h2&gt;

&lt;p&gt;Found it in the futures TP/SL monitor. The loop fetches all open positions across every book without a per-book filter (intentional — one loop watches the whole portfolio). For each position that trips its TP or SL threshold, it constructs a close-order and hands it to the execution engine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before (the bug)
&lt;/span&gt;&lt;span class="n"&gt;close_order&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SimulatedOrder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;lane&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;futures&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;close_action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price_vnd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;live_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;leverage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;leverage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;note&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Auto-close: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;close_reason&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;close_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute_futures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;close_order&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The monitor passes no &lt;code&gt;book_id&lt;/code&gt;. Downstream, &lt;code&gt;execute_futures&lt;/code&gt; defaults &lt;code&gt;book_id=1&lt;/code&gt; when the argument isn't provided. The close-execution query then filters the position table by that default &lt;code&gt;book_id&lt;/code&gt;, looking for a Book 1 position matching this symbol to close. For a Book 2 position that needs to close, the query finds nothing that matches — Book 1 has no such position. The execution path returns cleanly with zero matches. No exception. No warning. Just a silent no-op.&lt;/p&gt;

&lt;p&gt;The monitor logs a cheerful "Auto-close" message. The database state is unchanged. The position keeps running until the Book 1 mirror signal decides to exit, at which point the close finally lands on the correct book via a completely different code path (the mirror-fire routing in the execution engine). That's why Book 2 positions did eventually close — through Book 1's exit, not their own.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 1-line fix
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After
&lt;/span&gt;&lt;span class="n"&gt;close_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute_futures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;close_order&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;book_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;book_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole patch. Route the close to the same book the position lives on. I added a comment block above the call referencing the session log where the bug was diagnosed, so the next person reading this code has some archaeology to work with if they're wondering why the kwarg is suddenly important.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before and after
&lt;/h2&gt;

&lt;p&gt;The fix went live with the backend restart. Book 2 had 88 round-trips on its books at that moment. I locked that as the pre-fix baseline and started counting post-fix round-trips separately.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Window&lt;/th&gt;
&lt;th&gt;Round-trips&lt;/th&gt;
&lt;th&gt;Avg cost per round-trip&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pre-fix (contaminated by bug)&lt;/td&gt;
&lt;td&gt;88&lt;/td&gt;
&lt;td&gt;about $0.29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Post-fix (clean)&lt;/td&gt;
&lt;td&gt;87&lt;/td&gt;
&lt;td&gt;about $0.01&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 21x reduction in per-trade cost isn't the inverted-alpha signal suddenly working. It's the mirror book's own take-profit and stop-loss thresholds finally firing, instead of being clipped by Book 1's exit timing. Wins land at the size they were designed to land at. Losses stop at the size they were designed to stop at. The R-multiple on Book 2 is now something close to symmetric, which is what the inverted-alpha experiment was supposed to measure in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part I'm not claiming
&lt;/h2&gt;

&lt;p&gt;Eighty-seven post-fix round-trips is still not a lot. The number could drift back toward zero or turn positive or stay mildly negative as the sample grows. What I'm claiming is narrow: the bug was contaminating the signal to the point where no verdict was meaningful, and fixing it moved the book roughly to break-even on post-fix trades — which at least lets the actual inverted-alpha thesis get tested on its own merits. Whether the thesis itself holds up over 100+ clean round-trips is still open.&lt;/p&gt;

&lt;p&gt;I'm also not claiming that a bug this shape should be impossible for anyone smart to write. I wrote it. I shipped it. It ran for three days producing data that looked like a meaningful signal and wasn't. The uncomfortable part is how convincing the bad data was — a 63% win rate with a tidy asymmetric R-multiple is exactly the kind of shape that generates theories.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway: test cross-book routing, not just book behavior
&lt;/h2&gt;

&lt;p&gt;Every unit test I had written pointed at Book 1 behavior in isolation. Does the close logic work? Does the TP trigger at the right threshold? Does the position close update the balance correctly? All of those passed. What I hadn't written was a test that opens a position on the inverted-alpha book (Book 2), triggers its TP, and asserts that the resulting close lands on Book 2's ledger and not Book 1's. A single-line assertion in the right place would have caught this bug before it shipped.&lt;/p&gt;

&lt;p&gt;If you're running a multi-book or multi-account framework where the routing surface is implicit — where a missing keyword argument silently falls back to a default account — write the cross-routing assertion. It's the test that only exists once you have more than one book, and it's the test that stops being optional the moment silent no-ops can masquerade as winning trades.&lt;/p&gt;




&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/blog/the-inverted-control-what-24-hours-of-running-our-own-bot-backwards-revealed/"&gt;"The Inverted Control"&lt;/a&gt; — the original inverted-alpha thesis and why I set up the multi-book experiment in the first place&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/blog/apple-silicon-local-ai-2026-04/"&gt;"Why Apple Silicon Quietly Won the Local-AI Race"&lt;/a&gt; — the stack this whole system runs on, one M1 Max&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/blog/memory-compression-mlx-m1-max-april-2026/"&gt;"What 19 GB of Memory Compression Taught Me About MLX"&lt;/a&gt; — a companion story of another silent failure mode that hid in plain sight&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Disclaimer:&lt;/strong&gt; This post is engineering observation from a solo paper-trading experiment, not financial advice. The numbers reflect one specific configuration on a paper book denominated in a non-USD unit and converted for readability; results in any real live book will differ. Verify your own framework before trusting signal data from a multi-book setup.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If this was useful, I write weekly at &lt;a href="https://sleepyquant.rest" rel="noopener noreferrer"&gt;sleepyquant.rest&lt;/a&gt;. One email a week, real numbers, no signals. &lt;a href="https://sleepyquant.rest/#subscribe" rel="noopener noreferrer"&gt;Subscribe&lt;/a&gt; — come along to see me fall or thrive, whichever comes first.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>programming</category>
      <category>ai</category>
      <category>buildinpublic</category>
    </item>
    <item>
      <title>What 19 GB of Memory Compression Taught Me About MLX on M1 Max</title>
      <dc:creator>SleepyQuant</dc:creator>
      <pubDate>Mon, 20 Apr 2026 09:29:24 +0000</pubDate>
      <link>https://dev.to/sleepyquant/what-19-gb-of-memory-compression-taught-me-about-mlx-on-m1-max-3eha</link>
      <guid>https://dev.to/sleepyquant/what-19-gb-of-memory-compression-taught-me-about-mlx-on-m1-max-3eha</guid>
      <description>&lt;h1&gt;
  
  
  What 19 GB of Memory Compression Taught Me About MLX on M1 Max
&lt;/h1&gt;

&lt;h2&gt;
  
  
  The moment something was wrong
&lt;/h2&gt;

&lt;p&gt;I opened Activity Monitor on my M1 Max one afternoon and saw this: Memory Used 60.74 GB out of 64, compressed memory 19.69 GB, swap starting to fill. The SwiftUI dashboard I use to drive my multi-agent quant stack had hung. Python — the backend process holding an MLX-loaded Qwen 3.6 35B-A3B model — reported 44 GB in Activity Monitor's "Memory" column.&lt;/p&gt;

&lt;p&gt;My first thought was the obvious one: memory leak. Shut it down, restart, move on.&lt;/p&gt;

&lt;p&gt;That would have been wrong. What I found instead was a much more interesting problem about how macOS handles Metal unified memory when a large model sits idle between inferences — and the fix turned out to be a single MLX API call I had never used.&lt;/p&gt;

&lt;p&gt;This is the honest write-up: what broke, what I measured, what the fix actually was, and what I'm still not sure about.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I was actually running
&lt;/h2&gt;

&lt;p&gt;One M1 Max, 64 GB unified memory. One Python process holding the MLX framework with a Q8-quantized 35B-A3B MoE model loaded. About 35 GB of that goes to model weights in Metal-accessible memory; the rest of the process is the FastAPI backend, twelve specialized agents sharing the single model through a priority queue, a SQLite paper-trading book, and assorted content-generation loops.&lt;/p&gt;

&lt;p&gt;Uptime at the point of the snapshot: just under 8 hours since the last backend restart.&lt;/p&gt;

&lt;p&gt;In normal operation, Activity Monitor should show something like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python process: ~35-40 GB in the "Memory" column&lt;/li&gt;
&lt;li&gt;Wired: 2-3 GB (kernel)&lt;/li&gt;
&lt;li&gt;Compressed: low single digits&lt;/li&gt;
&lt;li&gt;Free + reclaimable inactive: 15-20 GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What I saw instead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python process: 44 GB&lt;/li&gt;
&lt;li&gt;Compressed: &lt;strong&gt;19.69 GB&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Swap: 1.57 GB and climbing&lt;/li&gt;
&lt;li&gt;Free: 3 GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The compressed number was the interesting one. Not the total.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why compressed memory is the signal, not the problem
&lt;/h2&gt;

&lt;p&gt;macOS has an in-kernel memory compressor that tries to keep a working set resident by compressing pages that processes have allocated but aren't actively touching. When compressed memory grows, it usually means somewhere a process has a big chunk of memory that's "cold" — allocated but not referenced often enough to count as active.&lt;/p&gt;

&lt;p&gt;Two-to-one is a rough compression ratio. 19.69 GB compressed suggests maybe 40 GB of "owed" memory being squeezed in.&lt;/p&gt;

&lt;p&gt;On a normal desktop, this is invisible and fine. On a machine running a 35 GB model, it's a red flag: if the model weights are being compressed and decompressed as the compressor swaps them in and out of a resident state, every inference pays a cost to decompress pages before Metal can use them. CPU cycles burn. Latency drifts. Over hours, the machine becomes sluggish in a way that's hard to attribute.&lt;/p&gt;

&lt;p&gt;The question became: why are my model weights going inactive between inferences in the first place?&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing I didn't know about Apple Silicon Metal
&lt;/h2&gt;

&lt;p&gt;On Apple Silicon, CPU and GPU share the same physical RAM. That's the unified memory advantage. But "unified" doesn't mean "all memory is treated the same." Metal exposes a few storage modes, and the one MLX uses by default for model weights is &lt;code&gt;shared&lt;/code&gt; — accessible to both CPU and GPU.&lt;/p&gt;

&lt;p&gt;Here's the thing I had to learn the hard way: &lt;code&gt;shared&lt;/code&gt; storage pages are pageable. They can be marked inactive by the kernel. They can be compressed. From the operating system's perspective, a chunk of Metal-allocated memory that isn't actively being read or written looks exactly like a process's idle heap. It gets the same treatment.&lt;/p&gt;

&lt;p&gt;So the loop I was producing was this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Model loaded into Metal shared storage (~35 GB)&lt;/li&gt;
&lt;li&gt;Inference fires, GPU reads weights, decoder runs&lt;/li&gt;
&lt;li&gt;Inference finishes&lt;/li&gt;
&lt;li&gt;Seconds pass. No one touches the weights.&lt;/li&gt;
&lt;li&gt;Kernel marks pages inactive&lt;/li&gt;
&lt;li&gt;Compressor kicks in, squeezes cold pages&lt;/li&gt;
&lt;li&gt;Next inference arrives&lt;/li&gt;
&lt;li&gt;GPU needs to read weights → decompress first → latency&lt;/li&gt;
&lt;li&gt;Return to 1.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Over hours, the compressor works harder and harder. The machine isn't leaking memory. It's thrashing a 35 GB working set against a compression algorithm that assumes cold data will stay cold. It won't stay cold. It's a running model.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix I should have known about six months ago
&lt;/h2&gt;

&lt;p&gt;MLX has an API called &lt;code&gt;mx.metal.set_wired_limit(bytes)&lt;/code&gt;. It tells Metal: "keep up to N bytes of memory resident and uncompressible." I had never called it. The default is unlimited-but-unpinned, which means nothing is protected.&lt;/p&gt;

&lt;p&gt;I set it to 45 GB — enough to cover the ~35 GB of model weights plus a few GB of KV cache and scratch. Added two more for good measure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;mx.metal.set_cache_limit(512 MB)&lt;/code&gt; — cap the Metal compile cache so it can't drift over time.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mx.metal.set_memory_limit(48 GB)&lt;/code&gt; — hard ceiling so Metal refuses to allocate beyond that. Fail loudly instead of OOM.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All three calls go in &lt;code&gt;_load_model&lt;/code&gt; before &lt;code&gt;mlx_lm.load()&lt;/code&gt; allocates weights, so Metal knows the budget up front.&lt;/p&gt;

&lt;p&gt;Results (one backend restart later):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Python "Memory" column&lt;/td&gt;
&lt;td&gt;44 GB&lt;/td&gt;
&lt;td&gt;~40 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compressed&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;19.69 GB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.7 GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Swap&lt;/td&gt;
&lt;td&gt;1.57 GB&lt;/td&gt;
&lt;td&gt;1.6 GB (historical, drains)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Free + reclaimable inactive&lt;/td&gt;
&lt;td&gt;3 GB&lt;/td&gt;
&lt;td&gt;~30 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Compressed memory dropped by 91%. The model wasn't leaking. The kernel just wasn't pinning it, because I had never told it to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four more layers I added because I don't trust a single fix
&lt;/h2&gt;

&lt;p&gt;Getting to 1.7 GB compressed on a fresh restart is nice. Keeping it there over days of uptime is different. I layered four more defenses in case any of them mattered:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clear the Metal compile cache after heavy inference.&lt;/strong&gt; My content pipeline runs &lt;code&gt;max_tokens ≥ 500&lt;/code&gt; inferences regularly (sectional generation for long-form writeups). Metal accumulates a compile/scratch cache that doesn't matter for a single run but drifts. Added &lt;code&gt;mx.metal.clear_cache()&lt;/code&gt; as an automatic hook at the end of any inference above that token threshold.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A memory-pressure watchdog.&lt;/strong&gt; A background task polls &lt;code&gt;psutil.virtual_memory()&lt;/code&gt; every five minutes. If Metal cache exceeds 1 GB, clear it automatically. If total system memory used exceeds 60 GB, print a warning. Not an alarm — just a log signal I can grep later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A nightly restart.&lt;/strong&gt; Every night at 4 AM local time, the backend does &lt;code&gt;os._exit(1)&lt;/code&gt;. LaunchAgent &lt;code&gt;KeepAlive&lt;/code&gt; respawns it in about a minute. Fresh MLX state, fresh Python heap. The warmup cost (~60 seconds of MLX reload) is free because I'm asleep and nothing depends on it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Manual unload / reload API.&lt;/strong&gt; &lt;code&gt;POST /resources/mlx-unload&lt;/code&gt; sets a flag, drops the model reference, calls &lt;code&gt;mx.metal.clear_cache()&lt;/code&gt;. Inference calls after that fail fast with a clear error. &lt;code&gt;POST /resources/mlx-reload&lt;/code&gt; brings the model back in about 60 seconds. This is for when I want the full 40 GB of Metal memory for something else temporarily. Trade scanners and the paper engine keep running because they don't depend on MLX at all — they're pure Python against SQLite.&lt;/p&gt;

&lt;p&gt;All five together survive multiple-day uptime without drift.&lt;/p&gt;

&lt;h2&gt;
  
  
  The parts I'm still not sure about
&lt;/h2&gt;

&lt;p&gt;The 45 GB wired limit is a guess. It works on my machine with this exact model. If I added a second model, or switched to a denser quantization, or loaded more aggressive KV cache — I'd need to re-tune. I don't have a systematic way to pick the number other than "model weights plus headroom, less than the point where the rest of macOS starves."&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;set_memory_limit(48 GB)&lt;/code&gt; hard ceiling may be too aggressive. I haven't stress-tested what happens when the limit is actually hit. Probably Metal throws an OutOfMemoryError and the inference fails with a clear traceback, which is what I want. But I haven't caused it on purpose yet.&lt;/p&gt;

&lt;p&gt;The watchdog threshold — clear cache above 1 GB, warn above 60 GB — is arbitrary. I set those based on vibes and one afternoon of measurement. A more disciplined version would instrument several days of data and pick thresholds from actual distribution percentiles.&lt;/p&gt;

&lt;p&gt;The nightly restart is the scariest one. It assumes nothing important is mid-execution at 4 AM. For now that's true because I'm a solo operator. For a multi-user production stack, it would not be acceptable, and I'd need a graceful-drain + cutover pattern instead.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell past-me six months ago
&lt;/h2&gt;

&lt;p&gt;If you're running a large MLX model on Apple Silicon and you've never touched &lt;code&gt;mx.metal.set_wired_limit&lt;/code&gt;, check Activity Monitor's Compressed Memory number after a few hours of uptime. If it's in double-digit GB, you're probably paying a compression/decompression tax on every inference.&lt;/p&gt;

&lt;p&gt;The fix is three lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlx.core&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;mx&lt;/span&gt;
&lt;span class="n"&gt;mx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_wired_limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# pin the model in resident RAM
&lt;/span&gt;&lt;span class="n"&gt;mx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_cache_limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# cap Metal compile/scratch
&lt;/span&gt;&lt;span class="n"&gt;mx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_memory_limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# fail loud above this, don't OOM
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Works on M1 and M2 generations. I haven't tested on M3 or M4 Pro / Max, but the API is the same and the underlying Metal behavior should be too.&lt;/p&gt;

&lt;p&gt;The broader lesson I'm taking away: unified memory is a genuine advantage for local-first AI, but it inherits the OS's defaults for normal application memory. A 35 GB working set of neural-network weights is not what macOS's memory manager was designed for. The API to tell it "treat this differently" is there; I just had to know it existed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;I'm packaging the full hygiene layer as a small open-source helper — tentatively &lt;code&gt;mlx-memory-safe&lt;/code&gt; — so anyone running MLX on a Mac can drop it in with one import instead of reading three sections of this post to rediscover the same fixes. Should land on GitHub and PyPI in the next week or two, with a separate write-up of the package internals.&lt;/p&gt;

&lt;p&gt;If you've hit something similar, or if you've tested &lt;code&gt;set_wired_limit&lt;/code&gt; on M3/M4 and seen different behavior, I'd love to hear about it. I still don't have a clean mental model for when &lt;code&gt;shared&lt;/code&gt; storage mode pages leave the wired set under real-world pressure, and that gap is the next thing I want to understand.&lt;/p&gt;

&lt;p&gt;Come along for the ride.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Disclaimer:&lt;/strong&gt; This post reflects one solo operator's configuration on one M1 Max with 64 GB of unified memory in April 2026, running MLX + Qwen 3.6 35B-A3B Q8. Specific numbers (compressed GB, tok/s, wired limit) will differ on other hardware, other models, and other workloads. Test on your own setup before adopting any threshold as a default.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If this was useful, I write weekly at &lt;a href="https://sleepyquant.rest" rel="noopener noreferrer"&gt;sleepyquant.rest&lt;/a&gt;. One email a week, real numbers, no signals. &lt;a href="https://sleepyquant.rest/#subscribe" rel="noopener noreferrer"&gt;Subscribe&lt;/a&gt; — come along to see me fall or thrive, whichever comes first.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>mlx</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>Why Apple Silicon Quietly Won the Local-AI Race (April 2026)</title>
      <dc:creator>SleepyQuant</dc:creator>
      <pubDate>Sat, 18 Apr 2026 09:30:37 +0000</pubDate>
      <link>https://dev.to/sleepyquant/why-apple-silicon-quietly-won-the-local-ai-race-april-2026-34g7</link>
      <guid>https://dev.to/sleepyquant/why-apple-silicon-quietly-won-the-local-ai-race-april-2026-34g7</guid>
      <description>&lt;h1&gt;
  
  
  Why Apple Silicon Quietly Won the Local-AI Race (April 2026)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Executive summary
&lt;/h2&gt;

&lt;p&gt;While the public AI narrative is dominated by capex wars and cloud GPU shortages, a quieter shift has happened on the desktop. A single Apple Silicon laptop with 64GB of unified memory now runs a 35-billion-parameter mixture-of-experts model at usable speed, with no API key, no rate limit, and no per-token bill. SleepyQuant — a public notebook from one solo finance + tech enthusiast — runs twelve specialized agents sharing a single MLX model instance on one M1 Max. Last week I swapped the primary inference quantization from 4-bit to 8-bit. Active model memory went from about 19GB to about 35GB. Decode speed initially dropped from ~50 tokens per second to ~10 — and after a reader on r/LocalLLaMA pointed out that M1/M2 GPUs lack native bf16 compute, I cast the non-quantized weights to fp16 and that brought the same 8-bit model up to ~26 tokens per second. The post that follows is the honest account of the Q4→Q8 trade, what unified memory architecture actually changes for anyone trying to ship local-first AI in 2026, and a teaser on the fp16 fix (full write-up in a follow-up post).&lt;/p&gt;

&lt;h2&gt;
  
  
  Thesis
&lt;/h2&gt;

&lt;p&gt;The default assumption of the last two years is that meaningful AI requires meaningful infrastructure: a data center, a GPU cluster, an API contract. Apple's hardware bet quietly inverts that assumption for a specific category of work — single-operator inference of capable open-weight models on commodity hardware.&lt;/p&gt;

&lt;p&gt;The mechanism is unified memory architecture, or UMA. On a traditional desktop, the CPU and GPU each own separate memory pools. To run a large model on the GPU, the model weights must be copied across the PCIe bus, then activations move back and forth for every layer. The cost is latency, energy, and an effective ceiling on model size set by the GPU's dedicated VRAM. On Apple Silicon, CPU, GPU, and Neural Engine cores share one unified memory pool on the same package. There is no copy step. The same 64GB of physical RAM is available to whichever processing unit needs it, in whatever ratio the workload demands.&lt;/p&gt;

&lt;p&gt;This sounds like an engineering footnote. It is not. It is the mechanism that lets a 35B-parameter model fit and run on a $4,000 laptop instead of an $80,000 server. For workloads that are bounded by single-user inference latency and privacy — exactly the workloads small builders, indie developers, and solo operators care about — that changes the economics of building with AI from "raise a seed round for compute" to "buy the laptop."&lt;/p&gt;

&lt;h2&gt;
  
  
  Deep dive: what I actually run
&lt;/h2&gt;

&lt;p&gt;My setup: one M1 Max with 64GB of unified memory. The primary inference engine is MLX — Apple's open-source machine learning library tuned for Apple Silicon. The model is Qwen 3.6 35B-A3B, a sparse mixture-of-experts (MoE) architecture, served at 8-bit quantization. The active model footprint is around 35GB. With Python's process overhead and the rest of the agent stack loaded, total active and wired memory sits around 40-44GB. I've pinned the model weights with &lt;code&gt;mx.metal.set_wired_limit(45GB)&lt;/code&gt; and cap total Metal allocation at 48GB so macOS can't swap-page model pages into SSD when things get busy.&lt;/p&gt;

&lt;p&gt;Decode throughput at 8-bit started around 10 tokens per second on the default path, and moved to ~26 tokens per second after a reader's tip to force fp16 compute (M1/M2 lack native bf16 — details in a follow-up post). At 4-bit, the same model decoded at 49–60 tokens per second. The 5x slowdown from Q4 was real; the fp16 recovery was a reader's gift. The reason I accepted the 8-bit path in the first place is that it's meaningfully sharper on data-aware tasks — content evaluation against a fact list, fabrication detection in generated drafts, structured output parsing. For a public notebook where every number should be defensible, "slightly slower but more truthful" is the right trade. For a real-time chat application, it would not be.&lt;/p&gt;

&lt;p&gt;The sparse MoE design adds one more wrinkle. Qwen 3.6 35B-A3B activates only ~3B parameters per token, which is what makes its decode throughput tractable on commodity hardware in the first place. But MoE models degenerate into repetitive word-salad when forced to generate long single completions — anything past about 500 output tokens reliably produces collapsing prose where the same phrases re-circulate. The fix is not "buy a denser model"; the fix is sectional generation. Long content gets split into 250–400-token sections that are generated independently and concatenated. The model never has to hold a 1500-word output in its working window at once. This is a structural workaround for an architectural property of MoE, not a hack.&lt;/p&gt;

&lt;p&gt;On top of that base inference layer, twelve specialized agents — content drafting, quality evaluation, trading scan, risk analysis, news ingestion, and so on — share the single MLX runtime through a sequential lock that prevents two simultaneous Metal GPU calls from crashing the device. The lock turns into a priority queue: user-facing chat outranks agent tool calls, which outrank background automation. Twelve agents share one inference engine, not twelve cloud endpoints.&lt;/p&gt;

&lt;p&gt;The full operational footprint: one laptop, one model on disk, one Metal-bound process, no recurring infrastructure cost. The bill of materials is the laptop and the electricity to run it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Counter-argument: when Apple Silicon loses
&lt;/h2&gt;

&lt;p&gt;The story above is selective. Apple Silicon is the wrong tool for several common AI workloads, and pretending otherwise sets up failure.&lt;/p&gt;

&lt;p&gt;Training is the obvious one. Pre-training a foundation model from scratch, or even continued pre-training on a domain-specific corpus, demands cluster-grade compute and high-bandwidth interconnects that consumer hardware does not provide. The unified memory advantage works in the inference direction; in the training direction, dedicated GPU farms remain dominant.&lt;/p&gt;

&lt;p&gt;Multi-tenant serving is the second loss case. A single MLX-bound laptop serves one inference at a time through a lock. That works for a solo operator running an internal stack. It does not work for a SaaS product with concurrent users, where horizontal scaling on cloud GPU is the rational architecture.&lt;/p&gt;

&lt;p&gt;High-throughput batch inference is the third. If the workload is "score 100,000 documents tonight," a multi-GPU server with batched attention will eat the laptop's lunch. The laptop wins on per-token cost for low volume; cloud batch wins on throughput per dollar at scale.&lt;/p&gt;

&lt;p&gt;Continuous fine-tuning is the fourth, and the one most people forget. The Apple Silicon stack excels at running pre-trained models efficiently. It is weaker at adapting them quickly. If the strategy depends on retraining on yesterday's market data every night to stay competitive, single-laptop inference is a structural disadvantage compared to a hedge fund operating its own GPU cluster.&lt;/p&gt;

&lt;p&gt;These limitations are real. They constrain where the local-first thesis applies. They do not invalidate it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verdict
&lt;/h2&gt;

&lt;p&gt;The local-first Apple Silicon stack is the right answer for a specific shape of project: a single operator (or small team), inference-dominant workloads, sensitivity to per-token cost, sensitivity to data leaving the machine, and acceptable latency at the throughput a sequential lock allows. Build-in-public projects, indie research, internal tooling, privacy-sensitive personal automation — all of these fit the shape.&lt;/p&gt;

&lt;p&gt;For training, multi-tenant serving, high-throughput batch, and continuous fine-tuning at production scale, the cloud GPU stack remains the right answer.&lt;/p&gt;

&lt;p&gt;What changed in 2026 is not that Apple Silicon is suddenly competitive everywhere. What changed is that the band of workloads for which a single laptop is sufficient has widened to include things that, two years ago, demanded a serious infrastructure budget. A 35B-parameter MoE running on one M-series chip at 26 tokens per second (with the fp16 fix) is not a benchmark to brag about against H100 clusters. It is, however, a baseline good enough to run a real experiment, on a real budget, with no vendor in the loop. For a category of builders and enthusiasts who used to be priced out of meaningful AI infrastructure, that is the entire point.&lt;/p&gt;

&lt;p&gt;More notes in this series — including the fp16 cast that got me from 10 to 26 tokens per second, the honest 4-bit vs 8-bit quality comparison, the sectional generation pattern in detail, the 12-agent priority-queue design, and the Metal wired-limit trick that fixed 19GB of memory-compression thrash — live in the &lt;a href="https://sleepyquant.rest/blog/" rel="noopener noreferrer"&gt;SleepyQuant blog archive&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Disclaimer:&lt;/strong&gt; This post is engineering observation, not financial or hardware purchasing advice. Specific tokens-per-second numbers reflect the SleepyQuant configuration on one M1 Max with 64GB unified memory in April 2026; results on other hardware or quantizations will differ. Verify benchmarks against your own workload before making allocation decisions.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If this was useful, I write weekly at &lt;a href="https://sleepyquant.rest" rel="noopener noreferrer"&gt;sleepyquant.rest&lt;/a&gt;. One email a week, real numbers, no signals. &lt;a href="https://sleepyquant.rest/#subscribe" rel="noopener noreferrer"&gt;Subscribe&lt;/a&gt; — come along to see me fall or thrive, whichever comes first.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mlx</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>SleepyQuant Weekly · 2026W16</title>
      <dc:creator>SleepyQuant</dc:creator>
      <pubDate>Sat, 18 Apr 2026 08:26:56 +0000</pubDate>
      <link>https://dev.to/sleepyquant/sleepyquant-weekly-2026w16-1lb0</link>
      <guid>https://dev.to/sleepyquant/sleepyquant-weekly-2026w16-1lb0</guid>
      <description>&lt;h2&gt;
  
  
  This week in paper trading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Round-trips: 464&lt;/li&gt;
&lt;li&gt;Win rate: 38.1%&lt;/li&gt;
&lt;li&gt;Realized PnL: -34.58 USDT&lt;/li&gt;
&lt;li&gt;Net return: +20.23%&lt;/li&gt;
&lt;li&gt;Max drawdown: 3.14%&lt;/li&gt;
&lt;li&gt;R:R ratio: 0.8&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Failure vault: what broke, what changed
&lt;/h2&gt;

&lt;p&gt;Past 7 days · 49 losing trades · total -24.63 USDT&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Execution Slippage&lt;/strong&gt; cluster × 25 across APT/USDT, BNB/USDT, ETH/USDT, LINK/USDT&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Technical Failure&lt;/strong&gt; cluster × 24 across APT/USDT, ARB/USDT, ATOM/USDT, AVAX/USDT&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;APT/USDT&lt;/strong&gt; — Execution Slippage × 5 (-1.32 USDT, avg -0.26 per trade)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Strategy adjustments shipped / queued for next week:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;[65% conf]&lt;/strong&gt; Scanner-wide: cut position size 25% + tighten stop loss&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;[60% conf]&lt;/strong&gt; Global: scan interval 8 → 12 minutes to filter noise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;[85% conf]&lt;/strong&gt; Temporarily remove APT/USDT from scan list for 48 hours&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  News that mattered
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;🔥 Trending: Bio Protocol (BIO) — Rank #365 &lt;em&gt;(via CoinGecko Trending)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;🔥 Trending: Pudgy Penguins (PENGU) — Rank #108 &lt;em&gt;(via CoinGecko Trending)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;🔥 Trending: RaveDAO (RAVE) — Rank #33 &lt;em&gt;(via CoinGecko Trending)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;🔥 Trending: Based (BASED) — Rank #722 &lt;em&gt;(via CoinGecko Trending)&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;🔥 Trending: Bitcoin (BTC) — Rank #1 &lt;em&gt;(via CoinGecko Trending)&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  One operating insight
&lt;/h2&gt;

&lt;p&gt;The main lesson this week is simple: trust the quiet tape.&lt;/p&gt;

&lt;p&gt;When the engine scans widely but trades narrowly, that usually means the filters are doing their job. A lower trade count is cheaper than forcing mediocre entries, especially when the failure vault is already pointing at repeat mistakes like noisy confirmation, weak follow-through, or execution drift. The right response is not "make the bot trade more." The right response is to tighten the decision path, preserve RAM for the live stack, and keep publishing the real numbers so the system can keep learning in public.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stack and infra
&lt;/h2&gt;

&lt;p&gt;The stack right now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apple M1 Max, 64GB unified memory&lt;/li&gt;
&lt;li&gt;MLX Qwen 3.6 35B-A3B 8-bit quant (primary inference)&lt;/li&gt;
&lt;li&gt;A lightweight CLI layer for build-time automation&lt;/li&gt;
&lt;li&gt;12 AI agents coordinating in one local process&lt;/li&gt;
&lt;li&gt;Binance spot + futures paper trading via ccxt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model swap from 4-bit to 8-bit this week traded raw decode speed (about 50 tokens per second down to about 10) for sharper data-aware evaluation. Worthwhile for content quality scoring; less worthwhile for high-frequency scan loops, which still rely on cached deterministic signals.&lt;/p&gt;




&lt;p&gt;If you're building local-first trading systems, hit reply and tell me what you optimize for first: speed, cost, or control. The next issue covers the inverted-control experiment: running the same signal backward on a parallel paper book to test whether the edge is real or whether the bot is anti-correlated with itself.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Compiled from live operating data. Every number in this issue came from the running system, not a deck.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>quant</category>
      <category>mlx</category>
      <category>buildinpublic</category>
    </item>
    <item>
      <title>The Inverted Control: What 24 Hours of Running Our Own Bot Backwards Revealed</title>
      <dc:creator>SleepyQuant</dc:creator>
      <pubDate>Sat, 18 Apr 2026 03:37:25 +0000</pubDate>
      <link>https://dev.to/sleepyquant/the-inverted-control-what-24-hours-of-running-our-own-bot-backwards-revealed-402g</link>
      <guid>https://dev.to/sleepyquant/the-inverted-control-what-24-hours-of-running-our-own-bot-backwards-revealed-402g</guid>
      <description>&lt;h1&gt;
  
  
  The Inverted Control: What 24 Hours of Running Our Own Bot Backwards Revealed
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Executive Summary
&lt;/h2&gt;

&lt;p&gt;After roughly 500 paper round-trips showed a persistent sub-35% win rate with average losses larger than average wins, we stopped scaling the live side and ran a cheap experiment: a second paper book that executes the exact opposite of every signal the bot produces, on the same universe, same cadence, same fee model.&lt;/p&gt;

&lt;p&gt;Twenty-four hours in, the inverted book is winning 70.59% of round-trips versus 15.79% on the standard book. Both books are still losing in absolute terms because fees dominate at small sample. The important number is not the win rate gap. It is whether the inverted book's gross edge clears the fee floor by the time we hit the 100-round-trip decision point, roughly 8 to 12 days out.&lt;/p&gt;

&lt;p&gt;This post walks through the setup, the data so far, where the reading could be wrong, and the specific decision that happens at 100 round-trips.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Thesis
&lt;/h2&gt;

&lt;p&gt;A bot that loses more than random is either extracting no signal, or extracting signal with the sign reversed. Those two hypotheses produce identical win-rate readings in a one-book world. They are only separable by running a second book with the signal flipped.&lt;/p&gt;

&lt;p&gt;The second hypothesis is rarer but well-documented: overfit features trained on stale microstructure, labels that got reversed in a pipeline step, crowding where yesterday's "bullish" marker is now a faded trade. None of those are visible from inside a single losing book. All of them flip sign when you flip the signal.&lt;/p&gt;

&lt;p&gt;Running the inverted control is the lowest-cost diagnostic that distinguishes the two hypotheses. In the first hypothesis (no signal), the inverted book converges to the same losing distribution, minus fee drag. In the second hypothesis (inverted signal), the inverted book diverges: higher win rate, smaller loss magnitude, possibly net-positive once sample grows past fee-drag territory.&lt;/p&gt;

&lt;p&gt;The point of running the control is not to find a winning strategy. It is to stop guessing about which of those two worlds the bot is actually in.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Two paper books, same engine, same universe, same fee schedule.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Book 1 — standard signal.&lt;/strong&gt; Every decision from the scanner is executed as issued. LONG is LONG, BUY is BUY.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Book 2 — inverted mirror.&lt;/strong&gt; Every decision is flipped programmatically before execution. LONG becomes SHORT, BUY becomes SELL (or hold, since the spot lane is accumulate-only during this window, making the flip mostly a futures test).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both books start from identical simulated ~$1000 balances. Both pay realistic exchange-tier fees on open and close — no free-trade assumption, which is where most inversion backtests fail.&lt;/p&gt;

&lt;p&gt;Universe: 30 USDT pairs on a major exchange, perps plus spot. Scan cadence 15 minutes. Leverage cap 3x. Drawdown hard stop 8% per book. Spot exit signals ignored in Book 2 for this window — the test isolates the futures direction bet.&lt;/p&gt;

&lt;p&gt;The test completes at 100 post-flip round-trips on Book 2. At that point one of three decisions is on the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deep Dive: 24 Hours of Parallel Data
&lt;/h2&gt;

&lt;p&gt;Windowed to the period since the flip went live:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Book 1 — standard.&lt;/strong&gt; 38 round-trips closed. Win rate 15.79%. Net result negative on the order of tens of USD.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Book 2 — inverted.&lt;/strong&gt; 17 round-trips closed. Win rate 70.59%. Net result also negative, but by a much smaller per-round-trip magnitude (roughly 25x better than standard).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The win-rate gap from 15.79% to 70.59% is the headline. It is not a statistical fluke at this sample. A purely random signal in this setup would produce win rates clustering around 45-55% on both books. A noise signal (first hypothesis) would produce roughly symmetric rates on both books. What shows up instead — asymmetric split heavily favoring the inverse — is the fingerprint of a signal that carries information with the wrong sign.&lt;/p&gt;

&lt;p&gt;Per-symbol, the inversion's effect is not uniform:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symbol&lt;/th&gt;
&lt;th&gt;Book 1 WR&lt;/th&gt;
&lt;th&gt;Book 2 WR&lt;/th&gt;
&lt;th&gt;Direction&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ZEC/USDT&lt;/td&gt;
&lt;td&gt;12.5% (8 RT)&lt;/td&gt;
&lt;td&gt;80.0% (5 RT)&lt;/td&gt;
&lt;td&gt;Inversion strongly helps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ARB/USDT&lt;/td&gt;
&lt;td&gt;25.0% (4 RT)&lt;/td&gt;
&lt;td&gt;100% (3 RT)&lt;/td&gt;
&lt;td&gt;Inversion helps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DOGE/USDT&lt;/td&gt;
&lt;td&gt;0.0% (5 RT)&lt;/td&gt;
&lt;td&gt;100% (2 RT)&lt;/td&gt;
&lt;td&gt;Inversion helps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UNI/USDT&lt;/td&gt;
&lt;td&gt;0.0% (4 RT)&lt;/td&gt;
&lt;td&gt;100% (1 RT)&lt;/td&gt;
&lt;td&gt;Inversion helps (micro sample)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BCH/USDT&lt;/td&gt;
&lt;td&gt;0.0% (1 RT)&lt;/td&gt;
&lt;td&gt;100% (1 RT)&lt;/td&gt;
&lt;td&gt;Inversion helps (micro sample)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NEAR/USDT&lt;/td&gt;
&lt;td&gt;28.6% (7 RT)&lt;/td&gt;
&lt;td&gt;0.0% (2 RT)&lt;/td&gt;
&lt;td&gt;Inversion hurts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ADA/USDT&lt;/td&gt;
&lt;td&gt;50.0% (4 RT)&lt;/td&gt;
&lt;td&gt;33.3% (3 RT)&lt;/td&gt;
&lt;td&gt;Inversion hurts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Five of seven symbols with both-book data favor inversion. Two do not. The symbols where inversion fails are the ones where the standard book was already near or above 30% — consistent with a "invert only what's clearly broken, leave the rest" hybrid strategy that may emerge at higher sample.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fee Floor
&lt;/h3&gt;

&lt;p&gt;Every round-trip pair costs roughly the open-plus-close fee on a major exchange, applied to both books independently. With Book 2 running in parallel, fees double.&lt;/p&gt;

&lt;p&gt;That doubles the bar. Book 2's improvement in gross profit-and-loss has to clear two fee stacks, not one. An inversion signal that wins on gross but gets eaten by the fee floor is a classic mean-reversion trap: backtests ignoring fees look clean, live books ignoring fees bleed out.&lt;/p&gt;

&lt;p&gt;At 17 round-trips, Book 2's net-negative result is dominated by fee drag, not by losses on individual trades. The interesting question is whether that fee drag, as a percentage of gross result, shrinks as sample grows. If the gross per-round-trip edge holds at roughly current magnitude, net-positive becomes plausible around round-trip 50-70. If the gross edge compresses as the signal gets noisier at larger sample, net-positive never arrives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Counter-Argument: Why This Reading Could Be Wrong
&lt;/h2&gt;

&lt;p&gt;Taking the opposite side of our own preliminary conclusion:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sample is too small.&lt;/strong&gt; Seventeen round-trips on Book 2 is the sample size a drunk person at a blackjack table has after twenty minutes. Win-rate distributions at n=17 are wide enough that a 70.59% result can reverse to 35% over the next 30 trips without surprising anyone. Any reading here is provisional.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recent regime shift.&lt;/strong&gt; The standard book's historical 34% win rate was compiled over weeks. The 15.79% since the flip is over 24 hours. A regime change (one market day of trend-heavy action on symbols the scanner dislikes, for example) could compress the standard book's rate artificially without the underlying signal being any more broken than it was a week ago. That would make the inversion's apparent edge a mirage of timing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Asymmetric fee burn.&lt;/strong&gt; Book 2's inverted futures positions may open and close in ways that pay funding rate differently than Book 1's. If the test period coincides with a funding regime that favors one side, some of the apparent gross edge is just "Book 2 happened to be on the right side of funding this week."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The symbols where inversion fails are the ones we actually trade most.&lt;/strong&gt; The test might reveal that inversion works on low-activity symbols that produce little volume, while the symbols driving Book 1's meaningful losses (higher-sample names like BTC, ETH, SOL, which Book 2 has not yet traded in this window) are not in the inverted-signal camp. A strategy that only works on low-volume names is not a strategy worth running.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The signal might be improving organically.&lt;/strong&gt; Book 1's live standard-signal win rate (across all history, not just this window) has been creeping toward 34% from the 27% it hit in the worst stretch earlier in April. If the signal is already self-correcting, the inversion's apparent edge evaporates before the test window closes.&lt;/p&gt;

&lt;p&gt;Any one of those could be what is actually going on. We are not going to know until the sample grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Verdict
&lt;/h2&gt;

&lt;p&gt;The decision point is 100 round-trips on Book 2, expected 8 to 12 days out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If Book 2 lands net-positive with win rate above 55%:&lt;/strong&gt; the inversion locks in. The live signal gets flipped permanently, along with the take-profit and stop-loss asymmetry (swap from 3% TP / -2% SL to 2% TP / -3% SL to match the inverted payoff shape). Live trading remains paused until the paper side clears a 30-day rolling benchmark of Binance Simple Earn at roughly 0.42% per month — the honest passive bar.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If Book 2 lands net-negative or drawdown exceeds 8%:&lt;/strong&gt; the futures lane is disabled entirely. Spot accumulation remains. The diagnosis shifts from "inverted signal" to "no signal," and the rebuild restarts on features, not direction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If Book 2 lands mixed — gross positive but net-negative, or win rate high but below 55%:&lt;/strong&gt; the hybrid path becomes the next experiment. Invert only the symbols where Book 1's rolling win rate sits below 40%. Leave the ones above 40% standard. Re-run the control on that subset.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the reader should take from this
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;If you are running a paper book that loses more than random:&lt;/strong&gt; run the inverted control before killing the strategy. The setup is one column in the trades table (&lt;code&gt;book_id&lt;/code&gt;) and one branch in the execute function. Cost is near zero, answer is binary, information is much larger than the cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you are watching SleepyQuant for the outcome:&lt;/strong&gt; the result arrives at 100 round-trips. We publish either a "inversion locks in, here is the updated config" or a "futures lane disabled, here is why" — whichever the numbers say, not whichever is more flattering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you are here for the general lesson:&lt;/strong&gt; a losing signal is not automatically noise. Sometimes it is a working signal with the sign reversed. The diagnostic is cheap. The implication — that your model has been right about structure and wrong about direction — is unusual enough that most builders never check. The check itself is worth more than the result.&lt;/p&gt;

&lt;h2&gt;
  
  
  Follow the experiment
&lt;/h2&gt;

&lt;p&gt;We publish one email per week with the round-trip count, the current win rates on both books, the fee-drag ratio, and whatever the honest read is at that point. No trading advice, no signals, no "buy at X." Just the numbers and what we are and are not willing to conclude from them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subscribe at &lt;a href="https://sleepyquant.rest" rel="noopener noreferrer"&gt;sleepyquant.rest&lt;/a&gt;&lt;/strong&gt; → the verdict lands in your inbox.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If this was useful, I write weekly at &lt;a href="https://sleepyquant.rest" rel="noopener noreferrer"&gt;sleepyquant.rest&lt;/a&gt;. One email a week, real numbers, no signals. &lt;a href="https://sleepyquant.rest/#subscribe" rel="noopener noreferrer"&gt;Subscribe&lt;/a&gt; — come along to see me fall or thrive, whichever comes first.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mlx</category>
      <category>python</category>
      <category>buildinpublic</category>
    </item>
    <item>
      <title>SleepyQuant – a 12-agent crypto quant running on one Mac</title>
      <dc:creator>SleepyQuant</dc:creator>
      <pubDate>Sat, 18 Apr 2026 01:47:25 +0000</pubDate>
      <link>https://dev.to/sleepyquant/show-hn-sleepyquant-a-12-agent-crypto-quant-running-on-one-mac-4dhh</link>
      <guid>https://dev.to/sleepyquant/show-hn-sleepyquant-a-12-agent-crypto-quant-running-on-one-mac-4dhh</guid>
      <description>&lt;p&gt;SleepyQuant – a 12-agent crypto quant running on one Mac&lt;/p&gt;

&lt;p&gt;Hey everyone,&lt;/p&gt;

&lt;p&gt;SleepyQuant is a solo experiment I've been running for the last couple of weeks: 12 local AI agents coordinating a paper crypto trading book on a single Apple M1 Max. No cloud inference, no API bills, no vendor black box. Every agent prompt, every losing trade, every round-trip gets written up weekly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stack (all local):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apple M1 Max, 64 GB RAM&lt;/li&gt;
&lt;li&gt;MLX Qwen 2.5 32B Q8 as the primary agent model&lt;/li&gt;
&lt;li&gt;DeepSeek R1 14B Q8 as a lazy-loaded reasoning lane for research tasks&lt;/li&gt;
&lt;li&gt;Priority queue on the MLX inference lock so user chat preempts automation&lt;/li&gt;
&lt;li&gt;FastAPI backend, SwiftUI macOS app, SQLite for state, ChromaDB for agent memory&lt;/li&gt;
&lt;li&gt;Binance paper via ccxt, spot + futures, 70/30 allocation, 10x leverage on the futures lane&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What's deliberately boring:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The paper book is roughly $78 equivalent. Not a typo. The real-mode transition gate requires three consecutive green days before anything touches real capital, and even then the first real trade is capped tiny. If the strategy can't handle $78, I'd rather find out for free.&lt;/li&gt;
&lt;li&gt;Tight scalp TP/SL (2.0% / -1.5% on futures) with a hard -8% daily drawdown stop.&lt;/li&gt;
&lt;li&gt;Every losing trade gets a post-mortem. The failure vault is public in the weekly newsletter, with root-cause classification (technical / news / execution slippage) and the exact param changes shipped as a response.&lt;/li&gt;
&lt;li&gt;Funding rate guard — refuses to open futures positions when our side is paying extreme funding. Shipped after the scanner was quietly bleeding basis points for three days straight.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Agents (one role each):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A COO / dispatcher, a trading lead, separate futures + spot executors, a CFO, a CTO with filesystem + shell tools, an R&amp;amp;D / failure analyst, a legal / compliance officer, a resource monitor, a QA engineer, a news intelligence watcher, and a content / SEO writer.&lt;/p&gt;

&lt;p&gt;Each agent has a focused system prompt + a small set of skill handlers. The COO routes CEO requests to the right specialist instead of one monolithic agent trying to do everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live paper P&amp;amp;L widget + weekly newsletter:&lt;/strong&gt; &lt;a href="https://sleepyquant.rest" rel="noopener noreferrer"&gt;https://sleepyquant.rest&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two things I'd genuinely want feedback on — please weigh in below:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Is 12 agents worth the routing overhead?&lt;/strong&gt; Or would a single bigger agent with tool use be cleaner at this scale? I keep flip-flopping and would love to hear from anyone who's been through the same decomposition choice.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;MLX unload strategies on Apple Silicon?&lt;/strong&gt; Right now my reasoning model auto-unloads after 2 minutes idle, which works but feels crude. If you're running MLX in production on a Mac, how do you free RAM when you need it back?&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Try it or follow along:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Live paper P&amp;amp;L widget + weekly write-up:&lt;/strong&gt; &lt;a href="https://sleepyquant.rest" rel="noopener noreferrer"&gt;https://sleepyquant.rest&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subscribe to the weekly post-mortem newsletter&lt;/strong&gt; — Beehiiv, free, one email per week, no upsells, no signals, no affiliate links&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cadence:&lt;/strong&gt; every Tuesday. If the book dies, I'll write up that too&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Happy to answer questions in the comments about the architecture, the failure vault, the priority queue design, or why local-first LLM agents are worth the effort on a 64 GB machine. Fire away.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If this was useful, I write weekly at &lt;a href="https://sleepyquant.rest" rel="noopener noreferrer"&gt;sleepyquant.rest&lt;/a&gt;. One email a week, real numbers, no signals. &lt;a href="https://sleepyquant.rest/#subscribe" rel="noopener noreferrer"&gt;Subscribe&lt;/a&gt; — come along to see me fall or thrive, whichever comes first.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mlx</category>
      <category>python</category>
      <category>buildinpublic</category>
    </item>
    <item>
      <title>SleepyQuant — Twitter brand assets (bio + pinned tweet)</title>
      <dc:creator>SleepyQuant</dc:creator>
      <pubDate>Sat, 18 Apr 2026 01:47:21 +0000</pubDate>
      <link>https://dev.to/sleepyquant/sleepyquant-twitter-brand-assets-bio-pinned-tweet-1joe</link>
      <guid>https://dev.to/sleepyquant/sleepyquant-twitter-brand-assets-bio-pinned-tweet-1joe</guid>
      <description>&lt;h2&gt;
  
  
  Profile
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Display name (50 chars max):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SleepyQuant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Bio (160 chars max — landing + newsletter + 1-line pitch):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI trades while the CEO sleeps. 12 local agents + one Mac M1 Max running a paper crypto book in public. Weekly post-mortems, zero hype.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;(139 chars — room for a trailing link to sleepyquant.rest in the website field rather than in the bio text itself.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Location:&lt;/strong&gt; &lt;code&gt;Runs on a Mac in a closet&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Website:&lt;/strong&gt; &lt;code&gt;https://sleepyquant.rest&lt;/code&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Pinned tweet
&lt;/h2&gt;

&lt;p&gt;One tweet, no thread. Meant to be the first thing a new visitor sees. No question on purpose — it's a brand statement, not a conversation opener.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;One Mac. 12 AI agents. A $78 paper crypto book.

I run a quant experiment while I sleep and post the whole journey — every win, every dumb loss, every architecture note — every week.

Live P&amp;amp;L + the newsletter: sleepyquant.rest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;(278 chars — right under the 280 limit, no line-break tricks, reads in one pass.)&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Alternate pinned tweet (if the first one feels too cold)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;I ship weekly regardless of wins or losses.

Week 1 on paper: +2.65%, 9 round-trips, 3 losses with full post-mortems, funding-rate guard shipped mid-week.

Everything runs locally on one Mac. sleepyquant.rest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;(242 chars.)&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Notes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Keep the bio and pinned tweet aligned on tone. Reader should see the bio, then the pinned, and the two should feel like one voice.&lt;/li&gt;
&lt;li&gt;Don't use "crypto trading bot" — implies signals and gets flagged by X ad policy. Use "paper crypto book" or "quant experiment".&lt;/li&gt;
&lt;li&gt;Update the pinned weekly — roll in the latest round-trip number so it never feels stale. The alternate version is a good template for that weekly refresh.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>quant</category>
      <category>mlx</category>
      <category>buildinpublic</category>
    </item>
  </channel>
</rss>
