DEV Community: SleepyQuant

I Blamed the Model for Months. The Bug Was My Sampler.

SleepyQuant — Fri, 29 May 2026 09:35:03 +0000

I Blamed the Model for Months. The Bug Was My Sampler.

40GB In, Word Salad Out

Running local LLMs on M1 Max hardware is one of those setups that looks great on paper — unified memory, no PCIe bottleneck, offline and private. For about a year I ran mlx-community/Qwen3.6-35B-A3B-8bit, a 35B Mixture-of-Experts model that consumed ~40GB of the machine's 64GB pool. Generation speed was fine. Quality, past about 150 words, was not.

The output turned to garbage — run-on sentences that looped without structure, synonyms piling on synonyms, no paragraph ever landing a conclusion. I assumed the MoE architecture was the problem. That assumption sent me down months of tooling: sectional generation scaffolding, retry logic, truncation heuristics, all built to work around what I thought was a hard architectural limit.

The model was fine. My code was broken.

The Diagnostic: Four Configs, One Metric

I stopped guessing and wrote a small diagnostic script. Same prompt, four sampler configurations, one measurement: the longest continuous stretch of text without a sentence terminator. A coherent paragraph rarely runs beyond 40 words between periods. A collapse runs into the hundreds.

Config	Words out	Longest run (no period)
temp=0.6 + rep-penalty ON	581	472 ← collapse
temp=0.6, no processors	479	27 ✓
temp=0.6 + top_p=0.9, no processors	510	31 ✓
greedy temp=0, no processors	490	27 ✓

Row one is what production was running. The rest are controls. Row two settled it: remove the processor, the collapse disappears. The model was never the problem.

The Root Cause: A Logits Processor That Fought Itself

My codebase had a custom _repetition_penalty_processor. On every decoding step, it divided each recently-generated token’s logit by 1.15, across a sliding 128-token window. The intent was to discourage repetition. The effect was the opposite.

By penalizing recently-used tokens, the processor forced the model to reach for synonyms. Those synonyms got penalized next step. The model kept reaching further — abstract nouns, tangential phrases, concepts that drifted further from the original topic — until coherence collapsed entirely. It’s a feedback loop. The harder you penalize recent tokens, the more the model is pushed toward low-probability continuations that spiral into incoherence.

Turn the processor off, and the same model generates cleanly past 500 words. The architecture wasn’t the problem. The sampler config was.

The Fix

Three changes in mlx_inference.py:

Swapped MLX_MODEL_PRIMARY to "mlx-community/Qwen3.6-27B-4bit" — the dense 4-bit variant, ~18GB RSS instead of ~40GB.
Changed make_sampler(temp=0.6) to make_sampler(temp=0.6, top_p=0.9). Top-p sampling limits generation to the smallest set of tokens whose cumulative probability reaches 90%, pruning the low-probability tail without introducing a penalty feedback loop.
Removed logits_processors=[_repetition_penalty_processor] from the main generate() call entirely.

Before shipping, I validated the smaller model against a set of already-scored drafts: Pearson r=0.89 correlation against the old model’s scores, with 7 of 8 gate decisions (the ≥70 publish threshold) matching. That’s 8 samples — enough to justify a swap for an evaluation-only workload, not enough to make architectural claims about generation quality at scale.

What 22GB Back Feels Like on Apple Silicon

There’s no separate VRAM bucket on unified memory. The GPU and CPU share the same 64GB pool, which means a 40GB model isn’t just using "graphics memory" — it’s consuming RAM that macOS, the browser, the terminal, and everything else also depends on.

When the model shrank from ~40GB to ~18GB, system free memory jumped from 65% to 93%. Compressed swap dropped. Background processes stopped stuttering. The machine just felt different. Before, keeping Chrome open alongside Xcode and a running LLM was a tradeoff I had to think about. With ~46GB of headroom instead of ~14GB, it stopped being a question.

On a unified memory machine, the RAM number isn’t just a spec. It’s the entire working experience. Every gigabyte you save on the model is a gigabyte that macOS can use to keep your browser alive, your terminal responsive, your compiler from stalling. The 22GB reclaim from switching models was worth more than any hardware upgrade I could have bought.

How I measured

Everything here is first-party, measured on my own M1 Max (64 GB) — no external benchmarks. The four-config table comes from a single diagnostic script: same prompt each run, scoring the longest stretch of text with no sentence terminator. The r=0.89 figure is from re-scoring eight already-graded drafts with both Qwen3.6-35B-A3B-8bit and Qwen3.6-27B-4bit and comparing the two sets of scores. The memory numbers (65% → 93% free, ~22 GB reclaimed) are system readings taken before and after the model swap. The whole point is that you can reproduce any of this on your own hardware.

Don’t Size for Prestige

The 40GB model was chosen for generation quality — the reasoning being that bigger MoE models produce better long-form text. But the actual job the model was doing was evaluation: scoring draft content on a 0–100 scale. For that job, the 27B-4bit is sufficient, the correlation numbers confirm it, and it costs 22GB less.

I spent months building scaffolding for a wrong assumption. The bug was three lines of sampler code I wrote myself. Know what a model actually does before deciding which one to run.

How I Budget 64 GB Unified Memory on M1 Max for a 35B Model + Long-Running Agent Loops

SleepyQuant — Tue, 19 May 2026 01:20:11 +0000

How I Budget 64 GB Unified Memory on M1 Max for a 35B Model + Long-Running Agent Loops

The first lie I had to unlearn buying a 64 GB Mac for local LLM work was that I had 64 GB to use for the model.

You don't. After macOS, your browser, your editor, and whatever else you keep open during a workday, the actual usable headroom for ML is about 48-50 GB. That's enough for a 35 B parameter model in Q8 with some breathing room — but only if you're explicit about what else is allowed to live in memory.

This is the budget I run, what it leaves for other work, and how to recalculate for your own setup.

The actual budget on my Mac

Here's the layout I'm running right now, mid-workday with everything I normally have open:

M1 Max 64 GB Unified Memory
─────────────────────────────────────────────────
│ macOS kernel + system services       6.5 GB  │
│ WindowServer + UI compositor         1.8 GB  │
│ Safari (8 tabs, mid-weight)          2.4 GB  │
│ Swift IDE (Xcode-class)              2.7 GB  │
│ Spotlight + background indexing      0.5 GB  │
│ Discord                              0.3 GB  │
│ Terminal + tmux sessions             0.4 GB  │
│ Chrome (3 tabs incl. one heavy SPA)  4.0 GB  │
├───────────────────────────────────────────────┤
│ SYSTEM + WORKFLOW SUBTOTAL          18.6 GB  │
├───────────────────────────────────────────────┤
│ Python runtime + libs                1.2 GB  │
│ MLX model weights (35B Q8)          35.0 GB  │
│ Metal cache (capped)                 0.5 GB  │
│ Agent context buffers                2.0 GB  │
├───────────────────────────────────────────────┤
│ ML SUBTOTAL                         38.7 GB  │
├───────────────────────────────────────────────┤
│ Free + reclaimable buffer            6.7 GB  │
└───────────────────────────────────────────────┘
TOTAL ALLOCATED:                      64.0 GB

That ~6.7 GB free buffer is what I have left for spikes. Chrome opening a heavier tab, a Spotlight reindex burst, a build kicking off. If the buffer drops under 3 GB, macOS starts compressing memory aggressively, and inference latency spikes.

The number I tune to: keep system + workflow under 20 GB so ML has at least 44 GB to play with, including buffer.

Why 35B Q8 specifically fits

Different model sizes and quantizations land in different memory bands. Rough numbers for the common ones I've tested or measured:

Model size	Quant	Resident memory	What's left on 64 GB Mac
7B	Q4	~4 GB	~42 GB (comfortable)
7B	Q8	~7 GB	~39 GB (comfortable)
14B	Q4	~8 GB	~38 GB (comfortable)
14B	Q8	~14 GB	~32 GB (comfortable)
32B	Q4	~18 GB	~28 GB (comfortable)
32B	Q8	~32 GB	~14 GB (tight)
35B MoE	Q4	~19 GB	~27 GB (comfortable)
35B MoE	Q8	~35 GB	~11 GB (very tight)
70B	Q4	~38 GB	~8 GB (won't run with my workflow)
70B	Q8	~70 GB	doesn't fit at all

35B Q8 is the largest model where I can still keep my normal dev workflow open. Anything bigger and I have to close apps to make room. 70B Q4 technically fits but leaves no headroom for the agent loop or browser.

This is also why I swapped from Q4 to Q8 instead of going from 35B to 70B. Q8 of the same model gave me a quality lift I could measure on real outputs; 70B Q4 would have forced me to close half my workspace. Quality-per-headroom favored the upgrade I made.

How to measure your own baseline

The fastest way to see your actual numbers: open Activity Monitor, switch to the Memory tab, sort by Memory descending. The "Memory Used" total at the bottom shows your committed footprint. The "Memory Pressure" graph shows whether macOS is comfortable or struggling.

For a more precise read, three terminal commands:

# System-wide memory state
memory_pressure -Q | head

# Per-process memory (top consumers)
ps -axm -o rss,command | sort -nr | head -15

# Pages active vs compressed vs free
vm_stat

Run these mid-workday with everything you normally have open, before you load the model. That's your baseline. Subtract from 64 GB. Whatever's left is your ML budget.

If your baseline is over 20 GB, you have less ML room than I do. Some choices: close Chrome, reduce open browser tabs, kill Slack/Discord during inference sessions, or accept a smaller model.

What changes if you have less or more RAM

The shape of the budget holds across Mac generations, but the thresholds shift.

M2 Air 16 GB: roughly 6-8 GB system baseline. Leaves ~8-10 GB for ML. Realistic models: 7B Q4 only, with minimal multitasking.

M2 Pro 32 GB: ~12 GB baseline. Leaves ~20 GB for ML. Realistic: 14B Q8 or 32B Q4 with light workflow. 35B too tight.

M1/M2 Max 64 GB (my setup): ~18-20 GB baseline. Leaves ~44 GB. Realistic: 35B Q8 with normal workflow, 70B Q4 if you close most apps.

M2 Ultra 128 GB: ~20-22 GB baseline. Leaves ~106 GB. Realistic: 70B Q8 comfortable, 100B+ Q4 possible.

M3 Ultra 192 GB: similar baseline. Leaves ~170 GB. Realistic: 100B+ Q8, multiple models loaded simultaneously, or one large model + heavy concurrent workload.

The pattern: about 18-22 GB goes to "being a Mac" regardless of total RAM, plus another 0-10 GB depending on your browser/IDE habits. The leftover scales linearly with what you bought.

What goes wrong if you over-budget

The failure modes from over-allocating memory to ML, in order of how often I've hit them:

1. Inference latency spikes. Memory pressure triggers macOS compression. Decode tok/s drops from 26 to 8-12 silently. The model still responds, just slower. You assume the model degraded, when actually the memory layer did.

2. Random app evictions. macOS will start force-quitting background apps to free pages. Discord disappears, your IDE loses unsaved buffers, Spotify silently stops. Usually no notification.

3. Full system freeze. If compression saturates and the kernel can't recover, the whole machine locks. I hit this twice in one week before I tuned memory caps — write-up of the fix is in my 6-layer MLX defense post. Hard reboot required.

4. Swap to SSD wear. macOS will swap pages to SSD if compression fails. Heavy daily inference + tight memory = measurable SSD write amplification. Apple Silicon SSDs have decent endurance, but it's not zero.

The first two are warnings. The third is the failure mode that costs you a workday. Budget accordingly.

What this isn't

This budget is for one workflow: continuous local LLM inference with a multi-agent setup, plus normal dev work in parallel, on a 64 GB M1 Max. The principles generalize but the numbers don't.

If you're a researcher doing batch jobs, you can shut down your dev workflow during runs and free up the 18-20 GB system budget for the model. That lets you push to 50+ GB ML allocation on the same hardware.

If you're a single-shot interactive user (one prompt, read answer, repeat), you can be looser with the cache caps. The accumulated drift doesn't have time to build up.

If you're a multi-tenant server operator running inference for multiple users, you need to budget per concurrent session. The numbers in this post assume one user (me).

If you're choosing a Mac to buy for local LLM work, the practical guidance: 32 GB if you want 14B; 64 GB if you want 35B with workflow; 128 GB+ if you want 70B or want headroom for the next model generation. Apple Silicon non-upgradeable RAM means buy more than you think you need.

The smaller lesson

Unified memory is not a free lunch. The advantage over discrete VRAM (no copy overhead, model + workflow share pool) comes with the responsibility to be explicit about who gets what. Default macOS behavior assumes you're not running a 35 GB model. You have to opt into the budget.

If you've worked out a different budget that fits your workflow on the same RAM, I'd genuinely like to see it. Reply on the post.

Come along for the ride — see me fall or thrive, whichever comes first.

MLX vs llama.cpp on M1 Max with 35B Q8 — The Honest Benchmark

SleepyQuant — Tue, 19 May 2026 01:09:18 +0000

MLX vs llama.cpp on M1 Max with 35B Q8 — The Honest Benchmark

I tested both. Same machine (M1 Max 64 GB), same model (Qwen 3.6 35B-A3B Q8), same prompts, same generation lengths. llama.cpp came out about 30% faster on raw decode throughput. I stayed on MLX anyway.

This is the breakdown of what each gets right, where the speed gap actually shows up, and why I stayed on the slower one. Hopefully useful if you're picking your local inference stack from scratch.

Setup

Hardware: MacBook Pro M1 Max, 64 GB unified memory, 1 TB SSD, on macOS Sequoia
Model: mlx-community/Qwen3.6-35B-A3B-8bit for MLX, equivalent GGUF Q8_0 for llama.cpp (Qwen3.6-35B-A3B-Q8_0.gguf)
Prompts: 5 prompts, mix of short Q&A (50 tokens output) and longer content generation (500 tokens output), 3 runs each, warm-cache results
MLX setup: MLX_FORCE_FP16=1, wired_limit 45 GB, cache_limit 512 MB
llama.cpp setup: Metal backend enabled, --n-gpu-layers -1 (all on GPU), threads 8, context 8192

I'll caveat upfront: this is one machine, one model, one workload pattern. Your numbers will be in the same shape but different magnitudes. If you replicate and your numbers diverge significantly, I'd genuinely like to know.

Raw throughput

Metric	MLX (fp16)	llama.cpp (Metal)
Decode tok/s (steady state, 500-token gen)	26.22	34
Prefill tok/s (1k token prompt)	~190	~245
Cold-start latency (first token)	1.8s	1.2s
Memory resident	~35 GB	~37 GB
Memory peak (under load)	~42 GB	~44 GB

llama.cpp wins all three speed measures. About 30% faster on decode, similar margin on prefill. Cold start is also faster — less Python/MLX import overhead, more direct Metal binding.

Memory usage is comparable. MLX edges out slightly because it doesn't keep a separate GGUF reader buffer. Not enough to be a deciding factor.

If your primary need is "generate as many tokens as fast as possible, batch workload, throughput-bound" — llama.cpp wins. Stop reading and switch.

Where the speed difference doesn't show up

In my actual day-to-day usage, the 30% speed gap doesn't translate to 30% better experience. Here's why.

Interactive chat: I read at maybe 5-7 tok/s of comprehension. Whether the model generates at 26 or 34 tok/s, I'm waiting for me, not the model. The gap is invisible.

Agent loops where I parse output: the bottleneck is round-trip time, not generation speed. A 100-token JSON response takes 3.8s on MLX vs 2.9s on llama.cpp. Both feel fast to me. The agent loop spends most of its time on tool execution and HTTP calls, not inference.

Sectional generation for long content (which I do because of MoE degeneration on long contexts): I generate 300-token sections. Each section takes ~11 seconds on MLX, ~8.5 on llama.cpp. Difference is 2.5 seconds per section, ~15 seconds total for a 6-section blog post. Imperceptible vs the time I spend reviewing and editing the output.

The speed gap matters when you're running batch inference (e.g., re-scoring a dataset, generating 10k synthetic examples). For interactive or agent workloads on a single user, it's largely cosmetic.

Where MLX wins (that's why I stayed)

These are the reasons I haven't switched, in rough priority order.

1. Python-native API. MLX is pip install mlx mlx-lm and you're calling Python functions directly. Tokenizers integrate cleanly with HuggingFace patterns I already know. Chat templates work through standard tokenizer.apply_chat_template. No subprocess, no IPC, no llama.cpp server.

llama.cpp has Python bindings (llama-cpp-python), but they're a layer over the C++ core. Some features lag the main project, tokenizer behavior occasionally diverges from upstream HF, and the bindings need to be rebuilt when you upgrade. Not deal-breakers, but friction adds up.

2. Quantization and format flexibility. MLX reads safetensors directly. I can swap from Q4 to Q8 to fp16 by changing the model path. No re-conversion step.

llama.cpp uses GGUF, which is a different format. To switch quants, you either download a different GGUF (if someone published one) or convert from safetensors yourself with convert.py. For exotic models or custom finetunes, this is real overhead.

3. The MoE handling is genuinely better. Qwen 3.6 35B-A3B is a Mixture-of-Experts model. MLX's MoE routing implementation has been measurably more stable for me on long generations. llama.cpp had a few weeks early in 2026 where Qwen MoE inference would silently produce different outputs run-to-run because of router determinism bugs. Fixed now, but it shook my confidence.

4. Sectional generation pipeline already built. I have a working setup for sectional gen that integrates with my agent stack. Switching to llama.cpp would mean re-implementing the same flow. Re-implementation cost: probably 1-2 weeks, including testing and the inevitable bugs in the new setup. 30% speed gain doesn't pay back 2 weeks of work for an interactive workload.

5. Apple is investing in MLX directly. This is a soft factor, not a benchmark. But MLX is Apple's first-party ML framework for their silicon. Improvements compound faster when the chip designer is also writing the framework. llama.cpp is community-maintained, brilliant work, but not Apple-funded.

Where llama.cpp wins (and might still be the right pick)

To be fair to the project, here's where llama.cpp is clearly the better choice.

Batch inference at scale. If you're scoring 100k prompts overnight, the 30% speed advantage compounds. Save 8 hours on a 24-hour job.

Embedded/constrained environments. llama.cpp compiles to a tiny binary and runs on a wider range of hardware. If you're shipping a desktop app to users with mixed Macs, llama.cpp gives you broader compatibility.

Quantization research. llama.cpp's quant ecosystem (Q4_K_M, Q5_K_S, IQ-quants) is broader and more experimental than MLX. If you're testing new quant strategies, llama.cpp moves faster.

Cross-platform deployment. Need the same inference code to run on Linux, Mac, Windows, Android? llama.cpp does all four. MLX is Apple-only.

Independent of Python ecosystem. No GIL, no Python import dance. If you're already writing C++ or Rust, llama.cpp slots in cleaner.

What I'd recommend by use case

A short list, since this post has gotten long.

Use case	Pick	Why
Interactive chat on a Mac	MLX	Speed gap invisible, Python integration matters more
Agent loops on a Mac	MLX	Same as chat, plus MoE stability
Local API server (single user)	Either	Personal pick. Toss-up.
Batch dataset scoring	llama.cpp	30% speed gap × N tokens = real hours saved
Cross-platform desktop app	llama.cpp	MLX is Apple-only
Embedded inference (mobile, edge)	llama.cpp	Smaller binary, fewer deps
Custom quantization R&D	llama.cpp	Broader quant ecosystem

If you're a Mac dev building an agent stack or chat app for yourself: MLX is the easier path. If you're shipping inference at scale or beyond Mac: llama.cpp.

Will I switch later?

Maybe. The condition I'm watching: if MLX's release cadence slows or if llama.cpp's Python bindings catch up on tokenizer behavior, the trade looks different.

I check in on both every couple of months. Last check was April 2026. Plan to check again July 2026. If anything material changes, I'll write the update.

If you've benchmarked these two on different hardware or a different model and your conclusion is different, I'd genuinely like to see your numbers. Reply on the post.

Come along for the ride — see me fall or thrive, whichever comes first.

MoE Degeneration on Long Context — Why My 35B Model Started Repeating Itself

SleepyQuant — Tue, 19 May 2026 01:09:13 +0000

MoE Degeneration on Long Context — Why My 35B Model Started Repeating Itself

The first 600 tokens looked great. Coherent prose, on-topic, the same voice I'd been getting from Qwen 3.6 35B-A3B Q8 for weeks. Then something snapped. The next 200 tokens were a chain of synonyms — "leadership management administration supervision oversight stewardship" running for half a paragraph. After that, partial sentences. After that, nonsense.

I assumed I'd hit a quant artifact. Q8 isn't lossless. Maybe the model was confused. Re-ran with a fresh prompt at the same max_tokens. Same collapse around the 600-700 token mark.

It wasn't quantization. It's an MoE-specific failure mode that gets worse on long generations. And the fix isn't tuning sampling — it's not generating long sequences at all.

What the output looked like

Here's a sanitized example of what I saw. The prompt asked for a 1500-word blog post outline.

Tokens 0-600 (coherent):

The Apple Silicon memory model differs fundamentally from x86. Where Intel and AMD systems separate DRAM from VRAM with explicit bandwidth boundaries, M-series chips share a single unified memory pool. This has real implications for ML inference: model weights and runtime allocations compete for the same physical bytes...

Tokens 600-800 (degradation starts):

One of the key considerations when working with this architecture involves understanding the relationship between allocation, deallocation, retention, release, management, oversight, supervision, administration, governance, stewardship, custody, oversight again, the role of overseeing the management of the allocation in a managed manner...

Tokens 800+ (collapse):

The the the system system the system the management of of the the of allocation system the the management the the management oversight oversight oversight...

The transition from coherent to degraded happened over about 100 tokens. Before that, the output was on-voice and useful. After, it was unsalvageable.

What I think is happening

Caveat first: I don't have authoritative access to Qwen's training data or routing internals. This is a working hypothesis built from observing the symptom across hundreds of generations and reading public research on MoE behavior.

Mixture-of-experts models route each token to a small subset of available "expert" sub-networks. In Qwen 3.6 35B-A3B, the "A3B" means roughly 3 billion active parameters per token out of 35 billion total. The router picks which experts handle which tokens based on attention patterns and learned routing weights.

On short generations (under 400 tokens), the router behavior is stable. Each token's expert selection has plenty of attention context to score against, and the experts that win tend to be the right ones for the topic.

On long generations, two things start to drift:

The attention context fills with the model's own output. By token 600, more than half the context is what the model just generated. The router's routing weights are now being computed against generated content, not the original prompt. If any expert produced even mildly low-quality output, that output now influences which experts get picked next.
Routing collapse on dominant experts. When the router has been picking the same few experts consistently, attention weights start concentrating on those experts' "vocabulary." The model develops a self-reinforcing loop where the experts good at certain word categories (abstract nouns, conjunctions, hedge phrases) keep winning the routing competition.

Combine these and you get the synonym-chain pattern: experts good at abstract management vocabulary win routing, the output reinforces their winning, and the chain spirals.

Dense (non-MoE) models hit similar degradation but later — usually past 1500-2000 tokens — because there's no router to collapse. MoE seems to fail earlier in this specific mode.

The fix that worked: sectional generation

Once I understood it was a context-buildup problem, the fix was simple: don't generate long sequences. Generate short ones and concatenate.

Specifically: split your target content into sections of 250-400 tokens each, generate each section independently with its own prompt, then concatenate.

def generate_long_content(prompt_skeleton, sections):
    outputs = []
    for section_name, section_instruction in sections:
        section_prompt = f"{prompt_skeleton}\n\nWrite the section: {section_name}\nInstruction: {section_instruction}\nTarget length: 300 tokens."
        section_output = generate(model, tokenizer, section_prompt, max_tokens=400)
        outputs.append(f"## {section_name}\n\n{section_output}\n")
    return "\n".join(outputs)

Each section gets a fresh attention context. The router never sees more than 400 tokens of generated content at once. Degradation never starts because the runway is too short.

Trade-off: section transitions can feel slightly choppy because each section was generated independently. For a blog post, this is usually fine — the human-written section headings paper over the seam. For continuous prose like a novel, you'd want extra glue prompts to maintain flow.

The other trade-off: it's slower. Five 300-token generations take longer than one 1500-token generation because of per-call overhead. In my measurements, sectional gen was about 30-40% slower for the same total token count. The quality difference more than justifies it.

What didn't work

Before landing on sectional gen, I tried a few things that sounded reasonable.

Temperature dropping. Lowered temperature from 0.7 to 0.3 hoping more deterministic sampling would avoid the synonym chain. It didn't. The degradation still started around token 600. Lower temperature just made the synonym chain more repetitive, not absent.

Repetition penalty. Added a repetition_penalty=1.15 to the generate call. This helped slightly — pushed the collapse out to token 700-800. But it didn't prevent the underlying routing collapse, and at higher penalties the output started avoiding common words (articles, prepositions) in weird ways.

Top-p tightening. Dropped top_p from 0.9 to 0.7. Same story as temperature drop — the collapse still happened, just with a smaller vocabulary.

Longer prompt. Padded the prompt with more context, hoping the model would have more to anchor on. Made degradation slightly worse if anything — more context for the router to chase as the generation continued.

The pattern across all the failed attempts: sampling adjustments treat the symptom (low-quality token at position N) but don't fix the cause (routing dynamics on long generations). Sectional generation fixes the cause by avoiding the long generation entirely.

How to detect it in your own runs

If you suspect MoE degeneration, the easiest signal is a word-overlap check on the output. Compare token sets across sliding windows:

def detect_collapse(text, window=100, overlap_threshold=0.65):
    tokens = text.split()
    for i in range(0, len(tokens) - window):
        window_tokens = set(tokens[i:i+window])
        next_window = set(tokens[i+50:i+150])
        overlap = len(window_tokens & next_window) / len(window_tokens)
        if overlap > overlap_threshold:
            return i  # position where collapse starts
    return None

If the same 100-word window has more than 65% overlap with the next 100 words, you're probably in collapse territory. Truncate the output at the collapse position, regenerate the rest with a fresh context.

This is the same check I bake into my generation wrapper. When it fires, the wrapper retries the section with a smaller token budget. It's not elegant, but it's robust.

What this isn't

This is calibrated for Qwen 3.6 35B-A3B Q8 specifically. The exact collapse threshold (600-700 tokens for me) will differ for other models.

Smaller MoE models (Qwen 1.5 14B MoE, Mixtral 8x7B): degradation may start later because fewer experts means simpler routing dynamics. Or it may start earlier if those experts overlap heavily. I haven't tested these.

Dense models (Llama 3 70B, Qwen 2.5 dense): you'll hit similar degradation but usually past 1500-2000 tokens. Sectional gen still helps but is less urgent.

Higher quantization (Q4 vs Q8): I tested Q4 of the same model briefly before swapping to Q8 for quality. Q4 collapsed earlier (around token 400-500), consistent with the hypothesis that quantization noise compounds with routing instability.

The smaller lesson

When you see word salad from a long generation, your first instinct is probably sampling tuning. Mine was. It almost never works as a primary fix.

The pattern to internalize: long-sequence degradation on MoE models is a routing problem, not a sampling problem. The fix is structural (don't generate long sequences) not numeric (don't tune temperature). Sectional generation forces the structure.

If you've hit this on a different MoE model and found a different fix, I'd genuinely like to know which one. Reply on the post.

Come along for the ride — see me fall or thrive, whichever comes first.

Qwen 3.6 enable_thinking — The MoE Pitfall That Broke My Agent JSON Parsing

SleepyQuant — Mon, 18 May 2026 13:21:03 +0000

Qwen 3.6 enable_thinking — The MoE Pitfall That Broke My Agent JSON Parsing

I lost two hours last week to a Qwen 3.6 quirk that doesn't show up in any quickstart guide. My agent kept returning malformed JSON. Logs showed the model output started with <think> and a 200-token reasoning monologue before the actual JSON I asked for. Parser exploded every time.

The fix is one keyword argument. The frustration is that nothing in the obvious places — model card, MLX docs, generic chat template examples — tells you about it.

If you're running Qwen 3.6 MoE for an agent setup and your structured outputs are broken, read on.

The symptom

I had a tool-calling loop that asked Qwen to emit JSON. Something like:

prompt = "Return a JSON object with keys 'action' and 'target'."
response = generate(model, tokenizer, prompt)
data = json.loads(response)

Worked fine with Qwen 2.5. Broke immediately with Qwen 3.6. The output looked like:

<think>
The user wants a JSON object. I need to think about what action and target make sense.
Let me consider the context...
[200 more tokens of reasoning]
</think>

{"action": "search", "target": "weather"}

JSON parser saw the <think> block as garbage, threw a JSONDecodeError. Easy enough to spot once I logged the raw output. But it took me a while to realize this was a model feature, not a prompt problem.

What's actually happening

Qwen 3.6 ships with reasoning mode default-on. The chat template injects markers — <think> and </think> — and the model is trained to fill them with its chain-of-thought before producing the user-facing answer. For interactive chat, this is sometimes useful: you can show or hide the reasoning to a user, and the reasoning content does measurably improve answer quality on hard problems.

For an agent loop that parses structured output, it's silently destructive. Every response starts with hundreds of tokens you have to strip before you can use the actual answer. And worse, the reasoning length is unpredictable — sometimes 50 tokens, sometimes 800 — so your max_tokens budget gets eaten by thinking instead of output. On a memory-tight Mac running a 35B model already, those wasted tokens also fragment Metal cache faster — separate problem but they compound. (I wrote up the memory side in my MLX memory safety checklist if that's the angle you hit first.)

The fix

In apply_chat_template, pass enable_thinking=False:

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # <-- this
)
response = generate(model, tokenizer, text)

That's it. No <think> blocks, no reasoning preamble, just the answer. JSON parses cleanly. max_tokens budget goes to the actual response.

Where the flag has to go

This took me embarrassingly long to figure out. The flag belongs at template apply time, not at generation time. You can't pass it to model.generate() and have it work. You can't set it as a tokenizer kwarg at load time. It only has effect inside apply_chat_template.

I tried these wrong things first:

# These do nothing — flag is ignored
generate(model, tokenizer, prompt, enable_thinking=False)
tokenizer = AutoTokenizer.from_pretrained(model_id, enable_thinking=False)
model.generate(prompt, enable_thinking=False)

If you've inherited a codebase where chat formatting is wrapped in a custom function, the wrapper probably calls apply_chat_template somewhere. That's the spot. Patch it there.

When you actually want thinking on

For interactive chat where a user reads the response, leaving enable_thinking=True (the default) usually helps. The model is genuinely smarter on multi-step reasoning when it gets to think out loud. Math problems, code debugging, multi-constraint planning — all measurably better with thinking on.

So the rule isn't "always disable." It's "disable for any path where the output gets machine-parsed, kept on for any path where a human reads it."

In my own setup (a multi-agent local stack on M1 Max — full hardware notes in the 19 GB memory compression writeup), I split into two generate functions:

def generate_for_agent(messages, max_tokens=512):
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True,
        enable_thinking=False  # parser-safe
    )
    return generate(model, tokenizer, text, max_tokens=max_tokens)

def generate_for_chat(messages, max_tokens=2000):
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True,
        enable_thinking=True  # quality boost for chat
    )
    return generate(model, tokenizer, text, max_tokens=max_tokens)

Two functions, two contexts. Same model, same tokenizer, different chat template flag. Clean separation.

Why the docs don't surface this

This is my speculation, not authoritative — but here's what I think happened. Qwen 3.6 launched as Alibaba's flagship reasoning model. The whole pitch is "thinks before it answers." Disabling that flag in the quickstart would undercut the marketing of the feature itself. So the docs assume you want thinking on by default, and the flag is buried in API reference, not the first-page tutorial.

If your use case is agent JSON, you'll find this gotcha on day one. If your use case is human chat, you might never need to touch the flag and won't see why anyone would.

It's a real-world case where the default optimizes for the most demo-worthy path, not the most common production path.

Verification

After patching, you can verify the flag took effect by inspecting the rendered template before generation:

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False
)
print(text[-200:])  # tail of the prompt

You should see the assistant generation prompt with no <think> marker. If you see <think> in the tail, the flag didn't apply — most likely because you're calling a wrapper that doesn't pass it through.

You can also check by inspecting the first 100 tokens of any response. Reasoning-on output starts with <think>. Reasoning-off output starts with the actual answer.

What this isn't

This is specifically Qwen 3.6 behavior. Earlier Qwen versions (2.5 and below) don't have the enable_thinking flag because reasoning mode wasn't a feature yet. Other reasoning-mode models (DeepSeek-R1, the o1 family on the OpenAI API) have similar dynamics but different flags or modes — check their respective chat templates.

If your output isn't parsable but doesn't have <think> blocks, the cause is somewhere else. Common alternatives I've hit:

Trailing whitespace or newlines in the response — strip before parsing
Markdown code-fence wrapping around the JSON — strip json ` and `
Model adding explanatory text before/after the JSON — tighten the system prompt with explicit "no preamble, no explanation"

The <think> block fix only solves the reasoning-leak case. The other cases need other fixes.

The smaller lesson

When a new model breaks an existing pipeline silently, the bug is usually in the chat template, not the generate call. The template is the interface between your code and the model's expectations. Most upstream API changes happen there.

For Qwen 3.6, the gotcha is enable_thinking. For the next model in two months, it'll be something else. The diagnostic habit — log the rendered template, not just the response — saves hours over the year.

If you've hit a different Qwen 3.6 surprise that nobody flags, I'd genuinely like to know. Reply on the post.

Come along for the ride — see me fall or thrive, whichever comes first.

MLX Memory Safety Checklist: 6-Layer Defense for M1/M2 Apple Silicon

SleepyQuant — Mon, 04 May 2026 08:54:01 +0000

MLX Memory Safety Checklist

6-Layer Defense for M1/M2 Apple Silicon

A solo public notebook from SleepyQuant.

The problem

I froze my M1 Max twice in one week running Qwen 3.6 35B-A3B Q8 for a 12-agent stack.

Symptoms before the fix:

Memory compressor hit 19.69 GB of compressed pages
macOS started swapping random background apps (Safari tabs, IDE windows)
After ~6 hours uptime: full system freeze, hard reboot only option
MLX inference latency drifted from ~26 tok/s → ~14 tok/s before the freeze hit

Root cause: MLX on Apple Silicon uses unified memory + Metal command buffers that grow without explicit cleanup. Default macOS memory_pressure thresholds don't kick in fast enough for a 35GB-resident model + per-inference Metal cache buildup.

After the 6-layer defense below, same workload runs steady:

Compressed memory: <1.7 GB (-91%)
Metal active: ~35 GB (model weights, expected)
Metal cache: <100 MB (was unbounded before)
Free + reclaimable: ~30 GB buffer
Zero freezes in 7 days continuous run

Here's exactly what each layer does and how to ship it.

Layer 1 — Metal `wired_limit` cap

What it does: tells Metal driver max bytes it can pin in physical RAM (un-pageable).

Set to ~70% of total unified memory. On 64GB M1 Max:

import mlx.core as mx
mx.metal.set_wired_limit(45 * 1024**3)  # 45 GB

Why this matters: without a cap, Metal can grow past comfortable headroom and force macOS to compress everything else. With 45GB cap, the OS keeps ~19GB breathing room for app + IDE + browser.

Layer 2 — Metal `cache_limit` cap

What it does: caps the Metal allocator's internal buffer reuse cache. Different from wired memory — this is the "scratch" that builds per-inference.

mx.metal.set_cache_limit(512 * 1024**2)  # 512 MB

Why 512 MB: empirically enough to keep inference fast (cache hit on common shapes) without unbounded growth on long generation runs. Set lower (256 MB) if you have <32GB total.

Layer 3 — `memory_limit` (soft ceiling)

mx.metal.set_memory_limit(48 * 1024**3)  # 48 GB

This is MLX's own soft ceiling. Slightly higher than wired_limit to allow some pageable allocation but still bounded.

Layer 4 — Explicit `clear_cache()` after long inference

Hook into your generation loop:

def generate_with_cleanup(model, prompt, max_tokens):
    output = model.generate(prompt, max_tokens=max_tokens)
    if max_tokens >= 500:
        mx.metal.clear_cache()
    return output

Why threshold at 500 tokens: short generations don't accumulate enough cache to matter. Long ones (essay drafts, multi-section content, reasoning chains) do. Clearing on every call costs ~5-10ms per inference; clearing on threshold saves that overhead.

Layer 5 — 5-minute memory pressure watchdog

Background thread that polls macOS memory_pressure every 5 min. If pressure crosses "warn" threshold, force clear_cache() + log:

import subprocess, time, threading

def memory_watchdog():
    while True:
        out = subprocess.check_output(
            ["memory_pressure", "-Q"], text=True
        )
        # Parse "System-wide memory free percentage: 18%" from output
        if "warn" in out.lower() or _free_pct(out) < 15:
            mx.metal.clear_cache()
            print(f"[watchdog] forced cache clear, free={_free_pct(out)}%")
        time.sleep(300)

threading.Thread(target=memory_watchdog, daemon=True).start()

This is the "if all else fails" net. Catches drift cases that the per-inference threshold misses.

Layer 6 — Nightly restart via LaunchAgent

The honest one. Even with all 5 layers above, multi-day uptime accumulates fragmentation. Easiest fix: scheduled restart at 4 AM local time.

LaunchAgent plist (~/Library/LaunchAgents/com.yourapp.backend.plist):

<dict>
    <key>Label</key>
    <string>com.yourapp.backend</string>
    <key>ProgramArguments</key>
    <array>
        <string>/path/to/your/start.sh</string>
    </array>
    <key>KeepAlive</key>
    <true/>
    <key>StartCalendarInterval</key>
    <dict>
        <key>Hour</key><integer>4</integer>
        <key>Minute</key><integer>0</integer>
    </dict>
    <key>EnvironmentVariables</key>
    <dict>
        <key>MLX_FORCE_FP16</key><string>1</string>
    </dict>
</dict>

Load: launchctl load ~/Library/LaunchAgents/com.yourapp.backend.plist

Why nightly not weekly: model warmup is ~60 seconds; nightly restart is barely noticeable but resets all accumulated state. Weekly meant the freeze caught me before the restart fired.

Verification commands

Run these while your inference workload is active to verify each layer is doing its job:

# Check Metal active + cache + compressed memory
sudo memory_pressure -Q
vm_stat | grep -E "(Pages active|Pages compressed|Pages free)"

# Check MLX limits applied
python -c "import mlx.core as mx; print(mx.metal.get_active_memory()/1024**3, 'GB active')"
python -c "import mlx.core as mx; print(mx.metal.get_cache_memory()/1024**3, 'GB cache')"

# Check LaunchAgent loaded
launchctl list | grep yourapp

Healthy steady-state targets (35GB model on 64GB Mac):

Pages compressed: <500k pages (~2 GB)
Metal active: ~35 GB
Metal cache: <500 MB
Pages free + inactive: >7M pages (~30 GB)

What happens when each layer fails

Layer fails	Symptom
1 (wired_limit)	Compressed memory climbs past 5 GB within hours
2 (cache_limit)	Metal cache grows unbounded, eventually swap thrash
3 (memory_limit)	Allocation errors mid-inference (rare, hard to catch)
4 (clear_cache hook)	Slow drift over long generations, latency creep
5 (watchdog)	Edge cases sneak past, freeze possible after 8+ hours
6 (nightly restart)	Multi-day uptime hits fragmentation wall around day 3-4

All 6 together: zero freezes in continuous 7-day runs on 12-agent workload.

What this is and isn't

This is the setup that worked for one specific workload: 35GB Qwen MoE Q8 + 12-agent multi-tenant inference on a 64GB M1 Max. Numbers are real, from my own backend.

It is not a universal recipe. If you're running:

Smaller models (<10 GB): Layer 1 cap can be tighter (15-20 GB), Layer 5 watchdog less critical
Larger Macs (128 GB Studio): cap can be 80-90 GB
Single-user dev workload: nightly restart may be overkill

Test each layer independently. Watch the verification commands. Adjust thresholds to your workload.

Want more posts like this?

I'm building a multi-agent quant stack on one M1 Max, public notebook style. Local AI engineering, MLX deep-dives, paper-trading transparency, all numbers (good and bad).

Subscribe to SleepyQuant Weekly at sleepyquant.rest — see me fall or thrive, whichever comes first.

Last updated 2026-04-27. Numbers from my own backend running Qwen 3.6 35B-A3B Q8 on M1 Max 64GB since 2026-04-20. If you find a layer that helped or didn't help in your setup, reply to the welcome email — I'd genuinely like to compare notes.

Yen Intervention Crypto: Why This Isn't August 2024 Round Two

SleepyQuant — Mon, 04 May 2026 06:59:14 +0000

Yen Intervention Crypto: Why This Isn't August 2024 Round Two

TL;DR

Tokyo intervened in the FX market on April 30, 2026, and USD/JPY round-tripped from 160.72 down to 155.5 before bouncing back to 157.2 in the Asia session. Some crypto desks are already typing "carry trade unwind, round two." I think they're skipping a step. The yen intervention crypto question this week is real — but one-shot FX intervention is not the same animal as the structural rate hike that crashed Bitcoin in August 2024. The signal for crypto isn't this print. It's whether the BoJ pivots from spending reserves to raising rates during the May 1 to May 6 holiday window. Here's what I'm watching, and why I'm sitting on hands.

The Setup

Last Wednesday, USD/JPY hit 160.72 — the weakest yen reading in a long stretch. Background: US-Iran tensions have been spooking Japan's energy import bill for weeks, and traders piled into shorts ahead of the move. Tokyo blinked.

Finance Minister Katayama said it was "near time for decisive action." Official Atsushi Mimura warned shorts directly: "this is the last advice if you want out" (quotes from VnEconomy reporting on the April 30 intervention). Then the BoJ stepped through the door:

Yen ripped roughly 3% against the dollar in the US session
USD/JPY round-trip: 160.72 → 155.5 → 157.2 in Asia
DXY fell 0.92% to 98.06

Japan's holiday runs May 1 to May 6. That means another intervention can land at overseas venues while Tokyo is technically closed. That's the cliffhanger heading into next week.

Why Crypto Traders Open a Yen Headline at All

Short version: yen is the world's primary funding currency. People borrow yen at near-zero rates, swap into dollars, and buy higher-yielding stuff — Treasuries, equities, crypto. When yen strengthens fast, those positions get squeezed, and someone has to sell what's liquid to cover. Crypto trades 24/7, so it gets sold first.

August 2024 is the reference incident every macro-aware crypto trader knows. The BoJ raised its policy rate to 0.25% on July 31, 2024 (Bank of Japan policy statement). The yen ripped against the dollar over the next few sessions. Bitcoin fell from roughly $65,000 toward $49,000 (CoinGecko BTC/USD historical, August 1-5). Nikkei dropped −12% on August 5, 2024 — its worst single-day drop since 1987 (Tokyo Stock Exchange close data). The move wasn't about crypto fundamentals. It was about leverage being forced off a curve that suddenly cost more to ride.

That's the playbook some traders are dusting off this week.

Why I Think It's the Wrong Playbook

August 2024 was a rate hike. That's a structural change in the cost of borrowing yen. Once funding gets more expensive, the carry math doesn't recover — positions have to come off, and they stay off.

April 30, 2026 was an intervention. Tokyo spent reserves to push the print. Mechanically that's a one-shot move. Without a follow-on rate move, the carry math is unchanged: yen funding is still cheap, the trade still works. History on solo Japan FX intervention is consistent — the move fades unless paired with policy. The 2022 round, when MoF burned roughly $60 billion across multiple September-October episodes (Japan Ministry of Finance public disclosure), bought weeks of relief but never regime change. Yen weakness resumed once the intervention impulse faded.

The 2.5% pop on USD/JPY is dramatic on a chart. By itself, it is not an unwind catalyst.

What I'm Actually Watching

Three things, ranked by how much they would shift the read:

BoJ tone during the holiday window (May 1 to May 6). If officials start hinting that rate normalization is back on the table, that is the August 2024 setup arriving in slow motion. If they keep talking intervention only, it stays noise.
US-Iran headline track. The yen weakness was driven by energy import fear. A real escalation hits crypto directly through risk-off, not through the carry channel. That's a separate thread to pull.
DXY behavior. The −0.92% move is mildly supportive for risk assets if it sticks. If DXY rebuilds above 99 inside a week, the intervention got fully retraced, and the yen pressure is right back where it started.

Where I Could Be Wrong

The honest counter case: I could be wrong about the holiday window. Japan can intervene again with thinner liquidity overseas, force a second 2 to 3% pop, and trigger forced covering at funds that are running real carry exposure. That cascade doesn't need a rate hike to start. Just enough mechanical pain. If USD/JPY breaks 153 and stays there for more than a session, I would flip the read.

What I'm Doing

Nothing. No hedge, no rotation, no take. The signal is not here yet. The book stays put, and I'll re-read this on Tuesday or Wednesday when Tokyo is back at desk and policy intent gets clearer.

This is a working note, not a call. The book is a public journal — every read here is provisional, and I'll update or scrap as the data evolves. The rest of the trades and the failures live at sleepyquant.rest.

Come along for the ride — see me fall or thrive, whichever comes first.

Sources

April 30, 2026 intervention event and quotes from Finance Minister Katayama / official Atsushi Mimura: VnEconomy report.
BoJ policy rate hike to 0.25% on July 31, 2024: Bank of Japan public policy statement.
Bitcoin price action $65,000 → $49,000 over August 1-5, 2024: CoinGecko BTC/USD historical.
Nikkei 225 −12% on August 5, 2024 (worst single-day drop since 1987): Tokyo Stock Exchange close data.
2022 Ministry of Finance intervention scale (~$60 billion across September-October episodes): Japan MoF public disclosure.
USD/JPY round-trip 160.72 → 155.5 → 157.2 and DXY −0.92% to 98.06 on April 30, 2026: standard FX data feeds.

The 0.42% Bar: A Passive Yield Benchmark for Every Crypto Trading Bot (April 2026)

SleepyQuant — Mon, 27 Apr 2026 16:13:30 +0000

The 0.42% Bar: A Passive Yield Benchmark for Every Crypto Trading Bot (April 2026)

Most "is my trading bot any good?" conversations start from the wrong place. People compare bot returns to zero, or to "the market," or to whatever random chart is in front of them. None of those are the right bar.

The right bar is a passive yield benchmark — the next-best thing you could have done with the same capital, doing nothing.

That next-best thing is not a single number. It depends on how much work, risk, and lockup you are willing to accept. So the honest framing is not "does the bot beat X?" but "at what point does the bot become more attractive than any of the passive options at its own risk tier and above?"

This post walks through five publicly checkable passive yields as of April 2026 and lays out what a crypto trading bot actually has to clear before it deserves capital.

Why passive benchmarks matter more than "the market"

A trading bot is not competing against the S&P 500. It is competing against a spectrum of alternatives that an investor could park the same USD or USDT into right now, with far less operational risk.

If a bot returns 3% in a year and the lowest-effort alternative returns 5%, the bot has lost — even though its return is positive. Losses are not just red PnL. Losses include every basis point the bot fails to earn against an easier path.

Every benchmark below is a real product an investor can access this week. None of them require running a Mac, maintaining code, or watching drawdowns on a Sunday night.

Tier 1 — Crypto earn products (lowest friction)

Binance Simple Earn Flexible USDT: ~0.42%/month (~5.16% APY)

This is the baseline. A USDT deposit on Binance Simple Earn Flexible earns roughly 0.42% per month at current promotional rates. The funds are liquid, the balance is visible, subscription is one click. April 2026 rates are variable and may drift, but the 0.42%/month range has been persistent for months.

Cake Finance Flexi USDT: ~0.29%/month (~3.53% APY)

A slightly lower-yield competitor, also liquid, with roughly 0.29%/month as of April 2026. Useful as a second data point for "what does the market pay to park USDT with no work?"

What this means: any bot that returns less than 0.42%/month on USDT, net of fees and slippage, is paying the operator to underperform a one-click product.

Tier 2 — Fiat instruments (moderate friction, no crypto risk)

US 1-Year Treasury Bill: ~4.3% annualized

T-Bills are among the lowest-risk dollar-denominated yields in the world. The 1-year T-Bill is a bellwether for risk-free USD return. At ~4.3% annualized in April 2026, a T-Bill yields more than the crypto earn tier on an annualized basis and carries essentially zero counterparty risk relative to a crypto exchange.

The trade-off is friction. Buying T-Bills requires a brokerage account. The money is effectively locked for the term. But for an investor comparing "where do I park this for 12 months," the T-Bill yield is the honest number to beat.

Vietnamese bank 12-month Certificate of Deposit: ~5.5-6% annualized

Vietnamese commercial banks offer 12-month CDs in the 5.5-6% range in April 2026. Locked up, insured within local deposit guarantees, VND denominated. An investor with VND already in the system can pick this up with almost no work.

For a bot operating in a VND context, this is the explicit bar. Any trading strategy must outperform a passive 5.5% annual deposit after tax and effort, or the operator is trading for entertainment.

Tier 3 — Equity index averages (long-horizon only)

VN30 Index 10-year average: roughly 10-12% annualized (before dividends)

The VN30 is Vietnam's blue-chip index. Its long-run average price return sits around 10-12% annualized over 10-year windows, though individual years swing wildly. Drawdowns of 30-40% have happened. This is not a one-click product — it requires a brokerage account and the stomach to hold through volatility.

S&P 500 10-year average: roughly 10-13% annualized (historical)

The SPY comparison needs no introduction. 10-year rolling averages have clustered in the 10-13% range historically, including dividends. Like VN30, real experience includes multi-year drawdowns that kill most would-be passive holders halfway through.

Equity indices are the highest passive bar, but they are also the noisiest. The honest comparison is multi-year, not monthly.

So what does a trading bot actually have to beat?

Pick the tier that matches the bot's risk profile, and the bot has to clear the corresponding passive.

Bot risk profile	Passive peer	Monthly yield to clear
Stablecoin, low leverage, low drawdown	Binance Simple Earn	0.42%
USD risk tolerance, no crypto risk	1Y T-Bill	0.36%
VND investor, moderate risk tolerance	VN bank 12M CD	0.45-0.50%
Long horizon, volatility tolerant	VN30 or SPY 10Y avg	0.83-1.00%

A bot that clears 0.42%/month on USDT sustainably is competitive with a one-click product. A bot that clears 1%/month sustainably is competitive with long-run equity returns — at far greater drawdown risk. Anything less and the honest answer is to stop running the bot.

The gate we use on our own bot

We gate any capital scaling on clearing these bars on rolling 30-day paper trading windows, before any allocation increase. The first gate is 0.42%/month after fees. The second is the VN bank CD. The third is a long-run equity-index equivalent.

We are currently in an edge-rebuild phase. Paper returns are being measured against these passives every day, not against zero. That is the only comparison that tells the truth.

How we measure the bot against these bars

Day-to-day measurement is not a single P&L number. It is a 30-day rolling window comparing the bot's net return — after fees, slippage, and any failed trades — against the corresponding passive yield for the same period. Every closed trade goes into a paper book. Every open position is marked to market once per scan cycle. The bot's monthly equivalent is then computed as net return divided by days elapsed, multiplied by thirty.

If that monthly equivalent clears Tier 1 for thirty consecutive days, the bot graduates to a small allocation increase. Any drawdown over eight percent in a single book triggers an automatic kill switch. Both books — the standard signal and the inverted-mirror experiment — are measured against the same passive bars.

What this framing is not

This is not a "bot versus the market" comparison. The market index is a single number, often chosen because it is convenient. The honest framing is "bot versus the easiest thing an investor with the same capital could have done instead." That alternative is rarely the index. For most retail crypto operators, the alternative is parking USDT on a one-click product. For a USD investor, it is a T-Bill. For a VND investor, it is a twelve-month deposit.

If the bot does not clear the easiest alternative for that capital pool, it has no business managing capital at scale. More posts on the methodology, the open paper-trading numbers, and what the bot has to clear next live in the SleepyQuant blog archive.

Disclaimer: This post is educational. Nothing here is investment advice. Yield numbers are approximate April 2026 snapshots and may shift. Always verify current rates from primary sources before acting.

Newsletter: If this kind of honest benchmark math is useful, the weekly newsletter covers the current numbers, the paper trading results, and what the bot had to beat that week. Sign up at sleepyquant.rest.

I Trained a Crypto Quantile Predictor on 47M Klines. The Transformer Lost to LightGBM.

SleepyQuant — Mon, 27 Apr 2026 16:13:27 +0000

I Trained a Crypto Quantile Predictor on 47M Klines. The Transformer Lost to LightGBM.

This is what 47.68 million klines, 27 LightGBM models, and one failed Transformer spike taught me about building a crypto quantile predictor that holds up under out-of-sample stress — and what the OOS calibration numbers actually showed when I stopped narrating and started measuring.

The honest answer to "is your trading edge real" is "wait two to four years for enough live round-trips and find out." That's what statistical significance actually requires for the Sharpe levels retail traders chase. Two years of paper trading. Four if your sample-per-day is thin.

I've watched enough people decide three months of decent paper is enough, flip live trading, and blow up by month nine to know the math isn't the hard part. The hard part is the patience. I didn't have it either, so I ran a 3-week compressed sprint to compress that wait into offline out-of-sample validation on historical klines. 47.68 million of them. This is what the data showed, what broke, and why the model I shipped is the boring one.

The data setup

Binance Vision archives are free and well-structured. I pulled 1-minute klines for 30 perpetual pairs across 2023-2026 — 1,093 monthly parquet files, 2.4 GB on disk, 47.68M rows. Integrity checks all passed (no gaps wider than 5 minutes, no negative volumes, no zero-spread rows).

From that I generated 9.53M training rows by sampling every 5 minutes per pair. Each row had 10 numeric features (RSI, EMA gaps, ATR, realized volatility, return windows) and 3 categorical features. Each row also had 3 forward-return labels: returns at 5 minutes, 10 minutes, and 30 minutes ahead.

Three horizons because crypto's signal-to-noise ratio is awful at 5 minutes and decent at 30. I wanted the calibration data to tell me which horizon was actually predictable, not assume one.

Why LightGBM became the crypto quantile predictor I shipped

I went with LightGBM quantile regression as the first attempt. Three reasons.

First: quantile regression gives you P10, P50, P90 instead of a single point estimate. For a trading gate you want "what's the floor of my downside under this signal" more than "what's the expected return." Point estimates lie. Tail quantiles don't lie as much.

Second: walk-forward CV is cheap with gradient boosting. I split the historical kline window (2023-Q4 through 2026-Q1, all already-closed bars at the time of writing in late April 2026) into three folds, training on past, testing on next-period OOS. 27 models total: 3 folds × 3 horizons × 3 quantiles. Trained in 45 minutes on a single M1 Max core.

Third: it's interpretable enough to debug. When fold 2 (2025-Q3 drift regime) showed negative TP/SL uplift on the 5-minute horizon, I could see in the feature importance plot that the model had over-weighted ATR — fixed by adjusting the lookback window.

OOS results across folds: P10 hit rate stayed within 0.6% of the 10% target. P90 hit rate stayed within 0.6% of the 90% target. Directional accuracy ranged 51.17-52.59% across the three folds, with the most recent regime (fold 3, 2026-Q1) hitting the high end. Tail-quantile improvement over a fixed-baseline TP/SL: roughly 10% consistent across all three folds. Modest but real.

The Transformer spike that didn't beat the bar

Then I burned 3.5 hours on a Transformer spike. I wanted to see if attention could pick up cross-pair structure that gradient boosting was missing.

First attempt: PyTorch with MPS backend, attention layer hit NaN at epoch 4. Known PyTorch MPS attention instability on Apple Silicon — the softmax saturates when you have tiny-std return features that get z-scored to ±100σ before clipping.

Second attempt: tighter feature normalization (±5σ post-clip), still NaN at epoch 6. Different layer this time.

Third attempt: dropped MPS, ran on CPU. No NaN. Slow — about 10 minutes per epoch. Trained for 8 epochs, beat LightGBM on the OOS tail-quantile metric by exactly 0.40%.

The pre-set go/no-go bar was 5%. So the Transformer cleared 8% of the bar.

I cut it. Saved the spike report (spike_report.json with the FAIL_BAR verdict), saved the failed training scripts for reference, and skipped the planned full sweep that would have burned 6 more hours of CPU time and a week of deploy work for marginal gain.

This is the result I would have wanted to find. Not the dramatic win, the boring confirmation. LightGBM with the current feature set already captures most of the extractable signal. More model class isn't the bottleneck. More features (or more horizons, or different labels) might be.

What 30 minutes told me that 5 minutes didn't

Per-horizon uplift was the most useful thing the OOS analysis surfaced. On the 30-minute horizon, the model improved fixed-baseline TP/SL by +33%, +125%, +172% across the three quantiles tested. On the 5-minute horizon: -87%, +168%, +87%. The 30-minute numbers are uniformly positive. The 5-minute numbers swing wildly — which is honest about how noisy a 5-minute crypto forecast actually is.

I switched the default TP/SL planning horizon from 5 minutes (which the original v1 predictor used) to 30 minutes. Inference still runs on 5-minute scan cadence. The horizon change was config, not architecture.

Pivot to the winning book

The deployment plan originally targeted Book 14 — a new book I'd just enabled with no track record. I caught myself mid-sprint and pivoted to Book 13 instead, which had 41 round-trips and 51.22% win rate at the time. The reasoning: deploying a quantile-regression overlay onto a book with no history is guess plus guess. Onto a book with proof, it's guess plus proof.

The pivot also surfaced an unrelated bug. Three symbols in the B13 invert-long list — DOGE, AVAX, ATOM — were responsible for the bulk of the negative PnL across 28 round-trips, while six symbols in the invert-short list carried the positive contribution. I pruned the three losers from the invert list. Free win, no model change, just cleaning out the asymmetric bleeders. (If you want the post-mortem on a similar book-routing bug that cost me a week of bad numbers, it's in How a Missing book_id Kwarg Quietly Tanked My Inverted-Alpha Paper Trade.)

The honest scope

This is paper-only. None of it has touched live trading yet. The plan is: log the v2 predictor in shadow mode alongside v1 for 3-5 days, verify alignment >85% across regimes, then wire v2's TP/SL into B13 paper trades. After 30-50 paper round-trips with v2 active, if the net is positive and the win rate holds above 55%, $5 of USDT goes onto the winning symbol subset. Not before.

The OOS calibration was clean. That doesn't mean the model is right. It means the model is consistent under the data slices I tested, which is necessary but not sufficient. Live execution introduces fees, slippage, and regime shifts the historical sample didn't see. I'll know in 30-50 more paper round-trips whether the offline numbers transferred or not.

What I'd tell my past self

Three things from the sprint that surprised me.

The first: the Transformer FAIL_BAR was a faster, more honest signal than I expected. Spending 3 hours to confirm "the simple model already won" beats spending a week to confirm "the complex model didn't beat the simple one by enough." Spike, set a clear bar before you start, accept the verdict.

The second: tail-quantile calibration mattered more than directional accuracy. 52% directional sounds barely-above-random and it is. P10 calibration within 0.6% of target across three regimes is genuinely useful — it lets the trading gate ask "what's my realistic downside under this signal" with a number it can trust.

The third: deploying onto the winning book is not the same as deploying onto the convenient book. I almost shipped onto B14 because it was newer and cleaner. B13 had 41 round-trips of proof. The pivot took 30 minutes of decision and saved an unknown amount of "wait what is this signal even doing" debugging later.

Subscribe + follow along

I'm running this whole stack on a single M1 Max — paper trading 30 pairs, MLX inference for the language pieces (the Apple Silicon write-up covers why that hardware choice mattered), LightGBM for the trading pieces. The earlier inverted-control bot post-mortem is the closest sibling to this one in spirit — both are the paper-trail of an idea that survived contact with reality. All numbers in this post (and the sprint reports they came from) are real. The wins and the FAIL_BARs both.

If you want the next post — probably the 30-50 paper round-trip post-mortem on whether the OOS numbers held in live execution — subscribe at sleepyquant.rest.

Come along for the ride — see me fall or thrive, whichever comes first.

I Run a 40GB AI Model on a MacBook. Three Months of MLX on M1 Max Has Changed How I Think About Apple Silicon.

SleepyQuant — Thu, 23 Apr 2026 07:53:02 +0000

I Run a 40GB AI Model on a MacBook. Three Months of MLX on M1 Max Has Changed How I Think About Apple Silicon.

It's Just a Laptop. But It's Running a 40GB Model Right Now.

I'm drafting this on a MacBook Pro. Qwen 3.6 35B-A3B MoE Q8 — about 40GB of weights — is pinned in Metal memory right now, and the fan is quiet.

That sentence still feels weird to write. A year ago I would have assumed "run a 35B model locally" meant a dedicated rig with an H100, or at least a pair of 4090s. Turns out it means a MacBook Pro M1 Max with the 64GB unified memory variant, MLX, and about a weekend of config tuning.

This post is a three-month dev diary on that setup. Not a product review. Not a "10x your AI productivity" take. Just what I've learned that isn't in the Apple keynote or the MLX README.

And since Tim Cook has been CEO for 14+ years with no named successor, I ended up thinking about what changes if the person running Apple changes — and what doesn't. Short version: a lot less than most market takes assume. The laptop on my desk is why.

The Setup: 64GB of Unified Memory, One Model, Zero Cloud

Hardware is an M1 Max MacBook Pro with the full 64GB unified memory. Yes, it's a $3k-class setup. That's the first honest thing to say.

The model is Qwen 3.6 35B-A3B MoE, Q8 quantization. Weights are ~40GB in Metal memory via mx.metal.set_wired_limit(45GB). That pin is load-bearing — without it the macOS memory compressor will happily try to page out the model while you're mid-inference.

Hard ceiling at set_memory_limit(48GB). Scratch buffers capped at set_cache_limit(512MB). Buffer left for OS + apps: ~14-16GB, tight but stable. Everything runs offline. No cloud fallback. No API key. Just the laptop.

For that ~14-16GB buffer to actually hold: no Docker, no 30-tab Chrome session. I used to keep Chrome open with dozens of tabs; the memory pressure during long inference was noticeable enough that I stopped. My background load during heavy generation is Xcode (SwiftUI work) + terminal + editor. That's it.

The Q8 Tax: Trading Speed for Sanity

I moved from Q4 to Q8 on April 17. The motivation was pure quality. Q4 output was noticeably more muddled on longer reasoning tasks, especially anything requiring numerical precision or sustained argument.

Q8 runs in the 35-50 tok/s range depending on context length. Q4 was faster — probably 10-15% more tok/s — but the output just wasn't as good. When you're generating content you'll actually publish, that tradeoff isn't close.

The honest take: if your use case is chat-style short responses, Q4 might be fine. For long-form drafting, research synthesis, or anything that has to be correct-ish without a human checking every sentence, Q8 earns its extra memory.

The fp16 Moment: 21.18 to 26.22 tok/s From One Env Var

Running MLX on M1 Max defaults to bf16 for many kernels. For Qwen 3.6 MoE specifically, that was costing real throughput.

Setting MLX_FORCE_FP16=1 in the LaunchAgent environment bumped tok/s from 21.18 to 26.22. That's +24% from one flag. No recompile. No re-quantization. No weight re-download.

I don't know the full story of why bf16 is the default if fp16 wins here — the MLX team almost certainly has a good reason at the kernel level. But empirically, on this hardware with this model, the flag is free speed.

Persisted it in the LaunchAgent plist, restarted, never looked back.

What Metal Memory Actually Wants: 45GB Wired, 48GB Ceiling, 512MB Scratch

Out of the box, Apple's memory compressor is aggressive. It will look at your 40GB model sitting in RAM, decide some of it is "idle," and start compressing pages. Every decompression on a subsequent inference is thrash.

The fix for MLX on M1 Max is a three-line config (pseudo-code — real calls take bytes, I'm using GB suffixes for readability):

set_wired_limit(45GB) — weights stay pinned, compressor can't touch them
set_memory_limit(48GB) — hard ceiling, prevents runaway scratch buffers
set_cache_limit(512MB) — caps Metal compile cache

Before this, compressed swap on my machine was 19.69GB. After, it sits at 1.7GB. That's a 10x improvement on memory pressure from three lines of config. The buffer for macOS + Chrome + everything else stays at ~14-16GB, which survives a full day of normal laptop use. (I wrote up the full debugging path for the memory compression issue here — it took me longer than I'd like to admit to figure out.)

The MoE Saturation Wall at 500 Tokens (The Thing Nobody Warns You About)

Qwen 3.6 is a Mixture-of-Experts model. On paper, sparse activation means you're only touching a fraction of weights per token, which is why it fits in 40GB at all.

What the papers don't emphasize: MoE models have a soft quality ceiling on single generation length. For Qwen 3.6 specifically, output degrades past roughly 500 tokens. Past 800 you start getting word salad. Past 1500 you get paragraphs that apologize to themselves mid-sentence.

The workaround is sectional generation. Split long outputs into 250-400 token sections, generate each independently, concatenate. State resets between calls. The model stays coherent the whole way through.

I automated it: a FastAPI endpoint that takes a research brief plus an ordered list of sections (heading + 1-sentence instruction + target word count) and fires one MLX call per section with max_tokens hard-capped under the degen zone. No shared context across calls. Outputs concatenate into a full draft. Maybe 40 lines of Python. If there's interest I'll clean it up and drop it as part of a small OSS package alongside the memory-safe runtime config.

This isn't an MLX issue. It's how MoE attention routing behaves under sustained sampling. Took me a while to isolate the variable.

The 4 AM Ghost: Managing Metal's Memory Drift

Even with wired_limit pinning, Metal accumulates scratch buffers over time. Long inference sessions leave compile cache and intermediate allocations that don't always free cleanly. After a couple of days of uptime, tok/s drifts down 5-10%.

The fix is a scheduled restart. I have a LaunchAgent KeepAlive set up to kill and relaunch the backend every day at 4 AM local time. Takes about 60 seconds end-to-end — roughly 40 of those are MLX warmup.

It's not elegant. A properly designed memory system wouldn't need this. But it works, it's invisible because it runs while I sleep, and the next morning tok/s is back at baseline. I'll take a cron job over a memory leak any day.

What I Actually Lose vs Cloud (And What I Don't)

Honest comparison. What you lose going local:

Peak throughput: 26 tok/s here vs ~60-100 tok/s on cloud APIs
Context window: 32k practical on this setup vs 200k+ cloud
Scale: one user at a time vs unlimited parallel

What you don't lose:

Quality: Q8 is close enough to cloud that most tasks don't notice
Latency: sub-1s first token local vs 500-1500ms network round-trip
Cost: $0 marginal per call vs $3-15 per million tokens
Privacy: weights and prompts never leave the laptop
Availability: works offline, works when the cloud provider has an outage

For a solo dev with one user (me), the tradeoff leans local hard. Mileage varies if you're serving an API.

The Thing Nobody Prices About Apple Silicon: Unified Memory

Here's the structural point most Apple Silicon takes miss.

On x86 + Nvidia, VRAM is separate from system RAM. A $3k gaming laptop ships with at most 16GB of VRAM — physically cannot hold Qwen 35B Q8, period. To match the 40GB I'm using here, you'd need two RTX 3090s (24GB each, NVLink bridge to share weights): ~$1,400-1,800 used for the cards alone, plus PSU, case, cooling, CPU. Easily another $1,500 before you have a running machine. And even then each forward pass is sharding across PCIe — not unified memory. Two 4090s don't even solve it cleanly because Nvidia dropped NVLink on the 4090 line.

Meanwhile this thing fits in a backpack and runs at a quiet coffee shop.

On Apple Silicon, the 40GB of model weights live in the same physical RAM the OS and Chrome use. No PCIe bottleneck between CPU and GPU compute — they literally share memory. That's not a Metal-is-faster-than-CUDA claim (per-op, it usually isn't). It's an architecture claim.

Which is why this MacBook runs models that most gaming desktops physically cannot. The chip speed is a subplot. The memory layout is the actual moat. (I made a longer version of this argument here, back when I was still surprised it was working at all.)

Three Months In, I'm Long the Ecosystem

Three months of MLX on M1 Max later, here's what I actually believe: I'm long the ecosystem, not the CEO.

Whoever succeeds Tim Cook next can reshape pricing, Services tiers, or the iPhone upgrade cadence. They can't reverse unified memory architecture in a quarter. They can't make pip install mlx-lm harder than pip install mlx-lm. They can't retroactively ship a gaming laptop with 40GB of usable VRAM for $3k.

The developer experience moat — pip install mlx-lm and you're done, with CUDA nowhere in sight — compounds quietly every time a solo dev gets a 35B model to run on their first try. That's the flywheel the market underprices.

I could be wrong on the broader empire thesis. But the laptop on my desk still runs the model. That floor doesn't move.

Come along for the ride — see me fall or thrive, whichever comes first.

FPT Corporation and the AI Consulting Margin Compression: Why Vietnam's Biggest Tech Firm Lost a Third of Its Market Cap

SleepyQuant — Wed, 22 Apr 2026 06:01:51 +0000

FPT Corporation and the AI Consulting Margin Compression: Why Vietnam's Biggest Tech Firm Lost a Third of Its Market Cap

An IT Giant Most Western Investors Have Never Heard Of

FPT Corporation, Vietnam's largest IT services firm, is down ~33.8% from its 52-week high. This drawdown mirrors a broader sector-wide slump: TCS fell 21.4%, Wipro dropped 23.1%, and Infosys declined roughly 16% over the same window. The market appears to be repricing the entire labor-arbitrage consulting model at once, not punishing FPT in isolation.

Here's what makes it interesting: in 9M2025, FPT still grew revenue +10.3% YoY (VND 49,887 billion ≈ $1.96B USD) and pre-tax profit +17.6% YoY (VND 9,540 billion ≈ $374M USD). The fundamentals didn't crash. The expectations did.

I went down this rabbit hole after watching Mèo Giải Thích's Vietnamese-language deep dive on FPT (388k+ views). What follows is a case study in AI consulting margin compression — one of the cleanest sector-wide pricing events I've seen in IT services in the past year. Below: what FPT actually does, the AI catalyst that hit the entire sector at once, and the counter-case the market isn't pricing in.

What FPT Actually Does

From banana flour machines to Vietnam's largest IT firm

The founding story is almost too literal to be real: in 1988, the acronym FPT stood for "Food Processing Technology." Early FPT was drying cigarettes and installing air conditioners. Then came the pivot in 1990 — a $1 million computer contract with the Soviet Academy of Sciences changed everything. Within roughly a decade, FPT had become Vietnam's dominant IT firm. Understanding their current engine, though, requires looking at three distinct pillars rather than the single "IT" label.

Three pillars: Technology, Telecom, Education

According to the official 9M2025 earnings report (~$1.96B USD in nine-month revenue), Technology remains the undisputed core: about 62% of group revenue and 45% of group pre-tax profit. Telecom follows as a steady cash cow, contributing 29% of revenue (≈$539M USD) with surprising margin expansion — pre-tax profit grew +21% despite limited market-size headroom. Education rounds out the trio at just 9%; historically high-margin, but recent stagnation hints at real competitive pressure (more on that next).

Why Japan is FPT's biggest customer

What fascinates me most about FPT's Tech segment is where the money actually lives: overseas markets capture roughly 80–90% of that division's inbound revenue. Japan sits firmly as #1, followed by the US and APAC. Why? Because demographic collapse there has created an IT labor shortage so severe that Japanese planners are now recruiting half a million Indian tech workers to fill the gap. FPT's labor-cost advantage is the bridge Vietnamese firms have been crossing for years. In 2024, FPT also opened two AI factories — one in Vietnam, one in Japan — but they're still too small to materially move group numbers.

The Margin Story Hidden in Education

Education accounts for just 9% of FPT's total revenue, yet it has historically been a cash cow with pre-tax margins hovering between 40-50%, according to Mèo Giải Thích. This profitability stems from a vertically integrated talent pipeline: FPT operates schools ranging from K-12 through university, and many graduates join the company directly. By internalizing recruitment, FPT drastically reduces external hiring friction and retraining costs while ensuring new hires are already aligned with its specific technical culture and operational standards. It is an elegant self-sustaining loop where education fuels technology growth without depending on volatile external labor markets.

However, the official 9M2025 earnings data reveals a sharp divergence from that high-margin narrative. Education revenue grew only +1.0% YoY to VND 5,195 billion (≈$204M USD). This stagnation suggests headwinds are biting harder than anticipated. Vietnam's K-12 fee waiver in public schools has eroded the addressable market for private tuition, with families increasingly opting for free state alternatives over premium rates at FPT institutions.

Why The Market Lost Faith — The P/E Compression

From 30x to 15x in eighteen months

I first understood valuation through a coffee-shop analogy from Mèo Giải Thích's video: if a shop earns $1 million a year but sells for $20 million, the Price-to-Earnings ratio is 20. Buyers are paying twenty years of current profits upfront for the future growth they expect.

FPT's stock chart tells the same story in real time. P/E peaked around 30x when optimism was highest, normalized to roughly 19x over recent quarters, and now sits near 15x. The compression signals that investors have drastically lowered their growth assumptions.

How FPT compares to Indian IT consulting peers

Here's the part that should make any FPT bull pause: the Indian IT consulting comparables aren't trading much higher. As of April 2026, TCS sits around ~19x trailing P/E, Infosys around ~18x, Wipro around ~16x. FPT at ~15x is trading at a discount to all three. Sector-wide compression explains most of the move, but FPT carries an additional discount on top — the market is pricing in either smaller scale, less diversified revenue base, or company-specific execution risk that its global peers don't have.

What the official 9M2025 numbers actually show

The official 9M2025 data backs the deceleration. Tech segment revenue grew only +10.7% YoY against the segment's 24% historical CAGR (2018-2024). Total group revenue reached VND 49,887 billion (≈$1.96B USD) — still positive, but well off the trajectory the old multiple priced in. The gap between former hype and current reality is why the multiple collapsed from 30x toward 15x. But P/E compression doesn't happen in a vacuum — there was a specific catalyst.

Why The Market Lost Faith — The AI Catalyst Behind the Margin Compression

February 23: The Anthropic shot heard around IT services

The catalyst for the sector's re-rating arrived on February 23, 2026, when Anthropic published "How AI helps break the cost barrier to COBOL modernization". They claimed Claude Code could map dependencies across thousands of lines of legacy code, document workflows, and identify risks that "would take human analysts months to surface." This was not an abstract tech update — it was a direct shot at the consulting layer where firms charge premium hourly rates for human-led modernization work, exactly FPT's core moat in digital transformation and system integration.

IBM down 13.2% in a single day, FPT followed

The market reacted immediately: IBM stock fell 13.2% that same day. The pricing signal suggested investors were rapidly discounting future labor-arbitrage margins across global IT services providers. FPT's own decline accelerated after this date — the timing is suggestive rather than coincidental within the broader -16% to -23% sector drawdown seen across TCS, Infosys, and Wipro. An insider sale by board member Bùi Quang Ngọc near the peak (timing flagged by the Mèo Giải Thích video) drew attention, but the source video itself cautioned against over-reading: founders Trương Gia Bình and Đỗ Cao Bảo did not sell, and a single insider transaction without volume context is closer to noise than signal. (For a tangentially-related build-in-public take on how a single missing argument in production code can compound into outsized loss, see my recent post on a one-line trading bug.)

The Counter-Case Worth Hearing

AI doesn't only delete consulting — it creates new categories

The bear case assumes AI simply deletes consulting hours, but it also creates entirely new categories. Companies still need someone to deploy these models into actual operations, integrate complex stacks with legacy systems, and train staff on the new workflows. FPT has explicitly pivoted in response, declaring an "AI-first" strategic direction. They are building infrastructure rather than relying on manual code migration alone. Their flagship vehicle is the FPT AI Factory, positioned as a "one-stop shop" for AI and Cloud services. At CES 2026, FPT showcased AI-first innovations across industries from automotive to semiconductor design.

What FPT's own numbers say about the pivot

Per FPT's own reporting, their AI and Data Analytics service lines grew +41% YoY — tangible demand, though importantly off a small base; AI services are still single-digit percent of group revenue, not yet large enough to offset the deceleration in legacy Tech consulting. Chairman Trương Gia Bình has publicly emphasized future bets on Quantum Computing, Cybersecurity, UAVs, and Railway Tech, all underpinned by core AI capabilities. Two AI factories are operational — one in Vietnam, one opened in Japan in 2024 — aimed directly at the labor shortages and digital transformation demand that have driven FPT's overseas growth for years. At a ~15x P/E, the market is pricing low odds that this pivot scales fast enough. That is where the optionality sits.

What I'm Watching, From Outside

I am trying to understand FPT as a business, not as a ticker. The recent drawdown is stark, but the real story lies in three leading indicators that reveal whether the company can pivot from legacy arbitrage to AI-driven value:

New Contract Value (NCV) — the leading indicator for future Technology segment revenue. If NCV stagnates while signed revenue keeps growing through backlog consumption, that's demand friction showing up before it hits the top line.
Tech segment pre-tax margin trend — the canary for AI pricing pressure. As AI tools compress billable hours per project (the same dynamic behind IBM's 13% drop), it shows up here long before it shows up in total sales volume.
AI Factory contribution to group revenue — the strategic execution check. If the two factories (Vietnam + Japan) can move from single-digit % to materially mid-single-digit % over the next 4-8 quarters, the bull pivot is landing.

None of this is a recommendation. I'm watching because the case is interesting, not because I have an edge. The cost economics are also why I keep coming back to the Apple Silicon angle on local AI — the same dynamic that compresses FPT's consulting margins is what makes running a 35B model on a laptop suddenly viable. Credit again to Mèo Giải Thích for doing the heavy synthesis on the Vietnamese-language side; this post is my attempt to put that story into English with the official 9M2025 earnings numbers cross-checked.

The Lesson Beyond FPT

Pulling back from the company-specific drama reveals a sector-wide structural shift. Tata Consultancy Services is down 21.4%, Infosys 16.5%, Wipro 23.1%, FPT 22.2% over the same window — peak-to-trough on FPT is closer to 33.8%. Four major labor-arbitrage IT consulting firms across two continents, all repricing in the same direction at the same time.

This is not a Vietnam story or even an FPT story. It is the entire "humans do the consulting work" business model getting re-rated by AI. The survivors will pivot fast from "we sell hours" to "we sell the AI that does the hours". The casualties will stay too long in the now-commoditizing layer.

The IBM chart on Feb 23 and the FPT chart in the weeks after are saying exactly the same thing.

Sources

This analysis was anchored on a Vietnamese-language YouTube video by Mèo Giải Thích (Explaining Cat), a Vietnamese economics explainer channel — the primary narrative spine. The hard numbers and corroborating data points came from the following public sources:

Mèo Giải Thích — "Tôi phân tích FPT để bạn không phải làm" — YouTube video, 18 min, 388k+ views. The Vietnamese-language analysis that triggered this deep dive.
FPT Corporation — 9M2025 Earnings Report — PDF (October 2025) and investor news release. Source for all official segment revenue, profit, and growth numbers.
FPT Software press releases — AI strategic direction, CES 2026 AI showcase, and Global IT Services $1.3B signed revenue announcement.
Anthropic — "How AI helps break the cost barrier to COBOL modernization" (February 23, 2026) — primary blog post and Code Modernization Playbook. Source for the AI catalyst dating and capability claims.
CNBC — "IBM is the latest AI casualty. Shares tank 13% on Anthropic programming language threat" (February 23, 2026). Source for the IBM market reaction.
IT Pro — Anthropic vs IBM debate on COBOL modernization — counter-view from IBM. Included for the skeptical counter-perspective.
Yahoo Finance — live equity data pulled 2026-04-22 for FPT.VN, TCS.NS (Tata Consultancy Services), INFY (Infosys), and WIT (Wipro). Source for sector-wide drawdown comparison.

This is independent analysis grounded on publicly available sources. Not financial advice. Numbers stated are as of the source date noted; equity prices move continuously and any specific level cited may be stale by the time you read this. The author holds no position in FPT Corporation, Tata Consultancy Services, Infosys, Wipro, or IBM at the time of writing. Mèo Giải Thích is credited as the anchor source and was not consulted for this article.

If this was useful, I write weekly at sleepyquant.rest. One email a week, real numbers, no signals. Subscribe — come along to see me fall or thrive, whichever comes first.

How a Missing book_id Kwarg Quietly Tanked My Inverted-Alpha Paper Trade

SleepyQuant — Tue, 21 Apr 2026 09:10:43 +0000

Executive summary

I ran an inverted-alpha paper-trading experiment to test whether inverting my live signals would produce net-positive P&L over 100 round-trips. The inverted-alpha book (Book 2) hit a 63% win rate — good enough to celebrate — but the per-trade average loss was six times larger than the per-trade win size. The shape of the P&L didn't match any thesis I had. After a few days of staring at the numbers, I traced the problem to a single missing keyword argument in the close-order routing path. One line of fix, and the per-round-trip cost on the inverted book dropped from about $0.29 to under $0.02 — roughly a 21x reduction in per-trade bleed. This post is the story of finding the bug, why it hid for three days, and the structural test I should have written up front.

The signal that didn't match any thesis

A quick refresher on the inverted-alpha setup (I covered the original thesis in more detail in "The Inverted Control"). I run a multi-book paper-trading experiment on the same live signal source. Book 1 executes the signal as-is. Book 2 executes the inverted side of every signal — if Book 1 goes long, Book 2 goes short on the same symbol and size. The idea is simple: if my signal has negative edge on average, its inverse should have positive edge, less fees. Historical shadow analysis said the inversion would have produced roughly +$40 on 496 round-trips where Book 1 actually lost about $70. The live test was going to confirm or deny that in new market conditions.

Two days in, Book 2 looked weird. The win rate was sitting around 63% — higher than Book 1's 34%, which is what you'd expect if the inversion thesis held. But the net P&L on Book 2 was already deeply negative, with an average per-round-trip loss three times worse than Book 1. The shape didn't make sense: a book that wins 63% of the time shouldn't bleed faster than one that wins 34% of the time unless the losing trades are massively larger than the winning trades. And they were. The average win was small and the average loss was huge. The R-multiple on Book 2 was roughly inverted from what the mirror design implied.

What I initially suspected

My first hypothesis was that the inversion thesis was just wrong in the current regime. Maybe the market had shifted from trending to mean-reverting, and the signal that had been losing in trend mode was now correct in the new regime — which would make its inverse wrong. That's an honest failure mode, and if that's what was happening, I needed to kill the test early.

My second hypothesis was sample-size variance. Eighty round-trips is not a lot. A handful of asymmetric outliers can make the per-trade average look catastrophic before the law of large numbers smooths things out. I considered waiting for 200 round-trips before acting.

Neither hypothesis explained the specific R-multiple asymmetry. If the signal had flipped edge direction, the win rate should have dropped toward 50% or below, not landed at 63%. If it was pure variance, the wins and losses should have been roughly symmetric around the expected mean. What I was seeing — high win rate, small wins, large losses — is the mechanical signature of something clipping the wins and letting the losses run.

The trace that revealed it

I went into the logs. For each closed position on Book 2, I pulled the close-order record and checked which book the close actually hit. Every single one had routed to Book 1's ledger. Book 2's open positions existed. Book 2's trades showed up in the comparison snapshot. But Book 2's closes were landing on Book 1, which meant Book 2 positions were only closing when Book 1's mirror trade hit its own TP or SL — at Book 1's magnitudes, not Book 2's.

That's the asymmetry I was seeing. Book 2's take-profit threshold (set symmetrically with Book 1 for the experiment) never fired on its own positions. Book 2 closed when Book 1's signal exited — and since Book 2 is the inverse, Book 1's winning exits were Book 2's losing exits, at Book 1's take-profit magnitude. Meanwhile, Book 1's losing exits (at its smaller stop-loss magnitude) were Book 2's winning exits. Wins capped at small, losses running to large. The R-multiple wasn't mysteriously inverted; it was mechanically forced that way by a routing bug.

The root cause — one missing kwarg

Found it in the futures TP/SL monitor. The loop fetches all open positions across every book without a per-book filter (intentional — one loop watches the whole portfolio). For each position that trips its TP or SL threshold, it constructs a close-order and hands it to the execution engine:

# Before (the bug)
close_order = SimulatedOrder(
    lane="futures", symbol=pos.symbol, action=close_action,
    quantity=pos.quantity, price_vnd=live_price, leverage=pos.leverage,
    note=f"Auto-close: {close_reason}",
)
close_result = await engine.execute_futures(close_order)

The monitor passes no book_id. Downstream, execute_futures defaults book_id=1 when the argument isn't provided. The close-execution query then filters the position table by that default book_id, looking for a Book 1 position matching this symbol to close. For a Book 2 position that needs to close, the query finds nothing that matches — Book 1 has no such position. The execution path returns cleanly with zero matches. No exception. No warning. Just a silent no-op.

The monitor logs a cheerful "Auto-close" message. The database state is unchanged. The position keeps running until the Book 1 mirror signal decides to exit, at which point the close finally lands on the correct book via a completely different code path (the mirror-fire routing in the execution engine). That's why Book 2 positions did eventually close — through Book 1's exit, not their own.

The 1-line fix

# After
close_result = await engine.execute_futures(close_order, book_id=pos.book_id)

That's the whole patch. Route the close to the same book the position lives on. I added a comment block above the call referencing the session log where the bug was diagnosed, so the next person reading this code has some archaeology to work with if they're wondering why the kwarg is suddenly important.

Before and after

The fix went live with the backend restart. Book 2 had 88 round-trips on its books at that moment. I locked that as the pre-fix baseline and started counting post-fix round-trips separately.

Window	Round-trips	Avg cost per round-trip
Pre-fix (contaminated by bug)	88	about $0.29
Post-fix (clean)	87	about $0.01

The 21x reduction in per-trade cost isn't the inverted-alpha signal suddenly working. It's the mirror book's own take-profit and stop-loss thresholds finally firing, instead of being clipped by Book 1's exit timing. Wins land at the size they were designed to land at. Losses stop at the size they were designed to stop at. The R-multiple on Book 2 is now something close to symmetric, which is what the inverted-alpha experiment was supposed to measure in the first place.

The part I'm not claiming

Eighty-seven post-fix round-trips is still not a lot. The number could drift back toward zero or turn positive or stay mildly negative as the sample grows. What I'm claiming is narrow: the bug was contaminating the signal to the point where no verdict was meaningful, and fixing it moved the book roughly to break-even on post-fix trades — which at least lets the actual inverted-alpha thesis get tested on its own merits. Whether the thesis itself holds up over 100+ clean round-trips is still open.

I'm also not claiming that a bug this shape should be impossible for anyone smart to write. I wrote it. I shipped it. It ran for three days producing data that looked like a meaningful signal and wasn't. The uncomfortable part is how convincing the bad data was — a 63% win rate with a tidy asymmetric R-multiple is exactly the kind of shape that generates theories.

Takeaway: test cross-book routing, not just book behavior

Every unit test I had written pointed at Book 1 behavior in isolation. Does the close logic work? Does the TP trigger at the right threshold? Does the position close update the balance correctly? All of those passed. What I hadn't written was a test that opens a position on the inverted-alpha book (Book 2), triggers its TP, and asserts that the resulting close lands on Book 2's ledger and not Book 1's. A single-line assertion in the right place would have caught this bug before it shipped.

If you're running a multi-book or multi-account framework where the routing surface is implicit — where a missing keyword argument silently falls back to a default account — write the cross-routing assertion. It's the test that only exists once you have more than one book, and it's the test that stops being optional the moment silent no-ops can masquerade as winning trades.

DEV Community: SleepyQuant

I Blamed the Model for Months. The Bug Was My Sampler.

I Blamed the Model for Months. The Bug Was My Sampler.

40GB In, Word Salad Out

The Diagnostic: Four Configs, One Metric

The Root Cause: A Logits Processor That Fought Itself

The Fix

What 22GB Back Feels Like on Apple Silicon

How I measured

Don’t Size for Prestige

How I Budget 64 GB Unified Memory on M1 Max for a 35B Model + Long-Running Agent Loops

How I Budget 64 GB Unified Memory on M1 Max for a 35B Model + Long-Running Agent Loops

The actual budget on my Mac

Why 35B Q8 specifically fits

How to measure your own baseline

What changes if you have less or more RAM

What goes wrong if you over-budget

What this isn't

The smaller lesson

MLX vs llama.cpp on M1 Max with 35B Q8 — The Honest Benchmark

MLX vs llama.cpp on M1 Max with 35B Q8 — The Honest Benchmark

Setup

Raw throughput

Where the speed difference doesn't show up

Where MLX wins (that's why I stayed)

Where llama.cpp wins (and might still be the right pick)

What I'd recommend by use case

Will I switch later?

MoE Degeneration on Long Context — Why My 35B Model Started Repeating Itself

MoE Degeneration on Long Context — Why My 35B Model Started Repeating Itself

What the output looked like

What I think is happening

The fix that worked: sectional generation

What didn't work

How to detect it in your own runs

What this isn't

The smaller lesson

Qwen 3.6 enable_thinking — The MoE Pitfall That Broke My Agent JSON Parsing

Qwen 3.6 enable_thinking — The MoE Pitfall That Broke My Agent JSON Parsing

The symptom

What's actually happening

The fix

Where the flag has to go

When you actually want thinking on

Why the docs don't surface this

Verification

What this isn't

The smaller lesson

MLX Memory Safety Checklist: 6-Layer Defense for M1/M2 Apple Silicon

MLX Memory Safety Checklist

6-Layer Defense for M1/M2 Apple Silicon

The problem

Layer 1 — Metal wired_limit cap

Layer 2 — Metal cache_limit cap

Layer 3 — memory_limit (soft ceiling)

Layer 4 — Explicit clear_cache() after long inference

Layer 5 — 5-minute memory pressure watchdog

Layer 6 — Nightly restart via LaunchAgent

Verification commands

What happens when each layer fails

What this is and isn't

Want more posts like this?

Yen Intervention Crypto: Why This Isn't August 2024 Round Two

Yen Intervention Crypto: Why This Isn't August 2024 Round Two

TL;DR

The Setup

Why Crypto Traders Open a Yen Headline at All

Why I Think It's the Wrong Playbook

What I'm Actually Watching

Where I Could Be Wrong

What I'm Doing

Sources

The 0.42% Bar: A Passive Yield Benchmark for Every Crypto Trading Bot (April 2026)

The 0.42% Bar: A Passive Yield Benchmark for Every Crypto Trading Bot (April 2026)

Why passive benchmarks matter more than "the market"

Tier 1 — Crypto earn products (lowest friction)

Tier 2 — Fiat instruments (moderate friction, no crypto risk)

Tier 3 — Equity index averages (long-horizon only)

So what does a trading bot actually have to beat?

The gate we use on our own bot

Layer 1 — Metal `wired_limit` cap

Layer 2 — Metal `cache_limit` cap

Layer 3 — `memory_limit` (soft ceiling)

Layer 4 — Explicit `clear_cache()` after long inference