This is the second entry in a curious builder's diary. In the first one, a self-taught web developer borrowed a satellite engineer's intuition, teamed up with an AI agent that writes the C++ he cannot, and asked a ridiculous question: can you run frontier AI on the MacBook on your desk? That article ended with a 470 GB download in progress. This one ends differently.
1. The Moment
You: Say hello.
S-MoE [RAM: 48.0GB | Ring: auto]: Hello, world!
That's it. That's the whole story, really.
Qwen3-235B-A22B-Instruct-2507 — a genuinely frontier Mixture of Experts model, 235 billion parameters, the kind of thing that is supposed to live behind an API key and a $250,000 GPU cluster — answered a question on a consumer MacBook. Not a distilled version. Not a 7B "assistant" wearing a big model's name. The actual 235B model, streaming its experts off the SSD, token by token, through a C++ engine and Metal kernels that a web developer and an AI agent built together.
It is slow. It is imperfect. It is also, as far as I can tell, something nobody else has quite done this way. And I think the most honest thing I can do is tell you exactly what it took — because the last bug standing between us and "Hello, world!" was not in the GPU kernels at all. It was the kind of bug a web developer knows in his bones.
2. The Numbers, Plainly
| Thing | Reality |
|---|---|
| Model | Qwen3-235B-A22B-Instruct-2507 (Apache 2.0) |
| Original checkpoint | ~470 GB bfloat16 |
| The Vault (routed experts, SMOE-Q4) | ~117 GB, cold on NVMe, 0 bytes of RAM |
| The Surface Scout (dense backbone) | ~16 GB, resident in Unified Memory |
| Machine | MacBook, 48 GB Unified Memory |
| Ring buffer | Auto-sized: 25% of free RAM, ~10 MB per expert slot |
| Observed NVMe throughput | ~3 GB/s of expert streaming, prediction miss rate 0.0% |
| Speed | ~0.1–0.2 tokens/second. A word every few seconds. A paragraph while you make coffee. |
Yes, you read the last line right. This is not a chatbot you'd use to write emails. It is an existence proof: the memory wall has a door in it, and the door is the SSD.
3. The Bug That Taught Me the Whole Point
Here is my favourite part, because it is where the "web developer mind in low-level code" experiment paid for itself.
After we shattered the model to 4-bit (our first 2-bit attempt mathematically crushed the fine-grained experts — the model produced only spaces and punctuation, like a ghost tapping on a table), the engine ran. The I/O streamed. The Metal kernels hummed. The hidden-state norms were healthy, finite, boringly correct across all 94 layers.
And the model answered every question with:
::
Two colons. That's all. A 235-billion-parameter mountain of intelligence, and it gave us punctuation.
Every instinct from the systems side pointed down — into the dequantisation math, the RoPE rotations, the ring buffer synchronisation. That is where serious low-level bugs are supposed to live. The AI agent and I checked all of it. Greedy decoding on a short raw prompt produced flawless English: "Hello! How can I...". The engine was innocent.
The culprit was upstream, in the smallest, most web-shaped detail imaginable: the chat template. Qwen3 ships in two lineages — the original "thinking" model, and the non-thinking Instruct-2507 we had shattered. The thinking lineage's template injects <think> tokens into every prompt. And in the Instruct-2507 weights, those two token embeddings are untrained — near-zero rows, ~150× smaller than every normal token, sitting in the vocabulary like unexploded blanks. We were whispering two meaningless syllables into the model's ear at the start of every conversation, and the residual stream collapsed on contact.
One line changed — load the tokenizer from the same checkpoint you shattered — and the mountain said hello.
The lesson I keep relearning: the terrifying low-level layer was fine. What broke was a config mismatch, a wrong template, an error message silently swallowed by a pipe — the exact species of bug every web developer has hunted at 2 AM in a misconfigured staging environment. My instincts were useless for writing Metal kernels, and genuinely useful for debugging the system around them. That mix — an agent that can write
memory_order_acquirefences, and a human who reflexively asks "wait, are we even sending the right payload?" — is the actual experiment here.
4. An Honest Word About 16 GB
The original dream — the one in the manifesto — was 16 GB. "16 GB of RAM should give you the same intelligence as 512 GB, just slower."
We almost did it. And I want to be precise about the "almost", because I promised this diary would be sincere.
The architecture held: the 117 GB of expert weights truly occupy zero bytes of RAM. They stream from the SSD at the moment of need, exactly as the phonon metaphor promised. That part of the claim survived contact with reality completely.
What we could not shrink is the Surface Scout — the model's own dense backbone, the part that must stay resident to predict what fires next. For a 235B frontier model, that backbone alone weighs ~16 GB in bfloat16. It is the size of the entire machine we were aiming for. Chasing the frontier on 16 GB was, in hindsight, a bit too hazardous — there would be no room left for the ring buffer, the KV-cache, or macOS itself.
So the honest, recalibrated claim is this:
A 32–48 GB MacBook now runs a 235B frontier model locally, with intelligence that does not degrade — only speed does. And a 16 GB Mac runs the same architecture with smaller fine-grained MoE models, same principle, same sovereignty.
I have made my peace with this. Frontier-on-32GB was science fiction in January. Being annoyed that it isn't frontier-on-16GB feels like complaining about the legroom on your first flight to the Moon.
5. What's Next
The system that says "Hello, world!" today is the slowest version of itself that will ever exist. The roadmap is public and concrete: computing expert routing from the heavy model's own hidden state instead of the Scout's approximation, replacing the GPU's waitUntilCompleted with MTLSharedEvent pipelining, and eventually a small distilled Scout that predicts activations at a fraction of the cost. None of this requires new hardware. All of it requires more evenings.
And I'll keep being honest about authorship: the AI agent wrote the C++ and the Metal shaders, reasoned through memory fences and page alignment, and found bugs in code I could only read like a tourist reads street signs. I brought the question, the direction, the stubbornness, and — it turns out — the debugging instincts of someone who has been lied to by config files for more than ten years. Neither of us could have built this alone. I am not sure anyone has built quite this pairing before, and that, more than the tokens per second, is what I think is worth noticing.
The mountain doesn't just vibrate anymore.
It talks.
S-MoE is open source under the MIT License. The code, the philosophy, and the mistakes are all public: github.com/melasistema/s-moe.
Built by Luca Visciola and an AI agent who hopefully knows C++ and low level programming, because I do not 😅.
Acknowledgements: The seismic tomography concept remains a philosophical translation of the published work of Ing. Filippo Biondi on SAR-based subsurface imaging. The first words belong to the mountain; the intuition to listen for them belongs to him.
Top comments (0)