Two of the biggest models you can run on a laptop. One head-to-head. No cloud. I had two giant models my MacBook can hold and I kept wondering which one I should actually trust as my daily driver. So instead of guessing, I made them fight. Here is the whole thing in under two minutes.
The setup is a MacBook Pro M5 Max with 128 GB of memory. In one corner, GLM-4.5-Air — about 106 billion parameters, kept at a clean 6-bit. In the other, DeepSeek V4 Flash — a 284 billion parameter model squeezed all the way down to 2-bit so it fits. Both land around 80 GB on disk, which means only one of them loads at a time. Reaching for the other costs a full reload. That is just the reality of running frontier-sized models on a machine that also has to run everything else.
I went in assuming the smaller, higher-precision model would feel sharper, and the giant 2-bit one would be the slow, fuzzy novelty. From where I sit now, I had it backwards.
The contenders
| Spec | GLM-4.5-Air | DeepSeek V4 Flash 284B |
|---|---|---|
| Total parameters | 106B | 284B |
| Quantization | 6-bit (MLX) | 2-bit (GGUF) |
| Weights on disk | 87 GB | 81 GB |
| Context window | 128K | 200K |
| Runtime | MLX / Apple Silicon | llama.cpp / Metal |
I gave both the exact same four prompts — a reasoning riddle, a coding task, a creative writing bit, and the gotcha question that trips up almost every model — and measured speed the way you actually feel it: tokens per second, end to end.
Round zero: speed
First surprise. The bigger model is faster. DeepSeek averaged about 35.8 tokens per second; GLM came in around 31. A 284 billion parameter model, crushed to 2-bit, out-paced a model a quarter its size at 6-bit. On Apple Silicon the heavier quantization buys back enough memory bandwidth to come out ahead.
Round one: the reasoning riddle
I asked the old chestnut: if a hen and a half lays an egg and a half in a day and a half, how many eggs do six hens lay in six days? The answer is 24.
GLM fell for the trap and answered 12. It did the clever part right — doubling the hens and the eggs — but then it also tripled the days, which is exactly the mistake the riddle is built to catch. DeepSeek kept its head and landed on 24. That was the one round where the gap was real, and it went to the 2-bit model.
Round two: coding
Write a memoized Fibonacci function, then tell me what fib(30) returns. Both wrote clean, correct, memoized code and both got 832040. A genuine tie. If your work is mostly coding, either one will serve you well.
Round three: creative writing
A four-line poem about a robot tasting rain. Both delivered something real. GLM played it gentle and tidy; DeepSeek got a little stranger and more vivid. That one comes down to taste, not correctness.
Round four: the gotcha
How many times does the letter R appear in "strawberry"? This is the question that has embarrassed a lot of big models. Both got it right — three. Nice to see, honestly.
The verdict
DeepSeek V4 Flash takes it. It was faster, and it was the only one that solved the reasoning riddle. They tied everywhere else. The part I keep turning over is that the winner is the most compressed model in the room. Two-bit beat six-bit.
I do not want to over-claim from four prompts — this is a quick head-to-head, not a full benchmark suite, and a different set of tasks could shift the picture. But the takeaway for me is simple: on a 128 GB machine, more raw parameters at low precision can beat fewer parameters at high precision. I had been spending memory on precision when I probably should have been spending it on size.
The bigger point is the one I keep coming back to in everything I write here. Both of these models ran on my desk, offline, for free. No subscription, no tokens metered, nothing leaving the machine. A couple of years ago a 284 billion parameter model was a data-center thing. Now it is something I can load on a laptop and argue with about chickens. That is a wild place to be, and it is only moving in one direction.
Run local models yourself (free)
If you want to try this, the abliterated MLX models I have converted for Apple Silicon are all free to pull:
- My Hugging Face — abliterated MLX models for Apple Silicon
-
claude-code-local — run Claude Code against a local model, no API key. Point
MLX_MODELat a repo and go.
The narration in the video is local text-to-speech, and the whole thing was rendered on the same MacBook. Nothing in this comparison touched the cloud.
Originally published at Nice Dreamz Wholesale. I convert abliterated MLX models for Apple Silicon — all free at my Hugging Face. For local-AI consulting for compliance-sensitive firms, see AirGap AI.
Top comments (0)