1-bit, 545 megabytes, zero API keys — local AI that beats GPT-5.4
By Vilius Vystartas | May 2026
I ran the same 10 agent coding tasks against 8 locally-running models on my Mac. No cloud, no API keys, no per-token billing. The results surprised me enough that I ran them twice.
The leaderboard
| Model | Bits | Size | Score | Time |
|---|---|---|---|---|
| Qwen 3.5 9B | 4-bit | ~5GB | 83% | 190s |
| AgenticQwen 8B | 4-bit | ~5GB | 82% | 189s |
| Bonsai 4B | 1-bit | 545MB | 80% | 18s |
| Ternary Bonsai 1.7B | 2-bit | 442MB | 80% | 10s |
| Bonsai 8B | 1-bit | 1.1GB | 80% | 15s |
| Ternary Bonsai 4B | 2-bit | 1.0GB | 80% | 20s |
| Ternary Bonsai 8B | 2-bit | 2.1GB | 78% | 22s |
| Bonsai 1.7B | 1-bit | 237MB | 73% | 8s |
A 545MB model beats GPT-5.4
Bonsai 4B at 1-bit quantization scores 80% on the same tasks where GPT-5.4 scored 75%. Half a gigabyte. No data center. Your laptop processes every request locally, zero latency. It's 3x faster than the Qwen models because there's less to compute.
4-bit controls tie Claude
The 4-bit Qwen models at ~5GB score 82-83% — matching Claude Sonnet 4's cloud performance. On a Mac. These aren't toys.
1-bit vs 2-bit (ternary): the extra bit is dead weight
At the 1.7B size, ternary helps — 80% vs 73%. But at 4B and 8B, 1-bit and 2-bit perform identically (80%). That extra bit costs double the disk (1.0GB vs 545MB, 2.1GB vs 1.1GB) for zero gain. At larger model sizes, 1-bit quantization has already captured everything the model can offer.
What this means
You can run an agent coding model that beats GPT-5.4 on a laptop with no internet. For regulated industries — healthcare, finance, government — this removes the compliance headache. No data leaves the device. No vendor API agreement to negotiate. No per-request billing to track.
The Bonsai findings are also on benchmarks.workswithagents.dev, refreshed with each run. Alongside the cloud models for direct comparison.
I didn't expect a 545MB quantized model to beat a cutting-edge cloud API. But here we are.
Top comments (0)