Matt Macosko

Posted on Apr 23 • Originally published at youtube.com

Free AI on a MacBook vs $100-a-Month Claude Code — Hexagon Shootout

#ai #localllama #mlx #applesilicon

▶ Watch the race on YouTube: https://www.youtube.com/watch?v=2KeTDDodE0A

April 22, 2026. Anthropic's Claude Code Max plan jumped to $100 a month. I ran a live three-way AI race on the exact same prompt — Gemma 31B local, Llama 70B local, and Claude cloud — on a single MacBook, to see how close a free local stack gets to the paid cloud. Two of three contestants finished with zero cloud calls.

If you just want the video, it's here: FREE AI on a MacBook vs Claude Cloud — Hexagon Shootout.

If you want the repo, it's here: github.com/nicedreamzapp/claude-code-local.

Keep reading for the setup, the numbers, and the three things that surprised me.

The setup — same prompt, three contestants

Hardware: M5 Max MacBook Pro, 128 GB unified memory, Apple Silicon.

Gemma 31B — local, Apple MLX, 4-bit quantized (Google's code-specialized model)
Llama 70B — local, Apple MLX, 8-bit quantized (Meta's generalist)
Claude cloud — the real Anthropic API, using Claude Code unchanged

Same prompt to every contestant:

Build a single HTML file with inline JavaScript that shows a ball bouncing inside a rotating hexagon. Include gravity and realistic bounce physics.

Simple enough that the answer should be a few kilobytes of code. Interesting enough that it exposes how well a model handles real math — collision detection against rotating geometry, energy conservation, boundary clamping. When models trip, they trip here.

Every run was recorded end-to-end with a live stats panel: elapsed seconds, output bytes, tokens-per-second. No cherry-picking, no post-hoc edits to the physics code, no "here's what it SHOULD have said." What you see is what came out.

The results

Contestant	Time to ship working HTML	Tokens/sec	Cloud calls
Claude cloud	22 s	N/A (data center)	yes (via API)
Gemma 31B local	56 s	~30	zero
Llama 70B local	2:17	~11	zero

Claude cloud finished first — it's a data center somewhere. Gemma 31B finished clean in under a minute with working physics. Llama 70B took the longest and produced the most verbose output, but also landed a working demo in the end.

The headline isn't that one is "best." It's that two of the three ran with Wi-Fi that could have been off the entire time. That's the number that matters for anyone dealing with NDAs, PHI, client files, or just a flight without connectivity.

Three things that surprised me

1. Bigger isn't better when "bigger" is a generalist

I went in expecting Llama 70B to beat Gemma 31B on code quality. It's more than twice the parameter count. Gemma beat Llama cleaner and faster on this specific task.

Why: Gemma 4 is a Google model fine-tuned heavily for coding and math. Llama 3.3 70B is Meta's generalist — it's excellent at conversation, reasoning, creative writing, but it wasn't tuned to punch above its weight on HTML canvas physics.

If you're buying a local model for coding, you're better off with a 30B that's code-tuned than a 70B that's general. Don't count parameters, read the model card.

2. Claude Code's harness chokes local models

Claude Code (the CLI agent) sends a 29,000-token system prompt with 60 tool schemas in every request. That's tuned for the cloud — where a frontier model can happily chew through 30K tokens of context before even starting. On a local 70B, that prefill takes a minute or two before generation begins.

When I bypassed Claude Code and hit the MLX server directly with just the prompt, Llama 70B's wall-clock time dropped from 7+ minutes to under 2.

The tradeoff: without Claude Code's harness you lose the Write/Edit/Bash tool-use loop, so you can't use Claude Code as an agent, only as a generator. For research, benchmarking, or any single-shot prompt, direct is way faster. For actual coding sessions, the overhead is real but it's what buys you the agent loop.

3. Circle-approximation collision is the cheat code

All three models eventually produced a bouncing ball. The ones that worked used circle-approximation collision — treat the hexagon as a circle of its apothem radius for collision purposes, reflect velocity when the ball exceeds that radius, clamp the ball back to exactly inside. Five lines of math, reliable, hexagon can rotate as wildly as you want.

The ones that failed tried to do proper polygon-edge collision — compute the six edges of the rotating hexagon each frame, compute point-to-line distance for each, reflect off the appropriate edge. That's the "right" way, and it fails constantly because floating-point error lets the ball slip through edges during the rotation, and then the model doesn't know how to clamp it back.

I wouldn't have predicted this. The "simple" approximation is strictly better for the demo because it can't leak. For anything more complex than one ball, the polygon approach is necessary — but for a benchmark, approximation wins.

Who should care

Developers on laptops with 64+ GB of Apple Silicon unified memory: you can run this today, your hardware already supports it.
Anyone dealing with confidential work — lawyers, accountants, doctors, contractors handling NDAs or PHI: the cost isn't $0 vs $100, it's "does your data leave the machine" vs "does it not."
Frequent flyers and people who travel to places with bad internet: a 70B model on a laptop keeps working when the plane's Wi-Fi is $18 and throttled.
Anyone curious whether Apple's bet on unified memory was actually about AI: it was.

How to run it yourself

The repo is MIT licensed and open source. Full setup is in the README:

→ github.com/nicedreamzapp/claude-code-local

The project pairs a native-MLX Anthropic-API-compatible server with Claude Code. Point Claude Code at localhost:4000 and the official CLI talks to your local model as if it were the cloud API. Swap models with one env var. Ship code without the subscription.

Around 2,000 stars in the first month. If it's useful, a star helps.

TL;DR

Claude cloud: $100/mo, 22 seconds to a working hexagon.
Gemma 31B on my MacBook: $0, 56 seconds to a working hexagon.
Llama 70B on my MacBook: $0, 2:17 to a working hexagon.
Two of three ran with zero cloud calls.
Free AI on Apple Silicon is real, now, for a huge slice of what people use cloud APIs for.

The receipts, in video form: youtube.com/watch?v=2KeTDDodE0A

Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.

DEV Community