Faster, Cheaper, Local: The Myth and Reality of Replacing Claude for Coding

#ai #programming

As many, I started actively using LLMs for basic coding scenarios around June of 2023 and it was a breakthrough. The only problem back then was fitting a prompt into the context window — so I built my own techniques (even made codeprompter.com, a free service to compact code context). Plugins came and went, but I was still using web UIs because they were predictable. Fast-forward to 2025: the new kid on a block, Claude Code (a CLI/plugin tool), flipped my entire workflow. It's not cheap, and if you don't develop some basic skills, it can become exponentially expensive. Can you replace it? Let's dive in.

Token Costs Escalate Quickly

Ten days. $170. That was the burn rate of my last experiment with Claude tokens — which, if you do the math, scales to thousands a year. The quality of the code, though? Magic. You still need to prompt carefully, still need to git commit often, but the results are on a whole different level: clean logic, production-ready output, even solid documentation if you ask for it. Feed it examples of your preferred style and you'll get a masterpiece.

$170 isn't a deal-breaker once, but unsustainable long term. So I went hunting for cheaper ways to run an AI coder. Here's what I found.

Local Qwen3 Coder 30B MLX on Mac M1 Max

I spun up Qwen3 Coder 30B A3B Instruct 4bit locally with MLX using LM Studio and several CLI options: Claude Code Router, llxprt, and Qwen Code.

Speed (chat mode): 50–60 tokens/sec — insane for my 32GB M1 Max. It felt like GPT-4 level reasoning, running locally.
CLI mode: A different story. Because CLI tools send many background requests (update TODOs, list folders, plan steps), real workflows took 20–30 minutes for just a few commands. And they need a huge context window (~30K tokens).
Scaling up: Could a $10K Mac Studio Ultra (512GB, M3 Ultra) push ~100–120 tokens/sec? Probably. But that's pricey, ages quickly, and still doesn't fix the quality gap.
Verdict: Works fine in chat, basically unusable in CLI. Might suit hobbyists with high-end rigs, but not me.

Qwen3 Coder on Vast.ai

Vast.ai is a great idea, honestly. They let you rent GPUs cheaply and run any model with pre-built templates. You can spin an instance in minutes.

But… I already manage a hundred instances across different projects. Adding more is overhead I don't want.

Verdict: Fantastic service, but managing yet another set of instances isn't worth it for me. For others with fewer moving parts, it could be perfect.

Qwen-3-Coder-480B on Cerebras

If you've never heard of Cerebras, go and try it right now. Seriously. Remember my local 50–60 tokens/sec? Cerebras gives you 1,800–3,000 tokens/sec. It's so fast you'll blink and your app is built. Their Discord community support is outstanding, too.

But here's the catch: they frame it as "24M tokens/day for $50 per month." In practice, the calculation logic is slightly more complex. CLI tools blast many requests, so I hit my daily limits in just hours. I wasn't alone — plenty of devs hit the same wall (see this Reddit thread).

And when I actually used it on my projects, speed wasn't the bottleneck — debug time was. Qwen3 still didn't match the coding quality of Sonnet 3.7.

Verdict: Wild speed, great service, but model quality lags Anthropic.

Back to Claude

After this tour, I went back to Claude. Even Sonnet 3.7 feels like driving a brand-new car compared to others. Smooth, predictable, fewer surprises.

But I can't burn tokens at $170 per 10 days (~$6K/year). My solution: switch to two Pro subscriptions (and Max if needed). The key change? I spend more time crafting better prompts. I run drafts through GPT-based agents to find flaws, polish them, then send them into Claude. The result: spectacular output at predictable cost.

The secret sauce wasn't more tokens — it was fewer, higher-quality requests. Same speed, lower bill.

Closing Thoughts

I still dream of the day when a local model can match Sonnet's coding reliability. We're not there yet. But hats off to open source — the fact that Llama, Gemma, DeepSeek, Qwen and others run on a laptop at all is incredible. Five years ago I'd have bet this would take 20. And yet, here we are.

Such a time to be alive.

by Russ Anvarov | Algohit Inc.