Tsunamayo

Posted on Apr 3 • Edited on Apr 10

Claude Code Token Crisis: Why I Built a Local Agent Instead of Switching to Codex

#agents #llm #productivity #showdev

The Exodus

It's April 2026 and Claude Code developers are in crisis:

Max plan users ($100-200/mo) hitting daily limits by afternoon
Anthropic admitted tokens drain "way faster than expected"
OpenAI Codex launched at $20/mo with no limits
OpenClaw hit 346K stars — but has a CVSS 8.8 RCE vulnerability

Developers are leaving. But they don't have to.

The Real Problem

Claude Code burns tokens on everything:

Reading a file: ~2K tokens
Searching code: ~5K tokens
Each agent subprocess: ~50K tokens
A complex refactoring session: 500K+ tokens

Most of these are routine operations that don't need Opus 4.6's reasoning power.

The Solution: Local Delegation

helix-agents v0.9.0 is an MCP server that keeps you on Claude while cutting token usage by 60-80%.

Claude Code (Opus 4.6) — makes decisions
  ↓ delegates via MCP
helix-agents (local, $0)
  ├── gemma4:31b — research, vision, tools
  ├── Qdrant memory — persistent across sessions
  └── Computer Use — browser automation

Opus decides what to do. Local models do the work.

gemma4: Released Yesterday, Default Today

Google DeepMind released gemma4 on April 2nd. helix-agents adopted it as the default model on Day 1 — the fastest adoption of any MCP tool:

AIME 89.2% — math reasoning rivaling closed models
LiveCodeBench 80% — strong code generation
256K context — handle massive codebases
Vision + Function Calling — multimodal agent capabilities
Apache 2.0 — fully open, no restrictions
Runs on 20GB VRAM — accessible hardware requirements

Windows Computer Use

Claude Code's Computer Use is macOS only. helix-agents brings it to Windows via Playwright + helix-pilot integration — making it the only MCP tool offering Computer Use on Windows today.

Multi-Provider Architecture

helix-agents isn't just about gemma4. It's a unified MCP runtime supporting three providers:

Provider	Use Case	Examples
`ollama`	Local LLM (free)	gemma4:31b, qwen3.5:122b, deckard-uncensored
`codex`	Repo-scale coding	Codex CLI integration, sandboxed execution
`openai-compatible`	Hosted APIs	GPT, Mistral, Groq

All 11 MCP tools (think, agent_task, fork_task, computer_use, etc.) work identically across all providers. Switch with one command:

providers(action="use", provider="codex")     # Switch to Codex
providers(action="use", provider="ollama")    # Back to local
providers(action="use_auto")                   # Auto-select

This means:

Routine tasks → Ollama ($0)
Repo-scale coding → Codex
High quality but not Opus → OpenAI-compatible

Claude Code + helix-agents = optimal model at optimal cost for every task.

The multi-provider runtime has been stable since v0.4.0 — zero breaking changes through v0.9.0.

Why Not Just Switch to Codex?

	Claude Code + helix-agents	Codex	OpenClaw
Cost	$100 + $0 local	$20	Free
Quality	Opus 4.6 decisions	GPT-5.3	Varies
Security	Local, no cloud	OpenAI cloud	CVE-2026-25253, 12% malicious skills
Token limit	Effectively 5-10x more	Unlimited	N/A
Ecosystem	Claude Code native	Separate tool	Separate tool
Computer Use	Windows + macOS	No	No

The key insight: you don't need to abandon Claude's quality to solve the cost problem.

What's in v0.9.0

Built by analyzing Claude Code's actual source architecture:

Fork-style context — subagents inherit parent context
gemma4:31b default — vision + reasoning + function calling
280 tests passing — production-ready
Computer Use — browser/desktop automation (Windows!)
Qdrant shared memory — persistent vector search
JSONL tracing — full observability
OOM auto-fallback — gemma4 → gemma3 → gemma3:4b

Real Savings

Task	Opus tokens	With helix-agents	Saved
Explore 50 files	100K	2K	98%
Code review 500 lines	30K	1K	97%
Multi-step research	200K	3K	98%

Quick Start (2 minutes)

git clone https://github.com/tsunamayo7/helix-agent.git
cd helix-agent && uv sync
ollama pull gemma4:31b
uv run python server.py

Add to ~/.claude/settings.json:

{
  "mcpServers": {
    "helix-agents": {
      "command": "uv",
      "args": ["run", "--directory", "/path/to/helix-agent", "python", "server.py"]
    }
  }
}

For Anthropic

This isn't an anti-Claude tool. It's a retention tool:

Users stay on Claude instead of switching to Codex
Max plan subscriptions continue
Token pressure decreases naturally
Users get a better experience and stay loyal

The best response to "Claude Code is too expensive" isn't "switch to Codex." It's "make Claude Code more efficient."

GitHub: tsunamayo7/helix-agent

Built during the 2026 token crisis. Because the best code assistant shouldn't come with a timer.

Top comments (2)

John • May 16

The local-delegation framing is the part that resonates. I would still keep a tiny amount of live usage visibility in the loop though, because the bad failure mode is not just total spend. It is discovering mid-task that one agent branch quietly ate the budget. Routing reduces burn; instrumentation changes behavior while the session is still running.

Steriani Karamanlis • Apr 3

the token burn problem you're describing is real and it gets worse when you factor in that frontier pricing has been flat for 9 consecutive weeks while the spread between frontier and budget models sits at 7.1x. for agentic workflows where tokens compound across every loop the model selection decision matters enormously. most teams are paying frontier prices for tasks that a well chosen budget model handles just as well

Some comments may only be visible to logged-in visitors. Sign in to view all comments.