Stop Using "Senior" AI for Junior Tasks: How I Cut Token Costs by 85%
Here’s a take most developers get wrong: The best AI tool isn't the most powerful one. It's the right one for the job.
I use Claude Code every day. I also use a local 14B model that can't hold a candle to it. But routing tasks between them cut my token usage by ~85% on a real project last week — with identical output quality.
Here is the logic behind this "Model Routing" and how you can replicate it.
The Problem: Hiring an Architect to Paint Walls
Imagine you're writing a new feature. You type /implement add a caching layer. Typically, a high-end tool like Claude Code:
- Reads your entire codebase for context.
- Thinks through the architecture.
- Writes the boilerplate.
- Generates the imports.
- Formats the code.
Steps 1–2? That's Claude doing what only Claude can do (high-level reasoning).
Steps 3–5? That's a job for a 14B local model. By using Claude for all of it, you're paying senior rates for junior execution.
The Routing Table
The core insight is this: Local models fail at reasoning, not execution. Give them a clear spec, and they write perfectly acceptable code. The spec is the hard part — that's what Claude is for.
The Pipeline: Claude Plans, Ollama Codes
I built ai-orchestrator — a pure Bash setup that wires Claude Code and Ollama into a single workflow:
> /implement add a Redis caching layer to the user service
What happens under the hood:
- Planner (Claude): Analyzes the task and generates
task_context.md. - Coder (Ollama qwen3-coder:30b): Takes the spec and writes the actual code.
- Validator: Runs
tsc --noEmit(ormypy, etc.) to ensure syntax is correct. - Reviewer (Ollama qwen2.5-coder:7b): Checks $N$ files in parallel for logic errors.
- Fix Loop: Automatically iterates up to 3 rounds if the build fails.
The Result: Claude only sees your task description + a compact plan (~300–500 tokens). Ollama handles the bulk of the file contents at zero cost.
Configuration & Setup
One JSON file controls the entire brain of the operation. You can swap models instantly:
{
"models": {
"coder": "qwen3-coder:30b-a3b-q4_K_M",
"reviewer": "qwen2.5-coder:7b",
"commit": "qwen2.5-coder:7b",
"embedding": "nomic-embed-text"
}
}
Install in one line
The installer detects your hardware (RAM/VRAM) and automatically picks the right model tier for you.
curl -sSL https://raw.githubusercontent.com/Mybono/ai-orchestrator/main/scripts/install.sh | bash
Requirements: Claude Code CLI + Ollama. No Python or Node runtime needed — just pure Bash and jq.
Real World Results
On a recent TypeScript project with 12 files changed:
- Claude processed: Task description + generated plan.
- Ollama wrote: All 12 files.
- Token Savings: ~85% compared to a pure Claude Code workflow.
- Quality: The code passed type-check and review on the first round.
Key Features:
-
/implement— Full plan → code → build check → review pipeline. -
/review— Check current diff against project standards. -
/stats— Track your token savings (day/week/month). -
/commit— Let a local LLM write your commit messages.
Final Thought
Claude is expensive because reasoning is expensive. Don't spend it on writing for-loops.
👉 Repo: https://github.com/Mybono/ai-orchestrator
What's your current setup for managing AI costs? Are you running anything locally or sending everything to the cloud? Let's discuss in the comments!
#ai #productivity #programming #llm #opensource

Top comments (0)