DEV Community

Artur DefToExplore
Artur DefToExplore

Posted on

I stopped using Claude for 80% of my coding tasks. Here's what I use instead.

Stop Using "Senior" AI for Junior Tasks: How I Cut Token Costs by 85%

Here’s a take most developers get wrong: The best AI tool isn't the most powerful one. It's the right one for the job.

I use Claude Code every day. I also use a local 14B model that can't hold a candle to it. But routing tasks between them cut my token usage by ~85% on a real project last week — with identical output quality.

Here is the logic behind this "Model Routing" and how you can replicate it.


The Problem: Hiring an Architect to Paint Walls

Imagine you're writing a new feature. You type /implement add a caching layer. Typically, a high-end tool like Claude Code:

  1. Reads your entire codebase for context.
  2. Thinks through the architecture.
  3. Writes the boilerplate.
  4. Generates the imports.
  5. Formats the code.

Steps 1–2? That's Claude doing what only Claude can do (high-level reasoning).
Steps 3–5? That's a job for a 14B local model. By using Claude for all of it, you're paying senior rates for junior execution.

The Routing Table

The core insight is this: Local models fail at reasoning, not execution. Give them a clear spec, and they write perfectly acceptable code. The spec is the hard part — that's what Claude is for.


The Pipeline: Claude Plans, Ollama Codes

I built ai-orchestrator — a pure Bash setup that wires Claude Code and Ollama into a single workflow:

> /implement add a Redis caching layer to the user service

What happens under the hood:

  1. Planner (Claude): Analyzes the task and generates task_context.md.
  2. Coder (Ollama qwen3-coder:30b): Takes the spec and writes the actual code.
  3. Validator: Runs tsc --noEmit (or mypy, etc.) to ensure syntax is correct.
  4. Reviewer (Ollama qwen2.5-coder:7b): Checks $N$ files in parallel for logic errors.
  5. Fix Loop: Automatically iterates up to 3 rounds if the build fails.

The Result: Claude only sees your task description + a compact plan (~300–500 tokens). Ollama handles the bulk of the file contents at zero cost.


Configuration & Setup

One JSON file controls the entire brain of the operation. You can swap models instantly:

{
  "models": {
    "coder":     "qwen3-coder:30b-a3b-q4_K_M",
    "reviewer":  "qwen2.5-coder:7b",
    "commit":    "qwen2.5-coder:7b",
    "embedding": "nomic-embed-text"
  }
}
Enter fullscreen mode Exit fullscreen mode

Install in one line

The installer detects your hardware (RAM/VRAM) and automatically picks the right model tier for you.

curl -sSL https://raw.githubusercontent.com/Mybono/ai-orchestrator/main/scripts/install.sh | bash
Enter fullscreen mode Exit fullscreen mode

Requirements: Claude Code CLI + Ollama. No Python or Node runtime needed — just pure Bash and jq.


Real World Results

On a recent TypeScript project with 12 files changed:

  • Claude processed: Task description + generated plan.
  • Ollama wrote: All 12 files.
  • Token Savings: ~85% compared to a pure Claude Code workflow.
  • Quality: The code passed type-check and review on the first round.

Key Features:

  • /implement — Full plan → code → build check → review pipeline.
  • /review — Check current diff against project standards.
  • /stats — Track your token savings (day/week/month).
  • /commit — Let a local LLM write your commit messages.

Final Thought

Claude is expensive because reasoning is expensive. Don't spend it on writing for-loops.

👉 Repo: https://github.com/Mybono/ai-orchestrator


What's your current setup for managing AI costs? Are you running anything locally or sending everything to the cloud? Let's discuss in the comments!

#ai #productivity #programming #llm #opensource

Top comments (0)