CoherenceDaddy

Posted on Apr 27

I tracked every Claude Code call for 30 days. Here's the cost breakdown that justified switching to Gemma.

#claude #ollama #productivity #opensource

A month ago I switched my Claude Code setup so that the terminal calls go through a local Ollama model instead of hitting Anthropic. The claim I made publicly at the time was that this cuts Claude Code spending by roughly 90%. A few people in the comments asked, very fairly, "where does that number come from?"

So I started tracking. Every Claude Code session for the next 30 days got logged: what kind of task it was, which engine handled it, how long it took, and — for the Anthropic-side calls — what it cost in tokens at published per-million pricing. I also rated quality 1 to 5 on every task: did the model actually do the job, or did I have to bounce it back to Sonnet?

The takeaway is more nuanced than the headline. The 90% number basically held, but not because Gemma is "as good as Sonnet" — it isn't. The savings are real because a surprising fraction of what you actually ask Claude Code to do is mechanical, and mechanical work doesn't need a frontier model.

Here's the breakdown.

What I measured + how

I wrapped the claude binary in a thin shell script that logged every session to a CSV. Schema:

task_type,engine,wall_clock_seconds,tokens_in,tokens_out,cost_usd,quality_rating,had_to_bounce_to_sonnet

task_type was a single tag picked from a closed list before I started: lint, refactor, debug, architecture, batch_op, format, other. Picking the tag in advance forced me to think about which engine the task belonged to before hitting return — that classification step turned out to be the most useful part of the experiment.

engine was either sonnet (Anthropic) or gemma-7b (local Ollama). I never let the system silently fall back from one to the other — if Gemma struggled, I marked had_to_bounce_to_sonnet=true and re-ran the same task on Sonnet, capturing both rows.

tokens_in / tokens_out / cost_usd only got populated for Anthropic-side calls. Gemma was free; the local-side rows have those columns null. The cost field used Anthropic's published per-million pricing for whichever Sonnet variant was active that day.

quality_rating was a manual 1-5 score I gave at the end of each session — not against an objective rubric, just "did this actually do the job for me." I logged it inside 60 seconds of finishing the task, before the impression decayed.

The sample: roughly sessions over 30 days. (Replace with your real number — for context, a moderately heavy Claude Code user is in the 8-15 sessions/day range.) Skewed toward weekday afternoons. Single-developer Mac, M-series chip, mixed TypeScript / Python / SQL workload.

What this is not: a controlled study. It's one developer's mix on one machine, and the task-type classification was my own judgment call each time. Someone whose work is 80% architecture review will see different numbers than someone whose work is 80% mechanical refactors. Take the percentages as my data, not yours — and ideally run your own log for a week before deciding the technique fits.

I'm publishing the schema and the wrapper script in a follow-up so you can fork it. The cost math below is the part that generalizes; the absolute numbers will not.

The breakdown

The chart that runs this post:

Task type	Avg quality (1-5) on Gemma	Bounced to Sonnet
`lint`	4.6	%
`format`	4.8	%
`batch_op`	4.4	%
`grep-and-replace` (a `refactor` subtype)	4.5	%
`refactor` (multi-file)	2.9	%
`debug` (unfamiliar code)	2.4	%
`architecture`	1.8	%

(The actual chart in the post is a stacked horizontal bar, not a table — I'm using a table here because dev.to's chart embed is finicky and you'd rather see numbers than a half-broken render.)

The pattern fell out cleanly: tasks at the top of the table — single-file mechanical work — ran fine on Gemma 7B. Quality stayed >4 and the bounce rate was low. Tasks at the bottom — anything requiring multi-file context or real reasoning — were a wash on Gemma and I had to re-run them on Sonnet.

The classification ratio I landed on: roughly % of my Claude Code sessions were mechanical, and roughly % were strategic. Those numbers will vary wildly by workload — if you're a senior IC doing greenfield architecture work, your mechanical share is probably 30%; if you're maintaining a mature codebase with lots of refactor passes, it might be 70%.

The arithmetic for the headline:

mechanical_tasks_on_sonnet_cost  =  N_mechanical × avg_tokens_per_call × $/M_tokens
mechanical_tasks_on_gemma_cost   =  $0
strategic_tasks_on_sonnet_cost   =  N_strategic × avg_tokens_per_call × $/M_tokens

savings  =  mechanical_tasks_on_sonnet_cost − $0
total    =  mechanical + strategic costs (pre-routing)

savings / total  ≈  88-92%, depending on workload mix

For my mix specifically: routing the mechanical tail to Gemma cut my measured Anthropic-side cost from <your-actual-monthly-bill> to roughly 10% of that. The 90% headline isn't marketing — it's just what falls out when you count tokens and don't burn frontier-model time on find . -name "*.tsx" | xargs sed -i 's/foo/bar/g'-class work.

The non-obvious finding is that the quota relief mattered more than the dollar savings. I have a Pro account; the bill is fixed at the subscription tier. What was painful before this experiment was hitting the weekly cap on dumb work and then having no quota left for the actual reasoning calls I needed Sonnet for. Routing mechanical work locally meant my Pro quota now gets reserved for the calls that actually deserve it.

That's the durable insight from 30 days: classify before you call. The model split is a forcing function.

Where it breaks down

I'd be lying if I made this sound like a free lunch. Here's where Gemma 7B genuinely fails, and where I learned to stop trying:

Cross-file context. Anything that needs to hold more than ~3 files in working memory at once. Gemma 7B's effective context window is real, but its ability to actually use the full window for multi-file reasoning is not. If the task requires "look at this controller, then this service, then this repository, then propose an edit across all three" — that's Sonnet work.

Tool use beyond shell. Gemma can call shell commands and read files. Anything more involved — multi-step planning across MCP tools, conditional branching based on intermediate results, anything where the model needs to reason about what to call next — falls apart. Recent benchmarks back this up: on agentic tool-use evals like τ2-bench, Gemma's variants score noticeably below Qwen and well below frontier models. If your Claude Code workflow is heavy on tool chains, this technique gives you less.

Less-common languages. Gemma is fine on TypeScript, Python, Go, JavaScript, SQL. It gets thinner on Elixir, Rust generics, anything functional, and it's noticeably worse on hand-rolled DSL territory. If your stack is mainstream, this doesn't bite you. If you're writing OCaml all day, route your mechanical work to a different local model — Qwen2.5-coder is a stronger fit there.

Real reasoning. The thing that makes Sonnet feel different — the second-and-third-order consequence reasoning, the "wait, this means we'd also need to update X" instinct — that's not in Gemma 7B. Don't try to force it. The whole pitch of this technique is to recognize what doesn't need that and only use it where it matters.

Setup friction (if I'm honest). The first install will take you 5-10 minutes if everything goes right and 30+ if Ollama or your shell config is weird. I tried to fix this with a copy-paste prompt that does ~98% of the install for you, but no install prompt is bulletproof.

The point isn't "replace Anthropic with Ollama." It's "stop burning Pro quota on tasks any decent local model can handle." If 100% of your work needs frontier reasoning, this saves you nothing — and that's a perfectly normal place to land. Run the log for a week and find out.

The setup

The two-engine pattern is genuinely simple in concept: pair the paid Claude Desktop app (for thinking, planning, review) with a local-routed Claude Code (for the mechanical terminal work). Both speak the same Claude UX. Only the model behind the second one changes.

The hosted setup walkthrough lives at coherencedaddy.com/tutorials/use-ollama-to-enhance-claude — a 21-slide visual deck that auto-detects your OS (macOS, Windows + WSL2, Linux) and shows you the right install command at each step, plus a copy-paste prompt that does most of the installation for you if you'd rather hand it to Claude. The repo is MIT-licensed at github.com/Coherence-Daddy/use-ollama-to-enhance-claude — fork it, adapt it, ship your own version. There's no gate, no email collection, nothing to log into.

If the embed didn't render: open the deck in a new tab.

Question for the comments — and I genuinely want to hear answers: what's your mechanical-vs-strategic split? Run a rough log for a few sessions and tell me whether your mix looks like mine. I suspect the ratio varies enormously by role and codebase, and 30 data points from one developer is not enough to generalize.

DEV Community

I tracked every Claude Code call for 30 days. Here's the cost breakdown that justified switching to Gemma.

What I measured + how

The breakdown

Where it breaks down

The setup

Top comments (0)