Ken Imoto

Posted on Jun 27 • Originally published at doi.org

When the Free Executor Cost More: 40 Trials on Opus + Local Qwen Ended Up the Most Expensive Cloud Arm

#agents #ai #llm #programming

"Use a strong model to orchestrate, a cheap model to execute." This is now the default cost-aware recipe for agentic coding.

I believed it and ran the experiment. Opus 4.7 as the orchestrator, locally-hosted Qwen 3.5-9B (zero token cost) as the executor. This should beat running Opus alone on cost. Has to.

It did the opposite.

The supposedly "free" configuration (Opus + Qwen) came out as the most expensive cloud arm on all three of the code-repair tasks I ran. Higher than Opus solo. Higher than Opus + Haiku. And of course much higher than Haiku solo. As someone who actually built a GPU PC believing "local means cheap," I find this somewhat inconvenient.

I wrote up the 40 trials worth of numbers and the mechanism as a paper, archived on Zenodo: DOI 10.5281/zenodo.20978074 / GitHub repo.

This post walks through what happened across the 40 trials and why "free" turned out to be the most expensive option — all from real measurements.

TL;DR

40 trials × 4 configurations × 3 tasks, judged by a deterministic harness (mypy + ruff + pytest exit codes only). No LLM-as-judge anywhere in the loop.

Opus orchestrates + Qwen executes is the most expensive cloud arm on every task. More expensive than Opus solo.
The cause is not the executor's tokens — it's the orchestrator's prompt-cache re-reads. Opus keeps reading Qwen's returned summaries on every turn, and its own input volume grows to 1.4–5.3× that of Opus running alone.
Haiku solo is 5.5× cheaper than Opus solo on the largest task — but fails 25% of the time within the per-arm iteration cap. Within cloud-only options, Opus + Haiku is the most balanced.

If the intuition "the executor's tokens are free, therefore this is cheap" feels obvious, this post is about why that intuition breaks.

What I Measured

The four arms

arm	orchestrator	executor	role split
A	Opus 4.7	(solo)	one model does everything
B	Opus 4.7	Qwen 3.5-9B (local / Ollama)	Opus plans + verifies, Qwen edits
C	Opus 4.7	Haiku 4.5 (Anthropic SDK sub-loop)	Opus plans + verifies, Haiku edits
D	Haiku 4.5	(solo)	one cheap model does everything

All four arms use the Anthropic SDK with the same tool surface: str_replace_editor (view/create/str_replace/insert) and a bash tool with a 120-second timeout. The orchestrator arms (B, C) get one extra tool: delegate_to_executor.

Anthropic prompt caching is enabled identically on every call — system, tool definitions, and the most recent user message are marked with cache_control: ephemeral. No temperature or seed is set, so trial-to-trial variance reflects sampling noise.

The three tasks

All three operate on the typer repository at commit b210c0e (v0.26.8, MIT license). Each trial starts with git checkout -- . && git clean -fd to restore the base state.

T1 — Breakage recovery: 25 errors injected via AST (10 mypy + 10 ruff + 5 pytest collection failures). The agent has to return the harness to fully green.
T2 — Refactor: Move get_params_from_function from typer/utils.py to a new module typer/_param_extractor.py. Update every import site. All tests still passing.
T3 — Feature-add: Implement get_version_banner(prefix, uppercase) -> str, re-export from typer/__init__.py, and pass a SHA-256-fingerprinted test file.

The judge

mypy + ruff check + pytest — exit code 0 = success, anything else = failure. Per-task verifiers (verify-T2.sh / verify-T3.sh) add structural checks (function actually moved to the new module, fingerprinted test unmodified, etc.).

No LLM is ever asked "is this OK?". The judgment is deterministic and reproducible.

Results (success-only medians, n=3 per cell)

arm	task	n_succ/total	wall (s)	iters	cost ($)	success rate
A Opus solo	T1	3/3	253	36	1.74	1.00
A Opus solo	T2	3/4	233	26	1.11	0.75
A Opus solo	T3	3/3	69	6	0.17	1.00
B Opus+Qwen	T1	3/4	484	38	2.27	0.75
B Opus+Qwen	T2	3/3	443	27	1.38	1.00
B Opus+Qwen	T3	3/3	348	12	0.42	1.00
C Opus+Haiku	T1	3/3	400	28	1.67	1.00
C Opus+Haiku	T2	3/3	275	20	0.92	1.00
C Opus+Haiku	T3	3/3	145	11	0.38	1.00
D Haiku solo	T1	3/4	758	89	0.30	0.75
D Haiku solo	T2	3/4	507	70	0.23	0.75
D Haiku solo	T3	3/3	208	29	0.08	1.00

Bold = per-column winner. Total Anthropic spend across 40 trials: $35.98 — cheap for a paper.

The row worth staring at is arm B. On all three tasks, its cost ($) is the cloud-arm worst ($2.27 / $1.38 / $0.42). Qwen's tokens cost zero. Opus + Qwen is more expensive than Opus alone anyway.

Why "Free" Cost the Most

Compare Opus-side token consumption (input + cache_read_input) across arms:

arm role	T1 (Opus-side in + cache_r)	T2	T3
A (Opus solo)	534,586	226,474	13,320
B (Opus + Qwen)	733,142	313,914	62,864
C (Opus + Haiku)	421,622	159,640	44,016

B-over-A ratio (Opus-side only): 1.38× on T1, 1.39× on T2, 5.26× on T3.

Qwen's tokens are free. But Opus itself is reading 1.4–5.3× more tokens than it would running alone.

The mechanism. When Opus calls delegate_to_executor, Qwen returns a stdout summary (capped at 4000 chars in my implementation). That summary lands in Opus's context. Anthropic prompt caching marks the most recent message for cache_write, and the next turn reads it back via cache_read. Across 30–80 turns, Opus ends up reading the "what Qwen did" summary over and over and over.

Each re-read is billed at the cache_read rate ($1.50/M token = 10% of Opus input). The executor is free; the orchestrator is not. Which sounds obvious in hindsight, except the word "free" in a sentence tends to short-circuit human reasoning. Mine, anyway.

Stated correctly: the orchestrator's cost is proportional to how many times it re-reads the executor's returned summaries, not to the executor's raw token count. This reads more like a middle-management observation than an LLM finding, but the data says what it says.

Why T3 Blew Up to 5.3×

The most extreme case is T3, the smallest task — about 6 iterations.

Same mechanism, different ratio. The base context (system + tools + initial prompt) is cache_write-ed once on the first turn and cache_read cheaply thereafter. On long tasks (T1, T2), that base is a small fraction of the cumulative input. On a short task, it's a big fraction. So "base re-read every turn + executor summary re-read every turn" overhead dominates everything else, and T3's B/A ratio spikes to 5.3×.

Conversely, arm C (Opus + Haiku) has a smaller cache_read footprint than arm A on T1 and T2 (0.79× and 0.70× of A). Haiku does substantive work that Opus would have otherwise had to do itself, and the substance translates into useful summaries instead of dead weight. Which is the opposite end of the Qwen-summary-bloat story.

When Orchestration Would Win (cases I deliberately excluded)

The "strong orchestrator + cheap executor" recipe falters in iterative tool-loops because, over dozens of turns, the orchestrator's cache_read becomes the dominant cost line. One-shot routing has no such problem.

The experiment was, in that sense, designed against arm B:

Executor returns are free-form (Qwen stdout summary up to 4000 chars). If you constrain returns to "one structured diff and nothing else," the orchestrator's accumulated context shrinks.
Tasks are sequential (T1/T2/T3 cannot be parallelized within a single trial). Tasks where the orchestrator can dispatch "go edit these three places at once" might pay for orchestration overhead.

Re-running arm B with tightly-bounded executor returns is the next experiment on my list. I expect T3 to invert. T1 is harder to call.

Practical Takeaways

Sitting with these numbers, here is how my own agentic coding setup changed:

For tasks that finish in a handful of iterations, Opus solo is the cheapest cloud option. T3: $0.17, 69 seconds, 6 iterations — cloud-best. "Opus is expensive" is a one-shot framing. Across an iterative loop, Opus's per-iteration efficiency pays.
For tasks that need dozens of iterations, the model with the lowest per-iteration cost wins on dollars. T1: Haiku solo at $0.30 is 5.5× cheaper than the cheapest cloud arm. But it fails 25% of the time, so retry-adjusted expected cost narrows the gap to 4.2×.
For cloud-only setups, Opus + Haiku is the most balanced. Ties Opus solo on T1, wins T2 on cost, narrowly loses to Opus solo on T3. The safe pick if you don't want Haiku-solo's failure rate.
If you're going to use a local Qwen "for free," constrain the executor return size structurally. Free-form stdout returns just shift the cost to the orchestrator's cache_read line.

"Strong + cheap" composition has a narrower design surface than it seems. Unless you also specify what and how much the executor is allowed to return, you regenerate the "orchestrator-becomes-expensive" pattern. I regenerated it three times suspecting measurement error before finally accepting it.

Limitations

The honesty section:

n=3 per cell. Mann-Whitney U p-values use a normal approximation where 0.050 is the small-sample floor — it means "as different as this sample size can show." Trust the Cliff's delta effect sizes; don't over-read p-value differences.
All three tasks are on the typer repo. Generalization needs other codebases. The harness, breakage injector, runner, and analysis are all MIT-licensed in the repo, so reproducing this on your own codebase is cheap.
The orchestrator system prompt is asymmetric. It instructs "do not edit directly, delegate instead." This mirrors a real deployment shape but is a real confounder in the results.

Reproducing

Tested on Ubuntu 22.04 with Python 3.10+, uv 0.4+, and anthropic Python SDK 0.83+. Arm B uses Ollama 0.4+ running qwen3.5:9b. If you skip arm B, Ollama is not needed.

git clone https://github.com/kenimo49/free-executor-paradox
cd free-executor-paradox
# run arm A on T3, one trial
python scripts/runners/runner.py --arm A --task T3 --trial 1

The repo README and paper PDF have the full reproducibility setup — harness, breakage injection, runner, and analysis scripts are all included.

Closing

The cost debate around agentic coding tends to fixate on what the executor costs per token. The dominant term is actually what the orchestrator re-reads, and how often. Qwen here is just one instance of the pattern — every "free local executor" that comes next will hit the same issue. Free executor tokens don't make orchestrator cache_read free.

I wrote it up as a paper because numbers are harder to argue with than vibes. The most satisfying outcome would be someone replying "got the same thing on my codebase" or "actually got the opposite, here's why."

Paper (Zenodo): When Free Executors Cost More — DOI 10.5281/zenodo.20978074
Code + data + harness: https://github.com/kenimo49/free-executor-paradox
GitHub Release: v1.0.0

Top comments (1)

kensichgam • Jun 28

An interesting article. It would be interesting to see how the costs compare if the executor also summarizes all the previous stdouts before passing it to the orchestrator