I Rebuilt My AI Agent From a 600-Line Script Into a Harness. Token Cost Dropped 40%.

#ai #agents #harness #claudecode

The 600-line script that ate my API budget

My first real agent was one Python file. About 600 lines. It read a task, pulled in everything it might conceivably need -- the full repo tree, every doc, the entire conversation history, all tool definitions -- jammed it into one giant prompt, and let the model loop until it declared victory.

It worked. That was the trap. It worked well enough that I didn't question the architecture for two months.

Then I looked at the bill.

A single medium task -- "add a field to this API and update the tests" -- was burning through tokens like a space heater left on in July. I traced one run: 14 model calls, and by call 9 the prompt had ballooned past 80,000 tokens. Most of it was sediment. Tool outputs from step 2 that nothing downstream ever read again. A directory listing the model had already digested. Three docs I'd loaded "just in case" that turned out to be never.

I was paying full price, every call, to re-send garbage the model had stopped caring about.

So I tore the script down and rebuilt it as a harness. Same model. Same tasks. Token cost dropped roughly 40%, and the output quality went up, not down. This is the before/after, with the three changes that did almost all of the work.

What "harness" actually means here

Quick definition, because the word got thrown around a lot in 2026 and half the time nobody agrees on it.

A harness is everything wrapped around the model that isn't the model: context management, tool execution, the loop that decides when to stop, the guardrails, the observability. Mitchell Hashimoto's framing from early 2026 is the one that stuck with me -- the agent is the model plus the harness, and you only control one of those two. You can't make the model smarter on a Tuesday afternoon. You can make the thing around it stop wasting its attention.

My 600-line script had a harness. It was just a bad one, smeared across the same file as the business logic, with every decision made implicitly. Rebuilding it didn't mean adding a framework. It meant pulling the harness out into its own thing and making three implicit decisions explicit.

Change 1: stop carrying the whole conversation forever

This was the big one. Maybe 25 of the 40 points came from here.

The old script appended every tool result to the running message list and re-sent the whole pile on every call. Step 12 was still hauling around the raw output of step 2. The model didn't need it. I was paying for it anyway, at the per-call rate, compounding.

The fix is a pattern Anthropic documented for long-running agents, and it sounds almost too simple: don't let one context window run the entire job. Reset it. Carry forward a small written summary of state instead of the full transcript.

Concretely, I split the loop into a planner and a worker. The worker does a chunk of the task, writes a short progress note to a file, and exits. The next worker starts in a clean context, reads the note, and keeps going. State lives on disk, not in the prompt.

Before -- everything accumulates in memory:

# the script: one ever-growing message list
messages = [system_prompt]
while not done:
    messages.append(call_model(messages))   # full history, every time
    messages.append(run_tools(messages[-1]))  # raw tool dumps pile up
# by iteration 12, messages is a small novel

After -- state on disk, context stays small:

# the harness: each worker starts fresh, reads a note
def worker(task):
    state = read_progress("progress.md")      # ~400 tokens, not 80k
    ctx = build_context(task, state)          # only what this step needs
    result = run_until_checkpoint(ctx)
    write_progress("progress.md", summarize(result))
    return result.done

Anthropic's own write-up on this calls the underlying problem "context anxiety" -- past a certain fill level, the model's output quality actually drops as the window gets crowded. So a bloated prompt isn't just expensive. It makes the agent worse. I'd been paying extra money to degrade my own results. That stung in a specific way.

The progress-file trick borrows from something every decent engineer already does without thinking: git history tells you what changed, and a scratch note tells you where you are. Give a fresh agent both and it picks up the thread in one read instead of re-deriving the world.

Change 2: gate the tools instead of showing all of them

The script handed the model all 18 tool definitions on every single call. Database tools, file tools, HTTP tools, a shell, the works. Tool schemas aren't free -- good descriptions are verbose on purpose, because vague ones cause misfires. Eighteen of them was a few thousand tokens of overhead per call, most of it irrelevant to whatever step we were on.

Worse than the cost: a model staring at 18 options picks wrong more often. Ask someone to "grab the tool from the drawer" when the drawer has 18 tools and you'll get the wrong one some of the time. Show them the three that fit the job and they're fine.

So I gated them. The harness decides which tools are even visible based on the current phase. Planning phase sees read-only tools. Editing phase sees file and shell tools. Nothing sees the database tools unless the task actually touches data.

# the harness exposes a phase-scoped subset, not the whole arsenal
TOOLSETS = {
    "plan":   ["read_file", "list_dir", "search"],
    "edit":   ["read_file", "write_file", "run_shell"],
    "verify": ["run_tests", "read_file"],
}

def tools_for(phase):
    return [TOOL_REGISTRY[name] for name in TOOLSETS[phase]]

This was maybe 8 of the 40 points on cost. The quieter win was accuracy: wrong-tool calls dropped noticeably, which meant fewer wasted retry loops, which is itself fewer tokens. The savings compound in directions you don't predict.

The 2026 guidance backs this up -- the move is to give the agent a small, relevant result set and let it stop when it has enough, not to dump the entire toolbox and the entire database into the window and hope. Less is genuinely more here, and for once that's not a poster slogan, it's a line item.

Change 3: load context in stages, not all upfront

The script's loading logic was "grab everything, then think." Full repo tree, all the docs, the lot, before the model had done anything.

Most of it was dead weight. For a task touching three files, the model does not need a map of all 400.

The harness loads in stages. Start with a thin slice: the task, a directory outline, the progress note. Let the model ask for more. It requests a file, the harness reads it in. It needs a doc, the harness fetches that one doc. Context grows to fit the actual problem instead of pre-loading for an imagined worst case.

# staged: the model pulls what it needs, the harness doesn't pre-stuff
def build_context(task, state):
    ctx = [task, state, dir_outline()]   # thin to start
    # files get pulled in on demand during the loop, not here
    return ctx

This was the last chunk of the 40%, and it's the one I'd most want to do over again because it overlaps with change 1. They're the same instinct from two angles: the window should hold what this step needs, and nothing it merely might have needed. Everything else lives outside the model -- on disk, behind a tool call, one fetch away.

That outside-the-model idea is the thread tying all three changes together. The 600-line script treated the context window as the place where everything had to be at once. The harness treats it as a workbench: you bring up the part you're working on, finish, put it back, bring up the next one.

The numbers, honestly

So I can stand behind them: this was my own multi-step coding agent, measured across the same batch of tasks before and after, same model, total tokens summed across all model calls in a run.

	600-line script	harness	change
avg tokens / task	~118k	~70k	-41%
wrong-tool calls / task	~2.3	~0.6	-74%
retry loops / task	~3	~1	-67%
my confidence it'd finish	"probably"	"yes"	priceless

The 40% is real but it's mine, not a benchmark. Your script's sediment is in different places than mine was. The published 2026 work lands in a similar zone -- context-mode patterns that keep bulky tool output out of the window report north of 20% reduction with no accuracy hit, and the broad finding that token usage explains most of the performance variance in multi-agent setups, more than model choice. The exact percentage isn't the point. The direction is: the cost was never the model. It was what I kept feeding it.

What I'd tell past me

The thing I got wrong for two months: I thought "make the agent better" meant a better prompt or a better model. It meant a better harness. The model never changed. The script and the harness ran the identical model on the identical tasks, and one cost 40% less because it stopped treating the context window like a junk drawer.

If you've got a script doing real work right now, you don't need to rewrite it into a framework. You need three smaller things:

Reset the context. Stop re-sending the whole transcript. Keep state in a file, start fresh, read the file.
Gate the tools. Show the model the three tools the current step needs, not all eighteen.
Stage the loading. Start thin. Let the model pull more on demand instead of pre-stuffing for a worst case that rarely arrives.

I wrote about a related failure -- letting the agent grade its own homework -- when I turned on auto-approve and broke CI in 30 minutes. Same lesson from a different angle: the harness is where the wins are, not the model.

Find the call where your prompt is biggest. Look at what's in it. I'd bet most of it is sediment the model stopped reading three steps ago. That sediment is your 40%.

Want the full picture? Context resets, tool gating, observability, and the rest of the harness anatomy are the spine of Harness Engineering Guide -- the book this article is pulled from, on building systems that control AI agents instead of just prompting them.