AttestDojo

Posted on Jun 10 • Edited on Jun 15 • Originally published at dev.to

We Cut Our AI Agent Costs by 60%. Here's What Worked.

#ai #llm #costoptimization #agents

We run a self-healing AI agent system (Kaizen Harness — open source, GitHub). Council debates on architecture, daily tech scans, trajectory logging, automated patching. Tokens add up fast. After a month of tuning, we cut costs 60% with zero quality loss. Here are the patterns that moved the needle, from biggest impact down.

1. Context engineering: stop re-reading your own history

This was the single biggest win. Our agents were burning 40-50% of tokens re-parsing conversation history that hadn't changed since turn 3. The fix, derived from production patterns used by Manus and Cognition:

Append-only design. Every agent response starts with a [STATUS] header that replaces the full history recap. Goal, completed steps, next step. Three lines.

[STATUS] Building PR auto-review pattern. Step 2/4 complete (diff parser done). Next: wire council debate.

The model treats it as an attention anchor. No re-reading 2,000 tokens of conversation to remember where we are.

Static tool definitions first. Our tool registry is ~800 tokens of JSON schemas. Placing it before dynamic content means the KV cache can reuse it across turns. Moving tool definitions from the middle of prompts to the top saved ~15% per session.

Compaction trigger. After turn 5, auto-insert a [CONTEXT UPDATE] block summarizing everything the agent needs. Old context is not deleted, but it's no longer in the active attention window. This alone cut our long session costs by 35%.

2. Route by task tier, not by default model

Our default was "call Claude for everything." Claude is great at creative reasoning. It is also expensive for tasks that don't need it.

We split tasks into three tiers:

Tier	Task	Model	Cost vs Claude
Creative	Architecture decisions, debate synthesis, public-facing content	Claude Sonnet 4	1x
Planning	Feature scoping, issue triage, PRD drafts	DeepSeek V3.2	0.1x
Utility	Log parsing, health checks, format validation	Gemini Flash 2.5	0.02x

The tier names are in the prompt. The agent classifies its own task before choosing a model. Simple routing cut our total spend by half, because 70% of agent tasks are utility and planning, not creative reasoning.

3. Local models for private tasks

Some runs should never touch a cloud API. System health checks, internal logs, config validation. We added Ollama + MLX models as first-class seats in the council debate script:

Qwen3.6 35B MoE (3B active) for reasoning tasks. Fits on any Mac with 16GB RAM because only 3B params are active at a time.
North Mini Code 1B (4-bit) for code diffs and syntax checks. Sub-second on M4.

These don't reduce dollar cost (they're free), but they eliminate API latency for high-frequency tasks. Our self-healing loop now runs entirely local: failure detection, classification, and patching never leave the machine.

4. What didn't work

Prompt compression tools. Tried three different "auto-summarize your context" libraries. All of them lost critical details the agent needed later. Manual compaction triggers (the [CONTEXT UPDATE] pattern above) worked better because the agent decides what matters.

"Just use a cheaper model for everything." Swapping Claude for DeepSeek on creative tasks produced technically correct but flat advice. No edge detection. The tiered routing was necessary because quality degrades on the wrong task type.

The numbers

Running 3 agents in continuous mode for 30 days before and after:

Metric	Before	After
Monthly API spend	$410	$165
Avg tokens per session	12,400	5,100
Council debate cost	$0.48/debate	$0.14/debate
Context rot sessions (>10 turns, quality degrades)	22%	6%

No increase in error rates. Self-healing success rate unchanged at 91%.

Your turn

The context engineering patterns cost nothing to implement. Try the [STATUS] header in your next agent prompt and see if the model stops re-summarizing history. The tiered routing is a config change away if you're already using OpenRouter.

Repo with the actual scripts: Kaizen Harness. The council debate config and model registry are in patterns/council/.

What's your biggest token waste source?

Top comments (2)

Alex Shev • Jun 11

Cost reduction usually comes from workflow shape, not one magic model swap. Cache what is stable, shrink context to what the task needs, and make the agent prove progress before giving it another expensive turn.

The best agent tooling will expose those decisions directly in the run log.

AttestDojo • Jun 11

Alex, we ran with this since reading your comment. Here is what we found:

"Cache what is stable" — Just shipped a dual-layer semantic cache (SHA256 exact match + cosine similarity on nomic-embed-text embeddings) with model-dependent TTLs: 12h for free local models, 48h for paid API calls. Council debates were re-asking the same strategic questions repeatedly. The surprising finding: the free-tier models are where the waste compounds. A council debate firing 8 Ollama models can burn minutes of compute on a question asked 3 hours ago. A 0.95 semantic match threshold paid for itself immediately.

"Shrink context to what the task needs" — This one we already had covered. Context engineering rules (append-only outputs, structured to-do lists, static tool definitions first, compaction, session turn caps) plus a muncher-first protocol that loads file slices instead of whole files.

"Make the agent prove progress" — The most interesting one. We built three evaluation loops: a council debate critic that stops when it finds no more flaws, a PRD reviewer that won't pass until scores clear the threshold, and a code healer that reverts after three failed fix attempts. The pattern is the same in all three: separate the evaluator from the generator. A model cannot reliably grade its own work.

Appreciate the push. Your framing of workflow shape over model selection is right, and it's more actionable than most cost reduction advice out there.