DEV Community

AttestDojo
AttestDojo

Posted on • Originally published at github.com

We Cut Our AI Agent Costs by 60%. Here's What Worked.

We run a self-healing AI agent system (Kaizen Harness — open source, GitHub). Council debates on architecture, daily tech scans, trajectory logging, automated patching. Tokens add up fast. After a month of tuning, we cut costs 60% with zero quality loss. Here are the patterns that moved the needle, from biggest impact down.

1. Context engineering: stop re-reading your own history

This was the single biggest win. Our agents were burning 40-50% of tokens re-parsing conversation history that hadn't changed since turn 3. The fix, derived from production patterns used by Manus and Cognition:

Append-only design. Every agent response starts with a [STATUS] header that replaces the full history recap. Goal, completed steps, next step. Three lines.

[STATUS] Building PR auto-review pattern. Step 2/4 complete (diff parser done). Next: wire council debate.
Enter fullscreen mode Exit fullscreen mode

The model treats it as an attention anchor. No re-reading 2,000 tokens of conversation to remember where we are.

Static tool definitions first. Our tool registry is ~800 tokens of JSON schemas. Placing it before dynamic content means the KV cache can reuse it across turns. Moving tool definitions from the middle of prompts to the top saved ~15% per session.

Compaction trigger. After turn 5, auto-insert a [CONTEXT UPDATE] block summarizing everything the agent needs. Old context is not deleted, but it's no longer in the active attention window. This alone cut our long session costs by 35%.

2. Route by task tier, not by default model

Our default was "call Claude for everything." Claude is great at creative reasoning. It is also expensive for tasks that don't need it.

We split tasks into three tiers:

Tier Task Model Cost vs Claude
Creative Architecture decisions, debate synthesis, public-facing content Claude Sonnet 4 1x
Planning Feature scoping, issue triage, PRD drafts DeepSeek V3.2 0.1x
Utility Log parsing, health checks, format validation Gemini Flash 2.5 0.02x

The tier names are in the prompt. The agent classifies its own task before choosing a model. Simple routing cut our total spend by half, because 70% of agent tasks are utility and planning, not creative reasoning.

3. Local models for private tasks

Some runs should never touch a cloud API. System health checks, internal logs, config validation. We added Ollama + MLX models as first-class seats in the council debate script:

  • Qwen3.6 35B MoE (3B active) for reasoning tasks. Fits on any Mac with 16GB RAM because only 3B params are active at a time.
  • North Mini Code 1B (4-bit) for code diffs and syntax checks. Sub-second on M4.

These don't reduce dollar cost (they're free), but they eliminate API latency for high-frequency tasks. Our self-healing loop now runs entirely local: failure detection, classification, and patching never leave the machine.

4. What didn't work

Prompt compression tools. Tried three different "auto-summarize your context" libraries. All of them lost critical details the agent needed later. Manual compaction triggers (the [CONTEXT UPDATE] pattern above) worked better because the agent decides what matters.

"Just use a cheaper model for everything." Swapping Claude for DeepSeek on creative tasks produced technically correct but flat advice. No edge detection. The tiered routing was necessary because quality degrades on the wrong task type.

The numbers

Running 3 agents in continuous mode for 30 days before and after:

Metric Before After
Monthly API spend $410 $165
Avg tokens per session 12,400 5,100
Council debate cost $0.48/debate $0.14/debate
Context rot sessions (>10 turns, quality degrades) 22% 6%

No increase in error rates. Self-healing success rate unchanged at 91%.

Your turn

The context engineering patterns cost nothing to implement. Try the [STATUS] header in your next agent prompt and see if the model stops re-summarizing history. The tiered routing is a config change away if you're already using OpenRouter.

Repo with the actual scripts: Kaizen Harness. The council debate config and model registry are in patterns/council/.

What's your biggest token waste source?

Top comments (0)