DEV Community

Alexander Velikiy
Alexander Velikiy

Posted on • Originally published at greatcto.systems

Everyone is squeezing context. We stopped putting everything in one context.

The standard advice for reducing LLM costs: truncate your prompts, use a cheaper model, compress your system prompt, enable caching, add Be concise. to every instruction and hope for the best.

All valid. All treating the symptom.

We did something different.


The real problem isn't prompt size. It's context architecture.

When great_cto runs a feature pipeline — architect, PM, senior-dev, QA, security officer — each agent starts by reading the same stack of documents:

  • ARCH-*.md — full architecture decisions, 3–8k tokens each
  • PLAN-*.md — implementation plans, 4–10k tokens
  • decisions.md — every architectural decision made since the project started
  • lessons.md — every lesson learned, including that one time someone forgot to add an index

Six agents. Each reads all of it. Most of it irrelevant to the task at hand.

A senior-dev implementing a Stripe webhook doesn't need the 200-line deep-dive into the auth system. They need two sentences: "We use Stripe. Card data never touches our infra."

The information was right. The delivery unit was wrong. We were running a library where everyone gets every book, every time.


Phase 1: Stop sending full documents. Send summaries.

Every artifact in great_cto now has a paired .summary.md — auto-generated, ≤250 tokens, structured for the consuming agent:

# ARCH — Multi-tenant auth system · summary
- **Decision:** SAML over OIDC for enterprise; JWT internally
- **Stack:** Node 20, Passport.js, PostgreSQL row-level security
- **Risks:** SAML metadata rotation, session fixation on tenant switch
- **Full doc:** docs/architecture/ARCH-auth.md
Enter fullscreen mode Exit fullscreen mode

Agents read the summary first. If they need depth — the path to the full doc is right there. In practice, 80% of reads stop at the summary. The other 20% at least know exactly what they're looking for.

The numbers:

Before v2.19.0
13 artifacts, per agent read 21,459 tokens 2,216 tokens
Reduction –89.7%

The summary generates automatically via a PostToolUse hook the moment any agent writes an artifact. Anthropic Haiku if you have an API key (~$0.0005/call). OpenRouter Kimi K2 as fallback. Deterministic keyword heuristic if neither — zero cost, works offline, mildly embarrassed about the quality but gets the job done.

No config. No manual steps. Write artifact, get summary.


Phase 2: Stop injecting the entire memory. Filter it to the task.

decisions.md is an append-only log. It grows. A typical project after three months: 40–80 entries — database choices, API decisions, security tradeoffs, that one auth approach you tried and abandoned at 2am.

Before v2.19.0, the architect agent received the full file every time. 3–5k tokens, of which maybe 200 were actually relevant to the task. The model read all of it, politely, and quietly ignored most of it.

Now: one call to scripts/memory-filter.mjs "add Stripe webhook integration" decisions.md --k=5

The filter scores each entry against the task title. For "add Stripe webhook integration" — you get the PCI decision, the webhook signature lesson, the relevant security pattern. Not the database choice from six months ago that has nothing to do with anything.

The numbers:

Before v2.19.0
decisions.md inject per agent pair 946 tokens 544 tokens
Reduction –42.5%

Latency: ~50ms heuristic, ~200ms Haiku. Cost: <$0.0001 per call. Opt-out: GREAT_CTO_DISABLE_MEMORY_FILTER=1 (for when you miss the old noise).


The combined pipeline: before vs. after

Six agents per feature. Each reads artifacts and memory.

Before v2.19.0
Total tokens per feature 134,430 16,560
Reduction –87.7%
Cost saved (Sonnet $3/1M) $0.35 per feature

This is with a small project — 13 artifacts, 7 decisions. The savings compound with scale: at 50 artifacts and 50 decisions (a project six months in), the legacy number climbs past 600k tokens per feature run. The filtered number stays roughly flat.

That's the interesting property of this architecture: the noise grows with the project, the signal doesn't.


What this isn't

This is not prompt compression. We're not removing information — we're delivering it at the right granularity, to the right agent, at the right moment.

The full docs are still there. The full decisions.md is still there. Any agent that needs depth can read it — the summary tells them exactly where to look. The filter acknowledges it might miss something ("if you suspect a relevant lesson is missing, read the full file directly"). It's a hint, not a wall.

We're not betting on the model being smart enough to ignore irrelevant noise. We're not hoping a Be concise. instruction somewhere will solve a structural problem. We're betting on information architecture — the same principle that makes an indexed database faster than a full table scan.

The index doesn't know less than the table. It knows where to look.


Getting it

Everything shipped in v2.19.0:

  • scripts/generate-summary.mjs--all, --check, --force
  • scripts/memory-filter.mjs--k=N, --heuristic, --stats
  • agents/_shared/artifact-summary-contract.md — the producer/consumer contract
  • 31 tests, all green
npx great-cto upgrade
Enter fullscreen mode Exit fullscreen mode

Summaries generate on first --all run, then stay fresh automatically. Memory filter activates in architect and senior-dev agents — no config needed.


What's next

Phase 3: session-scoped read cache. When five agents in one pipeline all read PROJECT.md, only the first actually reads the file. The rest get a cache stub with a hash. Target: additional –15% on multi-agent runs.

Phase 4: system prompt audit across all 30+ agent files. Removing filler. Enforcing token budgets. Finding the seven places we wrote "carefully" when the model was going to be careful anyway.

The full plan is public: docs/plans/PLAN-token-economy-2026-q2.md

Top comments (0)