A lot of AI coding workflows degrade the exact same way.
At first, everything feels incredible.
Your coding agent:
- understands the project
- moves insanely fast
- eliminates boilerplate
- compounds your momentum
Then a few weeks later:
AGENTS.mdturns into a novel.
Prompts get bloated.
The model starts missing obvious things.
Responses become inconsistent.
Token usage quietly becomes absurd.
I kept running into this while building Empirical.
Eventually I realized the problem wasn’t:
“The model needs more context.”
The problem was:
“The model is carrying too much irrelevant context at once.”
That distinction changed everything.
The Hidden Failure Mode of Coding Agents
Most teams solve AI memory like this:
“Just add it to the prompt.”
And over time the context fills up with:
Permanent Context Soup
- architecture decisions
- coding standards
- deployment notes
- UI preferences
- old implementation details
- temporary fixes
- abandoned experiments
- half-finished thoughts
Eventually every request drags all of it around forever.
Even when most of it has absolutely nothing to do with the current task.
That creates a brutal signal-to-noise problem.
The model starts treating temporary junk and critical architecture decisions with equal importance.
You can actually feel the degradation happen.
Symptoms:
- the agent gets fuzzier
- architecture drift increases
- outputs become inconsistent
- you spend more time correcting than building
Bigger Context Windows Aren’t the Real Solution
I think the industry is optimizing the wrong thing right now.
Everyone keeps pushing toward:
Bigger Everything
- million-token windows
- infinite memory
- larger context sizes
- stuffing more into prompts
But humans don’t work like that either.
Good engineering teams don’t bring every document into every meeting.
Most information is situational.
Most memory should stay dormant until it becomes relevant.
That was the shift for me.
Not:
“How do I fit more into context?”
But:
“How do I load only what matters right now?”
What Worked Better
I started treating AI memory more like layered working memory instead of permanent prompt stuffing.
1. Lean Persistent Context
Keep permanent instructions extremely small.
Only things like:
- architecture principles
- coding philosophy
- project identity
- non-negotiables
That layer should stay lean on purpose.
2. Retrieved Context
Pull implementation knowledge dynamically based on:
Relevance Signals
- semantic similarity
- current task
- related code paths
- previous work in the same area
Only relevant context enters the active prompt.
3. Session Context
Use temporary working memory for:
Active Work
- bugs
- in-progress features
- short-lived implementation decisions
Then let it expire naturally instead of polluting long-term memory forever.
What Changed
The biggest surprise wasn’t even the token savings.
It was how much sharper the agents became once the noise disappeared.
After reducing context bloat:
- responses became more focused
- architecture stayed more consistent
- prompt babysitting dropped significantly
- outputs drifted less between sessions
The token reduction was just the measurable side effect.
Results
| Workflow | Context Reduction |
|---|---|
| Smaller focused tasks | ~22% |
| Larger iterative workflows | Up to ~45% |
That compounds fast once agents start looping.
The Bigger Realization
I think a lot of AI tooling is accidentally recreating bad human organizational habits.
We already know what happens when people dump everything into:
Organizational Chaos
- giant docs
- giant meetings
- giant Slack threads
- giant Notion pages
Clarity collapses.
Coding agents seem to behave better when memory works more like human working memory:
Better Memory Pattern
- small active focus
- relevant recall
- long-term memory separated from immediate attention
That mattered far more than raw context size.
Full Breakdown
I wrote the complete breakdown here:
- retrieval architecture
- layered memory strategy
- implementation lessons
- where the 22–45% savings actually came from
→ Reducing Coding Agent Context Usage by 22–45% with Retrieval-Based Memory Systems
Top comments (0)