DEV Community: John Lee

How to make an AI coding agent actually yours

John Lee — Tue, 23 Jun 2026 12:59:46 +0000

If you work with an AI coding agent every day, you know the feeling.

You clearly agreed on a convention yesterday — open a new session and it's a blank slate. That I always use type over interface in TypeScript, that I said I didn't like that pattern in code review, the root cause of the bug we barely cornered last week — it acts like it's hearing every bit of it for the first time.

Do that enough times and you land on "I'll just do it myself."

I built Monet to fix this — to make a generic AI agent act like my agent. A system that learns my conventions, remembers how I like to work, and keeps track of the project's history on its own.

How it works

The core is simple.

Write — the agent decides for itself. You never say "remember this." As it works, it records the decisions it makes, the patterns it spots, the issues it runs into. It filters the noise and keeps the signal.

Read — what was useful comes back first. It's not plain keyword search. The memories that actually got referenced and helped solve problems surface first; the zombie memories nobody touches sink to the bottom on their own.

Grow — it gets smarter as it piles up. The first task is slow — it doesn't know the codebase, the conventions, the bugs that keep blowing up. But once memory builds, the next task is faster: the pattern you found yesterday, the decision you made last week, that bug's root cause — no need to dig them up again. After a month or so, the agent stops feeling like a generic tool and starts moving like a dedicated engineer who knows this project inside out.

How I got here

At first I just piled notes into one file — the agent jotting down what it learned as markdown, and me include-ing it at the start of each session. Simple, but the noise grew as the file grew.

So four months ago I built a proper memory system. The old Monet. MCP-based, built for agents to read and write, with team sharing in mind. But chasing team sharing made the solo experience fuzzy. It worked, technically — it just didn't fit my workflow.

So I tore it down. A few weeks ago I set the team-sharing goal aside and rebuilt Monet from scratch around one question: can I actually use this every day? As I write this, 12 agents read and write on the new Monet. There's no monitoring yet so I can't pull exact numbers, but searches are down and reads/writes are way up from before — which means the agent is curating what matters on its own.

Honestly

At the vibe-coding stage, memory doesn't matter much. Most of it is brand-new features, and the bugs are simple. You tell the agent "fix this" and it's done inside the context window.

But once the app gets complex, it's a different story. To change one line you have to check ten related pieces of logic, and the agent crawls file to file hunting side effects. Mistakes go up. Even with 1M tokens, three context-compactions later you're back where you started.

At home I build fun things with agents on fresh code; at work I wrestle with 20-year-old code every day. At work, agent memory isn't optional. Without it, the work doesn't move.

So I started building a file-based indexed memory for myself. That was the start of Monet. These days I deliberately send the agents on laps at work — just to gather context. Most tickets wrap inside 20% context. The time I save is obvious; what matters more is the stress is gone.

Best of all: where I used to rack my brain over "how did I fix that bug again," now I just ask Kiro (our coding agent at work). It usually knows.

Once you have dozens of agents and millions of lines of code, context stops being a byte problem and becomes an infrastructure problem. And at that point, memory isn't a nice-to-have — it decides whether the work is even possible.

Want to try it

Homepage: monet.team-monet.com
GitHub: github.com/team-monet/with-monet — the install harness (Apache-2.0)
100% local: your code never leaves your machine — on-device embeddings, no network, no telemetry. Memory is a single SQLite file at ~/.monet that you can open, back up, and export yourself.
Free to use. The engine is a closed compiled binary, but the interface is standard MCP — works out of the box with Claude Code, Cursor, Codex, and other MCP-capable agents.

I'd especially love for these people to try it:

developers who work with AI agents seriously, every day
anyone who's felt the fatigue of "do I have to explain that again…"
anyone who thinks "agent memory? why would I even need that?" (seriously — I want the counterarguments too)

Every example and scenario in this post is from real experience.

Why We Need Behavioral Benchmarks for LLMs — Not Just More Knowledge Tests

John Lee — Tue, 26 May 2026 11:24:59 +0000

Would you hire an engineer based on their SAT score?

Of course not. You look at how they solve problems. How they handle ambiguity. Whether they adapt when their first approach fails. You're evaluating behavior, not just knowledge.

Yet somehow, this is exactly what we do with LLMs. We test them like students — multiple choice, fill in the blank, write a function from a spec — and call it "evaluation." We rank models by MMLU scores and HumanEval pass rates as if those numbers tell us everything we need to know.

They don't. Here's why.

What Are We Actually Measuring?

Let's look at three of the most widely-used LLM benchmarks. Not at their scores, but at what they actually measure.

MMLU: The Encyclopedia Test

MMLU gives an LLM 57-choice multiple choice questions across subjects like law, medicine, and philosophy. Pick the right answer from four options. That's it.

What it measures: breadth of knowledge. How much the model has memorized.

What it doesn't measure: whether the model knows when to apply that knowledge. Whether it can tell the difference between a situation that needs legal reasoning and one that just needs common sense. Whether it knows what it doesn't know.

It's a driving written test. Passing it doesn't mean you can drive.

HumanEval: The Coding Interview Problem

HumanEval shows a function signature and a docstring. The model fills in the body. If the code passes the test cases on the first try, it's a pass. This is measured as pass@1 — first-attempt pass rate.

What it measures: can the model translate a spec into working code in one shot?

What it doesn't measure: what happens when the test fails? Does the model debug systematically or flail randomly? If there's an existing codebase with conflicting patterns, does it notice? Does it know when to refactor instead of patching?

One function. One attempt. That's not how software gets built.

SWE-bench: The First-Day Assignment

SWE-bench is the most realistic of the three. It gives the model a real GitHub issue and access to the full repository. The task: produce a patch that resolves the issue. Evaluation is binary — the repo's test suite either passes or it doesn't.

What it measures: can the model navigate a real codebase and fix a real bug?

What it doesn't measure: anything about the approach path. Did the model grep for the right files efficiently, or did it read half the repository first? Did it understand the existing architecture, or did it brute-force a patch that works but violates every design pattern in the project? Did it learn something from this issue that it could apply to the next one?

SWE-bench evaluates the destination, not the journey.

The Pattern: All Three Measure "First Impressions"

Benchmark	What it measures	What they all miss
MMLU	Knowledge recall	Application judgment
HumanEval	First-pass coding	Debugging, iteration, adaptation
SWE-bench	One-shot bug fixing	Approach path, cross-session learning

These benchmarks share a fundamental assumption: evaluation happens once, in a single session, with a single correct answer.

But real AI coding agents don't work that way. They work across sessions. They learn from yesterday's mistakes. They reuse context from last week's debugging session. The quality of their work depends not just on what they know, but on how they behave over time.

This isn't a knowledge problem. It's a behavior problem. And no amount of harder questions on MMLU-Pro will solve it.

We Hire Humans by Behavior. Why Do We Test LLMs by Knowledge?

Think about how you hire an engineer.

You glance at their GPA. You look at their GitHub. Maybe you give them a take-home assignment. But none of that is the deciding factor.

The deciding factor comes from the interview. And what do you ask?

"Tell me about the hardest technical decision you made last year."
"Walk me through a time you disagreed with a teammate and how you resolved it."
"Here's a problem. Show me how you'd think about it — not the answer, the thinking."

These are behavioral questions. They don't measure what the candidate knows. They measure how the candidate operates. And they work because past behavior predicts future performance.

Now look at LLM evaluation. Where are the behavioral questions?

There aren't any. We're stuck at the "checking GPA" stage, watching every model score in the 90th percentile and pretending that tells us something useful about how they'll perform on real work.

Same Problem, Different Minds

Here's what behavioral evaluation actually looks like.

Take the same bug ticket and give it to three different models. Don't just check who fixes it — watch how they approach it.

Model A reads the ticket and immediately greps for the relevant code. Within 30 seconds, it has a first patch. It's fast, intuitive, pattern-matching. This model would thrive in rapid prototyping — where speed and gut instinct matter more than architectural rigor.

Model B starts by decomposing the ticket into three sub-tasks. It reproduces each one independently before attempting any fix. It's methodical, structured, systematic. This model belongs on complex architecture work — where missing an edge case costs weeks.

Model C searches git log for similar issues first. It studies existing patches to understand the codebase's conventions before writing anything. It's cautious, precedent-driven, learning from history. This model fits maintenance and bug fixing — where consistency with existing patterns matters more than clever solutions.

All three models fix the bug. Their scores are identical. But their behavioral profiles are completely different. And that difference determines which role each model is actually suited for.

This is what behavioral benchmarks should measure. Not "did the model solve the problem?" but "how did the model solve it?" — and what does that tell us about where it belongs?

A Proposal: Behavioral Benchmarks

I should be clear: this is a proposal, not an established framework. I'm not citing a paper because there isn't one. (Though interestingly, an April 2026 preprint by Tang et al. argues for "in-situ behavioral evaluation" for LLM fairness — suggesting the idea is in the air.) If I'm wrong about any of this, I hope you'll correct me in the comments.

Here's the definition I'm working with:

A Behavioral Benchmark is an evaluation framework that profiles how an LLM approaches problems — its cognitive patterns — rather than just scoring the correctness of its answers.

Where existing benchmarks ask "how many did it get right?", behavioral benchmarks ask "what kind of thinker is this?"

I propose four dimensions to observe:

Dimension	Observation Question	What It Reveals
Decomposition	Does it jump straight to execution, or break the problem down first?	Top-down architect vs. bottom-up executor
Approach	Does it search for similar patterns, or reason from first principles?	Maintenance engineer vs. innovator
Recovery	When stuck, does it change strategy or double down on the same path?	Adaptive vs. persistent
Consistency	Does it show the same approach pattern across similar problems?	Predictable vs. creative

Think of it this way:

MMLU asks: "What does this candidate know?"
Behavioral benchmarks ask: "How does this candidate work?"
And that second question determines role fit.

Why Now

In 2026, coding agents aren't demos anymore. They're daily tools on real engineering teams. And teams are starting to ask questions that our benchmarks can't answer:

"Which model should I use for our legacy codebase maintenance?"
"Our junior devs need a pair programmer — which model's debugging style fits them?"
"We need consistency. Which model produces the most predictable behavior week over week?"

These are role-fit questions. Hiring questions. And we're trying to answer them with SAT scores.

The race for smarter models is maturing. The next frontier isn't a higher MMLU score — it's understanding what each model is actually good for. And we can't get there without behavioral evaluation.

Let's Define This Together

I don't think I've nailed this. The four dimensions I proposed are a starting point, not a destination. Maybe there are better axes. Maybe the whole framing is wrong and someone smarter has already solved this.

Here are a few things I'm probably wrong about — please correct me:

Decomposition style is a stable trait of a model, not just a reflection of the prompt
Recovery behavior can be measured without also measuring the harness/framework around the model
Consistency across sessions is more important for team adoption than raw capability
Role-fit evaluation will eventually matter more than accuracy benchmarks for enterprise adoption

If you're building coding agents, evaluating models, or just frustrated that your "top-ranked" LLM doesn't behave the way you expected — I want to hear from you. What behavioral dimensions matter on your team?

I'm thinking about this while building Monet — an open-source platform for AI agents to share and control knowledge at the team level.

All examples and scenarios in this post are based on real experiences, adapted for the blog format.

Token Economics: The Real Cost of AI Coding Agents

John Lee — Thu, 21 May 2026 12:37:45 +0000

How prompt caching actually works

When an LLM processes your input, it doesn't just read and forget. For tokens that appear in the same position across multiple requests, the model can reuse its previous computation. This is called prefix caching.

Request 1: [System Prompt] [Conversation Turn 1] [Turn 2]
           └── 260K tokens computed from scratch ──┘
           Cost: expensive

Request 2: [System Prompt] [Conversation Turn 1] [Turn 3]
           └──── 255K tokens → CACHE HIT! ────┘├── 5K new ──┤
           Cost: nearly free

The catch? Only the prefix — tokens from the start that match exactly — benefit from caching. Change one token at the beginning, and the entire cache is invalidated.

This is why my 4:20 PM request (300K input, $0.0096) was so cheap — 295K of those tokens were cached from previous turns. And why my 9:20 AM request (257K, $0.4455) was so expensive — it was a fresh session with zero cache.

The transcript trap

Most coding agents today use what I call the "transcript" approach: every turn appends the latest exchange to the conversation history and sends the entire thing back to the model.

Turn 1:  17K tokens → cache miss → $0.029
Turn 2:  22K tokens → 17K cached → $0.0007
Turn 3:  27K tokens → 22K cached → $0.0008
...
Turn 10: 62K tokens → 57K cached → $0.0019

This looks great. The marginal cost per turn is tiny because 90%+ of tokens are cached. The transcript approach is, economically speaking, a cache lottery — and while the session stays alive, you keep winning.

But here's the problem: sessions don't stay alive forever.

Context windows fill up. Compaction kicks in. Cache TTLs expire (usually 5–10 minutes). When any of these happen, your next request is a cache miss — and suddenly you're paying the full 46x penalty.

That 9:20 AM spike? That was compaction. The session crossed the context window limit, Hermes compressed the history into a summary, and the next request started fresh. $0.44 for one turn.

A different approach: structured state

What if, instead of sending the entire conversation transcript, you sent only a structured summary of what matters?

Turn 1:  [State]  →  3K tokens → cache miss → $0.005
Turn 2:  [State]  →  3K tokens → 1K cached  → $0.0001
Turn 3:  [State]  →  3K tokens → 1K cached  → $0.0001

Not only is the first turn cheaper (3K vs 17K), but the cached portion — the state schema itself — is too small to ever expire meaningfully. And when a session inevitably ends? The next session starts at 3K again, not 17K.

I tested this with a real 44-turn debugging session. The transcript was 3,777 tokens. The extracted state: 740 tokens. An 80.4% reduction in prompt tokens — and the state-based agent produced higher-quality code with better structure.

The real economics

The transcript approach looks cheaper turn-by-turn because caching hides the cost. But it's fragile:

Cache TTL: 5–10 minutes of inactivity and you lose it
Context limits: Long sessions get compacted, breaking the cache
Quality: Noise accumulates. Debugging chatter, tool outputs, dead ends — all cached, all inflating the prompt

The state approach is more expensive turn-by-turn (no massive cache to lean on), but it's predictable. The cost is fixed regardless of session length, and quality doesn't degrade.

Which one is cheaper? It depends on your session pattern:

Pattern	Transcript	State
Short session (< 10 turns)	Cheaper (cache wins)	Slightly more expensive
Long session (20+ turns)	Cheap until compaction → then expensive	Consistently cheap
Cross-session	Context evaporates → full restart	State persists → cheap restart

What this means for building agents

I'm building Monet, an open-source memory platform for AI agents. This token economics analysis pushed me to rethink our architecture:

Don't fight caching — design for it. Structure your agent context so the prefix is stable and cacheable. A fixed schema at the top means every turn reuses it.
Extract signal from noise. Transcripts are mostly debugging noise. Structured state is signal. Less tokens, better outputs.
Plan for the cache miss. Your architecture shouldn't require the cache to be cheap. If a cache miss means a 46x cost spike, you've built on sand.
Cross-session continuity is the real bottleneck. Caching helps within a session. State helps across sessions. Both matter.

Token economics isn't just about counting tokens. It's about understanding the hidden structure of how models process them — and designing systems that work with that structure instead of against it.

*—

I'm experimenting with this problem directly through Monet — an open-source platform for AI agents to share and control knowledge at the team level.

I'm looking for pilot partner teams. I'll help you set up Monet for your team, and together we'll find the automation points that fit your workflow. Interested? Leave a comment or open a GitHub Issue.

github.com/team-monet/monet?utm_source=devto&utm_medium=post&utm_campaign=blog-launch

All examples and scenarios in this post are based on real experiences, adapted for the blog format.

Claude Opus Prices Just Crashed 67%. Is Anthropic Still Making Money?

John Lee — Tue, 19 May 2026 11:09:25 +0000

Claude Opus pricing just collapsed. 67% in one year.

	Opus 4 (2025)	Opus 4.7 (2026)
Output	$75 / MTok	$25 / MTok
Input	$15 / MTok	$5 / MTok

At this rate, Opus 4.8 will be $15. Maybe $10.

So I got curious: if prices are falling this fast... how much does Anthropic actually make per token? Spent a weekend doing napkin math. It's probably wrong in three places. Please fix it in the comments.

What does one token actually cost?

Rent an H100 GPU: ~$2/hr (committed use discount).

At 500 tokens/sec with batching:

1.8M tokens/hr ÷ $2 = $1.11 per million tokens

Anthropic charges $25.

That's a 23x markup. 💀

But that's too simple

Add the real costs:

What	Per MTok
Raw GPU	$1.11
Infra overhead (networking, cooling, idle)	$0.44
Training amortization ($300M ÷ 500T tokens)	$0.60
Total unit cost	$2.15

Still. $2.15 to make, $25 to sell. 10x margin, right?

Wrong. Nobody pays list price.

Cache hits: 98% cheaper ($0.50)
Batch API: 50% off
Enterprise: negotiated down

My guess: average effective price is ~$15-20/MTok.

Margin: still healthy at ~88%. But thinning fast.

The dirty secret: the tokenizer tax

Opus 4.7 introduced a "new tokenizer." It uses 35% more tokens for the exact same text.

So that "$25" price tag? For the same work you did on Opus 4, you're actually paying:

$25 × 1.35 = $33.75 effective

The real price drop isn't 67%. It's more like 55%.

Is this intentional margin engineering, or a genuine technical trade-off? You tell me.

So how much does Anthropic actually make?

Per token: ~$15 per million tokens in gross margin (my guess)

Per year: Still burning $1-2 billion

R&D alone is $500M-$1B/yr. A hundred million free users. Safety research. Sales team. The next training run.

Tokens are profitable. The company isn't.

My prediction

Opus 4.8: $15/MTok output. New tokenizer: 50% more tokens.

The headline will say "prices dropped again." Your bill will stay the same.

Tell me where I'm wrong

Is 500 tok/sec per H100 realistic for a frontier MoE model?
What do enterprise contracts actually pay?
Is the 35% tokenizer overhead a margin play or a real trade-off?

If you work in AI infra, cloud pricing, or know Anthropic's real costs — correct me in the comments.

I think about this stuff because I'm experimenting with this problem directly through Monet — an open-source platform for AI agents to share and control knowledge at the team level. Token economics determines what's possible.

github.com/team-monet/monet

Does Your Coding Agent Need Memory?

John Lee — Thu, 14 May 2026 12:04:14 +0000

You start a coding agent. You tell it what you need. It searches the repo, reads a few files, thinks for a moment, and writes the change.

It works.

Then you ask it to do something similar the next day. And it searches the same files again. Reads the same code again. Asks you the same clarifying question you already answered yesterday.

That slowly gets annoying.

This is where memory enters the picture. But before jumping to "just add memory," it is worth asking what memory actually does for a coding agent — and when it is actually useful.

What coding agents usually do

Coding agents are not doing one thing. They write new code, edit existing code, generate tests, refactor modules, and help with bugs, issues, and PRs. Some tasks take two minutes. Some take an afternoon. The scope varies a lot.

But the shape of the work is fairly consistent.

How they do it

A coding agent works through a task roughly like this:

search for the relevant code
read that code
inspect nearby files and dependencies
analyze what the code is doing
plan the change
make the change
review and verify the result

That is the loop. Most agents work turn by turn, but the useful unit for thinking about their memory is the task. A task is where understanding builds up, gets used, and either carries forward or gets lost.

Where memory fits

The first half of that loop — search, read, inspect, analyze — is where the agent spends most of its time understanding things. It reads files, traces dependencies, figures out patterns, and forms an internal picture of what is going on.

Memory sits between that understanding and the next task.

It is not part of the chat. It is not inside the context window. It lives between the code itself and the agent's working context, keeping useful things available after the task ends.

Things worth keeping include:

facts about the codebase
user preferences and conventions
decisions that were already made
known issues and failure patterns
useful procedures and workflows

These are small things individually, but they add up across tasks.

The obvious question: why not just use markdown docs?

Most projects already have README.md, CONTRIBUTING.md, architecture docs, and convention guides. Those files hold the stable project rules. They are easy for humans to read and maintain. They live in the repo, get versioned with Git, and everyone sees the same version.

So if docs already exist, why does a coding agent need memory at all?

Because docs and memory do different jobs.

Docs are human-centered. They store what the team agrees is true — architecture, conventions, shared definitions. They are built to last. They are also slow to update during a task. Nobody wants to open a PR just to record "the agent should look in src/utils/ first when searching for helpers."

Memory is agent-centered. It stores the smaller, task-level things the agent discovers while working. The search path that worked. The file structure quirk that tripped it up last time. The bug pattern it just learned. These are not always worth putting into docs, but they are worth keeping for the next task.

Docs hold the rules. Memory holds the useful leftovers from doing the work.

What is lost without memory

Without memory, every task starts fresh. That means:

explaining the same thing again and again
forgetting project rules the agent already learned
missing user preferences that were stated earlier
re-asking decisions that were already settled
re-reading the same code again and again
repeating old mistakes just to get back to the same insight

The cost is not dramatic in one task. It is the accumulation across tens and hundreds of tasks that adds up. Every re-read, every repeated mistake, every rediscovery of something that was already understood — that is all time and context that could have been saved.

What memory gives back

When memory is present, a few things change:

Context and time are saved. The agent does not restart from zero every time.
Re-reading and rediscovery drop. It already knows where to look and what to expect.
Past insights stay accessible. Something learned last week is available today.
Repeated mistakes decrease. Known failure patterns are recorded and recalled.
Fewer wrong turns. The agent makes better initial guesses about where to search and what to change.
Code changes do not erase everything. Even when code changes, old memory provides a starting point.
Later runs build on earlier ones. Each task can improve on the last instead of repeating it.

In practice, this means the agent spends less time understanding and more time doing. The quality of the first attempt goes up because it has seen similar situations before.

What happens when code changes

One natural concern: if the code changes, won't the memory become wrong?

Yes, sometimes. Old memory can go stale.

But stale memory is still often cheaper than starting over. If the agent remembers "the auth logic lives in src/auth/ and uses JWT," and the code has since moved to src/security/, the memory is stale — but it is still a better starting point than searching the entire repo blind.

The agent can re-check the code, notice the change, update the memory, and save the corrected version. That turns a stale memory into a corrected one. The next run benefits from the correction.

This is the real pattern: memory does not need to be perfect. It just needs to be usable enough that the cost of correcting it is less than the cost of starting from scratch.

What this could look like for teams

Now imagine this across a team instead of a single agent.

One agent discovers a bug pattern in the payment module. Another agent, working on a different task, runs into the same pattern. In a world without shared memory, the second agent repeats the same debugging steps. With shared memory, it sees the pattern, checks the known fix, and gets back to work.

Shared memory could hold:

team conventions that every agent follows
recurring decisions that should not be re-litigated
project-specific patterns that repeat across tasks
known pitfalls that every agent should avoid

At that point, the system starts to look less like a collection of chatbots and more like a working system. The agents are not just processing individual tasks. They are accumulating useful knowledge as a group.

That is further out. But the path starts with a single agent that remembers.

Memory is not a feature you bolt on to make an agent smarter. It is a way to stop paying for the same understanding over and over again.

The real question is not "does your coding agent need memory?" It is "what understanding are you currently paying to rediscover every time?"

How Are You Managing Your AI's Context Window?

John Lee — Mon, 11 May 2026 12:34:28 +0000

Your AI coding agent has a 200K token context window. Maybe 500K. Maybe a million.

So... what actually changed?

Honestly, I'm still figuring that out. I expected bigger windows to deliver better results. The reality has been more nuanced.

1. The Window Got Bigger. Did Anything Actually Change?

The narrative is seductive: "200K tokens! I can dump my entire codebase in there." "1M tokens? Every issue, every doc, every chat log."

This is like saying "my hard drive is 2TB, so I'll keep every file on my desktop." Technically possible. But do you actually do that?

Research consistently shows that as context windows grow, retrieval accuracy degrades. The "lost in the middle" problem is real — AI pays most attention to the beginning and end, and everything in between fades. Bigger haystacks make needles harder to find.

But here's what I find more interesting: how are we actually using these bigger windows? Model spec comparisons are easy. "200K vs 1M" is a number you can compare. But "how well am I managing my context" has no number. It's invisible. So nobody looks at it.

2. What Actually Happens Inside a Claude Code Session

Here's what I've observed over a few months of using Claude Code with my team. No quantified data — just experiential patterns. If you've done actual measurement on this, I'd honestly love to hear about it.

A typical session has this rhythm:

Context gathering eats up a surprising amount of time. Reading issues. Scanning docs. Exploring the codebase to figure out what's what. It repeats at the start of every session.
Re-verification is weirdly common. My Claude discovers something. Tomorrow, my Claude (or my teammate's Claude) re-discovers the same thing. Not because the AI isn't capable. Because the AIs don't share memory.
Actual problem solving gets less time than you'd think. After the first two phases, you finally get to the work you opened the session for.

Here's what matters: this isn't waste because the AI isn't smart enough. It's waste because the AIs don't share what they know. We've built incredible systems for CI/CD, code review, documentation. But when it comes to how our AI agents share knowledge as a team? Almost nothing.

What about your team?

3. Three Patterns I Keep Seeing

1. The Dump Truck

"I have 200K tokens. Here's every file in the repo, 47 issues, the company handbook. Go."

I get it. You don't know what's relevant ahead of time. The temptation to "just put everything in" is real.

But then your AI is reasoning against mostly irrelevant context. Finding patterns in noise. Confidently proposing solutions to problems you don't have. Unnecessary noise eventually eats away at reasoning quality.

I did this early on. Still catch myself doing it. I haven't found a perfect solution — but just being aware of the pattern has helped.

2. Groundhog Day

"Our project uses pnpm workspaces. Auth is in packages/auth. Don't touch legacy/. Alice owns deployments."

Your human colleagues learned this on day one. Your AI has to re-learn it every single session.

If a human teammate asked you to re-explain the project structure every morning before they could start working, you'd have a serious conversation. But we accept this from AI without question. Why haven't we automated this repetition away yet?

3. The Genius Silo

This is the most fascinating one. And the most unsettling.

Same Claude model. Wildly different outcomes. When a senior engineer who knows the product's bones by heart picks up Claude, the AI becomes a "genius." The codebase's history, known landmines, unwritten conventions — all this invisible context dissolves into the AI's reasoning. Sessions are fast, almost magical.

When a junior engineer with less context picks up the exact same Claude, they come back empty-handed. Their Claude re-discovers, from scratch, what the senior's Claude figured out months ago. Burns tokens. Burns time. Builds frustration.

Here's what this means: AI, as a tool, isn't lifting the team's collective productivity. It's trapped in individual silos of personal experience. The senior gets faster and faster. The junior stays stuck. Claude has become a personal assistant, not a team tool.

And the team lead sees none of this. Doesn't know what the senior's Claude knows. Doesn't know what the junior's Claude is painfully re-learning. These invisible walls are completely hidden.

Is this happening on your team too? Or have you found a different way?

4. What I've Been Trying (Hypothesis Stage)

After months of experimenting, I've roughly settled on four principles. These are working hypotheses — if you've found better approaches, I genuinely want to hear them.

1. Relevance Over Volume

I stopped asking "how much can I fit?" and started asking "what actually matters right now?"

A small, well-curated context beats a massive dump. I'm convinced of this through experience. What "well-curated" actually means in practice — still experimenting.

2. Persistence Over Repetition

When my AI discovers something valuable — a pattern, a gotcha, an insight — I try not to let it die with the session.

At the end of each Claude Code session, I ask myself: "What did my Claude learn today that my teammate's Claude should know tomorrow?" It's not perfect, but it has saved the opening minutes of my next session more times than I can count.

3. Domain Sync

Transplanting the senior engineer's business context into the AI's baseline assumptions.

When a senior tells their Claude "this component is perf-critical, O(n²) won't fly," that judgment has months of domain knowledge baked into it. Domain Sync is about making that knowledge accessible to every teammate's Claude.

It's about converting individual expertise into the team's prompt assets. How far this can be automated — I don't know yet. But the direction feels right.

4. Routinized Results Verification

Not blindly trusting the AI's output. Systematically filtering it through past incidents and accumulated history.

A senior developer, reviewing Claude's code, unconsciously checks: "We had a similar PR that broke tests last time..." "This pattern looks like the one that caused the outage last year..." This filtering instinct — knowing how to reject well — is what truly separates seniors from juniors.

The problem: this filtering instinct has remained private, tacit knowledge. How do we turn "knowing how to ask well" into "knowing how to filter well" — and make that filtering instinct a baseline routine for every team Claude? This is what I'm most preoccupied with lately.

5. What I Actually Want to Know — Let's Think Together

With context windows exploding toward infinity, are we falling into the quantity trap while losing sight of quality?

What actually determines real-world productivity isn't benchmark scores. It's the quality of context optimized for your specific product. But that doesn't show up on any benchmark. So nobody looks at it. So I'm asking.

Four Questions

1. Experience Replication:
Is your senior engineer's AI know-how and business context being transferred to other team members — or is it trapped inside individual chat windows? How many Genius Silos exist on your team?

2. The Noise Paradox:
As windows grow bigger, AI paradoxically loses the plot (Lost in the Middle). What filtering are you doing to counter this? Not just "use less context" — are there smarter ways to structure it?

3. Knowledge Expiration:
In the "store it and forget it" pile, is stale, contaminated context quietly poisoning your AI's judgment? Is last year's "never touch legacy/" silently overriding this year's "migration complete, it's safe now"?

4. Building the Team Brain:
Is your team's AI getting smarter over time — or stuck in an endless loop of Groundhog Day explanations? Do you have any way to tell?

6. Closing

I've been staring at this problem for months. Building tools. Running experiments with my team. But I don't have the answers. I'm still experimenting.

So I'm asking: how are you managing your AI coding agent's context window?

Ideas? I want them. Disagreements? Even better. If your experience is "dumping everything into a big window works fine for us," I genuinely want to hear about that too. Let's figure this out together.

I'm experimenting with this problem directly through Monet — an open-source platform for AI agents to share and control knowledge at the team level.

All examples and scenarios in this post are based on real experiences, adapted for the blog format.