<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: John Lee</title>
    <description>The latest articles on DEV Community by John Lee (@johnonlee).</description>
    <link>https://dev.to/johnonlee</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3924610%2Fe3db9d16-677c-4971-9de6-071496991f48.jpeg</url>
      <title>DEV Community: John Lee</title>
      <link>https://dev.to/johnonlee</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/johnonlee"/>
    <language>en</language>
    <item>
      <title>Why We Need Behavioral Benchmarks for LLMs — Not Just More Knowledge Tests</title>
      <dc:creator>John Lee</dc:creator>
      <pubDate>Tue, 26 May 2026 11:24:59 +0000</pubDate>
      <link>https://dev.to/johnonlee/why-we-need-behavioral-benchmarks-for-llms-not-just-more-knowledge-tests-490f</link>
      <guid>https://dev.to/johnonlee/why-we-need-behavioral-benchmarks-for-llms-not-just-more-knowledge-tests-490f</guid>
      <description>&lt;p&gt;&lt;strong&gt;Would you hire an engineer based on their SAT score?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Of course not. You look at how they solve problems. How they handle ambiguity. Whether they adapt when their first approach fails. You're evaluating behavior, not just knowledge.&lt;/p&gt;

&lt;p&gt;Yet somehow, this is exactly what we do with LLMs. We test them like students — multiple choice, fill in the blank, write a function from a spec — and call it "evaluation." We rank models by MMLU scores and HumanEval pass rates as if those numbers tell us everything we need to know.&lt;/p&gt;

&lt;p&gt;They don't. Here's why.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Are We Actually Measuring?
&lt;/h2&gt;

&lt;p&gt;Let's look at three of the most widely-used LLM benchmarks. Not at their scores, but at what they actually measure.&lt;/p&gt;

&lt;h3&gt;
  
  
  MMLU: The Encyclopedia Test
&lt;/h3&gt;

&lt;p&gt;MMLU gives an LLM 57-choice multiple choice questions across subjects like law, medicine, and philosophy. Pick the right answer from four options. That's it.&lt;/p&gt;

&lt;p&gt;What it measures: breadth of knowledge. How much the model has memorized.&lt;/p&gt;

&lt;p&gt;What it doesn't measure: whether the model knows when to apply that knowledge. Whether it can tell the difference between a situation that needs legal reasoning and one that just needs common sense. Whether it knows what it doesn't know.&lt;/p&gt;

&lt;p&gt;It's a driving written test. Passing it doesn't mean you can drive.&lt;/p&gt;

&lt;h3&gt;
  
  
  HumanEval: The Coding Interview Problem
&lt;/h3&gt;

&lt;p&gt;HumanEval shows a function signature and a docstring. The model fills in the body. If the code passes the test cases on the first try, it's a pass. This is measured as pass@1 — first-attempt pass rate.&lt;/p&gt;

&lt;p&gt;What it measures: can the model translate a spec into working code in one shot?&lt;/p&gt;

&lt;p&gt;What it doesn't measure: what happens when the test fails? Does the model debug systematically or flail randomly? If there's an existing codebase with conflicting patterns, does it notice? Does it know when to refactor instead of patching?&lt;/p&gt;

&lt;p&gt;One function. One attempt. That's not how software gets built.&lt;/p&gt;

&lt;h3&gt;
  
  
  SWE-bench: The First-Day Assignment
&lt;/h3&gt;

&lt;p&gt;SWE-bench is the most realistic of the three. It gives the model a real GitHub issue and access to the full repository. The task: produce a patch that resolves the issue. Evaluation is binary — the repo's test suite either passes or it doesn't.&lt;/p&gt;

&lt;p&gt;What it measures: can the model navigate a real codebase and fix a real bug?&lt;/p&gt;

&lt;p&gt;What it doesn't measure: anything about the approach path. Did the model grep for the right files efficiently, or did it read half the repository first? Did it understand the existing architecture, or did it brute-force a patch that works but violates every design pattern in the project? Did it learn something from this issue that it could apply to the next one?&lt;/p&gt;

&lt;p&gt;SWE-bench evaluates the destination, not the journey.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pattern: All Three Measure "First Impressions"
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;What they all miss&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MMLU&lt;/td&gt;
&lt;td&gt;Knowledge recall&lt;/td&gt;
&lt;td&gt;Application judgment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HumanEval&lt;/td&gt;
&lt;td&gt;First-pass coding&lt;/td&gt;
&lt;td&gt;Debugging, iteration, adaptation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench&lt;/td&gt;
&lt;td&gt;One-shot bug fixing&lt;/td&gt;
&lt;td&gt;Approach path, cross-session learning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These benchmarks share a fundamental assumption: &lt;strong&gt;evaluation happens once, in a single session, with a single correct answer.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But real AI coding agents don't work that way. They work across sessions. They learn from yesterday's mistakes. They reuse context from last week's debugging session. The quality of their work depends not just on what they know, but on how they behave over time.&lt;/p&gt;

&lt;p&gt;This isn't a knowledge problem. It's a behavior problem. And no amount of harder questions on MMLU-Pro will solve it.&lt;/p&gt;




&lt;h2&gt;
  
  
  We Hire Humans by Behavior. Why Do We Test LLMs by Knowledge?
&lt;/h2&gt;

&lt;p&gt;Think about how you hire an engineer.&lt;/p&gt;

&lt;p&gt;You glance at their GPA. You look at their GitHub. Maybe you give them a take-home assignment. But none of that is the deciding factor.&lt;/p&gt;

&lt;p&gt;The deciding factor comes from the interview. And what do you ask?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Tell me about the hardest technical decision you made last year."&lt;/li&gt;
&lt;li&gt;"Walk me through a time you disagreed with a teammate and how you resolved it."&lt;/li&gt;
&lt;li&gt;"Here's a problem. Show me how you'd think about it — not the answer, the thinking."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are behavioral questions. They don't measure what the candidate knows. They measure how the candidate operates. And they work because past behavior predicts future performance.&lt;/p&gt;

&lt;p&gt;Now look at LLM evaluation. Where are the behavioral questions?&lt;/p&gt;

&lt;p&gt;There aren't any. We're stuck at the "checking GPA" stage, watching every model score in the 90th percentile and pretending that tells us something useful about how they'll perform on real work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Same Problem, Different Minds
&lt;/h2&gt;

&lt;p&gt;Here's what behavioral evaluation actually looks like.&lt;/p&gt;

&lt;p&gt;Take the same bug ticket and give it to three different models. Don't just check who fixes it — watch how they approach it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model A&lt;/strong&gt; reads the ticket and immediately greps for the relevant code. Within 30 seconds, it has a first patch. It's fast, intuitive, pattern-matching. This model would thrive in rapid prototyping — where speed and gut instinct matter more than architectural rigor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model B&lt;/strong&gt; starts by decomposing the ticket into three sub-tasks. It reproduces each one independently before attempting any fix. It's methodical, structured, systematic. This model belongs on complex architecture work — where missing an edge case costs weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model C&lt;/strong&gt; searches git log for similar issues first. It studies existing patches to understand the codebase's conventions before writing anything. It's cautious, precedent-driven, learning from history. This model fits maintenance and bug fixing — where consistency with existing patterns matters more than clever solutions.&lt;/p&gt;

&lt;p&gt;All three models fix the bug. Their scores are identical. But their behavioral profiles are completely different. And that difference determines which role each model is actually suited for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is what behavioral benchmarks should measure.&lt;/strong&gt; Not "did the model solve the problem?" but "how did the model solve it?" — and what does that tell us about where it belongs?&lt;/p&gt;




&lt;h2&gt;
  
  
  A Proposal: Behavioral Benchmarks
&lt;/h2&gt;

&lt;p&gt;I should be clear: this is a proposal, not an established framework. I'm not citing a paper because there isn't one. (Though interestingly, an April 2026 preprint by Tang et al. &lt;a href="https://arxiv.org/abs/2605.12530" rel="noopener noreferrer"&gt;argues for "in-situ behavioral evaluation" for LLM fairness&lt;/a&gt; — suggesting the idea is in the air.) If I'm wrong about any of this, I hope you'll correct me in the comments.&lt;/p&gt;

&lt;p&gt;Here's the definition I'm working with:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Behavioral Benchmark is an evaluation framework that profiles how an LLM approaches problems — its cognitive patterns — rather than just scoring the correctness of its answers.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Where existing benchmarks ask "how many did it get right?", behavioral benchmarks ask "what kind of thinker is this?"&lt;/p&gt;

&lt;p&gt;I propose four dimensions to observe:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Observation Question&lt;/th&gt;
&lt;th&gt;What It Reveals&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Decomposition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Does it jump straight to execution, or break the problem down first?&lt;/td&gt;
&lt;td&gt;Top-down architect vs. bottom-up executor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Approach&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Does it search for similar patterns, or reason from first principles?&lt;/td&gt;
&lt;td&gt;Maintenance engineer vs. innovator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Recovery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;When stuck, does it change strategy or double down on the same path?&lt;/td&gt;
&lt;td&gt;Adaptive vs. persistent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Consistency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Does it show the same approach pattern across similar problems?&lt;/td&gt;
&lt;td&gt;Predictable vs. creative&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Think of it this way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MMLU asks: "What does this candidate know?"&lt;/li&gt;
&lt;li&gt;Behavioral benchmarks ask: "How does this candidate work?"&lt;/li&gt;
&lt;li&gt;And that second question determines role fit.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why Now
&lt;/h2&gt;

&lt;p&gt;In 2026, coding agents aren't demos anymore. They're daily tools on real engineering teams. And teams are starting to ask questions that our benchmarks can't answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Which model should I use for our legacy codebase maintenance?"&lt;/li&gt;
&lt;li&gt;"Our junior devs need a pair programmer — which model's debugging style fits them?"&lt;/li&gt;
&lt;li&gt;"We need consistency. Which model produces the most predictable behavior week over week?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are role-fit questions. Hiring questions. And we're trying to answer them with SAT scores.&lt;/p&gt;

&lt;p&gt;The race for smarter models is maturing. The next frontier isn't a higher MMLU score — it's understanding what each model is actually good for. And we can't get there without behavioral evaluation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Let's Define This Together
&lt;/h2&gt;

&lt;p&gt;I don't think I've nailed this. The four dimensions I proposed are a starting point, not a destination. Maybe there are better axes. Maybe the whole framing is wrong and someone smarter has already solved this.&lt;/p&gt;

&lt;p&gt;Here are a few things I'm probably wrong about — please correct me:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Decomposition style is a stable trait of a model, not just a reflection of the prompt&lt;/li&gt;
&lt;li&gt;Recovery behavior can be measured without also measuring the harness/framework around the model&lt;/li&gt;
&lt;li&gt;Consistency across sessions is more important for team adoption than raw capability&lt;/li&gt;
&lt;li&gt;Role-fit evaluation will eventually matter more than accuracy benchmarks for enterprise adoption&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're building coding agents, evaluating models, or just frustrated that your "top-ranked" LLM doesn't behave the way you expected — I want to hear from you. What behavioral dimensions matter on your team?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm thinking about this while building &lt;a href="https://github.com/team-monet/monet?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=blog-launch" rel="noopener noreferrer"&gt;Monet&lt;/a&gt; — an open-source platform for AI agents to share and control knowledge at the team level.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;All examples and scenarios in this post are based on real experiences, adapted for the blog format.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Token Economics: The Real Cost of AI Coding Agents</title>
      <dc:creator>John Lee</dc:creator>
      <pubDate>Thu, 21 May 2026 12:37:45 +0000</pubDate>
      <link>https://dev.to/johnonlee/token-economics-the-real-cost-of-ai-coding-agents-3a92</link>
      <guid>https://dev.to/johnonlee/token-economics-the-real-cost-of-ai-coding-agents-3a92</guid>
      <description>&lt;h2&gt;
  
  
  How prompt caching actually works
&lt;/h2&gt;

&lt;p&gt;When an LLM processes your input, it doesn't just read and forget. For tokens that appear in the same position across multiple requests, the model can reuse its previous computation. This is called &lt;strong&gt;prefix caching&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request 1: [System Prompt] [Conversation Turn 1] [Turn 2]
           └── 260K tokens computed from scratch ──┘
           Cost: expensive

Request 2: [System Prompt] [Conversation Turn 1] [Turn 3]
           └──── 255K tokens → CACHE HIT! ────┘├── 5K new ──┤
           Cost: nearly free
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The catch? Only the &lt;strong&gt;prefix&lt;/strong&gt; — tokens from the start that match exactly — benefit from caching. Change one token at the beginning, and the entire cache is invalidated.&lt;/p&gt;

&lt;p&gt;This is why my 4:20 PM request (300K input, $0.0096) was so cheap — 295K of those tokens were cached from previous turns. And why my 9:20 AM request (257K, $0.4455) was so expensive — it was a fresh session with zero cache.&lt;/p&gt;




&lt;h2&gt;
  
  
  The transcript trap
&lt;/h2&gt;

&lt;p&gt;Most coding agents today use what I call the "transcript" approach: every turn appends the latest exchange to the conversation history and sends the entire thing back to the model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Turn 1:  17K tokens → cache miss → $0.029
Turn 2:  22K tokens → 17K cached → $0.0007
Turn 3:  27K tokens → 22K cached → $0.0008
...
Turn 10: 62K tokens → 57K cached → $0.0019
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks great. The marginal cost per turn is tiny because 90%+ of tokens are cached. The transcript approach is, economically speaking, a &lt;strong&gt;cache lottery&lt;/strong&gt; — and while the session stays alive, you keep winning.&lt;/p&gt;

&lt;p&gt;But here's the problem: sessions don't stay alive forever.&lt;/p&gt;

&lt;p&gt;Context windows fill up. Compaction kicks in. Cache TTLs expire (usually 5–10 minutes). When any of these happen, your next request is a cache miss — and suddenly you're paying the full 46x penalty.&lt;/p&gt;

&lt;p&gt;That 9:20 AM spike? That was compaction. The session crossed the context window limit, Hermes compressed the history into a summary, and the next request started fresh. $0.44 for one turn.&lt;/p&gt;




&lt;h2&gt;
  
  
  A different approach: structured state
&lt;/h2&gt;

&lt;p&gt;What if, instead of sending the entire conversation transcript, you sent only a structured summary of what matters?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Turn 1:  [State]  →  3K tokens → cache miss → $0.005
Turn 2:  [State]  →  3K tokens → 1K cached  → $0.0001
Turn 3:  [State]  →  3K tokens → 1K cached  → $0.0001
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not only is the first turn cheaper (3K vs 17K), but the cached portion — the state schema itself — is too small to ever expire meaningfully. And when a session inevitably ends? The next session starts at 3K again, not 17K.&lt;/p&gt;

&lt;p&gt;I tested this with a real 44-turn debugging session. The transcript was 3,777 tokens. The extracted state: 740 tokens. An &lt;strong&gt;80.4% reduction&lt;/strong&gt; in prompt tokens — and the state-based agent produced higher-quality code with better structure.&lt;/p&gt;




&lt;h2&gt;
  
  
  The real economics
&lt;/h2&gt;

&lt;p&gt;The transcript approach looks cheaper turn-by-turn because caching hides the cost. But it's fragile:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cache TTL:&lt;/strong&gt; 5–10 minutes of inactivity and you lose it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context limits:&lt;/strong&gt; Long sessions get compacted, breaking the cache&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality:&lt;/strong&gt; Noise accumulates. Debugging chatter, tool outputs, dead ends — all cached, all inflating the prompt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The state approach is more expensive turn-by-turn (no massive cache to lean on), but it's predictable. The cost is fixed regardless of session length, and quality doesn't degrade.&lt;/p&gt;

&lt;p&gt;Which one is cheaper? It depends on your session pattern:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Transcript&lt;/th&gt;
&lt;th&gt;State&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Short session (&amp;lt; 10 turns)&lt;/td&gt;
&lt;td&gt;Cheaper (cache wins)&lt;/td&gt;
&lt;td&gt;Slightly more expensive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long session (20+ turns)&lt;/td&gt;
&lt;td&gt;Cheap until compaction → then expensive&lt;/td&gt;
&lt;td&gt;Consistently cheap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-session&lt;/td&gt;
&lt;td&gt;Context evaporates → full restart&lt;/td&gt;
&lt;td&gt;State persists → cheap restart&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What this means for building agents
&lt;/h2&gt;

&lt;p&gt;I'm building Monet, an open-source memory platform for AI agents. This token economics analysis pushed me to rethink our architecture:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't fight caching — design for it.&lt;/strong&gt; Structure your agent context so the prefix is stable and cacheable. A fixed schema at the top means every turn reuses it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Extract signal from noise.&lt;/strong&gt; Transcripts are mostly debugging noise. Structured state is signal. Less tokens, better outputs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plan for the cache miss.&lt;/strong&gt; Your architecture shouldn't require the cache to be cheap. If a cache miss means a 46x cost spike, you've built on sand.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cross-session continuity is the real bottleneck.&lt;/strong&gt; Caching helps within a session. State helps across sessions. Both matter.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;Token economics isn't just about counting tokens. It's about understanding the hidden structure of how models process them — and designing systems that work with that structure instead of against it.&lt;/p&gt;

&lt;p&gt;*—&lt;/p&gt;

&lt;p&gt;I'm experimenting with this problem directly through Monet — an open-source platform for AI agents to share and control knowledge at the team level.&lt;/p&gt;

&lt;p&gt;I'm looking for pilot partner teams. I'll help you set up Monet for your team, and together we'll find the automation points that fit your workflow. Interested? Leave a comment or open a GitHub Issue.&lt;/p&gt;

&lt;p&gt;github.com/team-monet/monet?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=blog-launch&lt;/p&gt;




&lt;p&gt;&lt;em&gt;All examples and scenarios in this post are based on real experiences, adapted for the blog format.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Claude Opus Prices Just Crashed 67%. Is Anthropic Still Making Money?</title>
      <dc:creator>John Lee</dc:creator>
      <pubDate>Tue, 19 May 2026 11:09:25 +0000</pubDate>
      <link>https://dev.to/johnonlee/claude-opus-prices-just-crashed-67-is-anthropic-still-making-money-173c</link>
      <guid>https://dev.to/johnonlee/claude-opus-prices-just-crashed-67-is-anthropic-still-making-money-173c</guid>
      <description>&lt;p&gt;Claude Opus pricing just collapsed. &lt;strong&gt;67% in one year.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Opus 4 (2025)&lt;/th&gt;
&lt;th&gt;Opus 4.7 (2026)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Output&lt;/td&gt;
&lt;td&gt;$75 / MTok&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$25 / MTok&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input&lt;/td&gt;
&lt;td&gt;$15 / MTok&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$5 / MTok&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At this rate, Opus 4.8 will be $15. Maybe $10.&lt;/p&gt;

&lt;p&gt;So I got curious: if prices are falling this fast... &lt;strong&gt;how much does Anthropic actually make per token?&lt;/strong&gt; Spent a weekend doing napkin math. It's probably wrong in three places. Please fix it in the comments.&lt;/p&gt;




&lt;h2&gt;
  
  
  What does one token actually cost?
&lt;/h2&gt;

&lt;p&gt;Rent an H100 GPU: &lt;strong&gt;~$2/hr&lt;/strong&gt; (committed use discount).&lt;/p&gt;

&lt;p&gt;At 500 tokens/sec with batching:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1.8M tokens/hr ÷ $2 = $1.11 per million tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Anthropic charges $25.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's a 23x markup.&lt;/strong&gt; 💀&lt;/p&gt;




&lt;h2&gt;
  
  
  But that's too simple
&lt;/h2&gt;

&lt;p&gt;Add the real costs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What&lt;/th&gt;
&lt;th&gt;Per MTok&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Raw GPU&lt;/td&gt;
&lt;td&gt;$1.11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infra overhead (networking, cooling, idle)&lt;/td&gt;
&lt;td&gt;$0.44&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Training amortization ($300M ÷ 500T tokens)&lt;/td&gt;
&lt;td&gt;$0.60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total unit cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$2.15&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Still. $2.15 to make, $25 to sell. &lt;strong&gt;10x margin&lt;/strong&gt;, right?&lt;/p&gt;

&lt;p&gt;Wrong. Nobody pays list price.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache hits: 98% cheaper ($0.50)&lt;/li&gt;
&lt;li&gt;Batch API: 50% off&lt;/li&gt;
&lt;li&gt;Enterprise: negotiated down&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My guess: &lt;strong&gt;average effective price is ~$15-20/MTok.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Margin: still healthy at ~88%. But thinning fast.&lt;/p&gt;




&lt;h2&gt;
  
  
  The dirty secret: the tokenizer tax
&lt;/h2&gt;

&lt;p&gt;Opus 4.7 introduced a "new tokenizer." It uses &lt;strong&gt;35% more tokens&lt;/strong&gt; for the exact same text.&lt;/p&gt;

&lt;p&gt;So that "$25" price tag? For the same work you did on Opus 4, you're actually paying:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$25 × 1.35 = $33.75 effective
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The real price drop isn't 67%. It's more like 55%.&lt;/p&gt;

&lt;p&gt;Is this intentional margin engineering, or a genuine technical trade-off? You tell me.&lt;/p&gt;




&lt;h2&gt;
  
  
  So how much does Anthropic actually make?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Per token:&lt;/strong&gt; ~$15 per million tokens in gross margin (my guess)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per year:&lt;/strong&gt; Still burning &lt;strong&gt;$1-2 billion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;R&amp;amp;D alone is $500M-$1B/yr. A hundred million free users. Safety research. Sales team. The next training run.&lt;/p&gt;

&lt;p&gt;Tokens are profitable. The company isn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  My prediction
&lt;/h2&gt;

&lt;p&gt;Opus 4.8: $15/MTok output. New tokenizer: 50% more tokens.&lt;/p&gt;

&lt;p&gt;The headline will say "prices dropped again." Your bill will stay the same.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tell me where I'm wrong
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Is 500 tok/sec per H100 realistic for a frontier MoE model?&lt;/li&gt;
&lt;li&gt;What do enterprise contracts actually pay?&lt;/li&gt;
&lt;li&gt;Is the 35% tokenizer overhead a margin play or a real trade-off?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you work in AI infra, cloud pricing, or know Anthropic's real costs — &lt;strong&gt;correct me in the comments.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I think about this stuff because I'm experimenting with this problem directly through &lt;a href="https://github.com/team-monet/monet?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=blog-launch" rel="noopener noreferrer"&gt;Monet&lt;/a&gt; — an open-source platform for AI agents to share and control knowledge at the team level. Token economics determines what's possible.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;github.com/team-monet/monet&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Does Your Coding Agent Need Memory?</title>
      <dc:creator>John Lee</dc:creator>
      <pubDate>Thu, 14 May 2026 12:04:14 +0000</pubDate>
      <link>https://dev.to/johnonlee/does-your-coding-agent-need-memory-4io9</link>
      <guid>https://dev.to/johnonlee/does-your-coding-agent-need-memory-4io9</guid>
      <description>&lt;p&gt;You start a coding agent. You tell it what you need. It searches the repo, reads a few files, thinks for a moment, and writes the change.&lt;/p&gt;

&lt;p&gt;It works.&lt;/p&gt;

&lt;p&gt;Then you ask it to do something similar the next day. And it searches the same files again. Reads the same code again. Asks you the same clarifying question you already answered yesterday.&lt;/p&gt;

&lt;p&gt;That slowly gets annoying.&lt;/p&gt;

&lt;p&gt;This is where memory enters the picture. But before jumping to "just add memory," it is worth asking what memory actually does for a coding agent — and when it is actually useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  What coding agents usually do
&lt;/h2&gt;

&lt;p&gt;Coding agents are not doing one thing. They write new code, edit existing code, generate tests, refactor modules, and help with bugs, issues, and PRs. Some tasks take two minutes. Some take an afternoon. The scope varies a lot.&lt;/p&gt;

&lt;p&gt;But the shape of the work is fairly consistent.&lt;/p&gt;

&lt;h2&gt;
  
  
  How they do it
&lt;/h2&gt;

&lt;p&gt;A coding agent works through a task roughly like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;search for the relevant code&lt;/li&gt;
&lt;li&gt;read that code&lt;/li&gt;
&lt;li&gt;inspect nearby files and dependencies&lt;/li&gt;
&lt;li&gt;analyze what the code is doing&lt;/li&gt;
&lt;li&gt;plan the change&lt;/li&gt;
&lt;li&gt;make the change&lt;/li&gt;
&lt;li&gt;review and verify the result&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the loop. Most agents work turn by turn, but the useful unit for thinking about their memory is the task. A task is where understanding builds up, gets used, and either carries forward or gets lost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where memory fits
&lt;/h2&gt;

&lt;p&gt;The first half of that loop — search, read, inspect, analyze — is where the agent spends most of its time understanding things. It reads files, traces dependencies, figures out patterns, and forms an internal picture of what is going on.&lt;/p&gt;

&lt;p&gt;Memory sits between that understanding and the next task.&lt;/p&gt;

&lt;p&gt;It is not part of the chat. It is not inside the context window. It lives between the code itself and the agent's working context, keeping useful things available after the task ends.&lt;/p&gt;

&lt;p&gt;Things worth keeping include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;facts about the codebase&lt;/li&gt;
&lt;li&gt;user preferences and conventions&lt;/li&gt;
&lt;li&gt;decisions that were already made&lt;/li&gt;
&lt;li&gt;known issues and failure patterns&lt;/li&gt;
&lt;li&gt;useful procedures and workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are small things individually, but they add up across tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The obvious question: why not just use markdown docs?
&lt;/h2&gt;

&lt;p&gt;Most projects already have &lt;code&gt;README.md&lt;/code&gt;, &lt;code&gt;CONTRIBUTING.md&lt;/code&gt;, architecture docs, and convention guides. Those files hold the stable project rules. They are easy for humans to read and maintain. They live in the repo, get versioned with Git, and everyone sees the same version.&lt;/p&gt;

&lt;p&gt;So if docs already exist, why does a coding agent need memory at all?&lt;/p&gt;

&lt;p&gt;Because docs and memory do different jobs.&lt;/p&gt;

&lt;p&gt;Docs are &lt;strong&gt;human-centered&lt;/strong&gt;. They store what the team agrees is true — architecture, conventions, shared definitions. They are built to last. They are also slow to update during a task. Nobody wants to open a PR just to record "the agent should look in &lt;code&gt;src/utils/&lt;/code&gt; first when searching for helpers."&lt;/p&gt;

&lt;p&gt;Memory is &lt;strong&gt;agent-centered&lt;/strong&gt;. It stores the smaller, task-level things the agent discovers while working. The search path that worked. The file structure quirk that tripped it up last time. The bug pattern it just learned. These are not always worth putting into docs, but they are worth keeping for the next task.&lt;/p&gt;

&lt;p&gt;Docs hold the rules. Memory holds the useful leftovers from doing the work.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is lost without memory
&lt;/h2&gt;

&lt;p&gt;Without memory, every task starts fresh. That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;explaining the same thing again and again&lt;/li&gt;
&lt;li&gt;forgetting project rules the agent already learned&lt;/li&gt;
&lt;li&gt;missing user preferences that were stated earlier&lt;/li&gt;
&lt;li&gt;re-asking decisions that were already settled&lt;/li&gt;
&lt;li&gt;re-reading the same code again and again&lt;/li&gt;
&lt;li&gt;repeating old mistakes just to get back to the same insight&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cost is not dramatic in one task. It is the accumulation across tens and hundreds of tasks that adds up. Every re-read, every repeated mistake, every rediscovery of something that was already understood — that is all time and context that could have been saved.&lt;/p&gt;

&lt;h2&gt;
  
  
  What memory gives back
&lt;/h2&gt;

&lt;p&gt;When memory is present, a few things change:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context and time are saved.&lt;/strong&gt; The agent does not restart from zero every time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-reading and rediscovery drop.&lt;/strong&gt; It already knows where to look and what to expect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Past insights stay accessible.&lt;/strong&gt; Something learned last week is available today.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repeated mistakes decrease.&lt;/strong&gt; Known failure patterns are recorded and recalled.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fewer wrong turns.&lt;/strong&gt; The agent makes better initial guesses about where to search and what to change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code changes do not erase everything.&lt;/strong&gt; Even when code changes, old memory provides a starting point.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Later runs build on earlier ones.&lt;/strong&gt; Each task can improve on the last instead of repeating it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, this means the agent spends less time understanding and more time doing. The quality of the first attempt goes up because it has seen similar situations before.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happens when code changes
&lt;/h2&gt;

&lt;p&gt;One natural concern: if the code changes, won't the memory become wrong?&lt;/p&gt;

&lt;p&gt;Yes, sometimes. Old memory can go stale.&lt;/p&gt;

&lt;p&gt;But stale memory is still often cheaper than starting over. If the agent remembers "the auth logic lives in &lt;code&gt;src/auth/&lt;/code&gt; and uses JWT," and the code has since moved to &lt;code&gt;src/security/&lt;/code&gt;, the memory is stale — but it is still a better starting point than searching the entire repo blind.&lt;/p&gt;

&lt;p&gt;The agent can re-check the code, notice the change, update the memory, and save the corrected version. That turns a stale memory into a corrected one. The next run benefits from the correction.&lt;/p&gt;

&lt;p&gt;This is the real pattern: memory does not need to be perfect. It just needs to be usable enough that the cost of correcting it is less than the cost of starting from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this could look like for teams
&lt;/h2&gt;

&lt;p&gt;Now imagine this across a team instead of a single agent.&lt;/p&gt;

&lt;p&gt;One agent discovers a bug pattern in the payment module. Another agent, working on a different task, runs into the same pattern. In a world without shared memory, the second agent repeats the same debugging steps. With shared memory, it sees the pattern, checks the known fix, and gets back to work.&lt;/p&gt;

&lt;p&gt;Shared memory could hold:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;team conventions that every agent follows&lt;/li&gt;
&lt;li&gt;recurring decisions that should not be re-litigated&lt;/li&gt;
&lt;li&gt;project-specific patterns that repeat across tasks&lt;/li&gt;
&lt;li&gt;known pitfalls that every agent should avoid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, the system starts to look less like a collection of chatbots and more like a working system. The agents are not just processing individual tasks. They are accumulating useful knowledge as a group.&lt;/p&gt;

&lt;p&gt;That is further out. But the path starts with a single agent that remembers.&lt;/p&gt;




&lt;p&gt;Memory is not a feature you bolt on to make an agent smarter. It is a way to stop paying for the same understanding over and over again.&lt;/p&gt;

&lt;p&gt;The real question is not "does your coding agent need memory?" It is "what understanding are you currently paying to rediscover every time?"&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>discuss</category>
    </item>
    <item>
      <title>How Are You Managing Your AI's Context Window?</title>
      <dc:creator>John Lee</dc:creator>
      <pubDate>Mon, 11 May 2026 12:34:28 +0000</pubDate>
      <link>https://dev.to/johnonlee/how-are-you-managing-your-ais-context-window-324g</link>
      <guid>https://dev.to/johnonlee/how-are-you-managing-your-ais-context-window-324g</guid>
      <description>&lt;p&gt;Your AI coding agent has a 200K token context window. Maybe 500K. Maybe a million.&lt;/p&gt;

&lt;p&gt;So... what actually changed?&lt;/p&gt;

&lt;p&gt;Honestly, I'm still figuring that out. I expected bigger windows to deliver better results. The reality has been more nuanced.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Window Got Bigger. Did Anything Actually Change?
&lt;/h2&gt;

&lt;p&gt;The narrative is seductive: "200K tokens! I can dump my entire codebase in there." "1M tokens? Every issue, every doc, every chat log."&lt;/p&gt;

&lt;p&gt;This is like saying "my hard drive is 2TB, so I'll keep every file on my desktop." Technically possible. But do you actually do that?&lt;/p&gt;

&lt;p&gt;Research consistently shows that as context windows grow, retrieval accuracy degrades. The "lost in the middle" problem is real — AI pays most attention to the beginning and end, and everything in between fades. Bigger haystacks make needles harder to find.&lt;/p&gt;

&lt;p&gt;But here's what I find more interesting: &lt;strong&gt;how are we actually using these bigger windows?&lt;/strong&gt; Model spec comparisons are easy. "200K vs 1M" is a number you can compare. But "how well am I managing my context" has no number. It's invisible. So nobody looks at it.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. What Actually Happens Inside a Claude Code Session
&lt;/h2&gt;

&lt;p&gt;Here's what I've observed over a few months of using Claude Code with my team. No quantified data — just experiential patterns. If you've done actual measurement on this, I'd honestly love to hear about it.&lt;/p&gt;

&lt;p&gt;A typical session has this rhythm:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context gathering eats up a surprising amount of time.&lt;/strong&gt; Reading issues. Scanning docs. Exploring the codebase to figure out what's what. It repeats at the start of every session.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-verification is weirdly common.&lt;/strong&gt; My Claude discovers something. Tomorrow, my Claude (or my teammate's Claude) re-discovers the same thing. Not because the AI isn't capable. Because the AIs don't share memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Actual problem solving gets less time than you'd think.&lt;/strong&gt; After the first two phases, you finally get to the work you opened the session for.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's what matters: this isn't waste because the AI isn't smart enough. It's waste because &lt;strong&gt;the AIs don't share what they know&lt;/strong&gt;. We've built incredible systems for CI/CD, code review, documentation. But when it comes to how our AI agents share knowledge as a team? Almost nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What about your team?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Three Patterns I Keep Seeing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The Dump Truck
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;"I have 200K tokens. Here's every file in the repo, 47 issues, the company handbook. Go."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I get it. You don't know what's relevant ahead of time. The temptation to "just put everything in" is real.&lt;/p&gt;

&lt;p&gt;But then your AI is reasoning against mostly irrelevant context. Finding patterns in noise. Confidently proposing solutions to problems you don't have. Unnecessary noise eventually eats away at reasoning quality.&lt;/p&gt;

&lt;p&gt;I did this early on. Still catch myself doing it. I haven't found a perfect solution — but just being aware of the pattern has helped.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Groundhog Day
&lt;/h3&gt;

&lt;p&gt;"Our project uses pnpm workspaces. Auth is in &lt;code&gt;packages/auth&lt;/code&gt;. Don't touch &lt;code&gt;legacy/&lt;/code&gt;. Alice owns deployments."&lt;/p&gt;

&lt;p&gt;Your human colleagues learned this on day one. Your AI has to re-learn it every single session.&lt;/p&gt;

&lt;p&gt;If a human teammate asked you to re-explain the project structure every morning before they could start working, you'd have a serious conversation. But we accept this from AI without question. Why haven't we automated this repetition away yet?&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The Genius Silo
&lt;/h3&gt;

&lt;p&gt;This is the most fascinating one. And the most unsettling.&lt;/p&gt;

&lt;p&gt;Same Claude model. Wildly different outcomes. When a senior engineer who knows the product's bones by heart picks up Claude, the AI becomes a "genius." The codebase's history, known landmines, unwritten conventions — all this invisible context dissolves into the AI's reasoning. Sessions are fast, almost magical.&lt;/p&gt;

&lt;p&gt;When a junior engineer with less context picks up the exact same Claude, they come back empty-handed. Their Claude re-discovers, from scratch, what the senior's Claude figured out months ago. Burns tokens. Burns time. Builds frustration.&lt;/p&gt;

&lt;p&gt;Here's what this means: AI, as a tool, isn't lifting the team's collective productivity. It's &lt;strong&gt;trapped in individual silos of personal experience&lt;/strong&gt;. The senior gets faster and faster. The junior stays stuck. Claude has become a personal assistant, not a team tool.&lt;/p&gt;

&lt;p&gt;And the team lead sees none of this. Doesn't know what the senior's Claude knows. Doesn't know what the junior's Claude is painfully re-learning. &lt;strong&gt;These invisible walls are completely hidden.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is this happening on your team too? Or have you found a different way?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  4. What I've Been Trying (Hypothesis Stage)
&lt;/h2&gt;

&lt;p&gt;After months of experimenting, I've roughly settled on four principles. These are working hypotheses — if you've found better approaches, I genuinely want to hear them.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Relevance Over Volume
&lt;/h3&gt;

&lt;p&gt;I stopped asking "how much can I fit?" and started asking "what actually matters right now?"&lt;/p&gt;

&lt;p&gt;A small, well-curated context beats a massive dump. I'm convinced of this through experience. What "well-curated" actually means in practice — still experimenting.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Persistence Over Repetition
&lt;/h3&gt;

&lt;p&gt;When my AI discovers something valuable — a pattern, a gotcha, an insight — I try not to let it die with the session.&lt;/p&gt;

&lt;p&gt;At the end of each Claude Code session, I ask myself: "What did my Claude learn today that my teammate's Claude should know tomorrow?" It's not perfect, but it has saved the opening minutes of my next session more times than I can count.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Domain Sync
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Transplanting the senior engineer's business context into the AI's baseline assumptions.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a senior tells their Claude "this component is perf-critical, O(n²) won't fly," that judgment has months of domain knowledge baked into it. Domain Sync is about making that knowledge accessible to every teammate's Claude.&lt;/p&gt;

&lt;p&gt;It's about converting individual expertise into the team's prompt assets. How far this can be automated — I don't know yet. But the direction feels right.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Routinized Results Verification
&lt;/h3&gt;

&lt;p&gt;Not blindly trusting the AI's output. Systematically filtering it through past incidents and accumulated history.&lt;/p&gt;

&lt;p&gt;A senior developer, reviewing Claude's code, unconsciously checks: "We had a similar PR that broke tests last time..." "This pattern looks like the one that caused the outage last year..." This filtering instinct — knowing how to &lt;em&gt;reject&lt;/em&gt; well — is what truly separates seniors from juniors.&lt;/p&gt;

&lt;p&gt;The problem: this filtering instinct has remained private, tacit knowledge. How do we turn "knowing how to ask well" into "knowing how to filter well" — and make that filtering instinct a baseline routine for every team Claude? This is what I'm most preoccupied with lately.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. What I Actually Want to Know — Let's Think Together
&lt;/h2&gt;

&lt;p&gt;With context windows exploding toward infinity, are we falling into the quantity trap while losing sight of quality?&lt;/p&gt;

&lt;p&gt;What actually determines real-world productivity isn't benchmark scores. It's &lt;strong&gt;the quality of context optimized for your specific product&lt;/strong&gt;. But that doesn't show up on any benchmark. So nobody looks at it. So I'm asking.&lt;/p&gt;

&lt;h3&gt;
  
  
  Four Questions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Experience Replication:&lt;/strong&gt;&lt;br&gt;
Is your senior engineer's AI know-how and business context being transferred to other team members — or is it trapped inside individual chat windows? How many Genius Silos exist on your team?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The Noise Paradox:&lt;/strong&gt;&lt;br&gt;
As windows grow bigger, AI paradoxically loses the plot (Lost in the Middle). What filtering are you doing to counter this? Not just "use less context" — are there smarter ways to structure it?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Knowledge Expiration:&lt;/strong&gt;&lt;br&gt;
In the "store it and forget it" pile, is stale, contaminated context quietly poisoning your AI's judgment? Is last year's "never touch &lt;code&gt;legacy/&lt;/code&gt;" silently overriding this year's "migration complete, it's safe now"?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Building the Team Brain:&lt;/strong&gt;&lt;br&gt;
Is your team's AI getting smarter over time — or stuck in an endless loop of Groundhog Day explanations? Do you have any way to tell?&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Closing
&lt;/h2&gt;

&lt;p&gt;I've been staring at this problem for months. Building tools. Running experiments with my team. But I don't have the answers. I'm still experimenting.&lt;/p&gt;

&lt;p&gt;So I'm asking: &lt;strong&gt;how are you managing your AI coding agent's context window?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Ideas? I want them. Disagreements? Even better. If your experience is "dumping everything into a big window works fine for us," I genuinely want to hear about that too. Let's figure this out together.&lt;/p&gt;




&lt;p&gt;I'm experimenting with this problem directly through &lt;a href="https://github.com/team-monet/monet?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=blog-launch" rel="noopener noreferrer"&gt;Monet&lt;/a&gt; — an open-source platform for AI agents to share and control knowledge at the team level.&lt;/p&gt;

&lt;p&gt;I'm looking for pilot partner teams. I'll help you set up Monet for your team, and together we'll find the automation points that fit your workflow. Interested? Leave a comment or open a &lt;a href="https://github.com/team-monet/monet/issues?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=blog-launch" rel="noopener noreferrer"&gt;GitHub Issue&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;All examples and scenarios in this post are based on real experiences, adapted for the blog format.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>claude</category>
    </item>
  </channel>
</rss>
