DEV Community: Amit Ben-Ari

Agentic Context Engineering (ACE) Explained: Why Your AI Agents Need a Playbook, Not a Prompt

Amit Ben-Ari — Wed, 20 May 2026 13:07:55 +0000

A team from Stanford, SambaNova, and UC Berkeley recently published the ACE paper - and it's the most substantive academic contribution to context engineering I've seen in a while. The core idea: give your AI agent a structured "playbook" that it maintains and refines itself, task by task. The result? A +10.6% performance improvement on agent benchmarks, and +8.6% on domain-specific finance reasoning, using a smaller open-source model that matched top-ranked production agents.

ACE works through three components - a Generator, a Reflector, and a Curator - that mirror how humans actually learn: attempt, reflect, consolidate. The Curator's key trick is issuing small, surgical edits to the playbook rather than rewriting it wholesale. That single design choice is what prevents the context degradation most agents suffer silently.

Here's the honest caveat: you probably shouldn't run out and implement this today. But you should understand what it proves, because the principles transfer directly to how you work with AI agents right now.

The Problem ACE Is Trying to Solve

If you've spent time building or using AI agents, you've encountered two failure modes that nobody talks about clearly enough.

The first is context rot - the gradual degradation in output quality as an agent's context window fills up with irrelevant history, redundant instructions, and stale reasoning. This is the "why does this agent get worse over time" problem most practitioners diagnose too late.

The second failure mode is structural: your agent runs the same class of task five hundred times and never gets better at it. The system prompt you wrote in week one is the same system prompt running in week twenty. Every lesson learned in production, every hard-won heuristic, every edge case your team discovered - none of it feeds back into the agent's context. It starts from scratch every time.

The ACE paper names two specific mechanisms that cause this, and both are worth adding to your vocabulary.

Brevity bias is what happens when you try to automate prompt optimization. Methods like DSPy and GEPA can evolve your prompts automatically - but they tend to favor short, compact instructions over rich, specific ones. Domain knowledge gets compressed out in favor of instructions that benchmark well on average. You end up with a leaner prompt that performs worse on the nuanced cases your domain actually cares about.

Context collapse is what happens when you try to solve the stale-context problem by asking the model to iteratively rewrite its accumulated context. Each rewrite seems clean. But over many rounds, specific details get gradually squeezed out. The researchers described it through VentureBeat's coverage as "like overwriting a document so many times that key notes disappear" - a kind of digital amnesia baked into the optimization loop itself.

Both problems share the same root cause: treating context as something to be compressed or replaced, rather than something to be curated.

What ACE Actually Is

ACE treats an agent's context as a structured, evolving playbook - a living document of strategies, domain rules, and hard-won lessons that the agent itself maintains, one task at a time.

Rather than one model trying to act, reflect, and update its own context simultaneously (which creates the compression pressure that leads to context collapse), ACE splits the work across three distinct roles:

Generator - attempts the task and produces a reasoning trace of what it did and why
Reflector - analyzes that trace, identifies what succeeded and what failed, and extracts transferable lessons
Curator - takes those lessons and integrates them into the playbook as small, targeted edits

The design mirrors how experienced engineers actually learn: you ship something, do a postmortem, and update your runbook. You don't throw the runbook away and rewrite it from memory after every incident.

Why "Incremental Deltas" Matter

The single most important engineering decision in ACE is how the Curator writes to the playbook. It doesn't rewrite it. It issues delta operations - ADD, UPDATE, or REMOVE - against individual bullets in the structured playbook.

This is the mechanism that prevents context collapse. Rewriting the whole playbook every round is exactly how you generate the information loss the paper is trying to solve. Each full rewrite introduces compression pressure; the model can't perfectly reconstruct what was there before. Over time, specificity erodes.

The incremental delta approach is the difference between git commit and "open the document and retype it from memory." The former gives you tracked, small, reversible changes with full history. The latter guarantees unbounded information loss at each step - and there's no diff to recover from.

The Playbook, Specifically

The playbook isn't a prose system prompt. It's a structured collection of bullets, each carrying:

A unique ID for surgical targeting
A section tag (e.g., strategies_and_hard_rules, domain_knowledge, edge_cases)
Usage metadata - how many times this bullet contributed to a successful vs. failed outcome
Any other relevant context metadata

This structure is what makes localized retrieval and surgical updates possible. Rather than injecting the entire accumulated playbook into every prompt, the system can retrieve only the bullets relevant to the current task. Specific beats generic, every time.

Here's a simplified representation of what a playbook bullet looks like, drawn from reference implementations:

{
  "id": "rule_042",
  "section": "strategies_and_hard_rules",
  "content": "When the user specifies a date range, always validate that the end date is after the start date before querying the API. Return a structured error if not.",
  "usage": {
    "helpful": 14,
    "harmful": 1
  },
  "added_at": "2025-11-03T09:14:22Z",
  "last_updated": "2025-11-18T16:42:07Z"
}

The usage counts matter. They're how the Curator knows whether to reinforce, update, or remove a bullet when the playbook grows large. Bullets that consistently contribute to failure are candidates for removal or revision. Bullets that prove reliable accumulate weight. The playbook becomes, over time, a compressed representation of what actually works in this specific domain.

What the Paper Proves

The ACE paper reports:

+10.6% performance improvement on the AppWorld agent benchmark - a complex, multi-step agent evaluation suite
+8.6% improvement on finance domain reasoning tasks
On the AppWorld leaderboard, ACE running on a smaller open-source model matched the top-ranked production agent overall, and surpassed it on the harder test-challenge split
Adaptation speed was faster and cost fewer tokens than baseline methods - incremental deltas are cheaper than monolithic rewrites, not just more effective
Notably, ACE achieved this without ground-truth labels. It curates from execution feedback alone - what worked, what didn't - without needing a human-annotated training signal

That last point matters more than it sounds. Most learning-from-experience systems need labeled data to know what "better" looks like. ACE infers it from the agent's own execution trace.

ACE vs. The Other Things It Looks Like

If you know the optimization landscape, ACE will remind you of several things. Here's where it fits and where it doesn't:

ACE vs. Prompt Engineering

Prompt engineering produces a static artifact - you craft one good instruction set and it doesn't change unless you change it manually. ACE is dynamic by design: the context evolves as the agent works. The insight the paper crystallizes is that a static prompt is a ceiling, not a foundation. Domain complexity accumulates faster than manual prompt iteration can track.

ACE vs. DSPy / GEPA

DSPy and GEPA are prompt optimization frameworks - they evolve the prompt instructions themselves, usually through gradient-based or few-shot methods. ACE evolves the context that lives behind the prompt - accumulated strategies, domain rules, hard-won heuristics. These aren't competitive approaches; a well-resourced team could conceivably run both. DSPy optimizes how the agent asks; ACE optimizes what the agent knows.

ACE vs. RAG / Vector Memory

RAG answers the question "what do I know?" - it retrieves relevant documents or facts for each query from an external knowledge base. ACE answers a different question: "what have I learned from doing this task before?" RAG gives the agent reference material. ACE gives the agent accumulated experience. Different problems. Both worth solving.

The Honest Limitations

Reflector quality is the bottleneck. The whole system depends on the Reflector's ability to extract meaningful insights from a reasoning trace. In specialized domains where even frontier models have limited capability, the Reflector produces weak or noisy lessons - and those lessons corrupt the playbook. As AltexSoft notes in their breakdown, this dependence on Reflector quality is the system's most significant single point of failure.

Error accumulation compounds. Bad reflections lead to bad Curator edits. Bad Curator edits persist in the playbook. Without robust evaluation loops catching drift, the playbook can degrade gradually - encoding the wrong lessons confidently. Garbage in, confidently curated garbage out.

Benchmark generalization is unproven. The paper validates on AppWorld (agentic tasks) and finance reasoning. Coding agents, medical reasoning, creative work, long-horizon planning - all untested. As Emergent Mind's breakdown of the paper flags, the leap from "works on these two benchmarks" to "generalizes broadly" hasn't been demonstrated yet. That doesn't mean it won't generalize - it means you'd be betting on an assumption the paper doesn't support.

Inference cost goes up. Three roles, a growing playbook, and per-task reflection loops cost more tokens per task than a static system prompt. The paper shows that adaptation cost is lower than monolithic-rewrite baselines - but it's still more expensive than doing nothing. For high-frequency, low-stakes tasks, the economics may not pencil out.

What This Actually Means If You're Not a Research Team

The gap between "this is an interesting research result" and "this changes how I work" is where most explainers stop. Here's where they shouldn't.

For Application Developers

You probably don't need to implement a full Generator/Reflector/Curator loop for your production agent. Two reference implementations are available on GitHub - ace-agents and ACE-open - but both are research-grade and not production-hardened. Running them in production today means owning the maintenance yourself.

The principles, however, are immediately transferable:

Version your context. Your agent's system prompt and configuration should be treated like source code - tracked, diffed, and rolled back when things go wrong.
Prefer incremental edits over wholesale rewrites. When a session produces a lesson, add it as a discrete bullet. Don't rewrite the whole configuration to incorporate it.
Structure your context as tagged, discrete items rather than long prose blobs. Prose doesn't support surgical updates. Structured bullets do.

For Engineering Teams Using Claude Code, Cursor, or Copilot

Your CLAUDE.md or AGENTS.md file is already a primitive, human-maintained ACE playbook. You add project-specific rules, patterns to follow, edge cases to watch for. ACE just proposes making the maintenance automatic rather than manual.

The practical application: when a coding session produces a hard-won lesson - a pattern that keeps breaking, an API quirk you learned the expensive way - add it as a discrete bullet to your agent config. Apply the ACE principles manually: one lesson, one bullet, one commit. Don't rewrite the whole file to incorporate it.

Over time, your CLAUDE.md accumulates the same kind of domain expertise ACE builds automatically. The difference is the feedback loop speed. The principle is identical.

For PMs and Solopreneurs

The research-grade architecture isn't the point for you. The point is the insight underneath it: the payoff in working with AI is in reusing and refining your context, not rebuilding it each session.

Every time you reconstruct your project context from scratch - repasting the same background, re-explaining the same constraints, reminding the model of the same patterns - you're leaving performance on the table. The practical, available-today version of "evolving playbook" is a saved, reusable context stack you curate over time as you learn what works.

At HiveTrail, we build Mesh - a desktop tool that gives you the human-scale version of an evolving context playbook. Instead of automated delta updates, you save and refine your own Stacks: collections of Notion pages, local files, and prompt snippets that you curate over time. If you've been manually rebuilding context for every AI session, that's the workflow Mesh replaces. See how it works →

Where ACE Fits in 2026's Context Engineering Stack

Anthropic's 2026 Agentic Coding Trends Report named context engineering the most important skill shift for developers this year. ACE is the first formal, measurable academic framework to validate why: context isn't a static resource to be managed, it's a dynamic artifact to be engineered.

Other frameworks are coming. Dynamic Cheatsheet, DSPy, GEPA, and Anthropic's own writing on effective context engineering for agents all point in the same direction. The research consensus is converging on a shared principle: context is a first-class engineered artifact with its own versioning, curation, and lifecycle management.

Whether you implement ACE specifically is less important than internalizing what it proves. An agent that learns from its own execution trace will outperform one that doesn't. A context that accumulates structured lessons will outperform one that's rebuilt fresh each session. These aren't speculative claims anymore - they're benchmarked results from a peer-reviewed paper.

The question for 2026 isn't whether context engineering matters. It's whether your team treats it as seriously as the research now says it deserves.

The trend ACE validates - that context is an engineered artifact worth curating deliberately - is the entire reason we built Mesh. Join the beta →

FAQ

What is Agentic Context Engineering (ACE)?
ACE is a framework proposed in a 2025 paper from Stanford, SambaNova, and UC Berkeley that treats an AI agent's context as an evolving playbook. Rather than relying on a static system prompt, ACE maintains a structured, growing document of strategies and lessons that the agent updates itself - through a loop involving a Generator, Reflector, and Curator.

What problem does ACE solve?
ACE targets two specific failure modes in AI agent design. Brevity bias describes how automated prompt optimization tends to favor short, generic instructions that lose domain-specific detail. Context collapse describes how iterative full rewrites of an agent's context gradually erode important information, the same way repeatedly retyping a document from memory introduces progressive information loss. ACE addresses both by accumulating context incrementally rather than compressing or rewriting it.

How does ACE work?
ACE runs a three-role loop on every task. The Generator attempts the task and produces a reasoning trace. The Reflector analyzes the trace and extracts lessons about what worked and what didn't. The Curator applies those lessons as small, targeted delta operations (ADD, UPDATE, REMOVE) to individual bullets in a structured playbook. Over time, the playbook accumulates genuine domain expertise without collapsing under the weight of its own rewrites.

How is ACE different from prompt engineering or RAG?
Prompt engineering produces a static instruction set you craft once and maintain manually. RAG retrieves reference documents at query time to give the model relevant information. ACE evolves the context itself - specifically, the accumulated strategies and lessons an agent has learned from past task attempts. The three approaches address different problems and are complementary rather than competitive.

Should I implement ACE in my own agent?
Most production teams don't need a full ACE implementation today - the available open-source implementations are research-grade and not production-hardened. The more immediately applicable takeaway is to apply the underlying principles manually: structure your agent's context as discrete, tagged bullets rather than prose, prefer incremental additions over wholesale rewrites, and version your context over time the same way you version source code.

About the Author

Founder of HiveTrail, where he builds context management tools for LLM workflows and agentic AI. HiveTrail's flagship product, Mesh, is a desktop app in beta that helps developers and teams assemble reusable, curated context stacks from Notion, local files, and prompt libraries - the human-scale version of what ACE automates.

Claude Code Context Window Rot: Why Sessions Get Dumber (And How to Fix It)

Amit Ben-Ari — Wed, 06 May 2026 06:21:47 +0000

You know the pattern. You open a fresh Claude Code session, paste in the task, and the first few responses are sharp. Claude reads the right files, makes sensible decisions, and follows your architecture conventions. Then - somewhere around the 45-minute mark - things start to drift.

It asks about a file path that it read twenty minutes ago. It contradicts an architectural decision you both agreed on earlier in the session. It proposes a refactor that undoes work it just finished. The responses get wordier but somehow less useful.

You didn't change anything. The model didn't crash. There's no error in the log. But something clearly went wrong.

Here's what I've learned after building HiveTrail Mesh - a tool designed specifically to solve this problem upstream - and after hitting this wall dozens of times in my own Claude Code sessions: the degradation isn't random, and it isn't a bug. It's structural. And most of the common fixes only address symptoms. The root cause starts before you type your first message.

What's actually happening: context rot explained

The phenomenon has a name. Anthropic's own engineering documentation calls it context rot, defining it directly: "Context must be treated as a finite resource with diminishing marginal returns." (Anthropic, Effective Context Engineering for AI Agents, September 2025)

But what does that actually mean mechanically?

Every Claude Code session has a context window - a finite amount of text (measured in tokens) that the model can "see" at any given moment. Your messages, Claude's responses, every file it reads, every bash command output, every tool call result - it all goes in. And here's the critical part: the transformer architecture underlying Claude processes all of this by comparing every token against every other token to compute relationships. This works beautifully when the context is small and tightly relevant. As the context grows with noisy, loosely related content, those pairwise comparisons get stretched thin.

The result: performance doesn't fall off a cliff. It degrades gradually. Attention that should be focused on your current task gets diluted by everything that's accumulated in the window since you started.

The research on this is unambiguous. Chroma's 2025 study, Context Rot: How Increasing Input Tokens Impacts LLM Performance, tested 18 frontier models - including GPT-4.1, Claude 4, and Gemini 2.5 - and found that every single one performed worse as input length increased. Not some. Not most under certain conditions. All of them, without exception. Some models held at 95% accuracy and then dropped to 60% once the input crossed a certain length threshold. The drop wasn't gradual - it was a cliff edge after a saturation point.

Morph's analysis of the same research adds an important nuance for coding agents specifically: coding agents have three properties that make context rot worse than average. First, context is accumulative - every file read, grep result, and tool output stays in the window for the rest of the session. Second, distractor density is high - code search returns semantically similar results across many files, creating more opportunities for attention to be misled. Third, task horizons are long - a real coding task takes 15–60 minutes, during which context continuously degrades.

The lost-in-the-middle effect

There's a second mechanism compounding the problem. Stanford researchers (Liu et al., Lost in the Middle: How Language Models Use Long Contexts, TACL 2024) established one of the most cited findings in LLM reliability: models attend strongly to the beginning and end of their context, and poorly to the middle.

In a multi-document question-answering setup with 20 documents, accuracy dropped by more than 30% when the relevant document was placed in positions 5–15 rather than position 1 or 20. The model had all the information it needed - it just couldn't effectively attend to it from the middle of a crowded window.

Critically, this effect doesn't require the context window to be full. It appears at moderate context lengths and worsens as the window grows. The relevant information doesn't need to be absent - it just needs to be in the wrong position relative to everything else that's accumulated.

For a typical Claude Code session, the practical consequences are specific and predictable:

Constraints drift into the dead zone. The "don't use global state," "always write tests," and "never touch the config file" rules you specified at the start of the session - those are now buried in the middle of a window that's grown to include dozens of file reads and tool call outputs. They're technically still there. The model just can't attend to them as effectively as it did when they were fresh.
Early architectural decisions lose weight. The approach you agreed on for the auth layer in message 8 is now somewhere in the middle of a 60-message session. The model didn't "forget" it in a human sense - but it can no longer give that decision the same attention weight it would if it were in the first few messages.
File content read early becomes unreliable. If Claude read your config.py at the start of the session to understand the data model, and the session has since grown substantially, the model may produce code that contradicts that schema - not because the information isn't there, but because it's no longer in a high-attention position.

Morph's analysis of context rot points to a particularly counterintuitive finding: models performed better on randomly shuffled documents than on logically structured ones. Logical structure creates plausible distractors - adjacent documents share terminology, concepts, and patterns, making it harder for the model to isolate the relevant signal. A well-organised codebase is, perversely, harder for an LLM to search than a randomly arranged one. This is why structured, labelled context blocks - where sections are explicitly tagged by type - outperform even well-organised raw dumps.

The upshot: position and structure matter as much as content. A piece of information in the first 10% of the context window has meaningfully more influence on the model's output than the same information in the middle 60%. Every message you exchange with Claude pushes your earlier constraints further toward that low-attention zone.

The numbers are worse than you think

This isn't theoretical. A developer recently filed GitHub issue #34685 on the Claude Code repository documenting exactly what this looks like in practice, using the 1M context window model (claude-opus-4-6):

At ~20% context usage: noticeable degradation - losing track of earlier decisions, circular reasoning, forgetting details
At ~40% context usage: automatic context compression kicked in, removing scrollback history
At ~48% context usage: Claude actively told the user "I'm deep enough in this context that I'm not being effective" and recommended starting a fresh session

That's less than halfway through a context window that Anthropic advertises as capable of holding entire codebases. The effective high-quality range appears to be roughly the first 400K tokens. As the issue author asks: if the practical ceiling is 400K tokens, should that be what's communicated to users rather than 1M?

This isn't a criticism of Anthropic - it's a fundamental constraint of how transformer attention works at scale. The compaction mechanics are also relevant here: Claude Code reserves a ~33K token buffer and triggers auto-compaction when usage hits ~83.5% of the window. But compaction is lossy. As MindStudio's analysis notes, running /compact after quality has already dropped is less effective than doing it proactively - the compressed summary may capture confused outputs alongside the good ones.

How to spot context rot in your sessions before it gets critical

Context rot is insidious because it rarely announces itself. You don't get an error. You get a gradual erosion of quality that's easy to attribute to a bad prompt or a complex task. Here are the specific signals worth watching for:

Claude re-reads files it already processed. If you notice Claude using a read_file tool call on a file it clearly already loaded earlier in the session, the earlier read has degraded in the window. It's compensating by fetching the information again - burning more context in the process.

It contradicts earlier architectural decisions. You settled on an approach for your auth layer 30 messages ago. Now Claude is suggesting something that contradicts it without acknowledging the earlier decision. The constraint has drifted into the middle of the context window and lost attention weight.

Responses get longer but less precise. This is a counter-intuitive tell. As context fills, the model sometimes compensates by generating more text - adding hedging, restating things, or providing options rather than direct answers. Word count goes up, signal goes down.

Circular debugging loops. You tell Claude a fix didn't work. It suggests a variation. That doesn't work either. It cycles back to an approach it already tried. This is classic context rot - the session history is creating distractors that prevent the model from making forward progress.

It forgets your "don't do X" rules. Constraints stated early in the session - "don't use global state," "always write tests for new functions," "never modify the config file directly" - are exactly the kind of instruction that ends up in the middle of a long context and loses effective attention weight.

If you're seeing two or more of these in a session, you're in context rot territory. The sooner you act, the less you lose.

The standard fixes - and why they only solve half the problem

Let me be clear: the standard context management advice is genuinely useful. Here's what works:

/clear between unrelated tasks. If you've finished one feature and are starting something new, clear the session. You lose the history, but you regain a clean, high-attention context for the new task.

/compact at natural task boundaries. As Anthropic's own best practices documentation puts it: "Most best practices are based on one constraint: Claude's context window fills up fast, and performance degrades as it fills." Running /compact after completing a distinct phase - finished the auth module, finished a refactor - creates a compressed checkpoint before quality degrades.

Keep CLAUDE.md lean and targeted. Your CLAUDE.md file is read into every session. Each token in it costs budget before you type your first message. Keep it focused on things that are genuinely session-critical: architecture patterns, hard constraints, and current sprint context. Remove anything that isn't actively guiding behavior.

Break large tasks into focused sessions. A 200K context session covering five loosely related tasks is far more prone to rot than five 40K sessions each tackling one thing precisely.

These techniques all work. But here's the fundamental limitation of every single one of them: they're reactive. They manage the window after it has already started filling with noise. They don't address what goes in at session start - and that's where most of the damage is actually done.

The structural fix: treat pre-session context as an engineering problem

Here's the insight that changed how I work, and that drove the entire design of Mesh.

Context rot isn't primarily a volume problem. It's a signal-to-noise ratio problem.

A session loaded with 50,000 tokens of tightly relevant, well-structured context will outperform a session loaded with 15,000 tokens of loosely assembled raw files. More tokens can mean worse results - not because the model ran out of space, but because the relevant signal is now competing for attention against noise.

Google's Chrome DevTools team documented this empirically in a January 2026 post on token-efficient formats for AI assistance: when they replaced raw JSON performance trace data with a compact, structured format designed specifically for LLM consumption, they were able to fit dramatically more useful context in the window. The conclusion: "The optimized format enables a performance agent that can maintain a longer conversation history and provide more accurate, context-aware answers without getting overwhelmed by noise."

The New Stack's analysis of token-efficient data preparation quantified this more directly: organizations typically discover 40–60% waste in existing serialization approaches. Field names repeated across every record, verbose JSON nesting, raw prose with filler content - these consume tokens without conveying information the model needs.

This reframes the problem entirely. Instead of asking "how do I manage my context window during a session?", the better question is: "what's the highest signal-to-noise ratio I can achieve in the context I load before the session starts?"

What a token-optimised context block actually looks like

Let me make this concrete. Here's a typical approach most developers use when starting a Claude Code session for a feature task:

Context provided (raw approach):
- Entire Notion spec document (pasted): ~4,200 tokens
- Three related source files (full content): ~6,800 tokens  
- Recent Git log (last 30 commits, full messages): ~2,100 tokens
- Relevant tickets (copy-pasted from Jira): ~1,800 tokens
Total: ~14,900 tokens - loaded before typing one word of the actual task

Much of this is noise. The Notion spec includes meeting notes, background context, and historical discussion that are irrelevant to the specific task. The full source files include sections that won't be touched. The Git log includes commits from unrelated workstreams. The ticket copy-pastes include formatting artifacts, metadata, and comment threads.

Now contrast that with a Mesh-assembled context block for the same task:

<context>
  <task_scope>Implement rate limiting for the GitHub API service (issue #41)</task_scope>

  <relevant_code>
    <file path="src/services/github_service.py" scope="rate_limit_related">
      <!-- Only the 3 methods relevant to rate limit implementation - 280 tokens -->
    </file>
    <file path="src/models/config.py" scope="rate_limit_config">
      <!-- Config class with existing rate limit fields - 95 tokens -->
    </file>
  </relevant_code>

  <spec_excerpt privacy_scanned="true">
    <!-- Core requirements from Notion: 4 bullet points, no filler - 180 tokens -->
  </spec_excerpt>

  <git_context>
    <!-- Last 5 commits touching github_service.py only - 140 tokens -->
  </git_context>
</context>

Approximate token count for the structured block: ~700 tokens. Same task, same information where it matters - roughly 95% fewer tokens devoted to context before the session starts. That's not an artificial example. It's the difference between a session that degrades in 40 minutes and one that stays sharp for two hours.

The structured format does two things beyond compression. First, it uses labels (<task_scope>, <relevant_code>, <spec_excerpt>) that help the model's attention mechanism understand the type of each piece of context - which improves retrieval and reduces the lost-in-the-middle effect. Second, it front-loads the most important constraint (the task scope), putting it where primacy bias works in your favor.

A practical pre-session context checklist

Before you start your next Claude Code session on a substantive task, run through these five steps:

1. Define the scope precisely. Write one sentence describing what this session will accomplish. This becomes your <task_scope> or the first line of your CLAUDE.md session block. If you can't write it in one sentence, the task may be too broad for a single session.

2. Identify the 3–5 files that actually matter. For most feature tasks, there are 2–5 files in your codebase that are genuinely relevant. Load those, not the entire service layer. If you're uncertain, use /context to check usage after loading and trim ruthlessly.

3. Filter your external sources for recency and relevance. If you're pulling from Notion or a ticket system, bring only the spec requirements for this task - not the full page, not the historical comments, not the background. A 200-token excerpt of the right spec section is worth more than a 4,000-token full page dump.

4. Structure with clear section labels. Whether you're building the context manually or using a tool, add structural labels to the top of each context section. ## Task scope, ## Relevant files, ## Constraints - these act as attention anchors that help the model navigate a longer context without losing track.

5. Set a pre-session token budget ceiling. For a focused single-feature session, aim for under 3,000 tokens of loaded context before the task prompt. For a larger architectural task, 5,000–8,000 tokens is reasonable. If your pre-session context exceeds that, you're loading noise - go back and filter.

This checklist isn't the last word on the topic - for a deeper treatment, see the practical guide to building an LLM context stack and the foundational piece on context engineering published here on the HiveTrail blog.

Raw inputs vs. structured inputs: what the token difference actually looks like

Before going into the real-world evidence, it helps to see the cost difference side by side. This table compares a typical raw-input approach against a structured, filtered context block for the same feature task - implementing rate limiting for a GitHub API service.

Input type	Approach	Approx. tokens	Relevance to task	Attention risk
Full Notion spec page	Paste entire page	~4,200	~15% (rest is history, discussion, background)	High - noise pushes task scope into middle
Three full source files	Load full file content	~6,800	~30% (most code won't be touched)	High - irrelevant methods create distractors
Full Git log (30 commits)	Raw `git log` output	~2,100	~10% (most commits are unrelated workstreams)	Medium - commit messages from other features add noise
Jira tickets copy-paste	Full ticket with comments	~1,800	~20% (useful requirements buried in thread)	Medium - comment threads inflate context
Raw total		~14,900 tokens	~18% signal	High rot risk from first message
Filtered spec excerpt	Requirements only	~180	~95%	Minimal
Scoped file sections	Only methods relevant to rate limiting	~375	~90%	Minimal
Filtered Git history	Commits touching target files only	~140	~95%	Minimal
Structured XML block	Labelled, token-budgeted, privacy-scanned	~700	~92%	Low - labels act as attention anchors
Structured total		~700 tokens	~93% signal	Low rot risk, session stays sharp longer

The 14,900-token raw session and the 700-token structured session contain the same useful information. The difference is the noise ratio, and according to the Chroma research, that noise directly degrades every output the model produces for the rest of the session.

Real-world evidence: the cost of ignoring input quality

This isn't a purely theoretical argument. There are documented cases from real engineers, filed publicly, that illustrate exactly how this plays out in production development work.

The Ramp engineer and the 100K token search spiral. Anton Biryukov, a software engineer at Ramp, described a scenario that will be familiar to anyone doing serious Claude Code work: "Claude Code can burn 100K+ tokens searching Datadog, Braintrust, databases, and source code. Then compaction kicks in."

The mechanism here is important. Biryukov isn't describing a session that got too long - he's describing a session where the input quality of each tool call is poor. Each Datadog query, each database search, each source file read returns raw, unfiltered output that gets dumped wholesale into the context. By the time compaction triggers, the window is full of search results, most of which were intermediary steps rather than final answers. The session has to restart from a compressed summary that may have lost the connections between findings.

The fix isn't a bigger context window. The 1M context window, which Anthropic rolled out to address exactly this kind of workflow, helped - but Biryukov's observation was made after the 1M window shipped. More capacity delays the problem. Filtering input quality at the source eliminates it.

The RBAC implementation that lost its own schema decisions. GitHub issue #28984 documents a particularly instructive failure mode. A developer asked Claude Code to implement an RBAC system across routes, middleware, database schema, and tests - a realistic multi-file task. At ~15–20 tool calls in, compaction triggered. After compaction:

Claude lost track of schema decisions it had made earlier in the session
It re-read files it had already processed, burning more context
It contradicted its own recent work, requiring the developer to intervene and correct it
The compaction summary had compressed away the precise architectural decisions that made the earlier work coherent

The developer's proposed fix was: "Compaction should prioritize retaining file contents and architectural decisions over conversation history." That's the right instinct - but it's still a reactive frame. The deeper issue is that loading full file contents for every file that might be touched (rather than the specific functions that will be touched) bloated the context from the start, accelerating the timeline to compaction.

Simon Willison and MCP context pollution. Simon Willison - one of the most widely-read voices in the AI developer tools community, and the creator of Datasette - identified the exact same input quality problem in a different context: "Context pollution is why I rarely used MCP."

The problem: connecting 5–6 MCP integrations (Figma, Playwright, GitHub, a few custom servers) caused their tool definitions to consume 30–50% of the available context window before a single task message was sent. The model was starting every session with half its attention budget already spent on tool schemas it might never need for the current task.

Anthropic's solution - on-demand tool loading, where tool definitions are only loaded when the semantic search identifies them as relevant - reduced this dramatically. Willison's reaction: "Now there's no reason not to hook up dozens or hundreds of MCPs."

This is precisely the upstream argument applied to tools rather than files: don't load everything that might be relevant. Load only what's relevant to the current task. The principle is identical whether you're talking about file contents, spec documents, Git history, or MCP tool definitions.

The pattern across all three cases. None of these are model failures. The models are working as designed. The constraint in every case is what was competing for the model's attention before the productive work started. Each case also has the same structural fix: filter inputs for relevance at the source, before they enter the context window, rather than managing the consequences of noisy inputs mid-session.

This is the distinction that most context management advice misses. /compact is a good tool for a session that's already running. A structured, token-budgeted pre-session context is a better tool for a session that hasn't started yet.

Frequently asked questions

What is context rot in Claude Code?

Context rot is the gradual degradation of output quality that occurs as a Claude Code session's context window fills with accumulated tokens - messages, file reads, tool outputs, and responses. As more tokens accumulate, the model's attention mechanisms become less effective at focusing on what matters for the current task, producing less accurate, less coherent responses. The term was formalized by Chroma's 2025 research and is now used in Anthropic's own documentation.

Why does Claude Code get worse as the session gets longer?

The transformer architecture underlying Claude compares every token in the context against every other token to compute relationships. As the context grows, this comparison gets spread thinner - the signal-to-noise ratio drops. Information from earlier in the session, especially constraints and architectural decisions, tends to fall into the "middle" of the context window where attention is weakest. This is the lost-in-the-middle effect, documented in Liu et al.'s Stanford research.

Does /compact fix context rot in Claude Code?

/compact helps, but it's a reactive mitigation, not a structural fix. It compresses conversation history into a summary, freeing token budget. However, it works best when run proactively at natural task boundaries - not after quality has already degraded. If you compact a session that's already showing context rot symptoms, the summary may preserve confused outputs alongside the good ones.

How many tokens does Claude Code use before quality starts to drop?

Based on the GitHub issue #34685 report, real-world degradation with Opus 4.6 was reported starting at approximately 20% of the 1M context window - around 200K tokens. By 40%, automatic compaction triggered. By 48%, the model itself recommended restarting the session. For a 200K context window model, observable degradation can start considerably earlier depending on the quality of what's been loaded.

How can I improve Claude Code's output quality at the start of a session?

Focus on input quality, not just session hygiene. Load only the files directly relevant to the current task. Extract relevant excerpts from specs and tickets rather than pasting full documents. Structure your context with clear section labels. Set a token budget for pre-session context (under 3,000 tokens for focused tasks) and trim anything that doesn't directly serve the work. This is what the HiveTrail Mesh beta automates - pulling from Notion, local files, and Git, filtering for relevance, and assembling a structured, token-budgeted context block before the session starts.

The upstream fix

Every piece of advice in the "manage your context window" ecosystem - /clear, /compact, CLAUDE.md hygiene, focused sessions - is sound. I use all of it. But it treats a problem that's partly self-inflicted.

The real leverage is upstream: what you load before the session starts determines the baseline signal-to-noise ratio the session will operate at. A session starting with 700 tokens of structured, relevant context is fundamentally different from one starting with 15,000 tokens of loosely assembled raw content - even if the total context capacity is identical.

I built HiveTrail Mesh because I kept hitting this wall in my own development work. The tool assembles a token-optimised, structured XML context block from your Notion pages, local files, and Git history - scanning for privacy issues, filtering for relevance, and enforcing token budgets - so your Claude Code session starts with maximum signal and minimum noise.

If you want to try the pre-session context workflow described in this post, HiveTrail Mesh beta is open now.

Have questions about context engineering for Claude Code? The comment section is open, or reach out on Hivetrail's website. If this post was useful, the model benchmark series on HiveTrail has related data on how context quality affects model output - including why a cheaper model with better context often outperforms a more expensive one with shallow context.

How to Build an LLM Context Stack: A Practical Playbook for Developers (2026)

Amit Ben-Ari — Thu, 30 Apr 2026 10:04:43 +0000

Here is a pattern that repeats dozens of times a day in developer workflows everywhere. A developer opens Claude or ChatGPT, copies a chunk of code from their editor, pastes a Notion spec, adds a rough description of what they want, and sends it. The output is mediocre - generic, missing context, wrong in ways that take longer to fix than to write from scratch. So the developer refines the prompt, adds more text, pastes more files. The output gets marginally better. Twenty minutes have passed.

The problem is almost never the model. It is the context.

As Philipp Schmid, Technical Lead at Hugging Face, put it plainly in his widely-read 2025 post on context engineering:

"Most agent failures are not model failures anymore, they are context failures."

And Andrej Karpathy - former Director of AI at Tesla and co-founder of OpenAI - framed the discipline even more sharply when he endorsed the term "context engineering" on X in June 2025:

"In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."

This post is about building that discipline into your personal LLM workflow - not as an abstract concept, but as a concrete, repeatable context engineering workflow you can apply to any task. We will cover what a context stack is, where to source its components, how to budget tokens wisely, why structure matters, and how to add a privacy layer before anything leaves your machine. By the end, you will have a practical playbook you can apply starting today.

What is a context stack (and why you need one)

The context window is the model's working memory - the total number of tokens it can process in a single request, shared between your input and its output. As of 2026, that window ranges from 128K tokens (DeepSeek, Mistral) to 10 million (Llama 4 Scout), with Claude's 200K and GPT-5.2's 400K sitting in the mid-tier.

A context stack is different. It is the curated set of information sources you assemble before you open an LLM session - the deliberate process of deciding what goes in, in what order, and in what structure. Think of the context window as a whiteboard. The context stack is the act of deciding, before the meeting, exactly which diagrams, reference sheets, and notes to pin on it.

Most developers skip this step entirely. They treat the context window like a junk drawer - pasting in whatever seems related and hoping the model figures it out. This works for trivial tasks. For anything complex, it guarantees degraded output.

There are four layers to a well-built context stack:

Task definition - exactly what you want the model to do, and any hard constraints
Project knowledge - the background context the model needs to understand your specific situation
Task-specific data - the files, diffs, records, or documents directly relevant to this request
Output format - the structure, tone, length, or schema you expect in the response

These four layers map to how a skilled human consultant would approach a complex task: first understand the goal, then understand the environment, then review the specific case, then produce structured output. Every LLM task benefits from this same mental model.

The science behind why context quality beats context quantity

Before we get into the mechanics, it is worth understanding why this matters at the research level - because the findings are counterintuitive and somewhat alarming.

In July 2025, Chroma Research published a landmark study: Context Rot: How Increasing Input Tokens Impacts LLM Performance. The researchers tested 18 frontier models - including GPT-4.1, Claude Opus 4, Gemini 2.5 Pro, and Qwen3-235B - and reached a finding that should change how every developer thinks about context:

"We demonstrate that even under these minimal conditions, model performance degrades as input length increases, often in surprising and non-uniform ways."

Every single model they tested got worse as input length increased. Not some. Not most. All 18. And the degradation began well before the context window was close to full - a model with a 200K token window showed measurable performance decay at 50K tokens.

This is not a bug. It is an architectural property of transformer-based attention. As you add tokens, the signal-to-noise ratio in the model's attention mechanism deteriorates. More input means more relationships the model has to compute between tokens, and the important signal competes with an ever-larger pool of noise.

The Chroma study also identified a particularly counterintuitive finding: models perform better on shuffled context than on coherent, well-structured context when that context is long. Logical document flow actually hurts retrieval performance, because coherent narrative creates stronger positional patterns that bias the attention mechanism away from the actual target information.

This compounds a separate, well-established finding. In 2024, Nelson F. Liu and colleagues at Stanford published "Lost in the Middle: How Language Models Use Long Contexts" in the Transactions of the Association for Computational Linguistics. They found that LLM performance follows a U-shaped curve across input positions: models attend strongly to the beginning and end of context, and significantly less to the middle. When the answer document was placed at position 10 out of 20 documents in a question-answering task, GPT-3.5-Turbo's accuracy dropped by more than 20 percentage points compared to when the answer was at position 1 or position 20.

The practical implication is stark: if you paste a long project spec, then a git diff, then a README, then your actual question - the model may largely ignore the middle material, even though it "fits" in the context window.

The solution is not a better model. The solution is a better context stack.

The 5 sources most developers already have (and underuse)

Building a context stack does not require new tools, infrastructure, or RAG pipelines. The raw material is already sitting in your existing workflow. Here is how to think about each source and when to use it.

1. Notion pages and internal documentation

If you use Notion for specs, design docs, decision logs, or sprint planning, you are sitting on the single most valuable context source for product and feature work. Yet most developers either paste raw Notion content (which includes formatting noise, sidebar metadata, and orphaned draft text) or ignore it entirely.

What to extract: the specific sections relevant to the current task. For a PR description, that might be two paragraphs from the feature spec and the acceptance criteria. For a code review, it might be the architecture decision record. Not the whole document - the relevant slices.

When not to include it: if the page is outdated, heavily formatted, or longer than ~1,500 tokens and only tangentially related to the task. Stale context is worse than no context - the model treats it as signal and reasons from incorrect assumptions.

2. Git diffs and commit history

Git output is underused context gold, especially for developer-facing tasks like PR descriptions, release notes, code reviews, and regression analysis. A well-scoped git diff of the changed files, combined with the last 5 commit messages, gives a model nearly everything it needs to understand what changed, why, and how it fits into recent history.

The key is scoping. git diff HEAD~1 of a large codebase will produce thousands of tokens of noise. git diff HEAD~1 -- src/services/github_service.py gives you targeted signal. Specificity is the discipline.

Here is the practical toolkit for assembling Git context:

# Get the diff for only the files you changed on this branch
git diff main...HEAD -- $(git diff --name-only main...HEAD)

# Scope to a specific file or directory
git diff main...HEAD -- src/services/github_service.py

# Get the commit narrative for this branch (concise, one line per commit)
git log main...HEAD --oneline

# Count approximate tokens before pasting (rough estimate: 1 token ≈ 4 chars)
git diff main...HEAD -- src/services/ | wc -c
# Divide by 4 for a token estimate. If over 2,000, scope more tightly.

# Show only changed function signatures without full body (useful for large files)
git diff main...HEAD -- src/services/github_service.py | grep "^[+-].*def "

These two together - git diff scoped to changed files plus git log --oneline - typically produce 500–2,000 tokens of extremely high-density context. That is more useful information per token than almost any other source in your stack.

3. Local files: specs, READMEs, configs, and schemas

Local files are the forgotten layer. Developers habitually paste code snippets into LLM chats, but rarely include the broader file context that explains why the code is written the way it is - the architecture notes in a README, the data schema that a function operates on, the config file that controls feature flags.

For code tasks, the most useful local files are usually: the file being modified (scoped to the relevant functions, not the whole file), any type definitions or interfaces it imports, and the test file for that module. This trio gives the model the implementation, the contract, and the expected behavior - the three things it needs to produce correct code.

For writing tasks (docs, PR descriptions, release notes), the most useful local files are usually the previous version of the document being updated and any style guide or template you want the output to follow.

4. Previous LLM output and conversation checkpoints

This is the most underused source of all. If you have already used an LLM to produce a structured analysis, a draft spec, or a set of decisions - that output is a compressed, high-quality summary of prior reasoning. Feeding it back into a subsequent task saves both tokens and cognitive work.

A practical pattern: after any significant LLM session, extract and save a "context summary" - a brief structured document capturing the key decisions, constraints, and outputs from that session. Next time you return to the same feature or problem, this becomes the top-level context block in your stack.

This is especially valuable for Claude Code or Cursor workflows, where sessions can span multiple days. The manual checkpoint discipline - saving a HANDOFF.md before clearing context - is one of the highest-leverage habits in a modern dev workflow. The dev.to post "I Spent Four Weeks Reading 200+ Sources on Context Engineering" documented this concretely: a single 805-line SESSION.md was consuming 17,000 tokens on every session start, loading irrelevant historical context into every conversation. After trimming it to 59 lines, task benchmark scores jumped 32 points.

5. Reusable context blocks

Reusable context blocks are the equivalent of code snippets for context engineering. They are pre-written, curated descriptions of recurring context elements - your tech stack, coding conventions, project architecture, output templates, or persona definitions. You write them once and pull them into any relevant task.

Examples: a 300-token description of your service architecture that you include whenever asking for code that touches multiple services; a 200-token output format spec that defines the PR description template you want Claude to follow every time; a 150-token description of your team's coding standards that prevents the model from generating patterns you will just reject.

The return on investment for reusable context blocks is asymmetric: 30 minutes writing them pays dividends across hundreds of future tasks.

Token budgeting: how to decide what stays and what gets cut

Knowing your sources is not enough. The harder discipline is deciding how much of each source to include, and in what order.

The most actionable guidance comes from production data. Based on analysis of Anthropic and Manus production systems, published by AI practitioner Dex Horthy, the safe zone for context utilization is below 60% of the available window. At 70% utilization, precision begins to drop. At 85%, hallucination rates rise meaningfully.

This means a developer using Claude with a 200K token context window should aim to keep total input context below 120K tokens - and in practice, far below that for most tasks. The gap between focused and unfocused context is not marginal. Research compiled from the LongMemEval benchmark found that a focused 300-token context outperformed an unfocused 113,000-token context on the same task. The model with less context - but the right context - produced better results than the model with a context window 377 times larger. This is the defining empirical argument for context stack discipline: more is not better, right is better.

A focused 5,000-token context stack will almost always outperform a sprawling 80,000-token dump.

This validates what the Stanford research documented empirically. A model does not become more capable because you fed it more text. It becomes more distracted.

The ordering rule: the "lost in the middle" effect has a direct practical corollary for how you assemble your stack. Always put your highest-priority context at the beginning or end of your input, never in the middle. Specifically:

Opening block: task definition and constraints (what you want, what it must and must not do)
Middle blocks: project knowledge and supporting files (where degraded attention is acceptable because this context is background, not critical)
Closing block: task-specific data (the diff, the function, the document to be modified) - put this as close to your actual question as possible

This ordering ensures the model's strongest attention lands on the most critical inputs: the instructions and the specific target.

A practical token budget framework for common tasks:

Task	Task definition	Project context	Task-specific data	Total target
PR description	~200 tokens	~500 tokens	~1,500 tokens (diff + commits)	~2,500
Code review	~200 tokens	~300 tokens	~2,000 tokens (file + tests)	~2,500
Feature spec draft	~300 tokens	~800 tokens	~500 tokens (similar prior specs)	~1,600
Bug investigation	~200 tokens	~400 tokens	~1,500 tokens (error + relevant code)	~2,100

Note that even the most context-heavy of these tasks stays well under 3,000 tokens. That is not a constraint - it is the point. Focused, high-signal context at 3,000 tokens will produce better output than unfocused context at 30,000 tokens, because the signal-to-noise ratio is dramatically higher.

Structuring for the model: why XML context outperforms plain text

How you format your context stack matters as much as what you put in it. A plain text dump - even a well-curated one - forces the model to spend attention inferring boundaries between components. Explicit structure eliminates that overhead.

Anthropic's own prompt engineering documentation recommends XML tags for structuring complex inputs, specifically because Claude parses them more accurately than other delimiter formats. The guidance applies broadly: XML-style tags help any model parse multi-component inputs by providing semantic clarity about what each block contains and why it is there.

Here is the contrast in practice. This is a typical unstructured paste:

I need to write a PR description. Here is the diff: [paste of git diff] Here is the spec: [paste of Notion content] Here is the previous PR description format we use: [paste of template]

Here is the same content as a structured context stack:

<task>
  Write a PR description following the format in <output_template>.
  The change should be described from the perspective of the reviewer.
</task>

<output_template>
  ## Summary
  [1-2 sentences describing what changed and why]

  ## Changes
  - [bullet per meaningful change]

  ## Testing
  [what was tested and how]

  Closes #[issue number]
</output_template>

<feature_spec>
  [relevant 2 paragraphs from the Notion spec]
</feature_spec>

<git_diff>
  [scoped diff of changed files]
</git_diff>

The structured version does several things the unstructured version cannot. It tells the model the role of each block before the model reads it. It separates the instruction from the data from the output contract. It eliminates ambiguity about where one component ends and another begins.

Research cited by FlowHunt found that strategic use of descriptive XML tags can improve AI response quality by up to 40% for complex, multi-component tasks. The gain is largest when prompts include multiple distinct types of information - which is exactly the definition of a context stack.

The naming of your tags matters. <feature_spec> is clearer than <doc>. <git_diff> is clearer than <code>. <output_template> is clearer than <format>. The model reasons better when tag names communicate the purpose of the content, not just its type.

The privacy layer: what to scrub before your context leaves your machine

A context stack pulled from Notion, Git, and local files will frequently contain material that should not be sent to a third-party API: API keys, authentication tokens, database credentials, internal employee names, customer email addresses, financial data, or proprietary architecture details.

The risk is not hypothetical. As context stacks grow richer and more comprehensive, the surface area for accidental data exposure grows with them. A .env file pasted alongside application code, a Notion page that includes a vendor contract summary, a git diff that touches a config file - each of these is a potential privacy incident waiting to happen.

Before any context stack leaves your machine, run a brief mental checklist:

Strip these automatically:

Environment variables and .env file content
API keys, tokens, secrets, and credentials (even if they look expired)
Database connection strings
Internal IP addresses and hostnames in infrastructure files

Review these contextually:

Customer or user data (names, emails, IDs, transaction records)
Employee personal information
Vendor contracts, pricing, and commercial terms
Internal financial data or projections

Safe to include:

Anonymized code and logic
Technical architecture descriptions that do not reveal security posture
Public documentation and specs
Your own notes and task descriptions

A practical rule: if the data would require a legal review before you shared it with a contractor, it should not go into an LLM context without explicit approval. Most LLM API providers retain input data for safety and moderation purposes, and enterprise compliance teams are increasingly auditing AI usage for exactly this category of exposure.

The privacy layer is not an obstacle to building a rich context stack - it is what makes it safe to build a rich one. When your scrubbing step is fast and systematic, you can be more aggressive about including genuinely useful context.

Before/after: the same task, two ways

The best way to understand the value of a context stack is to see it applied to a real task. Below is a PR description workflow - one of the highest-frequency developer tasks - executed with and without a structured context stack.

The task: write a PR description for a bug fix in a GitHub service that was returning stale cached data after a user revoked a token.

Without a context stack (the typical approach)

The developer opens Claude, types:

"Write a PR description for my bug fix. It was a caching issue in the GitHub service."

Output (representative):

## Summary
Fix caching issue in GitHub service.

## Changes
- Fixed bug with caching
- Updated cache logic

## Testing
Tested locally.

This is the output that makes developers distrust AI writing assistance. It is generic, it contains no specific information, and it would be rejected in any serious code review. The developer then spends 10 minutes adding the specifics manually - at which point they might as well have written it from scratch.

With a structured context stack

The developer takes 3 minutes to assemble the following:

<task>
  Write a PR description for a bug fix. Follow the output_template format.
  The reviewer is a senior backend engineer who will want to understand
  the root cause and verify the fix is complete.
</task>

<output_template>
  ## Summary
  [1-2 sentences: what changed and why]

  ## Root cause
  [Brief technical explanation of what was broken]

  ## Fix
  [What the fix does and why it works]

  ## Testing
  [How it was tested]

  Closes #[issue number]
</output_template>

<feature_context>
  GitHub service caches API responses to reduce rate limit exposure.
  Cache TTL is 5 minutes. Token revocation is handled by the auth service
  and emits a `token.revoked` event on the internal event bus.
</feature_context>

<git_diff>
  diff --git a/services/github_service.py b/services/github_service.py
  -    def get_repos(self, user_id: str) -> list[Repo]:
  -        cached = self.cache.get(f"repos:{user_id}")
  -        if cached:
  -            return cached
  +    def get_repos(self, user_id: str) -> list[Repo]:
  +        token = self.auth.get_token(user_id)
  +        if token is None:
  +            self.cache.invalidate(f"repos:{user_id}")
  +            raise TokenRevokedError(user_id)
  +        cached = self.cache.get(f"repos:{user_id}")
  +        if cached:
  +            return cached
</git_diff>

<recent_commits>
  abc1234 Add token.revoked event listener to auth service
  def5678 Implement cache invalidation interface on CacheService
</recent_commits>

Output:

## Summary
Fix stale cache returns in GitHub service after token revocation, where
previously revoked users could still receive cached repo data for up to
5 minutes post-revocation.

## Root cause
The `get_repos` method checked the cache before validating the user's
token state. When a token was revoked via the auth service, the cache
was not invalidated, allowing the stale response to be served until
the TTL expired.

## Fix
Added a token validity check before the cache lookup. If the token is
absent (revoked), the cache entry is immediately invalidated and a
`TokenRevokedError` is raised. This ensures revoked users receive a
correct error response rather than stale data, regardless of cache TTL.

## Testing
- Verified cache invalidation fires on `token.revoked` event (unit test)
- Confirmed `TokenRevokedError` raised for revoked user (integration test)
- Tested TTL expiry path still returns fresh data for valid tokens

Closes #412

This output requires no editing. It accurately describes the root cause from the diff, correctly characterizes the fix, and reflects the testing from the commit history. The total assembly time was under 3 minutes; the editing time was zero.

The difference is not the model. The same model produced both outputs. The difference is the context stack.

This is exactly what HiveTrail Mesh was built to do. Mesh assembles your context stack automatically - pulling from Notion, Git, and local files into structured XML, with privacy scanning built in. It's in beta and free to try. Join the beta →

Frequently asked questions

What is an LLM context stack?

An LLM context stack is the curated set of information sources you assemble before an LLM session - a deliberate process of selecting, structuring, and ordering the context you feed into the model for a specific task. It includes task definition, project background, task-specific data, and output format instructions. Unlike the context window (the model's working memory), the context stack is what you build to fill that window intelligently.

How much of the context window should I fill?

Keep total context utilization below 60% of the available window for reliable results. Based on production data from Anthropic and Manus systems: at 70% utilization, precision begins to drop; at 85%, hallucination rates increase. For Claude Sonnet 4.6 with its 200K token window, this means keeping input context below approximately 120K tokens - and in practice, most single tasks can be handled in 2,000–5,000 tokens of focused context.

Why does LLM output quality degrade with more context?

The primary mechanism is attention dilution combined with the "lost in the middle" effect. As context grows longer, the transformer's attention mechanism computes more pairwise token relationships, and the signal-to-noise ratio falls. Chroma's 2025 study found that every one of 18 tested frontier models degraded as input length increased - this is an architectural property, not a bug that model updates will fix. Stanford's "Lost in the Middle" research showed an additional effect: models attend poorly to information in the middle third of long prompts, with accuracy drops exceeding 20 percentage points compared to content at the beginning or end.

What should I include in my LLM context?

In priority order: (1) task goal and hard constraints - always include this first; (2) output format or template - so the model knows the target structure before reading the data; (3) project background - the minimal architecture/spec context needed to understand your specific situation; (4) task-specific data - the actual code, diff, document, or record you want the model to work with. Include only what is directly relevant to the current task. When in doubt, leave it out.

Is XML better than plain text for LLM context?

For multi-component contexts - anything with more than one distinct type of information - yes. Anthropic explicitly recommends XML tags for structuring complex Claude inputs. XML tags communicate the role of each block before the model reads the content, reducing the attention overhead spent inferring boundaries. The benefit scales with task complexity: a simple one-paragraph question gains nothing from XML; a four-component context stack gains substantially.

What's the difference between a context stack and RAG?

RAG (Retrieval-Augmented Generation) is an automated, infrastructure-level pattern: a system automatically retrieves relevant content from a vector database or search index and injects it into each model call at runtime. A context stack is a curated, human-in-the-loop assembly process: you select and structure the relevant sources for a specific task. RAG is appropriate for production applications that need to handle many different queries automatically. A context stack is appropriate for individual developer workflows where you have direct knowledge of what context is most relevant. They solve overlapping but distinct problems.

How do I prevent PII leaks when using LLMs with company data?

Before assembling your context stack, run a systematic scrub: remove all environment variables, API keys, tokens, and credentials; remove or anonymize customer-identifying data (names, emails, IDs); and flag any vendor contract or financial information for review. The safest approach is a rule-based scanner that runs on all context before it is exported - catching secrets patterns, email formats, and credential-like strings automatically. For teams, consider making this a required step in any LLM workflow that touches production data or customer records.

Context Engineering vs Prompt Engineering: What the Shift Means for Developers

Amit Ben-Ari — Tue, 28 Apr 2026 10:19:08 +0000

You've been here before.

You're asking Claude or ChatGPT to help with something you've explained before - your project's architecture, your team's conventions, the fact that you use tRPC and not REST for internal APIs. The model gives you a technically correct answer that's completely wrong for your codebase.

You tweak the prompt. You add more detail. You try again. The result is marginally better. You spend ten minutes rewriting the request when the real problem was never the request at all.

This is the moment where prompt engineering reaches its limit - and where context engineering begins. When evaluating context engineering vs prompt engineering, the real issue isn't how you phrased the request. It's the information environment the model was working with when it answered.

Almost every article written on this topic frames it as a massive enterprise infrastructure problem requiring specialized platform teams. That framing is accurate for certain use cases. But it leaves out the 95% of developers and product managers who aren't building multi-agent platforms. They're just trying to get reliable, high-quality output from AI tools they use every single day.

This article is written for them.

What Prompt Engineering Actually Is (and Where It Earns Its Place)

Prompt engineering is the practice of crafting the input you send to a language model to get a better response. It encompasses techniques like role assignment ("you are a senior TypeScript engineer"), chain-of-thought reasoning ("think step by step"), few-shot examples, output format constraints, and negative prompting ("do not use Redux").

It's real, it works, and it's worth learning. For bounded, well-defined tasks where the model already has everything it needs to do the job, prompt engineering can make a significant difference:

Classifying a support ticket
Generating a SQL query from a description
Summarizing a meeting transcript
Writing a unit test for a function you paste in
Drafting a short email

In these cases, the model has the domain knowledge, the task is self-contained, and the challenge is communication - getting the model to understand exactly what you want. That's where prompt engineering shines.

The problem is that this represents a surprisingly small fraction of what developers and PMs actually use AI for day-to-day.

Where Prompt Engineering Breaks Down

Here's what breaks prompt engineering: when the model needs to know things it doesn't have access to.

Not things it doesn't know globally - models like Claude and GPT-4o are extraordinarily knowledgeable. But things it doesn't know about your specific situation: your codebase, your architectural decisions, your Notion docs, your team conventions, the business context behind this sprint.

Sound familiar? These are the exact failure modes developers hit constantly:

The Architecture Amnesia problem. You ask Claude to help refactor a module. It produces clean code using patterns your team deliberately moved away from six months ago. Nothing in your prompt told it you'd made that decision - because you shouldn't have to explain your entire tech history in every message.

The Convention Gap. You ask for a new component. It uses Styled Components. Your project uses Tailwind. You said nothing about this because you assumed it was obvious. It isn't. The model doesn't know what it can't see.

The Session Reset. In session 1, you built up a shared understanding with the model - you explained your domain, your data model, your naming conventions. In session 2, all of that is gone. The model is a blank slate. You start over.

The Prompt Spiral. You spend more time crafting and refining the prompt than you would have spent just doing the task yourself. You've accidentally made AI slower than not using AI.

These aren't failures of instruction quality. They're failures of information availability. As Thoughtworks engineer Bharani Subramaniam put it: "Context engineering is curating what the model sees so that you get a better result."

No amount of prompt refinement fixes a model that simply never saw the relevant information.

What Context Engineering Actually Changes

The shift from prompt engineering to context engineering is a shift in what you're optimizing.

Prompt engineering optimizes the instruction. Context engineering optimizes the information environment - what the model knows, remembers, has access to, and what it doesn't need to waste tokens on.

Andrej Karpathy, former co-founder of OpenAI, put it precisely when he posted on X in June 2025:

"Context engineering is the delicate art and science of filling the context window with just the right information for the next step."

His framing matters: it's both an art and a science. Science because there are principles - right information, right structure, right size. Art because judgment is always involved in deciding what belongs and what doesn't.

For an individual developer or PM, context engineering boils down to four practical levers:

1. What the model knows

This is your project conventions, architectural decisions, domain knowledge, business rules - everything a well-onboarded teammate would know on day three that you'd never want to explain again.

A practical example: developer Thomas Landgraf documented his approach to creating deep technical knowledge documents for Claude Code. Working on a complex IoT platform (Eclipse Ditto), he created a structured ditto-advanced-knowledge.md covering specialized policies, API patterns, and edge cases that the model's general training didn't cover. His reported outcome: "Features that previously took days of trial-and-error now ship in hours. The AI suggests optimizations I wouldn't have thought of." The model didn't get smarter - it got better information.

2. What the model remembers across sessions

Stateless models can't retain anything across conversations. Context engineering gives them a memory layer - not through technical magic, but through deliberately maintained files that get loaded at the start of each session. This is where LLM context quality starts to compound: consistent context in means consistent quality out.

The most direct implementation for Claude Code users is a CLAUDE.md file: a structured document at the root of your project that the model reads automatically. Claude Directory's guide captures the progression developers experience when they invest in this properly:

"Week 1: You write a basic CLAUDE.md and start structuring your prompts better. Claude's output improves noticeably. Month 1: Your CLAUDE.md is refined from real sessions. Claude feels like it knows your project. Month 3: Claude produces code that passes your review on the first try most of the time."

The same principle applies beyond Claude Code - any tool that accepts a system prompt or persistent instructions benefits from this approach.

3. How context is structured

More context is not better context. This is one of the most common misunderstandings developers have when they first encounter this idea.

Research has consistently shown that model performance degrades as context gets noisier. A Chroma Research study published in July 2025 testing 18 LLMs including Claude 4, GPT-4.1, and Gemini 2.5 found that "models do not use their context uniformly; performance grows increasingly unreliable as input length grows" - even on simple retrieval tasks. This failure mode has been called "context rot": the degradation that happens when the context window is filled with irrelevant or poorly structured material.

The practical implication: a well-structured 800-token context block will outperform a dumped 8,000-token blob of documentation. Structured context for LLMs isn't just tidier - it's measurably more effective. Token optimization for LLMs means being deliberate about every element in the context window: structure matters, format matters, and what you leave out matters as much as what you include.

As LiquidMetal AI put it in a concrete example: a developer working on a financial dashboard with compliance requirements didn't need to include their entire 15,000-token regulatory document. They extracted the single relevant constraint - "Dashboard must maintain full data accessibility for SEC compliance - no lazy loading permitted" - and injected that. The model understood the business context, the technical constraint, and produced a compliant implementation. Right information, minimal tokens.

4. What gets left out

Context engineering is as much about exclusion as inclusion. This is especially relevant for developers working with sensitive codebases - pasting entire files into a chat interface can expose API keys, internal credentials, customer data, or proprietary business logic that should never leave your local environment.

Filtering out your .env variables or proprietary business logic before pasting a snippet protects your data and instantly improves LLM context quality. A proper context engineering practice always includes a step for what not to include: credentials, PII, internal configurations, and irrelevant noise. Protecting your data and improving output quality are the same action.

When to Use Context Engineering: A Practical Guide

The conceptual distinction is useful, but what you actually need day-to-day is a signal for which approach to reach for. Here's a simple framework:

Situation	What to use
Single task, self-contained, model already has the knowledge it needs	Prompt engineering
Recurring task where you find yourself re-explaining the same background	Context engineering
Output quality degrades over a long conversation	Context engineering (context rot)
You're spending more time on the prompt than the task	Context engineering
You need to synthesize multiple sources (code + docs + tickets)	Context engineering
You want consistent output across multiple sessions or team members	Context engineering

The key diagnostic question is: am I fixing the instruction, or am I compensating for missing information?

If you've refined a prompt three times and it still feels wrong, the problem is almost certainly the second one. The model is working with incomplete information, and no rewording of the question will fix that.

Context Engineering Without RAG

Here's where most writing on this topic loses the individual developer: it assumes you're building a retrieval pipeline. That's a legitimate context engineering approach for production AI systems at scale. But it's not the starting point for a developer who wants better results from Claude Code or ChatGPT tomorrow morning.

The good news: you can practice context engineering with zero infrastructure. The tools you need are already in your editor.

CLAUDE.md / system prompt files. For Claude Code users, this is the single highest-leverage starting point. A well-designed CLAUDE.md tells the model your stack, your conventions, your architectural decisions, what commands to run to test things, and what patterns to avoid. Cole Medin's open-source context engineering template - which has become a popular reference point in the community - demonstrates how a minimal CLAUDE.md paired with a PRPs/ folder for feature-specific context can fundamentally change the quality of what you get back.

Reusable context blocks. Identify the parts of your context that are relevant across multiple tasks - your data model, your API patterns, your team's naming conventions - and store them as structured markdown files. Load them manually when relevant rather than re-explaining every session.

Task-scoped context assembly. Before running a complex task, spend two minutes assembling the relevant context: the specific files, the relevant docs, the issue reference, the acceptance criteria. This is the manual version of what a RAG pipeline does automatically. It feels slow at first and becomes very fast with practice.

Structured formats for complex tasks. Plain prose context is harder for a model to navigate than structured context. A simple format like XML or structured markdown - with clear sections for architecture, conventions, task details, and constraints - consistently outperforms an equivalent amount of unstructured text.

HiveTrail Mesh streamlines this exact workflow by assembling token-optimized, structured context from Notion pages, local files, and Git repositories - with built-in privacy scanning to filter out what shouldn't be included - and exporting it ready to paste. It's one of the context engineering tools for developers that removes the assembly overhead without requiring any pipeline infrastructure. Doing this manually works, and many developers do it that way. The tradeoff is time. If you're running this kind of context assembly dozens of times a week, automating it pays back quickly.

The Real-World Impact: What Changes When You Make the Shift

The qualitative shift developers describe when they start practicing context engineering consistently is notable. It's not that the model suddenly becomes more intelligent. It's that it stops being a stranger to your project.

Freelance fullstack developer Christopher Groß wrote about this on Dev.to from a year of daily Claude Code use:

"When I start working on a new project, the first thing I create is the CLAUDE.md. Not the first component, not the first feature - the context file... After that I save myself from rebuilding that context in every single AI session. I don't explain 'we use Tailwind, not Styled Components' or 'all texts must go through i18n' anymore. It's in the file. Claude reads it, follows it."

The practical difference he describes: asking for a new section on the services page now produces code that uses his design system, follows TypeScript strict mode, and contains no hardcoded strings - without him having to specify any of that. The context file did the work.

This mirrors a broader pattern. Developers who invest in context engineering report a compound effect: the upfront time to structure context is paid back quickly, and returns accelerate over time as the context files mature.

Frequently Asked Questions

Is prompt engineering dead?

No. Prompt engineering is a real skill with a genuine use case: getting the best instruction into the context window. What's changing is that it's increasingly understood as one component of a larger practice, not the whole game. The industry didn't abandon web design when UX emerged as a distinct discipline - it recognized both were needed. The same applies here.

Do I need to set up RAG to practice context engineering?

No. RAG is one way to populate a context window with relevant information automatically at runtime. But manually curated context files, structured prompts, and task-scoped assembly are all valid approaches that require no infrastructure. RAG becomes relevant when the scale of information makes manual curation impractical.

What's the difference between context engineering and just adding more context?

Quality and structure, not quantity. More tokens in the context window doesn't mean better output - it can actively hurt performance through context rot. Context engineering is about selecting the right information and presenting it in the right structure. The goal is signal density, not volume.

Can a product manager (non-developer) practice context engineering?

Absolutely. Context engineering at the individual level is fundamentally about curation and structure, not code. A PM assembling a structured context block from Notion spec pages, issue references, and business constraints before asking an AI to draft a requirements document is doing context engineering. No technical background required.

Why does the same prompt produce different quality output on different days?

Because context is different. The session state, what you included, how you structured it, how much of the conversation history is still in the window - all of these affect output. Inconsistency is usually a context problem, not a model problem.

The Takeaway

The developer at the beginning of this article wasn't writing bad prompts. They were treating a context problem like an instruction problem - and no amount of prompt refinement fixes that.

The shift to context engineering isn't about learning a new technology. It's about changing the mental model: from "how do I ask this better?" to "what does the model need to know right now?"

Once you make that shift, the improvement in output quality isn't incremental. It's categorical.

Start with your definitive guide to context engineering →

Try HiveTrail Mesh beta - context assembly built for developers →

Amit is the founder of HiveTrail, building Mesh - a desktop tool that assembles structured, token-optimized context from Notion, local files, and Git for developers and PMs who work with LLMs daily.

Context Engineering for Developers: A Practical Guide (2026)

Amit Ben-Ari — Fri, 24 Apr 2026 08:43:52 +0000

You've been there. You paste something into Claude or ChatGPT, get a mediocre answer, then realize two seconds later that the three files that would have made the difference are sitting open in your editor. You add them, resend, and suddenly the response is exactly what you needed.

That moment of realization - this is what the model needed to see - is context engineering. You've been doing it since the first day you used an LLM. You just didn't have a name for it, or a system.

Now you do. And the difference between doing it deliberately versus doing it by accident is the difference between AI that occasionally impresses you and AI that you can actually depend on.

This guide is a practical walkthrough for developers, engineers, and PMs who work with LLMs daily. Not the kind of "context engineering" that means building RAG pipelines or multi-agent orchestration systems - there are plenty of excellent guides for that. This one is about the workflow layer: how to assemble the right context for your next task, right now, from the sources you already have.

What context engineering actually is

In June 2025, Andrej Karpathy posted what became the most widely quoted definition in AI engineering:

"In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."

Andrej Karpathy on X

Shopify CEO Tobi Lütke had framed the same idea a day earlier: "I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."

Both were naming something practitioners had already been wrestling with. The term caught because it was precise.

Here's the cleanest way to think about it:

Prompt engineering is about how you ask. The wording, structure, and format of your instruction.

Context engineering is about what the model sees before it processes your instruction. The documents, code, history, specs, and data you include - or exclude - shapes the entire output.

The analogy Karpathy himself uses: think of the LLM as a CPU and its context window as RAM. The model can only work with what's currently loaded into that working memory. Your job, as the developer or practitioner building a task, is to act like an operating system - curating what gets loaded, in what format, so the model has exactly what it needs for this specific computation.

The prompt is a question. Context engineering is everything that determines whether the model has what it needs to answer it well.

What it is not

Context engineering is often confused with adjacent concepts. Being clear about the distinctions matters:

It is not just RAG. Retrieval-Augmented Generation is one technical pattern for assembling context automatically. Context engineering is the broader discipline - RAG is one tool within it.
It is not CLAUDE.md or AGENTS.md. Those are context configuration artifacts for persistent sessions. Valuable, but they address the always-on layer. Session-level context - assembling the right payload for a specific task - is a separate and equally important problem.
It is not fine-tuning. Fine-tuning modifies the model. Context engineering modifies what the model sees at runtime. These are fundamentally different levers.

The LangChain team's thorough breakdown of context engineering strategies describes four core approaches - write, select, compress, isolate - and shows how they interact in agent systems. It's worth reading for the system-building layer. This guide focuses on the practitioner layer: doing this well, manually, session by session.

The four problems that poor context engineering creates

Before getting to solutions, it's worth naming the failure modes precisely. Each of these shows up in daily work.

1. Context rot

Chroma Research published a study in July 2025 - "Context Rot: How Increasing Input Tokens Impacts LLM Performance" - testing 18 state-of-the-art LLMs including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3. The finding: model performance degrades as input length increases, often in non-uniform and surprising ways. The study found that adding irrelevant context - not just length, but noise - significantly degrades model performance even on simple tasks.

This matters because of how most LLM sessions actually run. You start a session with a specific task. You get partway through. You add more files. Ask follow-up questions. Reference earlier parts of the conversation. By the time you're on message fifteen, the model is working in a context that has doubled in size, the original clear specification is buried, and the earlier assumptions are silently competing with the later ones.

The model isn't getting worse. The context is getting noisy.

The Adobe research team's NoLiMa benchmark (presented at ICML 2025) showed this quantitatively: at 32K tokens, 11 out of 12 tested models dropped below 50% of their performance in short contexts. GPT-4o dropped from 99.3% accuracy at 1K tokens to 69.7% at 32K tokens - on the same task, with the same core information present.

The practical upshot: a long session with accumulated context is not better than a fresh, focused one. It is frequently worse.

2. Context poverty

The opposite problem is providing too little - or the wrong kind. A diffstat summary instead of the actual diff. A Notion page title instead of the page content. A vague task description without the spec it's supposed to implement.

The model fills the gaps with plausible-sounding completions drawn from its training data. These completions can feel coherent while being wrong for your specific codebase, your specific architecture, or your specific customer. You get generic output that could have been written for any company, because the model didn't have what it needed to write specifically for yours.

This is the mechanism we documented in our Haiku vs. Sonnet experiment: Claude Code running Sonnet 4.6 was given a high-level diffstat. Claude Haiku 4.5 was given a 380KB structured XML file containing full file contents, unified diffs, and commit metadata. Haiku - a smaller, cheaper model - produced a measurably better PR description. The model that could see the primary source material didn't need to guess.

3. Context contamination

Manual context assembly under time pressure creates a specific and underappreciated risk: accidentally including sensitive data.

According to LayerX's Enterprise AI & SaaS Data Security Report 2025, 77% of enterprise employees paste data into GenAI tools, and 82% of that activity happens through personal, unmanaged accounts outside enterprise oversight. Of file uploads to AI tools, 40% contain PII or PCI data.

This isn't malice - it's workflow. A developer debugging a production issue pastes the relevant log file. What they didn't notice: an API key three lines above the function, or an internal database hostname, or a customer email in a comment. The Samsung incident in March 2023 - where semiconductor engineers pasted proprietary source code and meeting notes into ChatGPT, effectively transferring trade secrets to OpenAI's servers - involved no bad actors. Just engineers using the tool the way engineers use tools, without a review step.

Three separate incidents. Twenty days. No one noticed until the damage was done.

4. Context drift

Different sessions get different context. Two developers on the same team give the same task to the same model on the same day and get wildly different results, because one included the architecture decision record and the other didn't. One pasted the relevant Notion spec; the other described it from memory.

No reproducibility, no shared baseline, no way to know which output to trust. The model's capability becomes unreliable not because it's inconsistent - it's quite consistent given the same input - but because the inputs are never actually the same.

The proof: context quality beats model tier

Every claim in this guide rests on a reproducible finding from our own practice. We ran a controlled comparison to generate a PR description for the same feature branch using two different setups:

Setup A - Claude Code with Sonnet 4.6

Context provided: a high-level diffstat (standard Claude Code behaviour)

Setup B - Claude Haiku 4.5 via web chat

Context provided: a 380KB structured XML file containing the full content of every changed file, unified diffs for every file, per-commit metadata, and a structured commit log. 106,120 tokens of primary source material.

Haiku won. Clearly. The output named the product feature correctly, described cross-component dependencies accurately, explained test coverage, and wrote in the register of a senior engineer who understood the codebase. Sonnet's output referenced "the Stack" without explanation and missed architectural context that was obvious once you could see the actual code.

The lesson isn't "use Haiku." It's that the model that sees the primary source material will outperform the model working from summaries - regardless of model tier.

This finding is consistent with the broader research. The Faros engineering team documented a similar pattern: agents with access to specific, structured, codebase-level context consistently outperformed agents with generic high-level guidance, even when the guiding documentation was comprehensive. "A rule like 'follow DRY principles' helped in theory but didn't prevent the specific anti-patterns unique to each codebase."

Context specificity beats context volume. Primary sources beat summaries. Structured beats unstructured.

Five principles for practical context engineering

These are the principles we've distilled from building Mesh and from the research above. They apply whether you're assembling context manually or through tooling.

Principle 1 - Relevant is better than large

The goal is not to fit as much as possible into the context window. It is to include only what the model needs for this specific task. Every irrelevant token competes for the model's attention - the Chroma and NoLiMa research both confirm this happens at a measurable level.

When building a context stack, start with what the model must see to produce a correct, specific answer. Add things only if they are genuinely load-bearing for this task. When in doubt, leave it out.

Principle 2 - Primary sources over summaries

Always prefer the actual document, file, or data over a description of it.

If the model needs to understand a design decision, give it the ADR - not your recollection of the ADR. If it's writing a PR description, give it the diff - not the diffstat. If it's reviewing a feature spec, give it the spec - not the bullet points you extracted from the spec last week.

Summaries introduce interpretation. Models working from summaries are working from your interpretation of what mattered. They may be missing the one detail that would have changed the output.

Principle 3 - Structure helps the model reason

Unstructured prose dumps are harder for models to reason over than well-structured context. Markdown with clear headers, labelled sections, XML with descriptive element names - these give the model anchors. It can locate the relevant section, understand what it contains, and reference it explicitly in its output.

This is why the PR Brief in our experiment was exported as structured XML: <file>, <diff>, <commit>, <metadata>. The model could parse the payload, not just read it.

The O'Reilly context engineering guide frames this as the difference between "writing the prompt" and "writing the screenplay" - you're not crafting an instruction, you're designing an information environment the model will navigate.

Principle 4 - Just-in-time, not always-on

Context should be assembled fresh for each task. Stale context - a Notion page from two weeks ago, a file cached before yesterday's refactor, notes from a meeting that's since been superseded - is worse than no context. It's confidently wrong.

The "just-in-time" principle means: fetch your sources at the moment you need them, not in advance. Build the context stack when you're about to run the task, not when you think you might run it later.

Principle 5 - Privacy is a step, not an afterthought

Context assembly is the moment your data is most exposed. You're gathering real files, real Notion pages, real git history, and combining them into a payload you're about to send to an external server.

A review step - even thirty seconds of scanning before you export - catches the things that manual assembly misses. API keys buried in config files. PII in a comment you forgot was there. Internal hostnames that shouldn't leave your network. The Samsung engineers weren't careless by disposition; they were working fast, under time pressure, in the way developers always work.

The gap between "safe" and "unsafe" context assembly is usually one quick scan step before export.

A practical workflow: building your context stack

Here's the end-to-end workflow for assembling context deliberately. Walk through this for any non-trivial task before opening your LLM client.

Step 1 - Define the task scope precisely

Before you assemble anything, answer one question: What does the model need to produce a correct, specific, non-generic output for this task? Write it down if it helps. Not "help me with authentication" but "I need to implement token refresh in our Express API, following the pattern in auth/session.ts, without changing the existing middleware signature."

The more precise your scope, the clearer it becomes which context is load-bearing and which is noise.

Step 2 - Map your sources

For the task you've defined, which information sources are actually required? Common categories:

Relevant code files - not the whole codebase, the specific files and modules that directly touch the task
Specification or design docs - the Notion page, the ADR, the GitHub issue, the feature brief
Git context - the relevant diff, recent commits in the affected area, any WIP changes
Reference implementations - examples of the pattern you want to follow elsewhere in the codebase

Write the list. Two or three items usually covers 80% of what the model needs.

Step 3 - Select and filter, don't dump

From each source, include only the sections directly relevant to the task. A 40-page Notion database is not a context stack - it's noise. The three pages in that database that define the relevant data model and auth flows are.

If you find yourself including something "just in case," it probably shouldn't be there. Irrelevant context doesn't help - based on the research, it actively hurts.

Step 4 - Structure for the model

Organise your assembled context with clear section headings, labels, and hierarchy. If you're building this manually in a text editor, use markdown headers to separate sections. If you're using a tool that exports to XML, use descriptive element names that communicate meaning.

The model reads structure as signal. <spec>, <current_implementation>, and <example_pattern> tell the model something about how to use each section. A flat paste of three files tells it nothing.

Step 5 - Token-check before you send

Know your model's effective window for this task type, and count your tokens before sending. Most LLM interfaces show a token count. Most developers ignore it until things break.

If you're approaching the model's effective range (not the advertised maximum - the NoLiMa and Chroma research both show effective range is considerably lower), trim aggressively. The last item added is usually the least essential.

Step 6 - Run the privacy gate

Before any context leaves your machine, spend thirty seconds scanning for things that shouldn't be there. API keys, OAuth tokens, internal hostnames, customer names, email addresses, financial data.

This step is a personal hygiene baseline regardless of whether you're at a company with a security team or a solo builder working from a home office. External servers are external servers.

Step 7 - Separate assembly from generation

This is the shift that makes everything else sustainable.

Build your context payload first. Then open your LLM client. Don't build context inside the chat thread as you go - that's how sessions accumulate noise, how privacy review gets skipped, and how context drift between sessions becomes permanent.

Treating context assembly as a distinct workflow step - upstream of generation - is what separates ad-hoc AI use from a reproducible, reliable process.

Context engineering by role

The principles above apply universally. How they translate to practice depends on where you sit.

The developer

Your context stack for a coding task typically needs: the specific files directly involved in the change (not the whole module), the relevant git history for those files, the linked issue or ticket with acceptance criteria, and any architectural decision records that govern the pattern being implemented.

The common failure mode: giving the model only the file you're currently editing when the actual dependency that's causing the bug lives elsewhere. The model produces a fix that works in isolation and breaks in integration, because it couldn't see what it was integrating with.

Quick checklist for coding tasks:

The specific files being changed (glob-selected, not whole repo dumps)
The diff or changeset if working from existing code
The issue/ticket defining what "done" looks like
One reference implementation showing the established pattern

The product manager

Your context stack for a feature brief, spec review, or roadmap document typically needs: the relevant user research or customer feedback, the existing product spec or PRD section this relates to, competitive positioning context if relevant, and the current sprint or milestone context.

The common failure mode: asking the model to "write a spec for X" with no product context and getting an output that could apply to any software company. The model is not being lazy - it genuinely doesn't know your product, your users, or your architectural constraints. That's context you have to provide.

Quick checklist for PM tasks:

The user research or feedback that motivated this feature
The relevant section of your existing PRD or product principles
Any constraints (technical, timeline, resource) that shape scope
An example of a spec or brief that matched the format you want

The solo builder / LLM power user

Your challenge is different: you work across many contexts (your product, client work, research, writing) and rebuilding a context stack from scratch each session is the biggest time tax.

The solution is a personal context library - reusable blocks you've pre-assembled: a "this is my product" block, a "this is my stack" block, a "this is my code style" block. You compose from these fast, rather than reconstructing from source each time.

The goal isn't to have one massive always-on context. It's to have well-organised, modular pieces you can quickly combine into a precise payload for each task.

Quick checklist for solo builders:

A saved "company/project context" block (architecture, tone, core constraints)
The specific task documents for today's session
A saved "instructions" block for task-specific behaviour you want to persist
A clear separation between sessions - don't carry yesterday's context into today's task

What this looks like with a tool built for it

The workflow above works manually. If you're doing this once or twice a week, a text editor and some discipline is sufficient.

If you're doing this daily - assembling context from Notion, local files, git diffs, GitHub issues, and saved reusable blocks, across multiple projects - the manual workflow becomes a bottleneck. The friction of assembly creates pressure to skip steps: to use a summary instead of the source, to skip the privacy review, to copy from yesterday's session instead of fetching fresh.

HiveTrail Mesh is the tool we built specifically for this workflow. The design maps directly to the seven steps above:

Connect your sources - link your Notion workspace, point Mesh at local directories with glob patterns, connect your GitHub account, or draw from your saved Context Blocks library. Sources are connected once; fetched just-in-time on every export.

Build the stack - drag relevant items into the Stack, reorder them by priority, pin the ones you always need for this project. Every item shows a live token count. The model-aware counter updates as you add and remove items.

Token-check and trim - the Output Editor gives you a live view of your assembled context, with a token count updating in real time against your target model's effective window. Trim until you're within range.

Privacy gate - before export, Mesh's Privacy Scanner runs automatically, flagging API keys, PII, and internal paths. The Exit Gate blocks unsafe exports. Nothing leaves until it's clean.

Export - to clipboard, directly to your LLM client, or as a file. The assembled context is model-agnostic - you bring it into Claude, Gemini, GPT-4o, or Claude Code. Context assembly is handled; generation is your choice.

The JIT principle is built into the architecture: every item in the stack is fetched at the moment of export, not cached from a previous session. The Notion page you're exporting today reflects today's version of that page.

Mesh is currently in limited beta. If this workflow resonates, request early access here.

Frequently asked questions

What is context engineering, in plain terms?

Context engineering is the discipline of deciding what information enters a language model's context window for a given task - what to include, how much, how it's structured, and when it's fetched. It's the work that happens before you write your prompt, and it determines most of the quality of the output you get back.

What's the difference between context engineering and prompt engineering?

Prompt engineering is about how you ask the question - the wording, structure, and format of your instruction. Context engineering is about what the model sees before it processes your question. They're complementary, but context engineering has a higher ceiling: you can write a perfect prompt and still get a mediocre answer if the model doesn't have the information it needs.

Does context engineering replace RAG?

No. Retrieval-Augmented Generation is one technical approach to assembling context automatically at query time. It's a specific implementation pattern within context engineering. For many production systems, RAG is the right tool. For session-level, practitioner-level work - assembling context for a specific task today - manual or semi-manual context assembly is often more precise and more controllable than automated retrieval.

How do I know if my context is too large?

Check your token count against the model's effective range - not its advertised maximum. The NoLiMa benchmark showed that 11 of 12 tested models dropped below 50% performance at 32K tokens, even for models claiming 128K+ context windows. In practice, aim for the smallest context that contains everything load-bearing for your task. If you're adding something "just in case," it's probably noise.

What is context rot and how do I prevent it?

Context rot is the performance degradation that occurs as LLM context length grows - documented by Chroma Research across 18 state-of-the-art models. The practical prevention: start fresh sessions for distinct tasks rather than accumulating everything in one thread, fetch context just-in-time rather than carrying it over from previous sessions, and aggressively trim irrelevant content before sending.

What tools exist for context engineering?

For pipeline and agent-level context engineering: LlamaIndex for data ingestion and retrieval, LangGraph for agent orchestration, and CLAUDE.md / AGENTS.md configuration files for persistent coding agent context. For session-level, practitioner-level context assembly: HiveTrail Mesh, which handles source connection, stack assembly, token management, privacy scanning, and structured export for individual workflows.

The bottom line

Next time you're about to paste something into your LLM client, pause for a moment and ask: is this the context the model actually needs to produce a specific, correct answer for this task?

That question is context engineering. The discipline is the same whether you're architecting a 10-agent pipeline or putting together a feature brief on a Tuesday afternoon. The difference is whether you're doing it deliberately, with a system - or by instinct, hoping the model figures out what you needed it to see.

The good news: the gap between casual and deliberate is not large. A clear task definition, a short source list, structured assembly, a token check, and a thirty-second privacy review. That's the workflow. It takes a few extra minutes and it moves the quality ceiling significantly.

The models are good. Give them what they need to work with.

HiveTrail builds tools for developers and PMs who work with LLMs daily. HiveTrail Mesh is a desktop application for assembling, managing, and securely exporting structured context from Notion, local files, git repositories, and reusable context blocks. Currently in limited beta - request early access.

Related reading from HiveTrail:

We Had Gemini Blind-Judge Three Claude-Generated Pull Requests. Here's the Template It Built.

Amit Ben-Ari — Thu, 16 Apr 2026 07:00:00 +0000

Originally published at hivetrail.com

Most AI-generated pull request descriptions have a problem that's easy to miss: they sound right.

The structure is there. The sections are filled in. The tone is professional. And somewhere in the Testing section, the AI has written something like "comprehensive test coverage was added to ensure correct functionality" - which is confident, grammatically correct, and completely useless to anyone trying to review your code.

The AI didn't hallucinate because it's a bad model. It hallucinated because you gave it a diffstat and a list of commit subjects, and asked it to describe a 32-file, 27-commit feature. It did its best with what it had. The output looks like a PR description. It just isn't one.

This post is about what a real AI-generated PR description looks like - and how you can build one. The backstory is short: we ran the same one-line prompt against three different context conditions, then asked Gemini 3 Pro to evaluate the outputs blind. It didn't know which model produced which text. It didn't know how many tokens each used. It judged purely on engineering utility.

It ranked them. Then it did something more useful: it told us exactly what the ideal PR description looks like, by identifying the best element from each output and explaining why it worked.

We turned that synthesis into a template. It's below. Use it today, regardless of your tooling.

The Experiment, In Brief

We were building a real feature - Git Tools for HiveTrail Mesh - 27 commits, 32 files. After wrapping up, we ran the same prompt against three conditions:

Prompt: "Based on the staged changes / recent commits, write me a PR title and description."

Tool	Model	Context
Condition A: Claude Code	Sonnet 4.6	Native git: `diffstat` + `--oneline` commit log (~61K tokens)
Condition B: Claude web chat	Sonnet 4.6	Mesh PR Brief: 106K tokens of full files, diffs, & structured commit log
Condition C: Claude web chat	Haiku 4.5	Mesh PR Brief: 106K tokens of full files, diffs, & structured commit log

Gemini 3 Pro then evaluated all three outputs without knowing the conditions - no model names, no token counts, just the raw PR text.

Posts 1 and 2 covered the results in detail. The short version: Condition A (native Claude Code) came in last in every evaluation. Both Mesh-context outputs beat it substantially, and Haiku 4.5 with full context outranked Sonnet 4.6 without it.

But the most useful thing Gemini produced wasn't the ranking. It was the synthesis.

What Gemini Said the Ideal PR Looks Like

After ranking the three outputs, Gemini identified the single strongest element from each and described what you'd get if you combined them:

"The ideal PR description would use the structure and design rationale of [Condition B], the actionable test plan of [Condition C - Claude Code fed the Mesh XML], and the crisp inline code formatting and bug-fix callouts of [Condition B]."

Two things are immediately notable here. First, every element that made it into the ideal template came from a Mesh-context output. The native Claude Code PR - working from a diffstat and oneline commit log alone - contributed nothing to the synthesis. Second, the Mesh outputs each contributed different strengths depending on how the context was consumed, which means context quality is necessary but not sufficient. Structure, model, and interface all still matter.

Here's what each element actually means in practice:

Structure layered by architectural tier + Key Design Decisions section (from Condition B - Mesh + Sonnet 4.6 via web chat)

Grouping changes by layer - Models, Service Layer, State & Stack, UI, Bug Fixes, Tests - means a backend engineer can jump straight to the service layer, a UI reviewer can go straight to the components section, and a PM can read the summary and Key Design Decisions without parsing file lists. The Key Design Decisions section is the part most PRs skip entirely: it explains why architectural choices were made, not just what changed. Gemini flagged this as "invaluable for team alignment and long-term maintainability." It's also the section an AI is most likely to hallucinate if it didn't read the actual code, because the reasoning lives in implementation decisions, not in commit messages.

Actionable, scenario-specific test plan (from Condition C - Claude Code fed the Mesh XML)

There's a meaningful difference between "41 tests passing" and "trigger a file-read failure on a locked binary - confirm the stack card shows an orange warning icon and the edit dialog displays an error banner with partial content." The first is a status report. The second is a verification guide. Gemini specifically praised this output for providing "specific, actionable steps to verify the feature" that "remove ambiguity" for QA and PMs. This level of specificity requires the AI to know what your failure states actually look like - information that lives in the diff, not in the commit subject line.

Rigorous inline code formatting + dedicated Bug Fixes & Hardening section (from Condition B - Mesh + Sonnet 4.6 via web chat)

Backtick formatting for every variable, class name, and file path makes a PR scannable. commit_log stands out from surrounding prose; "the commit_log fallback" does not. Separately, pulling bug fixes out of the "What's Changed" section into their own dedicated block is a PM-facing signal: it shows that the PR handles edge cases, not just the happy path. Gemini called this "a great PM practice." It's also easy to miss in a flat file-by-file list.

The Template

Here it is. Copy the block below directly - it's ready to paste into your PR description field or a reusable snippet.

Then scroll past it for annotation on what each section is for and who it serves.

## [feat/fix/chore](#issue-number): [short imperative description]

### Summary
[2–3 sentences: what this is and where it fits in the product - name the feature
and its context within the broader system, not just what files changed]

**[Workflow or feature name]** - [what it does] for [user goal]
**[Second workflow, if applicable]** - [same pattern]

---

### Key Design Decisions

- **[Decision name]:** [What was decided] - [the alternative considered and why
  this approach won, or the constraint it addresses]
- **[Decision name]:** [What was decided] - [tradeoff or edge case it handles]
- **[Decision name]:** [What was decided] - [why the obvious alternative was
  rejected]

---

### What's Changed

#### Models & Architecture
- `ModelName` - [what it is, one line]
- `AnotherModel` - [discriminated union support, computed fields, etc.]

#### Service Layer
- `service_file.py` - [stateless/stateful, what operations it covers]

#### State & Stack
- `HandlerName` - [how it integrates into the lifecycle]
- `ManagerMethod` - [what new capability it exposes]

#### UI - [Panel/Component Area]
- `ComponentName` - [what it renders or manages]
- `DialogName` - [tabs, actions, edge case handling]

#### Bug Fixes & Hardening
- Fixed [specific issue] by [specific mechanism] to prevent [failure mode]
- Changed `fallback` from `"old_value"` to `correct_value` to prevent
  [specific error class]
- Downgraded `[log_method]` from `info` to `debug` to reduce [noise type]

---

### Test Plan

- [ ] [Core scenario]: [exact setup steps] → confirm [specific expected output]
- [ ] [Edge case]: Trigger [failure condition] (e.g., [concrete example]) →
  confirm [specific UI state or error behavior]
- [ ] [Selection/state scenario]: [user action] → confirm [downstream behavior]
- [ ] [Persistence scenario]: Save [config], reload app → confirm [state restored]
- [ ] [Regression check]: Confirm no regressions on [adjacent feature or flow]

---

### Testing

- [N] new tests in `test_file.py` covering [specific functions and scenarios]
- [Total] total tests passing
- Test approach: [real repos vs. mocks, integration points, what's not covered]

Closes #[issue-number]

Section-by-Section: What Each Part Does and Why It's There

Summary
The first thing a reviewer reads, and the section most PRs get wrong. "Adds git tools support" is a file-level description. "Introduces Git Tools as a fourth source type alongside Notion, Local Files, and Context Blocks - providing two workflows for assembling LLM context from a local repository" is a product-level description. The difference matters: a reviewer who doesn't know your architecture shouldn't have to reconstruct it from file names. Place the feature in context. Name what it's for and who it's for.

Key Design Decisions
The most underused section in most PR descriptions, and the one that pays the longest dividends. Future maintainers - including you, eight months from now - don't need a file list. They need to know why the base branch field is a dropdown instead of a text input (to prevent stale scan targets), why partial failures surface as a warning state instead of an error (so users can still insert partial content), why subprocess calls are wrapped with --no-pager (to prevent ANSI corruption in generated XML). These decisions look arbitrary in the code. A dedicated section makes them legible.

This is also the section an AI is most likely to fill with plausible-sounding nonsense if it didn't read your code. If your Key Design Decisions could apply to any feature in any codebase, the AI was guessing.

What's Changed, layered by architectural tier
A flat file list puts the cognitive burden on the reviewer. Grouping by layer - Models, Service, State, UI, Bug Fixes - lets different reviewers navigate to their section. A UI specialist doesn't need to parse service layer changes to find the component work. A backend engineer doesn't need to read the dialog code to find the async lifecycle integration. The grouping itself is a form of documentation.

Bug Fixes & Hardening as a separate section
Don't bury these in "What's Changed." Pulling them out into their own block does two things: it makes them visible to reviewers who might otherwise miss a "" → [] fallback fix buried in a bullet list, and it signals to non-technical stakeholders that the PR handles edge cases, not just the happy path. One-line bug fixes are worth calling out explicitly.

Test Plan
There is a large gap between "full pytest coverage" and a step-by-step verification guide. The test plan serves a different audience than the Testing section: it's for QA engineers, PMs, and reviewers doing manual verification. Each item should have a specific setup, a specific action, and a specific expected outcome. "Trigger a file-read failure and confirm the stack card shows an orange warning icon" is actionable. "Verify error handling works correctly" is not.

The test plan is also the section that is most directly dependent on knowing what your failure states look like - which requires reading the implementation.

Testing
Quantitative confidence: specific counts and coverage areas, not "full coverage." "41 new tests in test_git_service.py covering parse, pre-checks, scan, merge logic, and XML generation - 199 total passing - using real temp repos, no mocks" gives a reviewer immediate signal about test quality and approach. "Full pytest coverage" does not.

Why This Template Is Hard to Fill Without Rich Context

You can use this template right now with any AI assistant. It will fill every section. The question is whether it's filling them or extracting them.

When an AI has thin context - a diffstat, oneline commit subjects, maybe a file count - it generates plausible content based on what PRs like yours typically say. The result is PR descriptions that are coherent and wrong in ways that are hard to spot without reading the code.

Consider three specific sections:

Bug Fixes & Hardening requires the AI to have read the actual diffs. A diffstat tells you that content_reader_service.py had 12 lines changed. It doesn't tell you that those 12 lines fixed a BOM-aware encoding issue for UTF-16 LE/BE files, or that the previous code was hitting a Windows cp1252 default that caused garbled output. That detail lives in the implementation. An AI without it will either leave the section empty, write something generic, or - most dangerously - write something specific-sounding that isn't accurate.

Key Design Decisions requires understanding the architectural alternatives you considered and rejected. Why is commit_count a @computed_field instead of a stored value? Why does the base branch field disable immediately on path change? The answers exist in the code and in the reasoning that shaped it. An AI working from commit subjects has no access to that reasoning, so it will write decisions that sound plausible but describe different choices than the ones you actually made.

Test Plan requires knowing what your failure states look like and what the UI does in each one. "Trigger a file-read failure on a locked binary and confirm the stack card shows an orange warning icon with an enabled Insert button" is only writable if the AI read the warning state implementation in BaseStackCard. A diffstat says base_stack_card.py | 8 +++. That's not enough.

This isn't a flaw in the model. Sonnet 4.6 and Haiku 4.5 are both capable of writing excellent PR descriptions. The difference in our experiment wasn't model capability - it was whether the model had the content to extract from, or had to invent it.

Native Claude Code received a git diff --stat and a --oneline commit log. It produced a reasonable-looking PR description. Mesh provided 106K tokens of structured XML - full file content, unified diffs, commit metadata, and a structured commit log. The same prompt, the same model, completely different output.

How Mesh Generates That Context

Mesh's PR Brief workflow is straightforward. You point it at a local repository, select a base branch from an auto-populated dropdown (populated from get_git_branches(), not free-text input - this prevents stale scan targets), and get a checklist of changed files and commits. You deselect anything irrelevant - generated files, lock files, assets you don't want in the context - and Mesh generates a structured XML document containing full file content, unified diffs, and a structured commit log with per-commit metadata.

The output for the feature in this experiment was a 380KB XML file: 106,120 tokens, 379,281 characters. That's the document Claude web chat received when it wrote the PR descriptions in Conditions B and C.

The economics are worth pausing on. Condition B (Mesh + Sonnet 4.6) and Condition C (Mesh + Haiku 4.5) used identical context. Haiku 4.5 costs a fraction of Sonnet 4.6 per token - and it produced a PR that Gemini ranked ahead of native Sonnet 4.6 by a substantial margin. For teams watching LLM API costs, this is the significant finding: when context quality is high, you can step down to a cheaper model without sacrificing output quality. The context is doing most of the heavy lifting. Model tier matters less than you'd expect when the input is rich enough.

The conclusion from our experiment: the gap between a mediocre AI-generated PR and an excellent one is not primarily a model selection problem. It's a context assembly problem. Better context enables better outputs from cheaper models - that's an improvement in both quality and cost simultaneously.

Start With the Template

The template above is free. Use it today. It'll make your PR descriptions better regardless of what tool you use to fill it in - even if you fill it in manually.

What the template can't do is generate its own content. The Key Design Decisions section requires you (or your AI) to know why you built it the way you did. The Test Plan requires knowing what your failure states look like. The Bug Fixes section requires reading the actual diffs.

If you want an AI to fill this template accurately - not plausibly, accurately - it needs to see enough of your codebase to extract those answers rather than guess at them. That's the problem Mesh is built to solve.

HiveTrail Mesh is a standalone desktop application that acts as a just-in-time context engine, assembling structured XML from your local git repositories, Notion docs, and local files - and running a privacy scanner against the output before anything leaves your machine. Proprietary secrets, API keys, and internal paths get masked locally. Nothing is sent to a cloud service during context assembly.

Mesh is currently in beta. If you're a developer who writes PRs, generates commit messages, or uses LLMs for code work, join the beta here.

Claude Haiku 4.5 Outperformed Sonnet 4.6 on PR Writing - Context Was the Difference

Amit Ben-Ari — Tue, 14 Apr 2026 06:30:00 +0000

Originally published at hivetrail.com

I ran the same prompt on three different setups and had Gemini 3 Pro evaluate the results blind.

The setup using Claude Haiku 4.5 - Anthropic's smallest, cheapest model - produced a better pull request description than Claude Code running Sonnet 4.6.

Before I explain why, a transparency note: to keep this objective, I didn't grade these myself. I handed all three outputs to Gemini 3 Pro and asked it to evaluate them from the perspective of a senior developer and product manager. I agree entirely with its verdict.

The reason Haiku won has nothing to do with Haiku being a better model than Sonnet. It isn't. The reason is that Haiku was given something Sonnet wasn't: the actual evidence.

What Claude Code actually saw

When I asked Claude Code (Sonnet 4.6) to write a PR title and description for a completed feature branch, it ran two commands:

git log main..HEAD --oneline
git diff main...HEAD --stat

If you're not familiar with these flags: --oneline returns the abbreviated commit SHA and the subject line of each commit message. That's it - no body, no diff, no file content. --stat returns a summary of which files changed and how many lines were added or removed. Also no actual content.

The result was a 61K token session that cost $0.12 and completed in about 25 seconds. Three entries in the session log. Claude Code was fast, cheap, and working from a diffstat.

Now here's what the Haiku session saw: a 380KB structured XML file containing the full content of every changed file, unified diffs for every file, per-commit metadata with author and timestamp, a structured commit log, and uncommitted change warnings. 106,120 tokens. The same feature branch, assembled into a document designed to give a model everything it needs to reason about the change.

The difference isn't token count. It's that one setup asked the model to reconstruct what happened from a summary. The other gave it the primary source material.

What the models did with what they had

The gap in output quality shows up most clearly in three specific places.

Product context. The Claude Code output refers to "the Stack" without explanation. A developer already working in this codebase knows what that means. A reviewer who doesn't is left to guess. The Haiku output opens with: "Introduces Git Tools as a fourth source type in HiveTrail Mesh, alongside Notion, Local Files, and Context Blocks." Same fact, but the second version works for anyone reading the PR - now or six months from now in the commit history.

Workflow accuracy. The Claude Code output describes the Commit Brief feature as something that "scans a single commit." That's not quite right. Commit Brief scans uncommitted changes - staged, unstaged, and untracked files - to help you write a commit message for work you haven't committed yet. It's a subtle distinction, but it's exactly the kind of thing a model gets wrong when it's reasoning from a commit log rather than reading the actual implementation. Haiku got it right because it read the implementation. A diffstat will never surface that kind of semantic precision - and neither will any model reasoning from one.

Test evidence. The Claude Code output states: "Full pytest coverage in tests/services/test_git_service.py." The Haiku output states: "41 new tests in test_git_service.py covering parsing, pre-checks, scan, default checks, save-merging, and XML generation. 199 total tests passing." One is an assertion. The other is evidence. When Gemini evaluated these, it called the Claude Code version a "trust me" statement and said the Haiku version "brings the receipts." That's not a model intelligence gap. Claude Code just didn't see the test file.

The ranking

Gemini evaluated three outputs in total: Claude Code + Sonnet 4.6, the XML context + Haiku 4.5, and the XML context + Sonnet 4.6. The full ranking:

XML context + Sonnet 4.6 - strongest overall structure, best inclusion of design decisions
XML context + Haiku 4.5 - excellent markdown formatting, highly scannable, clear separation of bug fixes
Claude Code + Sonnet 4.6 - strong test plan section, but flat formatting and limited context throughout

Sonnet with the full context ranked first, which is expected. But Haiku with the full context ranked second - above Sonnet without it. The model tier mattered less than the context quality.

Here's where the economics get interesting. Haiku is dramatically cheaper than Sonnet. Combined with prompt caching absorbing most of the input cost, the Haiku generation cost effectively nothing. You get a senior-level PR description for pennies - not by paying for a smarter model, but by changing what you feed it.

Why this happens and what to do about it

Claude Code is a general-purpose coding assistant optimizing for speed, cost, and interactivity across hundreds of different tasks. Running a full branch diff for every PR request would be slow and expensive, and most of the time unnecessary. The context assembly tradeoff it makes is reasonable for a general tool.

The problem is that PR description quality is directly proportional to how much of the change the model can actually see. A diffstat tells you what changed. It doesn't tell you why, how the pieces fit together architecturally, what edge cases were handled, or what the test coverage actually covers. When you ask a model to write a PR from a diffstat, you're asking it to reconstruct the full picture from a thumbnail.

The fix is to separate context assembly from text generation. Stop letting your general-purpose AI coding assistant decide what context matters. Assemble the context yourself - or use a specialized tool to do it - and hand that to whatever model you want to use.

Concretely: a PR Brief for this feature branch was 380KB of structured XML. Feeding that into Claude web chat with Haiku 4.5 and a one-sentence prompt produced the second-ranked output in a three-way evaluation. The generation step cost effectively nothing. The quality came from the context.

This is what HiveTrail Mesh does for the context assembly step. It generates a PR Brief - structured XML containing file content, diffs, commit metadata, and checklist controls so you can include or exclude specific files and commits - from any local git repository. You take that output into any model you want: Claude web chat, Gemini, GPT-4o, or back into Claude Code via file reference. The generation step is your choice. The context assembly is handled.

A question worth asking about your last PR

Look at the commands your AI tool actually ran to generate your last pull request description. Most tools log this. If you wouldn't sit down and write a PR description manually from that output - if you'd want to open a few files, read through the diffs, check the test coverage - then you're asking the model to do something you wouldn't do yourself, with less information than you'd give yourself.

The model tier you're paying for doesn't change the quality of the raw material it's reasoning from. Context does.

If you're writing PRs for a project you care about, give the HiveTrail Mesh beta a look.

Why git log --oneline is Killing Your AI-Generated PRs

Amit Ben-Ari — Thu, 09 Apr 2026 06:30:00 +0000

Originally published at hivetrail.com

I build HiveTrail Mesh, a context assembly tool for LLMs. I use Claude Code, as well as multiple other LLMs, daily. And recentry, I asked it to write a pull request description for a feature I'd just finished - 27 commits across 32 files, several days of real work.

The output was competent. It had a title, a summary, a list of changed files. But reading it back, something felt off. I knew what was in those commits. The architectural decisions, the edge cases I'd hunted down, the bug fixes that weren't obvious from the file names. None of it was there. The PR read like someone had skimmed the index of a book and written a summary without reading the chapters.

So I did what any developer building a context tool probably would: I went looking for why.

What Claude Code Actually Sends to the Model

When you ask Claude Code to write a PR description, it runs two git commands:

git log main..HEAD --oneline
git diff main...HEAD --stat

The --oneline flag returns one line per commit: the abbreviated SHA hash and the subject line. That's it. No commit body, no co-author notes, no extended description you carefully wrote.

The --stat flag returns a diffstat - a summary of which files changed and how many lines were added or removed. Again, no actual content. No diffs, no file contents, no context about what changed or why.

So the model is working from something like:

7c22302 fix(git-tools): harden subprocess wrapper against local gitconfig pollution
6b5fd96 feat(git-tools): replace base branch input with auto-populated select
1246c27 fix(git-tools): resolve validation, state drift, and arch leaks
...

Plus a file change summary. For a 27-commit feature branch, that's the equivalent of asking someone to explain a film by reading the chapter titles on a DVD menu.

The model is good enough to produce something coherent from this - but coherent isn't the same as accurate, and it certainly isn't the same as useful.

What a Good PR Description Actually Needs

Before getting to the fix, it's worth being specific about what's missing.

A PR description serves two audiences with different needs. Developers reviewing the code want to know which layers of the codebase were touched, what edge cases were handled, and why certain decisions were made the way they were. Product managers and QA engineers want to understand user impact, workflow changes, and how to verify the feature works.

When the model only has commit subject lines and a file list, it can infer what changed from the file names. It cannot infer:

The why behind architectural decisions
Edge cases that were discovered and handled mid-implementation
Bug fixes that are buried in commits whose subject lines don't make them obvious
The distinction between new features and hardening work
Testing specifics - what's covered, how, and what a reviewer should manually verify

These are exactly the things that separate a PR description that's useful from one that's just technically accurate.

The Prompt Fix: Better Claude Code Instructions

Before reaching for a different tool, most developers will try the obvious thing first: write a better prompt. And it's a fair instinct - you can absolutely instruct Claude Code to run more thorough git commands, request full diff content, and follow a specific PR structure. Something like "run git log with full commit bodies, fetch the complete diff for each changed file, then write a PR description organized by architectural layer with a key design decisions section" will produce meaningfully better output than the default.

But there are real costs to this approach worth understanding. The most immediate is token burn - asking Claude Code to fetch full diffs and structured commit logs for a large branch will consume significantly more context than the default --oneline summary approach, which adds up quickly if you're on a metered plan. The less obvious problem is consistency. Claude Code operates within a conversation context that degrades over a long session: early instructions get compressed, memory gets summarised, and the careful prompt you wrote at the start of a session may not be fully honoured three hours later when you finally hit merge. You're also now maintaining a prompt, not just a workflow.

For this test, I deliberately used the simplest possible prompt - "based on the staged changes / recent commits, write me a PR title and description" - across all three methods. The goal was to measure what each approach produces from its own capabilities, not what it produces when coached. Prompt engineering can close some of the gap, but it can't change what data the model is actually working from.

The Manual Fix: Give the Model the Full Context

The underlying problem is simple: the model is summarising a summary. The fix is to give it the actual data.

Here's what that looks like manually. Before asking your LLM to write the PR, run this yourself and include the output in your prompt:

Full commit log with bodies:

git log main..HEAD --pretty=format:"%H%n%an%n%ad%n%s%n%b%n---"

Actual file diffs:

git diff main...HEAD

Or per-commit diffs if the full diff is too large:

git log main..HEAD --patch

A word of warning on token volume: for a large feature branch, git diff main...HEAD on a 27-commit, 32-file branch produced around 106,000 tokens in my test - roughly 379KB of XML-structured content. That's well beyond what you'd paste into a chat window, and it approaches or exceeds the context limits of many models.

This is where you need to be selective. For smaller branches - a few commits, a handful of files - pasting the full diff directly works fine. For larger branches, you have options:

Feed it to a model with a large context window (Gemini Pro handles this comfortably)
Trim to the files and commits most relevant to the PR's purpose
Use the structured approach described in the next section

Either way, the quality difference is immediate. When I ran the same prompt - "based on the staged changes / recent commits, write me a PR title and description" - with the full structured context versus Claude Code's summary approach, the outputs were not comparable. The full-context version knew about BOM-aware file encoding, NiceGUI deleted-slot errors, the decision to use @computed_field to eliminate state drift. The summary version knew that git_service.py was a new file.

What the Results Actually Showed

I ran three versions of the same PR description for the same feature branch:

Version 1 - Claude Code (Sonnet 4.6)
Working from --oneline and --stat only. Produced a competent, file-oriented description with a good "Key design decisions" section. Flat formatting, no inline code styling, read like a wall of text.

Version 2 - Claude web chat (Sonnet 4.6) + full structured context
The same model, but fed the complete PR Brief XML (106k tokens of structured diffs, commit metadata, and file content). Layered by architectural section, included product context, named specific edge cases and why they were handled the way they were, referenced exact test counts.

Version 3 - Claude web chat (Haiku 4.5) + full structured context
The cheapest Claude model, same full context. Produced a description nearly as strong as Version 2, with better structured sections for testing guidance and explicit "Key Design Decisions."

I asked Gemini 3 Pro to evaluate all three as a neutral third party, framed as a senior developer and product manager. The ranking: Version 2 first, Version 3 second, Version 1 third.

The conclusion that stood out: Haiku 4.5 with full context outperformed Sonnet 4.6 with shallow context. The model tier mattered less than the context quality.

Gemini's summary of the gap was pointed: Version 1 "forces the reviewer to do the heavy lifting." Versions 2 and 3 "treat the PR description as living documentation."

Google Gemini 3 Pro's comparison table:

PR Text	Primary Strength	Primary Weakness	Best Audience
PR 1 (Claude Code)	Exceptional, step-by-step test plan.	Flat formatting makes it hard to scan. Lacks inline code styling.	QA and Product Managers.
PR 2 (Sonnet + Mesh)	Strongest structural hierarchy and includes "Key Design Decisions."	Testing section is a status report, not an actionable guide.	Senior Developers and Architects.
PR 3 (Haiku + Mesh)	Excellent markdown formatting (inline backticks) and explicit bug-fix callouts.	Lacks the deeper architectural "why" found in PR 2.	General Code Reviewers.

Version 1: Claude Code with Sonnet 4.6:

Title:
feat(#282): add Git Tools source type (Commit Brief + PR Brief)

---

Description:

Summary
Introduces Git Tools as a new source type in the Stack, enabling users to generate token-optimized LLM context from local git repositories.

Two item types are supported:

- Commit Brief - scans a single commit and produces a structured XML summary of changed files and diff content
- PR Brief - scans a branch against a base branch, producing a structured XML summary with per-commit metadata and a selectable file/commit
checklist

What's included
- services/git_service.py - new service with full subprocess wrapper...
- models/stack/git_tools.py - Pydantic models for CommitBriefStackItem...
- core/state/sources/git_tools.py + core/state/stack/handlers/git_tools.py...
- components/common/forms/git_tools/ - CommitBriefForm, PRBriefForm...
- components/sections/stack/dialogs/edit_commit_brief.py + edit_pr_brief.py - edit dialogs with...
- components/sections/source/panels/git_tools/ - source panel with auto-populated base branch select ...
- Full pytest coverage in tests/services/test_git_service.py

Key design decisions
- Base branch is auto-populated from the repo rather than free-text input...
- Partial generation failures surface as a warning state...
- Commit log is stored as structured...
- subprocess calls are wrapped to suppress local...
- Follows the existing async stack lifecycle...

Version 2: Claude Sonnet 4.6 chat with Mesh:

**feat(#282): Add Git Tools source type (Commit Brief + PR Brief)**

---

## Summary

Introduces Git Tools as a fourth source type in HiveTrail Mesh, alongside Notion, Local Files, and Context Blocks. Provides two workflows for assembling LLM context from a git repository:

- **Commit Brief** - scans uncommitted changes (staged/unstaged/untracked), assembles file content + unified diffs against HEAD for commit message generation
- **PR Brief** - scans branch-wide changes vs a base branch, includes file content, diffs, structured commit log, and uncommitted change warnings for PR description generation

---

## What's Changed

**Models & Architecture**

- New `GitChangedFile`, `GitCommitEntry`, ... models with Pydantic...
- `PRBriefStackItem.commit_count` as a `@computed_field` to eliminate...
- `warning` status added to `StackItemBaseModel`...
- Git item types registered in `StackItemUnion`...

**Service Layer**

- New `git_service.py` - stateless subprocess service covering...
- Hardened `_run_git` with... to prevent ANSI corruption and pager hangs
- BOM-aware file reading (UTF-16 LE/BE) in...
- `GitResult` dataclass for structured content...
- `generate_git_content` produces structured... XML nodes with ... CDATA string

**State & Stack**

- `GitToolsSource` bridge in... wrapping all git service calls
- `GitToolsHandler` integrated into the standard async stack lifecycle...
- `StackManager.update_git_item`, ... extended for git types
- Stack persistence (`StackPersistence`) handles ... in save/load
- Full facade coverage on...

**UI - Source Panel**

- `GitToolsPanel` with ..., following ... conventions
- `CommitBriefForm` and ... forms with path selector, scan button...
- `FileChecklist` and ... reusable components with select/deselect all, .... with local timezone formatting
- Base Branch field auto-populated from ... disables immediately on path change to prevent stale state
- Uncommitted changes warning banner in...
- Files/Commits sections hidden until...

**UI - Stack**

- `GitToolsStackCard` with type-specific... confirmed Reload action
- `EditCommitBriefDialog` and ... with Settings/Content tabs, ... banner for partial failures
- Insert action enabled for ...
- Edit routing wired in ...

**Bug Fixes & Hardening**

- `notify` called before ... to prevent NiceGUI ... errors
- Stale scan ... on path change
- `commit_log` fallback ... prevent Pydantic ...
- Output manager processing logs ...

**Tests**

- 41 new tests in ... covering parse, pre-checks, scan, ... and PR XML generation (199 total passing)

Version 3: Claude Haiku 4.5 chat with Mesh:

**feat(#282): complete Git Tools integration-forms, dialogs, stack cards, and content generation**

---

# PR Description

## Overview

This PR completes the Git Tools feature (Commit Brief and PR Brief) end-to-end, integrating UI components, edit dialogs, stack management, and async content generation. The feature is now wired into the source panel, stack lifecycle, and preset persistence, with full test coverage for the git_service module.

## What's Included

### Forms & Components

- **CommitBriefForm** ... forms with path selection, branch auto-population, file/commit checklists, and rescan workflows
- **FileChecklist** ...Reusable list components with ... dialogs
- **GitToolsStackCard**: Stack ... confirmation
- **EditCommitBriefDialog** ... dialogs with Settings/Content tabs, ... display

### State & Architecture

- **GitToolsSource** (`core/.../git_tools.py`): Domain bridge wrapping ... asyncio threading
- **GitToolsHandler** (`core/.../git_tools.py`): Pre-generated ... (no JIT reads)
- **git_service.py**: Stateless subprocess service with ... merge logic
- **Stack persistence** updated to serialize/deserialize ...
- **StateManager facade** ... git-specific methods ...

### UI Integration

- Git Tools added to source panel...
- Stack card factory routes ...
- Stack component instantiates both ...
- BaseStackCard updated to handle ... enabled Insert button
- Output manager log messages ...

### Content Generation

- **generate_git_content()**: Async XML generator with BOM-aware file reading...
- Returns **GitResult** dataclass with ...
- Partial success mode: items transition to `warning` state ...

### Data Models

- **GitChangedFile**, ...
- **ScanResult**, ... dataclasses
- FileStatus ... enums
- Pydantic `@computed_field` for ...

### Testing

- 41 new git_service tests covering:
    - _parse_name_status with ...
    - Pre-checks (is_repo, ...)
    - scan_commit_changes with ... states
    - scan_pr_changes with ... flag
    - apply_default_checks (...)
    - merge_scan_with_saved (preserve user ...)
    - parse_commit_log and ...
    - generate_git_content integration with ...

## Key Design Decisions

1. **Pre-generated content model**: Git items store content at ...
2. **Graceful partial success**: When some files fail ... insert partial context.
3. **Clean Architecture enforcement**: GitToolsSource wraps ... No direct service imports.
4. **Concurrent diff fetching**: Diffs fetched in parallel with ... avoid resource exhaustion.
5. **Merge logic for rescan**: When users rescan, ... across repository state changes.
6. **BOM-aware encoding detection**: UTF-16 files (with BOM) and UTF-8 with BOM are decoded correctly; Windows cp1252 default avoided.

## Testing Guidance

- All 199 tests pass (including 41 new ... tests)
- git_service tests use ...
- Integration tested via forms/dialogs in the app
- Warning state rendering tested in ...

---

**Closes #282**

The Easier Path: Structured Context Assembly

Constructing that full context manually - running the right git commands, handling encoding issues, structuring the output so the model can navigate the 106k-token PR Brief without losing the thread - is non-trivial. For a one-off experiment it's fine. As a repeatable workflow before every PR, it's friction most developers won't sustain.

This is exactly the problem I built HiveTrail Mesh to solve. The PR Brief source type runs the git commands, structures the output as navigable XML with per-commit nodes, handles BOM-aware encoding, and lets you select which files and commits to include before the context gets assembled. The output goes to your clipboard, ready to paste into whichever LLM you want to use.

If you want to try it, Mesh is in limited beta and free during the beta period. But honestly, if you want to test the manual approach first on your next branch, the git commands above will get you there.

Quick Reference: The Git Commands That Actually Feed the Model

If you want to try the full-context approach on your next branch before merging, these are the three commands worth knowing:

Full commit log with bodies (not just subject lines):


git log main..HEAD --pretty=format:"%H%n%s%n%b%n---"

Complete diff across the branch:


git diff main...HEAD

Per-commit diffs with context (useful for smaller branches):


git log main..HEAD --patch

A practical note on scope: for a large feature branch, the full diff will be large - potentially 100k+ tokens. Before feeding it to a model, skim the file list and drop binaries, generated files, and lockfile changes. What remains is usually 20-40% smaller and significantly more useful to the model.

If you'd rather not run and filter these manually every time, this is the workflow HiveTrail Mesh automates - structured XML output, file selection, token count before you export.

The Broader Point

Claude Code isn't doing something wrong. It's making a reasonable tradeoff - fast, low-cost, good enough for most cases. The --oneline approach keeps the token cost down and the response time fast. For a quick commit message or a small fix, it's fine.

But for complex feature branches where the PR description is going to be read by your team, reviewed by senior engineers, and live in your repository history for years - it's worth spending an extra 30 seconds to give the model the full picture.

The quality of your AI output is constrained by the quality of the context you provide. For PR descriptions, the full diff is the context. Everything else is a summary of a summary.

Amit builds HiveTrail Mesh, a context assembly tool for developers working with LLMs. If you found this useful, join our beta.

We Ran the Same Experiment Twice. Different Feature, Different Models, Same Winner.

Amit Ben-Ari — Tue, 07 Apr 2026 07:00:00 +0000

Originally published at hivetrail.com

How two independent PR generation benchmarks pointed to the same conclusion about context quality - and why your model choice matters less than you think.

Here's a finding that should change how you think about AI tooling: in two independent experiments using real production code, a "budget" model fed rich context consistently outperformed flagship models operating on shallow git summaries. The budget model didn't just win. It won by a landslide, unanimously, against models that cost significantly more per token.

This isn't a post about which model is best. It's about why the question itself might be the wrong one to ask.

The setup

HiveTrail Mesh is a context assembly tool. One of its core features is PR Brief - it scans a git branch against a base branch, reads every changed file in full, assembles all diffs and commit metadata into a structured XML document, and hands it to an LLM. The output is typically a 100K–380K token document containing everything an LLM needs to write a comprehensive PR description.

We used this workflow as the basis for both experiments. The prompt in each case was deliberately simple:

Based on the staged changes / recent commits, write me a PR title and description.

No elaborate prompting. No chain-of-thought instructions. Just the raw context and a task.

Experiment 1: The budget model vs. the flagship agent

The first experiment ran on the Git Tools feature - a substantial new addition to HiveTrail Mesh covering 27 commits across 32 files, with async XML generation, state management, UI components, and 41 new tests.

We ran three conditions:

Condition A - Claude Code (Sonnet 4.6), native git context. Claude Code ran git log main..HEAD --oneline and git diff main...HEAD --stat - the standard abbreviated approach. Generated in about 25 seconds.

Condition B - Haiku 4.5, Mesh context. Mesh assembled a 380KB XML file (~106K tokens) covering every changed file, diff, and commit. Haiku 4.5 received this in full.

Condition C - Sonnet 4.6, Mesh context. Same Mesh XML, same prompt, given to Sonnet 4.6.

Gemini 3 Pro evaluated all three as a senior software developer and product manager.

The verdict was unambiguous. The Mesh-fed PRs were called "significantly stronger" across every dimension: product context, workflow clarity, architectural structure, technical depth, and testing visibility. The Claude Code version was characterised as reading like "a rough draft or a quick brain dump before hitting Create Pull Request."

This wasn't a knock on Sonnet 4.6. It was a knock on what Sonnet 4.6 was given to work with.

Claude Code - like most agentic coding tools - acts like a developer who skims the commit titles and says "looks good to me." It reads summaries: which files changed, roughly how many lines, what the commit subjects say. HiveTrail Mesh acts like the reviewer who actually pulls down the branch and reads every single file. The difference in output reflects that difference in reading.

Haiku 4.5 with full context outperformed Sonnet 4.6 with shallow context. A cheaper, faster model given the complete picture wrote a better PR than a more capable model working from a summary.

But here's the part that should really give you pause: Haiku 4.5 didn't just beat Sonnet 4.6's native shallow context - it beat Sonnet 4.6 when both were fed the exact same Mesh XML. The budget model outperformed the flagship on a level playing field.

Final ranking:

Haiku 4.5 + Mesh - best overall structure, key design decisions, quantified test coverage
Sonnet 4.6 + Mesh - excellent markdown, clear bug-fix callouts, strong architecture section
Sonnet 4.6 native (Claude Code) - good test plan, but flat structure and shallow context throughout

Experiment 2: Can Gemini CLI beat its own model family?

Several months later, we ran a second experiment on a completely different feature - the GitHub API integration for HiveTrail Mesh, covering 24 files and 22 commits.

The framing this time was sharper. The question wasn't "which model is best" - it was "can an agentic tool using native git context compete with the same model family when context is properly assembled?"

Gemini CLI was the subject under test. It has its own git tooling, can run shell commands, and is built by the same team behind the models it would be competing against. If any tool could close the context gap through smart native tool use, Gemini CLI was the candidate.

We set it against seven Gemini models - ranging from Gemini 3 Fast to Gemini 3.1 Pro with high thinking - all fed via HiveTrail Mesh. We also added Haiku 4.5 via Mesh as an external reference point, since it had won Experiment 1.

Three independent judges evaluated all nine PR texts blind, without knowing which model produced which:

Google Gemini 3 Pro
Anthropic Claude Opus 4.6
OpenAI ChatGPT

Scoring: 9 points for 1st place, 1 point for last. Maximum possible: 27.

Rank	Model	Gemini Pro	Opus 4.6	ChatGPT	Total
1	Haiku 4.5 + Mesh	9	9	9	27
2	Gemini Flash 3 preview (Thinking Low) + Mesh	8	7	8	23
3	Gemini 3 Fast + Mesh	7	6	4	17
4	Gemini 3.1 Pro preview (Thinking High) + Mesh	2	8	6	16
Tied 5	ChatGPT + Mesh	6	1	7	14
Tied 5	Gemini Flash 3 preview (Thinking High) + Mesh	5	4	5	14
7	Gemini 3.1 Flash Light preview (Thinking High) + Mesh	3	5	3	11
8	Gemini 3 Pro + Mesh	4	3	2	9
9	Gemini CLI (native context)	1	2	1	4

Two results stand out.

First, Haiku 4.5 received a perfect score - 9 from every judge, unanimously, with a 4-point gap over second place. All three judges independently placed it first for the same reasons: dedicated test coverage sections, specific method names and API behaviors called out by name, explicit reasoning behind architectural decisions, and reviewer notes that no other entry included. Opus 4.6 called it "the most complete and production-grade PR description" of the nine.

Second, and more telling: Gemini CLI finished last. Not second to last - last, with 4 points, behind every Mesh-fed entry including smaller, cheaper Gemini variants. Its own model family, given better context by a different tool, beat it at every position in the table.

The reason is the same as Experiment 1. Gemini CLI ran git log -n 10 --stat and a few shell commands. Fast, low-cost, reasonable for most tasks - but it produced the same shallow picture. The resulting PR covered the surface of the changes without the architectural reasoning, edge case handling, or quantified test results that the Mesh-fed models could draw on because they had actually read the code.

It's worth noting that the Mesh PR Brief isn't just raw file content dumped into a prompt. It's structured XML - commits organized chronologically, files grouped by change type, diffs nested within their commit context. That structure helps LLMs navigate 100K+ token documents more efficiently than a flat wall of text would. So "full context" here means both more information and better-organized information. Both matter.

After the main competition, we ran Claude Code on the same feature - not as a competitor, but as a consistency check. Same pattern as Experiment 1: a short, surface-level PR based on abbreviated git output. The shallow-context behavior isn't specific to any one tool or vendor. It's structural - it's what happens when speed is optimized over depth of reading.

The pattern

Context quality sets the ceiling. Model choice determines where within that ceiling you land.

Run both experiments side by side and the picture is hard to argue with.

Experiment 1 tested context delivery method with the same model family. Mesh-assembled context won over native git context regardless of model tier - and the budget model beat the flagship even on a level playing field.

Experiment 2 tested whether a sophisticated agentic tool could close that gap through smart native tool use. It couldn't - and it finished last against its own model family.

Different features. Different PR Briefs. Different competitive sets. Different judges. The only constant was the relationship between context quality and output quality.

When an AI tool reads a few lines of git log to write a PR, it isn't producing a poor result because it's a bad model. It's producing a poor result because it has been given a poor picture of what changed and why. Give any capable model the full picture - every file, every diff, every commit, structured and organized - and the output improves dramatically.

The implication runs both ways. A "budget" model with rich context outperforms a flagship with shallow context. And a flagship with shallow context produces flagship-priced shallow output.

What this means for your workflow

If you're using AI tools for PR descriptions today, the most impactful change probably isn't switching models.

Agentic coding tools are optimized for speed and low token cost - they read summaries, not full file content. That's the right tradeoff for interactive coding tasks, where you want fast feedback and low latency. For a PR covering 20+ files and weeks of work, summary-level context produces summary-level output.

The alternative is deliberate context assembly before you prompt: read every changed file in full, preserve the diff structure, organize commits chronologically, package everything in a format the LLM can navigate. You could build a script to do this - pull every changed file, run the diffs, format it into structured XML. It's achievable engineering. It's also a few days of work to do properly, and more to maintain as your codebase evolves.

That's exactly why we built HiveTrail Mesh's PR Brief. Point it at a branch and within seconds it has scanned every changed file, assembled the diffs, and produced a structured 100,000+ token XML document - faster than most agentic tools complete their own context gathering. The remaining time in the workflow is just the LLM responding, which varies by model (a few seconds for smaller models, up to ~30 seconds for the larger ones). The total end-to-end time is competitive with agentic coding tools - with dramatically better output to show for it. Use any LLM you prefer: Claude, Gemini, ChatGPT, whatever fits your workflow. The model choice, as these experiments suggest, matters less than you might expect.

For teams where PRs serve as living documentation, get reviewed by multiple people, or feed downstream into release notes - the tradeoff is straightforward. For a solo developer pushing a two-file fix, probably not worth it.

What we didn't test

In the spirit of intellectual honesty:

Prompt engineering. Both experiments used a minimal prompt. A carefully crafted prompt might narrow the gap somewhat - though we'd expect the ceiling to remain lower without full file content. It's also worth noting that the Mesh PR Brief's structured XML format is itself a form of context organization: commits are sequenced chronologically, files are grouped by change type, and diffs are nested within their commit context. That structure likely helps LLMs parse large documents more efficiently than flat CLI output would.

Other writing tasks. Both experiments focused on PR descriptions. Commit messages, technical documentation, and code review summaries likely follow the same pattern, but we haven't tested them.

Newer model releases. These experiments used models current at the time of testing. Rankings will shift as new models release - though the underlying dynamic (context quality determines ceiling) should hold.

Cost efficiency. Haiku 4.5 is significantly cheaper per token than most of the models it beat. The cost-per-quality-point story is compelling but token pricing changes frequently enough that any number we published here would be stale quickly.

Closing thought

The most useful takeaway from two experiments isn't a model recommendation. It's a workflow question worth asking before you prompt: what does the model actually see?

If the answer is "a handful of commit subject lines and a diffstat," you've already constrained the output - regardless of which model is on the other end.

The models are good enough. The context is usually the bottleneck.

HiveTrail Mesh is a context assembly tool for developers and product teams. PR Brief assembles a token-optimized, structured XML document from your git branch - ready to paste into any LLM. Try the beta →

DEV Community: Amit Ben-Ari

Agentic Context Engineering (ACE) Explained: Why Your AI Agents Need a Playbook, Not a Prompt

The Problem ACE Is Trying to Solve

What ACE Actually Is

Why "Incremental Deltas" Matter

The Playbook, Specifically

What the Paper Proves

ACE vs. The Other Things It Looks Like

The Honest Limitations

What This Actually Means If You're Not a Research Team

For Application Developers

For Engineering Teams Using Claude Code, Cursor, or Copilot

For PMs and Solopreneurs

Where ACE Fits in 2026's Context Engineering Stack

FAQ

About the Author

Claude Code Context Window Rot: Why Sessions Get Dumber (And How to Fix It)

What's actually happening: context rot explained

The lost-in-the-middle effect

The numbers are worse than you think

How to spot context rot in your sessions before it gets critical

The standard fixes - and why they only solve half the problem

The structural fix: treat pre-session context as an engineering problem

What a token-optimised context block actually looks like

A practical pre-session context checklist

Raw inputs vs. structured inputs: what the token difference actually looks like

Real-world evidence: the cost of ignoring input quality

Frequently asked questions

The upstream fix

How to Build an LLM Context Stack: A Practical Playbook for Developers (2026)

What is a context stack (and why you need one)

The science behind why context quality beats context quantity

The 5 sources most developers already have (and underuse)

1. Notion pages and internal documentation

2. Git diffs and commit history

3. Local files: specs, READMEs, configs, and schemas

4. Previous LLM output and conversation checkpoints

5. Reusable context blocks

Token budgeting: how to decide what stays and what gets cut

Structuring for the model: why XML context outperforms plain text

The privacy layer: what to scrub before your context leaves your machine

Before/after: the same task, two ways

Without a context stack (the typical approach)

With a structured context stack

Frequently asked questions

What to read next

Context Engineering vs Prompt Engineering: What the Shift Means for Developers

What Prompt Engineering Actually Is (and Where It Earns Its Place)

Where Prompt Engineering Breaks Down

What Context Engineering Actually Changes

1. What the model knows

2. What the model remembers across sessions

3. How context is structured

4. What gets left out

When to Use Context Engineering: A Practical Guide

Context Engineering Without RAG

The Real-World Impact: What Changes When You Make the Shift

Frequently Asked Questions

The Takeaway

Context Engineering for Developers: A Practical Guide (2026)

What context engineering actually is

What it is not

The four problems that poor context engineering creates

1. Context rot

2. Context poverty

3. Context contamination

4. Context drift

The proof: context quality beats model tier

Five principles for practical context engineering

Principle 1 - Relevant is better than large

Principle 2 - Primary sources over summaries

Principle 3 - Structure helps the model reason

Principle 4 - Just-in-time, not always-on

Principle 5 - Privacy is a step, not an afterthought

A practical workflow: building your context stack

Context engineering by role

The developer

The product manager

The solo builder / LLM power user

What this looks like with a tool built for it