<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Amit Ben-Ari</title>
    <description>The latest articles on DEV Community by Amit Ben-Ari (@amitba).</description>
    <link>https://dev.to/amitba</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3827860%2F69b55ac3-d46d-4cf2-b76c-4e35fca060c1.jpg</url>
      <title>DEV Community: Amit Ben-Ari</title>
      <link>https://dev.to/amitba</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/amitba"/>
    <language>en</language>
    <item>
      <title>Context Engineering vs Prompt Engineering: What the Shift Means for Developers</title>
      <dc:creator>Amit Ben-Ari</dc:creator>
      <pubDate>Tue, 28 Apr 2026 10:19:08 +0000</pubDate>
      <link>https://dev.to/amitba/context-engineering-vs-prompt-engineering-what-the-shift-means-for-developers-32db</link>
      <guid>https://dev.to/amitba/context-engineering-vs-prompt-engineering-what-the-shift-means-for-developers-32db</guid>
      <description>&lt;p&gt;You've been here before.&lt;/p&gt;

&lt;p&gt;You're asking Claude or ChatGPT to help with something you've explained before - your project's architecture, your team's conventions, the fact that you use tRPC and not REST for internal APIs. The model gives you a technically correct answer that's completely wrong for your codebase.&lt;/p&gt;

&lt;p&gt;You tweak the prompt. You add more detail. You try again. The result is marginally better. You spend ten minutes rewriting the request when the real problem was never the request at all.&lt;/p&gt;

&lt;p&gt;This is the moment where prompt engineering reaches its limit - and where context engineering begins. When evaluating context engineering vs prompt engineering, the real issue isn't how you phrased the request. It's the information environment the model was working with when it answered.&lt;/p&gt;

&lt;p&gt;Almost every article written on this topic frames it as a massive enterprise infrastructure problem requiring specialized platform teams. That framing is accurate for certain use cases. But it leaves out the 95% of developers and product managers who aren't building multi-agent platforms. They're just trying to get reliable, high-quality output from AI tools they use every single day.&lt;/p&gt;

&lt;p&gt;This article is written for them.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Prompt Engineering Actually Is (and Where It Earns Its Place)
&lt;/h2&gt;

&lt;p&gt;Prompt engineering is the practice of crafting the input you send to a language model to get a better response. It encompasses techniques like role assignment ("you are a senior TypeScript engineer"), chain-of-thought reasoning ("think step by step"), few-shot examples, output format constraints, and negative prompting ("do not use Redux").&lt;/p&gt;

&lt;p&gt;It's real, it works, and it's worth learning. For bounded, well-defined tasks where the model already has everything it needs to do the job, prompt engineering can make a significant difference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Classifying a support ticket&lt;/li&gt;
&lt;li&gt;Generating a SQL query from a description&lt;/li&gt;
&lt;li&gt;Summarizing a meeting transcript&lt;/li&gt;
&lt;li&gt;Writing a unit test for a function you paste in&lt;/li&gt;
&lt;li&gt;Drafting a short email&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In these cases, the model has the domain knowledge, the task is self-contained, and the challenge is communication - getting the model to understand exactly what you want. That's where prompt engineering shines.&lt;/p&gt;

&lt;p&gt;The problem is that this represents a surprisingly small fraction of what developers and PMs actually use AI for day-to-day.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Prompt Engineering Breaks Down
&lt;/h2&gt;

&lt;p&gt;Here's what breaks prompt engineering: when the model needs to know things it doesn't have access to.&lt;/p&gt;

&lt;p&gt;Not things it doesn't know globally - models like Claude and GPT-4o are extraordinarily knowledgeable. But things it doesn't know &lt;em&gt;about your specific situation&lt;/em&gt;: your codebase, your architectural decisions, your Notion docs, your team conventions, the business context behind this sprint.&lt;/p&gt;

&lt;p&gt;Sound familiar? These are the exact failure modes developers hit constantly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Architecture Amnesia problem.&lt;/strong&gt; You ask Claude to help refactor a module. It produces clean code using patterns your team deliberately moved away from six months ago. Nothing in your prompt told it you'd made that decision - because you shouldn't have to explain your entire tech history in every message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Convention Gap.&lt;/strong&gt; You ask for a new component. It uses Styled Components. Your project uses Tailwind. You said nothing about this because you assumed it was obvious. It isn't. The model doesn't know what it can't see.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Session Reset.&lt;/strong&gt; In session 1, you built up a shared understanding with the model - you explained your domain, your data model, your naming conventions. In session 2, all of that is gone. The model is a blank slate. You start over.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Prompt Spiral.&lt;/strong&gt; You spend more time crafting and refining the prompt than you would have spent just doing the task yourself. You've accidentally made AI slower than not using AI.&lt;/p&gt;

&lt;p&gt;These aren't failures of instruction quality. They're failures of information availability. As &lt;a href="https://martinfowler.com/articles/exploring-gen-ai/context-engineering-coding-agents.html" rel="noopener noreferrer"&gt;Thoughtworks engineer Bharani Subramaniam put it&lt;/a&gt;: &lt;strong&gt;"Context engineering is curating what the model sees so that you get a better result."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No amount of prompt refinement fixes a model that simply never saw the relevant information.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Context Engineering Actually Changes
&lt;/h2&gt;

&lt;p&gt;The shift from prompt engineering to context engineering is a shift in what you're optimizing.&lt;/p&gt;

&lt;p&gt;Prompt engineering optimizes the &lt;em&gt;instruction&lt;/em&gt;. Context engineering optimizes the &lt;em&gt;information environment&lt;/em&gt; - what the model knows, remembers, has access to, and what it doesn't need to waste tokens on.&lt;/p&gt;

&lt;p&gt;Andrej Karpathy, former co-founder of OpenAI, put it precisely when he &lt;a href="https://x.com/karpathy/status/1937902205765607626" rel="noopener noreferrer"&gt;posted on X in June 2025&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Context engineering is the delicate art and science of filling the context window with just the right information for the next step."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;His framing matters: it's both an art and a science. Science because there are principles - right information, right structure, right size. Art because judgment is always involved in deciding what belongs and what doesn't.&lt;/p&gt;

&lt;p&gt;For an individual developer or PM, context engineering boils down to four practical levers:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;What the model knows&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This is your project conventions, architectural decisions, domain knowledge, business rules - everything a well-onboarded teammate would know on day three that you'd never want to explain again.&lt;/p&gt;

&lt;p&gt;A practical example: developer Thomas Landgraf &lt;a href="https://thomaslandgraf.substack.com/p/context-engineering-for-claude-code" rel="noopener noreferrer"&gt;documented his approach&lt;/a&gt; to creating deep technical knowledge documents for Claude Code. Working on a complex IoT platform (Eclipse Ditto), he created a structured &lt;code&gt;ditto-advanced-knowledge.md&lt;/code&gt; covering specialized policies, API patterns, and edge cases that the model's general training didn't cover. His reported outcome: &lt;em&gt;"Features that previously took days of trial-and-error now ship in hours. The AI suggests optimizations I wouldn't have thought of."&lt;/em&gt; The model didn't get smarter - it got better information.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;What the model remembers across sessions&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Stateless models can't retain anything across conversations. Context engineering gives them a memory layer - not through technical magic, but through deliberately maintained files that get loaded at the start of each session. This is where LLM context quality starts to compound: consistent context in means consistent quality out.&lt;/p&gt;

&lt;p&gt;The most direct implementation for Claude Code users is a &lt;code&gt;CLAUDE.md&lt;/code&gt; file: a structured document at the root of your project that the model reads automatically. &lt;a href="https://www.claudedirectory.org/blog/context-engineering-claude-code" rel="noopener noreferrer"&gt;Claude Directory's guide&lt;/a&gt; captures the progression developers experience when they invest in this properly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Week 1: You write a basic CLAUDE.md and start structuring your prompts better. Claude's output improves noticeably. Month 1: Your CLAUDE.md is refined from real sessions. Claude feels like it knows your project. Month 3: Claude produces code that passes your review on the first try most of the time."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The same principle applies beyond Claude Code - any tool that accepts a system prompt or persistent instructions benefits from this approach.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;How context is structured&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;More context is not better context. This is one of the most common misunderstandings developers have when they first encounter this idea.&lt;/p&gt;

&lt;p&gt;Research has consistently shown that model performance degrades as context gets noisier. A &lt;a href="https://www.firecrawl.dev/blog/context-engineering" rel="noopener noreferrer"&gt;Chroma Research study published in July 2025&lt;/a&gt; testing 18 LLMs including Claude 4, GPT-4.1, and Gemini 2.5 found that &lt;em&gt;"models do not use their context uniformly; performance grows increasingly unreliable as input length grows"&lt;/em&gt; - even on simple retrieval tasks. This failure mode has been called "context rot": the degradation that happens when the context window is filled with irrelevant or poorly structured material.&lt;/p&gt;

&lt;p&gt;The practical implication: a well-structured 800-token context block will outperform a dumped 8,000-token blob of documentation. Structured context for LLMs isn't just tidier - it's measurably more effective. Token optimization for LLMs means being deliberate about every element in the context window: structure matters, format matters, and what you leave out matters as much as what you include.&lt;/p&gt;

&lt;p&gt;As &lt;a href="https://liquidmetal.ai/casesAndBlogs/context-engineering-claude-code/" rel="noopener noreferrer"&gt;LiquidMetal AI put it in a concrete example&lt;/a&gt;: a developer working on a financial dashboard with compliance requirements didn't need to include their entire 15,000-token regulatory document. They extracted the single relevant constraint - "Dashboard must maintain full data accessibility for SEC compliance - no lazy loading permitted" - and injected that. The model understood the business context, the technical constraint, and produced a compliant implementation. Right information, minimal tokens.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;What gets left out&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Context engineering is as much about exclusion as inclusion. This is especially relevant for developers working with sensitive codebases - pasting entire files into a chat interface can expose API keys, internal credentials, customer data, or proprietary business logic that should never leave your local environment.&lt;/p&gt;

&lt;p&gt;Filtering out your &lt;code&gt;.env&lt;/code&gt; variables or proprietary business logic before pasting a snippet protects your data and instantly improves LLM context quality. A proper context engineering practice always includes a step for what &lt;em&gt;not&lt;/em&gt; to include: credentials, PII, internal configurations, and irrelevant noise. Protecting your data and improving output quality are the same action.&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Use Context Engineering: A Practical Guide
&lt;/h2&gt;

&lt;p&gt;The conceptual distinction is useful, but what you actually need day-to-day is a signal for which approach to reach for. Here's a simple framework:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Situation&lt;/th&gt;
&lt;th&gt;What to use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single task, self-contained, model already has the knowledge it needs&lt;/td&gt;
&lt;td&gt;Prompt engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recurring task where you find yourself re-explaining the same background&lt;/td&gt;
&lt;td&gt;Context engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output quality degrades over a long conversation&lt;/td&gt;
&lt;td&gt;Context engineering (context rot)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;You're spending more time on the prompt than the task&lt;/td&gt;
&lt;td&gt;Context engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;You need to synthesize multiple sources (code + docs + tickets)&lt;/td&gt;
&lt;td&gt;Context engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;You want consistent output across multiple sessions or team members&lt;/td&gt;
&lt;td&gt;Context engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key diagnostic question is: &lt;strong&gt;am I fixing the instruction, or am I compensating for missing information?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you've refined a prompt three times and it still feels wrong, the problem is almost certainly the second one. The model is working with incomplete information, and no rewording of the question will fix that.&lt;/p&gt;




&lt;h2&gt;
  
  
  Context Engineering Without RAG
&lt;/h2&gt;

&lt;p&gt;Here's where most writing on this topic loses the individual developer: it assumes you're building a retrieval pipeline. That's a legitimate context engineering approach for production AI systems at scale. But it's not the starting point for a developer who wants better results from Claude Code or ChatGPT tomorrow morning.&lt;/p&gt;

&lt;p&gt;The good news: you can practice context engineering with zero infrastructure. The tools you need are already in your editor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CLAUDE.md / system prompt files.&lt;/strong&gt; For Claude Code users, this is the single highest-leverage starting point. A well-designed CLAUDE.md tells the model your stack, your conventions, your architectural decisions, what commands to run to test things, and what patterns to avoid. &lt;a href="https://github.com/coleam00/context-engineering-intro" rel="noopener noreferrer"&gt;Cole Medin's open-source context engineering template&lt;/a&gt; - which has become a popular reference point in the community - demonstrates how a minimal CLAUDE.md paired with a &lt;code&gt;PRPs/&lt;/code&gt; folder for feature-specific context can fundamentally change the quality of what you get back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reusable context blocks.&lt;/strong&gt; Identify the parts of your context that are relevant across multiple tasks - your data model, your API patterns, your team's naming conventions - and store them as structured markdown files. Load them manually when relevant rather than re-explaining every session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task-scoped context assembly.&lt;/strong&gt; Before running a complex task, spend two minutes assembling the relevant context: the specific files, the relevant docs, the issue reference, the acceptance criteria. This is the manual version of what a RAG pipeline does automatically. It feels slow at first and becomes very fast with practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structured formats for complex tasks.&lt;/strong&gt; Plain prose context is harder for a model to navigate than structured context. A simple format like XML or structured markdown - with clear sections for architecture, conventions, task details, and constraints - consistently outperforms an equivalent amount of unstructured text.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;HiveTrail Mesh&lt;/a&gt; streamlines this exact workflow by assembling token-optimized, structured context from Notion pages, local files, and Git repositories - with built-in privacy scanning to filter out what shouldn't be included - and exporting it ready to paste. It's one of the context engineering tools for developers that removes the assembly overhead without requiring any pipeline infrastructure. Doing this manually works, and many developers do it that way. The tradeoff is time. If you're running this kind of context assembly dozens of times a week, automating it pays back quickly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real-World Impact: What Changes When You Make the Shift
&lt;/h2&gt;

&lt;p&gt;The qualitative shift developers describe when they start practicing context engineering consistently is notable. It's not that the model suddenly becomes more intelligent. It's that it stops being a stranger to your project.&lt;/p&gt;

&lt;p&gt;Freelance fullstack developer Christopher Groß &lt;a href="https://dev.to/grossbyte/context-engineering-why-your-prompt-is-the-smallest-problem-3li"&gt;wrote about this on Dev.to&lt;/a&gt; from a year of daily Claude Code use:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"When I start working on a new project, the first thing I create is the CLAUDE.md. Not the first component, not the first feature - the context file... After that I save myself from rebuilding that context in every single AI session. I don't explain 'we use Tailwind, not Styled Components' or 'all texts must go through i18n' anymore. It's in the file. Claude reads it, follows it."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The practical difference he describes: asking for a new section on the services page now produces code that uses his design system, follows TypeScript strict mode, and contains no hardcoded strings - without him having to specify any of that. The context file did the work.&lt;/p&gt;

&lt;p&gt;This mirrors a broader pattern. Developers who invest in context engineering report a compound effect: the upfront time to structure context is paid back quickly, and returns accelerate over time as the context files mature.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is prompt engineering dead?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No. Prompt engineering is a real skill with a genuine use case: getting the best instruction into the context window. What's changing is that it's increasingly understood as one component of a larger practice, not the whole game. The industry didn't abandon web design when UX emerged as a distinct discipline - it recognized both were needed. The same applies here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need to set up RAG to practice context engineering?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No. RAG is one way to populate a context window with relevant information automatically at runtime. But manually curated context files, structured prompts, and task-scoped assembly are all valid approaches that require no infrastructure. RAG becomes relevant when the scale of information makes manual curation impractical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the difference between context engineering and just adding more context?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Quality and structure, not quantity. More tokens in the context window doesn't mean better output - it can actively hurt performance through context rot. Context engineering is about selecting the &lt;em&gt;right&lt;/em&gt; information and presenting it in the &lt;em&gt;right structure&lt;/em&gt;. The goal is signal density, not volume.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can a product manager (non-developer) practice context engineering?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Absolutely. Context engineering at the individual level is fundamentally about curation and structure, not code. A PM assembling a structured context block from Notion spec pages, issue references, and business constraints before asking an AI to draft a requirements document is doing context engineering. No technical background required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does the same prompt produce different quality output on different days?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Because context is different. The session state, what you included, how you structured it, how much of the conversation history is still in the window - all of these affect output. Inconsistency is usually a context problem, not a model problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;The developer at the beginning of this article wasn't writing bad prompts. They were treating a context problem like an instruction problem - and no amount of prompt refinement fixes that.&lt;/p&gt;

&lt;p&gt;The shift to context engineering isn't about learning a new technology. It's about changing the mental model: from "how do I ask this better?" to "what does the model need to know right now?"&lt;/p&gt;

&lt;p&gt;Once you make that shift, the improvement in output quality isn't incremental. It's categorical.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hivetrail.com/blog/context-engineering-for-developers" rel="noopener noreferrer"&gt;&lt;strong&gt;Start with your definitive guide to context engineering →&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;&lt;strong&gt;Try HiveTrail Mesh beta - context assembly built for developers →&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Amit is the founder of HiveTrail, building Mesh - a desktop tool that assembles structured, token-optimized context from Notion, local files, and Git for developers and PMs who work with LLMs daily.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>promptengineering</category>
      <category>contextengineering</category>
      <category>llm</category>
      <category>developers</category>
    </item>
    <item>
      <title>Context Engineering for Developers: A Practical Guide (2026)</title>
      <dc:creator>Amit Ben-Ari</dc:creator>
      <pubDate>Fri, 24 Apr 2026 08:43:52 +0000</pubDate>
      <link>https://dev.to/amitba/context-engineering-for-developers-a-practical-guide-2026-pi1</link>
      <guid>https://dev.to/amitba/context-engineering-for-developers-a-practical-guide-2026-pi1</guid>
      <description>&lt;p&gt;You've been there. You paste something into Claude or ChatGPT, get a mediocre answer, then realize two seconds later that the three files that would have made the difference are sitting open in your editor. You add them, resend, and suddenly the response is exactly what you needed.&lt;/p&gt;

&lt;p&gt;That moment of realization - &lt;em&gt;this is what the model needed to see&lt;/em&gt; - is context engineering. You've been doing it since the first day you used an LLM. You just didn't have a name for it, or a system.&lt;/p&gt;

&lt;p&gt;Now you do. And the difference between doing it deliberately versus doing it by accident is the difference between AI that occasionally impresses you and AI that you can actually depend on.&lt;/p&gt;

&lt;p&gt;This guide is a practical walkthrough for developers, engineers, and PMs who work with LLMs daily. Not the kind of "context engineering" that means building RAG pipelines or multi-agent orchestration systems - there are plenty of excellent guides for that. This one is about the workflow layer: how to assemble the right context for your next task, right now, from the sources you already have.&lt;/p&gt;




&lt;h2&gt;
  
  
  What context engineering actually is
&lt;/h2&gt;

&lt;p&gt;In June 2025, Andrej Karpathy posted what became the most widely quoted definition in AI engineering:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://x.com/karpathy/status/1937902205765607626" rel="noopener noreferrer"&gt;Andrej Karpathy on X&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Shopify CEO Tobi Lütke had framed the same idea a day earlier: &lt;em&gt;"I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Both were naming something practitioners had already been wrestling with. The term caught because it was precise.&lt;/p&gt;

&lt;p&gt;Here's the cleanest way to think about it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt engineering&lt;/strong&gt; is about how you ask. The wording, structure, and format of your instruction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context engineering&lt;/strong&gt; is about what the model sees before it processes your instruction. The documents, code, history, specs, and data you include - or exclude - shapes the entire output.&lt;/p&gt;

&lt;p&gt;The analogy Karpathy himself uses: think of the LLM as a CPU and its context window as RAM. The model can only work with what's currently loaded into that working memory. Your job, as the developer or practitioner building a task, is to act like an operating system - curating what gets loaded, in what format, so the model has exactly what it needs for this specific computation.&lt;/p&gt;

&lt;p&gt;The prompt is a question. Context engineering is everything that determines whether the model has what it needs to answer it well.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it is not
&lt;/h3&gt;

&lt;p&gt;Context engineering is often confused with adjacent concepts. Being clear about the distinctions matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It is not just RAG.&lt;/strong&gt; Retrieval-Augmented Generation is one technical pattern for assembling context automatically. Context engineering is the broader discipline - RAG is one tool within it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It is not CLAUDE.md or AGENTS.md.&lt;/strong&gt; Those are context configuration artifacts for persistent sessions. Valuable, but they address the always-on layer. Session-level context - assembling the right payload for a specific task - is a separate and equally important problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It is not fine-tuning.&lt;/strong&gt; Fine-tuning modifies the model. Context engineering modifies what the model sees at runtime. These are fundamentally different levers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://blog.langchain.com/context-engineering-for-agents/" rel="noopener noreferrer"&gt;LangChain team's thorough breakdown of context engineering strategies&lt;/a&gt; describes four core approaches - write, select, compress, isolate - and shows how they interact in agent systems. It's worth reading for the system-building layer. This guide focuses on the practitioner layer: doing this well, manually, session by session.&lt;/p&gt;




&lt;h2&gt;
  
  
  The four problems that poor context engineering creates
&lt;/h2&gt;

&lt;p&gt;Before getting to solutions, it's worth naming the failure modes precisely. Each of these shows up in daily work.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Context rot
&lt;/h3&gt;

&lt;p&gt;Chroma Research published a study in July 2025 - "&lt;a href="https://research.trychroma.com/context-rot" rel="noopener noreferrer"&gt;Context Rot: How Increasing Input Tokens Impacts LLM Performance&lt;/a&gt;" - testing 18 state-of-the-art LLMs including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3. The finding: model performance degrades as input length increases, often in non-uniform and surprising ways. The study found that &lt;strong&gt;adding irrelevant context - not just length, but noise - significantly degrades model performance even on simple tasks&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This matters because of how most LLM sessions actually run. You start a session with a specific task. You get partway through. You add more files. Ask follow-up questions. Reference earlier parts of the conversation. By the time you're on message fifteen, the model is working in a context that has doubled in size, the original clear specification is buried, and the earlier assumptions are silently competing with the later ones.&lt;/p&gt;

&lt;p&gt;The model isn't getting worse. The context is getting noisy.&lt;/p&gt;

&lt;p&gt;The Adobe research team's &lt;a href="https://github.com/adobe-research/NoLiMa" rel="noopener noreferrer"&gt;NoLiMa benchmark&lt;/a&gt; (presented at ICML 2025) showed this quantitatively: at 32K tokens, 11 out of 12 tested models dropped below 50% of their performance in short contexts. GPT-4o dropped from 99.3% accuracy at 1K tokens to 69.7% at 32K tokens - on the same task, with the same core information present.&lt;/p&gt;

&lt;p&gt;The practical upshot: a long session with accumulated context is not better than a fresh, focused one. It is frequently worse.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Context poverty
&lt;/h3&gt;

&lt;p&gt;The opposite problem is providing too little - or the wrong kind. A diffstat summary instead of the actual diff. A Notion page title instead of the page content. A vague task description without the spec it's supposed to implement.&lt;/p&gt;

&lt;p&gt;The model fills the gaps with plausible-sounding completions drawn from its training data. These completions can feel coherent while being wrong for your specific codebase, your specific architecture, or your specific customer. You get generic output that could have been written for any company, because the model didn't have what it needed to write specifically for yours.&lt;/p&gt;

&lt;p&gt;This is the mechanism we documented in our &lt;a href="https://hivetrail.com/blog/claude-haiku-vs-sonnet-ai-pr-descriptions" rel="noopener noreferrer"&gt;Haiku vs. Sonnet experiment&lt;/a&gt;: Claude Code running Sonnet 4.6 was given a high-level diffstat. Claude Haiku 4.5 was given a 380KB structured XML file containing full file contents, unified diffs, and commit metadata. Haiku - a smaller, cheaper model - produced a measurably better PR description. The model that could see the primary source material didn't need to guess.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Context contamination
&lt;/h3&gt;

&lt;p&gt;Manual context assembly under time pressure creates a specific and underappreciated risk: accidentally including sensitive data.&lt;/p&gt;

&lt;p&gt;According to &lt;a href="https://go.layerxsecurity.com/the-layerx-enterprise-ai-saas-data-security-report-2025" rel="noopener noreferrer"&gt;LayerX's Enterprise AI &amp;amp; SaaS Data Security Report 2025&lt;/a&gt;, 77% of enterprise employees paste data into GenAI tools, and 82% of that activity happens through personal, unmanaged accounts outside enterprise oversight. Of file uploads to AI tools, 40% contain PII or PCI data.&lt;/p&gt;

&lt;p&gt;This isn't malice - it's workflow. A developer debugging a production issue pastes the relevant log file. What they didn't notice: an API key three lines above the function, or an internal database hostname, or a customer email in a comment. The &lt;a href="https://incidentdatabase.ai/cite/768/" rel="noopener noreferrer"&gt;Samsung incident in March 2023&lt;/a&gt; - where semiconductor engineers pasted proprietary source code and meeting notes into ChatGPT, effectively transferring trade secrets to OpenAI's servers - involved no bad actors. Just engineers using the tool the way engineers use tools, without a review step.&lt;/p&gt;

&lt;p&gt;Three separate incidents. Twenty days. No one noticed until the damage was done.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Context drift
&lt;/h3&gt;

&lt;p&gt;Different sessions get different context. Two developers on the same team give the same task to the same model on the same day and get wildly different results, because one included the architecture decision record and the other didn't. One pasted the relevant Notion spec; the other described it from memory.&lt;/p&gt;

&lt;p&gt;No reproducibility, no shared baseline, no way to know which output to trust. The model's capability becomes unreliable not because it's inconsistent - it's quite consistent given the same input - but because the inputs are never actually the same.&lt;/p&gt;




&lt;h2&gt;
  
  
  The proof: context quality beats model tier
&lt;/h2&gt;

&lt;p&gt;Every claim in this guide rests on a reproducible finding from our own practice. We ran a controlled comparison to generate a PR description for the same feature branch using two different setups:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup A - Claude Code with Sonnet 4.6&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Context provided: a high-level diffstat (standard Claude Code behaviour)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup B - Claude Haiku 4.5 via web chat&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Context provided: a 380KB structured XML file containing the full content of every changed file, unified diffs for every file, per-commit metadata, and a structured commit log. 106,120 tokens of primary source material.&lt;/p&gt;

&lt;p&gt;Haiku won. Clearly. The output named the product feature correctly, described cross-component dependencies accurately, explained test coverage, and wrote in the register of a senior engineer who understood the codebase. Sonnet's output referenced "the Stack" without explanation and missed architectural context that was obvious once you could see the actual code.&lt;/p&gt;

&lt;p&gt;The lesson isn't "use Haiku." It's that &lt;strong&gt;the model that sees the primary source material will outperform the model working from summaries - regardless of model tier&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This finding is consistent with the broader research. The &lt;a href="https://www.faros.ai/blog/context-engineering-for-developers" rel="noopener noreferrer"&gt;Faros engineering team&lt;/a&gt; documented a similar pattern: agents with access to specific, structured, codebase-level context consistently outperformed agents with generic high-level guidance, even when the guiding documentation was comprehensive. &lt;em&gt;"A rule like 'follow DRY principles' helped in theory but didn't prevent the specific anti-patterns unique to each codebase."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Context specificity beats context volume. Primary sources beat summaries. Structured beats unstructured.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five principles for practical context engineering
&lt;/h2&gt;

&lt;p&gt;These are the principles we've distilled from building Mesh and from the research above. They apply whether you're assembling context manually or through tooling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Principle 1 - Relevant is better than large
&lt;/h3&gt;

&lt;p&gt;The goal is not to fit as much as possible into the context window. It is to include only what the model needs for this specific task. Every irrelevant token competes for the model's attention - the Chroma and NoLiMa research both confirm this happens at a measurable level.&lt;/p&gt;

&lt;p&gt;When building a context stack, start with what the model &lt;em&gt;must&lt;/em&gt; see to produce a correct, specific answer. Add things only if they are genuinely load-bearing for this task. When in doubt, leave it out.&lt;/p&gt;

&lt;h3&gt;
  
  
  Principle 2 - Primary sources over summaries
&lt;/h3&gt;

&lt;p&gt;Always prefer the actual document, file, or data over a description of it.&lt;/p&gt;

&lt;p&gt;If the model needs to understand a design decision, give it the ADR - not your recollection of the ADR. If it's writing a PR description, give it the diff - not the diffstat. If it's reviewing a feature spec, give it the spec - not the bullet points you extracted from the spec last week.&lt;/p&gt;

&lt;p&gt;Summaries introduce interpretation. Models working from summaries are working from your interpretation of what mattered. They may be missing the one detail that would have changed the output.&lt;/p&gt;

&lt;h3&gt;
  
  
  Principle 3 - Structure helps the model reason
&lt;/h3&gt;

&lt;p&gt;Unstructured prose dumps are harder for models to reason over than well-structured context. Markdown with clear headers, labelled sections, XML with descriptive element names - these give the model anchors. It can locate the relevant section, understand what it contains, and reference it explicitly in its output.&lt;/p&gt;

&lt;p&gt;This is why the PR Brief in our experiment was exported as structured XML: &lt;code&gt;&amp;lt;file&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;diff&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;commit&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;metadata&amp;gt;&lt;/code&gt;. The model could parse the payload, not just read it.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.oreilly.com/radar/context-engineering-bringing-engineering-discipline-to-prompts-part-1/" rel="noopener noreferrer"&gt;O'Reilly context engineering guide&lt;/a&gt; frames this as the difference between "writing the prompt" and "writing the screenplay" - you're not crafting an instruction, you're designing an information environment the model will navigate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Principle 4 - Just-in-time, not always-on
&lt;/h3&gt;

&lt;p&gt;Context should be assembled fresh for each task. Stale context - a Notion page from two weeks ago, a file cached before yesterday's refactor, notes from a meeting that's since been superseded - is worse than no context. It's confidently wrong.&lt;/p&gt;

&lt;p&gt;The "just-in-time" principle means: fetch your sources at the moment you need them, not in advance. Build the context stack when you're about to run the task, not when you think you might run it later.&lt;/p&gt;

&lt;h3&gt;
  
  
  Principle 5 - Privacy is a step, not an afterthought
&lt;/h3&gt;

&lt;p&gt;Context assembly is the moment your data is most exposed. You're gathering real files, real Notion pages, real git history, and combining them into a payload you're about to send to an external server.&lt;/p&gt;

&lt;p&gt;A review step - even thirty seconds of scanning before you export - catches the things that manual assembly misses. API keys buried in config files. PII in a comment you forgot was there. Internal hostnames that shouldn't leave your network. The Samsung engineers weren't careless by disposition; they were working fast, under time pressure, in the way developers always work.&lt;/p&gt;

&lt;p&gt;The gap between "safe" and "unsafe" context assembly is usually one quick scan step before export.&lt;/p&gt;




&lt;h2&gt;
  
  
  A practical workflow: building your context stack
&lt;/h2&gt;

&lt;p&gt;Here's the end-to-end workflow for assembling context deliberately. Walk through this for any non-trivial task before opening your LLM client.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 - Define the task scope precisely&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before you assemble anything, answer one question: &lt;em&gt;What does the model need to produce a correct, specific, non-generic output for this task?&lt;/em&gt; Write it down if it helps. Not "help me with authentication" but "I need to implement token refresh in our Express API, following the pattern in &lt;code&gt;auth/session.ts&lt;/code&gt;, without changing the existing middleware signature."&lt;/p&gt;

&lt;p&gt;The more precise your scope, the clearer it becomes which context is load-bearing and which is noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 - Map your sources&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For the task you've defined, which information sources are actually required? Common categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Relevant code files&lt;/em&gt; - not the whole codebase, the specific files and modules that directly touch the task&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Specification or design docs&lt;/em&gt; - the Notion page, the ADR, the GitHub issue, the feature brief&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Git context&lt;/em&gt; - the relevant diff, recent commits in the affected area, any WIP changes&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Reference implementations&lt;/em&gt; - examples of the pattern you want to follow elsewhere in the codebase&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Write the list. Two or three items usually covers 80% of what the model needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3 - Select and filter, don't dump&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From each source, include only the sections directly relevant to the task. A 40-page Notion database is not a context stack - it's noise. The three pages in that database that define the relevant data model and auth flows are.&lt;/p&gt;

&lt;p&gt;If you find yourself including something "just in case," it probably shouldn't be there. Irrelevant context doesn't help - based on the research, it actively hurts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4 - Structure for the model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Organise your assembled context with clear section headings, labels, and hierarchy. If you're building this manually in a text editor, use markdown headers to separate sections. If you're using a tool that exports to XML, use descriptive element names that communicate meaning.&lt;/p&gt;

&lt;p&gt;The model reads structure as signal. &lt;code&gt;&amp;lt;spec&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;current_implementation&amp;gt;&lt;/code&gt;, and &lt;code&gt;&amp;lt;example_pattern&amp;gt;&lt;/code&gt; tell the model something about how to use each section. A flat paste of three files tells it nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5 - Token-check before you send&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Know your model's effective window for this task type, and count your tokens before sending. Most LLM interfaces show a token count. Most developers ignore it until things break.&lt;/p&gt;

&lt;p&gt;If you're approaching the model's effective range (not the advertised maximum - the NoLiMa and Chroma research both show effective range is considerably lower), trim aggressively. The last item added is usually the least essential.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 6 - Run the privacy gate&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before any context leaves your machine, spend thirty seconds scanning for things that shouldn't be there. API keys, OAuth tokens, internal hostnames, customer names, email addresses, financial data.&lt;/p&gt;

&lt;p&gt;This step is a personal hygiene baseline regardless of whether you're at a company with a security team or a solo builder working from a home office. External servers are external servers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 7 - Separate assembly from generation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the shift that makes everything else sustainable.&lt;/p&gt;

&lt;p&gt;Build your context payload first. Then open your LLM client. Don't build context inside the chat thread as you go - that's how sessions accumulate noise, how privacy review gets skipped, and how context drift between sessions becomes permanent.&lt;/p&gt;

&lt;p&gt;Treating context assembly as a distinct workflow step - upstream of generation - is what separates ad-hoc AI use from a reproducible, reliable process.&lt;/p&gt;




&lt;h2&gt;
  
  
  Context engineering by role
&lt;/h2&gt;

&lt;p&gt;The principles above apply universally. How they translate to practice depends on where you sit.&lt;/p&gt;

&lt;h3&gt;
  
  
  The developer
&lt;/h3&gt;

&lt;p&gt;Your context stack for a coding task typically needs: the specific files directly involved in the change (not the whole module), the relevant git history for those files, the linked issue or ticket with acceptance criteria, and any architectural decision records that govern the pattern being implemented.&lt;/p&gt;

&lt;p&gt;The common failure mode: giving the model only the file you're currently editing when the actual dependency that's causing the bug lives elsewhere. The model produces a fix that works in isolation and breaks in integration, because it couldn't see what it was integrating with.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick checklist for coding tasks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The specific files being changed (glob-selected, not whole repo dumps)&lt;/li&gt;
&lt;li&gt;The diff or changeset if working from existing code&lt;/li&gt;
&lt;li&gt;The issue/ticket defining what "done" looks like&lt;/li&gt;
&lt;li&gt;One reference implementation showing the established pattern&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The product manager
&lt;/h3&gt;

&lt;p&gt;Your context stack for a feature brief, spec review, or roadmap document typically needs: the relevant user research or customer feedback, the existing product spec or PRD section this relates to, competitive positioning context if relevant, and the current sprint or milestone context.&lt;/p&gt;

&lt;p&gt;The common failure mode: asking the model to "write a spec for X" with no product context and getting an output that could apply to any software company. The model is not being lazy - it genuinely doesn't know your product, your users, or your architectural constraints. That's context you have to provide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick checklist for PM tasks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The user research or feedback that motivated this feature&lt;/li&gt;
&lt;li&gt;The relevant section of your existing PRD or product principles&lt;/li&gt;
&lt;li&gt;Any constraints (technical, timeline, resource) that shape scope&lt;/li&gt;
&lt;li&gt;An example of a spec or brief that matched the format you want&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The solo builder / LLM power user
&lt;/h3&gt;

&lt;p&gt;Your challenge is different: you work across many contexts (your product, client work, research, writing) and rebuilding a context stack from scratch each session is the biggest time tax.&lt;/p&gt;

&lt;p&gt;The solution is a personal context library - reusable blocks you've pre-assembled: a "this is my product" block, a "this is my stack" block, a "this is my code style" block. You compose from these fast, rather than reconstructing from source each time.&lt;/p&gt;

&lt;p&gt;The goal isn't to have one massive always-on context. It's to have well-organised, modular pieces you can quickly combine into a precise payload for each task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick checklist for solo builders:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A saved "company/project context" block (architecture, tone, core constraints)&lt;/li&gt;
&lt;li&gt;The specific task documents for today's session&lt;/li&gt;
&lt;li&gt;A saved "instructions" block for task-specific behaviour you want to persist&lt;/li&gt;
&lt;li&gt;A clear separation between sessions - don't carry yesterday's context into today's task&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What this looks like with a tool built for it
&lt;/h2&gt;

&lt;p&gt;The workflow above works manually. If you're doing this once or twice a week, a text editor and some discipline is sufficient.&lt;/p&gt;

&lt;p&gt;If you're doing this daily - assembling context from Notion, local files, git diffs, GitHub issues, and saved reusable blocks, across multiple projects - the manual workflow becomes a bottleneck. The friction of assembly creates pressure to skip steps: to use a summary instead of the source, to skip the privacy review, to copy from yesterday's session instead of fetching fresh.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;HiveTrail Mesh&lt;/a&gt; is the tool we built specifically for this workflow. The design maps directly to the seven steps above:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connect your sources&lt;/strong&gt; - link your Notion workspace, point Mesh at local directories with glob patterns, connect your GitHub account, or draw from your saved Context Blocks library. Sources are connected once; fetched just-in-time on every export.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build the stack&lt;/strong&gt; - drag relevant items into the Stack, reorder them by priority, pin the ones you always need for this project. Every item shows a live token count. The model-aware counter updates as you add and remove items.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token-check and trim&lt;/strong&gt; - the Output Editor gives you a live view of your assembled context, with a token count updating in real time against your target model's effective window. Trim until you're within range.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Privacy gate&lt;/strong&gt; - before export, Mesh's Privacy Scanner runs automatically, flagging API keys, PII, and internal paths. The Exit Gate blocks unsafe exports. Nothing leaves until it's clean.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Export&lt;/strong&gt; - to clipboard, directly to your LLM client, or as a file. The assembled context is model-agnostic - you bring it into Claude, Gemini, GPT-4o, or Claude Code. Context assembly is handled; generation is your choice.&lt;/p&gt;

&lt;p&gt;The JIT principle is built into the architecture: every item in the stack is fetched at the moment of export, not cached from a previous session. The Notion page you're exporting today reflects today's version of that page.&lt;/p&gt;

&lt;p&gt;Mesh is currently in limited beta. If this workflow resonates, &lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;request early access here&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is context engineering, in plain terms?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Context engineering is the discipline of deciding what information enters a language model's context window for a given task - what to include, how much, how it's structured, and when it's fetched. It's the work that happens before you write your prompt, and it determines most of the quality of the output you get back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the difference between context engineering and prompt engineering?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Prompt engineering is about how you ask the question - the wording, structure, and format of your instruction. Context engineering is about what the model sees before it processes your question. They're complementary, but context engineering has a higher ceiling: you can write a perfect prompt and still get a mediocre answer if the model doesn't have the information it needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does context engineering replace RAG?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No. Retrieval-Augmented Generation is one technical approach to assembling context automatically at query time. It's a specific implementation pattern within context engineering. For many production systems, RAG is the right tool. For session-level, practitioner-level work - assembling context for a specific task today - manual or semi-manual context assembly is often more precise and more controllable than automated retrieval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I know if my context is too large?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Check your token count against the model's effective range - not its advertised maximum. The NoLiMa benchmark showed that 11 of 12 tested models dropped below 50% performance at 32K tokens, even for models claiming 128K+ context windows. In practice, aim for the smallest context that contains everything load-bearing for your task. If you're adding something "just in case," it's probably noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is context rot and how do I prevent it?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Context rot is the performance degradation that occurs as LLM context length grows - documented by &lt;a href="https://research.trychroma.com/context-rot" rel="noopener noreferrer"&gt;Chroma Research&lt;/a&gt; across 18 state-of-the-art models. The practical prevention: start fresh sessions for distinct tasks rather than accumulating everything in one thread, fetch context just-in-time rather than carrying it over from previous sessions, and aggressively trim irrelevant content before sending.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What tools exist for context engineering?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For pipeline and agent-level context engineering: &lt;a href="https://www.llamaindex.ai/" rel="noopener noreferrer"&gt;LlamaIndex&lt;/a&gt; for data ingestion and retrieval, &lt;a href="https://langchain-ai.github.io/langgraph/" rel="noopener noreferrer"&gt;LangGraph&lt;/a&gt; for agent orchestration, and CLAUDE.md / AGENTS.md configuration files for persistent coding agent context. For session-level, practitioner-level context assembly: &lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;HiveTrail Mesh&lt;/a&gt;, which handles source connection, stack assembly, token management, privacy scanning, and structured export for individual workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;Next time you're about to paste something into your LLM client, pause for a moment and ask: &lt;em&gt;is this the context the model actually needs to produce a specific, correct answer for this task?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That question is context engineering. The discipline is the same whether you're architecting a 10-agent pipeline or putting together a feature brief on a Tuesday afternoon. The difference is whether you're doing it deliberately, with a system - or by instinct, hoping the model figures out what you needed it to see.&lt;/p&gt;

&lt;p&gt;The good news: the gap between casual and deliberate is not large. A clear task definition, a short source list, structured assembly, a token check, and a thirty-second privacy review. That's the workflow. It takes a few extra minutes and it moves the quality ceiling significantly.&lt;/p&gt;

&lt;p&gt;The models are good. Give them what they need to work with.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;HiveTrail builds tools for developers and PMs who work with LLMs daily. &lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;HiveTrail Mesh&lt;/a&gt; is a desktop application for assembling, managing, and securely exporting structured context from Notion, local files, git repositories, and reusable context blocks. Currently in limited beta - &lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;request early access&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Related reading from HiveTrail:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://hivetrail.com/blog/claude-haiku-vs-sonnet-ai-pr-descriptions" rel="noopener noreferrer"&gt;Claude Haiku 4.5 Outperformed Sonnet 4.6 on PR Writing - Context Was the Difference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hivetrail.com/blog/why-git-log-oneline-kills-ai-prs" rel="noopener noreferrer"&gt;Why git log --oneline kills AI PR descriptions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hivetrail.com/blog/prevent-api-key-pii-leaks-llm-prompts" rel="noopener noreferrer"&gt;The hidden risk of pasting code into LLMs: how to prevent API key and PII leaks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hivetrail.com/blog/llm-context-assembly-pr-generation" rel="noopener noreferrer"&gt;LLM context assembly for PR generation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>developers</category>
      <category>llm</category>
      <category>devex</category>
    </item>
    <item>
      <title>We Had Gemini Blind-Judge Three Claude-Generated Pull Requests. Here's the Template It Built.</title>
      <dc:creator>Amit Ben-Ari</dc:creator>
      <pubDate>Thu, 16 Apr 2026 07:00:00 +0000</pubDate>
      <link>https://dev.to/amitba/we-had-gemini-blind-judge-three-claude-generated-pull-requests-heres-the-template-it-built-295a</link>
      <guid>https://dev.to/amitba/we-had-gemini-blind-judge-three-claude-generated-pull-requests-heres-the-template-it-built-295a</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://hivetrail.com/blog/ai-pull-request-template" rel="noopener noreferrer"&gt;hivetrail.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most AI-generated pull request descriptions have a problem that's easy to miss: they sound right.&lt;/p&gt;

&lt;p&gt;The structure is there. The sections are filled in. The tone is professional. And somewhere in the Testing section, the AI has written something like "comprehensive test coverage was added to ensure correct functionality" - which is confident, grammatically correct, and completely useless to anyone trying to review your code.&lt;/p&gt;

&lt;p&gt;The AI didn't hallucinate because it's a bad model. It hallucinated because you gave it a diffstat and a list of commit subjects, and asked it to describe a 32-file, 27-commit feature. It did its best with what it had. The output looks like a PR description. It just isn't one.&lt;/p&gt;

&lt;p&gt;This post is about what a real AI-generated PR description looks like - and how you can build one. The backstory is short: we ran the same one-line prompt against three different context conditions, then asked Gemini 3 Pro to evaluate the outputs blind. It didn't know which model produced which text. It didn't know how many tokens each used. It judged purely on engineering utility.&lt;/p&gt;

&lt;p&gt;It ranked them. Then it did something more useful: it told us exactly what the ideal PR description looks like, by identifying the best element from each output and explaining why it worked.&lt;/p&gt;

&lt;p&gt;We turned that synthesis into a template. It's below. Use it today, regardless of your tooling.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Experiment, In Brief
&lt;/h2&gt;

&lt;p&gt;We were building a real feature - Git Tools for &lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;HiveTrail Mesh&lt;/a&gt; - 27 commits, 32 files. After wrapping up, we ran the same prompt against three conditions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt:&lt;/strong&gt; &lt;em&gt;"Based on the staged changes / recent commits, write me a PR title and description."&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Condition A:&lt;/strong&gt; Claude Code&lt;/td&gt;
&lt;td&gt;Sonnet 4.6&lt;/td&gt;
&lt;td&gt;Native git: &lt;code&gt;diffstat&lt;/code&gt; + &lt;code&gt;--oneline&lt;/code&gt; commit log (~61K tokens)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Condition B:&lt;/strong&gt; Claude web chat&lt;/td&gt;
&lt;td&gt;Sonnet 4.6&lt;/td&gt;
&lt;td&gt;Mesh PR Brief: 106K tokens of full files, diffs, &amp;amp; structured commit log&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Condition C:&lt;/strong&gt; Claude web chat&lt;/td&gt;
&lt;td&gt;Haiku 4.5&lt;/td&gt;
&lt;td&gt;Mesh PR Brief: 106K tokens of full files, diffs, &amp;amp; structured commit log&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Gemini 3 Pro then evaluated all three outputs without knowing the conditions - no model names, no token counts, just the raw PR text.&lt;/p&gt;

&lt;p&gt;Posts 1 and 2 covered the results in detail. The short version: Condition A (native Claude Code) came in last in every evaluation. Both Mesh-context outputs beat it substantially, and Haiku 4.5 with full context outranked Sonnet 4.6 without it.&lt;/p&gt;

&lt;p&gt;But the most useful thing Gemini produced wasn't the ranking. It was the synthesis.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Gemini Said the Ideal PR Looks Like
&lt;/h2&gt;

&lt;p&gt;After ranking the three outputs, Gemini identified the single strongest element from each and described what you'd get if you combined them:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The ideal PR description would use the structure and design rationale of [Condition B], the actionable test plan of [Condition C - Claude Code fed the Mesh XML], and the crisp inline code formatting and bug-fix callouts of [Condition B]."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Two things are immediately notable here. First, every element that made it into the ideal template came from a Mesh-context output. The native Claude Code PR - working from a diffstat and oneline commit log alone - contributed nothing to the synthesis. Second, the Mesh outputs each contributed different strengths depending on how the context was consumed, which means context quality is necessary but not sufficient. Structure, model, and interface all still matter.&lt;/p&gt;

&lt;p&gt;Here's what each element actually means in practice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structure layered by architectural tier + Key Design Decisions section&lt;/strong&gt; (from Condition B - Mesh + Sonnet 4.6 via web chat)&lt;/p&gt;

&lt;p&gt;Grouping changes by layer - Models, Service Layer, State &amp;amp; Stack, UI, Bug Fixes, Tests - means a backend engineer can jump straight to the service layer, a UI reviewer can go straight to the components section, and a PM can read the summary and Key Design Decisions without parsing file lists. The Key Design Decisions section is the part most PRs skip entirely: it explains &lt;em&gt;why&lt;/em&gt; architectural choices were made, not just what changed. Gemini flagged this as "invaluable for team alignment and long-term maintainability." It's also the section an AI is most likely to hallucinate if it didn't read the actual code, because the reasoning lives in implementation decisions, not in commit messages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Actionable, scenario-specific test plan&lt;/strong&gt; (from Condition C - Claude Code fed the Mesh XML)&lt;/p&gt;

&lt;p&gt;There's a meaningful difference between "41 tests passing" and "trigger a file-read failure on a locked binary - confirm the stack card shows an orange warning icon and the edit dialog displays an error banner with partial content." The first is a status report. The second is a verification guide. Gemini specifically praised this output for providing "specific, actionable steps to verify the feature" that "remove ambiguity" for QA and PMs. This level of specificity requires the AI to know what your failure states actually look like - information that lives in the diff, not in the commit subject line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rigorous inline code formatting + dedicated Bug Fixes &amp;amp; Hardening section&lt;/strong&gt; (from Condition B - Mesh + Sonnet 4.6 via web chat)&lt;/p&gt;

&lt;p&gt;Backtick formatting for every variable, class name, and file path makes a PR scannable. &lt;code&gt;commit_log&lt;/code&gt; stands out from surrounding prose; "the commit_log fallback" does not. Separately, pulling bug fixes out of the "What's Changed" section into their own dedicated block is a PM-facing signal: it shows that the PR handles edge cases, not just the happy path. Gemini called this "a great PM practice." It's also easy to miss in a flat file-by-file list.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Template
&lt;/h2&gt;

&lt;p&gt;Here it is. Copy the block below directly - it's ready to paste into your PR description field or a reusable snippet.&lt;/p&gt;

&lt;p&gt;Then scroll past it for annotation on what each section is for and who it serves.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## [feat/fix/chore](#issue-number): [short imperative description]&lt;/span&gt;

&lt;span class="gu"&gt;### Summary&lt;/span&gt;
[2–3 sentences: what this is and where it fits in the product - name the feature
and its context within the broader system, not just what files changed]

&lt;span class="gs"&gt;**[Workflow or feature name]**&lt;/span&gt; - [what it does] for [user goal]
&lt;span class="gs"&gt;**[Second workflow, if applicable]**&lt;/span&gt; - [same pattern]
&lt;span class="p"&gt;
---
&lt;/span&gt;
&lt;span class="gu"&gt;### Key Design Decisions&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**[Decision name]:**&lt;/span&gt; [What was decided] - [the alternative considered and why
  this approach won, or the constraint it addresses]
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**[Decision name]:**&lt;/span&gt; [What was decided] - [tradeoff or edge case it handles]
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**[Decision name]:**&lt;/span&gt; [What was decided] - [why the obvious alternative was
  rejected]
&lt;span class="p"&gt;
---
&lt;/span&gt;
&lt;span class="gu"&gt;### What's Changed&lt;/span&gt;

&lt;span class="gu"&gt;#### Models &amp;amp; Architecture&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`ModelName`&lt;/span&gt; - [what it is, one line]
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`AnotherModel`&lt;/span&gt; - [discriminated union support, computed fields, etc.]

&lt;span class="gu"&gt;#### Service Layer&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`service_file.py`&lt;/span&gt; - [stateless/stateful, what operations it covers]

&lt;span class="gu"&gt;#### State &amp;amp; Stack&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`HandlerName`&lt;/span&gt; - [how it integrates into the lifecycle]
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`ManagerMethod`&lt;/span&gt; - [what new capability it exposes]

&lt;span class="gu"&gt;#### UI - [Panel/Component Area]&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`ComponentName`&lt;/span&gt; - [what it renders or manages]
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`DialogName`&lt;/span&gt; - [tabs, actions, edge case handling]

&lt;span class="gu"&gt;#### Bug Fixes &amp;amp; Hardening&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Fixed [specific issue] by [specific mechanism] to prevent [failure mode]
&lt;span class="p"&gt;-&lt;/span&gt; Changed &lt;span class="sb"&gt;`fallback`&lt;/span&gt; from &lt;span class="sb"&gt;`"old_value"`&lt;/span&gt; to &lt;span class="sb"&gt;`correct_value`&lt;/span&gt; to prevent
  [specific error class]
&lt;span class="p"&gt;-&lt;/span&gt; Downgraded &lt;span class="sb"&gt;`[log_method]`&lt;/span&gt; from &lt;span class="sb"&gt;`info`&lt;/span&gt; to &lt;span class="sb"&gt;`debug`&lt;/span&gt; to reduce [noise type]
&lt;span class="p"&gt;
---
&lt;/span&gt;
&lt;span class="gu"&gt;### Test Plan&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; [ ] [Core scenario]: [exact setup steps] → confirm [specific expected output]
&lt;span class="p"&gt;-&lt;/span&gt; [ ] [Edge case]: Trigger [failure condition] (e.g., [concrete example]) →
  confirm [specific UI state or error behavior]
&lt;span class="p"&gt;-&lt;/span&gt; [ ] [Selection/state scenario]: [user action] → confirm [downstream behavior]
&lt;span class="p"&gt;-&lt;/span&gt; [ ] [Persistence scenario]: Save [config], reload app → confirm [state restored]
&lt;span class="p"&gt;-&lt;/span&gt; [ ] [Regression check]: Confirm no regressions on [adjacent feature or flow]
&lt;span class="p"&gt;
---
&lt;/span&gt;
&lt;span class="gu"&gt;### Testing&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; [N] new tests in &lt;span class="sb"&gt;`test_file.py`&lt;/span&gt; covering [specific functions and scenarios]
&lt;span class="p"&gt;-&lt;/span&gt; [Total] total tests passing
&lt;span class="p"&gt;-&lt;/span&gt; Test approach: [real repos vs. mocks, integration points, what's not covered]

Closes #[issue-number]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Section-by-Section: What Each Part Does and Why It's There
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;br&gt;
The first thing a reviewer reads, and the section most PRs get wrong. "Adds git tools support" is a file-level description. "Introduces Git Tools as a fourth source type alongside Notion, Local Files, and Context Blocks - providing two workflows for assembling LLM context from a local repository" is a product-level description. The difference matters: a reviewer who doesn't know your architecture shouldn't have to reconstruct it from file names. Place the feature in context. Name what it's for and who it's for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Design Decisions&lt;/strong&gt;&lt;br&gt;
The most underused section in most PR descriptions, and the one that pays the longest dividends. Future maintainers - including you, eight months from now - don't need a file list. They need to know &lt;em&gt;why&lt;/em&gt; the base branch field is a dropdown instead of a text input (to prevent stale scan targets), &lt;em&gt;why&lt;/em&gt; partial failures surface as a warning state instead of an error (so users can still insert partial content), &lt;em&gt;why&lt;/em&gt; subprocess calls are wrapped with &lt;code&gt;--no-pager&lt;/code&gt; (to prevent ANSI corruption in generated XML). These decisions look arbitrary in the code. A dedicated section makes them legible.&lt;/p&gt;

&lt;p&gt;This is also the section an AI is most likely to fill with plausible-sounding nonsense if it didn't read your code. If your Key Design Decisions could apply to any feature in any codebase, the AI was guessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's Changed, layered by architectural tier&lt;/strong&gt;&lt;br&gt;
A flat file list puts the cognitive burden on the reviewer. Grouping by layer - Models, Service, State, UI, Bug Fixes - lets different reviewers navigate to their section. A UI specialist doesn't need to parse service layer changes to find the component work. A backend engineer doesn't need to read the dialog code to find the async lifecycle integration. The grouping itself is a form of documentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug Fixes &amp;amp; Hardening as a separate section&lt;/strong&gt;&lt;br&gt;
Don't bury these in "What's Changed." Pulling them out into their own block does two things: it makes them visible to reviewers who might otherwise miss a &lt;code&gt;""&lt;/code&gt; → &lt;code&gt;[]&lt;/code&gt; fallback fix buried in a bullet list, and it signals to non-technical stakeholders that the PR handles edge cases, not just the happy path. One-line bug fixes are worth calling out explicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test Plan&lt;/strong&gt;&lt;br&gt;
There is a large gap between "full pytest coverage" and a step-by-step verification guide. The test plan serves a different audience than the Testing section: it's for QA engineers, PMs, and reviewers doing manual verification. Each item should have a specific setup, a specific action, and a specific expected outcome. "Trigger a file-read failure and confirm the stack card shows an orange warning icon" is actionable. "Verify error handling works correctly" is not.&lt;/p&gt;

&lt;p&gt;The test plan is also the section that is most directly dependent on knowing what your failure states look like - which requires reading the implementation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing&lt;/strong&gt;&lt;br&gt;
Quantitative confidence: specific counts and coverage areas, not "full coverage." "41 new tests in &lt;code&gt;test_git_service.py&lt;/code&gt; covering parse, pre-checks, scan, merge logic, and XML generation - 199 total passing - using real temp repos, no mocks" gives a reviewer immediate signal about test quality and approach. "Full pytest coverage" does not.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Template Is Hard to Fill Without Rich Context
&lt;/h2&gt;

&lt;p&gt;You can use this template right now with any AI assistant. It will fill every section. The question is whether it's &lt;em&gt;filling&lt;/em&gt; them or &lt;em&gt;extracting&lt;/em&gt; them.&lt;/p&gt;

&lt;p&gt;When an AI has thin context - a diffstat, oneline commit subjects, maybe a file count - it generates plausible content based on what PRs like yours typically say. The result is PR descriptions that are coherent and wrong in ways that are hard to spot without reading the code.&lt;/p&gt;

&lt;p&gt;Consider three specific sections:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug Fixes &amp;amp; Hardening&lt;/strong&gt; requires the AI to have read the actual diffs. A diffstat tells you that &lt;code&gt;content_reader_service.py&lt;/code&gt; had 12 lines changed. It doesn't tell you that those 12 lines fixed a BOM-aware encoding issue for UTF-16 LE/BE files, or that the previous code was hitting a Windows cp1252 default that caused garbled output. That detail lives in the implementation. An AI without it will either leave the section empty, write something generic, or - most dangerously - write something specific-sounding that isn't accurate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Design Decisions&lt;/strong&gt; requires understanding the architectural alternatives you considered and rejected. Why is &lt;code&gt;commit_count&lt;/code&gt; a &lt;code&gt;@computed_field&lt;/code&gt; instead of a stored value? Why does the base branch field disable immediately on path change? The answers exist in the code and in the reasoning that shaped it. An AI working from commit subjects has no access to that reasoning, so it will write decisions that sound plausible but describe different choices than the ones you actually made.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test Plan&lt;/strong&gt; requires knowing what your failure states look like and what the UI does in each one. "Trigger a file-read failure on a locked binary and confirm the stack card shows an orange warning icon with an enabled Insert button" is only writable if the AI read the &lt;code&gt;warning&lt;/code&gt; state implementation in &lt;code&gt;BaseStackCard&lt;/code&gt;. A diffstat says &lt;code&gt;base_stack_card.py | 8 +++&lt;/code&gt;. That's not enough.&lt;/p&gt;

&lt;p&gt;This isn't a flaw in the model. Sonnet 4.6 and Haiku 4.5 are both capable of writing excellent PR descriptions. The difference in our experiment wasn't model capability - it was whether the model had the content to extract from, or had to invent it.&lt;/p&gt;

&lt;p&gt;Native Claude Code received a &lt;code&gt;git diff --stat&lt;/code&gt; and a &lt;code&gt;--oneline&lt;/code&gt; commit log. It produced a reasonable-looking PR description. Mesh provided 106K tokens of structured XML - full file content, unified diffs, commit metadata, and a structured commit log. The same prompt, the same model, completely different output.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Mesh Generates That Context
&lt;/h2&gt;

&lt;p&gt;Mesh's PR Brief workflow is straightforward. You point it at a local repository, select a base branch from an auto-populated dropdown (populated from &lt;code&gt;get_git_branches()&lt;/code&gt;, not free-text input - this prevents stale scan targets), and get a checklist of changed files and commits. You deselect anything irrelevant - generated files, lock files, assets you don't want in the context - and Mesh generates a structured XML document containing full file content, unified diffs, and a structured commit log with per-commit metadata.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9kr5seig423xwdf03yu.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9kr5seig423xwdf03yu.webp" alt="HiveTrail Mesh’s PR Brief interface automates the git context assembly process, structuring full diffs and commit logs into LLM-ready XML while letting you select exactly which files to include." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The output for the feature in this experiment was a 380KB XML file: 106,120 tokens, 379,281 characters. That's the document Claude web chat received when it wrote the PR descriptions in Conditions B and C.&lt;/p&gt;

&lt;p&gt;The economics are worth pausing on. Condition B (Mesh + Sonnet 4.6) and Condition C (Mesh + Haiku 4.5) used identical context. Haiku 4.5 costs a fraction of Sonnet 4.6 per token - and it produced a PR that Gemini ranked ahead of native Sonnet 4.6 by a substantial margin. For teams watching LLM API costs, this is the significant finding: when context quality is high, you can step down to a cheaper model without sacrificing output quality. The context is doing most of the heavy lifting. Model tier matters less than you'd expect when the input is rich enough.&lt;/p&gt;

&lt;p&gt;The conclusion from our experiment: the gap between a mediocre AI-generated PR and an excellent one is not primarily a model selection problem. It's a context assembly problem. Better context enables better outputs from cheaper models - that's an improvement in both quality and cost simultaneously.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start With the Template
&lt;/h2&gt;

&lt;p&gt;The template above is free. Use it today. It'll make your PR descriptions better regardless of what tool you use to fill it in - even if you fill it in manually.&lt;/p&gt;

&lt;p&gt;What the template can't do is generate its own content. The Key Design Decisions section requires you (or your AI) to know why you built it the way you did. The Test Plan requires knowing what your failure states look like. The Bug Fixes section requires reading the actual diffs.&lt;/p&gt;

&lt;p&gt;If you want an AI to fill this template accurately - not plausibly, accurately - it needs to see enough of your codebase to extract those answers rather than guess at them. That's the problem Mesh is built to solve.&lt;/p&gt;

&lt;p&gt;HiveTrail Mesh is a standalone desktop application that acts as a just-in-time context engine, assembling structured XML from your local git repositories, Notion docs, and local files - and running a privacy scanner against the output before anything leaves your machine. Proprietary secrets, API keys, and internal paths get masked locally. Nothing is sent to a cloud service during context assembly.&lt;/p&gt;

&lt;p&gt;Mesh is currently in beta. If you're a developer who writes PRs, generates commit messages, or uses LLMs for code work, &lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;join the beta here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>git</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Claude Haiku 4.5 Outperformed Sonnet 4.6 on PR Writing - Context Was the Difference</title>
      <dc:creator>Amit Ben-Ari</dc:creator>
      <pubDate>Tue, 14 Apr 2026 06:30:00 +0000</pubDate>
      <link>https://dev.to/amitba/claude-haiku-45-outperformed-sonnet-46-on-pr-writing-context-was-the-difference-3jim</link>
      <guid>https://dev.to/amitba/claude-haiku-45-outperformed-sonnet-46-on-pr-writing-context-was-the-difference-3jim</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://hivetrail.com/blog/claude-haiku-vs-sonnet-ai-pr-descriptions" rel="noopener noreferrer"&gt;hivetrail.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I ran the same prompt on three different setups and had Gemini 3 Pro evaluate the results blind.&lt;/p&gt;

&lt;p&gt;The setup using Claude Haiku 4.5 - Anthropic's smallest, cheapest model - produced a better pull request description than Claude Code running Sonnet 4.6.&lt;/p&gt;

&lt;p&gt;Before I explain why, a transparency note: to keep this objective, I didn't grade these myself. I handed all three outputs to Gemini 3 Pro and asked it to evaluate them from the perspective of a senior developer and product manager. I agree entirely with its verdict.&lt;/p&gt;

&lt;p&gt;The reason Haiku won has nothing to do with Haiku being a better model than Sonnet. It isn't. The reason is that Haiku was given something Sonnet wasn't: the actual evidence.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What Claude Code actually saw&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;When I asked Claude Code (Sonnet 4.6) to write a PR title and description for a completed feature branch, it ran two commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git log main..HEAD &lt;span class="nt"&gt;--oneline&lt;/span&gt;
git diff main...HEAD &lt;span class="nt"&gt;--stat&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're not familiar with these flags: &lt;code&gt;--oneline&lt;/code&gt; returns the abbreviated commit SHA and the subject line of each commit message. That's it - no body, no diff, no file content. &lt;code&gt;--stat&lt;/code&gt; returns a summary of which files changed and how many lines were added or removed. Also no actual content.&lt;/p&gt;

&lt;p&gt;The result was a 61K token session that cost $0.12 and completed in about 25 seconds. Three entries in the session log. Claude Code was fast, cheap, and working from a diffstat.&lt;/p&gt;

&lt;p&gt;Now here's what the Haiku session saw: a 380KB structured XML file containing the full content of every changed file, unified diffs for every file, per-commit metadata with author and timestamp, a structured commit log, and uncommitted change warnings. 106,120 tokens. The same feature branch, assembled into a document designed to give a model everything it needs to reason about the change.&lt;/p&gt;

&lt;p&gt;The difference isn't token count. It's that one setup asked the model to reconstruct what happened from a summary. The other gave it the primary source material.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What the models did with what they had&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The gap in output quality shows up most clearly in three specific places.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Product context.&lt;/strong&gt; The Claude Code output refers to "the Stack" without explanation. A developer already working in this codebase knows what that means. A reviewer who doesn't is left to guess. The Haiku output opens with: "Introduces Git Tools as a fourth source type in HiveTrail Mesh, alongside Notion, Local Files, and Context Blocks." Same fact, but the second version works for anyone reading the PR - now or six months from now in the commit history.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workflow accuracy.&lt;/strong&gt; The Claude Code output describes the Commit Brief feature as something that "scans a single commit." That's not quite right. Commit Brief scans uncommitted changes - staged, unstaged, and untracked files - to help you write a commit message for work you haven't committed yet. It's a subtle distinction, but it's exactly the kind of thing a model gets wrong when it's reasoning from a commit log rather than reading the actual implementation. Haiku got it right because it read the implementation. A diffstat will never surface that kind of semantic precision - and neither will any model reasoning from one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test evidence.&lt;/strong&gt; The Claude Code output states: "Full pytest coverage in tests/services/test_git_service.py." The Haiku output states: "41 new tests in test_git_service.py covering parsing, pre-checks, scan, default checks, save-merging, and XML generation. 199 total tests passing." One is an assertion. The other is evidence. When Gemini evaluated these, it called the Claude Code version a "trust me" statement and said the Haiku version "brings the receipts." That's not a model intelligence gap. Claude Code just didn't see the test file.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;The ranking&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Gemini evaluated three outputs in total: Claude Code + Sonnet 4.6, the XML context + Haiku 4.5, and the XML context + Sonnet 4.6. The full ranking:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;XML context + Sonnet 4.6 - strongest overall structure, best inclusion of design decisions&lt;/li&gt;
&lt;li&gt;XML context + Haiku 4.5 - excellent markdown formatting, highly scannable, clear separation of bug fixes&lt;/li&gt;
&lt;li&gt;Claude Code + Sonnet 4.6 - strong test plan section, but flat formatting and limited context throughout&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Sonnet with the full context ranked first, which is expected. But Haiku with the full context ranked second - above Sonnet without it. The model tier mattered less than the context quality.&lt;/p&gt;

&lt;p&gt;Here's where the economics get interesting. Haiku is dramatically cheaper than Sonnet. Combined with prompt caching absorbing most of the input cost, the Haiku generation cost effectively nothing. You get a senior-level PR description for pennies - not by paying for a smarter model, but by changing what you feed it.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why this happens and what to do about it&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Claude Code is a general-purpose coding assistant optimizing for speed, cost, and interactivity across hundreds of different tasks. Running a full branch diff for every PR request would be slow and expensive, and most of the time unnecessary. The context assembly tradeoff it makes is reasonable for a general tool.&lt;/p&gt;

&lt;p&gt;The problem is that PR description quality is directly proportional to how much of the change the model can actually see. A diffstat tells you what changed. It doesn't tell you why, how the pieces fit together architecturally, what edge cases were handled, or what the test coverage actually covers. When you ask a model to write a PR from a diffstat, you're asking it to reconstruct the full picture from a thumbnail.&lt;/p&gt;

&lt;p&gt;The fix is to separate context assembly from text generation. Stop letting your general-purpose AI coding assistant decide what context matters. Assemble the context yourself - or use a specialized tool to do it - and hand that to whatever model you want to use.&lt;/p&gt;

&lt;p&gt;Concretely: a PR Brief for this feature branch was 380KB of structured XML. Feeding that into Claude web chat with Haiku 4.5 and a one-sentence prompt produced the second-ranked output in a three-way evaluation. The generation step cost effectively nothing. The quality came from the context.&lt;/p&gt;

&lt;p&gt;This is what &lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;HiveTrail Mesh&lt;/a&gt; does for the context assembly step. It generates a PR Brief - structured XML containing file content, diffs, commit metadata, and checklist controls so you can include or exclude specific files and commits - from any local git repository. You take that output into any model you want: Claude web chat, Gemini, GPT-4o, or back into Claude Code via file reference. The generation step is your choice. The context assembly is handled.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;A question worth asking about your last PR&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Look at the commands your AI tool actually ran to generate your last pull request description. Most tools log this. If you wouldn't sit down and write a PR description manually from that output - if you'd want to open a few files, read through the diffs, check the test coverage - then you're asking the model to do something you wouldn't do yourself, with less information than you'd give yourself.&lt;/p&gt;

&lt;p&gt;The model tier you're paying for doesn't change the quality of the raw material it's reasoning from. Context does.&lt;/p&gt;

&lt;p&gt;If you're writing PRs for a project you care about, &lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;give the HiveTrail Mesh beta a look&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Why git log --oneline is Killing Your AI-Generated PRs</title>
      <dc:creator>Amit Ben-Ari</dc:creator>
      <pubDate>Thu, 09 Apr 2026 06:30:00 +0000</pubDate>
      <link>https://dev.to/amitba/why-git-log-oneline-is-killing-your-ai-generated-prs-5gbm</link>
      <guid>https://dev.to/amitba/why-git-log-oneline-is-killing-your-ai-generated-prs-5gbm</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://hivetrail.com/blog/why-git-log-oneline-kills-ai-prs" rel="noopener noreferrer"&gt;hivetrail.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I build &lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;HiveTrail Mesh&lt;/a&gt;, a context assembly tool for LLMs. I use Claude Code, as well as multiple other LLMs, daily. And recentry, I asked it to write a pull request description for a feature I'd just finished - 27 commits across 32 files, several days of real work.&lt;/p&gt;

&lt;p&gt;The output was competent. It had a title, a summary, a list of changed files. But reading it back, something felt off. I knew what was in those commits. The architectural decisions, the edge cases I'd hunted down, the bug fixes that weren't obvious from the file names. None of it was there. The PR read like someone had skimmed the index of a book and written a summary without reading the chapters.&lt;/p&gt;

&lt;p&gt;So I did what any developer building a context tool probably would: I went looking for why.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Claude Code Actually Sends to the Model
&lt;/h2&gt;

&lt;p&gt;When you ask Claude Code to write a PR description, it runs two git commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git log main..HEAD &lt;span class="nt"&gt;--oneline&lt;/span&gt;
git diff main...HEAD &lt;span class="nt"&gt;--stat&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--oneline&lt;/code&gt; flag returns one line per commit: the abbreviated SHA hash and the subject line. That's it. No commit body, no co-author notes, no extended description you carefully wrote.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;--stat&lt;/code&gt; flag returns a diffstat - a summary of which files changed and how many lines were added or removed. Again, no actual content. No diffs, no file contents, no context about &lt;em&gt;what&lt;/em&gt; changed or &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So the model is working from something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;7c22302 fix(git-tools): harden subprocess wrapper against local gitconfig pollution
6b5fd96 feat(git-tools): replace base branch input with auto-populated select
1246c27 fix(git-tools): resolve validation, state drift, and arch leaks
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plus a file change summary. For a 27-commit feature branch, that's the equivalent of asking someone to explain a film by reading the chapter titles on a DVD menu.&lt;/p&gt;

&lt;p&gt;The model is good enough to produce something coherent from this - but coherent isn't the same as accurate, and it certainly isn't the same as useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  What a Good PR Description Actually Needs
&lt;/h2&gt;

&lt;p&gt;Before getting to the fix, it's worth being specific about what's missing.&lt;/p&gt;

&lt;p&gt;A PR description serves two audiences with different needs. Developers reviewing the code want to know which layers of the codebase were touched, what edge cases were handled, and why certain decisions were made the way they were. Product managers and QA engineers want to understand user impact, workflow changes, and how to verify the feature works.&lt;/p&gt;

&lt;p&gt;When the model only has commit subject lines and a file list, it can infer &lt;em&gt;what&lt;/em&gt; changed from the file names. It cannot infer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The why&lt;/strong&gt; behind architectural decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge cases&lt;/strong&gt; that were discovered and handled mid-implementation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bug fixes&lt;/strong&gt; that are buried in commits whose subject lines don't make them obvious&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The distinction&lt;/strong&gt; between new features and hardening work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing specifics&lt;/strong&gt; - what's covered, how, and what a reviewer should manually verify&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are exactly the things that separate a PR description that's useful from one that's just technically accurate.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Prompt Fix: Better Claude Code Instructions
&lt;/h2&gt;

&lt;p&gt;Before reaching for a different tool, most developers will try the obvious thing first: write a better prompt. And it's a fair instinct - you can absolutely instruct Claude Code to run more thorough git commands, request full diff content, and follow a specific PR structure. Something like &lt;em&gt;"run git log with full commit bodies, fetch the complete diff for each changed file, then write a PR description organized by architectural layer with a key design decisions section"&lt;/em&gt; will produce meaningfully better output than the default.&lt;/p&gt;

&lt;p&gt;But there are real costs to this approach worth understanding. The most immediate is token burn - asking Claude Code to fetch full diffs and structured commit logs for a large branch will consume significantly more context than the default &lt;code&gt;--oneline&lt;/code&gt; summary approach, which adds up quickly if you're on a metered plan. The less obvious problem is consistency. Claude Code operates within a conversation context that degrades over a long session: early instructions get compressed, memory gets summarised, and the careful prompt you wrote at the start of a session may not be fully honoured three hours later when you finally hit merge. You're also now maintaining a prompt, not just a workflow.&lt;/p&gt;

&lt;p&gt;For this test, I deliberately used the simplest possible prompt - &lt;em&gt;"based on the staged changes / recent commits, write me a PR title and description"&lt;/em&gt; - across all three methods. The goal was to measure what each approach produces from its own capabilities, not what it produces when coached. Prompt engineering can close some of the gap, but it can't change what data the model is actually working from.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Manual Fix: Give the Model the Full Context
&lt;/h2&gt;

&lt;p&gt;The underlying problem is simple: the model is summarising a summary. The fix is to give it the actual data.&lt;/p&gt;

&lt;p&gt;Here's what that looks like manually. Before asking your LLM to write the PR, run this yourself and include the output in your prompt:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full commit log with bodies:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git log main..HEAD &lt;span class="nt"&gt;--pretty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;format:&lt;span class="s2"&gt;"%H%n%an%n%ad%n%s%n%b%n---"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Actual file diffs:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git diff main...HEAD
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Or per-commit diffs if the full diff is too large:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git log main..HEAD &lt;span class="nt"&gt;--patch&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A word of warning on token volume: for a large feature branch, &lt;code&gt;git diff main...HEAD&lt;/code&gt; on a 27-commit, 32-file branch produced around 106,000 tokens in my test - roughly 379KB of XML-structured content. That's well beyond what you'd paste into a chat window, and it approaches or exceeds the context limits of many models.&lt;/p&gt;

&lt;p&gt;This is where you need to be selective. For smaller branches - a few commits, a handful of files - pasting the full diff directly works fine. For larger branches, you have options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Feed it to a model with a large context window (Gemini Pro handles this comfortably)&lt;/li&gt;
&lt;li&gt;Trim to the files and commits most relevant to the PR's purpose&lt;/li&gt;
&lt;li&gt;Use the structured approach described in the next section&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Either way, the quality difference is immediate. When I ran the same prompt - &lt;em&gt;"based on the staged changes / recent commits, write me a PR title and description"&lt;/em&gt; - with the full structured context versus Claude Code's summary approach, the outputs were not comparable. The full-context version knew about BOM-aware file encoding, NiceGUI deleted-slot errors, the decision to use &lt;code&gt;@computed_field&lt;/code&gt; to eliminate state drift. The summary version knew that &lt;code&gt;git_service.py&lt;/code&gt; was a new file.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Results Actually Showed
&lt;/h2&gt;

&lt;p&gt;I ran three versions of the same PR description for the same feature branch:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version 1 - Claude Code (Sonnet 4.6)&lt;/strong&gt;&lt;br&gt;
Working from &lt;code&gt;--oneline&lt;/code&gt; and &lt;code&gt;--stat&lt;/code&gt; only. Produced a competent, file-oriented description with a good "Key design decisions" section. Flat formatting, no inline code styling, read like a wall of text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version 2 - Claude web chat (Sonnet 4.6) + full structured context&lt;/strong&gt;&lt;br&gt;
The same model, but fed the complete PR Brief XML (106k tokens of structured diffs, commit metadata, and file content). Layered by architectural section, included product context, named specific edge cases and why they were handled the way they were, referenced exact test counts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version 3 - Claude web chat (Haiku 4.5) + full structured context&lt;/strong&gt;&lt;br&gt;
The cheapest Claude model, same full context. Produced a description nearly as strong as Version 2, with better structured sections for testing guidance and explicit "Key Design Decisions."&lt;/p&gt;

&lt;p&gt;I asked Gemini 3 Pro to evaluate all three as a neutral third party, framed as a senior developer and product manager. The ranking: Version 2 first, Version 3 second, Version 1 third.&lt;/p&gt;

&lt;p&gt;The conclusion that stood out: &lt;strong&gt;Haiku 4.5 with full context outperformed Sonnet 4.6 with shallow context.&lt;/strong&gt; The model tier mattered less than the context quality.&lt;/p&gt;

&lt;p&gt;Gemini's summary of the gap was pointed: Version 1 "forces the reviewer to do the heavy lifting." Versions 2 and 3 "treat the PR description as living documentation."&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Google Gemini 3 Pro's comparison table:&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;PR Text&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Primary Strength&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Primary Weakness&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Best Audience&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PR 1 (Claude Code)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Exceptional, step-by-step test plan.&lt;/td&gt;
&lt;td&gt;Flat formatting makes it hard to scan. Lacks inline code styling.&lt;/td&gt;
&lt;td&gt;QA and Product Managers.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PR 2 (Sonnet + Mesh)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Strongest structural hierarchy and includes "Key Design Decisions."&lt;/td&gt;
&lt;td&gt;Testing section is a status report, not an actionable guide.&lt;/td&gt;
&lt;td&gt;Senior Developers and Architects.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PR 3 (Haiku + Mesh)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Excellent markdown formatting (inline backticks) and explicit bug-fix callouts.&lt;/td&gt;
&lt;td&gt;Lacks the deeper architectural "why" found in PR 2.&lt;/td&gt;
&lt;td&gt;General Code Reviewers.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Version 1:&lt;/strong&gt; Claude Code with Sonnet 4.6:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Title:
feat(#282): add Git Tools source type (Commit Brief + PR Brief)
&lt;span class="p"&gt;
---
&lt;/span&gt;
Description:

Summary
Introduces Git Tools as a new source type in the Stack, enabling users to generate token-optimized LLM context from local git repositories.

Two item types are supported:
&lt;span class="p"&gt;
-&lt;/span&gt; Commit Brief - scans a single commit and produces a structured XML summary of changed files and diff content
&lt;span class="p"&gt;-&lt;/span&gt; PR Brief - scans a branch against a base branch, producing a structured XML summary with per-commit metadata and a selectable file/commit
checklist

What's included
&lt;span class="p"&gt;-&lt;/span&gt; services/git_service.py - new service with full subprocess wrapper...
&lt;span class="p"&gt;-&lt;/span&gt; models/stack/git_tools.py - Pydantic models for CommitBriefStackItem...
&lt;span class="p"&gt;-&lt;/span&gt; core/state/sources/git_tools.py + core/state/stack/handlers/git_tools.py...
&lt;span class="p"&gt;-&lt;/span&gt; components/common/forms/git_tools/ - CommitBriefForm, PRBriefForm...
&lt;span class="p"&gt;-&lt;/span&gt; components/sections/stack/dialogs/edit_commit_brief.py + edit_pr_brief.py - edit dialogs with...
&lt;span class="p"&gt;-&lt;/span&gt; components/sections/source/panels/git_tools/ - source panel with auto-populated base branch select ...
&lt;span class="p"&gt;-&lt;/span&gt; Full pytest coverage in tests/services/test_git_service.py

Key design decisions
&lt;span class="p"&gt;-&lt;/span&gt; Base branch is auto-populated from the repo rather than free-text input...
&lt;span class="p"&gt;-&lt;/span&gt; Partial generation failures surface as a warning state...
&lt;span class="p"&gt;-&lt;/span&gt; Commit log is stored as structured...
&lt;span class="p"&gt;-&lt;/span&gt; subprocess calls are wrapped to suppress local...
&lt;span class="p"&gt;-&lt;/span&gt; Follows the existing async stack lifecycle...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Version 2:&lt;/strong&gt; Claude Sonnet 4.6 chat with Mesh:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gs"&gt;**feat(#282): Add Git Tools source type (Commit Brief + PR Brief)**&lt;/span&gt;
&lt;span class="p"&gt;
---
&lt;/span&gt;
&lt;span class="gu"&gt;## Summary&lt;/span&gt;

Introduces Git Tools as a fourth source type in HiveTrail Mesh, alongside Notion, Local Files, and Context Blocks. Provides two workflows for assembling LLM context from a git repository:
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**Commit Brief**&lt;/span&gt; - scans uncommitted changes (staged/unstaged/untracked), assembles file content + unified diffs against HEAD for commit message generation
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**PR Brief**&lt;/span&gt; - scans branch-wide changes vs a base branch, includes file content, diffs, structured commit log, and uncommitted change warnings for PR description generation
&lt;span class="p"&gt;
---
&lt;/span&gt;
&lt;span class="gu"&gt;## What's Changed&lt;/span&gt;

&lt;span class="gs"&gt;**Models &amp;amp; Architecture**&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; New &lt;span class="sb"&gt;`GitChangedFile`&lt;/span&gt;, &lt;span class="sb"&gt;`GitCommitEntry`&lt;/span&gt;, ... models with Pydantic...
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`PRBriefStackItem.commit_count`&lt;/span&gt; as a &lt;span class="sb"&gt;`@computed_field`&lt;/span&gt; to eliminate...
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`warning`&lt;/span&gt; status added to &lt;span class="sb"&gt;`StackItemBaseModel`&lt;/span&gt;...
&lt;span class="p"&gt;-&lt;/span&gt; Git item types registered in &lt;span class="sb"&gt;`StackItemUnion`&lt;/span&gt;...

&lt;span class="gs"&gt;**Service Layer**&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; New &lt;span class="sb"&gt;`git_service.py`&lt;/span&gt; - stateless subprocess service covering...
&lt;span class="p"&gt;-&lt;/span&gt; Hardened &lt;span class="sb"&gt;`_run_git`&lt;/span&gt; with... to prevent ANSI corruption and pager hangs
&lt;span class="p"&gt;-&lt;/span&gt; BOM-aware file reading (UTF-16 LE/BE) in...
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`GitResult`&lt;/span&gt; dataclass for structured content...
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`generate_git_content`&lt;/span&gt; produces structured... XML nodes with ... CDATA string

&lt;span class="gs"&gt;**State &amp;amp; Stack**&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="sb"&gt;`GitToolsSource`&lt;/span&gt; bridge in... wrapping all git service calls
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`GitToolsHandler`&lt;/span&gt; integrated into the standard async stack lifecycle...
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`StackManager.update_git_item`&lt;/span&gt;, ... extended for git types
&lt;span class="p"&gt;-&lt;/span&gt; Stack persistence (&lt;span class="sb"&gt;`StackPersistence`&lt;/span&gt;) handles ... in save/load
&lt;span class="p"&gt;-&lt;/span&gt; Full facade coverage on...

&lt;span class="gs"&gt;**UI - Source Panel**&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="sb"&gt;`GitToolsPanel`&lt;/span&gt; with ..., following ... conventions
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`CommitBriefForm`&lt;/span&gt; and ... forms with path selector, scan button...
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`FileChecklist`&lt;/span&gt; and ... reusable components with select/deselect all, .... with local timezone formatting
&lt;span class="p"&gt;-&lt;/span&gt; Base Branch field auto-populated from ... disables immediately on path change to prevent stale state
&lt;span class="p"&gt;-&lt;/span&gt; Uncommitted changes warning banner in...
&lt;span class="p"&gt;-&lt;/span&gt; Files/Commits sections hidden until...

&lt;span class="gs"&gt;**UI - Stack**&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="sb"&gt;`GitToolsStackCard`&lt;/span&gt; with type-specific... confirmed Reload action
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`EditCommitBriefDialog`&lt;/span&gt; and ... with Settings/Content tabs, ... banner for partial failures
&lt;span class="p"&gt;-&lt;/span&gt; Insert action enabled for ...
&lt;span class="p"&gt;-&lt;/span&gt; Edit routing wired in ...

&lt;span class="gs"&gt;**Bug Fixes &amp;amp; Hardening**&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="sb"&gt;`notify`&lt;/span&gt; called before ... to prevent NiceGUI ... errors
&lt;span class="p"&gt;-&lt;/span&gt; Stale scan ... on path change
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`commit_log`&lt;/span&gt; fallback ... prevent Pydantic ...
&lt;span class="p"&gt;-&lt;/span&gt; Output manager processing logs ...

&lt;span class="gs"&gt;**Tests**&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; 41 new tests in ... covering parse, pre-checks, scan, ... and PR XML generation (199 total passing)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Version 3: Claude Haiku 4.5 chat with Mesh:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gs"&gt;**feat(#282): complete Git Tools integration-forms, dialogs, stack cards, and content generation**&lt;/span&gt;
&lt;span class="p"&gt;
---
&lt;/span&gt;
&lt;span class="gh"&gt;# PR Description&lt;/span&gt;

&lt;span class="gu"&gt;## Overview&lt;/span&gt;

This PR completes the Git Tools feature (Commit Brief and PR Brief) end-to-end, integrating UI components, edit dialogs, stack management, and async content generation. The feature is now wired into the source panel, stack lifecycle, and preset persistence, with full test coverage for the git_service module.

&lt;span class="gu"&gt;## What's Included&lt;/span&gt;

&lt;span class="gu"&gt;### Forms &amp;amp; Components&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**CommitBriefForm**&lt;/span&gt; ... forms with path selection, branch auto-population, file/commit checklists, and rescan workflows
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**FileChecklist**&lt;/span&gt; ...Reusable list components with ... dialogs
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**GitToolsStackCard**&lt;/span&gt;: Stack ... confirmation
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**EditCommitBriefDialog**&lt;/span&gt; ... dialogs with Settings/Content tabs, ... display

&lt;span class="gu"&gt;### State &amp;amp; Architecture&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**GitToolsSource**&lt;/span&gt; (&lt;span class="sb"&gt;`core/.../git_tools.py`&lt;/span&gt;): Domain bridge wrapping ... asyncio threading
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**GitToolsHandler**&lt;/span&gt; (&lt;span class="sb"&gt;`core/.../git_tools.py`&lt;/span&gt;): Pre-generated ... (no JIT reads)
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**git_service.py**&lt;/span&gt;: Stateless subprocess service with ... merge logic
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Stack persistence**&lt;/span&gt; updated to serialize/deserialize ...
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**StateManager facade**&lt;/span&gt; ... git-specific methods ...

&lt;span class="gu"&gt;### UI Integration&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Git Tools added to source panel...
&lt;span class="p"&gt;-&lt;/span&gt; Stack card factory routes ...
&lt;span class="p"&gt;-&lt;/span&gt; Stack component instantiates both ...
&lt;span class="p"&gt;-&lt;/span&gt; BaseStackCard updated to handle ... enabled Insert button
&lt;span class="p"&gt;-&lt;/span&gt; Output manager log messages ...

&lt;span class="gu"&gt;### Content Generation&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**generate_git_content()**&lt;/span&gt;: Async XML generator with BOM-aware file reading...
&lt;span class="p"&gt;-&lt;/span&gt; Returns &lt;span class="gs"&gt;**GitResult**&lt;/span&gt; dataclass with ...
&lt;span class="p"&gt;-&lt;/span&gt; Partial success mode: items transition to &lt;span class="sb"&gt;`warning`&lt;/span&gt; state ...

&lt;span class="gu"&gt;### Data Models&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; &lt;span class="gs"&gt;**GitChangedFile**&lt;/span&gt;, ...
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**ScanResult**&lt;/span&gt;, ... dataclasses
&lt;span class="p"&gt;-&lt;/span&gt; FileStatus ... enums
&lt;span class="p"&gt;-&lt;/span&gt; Pydantic &lt;span class="sb"&gt;`@computed_field`&lt;/span&gt; for ...

&lt;span class="gu"&gt;### Testing&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; 41 new git_service tests covering:
&lt;span class="p"&gt;    -&lt;/span&gt; _parse_name_status with ...
&lt;span class="p"&gt;    -&lt;/span&gt; Pre-checks (is_repo, ...)
&lt;span class="p"&gt;    -&lt;/span&gt; scan_commit_changes with ... states
&lt;span class="p"&gt;    -&lt;/span&gt; scan_pr_changes with ... flag
&lt;span class="p"&gt;    -&lt;/span&gt; apply_default_checks (...)
&lt;span class="p"&gt;    -&lt;/span&gt; merge_scan_with_saved (preserve user ...)
&lt;span class="p"&gt;    -&lt;/span&gt; parse_commit_log and ...
&lt;span class="p"&gt;    -&lt;/span&gt; generate_git_content integration with ...

&lt;span class="gu"&gt;## Key Design Decisions&lt;/span&gt;
&lt;span class="p"&gt;
1.&lt;/span&gt; &lt;span class="gs"&gt;**Pre-generated content model**&lt;/span&gt;: Git items store content at ...
&lt;span class="p"&gt;2.&lt;/span&gt; &lt;span class="gs"&gt;**Graceful partial success**&lt;/span&gt;: When some files fail ... insert partial context.
&lt;span class="p"&gt;3.&lt;/span&gt; &lt;span class="gs"&gt;**Clean Architecture enforcement**&lt;/span&gt;: GitToolsSource wraps ... No direct service imports.
&lt;span class="p"&gt;4.&lt;/span&gt; &lt;span class="gs"&gt;**Concurrent diff fetching**&lt;/span&gt;: Diffs fetched in parallel with ... avoid resource exhaustion.
&lt;span class="p"&gt;5.&lt;/span&gt; &lt;span class="gs"&gt;**Merge logic for rescan**&lt;/span&gt;: When users rescan, ... across repository state changes.
&lt;span class="p"&gt;6.&lt;/span&gt; &lt;span class="gs"&gt;**BOM-aware encoding detection**&lt;/span&gt;: UTF-16 files (with BOM) and UTF-8 with BOM are decoded correctly; Windows cp1252 default avoided.

&lt;span class="gu"&gt;## Testing Guidance&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; All 199 tests pass (including 41 new ... tests)
&lt;span class="p"&gt;-&lt;/span&gt; git_service tests use ...
&lt;span class="p"&gt;-&lt;/span&gt; Integration tested via forms/dialogs in the app
&lt;span class="p"&gt;-&lt;/span&gt; Warning state rendering tested in ...
&lt;span class="p"&gt;
---
&lt;/span&gt;
&lt;span class="gs"&gt;**Closes #282**&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Easier Path: Structured Context Assembly
&lt;/h2&gt;

&lt;p&gt;Constructing that full context manually - running the right git commands, handling encoding issues, structuring the output so the model can navigate the 106k-token PR Brief without losing the thread - is non-trivial. For a one-off experiment it's fine. As a repeatable workflow before every PR, it's friction most developers won't sustain.&lt;/p&gt;

&lt;p&gt;This is exactly the problem I built HiveTrail Mesh to solve. The PR Brief source type runs the git commands, structures the output as navigable XML with per-commit nodes, handles BOM-aware encoding, and lets you select which files and commits to include before the context gets assembled. The output goes to your clipboard, ready to paste into whichever LLM you want to use.&lt;/p&gt;

&lt;p&gt;If you want to try it, &lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;Mesh is in limited beta and free during the beta period&lt;/a&gt;. But honestly, if you want to test the manual approach first on your next branch, the git commands above will get you there.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fztdtuwhcdq4i7n2ats9r.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fztdtuwhcdq4i7n2ats9r.webp" alt="HiveTrail Mesh’s PR Brief interface automates the git context assembly process, structuring full diffs and commit logs into LLM-ready XML while letting you select exactly which files to include." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference: The Git Commands That Actually Feed the Model
&lt;/h2&gt;

&lt;p&gt;If you want to try the full-context approach on your next branch before merging, these are the three commands worth knowing:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full commit log with bodies (not just subject lines):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
git log main..HEAD &lt;span class="nt"&gt;--pretty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;format:&lt;span class="s2"&gt;"%H%n%s%n%b%n---"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Complete diff across the branch:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
git diff main...HEAD
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Per-commit diffs with context (useful for smaller branches):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
git log main..HEAD &lt;span class="nt"&gt;--patch&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A practical note on scope: for a large feature branch, the full diff will be large - potentially 100k+ tokens. Before feeding it to a model, skim the file list and drop binaries, generated files, and lockfile changes. What remains is usually 20-40% smaller and significantly more useful to the model.&lt;/p&gt;

&lt;p&gt;If you'd rather not run and filter these manually every time, this is the workflow &lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;HiveTrail Mesh&lt;/a&gt; automates - structured XML output, file selection, token count before you export.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Broader Point
&lt;/h2&gt;

&lt;p&gt;Claude Code isn't doing something wrong. It's making a reasonable tradeoff - fast, low-cost, good enough for most cases. The &lt;code&gt;--oneline&lt;/code&gt; approach keeps the token cost down and the response time fast. For a quick commit message or a small fix, it's fine.&lt;/p&gt;

&lt;p&gt;But for complex feature branches where the PR description is going to be read by your team, reviewed by senior engineers, and live in your repository history for years - it's worth spending an extra 30 seconds to give the model the full picture.&lt;/p&gt;

&lt;p&gt;The quality of your AI output is constrained by the quality of the context you provide. For PR descriptions, the full diff is the context. Everything else is a summary of a summary.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://dev.to/amitba"&gt;Amit&lt;/a&gt; builds HiveTrail Mesh, a context assembly tool for developers working with LLMs. &lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;If you found this useful, join our beta.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>git</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
    <item>
      <title>We Ran the Same Experiment Twice. Different Feature, Different Models, Same Winner.</title>
      <dc:creator>Amit Ben-Ari</dc:creator>
      <pubDate>Tue, 07 Apr 2026 07:00:00 +0000</pubDate>
      <link>https://dev.to/amitba/we-ran-the-same-experiment-twice-different-feature-different-models-same-winner-93n</link>
      <guid>https://dev.to/amitba/we-ran-the-same-experiment-twice-different-feature-different-models-same-winner-93n</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://hivetrail.com/blog/llm-context-assembly-pr-generation" rel="noopener noreferrer"&gt;hivetrail.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;How two independent PR generation benchmarks pointed to the same conclusion about context quality - and why your model choice matters less than you think.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Here's a finding that should change how you think about AI tooling: in two independent experiments using real production code, a "budget" model fed rich context consistently outperformed flagship models operating on shallow git summaries. The budget model didn't just win. It won by a landslide, unanimously, against models that cost significantly more per token.&lt;/p&gt;

&lt;p&gt;This isn't a post about which model is best. It's about why the question itself might be the wrong one to ask.&lt;/p&gt;




&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;HiveTrail Mesh is a context assembly tool. One of its core features is PR Brief - it scans a git branch against a base branch, reads every changed file in full, assembles all diffs and commit metadata into a structured XML document, and hands it to an LLM. The output is typically a 100K–380K token document containing everything an LLM needs to write a comprehensive PR description.&lt;/p&gt;

&lt;p&gt;We used this workflow as the basis for both experiments. The prompt in each case was deliberately simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Based on the staged changes / recent commits, write me a PR title and description.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;No elaborate prompting. No chain-of-thought instructions. Just the raw context and a task.&lt;/p&gt;




&lt;h2&gt;
  
  
  Experiment 1: The budget model vs. the flagship agent
&lt;/h2&gt;

&lt;p&gt;The first experiment ran on the Git Tools feature - a substantial new addition to HiveTrail Mesh covering 27 commits across 32 files, with async XML generation, state management, UI components, and 41 new tests.&lt;/p&gt;

&lt;p&gt;We ran three conditions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Condition A - Claude Code (Sonnet 4.6), native git context.&lt;/strong&gt; Claude Code ran &lt;code&gt;git log main..HEAD --oneline&lt;/code&gt; and &lt;code&gt;git diff main...HEAD --stat&lt;/code&gt; - the standard abbreviated approach. Generated in about 25 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Condition B - Haiku 4.5, Mesh context.&lt;/strong&gt; Mesh assembled a 380KB XML file (~106K tokens) covering every changed file, diff, and commit. Haiku 4.5 received this in full.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Condition C - Sonnet 4.6, Mesh context.&lt;/strong&gt; Same Mesh XML, same prompt, given to Sonnet 4.6.&lt;/p&gt;

&lt;p&gt;Gemini 3 Pro evaluated all three as a senior software developer and product manager.&lt;/p&gt;

&lt;p&gt;The verdict was unambiguous. The Mesh-fed PRs were called "significantly stronger" across every dimension: product context, workflow clarity, architectural structure, technical depth, and testing visibility. The Claude Code version was characterised as reading like "a rough draft or a quick brain dump before hitting Create Pull Request."&lt;/p&gt;

&lt;p&gt;This wasn't a knock on Sonnet 4.6. It was a knock on what Sonnet 4.6 was given to work with.&lt;/p&gt;

&lt;p&gt;Claude Code - like most agentic coding tools - acts like a developer who skims the commit titles and says "looks good to me." It reads summaries: which files changed, roughly how many lines, what the commit subjects say. HiveTrail Mesh acts like the reviewer who actually pulls down the branch and reads every single file. The difference in output reflects that difference in reading.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Haiku 4.5 with full context outperformed Sonnet 4.6 with shallow context.&lt;/strong&gt; A cheaper, faster model given the complete picture wrote a better PR than a more capable model working from a summary.&lt;/p&gt;

&lt;p&gt;But here's the part that should really give you pause: Haiku 4.5 didn't just beat Sonnet 4.6's native shallow context - &lt;strong&gt;it beat Sonnet 4.6 when both were fed the exact same Mesh XML.&lt;/strong&gt; The budget model outperformed the flagship on a level playing field.&lt;/p&gt;

&lt;p&gt;Final ranking:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Haiku 4.5 + Mesh&lt;/strong&gt; - best overall structure, key design decisions, quantified test coverage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sonnet 4.6 + Mesh&lt;/strong&gt; - excellent markdown, clear bug-fix callouts, strong architecture section&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sonnet 4.6 native (Claude Code)&lt;/strong&gt; - good test plan, but flat structure and shallow context throughout&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Experiment 2: Can Gemini CLI beat its own model family?
&lt;/h2&gt;

&lt;p&gt;Several months later, we ran a second experiment on a completely different feature - the GitHub API integration for HiveTrail Mesh, covering 24 files and 22 commits.&lt;/p&gt;

&lt;p&gt;The framing this time was sharper. &lt;strong&gt;The question wasn't "which model is best" - it was "can an agentic tool using native git context compete with the same model family when context is properly assembled?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gemini CLI was the subject under test. It has its own git tooling, can run shell commands, and is built by the same team behind the models it would be competing against. If any tool could close the context gap through smart native tool use, Gemini CLI was the candidate.&lt;/p&gt;

&lt;p&gt;We set it against seven Gemini models - ranging from Gemini 3 Fast to Gemini 3.1 Pro with high thinking - all fed via HiveTrail Mesh. We also added Haiku 4.5 via Mesh as an external reference point, since it had won Experiment 1.&lt;/p&gt;

&lt;p&gt;Three independent judges evaluated all nine PR texts blind, without knowing which model produced which:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google Gemini 3 Pro&lt;/li&gt;
&lt;li&gt;Anthropic Claude Opus 4.6&lt;/li&gt;
&lt;li&gt;OpenAI ChatGPT&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scoring: 9 points for 1st place, 1 point for last. Maximum possible: 27.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Gemini Pro&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;th&gt;ChatGPT&lt;/th&gt;
&lt;th&gt;Total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Haiku 4.5 + Mesh&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;27&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Gemini Flash 3 preview (Thinking Low) + Mesh&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Gemini 3 Fast + Mesh&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Gemini 3.1 Pro preview (Thinking High) + Mesh&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tied 5&lt;/td&gt;
&lt;td&gt;ChatGPT + Mesh&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tied 5&lt;/td&gt;
&lt;td&gt;Gemini Flash 3 preview (Thinking High) + Mesh&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Gemini 3.1 Flash Light preview (Thinking High) + Mesh&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Gemini 3 Pro + Mesh&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Gemini CLI (native context)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two results stand out.&lt;/p&gt;

&lt;p&gt;First, Haiku 4.5 received a perfect score - 9 from every judge, unanimously, with a 4-point gap over second place. All three judges independently placed it first for the same reasons: dedicated test coverage sections, specific method names and API behaviors called out by name, explicit reasoning behind architectural decisions, and reviewer notes that no other entry included. Opus 4.6 called it "the most complete and production-grade PR description" of the nine.&lt;/p&gt;

&lt;p&gt;Second, and more telling: &lt;strong&gt;Gemini CLI finished last.&lt;/strong&gt; Not second to last - last, with 4 points, behind every Mesh-fed entry including smaller, cheaper Gemini variants. Its own model family, given better context by a different tool, beat it at every position in the table.&lt;/p&gt;

&lt;p&gt;The reason is the same as Experiment 1. Gemini CLI ran &lt;code&gt;git log -n 10 --stat&lt;/code&gt; and a few shell commands. Fast, low-cost, reasonable for most tasks - but it produced the same shallow picture. The resulting PR covered the surface of the changes without the architectural reasoning, edge case handling, or quantified test results that the Mesh-fed models could draw on because they had actually read the code.&lt;/p&gt;

&lt;p&gt;It's worth noting that the Mesh PR Brief isn't just raw file content dumped into a prompt. It's structured XML - commits organized chronologically, files grouped by change type, diffs nested within their commit context. That structure helps LLMs navigate 100K+ token documents more efficiently than a flat wall of text would. So "full context" here means both &lt;em&gt;more&lt;/em&gt; information and &lt;em&gt;better-organized&lt;/em&gt; information. Both matter.&lt;/p&gt;

&lt;p&gt;After the main competition, we ran Claude Code on the same feature - not as a competitor, but as a consistency check. Same pattern as Experiment 1: a short, surface-level PR based on abbreviated git output. The shallow-context behavior isn't specific to any one tool or vendor. It's structural - it's what happens when speed is optimized over depth of reading.&lt;/p&gt;




&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Context quality sets the ceiling. Model choice determines where within that ceiling you land.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Run both experiments side by side and the picture is hard to argue with.&lt;/p&gt;

&lt;p&gt;Experiment 1 tested context delivery method with the same model family. Mesh-assembled context won over native git context regardless of model tier - and the budget model beat the flagship even on a level playing field.&lt;/p&gt;

&lt;p&gt;Experiment 2 tested whether a sophisticated agentic tool could close that gap through smart native tool use. It couldn't - and it finished last against its own model family.&lt;/p&gt;

&lt;p&gt;Different features. Different PR Briefs. Different competitive sets. Different judges. The only constant was the relationship between context quality and output quality.&lt;/p&gt;

&lt;p&gt;When an AI tool reads a few lines of git log to write a PR, it isn't producing a poor result because it's a bad model. It's producing a poor result because it has been given a poor picture of what changed and why. Give any capable model the full picture - every file, every diff, every commit, structured and organized - and the output improves dramatically.&lt;/p&gt;

&lt;p&gt;The implication runs both ways. A "budget" model with rich context outperforms a flagship with shallow context. And a flagship with shallow context produces flagship-priced shallow output.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this means for your workflow
&lt;/h2&gt;

&lt;p&gt;If you're using AI tools for PR descriptions today, the most impactful change probably isn't switching models.&lt;/p&gt;

&lt;p&gt;Agentic coding tools are optimized for speed and low token cost - they read summaries, not full file content. That's the right tradeoff for interactive coding tasks, where you want fast feedback and low latency. For a PR covering 20+ files and weeks of work, summary-level context produces summary-level output.&lt;/p&gt;

&lt;p&gt;The alternative is deliberate context assembly before you prompt: read every changed file in full, preserve the diff structure, organize commits chronologically, package everything in a format the LLM can navigate. You could build a script to do this - pull every changed file, run the diffs, format it into structured XML. It's achievable engineering. It's also a few days of work to do properly, and more to maintain as your codebase evolves.&lt;/p&gt;

&lt;p&gt;That's exactly why we built HiveTrail Mesh's PR Brief. Point it at a branch and within seconds it has scanned every changed file, assembled the diffs, and produced a structured 100,000+ token XML document - faster than most agentic tools complete their own context gathering. The remaining time in the workflow is just the LLM responding, which varies by model (a few seconds for smaller models, up to ~30 seconds for the larger ones). The total end-to-end time is competitive with agentic coding tools - with dramatically better output to show for it. Use any LLM you prefer: Claude, Gemini, ChatGPT, whatever fits your workflow. The model choice, as these experiments suggest, matters less than you might expect.&lt;/p&gt;

&lt;p&gt;For teams where PRs serve as living documentation, get reviewed by multiple people, or feed downstream into release notes - the tradeoff is straightforward. For a solo developer pushing a two-file fix, probably not worth it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What we didn't test
&lt;/h2&gt;

&lt;p&gt;In the spirit of intellectual honesty:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt engineering.&lt;/strong&gt; Both experiments used a minimal prompt. A carefully crafted prompt might narrow the gap somewhat - though we'd expect the ceiling to remain lower without full file content. It's also worth noting that the Mesh PR Brief's structured XML format is itself a form of context organization: commits are sequenced chronologically, files are grouped by change type, and diffs are nested within their commit context. That structure likely helps LLMs parse large documents more efficiently than flat CLI output would.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Other writing tasks.&lt;/strong&gt; Both experiments focused on PR descriptions. Commit messages, technical documentation, and code review summaries likely follow the same pattern, but we haven't tested them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Newer model releases.&lt;/strong&gt; These experiments used models current at the time of testing. Rankings will shift as new models release - though the underlying dynamic (context quality determines ceiling) should hold.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost efficiency.&lt;/strong&gt; Haiku 4.5 is significantly cheaper per token than most of the models it beat. The cost-per-quality-point story is compelling but token pricing changes frequently enough that any number we published here would be stale quickly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing thought
&lt;/h2&gt;

&lt;p&gt;The most useful takeaway from two experiments isn't a model recommendation. It's a workflow question worth asking before you prompt: &lt;em&gt;what does the model actually see?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If the answer is "a handful of commit subject lines and a diffstat," you've already constrained the output - regardless of which model is on the other end.&lt;/p&gt;

&lt;p&gt;The models are good enough. The context is usually the bottleneck.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;HiveTrail Mesh is a context assembly tool for developers and product teams. PR Brief assembles a token-optimized, structured XML document from your git branch - ready to paste into any LLM. &lt;a href="https://hivetrail.com/mesh" rel="noopener noreferrer"&gt;Try the beta →&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
