Amit Ben-Ari

Posted on Apr 7 • Originally published at hivetrail.com

We Ran the Same Experiment Twice. Different Feature, Different Models, Same Winner.

#ai #llm #productivity #devtools

Originally published at hivetrail.com

How two independent PR generation benchmarks pointed to the same conclusion about context quality - and why your model choice matters less than you think.

Here's a finding that should change how you think about AI tooling: in two independent experiments using real production code, a "budget" model fed rich context consistently outperformed flagship models operating on shallow git summaries. The budget model didn't just win. It won by a landslide, unanimously, against models that cost significantly more per token.

This isn't a post about which model is best. It's about why the question itself might be the wrong one to ask.

The setup

HiveTrail Mesh is a context assembly tool. One of its core features is PR Brief - it scans a git branch against a base branch, reads every changed file in full, assembles all diffs and commit metadata into a structured XML document, and hands it to an LLM. The output is typically a 100K–380K token document containing everything an LLM needs to write a comprehensive PR description.

We used this workflow as the basis for both experiments. The prompt in each case was deliberately simple:

Based on the staged changes / recent commits, write me a PR title and description.

No elaborate prompting. No chain-of-thought instructions. Just the raw context and a task.

Experiment 1: The budget model vs. the flagship agent

The first experiment ran on the Git Tools feature - a substantial new addition to HiveTrail Mesh covering 27 commits across 32 files, with async XML generation, state management, UI components, and 41 new tests.

We ran three conditions:

Condition A - Claude Code (Sonnet 4.6), native git context. Claude Code ran git log main..HEAD --oneline and git diff main...HEAD --stat - the standard abbreviated approach. Generated in about 25 seconds.

Condition B - Haiku 4.5, Mesh context. Mesh assembled a 380KB XML file (~106K tokens) covering every changed file, diff, and commit. Haiku 4.5 received this in full.

Condition C - Sonnet 4.6, Mesh context. Same Mesh XML, same prompt, given to Sonnet 4.6.

Gemini 3 Pro evaluated all three as a senior software developer and product manager.

The verdict was unambiguous. The Mesh-fed PRs were called "significantly stronger" across every dimension: product context, workflow clarity, architectural structure, technical depth, and testing visibility. The Claude Code version was characterised as reading like "a rough draft or a quick brain dump before hitting Create Pull Request."

This wasn't a knock on Sonnet 4.6. It was a knock on what Sonnet 4.6 was given to work with.

Claude Code - like most agentic coding tools - acts like a developer who skims the commit titles and says "looks good to me." It reads summaries: which files changed, roughly how many lines, what the commit subjects say. HiveTrail Mesh acts like the reviewer who actually pulls down the branch and reads every single file. The difference in output reflects that difference in reading.

Haiku 4.5 with full context outperformed Sonnet 4.6 with shallow context. A cheaper, faster model given the complete picture wrote a better PR than a more capable model working from a summary.

But here's the part that should really give you pause: Haiku 4.5 didn't just beat Sonnet 4.6's native shallow context - it beat Sonnet 4.6 when both were fed the exact same Mesh XML. The budget model outperformed the flagship on a level playing field.

Final ranking:

Haiku 4.5 + Mesh - best overall structure, key design decisions, quantified test coverage
Sonnet 4.6 + Mesh - excellent markdown, clear bug-fix callouts, strong architecture section
Sonnet 4.6 native (Claude Code) - good test plan, but flat structure and shallow context throughout

Experiment 2: Can Gemini CLI beat its own model family?

Several months later, we ran a second experiment on a completely different feature - the GitHub API integration for HiveTrail Mesh, covering 24 files and 22 commits.

The framing this time was sharper. The question wasn't "which model is best" - it was "can an agentic tool using native git context compete with the same model family when context is properly assembled?"

Gemini CLI was the subject under test. It has its own git tooling, can run shell commands, and is built by the same team behind the models it would be competing against. If any tool could close the context gap through smart native tool use, Gemini CLI was the candidate.

We set it against seven Gemini models - ranging from Gemini 3 Fast to Gemini 3.1 Pro with high thinking - all fed via HiveTrail Mesh. We also added Haiku 4.5 via Mesh as an external reference point, since it had won Experiment 1.

Three independent judges evaluated all nine PR texts blind, without knowing which model produced which:

Google Gemini 3 Pro
Anthropic Claude Opus 4.6
OpenAI ChatGPT

Scoring: 9 points for 1st place, 1 point for last. Maximum possible: 27.

Rank	Model	Gemini Pro	Opus 4.6	ChatGPT	Total
1	Haiku 4.5 + Mesh	9	9	9	27
2	Gemini Flash 3 preview (Thinking Low) + Mesh	8	7	8	23
3	Gemini 3 Fast + Mesh	7	6	4	17
4	Gemini 3.1 Pro preview (Thinking High) + Mesh	2	8	6	16
Tied 5	ChatGPT + Mesh	6	1	7	14
Tied 5	Gemini Flash 3 preview (Thinking High) + Mesh	5	4	5	14
7	Gemini 3.1 Flash Light preview (Thinking High) + Mesh	3	5	3	11
8	Gemini 3 Pro + Mesh	4	3	2	9
9	Gemini CLI (native context)	1	2	1	4

Two results stand out.

First, Haiku 4.5 received a perfect score - 9 from every judge, unanimously, with a 4-point gap over second place. All three judges independently placed it first for the same reasons: dedicated test coverage sections, specific method names and API behaviors called out by name, explicit reasoning behind architectural decisions, and reviewer notes that no other entry included. Opus 4.6 called it "the most complete and production-grade PR description" of the nine.

Second, and more telling: Gemini CLI finished last. Not second to last - last, with 4 points, behind every Mesh-fed entry including smaller, cheaper Gemini variants. Its own model family, given better context by a different tool, beat it at every position in the table.

The reason is the same as Experiment 1. Gemini CLI ran git log -n 10 --stat and a few shell commands. Fast, low-cost, reasonable for most tasks - but it produced the same shallow picture. The resulting PR covered the surface of the changes without the architectural reasoning, edge case handling, or quantified test results that the Mesh-fed models could draw on because they had actually read the code.

It's worth noting that the Mesh PR Brief isn't just raw file content dumped into a prompt. It's structured XML - commits organized chronologically, files grouped by change type, diffs nested within their commit context. That structure helps LLMs navigate 100K+ token documents more efficiently than a flat wall of text would. So "full context" here means both more information and better-organized information. Both matter.

After the main competition, we ran Claude Code on the same feature - not as a competitor, but as a consistency check. Same pattern as Experiment 1: a short, surface-level PR based on abbreviated git output. The shallow-context behavior isn't specific to any one tool or vendor. It's structural - it's what happens when speed is optimized over depth of reading.

The pattern

Context quality sets the ceiling. Model choice determines where within that ceiling you land.

Run both experiments side by side and the picture is hard to argue with.

Experiment 1 tested context delivery method with the same model family. Mesh-assembled context won over native git context regardless of model tier - and the budget model beat the flagship even on a level playing field.

Experiment 2 tested whether a sophisticated agentic tool could close that gap through smart native tool use. It couldn't - and it finished last against its own model family.

Different features. Different PR Briefs. Different competitive sets. Different judges. The only constant was the relationship between context quality and output quality.

When an AI tool reads a few lines of git log to write a PR, it isn't producing a poor result because it's a bad model. It's producing a poor result because it has been given a poor picture of what changed and why. Give any capable model the full picture - every file, every diff, every commit, structured and organized - and the output improves dramatically.

The implication runs both ways. A "budget" model with rich context outperforms a flagship with shallow context. And a flagship with shallow context produces flagship-priced shallow output.

What this means for your workflow

If you're using AI tools for PR descriptions today, the most impactful change probably isn't switching models.

Agentic coding tools are optimized for speed and low token cost - they read summaries, not full file content. That's the right tradeoff for interactive coding tasks, where you want fast feedback and low latency. For a PR covering 20+ files and weeks of work, summary-level context produces summary-level output.

The alternative is deliberate context assembly before you prompt: read every changed file in full, preserve the diff structure, organize commits chronologically, package everything in a format the LLM can navigate. You could build a script to do this - pull every changed file, run the diffs, format it into structured XML. It's achievable engineering. It's also a few days of work to do properly, and more to maintain as your codebase evolves.

That's exactly why we built HiveTrail Mesh's PR Brief. Point it at a branch and within seconds it has scanned every changed file, assembled the diffs, and produced a structured 100,000+ token XML document - faster than most agentic tools complete their own context gathering. The remaining time in the workflow is just the LLM responding, which varies by model (a few seconds for smaller models, up to ~30 seconds for the larger ones). The total end-to-end time is competitive with agentic coding tools - with dramatically better output to show for it. Use any LLM you prefer: Claude, Gemini, ChatGPT, whatever fits your workflow. The model choice, as these experiments suggest, matters less than you might expect.

For teams where PRs serve as living documentation, get reviewed by multiple people, or feed downstream into release notes - the tradeoff is straightforward. For a solo developer pushing a two-file fix, probably not worth it.

What we didn't test

In the spirit of intellectual honesty:

Prompt engineering. Both experiments used a minimal prompt. A carefully crafted prompt might narrow the gap somewhat - though we'd expect the ceiling to remain lower without full file content. It's also worth noting that the Mesh PR Brief's structured XML format is itself a form of context organization: commits are sequenced chronologically, files are grouped by change type, and diffs are nested within their commit context. That structure likely helps LLMs parse large documents more efficiently than flat CLI output would.

Other writing tasks. Both experiments focused on PR descriptions. Commit messages, technical documentation, and code review summaries likely follow the same pattern, but we haven't tested them.

Newer model releases. These experiments used models current at the time of testing. Rankings will shift as new models release - though the underlying dynamic (context quality determines ceiling) should hold.

Cost efficiency. Haiku 4.5 is significantly cheaper per token than most of the models it beat. The cost-per-quality-point story is compelling but token pricing changes frequently enough that any number we published here would be stale quickly.

Closing thought

The most useful takeaway from two experiments isn't a model recommendation. It's a workflow question worth asking before you prompt: what does the model actually see?

If the answer is "a handful of commit subject lines and a diffstat," you've already constrained the output - regardless of which model is on the other end.

The models are good enough. The context is usually the bottleneck.

HiveTrail Mesh is a context assembly tool for developers and product teams. PR Brief assembles a token-optimized, structured XML document from your git branch - ready to paste into any LLM. Try the beta →