Lewis

Posted on Mar 11

From 75% to 98% Precision: The Research Paper That Changed How a Startup Prompts AI

#ai #startup #codereview #promptengineering

GPT 5.3 and Opus 4.6 dropped on the same day last month. The team at Every, the tech publication that’s become one of the go-to sources for hands-on AI coverage, ran a “vibe check”. They had their team compare the models head to head, and nobody could pick a clear winner.

Their CEO Dan Shipper uses them 50/50. Other team members each landed on a different mix. The consensus: each model has different strengths, and the best approach is to use both depending on the task.

But “depending on the task” is doing a lot of heavy lifting in that sentence. How do you figure out which model is best for which task? And once you’ve picked one, how do you write the prompt that unlocks its peak performance on that specific job? Throw in all of the other models that you may need to consider and this becomes a wicked problem.

Left unsolved, this wickedness means you’re leaving serious performance on the table, whether you’re building products on top of AI or trying to get the most out of these models in your work.

The default approach for most teams is to roll with the vibes. Pick a model. Write prompts by hand. A/B test against a benchmark. Wait for results. Tweak. Repeat. Try to build intuition about what each model is good at. Maybe add some few-shot examples. It’s better than nothing but you won’t achieve differentiated results.

The more sophisticated approach is automated prompt optimizers. These are systems where an LLM writes a prompt, scores it against a benchmark, reflects on the results, and tries to write a better one — looping until the score plateaus. The best use evolutionary approaches, maintaining multiple candidate prompts and breeding the best performers together. This is better than doing it by hand, but it hits its own ceiling — one that elite AI researchers recently pinned down to two constraints.

Brevity bias and context collapse

In their 2026 ICLR paper, the research team identified two core failure modes with prompt optimizers. The first is brevity bias: the optimizers tend to converge on short, generic prompts — because short prompts are safe. “Be careful with edge cases” never hurts on any particular test case, so it survives selection round after round. “When processing this specific type of input, check for Y” only helps 10% of the time, so it gets pruned. Over many rounds, the specific stuff dies and the generic stuff lives. You end up with prompts that are OK at everything and great at nothing.

The second failure mode is context collapse. When the optimizer asks an LLM to rewrite a large, detailed prompt, the LLM compresses it — sometimes catastrophically. The researchers showed an example where 18,000 tokens of accumulated knowledge collapsed to 122 tokens, and performance dropped below the baseline. The system literally forgot everything it had learned.

Until recently, that was the landscape. A/B testing was too slow. Automated optimization converged on mediocrity. Neither approach scaled with the speed of model progression.

I’ve been bumping up against this exact problem myself with an AI product I’m building. So when one of the startups I work with, Macroscope, published a detailed article outlining a new approach to prompt optimization they call “auto-tuning,” I leaned in extra hard.

The A.C.E. in the hole

Macroscope does AI code review. Their product needs to work across every programming language developers use, and each language has different idioms. A prompt that catches real bugs in Go flags noise in Python. Adding few-shot examples for one language broke another. New models shipped faster than they could build full intuition about the old ones.

Then at the tail-end of last year Macroscope found our aforementioned ICLR paper. In addition to diagnosing their exact problem, it proposed a solution. The researchers called it Agentic Context Engineering, or ACE.

Instead of trying to find one perfect prompt through iterative rewriting, ACE builds a playbook. Three LLM roles work together: a Generator that attempts the task, a Reflector that diagnoses what went wrong, and a Curator that adds a specific bullet point to a master playbook based on what it learned.

The key constraint is that the Curator can only add or update individual bullets. It never rewrites the whole prompt. This prevents the context from collapsing into a generic summary — the failure mode that plagues traditional prompt optimizers.

The result is a prompt that accumulates hundreds of specific, detailed entries over time. Not “be careful with date formatting,” but “when processing Venmo transactions, use datetime range comparisons, not string matching.” The model reads the full playbook at inference time and naturally pays attention to whichever entries are relevant for the current task.

ACE showed roughly 10% improvement on agent benchmarks, matching top-ranked production agents while using a smaller open-source model. 10% is cute, but Macroscope pushed ACE further.

Prompt x Model x Language

ACE optimizes the playbook for a fixed model — you pick one model and the system improves the prompt for that model. Macroscope asked a different question: what if we run this process across every model simultaneously? Same task, same benchmark, but now the system is building and testing playbooks for GPT, Gemini, Opus, and others in parallel — discovering not just the best prompt, but the best model-prompt combination.

It’s closer to having a dedicated prompt engineer iterate on prompts for every model at once, except auto-tune can test ideas in parallel and doesn’t get tired.

And when they did this, they discovered something unexpected.

Finding subtask-model fit

The system found that models have stable behavioral signatures — personality traits, essentially — that they can’t turn off. And it learned to exploit them.

GPT-5.2 hedges. When GPT is uncertain, it says things like “this could potentially cause an issue” instead of committing. The hedging leaks through even with explicit instructions to be decisive. But auto-tune discovered that this hedging correlates strongly with false positives. The model is expressing genuine uncertainty, and that uncertainty is a useful signal. Modal language like “could,” “potentially,” and “may” became a rejection filter.

Gemini 3 rambles. Gemini sometimes thinks out loud mid-response. “Wait, let me re-read that.” “However, on second thought…” When it does this, it’s usually about to get the answer wrong. The self-correction is also a tell. Auto-tune learned to catch it and those phrases became rejection signals. Not every model does this though. Opus, for example, doesn’t ramble, so it doesn’t need this filter.

Once you understand each model’s natural tendencies, you can start assigning them to the tasks they’re suited for — even pairing them in concert on the same task to achieve results neither could produce alone.

One of autotune’s most useful findings was to pair a permissive model for detection with a strict model for validation. Tell the detection model to flag everything, false positives acceptable. Then use a different model to ruthlessly filter out anything that involves hedging, speculation, or claims that can’t be proven from the code. One optimizes for recall, the other for precision.

The differences unearthed by auto-tune are not subtle. Given the same “flag everything” directive, Opus flags 199 potential issues. GPT flags 3,923. Same task, 20x different output.

The team said: “We probably wouldn’t have tried pairing different models for different subtasks without auto-tune — it seemed unnecessarily complex.”

Near perfect precision

Remember, ACE achieved roughly 10% improvement on agent benchmarks. Macroscope’s results were more dramatic.

Overall precision jumped from 75% to 98% — meaning nearly every comment the system leaves is now correct. It catches 3.5x more high-severity bugs while leaving 22% fewer comments overall. Nitpicks dropped 64% in Python and 80% in TypeScript.

Since launching v3, developer thumbs-up reactions increased 30%, comments per PR dropped 37%, and developers are resolving 10% more of the issues flagged.

To achieve these results, Macroscope also layered in a few additional engineering enhancements that they detail in their post, for instance:

Severity-weighted scoring — a critical bug scores 125x higher than a low-severity one.
Learning rate controls — at low rates, the system tweaks wording. At high rates, it rewrites entire sections.
Anti-overfitting guidance — the system is instructed to identify underlying patterns across a batch of results, not make changes to address a single specific failure.

Beyond code review

Here’s why I think this matters for more than just one code review startup.

Every AI product faces the same underlying problem. Too many models. They change too fast. Different prompts work differently across tasks. And the behavioral signatures auto-tune discovered — the hedging, the rambling, the calibration differences — aren’t specific to code review. They’re properties of how these models reason. A model that hedges when reviewing code hedges when analyzing a legal contract. A model that rambles before getting code wrong rambles before getting a medical assessment wrong.

Anywhere you have judgment calls — where models can disagree, and the pattern of their disagreement carries information — this approach applies. Legal review. Medical triage. Content moderation. Financial risk assessment. The principle is the same: the right architecture routes each subtask to the model whose natural calibration fits it best, and crafts the prompt that maximizes that fit.

The model-agnostic advantage

There’s one final structural angle here that I think is underappreciated.

The labs — OpenAI, Google, Anthropic — are locked into their own models. OpenAI is never going to tell you to use Gemini for detection and Opus for validation. They’re incentivized to make their suite of models work for everything. That’s a reasonable strategy for them, but it means they’ll never find the cross-model combinations that auto-tune surfaces.

Companies that aren’t locked into one family of models have an inherent advantage: they can actually search the full space. Every model, every prompt, every combination. Auto-tuning allows you to tap into that advantage — and every time a new model drops, the system can re-run and find new optimal combinations automatically.

Macroscope’s full technical deep dive covers a lot more than I could here — including the specific ML techniques they borrowed, their benchmarking methodology, and the limitations of the approach. If this topic interests you, I’d recommend reading it in full.

References

Zhang, Q., Hu, C., Upasani, S., Ma, B., Hong, F., Kamanuru, V., Rainton, J., Wu, C., Ji, M., Li, H., Thakker, U., Zou, J., & Olukotun, K. (2025). Agentic Context Engineering (ACE). ICLR 2026. https://arxiv.org/abs/2510.04618

Macroscope. (2026). We (Basically) Stopped Writing Prompts. https://macroscope.com/blog/we-stopped-writing-prompts

Every. (2026). GPT 5.3 Codex vs. Opus 4.6: The Great Convergence. https://every.to/vibe-check/codex-vs-opus

DEV Community