<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lewis</title>
    <description>The latest articles on DEV Community by Lewis (@lewiska).</description>
    <link>https://dev.to/lewiska</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3819009%2Fa55c8ee0-054a-493a-a7c3-08594310c9bb.png</url>
      <title>DEV Community: Lewis</title>
      <link>https://dev.to/lewiska</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lewiska"/>
    <language>en</language>
    <item>
      <title>AI code lasts longer than human code, new study finds (but not for the reason you’d hope).</title>
      <dc:creator>Lewis</dc:creator>
      <pubDate>Tue, 17 Mar 2026 17:09:50 +0000</pubDate>
      <link>https://dev.to/lewiska/ai-code-lasts-longer-than-human-code-new-study-finds-but-not-for-the-reason-youd-hope-3a2i</link>
      <guid>https://dev.to/lewiska/ai-code-lasts-longer-than-human-code-new-study-finds-but-not-for-the-reason-youd-hope-3a2i</guid>
      <description>&lt;p&gt;&lt;strong&gt;AI-generated code survives 16% longer in production than human code before anyone touches it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's the headline finding from a &lt;a href="https://arxiv.org/abs/2601.16809" rel="noopener noreferrer"&gt;new study&lt;/a&gt; out of Concordia University's DAS Lab, led by Emad Shihab and accepted at EASE 2026.&lt;/p&gt;

&lt;p&gt;The researchers tracked over 200,000 individual code units across 201 open-source projects using survival analysis, a method borrowed from medical research, to answer a simple question: how long does AI-generated code last in production?&lt;/p&gt;

&lt;p&gt;The answer is agent-authored code has a 15.4 percent lower modification rate than human code. At any given time, it faces 16% less risk of being changed.&lt;/p&gt;

&lt;p&gt;Sounds like a win for AI coding agents, right? &lt;/p&gt;

&lt;p&gt;Not necessarily.&lt;/p&gt;

&lt;p&gt;The researchers wanted to understand &lt;em&gt;why&lt;/em&gt; AI code was being left untouched for longer. Is it because it's higher quality, or something else?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The data suggests something else.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When they examined what happens when AI code &lt;em&gt;is&lt;/em&gt; finally modified, they found that 26.3% of modifications to AI code are bug fixes, compared to 23% for human code. When someone finally touches AI-generated code, it's more likely to be because something was broken.&lt;/p&gt;

&lt;p&gt;So AI code appears to contain more latent bugs, yet those bugs sit unaddressed for longer. Why?&lt;/p&gt;

&lt;p&gt;The researchers point to a well-documented phenomenon in software engineering as one possible reason: the "Don't touch my code!" effect. Developers avoid modifying code they didn't write. AI-generated code has no human author so nobody feels responsible for maintaining it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nuance in the per-tool data.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The study tracked five AI coding tools, and the results between them varied. &lt;/p&gt;

&lt;p&gt;Cursor, one of the most sophisticated tools tested, had the lowest corrective modification rate of any tool at just 13.8%. When someone touches Cursor-assisted code, it's rarely to fix a bug.&lt;/p&gt;

&lt;p&gt;Yet Claude Code, also a powerful offering, had a corrective rate of 44.4%, nearly double the human baseline.&lt;/p&gt;

&lt;p&gt;One possible explanation here is that Cursor tends to keep the code visible in the interface as you work, but Claude Code has an interface that abstracts the code further away from the developer's view.&lt;/p&gt;

&lt;p&gt;The idea that the degree to which a developer sees, understands, and engages with the code during generation matters as much as the quality of the tool itself is a sensible theory.&lt;/p&gt;

&lt;p&gt;But a stronger clue for why AI generated code survives longer comes from a separate study entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The review burden&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.ft.com/content/7cab4ec7-4712-4137-b602-119a44f771de" rel="noopener noreferrer"&gt;Amazon recently summoned&lt;/a&gt; a large group of engineers for a "deep dive" into a spate of outages, including incidents tied to AI coding tools. A briefing note cited "novel GenAI usage for which best practices and safeguards are not yet fully established" as a contributing factor. &lt;/p&gt;

&lt;p&gt;Researchers at NAIST (Nara Institute of Science and Technology), in a &lt;a href="https://arxiv.org/abs/2602.17091" rel="noopener noreferrer"&gt;paper accepted at MSR 2026&lt;/a&gt;, analyzed 1,664 merged agentic pull requests across 197 open-source projects. They found that 75% of agentic PRs pass through review with zero revisions. Three out of four AI-generated PRs sail through without a single change requested.&lt;/p&gt;

&lt;p&gt;There is a growing chorus of developers complaining about the ballooning burden of code review. As AI coding agents improve, engineers who use them ship more code. As engineers ship more code, the volume of code that has to pass through review skyrockets.&lt;/p&gt;

&lt;p&gt;The tidal wave of code review could be what's driving developers to rubber stamp bugs into production at AWS, in the above-referenced studies, and beyond.&lt;/p&gt;

&lt;p&gt;Bugs that would have been caught in a more thorough review slip through. And if nobody engaged deeply with the code during review (nor at time of generation), nobody understands it well enough to feel equipped, or responsible, to maintain it later.&lt;/p&gt;

&lt;p&gt;Amazon's response is to require junior and mid-level engineers to get senior sign-off on all AI-assisted changes. But adding more human sign-off to a process that's already struggling to keep up cannot fix the core tension.&lt;/p&gt;

&lt;p&gt;So how &lt;em&gt;do&lt;/em&gt; we prevent orphaned, buggy code filling up codebases, without drowning humans in an impossible mountain of manual review?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The end of human code review&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The best articulation I've seen of what the answer should look like comes from Kayvon Beykpour, previously the CEO of Periscope / head of product at Twitter, and presently the cofounder of the AI code review tool Macroscope.&lt;/p&gt;

&lt;p&gt;In a &lt;a href="https://x.com/kayvz/status/2016934777396609428" rel="noopener noreferrer"&gt;widely shared post&lt;/a&gt;, he predicted that "soon, human engineers will review close to zero pull requests," and that instead, "code review will become always-on and increasingly automatic as code is being written. A new orchestration layer will emerge where agents will decide when PRs are ready to merge and only (infrequently) escalate to humans."&lt;/p&gt;

&lt;p&gt;Beykpour argues code review needs to be pulled closer to where code is being written, not delayed until a PR is opened. Specialized review agents should continuously analyze code as it's generated, verify correctness, and coordinate with coding agents to address issues in real time. &lt;/p&gt;

&lt;p&gt;If the AI-generated code in the Concordia study had been continuously reviewed by a dedicated agent as it was written, the bugs would have already been caught, and it wouldn't have mattered that an engineer waved the code into production.&lt;/p&gt;

&lt;p&gt;Then, at the PR stage, Beykpour says AI agents should orchestrate "merge readiness": assessing whether the code was sufficiently tested, evaluating blast radius, checking trust profiles, and deciding whether human escalation is actually required. &lt;/p&gt;

&lt;p&gt;When low-risk PRs are taken off an engineer's plate, they have more time and bandwidth for the reviews that actually matter. And when those reviews reach them, they know they're important.&lt;/p&gt;

&lt;p&gt;The first glimpse of this future is "&lt;a href="https://macroscope.com/blog/introducing-approvability" rel="noopener noreferrer"&gt;Approvability&lt;/a&gt;," a feature rolled out by Beykpour's team last month that automatically evaluates every PR against two hurdles before deciding whether it can merge without a human reviewer.&lt;/p&gt;

&lt;p&gt;Trusting an AI to decide whether code can merge without a human reviewer will seem reckless to some, but this is how every generation of programming has evolved. &lt;/p&gt;

&lt;p&gt;When the compiler took over the task of writing machine code in the 1950s, programmers didn't trust it—so they inspected its binary output line-by-line, swapping writing for reviewing. Over time, a set of checks and balances were built around the compiler—listing prints, error diagnostics, optimization passes, etc—until manual verification became redundant.&lt;/p&gt;

&lt;p&gt;The same pattern played out with operating systems, CI tools, and cloud platforms. Each initially added a burden of oversight, and each eventually earned enough trust that the oversight became unnecessary.&lt;/p&gt;

&lt;p&gt;While the above discussed research can help us diagnose some of the problems with LLM-written code, that it contains more bugs, gets left untouched longer, and mostly sails through review unchallenged, the teams that solve these challenges won't be the ones who review harder, but rather the ones who review smarter.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>agents</category>
      <category>claudecode</category>
    </item>
    <item>
      <title>From 75% to 98% Precision: The Research Paper That Changed How a Startup Prompts AI</title>
      <dc:creator>Lewis</dc:creator>
      <pubDate>Wed, 11 Mar 2026 20:30:37 +0000</pubDate>
      <link>https://dev.to/lewiska/from-75-to-98-precision-the-research-paper-that-changed-how-a-startup-prompts-ai-4ll6</link>
      <guid>https://dev.to/lewiska/from-75-to-98-precision-the-research-paper-that-changed-how-a-startup-prompts-ai-4ll6</guid>
      <description>&lt;p&gt;GPT 5.3 and Opus 4.6 dropped on the same day last month. The team at Every, the tech publication that’s become one of the go-to sources for hands-on AI coverage, &lt;a href="https://every.to/vibe-check/codex-vs-opus" rel="noopener noreferrer"&gt;ran&lt;/a&gt; a “vibe check”. They had their team compare the models head to head, and nobody could pick a clear winner.&lt;/p&gt;

&lt;p&gt;Their CEO Dan Shipper uses them 50/50. Other team members each landed on a different mix. The consensus: each model has different strengths, and the best approach is to use both depending on the task.&lt;/p&gt;

&lt;p&gt;But “depending on the task” is doing a lot of heavy lifting in that sentence. How do you figure out which model is best for which task? And once you’ve picked one, how do you write the prompt that unlocks its peak performance on that specific job? Throw in all of the other models that you may need to consider and this becomes a wicked problem.&lt;/p&gt;

&lt;p&gt;Left unsolved, this wickedness means you’re leaving serious performance on the table, whether you’re building products on top of AI or trying to get the most out of these models in your work.&lt;/p&gt;

&lt;p&gt;The default approach for most teams is to roll with the vibes. Pick a model. Write prompts by hand. A/B test against a benchmark. Wait for results. Tweak. Repeat. Try to build intuition about what each model is good at. Maybe add some few-shot examples. It’s better than nothing but you won’t achieve differentiated results.&lt;/p&gt;

&lt;p&gt;The more sophisticated approach is automated prompt optimizers. These are systems where an LLM writes a prompt, scores it against a benchmark, reflects on the results, and tries to write a better one — looping until the score plateaus. The best use evolutionary approaches, maintaining multiple candidate prompts and breeding the best performers together. This is better than doing it by hand, but it hits its own ceiling — one that elite AI researchers recently pinned down to two constraints.&lt;/p&gt;

&lt;h2&gt;
  
  
  Brevity bias and context collapse
&lt;/h2&gt;

&lt;p&gt;In their 2026 ICLR &lt;a href="https://arxiv.org/abs/2510.04618" rel="noopener noreferrer"&gt;paper&lt;/a&gt;, the research team identified two core failure modes with prompt optimizers. The first is brevity bias: the optimizers tend to converge on short, generic prompts — because short prompts are safe. “Be careful with edge cases” never hurts on any particular test case, so it survives selection round after round. “When processing this specific type of input, check for Y” only helps 10% of the time, so it gets pruned. Over many rounds, the specific stuff dies and the generic stuff lives. You end up with prompts that are OK at everything and great at nothing.&lt;/p&gt;

&lt;p&gt;The second failure mode is context collapse. When the optimizer asks an LLM to rewrite a large, detailed prompt, the LLM compresses it — sometimes catastrophically. The researchers showed an example where 18,000 tokens of accumulated knowledge collapsed to 122 tokens, and performance dropped below the baseline. The system literally forgot everything it had learned.&lt;/p&gt;

&lt;p&gt;Until recently, that was the landscape. A/B testing was too slow. Automated optimization converged on mediocrity. Neither approach scaled with the speed of model progression.&lt;/p&gt;

&lt;p&gt;I’ve been bumping up against this exact problem myself with an AI product I’m building. So when one of the startups I work with, Macroscope, &lt;a href="https://macroscope.com/blog/we-stopped-writing-prompts" rel="noopener noreferrer"&gt;published a detailed article&lt;/a&gt; outlining a new approach to prompt optimization they call “auto-tuning,” I leaned in extra hard.&lt;/p&gt;

&lt;h2&gt;
  
  
  The A.C.E. in the hole
&lt;/h2&gt;

&lt;p&gt;Macroscope does AI code review. Their product needs to work across every programming language developers use, and each language has different idioms. A prompt that catches real bugs in Go flags noise in Python. Adding few-shot examples for one language broke another. New models shipped faster than they could build full intuition about the old ones.&lt;/p&gt;

&lt;p&gt;Then at the tail-end of last year Macroscope found our aforementioned ICLR paper. In addition to diagnosing their exact problem, it proposed a solution. The researchers called it Agentic Context Engineering, or ACE.&lt;/p&gt;

&lt;p&gt;Instead of trying to find one perfect prompt through iterative rewriting, ACE builds a playbook. Three LLM roles work together: a Generator that attempts the task, a Reflector that diagnoses what went wrong, and a Curator that adds a specific bullet point to a master playbook based on what it learned.&lt;/p&gt;

&lt;p&gt;The key constraint is that the Curator can only add or update individual bullets. It never rewrites the whole prompt. This prevents the context from collapsing into a generic summary — the failure mode that plagues traditional prompt optimizers.&lt;/p&gt;

&lt;p&gt;The result is a prompt that accumulates hundreds of specific, detailed entries over time. Not “be careful with date formatting,” but “when processing Venmo transactions, use datetime range comparisons, not string matching.” The model reads the full playbook at inference time and naturally pays attention to whichever entries are relevant for the current task.&lt;/p&gt;

&lt;p&gt;ACE showed roughly 10% improvement on agent benchmarks, matching top-ranked production agents while using a smaller open-source model. 10% is cute, but Macroscope pushed ACE further.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt x Model x Language
&lt;/h2&gt;

&lt;p&gt;ACE optimizes the playbook for a fixed model — you pick one model and the system improves the prompt for that model. Macroscope asked a different question: what if we run this process across every model simultaneously? Same task, same benchmark, but now the system is building and testing playbooks for GPT, Gemini, Opus, and others in parallel — discovering not just the best prompt, but the best model-prompt combination.&lt;/p&gt;

&lt;p&gt;It’s closer to having a dedicated prompt engineer iterate on prompts for every model at once, except auto-tune can test ideas in parallel and doesn’t get tired.&lt;/p&gt;

&lt;p&gt;And when they did this, they discovered something unexpected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding subtask-model fit
&lt;/h2&gt;

&lt;p&gt;The system found that models have stable behavioral signatures — personality traits, essentially — that they can’t turn off. And it learned to exploit them.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;GPT-5.2 hedges.&lt;/em&gt; When GPT is uncertain, it says things like “this could potentially cause an issue” instead of committing. The hedging leaks through even with explicit instructions to be decisive. But auto-tune discovered that this hedging correlates strongly with false positives. The model is expressing genuine uncertainty, and that uncertainty is a useful signal. Modal language like “could,” “potentially,” and “may” became a rejection filter.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Gemini 3 rambles.&lt;/em&gt; Gemini sometimes thinks out loud mid-response. “Wait, let me re-read that.” “However, on second thought…” When it does this, it’s usually about to get the answer wrong. The self-correction is also a tell. Auto-tune learned to catch it and those phrases became rejection signals. Not every model does this though. Opus, for example, doesn’t ramble, so it doesn’t need this filter.&lt;/p&gt;

&lt;p&gt;Once you understand each model’s natural tendencies, you can start assigning them to the tasks they’re suited for — even pairing them in concert on the same task to achieve results neither could produce alone.&lt;/p&gt;

&lt;p&gt;One of autotune’s most useful findings was to pair a permissive model for detection with a strict model for validation. Tell the detection model to flag everything, false positives acceptable. Then use a different model to ruthlessly filter out anything that involves hedging, speculation, or claims that can’t be proven from the code. One optimizes for recall, the other for precision.&lt;/p&gt;

&lt;p&gt;The differences unearthed by auto-tune are not subtle. Given the same “flag everything” directive, Opus flags 199 potential issues. GPT flags 3,923. Same task, 20x different output.&lt;/p&gt;

&lt;p&gt;The team said: “We probably wouldn’t have tried pairing different models for different subtasks without auto-tune — it seemed unnecessarily complex.”&lt;/p&gt;

&lt;h2&gt;
  
  
  Near perfect precision
&lt;/h2&gt;

&lt;p&gt;Remember, ACE achieved roughly 10% improvement on agent benchmarks. Macroscope’s results were more dramatic.&lt;/p&gt;

&lt;p&gt;Overall precision jumped from 75% to 98% — meaning nearly every comment the system leaves is now correct. It catches 3.5x more high-severity bugs while leaving 22% fewer comments overall. Nitpicks dropped 64% in Python and 80% in TypeScript.&lt;/p&gt;

&lt;p&gt;Since launching v3, developer thumbs-up reactions increased 30%, comments per PR dropped 37%, and developers are resolving 10% more of the issues flagged.&lt;/p&gt;

&lt;p&gt;To achieve these results, Macroscope also layered in a few additional engineering enhancements that they detail in their post, for instance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Severity-weighted scoring&lt;/strong&gt; — a critical bug scores 125x higher than a low-severity one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learning rate controls&lt;/strong&gt; — at low rates, the system tweaks wording. At high rates, it rewrites entire sections.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anti-overfitting guidance&lt;/strong&gt; — the system is instructed to identify underlying patterns across a batch of results, not make changes to address a single specific failure.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Beyond code review
&lt;/h2&gt;

&lt;p&gt;Here’s why I think this matters for more than just one code review startup.&lt;/p&gt;

&lt;p&gt;Every AI product faces the same underlying problem. Too many models. They change too fast. Different prompts work differently across tasks. And the behavioral signatures auto-tune discovered — the hedging, the rambling, the calibration differences — aren’t specific to code review. They’re properties of how these models reason. A model that hedges when reviewing code hedges when analyzing a legal contract. A model that rambles before getting code wrong rambles before getting a medical assessment wrong.&lt;/p&gt;

&lt;p&gt;Anywhere you have judgment calls — where models can disagree, and the pattern of their disagreement carries information — this approach applies. Legal review. Medical triage. Content moderation. Financial risk assessment. The principle is the same: the right architecture routes each subtask to the model whose natural calibration fits it best, and crafts the prompt that maximizes that fit.&lt;/p&gt;

&lt;h2&gt;
  
  
  The model-agnostic advantage
&lt;/h2&gt;

&lt;p&gt;There’s one final structural angle here that I think is underappreciated.&lt;/p&gt;

&lt;p&gt;The labs — OpenAI, Google, Anthropic — are locked into their own models. OpenAI is never going to tell you to use Gemini for detection and Opus for validation. They’re incentivized to make their suite of models work for everything. That’s a reasonable strategy for them, but it means they’ll never find the cross-model combinations that auto-tune surfaces.&lt;/p&gt;

&lt;p&gt;Companies that aren’t locked into one family of models have an inherent advantage: they can actually search the full space. Every model, every prompt, every combination. Auto-tuning allows you to tap into that advantage — and every time a new model drops, the system can re-run and find new optimal combinations automatically.&lt;/p&gt;

&lt;p&gt;Macroscope’s full technical deep dive covers a lot more than I could here — including the specific ML techniques they borrowed, their benchmarking methodology, and the limitations of the approach. If this topic interests you, &lt;a href="https://macroscope.com/blog/we-stopped-writing-prompts" rel="noopener noreferrer"&gt;I’d recommend reading it in full&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;References&lt;/p&gt;

&lt;p&gt;Zhang, Q., Hu, C., Upasani, S., Ma, B., Hong, F., Kamanuru, V., Rainton, J., Wu, C., Ji, M., Li, H., Thakker, U., Zou, J., &amp;amp; Olukotun, K. (2025). Agentic Context Engineering (ACE). ICLR 2026. &lt;a href="https://arxiv.org/abs/2510.04618" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2510.04618&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Macroscope. (2026). We (Basically) Stopped Writing Prompts. &lt;a href="https://macroscope.com/blog/we-stopped-writing-prompts" rel="noopener noreferrer"&gt;https://macroscope.com/blog/we-stopped-writing-prompts&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every. (2026). GPT 5.3 Codex vs. Opus 4.6: The Great Convergence. &lt;a href="https://every.to/vibe-check/codex-vs-opus" rel="noopener noreferrer"&gt;https://every.to/vibe-check/codex-vs-opus&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>startup</category>
      <category>codereview</category>
      <category>promptengineering</category>
    </item>
  </channel>
</rss>
