DEV Community

Cover image for After 2 years of AI-assisted coding, I automated the one thing that actually improved quality: AI Pair Programming

After 2 years of AI-assisted coding, I automated the one thing that actually improved quality: AI Pair Programming

Sakiharu on February 16, 2026

After nearly 2 years of AI-assisted development — from ChatGPT 3.5 to Claude Code — I kept hitting the same problem: every model makes mistakes it ...
Collapse
 
chovy profile image
chovy

The AI-reviews-AI loop is underrated. Biggest issue I've hit with single-model generation is that the model gets anchored on its own assumptions — a second model asking "why did you do it this way?" catches stuff that linting never will.

Curious about the cost side though. Running two models per commit adds up fast if you're shipping multiple times a day. Have you found the quality improvement offsets the token spend, or do you gate it to only run on certain file types?

We've been automating a lot of our content pipeline the same way — one AI generates, another critiques. Built postammo.com around that idea for social media content specifically. The adversarial review step made the output way less generic.

Collapse
 
yw1975 profile image
Sakiharu

I'm on max plans for both models, and token usage looks pretty comfortable so I'd say the cost is manageable. As someone who doesn't come from a Node.js background, the dual-agent review loop is really my main way of catching deep issues that I wouldn't spot from reading the code alone.
Great question though — I'll add a token tracking module to the package so we can get accurate numbers instead of guessing. Thanks for raising it.
And applying the dual-agent pattern to content creation is a great idea. The same principle should work well there.

Collapse
 
choutos profile image
choutos

The observation that each model has consistent failure patterns rather than random bugs is underappreciated. Claude getting sloppy on error handling in long contexts, Codex over-engineering abstractions: these are predictable weaknesses, which means a second agent can be specifically tuned to watch for them.

We run a multi-agent setup ourselves and the "grading your own exam" problem is real. The manual copy-paste phase you describe is painfully familiar. Automating the review loop is the obvious next step but the hard part is knowing when to stop iterating. Two agents can get into an infinite refinement cycle if you're not careful. Did you find a good heuristic for convergence?

Collapse
 
yw1975 profile image
Sakiharu

Great question, our current approach has a few layers:

Tag-based convergence. Each round is tagged — [CODE], [PASS], [NEEDS_WORK], [CONSENSUS]. The loop stops when the reviewer issues [PASS] with an explicit reason. No rubber-stamp passes allowed — the policy requires at least one concrete justification.
Round cap. After 5 rounds without consensus, the system flags it for human intervention. You can [OVERRIDE] or [HANDOFF]. In practice most tasks converge in 2-3 rounds.
The real heuristic is upfront alignment. Honestly, with Claude Code + Codex the ping-pong refinement loop is rare. The more common failure mode is direction being wrong from the start. So for complex tasks, I align on goals and approach with the author agent first — before the loop even begins. If the direction is right, convergence comes naturally. If it's wrong, no amount of review rounds will fix it.

Curious about your setup — are you using different models for each role, or the same model in different contexts?

Collapse
 
ingosteinke profile image
Ingo Steinke, web developer

Good point! I also see that AI bots stick to their initial assumptions, thus doing effective work (sometimes) but still consistently moving in the wrong direction. The few cases where "vibe coding" worked for me as a senior were either

  • boilerplate code commonly documented in numerous tutorials
  • greenfield code in strongly typed languages
  • things that I should have known...
  • ...but search engines failed to reveal for some reason.

Claude successfully set up working MVP code for a classic WordPress plugin and a similar Chrome browser extension. Copilot did a code review and found two potential security issues in the extension code. The initial concept had been drafted by another AI agent based on a random idea that popped up in a casual conversation. In the end it turned out we'd better not waste any more effort in finishing and publishing the browser extension for the requirements were already based on flawed assumptions.

Thus, pair programming on a coding level isn't enough but maybe better than naive vibe coding without any external challenger at all.

Collapse
 
yw1975 profile image
Sakiharu

This is a really important point and honestly something I’m still figuring out. Code-level review catches bugs, but it can’t fix flawed assumptions — and we’ve run into that too.
What’s helped so far is aligning on goals and approach with the author agent before the review loop starts. The loop handles code quality, but the direction-setting has to happen upfront. When I skip that step, I get exactly what you described — polished code built on flawed premises.
Your Chrome extension story resonates. A reviewer agent would’ve caught the security issues, but wouldn’t have questioned whether the extension should exist at all. That kind of judgment is still on us.
So yeah — pair programming at the code level is necessary but not sufficient. Still learning where the boundaries are.​​​​​​​​​​​​​​​​

Collapse
 
matthewhou profile image
Matthew Hou

This matches my experience. The quality delta from AI isn't in the code it writes — it's in the speed at which you can iterate the review cycle.

The thing I've found: AI is good at catching "this works but it's wrong" issues when you frame the review correctly. "Check this for edge cases" gets mediocre results. "You are a security engineer reviewing this for potential injection vulnerabilities" gets substantially better results. The framing matters more than the code quality.

What specifically did you automate in the review cycle? I'm curious whether you're catching logic errors or mostly style/linting issues.

Collapse
 
yw1975 profile image
Sakiharu

It's mostly logic and design issues, not style/linting. Things like the race condition in the AionUI PR (double super.kill() without an idempotent guard), swallowed errors that silently hide failures, or mismatches between what the code does and what it claims to do. A linter would never catch any of those.

Your point about "security engineer" framing is interesting — I could see adding specialized review passes on top of the general loop. General review for logic and design, then a focused pass for security. Layered review rather than one-shot.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

This resonates a lot. I've been building side projects with Claude Code for months now and honestly the biggest lesson was exactly this - one agent reviewing its own output is like proofreading your own essay, you just gloss over things. The characteristic failure patterns you describe are spot on too, Claude getting sloppy on error handling in long contexts is something I hit constantly. I ended up building a security scanning step into my workflow for similar reasons - not a second agent exactly, but a dedicated pass that only looks for vulnerabilities and missed edge cases. Caught stuff I would have shipped otherwise. Curious though - do you find the reviewer agent sometimes introduces new issues? Like overcorrecting or suggesting refactors that break the original intent? That's been my experience when I tried having a second model do full rewrites instead of just flagging problems.

Collapse
 
yw1975 profile image
Sakiharu

Yeah, we ran into this early on too. Our solution was to let the author agent challenge the reviewer’s feedback before changing any code. So instead of blindly applying every suggestion, they discuss it first — the reviewer flags an issue, the author can push back with reasoning, and they go back and forth until they reach consensus. Only then does the code get modified.
In practice, most of the time the author accepts the reviewer’s feedback and makes the fix. But sometimes it pushes back and holds its ground — and often it’s right to. That back-and-forth filters out the overcorrections before they ever touch the code.
We have a 5-round cap on any single disagreement — if they can’t reach consensus, it escalates to human judgment. Hasn’t happened yet though. Turns out when both agents have to justify their position, they converge pretty quickly.
Your security scanning step is smart though — a focused pass for a specific concern. General review + specialized scan is probably the strongest combination. Something I’d like to explore adding to the loop as well.​​​​​​​​​​​​​​​​

Collapse
 
itskondrat profile image
Mykola Kondratiuk

That author-challenges-reviewer step is a really elegant fix. Way cleaner than trying to tune the reviewer to be less aggressive upfront - you're basically adding a negotiation layer before any code changes happen, which is a smarter place to put the friction. The 5-round cap with human escalation is a nice touch too, love systems that have a clear fallback rather than looping forever. Definitely stealing that idea.

Collapse
 
alifar profile image
Ali Farhat

Codex becomes very close to autonomous development

Collapse
 
vibeyclaw profile image
Vic Chen

The context window asymmetry point is really interesting — hadn't thought about that as an architectural advantage. When one agent compresses and loses info, the other still has it. That's basically distributed memory across agents.

I've been building something similar for financial data pipelines. One agent writes the data extraction logic, another validates the output against known constraints (e.g., SEC filing totals must balance). The domain-specific reviewer catches things a general code reviewer never would — like silently dropping rows when a filing has an unusual format.

Question: have you experimented with giving the reviewer agent access to git blame or commit history? Seems like knowing why code was written a certain way (not just what it does) would make the review pass significantly better, especially for legacy codebases.