<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Stephen Metcalfe</title>
    <description>The latest articles on DEV Community by Stephen Metcalfe (@raithlin).</description>
    <link>https://dev.to/raithlin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F309554%2Fb8348be7-2a64-44b0-ab50-0ba1898afaa1.jpeg</url>
      <title>DEV Community: Stephen Metcalfe</title>
      <link>https://dev.to/raithlin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/raithlin"/>
    <language>en</language>
    <item>
      <title>I Reviewed 200+ AI-Generated PRs. Here's the 4-Round Protocol I Use Now.</title>
      <dc:creator>Stephen Metcalfe</dc:creator>
      <pubDate>Mon, 15 Jun 2026 12:54:08 +0000</pubDate>
      <link>https://dev.to/raithlin/i-reviewed-200-ai-generated-prs-heres-the-4-round-protocol-i-use-now-28l8</link>
      <guid>https://dev.to/raithlin/i-reviewed-200-ai-generated-prs-heres-the-4-round-protocol-i-use-now-28l8</guid>
      <description>&lt;p&gt;Your teammate used Claude to generate a new API endpoint. The code looks great — clean formatting, proper error handling, even comments. You skim through it, see it follows conventions, CI is green. You approve.&lt;/p&gt;

&lt;p&gt;Two weeks later, the endpoint silently drops a decimal place on currency conversions. A financial report is wrong for three days before anyone notices.&lt;/p&gt;

&lt;p&gt;This scenario is playing out in hundreds of teams right now. Not because AI generates "bad code" — but because &lt;strong&gt;AI-generated code fails in ways human code doesn't&lt;/strong&gt;, and your existing review process wasn't designed for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With Reviewing AI Code
&lt;/h2&gt;

&lt;p&gt;AI doesn't flag uncertainty. It presents everything with equal confidence. A human developer might write &lt;code&gt;// not sure about the caching here&lt;/code&gt; — that nervous comment tells you exactly where to look. AI never writes that comment. It writes &lt;code&gt;// Transform the input to match the expected schema&lt;/code&gt; with full confidence, even when the transformation is wrong.&lt;/p&gt;

&lt;p&gt;After reviewing hundreds of AI-generated PRs over the past year, I found a pattern. The bugs aren't in formatting. They're in the places a quick glance won't reach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Off-by-one errors in loops you skimmed&lt;/li&gt;
&lt;li&gt;Missing auth checks on new endpoints&lt;/li&gt;
&lt;li&gt;Elegant abstractions that create maintenance nightmares&lt;/li&gt;
&lt;li&gt;Code that solves the &lt;em&gt;wrong problem&lt;/em&gt; perfectly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Generic "review this code" prompts won't catch these. You need a system.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 4-Round Protocol
&lt;/h2&gt;

&lt;p&gt;I built a review protocol specifically for AI-generated code. Four rounds, each targeting a different failure mode. Total time: ~15 minutes for a typical PR, up to 35 minutes for a large one.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Round&lt;/th&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;What You're Catching&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Surface Scan&lt;/td&gt;
&lt;td&gt;Logic errors, off-by-one, wrong assumptions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Security Deep-Dive&lt;/td&gt;
&lt;td&gt;Injection, auth gaps, data leaks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Architecture Smell Check&lt;/td&gt;
&lt;td&gt;Wrong patterns, tech debt, doesn't fit the system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Comparison Pass&lt;/td&gt;
&lt;td&gt;Does this match what we actually asked for?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: &lt;strong&gt;each round uses a separate AI prompt that forces a different lens on the same code.&lt;/strong&gt; You're not asking the AI to "review this code" four times — you're asking four different, targeted questions.&lt;/p&gt;

&lt;p&gt;Let me show you the two rounds that catch the most issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  Round 1: The Surface Scan
&lt;/h2&gt;

&lt;p&gt;This is the high-probability round. Most AI bugs live here — logic errors, wrong assumptions, off-by-one bugs. The code &lt;em&gt;looks&lt;/em&gt; correct. It's subtly wrong in exactly the ways a quick glance won't catch.&lt;/p&gt;

&lt;p&gt;Here's the prompt I use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Review this code for logic errors only. Do NOT suggest style improvements,
documentation, or refactoring. I want you to find:

1. Off-by-one errors, wrong comparisons, or inverted logic
2. Wrong default values or assumptions about data shape
3. Missing edge case handling (null, empty, zero, max values)
4. Race conditions or non-atomic operations on shared state

For each issue found, state:
- The exact line number
- Why it's wrong
- What the correct behavior should be

If you find zero issues, explain why each edge case IS handled,
not just say "looks good."

[Paste the PR description or requirements if available]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical instruction is that last line: &lt;strong&gt;"If you find zero issues, explain why each edge case IS handled."&lt;/strong&gt; Without this, the AI will happily say "looks good" and move on. Forcing it to justify the all-clear catches things a simple yes/no never will.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The trap:&lt;/strong&gt; AI-generated tests will pass. AI knows what the code does, so it writes tests that confirm the code's behavior — &lt;em&gt;including its bugs&lt;/em&gt;. Perfect test coverage means nothing if the tests are testing the wrong thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Round 2: The Security Deep-Dive
&lt;/h2&gt;

&lt;p&gt;This is the scary one. AI models are trained on massive amounts of public code, including code with security vulnerabilities. They don't understand security — they understand patterns. If the most common Stack Overflow solution uses &lt;code&gt;eval()&lt;/code&gt; or concatenates SQL strings, the AI will reproduce that pattern with full confidence.&lt;/p&gt;

&lt;p&gt;The most common AI security failures: SQL injection, insecure deserialization (pickle, Marshal, &lt;code&gt;YAML.load&lt;/code&gt;), BOLA/IDOR (authenticated but accessing someone else's resource), mass assignment, and SSRF.&lt;/p&gt;

&lt;p&gt;Here's the prompt that catches what your brain won't think to check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are a malicious actor with valid API credentials who want to exploit
this code. Walk through every possible thing you could try:

- Can you access data you shouldn't be able to?
- Can you escalate privileges?
- Can you cause the system to leak internal information?
- Can you trigger unexpected behavior with edge inputs?
- Can you cause the system to consume excessive resources?

Think step by step. List at least 5 distinct attack vectors. If you can't
find 5, you're not thinking creatively enough.

After listing individual vectors, describe at least 2 attack chains where
you combine multiple steps to achieve something none of the individual
vectors accomplish alone.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pro tip: use a different AI model for this round than the one that generated the code.&lt;/strong&gt; If Claude wrote the code, use GPT-4 to review it. Different training data means different blind spots. This single change catches vulnerabilities that using the same model consistently misses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The trap:&lt;/strong&gt; Auth checks that look right but aren't. AI will write &lt;code&gt;if current_user.present?&lt;/code&gt; — the user is authenticated, but the code doesn't check if they're authorized for &lt;em&gt;that specific resource&lt;/em&gt;. The check looks secure but isn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Two Rules That Make This Work
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Clear context between rounds.&lt;/strong&gt; Don't run all 4 rounds in the same chat thread. Start a fresh conversation for each round. If you run Round 2 in the same context as Round 1, the AI already "knows" what it told you in Round 1 and will unconsciously align its analysis. Fresh context forces independent analysis. Costs 30 seconds, worth every one of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Run all reviews first, fix once.&lt;/strong&gt; The naive approach is to fix issues one at a time — fix the logic bug, review, fix the security hole, review. This creates whack-a-mole: fixing the architecture can introduce a new security hole. Instead: run all 4 rounds, collect every issue, send the complete list to the AI in one shot, then re-run all rounds on the result.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Skip a Round
&lt;/h2&gt;

&lt;p&gt;Not every PR needs all 4 rounds:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Rounds to Run&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Comment or docs change&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Variable rename&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Typo fix&lt;/td&gt;
&lt;td&gt;1 &amp;amp; 4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New API endpoint&lt;/td&gt;
&lt;td&gt;1, 2 &amp;amp; 4 (skip 3 if it follows existing patterns)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New feature, new patterns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;All 4&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auth or payment change&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;All 4&lt;/strong&gt; — extra time on round 2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI-generated bugfix&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;All 4&lt;/strong&gt; — the fix might work but introduce new bugs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;General principle: if AI generated the code, lean toward running more rounds. That's the whole point.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Round Everyone Skips
&lt;/h2&gt;

&lt;p&gt;Round 4 — the Comparison Pass — is the most commonly skipped and the most dangerous to skip. It asks one question: &lt;strong&gt;does this code actually solve the problem we asked for?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AI is excellent at solving the problem you &lt;em&gt;typed&lt;/em&gt;, not the problem you &lt;em&gt;meant&lt;/em&gt;. It takes your words literally. The most common failure: AI solves the first 80% of a ticket perfectly and quietly ignores the last 20% because it "didn't seem important." The code is perfect — for the wrong thing.&lt;/p&gt;

&lt;p&gt;If you have a ticket or issue, paste it in and make the AI verify each acceptance criterion against the code. You'll be surprised what's missing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making It Stick
&lt;/h2&gt;

&lt;p&gt;Here's how to adopt this without overwhelming yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Day 1-2:&lt;/strong&gt; Run only Round 1 on all your PRs. Get comfortable with the prompts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 3-4:&lt;/strong&gt; Add Round 2. You'll likely find something within the first few PRs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 5-6:&lt;/strong&gt; Add Rounds 3 and 4.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 7:&lt;/strong&gt; Reflect. What failure patterns did you see most? Which round caught the most issues?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The best review process is the one that evolves. When you find a pattern this protocol doesn't catch, add your own round. When a prompt stops finding bugs, retire it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where This Comes From
&lt;/h2&gt;

&lt;p&gt;I've been using versions of this protocol for over a year. It has saved me from shipping bugs that I would have approved on a first pass. Not every time, but often enough that running it is automatic now.&lt;/p&gt;

&lt;p&gt;I wrote the full protocol — all 4 rounds, 12 copy-paste prompts, the "traps to watch for" in each round, a printable checklist, and the review loop workflow — into a guide. It's called &lt;a href="https://raithlin.gumroad.com/l/ai-code-review-protocol" rel="noopener noreferrer"&gt;The AI Code Review Protocol&lt;/a&gt; and it's on Gumroad for $19 (launch price of $12).&lt;/p&gt;

&lt;p&gt;If you want the complete version with Rounds 3 and 4, the architecture smell checklist, the PII audit prompt, and the automation approaches — that's there. If this post was useful, the guide goes deeper.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you've built your own review process for AI code, I'd genuinely like to hear what works for you. I'm &lt;a class="mentioned-user" href="https://dev.to/raithlin"&gt;@raithlin&lt;/a&gt; on X, or drop a comment below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>webdev</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
