Lakshmi Sravya Vedantham

Posted on Feb 26

I Put GPT-4 and Claude in the Same Repo and Made Them Review Each Other's PRs. It Got Weird.

#llm #ai #programming #tooling

I built a tool called model-diff that puts LLM outputs side by side so you can see exactly where they agree, where they diverge, and how confident each one sounds while being completely wrong.

Then one day I had a thought I probably should have dismissed: what if the models reviewed each other's code instead of mine?

Reader, I did not dismiss it.

The Setup

I needed a real feature. I picked a retry mechanism with exponential backoff. Boring enough to be useful, just complex enough to have real architectural opinions about.

I asked Claude to implement it first:

import time
import random
from typing import Callable, Any

def retry_with_backoff(
    func: Callable,
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0,
    jitter: bool = True,
) -> Any:
    last_exception = None

    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            last_exception = e
            if attempt == max_retries - 1:
                break
            delay = min(base_delay * (2 ** attempt), max_delay)
            if jitter:
                delay *= (0.5 + random.random() * 0.5)
            time.sleep(delay)

    raise last_exception

Clean. Functional. Does exactly what I needed.

Then I pasted this into GPT-4 and said: "Review this PR."

Round 1: The Part Where They Agreed (I Expected This)

GPT-4 came back with reasonable style notes:

"Consider renaming func to operation for clarity. Also, last_exception = None could be typed as Optional[Exception] = None for mypy compliance."

Fair. I ran model-diff on both outputs and the syntax/naming feedback clustered together — both models flagged similar things. Style is basically solved. The models have read the same PEP 8 documents.

I made Claude look at GPT-4's review. Claude agreed on the type hint. Pushed back mildly on operation vs func. We were in code review harmony.

It did not last.

Round 2: The Architecture Disagreement (Things Get Spicy)

GPT-4 came back with a second round that went further:

"I'd recommend extracting this into a RetryConfig dataclass to separate configuration from execution logic. You'll want to add callback hooks for logging retry attempts — on_retry: Optional[Callable[[int, Exception], None]] = None. Consider whether this should be a decorator rather than a function wrapper for better ergonomics at call sites."

A dataclass. Hooks. A decorator variant. For a retry function.

I showed this to Claude. Claude's response:

"The RetryConfig dataclass adds indirection without meaningful benefit at this scale. If this function is being called in three places, the current signature is clearer. YAGNI applies here — the callback hook is useful if you have a logging infrastructure to hook into, but adding it speculatively increases the API surface without solving a current problem."

Claude said YAGNI to GPT-4's face.

I ran model-diff. The divergence was stark: GPT-4 trending toward extensibility and abstraction, Claude trending toward sufficiency and restraint. Neither was wrong in the abstract. They were optimizing for different imagined futures.

This is when I started to suspect something.

Round 3: The Weird Part

I gave GPT-4 Claude's YAGNI critique and asked it to respond:

"The original author seems to be optimizing for the immediate use case, but production retry logic almost always evolves to need observability. The callback pattern costs nothing now and prevents a refactor later."

"The original author." It was talking about Claude like Claude was a junior dev who didn't think ahead.

I showed Claude the full thread. Claude's response:

"The counter-argument assumes this will become production retry logic requiring observability. Without knowing the actual deployment context, adding a callback hook is a guess about future requirements. The existing implementation handles the stated problem. If observability is needed, it can be added when it's needed."

Claude was now subtly defending its own original choices.

Meanwhile GPT-4, unprompted, had added max_jitter_factor as a new parameter suggestion.

They were going in circles. GPT-4 kept adding features. Claude kept saying the features weren't necessary. Neither of them had asked me a single question about what I was actually building.

The Reveal

I stepped back and looked at my actual use case. I'm calling an external API in a data pipeline. Three call sites. No distributed tracing. No ops team watching dashboards. The retry logic just needs to not crash and to wait a reasonable amount of time.

Claude's original implementation was correct for my situation. GPT-4 was designing for a microservices production environment that I do not have. It was giving me genuinely good advice — for a different project.

The uncomfortable part: GPT-4 never asked what the project was. Claude never asked either. They both reviewed against their own mental model of "what retry logic is usually for," and those mental models are slightly different because they were trained differently.

What model-diff Actually Showed Me

When I looked at the diff across all rounds, a pattern emerged. GPT-4's suggestions consistently trended toward more abstraction, more extensibility, more hooks. Claude's responses consistently trended toward sufficiency, specificity, restraint. These aren't random variations — they're systematic tendencies.

Neither model is wrong. But neither model knows your codebase, your team size, your deployment environment, or your tolerance for abstraction. They review against their training distribution, not your actual requirements.

AI code review is genuinely useful — it catches things I miss and occasionally spots a real bug. But the moment you're in an architectural discussion, you're not getting a neutral technical opinion. You're getting a perspective shaped by what that model saw most during training.

The models review confidently. They don't flag uncertainty about your context. They just opine.

The Practical Takeaway

Use AI code review for what it's good at: style, type safety, common bugs, missing edge cases.

But when a model starts arguing for a RetryConfig dataclass and callback hooks on a 25-line utility function, that's not the model being smart about your code. That's the model pattern-matching to production-scale architecture because that's what it's seen the most of.

The fix is simple and annoying: give it context. Tell it the team size, the deployment environment, whether this is a prototype or a production system. The models can incorporate that. They just won't ask for it.

Have you tried pitting two models against each other on the same code? I'm curious whether the GPT-4-goes-abstract, Claude-stays-minimal pattern holds for other people — or if I just got lucky with a particularly philosophical retry function.

Drop it in the comments. Or just tell me I'm wrong about YAGNI. GPT-4 would.

The full experiment transcript — the retry function, all four rounds of review, and the model-diff output — is on GitHub: examples/pr-review-experiment

model-diff is the tool I used to run this. It compares any two models on the same prompt side by side.

pip install model-diff
model-diff "Review this code" --models gpt-4o,claude-sonnet-4-6

DEV Community