Jakkie Koekemoer

Posted on Feb 19

"LGTM": Are We Reviewing Code or Just Rubber Stamping It?

#programming #productivity #coding

You open a 1,200-line PR at 9 AM. The CI is green. You scroll for four minutes, click "Approve," type "LGTM," and move to the next one.

You didn't review it. You logged it.

AI coding assistants generate modules in seconds. But reading code happens at human speed. This asymmetry creates a bottleneck: code influx permanently exceeds verification capacity. We're not reviewing anymore. We're performing security theater in the PR queue.

The Tech Debt Singularity

The Tech Debt Singularity is when incoming code rate exceeds review rate forever. Not a temporary backlog. A structural break.

LLMs don't get bored writing 50 similar functions. They don't care if the PR is 2,000 lines. Humans do. We skim. We trust the tests. We hope.

The result: code that works in isolation but fights itself architecturally. Six months later, you add a feature and realize the codebase is incoherent. One module uses factories. Another uses dependency injection. A third hard-codes everything. Each decision was locally correct. Together, they're a maze.

Research shows bugs caught in review cost exponentially less than bugs in production. But only if you actually review the code.

Three Symptoms You're Rubber Stamping

1. The Nitpick Review

600-line PR. Dense logic with state management, async flows, error handling. Understanding it takes 30 minutes. You have 10.

You rename userData to userProfile. Point out a missing space. Flag a linting error CI already caught. Three thoughtful-looking comments. Click "Approve."

You reviewed syntax. Not logic.

2. Time-to-Approve Collapse

1,200 lines approved in four minutes. Even scanning at 300 lines per minute, you can't verify logic, edge cases, or abstraction quality. You can scroll. You can spot-check. You can't review.

Google's research confirms: meaningful reviews take time. When approval time drops while PR size stays constant, you're not reviewing faster. You're reviewing less.

3. Blind Trust in AI

"Generated by o1, so the logic is sound." Wrong.

AI assistants generate plausible code. They don't understand your business logic. Your payment service has a race condition when webhooks arrive simultaneously. Your database uses soft-deletes that need special handling. The LLM doesn't know. It generates code that looks right but breaks in context.

Research backs this up. CodeRabbit's analysis of 470 open-source PRs found AI-generated code contained 1.7x more issues overall than human-written code. Logic and correctness issues were 75% more common. Error handling gaps appeared at nearly 2x the rate. Security vulnerabilities, particularly XSS, showed up at 2.74x higher rates.

The fundamental problem: 66% of developers cite "AI solutions that are almost right, but not quite" as their biggest frustration with AI tools. Teams at Span have been measuring this effect across millions of lines of code. Their data shows that AI-generated code creates specific patterns of defects that require different verification strategies than human-written code.

Without careful review, you catch these issues in production.

The Review Quality Framework

You need metrics that show when you're cutting corners. Most engineering metrics tools track PR velocity or cycle time. Metrics that reward speed. But speed isn't safety.

What matters is review quality. Three metrics expose rubber stamping:

1. Review Burden Ratio

Lines changed and complexity vs. time spent reviewing. When large, complex PRs get approved in minutes, your ratio collapses. This is your smoke detector.

2. Review Coverage

Percentage of lines actually viewed, not just scrolled past. Under 80% coverage means skimming. GitHub and GitLab track this in their APIs, but most teams never look at it.

3. Time to Substantive Feedback

How long before real feedback, not linting nitpicks. If first comments are always "rename this variable" or "fix spacing," you're reviewing cosmetics while skipping logic.

When teams track these metrics, patterns emerge. One team found their velocity was up 40% but review depth down 60%. They were shipping bugs faster.

The challenge: most teams don't track this systematically. It requires correlating PR metadata with actual reviewer activity, understanding code complexity, and distinguishing substantive review from rubber stamping. This is particularly critical now that AI-generated code is flooding PRs. Span's research shows this code requires different verification strategies, making review depth metrics even more important.

Five Strategies to Stop Rubber Stamping

1. Enforce 400-Line PR Limits

SmartBear's study with Cisco found review effectiveness drops sharply after 400 lines. Beyond that, reviewers skim.

Set a hard limit: 400 lines maximum. Larger changes need multiple PRs with clear dependencies. This slows shipping. That's the point. You're choosing correctness over speed.

AI makes this easier. Ask Claude or ChatGPT to split a 1,200-line refactor into three atomic PRs. It takes seconds.

2. Use AI-Generated PR Summaries

Have AI compress the PR into a summary before review:

What changed and why
High-level architecture decisions
Edge cases considered
Tests added

The summary doesn't replace review. It gives you context so you can focus on logic instead of parsing syntax. Read the summary first, then review the code.

3. Surface Declining Review Quality to Management

When Review Burden Ratio trends down, show management the data. "We can't review this volume safely. We need to slow shipping or add reviewers."

This feels like admitting failure. But the alternative is shipping unreviewed code and catching bugs in production. Multiple studies show code review catches the majority of defects before release. Rubber stamping defers those bugs.

Treat reviewer capacity as a hard constraint. If you won't deploy without tests, don't deploy without real reviews.

4. Rotate Review Responsibilities

Don't bottleneck all reviews through one senior engineer. Rotate across the team. This distributes effort and reduces the temptation to rubber stamp when overwhelmed.

Junior engineers improve at reviewing through practice. Senior engineers get breathing room.

5. Make "I Don't Have Time" Acceptable

Create a culture where "I can't review this properly right now" is acceptable. This beats rushed approval.

If reviewers consistently lack time, that's data. The team is under-resourced for the code volume being generated. Use that signal to adjust expectations or add capacity.

The Decision Framework: When Is a Review Good Enough?

Use this three-question framework before approving:

1. Can I explain the core logic to someone else?

If you can't summarize what the code does and why, you didn't review it. You scanned it.

2. Have I considered failure modes?

What breaks if the network times out? What happens with malformed input? If you haven't thought about edge cases, you didn't review.

This is particularly important for AI-generated code. Span's analysis shows AI code contains significantly more error handling gaps and edge case issues than human code. You need to actively verify these patterns, not assume they're handled.

3. Does this fit our architecture?

Does this match existing patterns? If it introduces a new approach, is that justified? Inconsistent patterns compound into technical debt.

If you answer "no" to any question, don't approve. Request time or ask for the PR to be split.

What Rubber Stamping Actually Costs

Rubber stamping erodes architecture gradually. When nobody reads code holistically, each PR optimizes locally. Together, they create global incoherence.

Example: You approve three PRs in one day. One uses factories. Another uses dependency injection. A third hard-codes dependencies. None are wrong individually. Together, they create inconsistency.

Six months later, new engineers spend weeks learning idiosyncrasies. Features take longer because every change navigates conflicting patterns. This is how brownfield codebases form. Not from malice. From inattention compounded over time.

AI accelerates this. It generates more code faster than humans can, magnifying the cost of each unreviewed decision. And the data shows AI code creates specific architectural challenges. GitClear's analysis of 153 million changed lines of code found code churn doubled and copy/pasted code increased significantly, with AI-generated code resembling "an itinerant contributor, prone to violate the DRY-ness of the repos visited."

The teams who understand this best are those measuring it. Span's platform detects AI-generated code at the chunk level, letting teams see exactly where these patterns appear and correlate them with code quality outcomes. This visibility matters because you can't fix what you can't measure.

Implementation Checklist

This week:

Set a 400-line PR size limit in your repo settings
Add a PR template that requires a summary section
Start tracking Review Burden Ratio manually (spreadsheet with PR size, review time, approval)

This month:

Review your team's average time-to-approve vs. PR size
Identify PRs approved in under 5 minutes that exceeded 500 lines
Present findings to your team

This quarter:

Implement automated review quality tracking
Rotate review responsibilities weekly
Measure defect rates before and after implementing limits

Conclusion

Code review is the last defense against chaos. When you rubber stamp, you're crossing your fingers and hoping tests are comprehensive.

The temptation to skim is constant. You're busy. The PR looks fine. CI is green. But "looks fine" isn't "is fine." This is especially true now that AI code makes up a growing percentage of PRs. Code that looks correct but contains subtle defects requires thorough verification.

Set limits. Track data. When Review Burden Ratio drops, treat it as a signal to slow down or add capacity.

AI code generation is fast. But speed without verification is just technical debt accumulating faster than you can pay it off. The hard part isn't generating code. The hard part is understanding it.

That part is still on you.

Be honest: Have you approved a PR this week that you didn't actually read? Tell us your "Rubber Stamp" horror stories in the comments.

DEV Community