Toni Antunovic

Posted on May 19 • Originally published at lucidshark.com

When Every PR Is a Rubber Stamp: What Automated Gates Catch That Exhausted Reviewers Miss

#codereview #devsecops #aitools #productivity

This article was originally published on LucidShark Blog.

Mitchell Hashimoto's post about "AI psychosis" hit 1,757 upvotes on Hacker News on May 16. The same weekend, a thread titled "Is the norm now that PRs are basically rubber stamps" climbed to 148 points on r/ExperiencedDevs. Both conversations are about the same underlying problem, approached from opposite ends.

Hashimoto warned about companies that have fully surrendered judgment to AI agents: ship bugs fast, agents will fix them. The Reddit thread described the downstream consequence: reviewers so overwhelmed by AI-generated PR volume that approval is the path of least resistance. Connect those two trends and you get a feedback loop that no team is immune to.

The numbers behind the loop: CodeRabbit's 2026 data shows AI-generated PRs contain 1.7x more issues than human-written ones. PR additions are up 18% since AI adoption accelerated. Incidents per PR are up 24%. Review capacity has not increased at all. When output accelerates faster than verification capacity, review becomes theater.

What Exhausted Reviewers Actually Miss

Code review fatigue is not hypothetical. It is a cognitive load problem. When a reviewer has seen forty PRs in a day, the mental bandwidth required to spot a subtle security flaw, a misused async/await, or a near-duplicate function is simply not available. The reviewer pattern-matches on surface signals: tests pass, description looks reasonable, author is trusted, approve.

This is not a failure of professionalism. It is how human attention works under sustained load. Automated gates do not get tired. They apply the same analysis to commit 1 and commit 1,000. The question is what specifically they catch that fatigued humans miss.

1. Hardcoded Secrets Hidden in Refactors

A common AI coding agent pattern: the agent refactors a config module, moves connection logic into a new helper, and in the process inlines a test credential it found in a comment three files away. The reviewer sees "refactor database connection handling" in the PR title, skims the diff at 4pm, approves.

# Before refactor (in a comment, no one notices):
# db_url = "postgresql://admin:dev_password_123@localhost/mydb"

# After agent refactor (now in actual code):
def get_connection():
    return psycopg2.connect(
        "postgresql://admin:dev_password_123@prod.internal/mydb"
    )

A pre-commit secret scanner catches this in 40 milliseconds. A tired reviewer approves it in 40 seconds.

2. Dependency Additions That Bypass SCA

AI agents add dependencies without ceremony. The agent needs a utility, it runs npm install some-package --save, and the package.json change is buried in a 400-line diff. Most reviewers do not manually audit every new dependency for license conflicts, known CVEs, or malicious lifecycle hooks.

"dependencies": {
    "react": "^18.2.0",
    "axios": "^1.6.0",
+   "lodash-merge-deep": "^2.1.3",
+   "fast-xml-parser": "^4.3.0",
+   "xmldom-qsa": "^0.1.2"
  }

That third package, xmldom-qsa, is a typosquat of the legitimate xmldom. The real package has 4.2 million weekly downloads. The fake one has 12. An SCA scanner resolving against the npm registry flags it immediately. A reviewer scanning a dependency diff at the end of a long day does not.

3. Async Errors Swallowed Silently

AI coding agents are reliably inconsistent with async error handling. They write correct-looking async/await code that silently swallows rejections in ways that only surface under specific runtime conditions. This class of bug consistently passes tests (because the tests use controlled inputs that do not trigger the error path) and passes human review (because the code looks syntactically correct).

// AI-generated: looks fine, reviewer approves
async function processWebhook(payload) {
  const result = await validateSignature(payload);
  // if validateSignature throws, this function returns undefined
  // no catch, no finally, rejection is unhandled in certain Node versions
  return transformPayload(result);
}

// What it should look like:
async function processWebhook(payload) {
  try {
    const result = await validateSignature(payload);
    return transformPayload(result);
  } catch (err) {
    logger.error('Webhook validation failed', { err, payload: payload.id });
    throw err;
  }
}

Static analysis tools with async-pattern rules catch this. Reviewers fatigued by async code surface area often approve it without tracing every error path.

4. Test Coverage Theater

AI coding agents write tests efficiently, and they write tests that pass. What they write less reliably are tests that fail when they should: tests that cover the actual invariants of the code rather than the happy path with minor variations.

# AI-generated test suite: 94% coverage, all green
def test_calculate_discount():
    assert calculate_discount(100, 10) == 90

def test_calculate_discount_zero():
    assert calculate_discount(0, 10) == 0

def test_calculate_discount_full():
    assert calculate_discount(100, 100) == 0

# What is NOT tested:
# - discount > 100 (negative result)
# - negative price
# - discount = None (TypeError not caught)
# - floating point precision with large prices

Coverage threshold checks tell you the number. Branch coverage analysis tells you which branches were never exercised. A reviewer approving a PR with "94% coverage, all tests green" has no reason to dig into what the missing 6% represents. An automated branch analysis does.

5. Near-Duplicate Logic Accumulation

AI coding agents generate correct code for the task in front of them. They do not have a global view of the codebase. A function that formats currency values gets written three times across three modules because the agent does not know the other two exist. Each version works. None is obviously wrong. A reviewer approving each PR in isolation has no reason to flag it.

// modules/payments/utils.ts (written by agent in Sprint 12)
function formatCurrency(amount: number, currency: string): string {
  return new Intl.NumberFormat('en-US', { style: 'currency', currency }).format(amount);
}

// modules/invoicing/helpers.ts (written by agent in Sprint 14)
function formatAmount(value: number, currencyCode: string): string {
  return new Intl.NumberFormat('en-US', { style: 'currency', currency: currencyCode }).format(value);
}

// modules/reporting/display.ts (written by agent in Sprint 16)
const toCurrencyString = (n: number, cur: string) =>
  new Intl.NumberFormat('en-US', { style: 'currency', currency: cur }).format(n);

Six sprints later, a locale bug gets fixed in one function. The other two keep the bug. Duplication detection at the diff level catches this before it accumulates.

The compounding problem: Each of these defect classes is individually low-severity. A missing catch block is not a P0. A duplicate function is not a CVE. But in an AI-accelerated codebase where 50 PRs ship per week instead of 10, these defects compound faster than any team can manually track. The quality debt accrues invisibly until a prod incident makes it visible.

The Cognitive Load Math

Human code review has an attention budget. Research from SmartBear consistently shows that reviewers who inspect more than 200-400 lines of code per session show measurably decreased defect detection rates. The optimal review session is 60-90 minutes, under 400 lines, with clear context.

The average AI-generated PR in 2026 is 18% larger than it was in 2024. If your team ships 30 AI-assisted PRs per week and each averages 250 lines, you need 7,500 lines of review capacity per week. At the SmartBear optimal rate, that is roughly 19 focused review sessions. Most engineering teams do not have that capacity as a dedicated activity. Review happens in 10-minute windows between meetings.

Automated gates do not replace review. They compress the surface area that requires human judgment. When a pre-commit hook has already verified no secrets were introduced, no known-vulnerable packages were added, test coverage did not drop, and no async error paths were left uncaught, the reviewer's attention is freed for the things that actually require human judgment: architecture decisions, business logic correctness, API contract changes.

The "Harness Engineering" Principle

Hashimoto's most actionable idea from his agentic workflow posts is what he calls harness engineering: when an agent makes a mistake, do not just correct it. Build a validation rule that the agent can use to self-check before producing output.

Applied to the rubber-stamp problem, this means encoding your quality expectations as machine-checkable rules at the commit layer, not as reviewer heuristics that vary by cognitive load. The rules run before the code reaches any human. By the time a reviewer sees a PR, the automated harness has already enforced the baseline.

The workflow looks like this:

# .git/hooks/pre-commit (or pre-push, depending on team preference)

# 1. Secret detection
lucidshark scan secrets --staged --fail-on-detect

# 2. Dependency audit
lucidshark scan dependencies --lockfile --check-new-additions

# 3. SAST
lucidshark scan sast --staged --severity=medium

# 4. Coverage regression check
lucidshark scan coverage --threshold=80 --branch-coverage

# 5. Duplication detection on staged changes
lucidshark scan duplication --staged --similarity=0.85

# Exit non-zero on any failure
# Agent gets the error, self-corrects, re-commits

The agent loop becomes: write code, commit, gate runs, gate fails, agent reads the error output, agent fixes the issue, agent re-commits. The reviewer receives a PR where the automated harness has already passed. The reviewer's job is to evaluate intent and architecture, not to manually re-implement a secret scanner.

What This Looks Like in Practice

A concrete example of the loop working correctly:

Developer prompts Claude Code to implement a new webhook handler for Stripe events.
Claude Code writes the handler, writes the tests, adds stripe as a dependency, and stages the commit.
The pre-commit hook runs LucidShark's dependency scan. It flags that the Stripe webhook secret is being read from an environment variable in the code but also has a fallback hardcoded string from the agent's test setup.
The hook fails with output: SECRET_DETECTED: STRIPE_WEBHOOK_SECRET_FALLBACK in src/webhooks/stripe.ts:14
Claude Code reads the error, removes the fallback, uses only the environment variable, re-stages, re-commits.
The hook runs again. Passes. PR opens.
Reviewer sees: "All automated checks passed." Reviews for business logic correctness in 8 minutes instead of 25.

The key insight: The gate does not slow the agent down significantly. The agent's correction loop happens in seconds. What it does is move defect detection from the reviewer, who sees the defect after 48 hours in a 400-line diff, to the moment of commit, when the context is still fresh and the fix is a one-line change.

Local-First Matters Here

Cloud-based code review tools address some of this, but they introduce their own problem: they run after the commit is pushed, which is after the agent has already moved on. A cloud bot that comments on a PR 3 minutes after push is useful. A local hook that catches the defect at commit time, before the PR exists, is categorically more effective because the agent can self-correct in the same session.

There is also a data privacy argument. AI-generated code often contains work-in-progress logic, internal API structures, and business-sensitive implementations. Sending that code to a third-party cloud analysis service at every commit is a data exposure policy decision, not just a developer tooling choice. Local-first analysis runs on your machine. Nothing leaves your environment.

Finally, local tools run without network latency, without per-seat pricing that scales with team size, and without service dependencies that add failure modes to your development loop. When Anthropic had three outages in April 2026, teams whose quality gates depended on cloud AI analysis services lost their quality enforcement during the outage window. Local tools kept running.

The Practical Gate Stack

Based on the defect classes most commonly introduced by AI coding agents and most commonly missed by fatigued reviewers, a minimum viable gate stack looks like this:

Gate |
What It Catches |
Why Humans Miss It |

  Secret detection | 
  Inlined credentials, tokens, fallback strings | 
  Hidden in large diffs, looks like test data | 



  Dependency SCA | 
  New packages, CVEs, typosquats, license violations | 
  Reviewers don't audit every new package.json entry | 



  SAST | 
  SQL injection, XSS, async error swallowing | 
  Requires tracing every code path under load | 



  Branch coverage | 
  Untested error paths, missing edge cases | 
  Coverage % looks fine; branches are invisible | 



  Duplication detection | 
  Near-duplicate functions across files | 
  Each PR looks isolated; cross-PR context is lost |

This is not a comprehensive security posture. It is a baseline that catches the five most common AI agent defect patterns without requiring any manual effort per PR.

LucidShark runs this entire stack locally, at commit time, with MCP integration for Claude Code.

When your AI agent stages a commit, LucidShark's pre-commit hooks run secret detection, SCA, SAST, coverage regression, and duplication analysis before the commit lands. Errors surface directly in the agent's context as structured output, so Claude Code can self-correct before the PR ever opens. Your reviewers see only code that has already passed the automated harness.

No cloud service. No per-seat pricing. No data leaving your environment. The gate runs at 200ms, not 3 minutes.

Start with LucidShark for free at lucidshark.com and configure your first pre-commit gate in under five minutes.

DEV Community