DEV Community

Mitko Tschimev
Mitko Tschimev

Posted on

What Reviewing AI-Generated PRs Actually Looks Like (Part 4)

What Reviewing AI-Generated PRs Actually Looks Like (Part 4)

In Parts 1-3, I covered how we built AI agents that write production code: the architecture, workflow, and CI implementation.

Part 4 is what happens after the agent pushes a PR.

This isn't theory. Here's what we learned about code review, trust, and what works (and what doesn't).

The Review Shift

Before agents

Code review covered:

  • Syntax and style
  • Lint/format compliance
  • Test coverage
  • Logic correctness
  • Architecture fit

Reviewers asked: "Did you check coverage for the new code?" "Why are the E2E tests failing?" "Do you follow all best practices and principles?" Multiple review cycles per PR.

With agents

Agent PRs arrive pre-reviewed:

  • Lint + format already run
  • Tests written (unit + integration + E2E when feasible)
  • Self-review checklist passed
  • 7-perspective internal review completed
  • Inline comments (max 10) from internal review addressed

Reviewers focus on:

  • Edge cases (race conditions, payment retries)
  • Business logic (is this the right approach?)
  • Architecture boundaries (should this be a separate service?)
  • Rollback risk (what breaks if we revert?)

Review cycles dropped significantly.

Why Human Review Is Mandatory

Agents can't be fully trusted yet

We're working on processes to guarantee it, but we're not there yet.

Even after passing internal review (7 perspectives, self-review, full test suite), agent PRs need human eyes.

What agents miss:

  1. Edge cases — Race conditions, payment retry logic, cascading failures. Agents write the happy path. Reviewers ask "what breaks this?"

  2. Architecture patterns — We use repository pattern. Agents sometimes try direct DB queries. Reviewers catch violations before merge.

  3. Clean code principles — Naming, abstraction levels, SRP violations. Agents optimize for "works" not "maintainable."

  4. Business context — "This works, but is this the right solution?" Agents implement the ticket. Reviewers question the approach.

The review process is a living document

We don't treat agent review as static. Every week, the team updates:

  • .cursorrules — Add architecture patterns agents violated
  • implement-SKILL.md — Expand review checklist based on what reviewers flag

Example evolution:

  • Week 1: Agent added hardcoded API key → added to checklist
  • Week 4: Agent skipped rate limiting on new endpoint → added to checklist
  • Week 8: Agent wrote vague PR descriptions → updated skill with required fields

The pattern: Reviewers teach the agent. Agent self-checks improve over time. Fewer review comments on repeat issues.

This isn't "set and forget." It's continuous improvement. Agents learn from human feedback, encoded into rules and checklists.

Trust Evolution

Phase 1: Skepticism (Early PRs)

Every agent PR got:

  • Line-by-line review
  • "Did the agent understand the requirement?" questions
  • Manual re-testing in local env

Team treated agent code as inherently suspicious.

Phase 2: Calibration (Mid-Volume)

Team started pattern-matching:

  • Agent PRs with good .pr-description.md → trust faster
  • Agent PRs that passed internal review with clean → skim, approve
  • Agent PRs with changes_required from security/architecture → deep review

Trust built on consistency, not individual PRs.

Phase 3: Routine (Later PRs)

Nobody asks "Did an agent write this?"

They ask: "Does this solve the problem?"

Review focuses on substance, not authorship.

What Changed in Practice

Agent PRs consistently arrive cleaner than human PRs:

  • Zero lint/format fixes required — Agent runs them before pushing
  • Fewer revision cycles — Agent handles mechanical issues internally
  • Faster reviews — Reviewers skip style checks, focus on logic
  • Similar (or higher) merge rates — Quality isn't worse; often better

The shift: reviewers spend less time on mechanical checks, more on architectural fit.

What Still Needs Humans

Agents handle:

  • ✅ Syntax, style, format
  • ✅ Test coverage (unit/integration/E2E)
  • ✅ Self-review checklist (defined in implement-SKILL.md)
  • ✅ Internal review (7 perspectives)

Humans handle:

  • ❌ Edge case discovery (race conditions, retries, cascading failures)
  • ❌ Product decisions (UX, error messages, feature scope)
  • ❌ Architecture boundaries (when to split a service, introduce a queue)
  • ❌ Rollback risk assessment (what breaks if we revert this?)

The shift: Agents don't replace reviewers. They shift what reviewers spend time on.

PR Description Quality Matters

Agent PRs include .pr-description.md with:

  1. Title (H1) — becomes GitHub PR title, follows conventional commits
  2. What changed — file-level summary
  3. Why — links to Jira ticket + Confluence Execution Plan
  4. Testing approach — which tests run, manual steps if needed
  5. Rollback plan — what to revert if this breaks production

Why this matters

Reviewers read the PR description before opening the diff.

Good description → trust the approach → skim implementation → approve faster.

Bad description → question the approach → deep dive every file → request changes.

Agent-written descriptions are consistent. Human-written descriptions vary wildly.

Review Checklist Changes Over Time

We iterated the review checklist (in implement-SKILL.md) based on what reviewers flagged most:

v1 (initial)

  • Code compiles
  • Tests pass
  • No lint errors

v2 (after 20 PRs)

  • Added: "Check for hardcoded secrets or API keys"
  • Added: "Verify error messages are user-friendly"

v3 (after 50 PRs)

  • Added: "Ensure new endpoints have rate limiting"
  • Added: "Check Confluence plan matches implementation"

v4 (current)

  • Added: "Verify observability: logs, metrics, traces"
  • Added: "Check for breaking changes (DB schema, API contracts)"

The pattern: Reviewers teach the agent what to self-check by updating the checklist. Over time, fewer review comments on those topics.

What Breaks (and How We Fixed It)

Problem 1: Agent ignores architectural constraints

Example: Agent added a direct DB query in the API layer (we use repository pattern).

Fix: Added to .cursorrules:

Repository pattern is mandatory. Never write `prisma.query()` in controllers or routes.
Enter fullscreen mode Exit fullscreen mode

Result: Hasn't happened since.

Problem 2: Agent writes tests that pass but don't test the right thing

Example: Agent mocked the entire service layer, so integration test didn't actually hit the database.

Fix: Updated implement-SKILL.md:

Integration tests must use real DB (testcontainers). No service-layer mocks in integration tests.
Enter fullscreen mode Exit fullscreen mode

Result: Agent now writes meaningful integration tests.

Problem 3: PR descriptions were too vague

Example: "Fixed the bug" (no context on what broke or how it's fixed).

Fix: Added to implement-SKILL.md:

.pr-description.md MUST include:
- Root cause (1-2 sentences)
- What changed to fix it
- How to verify the fix
Enter fullscreen mode Exit fullscreen mode

Result: PR descriptions now include enough context for reviewers to skip reading the full Jira ticket.

Key Takeaways

  1. Human review is mandatory. Agents miss edge cases, architecture violations, and clean code principles. Review isn't optional.

  2. Review time improves. Reviewers skip mechanical checks and focus on substance.

  3. Revision cycles drop. Agent handles lint/format/test internally before pushing.

  4. Trust builds gradually. Eventually team stops caring who wrote the code.

  5. The review process is a living document. Update .cursorrules and implement-SKILL.md weekly. Agents learn from reviewer feedback.

  6. PR descriptions matter more than you think. Good description = faster approval.

What's Next

This concludes our 4-part series on AI task automation:

  • Part 1: Architecture (JIRA webhooks → GitHub → Cursor agents)
  • Part 2: 5-stage workflow (human-in-the-loop transitions)
  • Part 3: CI implementation (agent chain, 7-perspective review, bug investigation)
  • Part 4: Code review and trust evolution (this post)

If you're building AI automation for your team, the lesson is:

Agents don't replace reviewers. They shift what reviewers spend time on.

Design for that shift. Update your checklists weekly. Let trust build gradually. And never skip human review—agents can't catch everything.


Mitko Tschimev — Technical lead at 1inch. I write about engineering leadership, architecture, and automation.


Have you reviewed AI-generated PRs? What's the hardest part—trust, quality, or something else? Drop a comment.

Top comments (0)