Mitko Tschimev

Posted on May 17

What Reviewing AI-Generated PRs Actually Looks Like (Part 4)

#ai #codereview #automation #devops

What Reviewing AI-Generated PRs Actually Looks Like (Part 4)

In Parts 1-3, I covered how we built AI agents that write production code: the architecture, workflow, and CI implementation.

Part 4 is what happens after the agent pushes a PR.

This isn't theory. Here's what we learned about code review, trust, and what works (and what doesn't).

The Review Shift

Before agents

Code review covered:

Syntax and style
Lint/format compliance
Test coverage
Logic correctness
Architecture fit

Reviewers asked: "Did you check coverage for the new code?" "Why are the E2E tests failing?" "Do you follow all best practices and principles?" Multiple review cycles per PR.

With agents

Agent PRs arrive pre-reviewed:

Lint + format already run
Tests written (unit + integration + E2E when feasible)
Self-review checklist passed
7-perspective internal review completed
Inline comments (max 10) from internal review addressed

Reviewers focus on:

Edge cases (race conditions, payment retries)
Business logic (is this the right approach?)
Architecture boundaries (should this be a separate service?)
Rollback risk (what breaks if we revert?)

Review cycles dropped significantly.

Why Human Review Is Mandatory

Agents can't be fully trusted yet

We're working on processes to guarantee it, but we're not there yet.

Even after passing internal review (7 perspectives, self-review, full test suite), agent PRs need human eyes.

What agents miss:

Edge cases — Race conditions, payment retry logic, cascading failures. Agents write the happy path. Reviewers ask "what breaks this?"
Architecture patterns — We use repository pattern. Agents sometimes try direct DB queries. Reviewers catch violations before merge.
Clean code principles — Naming, abstraction levels, SRP violations. Agents optimize for "works" not "maintainable."
Business context — "This works, but is this the right solution?" Agents implement the ticket. Reviewers question the approach.

The review process is a living document

We don't treat agent review as static. Every week, the team updates:

.cursorrules — Add architecture patterns agents violated
implement-SKILL.md — Expand review checklist based on what reviewers flag

Example evolution:

Week 1: Agent added hardcoded API key → added to checklist
Week 4: Agent skipped rate limiting on new endpoint → added to checklist
Week 8: Agent wrote vague PR descriptions → updated skill with required fields

The pattern: Reviewers teach the agent. Agent self-checks improve over time. Fewer review comments on repeat issues.

This isn't "set and forget." It's continuous improvement. Agents learn from human feedback, encoded into rules and checklists.

Trust Evolution

Phase 1: Skepticism (Early PRs)

Every agent PR got:

Line-by-line review
"Did the agent understand the requirement?" questions
Manual re-testing in local env

Team treated agent code as inherently suspicious.

Phase 2: Calibration (Mid-Volume)

Team started pattern-matching:

Agent PRs with good .pr-description.md → trust faster
Agent PRs that passed internal review with clean → skim, approve
Agent PRs with changes_required from security/architecture → deep review

Trust built on consistency, not individual PRs.

Phase 3: Routine (Later PRs)

Nobody asks "Did an agent write this?"

They ask: "Does this solve the problem?"

Review focuses on substance, not authorship.

What Changed in Practice

Agent PRs consistently arrive cleaner than human PRs:

Zero lint/format fixes required — Agent runs them before pushing
Fewer revision cycles — Agent handles mechanical issues internally
Faster reviews — Reviewers skip style checks, focus on logic
Similar (or higher) merge rates — Quality isn't worse; often better

The shift: reviewers spend less time on mechanical checks, more on architectural fit.

What Still Needs Humans

Agents handle:

✅ Syntax, style, format
✅ Test coverage (unit/integration/E2E)
✅ Self-review checklist (defined in implement-SKILL.md)
✅ Internal review (7 perspectives)

Humans handle:

❌ Edge case discovery (race conditions, retries, cascading failures)
❌ Product decisions (UX, error messages, feature scope)
❌ Architecture boundaries (when to split a service, introduce a queue)
❌ Rollback risk assessment (what breaks if we revert this?)

The shift: Agents don't replace reviewers. They shift what reviewers spend time on.

PR Description Quality Matters

Agent PRs include .pr-description.md with:

Title (H1) — becomes GitHub PR title, follows conventional commits
What changed — file-level summary
Why — links to Jira ticket + Confluence Execution Plan
Testing approach — which tests run, manual steps if needed
Rollback plan — what to revert if this breaks production

Why this matters

Reviewers read the PR description before opening the diff.

Good description → trust the approach → skim implementation → approve faster.

Bad description → question the approach → deep dive every file → request changes.

Agent-written descriptions are consistent. Human-written descriptions vary wildly.

Review Checklist Changes Over Time

We iterated the review checklist (in implement-SKILL.md) based on what reviewers flagged most:

v1 (initial)

Code compiles
Tests pass
No lint errors

v2 (after 20 PRs)

Added: "Check for hardcoded secrets or API keys"
Added: "Verify error messages are user-friendly"

v3 (after 50 PRs)

Added: "Ensure new endpoints have rate limiting"
Added: "Check Confluence plan matches implementation"

v4 (current)

Added: "Verify observability: logs, metrics, traces"
Added: "Check for breaking changes (DB schema, API contracts)"

The pattern: Reviewers teach the agent what to self-check by updating the checklist. Over time, fewer review comments on those topics.

What Breaks (and How We Fixed It)

Problem 1: Agent ignores architectural constraints

Example: Agent added a direct DB query in the API layer (we use repository pattern).

Fix: Added to .cursorrules:

Repository pattern is mandatory. Never write `prisma.query()` in controllers or routes.

Result: Hasn't happened since.

Problem 2: Agent writes tests that pass but don't test the right thing

Example: Agent mocked the entire service layer, so integration test didn't actually hit the database.

Fix: Updated implement-SKILL.md:

Integration tests must use real DB (testcontainers). No service-layer mocks in integration tests.

Result: Agent now writes meaningful integration tests.

Problem 3: PR descriptions were too vague

Example: "Fixed the bug" (no context on what broke or how it's fixed).

Fix: Added to implement-SKILL.md:

.pr-description.md MUST include:
- Root cause (1-2 sentences)
- What changed to fix it
- How to verify the fix

Result: PR descriptions now include enough context for reviewers to skip reading the full Jira ticket.

Key Takeaways

Human review is mandatory. Agents miss edge cases, architecture violations, and clean code principles. Review isn't optional.
Review time improves. Reviewers skip mechanical checks and focus on substance.
Revision cycles drop. Agent handles lint/format/test internally before pushing.
Trust builds gradually. Eventually team stops caring who wrote the code.
The review process is a living document. Update .cursorrules and implement-SKILL.md weekly. Agents learn from reviewer feedback.
PR descriptions matter more than you think. Good description = faster approval.

What's Next

This concludes our 4-part series on AI task automation:

Part 1: Architecture (JIRA webhooks → GitHub → Cursor agents)
Part 2: 5-stage workflow (human-in-the-loop transitions)
Part 3: CI implementation (agent chain, 7-perspective review, bug investigation)
Part 4: Code review and trust evolution (this post)

If you're building AI automation for your team, the lesson is:

Agents don't replace reviewers. They shift what reviewers spend time on.

Design for that shift. Update your checklists weekly. Let trust build gradually. And never skip human review—agents can't catch everything.

Mitko Tschimev — Technical lead at 1inch. I write about engineering leadership, architecture, and automation.

X: https://x.com/MTschimev
LinkedIn: https://linkedin.com/in/mitko-tschimev

Have you reviewed AI-generated PRs? What's the hardest part—trust, quality, or something else? Drop a comment.

DEV Community

What Reviewing AI-Generated PRs Actually Looks Like (Part 4)

What Reviewing AI-Generated PRs Actually Looks Like (Part 4)

The Review Shift

Before agents

With agents

Why Human Review Is Mandatory

Agents can't be fully trusted yet

The review process is a living document

Trust Evolution

Phase 1: Skepticism (Early PRs)

Phase 2: Calibration (Mid-Volume)

Phase 3: Routine (Later PRs)

What Changed in Practice

What Still Needs Humans

PR Description Quality Matters

Why this matters

Review Checklist Changes Over Time

v1 (initial)

v2 (after 20 PRs)

v3 (after 50 PRs)

v4 (current)

What Breaks (and How We Fixed It)

Problem 1: Agent ignores architectural constraints

Problem 2: Agent writes tests that pass but don't test the right thing

Problem 3: PR descriptions were too vague

Key Takeaways

What's Next

Top comments (0)