What Reviewing AI-Generated PRs Actually Looks Like (Part 4)
In Parts 1-3, I covered how we built AI agents that write production code: the architecture, workflow, and CI implementation.
Part 4 is what happens after the agent pushes a PR.
This isn't theory. Here's what we learned about code review, trust, and what works (and what doesn't).
The Review Shift
Before agents
Code review covered:
- Syntax and style
- Lint/format compliance
- Test coverage
- Logic correctness
- Architecture fit
Reviewers asked: "Did you check coverage for the new code?" "Why are the E2E tests failing?" "Do you follow all best practices and principles?" Multiple review cycles per PR.
With agents
Agent PRs arrive pre-reviewed:
- Lint + format already run
- Tests written (unit + integration + E2E when feasible)
- Self-review checklist passed
- 7-perspective internal review completed
- Inline comments (max 10) from internal review addressed
Reviewers focus on:
- Edge cases (race conditions, payment retries)
- Business logic (is this the right approach?)
- Architecture boundaries (should this be a separate service?)
- Rollback risk (what breaks if we revert?)
Review cycles dropped significantly.
Why Human Review Is Mandatory
Agents can't be fully trusted yet
We're working on processes to guarantee it, but we're not there yet.
Even after passing internal review (7 perspectives, self-review, full test suite), agent PRs need human eyes.
What agents miss:
Edge cases — Race conditions, payment retry logic, cascading failures. Agents write the happy path. Reviewers ask "what breaks this?"
Architecture patterns — We use repository pattern. Agents sometimes try direct DB queries. Reviewers catch violations before merge.
Clean code principles — Naming, abstraction levels, SRP violations. Agents optimize for "works" not "maintainable."
Business context — "This works, but is this the right solution?" Agents implement the ticket. Reviewers question the approach.
The review process is a living document
We don't treat agent review as static. Every week, the team updates:
-
.cursorrules— Add architecture patterns agents violated -
implement-SKILL.md— Expand review checklist based on what reviewers flag
Example evolution:
- Week 1: Agent added hardcoded API key → added to checklist
- Week 4: Agent skipped rate limiting on new endpoint → added to checklist
- Week 8: Agent wrote vague PR descriptions → updated skill with required fields
The pattern: Reviewers teach the agent. Agent self-checks improve over time. Fewer review comments on repeat issues.
This isn't "set and forget." It's continuous improvement. Agents learn from human feedback, encoded into rules and checklists.
Trust Evolution
Phase 1: Skepticism (Early PRs)
Every agent PR got:
- Line-by-line review
- "Did the agent understand the requirement?" questions
- Manual re-testing in local env
Team treated agent code as inherently suspicious.
Phase 2: Calibration (Mid-Volume)
Team started pattern-matching:
- Agent PRs with good
.pr-description.md→ trust faster - Agent PRs that passed internal review with
clean→ skim, approve - Agent PRs with
changes_requiredfrom security/architecture → deep review
Trust built on consistency, not individual PRs.
Phase 3: Routine (Later PRs)
Nobody asks "Did an agent write this?"
They ask: "Does this solve the problem?"
Review focuses on substance, not authorship.
What Changed in Practice
Agent PRs consistently arrive cleaner than human PRs:
- Zero lint/format fixes required — Agent runs them before pushing
- Fewer revision cycles — Agent handles mechanical issues internally
- Faster reviews — Reviewers skip style checks, focus on logic
- Similar (or higher) merge rates — Quality isn't worse; often better
The shift: reviewers spend less time on mechanical checks, more on architectural fit.
What Still Needs Humans
Agents handle:
- ✅ Syntax, style, format
- ✅ Test coverage (unit/integration/E2E)
- ✅ Self-review checklist (defined in
implement-SKILL.md) - ✅ Internal review (7 perspectives)
Humans handle:
- ❌ Edge case discovery (race conditions, retries, cascading failures)
- ❌ Product decisions (UX, error messages, feature scope)
- ❌ Architecture boundaries (when to split a service, introduce a queue)
- ❌ Rollback risk assessment (what breaks if we revert this?)
The shift: Agents don't replace reviewers. They shift what reviewers spend time on.
PR Description Quality Matters
Agent PRs include .pr-description.md with:
- Title (H1) — becomes GitHub PR title, follows conventional commits
- What changed — file-level summary
- Why — links to Jira ticket + Confluence Execution Plan
- Testing approach — which tests run, manual steps if needed
- Rollback plan — what to revert if this breaks production
Why this matters
Reviewers read the PR description before opening the diff.
Good description → trust the approach → skim implementation → approve faster.
Bad description → question the approach → deep dive every file → request changes.
Agent-written descriptions are consistent. Human-written descriptions vary wildly.
Review Checklist Changes Over Time
We iterated the review checklist (in implement-SKILL.md) based on what reviewers flagged most:
v1 (initial)
- Code compiles
- Tests pass
- No lint errors
v2 (after 20 PRs)
- Added: "Check for hardcoded secrets or API keys"
- Added: "Verify error messages are user-friendly"
v3 (after 50 PRs)
- Added: "Ensure new endpoints have rate limiting"
- Added: "Check Confluence plan matches implementation"
v4 (current)
- Added: "Verify observability: logs, metrics, traces"
- Added: "Check for breaking changes (DB schema, API contracts)"
The pattern: Reviewers teach the agent what to self-check by updating the checklist. Over time, fewer review comments on those topics.
What Breaks (and How We Fixed It)
Problem 1: Agent ignores architectural constraints
Example: Agent added a direct DB query in the API layer (we use repository pattern).
Fix: Added to .cursorrules:
Repository pattern is mandatory. Never write `prisma.query()` in controllers or routes.
Result: Hasn't happened since.
Problem 2: Agent writes tests that pass but don't test the right thing
Example: Agent mocked the entire service layer, so integration test didn't actually hit the database.
Fix: Updated implement-SKILL.md:
Integration tests must use real DB (testcontainers). No service-layer mocks in integration tests.
Result: Agent now writes meaningful integration tests.
Problem 3: PR descriptions were too vague
Example: "Fixed the bug" (no context on what broke or how it's fixed).
Fix: Added to implement-SKILL.md:
.pr-description.md MUST include:
- Root cause (1-2 sentences)
- What changed to fix it
- How to verify the fix
Result: PR descriptions now include enough context for reviewers to skip reading the full Jira ticket.
Key Takeaways
Human review is mandatory. Agents miss edge cases, architecture violations, and clean code principles. Review isn't optional.
Review time improves. Reviewers skip mechanical checks and focus on substance.
Revision cycles drop. Agent handles lint/format/test internally before pushing.
Trust builds gradually. Eventually team stops caring who wrote the code.
The review process is a living document. Update
.cursorrulesandimplement-SKILL.mdweekly. Agents learn from reviewer feedback.PR descriptions matter more than you think. Good description = faster approval.
What's Next
This concludes our 4-part series on AI task automation:
- Part 1: Architecture (JIRA webhooks → GitHub → Cursor agents)
- Part 2: 5-stage workflow (human-in-the-loop transitions)
- Part 3: CI implementation (agent chain, 7-perspective review, bug investigation)
- Part 4: Code review and trust evolution (this post)
If you're building AI automation for your team, the lesson is:
Agents don't replace reviewers. They shift what reviewers spend time on.
Design for that shift. Update your checklists weekly. Let trust build gradually. And never skip human review—agents can't catch everything.
Mitko Tschimev — Technical lead at 1inch. I write about engineering leadership, architecture, and automation.
Have you reviewed AI-generated PRs? What's the hardest part—trust, quality, or something else? Drop a comment.
Top comments (0)