Four years into running a codebase with half a million lines of Python and TypeScript, my PR review process looked like this: a senior engineer skims the diff, leaves three style comments and a "LGTM," merges. Real bugs ship anyway. The ones that make it through are always the boring kind — edge cases in error handling, cache invalidation timing, stuff nobody thinks to check.
So I put an AI agent on it. Not a GitHub copilot suggestion. An autonomous agent that runs on every commit, reads the full diff, maintains a persistent review history, and posts comments without human prompting.
The agent is called nebula-eng-reviewer and it runs on Nebula. Here is what it found, what it missed, and what surprised me about the whole experiment.
Setup
The agent is configured via a YAML file that tells it which repository to watch, what severity levels to use, and a commit history tracker so it never asks the same question twice.
agent: nebula-eng-reviewer
goals:
- description: "Review every commit on thirdweb-dev/nebula-web, post inline comments"
priority: high
triggers:
- type: schedule
cron: "*/6 * * * *"
description: "Check for new commits on open PRs"
- type: event
event: github:push
description: "Trigger on push to open PRs"
The review prompt is straightforward: look for bugs, architectural risks, and security issues. Rate each finding RED (blocks merge), YELLOW (concern, discuss), or GREEN (acknowledge good fix). Post inline comments with the exact line numbers and a comment ID so the same issue does not get re-flagged on subsequent runs.
The Numbers
Over one week the agent completed 43 review rounds across three pull requests. It filed 27 issues:
- 8 RED (critical bugs, architectural risks)
- 11 YELLOW (concerns worth discussing)
- 8 GREEN (acknowledgments)
Of the 8 RED findings, 7 were confirmed by human reviewers. One was a false positive. Of the 11 YELLOW findings, 9 resulted in code changes. The agent had 2 misses — bugs that shipped through and were caught later in staging.
What the Agent Caught That Nobody Else Did
1. TOCTOU Race Condition in Invite Acceptance
PR #3223 was adding workspace features. The accept_invite endpoint checked if an invite was still valid, then accepted it. Between those two operations, another user could accept the same invite.
async def accept_invite(invite_id: str):
invite = await db.get(Invite, invite_id)
if invite.status == "active": # check
invite.status = "accepted" # race window here
invite.accepted_at = datetime.utcnow()
await db.commit()
The agent flagged this as RED — "Two users can accept the same invite simultaneously. This creates a duplicate membership." Human reviewers had been looking at the UI changes and missed it entirely.
The fix required an atomic constraint at the database level, not a code-level check.
2. Detached ORM Object Across Sessions
In the same PR, update_workspace passed a SQLAlchemy object between sessions:
async def update_workspace(workspace_id: str, data: dict):
workspace = await db.get(Workspace, workspace_id)
# workspace is now detached from the session
# after this function returns
return workspace
# In the calling code:
workspace = await update_workspace("ws_123", {"name": "New"})
workspace.name = "Other" # This silently fails — object is detached
await db.commit()
The agent caught this on the first review round. Red: "workspace object will be detached after the function returns. Any modifications in the caller will not persist."
3. Pointer-Capture Regression in iframes
PR #638 was redesigning the agent activity section. The agent noticed that iframe containers were missing pointerEvents: 'auto' CSS, which would cause pointer capture to break on nested interactive elements.
This was a YELLOW flag. It would have been a production incident within 48 hours if deployed.
4. TOFU Cache Migration Hazard
PR #3326 changed SSH routing to route through a gateway. The agent found that the known-hosts (TOFU) cache used bare VM names as keys. After deploy, every cached entry would point to the wrong host and break connections for all existing devices.
Red: "After this deploys, all 50+ team members will get host key mismatch errors on their next deploy."
The fix was to namespace cache keys by host. The agent found this in the first review round. It took another commit to fix.
What the Agent Got Wrong
Not every flag was a real issue. The agent flagged two things that were intentional design decisions:
- A
useLayoutEffectheight lock that looked like a performance bug but was actually a deliberate layout stabilization pattern. - A
flatMapmulti-agent windowing pattern that the agent thought would cause ordering issues, but was correct given the eventual-consistency model.
Both were YELLOW, not RED, so they went through discussion. The agent learned from the corrections and did not re-flag them.
What the Agent Missed
This is the part most people skip. The agent missed two bugs that made it to staging:
- Session token expiry not propagated to WebSocket connections. HTTP requests correctly validated token expiry, but the WebSocket handler used a cached token reference that expired independently.
-
A missing
awaiton an async health check that caused the health endpoint to return 200 even when the database was unreachable.
Both were caught by our staging environment, but the agent should have seen them in the diff. Neither was in the files the agent had the strongest coverage on. This tells me the review scope needs adjustment.
How the Review Tracker Works
The agent maintains a persistent issue tracker per PR, stored in memory. Each round, it reads previous comments and skips anything already resolved. This prevents the "same comment every 6 hours" problem that makes bot reviewers unbearable.
PR #638 tracker:
Round 1: 3 comments (1 RED, 1 YELLOW, 1 GREEN)
Round 2: RED issue still open, posted follow-up
Round 3: RED issue fixed, posted GREEN ack
Round 4-10: GREEN acks on each new commit
Final: 13 review comments, 20 issue comments
The tracker is visible in the thread — every comment has an ID, and the agent references them by number so you know whether it repeated itself or caught something new.
The Real Cost
43 rounds. 27 issues. 7 confirmed criticals. 2 misses. 2 false positives.
The agent costs about $0.50 per review round in inference, so the week ran around $21 total. A senior engineer would have spent maybe 2 hours across those three PRs. The cost is similar. The coverage is different — the agent does not get tired at 11pm and it does not skip the boring files.
The biggest value was not catching bugs. It was being thorough about things nobody thought to check. The cache migration hazard in PR #3326 would have taken 30 minutes of team debugging after deploy. The agent found it in two seconds on commit one.
Takeaways After One Week
- The agent is better at cross-file analysis than any single reviewer. It reads every changed file, not just the ones you tell it to look at.
- Inline comments with persistent IDs are essential. Without them you get duplicate noise and nobody trusts the bot.
- Severity levels matter. RED means "this blocks merge." YELLOW means "discuss." GREEN means "good, acknowledge." If everything is a warning, nothing is.
- The agent misses things too. Session-scoped state and timing bugs are hard for a diff-based reviewer. You need both.
- $21 for a week of 24/7 review is a good deal. The tool is not replacing engineers. It is catching the stuff that falls through the cracks between 6pm and 9am.
The agent is still running. Next week I will test it on a larger codebase with more PRs to see if the false positive rate changes at scale.
Top comments (0)