What I learned building a tiny PR-review bot in 10 days

#agents #ai #opensource #codereview

For the past ten days I've been building a GitHub bot that reviews pull requests against a team's own written conventions. Not a generic "linter with vibes" reviewer — a small, focused one that knows the specific rules your team has agreed to: the kind that live in CLAUDE.md files, onboarding docs, and Notion pages nobody reads anymore.

I want to write about it because, honestly, I'm a little obsessed. Not with the bot itself (which is fine, it's a sprint project), but with the shape of the problem — what it actually takes to make an LLM useful, repeatable, and measurable at a real task. There's a version of "AI tooling" right now that's all vibes and demos. There's a different version where you have to build the boring substrate first. This sprint was a tour of the boring substrate, and I came out of it more curious than when I started, not less.

Source is public at github.com/azaz101hassan/ai-pr-review-copilot.

What I built

The bot listens for GitHub pull_request webhooks, runs a multi-turn agent loop over the diff against a vector-indexed knowledge base of team conventions, and posts findings as inline review comments plus a single walkthrough comment. A Next.js dashboard reads from the same SQLite store so an operator can see at a glance what the bot has been doing.

The stack is deliberately boring: NestJS + BullMQ + Redis for the webhook → queue → worker pipeline, SQLite + Drizzle as the audit log, Chroma + Voyage voyage-code-3 for retrieval, Claude Haiku 4.5 as the reviewer LLM by default. The discipline is in how the pieces fit, not in what they are.

A real walkthrough comment the bot posted on PR #15 — 774 changed lines, over the 500-line size gate. The bot explains its own reasoning instead of posting a low-confidence review on a PR it can't review well.

Where I think this actually shines

The thing that surprised me most, halfway through the sprint, was that the bot got better the smaller I made its scope. So before I get to what I learned, let me tell you where I think a tool like this actually earns its keep:

Onboarding new contributors. A new teammate joins, opens a PR, and gets the same "we don't add CRUD to the database service" comment a senior would have written by hand. The bot writes the comment once; the senior gets their afternoon back.
Open-source maintainers with a house style. Repos that have written conventions (a CONTRIBUTING.md, a project-style doc, an explicit "we prefer this pattern over that one") but don't have the bandwidth to enforce them on every drive-by PR.
Codebases with architectural rules linters can't catch. "Controllers should be thin." "Repositories live behind interface tokens." "No direct process.env reads outside the config module." These are real rules in a lot of mature codebases, and they're invisible to ESLint or RuboCop.
Small, focused PRs in active teams. Where the rule-violation rate is high (because there are a lot of rules) but the diff is small enough that a reviewer's attention is the actual bottleneck.

What it explicitly doesn't try to do is replace CodeRabbit, Sourcery, or any of the great commercial reviewers. They're better at general-purpose review than I could build solo. The wedge here is "review the things commercial tools can't see because they don't know your team's specific rules."

Three things I didn't expect to learn

1. The LLM is the smallest part of the system

I went in thinking the prompt was going to be the hard part. It wasn't. The hard parts were the queue (so GitHub retries don't double-review), the audit log (so every job has a row in the database before any external call), the eval harness (so I could tell if a change made the bot better or worse), and the dashboard (so I could see what the bot had been doing without grepping logs).

The LLM is the surface users see. The data model is what makes the surface trustworthy.

2. You need an eval thermometer, even on a 10-day sprint

Around Day 5 I almost skipped building the eval harness. "It's 10 days, I'll just dogfood the bot on my own PRs." That was the laziest instinct of the sprint, and I'm so glad I overruled it.

The harness doesn't make the bot better — that's still me, editing prompts and curating rules. What it does is tell me which of my changes moved the needle in which direction. Without it, every prompt edit is vibes. Three different times in the sprint, the harness told me a change I was sure was a win actually wasn't, or vice versa. That's the entire ROI right there.

It's dead simple: a capture script that runs real fixtures through the real API and records the output, and a score script that runs in CI and compares against thresholds. The expensive part runs when I want it to; the cheap part runs on every PR. If you build one AI tool this year, build the thermometer first.

3. Failure modes are a feature

The bot can fail in several interesting ways: turn cap exceeded, rate-limited, malformed model output, the model inventing a rule that wasn't in the retrieved set. Early on, all of these lived in logs and nowhere else. Day 8 was about making them visible.

Now there are three small chips on the dashboard: hallucinated findings dropped, top failure modes, cached tool calls. They render nothing when the count is zero (a clean state shouldn't look like a problem). When they light up, they tell the operator what's wrong without anyone having to open the database.

The whole point of these chips is that the bot can fail in interesting ways and the operator should know which kind of interesting. Hiding the failures behind a green "all good" badge would have been faster to ship and worse forever.

What I haven't done

In the spirit of an honest retro:

No learning loop. Findings from real PRs don't feed back into the bot. The seed rules are small enough (~50) that human curation still wins; building a real feedback loop would have eaten the rest of the sprint.
Real-world recall is mediocre. On synthetic test fixtures, the bot hits ~0.72 F1. On snippets from actual past PRs, it drops to ~0.33. It's good on clean diffs, worse on messy ones. The 500-line size gate keeps it out of the messy zone for now; the right next move is more curated real-PR fixtures, not threshold gymnastics.
A local-model migration is the next major piece. The whole reason I built the eval harness was to be able to swap the model and know whether the swap was a win, not just guess. I'm easing off the daily-sprint pace from here, but the substrate is in place for whenever I pick that thread back up.

What stuck with me

Every "AI feature" I've built before this one, I built like the LLM was the product. This time I built it like the LLM was a load-bearing component in a system whose other parts also had to be load-bearing. The audit log, the queue, the eval gate, the dashboard chips — none of them are AI. All of them are what makes the AI part trustworthy.

I started the sprint thinking I was going to learn about prompt engineering. I ended it thinking about queue semantics and observability surfaces. That turned out to be the much better thing to have learned.

Source at github.com/azaz101hassan/ai-pr-review-copilot. If you're building something in this shape, or if you think a bot like this would help with your team's conventions, I'd love to hear about it.