Hopkins Jesse

Posted on May 18

I Automated My PR Reviews With AI — Saved 6 Hours/Week (Full Setup)

#automation #tutorial #productivity #ai

I used to hate reviewing pull requests. Not the coding part. The context switching.

In early 2026, I was spending about 8 hours a week just reading diffs. Most of it was boilerplate. Type updates. Minor refactors. Copy-paste errors.

It felt like busywork. So I stopped doing it manually.

I built a local agent that scans every incoming PR on my team’s main repository. It checks for logic errors, security holes, and style consistency. It posts a summary comment within 45 seconds.

I still review the complex stuff. But the noise is gone.

Here is exactly how I set it up, what broke, and the numbers that convinced me to keep it.

The Problem Wasn't Code Quality

My team ships about 40 PRs a week. We are a small startup, so everyone reviews everything.

The issue wasn't that we missed bugs. We have decent test coverage. The issue was fatigue.

By Thursday afternoon, my brain was mush. I would approve a PR just to clear my inbox. I caught three actual bugs in January because I was too tired to read the diff properly. That was unacceptable.

I tried GitHub Copilot Chat. It helped, but I had to copy-paste code into the sidebar. Then I had to copy the answer back. It added friction.

I needed something that ran automatically. Something that lived in the CI pipeline but acted like a senior dev.

The Stack: Local LLMs + Custom Scripts

I didn't want to send our proprietary code to a public API. Privacy is non-negotiable for us.

So I went local. I run a Llama-3-70b quantized model on a dedicated workstation with two RTX 4090s. It’s overkill for some, but inference speed matters here.

For the orchestration, I used Python. No fancy frameworks. Just requests, pygithub, and ollama.

The flow is simple:

GitHub webhook triggers on pull_request.opened.
Python script fetches the diff.
Script sends diff to local Ollama instance.
LLM returns a structured JSON response.
Script posts comment to PR.

I tried using LangChain initially. It was too slow. The abstraction layers added 2-3 seconds of latency per call. I stripped it down to raw HTTP requests.

The Prompt Engineering Struggle

Getting the LLM to shut up was harder than getting it to talk.

My first version wrote essays. It praised my variable naming. It suggested adding comments to obvious code. It was annoying.

I had to force it into a strict schema. I told it to only speak if it found a problem. If the code was fine, it should return an empty list.

Here is the system prompt that finally worked. I tweaked it for two weeks before it stabilized.

SYSTEM_PROMPT = """
You are a senior backend engineer. Review the following git diff.
Return ONLY a JSON object with this structure:
{
  "summary": "One sentence overview",
  "issues": [
    {
      "line_number": int,
      "severity": "high" | "medium" | "low",
      "comment": "Specific fix suggestion"
    }
  ]
}

Rules:
- Ignore formatting changes.
- Ignore test files.
- If no issues found, return empty issues list.
- Be concise. No fluff.
"""

The key was the "Ignore formatting changes" rule. Without it, the AI would flag every whitespace adjustment.

The Data: Before and After

I tracked my time for four weeks before automation and four weeks after. I used a simple Toggl track setup.

Metric	Manual Review	AI-Assisted Review	Change
Avg Time per PR	12 minutes	3 minutes	-75%
PRs Reviewed/Week	40	40	0%
Bugs Caught	4	5	+25%
Mental Fatigue Score	8/10	3/10	-62%

The "Bugs Caught" number went up slightly. Why? Because the AI caught two edge cases in database migrations that I had glossed over. It noticed a missing index on a new query.

I probably would have caught it later. But catching it in review saved us a hotfix deployment.

The mental fatigue score is subjective. But I can tell you I don't dread opening GitHub anymore.

Where It Failed (And How I Fixed It)

It wasn't all smooth sailing.

In week two, the AI started hallucinating imports. It suggested adding import os when os was already imported at the top of the file. It couldn't see the full file context, only the diff.

This was a classic context window problem.

I fixed it by changing the trigger. Instead of sending just the diff, I now send the diff plus the full file content for any file changed by more than 50%.

This increased token usage by about 30%. But it reduced false positives by 90%.

Another failure was tone. The AI was rude. It said things like "This is stupid code."

I had to add a negative constraint to the prompt: "Never use condescending language. Be professional."

Developers

💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.

DEV Community