Hopkins Jesse

Posted on May 10

I Automated My PR Reviews With AI — Saved 6 Hours/Week (Full Setup)

#ai #automation #productivity #tutorial

I used to dread Monday mornings. Not because of the work itself, but because of the pull request backlog.

By 9:30 AM, I would have fifteen open PRs staring at me. Half were trivial style fixes. The other half were complex logic changes that required actual brain power.

I spent hours nitpicking variable names and missing semicolons. It was exhausting and unproductive.

In March 2026, I decided enough was enough. I built a local AI agent to handle the first pass of code reviews.

The result? I saved about six hours every week. More importantly, my actual review quality improved because I was focusing on architecture, not syntax.

Here is exactly how I set it up, what broke, and why you should probably do this too.

The Problem With Human Reviewers

Let’s look at the data from my team’s GitHub repo in Q1 2026.

We averaged 45 PRs per week. Each PR took me roughly 15 minutes to review initially. That is 11.25 hours of pure review time.

But here is the kicker. About 40% of those PRs had basic issues that CI/CD pipelines missed. Things like unused imports, inconsistent error handling, or missing type definitions in TypeScript files.

I was acting as a linter with a pulse. It was a waste of my senior engineer salary.

I needed something that could read the diff, understand the context of the entire codebase, and flag logical inconsistencies before I even looked at it.

Existing tools like Copilot Chat were helpful, but they required manual prompting. I wanted automation. I wanted the bot to comment on the PR automatically when specific conditions were met.

The Stack: Local LLMs and GitHub Actions

I did not want to send our proprietary code to external APIs. Privacy concerns are real, and in 2026, most companies still ban sending core logic to public cloud LLMs.

So I went local.

I used Ollama to run Llama 3.3 70B on a dedicated Mac Studio M3 Ultra in the office. It is fast enough for inference and keeps data entirely on-premise.

For the orchestration, I wrote a Python script triggered by GitHub Actions.

The flow is simple:

A PR is opened or updated.
GitHub Action triggers the Python script.
The script fetches the diff and relevant file contexts.
It sends this to the local Ollama instance.
The LLM generates a structured JSON response.
The script posts comments back to the PR.

The Implementation Details

The hardest part was not the AI model. It was context management.

You cannot just dump an entire codebase into a prompt. You will hit token limits or confuse the model. I had to be smart about what I sent.

I focused on "changed files" and their "direct dependencies." If auth.ts changed, I also included user_model.ts in the context window.

Here is the core prompt structure I used. It is strict about output format to make parsing easier.

SYSTEM_PROMPT = """
You are a Senior Staff Engineer reviewing a Pull Request.
Your goal is to identify logical errors, security vulnerabilities, and performance bottlenecks.
Ignore style issues (linting handles those).

Output ONLY valid JSON in this format:
{
  "summary": "Brief 1-sentence overview",
  "critical_issues": [
    {
      "file": "path/to/file.ts",
      "line_start": 10,
      "line_end": 15,
      "severity": "high",
      "comment": "Explanation of the issue and suggested fix"
    }
  ],
  "questions": ["Clarifying questions for the author"]
}

If no critical issues are found, return empty arrays.
"""

def get_review_prompt(diff_content, related_files):
    return f"""
    Here is the git diff:
    {diff_content}

    Here are the contents of related files for context:
    {related_files}

    Review the code based on the system instructions.
    """

I deployed this as a GitHub Action workflow. It runs on pull_request_target events.

The First Week: It Was Messy

I will be honest. The first three days were a disaster.

The model was hallucinating methods that did not exist. It complained about missing imports that were actually present in files I forgot to include in the context.

It also had a tone problem. It sounded condescending. "You clearly failed to understand the async pattern here," it wrote on one PR.

My teammate Sarah nearly quit.

I had to tweak two things. First, I improved the context retrieval logic to include parent directories for import resolution. Second, I added a tone filter to the system prompt.

I changed the instruction from "Review the code" to "Act as a helpful pair programmer. Be concise and kind."

The difference was night and day. By day five, the false positive rate dropped from 35% to under 8%.

The Results After 30 Days

I tracked the metrics for April 2026. Here is the comparison between March (manual only) and April (AI-assisted).

Metric	March 2026	April 2026	Change
Avg

💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.

DEV Community