I Automated My PR Reviews With AI — Saved 6 Hours/Week

#ai #automation #productivity #tutorial

I used to hate reviewing pull requests. Not the code itself, but the administrative overhead. Checking if variable names matched our style guide. Verifying that error handling wasn't just a console.log. Making sure no one committed an .env file by accident.

It ate up about six hours of my week. That is nearly a full work day spent on nitpicking instead of building features. In early 2026, I decided to stop doing it manually.

I built a lightweight agent using local LLMs and GitHub Actions. It doesn't replace human review. It handles the boring stuff so I can focus on architecture and logic. Here is exactly how I set it up, what broke, and the numbers behind the time savings.

The Problem With "Smart" Reviewers

Most AI review tools in 2024 and 2025 were too noisy. They would flag every single line. They suggested changing let to const even when mutability was required later. They hallucinated libraries that didn't exist.

I tried three different SaaS platforms. All of them cost over $50 per developer per month. None of them understood our specific context. Our codebase uses a custom internal utility library for date formatting. The generic models kept suggesting date-fns or moment.js, which we banned three years ago.

The turning point came when I realized I didn't need a generalist. I needed a specialist trained on our repo's history. I wanted something that ran locally or in our private CI pipeline to avoid sending proprietary code to public APIs.

The Stack: Local LLMs + GitHub Actions

I settled on a simple stack. No complex orchestration frameworks. No vector databases for this specific task. Just a focused prompt and a small model.

Hardware: Mac Studio M2 Ultra (32GB RAM)
Model: Llama-3-8B-Instruct (quantized to Q4_K_M)
Runner: GitHub Actions self-hosted runner
Tool: Ollama for local inference

Using an 8B parameter model might sound weak. For syntax checks and pattern matching, it is plenty. It runs in under 2 seconds on my machine. In the CI environment, it takes about 15 seconds. That is acceptable for a pre-merge check.

The key was restricting the scope. I told the AI to ignore business logic. It only checks for:

Security leaks (API keys, secrets)
Console logs in production code
Missing type definitions in TypeScript files
Deviations from our eslint config that linters miss

The Setup Process

First, I installed Ollama on our self-hosted GitHub runner. This ensures the model never leaves our infrastructure. Then I pulled the Llama-3 model.

ollama pull llama3:8b-instruct-q4_K_M

Next, I created a Python script called reviewer.py. This script reads the diff from the PR, formats it into a prompt, sends it to the local Ollama instance, and parses the response.

Here is the core logic for the prompt construction. Notice how I explicitly forbid it from rewriting code. I only want flags.

import subprocess
import json
import sys

def get_diff(pr_number):
    # Fetch diff using gh cli
    result = subprocess.run(
        ["gh", "pr", "diff", str(pr_number)],
        capture_output=True,
        text=True
    )
    return result.stdout

def analyze_diff(diff_content):
    prompt = f"""
    You are a senior backend engineer. Review this git diff.

    RULES:
    1. Only flag security risks, console.logs, or missing types.
    2. Ignore business logic correctness.
    3. If no issues found, return empty JSON array.
    4. Output MUST be valid JSON format.

    DIFF:
    {diff_content}
    """

    # Call local Ollama API
    response = subprocess.run(
        ["curl", "-s", "http://localhost:11434/api/generate", 
         "-d", json.dumps({
             "model": "llama3:8b-instruct-q4_K_M",
             "prompt": prompt,
             "stream": False
         })],
        capture_output=True,
        text=True
    )

    return json.loads(response.stdout)['response']

This script runs as a step in our GitHub Action workflow. If it finds issues, it posts them as comments on the PR. If it finds nothing, it stays silent. Silence is golden.

The Failure Phase

My first version was a disaster. I used GPT-4o via API initially. It worked well but cost $120 in the first month for our team of five. The latency was also high. Each review took 45 seconds. Developers started complaining that the CI pipeline felt sluggish.

Switching to the local Llama-3 model solved the cost and latency. But accuracy dropped. The model started flagging valid TypeScript generics as errors. It hated our use of optional chaining.

I fixed this by adding few-shot examples to the prompt. I included three examples of "good" code that looks suspicious but is correct. This reduced false positives by 80%.