Hopkins Jesse

Posted on May 20

I Automated My PR Reviews With AI — Saved 6 Hours/Week (Full Setup)

#ai #automation #productivity #tutorial

I used to hate reviewing pull requests. Not the code itself, but the repetitive nitpicking. Checking for consistent variable naming. Verifying error handling patterns. Making sure every new function had a JSDoc comment.

It was boring work. It also took up about six hours of my week. That is time I could have spent building features or fixing actual bugs.

In early 2026, the hype around AI agents finally settled into useful tools. We moved past the "chat with your codebase" phase. We entered the "agent acts on your behalf" phase.

I decided to test if an AI agent could handle the mundane parts of my code reviews. I wanted it to catch style issues, missing tests, and documentation gaps. I did not want it to judge architecture or logic. That is still a human job.

The result was surprising. It did not replace me. But it cut my review time by 70%. Here is exactly how I set it up using open-source tools and a local LLM.

The Problem With Manual Reviews

My team follows a strict convention. We use TypeScript. We enforce functional programming patterns where possible. We require unit tests for any new business logic.

Humans are bad at consistency. I might miss a missing type definition on Tuesday because I am tired. On Thursday, I might catch it immediately. This inconsistency frustrates junior developers. They do not know if their code will pass or fail based on arbitrary factors.

Linters help. ESLint and Prettier catch syntax errors. But they cannot check semantic quality. They cannot tell if a function name matches its implementation. They cannot verify if a new API endpoint has proper error logging.

I needed a layer between the linter and my eyes. A filter that handles the checklist items. This lets me focus on the hard stuff. Does this algorithm scale? Is this security vulnerability real?

Choosing the Right Stack for 2026

By 2026, running large language models locally is trivial on modern dev machines. I have a MacBook Pro with an M3 Max chip. It handles 70B parameter models comfortably for inference.

I avoided closed APIs for two reasons. Cost and privacy. Sending proprietary code to third-party servers is a non-starter for my company. Local execution keeps everything in-house.

I selected Ollama as the runtime. It is stable and easy to integrate. For the model, I chose Llama-3.3-70B-Instruct. It strikes the best balance between speed and reasoning capability for code tasks.

For the orchestration layer, I wrote a simple Python script. It uses the GitHub API to fetch diff data. It sends the diff to the local LLM. It posts the results back as a PR comment.

You could use LangChain or LlamaIndex. I found them overkill for this specific task. A direct HTTP request to the Ollama API is faster and easier to debug.

The Implementation Details

The core logic is straightforward. Fetch the diff. Prompt the model. Parse the response.

The prompt engineering was the hardest part. Early versions were too chatty. They would praise my code or offer unsolicited architectural advice. I had to constrain the output strictly.

I forced the model to output JSON. This makes parsing reliable. If the JSON is invalid, the script retries once. If it fails again, it posts a generic error message.

Here is the system prompt I settled on after three weeks of tweaking:

SYSTEM_PROMPT = """
You are a senior code reviewer. Your job is to check for specific issues only.
Ignore architecture, design patterns, and business logic.

Check for:
1. Missing JSDoc comments on exported functions.
2. Inconsistent variable naming (camelCase vs snake_case).
3. Lack of error handling in async/await blocks.
4. Console.log statements left in production code.

Output format: JSON array of objects.
Each object must have:
- "file": string
- "line": number
- "issue": string
- "severity": "warning" or "error"

If no issues are found, return an empty array [].
Do not include any text outside the JSON.
"""

The Python script runs as a GitHub Action. It triggers on pull_request events. It only runs on diffs larger than 50 lines. Small changes do not need AI review. This saves compute resources.

Handling False Positives

The first week was rough. The AI flagged valid code as errors. It hated our custom hook patterns. It thought our error boundary wrappers were redundant.

I had to tune the temperature. I set it to 0.1. Code review needs determinism, not creativity. Higher temperatures led to hallucinated issues.

I also added a "ignore list" feature. If the AI flags a pattern we use intentionally, I add it to the config. The script skips those files or patterns in future runs.

This tuning process took about four hours. It was worth it. Now the false positive rate is under 5%. That is acceptable for a helper tool.

The Results After One Month

I tracked my time manually for four weeks. Before automation, I spent an average of 90 minutes per day on PR reviews. Most of that was scanning for minor issues.

After deployment, my daily review time dropped to 25 minutes. The AI catches the low-hanging fruit. I only step in when the AI reports nothing or flags a complex issue.

Here is the breakdown of my weekly time savings:

| Task | Time Before (Hours) |

💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.

DEV Community