I built an AI code reviewer as a GitHub Action — here's what I learned

#opensource #ai #devtools #github

If you've spent time on software engineering teams, you know pull request reviews are the ultimate bottleneck. They're slow, inconsistent, and often skipped entirely under deadline pressure. Reviewers get fatigued, rubber-stamp approvals become the norm, and suddenly, subtle bugs creep into the codebase. Human review is essential for architectural alignment, but for catching obvious code smells or logical flaws, it relies heavily on mental energy we simply don't always have.

Meanwhile, LLMs have exploded in capability. They are genuinely good at reading diffs, understanding context, and pointing out issues. I kept expecting someone to release a dead-simple, plug-and-play GitHub Action that harnesses this power without requiring massive enterprise subscriptions or clunky self-hosted runners. But looking around, nobody had built a lightweight, open-source tool that just works out of the box. So, I built it myself.

Enter Argus. Argus is a GitHub Action that acts as an automated first pass for pull requests. Whenever a developer opens, synchronizes, or reopens a PR, Argus triggers on the event, fetches the diff, and intelligently sends each modified file's context to Groq's Llama 3.3 70B model. The LLM then analyzes the code for potential bugs, security vulnerabilities, or performance bottlenecks.

Instead of dumping a massive wall of text into a single comment, Argus parses the structured output from the model and posts specific, inline review comments directly on the problematic lines. It categorizes each comment with a severity label—like high, medium, or low—so developers know exactly what needs immediate attention. Setting it up is ridiculously easy. Just drop this snippet into your workflow:

name: Argus Code Review
on:
  pull_request:
    types: [opened, synchronize, reopened]
jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: Rozer402/argus@main
        with:
          groq_api_key: ${{ secrets.GROQ_API_KEY }}

What genuinely surprised me was how exceptionally good Llama 3.3 70B is at reading and understanding diffs. I was skeptical an open-weights model could handle the nuance of isolated code changes, but it proved me wrong. During testing, Argus caught a hardcoded API secret I accidentally committed, flagged a missing await on an async function that would have caused a nasty race condition, and pointed out an unused variable in my own code. It wasn't hallucinating generic advice; it provided razor-sharp, context-aware feedback that saved me from pushing broken code.

Ironically, the AI wasn't the bottleneck—the plumbing was. The hardest part of building Argus was prompt engineering the model to reliably return structured JSON output so every single comment maps to an exact line number in the GitHub PR diff. GitHub's API is notoriously strict; if you try to post a comment on a line that wasn't modified, the API throws an error and the action fails. Getting the LLM to consistently return valid JSON with perfect line number correlation took countless iterations and rigorous fallback logic.

Another major hurdle was building the .argus/config.yml system. I quickly realized that different teams have wildly different tolerances for automated feedback. If the bot comments on every minor stylistic choice, developers will get annoyed and ignore it. So, I implemented a configuration system so teams can fine-tune the action's behavior directly in their repo. By setting severity thresholds (like only showing high severity issues) and ignoring specific file paths, teams can easily control the noise-to-signal ratio, which is critical for real-world usage.

If I were to build this from scratch again, I'd definitely start with the config system from day one instead of hardcoding everything. In early versions, I baked all assumptions, thresholds, and ignored paths directly into the core logic. When I started testing across different codebases, those hardcoded rules immediately broke down. Refactoring the action to read and parse a .argus/config.yml file late in the game was messy. Building with user configuration in mind right from the start would have saved me a massive amount of technical debt.

If you're tired of PRs lingering in review purgatory or just want a fast, automated second pair of eyes on your code, give Argus a shot. You can find the repo at https://github.com/Rozer402/argus. It's free, open source, and uses Groq's generous free tier, so you don't have to worry about racking up an API bill just to get quality code reviews. Drop it into your workflow, tweak the config, and let the AI do the heavy lifting.

Let your team focus on the architecture, and let Argus catch the bugs.

Top comments (4)

Alex Shev • Jun 15

The strongest use case for an AI reviewer is probably not replacing the human reviewer, but creating a first-pass risk map.

If it can point to "this file changed auth behavior," "this looks like an unchecked edge case," or "tests do not cover the changed path," then the human review starts in the right place instead of from a blank diff.

Aditya Bhusal • Jun 16

Honestly, this reframe hit harder than expected. I’ve been calling it a “AI reviewer” which is the wrong expectation entirely. It's not competing with the human, it's taking care of the cold start.
A dev opening a 40 file diff will spend the first 10 minutes just trying to figure out where the risk even is. If Argus provides them with a map before they start, the actual review is 10x more focused.
I’m expanding this into a dashboard (DevLens) that tracks these risk signals over PRs over time which files keep showing up as risky, which paths consistently lack coverage.
Curious if you’ve seen any internal tooling that gets this right. Most tools I've found either go too far or stay too shallow

Alex Shev • Jun 16

DevLens sounds like the right evolution. A risk map before review is much easier to trust than a bot pretending to be the reviewer. If it can show why a PR is risky over time, it becomes workflow intelligence instead of another comment generator.

Alex Shev • Jun 17

The best internal tools I have seen split the problem in two: deterministic checks for things that must never be subjective, and an AI reviewer for suspicious patterns, missing context, or risk narration.

The risk map works when it can point to evidence: file touched, dependency changed, migration missing, permission boundary crossed, or test coverage absent. If it only says "looks risky," people stop trusting it quickly.