Automate Python Code Reviews with Free Local LLMs and GitHub Actions

#ai #productivity #tutorial #webdev

Paying for GPT-4o or Claude API calls every time someone opens a pull request adds up quickly on a busy repo. A self-hosted Ollama instance on a machine you already own — or a GPU-enabled self-hosted GitHub Actions runner — lets you run a capable open-weight model for the cost of electricity. The result is a first-pass automated review that catches common Python issues and leaves a comment on the PR before any human reads the diff.

This is not a replacement for human review. An open-weight 7B model running locally will miss subtle concurrency bugs, architectural problems, and context it has never seen. What it reliably does is reduce the amount of low-signal noise a human reviewer has to wade through: undocumented parameters, obvious type mismatches, functions that shadow builtins, missing error handling in obvious paths. That alone is worth setting up if your team is small and review time is scarce.

The Shape of the Workflow

The basic loop has four parts: a GitHub Actions workflow triggers on pull_request, a Python script fetches the diff via the GitHub REST API, the script sends that diff to a locally-running Ollama server, and the response comes back as a PR review comment posted through the same API.

Here is a minimal workflow file for a self-hosted runner that has Ollama already installed and the model pre-pulled:

# .github/workflows/llm-review.yml
name: LLM Code Review

on:
  pull_request:
    types: [opened, synchronize]
    paths:
      - '**.py'

jobs:
  review:
    runs-on: self-hosted   # requires GPU runner with Ollama installed
    permissions:
      pull-requests: write
      contents: read
    steps:
      - uses: actions/checkout@v4

      - name: Wait for Ollama
        run: curl --retry 10 --retry-delay 2 --retry-connrefused http://localhost:11434/api/tags

      - name: Run review script
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
          REPO: ${{ github.repository }}
          MODEL: qwen2.5-coder:7b
        run: python scripts/llm_review.py

The paths filter limits runs to PRs that touch Python files, which avoids burning runner time on documentation-only changes. If your runner is not persistent (for example, you spin it up on demand), remove the Wait for Ollama step and replace it with the Ollama install script before the review step.

The Python review script does three things: fetch the diff, prompt the model, post the comment. Here is a stripped-down version:

# scripts/llm_review.py
import os, json, textwrap
import urllib.request

GITHUB_API = "https://api.github.com"
OLLAMA_URL = "http://localhost:11434/api/generate"

def gh(path, method="GET", body=None):
    token = os.environ["GH_TOKEN"]
    req = urllib.request.Request(
        f"{GITHUB_API}{path}",
        data=json.dumps(body).encode() if body else None,
        headers={
            "Authorization": f"Bearer {token}",
            "Accept": "application/vnd.github+json",
            "X-GitHub-Api-Version": "2022-11-28",
        },
        method=method,
    )
    with urllib.request.urlopen(req) as r:
        return json.loads(r.read())

def get_diff():
    repo = os.environ["REPO"]
    pr = os.environ["PR_NUMBER"]
    req = urllib.request.Request(
        f"{GITHUB_API}/repos/{repo}/pulls/{pr}",
        headers={
            "Authorization": f"Bearer {os.environ['GH_TOKEN']}",
            "Accept": "application/vnd.github.v3.diff",
        },
    )
    with urllib.request.urlopen(req) as r:
        return r.read().decode()

def ask_ollama(diff):
    prompt = textwrap.dedent(f"""
        You are a Python code reviewer. Review the following git diff for:
        - Bugs or likely runtime errors
        - Missing or incorrect type annotations
        - Functions that shadow Python builtins
        - Missing error handling in obvious paths
        - Style issues that violate PEP 8

        Be concise. List specific findings only. Do not repeat the diff back.
        If the change looks correct, say so briefly.

        DIFF:
        {diff[:12000]}
    """)
    payload = {"model": os.environ["MODEL"], "prompt": prompt, "stream": False}
    req = urllib.request.Request(
        OLLAMA_URL,
        data=json.dumps(payload).encode(),
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=300) as r:
        return json.loads(r.read())["response"]

def post_comment(body):
    repo = os.environ["REPO"]
    pr = os.environ["PR_NUMBER"]
    gh(f"/repos/{repo}/issues/{pr}/comments", method="POST", body={"body": body})

if __name__ == "__main__":
    diff = get_diff()
    if not diff.strip():
        print("Empty diff, skipping.")
    else:
        review = ask_ollama(diff)
        post_comment(f"**LLM first-pass review** (model: `{os.environ['MODEL']}`)\n\n{review}\n\n---\n*Automated review. Not a substitute for human review.*")

The diff is truncated at 12,000 characters before being sent to the model. For a 7B model with a 4K–8K context window, sending a 40-file diff wholesale will silently truncate or produce incoherent output. The 12,000-character ceiling keeps the prompt within a safe range for 7B models while still covering most single-feature PRs. For larger diffs, you can split by file and send one prompt per changed file, then aggregate.

Choosing a Model

Three models are worth considering for this specific task. The tradeoffs map directly to the RAM available on your runner.

qwen2.5-coder:7b is the practical default. It runs in approximately 6–7 GB of VRAM or RAM, fits on a consumer GPU (RTX 3060 or similar), and performs well on Python-focused tasks. Alibaba's Qwen2.5-Coder series was explicitly trained on code, which matters more for targeted review work than general instruction-following ability.

mistral:7b is an acceptable alternative if you already have it pulled or if you want a model with stronger general-language generation for more verbose review comments. It is not specifically trained on code, so it will miss some language-specific patterns that a coder model catches, but its instruction-following is reliable.

qwen2.5-coder:32b or similar 30B+ models produce noticeably better reviews — they can reason about multi-file interactions and catch subtler bugs — but require roughly 22–24 GB of VRAM. That pushes you toward A100 or multi-GPU setups, which changes the cost calculus significantly.

For a first deployment, start with qwen2.5-coder:7b. You can upgrade the model string in the workflow env var without touching anything else.

GitHub-hosted runners (ubuntu-latest, macos-latest) cannot reach a localhost Ollama service — the runner is an ephemeral cloud VM with no persistent local service. You must use a self-hosted runner, or install Ollama from scratch on each run and pull the model during the job. Pulling a 7B model from scratch adds 3–8 minutes to every workflow run, depending on the runner's network connection. That is acceptable for low-frequency PRs; it becomes painful on a busy repo.

Self-Hosted Runners and the Cold-Start Problem

If you run Ollama on a persistent self-hosted runner — a spare workstation, a homelab server, or a cloud VM you control — the model stays in memory between runs and job startup time drops to a few seconds. The runner registers with GitHub via github.com/settings/actions/runners and picks up jobs like any other runner.

The cold-start problem appears when you do not have a persistent machine. In that case, you have two options. First, install Ollama and pull the model at the start of every job:

- name: Install Ollama
  run: curl -fsSL https://ollama.com/install.sh | sh

- name: Pull model
  run: ollama pull qwen2.5-coder:7b &

- name: Start Ollama server
  run: ollama serve &

- name: Wait for server
  run: curl --retry 15 --retry-delay 3 --retry-connrefused http://localhost:11434/api/tags

This works on any Linux runner but adds several minutes per run. Second, cache the model files. Ollama stores models under ~/.ollama/models by default. You can cache that directory with actions/cache keyed on the model name, which reduces subsequent pull times to a cache-restore operation — usually under 30 seconds for a warm cache. The cache approach is documented in community workflows and is the most practical path for ephemeral runners.

For GPU runners on cloud providers, actuated.dev offers GPU-enabled ephemeral runners with NVIDIA driver pre-installed. That cuts driver setup time to roughly 30 seconds (cached) and keeps the security model of ephemeral environments while giving you access to the hardware Ollama needs for sub-minute inference on 7B models.

Honest Limits

Automated LLM review works best as a first filter, not a gate. A few specific limits to plan around:

A 7B model will miss logic bugs that require understanding the broader codebase context — any bug that requires tracing through three or four files is unlikely to be caught. The model only sees the diff, not the full project.

Hallucinated findings are real. The model will occasionally flag something as a bug that is intentional. Human reviewers need to treat the output as a checklist to consider, not a verdict to accept. Adding the disclaimer line to the posted comment (as in the script above) makes that expectation explicit.

Diff truncation silently degrades quality. If your PR changes 3,000 lines, the model sees only the first portion. You either need to split by file, raise the truncation limit (and accept worse performance on smaller context models), or move to a 32B+ model with a longer context window.

The model has no knowledge of your codebase conventions. It will not flag violations of internal style guides, project-specific API contracts, or patterns that are acceptable in your context but look wrong in isolation. A .github/REVIEW_GUIDELINES.md pasted into the system prompt can help — up to a point.

With those limits stated, the setup described here takes an afternoon to wire together and costs nothing ongoing if you have a machine to run it on. For teams where review bandwidth is the bottleneck, filtering out a third of the review noise before a human looks at a PR is a real productivity gain.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.