DEV Community

Ayi NEDJIMI
Ayi NEDJIMI

Posted on

How to Build a Self-Hosted AI Code Review Tool in Python

Every team has the same code review problem: PRs sit for days, reviewers miss subtle logic bugs, and security issues slip through because nobody carefully checked the authentication layer. Linters catch syntax and style issues, but they don't reason about intent. A language model can — and you can run it entirely on your own infrastructure without sending a single line of your source code to a third party.

This guide walks you through building a self-hosted AI code review tool in Python. It reads a git diff, sends it to a locally hosted language model, and returns structured review comments you can pipe directly into your CI workflow.

Why Self-Hosted Matters

Sending your source code to an external API is a significant trust decision. For proprietary code, regulated industries, or anything security-sensitive, you want model inference happening inside your own perimeter. Ollama handles this cleanly: it runs any GGUF-quantized model locally and exposes an HTTP endpoint that's fully compatible with the OpenAI Python SDK. You get the same API surface, zero data egress.

The architecture is intentionally simple:

  • A Python script reads a git diff (or file path)
  • It splits the diff into manageable chunks
  • Each chunk is sent to the local LLM with a structured system prompt
  • The model returns JSON-formatted review comments
  • You aggregate and display them — or feed them into your CI gate

Setting Up

You need Python 3.11+, the openai SDK (it works against any compatible endpoint), and Ollama running locally with a code-focused model. codellama:13b works well; deepseek-coder:6.7b is faster and nearly as accurate for review tasks.

pip install openai gitpython
ollama pull deepseek-coder:6.7b
Enter fullscreen mode Exit fullscreen mode

Store your config in a .env file — the script reads from environment variables so swapping models requires no code changes:

OLLAMA_BASE_URL=http://localhost:11434/v1
OLLAMA_API_KEY=ollama
OLLAMA_MODEL=deepseek-coder:6.7b
Enter fullscreen mode Exit fullscreen mode

The Core Reviewer

The script reads a diff from a file argument or stdin (which makes it trivial to wire into a git hook), sends it to the model, and parses the structured output.

import os, json, sys
from openai import OpenAI

client = OpenAI(
    base_url=os.getenv("OLLAMA_BASE_URL", "http://localhost:11434/v1"),
    api_key=os.getenv("OLLAMA_API_KEY", "ollama"),
)
MODEL = os.getenv("OLLAMA_MODEL", "deepseek-coder:6.7b")

SYSTEM_PROMPT = (
    "You are a senior software engineer performing a code review.\n"
    "Analyze the provided code diff and return a JSON array of review comments.\n"
    "Each comment must have: severity (critical/warning/suggestion), "
    "line (int or null), message (str), fix (str or null).\n"
    "Return ONLY valid JSON. No prose outside the JSON array."
)

def review_diff(diff_text: str, max_chunk_chars: int = 6000) -> list[dict]:
    lines = diff_text.splitlines(keepends=True)
    chunks, current, current_len = [], [], 0
    for line in lines:
        if current_len + len(line) > max_chunk_chars and current:
            chunks.append("".join(current))
            current, current_len = [], 0
        current.append(line)
        current_len += len(line)
    if current:
        chunks.append("".join(current))

    all_comments = []
    for chunk in chunks:
        response = client.chat.completions.create(
            model=MODEL,
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": f"Review this diff:\n\n{chunk}"},
            ],
            temperature=0.1,
            max_tokens=1024,
        )
        raw = response.choices[0].message.content.strip()
        try:
            comments = json.loads(raw)
            if isinstance(comments, list):
                all_comments.extend(comments)
        except json.JSONDecodeError:
            pass
    return all_comments

if __name__ == "__main__":
    diff = open(sys.argv[1]).read() if len(sys.argv) > 1 else sys.stdin.read()
    comments = review_diff(diff)
    has_critical = False
    for c in sorted(comments, key=lambda x: ["critical","warning","suggestion"].index(x.get("severity","suggestion"))):
        print(f"[{c.get('severity','?').upper()}] line {c.get('line','?')}: {c.get('message','')}")
        if c.get("fix"):
            print(f"{c['fix']}\n")
        if c.get("severity") == "critical":
            has_critical = True
    sys.exit(1 if has_critical else 0)
Enter fullscreen mode Exit fullscreen mode

The script exits with code 1 if any critical issue is found, making it trivial to use as a blocking pre-push hook or CI gate.

Integrating into CI

For GitHub Actions, run the reviewer on every pull request diff:

name: AI Code Review
on: [pull_request]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install openai
      - name: Run AI review
        env:
          OLLAMA_BASE_URL: ${{ secrets.OLLAMA_BASE_URL }}
          OLLAMA_API_KEY: ${{ secrets.OLLAMA_API_KEY }}
          OLLAMA_MODEL: deepseek-coder:6.7b
        run: |
          git diff origin/${{ github.base_ref }}...HEAD > pr.diff
          python reviewer.py pr.diff
Enter fullscreen mode Exit fullscreen mode

For self-hosted CI (Gitea Actions, GitLab CI, Jenkins), point OLLAMA_BASE_URL at your internal Ollama instance. The runner needs network access to it, but nothing leaves your perimeter. If your Ollama node lives on a private subnet, use a dedicated runner in that subnet rather than routing through a proxy.

Hardening the Prompt for Security Review

The default prompt covers general code quality. When you want security-focused output — useful as a pre-merge gate on sensitive services — specialize the system prompt:

SECURITY_PROMPT = (
    "You are a security-focused code reviewer.\n"
    "Flag only security vulnerabilities: injection flaws, auth bypasses, "
    "insecure deserialization, hardcoded credentials, missing input validation, "
    "race conditions, and OWASP Top 10 patterns.\n"
    "Return a JSON array: [{severity, cwe, line, message, fix}]. "
    "Return ONLY valid JSON."
)
Enter fullscreen mode Exit fullscreen mode

Swap this in for SYSTEM_PROMPT. The cwe field is useful if you want to integrate findings with a vulnerability tracker or feed them into a risk scoring pipeline.

Keep in mind that language models produce false positives at a non-trivial rate. Treat this layer as a fast first-pass triage, not a substitute for manual review. For a structured view of what to actually check before shipping to production, our security hardening checklists cover the most common vulnerability classes by language and framework.

Splitting Large Diffs by File

Chunking by character count works, but it can split a file mid-hunk and confuse the model. Splitting by file boundary gives better results:

import re

def split_diff_by_file(diff_text: str) -> list[str]:
    parts = re.split(r'(?=^diff --git )', diff_text, flags=re.MULTILINE)
    return [p for p in parts if p.strip()]

def review_all_files(diff_text: str) -> list[dict]:
    all_comments = []
    for file_diff in split_diff_by_file(diff_text):
        all_comments.extend(review_diff(file_diff))
    return all_comments
Enter fullscreen mode Exit fullscreen mode

For very large files (300+ changed lines), further split on the @@ hunk markers. The model's effective context for code analysis degrades past ~4000 tokens of diff — smaller, focused chunks consistently produce better output than one large dump.

The Takeaway

Self-hosted AI code review earns its place in the pipeline as a fast, cheap first-pass filter. It catches common patterns — missing error handling, SQL queries built with f-strings, hardcoded secrets, unvalidated user input — before a human reviewer ever opens the PR. The setup is lightweight: Ollama, one Python file, a CI step.

What it won't replace: architectural review, business logic validation, and nuanced security analysis that requires understanding your domain. The model doesn't know your codebase's invariants or threat model. But for the low-hanging fruit, it consistently earns its keep.

From here, you can extend this foundation: add a SQLite store to track comment trends over time, wire up the GitHub Reviews API to post inline comments on the PR diff, or build a prompt library with different reviewer personas (security, performance, readability). The pattern is solid — the specialization is up to you.


I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.

Top comments (0)