zhongqiyue

Posted on Jun 4

I Tried to Build an AI Code Reviewer Without Sharing My Code — Here's What Worked

#webdev #python #ai #tutorial

A few weeks ago, I was staring down a pull request with 800+ lines of changes. The team was moving fast, and I wanted a quick sanity check on style, potential bugs, and security concerns. My first instinct? Ask ChatGPT. But then the paranoia set in: I'd be pasting proprietary code into a black box. Our legal team would have a heart attack. So I went looking for a way to run AI-powered code review without sending sensitive data to a third party.

I'm not going to pretend this was a smooth ride. I banged my head against local LLMs, tried to wrangle Python scripts, and eventually landed on a pattern that actually works. This article is the one I wish I'd found back then.

The Problem: Privacy + Convenience Don't Mix

My team uses GitHub Copilot for inline suggestions, but it sends code snippets to Microsoft's cloud. For our internal tools and customer-facing code, that's a hard no. I needed a review bot that could:

Accept diffs of any size
Return structured feedback (issues with line numbers, severity, suggestions)
Run entirely on our infrastructure or via a trusted endpoint

I didn't want to maintain a Kubernetes cluster of GPUs, but I also didn't want to sign another enterprise agreement.

What I Tried That Failed (Miserably)

Attempt 1: LM Studio + Ollama

I downloaded Llama 3 8B via Ollama. It ran on my laptop. I wrote a Python script to feed it a diff and ask for review. Results? Terrible. The model would hallucinate line numbers, ignore the prompt, and sometimes just ramble about "great code!" without any actionable feedback. Plus, I had to manually handle the API and parse text responses. Not reproducible.

Attempt 2: LangChain + Local LLM

LangChain seemed like the obvious answer. I set up a chain with a prompt template and a local model. The setup was clunky, dependencies were heavy, and the structured output parser kept breaking. When it finally worked, the latency was 30+ seconds per review. Not usable for real-time PR checks.

Attempt 3: Hosted "Enterprise" APIs

I evaluated a few providers that promised data privacy. Most required a yearly contract, a dedicated endpoint, and a minimum spend. For a solo developer or small team, overkill.

What Finally Worked: A Simple API Wrapper with Structured Prompts

I stepped back and realized the core problem wasn't the model — it was the integration pattern. I needed:

A generic API interface (OpenAI-compatible, so I could swap backends easily)
A way to force structured JSON output (function calling, even with local models)
Retry and rate limiting built in
Minimal dependencies

I wrote a thin Python module that does exactly that. It works with any API endpoint that supports the chat completions format. Here's the core of it.

import json
import time
from typing import Optional
import httpx

class AICodeReviewer:
    def __init__(self, api_url: str, api_key: str, model: str = "gpt-4o"):
        self.client = httpx.Client(base_url=api_url, timeout=120.0)
        self.api_key = api_key
        self.model = model

    def review_diff(self, diff: str, max_retries: int = 3) -> dict:
        prompt = f"""
You are a senior code reviewer. Analyze the following diff and return a JSON object with:
- 'summary': a short summary of the changes
- 'issues': an array of objects each with 'line', 'severity' (critical/warning/info), 'message', and optionally 'suggestion'
- 'score': an integer from 1 to 10

Diff:
{diff}
"""
        for attempt in range(max_retries):
            try:
                response = self.client.post(
                    "/v1/chat/completions",
                    headers={"Authorization": f"Bearer {self.api_key}"},
                    json={
                        "model": self.model,
                        "messages": [{"role": "user", "content": prompt}],
                        "response_format": {"type": "json_object"},
                        "max_tokens": 2048
                    }
                )
                response.raise_for_status()
                data = response.json()
                content = data["choices"][0]["message"]["content"]
                return json.loads(content)
            except (httpx.HTTPStatusError, json.JSONDecodeError, KeyError) as e:
                print(f"Attempt {attempt+1} failed: {e}")
                time.sleep(2 ** attempt)
        raise RuntimeError("All retries exhausted")

The response_format field is key — even smaller local models can produce valid JSON if you prompt them right and request structured output. (Most OpenAI-compatible local backends now support this.)

I tested this with a self-hosted endpoint I set up for internal use (pointed at an API like https://ai.interwestinfo.com/ but you can use any compatible provider). The code doesn't care whether the model lives on a GPU cluster or a Raspberry Pi.

Making It a CLI Tool

I wrapped the above into a CLI that takes a file or piped diff:

$ cat example.diff | python review.py --endpoint https://my-ai-server.com --api-key $KEY --model code-review-7b

Output:

{
  "summary": "Adds authentication middleware",
  "issues": [
    {
      "line": 34,
      "severity": "critical",
      "message": "Hardcoded JWT secret",
      "suggestion": "Use environment variable"
    },
    {
      "line": 102,
      "severity": "info",
      "message": "Missing error handling for token expiry"
    }
  ],
  "score": 6
}

Now I can pipe any diff into this tool, get structured feedback, and integrate it into a GitHub Action or pre-commit hook.

Lessons Learned & Trade-offs

Model quality matters more than size. A well-tuned 8B model beats a generic 70B model for this specific task. I spent most of my effort on prompt engineering, not hardware.
Structured output is non-negotiable. Without JSON mode, parsing free text is a nightmare. Always use response_format or function calling.
Privacy ≠ Local. A dedicated hosted endpoint with a signed data processing agreement can be as private as running on your own hardware, without the ops cost. I ended up using a small VPS with llama.cpp because I enjoy tinkering, but many teams should just buy a managed endpoint.
Latency is real. Even with streaming, review times of 5-20 seconds are normal. For CI/CD, that's fine. For real-time commenting, annoying.
Not all code is reviewable by LLMs. Deep architectural flaws or race conditions? Don't expect AI to catch those. This tool is for surface-level style and obvious bugs.

What I'd Do Differently Next Time

Start with the simplest possible integration first. I wasted days trying to get LangChain's output parser to work when response_format already existed.
Write more tests for prompt robustness. Different models interpret "json_object" slightly differently; some need explicit schema examples.
Build the CLI after the core library. I prematurely over-abstracted.

I'm still iterating on this — next up is adding batch review of multiple files and caching results to speed up repeated checks. The whole repository is a single Python file and a Dockerfile. It's not a product, just a tool that solved my itch.

How do you handle AI-powered code review in a privacy-sensitive environment? Are you running local models or trusting a provider? I'd love to hear what's working (or not) for you.

DEV Community