I Built an Open Source AI Code Reviewer in 48 Hours — Here's Why I Stopped Using GitHub Copilot

#ai #automation #productivity #opensource

I spent a weekend last month building codecrit, a lightweight AI code reviewer that runs entirely on my local machine. No API calls. No cloud dependencies. Just Ollama, a few Rust scripts, and 48 hours of caffeine-fueled iteration.

The result? It caught 23 bugs my team missed in our last sprint. It flagged 4 security vulnerabilities that would have shipped to production. And it costs me exactly $0 per month.

Here's why I think we're all overthinking AI tools.

The Problem with Cloud AI

I've been using GitHub Copilot since 2023. It's good at autocomplete. But code review? That's a different beast.

When we reviewed PRs at my company last year, we had a 3-hour average wait time. Reviewers would skim the diff, maybe leave a comment about naming conventions, and approve. Deep issues got missed constantly.

Last August, we shipped a PR that introduced a subtle race condition. It took us 3 weeks and a production incident to find it. The root cause was obvious in retrospect: a missing mutex lock in a goroutine.

I asked myself: could an AI catch this during review, not after the fact?

Building Codecrit

I started with a simple premise: take a PR diff, feed it to a local LLM, and ask it specific questions about the code. Not generic "is this good code?" but targeted checks.

Here's the core logic I ended up with:

fn review_diff(diff: &str, rules: &[Rule]) -> Vec<Issue> {
    let mut issues = Vec::new();

    for rule in rules {
        let prompt = format!(
            "Review this code diff for {}:\n\nDiff:\n{}\n\n\
             Respond with a JSON array of issues found. \
             Each issue must have: line_number, severity (low/medium/high), message",
            rule.description, diff
        );

        let response = ollama::chat("codellama:13b", &prompt)?;
        let parsed: Vec<Issue> = serde_json::from_str(&response)?;
        issues.extend(parsed);
    }

    issues
}

The key insight: I'm not asking the AI to review code broadly. I'm giving it 5 specific rules to check:

Race conditions in concurrent code (Go, Rust, Java)
SQL injection patterns in string concatenations
Memory leaks in resource handling (files, connections, goroutines)
Error swallowing (empty catch blocks, ignored return values)
Logic inversions in conditionals

The Numbers After 30 Days

I ran codecrit on every PR in our monorepo for a month. Here's what happened:

Metric	Before Codecrit	After Codecrit
PR review wait time	3.2 hours avg	12 minutes avg
Bugs caught before merge	14 per month	37 per month
False positives per PR	N/A	2.1 avg
Developer complaints	"reviews are slow"	"too many alerts"

The false positive rate was higher than I wanted. But here's the thing: we tuned the prompts over 3 iterations. By week 2, the false positives dropped to 0.8 per PR. Developers started actually reading the suggestions.

Why Nobody's Talking About Local AI

I've pitched this at 2 meetups and 1 internal tech talk. The response is always the same: "That's cool, but why not use GitHub's built-in AI?"

Three reasons:

Privacy. Our codebase contains proprietary algorithms. Sending diffs to OpenAI or Anthropic is a non-starter for legal. Local models mean zero data leaves my machine.

Latency. Cloud AI calls take 3-8 seconds per request. Local inference with Ollama takes 1-2 seconds for the same model size. When you're reviewing 50+ files in a PR, those seconds add up.

Cost. At our scale (200+ PRs per week), GitHub Copilot's business tier costs $39/user/month. Codecrit runs on a $0.10/hour GPU instance. We pay $72/month total instead of $7,800.

The Honest Reality Check

Codecrit isn't perfect. It misses things. It hallucinates occasionally. Last week it flagged a perfectly valid Rust borrow checker pattern as a "potential memory issue."

But here's what surprised me: the act of having a review tool changed how my team writes code. Developers started adding comments explaining their reasoning. They wrote more tests. They thought twice before reaching for unsafe blocks.

The tool became a forcing function for better practices, not just a bug detector.

What I'd Do Different

If I built this again, I'd spend more time on the prompt engineering upfront. The first version used generic prompts like "find bugs in this code." The second version used specific rule-based prompts. The difference in accuracy was 40%.

I'd also add a feedback loop. Developers can currently dismiss false positives, but that data doesn't improve future reviews. A simple "wrong" button that feeds back into the prompt tuning would help.

The Bigger Lesson

We're in this weird moment where everyone's chasing the perfect AI agent. The one that writes your whole app, reviews your code, deploys it, and monitors it. I don't think that exists yet.

What does work is narrow, specific, local AI tools that solve one pain point well. Codecrit

💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.

💰 Want to make some smart bets? I've been using Polymarket — the world's largest prediction market platform — to bet on everything from election outcomes to tech trends. Real money, real probabilities, real payouts. Unlike crypto casinos, Polymarket is a legitimate information market where your edge comes from being better informed than the crowd. I've banked some solid wins calling AI regulation timelines and crypto ETF approvals. Sign up with my referral link and start trading: Polymarket.com