DEV Community

brian austin
brian austin

Posted on

I built an automated code reviewer using Claude. It explains every decision. #openclawchallenge

OpenClaw Challenge Submission 🦞

I built an automated code reviewer using Claude. It explains every decision. #openclawchallenge

Most AI code reviewers just flag problems. Mine explains why something is a problem — and shows its reasoning before giving the verdict.

This is my entry for the #openclawchallenge — building AI that's transparent about what it says and why.

The problem with opaque AI reviewers

I was using a popular AI code review tool and it flagged a function as "inefficient." No explanation. No alternative. Just: inefficient.

I pushed back. Asked why. Got: "Consider optimizing this code."

Useless. I need to understand the reasoning to trust the verdict.

So I built my own — one that shows its work.

The system prompt that makes Claude think out loud

import anthropic
import re

client = anthropic.Anthropic(api_key="YOUR_API_KEY")  # simplylouie.com/developers — $2/month

SYSTEM_PROMPT = """
You are a senior code reviewer. For every issue you identify, you MUST:

1. Show your reasoning inside <think> tags BEFORE giving your verdict
2. In your reasoning, ask yourself:
   - Is this a real problem or a style preference?
   - What's the actual impact (performance, security, maintainability)?
   - Would a senior engineer at Google flag this in a PR review?
   - What's the simplest fix that preserves intent?
3. After reasoning, give a structured verdict:
   - SEVERITY: critical | warning | suggestion
   - ISSUE: one sentence
   - WHY: one sentence explaining the impact
   - FIX: the corrected code or specific change

If you find no real issues, say so explicitly. Do not invent problems to seem thorough.
"""

def review_code(code: str, language: str = "python") -> dict:
    """Review code with full reasoning transparency."""

    prompt = f"""Review this {language} code:

Enter fullscreen mode Exit fullscreen mode


{language}
{code}


Show your reasoning for each potential issue before giving your verdict."""

    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=2000,
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": prompt}]
    )

    raw = response.content[0].text

    # Extract reasoning blocks
    think_blocks = re.findall(r'<think>(.*?)</think>', raw, re.DOTALL)
    verdict = re.sub(r'<think>.*?</think>', '', raw, flags=re.DOTALL).strip()

    return {
        "reasoning": [t.strip() for t in think_blocks],
        "verdict": verdict,
        "raw": raw
    }


def format_review(result: dict) -> str:
    """Pretty-print the review with reasoning visible."""
    output = []

    if result["reasoning"]:
        output.append("## 🧠 Claude's Reasoning Process")
        for i, thought in enumerate(result["reasoning"], 1):
            output.append(f"\n### Thought {i}:")
            output.append(thought)

    output.append("\n## 📋 Review Verdict")
    output.append(result["verdict"])

    return "\n".join(output)
Enter fullscreen mode Exit fullscreen mode


python

Real test: reviewing a common Python mistake

I fed it this code:

def get_users(db, active=True):
    users = []
    for user in db.query("SELECT * FROM users"):
        if user.active == active:
            users.append(user)
    return users
Enter fullscreen mode Exit fullscreen mode

Here's what Claude's reasoning looked like:

<think>
Let me think about this carefully. The function fetches ALL users from the database 
then filters in Python. 

Is this a real problem? Yes — at small scale it's fine, but:
- At 1,000 users: minor overhead
- At 100,000 users: loading entire table into memory is a real performance issue
- At 1,000,000 users: this will likely cause OOM errors or extreme latency

Would a senior Google engineer flag this in a PR review? Yes, immediately — this 
is a classic N+1 adjacent pattern. The fix is trivial: push the filter to SQL.

Is the fix simple and preserves intent? Yes:
SELECT * FROM users WHERE active = ? 
with the active parameter passed in.

Severity: warning (not critical because it works at small scale, but this is 
exactly how systems fail under load).
</think>
Enter fullscreen mode Exit fullscreen mode

Then the verdict:

SEVERITY: warning
ISSUE: Full table scan with Python-side filtering
WHY: Loads all users into memory before filtering — fails under load at 10k+ rows
FIX:

def get_users(db, active=True):
    return db.query("SELECT * FROM users WHERE active = ?", (active,)).fetchall()
Enter fullscreen mode Exit fullscreen mode

Why transparency matters for code review

When Claude shows its reasoning, I can:

  1. Disagree with the reasoning — if the thought process is wrong, I know why the verdict is wrong
  2. Learn from the reasoning — seeing how to evaluate code quality is more valuable than the verdict itself
  3. Trust the verdict — when the reasoning is solid, I can ship faster

Opaque AI review = either trust blindly or reject blindly. Neither is good engineering.

The cost angle

I'm running this on a flat-rate Claude API for $2/month (not usage-based). For a solo developer doing code review on 10-20 PRs a week, usage-based APIs add up fast.

Tool Cost Reasoning shown?
Claude.ai Pro $20/month No
Claude API (pay-per-token) $5-40/month Yes, if you build it
SimplyLouie API $2/month Yes — flat rate
GitHub Copilot $10/month No

Flat-rate means I run code reviews without watching token counts.

Run it yourself

# Install
pip install anthropic

# Set your key (get one at simplylouie.com/developers)
export ANTHROPIC_API_KEY="your-key"

# Run
python reviewer.py
Enter fullscreen mode Exit fullscreen mode

Full code on the gist linked below.

The #openclawchallenge connection

The challenge asks: what should AIs say — and what shouldn't they say — about the code they review?

My answer: they should show their reasoning, not hide it. An AI that flags problems without explaining why is making editorial judgments without accountability. The <think> blocks make the AI's assumptions visible and challengeable.

That's the model I want reviewing my code: one that argues with itself before telling me I'm wrong.


What's the most useful thing an AI code reviewer could tell you that current tools don't? Drop it in the comments — genuinely curious what's missing from existing tools.

API access: simplylouie.com/developers — $2/month flat rate, no per-token billing

Top comments (0)