I built an automated code reviewer using Claude. It explains every decision. #openclawchallenge
Most AI code reviewers just flag problems. Mine explains why something is a problem — and shows its reasoning before giving the verdict.
This is my entry for the #openclawchallenge — building AI that's transparent about what it says and why.
The problem with opaque AI reviewers
I was using a popular AI code review tool and it flagged a function as "inefficient." No explanation. No alternative. Just: inefficient.
I pushed back. Asked why. Got: "Consider optimizing this code."
Useless. I need to understand the reasoning to trust the verdict.
So I built my own — one that shows its work.
The system prompt that makes Claude think out loud
import anthropic
import re
client = anthropic.Anthropic(api_key="YOUR_API_KEY") # simplylouie.com/developers — $2/month
SYSTEM_PROMPT = """
You are a senior code reviewer. For every issue you identify, you MUST:
1. Show your reasoning inside <think> tags BEFORE giving your verdict
2. In your reasoning, ask yourself:
- Is this a real problem or a style preference?
- What's the actual impact (performance, security, maintainability)?
- Would a senior engineer at Google flag this in a PR review?
- What's the simplest fix that preserves intent?
3. After reasoning, give a structured verdict:
- SEVERITY: critical | warning | suggestion
- ISSUE: one sentence
- WHY: one sentence explaining the impact
- FIX: the corrected code or specific change
If you find no real issues, say so explicitly. Do not invent problems to seem thorough.
"""
def review_code(code: str, language: str = "python") -> dict:
"""Review code with full reasoning transparency."""
prompt = f"""Review this {language} code:
{language}
{code}
Show your reasoning for each potential issue before giving your verdict."""
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=2000,
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": prompt}]
)
raw = response.content[0].text
# Extract reasoning blocks
think_blocks = re.findall(r'<think>(.*?)</think>', raw, re.DOTALL)
verdict = re.sub(r'<think>.*?</think>', '', raw, flags=re.DOTALL).strip()
return {
"reasoning": [t.strip() for t in think_blocks],
"verdict": verdict,
"raw": raw
}
def format_review(result: dict) -> str:
"""Pretty-print the review with reasoning visible."""
output = []
if result["reasoning"]:
output.append("## 🧠Claude's Reasoning Process")
for i, thought in enumerate(result["reasoning"], 1):
output.append(f"\n### Thought {i}:")
output.append(thought)
output.append("\n## 📋 Review Verdict")
output.append(result["verdict"])
return "\n".join(output)
python
Real test: reviewing a common Python mistake
I fed it this code:
def get_users(db, active=True):
users = []
for user in db.query("SELECT * FROM users"):
if user.active == active:
users.append(user)
return users
Here's what Claude's reasoning looked like:
<think>
Let me think about this carefully. The function fetches ALL users from the database
then filters in Python.
Is this a real problem? Yes — at small scale it's fine, but:
- At 1,000 users: minor overhead
- At 100,000 users: loading entire table into memory is a real performance issue
- At 1,000,000 users: this will likely cause OOM errors or extreme latency
Would a senior Google engineer flag this in a PR review? Yes, immediately — this
is a classic N+1 adjacent pattern. The fix is trivial: push the filter to SQL.
Is the fix simple and preserves intent? Yes:
SELECT * FROM users WHERE active = ?
with the active parameter passed in.
Severity: warning (not critical because it works at small scale, but this is
exactly how systems fail under load).
</think>
Then the verdict:
SEVERITY: warning
ISSUE: Full table scan with Python-side filtering
WHY: Loads all users into memory before filtering — fails under load at 10k+ rows
FIX:
def get_users(db, active=True):
return db.query("SELECT * FROM users WHERE active = ?", (active,)).fetchall()
Why transparency matters for code review
When Claude shows its reasoning, I can:
- Disagree with the reasoning — if the thought process is wrong, I know why the verdict is wrong
- Learn from the reasoning — seeing how to evaluate code quality is more valuable than the verdict itself
- Trust the verdict — when the reasoning is solid, I can ship faster
Opaque AI review = either trust blindly or reject blindly. Neither is good engineering.
The cost angle
I'm running this on a flat-rate Claude API for $2/month (not usage-based). For a solo developer doing code review on 10-20 PRs a week, usage-based APIs add up fast.
| Tool | Cost | Reasoning shown? |
|---|---|---|
| Claude.ai Pro | $20/month | No |
| Claude API (pay-per-token) | $5-40/month | Yes, if you build it |
| SimplyLouie API | $2/month | Yes — flat rate |
| GitHub Copilot | $10/month | No |
Flat-rate means I run code reviews without watching token counts.
Run it yourself
# Install
pip install anthropic
# Set your key (get one at simplylouie.com/developers)
export ANTHROPIC_API_KEY="your-key"
# Run
python reviewer.py
Full code on the gist linked below.
The #openclawchallenge connection
The challenge asks: what should AIs say — and what shouldn't they say — about the code they review?
My answer: they should show their reasoning, not hide it. An AI that flags problems without explaining why is making editorial judgments without accountability. The <think> blocks make the AI's assumptions visible and challengeable.
That's the model I want reviewing my code: one that argues with itself before telling me I'm wrong.
What's the most useful thing an AI code reviewer could tell you that current tools don't? Drop it in the comments — genuinely curious what's missing from existing tools.
API access: simplylouie.com/developers — $2/month flat rate, no per-token billing
Top comments (0)