Shane Wilkey

Posted on Mar 13

Code Genie: I Built a Self-Reviewing Code Generator with CrewAI

#crewai #agents #ai #beginners

Most AI coding tools are one-shot: you ask, they answer, you decide if it's good. That's not an agent — that's autocomplete with better vocabulary. Code Genie works differently. It writes code, reviews it, finds its own problems, and iterates until it's satisfied. I built it by hand while learning crewAI, and the process taught me more about agentic systems than any tutorial ever could.

The Problem With One-Shot Code Generation

If you've used Claude, ChatGPT, or any AI coding tool for more than a week, you've hit the wall.

You write a prompt. You get back code that looks right. You drop it in, run it, and something breaks — not catastrophically, just quietly. An edge case that wasn't handled. An assumption baked in that doesn't match your data. A function that works in isolation but fails in context.

The problem isn't that the model is bad. The problem is that nobody checked the output before it reached you.

Human code review exists for exactly this reason. A second set of eyes catches things the original author missed — not because the reviewer is smarter, but because they're reading it fresh, looking for problems instead of assuming it works. One-shot AI tools skip that step entirely. They generate, hand off, and move on. The gap between "looks right" and "is right" is yours to close.

I wanted to close it automatically.

What an Agentic Loop Actually Is

An agentic loop is: generate → evaluate → decide → repeat.

That's it. Three steps and a condition. What makes it agentic is the decide step — something has to determine whether the output is good enough to stop, or whether it needs another pass. Without that decision layer, you just have automation. With it, you have something that can improve its own output without a human in the loop.

Applied to code generation, the loop looks like this: a writer produces a snippet, a reviewer evaluates it against a set of criteria, the reviewer's verdict either passes the output or sends it back with specific feedback, and the writer tries again with that feedback in hand. The loop runs until the output passes review or hits a retry limit.

Simple pattern. Surprisingly powerful in practice.

Why crewAI

crewAI is a Python framework for building multi-agent systems. The core abstraction is exactly what it sounds like: you define a crew of agents, give each one a role, a goal, and a backstory, assign them tasks, and let the framework handle the orchestration.

crewAI manages the execution order, passes output between agents, and keeps the loop running. You don't write the plumbing. You write the agents and the tasks, and crewAI connects them.

I chose it because I wanted to learn agentic patterns hands-on, and crewAI's abstractions are close enough to how you'd think about the problem naturally that the learning curve doesn't get in the way of actually building. A writer and a reviewer are intuitive roles. Defining them as crewAI agents felt like a natural translation of the mental model.

What crewAI doesn't handle — and this is important — is your judgment about what good code means. The framework will run your loop faithfully. It won't tell you whether your review criteria are any good. That part you have to figure out yourself, and that's where most of the real work lives.

The Writer Agent

The writer agent is the simpler of the two — its job is straightforward. Given a task, produce code that accomplishes it.

In crewAI terms, the writer looks something like this:

writer = Agent(
    role="Software Engineer",
    goal="Write clean, correct, production-ready code that solves the given task",
    backstory="""You are an experienced software engineer who writes clear,
    well-structured code. You follow established conventions, handle edge cases,
    and write code that other developers can read and maintain.""",
    llm=llm,
    verbose=True
)

The task it receives includes the prompt — what to build — plus any context about the project: language, conventions, constraints. This is where the system from article one plugs directly in. If the repo has a CLAUDE.md, that context gets passed to the writer before it generates a single line. If there's a code-writing Skill, that gets included too. The writer doesn't start cold — it starts knowing your project the same way it would in any other session.

The first pass isn't expected to be perfect. It's expected to be a serious attempt — something substantive enough for the reviewer to actually evaluate. That framing matters. You're not asking for perfection on the first try. You're asking for something worth reviewing.

The Reviewer Agent

The reviewer is where the real design decisions live.

The naive approach is to have Claude re-read its own output and ask "is this good?" That doesn't work well. The model has too much context about what it was trying to do — it reads charitably, filling in gaps with intent rather than scrutinizing what's actually there.

The fix is persona separation. The reviewer is defined as a different agent with a different goal — not "did this accomplish what was intended" but "what's wrong with this code."

reviewer = Agent(
    role="Senior Code Reviewer",
    goal="""Review code critically and objectively. Identify bugs, edge cases,
    convention violations, and anything that would fail in production.
    Do not give the benefit of the doubt.""",
    backstory="""You are a senior engineer with high standards and low tolerance
    for sloppy code. You've seen too many production incidents caused by code
    that looked fine in review. You are thorough, specific, and direct.""",
    llm=llm,
    verbose=True
)

The backstory isn't decoration — it shapes behavior. A reviewer told to be skeptical and specific produces meaningfully different feedback than one told to be helpful. Same model, different instruction set, genuinely different output.

The reviewer's task is to return one of two things: a pass with brief justification, or a fail with specific, actionable feedback. Vague feedback — "this could be improved" — is useless in a loop. The writer needs to know exactly what to fix.

The Loop

With the writer and reviewer defined, the loop itself is surprisingly clean in crewAI:

crew = Crew(
    agents=[writer, reviewer],
    tasks=[write_task, review_task],
    process=Process.sequential,
    verbose=True
)

Sequential process means the writer runs first, hands off to the reviewer, and the reviewer's output determines what happens next. If the review passes, the loop exits and returns the code. If it fails, the feedback gets passed back to the writer as context for the next attempt.

The retry limit is non-negotiable. Without it, a loop that keeps failing will keep running — and Claude API calls are not free. In practice, three iterations is the sweet spot. If the code hasn't passed review by the third attempt, something is wrong with either the task definition or the review criteria, and a human should look at it. The loop surfaces the best attempt and stops.

What you learn quickly is that the loop is only as good as the reviewer's criteria. A reviewer that's too strict loops forever. A reviewer that's too lenient rubber-stamps bad code. Calibrating that bar — specific enough to catch real problems, reasonable enough to actually pass good code — is where most of the iteration happens when you're building this.

What Building It By Hand Taught Me

The honest answer is that the first version was too strict.

The reviewer's criteria were thorough — maybe too thorough. It flagged missing docstrings on every function, objected to variable names it considered insufficiently descriptive, and refused to pass anything that didn't include explicit error handling regardless of whether the task called for it. Almost nothing passed on the first review. The loop ran to the retry limit constantly, burning tokens on debates about naming conventions instead of catching real bugs.

The fix was humbling: I had to think harder about what "good code" actually means in context. A reviewer that holds a utility snippet to production application standards isn't useful — it's just expensive. Tightening the review criteria to focus on correctness and edge cases, and leaving style to the CLAUDE.md conventions layer, is what made the loop productive rather than just busy.

The second thing I learned is that verbose output is worth the cost while you're building. crewAI's verbose=True flag logs every agent action, every task handoff, every decision point. It's noisy. It's also the only way to understand why the loop is behaving the way it is. I left it on far longer than I needed to, and I don't regret it. You can't debug a loop you can't see.

The third thing — and this one surprised me — is that the reviewer persona genuinely changes the output. I expected the separation to be mostly cosmetic, a way of organizing prompts cleanly. It's not. A reviewer told to be skeptical and specific catches things that a neutral re-read misses. The backstory isn't decoration. It's instruction.

The moment it clicked was a loop run where the reviewer caught an off-by-one error the writer had introduced in a retry — an error that wasn't in the first attempt, only appeared in the revision, and would have been invisible in a one-shot workflow. That's the whole point, in one example.

What's Next

Code Genie is still early. The loop works, the output is measurably better than one-shot generation for the cases it's designed for, and the crewAI abstractions held up well enough that I'd use the framework again.

But the review layer is still bespoke — written specifically for Code Genie, not portable to other projects. The next logical step is extracting it into a reusable Skill: a code review Skill that any project can plug into its own workflow without building the full loop from scratch. That's the next article.

If you're thinking about building something similar, start with the reviewer, not the writer. The writer is easy — Claude already knows how to write code. The hard part is defining what "good enough" means precisely enough that a loop can act on it. Get that right first and the rest follows.

Tags: crewai, agentic-ai, code-generation, ai-code-review, python
Series: Building With AI Agents — Article 2 of 12

DEV Community