I Built an AI Code Reviewer That Asks Questions Instead of Handing Out Answers

#ai #codereview #n8n #education

Every AI code tool I've tried wants to do the same thing: rewrite my student's function for them. That's great in production. It's terrible in education.

I teach programming, I've trained over 1,000 developers through bootcamps across Europe. I know exactly what happens when a beginner gets handed the corrected version of their code: they copy it, they move on, they learn nothing. The whole point of a code review in education is to make the student think harder, not think less.

So I built something different. A workflow that takes a GitHub repo, runs it through an LLM with structured evaluation prompts, and returns scored feedback that teaches instead of fixing. No auto-corrections. No rewritten functions. Just pointed questions and concept explanations tied to specific lines.

It runs on N8N, costs almost nothing per evaluation, and took about an afternoon to set up.

What It Actually Does

You give it a GitHub repo URL. It comes back with:

Scores across structured metrics naming & expressiveness, structure & decomposition, logic & control flow, and (optionally) assignment completeness
Per-file breakdowns with 2–3 sentence assessments per metric
Line-referenced suggestions that point at exact code and ask questions instead of providing fixes
A next-steps block identifying the single highest-impact habit the student should build

Every suggestion is typed, positive, improvement, or concern, so the feedback isn't just a wall of criticism. The LLM is explicitly told to never write corrected code. Instead, it explains why something matters and challenges the student to figure out the fix.

The Pipeline in 60 Seconds

The whole thing is an N8N workflow with three stages.

Stage 1: Repository breakdown. Fetch the repo's file tree from GitHub, then run a quick LLM pass to figure out which folders contain student-written source code. This filters out node_modules, build artifacts, config files, and framework boilerplate before the expensive evaluation starts.

Stage 2: Evaluation. Download the target files, prepend line numbers, validate the total size (a hard cap at 3,000 lines prevents surprise token bills), and submit everything to the LLM with a carefully structured prompt. The prompt defines exact scoring rubrics, enforces teaching-oriented feedback, and demands JSON output.

Stage 3: Pack the result. Parse and validate the LLM's JSON response, clamp all scores to 0–10, compute an overall score, and send it back through the webhook. Optionally, a bonus node transforms the JSON into clean GitHub-compatible Markdown.

The entire flow fires from a single POST request. One curl command, one structured response.

Why the Prompt Design Matters Most

The hard part isn't the pipeline. It's the prompt.

Without careful constraints, the LLM defaults to "helpful developer" mode, it rewrites functions, suggests exact code changes, and hands out vague scores. Three design decisions fix this:

Explicit persona. The system prompt opens by framing the LLM as a patient, experienced mentor whose purpose is helping the student think about code better. Not a linter. Not a grading machine.

Constrained scoring bands. Scores below 7 must reference specific issues. Scores above 8 must reference specific strengths. This prevents the LLM from handing out comfortable 7s across every metric, which is exactly what it does without this constraint.

Feedback rules that enforce pedagogy. Rule 1: teach the concept, never write corrected code. Rule 3: use questions to activate thinking. These two rules are the difference between "rename x to todosUrl" and "what does x tell a reader about what this variable stores? What name would communicate its purpose immediately?"

A Real Example

Here's a student submission, a simple fetch-filter-display exercise with the kind of mistakes you see in every beginner codebase:

var x = "https://jsonplaceholder.typicode.com/todos";
fetch(x)
  .then(function(r) { return r.json(); })
  .then(function(data) {
    var arr = [];
    for (var i = 0; i < data.length; i++) {
      if (data[i].completed == false) {
        arr.push(data[i].title);
      }
    }
    var html = "";
    for (var j = 0; j < 5; j++) {
      html = html + "<p>" + arr[j] + "</p>";
    }
    document.getElementById("output").innerHTML = html;
  });

The code works. But look at what the reviewer catches: x as a URL variable name, r for the response, loose equality with ==, no .catch(), an unguarded loop that renders undefined if fewer than 5 items match, and zero function decomposition.

The feedback doesn't say "rename x to TODOS_API_URL." It says: "x tells us nothing about what it stores. What is this string? What name would communicate that immediately? Constants in JavaScript are often written in UPPER_SNAKE_CASE to signal they won't change."

That's the difference. The student has to close the gap themselves.

Final score: 5.3/10 naming at 3, structure at 3, logic at 5, completeness at 10. Honest, specific, and actionable.

Who This Is For

If you teach programming, bootcamp, university, corporate training, and you're tired of writing the same "rename this variable" comment for the 400th time, this solves that. The workflow handles the repetitive feedback so you can focus on the students who need deeper one-on-one attention.

It's also useful if you lead a team with junior developers. The review format works just as well for pull request feedback as it does for homework, the teaching-oriented approach helps juniors internalize patterns instead of just applying patches.

And if you're a self-taught developer looking for honest feedback on your own projects, you can point it at your own repos. It's surprisingly effective at catching habits you've gone blind to.

What It Costs

N8N's cloud starter tier is ~$20/month for 2,500 executions, more than enough for a teacher. Self-host it and the only cost is LLM tokens. A typical student project evaluation runs between a fraction of a cent and a few cents. Complex projects with larger codebases or top-tier models can push higher, but for the teaching use case, mid-tier models work surprisingly well.

One practical note: the workflow uses a cheap model for utility tasks like folder detection and Markdown transformation, and reserves the more capable model for the actual evaluation. That split keeps costs down without sacrificing feedback quality where it matters.

Try It, 200 Free Beta Spots

I've opened a limited closed beta: 200 spots to test the full workflow on a running instance. No setup, no API keys, no N8N account needed. Submit a public GitHub repo URL and get back a scored, line-by-line evaluation.

👉 codegrader.wows.dev/beta

If you want to see what the output looks like before committing, here's the result for the example above

Want the Full Build Guide?

This post covers the what and why. If you want the how, every N8N node, every code block, the complete prompt, the JSON output template, and the Markdown transformation bonus — I wrote the full step-by-step breakdown:

👉 How to Build an AI Code Reviewer That Teaches Instead of Fixing

The full article includes a downloadable N8N workflow JSON you can import directly into your instance and start using today.

I'm Giovanni — I build tools for developers and teach programming at scale. Currently shipping CodeGrader and ZipOps from a self-hosted K8s cluster that costs less than $100/month. Find me at wows.dev.