Alexandre Amado de Castro

Posted on Mar 15 • Originally published at platformtoolsmith.com

Agentic AI Code Review: From Confidently Wrong to Evidence-Based

#agents #ai #architecture #codereview

✅ Links fixed!
Archbot flagged a "blocker" on a PR. It cited the diff, built a plausible chain of reasoning, and suggested a fix.

It was completely wrong. Not "LLMs are sometimes wrong" wrong — more like convincing enough that a senior engineer spent 20 minutes disproving it.

The missing detail wasn't subtle. It was a guard clause sitting in a helper two files away.
Archbot just didn't have that file.

That failure mode wasn't a prompt problem.
It was a context problem.

So I stopped trying to predict what context the model would need up-front, and switched to an agentic loop: give the model tools to fetch evidence as it goes, and require it to end with a structured "submit review" action.

This post is the architectural why and how (and the reliability plumbing that made it work). There are good hosted AI review tools now. This post is about the pattern underneath them.

[!NOTE]
Names, repos, and examples are intentionally generalized. This is about the design patterns, not a particular company.

If you want the backstory on what Archbot is and why I built it, start with the original post: Building Archbot: AI Code Review for GitHub Enterprise.

The fixed pipeline (and where it breaks)

My original design was what a lot of "LLM workflow" systems converge to:

Give the model the PR diff.
Give it some representation of the repo.
Ask it to pick the files that matter.
Feed only those files back in for the real review.
Optionally run a second pass to critique the first pass.

That critique pass is helpful for catching obvious nonsense and reducing overconfident tone.
But it can't solve the core issue: if both passes are reasoning over the same clipped context, they're both still guessing.

It looks like this:

On paper, it's elegant:

The model is great at prioritizing.
Keeping context small should reduce hallucinations.
A critique pass should catch the worst bad takes.

In practice, it fails in a very specific way.

The core failure: you can't pre-select the missing piece

Code review isn't "read these files".
It's "follow the chain of evidence until you understand the behavior".

And chains are not predictable.

You start in handler.go, notice it calls validate(), jump to validate.go, see it wraps a client, jump to client.go, and only then realize the bug is a timeout default in config/defaults.go.

The fixed pipeline made that exploration impossible.

Once Phase 1 picked the "important" files, Phase 2 was stuck. If the model realized it needed one more file to confirm a claim, it had no way to fetch it.

That led to two bad outcomes:

False positives with confidence. The model would infer behavior from a partial call chain and present it as fact.
Missed risk. The model would never see the one file that made a change dangerous, because that file didn't look important from the diff.

I could tune prompts.
I could add more heuristics to file selection.
I could increase the file budget.

But that's all the same bet: that I can guess the right context ahead of time.

That bet is wrong more often than it feels.

The shift: stop guessing context, let the model fetch it

The agentic version flips the contract:

The system does not attempt to build a perfect context.
It gives the model a toolset for finding evidence.
It loops until the model either submits a review (terminal tool) or hits a budget/timeout.

Instead of a fixed pipeline, it's an exploration loop:

You can summarize the new rule as:

"If you can't cite it, go fetch it."

Under the hood, this ended up being three pieces:

A loop that alternates between "model turn" and "tool turn", with budgets and context hygiene.
A tool interface that can mark certain tools as terminal.
Different toolsets per mode (full review vs chat vs command), reusing the same loop.

The biggest reliability improvement wasn't "more tools".
It was making the model end by calling a terminal tool.

In a fixed prompt pipeline, the model ends by printing Markdown.
That makes it tempting to optimize for sounding right.

In an agentic loop, the model ends by submitting an object.
That changes the incentives:

It's harder to hand-wave; you have to populate fields.
You can enforce structure (severity, inline comments, evidence links).
The loop can treat "no submission" as failure.

So what does the terminal tool actually look like? And how do you wire all this together without it becoming a mess? Let's build it.

What changed in review quality

The qualitative shift after going agentic wasn't "more comments".
It was more explainable comments.

The best code review feedback has three properties:

It points at a specific behavior.
It cites the relevant code.
It proposes a fix (or at least a direction) with trade-offs.

Tools make that possible.

Instead of:

"This might break retries."

You get:

"In foo/bar.go:123, the new call bypasses withRetry(...). All other call sites use withRetry(...) (see matches in search_code). If that's intentional, we should document why; otherwise, wrap it."

Here's what that looks like in practice:

When the model has the ability (and expectation) to fetch evidence, it stops guessing.

The way I measure that shift: does it cite exact locations, or hedge? Does it go fetch more evidence when uncertain, or just keep talking? How often do I have to spend time disproving a comment?

In other words: does it behave like a cautious reviewer, or a persuasive one?

Build it yourself: the pieces you need

Here's how the pieces fit together — the tool interface, a few representative tools, the terminal tool, and how they wire into the loop.

The examples below are simplified Go to show the minimum shape. They're not production code, but they're structurally complete — you could extend these into a working system.

I didn't rebuild Archbot from scratch to go agentic. I took the existing review pipeline and wrapped it in a small loop: model turn → tool turn → repeat, plus a single terminal action (submit_code_review) to end the run. A hand-rolled loop is the fastest way to prove the product behavior (does it fetch evidence? does it stop guessing?) before you commit to a framework.

[!TIP]
Starting greenfield? Google's ADK is a solid default — it provides a similar tool interface with built-in orchestration, tracing, and callback hooks. The patterns below (tool contracts, terminal actions, structured output, context hygiene) apply either way.

Step 1: Define the tool contract

Every tool implements the same contract. The key method is IsTerminal() — it's what lets the loop know when to stop.

type Tool interface {
    Name() string
    Description() string
    InputSchema() map[string]interface{}
    Execute(ctx context.Context, input map[string]interface{}) (string, error)
    IsTerminal() bool
}

Step 2: Build your context-fetching tools

Most tools follow the same pattern: parse input, call an API, return a string. Keep them boring and deterministic.

One thing I didn't appreciate early: these tools are part of your product surface area. Treat them like APIs — stable contracts, deterministic outputs, and tests that lock in behavior. If search_code returns garbage, the model will reason over garbage. The shortest path from "I suspect X" to "here's the evidence" should be one tool call.

Here's get_file_content:

type GetFileContent struct {
    ghClient *github.Client
    owner    string
    repo     string
    headSHA  string
}

func (t *GetFileContent) Name() string        { return "get_file_content" }
func (t *GetFileContent) IsTerminal() bool     { return false }
func (t *GetFileContent) Description() string  {
    return "Fetch the full content of a file at the PR's head commit."
}

func (t *GetFileContent) InputSchema() map[string]interface{} {
    return map[string]interface{}{
        "type": "object",
        "properties": map[string]interface{}{
            "path": map[string]interface{}{
                "type":        "string",
                "description": "File path relative to the repo root",
            },
        },
        "required": []string{"path"},
    }
}

func (t *GetFileContent) Execute(
    ctx context.Context, input map[string]interface{},
) (string, error) {
    path, _ := input["path"].(string)
    if path == "" {
        return "", fmt.Errorf("path is required")
    }

    content, err := t.ghClient.GetFileContent(ctx, t.owner, t.repo, path, t.headSHA)
    if err != nil {
        return "", fmt.Errorf("fetching %s: %w", path, err)
    }

    return addLineNumbers(content), nil // prefix each line with its number for precise citations
}

And here's search_code — the tool the model reaches for when it sees a function call in a diff and wants to know "where else is this used?"

type SearchCode struct {
    repoFiles map[string]string // path -> content (from repomix or local clone)
    diffText  string
}

func (t *SearchCode) Name() string    { return "search_code" }
func (t *SearchCode) IsTerminal() bool { return false }

func (t *SearchCode) Execute(
    ctx context.Context, input map[string]interface{},
) (string, error) {
    query, _ := input["query"].(string)
    if query == "" {
        return "", fmt.Errorf("query is required")
    }
    queryLower := strings.ToLower(query)

    var matches []string

    // search repo files (production version uses ripgrep or tree-sitter for precision)
    for path, content := range t.repoFiles {
        lines := strings.Split(content, "\n")
        for i, line := range lines {
            if strings.Contains(strings.ToLower(line), queryLower) {
                matches = append(matches, fmt.Sprintf("%s:%d: %s", path, i+1, strings.TrimSpace(line)))
            }
        }
    }

    // search the PR diff too
    for i, line := range strings.Split(t.diffText, "\n") {
        if strings.Contains(strings.ToLower(line), queryLower) {
            matches = append(matches, fmt.Sprintf("(diff):%d: %s", i+1, strings.TrimSpace(line)))
        }
    }

    if len(matches) == 0 {
        return fmt.Sprintf("No matches found for %q", query), nil
    }
    return strings.Join(matches, "\n"), nil
}

Step 3: Define the terminal tool

This is what makes the loop an agent instead of a chatbot. The model doesn't print freeform text — it calls submit_code_review with structured data, and the loop ends.

One subtle but important implementation detail: treat the terminal tool as non-executable. When the model emits submit_code_review, the loop captures the structured payload and ends immediately. Your application code (outside the loop) turns that payload into an actual GitHub review.

That also gives you a schema you can validate. Here's what the model submits:

{
  "summary": "What changed + why it matters",
  "inline_comments": [
    {
      "path": "foo/bar.go",
      "line": 123,
      "severity": "blocker",
      "comment": "Specific issue, evidence, and a suggested fix"
    }
  ],
  "high_level_feedback": [
    "Design-level note that isn't tied to a single line"
  ]
}

If the model puts "blocker" in the severity field but can't provide a path + line, that's the model telling you it doesn't have evidence.

Here's the implementation:

type SubmitCodeReview struct{}

func (t *SubmitCodeReview) Name() string    { return "submit_code_review" }
func (t *SubmitCodeReview) IsTerminal() bool { return true }
func (t *SubmitCodeReview) Description() string {
    return `Submit your final code review. Every blocker and should-fix MUST
include a file path and line number. If you cannot cite evidence, downgrade
to nice-to-have or omit the finding.`
}

func (t *SubmitCodeReview) InputSchema() map[string]interface{} {
    return map[string]interface{}{
        "type": "object",
        "properties": map[string]interface{}{
            "summary":            map[string]interface{}{"type": "string"},
            "inline_comments":    map[string]interface{}{
                "type": "array",
                "items": map[string]interface{}{
                    "type": "object",
                    "properties": map[string]interface{}{
                        "path":     map[string]interface{}{"type": "string"},
                        "line":     map[string]interface{}{"type": "integer"},
                        "severity": map[string]interface{}{"type": "string", "enum": []string{"blocker", "should-fix", "nice-to-have"}},
                        "comment":  map[string]interface{}{"type": "string"},
                    },
                    "required": []string{"path", "line", "severity", "comment"},
                },
            },
            "high_level_feedback": map[string]interface{}{
                "type": "array",
                "items": map[string]interface{}{"type": "string"},
            },
        },
        "required": []string{"summary", "inline_comments"},
    }
}

func (t *SubmitCodeReview) Execute(
    ctx context.Context, input map[string]interface{},
) (string, error) {
    // In practice, the loop intercepts this before Execute runs.
    out, _ := json.MarshalIndent(input, "", "  ")
    return string(out), nil
}

Step 4: Write the loop

This is the core of the agent. Conceptually, it's tiny:

// Simplified: the real code has retries, token accounting, and timeouts.
for iteration := 1; iteration <= maxIterations; iteration++ {
    resp, err := model.Converse(ctx, messages, tools)
    if err != nil {
        return nil, err
    }

    if resp.HasToolCall("submit_code_review") {
        return resp.ToolInput("submit_code_review"), nil
    }

    toolMessages, err := executeTools(ctx, resp.ToolCalls)
    if err != nil {
        return nil, err
    }

    messages = append(messages, toolMessages...)
    messages = shrinkStaleToolResults(messages, maxToolResultChars)
}
return nil, fmt.Errorf("agent loop ended without submit_code_review")

That shrinkStaleToolResults line is doing more work than it looks. The loop manages context aggressively:

It keeps the diff and the most recent evidence.
It truncates older tool results (the giant file dumps you needed 3 turns ago).
If the model hits a context overflow anyway, it retries the same iteration after shrinking tool results harder.

Why bother with all this shrinking? "Why not just stuff the entire repo into the prompt and avoid tool calls?" Because long context has its own failure modes.

Even if your model can accept huge inputs, performance doesn't scale linearly with tokens. Two patterns show up in research and in practice:

"Lost in the middle." Models tend to over-weight the beginning and end of long contexts, and under-use relevant info buried in the middle.
Distraction. Irrelevant context measurably reduces accuracy; the model starts pattern-matching on junk.

If you want to go deep on the evidence:

"Lost in the Middle: How Language Models Use Long Contexts" (Liu et al., 2024)
"Large Language Models Can Be Easily Distracted by Irrelevant Context" (Shi et al., 2023)
"Context Rot: How Increasing Input Tokens Impacts LLM Performance" (Chroma Research, 2025)

The practical takeaway isn't "never use long context". It's this:

Treat tokens like budget, not storage.

This is boring plumbing. It is also the difference between a system that works on small PRs and a system that survives the messy ones.

Step 5: Wire it together

This is the entry point. You assemble your tools, set your budgets, run the loop, and handle the structured output.

func RunReview(ctx context.Context, model Model, pr PRContext) (*ReviewResult, error) {
    tools := []Tool{
        &GetPRInfo{pr: pr},
        &GetPRDiff{pr: pr},
        &GetFileContent{ghClient: pr.Client, owner: pr.Owner, repo: pr.Repo, headSHA: pr.HeadSHA},
        &SearchCode{repoFiles: pr.RepoFiles, diffText: pr.Diff},
        &SubmitCodeReview{},
    }

    config := LoopConfig{
        SystemPrompt: reviewSystemPrompt,
        InitialMessage: fmt.Sprintf(
            "Review PR #%d: %s\nAuthor: %s\nBase: %s ← Head: %s",
            pr.Number, pr.Title, pr.Author, pr.Base, pr.Head,
        ),
        Tools:              tools,
        MaxIterations:      15,
        MaxToolResultChars: 120_000,
    }

    result, err := RunLoop(ctx, model, config)
    if err != nil {
        return nil, err
    }

    if result.TerminalToolName != "submit_code_review" {
        return nil, fmt.Errorf("loop ended without submitting a review")
    }

    return parseReviewResult(result.TerminalToolInput)
}

The tools are simple. The loop is simple. The power comes from letting the model decide which tools to call, and in what order.

That terminal tool pattern also gives you a clean way to run the same loop in different modes:

Full PR review: includes submit_code_review.
Interactive chat: no terminal tool; the loop returns assistant text.
Command mode: terminal tool is "submit result" for that command.

One loop. Different endings.

Step 6: Add guardrails

Agentic doesn't mean "let it run forever". It means you give the model freedom inside a box:

Iteration caps. Hard limit on round-trips.
Timeouts. Total wall-clock budget.
Tool result limits. Max characters per tool output, and stricter limits for stale outputs.
Self-critique before submission. The terminal tool instructions include a checklist: cite evidence, downgrade uncertain claims, avoid bikeshedding, don't invent runtime failures.

If you're building your own version, don't copy my numbers. Copy the pattern: budgets are part of the interface.

Step 7: Evaluate on real PRs

Don't trust vibes. Pick 5-10 PRs where you already know what the real risks were — a missed nil check, a broken migration, a security issue someone caught in manual review. Run your loop against those PRs and check:

Did it find the real issue?
Did it cite the right file and line?
Did it hallucinate problems that don't exist?
When it was uncertain, did it fetch more evidence or just assert?

That gives you a ground-truth baseline. From there, iterate on your tools (not your prompts).

The trade-offs (because there are always trade-offs)

Agentic review is not a free lunch.

Latency: tool calls add round-trips. You have to make tools fast and cacheable.
Cost: more turns can mean more tokens. Budgeting is mandatory.
Tool design: bad tools produce bad behavior. (If search_code returns garbage, the model will reason over garbage.)
Security: tools are an exfiltration surface. Your tool layer needs authorization and redaction.

But for code review, the win is that the model starts behaving like a cautious reviewer: it looks things up.

Don't build a bigger prompt. Build a loop where the model can fetch evidence, update its hypothesis, and only finish when it submits a structured, checkable result. When you stop forcing the model to guess context, you stop debugging prompt vibes and start debugging actual interfaces.

Want to build your own agentic review system? Let's talk through the architecture.

Book a mentoring session

DEV Community