Building a Code Review Agent That Learns From Every Decision

#agents #ai #codequality #softwareengineering

Most AI-powered developer tools share a fundamental limitation: they reset to zero after every interaction. Close the tab, and the system forgets everything—your preferences, your team’s standards, and the context behind past decisions.
I wanted the opposite.
Instead of a stateless reviewer, I set out to build a code review agent that adapts over time—one that pays attention to which suggestions developers accept, which they reject, and gradually aligns itself with how a team actually works.
The result is a review system that evolves. After a handful of pull requests, it stops behaving like a generic linter and starts resembling a teammate who understands your codebase and your norms.
System Overview
At a high level, the agent sits in front of pull requests and executes a tight feedback loop:
Recall — Retrieve past review patterns and team conventions
Review — Analyze the current diff and generate structured feedback
Retain — Store developer decisions to refine future behavior
A developer opens a PR, triggers the review, and receives annotated feedback. Each comment can be accepted or rejected, and that signal feeds directly back into the system.
The interface is intentionally simple:
Left: PR metadata and file list
Center: syntax-highlighted diff
Right: structured review comments with actions
Each comment includes severity, location, category, and—when applicable—a suggested fix.
The key is what happens after interaction: repeated rejection of a specific suggestion type (e.g., stylistic nitpicks) suppresses it in future reviews. The system adapts without explicit configuration.
Memory as a First-Class Primitive
The most interesting part of the system isn’t the model—it’s the memory layer.
Instead of treating each review as an isolated task, the agent uses two primitives:
retain() — persist feedback decisions
recall() — retrieve relevant historical patterns
Retaining Feedback
Each developer action is stored as a simple, human-readable record:
Python
async def retain_feedback(repo: str, pr_id: str, comment: str, file: str, action: str):
payload = {
"collection": f"reviews:{repo}",
"content": f"PR #{pr_id} | File: {file} | Comment: {comment} | Developer {action} this suggestion.",
"metadata": {"pr_id": pr_id, "file": file, "action": action}
}
...
Notably, the system avoids rigid schemas. Instead of structured JSON objects, it stores plain language summaries.
Recalling Context
When a new review starts, the system retrieves patterns:
Python
async def recall_context(repo: str) -> dict:
...
return {"past_patterns": past_patterns or "No past patterns yet."}
These patterns are injected directly into the model prompt.
Why Plain Text Wins
This design choice turned out to be critical.
LLMs don’t need structured records—they need interpretable context. A sentence like “Developer rejected this suggestion” is immediately useful without parsing overhead. It aligns naturally with how the model reasons.
The Review Pipeline
The backend is a lightweight service built around three endpoints:
GET /prs — fetch PR data
POST /review — execute the full review pipeline
POST /feedback — record Accept/Reject decisions
The core flow lives inside the review endpoint:
Python
@app.post("/review")
async def review_pr(request: ReviewRequest):
memory = await recall_context(request.repo)
chunks = parse_diff(request.diff)
comments = await generate_review(...)
return {"comments": comments, "memory_used": memory}
Diff Parsing
Diffs are split into file-level chunks, each annotated with additions and deletions. This improves the model’s ability to anchor feedback to specific lines.
Edge cases are unavoidable—malformed diffs, missing headers, unusual filenames—so a fallback treats the entire diff as a single block when needed. Not elegant, but robust.
Model Output
The model is instructed to return strictly structured JSON:
file
line number
severity
category
comment
optional suggestion
A defensive fallback wraps malformed responses into a valid structure when parsing fails—a necessity during early iterations.
Example Output
On a PR introducing an authentication endpoint, the agent produced:
Critical (security) — direct SQL string interpolation → injection risk
Critical (security) — MD5 used for hashing → insecure
Warning (bug) — database connection not closed
Praise (documentation) — clear and helpful docstring
The inclusion of positive feedback is intentional. Purely negative reviews are easy to ignore; balanced feedback increases engagement and trust.
What Actually Matters
Several lessons emerged during development:

Memory Is the Differentiator The first review is average. The tenth is meaningfully better. The Accept/Reject loop isn’t a feature—it’s the mechanism that makes the system improve. Without it, you’re just building another static reviewer.
Human-Readable Context Outperforms Structured Data For LLM-driven systems, readability beats schema design. Storing feedback as natural language eliminates translation layers and lets the model reason directly over prior decisions.
Diff Handling Is Non-Trivial Unified diffs contain numerous edge cases. Any production system needs defensive parsing and sensible fallbacks.
Latency Shapes UX End-to-end response time sits around 2–3 seconds. That’s fast enough to feel interactive, which is essential for developer adoption.
Build for Offline and Demo Scenarios Both the memory layer and model calls include fallbacks: Default team standards when memory is unavailable Mock review responses when APIs are not configured This made development smoother and ensured the system works even without external dependencies. Where This Goes Next Two extensions stand out: GitHub Integration Replacing static PR data with live pull requests is straightforward. GitHub’s diff format is directly compatible, requiring only API integration. Team-Aware Memory Currently, all feedback is stored per repository. A more refined approach would segment memory by team, allowing different groups within the same repo to maintain distinct review preferences. The Core Insight Most AI tools operate as one-shot systems. They respond, then forget. Adding memory changes the trajectory entirely. Each Accept or Reject is a small signal. Individually, they’re trivial. At scale, they compound into a system that reflects how a team actually writes and reviews code. That compounding effect is what transforms a generic assistant into something genuinely useful. And that’s the part worth building.

Top comments (1)

Jerin • Apr 13

fantastic