From Stateless to Adaptive: Designing a Code Review Agent That Actually Learns

#agents #ai #codequality #machinelearning

AI code reviewers are everywhere now—but most of them share a critical flaw: they don’t improve.
You can run the same tool across dozens of pull requests, reject the same irrelevant suggestions repeatedly, and it will still make those exact suggestions again tomorrow. There’s no accumulation of context, no adjustment, no memory.
That limitation isn’t just inconvenient—it fundamentally caps how useful these systems can become.
So instead of building another reviewer, I focused on a different question:
What would a code review agent look like if it could learn continuously from developer feedback?
The Shift: Reviews as a Feedback System
Traditional code review tools operate like functions:
Input: diff
Output: comments
No retained state
What I built instead behaves more like a system with feedback control:
It observes past decisions
It adapts future outputs
It converges toward team-specific norms
Every time a developer accepts or rejects a suggestion, the system updates its internal understanding of what “good feedback” looks like for that team.
Over time, this creates a meaningful shift:
Fewer irrelevant suggestions
Better alignment with team conventions
More trust in the output
The Core Loop
The architecture is intentionally simple but powerful. Each review goes through three stages:

Context Retrieval Before analyzing a pull request, the system pulls in historical signals: Previously accepted suggestions Previously rejected suggestions Implicit team preferences This is not static configuration—it’s learned behavior.
Diff Analysis and Review Generation The pull request diff is parsed into structured chunks and passed, along with historical context, into a large language model. The model is constrained to produce structured output: File and line references Severity levels Categorized issues Optional fixes This ensures the output is not just readable, but actionable.
Feedback Capture Every interaction—Accept or Reject—is captured and persisted. This is the most important step. Without it, the system cannot evolve. With it, every review becomes training data. Memory Design: Simplicity Over Structure One of the more surprising design decisions was how to store feedback. The obvious approach is structured data: JSON { "file": "auth.py", "category": "security", "action": "rejected" } Instead, I chose plain language: “PR #42 | File: auth.py | Comment: Use parameterized queries | Developer rejected this suggestion.” This works better for a simple reason: the model consumes it directly. There’s no need for transformation or interpretation. The same representation used for storage is used for reasoning. This reduces system complexity and improves alignment with how LLMs process information. Making Reviews More Precise A key technical challenge was ensuring that generated comments map cleanly to the diff. Diff Chunking Rather than sending the entire diff as a monolithic block, it’s split into file-level segments with metadata (additions, deletions, headers). This improves: Line reference accuracy Contextual grounding Output consistency Handling Imperfect Inputs Diffs in the real world are messy: Missing headers Partial hunks Irregular formatting The system includes fallback logic to treat ambiguous input as a single chunk, prioritizing resilience over strict correctness. Output That Developers Actually Engage With A typical review might include: Security vulnerabilities (e.g., unsafe SQL usage) Cryptographic issues (e.g., weak hashing) Resource management bugs Positive reinforcement (well-written documentation, clean abstractions) The last category—praise—is not accidental. If every comment is negative, developers disengage. Balanced feedback makes the system feel less like a tool and more like a collaborator. Performance Considerations Speed matters more than it seems. The full pipeline—context retrieval, diff parsing, model inference—completes in a few seconds. That’s fast enough to feel interactive, which is critical for adoption. If the system lags, it won’t be used, regardless of how good the feedback is. Designing for Real-World Usage Two practical decisions made a significant difference:
Graceful Degradation The system works even without external dependencies: No memory service → fallback context No model access → mock responses This enables: Local development Reliable demos Reduced friction for onboarding
Feedback as a First-Class Interaction The Accept/Reject mechanism is not UI decoration—it is the engine of improvement. Without it, the system stagnates. With it, the system compounds value over time. Where This Becomes Interesting Once you introduce memory, new possibilities emerge: Team-Specific Behavior Different teams within the same repository often have different standards. Segmenting memory by team allows the agent to adapt at a finer granularity. Real Repository Integration Hooking into live pull requests via APIs is straightforward. The challenge isn’t data access—it’s maintaining responsiveness and reliability at scale. The Larger Takeaway The real innovation here isn’t in model choice, API design, or UI layout. It’s in treating feedback as persistent signal. Most AI systems today are transactional—they answer and forget. This system is incremental—it learns and adjusts. That shift changes the trajectory of the product: From static tool → adaptive assistant From generic output → team-aligned insight From isolated interactions → compounding intelligence Over enough iterations, those small improvements stop being incremental. They become identity. And that’s when the system stops feeling like AI—and starts feeling like part of the team.

DEV Community

From Stateless to Adaptive: Designing a Code Review Agent That Actually Learns

Top comments (0)