The problem: code review agents don’t really “learn”
Most AI code review tools feel impressive for about 5 minutes.
They catch syntax issues, suggest formatting improvements, and maybe flag obvious bugs.
But they don’t remember past mistakes.
If your codebase keeps repeating the same issue, the agent gives the same generic advice every time.
We didn’t want a smarter reviewer.
We wanted a reviewer that gets better over time.
What we built
We built a code review agent with persistent memory using an LLM and Hindsight.
The idea is simple:
Every review should make the next review better.
What pushed us to build this
While building this, we noticed our agent kept missing repeated issues across multiple pull requests.
For example, it would flag a missing null check in one PR, but completely forget about it in the next one.
Even worse, it kept giving the same generic suggestion without improving.
That’s when we realized the problem wasn’t intelligence.
It was memory.
System overview
- Developer pushes code
- Agent reviews code
- Mistakes are stored in memory
- Future reviews use that memory
This turns stateless reviews into evolving feedback.
Core idea: feedback loop
Step 1: Analyze code
python
def review_code(diff):
issues = analyze_with_llm(diff)
return issues
Step 2: Store memory
python
def store_memory(issue):
hindsight.retain({
"pattern": issue.pattern,
"fix": issue.fix
})
Step 3: Recall past issues
python
def get_memory(code):
return hindsight.recall(query=code)
Step 4: Improve output
python
def enhanced_review(diff):
past = get_memory(diff)
return analyze_with_context(diff, past)
Before vs After
Before:
- Generic feedback
- Same suggestions repeated
After:
- Personalized feedback
- Pattern recognition
- Continuous improvement
One clear change we observed:
Earlier, the agent would say:
"Handle null values properly"
After adding memory, it started saying:
"You’ve had similar null-check issues in previous PRs. Consider centralizing validation."
That shift made the feedback actually useful.
DevOps integration
This agent runs inside a CI pipeline:
- Triggered on pull requests
- Reviews code automatically
- Posts comments
- Stores learning after each run
So instead of being a one-time tool, it becomes part of the development workflow.
Where things broke (and what we learned)
At one point, we made a mistake.
We stored too many low-quality issues in memory.
The result?
The agent started:
- recalling irrelevant issues
- giving noisy suggestions
- becoming less accurate
It actually got worse.
Fixing memory quality
We fixed this by filtering what gets stored:
python
if issue.severity > threshold:
store_memory(issue)
We also started:
- prioritizing recent issues
- ignoring low-impact suggestions
This made the memory system much more reliable.
What surprised us
We expected better reviews.
We didn’t expect behavior change.
The agent started:
- referencing past mistakes
- recognizing patterns
- suggesting consistent fixes
At some point, it stopped feeling like a tool.
It felt like a junior engineer that learns over time.
Lessons learned
- Stateless AI has limits
- Memory is more powerful than prompt tuning
- Not all feedback should be remembered
- Feedback loops create real improvement
- DevOps integration is essential
Final thought
Most AI tools react.
This one remembers.
And that changes everything.
Links
- Hindsight GitHub: https://github.com/vectorize-io/hindsight
- Docs: https://hindsight.vectorize.io/
- Agent memory: https://vectorize.io/what-is-agent-memory
Top comments (0)