ANU MUDDAM

Posted on Apr 19

Fixing blind spots in code reviews with Hindsight memory

#agents #ai #codequality #showdev

The problem: code review agents don’t really “learn”

Most AI code review tools feel impressive for about 5 minutes.

They catch syntax issues, suggest formatting improvements, and maybe flag obvious bugs.

But they don’t remember past mistakes.

If your codebase keeps repeating the same issue, the agent gives the same generic advice every time.

We didn’t want a smarter reviewer.

We wanted a reviewer that gets better over time.

What we built

We built a code review agent with persistent memory using an LLM and Hindsight.

The idea is simple:

Every review should make the next review better.

What pushed us to build this

While building this, we noticed our agent kept missing repeated issues across multiple pull requests.

For example, it would flag a missing null check in one PR, but completely forget about it in the next one.

Even worse, it kept giving the same generic suggestion without improving.

That’s when we realized the problem wasn’t intelligence.

It was memory.

System overview

Developer pushes code
Agent reviews code
Mistakes are stored in memory
Future reviews use that memory

This turns stateless reviews into evolving feedback.

Core idea: feedback loop

Step 1: Analyze code

python
def review_code(diff):
issues = analyze_with_llm(diff)
return issues

Step 2: Store memory

python
def store_memory(issue):
hindsight.retain({
"pattern": issue.pattern,
"fix": issue.fix
})

Step 3: Recall past issues

python
def get_memory(code):
return hindsight.recall(query=code)

Step 4: Improve output

python
def enhanced_review(diff):
past = get_memory(diff)
return analyze_with_context(diff, past)

Before vs After

Before:

Generic feedback
Same suggestions repeated

After:

Personalized feedback
Pattern recognition
Continuous improvement

One clear change we observed:

Earlier, the agent would say:

"Handle null values properly"

After adding memory, it started saying:

"You’ve had similar null-check issues in previous PRs. Consider centralizing validation."

That shift made the feedback actually useful.

DevOps integration

This agent runs inside a CI pipeline:

Triggered on pull requests
Reviews code automatically
Posts comments
Stores learning after each run

So instead of being a one-time tool, it becomes part of the development workflow.

Where things broke (and what we learned)

At one point, we made a mistake.

We stored too many low-quality issues in memory.

The result?

The agent started:

recalling irrelevant issues
giving noisy suggestions
becoming less accurate

It actually got worse.

Fixing memory quality

We fixed this by filtering what gets stored:

python
if issue.severity > threshold:
store_memory(issue)

We also started:

prioritizing recent issues
ignoring low-impact suggestions

This made the memory system much more reliable.

What surprised us

We expected better reviews.

We didn’t expect behavior change.

The agent started:

referencing past mistakes
recognizing patterns
suggesting consistent fixes

At some point, it stopped feeling like a tool.

It felt like a junior engineer that learns over time.

Lessons learned

Stateless AI has limits
Memory is more powerful than prompt tuning
Not all feedback should be remembered
Feedback loops create real improvement
DevOps integration is essential

Final thought

Most AI tools react.

This one remembers.

And that changes everything.

DEV Community