DEV Community

Sanjeetha
Sanjeetha

Posted on

Fixing blind spots in code reviews with Hindsight memory

The problem: code review agents don’t really “learn”

Most AI code review tools feel impressive for about 5 minutes.

They catch syntax issues, suggest formatting improvements, and maybe flag obvious bugs.

But they don’t remember past mistakes.

If your codebase keeps repeating the same issue, the agent gives the same generic advice every time.

We didn’t want a smarter reviewer.

We wanted a reviewer that gets better over time.


What we built

We built a code review agent with persistent memory using an LLM and Hindsight.

The idea is simple:

Every review should make the next review better.


What pushed us to build this

While building this, we noticed our agent kept missing repeated issues across multiple pull requests.

For example, it would flag a missing null check in one PR, but completely forget about it in the next one.

Even worse, it kept giving the same generic suggestion without improving.

That’s when we realized the problem wasn’t intelligence.

It was memory.


System overview

  1. Developer pushes code
  2. Agent reviews code
  3. Mistakes are stored in memory
  4. Future reviews use that memory

This turns stateless reviews into evolving feedback.


Core idea: feedback loop

Step 1: Analyze code

def review_code(diff):
    issues = analyze_with_llm(diff)
    return issues
Enter fullscreen mode Exit fullscreen mode

Step 2: Store memory

def store_memory(issue):
    hindsight.retain({
        "pattern": issue.pattern,
        "fix": issue.fix
    })
Enter fullscreen mode Exit fullscreen mode

Step 3: Recall past issues

def get_memory(code):
    return hindsight.recall(query=code)
Enter fullscreen mode Exit fullscreen mode

Step 4: Improve output

def enhanced_review(diff):
    past = get_memory(diff)
    return analyze_with_context(diff, past)
Enter fullscreen mode Exit fullscreen mode

Before vs After

Before:

  • Generic feedback
  • Same suggestions repeated

After:

  • Personalized feedback
  • Pattern recognition
  • Continuous improvement

One clear change we observed:

Earlier, the agent would say:

"Handle null values properly"

After adding memory, it started saying:

"You’ve had similar null-check issues in previous PRs. Consider centralizing validation."

That shift made the feedback actually useful.


DevOps integration

This agent runs inside a CI pipeline:

  • Triggered on pull requests
  • Reviews code automatically
  • Posts comments
  • Stores learning after each run

So instead of being a one-time tool, it becomes part of the development workflow.


Where things broke (and what we learned)

At one point, we made a mistake.

We stored too many low-quality issues in memory.

The result?

The agent started:

  • recalling irrelevant issues
  • giving noisy suggestions
  • becoming less accurate

It actually got worse.


Fixing memory quality

We fixed this by filtering what gets stored:

if issue.severity > threshold:
    store_memory(issue)
Enter fullscreen mode Exit fullscreen mode

We also started:

  • prioritizing recent issues
  • ignoring low-impact suggestions

This made the memory system much more reliable.


What surprised us

We expected better reviews.

We didn’t expect behavior change.

The agent started:

  • referencing past mistakes
  • recognizing patterns
  • suggesting consistent fixes

At some point, it stopped feeling like a tool.

It felt like a junior engineer that learns over time.


Lessons learned

  1. Stateless AI has limits
  2. Memory is more powerful than prompt tuning
  3. Not all feedback should be remembered
  4. Feedback loops create real improvement
  5. DevOps integration is essential

Final thought

Most AI tools react.

This one remembers.

And that changes everything.


Links

Top comments (0)