DEV Community

Patel Yug
Patel Yug

Posted on

Code Reviewer

How I Built a Code Review Agent That Remembers Everything Your Codebase Has Ever Done

Most code review tools have no memory. Every pull request starts from zero — no context about why that pattern was rejected three months ago, no awareness that this new service is tightly coupled to a fragile module downstream, no institutional knowledge baked in. They review code, not codebases.

I got tired of watching the same mistakes get made repeatedly on our project — not because developers were careless, but because that knowledge lived in old Slack threads and forgotten PR comments. So I built ReviewMind: an AI code review agent with persistent memory, and a live interactive dependency graph that visually maps every change against your entire repository history.


The Problem I Was Actually Solving

Generic review tools give you linting errors and style suggestions. Useful, but shallow. What they can't tell you is:

  • "We rejected this singleton pattern in March because it caused race conditions in the auth module."
  • "This new utility function touches the same data pipeline that broke production last quarter."
  • "Three developers have tried to implement this same caching approach — here's why it keeps getting reverted."

That kind of context is what separates a 10x engineer from a junior dev. It lives in human memory — or it disappears. ReviewMind's job is to capture it, store it, and surface it automatically on every new submission.


The Architecture: Three Layers Working Together

I designed ReviewMind around three infrastructure components, each with a very specific job.

1. The Vector Database — Storing "Why"

This is the core of the memory system. I used Weaviate as the vector database, storing semantic embeddings of:

  • Past code diffs and their associated review outcomes
  • Project-specific architectural decisions and style guides
  • Historical review feedback ("rejected because X", "approved after Y was changed")

When a new PR comes in, the agent doesn't just look at the raw code. It queries Weaviate for semantically similar past changes — finding architectural patterns that rhyme with what's being submitted, even if the code looks different on the surface.

# Querying Weaviate for semantically similar past reviews
def retrieve_similar_reviews(code_diff: str, top_k: int = 5):
    embedding = embed(code_diff)
    results = weaviate_client.query\
        .get("ReviewHistory", ["diff", "feedback", "outcome"])\
        .with_near_vector({"vector": embedding})\
        .with_limit(top_k)\
        .do()
    return results["data"]["Get"]["ReviewHistory"]
Enter fullscreen mode Exit fullscreen mode

This is what gives the agent its "institutional memory." It doesn't just know your code — it knows your team's decisions.

2. Redis — The High-Speed Context Engine

Redis handles two critical jobs in ReviewMind:

Semantic Caching: If a developer submits a PR that is 95%+ similar to one reviewed last week, Redis serves the cached review instantly — no LLM call needed. This cuts costs significantly on large teams and keeps latency low.

Real-time Session State: The interactive dependency graph updates live as users explore it. Redis keeps the active graph state in memory so the visualization layer stays snappy even as the repository grows into hundreds of modules.

# Semantic cache check before hitting the LLM
def get_cached_review(diff_hash: str):
    cached = redis_client.get(f"review:{diff_hash}")
    if cached:
        return json.loads(cached)
    return None

def cache_review(diff_hash: str, review: dict, ttl: int = 86400):
    redis_client.setex(f"review:{diff_hash}", ttl, json.dumps(review))
Enter fullscreen mode Exit fullscreen mode

3. The LLM — Reasoning With Context

The LLM (I used Claude via the Anthropic API) is the reasoning layer. It receives:

  1. The current code diff
  2. Retrieved historical context from Weaviate
  3. The dependency graph footprint of the changed files

It then synthesizes all three into a structured review — not just "this looks wrong" but "this pattern was specifically flagged in PR #247 because it caused a cascade failure in the notification service, which this change touches directly."

def generate_review(diff: str, history: list, dependencies: list):
    context = format_context(history, dependencies)
    prompt = f"""
You are reviewing a code change with full institutional memory.

Historical context:
{context}

Current diff:
{diff}

Provide a structured review referencing specific past decisions where relevant.
Flag any downstream dependencies at risk.
    """
    response = anthropic_client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1000,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text
Enter fullscreen mode Exit fullscreen mode

The Feature That Makes It Visual: The Dependency Impact Graph

This is the part I'm most proud of. Every code submission generates a live, interactive System Mind Map — a visual graph that maps the relational footprint of the new code against the existing repository architecture.

The graph does three things:

1. Maps dependencies — every file the changed code touches, directly or transitively, gets plotted as a node.

2. Flags risk — files or modules that are downstream of the change get color-coded by risk level: green for safe, amber for potentially affected, red for historically fragile modules.

3. Encodes memory visually — components that have violated historically established rules in past reviews are highlighted differently. A developer can look at the graph and immediately see: "That module has a pattern flag from three previous reviews."

This is where persistent memory stops being a backend concept and becomes something you can actually see and interact with. Instead of reading through a wall of text, you point at a node in the graph and understand the blast radius of your change in seconds.


The Data Flow End to End

Here's how a complete review runs:

  1. Developer submits a PR → the diff is extracted and hashed
  2. Cache check → Redis looks for a near-identical past review
  3. Memory retrieval → Weaviate finds semantically similar past reviews and architectural decisions
  4. Dependency resolution → the graph layer maps which modules are touched
  5. LLM synthesis → Claude combines diff + history + dependencies into a structured critique
  6. Visual render → the frontend draws the interactive dependency map, color-coded by memory flags
  7. Result caching → Redis stores the output for future similar PRs

The whole pipeline runs in under 3 seconds for typical PR sizes — fast enough that it doesn't interrupt a developer's flow.


What I Learned Building This

Memory storage is easy. Memory retrieval is hard. Getting Weaviate queries to surface genuinely useful historical context — and not just superficially similar code — required careful tuning of the embedding strategy and chunking approach. Storing entire file diffs as single embeddings doesn't work well. Breaking them into semantic units (function-level, module-level) made a significant difference.

Redis semantic caching needs a similarity threshold, not exact matching. Pure hash-based caching misses too many cache opportunities. I implemented a lightweight similarity check using embedding cosine distance before deciding whether to serve from cache or hit the LLM.

The visualization layer is where engineers actually engage. Text-based review output gets skimmed. The interactive graph gets studied. Developers naturally want to click on the flagged nodes and understand why they're red. That engagement is where institutional knowledge actually transfers.

LLM context windows fill up fast with history. I learned to be aggressive about summarizing and ranking historical context before sending it to the LLM. Sending the five most semantically relevant past reviews outperforms sending the twenty most recent ones.

Incremental architecture beats big-bang design. I started with just the LLM review — no memory, no graph. Added Weaviate next, then Redis caching, then the visualization layer. Each step was independently useful and testable. If I'd tried to build all three simultaneously from the start, I'd still be debugging the integration.


Where It Goes From Here

The immediate next step is cross-repository memory — letting ReviewMind learn from architectural decisions made across multiple projects within an organization. Right now, the memory is scoped to a single repo. The infrastructure already supports it; it's a matter of building the right isolation and access control layer.

The Hindsight framework has been instrumental in structuring how the agent retains and retrieves episodic memory — I'd encourage anyone building memory-augmented agents to start there rather than rolling their own. The Hindsight documentation covers the core retain/recall primitives well, and the Vectorize agent memory overview is worth reading before you design your storage schema.

The thing I keep coming back to is this: the value of a code review compounds over time. The tenth review on a codebase should be dramatically better than the first, because by then the agent has seen what breaks, what holds, and what the team actually values. That's what ReviewMind is building toward — a reviewer that gets better the longer it works with you.


Built by Yug Patel, Team Code Warriors.

Top comments (0)