Aleksandr Ishchenko

Posted on Jun 15

Building a RAG-Powered Code Reviewer That Actually Understands Your Codebase

#ai #php #rag #showdev

Most AI code review tools give you generic advice. "Add type hints." "Handle exceptions." Useful, sure — but the same advice you'd get from any linter or a quick ChatGPT prompt.

What if your AI reviewer could say: "Add type hints to the constructor — consistent with how OrderProcessor.php and OrderPlaceAfter.php already do it in your project"?

That's the difference between a generic AI tool and one that understands your codebase. I built the latter. Here's how.

The Problem

I've been a Magento/PHP developer for 12+ years. Magento has complex architectural patterns — plugins, observers, dependency injection via XML, area-scoped configuration. When tools like CodeRabbit or GitHub Copilot review Magento code, they're getting better at repository-wide context — Copilot indexes your workspace, CodeRabbit reads related files. But they still treat Magento XML configurations as static text files rather than active dependency injection and event routing maps.

They don't inherently know that:

A class registered as a <plugin> in di.xml must implement at least one interceptor method (before, after, or around) matching the target class methods
An observer for sales_order_place_after must implement Magento\Framework\Event\ObserverInterface
Direct SQL queries bypass Magento's entire database abstraction layer — repositories, resource models, caching, plugins, observers, and business logic like MSI or event-driven extensions

I wanted a tool that understands these things. Not because it was trained on Magento docs, but because it read my actual project and understands the patterns we follow.

The Approach: RAG for Code

RAG (Retrieval Augmented Generation) is a pattern where you search for relevant context before sending a query to an LLM. For code review, this means:

Index the codebase — parse every PHP file into functions/methods/classes, generate embeddings, store in a vector database
Enrich with framework knowledge — parse di.xml, events.xml, module.xml to understand how classes are registered (as plugins, observers, preferences)
Retrieve context — when reviewing a file, search for similar code, related classes, and architectural patterns from the indexed codebase
Generate review — send the file + retrieved context to the LLM

The key insight: the LLM doesn't need to be trained on Magento. It just needs to see how your project does things, and it can spot inconsistencies.

Why RAG and not just a huge context window?

Fair question. Gemini offers a 1-2M token context window. Why not dump the entire module in there?

For a single small module (20-50 KB) — you absolutely could. But this approach is designed for a different scale: a real Magento project with vendor/magento/ (200+ core modules) plus app/code/ (dozens of custom modules). That's megabytes of code. Sending all of it into context for every review is slow, expensive, and — as my experiments showed — counterproductive.

When I tested RAG with irrelevant context (a "customer hobby" module as context for an "order processing" review), the results were worse than simple mode. The LLM got confused by unrelated code. RAG gives you surgical, relevant context — 5 precisely selected code chunks instead of 500 files where 490 are noise.

That said, for small projects, long-context is simpler and may be sufficient. RAG pays off at scale.

Architecture

The system has three layers:

┌─────────────────────────────────┐
│  CLI: review / index / search   │
├─────────────────────────────────┤
│  Agent Layer                    │
│  ├── RAG Review Service         │
│  ├── Search Service             │
│  └── Review Service (simple)    │
├─────────────────────────────────┤
│  Core Engine                    │
│  ├── LLM Providers (Gemini...)  │
│  ├── Embedding (BGE-small)      │
│  └── PHP Parser (tree-sitter)   │
├─────────────────────────────────┤
│  Framework Layer                │
│  ├── Magento Config Parser      │
│  └── Module Indexer             │
└─────────────────────────────────┘

Why layers matter: The core engine knows nothing about PHP or Magento. The PHP parser knows nothing about Magento. Only the framework layer has Magento-specific knowledge. Adding Symfony or Laravel support reuses ~80% of the codebase.

LLM providers are behind an abstraction — switching from Gemini to Claude is one environment variable. Same for embeddings. This was a deliberate architectural choice: provider-agnostic from day one.

How Indexing Works

When you run mage-audit index ./my-module, the system:

1. Parses PHP files with tree-sitter — not regex, not string splitting. Tree-sitter builds an AST (Abstract Syntax Tree), so we extract classes, methods, and functions at their logical boundaries. A 500-line file becomes 15-20 separate chunks, each a complete logical unit.

2. Parses Magento XML configs as structured data — di.xml becomes a map of plugins (with target classes, sort orders, disabled flags) and preferences (interface-to-implementation mappings). events.xml becomes a list of observers with event names, handler classes, and methods. This isn't text parsing — it's structured extraction that understands what these configs mean in the Magento DI/event system.

3. Enriches chunks with framework context — if a class is registered as a plugin for OrderRepositoryInterface, the chunk gets tagged: [PLUGIN for Magento\Sales\Api\OrderRepositoryInterface]. This tag is included in the embedding text, so searching for "plugin for order save" directly matches it.

4. Generates embeddings and stores in pgvector — each chunk becomes a 384-dimensional vector via BGE-small. This is an intentional trade-off: BGE-small is a general-purpose text embedding model, not a code-specific one. Models like jina-embeddings-v2-base-code or voyage-code-3 would likely perform better on code search. But BGE-small runs locally on CPU with zero API cost — critical for a zero-budget MVP. The architecture supports swapping models with one config change, so upgrading is trivial when budget allows.

Multi-Strategy Retrieval

Simple RAG uses one search query. We use three:

Code similarity — find chunks with similar code structure
Name matching — extract class/method names from the reviewed file, search for them directly
Dependency matching — extract use statements, find code that uses the same interfaces

Results are deduplicated and sorted by similarity. This catches context that a single strategy would miss — a class name search finds the exact interface implementation, while a code similarity search finds structurally similar methods.

The Results

I tested on a PHP file with 13 known issues (SQL injection, architecture violations, missing error handling, etc.) — issues I identified manually as a senior Magento developer. This hand-curated list serves as ground truth for evaluation.

Metric	Simple Mode	RAG Mode
Issues detected (recall)	54% (7/13)	69% (9/13)
Project-specific references	1	10
Input tokens	637	1,495

RAG found 15 percentage points more issues. But the real difference is qualitative.

Simple mode says:

"Add type hints to constructor parameters."

RAG mode says:

"Add type hints to constructor parameters — consistent with Model\OrderProcessor.php and Observer\OrderPlaceAfter.php from your project."

RAG mode also found issues that Simple mode missed entirely — like the method always returning true regardless of outcome, or the lack of error handling around repository calls. These are the kinds of issues that only become visible when you see how the rest of the project handles similar cases.

What I Learned

1. Context relevance is everything. When I tested RAG with an unrelated module as context (a "customer hobby" module for an "order processing" review), the results were worse than simple mode. The LLM got confused by irrelevant code. When the context was from a related module (also about orders), results improved dramatically. This confirms: retrieval quality determines review quality. Garbage context in → garbage review out.

2. LLMs don't return clean JSON. Even with explicit instructions to "return ONLY valid JSON", Gemini would add markdown fences, inconsistently escape backslashes in PHP namespaces (\Magento\Sales has invalid JSON escapes), and sometimes swap field values (putting a category in the severity field). I built a character-by-character JSON fixer and fallback field mapping for misplaced values.

A note on this: Gemini does offer response_schema (Structured Outputs) that enforces valid JSON at the token generation level. I chose prompt-based JSON instead for a specific reason — provider agnosticism. The same prompt works with Gemini, Claude, and OpenAI without changes. response_schema is a Gemini-specific API. For a production system targeting one provider, Structured Outputs would reduce parsing issues. For a multi-provider architecture, defensive parsing is the more portable approach.

3. Embeddings find similar code, not bugs. Searching for "SQL injection vulnerability" returned the actual vulnerable function as the 3rd result, not the 1st. Embeddings measure text similarity, not security analysis. That's why you need both retrieval (find relevant code) AND generation (analyze it with LLM). Each alone is weak; together they're strong. Using code-specific embedding models (Voyage Code, Jina Code) instead of the general-purpose BGE-small would likely improve retrieval quality — that's a planned upgrade.

4. Free tiers are enough for building. The entire project runs on Google Gemini free tier (1,500 requests/day), local embeddings (BGE-small, no API cost), and PostgreSQL + pgvector in Docker. Total spend: $0. You don't need an API budget to build serious AI applications — but you do need architecture that lets you upgrade when budget appears.

Tech Stack

Python 3.12, FastAPI, Typer + Rich (CLI)
PostgreSQL 16 + pgvector (vector storage with HNSW index)
tree-sitter (AST-based PHP parsing)
sentence-transformers / BGE-small (local embeddings, swappable)
Google Gemini (LLM, swappable via provider abstraction)
GitHub Actions (CI)

What's Next

Code-specific embeddings — replacing BGE-small with Voyage Code or Jina Code for better retrieval
Hybrid search — combining BM25 (keyword) with embedding (semantic) search
Structured Outputs — comparing Gemini's response_schema vs prompt-based JSON
GitHub App integration — auto-review on pull requests
Multi-agent architecture — separate agents for security, performance, and architecture analysis

The project is open source: github.com/Aquarvin/mage-audit

I'm a senior PHP/Magento developer transitioning into AI Engineering. This project is both a real tool and a learning vehicle — every architectural decision, experiment, and dead end is documented in the repo. If you're hiring AI Engineers or interested in discussing RAG for code analysis — reach out on LinkedIn or Telegram.

DEV Community