Mahima Thacker

Posted on May 27

RAG for Codebases Is Harder Than It Looks

#ai #rag #python #devtool

Building RepoChat, an AI tool that explains GitHub repos

I built a small AI tool called RepoChat.

The idea is simple:

Paste a GitHub repo. Ask questions. Get answers from the codebase.
Something like:

What does this project do?
How is the backend structured?
Where is authentication handled?
How do I run this locally?
Which files should I read first?

I did not want to build another chatbot. I wanted to build something useful for developers.

Because every developer has faced this problem:

You open a new repo.
There are 50 files.
The README is either missing, outdated, or too high-level.
You don’t know where to start.

So I wanted to see if RAG could help make codebase onboarding faster.

What RepoChat does

RepoChat takes a GitHub repository and turns it into something you can ask questions about.

The first version is intentionally small.

It has two main flows.

Indexing happens once for a repo:

GitHub repo
files
filtered files
chunks
OpenAI embeddings
Chroma vector DB

Querying happens every time someone asks a question:

question

embed question
search Chroma
top-K chunks
Claude
answer with sources

This split helped me think about the system more clearly.

The indexing pipeline prepares the repo.

The query pipeline uses that indexed repo to answer questions.

The basic architecture

The first version has a frontend and a backend.

The frontend is simple:

repo URL input
question input
answer section
sources section

The backend does the heavier work:

fetch repo files
filter useful files
chunk text/code
create embeddings
store chunks
retrieve relevant chunks
call the LLM

I used this kind of structure:

repochat-ai/
  apps/
    web/
    api/
  data/
    chroma/
  README.md

The stack:

Frontend: Next.js, React, Tailwind
Backend: FastAPI
Chunking: LangChain text splitters
Embeddings: OpenAI text-embedding-3-small
Vector DB: Chroma
LLM: Anthropic Claude Sonnet 4.6
Repo data: GitHub API

I used two providers for different jobs.

OpenAI handles embeddings because its embedding models are simple and reliable for this use case.

Claude handles generation because I wanted stronger reasoning and clearer explanations for codebase questions.

This felt better than forcing one provider to do everything.

Step 1: Fetching the repo

The first problem was getting the right files.

At first, I thought:

Just fetch the repo and send everything to the AI.

That sounds simple, but it breaks quickly.

Repos contain a lot of files that are not useful for understanding the project:

node_modules/
.git/
dist/
build/
lock files
generated files
images
large JSON files

So I had to filter files.

For the first version, I focused on files like:

README / readme files .md .py .ts .tsx .js .jsx .json

This already made the answers better.

One thing I learned here:

Good RAG starts before embeddings. It starts with choosing what data should enter the pipeline.

If you put garbage into the vector database, retrieval will return garbage too.

Step 2: Chunking code is not the same as chunking docs

This was the first part that felt harder than expected.

Most RAG tutorials use normal text:

paragraph
paragraph
paragraph

But code is different.

A code file has:

imports
functions
classes
comments
config
repeated names
small pieces that only make sense together

If chunks are too small, the model loses context.

If chunks are too large, retrieval becomes noisy.

Example:

function getUser() {
  ...
}

This function alone may not be enough.

The useful context may include:

import { db } from "./db"
import { users } from "./schema"

function getUser() {
  ...
}

So I had to think more carefully about chunk size and metadata.

For every chunk, I kept metadata like:

The file path is very important.

A chunk from:

apps/api/auth.py

means something different from:

apps/web/components/Login.tsx

Even if both mention “user” or “auth”.

The file extension is not stored separately, but it is still available through the path.

Step 3: Embeddings and retrieval

Once the files were chunked, I created embeddings using OpenAI’s text-embedding-3-small model and stored them in Chroma.

When a user asks a question, I embed the question too.

Then the system searches Chroma for chunks that are close to that question.

So if someone asks:

Where is authentication handled?

the system may find related chunks even if the exact word “authentication” is not used everywhere.

It can still find related words like:

auth
login
session
token
middleware
jwt
user

This is where RAG becomes useful.

But retrieval is not magic.

Sometimes it retrieves:

the README instead of the actual code
a frontend file when the backend file is more useful
a config file because it has matching words
a chunk that mentions the right term but does not answer the question

That was a good reminder:

RAG is not just “put data in vector DB and ask questions.” Retrieval quality matters a lot.

Step 4: Asking questions

The Q&A flow looks like this:

User asks a question:

embed the question
retrieve top relevant chunks
send chunks + question to Claude
generate answer
show answer with source files

I wanted answers to include sources because without sources, it is hard to trust the output.

For developer tools, source references are not optional.

If the AI says:

Auth is handled in middleware.

I want to know:

Which file?
Which function?
Where should I look?

So the answer should include something like:

Sources:

- apps/api/middleware/auth.ts
- apps/api/routes/users.ts
This makes the tool much more useful.

What broke or felt messy

This was the most useful part of the build.

A few things were harder than expected.

1. Large repos are noisy

Small repos are easy.

Large repos need better filtering.

A real repo may contain:

docs
examples
tests
scripts
generated files
frontend
backend
infra

If everything is indexed equally, answers become messy.

A better version should rank files based on importance.

For example:

README.md
package.json
main entry files
routes
config files
src/
docs/

should probably matter more than random test snapshots.

2. README is useful but not enough

README files are helpful for high-level questions.

But if you ask:

How does auth work?

the README is usually not enough.

You need code.

This is where code-aware retrieval becomes important.

3. File paths matter a lot

At first, I treated chunks mostly as text.

But for codebases, metadata is part of the answer.

A chunk from:

backend/routes/payment.ts
It is not just text.

It tells you:

This is backend code
This is route-level logic
This likely handles payment APIs

So the file path helps both retrieval and explanation.

4. The model needs strict instructions

If the model does not know something, it should say so.

For example:

I could not find authentication logic in the indexed files.

is much better than:

The app probably uses JWT authentication.

For a developer tool, guessing is dangerous.

So the prompt rule was intentionally strict:

Answer using only the provided repo context.
If the answer is not present in the context, say that clearly.
Always mention the source files used.
Do not guess implementation details.

What I learned about RAG

Building RepoChat made RAG feel much more real to me.

Before building it, RAG sounded simple:

embed docs
retrieve docs
ask LLM

After building it, I see it more like this:

choose the right data
clean the data
chunk it properly
store useful metadata
retrieve the right chunks
control the prompt
show sources
test bad answers

The retrieval part is only one piece.

The developer experience around it matters just as much.

A developer should not care about embeddings or vector DBs.

They should only feel:

“I understand this repo faster now.”

What I would improve next:

The first version is useful, but there are many things I would improve.

1.Better code parsing

Instead of splitting files only by text size, I want to split code by structure:

functions
classes
exports
API routes
components

Tools like Tree-sitter could help with this.

2.Repo map

Before answering questions, the app could build a repo map:

frontend
backend
API routes
database
auth
config
tests

This would help the model understand the project layout better.

Better source citations

Right now, file-level sources are useful.

But line-level sources would be better.

Example:

apps/api/auth.ts:45-72

That would make answers easier to verify.

3.Evaluation questions

Every repo could have test questions like:

How do I run this project?
Where is auth handled?
Where are API routes defined?
What database does it use?

Then I can test whether RepoChat answers correctly.

This is where evals become useful.

4.MCP integration

Later, RepoChat could expose repo search as an MCP tool.

Then an agent could ask:

search_codebase("where is auth handled?")

and use RepoChat as a codebase understanding tool.

Final thoughts

RepoChat started as a small demo, but it taught me a lot.

The biggest lesson:

RAG is only useful when the developer can trust the answer.

For codebases, trust comes from:

good retrieval
useful chunks
file metadata
clear sources
honest “I don’t know” answers

I still want to improve RepoChat, but even this first version made one thing clear:

AI tools for developers should not try to replace understanding.

They should help developers reach understanding faster.

That is the part I find exciting.

Code

GitHub: https://github.com/mahimathacker/repochat-ai
Live demo: https://youtu.be/kSgZSqH6iXk

I’m still improving it, especially around better code parsing, line-level citations, repo maps, evals, and MCP support.

Top comments (3)

Harjot Singh • May 31

RAG-for-codebases-is-harder is exactly right, and the root cause is that code isn't prose, so the prose-RAG playbook (chunk by tokens, embed, retrieve top-k) actively fights you. The reason: code's meaning lives in structure that chunking destroys, a function is useless without the types it takes, the imports it relies on, and the callers that constrain it, but naive chunking slices a file mid-function and embeds fragments that lost their context. So you retrieve something textually similar to the query and semantically incomplete. The questions you list (where is auth handled, which files first) are graph questions, not similarity questions, the answer is a path through the dependency structure, which is exactly what vector search can't follow. This is why the better codebase tools lean on the AST and the call graph, chunk along structural boundaries (whole functions, with their signatures), and treat retrieval as traversal not just nearest-neighbor. Preserve the structure the question depends on. That match-retrieval-to-the-code's-real-shape thinking is core to how I approach this in Moonshift. Did you end up chunking on AST boundaries, or did you find a reranker was enough to rescue token-based chunks?

Eugene Maiorov • May 27

Great article! I really agree with what you said about code chunking. It is so different from normal text paragraphs because functions and file paths matter a lot. A lot of people forget that if you put messy data into a vector database, you get messy answers back.

Since you mentioned wanting better code parsing and chunking for RepoChat, you should check out vectoralix.com. It's a tool built exactly for cleaning, parsing, and chunking complex data (like codebases) before it goes into a vector database. It might save you a lot of time on your next updates! Keep up the great work.

Mahima Thacker • May 29

Thank you! Completely agree, chunking code is a different problem from chunking normal text. File paths, imports, function boundaries, and repo structure matter a lot.

I’ll check out Vectoralix too. For now, I’m trying to keep RepoChat’s pipeline simple so I can understand the tradeoffs myself, but better parsing/chunking is definitely one of the next things I want to improve.