DEV Community

Mahima Thacker
Mahima Thacker

Posted on

RAG for Codebases Is Harder Than It Looks

Building RepoChat, an AI tool that explains GitHub repos

I built a small AI tool called RepoChat.

The idea is simple:

Paste a GitHub repo. Ask questions. Get answers from the codebase.
Something like:

What does this project do?
How is the backend structured?
Where is authentication handled?
How do I run this locally?
Which files should I read first?

I did not want to build another chatbot. I wanted to build something useful for developers.

Because every developer has faced this problem:

You open a new repo.
There are 50 files.
The README is either missing, outdated, or too high-level.
You don’t know where to start.

So I wanted to see if RAG could help make codebase onboarding faster.

What RepoChat does

RepoChat takes a GitHub repository and turns it into something you can ask questions about.

The first version is intentionally small.

It has two main flows.

Indexing happens once for a repo:

  • GitHub repo
  • files
  • filtered files
  • chunks
  • OpenAI embeddings
  • Chroma vector DB

Querying happens every time someone asks a question:

question

  • embed question
  • search Chroma
  • top-K chunks
  • Claude
  • answer with sources

This split helped me think about the system more clearly.

The indexing pipeline prepares the repo.

The query pipeline uses that indexed repo to answer questions.

The basic architecture

The first version has a frontend and a backend.

The frontend is simple:

repo URL input
question input
answer section
sources section

The backend does the heavier work:

fetch repo files
filter useful files
chunk text/code
create embeddings
store chunks
retrieve relevant chunks
call the LLM

I used this kind of structure:

repochat-ai/
  apps/
    web/
    api/
  data/
    chroma/
  README.md
Enter fullscreen mode Exit fullscreen mode

The stack:

Frontend: Next.js, React, Tailwind
Backend: FastAPI
Chunking: LangChain text splitters
Embeddings: OpenAI text-embedding-3-small
Vector DB: Chroma
LLM: Anthropic Claude Sonnet 4.6
Repo data: GitHub API

I used two providers for different jobs.

OpenAI handles embeddings because its embedding models are simple and reliable for this use case.

Claude handles generation because I wanted stronger reasoning and clearer explanations for codebase questions.

This felt better than forcing one provider to do everything.

Step 1: Fetching the repo

The first problem was getting the right files.

At first, I thought:

Just fetch the repo and send everything to the AI.

That sounds simple, but it breaks quickly.

Repos contain a lot of files that are not useful for understanding the project:

node_modules/
.git/
dist/
build/
lock files
generated files
images
large JSON files
Enter fullscreen mode Exit fullscreen mode

So I had to filter files.

For the first version, I focused on files like:

README / readme files
.md
.py
.ts
.tsx
.js
.jsx
.json

This already made the answers better.

One thing I learned here:

Good RAG starts before embeddings. It starts with choosing what data should enter the pipeline.

If you put garbage into the vector database, retrieval will return garbage too.

Step 2: Chunking code is not the same as chunking docs

This was the first part that felt harder than expected.

Most RAG tutorials use normal text:

paragraph
paragraph
paragraph

But code is different.

A code file has:

imports
functions
classes
comments
config
repeated names
small pieces that only make sense together
Enter fullscreen mode Exit fullscreen mode

If chunks are too small, the model loses context.

If chunks are too large, retrieval becomes noisy.

Example:

function getUser() {
  ...
}
Enter fullscreen mode Exit fullscreen mode

This function alone may not be enough.

The useful context may include:

import { db } from "./db"
import { users } from "./schema"

function getUser() {
  ...
}

Enter fullscreen mode Exit fullscreen mode

So I had to think more carefully about chunk size and metadata.

For every chunk, I kept metadata like:

The file path is very important.

A chunk from:

apps/api/auth.py

means something different from:

apps/web/components/Login.tsx

Even if both mention “user” or “auth”.

The file extension is not stored separately, but it is still available through the path.

Step 3: Embeddings and retrieval

Once the files were chunked, I created embeddings using OpenAI’s text-embedding-3-small model and stored them in Chroma.

When a user asks a question, I embed the question too.

Then the system searches Chroma for chunks that are close to that question.

So if someone asks:

Where is authentication handled?

the system may find related chunks even if the exact word “authentication” is not used everywhere.

It can still find related words like:

auth
login
session
token
middleware
jwt
user

This is where RAG becomes useful.

But retrieval is not magic.

Sometimes it retrieves:

the README instead of the actual code
a frontend file when the backend file is more useful
a config file because it has matching words
a chunk that mentions the right term but does not answer the question

That was a good reminder:

RAG is not just “put data in vector DB and ask questions.” Retrieval quality matters a lot.

Step 4: Asking questions

The Q&A flow looks like this:

User asks a question:

  • embed the question
  • retrieve top relevant chunks
  • send chunks + question to Claude
  • generate answer
  • show answer with source files

I wanted answers to include sources because without sources, it is hard to trust the output.

For developer tools, source references are not optional.

If the AI says:

Auth is handled in middleware.

I want to know:

Which file?
Which function?
Where should I look?

So the answer should include something like:

Sources:

- apps/api/middleware/auth.ts
- apps/api/routes/users.ts
This makes the tool much more useful.

What broke or felt messy

This was the most useful part of the build.

A few things were harder than expected.

1. Large repos are noisy

Small repos are easy.

Large repos need better filtering.

A real repo may contain:

docs
examples
tests
scripts
generated files
frontend
backend
infra

If everything is indexed equally, answers become messy.

A better version should rank files based on importance.

For example:

README.md
package.json
main entry files
routes
config files
src/
docs/
Enter fullscreen mode Exit fullscreen mode

should probably matter more than random test snapshots.

2. README is useful but not enough

README files are helpful for high-level questions.

But if you ask:

How does auth work?

the README is usually not enough.

You need code.

This is where code-aware retrieval becomes important.

3. File paths matter a lot

At first, I treated chunks mostly as text.

But for codebases, metadata is part of the answer.

A chunk from:

backend/routes/payment.ts
It is not just text.

It tells you:

This is backend code
This is route-level logic
This likely handles payment APIs

So the file path helps both retrieval and explanation.

4. The model needs strict instructions

If the model does not know something, it should say so.

For example:

I could not find authentication logic in the indexed files.

is much better than:

The app probably uses JWT authentication.

For a developer tool, guessing is dangerous.

So the prompt rule was intentionally strict:

Answer using only the provided repo context.
If the answer is not present in the context, say that clearly.
Always mention the source files used.
Do not guess implementation details.

What I learned about RAG

Building RepoChat made RAG feel much more real to me.

Before building it, RAG sounded simple:

embed docs
retrieve docs
ask LLM

After building it, I see it more like this:

choose the right data
clean the data
chunk it properly
store useful metadata
retrieve the right chunks
control the prompt
show sources
test bad answers

The retrieval part is only one piece.

The developer experience around it matters just as much.

A developer should not care about embeddings or vector DBs.

They should only feel:

“I understand this repo faster now.”

What I would improve next:

The first version is useful, but there are many things I would improve.

1.Better code parsing

Instead of splitting files only by text size, I want to split code by structure:

functions
classes
exports
API routes
components

Tools like Tree-sitter could help with this.

2.Repo map

Before answering questions, the app could build a repo map:

frontend
backend
API routes
database
auth
config
tests

This would help the model understand the project layout better.

Better source citations

Right now, file-level sources are useful.

But line-level sources would be better.

Example:

apps/api/auth.ts:45-72

That would make answers easier to verify.

3.Evaluation questions

Every repo could have test questions like:

How do I run this project?
Where is auth handled?
Where are API routes defined?
What database does it use?

Then I can test whether RepoChat answers correctly.

This is where evals become useful.

4.MCP integration

Later, RepoChat could expose repo search as an MCP tool.

Then an agent could ask:

search_codebase("where is auth handled?")

and use RepoChat as a codebase understanding tool.

Final thoughts

RepoChat started as a small demo, but it taught me a lot.

The biggest lesson:

RAG is only useful when the developer can trust the answer.

For codebases, trust comes from:

good retrieval
useful chunks
file metadata
clear sources
honest “I don’t know” answers

I still want to improve RepoChat, but even this first version made one thing clear:

AI tools for developers should not try to replace understanding.

They should help developers reach understanding faster.

That is the part I find exciting.

Code

GitHub: https://github.com/mahimathacker/repochat-ai
Live demo: https://youtu.be/kSgZSqH6iXk

I’m still improving it, especially around better code parsing, line-level citations, repo maps, evals, and MCP support.

Top comments (0)