Why I Built This
Every time I join a new codebase, the first few days are the same: open the repo, stare at folders, try to figure out which service does what, read half a file, get interrupted, lose context, start over.
GitHub's built-in search is keyword-only. ChatGPT has never seen your repo. Teammates are busy. Documentation is either missing or out of date.
I wanted a tool that could answer "how does checkout work?" from the actual code — not from training data, not from docs, but from the real source files.
So I built one.
How It Works
The system is built around a RAG (Retrieval-Augmented Generation) pipeline. The idea: instead of asking an LLM to answer from memory, you first retrieve the most relevant code chunks, then ask the LLM to answer using only those chunks.
Ingest flow:
- Clone the GitHub repo locally
- Walk every file and split into overlapping chunks (~500 tokens, 50-token overlap)
- Convert each chunk to a vector embedding using
all-MiniLM-L6-v2(Sentence Transformers — local, free) - Store embeddings + metadata in ChromaDB
Query flow:
- Embed the user's question with the same model
- ChromaDB cosine similarity search → top-5 most relevant chunks
- Inject chunks into a LangChain prompt
- LLM generates an answer with source file citations
The frontend is a Next.js split-pane layout — file tree on the left, chat on the right. The ingest endpoint streams progress via Server-Sent Events so the user sees live updates instead of a spinner.
# LangChain RAG chain (simplified)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)
result = qa_chain({"query": question})
The LLM only ever sees 5 chunks — not the whole repo. That keeps answers fast and grounded.
Demo / Results
I indexed the ShopFlow ecommerce demo app — ~847 files across 12 services.
Input: "How does the checkout flow work?"
Output:
The checkout flow works as follows:
1. Cart validation — CartService checks inventory availability (src/services/cart.py:142)
2. Payment processing — PaymentService calls Stripe API with order total (src/services/payment.py:89)
3. Order creation — OrderService writes to PostgreSQL with status PENDING (src/models/order.py:67)
4. Inventory update — InventoryService decrements stock counts (src/services/inventory.py:203)
Index time: ~30 seconds. Answer latency: ~3 seconds. API cost: $0 (self-hosted on AWS EC2).
What Surprised Me
1. FastAPI path parameters break on slashes
Repo names like owner/repo contain a / — FastAPI treats it as a path separator and routes to the wrong endpoint. Fix: declare the parameter as {repo_name:path}. One character. Found it via a 404 in production after deployment.
2. TypeScript types drift silently from backend SSE events
The backend was emitting file, file_path, and indexed_at fields in the SSE stream that the frontend TypeScript interface didn't declare. No error in local dev — only failed during next build inside Docker. A shared OpenAPI-generated type contract would have caught this at development time.
3. Docker layer caching hides real code
After pushing a fix, the Docker build was still serving old code because it cached the COPY . . layer. Always run docker-compose build --no-cache when a fix isn't appearing after git pull.
4. Chunk overlap matters more than chunk size
I initially had no overlap between chunks. The AI would give incomplete answers for questions that spanned function boundaries — the relevant context was split across two chunks that were never returned together. Adding 50-token overlap fixed most of these cases.
Try It
git clone https://github.com/ThinkWithOps/ai-devops-systems-lab
cd projects/02-ai-github-repo-explainer
cp backend/.env.example backend/.env
# Add your GROQ_API_KEY or leave blank to use Ollama
docker-compose up -d
Open http://<your-ec2-ip>:3000, paste any public GitHub URL, and start asking questions.
🔗 GitHub: https://github.com/ThinkWithOps/ai-devops-systems-lab
📺 Full build walkthrough: https://youtu.be/a6376K9Lm00
This is Project 02 of a 30-project AI + DevOps series. Each project is a real deployed system — not a tutorial snippet.
What's the most confusing codebase you've ever had to onboard into? What would you have asked an AI first?
Top comments (0)