# I Built an AI That Understands Any GitHub Repo Using LangChain and ChromaDB

#langchain #chromadb #devops #python

Why I Built This

Every time I join a new codebase, the first few days are the same: open the repo, stare at folders, try to figure out which service does what, read half a file, get interrupted, lose context, start over.

GitHub's built-in search is keyword-only. ChatGPT has never seen your repo. Teammates are busy. Documentation is either missing or out of date.

I wanted a tool that could answer "how does checkout work?" from the actual code — not from training data, not from docs, but from the real source files.

So I built one.

How It Works

The system is built around a RAG (Retrieval-Augmented Generation) pipeline. The idea: instead of asking an LLM to answer from memory, you first retrieve the most relevant code chunks, then ask the LLM to answer using only those chunks.

Ingest flow:

Clone the GitHub repo locally
Walk every file and split into overlapping chunks (~500 tokens, 50-token overlap)
Convert each chunk to a vector embedding using all-MiniLM-L6-v2 (Sentence Transformers — local, free)
Store embeddings + metadata in ChromaDB

Query flow:

Embed the user's question with the same model
ChromaDB cosine similarity search → top-5 most relevant chunks
Inject chunks into a LangChain prompt
LLM generates an answer with source file citations

The frontend is a Next.js split-pane layout — file tree on the left, chat on the right. The ingest endpoint streams progress via Server-Sent Events so the user sees live updates instead of a spinner.

# LangChain RAG chain (simplified)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)
result = qa_chain({"query": question})

The LLM only ever sees 5 chunks — not the whole repo. That keeps answers fast and grounded.

Demo / Results

I indexed the ShopFlow ecommerce demo app — ~847 files across 12 services.

Input: "How does the checkout flow work?"

Output:

The checkout flow works as follows:
1. Cart validation — CartService checks inventory availability (src/services/cart.py:142)
2. Payment processing — PaymentService calls Stripe API with order total (src/services/payment.py:89)
3. Order creation — OrderService writes to PostgreSQL with status PENDING (src/models/order.py:67)
4. Inventory update — InventoryService decrements stock counts (src/services/inventory.py:203)

Index time: ~30 seconds. Answer latency: ~3 seconds. API cost: $0 (self-hosted on AWS EC2).

What Surprised Me

1. FastAPI path parameters break on slashes
Repo names like owner/repo contain a / — FastAPI treats it as a path separator and routes to the wrong endpoint. Fix: declare the parameter as {repo_name:path}. One character. Found it via a 404 in production after deployment.

2. TypeScript types drift silently from backend SSE events
The backend was emitting file, file_path, and indexed_at fields in the SSE stream that the frontend TypeScript interface didn't declare. No error in local dev — only failed during next build inside Docker. A shared OpenAPI-generated type contract would have caught this at development time.

3. Docker layer caching hides real code
After pushing a fix, the Docker build was still serving old code because it cached the COPY . . layer. Always run docker-compose build --no-cache when a fix isn't appearing after git pull.

4. Chunk overlap matters more than chunk size
I initially had no overlap between chunks. The AI would give incomplete answers for questions that spanned function boundaries — the relevant context was split across two chunks that were never returned together. Adding 50-token overlap fixed most of these cases.

Try It

git clone https://github.com/ThinkWithOps/ai-devops-systems-lab
cd projects/02-ai-github-repo-explainer
cp backend/.env.example backend/.env
# Add your GROQ_API_KEY or leave blank to use Ollama
docker-compose up -d