I have always wanted to contribute to Open Source Projects on Github. If you check out my Profile, you will see that I have even tried to get into it. But, once I went past the documentation changes and minor fixes, I realized that OSS Contributions were HARD
So, I decided to code a RAG project that would help me out. Of course, I could just use the inbuilt coding agents in the IDE, but where's the fun in that?
The original version of KernelMind was pretty basic.
I just wanted a way to ask questions about large repositories without manually opening forty files and mentally reconstructing execution flow.
At the time, the plan looked straightforward:
Repository -> AST Parsing -> Chunk Extraction -> Embeddings -> Vector Search -> Answer Generation
That was it.No fancy business. Just embeddings over code. But it broke immediately.
The First Hurdles
The first step was parsing. I made a basic AST parser and ran it against a deliberately small repository, storing my chunks in MongoDB for now. I wanted something predictable so debugging would be easier. I decided to use full-stack-fastapi-template
The indexing pipeline finished and printed:
Inserted 1258 chunks.
Checked 57 files.
That made absolutely no sense. There was no way a small repository like that should explode into that many chunks. So I started tracing the parser output manually.
The first issue was trivial. I was ingesting... everything. Tests, initializers, EVERYTHING. This was a small fix ... I added a simple IGNORE_LIST that would skip the garbage files and only download the relevant python files.
The second issue was slightly more confusing: Turns out methods inside classes were being extracted twice:
- once correctly as methods
- once incorrectly as standalone functions
This meant that no chunk in my system had a concept of unique identity.
Everything was just “chunks.” And chunks had repetitive content...
Another related problem:
Originally, the parser stored function names like this:
__init__
Which is technically valid. It is also practically useless.
There could be dozens of __init__ methods across the repository.
So I introduced this (totally cool and non ChatGPT researched) concept - Fully Qualified Names.
Instead of:
__init__
the system generated:
matplotlib.figure.Figure.__init__
That single architectural change completely shifted the project. FQNs were now the atomic elements in the data - an FQN would be completely unique across the entire repo. Now, while parsing, I had to only construct the FQN once - if I found out that another function had the same FQN, then - it was already parsed, so ignore it.
Now that symbols had stable identities:
- imports could resolve properly
- dependencies became traceable
The repository stopped behaving like disconnected text.
It started behaving like a connected system.
The “self” Problem
One of the MOST CONFUSING bugs came from method calls.
Initially, method relationships looked like this:
"calls": ["self.get_host"]
Which looks reasonable at first glance ... except self means nothing globally.
A graph cannot reason over:
self.get_host
because it has no stable reference. So I had to build resolution logic that converted local method calls into globally addressable symbols.
Eventually:
"calls": ["src.requests.cookies.MockRequest.get_host"]
started appearing in the graph output. That was a huge leap for me - my system was no longer parsing syntax alone. It was starting to reconstruct semantic relationships.
Once FQNs entered the system, something clicked for me almost immediately.
I realized I was no longer dealing with isolated chunks of text. Every function now had identity, relationships, callers, callees, imports, and dependencies. The repository was starting to look far less like a document collection and much more like a graph data structure describing execution flow.
Building The Graph
And once I saw the repository that way, a lot of the later architecture decisions suddenly started making sense.
The next obvious question became:
If functions are connected, could I retrieve them together?
That question basically led to the entire graph architecture.
Constructing Relationships
The first step was building explicit call relationships. Whenever the parser encountered a function call, I attempted to resolve it into an FQN and create a directed edge:
caller -> callee
So if:
login_user()
called:
create_access_token()
the graph stored that relationship directly.
Initially, the graph nodes were fairly simple. Each node stored:
- the FQN
- file path
- source code
- outgoing calls
- incoming calls
Something roughly like:
class GraphNode:
def __init__(self):
self.calls = []
self.called_by = []
At first, this mainly helped with debugging. Then I realized the graph could fundamentally improve retrieval itself. Because codebases are not isolated files. They are execution systems.
Forward And Reverse Traversal
Once the graph structure stabilized, I realized traversal needed to work in both directions. Forward traversal helped answer questions like:
“What does this function eventually call?”
which was useful for reconstructing execution flow and understanding downstream behavior. Reverse traversal was equally important because it answered:
“Who depends on this logic?”
That became extremely useful for tracing middleware usage, validation chains, service dependencies, and understanding how deeply certain functionality was integrated into the repository.
I decided to implement naive BFS - semantic search (implemented later) would reveal the start node most similar to the query, and then BFS would reveal other function calls (and other "chunks") that were related to that node.
Together, forward and reverse traversal made the graph feel much less like static metadata and much more like a navigable execution map of the repository.
Once I switched traversal to BFS, retrieval immediately started feeling more coherent.
Query-Aware Expansion
The next problem was the naive BFS implementation. Naive graph expansion retrieves way too much context. If you blindly expand neighbors inside a large repository, the graph explodes into noise very quickly. Especially around highly connected framework code.
So graph expansion had to become query-aware.
Instead of expanding everything equally, the system started looking at:
- symbol overlap
- semantic similarity
- auth-related terminology
- file roles
- query keywords
before deciding what to expand.
For example:
query = authentication
should prioritize:
- token middleware
- JWT validation
- auth decorators
and not:
- generic request logging
- unrelated utilities
- serialization helpers
Once I managed to code this in, the graph was no longer purely structural. It was becoming semantic.
The Utility Node Problem
Another issue appeared during expansion. Highly connected utility functions started dominating retrieval.
Things like:
log_info()
handle_error()
serialize_response()
showed up everywhere. The graph accidentally rewarded centrality. Which sounds mathematically elegant until your retrieval system starts implying logging is the answer to everything, simply because that function appeared 1000 times...
So I introduced penalties for high-degree nodes. Highly connected utility-heavy functions received lower expansion priority. This was similar to how TF-IDF matrix works, except over function calls.
That cleanup improved retrieval quality far more than I expected ... because now the graph stopped constantly expanding into irrelevant framework plumbing.
Semantic Graph Expansion
This was where the architecture started becoming much more interesting. Originally, graph relationships were purely structural:
A calls B
Eventually, I started combining:
- graph relationships with semantic similarity
- symbol relevance
- query intent so the traversal could prioritize execution paths actually related to the user’s question instead of blindly expanding every connected node.
This made a huge difference for repository reasoning
Queries about authentication naturally began surfacing middleware chains, token validation logic, and request lifecycle flows instead of drifting into unrelated utility code and framework plumbing.
The traversal pipeline slowly evolved into something closer to:
results = initial_retrieval(query)
expanded = bfs_expand(
results,
query_aware=True,
semantic_weighting=True,
depth=2
)
Now, my retrieval architecture started feeling execution-aware.
The Biggest Realization
This entire phase fundamentally changed how I thought about retrieval systems. Originally, I assumed retrieval quality depended mostly on embeddings.
Eventually I realized:
Retrieval quality depends heavily on structure.
The graph was improving retrieval not because the model became smarter, but because the context became more coherent. The system stopped retrieving isolated functions. It started retrieving workflows.
And finally, once the graph structure stabilized:
- symbol identity existed
- traversal worked
- execution flow became traceable
- relationships became meaningful
All this time, I was working with MongoDB, and storing the "chunks" in a collection. This was excellent for debugging, but now that my repository structure had stabilized, and I was confident enough in my Graph architecture, I was ready to move into embeddings and retrieval ranking properly.
Part 2 is coming up soon! Until then, you can check out my code here
Top comments (2)
This is insightful
Cant wait for the next part!
Thankyou