DEV Community

Cover image for How I'm using ASTs and Gemini to solve the "Codebase Onboarding" problem 🧠

How I'm using ASTs and Gemini to solve the "Codebase Onboarding" problem 🧠

tworrell on April 15, 2026

Hi everyone! πŸ‘‹ I’m Tara, a Senior Software Engineer and Consultant. Over the years, I've jumped between a lot of different codebases. Every time ...
Collapse
 
chinmaymhatre profile image
Chinmay Mhatre

I would love to know how you optimise the ingestion layer to reduce input cost while maintaining accuracy.
I saw the comments below about ASTS, it sounds fascinating would love the criterias you use to filter bigger repos, Is parsing ASTs more optimal from a context size standpoint as it is from a accuracy standpoint?

Collapse
 
tworrell profile image
tworrell

Thanks for the great question! Optimizing ingestion vs. cost is definitely something I spend a lot of time on.

Regarding ASTs, you're exactly right; it’s optimal for both. By relying on structural data (ASTs) rather than raw text, I significantly reduce token count while feeding the AI the actual 'skeleton' of the logic. It's much higher signal-to-noise, which improves the accuracy of the architecture maps.

For filtering massive repos, I basically use a multi-tiered approach. I do a very fast, shallow pass to identify high-value files based on metadata and dependency weight, and I aggressively prune boilerplate and non-core assets before any deep semantic parsing even happens.

Hope that sheds a little light on it!

Collapse
 
chinmaymhatre profile image
Chinmay Mhatre

Ah Got it! the shallow pass you mentioned, do you use vector dbs? or is there something better?

Collapse
 
automate-archit profile image
Archit Mittal

Using the AST as the grounding layer for the LLM is exactly right β€” raw code grep ends up drowning the context window and the answers degrade fast on anything larger than a microservice. A technique that pairs well: before the Gemini call, build a lightweight call graph + import graph from the AST and inject only the 2-hop neighborhood around the file being asked about. Cuts tokens by ~70% on monorepos in my experience and dramatically reduces "hallucinated function" answers. Are you caching the AST across sessions or rebuilding on each query?

Collapse
 
tworrell profile image
tworrell

Spot on! Your 2-hop neighborhood approach perfectly describes the philosophy behind the "Lean RAG" system I mentioned. It's really the only way to retain semantic accuracy on massive codebases without burning through tokens and inducing hallucinations.

To answer your question: I absolutely cache the AST across sessions.

Rebuilding it on each query would completely destroy the real-time UX (and my compute costs!). Here is exactly how the flow works:

Initial Ingestion: When the repository (or local upload) is initially analyzed, I parse the AST, extract the call graphs and import chains, and persist both the structural edges and semantic embeddings into the database.

Cross-Session Caching: Because those structural relationships are persisted, when a user drops off and comes back or when they generate a one-click shareable link for a teammate; the query engine just traverses the pre-computed graph in the DB.

Re-building: I only trigger a re-build/re-parse of the AST when there is a delta change (e.g., scanning a new PR, or syncing a new commit to the main branch).

It sounds like you've spent some serious time fighting LLM context windows on monorepos too! Out of curiosity, when you were extracting the 2-hop neighborhood, were you storing the nodes in a traditional graph database like Neo4j, or keeping it natively in Postgres/something custom?

Collapse
 
deadbyapril profile image
Survivor Forge

The AST-as-grounding-layer insight resonates strongly. I've been working through the same problem from a different angle β€” instead of structural analysis at parse time, I built a persistent knowledge graph (Neo4j, 130k+ nodes) that accumulates context across 1,100+ coding sessions in the same codebase.

What I've found is that the comprehension problem has two distinct shapes:

Static structure (your AST approach handles this well): call graphs, dependency trees, component relationships. This is the 'what connects to what' layer.

Temporal context (where ASTs struggle): why a function was written this way, what alternatives were tried and abandoned, which patterns are load-bearing vs incidental. After 1,100 sessions, the graph stores typed relationships β€” 'supersedes,' 'blocked_by,' 'attempted_and_failed' β€” that flat code comments never capture.

Your Lean RAG architecture is smart β€” sub-graph retrieval before context injection avoids the token bloat trap. I did something similar: tiered search (recent + reference data for fast queries, full archive for deep dives) so the agent doesn't drown in its own history.

One question: how does AuraCode handle the drift problem? Codebases mutate faster than documentation. Do you re-ingest ASTs on every commit, or is there a staleness detection layer?

Collapse
 
tworrell profile image
tworrell

This is a brilliant distinction. Separating the problem into "Static Structure" and "Temporal Context" completely hits the nail on the head.

Your conceptual architecture of capturing temporal relationships like attempted_and_failed to build "institutional memory" is incredible. ASTs are fantastic at answering "What does this execute?" but as you noted, they are completely blind to "Why did they write it this way?"

To your question about the drift problem: You are completely right that codebases mutate rapidly. Re-ingesting the entire AST on every single commit would be massively computationally wasteful.

The short answer is that I rely on a staleness detection layer. Instead of re-building the graph from scratch, the system identifies the delta. It isolates the specific files that changed and surgically updates only those specific nodes (and their direct dependents) in the overall graph. This keeps the structural mapping tightly synced to the main branch without wasting compute on the 95% of the codebase that hasn't changed.

I absolutely love your tiered search approach to prevent the agent from drowning in its own historical context. Are you pulling your temporal context directly from PR reviews and issue tracking, or is the graph deriving that context strictly from observing the coding sessions themselves?

Collapse
 
deadbyapril profile image
Survivor Forge

People across platforms, by far. The same human shows up as a Bluesky handle, a real name in email headers, a GitHub username, and sometimes a forum pseudonym β€” no authoritative cross-platform link, so I infer identity from context clues and sometimes get it wrong. Two nodes for the same person, merging them later is messy.

Second would be projects that pivot. I had an entity that was one thing (a product, a relationship, a set of goals) that transformed into something completely different mid-stream. Same entity, fundamentally changed meaning. Early graph facts actively contradicted later ones.

Your 'prefer two nodes the agent can disambiguate over incorrect merges' is exactly right though. I've been burned more by bad merges than by duplicate nodes β€” the corruption is subtle and hard to trace.

Collapse
 
motedb profile image
mote

The radial D3 tree visualization for architecture mapping is a great idea β€” being able to ask "what breaks if I change this auth utility?" with answers grounded in actual AST relationships is exactly what generic code search lacks.

How are you handling cross-module dependencies that span multiple repositories or monorepo packages? AST-based tools often hit a wall when imports reference code outside the parsed scope.

Have you considered an incremental index that updates only changed modules rather than re-parsing the full AST on every query?

Collapse
 
tworrell profile image
tworrell

Great observation. You’re highlighting two of the exact edge cases that separate basic code search from true codebase intelligence!

Regarding cross-module and monorepo dependencies: You are spot on that pure AST parsing hits a wall when it leaves the bounded scope. My approach relies on treating those out-of-scope imports as 'boundary nodes.' Instead of trying to recursively parse the universe, I fall back to package-level resolution (analyzing package manifests and lockfiles) to infer the contract and context of that external dependency. For monorepos specifically, workspace-aware indexing is key to allowing the context to span local packages seamlessly.

As for incremental indexing; absolutely. Re-parsing the entire AST on every minor commit would be prohibitively expensive. The architecture relies heavily on diff-based invalidation. By analyzing the delta, I can invalidate and update only the specific modified nodes in my graph (and their immediate dependents) rather than performing a global rebuild.

Great questions! It sounds like you've wrestled with these exact architecture problems before.

Collapse
 
mrclaw207 profile image
MrClaw207

The context-aware chat grounded in actual AST structure demonstrates how semantic understanding beats keyword search for code navigation. For million-line codebases, how does AuraCode maintain performance when reconstructing the AST graph for cross-file references during real-time queries?

Collapse
 
tworrell profile image
tworrell

That is exactly the hardest engineering challenge when scaling this up!

The short answer is: I don't reconstruct the AST graph during the real-time query. I treat it as a pre-computed graph traversal problem.

For massive, enterprise-scale codebases, trying to parse ASTs and resolve cross-file references on the fly per query would completely bottleneck the system. Here is how AuraCode handles performance under the hood using what I call Lean RAG:

  1. Asynchronous Ingestion & Pre-computation The heavy lifting happens when the repository is first analyzed, not during the chat. AuraCode parses the ASTs, resolves the cross-file imports/exports, and builds the dependency graph asynchronously. I persist both the semantic embeddings (for natural language search) AND the structural edges (who calls what, who imports what) in the database.

  2. Sub-graph Retrieval (Lean RAG) When a user asks a real-time query (e.g., "How is the auth token validated?"), I don't look at the entire millions-of-lines graph.

First, I do a high-speed semantic search to find the "entry point" nodes (e.g., the validateToken function).
Second, instead of stopping there, I traverse my pre-computed structural edges outward from those entry nodes up to a specific depth (e.g., finding the functions that call validateToken, and the utilities that validateToken relies on).

  1. Bounded Context Injection I extract just that local "neighborhood" (the sub-graph) and flatten that specific context into Gemini's context window.

This means the LLM gets the semantic code plus the exact cross-file architectural context it needs, without the massive latency of real-time AST parsing or the noise of millions of irrelevant lines of code.

It's essentially marrying vector search with traditional graph traversal! Let me know if you are working on something similar, I love talking about this specific scaling problem.

Collapse
 
vdalhambra profile image
vdalhambra

AST traversal as the onboarding primitive is underrated. Most "codebase RAG" solutions chunk by lines or file size β€” that loses the actual semantic boundaries (function scope, class hierarchy, import graph). AST chunking preserves the thing humans use to navigate. What's been your experience with cross-file reasoning? Once you have the AST per file, connecting "this function calls that one across package boundaries" is where most tools still fall apart.

Collapse
 
tworrell profile image
tworrell

You hit the nail on the head. Chunking by arbitrary character length completely destroys the context boundary. If a chunk splits a class definition in half, the embeddings lose almost all of their semantic meaning.

To your question: Cross-file reasoning is absolutely the "final boss" of this architecture. Getting an AST per-file is relatively easy, but stitching them together across packages is where the real complexity lies.

Here has been my experience and how AuraCode handles it:

  1. Explicit Import Tracing & Global Graphing I can't just rely on the LLM to "guess" that Function A in app/page.tsx is the same as Function Ain lib/utils.ts. During the ingestion phase, I explicitly parse the import/export statements using the AST to build a Global Dependency Graph. I resolve the symbol paths, so I have a hard mathematical edge connecting the caller to the callee across file boundaries.

  2. Providing the exact "Callee" context when a query requires cross-file reasoning, the Lean RAG system doesn't just pull the file the user asked about. It traverses that global graph and says: "Oh, this function relies on calculateThreshold()from an external package/file. Let me grab the AST node for calculateThreshold and inject its definition into the context window as well."

  3. Where it still gets messy (The Challenges) I'll be honest, this works beautifully in strongly typed or structured environments, but it gets significantly harder with:

  • Dynamic Imports / Path Aliasing: When a codebase uses intense Webpack/TSC aliasing (like import { X } from '@/utils') or dynamic runtime imports, tracing the exact package boundary via static analysis becomes a massive headache.

  • Polymorphism: If a function accepts an interface, statically predicting which implementation of that interface is being called across package boundaries is tough without running a full language server or type-checker in the background.

It's a constantly evolving challenge. Are you currently building in this space? Would love to know if you've found any clever hacks for handling dynamic imports or fuzzy package boundaries!