AST traversal as the onboarding primitive is underrated. Most "codebase RAG" solutions chunk by lines or file size ā that loses the actual semantic boundaries (function scope, class hierarchy, import graph). AST chunking preserves the thing humans use to navigate. What's been your experience with cross-file reasoning? Once you have the AST per file, connecting "this function calls that one across package boundaries" is where most tools still fall apart.
You hit the nail on the head. Chunking by arbitrary character length completely destroys the context boundary. If a chunk splits a class definition in half, the embeddings lose almost all of their semantic meaning.
To your question: Cross-file reasoning is absolutely the "final boss" of this architecture. Getting an AST per-file is relatively easy, but stitching them together across packages is where the real complexity lies.
Here has been my experience and how AuraCode handles it:
Explicit Import Tracing & Global Graphing I can't just rely on the LLM to "guess" that Function A in app/page.tsx is the same as Function Ain lib/utils.ts. During the ingestion phase, I explicitly parse the import/export statements using the AST to build a Global Dependency Graph. I resolve the symbol paths, so I have a hard mathematical edge connecting the caller to the callee across file boundaries.
Providing the exact "Callee" context when a query requires cross-file reasoning, the Lean RAG system doesn't just pull the file the user asked about. It traverses that global graph and says: "Oh, this function relies on calculateThreshold()from an external package/file. Let me grab the AST node for calculateThreshold and inject its definition into the context window as well."
Where it still gets messy (The Challenges) I'll be honest, this works beautifully in strongly typed or structured environments, but it gets significantly harder with:
Dynamic Imports / Path Aliasing: When a codebase uses intense Webpack/TSC aliasing (like import { X } from '@/utils') or dynamic runtime imports, tracing the exact package boundary via static analysis becomes a massive headache.
Polymorphism: If a function accepts an interface, statically predicting which implementation of that interface is being called across package boundaries is tough without running a full language server or type-checker in the background.
It's a constantly evolving challenge. Are you currently building in this space? Would love to know if you've found any clever hacks for handling dynamic imports or fuzzy package boundaries!
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
AST traversal as the onboarding primitive is underrated. Most "codebase RAG" solutions chunk by lines or file size ā that loses the actual semantic boundaries (function scope, class hierarchy, import graph). AST chunking preserves the thing humans use to navigate. What's been your experience with cross-file reasoning? Once you have the AST per file, connecting "this function calls that one across package boundaries" is where most tools still fall apart.
You hit the nail on the head. Chunking by arbitrary character length completely destroys the context boundary. If a chunk splits a class definition in half, the embeddings lose almost all of their semantic meaning.
To your question: Cross-file reasoning is absolutely the "final boss" of this architecture. Getting an AST per-file is relatively easy, but stitching them together across packages is where the real complexity lies.
Here has been my experience and how AuraCode handles it:
Explicit Import Tracing & Global Graphing I can't just rely on the LLM to "guess" that
Function Ainapp/page.tsxis the same asFunction Ainlib/utils.ts.During the ingestion phase, I explicitly parse the import/export statements using the AST to build a Global Dependency Graph. I resolve the symbol paths, so I have a hard mathematical edge connecting the caller to the callee across file boundaries.Providing the exact "Callee" context when a query requires cross-file reasoning, the Lean RAG system doesn't just pull the file the user asked about. It traverses that global graph and says: "Oh, this function relies on
calculateThreshold()from an external package/file. Let me grab the AST node forcalculateThresholdand inject its definition into the context window as well."Where it still gets messy (The Challenges) I'll be honest, this works beautifully in strongly typed or structured environments, but it gets significantly harder with:
Dynamic Imports / Path Aliasing: When a codebase uses intense Webpack/TSC aliasing (like
import { X } from '@/utils') or dynamic runtime imports, tracing the exact package boundary via static analysis becomes a massive headache.Polymorphism: If a function accepts an interface, statically predicting which implementation of that interface is being called across package boundaries is tough without running a full language server or type-checker in the background.
It's a constantly evolving challenge. Are you currently building in this space? Would love to know if you've found any clever hacks for handling dynamic imports or fuzzy package boundaries!