How I reduced AI codebase context from 100K to 5K tokens using a graph-based RAG

#ai #devtools #buildinpublic #programming

Most AI coding tools lie to you about context.

They say "I understand your codebase." What they actually do is dump as many files as possible into the context window and hope the model figures it out. That works on a 3-file project. It falls apart on anything real.

Here's the problem I kept hitting: a medium-sized production codebase hits ~100K tokens when you try to feed it to an LLM. That's expensive, slow, and surprisingly lossy — models start hallucinating relationships between files that don't exist, missing the ones that do.
So I built a different approach inside Atlarix. Here's exactly how it works.

Step 1: Parse the codebase into a typed graph (Round-Trip Engineering)

When you open a project in Atlarix, it runs a parser over every file using Tree-sitter AST. Instead of storing raw text, it extracts:

Every function, class, interface, and type
Import/export relationships between files
Call relationships between functions
File-level dependency edges

This gets stored as a typed node/edge graph in SQLite — what we call the Blueprint. One-time cost. Persistent across sessions.

Step 2: Query the graph, not the files

When you ask Atlarix something — "why is this API returning null?" or "add authentication to the user route" — it doesn't re-read your files. It queries the Blueprint graph.
We use BM25 scoring to rank nodes by relevance to the query. Only the top-scoring nodes get passed to the LLM. Everything else stays in SQLite.
Result: instead of ~100K tokens, the average query uses ~5K tokens. That's a 95% reduction.

Step 3: Hierarchical context for complex queries

For simple queries, BM25 node retrieval is enough. For complex ones — multi-file refactors, architecture questions — we use a three-layer hierarchy:

Mermaid diagram — a persistent high-level map of the whole codebase, always in context
BM25-scored Blueprint nodes — targeted retrieval for the specific query
Fast-model compression — if context hits 70% capacity, a fast model compresses the least relevant nodes before passing to the main model

This keeps context tight regardless of project size.

Step 4: Provider-agnostic by design

The Blueprint RAG layer sits beneath every AI provider. Whether you're using GPT-4o, Claude, Gemini, Groq, or a local Ollama model — the same 5K token context gets served. You're not paying for the model to re-read your whole codebase on every message.

Why this matters beyond token cost

The token reduction is the headline number. But the real win is accuracy.
When you give an LLM 100K tokens, it attends to all of it roughly equally. The file you actually care about is competing with 200 other files for attention. With 5K targeted tokens, the model is working with exactly what's relevant. Responses are more precise, edits land in the right place, and hallucinated file paths basically disappear.

What's next

Atlarix v7 added parallel agents (Research, Architect, Builder, Reviewer, Debugger) on top of this foundation. Each agent uses the same Blueprint RAG layer — so even autonomous multi-agent builds stay within tight context budgets.
We also just shipped Windows support today — so Atlarix now runs on Mac, Windows, and Linux.
If you're building something where AI context management is a bottleneck, I'd love to compare notes. Try it at atlarix.dev or ask me anything in the comments

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.