Harness agentic AI to navigate enterprise code, without burning a billion tokens.
The $5,000 question
I needed a 100-million-token context window to spend a billion tokens on an average engineering task. At today’s prices with a frontier model, that’s roughly $5,000 for a single task.
That was my rough estimate for what it would take to work on Magento (Adobe Commerce), average enterprise software with average complexity, using an AI agent. Here’s how I got there.
LLMs take their input through prompts. A fresh Magento install, with its dependencies, is around 66,000 XML and PHP files. That’s roughly 305 million characters, or about 100 million tokens. That’s just the opening message of the conversation. Most LLMs cap their entire context window at 1 million tokens, so we’re already off by two orders of magnitude. And it gets worse: those 100 million tokens pile up with every new prompt. Ten rounds in, you’ve spent a billion tokens.
You can soften this. Caching the token reuse between turns might bring you down to 200 million spent tokens, but the context window itself still demands around 80 TB of RAM.
100,000,000 tokens × 0.8 MB ≈ 80 TB of KV cache
So the requirement is real: a 100-million-token context window, 80 TB of memory. The technology isn’t there, and given the finite nature of energy, it may never get there in our lifetime.
The real problem: no memory
But approaching it less naively and thinking more critically, a different problem comes up. LLMs don’t have memory. They are stateless brains. To give them memory, there are currently only two solutions.
The first is the obvious one: train an LLM on Magento. There are already capable open-weight models like Google’s Gemma, Qwen, and DeepSeek. The problem is the data. Source code alone isn’t enough. You’d also need the descriptions, the documentation, the reasoning behind the architecture, a full account of the codebase, and as many real customizations as possible paired with their business requirements. In practice, whoever trained such a model would have to reconstruct a decade of an agency’s Magento work: commits and their source code, the tickets that define each business requirement, even meeting transcripts to fill the gaps. That data doesn’t exist.
I can’t see inside every company, but I’m fairly sure that in most organizations the chain of decisions is never tracked or recorded. You could imagine bigger capital acquiring small agencies to harvest their data as an investment, then building business-specific LLMs where code and business meet. But doing that just for Magento would likely be a poor ROI, and doing it across many platforms at once risks producing noise. So today this option effectively doesn’t exist. It’s too financially risky.
The second option, the one I took, is the safer one. If a limited budget means you can’t train your own high-quality model and can’t feed it a billion tokens, then you build the memory yourself. Something that doesn’t forget.
Building something that doesn’t forget
Giving a stateless LLM its own memory isn’t a new idea. The first AIs were symbolic AIs: a large structure of deterministic data that the system looped over, known as symbolic reasoning. The most famous example was IBM’s Deep Blue.
From what I’ve observed, the modern AI-native companies like OpenAI and Anthropic argue the road to superior intelligence runs through bigger and bigger LLMs trained on more and more data. Other players like Google seem to take a different path. Their models often feel less smart, but they connect to more things. Watch how they work and you’ll see most answers come from dynamically fetched resources: documents, emails, websites. They call this grounding: tying the model to exact sources so it gives a less creative but better-grounded response.
Grounding is crucial in engineering. Creativity, which often comes from a degree of hallucination, has its place, but in software, precision is what matters.
AI agents like Claude, Codex, and Antigravity reach for grounding by indexing a project’s directory structure, searching it with regex and keyword matching, then reading through many, many files bluntly, burning a lot of tokens and still missing things. It’s a form of RAG (Retrieval-Augmented Generation), but not the right one, in my opinion.
The source code is a universe
Consider what the code actually is. A class implements interfaces, extends a parent, uses traits. Its methods declare parameters and return types that point to still more classes. Packages depend on packages. It isn’t a pile of files. It’s a dense web of relationships. And when your data is already a web of nodes and connections, a graph isn’t just one way to store it; it’s the shape the data already has. So a graph database was the natural choice, and retrieving from it is what people call GraphRAG. That’s where I started.
The goal was a cross-platform desktop app. The user installs it and connects it to their AI agent as an MCP server. From then on the agent treats it as a tool: whenever it needs to browse the code, instead of blindly grepping and reading files, it queries the graph. Because the graph is a complete map of the source, it becomes a world the LLM can navigate, moving from a class to its interface, to the package that declares it, to the types its methods consume, without loading a single raw file into the prompt. And why a desktop app? Because visualizing that world was part of the vision from the start. That was the concept.
Engineering hits and misses
I picked Electron, the Chromium-based desktop framework that ships with Node.js, together with React. The first wall came almost immediately: parsing PHP. There’s no good PHP parser in JavaScript or TypeScript; the mature, correct one (nikic/php-parser) is written in PHP itself. So the plan was to statically compile a PHP binary and ship it inside the app, letting real PHP parse the PHP source. For the graph database I picked a modern embedded engine, LadyBug. Embedded meaning it runs inside the app’s own process, with no separate server for the user to install or manage.
As I neared the end of the prototype, the problems shifted from code to distribution. Shipping a bundled binary means signing it, both the PHP executable and the app itself, and code signing is effectively mandatory on both Windows and macOS. That means paid certificates renewed annually, plus realistically three machines, one per platform, to test on before every release. For a proof of concept, that was too much: distribution would have cost more time than the product.
So I leaned toward a different kind of distribution: a plain PHP package that did the same thing. But could it? The dealbreaker was immediate: there’s no usable embedded graph database for PHP. None. One workaround was a split package: a Node.js CLI for the MCP server and graph connection, a PHP component for parsing, and a React UI for visualization. But at that point the “package” was really a multi-language stack pretending to be a package. It made far more sense to stop fighting it and treat the whole thing as a dockerized server-side project.
As a Docker project, the sub-packages became services, and Neo4j replaced the embedded database. Wiring it together took a full server-side system: frontend, backend, worker, PostgreSQL, a queue. My simple desktop tool was now a small distributed system, with plenty of room to grow.
Same destination, different roads
I’m not the only one who arrived at the “I need to give my AI a brain” conclusion. More and more engineers and scientists are building memory for their LLMs, including Andrej Karpathy, who keeps an “LLM wiki”: plain Markdown notes that link to one another, forming a graph he browses in Obsidian so his AI can navigate his own knowledge. It’s an elegant idea, but a notes-and-links approach like that is too basic for mapping complex code.
Proof of concept: Magentic
The proof-of-concept version of Magentic connects to any AI agent (Anthropic’s Claude, OpenAI’s Codex, Google’s Antigravity, and others) through the standard Model Context Protocol. It scans the source code and builds a world from it inside the graph database. Then, when the agent gets a task from the user, it can look into that graph and navigate the codebase efficiently, without tokenizing many, many files. Each natural-English question ends up as a graph result the agent understands, and Magentic’s UI can show that result as a visualized set of nodes and connections.
I named it Magentic because it currently targets Magento, MageOS, and Adobe Commerce, perfect candidates given their enterprise-grade size and complexity. But with this architecture, the door is open to extend it: GraphQL mapping, DI mapping, plugin mapping, and more, and maybe new platforms too.
There’s plenty to build to make AI workflows efficient long before we ever reach terabyte-memory laptops and hundred-million-token context windows.
Implementation of the MVP: GitHub.com/DavidBelicza/Magento-MCP-Server








Top comments (0)