lorismascio17

Posted on Jul 1

Why vector-only RAG is weak for coding agents

#mcp #python #ai #opensource

Building Droste: a local structural + semantic code-memory engine for MCP agents

AI coding agents are getting better, but their memory layer is still often too shallow.

Most agent workflows still depend on one of two things:

blind file reads;
vector search over chunks.

Both are useful, but both miss something important in codebases: causal structure.

A function may be relevant to a bug even if it shares no words with the user’s query. A SQL function may be called through an RPC string from TypeScript or Dart. A file may
matter because it is a caller, callee, handler, migration, or dependency, not because its text looks similar.

That is the reason I started building Droste.

GitHub:
https://github.com/lorismascio17/droste-memory

PyPI:
https://pypi.org/project/droste-memory/

## What Droste is

Droste is a local code-memory engine for AI coding agents.

It indexes a repository into a hybrid structural + semantic graph:

folders;
files;
symbols;
functions;
classes;
methods;
caller/callee links;
import/dependency edges;
cross-language links;
local embeddings.

Then it exposes that memory through:

a CLI;
an MCP server;
a visual graph viewer.

The goal is simple: give an agent the causal slice of code it needs, instead of forcing it to repeatedly scan files or rely only on semantic similarity.

## Why vector search alone is not enough

Vector search is good at finding code that sounds similar to a query.

But code often matters because of relationships, not wording.

For example:

a controller calls a service;
a service calls a repository;
a frontend invokes an RPC function;
an edge function touches a database table;
a migration defines something used indirectly from app code;
a test reveals behavior not obvious in the implementation.

A vector index may miss those links if the words are different.

A graph can preserve them.

That is the main design idea behind Droste: combine semantic retrieval with structural retrieval.

## Local-first by design

Droste is local-first.

No cloud database is required. No account is required. No API key is required.

Install:

python -m pip install --upgrade droste-memory

  Index a repository:

  droste index .

  Ask for context:

  droste context "checkout flow" --budget 1500

Run it as an MCP server:

droste mcp

An AI agent can then call Droste as a local code-memory backend instead of doing blind file reads.

## MCP usage

For MCP clients, the basic configuration is:

{
    "mcpServers": {
      "droste": {
        "command": "droste",
        "args": ["mcp"]
      }
    }
  }

For serious multi-repo work, Droste also supports using one database per project:

{
    "mcpServers": {
      "droste": {
        "command": "droste",
        "args": [
          "--db",
          "/absolute/path/to/droste_memory_db.json",
          "mcp"
        ]
      }
    }
  }

That avoids mixing context from different repositories.

## What happens internally

Droste builds a local graph of the project.

At a high level:

It extracts symbols from source files.
It maps functions, classes, methods, files, and folders.
It builds dependency edges where possible.
It computes local embeddings.
It stores the graph in local sharded JSON files.
It retrieves context using both semantic similarity and graph relationships.
It packs the result into a token budget for an LLM.

The important part is that retrieval is not only “which chunks are similar?”

It also asks:

what calls this?
what does this call?
what file owns this symbol?
what related nodes are connected by the graph?
what cross-language links exist?

## Sharded local storage

Droste does not store the whole graph as one giant JSON file.

It uses sharded local storage, with one shard per source path. This keeps incremental saves faster and avoids rewriting the entire database after every change.

It also uses a lightweight seqlock-style consistency model so a reader does not assemble a torn snapshot while another process is writing shards.

This matters because the intended workflow is live:

the engine indexes;
the visualizer reads;
the MCP server answers agent requests;
the developer keeps coding.

## Visual graph

Droste also includes a visualizer.

The idea is to represent the codebase as a zoomable graph: project, folders, files, symbols, and causal edges.

It is not meant to replace the CLI or MCP server. It is meant to make coupling and blast radius visible.

In practice, I wanted something closer to a “code universe” than a flat file tree.

## Current status

Droste is still early, but usable.

Current features include:

Python CLI;
MCP server;
local indexing;
Tree-sitter based symbol extraction;
semantic embeddings through local models;
sharded storage;
project-root isolation;
query-aware ranking;
token-budgeted context packing;
visual graph viewer.

Latest version:

python -m pip install --upgrade droste-memory

  ## Example commands

  droste index .
  droste status
  droste context "authentication flow" --budget 2000
  droste zoom "SomeFunction"
  droste view
  droste mcp

## What I am looking for feedback on

I would especially appreciate feedback from people building or using AI coding agents.

Useful feedback would be:

Does the MCP interface make sense?
Is the install flow clean?
Does retrieval feel better than normal file search in real projects?
Are the returned context slices useful to an LLM?
What graph relationships would matter most in your stack?
What would make this more useful in day-to-day coding?

## Links

GitHub:
https://github.com/lorismascio17/droste-memory

PyPI:
https://pypi.org/project/droste-memory/

Install:

python -m pip install --upgrade droste-memory

Droste is open source and MIT licensed.

Top comments (9)

Max Quimby • Jul 2

Strongly agree on the core premise — for coding agents, relevance is a graph property, not a similarity score, and "blind file read vs vector chunk" is a false binary a lot of tools are still stuck in. The place this gets hard, in my experience, is exactly the edges you list as cross-language / RPC links. Caller→callee within one language is tractable with a decent AST/LSP pass, but the frontend-invokes-RPC-by-string and migration-defines-something-used-indirectly cases are dynamic — the edge only exists at runtime or through a string key, so pure static analysis misses it. How does Droste resolve those? Heuristic string-matching on the RPC name, or do you fall back to embeddings to bridge where the graph has a hole? The other thing I'd be curious about is merge/ranking: once you have structural neighbors and semantic hits, how do you fit them into the --budget 1500 window — fixed ratio, or does graph distance weight the cut? That budgeting step is usually where hybrid retrieval lives or dies.

lorismascio17 • Jul 2

Great questions, because these are exactly the due places where the design gets opinionated.

Regarding cross language edges, we use deterministic string keys and never embeddings. The frontend invokes RPC by string case is handled as a static contract on the string literal itself. On the caller side, Droste extracts the key from standard idioms like .rpc('fn_name'), .functions.invoke('edge-fn'), /functions/v1/x routes, or .from('table'). It also parses FROM, JOIN, and INTO inside SQL strings embedded in any language. On the database side, things like CREATE FUNCTION or CREATE TABLE in your migrations become first class definitions. An edge is created only when the captured key exactly matches a definition indexed in the repo, and the edge carries the string key as its label so an agent can audit why it exists. There is also a guarded generic bridge for handler or channel name patterns, which looks for a string literal that exactly names a symbol defined in another language, though it is stopword filtered and length gated.

Two honest limits here. First, runtime computed keys like f-strings or dynamic dispatch are invisible. That edge genuinely only exists at runtime, and I would rather miss it than guess it. Second, if your schema only lives in a dashboard and not in committed migrations, there is no target node to link to.

And to answer your question, embeddings deliberately do not patch holes in the graph. They influence seed ranking only. A similarity edge dressed up as a causal edge is the exact failure mode the whole design is trying to avoid, because an agent cannot distinguish "calls this" from "smells like this" once they share the same edge type.

As for budget merging, there is no fixed ratio, and you are totally right that this is where it lives or dies. Here is the scar tissue from building it. The pipeline works like this: a hybrid seed picks the focus, the focus renders at full fidelity, and its direct syntax dependency callers and callees are pinned immediately after it, ahead of every secondary lexical seed. The rest just sort by score. Every node then walks a demotion ladder from full text to a compact stub, then a one line contract, down to a guaranteed one line floor.

The pinning and floor exist because I measured the naive version. At a 1500 token budget, plain score sorting let keyword lookalikes crowd out the true callers, and neighbours whose compact form missed the cap were silently dropped. Self supervised evaluation showed a neighbour recall of 0.068 and a graph lift of minus 0.18. My own tool was literally worse than grep at its flagship job. After pinning causal neighbours and guaranteeing the stub floor, we jumped to 0.875 recall and plus 0.59 lift at the exact same budget. The eval script is in the repo if you want to run it on your own codebase.

Mykola Kondratiuk • Jul 10

I’d push back a bit - structural graphs work for static codebases but dynamic dispatch and reflection break the call graph completely. vector search degrades gracefully; graph traversal either finds something or silently misses.

Alex Shev • Jul 1

Vector-only memory is especially weak for code because the important relationships are structural, not just semantic. An agent needs to know call paths, ownership boundaries, config links, and what changed recently. Embeddings help retrieval, but they should not be the whole map.

lorismascio17 • Jul 1

Spot on, Alex! You perfectly summarized the core thesis behind Droste. Code is inherently non-linear; forcing it into a flat vector space feels like trying to read a 3D blueprint through a keyhole.

If an agent doesn't understand the actual call paths or dependency edges, it spends half its token budget just playing detective with blind grep reads.

We actually just pushed an architectural update to address exactly what you mentioned regarding call paths and context mapping. We completely overhauled the context packer with a new "Interleaving" logic. Now, even with a tight token budget (like 1500 tokens), the engine guarantees that whenever a focal node is retrieved, its immediate causal neighbors (direct callers/callees) are pinned right next to it, down-ranking purely lexical noise.

Embeddings are a great entry point to guess "intent," but the structural graph is what keeps the agent from hallucinating or losing the thread in large repositories. Thanks for the feedback!

Alex Shev • Jul 2

Yes. For code, retrieval has to preserve relationships: call paths, ownership, tests, runtime constraints, and recent changes. A vector hit without that graph can feel relevant while missing the actual dependency.

René Zander • Jul 15

The graph-versus-vector framing understates where the leverage actually is: neither should be a hard dependency of the other. Mykola's dynamic-dispatch point is exactly why. Instead of traversing the call graph and falling back to vectors, run structural retrieval and semantic retrieval as two independent ranked lists, fuse them with RRF, then pack to the token budget. A dispatch or reflection edge the static graph misses then degrades to semantic recall instead of failing hard, and a cross-language RPC string the embeddings miss still gets caught by your deterministic string-key edges. It also reframes Dipankar's scoring question: the fusion step is the scoring policy, so budget-packing becomes rank-both-fuse-truncate rather than a bespoke heuristic. I wrote up the hybrid-plus-parallel-fan-out version after pure semantic search missed 4 of 5 of my agent queries: renezander.com/blog/agentic-knowle...

Dipankar Sarkar • Jul 2

Agree on the premise, and I want to push on the part that gets hard after the graph exists: ranking under a token budget. Building the caller/callee edges is the easy 80%. The other 80% is that a naive BFS from the seed symbol pulls in half the repo by hop two, and now you are back to dumping context.

What made structural retrieval actually pay off for us was scoring edges by type before distance. A direct callee of the seed outranks a sibling that merely imports the same module, a test that exercises the symbol outranks a migration three hops away, and everything decays with graph distance. Then you fill the budget greedily against that score. The graph is necessary but it is the scoring policy that decides whether the slice is causal or just large.

Genuine question on staleness: the structural graph drifts the moment the working tree is dirty, which is exactly when an agent is mid-edit and needs it most. Does droste index reindex incrementally per changed file, or is context served against the last full pass? That gap is where these engines quietly start lying.

lorismascio17 • Jul 2

You are totally right that the graph is necessary but it is the scoring policy that decides whether the slice is causal or just large. I can co-sign that with actual data, mostly because I learned it the embarrassing way. At the default token budget, my own benchmark showed that the naive policy was literally worse than lexical search. The neighbour recall was stuck at 0.068 and the graph lift was minus 0.18. The edges existed in the graph, but plain score sorting let keyword lookalikes eat up the entire budget, so the true callers vanished below the cliff. After pinning the direct callers and callees of the seed ahead of secondary matches, and adding a guaranteed one line floor per causal edge, the numbers jumped to 0.875 recall and plus 0.59 lift at the exact same budget. The evaluation script is self supervised, using the AST caller and callee sets as ground truth, and it ships directly in the repo if you want to check it out.

On your specific policy, I sidestep the hop two explosion by not walking a frontier at all. Expansion is strictly limited to the one hop syntax dependency neighbourhood of each hybrid ranked seed, so a second hop only enters the context if it independently triggers as a seed. Edge type priority does exist, though in a coarser form than yours. Syntax dependency edges outrank regex or semantic edges, which only ever enter at a reduced score. Crucially, semantic similarity can never create an edge on its own. Your finer distinctions, like a test that exercises the symbol versus a migration three hops away, are not type distinguished yet. I think you are completely right that a source role prior, prioritizing core code over tests and benchmarks, is worth adding. It is definitely on the roadmap.

Staleness is a very fair challenge, so here is the honest answer. Today the context command serves the last indexed state. The mitigation here is that a warm index pass is deliberately cheap. A per file content hash registry transplants unchanged files with their embeddings intact, so it only re-parses what actually changed. We use blake2b dirty tracking so only the touched shards get rewritten. A warm pass on a two thousand node repo takes around 0.6 seconds, and persistence takes about 50 milliseconds. Because of this, the working pattern for an agent is just to re-index after an edit burst, which is cheap enough to do on every single turn.

There is already a per file incremental path ready. It uses a polling watcher with about one second of latency that atomically splices one file's subtree into the live graph, preserving incoming cross file edges and filtering dangling ones at query time. However, it is not wired into the MCP server by default yet. Between index passes your framing is correct, and I would rather say so than quietly lie. Wiring that file watcher into the MCP server is our next reliability milestone. For now, cross process freshness is already mtime guarded, meaning a reader engine will self reload the moment another process writes to the database.