DEV Community

inspiringsource
inspiringsource

Posted on

A deterministic alternative to embedding-based repo understanding

Hey everyone, I'm Avi a CS student at FHNW in Switzerland.

I’ve been a bit frustrated with how AI coding tools handle larger codebases. Most of them rely on embeddings + prompting, which is cool for fuzzy stuff, but sometimes feels inconsistent, hard to reason about, and probably token-heavy.

So I wanted to try something more “boring” and predictable.

I built a small prototype called ai-context-map. It uses static analysis to build a structural graph of a repo:

  • files
  • imports / dependencies
  • some basic symbols (mostly Python for now)

The idea is to precompute a map of the repo so an AI (or even a human) doesn’t have to rediscover structure every time.

No ML, no embeddings, no API calls. Just parsing + graph stuff.


It outputs something like a .ai/context.yaml file. Very simplified example:

entry_points:
  - path: src/main.py

core_modules:
  - src/services/auth.py

task_routes:
  api_change:
    - src/api/routes.py
    - src/services/auth.py

anchors:
  - symbol: login_user
    file: src/services/auth.py
    line: 42
Enter fullscreen mode Exit fullscreen mode

What I'm trying to figure out is basically if this direction even makes sense.

  • Where does a purely static / graph-based approach fall apart compared to embeddings?
  • Are there tools doing something similar already that I should look into?
  • If you work with larger repos: would something deterministic like this actually help, or is vector search + big context already “good enough”?

One thing I'm curious about:

Could something like this reduce how many files an AI needs to look at, and therefore reduce token usage?

Repo:
https://github.com/inspiringsource/ai-context-map

Would really appreciate feedback (also “this is useless” is fine)

Top comments (1)

Collapse
 
avi_bro profile image
inspiringsource

The core idea is to help AI agents find the right files first, instead of scanning the repo blindly.

In theory that could reduce how many files need to be read, which might lower token usage and make edits more reliable.

Not sure how much this actually holds up in practice.