Dhevenddra

Posted on Jun 27 • Originally published at github.com

Grounding AI coding agents with a confidence-tagged code knowledge graph

#ai #opensource #mcp #devtools

Disclosure: this is my own open-source project (forensic-deepdive, Apache-2.0). I'm sharing it here because the dev.to crowd tends to have sharp opinions on agent tooling and I want the critique.

Most "repo context" tooling for AI agents is retrieval: embed the files, fetch the chunks that look similar to the prompt, hope the model reasons over them. That's a fine baseline, but it answers "what text looks relevant," not "what breaks if I change this," "which files are load-bearing," or "who owns this and is the bus factor 1." Those are graph and git-history questions.

Here's how I built a tool that answers them, what it gets right, and, explicitly, what it doesn't yet do.

The problem: retrieval isn't grounding

The current wave of AI coding tools has largely settled on retrieval-augmented context: embed the repository, fetch the chunks most similar to the prompt, and let the model reason over them. It's useful and it's the right baseline. But it has a structural ceiling.

Similarity retrieval answers "what text looks relevant?" It does not answer the questions a competent engineer answers reflexively:

If I change this function, what actually breaks, transitively, across files?
Which 20 files out of 2,000 are load-bearing?
Who has historically owned this module, and is the bus factor 1?
Does this frontend call resolve to a backend handler, and which one?

Those are structural and historical questions. You answer them with a graph and with git history, not with cosine similarity. forensic-deepdive is an open-source (Apache-2.0) tool built around exactly that premise.

What it produces

Point it at a repository and it emits three coordinated outputs:

A persistent embedded knowledge graph (<repo>/.deepdive/graph.lbug).
An MCP server exposing 9 composite tools to any MCP-aware agent.
Five durable markdown artifacts as a human-readable projection.

It's built on tree-sitter (parsing across 9 languages), a PageRank-style repo-map for centrality, an embedded graph database, and git log for the historical layer. Extraction runs entirely locally, with zero LLM calls, zero network, and no API keys required. Cloud and semantic features are strictly opt-in.

The graph schema

Node types: File, Symbol, Module, Commit, Author, Endpoint, DbTable.

Edge types: DEFINES, MEMBER_OF, IMPORTS, CALLS, EXTENDS, IMPLEMENTS, TOUCHED_BY_COMMIT, AUTHORED_BY, CO_CHANGES_WITH, and the cross-boundary set: HANDLES, CALLS_ENDPOINT, ROUTES_TO, INJECTS, PERSISTS_TO.

The structural edges (CALLS/IMPORTS/EXTENDS) come from AST analysis. The historical edges (TOUCHED_BY_COMMIT/AUTHORED_BY/CO_CHANGES_WITH) come from git. The cross-boundary edges come from protocol-specific extractors. They live in one graph, so an agent can ask a structural question and a historical question in the same breath.

The design decision that matters most: honesty

The failure mode of a code-graph tool is silent confidence. If the graph asserts a CALLS edge that's really just two same-named symbols in different files, an agent will trust it and "fix" code that never needed touching. High recall with hidden false positives is actively dangerous in an autonomous loop.

So every edge and every emitted claim carries a confidence tag:

EXTRACTED: deterministic from the AST or git log. A fact.
INFERRED: a heuristic resolved cleanly (import-graph walk, receiver-type inference, single same-name candidate). High-trust but derived.
AMBIGUOUS: multiple candidates; the resolver couldn't disambiguate, so it surfaces every candidate rather than guessing.

This shows up everywhere. HOTPATHS carries a per-row confidence-mix column, so you can tell a symbol that resolved cleanly (mostly EXTRACTED/INFERRED) from one drowning in same-name collisions (AMBIGUOUS). The tool tells you how much to trust it, per claim.

One abstraction for five protocols: the Endpoint keystone

Cross-boundary tracing is where most tools stop, because each protocol looks different. forensic-deepdive routes all of them through a single Endpoint join node. Five protocols, HTTP, MCP tools, registry dispatch, gRPC, and messaging/AMQP, share that one node and a protocol-blind join. A frontend call resolves to its backend handler across the whole stack as a single ROUTES_TO edge.

The architectural consequence: adding a sixth protocol is a new key-builder plus provider/consumer extractors. It never touches the trace, emit, or serve layers. The surfacing layer is protocol-blind by design. trace(symbol) walks frontend call -> CALLS_ENDPOINT -> Endpoint -> HANDLES -> handler -> CALLS tail generically, no matter which protocol produced the edge.

The historical layer agents are blind to

Git archaeology is a first-class layer, not a footnote: churn, top authors with their %, bus factor, co-change clusters (files that always move together), and defect proximity (proximity to bug-fix commits). In hands-on testing this was consistently the highest-trust, highest-value layer, the fastest way to learn where risk is concentrated and who to ask. A graph tells you the shape; archaeology tells you the story.

The 9 MCP tools

impact: blast-radius BFS over CALLS edges, depth-bucketed, confidence-filterable.
context: one-call kitchen sink: definition + callers + callees + parents/members + recent commits + dominant author + insights.
archaeology: churn, top authors, bus factor, co-change cluster, defect proximity.
flow: DFS over CALLS with cycle detection.
query: raw Cypher, or hybrid NL retrieval (BM25 + structural signal + optional offline semantic, RRF-fused).
record_insight: persist a verified learning about a symbol.
recall_insights: recall stored insights, newest-first.
visualize: bounded Mermaid diagram of a neighborhood; edge dash style encodes confidence.
trace: cross-stack feature slice across the Endpoint join node.

Each tool description is kept under ~200 tokens so all nine fit comfortably inside an agent's per-turn metadata budget.

How an agent integrates, without you wiring anything

On extract, the tool drops write-if-absent shims into the target repo: a CLAUDE.md, an AGENTS.md, a .cursor/rules file, a .continue/rules file, a Claude Code plugin manifest, and five single-intent skills (codebase-exploring, -debugging, -impact-analysis, -refactoring, -onboarding). Each skill's description encodes when to use it, and when to route to a sibling instead.

The result: a fresh agent opening the repo auto-discovers AGENT_BRIEF.md, selects the right skill for the task unprompted, and has the 9 MCP tools available. The record_insight/recall_insights pair gives it memory that outlives the context window. (Hand-edited files are never overwritten; the shims only fill gaps.)

What it gets right, and what it doesn't

I ran a deliberately adversarial review, a fresh agent cross-checking every MCP answer against the actual files. The honest scorecard:

High-trust: git archaeology, exact Cypher/structural queries, and the pre-generated briefs. These were accurate and verifiable.

Verify-before-trusting: impact()/context()/flow() are excellent lead generators but optimize recall over precision. On a dynamic-dispatch language (Dart) some CALLS edges are really "references," so a blast radius is a candidate set to verify, not a final answer. v0.8 added precision passes (distinct-caller counts, AMBIGUOUS tiering for same-name collisions, honest degraded-mode flags) that directly address these findings.

Context-dependent: NL query() and trace shine on large web/backend codebases and add little on a tiny offline app, and trace now self-notes when a graph has no endpoints.

The honest one-liner from that review: a fast lead-generator and an excellent git-risk lens, not an authoritative source of truth. Used as "where should I look and what's risky," it's a clear net positive. Used with verify-the-claim discipline, it pays off.

I'm equally clear on scope: v0.8 is an assisted-analysis tool. The autonomous end-to-end question, does seeding an agent with this make it resolve real issues measurably faster, is not yet proven. A model-free localization pilot is recorded in the repo (the static seed is a weak prior); the full measurement needs a GPU and a frontier main-agent endpoint, so it's deferred to v0.9. No autonomous-execution claims are made.

Roadmap (v0.9)

The headline is an interactive CLI: launch a persistent deepdive session, a query REPL holding the graph open, a Textual TUI graph browser, and a guided onboard wizard, layered on top of the existing command-runner (which stays for CI and agents). Plus the end-to-end usefulness measurement, and a couple of reporting-precision fixes. The full deferred ledger is in the repo.

Try it / contribute

uv tool install forensic-deepdive
forensic extract /path/to/repo

Also in the MCP Registry (https://registry.modelcontextprotocol.io, io.github.Dhevenddra/forensic-deepdive) and as a Claude Code plugin (/plugin marketplace add Dhevenddra/forensic-deepdive).

It's Apache-2.0 and open for contributions. CONTRIBUTING.md and the architectural invariants are documented. If you work on agent context, code graphs, or developer tooling, I'd value your issues, PRs, and honest critique.

Repo: https://github.com/Dhevenddra/forensic-deepdive