Requirements and code as a Neo4j ontology: reproducible token savings and multi-agent coordination

#neo4j #claudecode #devtools #ai

AI coding agents have an expensive habit: before they write a single line, they re-read source files to work out what already exists — which modules there are, what each one provides, what's tested, and what's currently being changed. On a small repo that's tolerable. Run several agents in parallel on one codebase and it becomes both a token sink and a coordination problem: two agents start the same feature, a test gets added that nobody can map to a requirement, an architectural decision made in one session is invisible to the others.

I kept hitting this running multiple Claude Code sessions against a single codebase, and ended up solving it the way you'd expect on a graph-shaped problem: model the codebase's requirements chain as an ontology in Neo4j, and let the agents query the graph instead of re-reading the source.

This post is about the data model and the queries — and why a graph is the right tool here rather than a table.

The model

The core idea is full requirements traceability, from a user story down to the unit test that verifies a routine. Every artefact is a node; the relationships carry the meaning.

(:SysUserStory)-[:REALIZED_BY]->(:SysUseCase)
(:SysUseCase)-[:REQUIRES]->(:SysFeature)
(:SysModule)-[:PROVIDES]->(:SysFeature)
(:SysModule)-[:CONTAINS_SYMBOL]->(:SysSymbol)        // :SysSymbol / :SysEndpoint -[:IMPLEMENTS]->(:SysFeature)
(:SysTest)-[:VERIFIES]->(:SysFeature)                // tier via t.testType, or (:SysTestPackage {testCategory})-[:CONTAINS_TEST]->(:SysTest)
(:SysArchDecision)-[:ADDRESSES]->(:SysArchStd)

A SysFeature isn't a ticket — it's a capability a SysModule provides. A SysUseCase isn't a description — it's a user-visible flow that realises a story. Every test carries its V-model tier — component, integration, use-case, or e2e — and a VERIFIES edge tying it to the feature it covers. So "is this feature covered at every tier?" stops being a judgement call and becomes a reachability question.

Reachability — and its mirror, absence — is exactly what a graph answers cheaply, and it's the whole reason this lives in Neo4j rather than a table.

Why a graph, not a table

The questions you actually want to ask of a codebase's requirements are reachability and absence questions, and those are one traversal in Cypher and an awkward pile of NOT EXISTS joins in SQL.

Which features have no use case covering them?

MATCH (f:SysFeature)
WHERE NOT ( (:SysUseCase)-[:REQUIRES]->(f) )
RETURN f.id, f.name

Which architecture standards have no decision addressing them — i.e. the genuine architecture gaps?

MATCH (std:SysArchStd)
WHERE NOT ( (:SysArchDecision)-[:ADDRESSES]->(std) )
RETURN std.id, std.name

Which features are missing integration-tier coverage?

MATCH (m:SysModule)-[:PROVIDES]->(f:SysFeature)
WHERE NOT EXISTS {
  MATCH (t:SysTest)-[:VERIFIES]->(f)
  WHERE t.testType = 'integration'
     OR (:SysTestPackage {testCategory:'integration'})-[:CONTAINS_TEST]->(t)
}
RETURN f.id, f.name

A gap is just a node with no incoming edge of a given type (here, no verifying test in a given tier). That framing is what makes "is this actually tested?" a query rather than an opinion — the VERIFIES edge either exists or it doesn't. Run a one-day pass of agents over a real codebase and you can watch coverage fill in as a shape across the four V-model tiers (unit → integration → use-case → e2e), not as a single misleading percentage.

Feeding the graph to the agent

Here's the part that matters for the agents. Instead of letting a session open a 2,800-line handler file to orient, it runs a query and gets a compact briefing — coverage by module, open work, what's in progress — serialised to a few hundred tokens.

// Coverage-by-module briefing for one agent's scope
MATCH (m:SysModule {instance:$instance})-[:PROVIDES]->(f:SysFeature)
WHERE NOT f.status IN ['Superseded','Deprecated']
OPTIONAL MATCH (t:SysTest)-[:VERIFIES]->(f)
RETURN m.id            AS module,
       count(DISTINCT f) AS features,
       count(DISTINCT t) AS tests
ORDER BY m.id

This is GraphRAG, just pointed at a codebase's specification instead of a document corpus: the graph is the retrieval layer, and what it returns is structured, current, and small.

I measured the effect on the open-source Formbricks repo. Closing a real defect took roughly 71% fewer input+output tokens with the graph than withou - 14,512 -> 4,141 actual Anthropic API tokens, or ~73% savings if you count total tokens including cache reads. Method and figures: https://www.org-edge.com/sysgraph.html — and because the repo is public, you can clone it and re-run the comparison yourself.

Coordination falls out of shared state

The multi-agent win is almost a side effect. Because the graph is shared, mutable state, marking a work item in progress is a write that every other agent sees on its next query:

MATCH (e:SysEnhancement {id:$id})
SET e.status = 'in-progress', e.startedAt = $now
RETURN e

The next agent sees that item flagged in-progress in its worklog and routes to something else, so two sessions don't build the same thing. No human reconciling a dozen context files. The graph is the source of truth, and "what's left to do?" is a query.

Notes from running it

A few things that surprised me:

Coverage as VERIFIES edges per tier, not a single percentage, meant gaps couldn't hide. A feature with integration tests but no component tests shows the hole rather than reporting "covered" — agents reported catching gaps they'd otherwise have rationalised away ("integration tests exist, so it feels covered").
MERGE-only writes for the agents, with destructive operations kept entirely out of their reach, was non-negotiable once multiple sessions shared one graph.
The seed is the cost. Mapping an existing codebase into the initial node set is the one real setup step; everything compounds after that.

Try it

The CLI that drives this is free and source-available (Neo4j Community under the hood): https://github.com/org-edge/sysedge. If you're doing anything with LLM agents on a real codebase, I'd genuinely like to hear how others are modelling this — the ontology here is opinionated and I'm sure it can be sharpened.

Built on Neo4j + a thin Python CLI. Works across Go, TypeScript, Python, Java, and C#.