DEV Community: Carl Ward

I cloned Formbricks and Documenso cold from GitHub and ran an AI spec audit. Here's what it found.

Carl Ward — Wed, 03 Jun 2026 13:17:44 +0000

I built a tool called SysEdge that models requirements, tests, and architecture standards in a Neo4j graph. It ships two AI-powered audit commands:

coverage-review --uc UC-xxx — evaluates a use case spec against 7 AS-REQ dimensions: actor completeness, authorization boundaries, exception coverage, ontology alignment, testability, scope definition, and bidirectional traceability with execution
audit-test --uc UC-xxx --file path/to/test.ts — evaluates a test file against 7 AS-TEST-UC dimensions: element inventory, role visibility, happy path, error paths, equivalence partitions, semantic correctness, and specification derivation

I ran both on two production open-source codebases, cloned cold. Here's what they found.

Formbricks — survey platform (TypeScript/Next.js)

The defect: survey response exports include PII when anonymisation is enabled.

Why it existed: I ran coverage-review on the export use case.

$ python3 cli/sys_graph.py coverage-review --uc UC-FBK-005

[FAIL]  AS-REQ-003  Exception Coverage
        UC describes happy path only. No exception flow for anonymisation —
        what happens when anonymisation is enabled is completely unspecified.

[FAIL]  AS-REQ-006  Scope Definition
        Out-of-scope section is missing. Data masking, PII redaction, and
        anonymization are never mentioned — no decision recorded on whether
        privacy filtering is in or out of scope.

3 FAIL · 4 WARN · 0 PASS

The requirement was never written. Code review has no diff to evaluate.

test-gaps confirmed: F-ANLYS-003 (Export responses) had zero tests at every V-model tier — no unit test, no API contract test, no Playwright UI flow.

Token measurement (actual Anthropic API counts via --output-format json, not estimates):

	Tokens (in+out)	Cache-read tokens
Without SysEdge	14,512	1,797,494
With SysEdge	4,141	473,649
Saving	71%	74%

The without-SysEdge session loaded the full codebase via prompt cache to orient on the defect. The with-SysEdge session used graph queries. Both cache figures are disclosed — the saving holds whether you count input+output or total tokens including cache reads.

Documenso — e-signature platform (TypeScript/Next.js)

github.com/documenso/documenso — open-source DocuSign alternative.

Setup: cloned cold, built a domain seed from the Prisma schema, started Neo4j on a fresh port.

docker run -d --name documenso-neo4j \
  -e NEO4J_AUTH=neo4j/documenso123 \
  -p 7700:7687 neo4j:5.26

python3 cli/sys_graph.py init
python3 cli/sys_graph.py seed documenso-seed.json

Time from clone to all findings: ~15 minutes. Total AI cost: ~$0.30 (haiku model).

Before we get to the findings: Documenso has 94 Playwright e2e tests and exactly 3 unit test files in the entire repo — two for webhook URL validation, one for CSS sanitization. That's the complete unit/integration test surface.

Finding 1: The Inngest job handlers that process envelope expiration have no tests

Documenso uses Inngest for background jobs. The job system lives in packages/lib/jobs/. That directory contains zero test files.

The expiration job implementation:

packages/lib/jobs/definitions/internal/
  expire-recipients-sweep.handler.ts     ← sweeps for expired recipients, fans out
  process-recipient-expired.handler.ts   ← processes each expired recipient

These handlers: query the database for pending envelopes past their expiresAt, set status to EXPIRED, invalidate signing tokens, trigger the ENVELOPE_EXPIRED webhook, and dispatch notification emails.

There is a Playwright e2e test: envelope-expiration-signing.spec.ts. It's a good test — it seeds a recipient with expiresAt in the past and verifies that clicking the signing link shows an "expired" error page. It passes and it's correct.

But it tests the UI response to an expired state. It does not test whether the Inngest job that creates that state runs correctly, handles partial failures, or fires the right webhook payload.

$ python3 cli/sys_graph.py audit-test \
  --uc UC-DOC-004 \
  --file packages/app-tests/e2e/envelopes/envelope-expiration-signing.spec.ts

[FAIL]  AS-TEST-UC-003  UC Test: Happy Path
        No test exercises the main UC success scenario: scheduled job runs,
        queries for envelopes with expiresAt in the past, sets status EXPIRED,
        invalidates signing tokens, fires ENVELOPE_EXPIRED webhook, sends
        notification emails to sender.
        → Create a test that seeds a PENDING envelope with expiresAt in the past,
          invokes the expiration job, and asserts all post-conditions.

If the sweep job silently fails, envelopes never expire. Time-limited agreements can be signed after their legal deadline.

Finding 2: The expiration specification describes success only

$ python3 cli/sys_graph.py coverage-review --uc UC-DOC-004

[FAIL]  AS-REQ-003  Exception Coverage
        No behaviour specified for:
        - Inngest job exception during status update
        - Webhook endpoint unreachable or returning 5xx
        - Email service unavailable
        - Database transaction failure mid-batch (partial token invalidation —
          some recipients expired, some not)

[FAIL]  AS-REQ-002  Authorization Boundaries
        No roles named. Can an admin manually trigger the expiration sweep?
        Can a recipient see that their envelope expired before receiving
        notification? No CASL permission string derivable.

For a legally-binding e-signature platform, the partial-failure case is material: if the database write succeeds but the webhook fails, did the envelope expire? If the sweep processes 500 of 1000 expired recipients before throwing, what state are the remaining 500 in?

Finding 3: Void envelope atomicity is undefined

$ python3 cli/sys_graph.py coverage-review --uc UC-DOC-003

[FAIL]  AS-REQ-003  Exception Coverage
        Void UC covers the happy path (status → VOIDED, tokens invalidated,
        webhook fired) but does not specify:
        - Behaviour when envelope is already VOIDED or EXPIRED
        - Whether void is atomic (webhook fires even if DB write fails?)
        - Minimum/maximum length of void reason field
        - Whether voiding is reversible

Void only appears in tests within an enterprise-feature-restrictions.spec.ts that is explicitly test.describe.skip — not an active test.

Finding 4: The signing certificate test checks appearance, not validity

include-document-certificate.spec.ts exists and passes. It seeds a document, has a recipient sign it, and verifies that a certificate page appears in the completed PDF.

What it does not test: whether the signature hash is cryptographically valid, whether the TSA timestamp (if configured) is correctly embedded, or whether the certificate fields match the actual signing event data.

[FAIL]  AS-TEST-UC-003  UC Test: Happy Path
        Test verifies certificate appearance in PDF but does not assert
        signature validity, TSA timestamp correctness, or certificate
        field accuracy against signing event data.

The cryptographic pipeline — packages/signing/transports/local.ts, packages/signing/helpers/tsa.ts — has zero unit tests.

Finding 5: Zero specification derivation comments in existing tests

[FAIL]  AS-TEST-UC-007  Specification Derivation
        No test function or test class in the expiration suite includes a
        comment tracing it to a UC step, precondition, or acceptance criterion.
        Test names use ENVELOPE_EXPIRATION labels but none reference
        UC-DOC-004 main flow steps.

For a system where legal validity of signatures may be subject to audit, "this test verifies specification clause X" is the evidence chain that matters. None of the existing tests establish that chain.

Why code review can't find these

Every finding has the same structure: something was never written down.

There is no diff for a missing Inngest test. There is no diff for an unspecified exception path. There is no diff for a use case that was never written. Code review sees what changed — these gaps exist in absence.

The standard toolset produces false confidence: tests exist, they pass, coverage percentage is non-zero. None of that surface checks whether the job handlers are tested, whether the specification covers failure states, or whether the test can be traced to the specification.

Real-world effectiveness from 11 parallel sessions

After shipping these audit commands across our own 12-instance codebase, the instances reported back. Highlights:

audit-test forced a full-walk test → revealed a missing UI feature the API already supported. The readiness edit control didn't exist in CandidateDetailPanel — candidates had readiness set at creation only. The API supported PUT, the UI never exposed it. audit-test's "happy path" dimension requires exercising the complete UC flow, which revealed the gap. Neither code review nor manual testing would have caught it — the API worked, the UI just didn't have the button.

coverage-review found an acceptance criterion cut off mid-sentence. The stored text was literally "They can add a new OKR (objective text, key result" — truncated. Data quality gap invisible to all other review mechanisms.

audit-test AS-TEST-UC-002 (role visibility) gaps led to a real access control finding. Testing denied roles via service token revealed the service role has read:* regardless of email scope — a real architectural gap, documented as a defect.

coverage-review AS-REQ-002 (auth boundaries) failing universally → all use cases now have explicit CASL permission strings. 62 use cases updated in one session. Code review doesn't check whether specs name permission strings.

How it works

# 1. Clone any repo
git clone https://github.com/documenso/documenso ~/documenso-demo

# 2. Start a fresh Neo4j (use a free port)
docker run -d --name documenso-neo4j \
  -e NEO4J_AUTH=neo4j/documenso123 \
  -p 7700:7687 neo4j:5.26

# 3. Build a seed JSON from the domain model, then:
SYSGRAPH_NEO4J_URI=bolt://localhost:7700 \
python3 cli/sys_graph.py seed documenso-seed.json

# 4. Run the spec audit
python3 cli/sys_graph.py coverage-review --uc UC-DOC-004 --model haiku

# 5. Run the test quality audit against a real test file
python3 cli/sys_graph.py audit-test \
  --uc UC-DOC-004 \
  --file packages/app-tests/e2e/envelopes/envelope-expiration-signing.spec.ts

The AI calls use the Claude Code session tokens — no separate API key needed if you're running inside Claude Code. Haiku is fast enough for this workload: the full Documenso run was ~$0.30 total.

The full case studies

Documenso findings (with the actual commands and outputs): org-edge.com/sysedge-documenso-review.html
Formbricks + token measurement: org-edge.com/sysedge-token-savings.html
SysEdge free CLI: github.com/org-edge/sysedge

MIT + Commons Clause. Runs inside Claude Code with no API key configuration required.

Requirements and code as a Neo4j ontology: reproducible token savings and multi-agent coordination

Carl Ward — Sat, 30 May 2026 13:57:26 +0000

AI coding agents have an expensive habit: before they write a single line, they re-read source files to work out what already exists — which modules there are, what each one provides, what's tested, and what's currently being changed. On a small repo that's tolerable. Run several agents in parallel on one codebase and it becomes both a token sink and a coordination problem: two agents start the same feature, a test gets added that nobody can map to a requirement, an architectural decision made in one session is invisible to the others.

I kept hitting this running multiple Claude Code sessions against a single codebase, and ended up solving it the way you'd expect on a graph-shaped problem: model the codebase's requirements chain as an ontology in Neo4j, and let the agents query the graph instead of re-reading the source.

This post is about the data model and the queries — and why a graph is the right tool here rather than a table.

The model

The core idea is full requirements traceability, from a user story down to the unit test that verifies a routine. Every artefact is a node; the relationships carry the meaning.

(:SysUserStory)-[:REALIZED_BY]->(:SysUseCase)
(:SysUseCase)-[:REQUIRES]->(:SysFeature)
(:SysModule)-[:PROVIDES]->(:SysFeature)
(:SysModule)-[:CONTAINS_SYMBOL]->(:SysSymbol)        // :SysSymbol / :SysEndpoint -[:IMPLEMENTS]->(:SysFeature)
(:SysTest)-[:VERIFIES]->(:SysFeature)                // tier via t.testType, or (:SysTestPackage {testCategory})-[:CONTAINS_TEST]->(:SysTest)
(:SysArchDecision)-[:ADDRESSES]->(:SysArchStd)

A SysFeature isn't a ticket — it's a capability a SysModule provides. A SysUseCase isn't a description — it's a user-visible flow that realises a story. Every test carries its V-model tier — component, integration, use-case, or e2e — and a VERIFIES edge tying it to the feature it covers. So "is this feature covered at every tier?" stops being a judgement call and becomes a reachability question.

Reachability — and its mirror, absence — is exactly what a graph answers cheaply, and it's the whole reason this lives in Neo4j rather than a table.

Why a graph, not a table

The questions you actually want to ask of a codebase's requirements are reachability and absence questions, and those are one traversal in Cypher and an awkward pile of NOT EXISTS joins in SQL.

Which features have no use case covering them?

MATCH (f:SysFeature)
WHERE NOT ( (:SysUseCase)-[:REQUIRES]->(f) )
RETURN f.id, f.name

Which architecture standards have no decision addressing them — i.e. the genuine architecture gaps?

MATCH (std:SysArchStd)
WHERE NOT ( (:SysArchDecision)-[:ADDRESSES]->(std) )
RETURN std.id, std.name

Which features are missing integration-tier coverage?

MATCH (m:SysModule)-[:PROVIDES]->(f:SysFeature)
WHERE NOT EXISTS {
  MATCH (t:SysTest)-[:VERIFIES]->(f)
  WHERE t.testType = 'integration'
     OR (:SysTestPackage {testCategory:'integration'})-[:CONTAINS_TEST]->(t)
}
RETURN f.id, f.name

A gap is just a node with no incoming edge of a given type (here, no verifying test in a given tier). That framing is what makes "is this actually tested?" a query rather than an opinion — the VERIFIES edge either exists or it doesn't. Run a one-day pass of agents over a real codebase and you can watch coverage fill in as a shape across the four V-model tiers (unit → integration → use-case → e2e), not as a single misleading percentage.

Feeding the graph to the agent

Here's the part that matters for the agents. Instead of letting a session open a 2,800-line handler file to orient, it runs a query and gets a compact briefing — coverage by module, open work, what's in progress — serialised to a few hundred tokens.

// Coverage-by-module briefing for one agent's scope
MATCH (m:SysModule {instance:$instance})-[:PROVIDES]->(f:SysFeature)
WHERE NOT f.status IN ['Superseded','Deprecated']
OPTIONAL MATCH (t:SysTest)-[:VERIFIES]->(f)
RETURN m.id            AS module,
       count(DISTINCT f) AS features,
       count(DISTINCT t) AS tests
ORDER BY m.id

This is GraphRAG, just pointed at a codebase's specification instead of a document corpus: the graph is the retrieval layer, and what it returns is structured, current, and small.

I measured the effect on the open-source Formbricks repo. Closing a real defect took roughly 71% fewer input+output tokens with the graph than withou - 14,512 -> 4,141 actual Anthropic API tokens, or ~73% savings if you count total tokens including cache reads. Method and figures: https://www.org-edge.com/sysgraph.html — and because the repo is public, you can clone it and re-run the comparison yourself.

Coordination falls out of shared state

The multi-agent win is almost a side effect. Because the graph is shared, mutable state, marking a work item in progress is a write that every other agent sees on its next query:

MATCH (e:SysEnhancement {id:$id})
SET e.status = 'in-progress', e.startedAt = $now
RETURN e

The next agent sees that item flagged in-progress in its worklog and routes to something else, so two sessions don't build the same thing. No human reconciling a dozen context files. The graph is the source of truth, and "what's left to do?" is a query.

Notes from running it

A few things that surprised me:

Coverage as VERIFIES edges per tier, not a single percentage, meant gaps couldn't hide. A feature with integration tests but no component tests shows the hole rather than reporting "covered" — agents reported catching gaps they'd otherwise have rationalised away ("integration tests exist, so it feels covered").
MERGE-only writes for the agents, with destructive operations kept entirely out of their reach, was non-negotiable once multiple sessions shared one graph.
The seed is the cost. Mapping an existing codebase into the initial node set is the one real setup step; everything compounds after that.

Try it

The CLI that drives this is free and source-available (Neo4j Community under the hood): https://github.com/org-edge/sysedge. If you're doing anything with LLM agents on a real codebase, I'd genuinely like to hear how others are modelling this — the ontology here is opinionated and I'm sure it can be sharpened.

Built on Neo4j + a thin Python CLI. Works across Go, TypeScript, Python, Java, and C#.