pyalwin

Posted on Mar 31

Beyond RAG: Building Graph-Aware Retrieval for Contract Reasoning

#ai #database #opensource #rag

Why We Moved Beyond Vector Search for Contract QA

When we started building AgreedPro, one of the core technical questions seemed almost ordinary: how do you answer questions over contracts using modern AI retrieval systems?

At first, the answer felt obvious.

Use RAG.

Chunk the contract. Embed the chunks. Retrieve the top matches. Pass them to a language model. Let the model generate the answer.

That pipeline is familiar because, in many domains, it works. It works well on documentation, FAQs, product manuals, internal wikis, and other corpora where relevant information is typically localized. A question points to a paragraph, a section, or a small cluster of nearby passages. Retrieval is largely a similarity problem.

Contracts are different.

That difference was not immediately obvious when we were building early versions of AgreedPro. The model often produced answers that looked right. They were well-phrased, coherent, and aligned with the wording of the contract section that had been retrieved.

But once we started checking those answers closely against the full document, a recurring pattern emerged.

The model was not hallucinating in the usual sense.

It was answering from incomplete context.

That failure mode turned out to be much more important than it sounds. In legal documents, incomplete context is often the difference between a correct answer and a misleading one. That realization is what led to EngramDB.

This article is a technical deep dive into the retrieval problem we encountered while building AgreedPro, why vector-only search was mismatched to contract reasoning, and how EngramDB emerged as a hybrid graph-plus-vector retrieval system designed for multi-hop legal reasoning.

The Problem Looked Solved Until We Tried to Solve It

The initial retrieval pipeline was standard.

Parse the contract into chunks.
Embed each chunk.
Embed the user query.
Retrieve the most similar chunks.
Ask the language model to answer from the retrieved context.

For a while, this looked fine.

Then we started asking realistic legal questions.

A representative example was this:

Under what conditions can this agreement be terminated?

A vector-based retriever would often return the section titled something like Termination. That seemed perfectly reasonable. The wording of the section was directly relevant to the query.

But when we inspected the full contract, the real answer was often distributed across several places:

the termination clause itself
a definitions section that defines a term like "Cause"
a different section containing exceptions, limitations, or notice requirements
one or more cross-referenced provisions that affect interpretation

The retriever found the clause that looked most relevant, but not the sections that made the clause fully interpretable.

The language model then generated an answer that sounded clean and plausible, but it had only seen one part of the reasoning chain.

This is one of the most dangerous failure modes in contract intelligence: not obviously wrong, not obviously invented, just incomplete in a way that can change the legal meaning.

The Core Mismatch: Similarity Does Not Equal Completeness

To understand why this keeps happening, it helps to be precise about what dense retrieval is optimizing for.

In a standard vector retrieval pipeline, you embed the query and the chunks, compute similarity scores, and return the top results. In simple terms, the retriever is asking:

Which passages look most semantically similar to the question?

That objective is useful, but it is not the same as the one we actually care about in contracts.

In contract QA, the real question is often closer to this:

Which sections, taken together, are necessary to answer this question correctly?

Those are not the same objective.

A clause that defines a key term may share very little wording with the query. A referenced section may be critical to the answer and still rank poorly in a pure embedding space. A parent section may scope the meaning of a child subsection without repeating the child’s language at all.

So the retrieval problem is not simply one of semantic relevance. It is a problem of reconstructing a reasoning path across document structure.

That is where vector-only retrieval begins to fail.

Why Contracts Behave More Like Graphs Than Like Flat Text

The biggest conceptual shift while building AgreedPro was realizing that contracts should not be modeled primarily as flat text.

They behave much more like structured graphs.

A contract has at least three kinds of first-class structure.

Hierarchical structure

Contracts are organized into articles, sections, subsections, schedules, appendices, and nested numbered clauses. A child clause often depends on the scope introduced by its parent. If you detach a subsection from its surrounding article, you can easily lose the context that gives it meaning.

Definitional structure

Contracts rely heavily on defined terms. Words like "Cause," "Confidential Information," "Affiliate," or "Change of Control" are often defined once and reused throughout the document. The place where a term is used is rarely the place where its meaning is established.

Referential structure

Contracts constantly point to themselves. They say things like "subject to Section 8.2," "as provided in Article III," or "except as otherwise set forth in Section 5.4." These are explicit navigational edges inside the document.

Once you start treating those signals as part of the representation, a contract naturally becomes a graph of connected sections rather than a bag of isolated chunks.

That framing is the foundation of EngramDB.

What EngramDB Is

EngramDB is a schema-aware hybrid retrieval system for structured documents, built around the kinds of retrieval failures that showed up while building AgreedPro.

At a high level, it combines two mechanisms:

Vector retrieval to find semantically relevant starting points
Graph traversal to expand from those starting points into structurally related sections

The key idea is not that embeddings are bad. It is that embeddings are being asked to do too much when the answer depends on multiple linked sections.

Embeddings are very good at finding where the answer might start.

They are not, by themselves, a reliable way to find everything the answer depends on.

EngramDB treats each document section as a node, called an Engram, and each structural relationship as an edge, called a Synapse. Those edges are extracted from the document itself during ingestion.

So instead of hoping the embedding model implicitly captures the document’s legal structure, the system makes that structure explicit.

Deterministic Ingestion: Structure Extraction Without LLM Calls

One of the most important design decisions in EngramDB is that ingestion is rule-based.

That decision matters for both engineering and research reasons.

From an engineering perspective, rule-based extraction is predictable, reproducible, cheaper to run, and easier to debug. If a section boundary is wrong or a reference fails to resolve, you can inspect the pattern and fix the pipeline.

From a research perspective, the hypothesis is that contracts already expose a large amount of high-quality structure for free. Section numbering, heading patterns, quotation conventions, and cross-references are not weak hints. They are central features of the document.

EngramDB takes advantage of that.

Section parsing

The first stage identifies headings and section boundaries using regular expressions and structural cues. The goal is to segment the contract into units that are legally meaningful while preserving hierarchy.

For example, the parser should understand that:

Article IV is a parent scope
Section 4.2 belongs under that article
nested clauses inherit context from the enclosing section

This is essential because a subsection often cannot be interpreted correctly without its parent context.

Definition extraction

The next stage detects defined terms using patterns such as:

"Term" means ...
"Term" shall mean ...
The term "X" means ...

This allows the system to map terms back to their defining sections and to link uses of those terms elsewhere in the document.

That means a query about termination for Cause does not have to rely on embeddings alone to discover the definition of Cause.

Cross-reference extraction

The ingestion pipeline also extracts explicit references such as:

Section 4.2
Section 8.1(b)
Article III
Schedule A

When those references can be resolved, EngramDB creates edges between the source and target sections.

This is one of the most valuable signals in legal documents because the contract itself is telling you which sections are meant to be read together.

Graph construction

After extraction, the document is represented as a graph consisting of:

Engrams: nodes representing sections
Synapses: typed edges representing relationships

The primary relationship types include:

PARENT_OF for hierarchy
DEFINES for term-definition links
REFERENCES for explicit cross-references

Embedding generation

Each section can also be embedded using a pluggable backend. In the current implementation, the supported backends include:

mock for deterministic tests and development
openai for hosted embeddings
local for sentence-transformers based local embeddings

This separation is deliberate. Structure extraction is deterministic. Embeddings are an additional signal, not the source of the graph.

Why DuckDB Is a Good Fit

A lot of retrieval systems get operationally heavy before the retrieval logic is even stable. Separate vector stores, graph databases, metadata stores, and orchestration layers appear early, often making iteration harder rather than easier.

EngramDB takes a simpler route.

Everything is stored in a single DuckDB file:

section metadata
graph edges
embeddings
retrieval artifacts

This has several practical advantages.

First, it keeps the system local-first and easy to reproduce. Second, it allows structured querying over nodes and edges without introducing another database layer. Third, it makes benchmarking, debugging, and inspection much easier when the system is still evolving.

For an experimental hybrid retrieval engine, that tradeoff is extremely appealing.

The Retrieval Pipeline: Vector Search for Anchors, Graph Walk for Context

The retrieval pipeline in EngramDB has two stages.

Stage 1: Anchor retrieval

Given a natural-language query, the system embeds the query and retrieves the top-k semantically similar sections.

These top-k results are treated as anchor nodes.

This is where vector search does what it is good at. It finds the parts of the document that look closest to the user’s question.

If the query is about termination, the anchor set often includes a termination clause. If the query is about confidentiality, the anchor set often includes confidentiality-related sections.

But anchors alone are not enough.

Stage 2: Graph expansion

From each anchor, EngramDB expands outward through the graph for a bounded number of hops, usually one to three.

That graph walk can recover:

a definition linked to a term used in the anchor
a referenced provision that adds an exception or condition
a parent section that scopes the meaning of the anchor
neighboring clauses that complete the obligation or permission being discussed

This is the step that turns retrieval from a semantic lookup problem into a multi-hop reasoning support system.

Vector retrieval finds entry points.

Graph traversal finds connected evidence.

That distinction is the core of the system.

Scoring: Blending Semantic and Structural Relevance

Once the candidate set is assembled, the system still has to rank it.

EngramDB scores each candidate using a blend of semantic similarity and structural relevance.

In the current design, the score is computed as:

score = 0.5 * semantic_similarity + 0.5 * structural_score

The equal weighting matters conceptually. It says that structural importance is not just a small reranking feature. It is a first-class retrieval signal.

Structural score

Structural relevance is based on two main factors:

the type of edge through which the node was reached
the hop distance from the anchor

The score decays with distance, using a hop-decay factor of about 0.75 per hop.

Different edge types receive different weights:

REFERENCES = 1.0
DEFINES = 0.9
PARENT_OF = 0.55

Anchors themselves are treated specially and receive a structural score of 1.0.

This ranking design captures an important practical truth. A section reached via an explicit legal reference may be more valuable than a section that merely looks semantically similar. Likewise, a definition section may deserve to outrank a loosely related paragraph because it provides the interpretation layer needed for the answer.

The Real Engineering Trick: Preventing Graph Nodes from Getting Ranked Out

A hybrid retrieval system can still fail even if the graph is correct.

The reason is ranking pressure.

Graph expansion often produces a large candidate set. A handful of anchor nodes can fan out into dozens of structurally related sections. But the final context budget is limited. The model may only receive ten or twelve pieces of evidence.

If you simply rank everything by the combined score, semantically strong anchors can still dominate the final set. The graph-walk successfully discovers the right nodes, but the ranking layer drops them.

This is a subtle but crucial failure mode.

The graph worked.

The retrieval result still failed.

EngramDB addresses this using reserved traversal slots.

The idea is simple:

reserve part of the result budget for anchor nodes
reserve another part for graph-discovered nodes

Within the graph-reserved portion, candidates are prioritized by edge-type tier and then by semantic similarity.

This prevents a large set of high-similarity local sections from completely displacing the structurally important evidence discovered through traversal.

That design choice is one of the most practically important parts of the system. Without it, many hybrid retrievers quietly collapse back into vector-only behavior at the final ranking stage.

Failure Modes and What They Teach Us

A retrieval system becomes much more useful once its failure modes are explicit.

Two categories mattered most in EngramDB.

1. Traversed but ranked out

In this case, the graph expansion discovers the required section, but the final ranking stage excludes it from the top context window.

This tends to happen in dense graph neighborhoods where many candidates compete for a small number of slots.

Mitigations include:

reserved traversal slots
stronger edge-aware ranking
reranking within structural tiers

2. Not reachable from anchors

A different failure happens when the initial vector anchors are poor.

This is especially common for questions that reference section numbers directly, such as:

What does Section 6 reference?

That kind of query does not necessarily produce useful semantic anchors, because the important signal is not meaning in the embedding sense. It is explicit document metadata.

A natural mitigation is metadata-aware anchor injection. If the query contains a recognizable section pattern, the corresponding section can be inserted directly into the anchor set before graph expansion starts.

This is a good reminder that in structured domains, metadata is not a workaround. It is part of the retrieval representation.

Benchmark Results: Where the Hybrid Approach Actually Wins

The central hypothesis behind EngramDB is that document-native structure provides retrieval signal that vector similarity alone does not fully capture, especially for multi-hop questions.

The reported benchmark evaluates this hypothesis on contract QA using CUAD documents.

The setup described in the project materials includes:

183 multi-hop questions
35 contracts
questions requiring retrieval of two to three structurally linked sections

The reported results for hybrid retrieval versus vector-only retrieval are substantial.

Overall recall

Hybrid: 92.8%
Vector-only: 68.2%

Two-hop recall

Hybrid: 97.6%
Vector-only: 66.8%

Three-hop recall

Hybrid: 86.5%
Vector-only: 70.0%

The largest gains appear in exactly the places you would expect if structure is the missing signal:

cross-reference queries
termination chains
definition-linked questions

These are the scenarios where the answer depends on sections that are explicitly connected but not necessarily semantically local.

That is important because it suggests the system is not just benefiting from more retrieval. It is benefiting from the right retrieval bias.

Why This Matters Beyond Legal Tech

Although EngramDB was motivated by contract QA, the deeper lesson is broader.

A lot of retrieval research is centered on semantic representation:

better embeddings
better rerankers
larger context windows
fusion strategies across retrievers

All of those matter.

But EngramDB highlights another axis of improvement that is often underused: the internal structure of the document itself.

In domains where meaning is distributed across linked sections, structure is not auxiliary. It is part of relevance.

That makes this style of retrieval interesting beyond legal documents.

The same pattern appears in:

compliance manuals
regulatory filings
financial disclosures
technical specifications
API documents with defined terms and references
scientific papers with structured section dependencies
policy frameworks and governance documents

Any domain where answers live in chains rather than paragraphs is a candidate for graph-aware retrieval.

What Building AgreedPro Made Clear

The most important lesson from this work is not that vector retrieval is wrong.

Vector retrieval is extremely effective at finding where the answer starts.

The lesson is that, in structured domains, that is often not enough.

Contract meaning is compositional. Definitions scope terms. Exceptions alter clauses. Parent sections constrain children. Cross-references relocate meaning. A retrieval system that only optimizes for semantic proximity will often return text that is relevant, but not sufficient.

EngramDB is an attempt to align the retrieval system with the way contracts actually encode meaning.

Instead of assuming the answer is a chunk, it assumes the answer is a connected set of sections.

That difference changes everything.

A Broader Principle for Retrieval System Design

If there is one general principle I would take from this work, it is this:

Retrieval should optimize for reasoning completeness, not just semantic similarity.

That principle has practical consequences.

It means you should think seriously about:

how documents are segmented
what structural signals are preserved during ingestion
whether explicit links are represented as edges
how ranking protects graph-discovered evidence
when metadata should override embedding-only assumptions

Once you acknowledge that many real-world questions are multi-hop, it becomes much harder to justify retrieval systems that behave as if the answer were always a single best chunk.

Final Thoughts

What began as a practical problem inside AgreedPro became a deeper retrieval question.

The limitation was not that language models could not answer contract questions. The limitation was that we were often handing them incomplete evidence and asking them to reconstruct relationships the document had already made explicit.

EngramDB came out of trying to make those relationships first-class.

By combining vector search with graph traversal, by using deterministic structure extraction during ingestion, and by scoring candidates with both semantic and structural signals, the system moves closer to the real shape of contract reasoning.

Once you see contracts that way, the original failure mode becomes hard to ignore.

The answer was never in one chunk.

It was in the links between them.

The source code can be found here

DEV Community