Why We Moved Beyond Vector Search for Contract QA
When we started building AgreedPro, one of the core technical questions seemed almost ordinary: how do you answer questions over contracts using modern AI retrieval systems?
At first, the answer felt obvious.
Use RAG.
Chunk the contract. Embed the chunks. Retrieve the top matches. Pass them to a language model. Let the model generate the answer.
That pipeline is familiar because, in many domains, it works. It works well on documentation, FAQs, product manuals, internal wikis, and other corpora where relevant information is typically localized. A question points to a paragraph, a section, or a small cluster of nearby passages. Retrieval is largely a similarity problem.
Contracts are different.
That difference was not immediately obvious when we were building early versions of AgreedPro. The model often produced answers that looked right. They were well-phrased, coherent, and aligned with the wording of the contract section that had been retrieved.
But once we started checking those answers closely against the full document, a recurring pattern emerged.
The model was not hallucinating in the usual sense.
It was answering from incomplete context.
That failure mode turned out to be much more important than it sounds. In legal documents, incomplete context is often the difference between a correct answer and a misleading one. That realization is what led to EngramDB.
This article is a technical deep dive into the retrieval problem we encountered while building AgreedPro, why vector-only search was mismatched to contract reasoning, and how EngramDB emerged as a hybrid graph-plus-vector retrieval system designed for multi-hop legal reasoning.
The Problem Looked Solved Until We Tried to Solve It
The initial retrieval pipeline was standard.
- Parse the contract into chunks.
- Embed each chunk.
- Embed the user query.
- Retrieve the most similar chunks.
- Ask the language model to answer from the retrieved context.
For a while, this looked fine.
Then we started asking realistic legal questions.
A representative example was this:
Under what conditions can this agreement be terminated?
A vector-based retriever would often return the section titled something like Termination. That seemed perfectly reasonable. The wording of the section was directly relevant to the query.
But when we inspected the full contract, the real answer was often distributed across several places:
- the termination clause itself
- a definitions section that defines a term like "Cause"
- a different section containing exceptions, limitations, or notice requirements
- one or more cross-referenced provisions that affect interpretation
The retriever found the clause that looked most relevant, but not the sections that made the clause fully interpretable.
The language model then generated an answer that sounded clean and plausible, but it had only seen one part of the reasoning chain.
This is one of the most dangerous failure modes in contract intelligence: not obviously wrong, not obviously invented, just incomplete in a way that can change the legal meaning.
The Core Mismatch: Similarity Does Not Equal Completeness
To understand why this keeps happening, it helps to be precise about what dense retrieval is optimizing for.
In a standard vector retrieval pipeline, you embed the query and the chunks, compute similarity scores, and return the top results. In simple terms, the retriever is asking:
Which passages look most semantically similar to the question?
That objective is useful, but it is not the same as the one we actually care about in contracts.
In contract QA, the real question is often closer to this:
Which sections, taken together, are necessary to answer this question correctly?
Those are not the same objective.
A clause that defines a key term may share very little wording with the query. A referenced section may be critical to the answer and still rank poorly in a pure embedding space. A parent section may scope the meaning of a child subsection without repeating the child’s language at all.
So the retrieval problem is not simply one of semantic relevance. It is a problem of reconstructing a reasoning path across document structure.
That is where vector-only retrieval begins to fail.
Why Contracts Behave More Like Graphs Than Like Flat Text
The biggest conceptual shift while building AgreedPro was realizing that contracts should not be modeled primarily as flat text.
They behave much more like structured graphs.
A contract has at least three kinds of first-class structure.
Hierarchical structure
Contracts are organized into articles, sections, subsections, schedules, appendices, and nested numbered clauses. A child clause often depends on the scope introduced by its parent. If you detach a subsection from its surrounding article, you can easily lose the context that gives it meaning.
Definitional structure
Contracts rely heavily on defined terms. Words like "Cause," "Confidential Information," "Affiliate," or "Change of Control" are often defined once and reused throughout the document. The place where a term is used is rarely the place where its meaning is established.
Referential structure
Contracts constantly point to themselves. They say things like "subject to Section 8.2," "as provided in Article III," or "except as otherwise set forth in Section 5.4." These are explicit navigational edges inside the document.
Once you start treating those signals as part of the representation, a contract naturally becomes a graph of connected sections rather than a bag of isolated chunks.
That framing is the foundation of EngramDB.
What EngramDB Is
EngramDB is a schema-aware hybrid retrieval system for structured documents, built around the kinds of retrieval failures that showed up while building AgreedPro.
At a high level, it combines two mechanisms:
- Vector retrieval to find semantically relevant starting points
- Graph traversal to expand from those starting points into structurally related sections
The key idea is not that embeddings are bad. It is that embeddings are being asked to do too much when the answer depends on multiple linked sections.
Embeddings are very good at finding where the answer might start.
They are not, by themselves, a reliable way to find everything the answer depends on.
EngramDB treats each document section as a node, called an Engram, and each structural relationship as an edge, called a Synapse. Those edges are extracted from the document itself during ingestion.
So instead of hoping the embedding model implicitly captures the document’s legal structure, the system makes that structure explicit.
Deterministic Ingestion: Structure Extraction Without LLM Calls
One of the most important design decisions in EngramDB is that ingestion is rule-based.
That decision matters for both engineering and research reasons.
From an engineering perspective, rule-based extraction is predictable, reproducible, cheaper to run, and easier to debug. If a section boundary is wrong or a reference fails to resolve, you can inspect the pattern and fix the pipeline.
From a research perspective, the hypothesis is that contracts already expose a large amount of high-quality structure for free. Section numbering, heading patterns, quotation conventions, and cross-references are not weak hints. They are central features of the document.
EngramDB takes advantage of that.
Section parsing
The first stage identifies headings and section boundaries using regular expressions and structural cues. The goal is to segment the contract into units that are legally meaningful while preserving hierarchy.
For example, the parser should understand that:
- Article IV is a parent scope
- Section 4.2 belongs under that article
- nested clauses inherit context from the enclosing section
This is essential because a subsection often cannot be interpreted correctly without its parent context.
Definition extraction
The next stage detects defined terms using patterns such as:
- "Term" means ...
- "Term" shall mean ...
- The term "X" means ...
This allows the system to map terms back to their defining sections and to link uses of those terms elsewhere in the document.
That means a query about termination for Cause does not have to rely on embeddings alone to discover the definition of Cause.
Cross-reference extraction
The ingestion pipeline also extracts explicit references such as:
- Section 4.2
- Section 8.1(b)
- Article III
- Schedule A
When those references can be resolved, EngramDB creates edges between the source and target sections.
This is one of the most valuable signals in legal documents because the contract itself is telling you which sections are meant to be read together.
Graph construction
After extraction, the document is represented as a graph consisting of:
- Engrams: nodes representing sections
- Synapses: typed edges representing relationships
The primary relationship types include:
-
PARENT_OFfor hierarchy -
DEFINESfor term-definition links -
REFERENCESfor explicit cross-references
Embedding generation
Each section can also be embedded using a pluggable backend. In the current implementation, the supported backends include:
-
mockfor deterministic tests and development -
openaifor hosted embeddings -
localfor sentence-transformers based local embeddings
This separation is deliberate. Structure extraction is deterministic. Embeddings are an additional signal, not the source of the graph.
Why DuckDB Is a Good Fit
A lot of retrieval systems get operationally heavy before the retrieval logic is even stable. Separate vector stores, graph databases, metadata stores, and orchestration layers appear early, often making iteration harder rather than easier.
EngramDB takes a simpler route.
Everything is stored in a single DuckDB file:
- section metadata
- graph edges
- embeddings
- retrieval artifacts
This has several practical advantages.
First, it keeps the system local-first and easy to reproduce. Second, it allows structured querying over nodes and edges without introducing another database layer. Third, it makes benchmarking, debugging, and inspection much easier when the system is still evolving.
For an experimental hybrid retrieval engine, that tradeoff is extremely appealing.
The Retrieval Pipeline: Vector Search for Anchors, Graph Walk for Context
The retrieval pipeline in EngramDB has two stages.
Stage 1: Anchor retrieval
Given a natural-language query, the system embeds the query and retrieves the top-k semantically similar sections.
These top-k results are treated as anchor nodes.
This is where vector search does what it is good at. It finds the parts of the document that look closest to the user’s question.
If the query is about termination, the anchor set often includes a termination clause. If the query is about confidentiality, the anchor set often includes confidentiality-related sections.
But anchors alone are not enough.
Stage 2: Graph expansion
From each anchor, EngramDB expands outward through the graph for a bounded number of hops, usually one to three.
That graph walk can recover:
- a definition linked to a term used in the anchor
- a referenced provision that adds an exception or condition
- a parent section that scopes the meaning of the anchor
- neighboring clauses that complete the obligation or permission being discussed
This is the step that turns retrieval from a semantic lookup problem into a multi-hop reasoning support system.
Vector retrieval finds entry points.
Graph traversal finds connected evidence.
That distinction is the core of the system.
Scoring: Blending Semantic and Structural Relevance
Once the candidate set is assembled, the system still has to rank it.
EngramDB scores each candidate using a blend of semantic similarity and structural relevance.
In the current design, the score is computed as:
score = 0.5 * semantic_similarity + 0.5 * structural_score
The equal weighting matters conceptually. It says that structural importance is not just a small reranking feature. It is a first-class retrieval signal.
Structural score
Structural relevance is based on two main factors:
- the type of edge through which the node was reached
- the hop distance from the anchor
The score decays with distance, using a hop-decay factor of about 0.75 per hop.
Different edge types receive different weights:
-
REFERENCES=1.0 -
DEFINES=0.9 -
PARENT_OF=0.55
Anchors themselves are treated specially and receive a structural score of 1.0.
This ranking design captures an important practical truth. A section reached via an explicit legal reference may be more valuable than a section that merely looks semantically similar. Likewise, a definition section may deserve to outrank a loosely related paragraph because it provides the interpretation layer needed for the answer.
The Real Engineering Trick: Preventing Graph Nodes from Getting Ranked Out
A hybrid retrieval system can still fail even if the graph is correct.
The reason is ranking pressure.
Graph expansion often produces a large candidate set. A handful of anchor nodes can fan out into dozens of structurally related sections. But the final context budget is limited. The model may only receive ten or twelve pieces of evidence.
If you simply rank everything by the combined score, semantically strong anchors can still dominate the final set. The graph-walk successfully discovers the right nodes, but the ranking layer drops them.
This is a subtle but crucial failure mode.
The graph worked.
The retrieval result still failed.
EngramDB addresses this using reserved traversal slots.
The idea is simple:
- reserve part of the result budget for anchor nodes
- reserve another part for graph-discovered nodes
Within the graph-reserved portion, candidates are prioritized by edge-type tier and then by semantic similarity.
This prevents a large set of high-similarity local sections from completely displacing the structurally important evidence discovered through traversal.
That design choice is one of the most practically important parts of the system. Without it, many hybrid retrievers quietly collapse back into vector-only behavior at the final ranking stage.
Failure Modes and What They Teach Us
A retrieval system becomes much more useful once its failure modes are explicit.
Two categories mattered most in EngramDB.
1. Traversed but ranked out
In this case, the graph expansion discovers the required section, but the final ranking stage excludes it from the top context window.
This tends to happen in dense graph neighborhoods where many candidates compete for a small number of slots.
Mitigations include:
- reserved traversal slots
- stronger edge-aware ranking
- reranking within structural tiers
2. Not reachable from anchors
A different failure happens when the initial vector anchors are poor.
This is especially common for questions that reference section numbers directly, such as:
What does Section 6 reference?
That kind of query does not necessarily produce useful semantic anchors, because the important signal is not meaning in the embedding sense. It is explicit document metadata.
A natural mitigation is metadata-aware anchor injection. If the query contains a recognizable section pattern, the corresponding section can be inserted directly into the anchor set before graph expansion starts.
This is a good reminder that in structured domains, metadata is not a workaround. It is part of the retrieval representation.
Benchmark Results: Where the Hybrid Approach Actually Wins
The central hypothesis behind EngramDB is that document-native structure provides retrieval signal that vector similarity alone does not fully capture, especially for multi-hop questions.
The reported benchmark evaluates this hypothesis on contract QA using CUAD documents.
The setup described in the project materials includes:
- 183 multi-hop questions
- 35 contracts
- questions requiring retrieval of two to three structurally linked sections
The reported results for hybrid retrieval versus vector-only retrieval are substantial.
Overall recall
- Hybrid:
92.8% - Vector-only:
68.2%
Two-hop recall
- Hybrid:
97.6% - Vector-only:
66.8%
Three-hop recall
- Hybrid:
86.5% - Vector-only:
70.0%
The largest gains appear in exactly the places you would expect if structure is the missing signal:
- cross-reference queries
- termination chains
- definition-linked questions
These are the scenarios where the answer depends on sections that are explicitly connected but not necessarily semantically local.
That is important because it suggests the system is not just benefiting from more retrieval. It is benefiting from the right retrieval bias.
Why This Matters Beyond Legal Tech
Although EngramDB was motivated by contract QA, the deeper lesson is broader.
A lot of retrieval research is centered on semantic representation:
- better embeddings
- better rerankers
- larger context windows
- fusion strategies across retrievers
All of those matter.
But EngramDB highlights another axis of improvement that is often underused: the internal structure of the document itself.
In domains where meaning is distributed across linked sections, structure is not auxiliary. It is part of relevance.
That makes this style of retrieval interesting beyond legal documents.
The same pattern appears in:
- compliance manuals
- regulatory filings
- financial disclosures
- technical specifications
- API documents with defined terms and references
- scientific papers with structured section dependencies
- policy frameworks and governance documents
Any domain where answers live in chains rather than paragraphs is a candidate for graph-aware retrieval.
What Building AgreedPro Made Clear
The most important lesson from this work is not that vector retrieval is wrong.
Vector retrieval is extremely effective at finding where the answer starts.
The lesson is that, in structured domains, that is often not enough.
Contract meaning is compositional. Definitions scope terms. Exceptions alter clauses. Parent sections constrain children. Cross-references relocate meaning. A retrieval system that only optimizes for semantic proximity will often return text that is relevant, but not sufficient.
EngramDB is an attempt to align the retrieval system with the way contracts actually encode meaning.
Instead of assuming the answer is a chunk, it assumes the answer is a connected set of sections.
That difference changes everything.
A Broader Principle for Retrieval System Design
If there is one general principle I would take from this work, it is this:
Retrieval should optimize for reasoning completeness, not just semantic similarity.
That principle has practical consequences.
It means you should think seriously about:
- how documents are segmented
- what structural signals are preserved during ingestion
- whether explicit links are represented as edges
- how ranking protects graph-discovered evidence
- when metadata should override embedding-only assumptions
Once you acknowledge that many real-world questions are multi-hop, it becomes much harder to justify retrieval systems that behave as if the answer were always a single best chunk.
Final Thoughts
What began as a practical problem inside AgreedPro became a deeper retrieval question.
The limitation was not that language models could not answer contract questions. The limitation was that we were often handing them incomplete evidence and asking them to reconstruct relationships the document had already made explicit.
EngramDB came out of trying to make those relationships first-class.
By combining vector search with graph traversal, by using deterministic structure extraction during ingestion, and by scoring candidates with both semantic and structural signals, the system moves closer to the real shape of contract reasoning.
Once you see contracts that way, the original failure mode becomes hard to ignore.
The answer was never in one chunk.
It was in the links between them.
The source code can be found here

Top comments (0)