Improving Enterprise Multi-Hop Question Answering using Hybrid GraphReasoner

Ayub Abu zer — Mon, 29 Jun 2026 09:35:10 +0000

TL;DR

General large language models lack awareness of enterprise-specific data relationships.

I built a graph-grounded LLM reasoning system that transforms a synthetic supply-chain dataset into a knowledge graph and enables structured querying via the model. By holding GPT-4o constant across both conditions, I isolated the impact of graph-based retrieval.

On complex multi-hop reasoning tasks requiring joins across multiple entity relationships, accuracy improved from 0% (no graph) to 80% (graph-grounded system).

The problem

capable models that don't know your business

large language models (LLMs) are excellent general reasoners, but they have no knowledge of an organization's specific-structured reality - its suppliers, contracts, products, customers and the relationships between them.

Ask a base model "which of our customers are exposed to high-risk suppliers?"
The model may either hedge when uncertain or, more problematically, hallucinate a confident but incorrect answer.
The standard fix is retrieval-augmented generation (RAG): embed documents into a vector store and retrieve the most similar chunks for each question. This works well for lookup questions whose answer lives in a single paragraph. It breaks down for multi-hop questions whose answer is spread across several connected facts:
"Which customers buy products that contain a component made by a supplier whose contract is flagged high-risk?"
That answer is spread across four different relationships.
Vector similarity has no notion of a join, so it retrieves disconnected chunks and the model is left to guess the connections, which is a primary source of hallucination in enterprise settings.

In enterprise settings, incorrect multi-hop reasoning is not a hallucination problem—it is a financial and compliance risk problem.

The idea

put the relationships where they belong — in a graph

Businesses already describe their world in terms of relationships:

a supplier supplies a product, a component is part of a product, a product is sold to a customer.

A knowledge graph stores those relationships as first-class connections, so a multi-hop question becomes a guided traversal instead of a lucky search.

I designed the system to transform enterprise structured data into a knowledge graph that grounds LLM reasoning through structured traversal instead of pure retrieval.

In a little more detail the approach has three steps:

Model the data of the business as a graph (Neo4j) — entities become nodes, relationships become the edges between them.
Then, at query time, each question is converted into a precise graph query, the graph returns the exact matching facts that matter for that question.
Let the model answer strictly from those facts — nothing else.

The result is an answer that is accurate, cheap (only the relevant sub-graph is sent to the model) and crucially for enterprise adoption — explainable: you can point at the exact Cypher query and rows behind every answer.

Nothing in that loop is hard-coded to a specific domain:
replacing the dataset preserves the same execution engine, because the Cypher is generated from whatever schema Neo4j reports.

Modelling the business as a graph

To make the idea concrete (and the results verifiable) I used a deliberately generic supply-chain domain — the kind of interconnected data almost every company has in some form.
• Nodes (entities): suppliers, products, components, contracts, customers, facilities, and business terms.
• Edges (relationships): who supplies what, what a product is made of, who a product is sold to, which supplier depends on another, etc.

Knowledge-graph schema: the node types and the relationship types that connect them

This is what makes the questions hard for a plain LLM: the interesting ones ("which customers are exposed to a high-risk supplier through a shared component?") trace a path across four or five of these edges. There is no single document that states the answer — it only exists as a traversal of the graph. That is precisely the structure a knowledge graph captures and flat text retrieval cannot.

Another important feature is that the business glossary lives inside the graph. Terms like "Single Point of Failure" or "Customer Exposure " are stored as their own nodes, each with a definition, an owner, and a pointer to the part of the model it governs. The organisation's own vocabulary becomes queryable alongside its data, so the system answers in the language the business already uses.

From table to graph

Real enterprise data doesn't arrive as a graph. It arrives as database tables, CSVs, spreadsheets. The build phase is essentially a translation step, and it follows one simple rule:
• Entity tables become the Nodes in the graph.
• Relationships between tables become the edges between them.

This matters a lot, it mirrors how the work would actually land on a real data platform, and it means a business doesn't have to re-shape its data to get started, its existing tables are the input.

Grounding

why the answers can be trusted

The key design choice is that the model is constrained to respond only using results returned from the graph. Each answer is therefore grounded in explicit facts, allowing a human reviewer to trace and audit the source of every claim.

This approach also enables a capability that many systems lack: graceful abstention. If a query cannot be answered from the underlying data—for instance, requesting suppliers in a country where none exist, the graph simply returns no results. there is nothing to ground an answer on, and the model declines instead of inventing one. In regulated settings, "I don't have that information" is exactly the right answer, and here it happens by design rather than by luck.

Evaluation

To measure the value of the graph fairly, I ran a controlled comparison. The same model (gpt-4o) answered the same labelled question set two ways:
• Without the graph — the model on its own.
• With the graph — the same model, answering through the knowledge graph.

Holding the model constant is the whole point: the only thing that changes is graph access, so the difference is the value the graph adds — not a difference in model strength.

The metric is entity recall: of the entities the correct answer should mention, how many did the system actually get.
Same model, with vs without the knowledge graph: overall and multi-hop accuracy

How the evaluation works

The process is deliberately simple and hard to game:

Generate the questions. A benchmark of 106 questions is built from the real data — 82 single-hop lookups ("who produces the Lithium Cell?") and 24 multi-hop reasoning chains ("which customers are exposed to suppliers in a given country?"). They're produced from templates filled with real values plus a set of deliberately unanswerable questions, so nothing is cherry-picked.
Answer each one twice — once with the graph, once without.
A domain-aware human evaluator manually verified correctness of each answer and abstention. Crucially, when a question genuinely can't be answered from the data, an honest "I don't know" is graded correct — refusing to fabricate is the right behaviour, not a failure. Human evaluation (rather than automated string matching) was used to ensure correctness of answers and abstentions.

What the numbers show

The plain model's 0% on multi-hop is the headline: with no way to perform the join, it can't answer a single question whose answer is a chain of relationships. It was unable to correctly resolve any multi-hop queries over the dataset. — it either gave a generic essay or confidently named real-world companies that aren't part of this business at all. Two real examples from the run make the gap concrete, including the exact query the graph ran to get the right answer.
A single-fact question — "Who produces the Lithium Cell?"
• With the graph (correct): "Asia Components Co produces the Lithium Cell."
• Without the graph (wrong): a list of global battery brands — CATL, LG Chem, Panasonic, Samsung SDI — none of which exist in this company's data.
The graph answer is backed by the exact query it ran (this is the provenance that makes every answer auditable):

this query are generated dynamically from the graph schema:
MATCH (s:Supplier)-[:PRODUCES]->(c:Component) WHERE toLower(c.name) CONTAINS toLower('Lithium Cell') RETURN s.name AS supplier

A multi-hop question — "Who produces the components that go into the Industrial Robot Arm?" (supplier → component → product)

• With the graph (correct): "Global Metals Ltd, Rhine Electronics GmbH, Shenzhen Microchips Inc and Andes Copper Mining" — the exact four suppliers, traced through the parts that make up the product.
•
Without the graph (failed): generic robotics names — Fanuc, ABB, Siemens, Mitsubishi — plausible-sounding and entirely wrong for this business.

Here the model wrote a genuine multi-hop traversal — product back to its components, then back to the suppliers that make them — in a single query:

this query are generated dynamically from the graph schema:
MATCH (p:Product)<-[:PART_OF]-(:Component)<-[:PRODUCES]-(s:Supplier) WHERE toLower(p.name) CONTAINS toLower('Industrial Robot Arm') RETURN DISTINCT s.name AS supplier

That's the pattern across the whole benchmark: the plain model reasons fluently about the world in general but knows nothing about this organization, while the graph-grounded model answers from the company's own facts — with the query as evidence — and says "I don't know" when the data can't support an answer.
These are measured results from a synthetic benchmark, judged by a human, not projections. Because the queries are generated live, exact figures vary slightly between runs; the large, reproducible finding — graph grounding makes the model accurate on a business's own multi-hop questions — does not.

Why this matters for a business

This approach makes enterprise LLM systems practical to deploy because it improves accuracy on multi-hop queries, provides traceable answers through explicit graph queries, and enables safe failure through structured abstention. The same reasoning engine can be reused across domains by changing only the underlying data model.

Explainability you can sign off on. Every answer comes with its evidence.
That is the difference between a tool a regulated business can adopt and one it
cannot.

Honesty under uncertainty. The system abstains rather than guessing, which
turns the biggest risk of enterprise AI — confident fabrication — into a managed
behaviour.

Reusable across the business, not locked to one domain. Nothing about the
approach is specific to supply chains. To point it at a different part of the
business — finance, HR, operations, logistics — you swap in that domain's tables
and a small mapping, and the question-answering engine is untouched, because it
generates its queries from whatever data is loaded. One build, many domains:
every new area reuses the same engine instead of a fresh bespoke project.

Limitations and Practical Trade-offs

Evaluation scope

The benchmark was conducted on a curated set of 106 human-judged questions designed to test single-hop and multi-hop reasoning. While the results demonstrate strong gains in structured reasoning tasks, performance may vary across broader, real-world distributions and more diverse query types.

Maintenance and adjustability

The system requires ongoing maintenance of the knowledge graph and prompt layer. In cases where the LLM fails to retrieve or reason correctly, improvements often involve iterative adjustments such as refining prompts, improving entity linking rules, or expanding graph coverage to handle missing or ambiguous entities.

Latency trade-off

Introducing graph traversal alongside vector retrieval improves reasoning accuracy but adds additional query-time overhead. In practice, this creates a trade-off between response latency and multi-hop reasoning performance, particularly for deeper or more complex graph queries.

These trade-offs are typical in hybrid GraphRAG systems and reflect the balance between accuracy, control, and runtime efficiency.

Where this goes next: from proof of concept to production

A proof of concept earns the right to ask "what next?". Taking this into production
comes down to three priorities.

Make it richer. Let the system draw on documents as well as the graph, so it
can answer both relationship questions and plain lookups. The more the business
invests in its shared vocabulary, the sharper the answers become.

Keep the graph fresh automatically. Feed it from the company's existing data
pipelines so it updates as the business changes, rather than from a one-off load.
A knowledge graph is only as trustworthy as its last refresh.

Keep people in the loop. Let experienced domain experts review answers, refine
how questions are phrased to the model, and confirm what "correct" looks like. This
steadily teaches the system the business's own logic and language — turning the
hard-won judgement of experienced people into accuracy the whole organization
benefits from.

Closing thought

Knowledge graphs give a model something it fundamentally lacks: the business knowledge - an explicit, queryable map of the relationships that define a business. By translating structured data into a graph and grounding the model on what that graph returns, the result is higher accuracy on the hard, multi-hop questions and full explainability, while staying generic enough to move from one domain to the next.
The single sentence I'd want a decision-maker to remember:
Holding the model constant, introducing a knowledge graph grounding layer increased multi-hop accuracy from 0% to 80% on a human-judged 106-question benchmark.

The accompanying proof of concept is a complete, runnable implementation with a reproducible benchmark.

DEV Community: Ayub Abu zer

Improving Enterprise Multi-Hop Question Answering using Hybrid GraphReasoner