Ayub Abu zer

Posted on Jun 29

Grounding LLMs in Knowledge Graphs for Multi-Hop Enterprise Reasoning (0% to 80% Accuracy Improvement)

#rag #llm #graph #langchain

TL;DR

General large language models lack awareness of enterprise-specific data relationships.

I built a graph-grounded LLM reasoning system that transforms a synthetic supply-chain dataset into a knowledge graph and enables structured querying via the model. By holding GPT-4o constant across both conditions, I isolated the impact of graph-based retrieval.

On complex multi-hop reasoning tasks requiring joins across multiple entity relationships, accuracy improved from 0% (no graph) to 80% (graph-grounded system).

The problem

capable models that don't know your business

large language models (LLMs) are excellent general reasoners, but they have no knowledge of an organization's specific-structured reality - its suppliers, contracts, products, customers and the relationships between them.

Ask a base model "which of our customers are exposed to high-risk suppliers?"
The model may either hedge when uncertain or, more problematically, hallucinate a confident but incorrect answer.
The standard fix is retrieval-augmented generation (RAG): embed documents into a vector store and retrieve the most similar chunks for each question. This works well for lookup questions whose answer lives in a single paragraph. It breaks down for multi-hop questions whose answer is spread across several connected facts:
"Which customers buy products that contain a component made by a supplier whose contract is flagged high-risk?"
That answer is spread across four different relationships.
Vector similarity has no notion of a join, so it retrieves disconnected chunks and the model is left to guess the connections, which is a primary source of hallucination in enterprise settings.

In enterprise settings, incorrect multi-hop reasoning is not a hallucination problem—it is a financial and compliance risk problem.

The idea

put the relationships where they belong — in a graph

Businesses already describe their world in terms of relationships:

a supplier supplies a product, a component is part of a product, a product is sold to a customer.

A knowledge graph stores those relationships as first-class connections, so a multi-hop question becomes a guided traversal instead of a lucky search.

The idea at a high level is to transform the enterprise structured data into a knowledge graph, which grounds the LLM, and every question is answered by the model through that graph rather than from memory.

In a little more detail the approach has three steps:

Model the data of the business as a graph (Neo4j) — entities become nodes, relationships become the edges between them.
Then, at query time, each question is converted into a precise graph query, the graph returns the exact matching facts that matter for that question.
Let the model answer strictly from those facts — nothing else.

The result is an answer that is accurate, cheap (only the relevant sub-graph is sent to the model) and crucially for enterprise adoption — explainable: you can point at the exact Cypher query and rows behind every answer.

Nothing in that loop is hard-coded to a specific domain:
replacing the dataset preserves the same execution engine, because the Cypher is generated from whatever schema Neo4j reports.

Modelling the business as a graph

To make the idea concrete (and the results verifiable) I used a deliberately generic supply-chain domain — the kind of interconnected data almost every company has in some form.
• Nodes (entities): suppliers, products, components, contracts, customers, facilities, and business terms.
• Edges (relationships): who supplies what, what a product is made of, who a product is sold to, which supplier depends on another, etc.

Knowledge-graph schema: the node types and the relationship types that connect them

This is what makes the questions hard for a plain LLM: the interesting ones ("which customers are exposed to a high-risk supplier through a shared component?") trace a path across four or five of these edges. There is no single document that states the answer — it only exists as a traversal of the graph. That is precisely the structure a knowledge graph captures and flat text retrieval cannot.

Another important feature is that the business glossary lives inside the graph. Terms like "Single Point of Failure" or "Customer Exposure " are stored as their own nodes, each with a definition, an owner, and a pointer to the part of the model it governs. The organisation's own vocabulary becomes queryable alongside its data, so the system answers in the language the business already uses.

From table to graph

Real enterprise data doesn't arrive as a graph. It arrives as database tables, CSVs, spreadsheets. The build phase is essentially a translation step, and it follows one simple rule:
• Entity tables become the Nodes in the graph.
• Relationships between tables become the edges between them.

This matters a lot, it mirrors how the work would actually land on a real data platform, and it means a business doesn't have to re-shape its data to get started, its existing tables are the input.

Grounding

why the answers can be trusted

The key design choice is that the model is constrained to respond only using results returned from the graph. Each answer is therefore grounded in explicit facts, allowing a human reviewer to trace and audit the source of every claim.

This approach also enables a capability that many systems lack: graceful abstention. If a query cannot be answered from the underlying data—for instance, requesting suppliers in a country where none exist, the graph simply returns no results. there is nothing to ground an answer on, and the model declines instead of inventing one. In regulated settings, "I don't have that information" is exactly the right answer, and here it happens by design rather than by luck.

Evaluation

To measure the value of the graph fairly, I ran a controlled comparison. The same model (gpt-4o) answered the same labelled question set two ways:
• Without the graph — the model on its own.
• With the graph — the same model, answering through the knowledge graph.

Holding the model constant is the whole point: the only thing that changes is graph access, so the difference is the value the graph adds — not a difference in model strength.

The metric is entity recall: of the entities the correct answer should mention, how many did the system actually get.
Same model, with vs without the knowledge graph: overall and multi-hop accuracy

How the evaluation works

The process is deliberately simple and hard to game:

Generate the questions. A benchmark of 106 questions is built from the real data — 82 single-hop lookups ("who produces the Lithium Cell?") and 24 multi-hop reasoning chains ("which customers are exposed to suppliers in a given country?"). They're produced from templates filled with real values plus a set of deliberately unanswerable questions, so nothing is cherry-picked.
Answer each one twice — once with the graph, once without.
A human judges every answer as correct or not. Crucially, when a question genuinely can't be answered from the data, an honest "I don't know" is graded correct — refusing to fabricate is the right behaviour, not a failure. Human evaluation (rather than automated string matching) was used to ensure correctness of answers and abstentions: a person confirms each answer is actually right, including whether an abstention was appropriate.

What the numbers show

Judged by a human over the 106-question benchmark, the gap is stark. With the knowledge graph the model was correct on 83% of single-hop and 80% of multi-hop questions; without it, the same model managed just 8% and 0%:

System (same model) Overall accuracy Multi-hop accuracy
Model + knowledge graph 83% 80%
Model, no graph access 8% 0%

The plain model's 0% on multi-hop is the headline: with no way to perform the join, it can't answer a single question whose answer is a chain of relationships. It was unable to correctly resolve any multi-hop queries over the dataset. — it either gave a generic essay or confidently named real-world companies that aren't part of this business at all. Two real examples from the run make the gap concrete, including the exact query the graph ran to get the right answer.
A single-fact question — "Who produces the Lithium Cell?"
• With the graph (correct): "Asia Components Co produces the Lithium Cell."
• Without the graph (wrong): a list of global battery brands — CATL, LG Chem, Panasonic, Samsung SDI — none of which exist in this company's data.
The graph answer is backed by the exact query it ran (this is the provenance that makes every answer auditable):

Example Cypher query used for grounding:
MATCH (s:Supplier)-[:PRODUCES]->(c:Component) WHERE toLower(c.name) CONTAINS toLower('Lithium Cell') RETURN s.name AS supplier

A multi-hop question — "Who produces the components that go into the Industrial Robot Arm?" (supplier → component → product)

• With the graph (correct): "Global Metals Ltd, Rhine Electronics GmbH, Shenzhen Microchips Inc and Andes Copper Mining" — the exact four suppliers, traced through the parts that make up the product.
•
Without the graph (failed): generic robotics names — Fanuc, ABB, Siemens, Mitsubishi — plausible-sounding and entirely wrong for this business.

Here the model wrote a genuine multi-hop traversal — product back to its components, then back to the suppliers that make them — in a single query:

Example Cypher query used for grounding:
MATCH (p:Product)<-[:PART_OF]-(:Component)<-[:PRODUCES]-(s:Supplier) WHERE toLower(p.name) CONTAINS toLower('Industrial Robot Arm') RETURN DISTINCT s.name AS supplier

That's the pattern across the whole benchmark: the plain model reasons fluently about the world in general but knows nothing about this organization, while the graph-grounded model answers from the company's own facts — with the query as evidence — and says "I don't know" when the data can't support an answer.
These are measured results from a synthetic benchmark, judged by a human, not projections. Because the queries are generated live, exact figures vary slightly between runs; the large, reproducible finding — graph grounding makes the model accurate on a business's own multi-hop questions — does not.

Why this matters for a business

Strip away the engineering, and four properties remain that decide whether a
business can actually trust and adopt this:

Accuracy on the questions that actually matter. The valuable questions in any
enterprise are rarely simple lookups; they are "what's connected to what, and
what happens if it breaks?" Those are exactly the multi-hop questions where this
approach wins.

Explainability you can sign off on. Every answer comes with its evidence.
That is the difference between a tool a regulated business can adopt and one it
cannot.

Honesty under uncertainty. The system abstains rather than guessing, which
turns the biggest risk of enterprise AI — confident fabrication — into a managed
behaviour.

Reusable across the business, not locked to one domain. Nothing about the
approach is specific to supply chains. To point it at a different part of the
business — finance, HR, operations, logistics — you swap in that domain's tables
and a small mapping, and the question-answering engine is untouched, because it
generates its queries from whatever data is loaded. One build, many domains:
every new area reuses the same engine instead of a fresh bespoke project.

Where this goes next: from proof of concept to production

A proof of concept earns the right to ask "what next?". Taking this into production
comes down to three priorities.

Make it richer. Let the system draw on documents as well as the graph, so it
can answer both relationship questions and plain lookups. The more the business
invests in its shared vocabulary, the sharper the answers become.

Keep the graph fresh automatically. Feed it from the company's existing data
pipelines so it updates as the business changes, rather than from a one-off load.
A knowledge graph is only as trustworthy as its last refresh.

Keep people in the loop. Let experienced domain experts review answers, refine
how questions are phrased to the model, and confirm what "correct" looks like. This
steadily teaches the system the business's own logic and language — turning the
hard-won judgement of experienced people into accuracy the whole organization
benefits from.

Closing thought

Knowledge graphs give a model something it fundamentally lacks: the business knowledge - an explicit, queryable map of the relationships that define a business. By translating structured data into a graph and grounding the model on what that graph returns, the result is higher accuracy on the hard, multi-hop questions and full explainability, while staying generic enough to move from one domain to the next.
The single sentence I'd want a decision-maker to remember:
Holding the model constant, adding the knowledge graph lifted multi-hop accuracy from 0% to 80%.
This represents the measurable difference between a language model that generates plausible responses and one that is grounded in enterprise data structure.

The accompanying proof of concept is a complete, runnable implementation with a reproducible benchmark.

DEV Community

Grounding LLMs in Knowledge Graphs for Multi-Hop Enterprise Reasoning (0% to 80% Accuracy Improvement)

Top comments (0)