WonderLab

Posted on May 15

RAG Series (16): Graph RAG — Using Knowledge Graphs to Solve Multi-Hop Reasoning

#ai #rag #graphrag #ragas

The Relationship Blindspot in Vector Retrieval

Every optimization in this series so far — better chunking, reranking, query rewriting, CRAG — works within a fundamental assumption: retrieval is about finding similar text. But a whole category of questions doesn't fit that assumption:

Questions that require reasoning across multiple entities.

Consider:

"bge-large-zh-v1.5 and bge-reranker-v2-m3 both come from the same organization. What is that organization, and what role does each model play in a RAG pipeline?"

Vector search will find documents mentioning BAAI or BGE — that part works. But retrieval only returns text chunks. The LLM then has to figure out "these two models are both from BAAI" by reading disconnected paragraphs. The relationship isn't in the retrieval result; it's implicit in the text.

Or consider:

"Trace the evolution of retrieval quality evaluation from basic RAG to CRAG."

This requires placing RAG → Rerank → Self-RAG → CRAG in order along a progression chain. Vector search returns the top-4 most similar documents — not necessarily the nodes on that chain in the right sequence.

Graph RAG's insight: instead of hoping that semantic similarity happens to capture the right relationships, extract the entities and relationships explicitly, build a knowledge graph, and traverse it at query time. Finding "what connects X to Y" becomes a graph traversal problem, not a similarity problem.

Knowledge Graph Fundamentals

The atomic unit of a knowledge graph is the triple:

(head entity, relation, tail entity)

For example:

BAAI           --[developed]-->  bge-large-zh-v1.5
BAAI           --[developed]-->  bge-reranker-v2-m3
bge-large-zh-v1.5  --[used for]--> vector retrieval
bge-reranker-v2-m3 --[used for]--> reranking

Assembled into a directed graph (NetworkX DiGraph), the question "what models did BAAI develop?" becomes: find the BAAI node, list all outgoing edges. No semantic similarity needed — pure structure.

The Graph RAG Pipeline

Build phase (offline):
  documents → LLM triple extraction → NetworkX directed graph
  documents → embedding → ChromaDB vector index (baseline)

Query phase (online):
  question
    ↓
  LLM extracts question entities (e.g., BAAI, bge-large-zh-v1.5)
    ↓
  Fuzzy-match entities against graph nodes (substring match)
    ↓
  BFS 2-hop traversal: seed_nodes → neighbors → neighbors' neighbors
    ↓
  Assemble triple context (graph traversal result + top-2 vector docs as fallback)
    ↓
  LLM generates answer

A critical design decision: Graph RAG is not pure graph retrieval. It's a hybrid of graph traversal context + vector retrieval. Pure graph traversal has a hard "entity boundary" — if a relevant entity wasn't extracted into the graph, it's unreachable. The top-2 vector documents serve as a safety net for those gaps.

Implementation: Triple Extraction

The Dead End First

LangChain ships LLMGraphTransformer, purpose-built for extracting graph structure from text. I started there:

from langchain_experimental.graph_transformers import LLMGraphTransformer
graph_transformer = LLMGraphTransformer(llm=llm)
graph_docs = graph_transformer.convert_to_graph_documents([doc])

All 12 documents failed:

Invalid JSON: invalid number at line 1 column 2
[type=json_invalid, input_value='- Node: RAG (Retrieval-A...']

LLMGraphTransformer requires the LLM to return structured JSON. GLM-4-flash was returning text-formatted lists (- Node: xxx). Pydantic validation failed every time.

Custom Prompt with a Simpler Format

The fix: stop requiring JSON. Use a pipe-delimited format instead — Entity A | relation | Entity B, one triple per line, parsed with split("|"). No JSON parsing, no Pydantic validation, high fault tolerance:

TRIPLE_EXTRACT_PROMPT = ChatPromptTemplate.from_messages([
    ("system",
     "Extract entities and relations from the text below. Output a list of triples.\n"
     "Required format: one triple per line, strictly: Entity A | relation | Entity B\n"
     "Rules:\n"
     "- Entities: noun phrases, no brackets or quotes\n"
     "- Relations: verb phrases, e.g., uses, contains, proposed by, applies to, outperforms\n"
     "- Output ONLY triples, no numbers, no explanations, nothing else\n"
     "- 8-15 triples per document\n\n"
     "Example output:\n"
     "RAG | uses | vector retrieval\n"
     "RAGAS | proposed by | Es et al.\n"
     "Chroma | suitable for | local development"),
    ("human", "Text:\n{text}"),
])

def extract_triples(text: str) -> list[tuple[str, str, str]]:
    raw = triple_chain.invoke({"text": text})
    triples = []
    for line in raw.strip().splitlines():
        parts = [p.strip() for p in line.split("|")]
        if len(parts) == 3 and all(parts):
            triples.append((parts[0], parts[1], parts[2]))
    return triples

Building the NetworkX Graph

KG = nx.DiGraph()

for doc in DOCUMENTS:
    triples = extract_triples(doc.page_content)
    for head, rel, tail in triples:
        KG.add_node(head, source=doc.metadata["source"])
        KG.add_node(tail, source=doc.metadata["source"])
        KG.add_edge(head, tail, relation=rel)

12 documents → 176 nodes, 139 edges.

Implementation: BFS Graph Traversal

def graph_retrieve(question: str, graph: nx.DiGraph, hops: int = 2):
    # Step 1: Extract entities from question
    entities = extract_entities(question)  # LLM, one entity per line

    # Step 2: Fuzzy-match against graph nodes (substring matching)
    seed_nodes = []
    for entity in entities:
        entity_lower = entity.lower()
        for node in graph.nodes:
            if entity_lower in node.lower() or node.lower() in entity_lower:
                seed_nodes.append(node)

    if not seed_nodes:
        return []  # No match — fall back to pure vector retrieval

    # Step 3: BFS k-hop traversal
    visited = set(seed_nodes)
    frontier = set(seed_nodes)
    for _ in range(hops):
        next_frontier = set()
        for node in frontier:
            neighbors = set(graph.successors(node)) | set(graph.predecessors(node))
            next_frontier |= neighbors - visited
        visited |= next_frontier
        frontier = next_frontier

    # Step 4: Format triples as context
    triples = [
        f"{u} --[{data['relation']}]--> {v}"
        for u, v, data in graph.edges(data=True)
        if u in visited or v in visited
    ]

    return [Document(page_content=
        f"[Graph entities]: {', '.join(list(visited)[:20])}\n\n"
        f"[Graph relationships]:\n" + "\n".join(triples[:40])
    )]

BFS at 2 hops: from seed entities, collect direct neighbors (1 hop) and their neighbors (2 hops). For "what models did BAAI develop?", 1 hop is enough. For "where do bge-large-zh-v1.5 and bge-reranker-v2-m3 come from, and what are they each used for?", 2 hops connects BAAI → both models → their respective uses.

Experimental Results

Test Set Design

8 questions, two categories:

Type	Count	Example
Single-hop factual	2	"What are the four RAGAS metrics?"
Multi-hop relational	6	"Which org do bge-large-zh-v1.5 and bge-reranker-v2-m3 come from, and what does each do?"

The multi-hop questions are Graph RAG's home turf — and vector search's weak spot.

RAGAS Metrics

======================================================================
  RAGAS Metrics Comparison (Vector RAG vs Graph RAG)
======================================================================

  Metric               Vector RAG    Graph RAG       Delta
  ──────────────────────────────────────────────────────────
  context_recall            0.812        0.750     ↓-0.062
  context_precision         0.729        0.948     ↑+0.219  ◀
  faithfulness              0.865        0.883     ↑+0.018
  answer_relevancy          0.536        0.465     ↓-0.071
======================================================================

context_precision +0.219 — one of the largest precision gaps between vector and non-vector approaches in this series.

Reading the Results

Why Does context_precision Jump So Much?

context_precision measures how many of the documents sent to the LLM are actually relevant.

Vector search returns the top-4 semantically similar chunks — similar doesn't mean precise. Ask "what two models did BAAI build?", and vector search pulls back every paragraph mentioning BAAI or BGE: model dimensions, benchmark rankings, usage scenarios — lots of content the question didn't ask for.

Graph traversal starts at the BAAI node and walks directly to its connected subgraph:

BAAI --[developed]--> bge-large-zh-v1.5
BAAI --[developed]--> bge-reranker-v2-m3
bge-large-zh-v1.5 --[used for]--> vector embedding
bge-reranker-v2-m3 --[used for]--> reranking

Four triples, all directly relevant. No noise. context_precision naturally approaches 1.0 for these questions.

Why Does context_recall Drop Slightly?

Graph traversal has a hard entity boundary: it can only expand from entities that were extracted into the graph. If a relevant fact exists in the text but its corresponding entity wasn't captured during triple extraction, that information is permanently unreachable by graph traversal.

Vector search has no such boundary. Semantic similarity doesn't require prior structuralization — it can surface relevant content even when the exact entity name doesn't appear in the query. That's why vector search has higher recall: broader reach, even if less precise.

This is the classic precision-recall tradeoff. Graph RAG deliberately takes the precision side: return less, but make what's returned count.

Why Does answer_relevancy Drop?

Graph traversal context is a structured triple list:

bge-large-zh-v1.5 --[developed by]--> BAAI

Vector retrieval returns full semantic paragraphs with natural language flow. When the LLM synthesizes an answer from triples, the output is technically correct but slightly less fluent than summarizing a natural paragraph — which slightly depresses the answer_relevancy score.

This is an inherent limitation of the triple format as context. It can be partially addressed by adding a prompt instruction: "First convert these triples into a coherent paragraph, then answer the question."

When to Use Graph RAG

Strong fit:

Complex relationship networks in documents: technical docs, knowledge bases, product manuals where entities have many explicit connections
Questions that cross entity boundaries: "what connects X and Y", "what family does Z belong to", "trace the evolution of A"
Answers distributed across documents: knowledge you need to assemble from multiple sources

Caveats:

Triple extraction cost: every document requires an LLM call to extract triples. Build phase is significantly more expensive than a vector-only index
Extraction quality is the ceiling: if the LLM misses a key entity during triple extraction, graph traversal can never recover it
Limited gain for single-hop QA: if most questions are "what is X?", vector recall advantage outweighs graph precision advantage
Answer fluency: triple-format context produces slightly stiffer answers; prompt engineering can compensate

Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/16-graph-rag

Key file:

graph_rag.py — Full implementation: graph build, BFS retrieval, RAGAS evaluation

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 16-graph-rag
cp .env.example .env
pip install -r requirements.txt
python graph_rag.py

Summary

This article implemented Graph RAG from scratch. Key findings:

context_precision +0.219 is a direct result of graph traversal's "retrieve by path, not by similarity" approach — only edges directly connected to the question's entities are returned
context_recall -0.062 is the cost of the entity boundary — what wasn't extracted into the graph is invisible to traversal, while vector search has no such hard limit
The most important implementation decision: abandon LLMGraphTransformer, switch to a custom Entity | relation | Entity format — solves GLM-4-flash's unstable JSON output with a simpler, more robust parsing strategy
The hybrid approach is the right call: graph traversal + vector fallback. Pure graph traversal leaves gaps; pure vector search misses relationships. The combination covers both.

A pattern that's been accumulating across this series: there is no universally dominant RAG strategy. Every approach gains on one dimension while conceding on another. Vector search has strong recall but weak precision for relational questions. Reranking improves precision but adds latency. Graph RAG achieves high precision but sacrifices recall coverage. Which one to deploy depends entirely on what your knowledge base looks like and what questions your users actually ask.

DEV Community