Daniel Yarmoluk

Posted on Apr 27

I benchmarked RAG vs GraphRAG vs pre-structured knowledge graphs across 45 domains — here's what happened

#ai #machinelearning #python #opensource

Three retrieval architectures. Same LLM. Same 7,928 queries across 45 domains. Different structure going in.

Here are the results:

System	F1 Score	Tokens/query	Cost/query
RAG (FAISS + Claude)	0.123	2,982	~$0.009
GraphRAG (Microsoft)	0.120	3,450	~$0.013
CKG (pre-structured DAG)	0.471	269	~$0.001

CKG is 4x more accurate and uses 11x fewer tokens than RAG.

What is a CKG?

A Compact Knowledge Graph (CKG) pre-structures domain knowledge as a directed acyclic graph (DAG). Concepts are nodes. Dependencies are edges. A CSV file:

ConceptID,ConceptLabel,Dependencies,TaxonomyID
1,Calculus,2|3,CORE
2,Algebra,,FOUND
3,Trigonometry,,FOUND

When an agent asks "what do I need to know before Calculus?", CKG traverses edges. No embedding. No similarity search. No hallucination by construction.

Why RAG fails on multi-hop queries

RAG retrieves the most similar text chunk to a query. For simple lookups, this works. For multi-hop questions — prerequisites, dependency chains, drug interactions, regulatory trees — it fragments the answer across chunks that contradict each other.

F1 by hop depth:

Hop depth	CKG	RAG
1	0.374	0.312
2	0.512	0.298
3	0.631	0.241
4	0.714	0.198
5	0.772	0.187

CKG improves continuously with depth. RAG plateaus at hop=2 and degrades. The deeper the question, the larger the gap.

Where CKG dominates by query type

Query type	CKG	RAG	Advantage
Aggregate (T4)	0.964	0.286	3.4x
Path traversal (T3)	0.660	0.201	3.3x
Dependency (T2)	0.634	0.078	8.1x
Cross-concept (T5)	0.323	0.115	2.8x
Entity lookup (T1)	0.207	0.094	2.2x

The biggest win (8.1x) is on dependency queries — the exact query type that matters in clinical, legal, financial, and regulatory domains.

Structure is the signal — not curation effort

Track 2: I built a GLP-1/pharma domain from the ClinicalTrials.gov API in a single session. No expert curation.

F1 = 0.530 — higher than the 45-domain average.

If a domain has knowable dependencies, it can be CKG-ified. The structure drives accuracy, not the effort.

Try it

MCP server — works in Claude Code and any MCP-compatible agent:

pip install ckg-mcp

Your agent gets 4 tools: list_domains, query_ckg, get_prerequisites, search_concepts.

Live demo: https://huggingface.co/spaces/danyarm/ckg-demo

Full dataset (45 domain CSVs + 7,928 query JSONL + results):
https://huggingface.co/datasets/danyarm/ckg-benchmark

Paper + benchmark code:
https://github.com/Yarmoluk/ckg-benchmark

One-page summary:
https://github.com/Yarmoluk/ckg-benchmark/blob/main/SUMMARY.md

Custom domains

The benchmark covers 45 general domains. For clinical, legal, financial, or regulatory domains where dependency structure is critical: graphifymd.com

All code MIT licensed. Data CC BY 4.0. Questions welcome in the comments.

DEV Community