Three retrieval architectures. Same LLM. Same 7,928 queries across 45 domains. Different structure going in.
Here are the results:
| System | F1 Score | Tokens/query | Cost/query |
|---|---|---|---|
| RAG (FAISS + Claude) | 0.123 | 2,982 | ~$0.009 |
| GraphRAG (Microsoft) | 0.120 | 3,450 | ~$0.013 |
| CKG (pre-structured DAG) | 0.471 | 269 | ~$0.001 |
CKG is 4x more accurate and uses 11x fewer tokens than RAG.
What is a CKG?
A Compact Knowledge Graph (CKG) pre-structures domain knowledge as a directed acyclic graph (DAG). Concepts are nodes. Dependencies are edges. A CSV file:
ConceptID,ConceptLabel,Dependencies,TaxonomyID
1,Calculus,2|3,CORE
2,Algebra,,FOUND
3,Trigonometry,,FOUND
When an agent asks "what do I need to know before Calculus?", CKG traverses edges. No embedding. No similarity search. No hallucination by construction.
Why RAG fails on multi-hop queries
RAG retrieves the most similar text chunk to a query. For simple lookups, this works. For multi-hop questions — prerequisites, dependency chains, drug interactions, regulatory trees — it fragments the answer across chunks that contradict each other.
F1 by hop depth:
| Hop depth | CKG | RAG |
|---|---|---|
| 1 | 0.374 | 0.312 |
| 2 | 0.512 | 0.298 |
| 3 | 0.631 | 0.241 |
| 4 | 0.714 | 0.198 |
| 5 | 0.772 | 0.187 |
CKG improves continuously with depth. RAG plateaus at hop=2 and degrades. The deeper the question, the larger the gap.
Where CKG dominates by query type
| Query type | CKG | RAG | Advantage |
|---|---|---|---|
| Aggregate (T4) | 0.964 | 0.286 | 3.4x |
| Path traversal (T3) | 0.660 | 0.201 | 3.3x |
| Dependency (T2) | 0.634 | 0.078 | 8.1x |
| Cross-concept (T5) | 0.323 | 0.115 | 2.8x |
| Entity lookup (T1) | 0.207 | 0.094 | 2.2x |
The biggest win (8.1x) is on dependency queries — the exact query type that matters in clinical, legal, financial, and regulatory domains.
Structure is the signal — not curation effort
Track 2: I built a GLP-1/pharma domain from the ClinicalTrials.gov API in a single session. No expert curation.
F1 = 0.530 — higher than the 45-domain average.
If a domain has knowable dependencies, it can be CKG-ified. The structure drives accuracy, not the effort.
Try it
MCP server — works in Claude Code and any MCP-compatible agent:
pip install ckg-mcp
Your agent gets 4 tools: list_domains, query_ckg, get_prerequisites, search_concepts.
Live demo: https://huggingface.co/spaces/danyarm/ckg-demo
Full dataset (45 domain CSVs + 7,928 query JSONL + results):
https://huggingface.co/datasets/danyarm/ckg-benchmark
Paper + benchmark code:
https://github.com/Yarmoluk/ckg-benchmark
One-page summary:
https://github.com/Yarmoluk/ckg-benchmark/blob/main/SUMMARY.md
Custom domains
The benchmark covers 45 general domains. For clinical, legal, financial, or regulatory domains where dependency structure is critical: graphifymd.com
All code MIT licensed. Data CC BY 4.0. Questions welcome in the comments.
Top comments (0)