I'm building a benchmarking platform to rigorously compare three AI retrieval pipelines on a large corpus of Indian public health research papers from PubMed Central. Here's the architecture, the engineering decisions, and why I think graph-based retrieval is the right approach for this problem — before the benchmark numbers are in.
The Problem in One Sentence
Ask a RAG system: "How does diabetes affect TB treatment outcomes, and what role does HbA1c play?"
Vector search returns chunks about diabetes. Chunks about TB. Maybe a chunk about HbA1c. But it has no idea those three things are connected — and that connection is the entire answer.
That's the problem we set out to benchmark rigorously.
What We Built
A benchmarking platform that will run three AI pipelines in parallel on a target corpus of ~9,000+ Indian public health research papers from PubMed Central, covering Diabetes, Tuberculosis, Maternal Health, and Malaria. The ingestion pipeline is built; the corpus is being populated.
Three pipelines, same LLM, same queries:
| Pipeline | Retrieval Strategy |
|---|---|
| LLM-Only | No retrieval — raw GPT-4o-mini |
| Basic RAG | FAISS vector search + cross-encoder reranking |
| GraphRAG | TigerGraph multi-hop traversal |
Every query runs through all three simultaneously. We measure tokens, cost, latency, LLM-as-a-Judge quality scores, and BERTScore F1.
Why Vector Search Breaks on Multi-Hop Questions
Standard RAG treats every chunk as an island. There's no edge between the HbA1c chunk and the rifampicin chunk. If your question requires connecting those two concepts, the retriever can't help you — unless both happen to be semantically close to the query, which for indirect relationships, they often aren't.
Three failure modes we observed repeatedly:
1. Indirect relationships are invisible
Query: "How does rifampicin affect glycemic control in diabetic TB patients?"
The answer requires connecting rifampicin's CYP enzyme induction to hepatic glucose metabolism. Neither paper individually covers both. Vector search misses the connection entirely.
2. Entity role confusion
A query about MDR-TB treatment in pediatric patients scores adult MDR-TB chunks highly — same keywords, wrong population. The retriever has no concept of role.
3. Aggregation is impossible
"What are the most common comorbidities in Indian TB literature?" can't be answered from any single chunk. You need corpus-wide aggregation. Vector search gives you the most similar individual chunks, which is not the same thing.
Building the Corpus: Pulling From PMC at Scale
We used PubMed's E-utilities API with domain-specific MeSH queries:
from Bio import Entrez
Entrez.email = "your@email.com"
def fetch_pmids(domain_query: str, max_results: int = 3000) -> list[str]:
# usehistory caches results server-side for batch paging
handle = Entrez.esearch(
db="pmc",
term=domain_query,
usehistory="y",
retmax=0
)
search_results = Entrez.read(handle)
handle.close()
web_env = search_results["WebEnv"]
query_key = search_results["QueryKey"]
total = int(search_results["Count"])
pmids = []
batch_size = 200
for start in range(0, min(total, max_results), batch_size):
fetch_handle = Entrez.efetch(
db="pmc",
rettype="xml",
retmode="xml",
retstart=start,
retmax=batch_size,
webenv=web_env,
query_key=query_key
)
records = Entrez.read(fetch_handle)
fetch_handle.close()
pmids.extend([r["MedlineCitation"]["PMID"] for r in records["PubmedArticle"]])
return pmids
The domain query for Tuberculosis looked like:
(tuberculosis[MeSH] OR "TB"[tiab] OR "MDR-TB"[tiab])
AND ("India"[Affiliation] OR "Indian"[Affiliation])
AND (epidemiology[MeSH] OR "public health"[tiab] OR "clinical trial"[tiab])
Preprocessing headaches worth knowing about:
- ~8% of papers had no abstract — fallback to first 500 tokens of full text
- PMC affiliation strings are free-text and inconsistent (
AIIMS,All India Institute of Medical Sciences,New Delhi 110029all need to resolve to the same institution) - Some papers appear under multiple PMIDs due to PMC versioning — requires deduplication
- Retraction Watch cross-reference is non-optional for a medical corpus
Affiliation filtering used multi-pass regex covering explicit country mentions, known Indian institution abbreviations (AIIMS, JIPMER, ICMR, PGIMER, NIMHANS, CMC Vellore), and major city names. False positive rate: ~2-3%.
The Knowledge Graph Schema
The schema is the most consequential design decision. Too sparse and traversals return nothing. Too noisy and you get spurious paths.
10 vertex types:
Disease, Treatment, Biomarker, Population, GeographicRegion, Intervention, Outcome, Study, Institution, Comorbidity
10 typed edge types:
| Edge | Meaning |
|---|---|
TREATS |
Treatment → Disease |
ASSOCIATED_WITH |
Disease ↔ Disease or Disease ↔ Biomarker |
MEASURED_BY |
Disease → Biomarker |
RISK_FACTOR_FOR |
Biomarker/Population → Disease |
COMPLICATES |
Disease ↔ Disease (bidirectional) |
REPORTS_OUTCOME |
Study → Outcome |
STUDIED_IN |
Study → Population/GeographicRegion |
CO_OCCURS_WITH |
Disease ↔ Disease (same study context) |
CONDUCTED_BY |
Study → Institution |
PART_OF |
GeographicRegion → GeographicRegion (hierarchy) |
Why typed edges matter: Early prototypes used a generic related_to edge for everything. The graph was technically connected but semantically useless — traversal had no way to distinguish "treats" from "is a risk factor for." Every edge also carries a confidence score from the extraction model. Edges below 0.65 are filtered during retrieval. This single quality gate made the biggest difference to answer quality.
Final graph stats: ~17,830 vertices, ~142,000 edges, avg vertex degree 8.0, graph diameter ~6 hops.
Multi-Hop Retrieval: A Concrete Walkthrough
Query: "What is the impact of diabetes on TB treatment outcomes in India?"
Step 1 — Entity extraction from query
import spacy
nlp = spacy.load("en_core_sci_lg")
def extract_query_entities(query: str) -> list[dict]:
doc = nlp(query)
entities = []
for ent in doc.ents:
entities.append({
"text": ent.text,
"label": ent.label_,
"start": ent.start_char,
"end": ent.end_char
})
return entities
# Returns:
# [
# {"text": "diabetes", "label": "DISEASE"},
# {"text": "TB", "label": "DISEASE"},
# {"text": "treatment outcomes", "label": "OUTCOME"},
# {"text": "India", "label": "GPE"}
# ]
Step 2 — GSQL 2-hop traversal in TigerGraph
CREATE QUERY multi_hop_retrieval(
SET<VERTEX> seed_vertices,
FLOAT confidence_threshold = 0.65,
INT max_hops = 2
) FOR GRAPH BiomedicalGraph {
OrAccum<BOOL> @visited;
SetAccum<EDGE> @@retrieved_edges;
SetAccum<VERTEX> @@retrieved_vertices;
seed_set = {seed_vertices};
@@retrieved_vertices += seed_set;
FOREACH hop IN RANGE[1, max_hops] DO
seed_set = SELECT t
FROM seed_set:s -(:e)- :t
WHERE e.confidence >= confidence_threshold
AND NOT t.@visited
ACCUM
t.@visited += TRUE,
@@retrieved_edges += e,
@@retrieved_vertices += t
POST-ACCUM
t.@visited = TRUE;
END;
PRINT @@retrieved_vertices, @@retrieved_edges;
}
Step 3 — What the traversal actually finds
Starting from the Type 2 Diabetes and Tuberculosis vertices, 2 hops out:
Hop 1 from Diabetes:
→ Tuberculosis [ASSOCIATED_WITH, confidence: 0.91]
→ HbA1c [MEASURED_BY, confidence: 0.97]
→ MDR-TB [RISK_FACTOR_FOR, confidence: 0.74]
Hop 1 from Tuberculosis:
→ Rifampicin [TREATS, confidence: 0.89]
→ DOTS therapy [TREATS, confidence: 0.93]
→ Treatment failure [REPORTS_OUTCOME, confidence: 0.85]
Hop 2 from Rifampicin:
→ HbA1c elevation [ASSOCIATED_WITH, confidence: 0.71]
→ Hepatotoxicity [ASSOCIATED_WITH, confidence: 0.83]
Hop 2 from HbA1c:
→ Treatment failure [RISK_FACTOR_FOR, confidence: 0.78]
Context passed to the LLM: ~380 tokens.
Basic RAG for the same query: 5–8 chunks × ~450 tokens = 2,250–3,600 tokens.
The graph didn't just return less text — it returned a connected evidence chain that the LLM could actually reason over.
Entity Extraction Pipeline
We used a three-stage pipeline:
import quickumls
from scispacy.linking import EntityLinker
# Stage 1: NER with scispaCy
nlp = spacy.load("en_core_sci_lg")
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})
def extract_and_normalize_entities(text: str) -> list[dict]:
doc = nlp(text)
entities = []
for ent in doc.ents:
umls_concepts = ent._.kb_ents # [(cui, score), ...]
top_concept = umls_concepts[0] if umls_concepts else None
entities.append({
"surface_form": ent.text,
"label": ent.label_,
"umls_cui": top_concept[0] if top_concept else None,
"confidence": top_concept[1] if top_concept else 0.0,
"canonical_name": get_canonical_name(top_concept[0]) if top_concept else ent.text
})
return entities
# Stage 2: Relation extraction (fine-tuned BioBERT)
# Stage 3: Confidence scoring + edge filtering at load time
Drug name normalization was the messiest part. metformin, Metformin HCl, metformin hydrochloride, and Glucophage all need to resolve to the same vertex. UMLS handles most of this, but Indian-specific institutional and geographic terms needed custom patterns.
Evaluation Design
The benchmark will run 120 queries across 4 categories:
- 30 simple factual ("First-line treatment for MDR-TB in India?")
- 30 single-hop relational ("Which biomarkers monitor diabetes treatment in Indian studies?")
- 35 multi-hop reasoning ("How does diabetes affect TB outcomes and what role does HbA1c play?")
- 25 aggregation/synthesis ("Most common comorbidities in Indian TB literature?")
LLM-as-a-Judge — GPT-4o will evaluate GPT-4o-mini outputs on 4 rubric dimensions. To mitigate the preference-for-longer-answers bias: randomized presentation order, structured rubric (not open-ended scoring), and cross-validation against human expert ratings on a 30-question subset.
BERTScore will be computed against expert-written reference answers using microsoft/deberta-xlarge-mnli.
from bert_score import score as bert_score
def evaluate_with_bertscore(
predictions: list[str],
references: list[str]
) -> dict:
P, R, F1 = bert_score(
predictions,
references,
model_type="microsoft/deberta-xlarge-mnli",
lang="en",
verbose=False
)
return {
"precision": P.mean().item(),
"recall": R.mean().item(),
"f1": F1.mean().item()
}
When GraphRAG Wins — and When It Doesn't
GraphRAG wins clearly on:
- Multi-hop questions where the answer requires connecting 3+ concepts
- Aggregation queries across the full corpus
- Any query where explainability matters (you get auditable evidence paths)
- Complex biomedical questions with indirect causal chains
GraphRAG loses or ties on:
- Simple factual lookups — Basic RAG is equally accurate and faster
- Latency-sensitive applications — graph traversal is slower than ANN lookup; expect higher end-to-end latency
- Low-resource domains — if your entity extraction is noisy, the graph degrades fast
- Setup cost — graph schema design, NER pipeline, TigerGraph infrastructure vs. a FAISS index
The upfront investment is real. For a focused domain with relationship-dense literature, it pays off. For a general-purpose Q&A system over miscellaneous documents, it probably doesn't.
Architecture Summary
PMC API → Biopython Entrez → Affiliation Filter → scispaCy NER
→ UMLS Normalization → Relation Extraction (BioBERT)
→ TigerGraph GSQL Load Jobs → Knowledge Graph
Query → Entity Extraction → Seed Vertex Lookup
→ GSQL Multi-Hop Traversal → Subgraph Serialization
→ GPT-4o-mini → Answer
Evaluation: LLM-as-a-Judge (GPT-4o) + BERTScore + tiktoken cost accounting
Dashboard: Streamlit + PyVis graph visualization
What's Next
A few things we're actively working on:
- Hybrid retrieval — using vector search to seed initial entity candidates, then graph traversal to expand. Best of both worlds for queries that sit between the categories.
- Incremental graph updates — TigerGraph's upsert semantics make this possible; the pipeline isn't fully automated yet.
- Query routing — classifying incoming queries as "simple factual" vs "multi-hop" to dispatch to the right pipeline automatically. The latency tradeoff makes this worthwhile in production.
- Completing the corpus — ingestion pipeline is done; working through the full ~9,000 paper target across all four domains
Top comments (0)