Building Knowledge Graphs for Drug Discovery: A Technical Deep Dive with KRAS G12C
Knowledge graphs are the backbone of modern drug discovery. But most implementations are static, manually curated, and lag behind the literature by months. RUMI builds knowledge graphs dynamically from PubMed papers, enriches them with 15+ external databases, and uses graph metrics to surface non-obvious relationships.
Here's the technical architecture, using KRAS G12C cancer drug resistance as a real case study.
The Pipeline
Step 1: Literature Acquisition
RUMI searches PubMed using the NCBI E-utilities API with semantic query expansion. For "KRAS G12C resistance mechanisms", it generates multiple query variants:
queries = [
'"KRAS G12C" AND resistance',
'"KRAS G12C" AND "treatment failure"',
'"sotorasib" AND resistance',
'"adagrasib" AND resistance',
'"KRAS G12C" AND "adaptive resistance"',
]
Each query returns different papers. RUMI deduplicates by PMID and filters by relevance score (cosine similarity between query and abstract embeddings).
Step 2: Entity Extraction
RUMI extracts domain-specific entities using a combination of:
- Dictionary matching: HGNC gene names, DrugBank drug names, MeSH disease terms
- LLM extraction: For novel entities not in dictionaries (new protein names, experimental compounds)
- Contextual disambiguation: "BRAF" could be the gene or the protein — RUMI uses context to resolve
Entity types in the drug discovery domain:
| Entity Type | Examples | Count in KRAS Analysis |
|---|---|---|
| Gene/Protein | KRAS, AURKA, PHB2, MET | 23 |
| Drug/Compound | Sotorasib, Adagrasib, BBO-8520 | 8 |
| Disease/Condition | NSCLC, CRC, pancreatic cancer | 5 |
| Pathway | MAPK, PI3K/AKT, RAS/MAPK | 6 |
| Mechanism | Amplification, mutation, methylation | 9 |
Step 3: Relationship Extraction
For each pair of co-occurring entities, RUMI extracts the relationship type:
- Activates: A promotes B
- Inhibits: A suppresses B
- Associated: A and B co-occur (direction unknown)
- Causes: A causally leads to B
- Treats: A is a therapeutic intervention for B
The LLM extracts relationship direction and confidence from the sentence context:
Input: "Sotorasib-bound KRAS accumulates and reactivates MAPK signaling through DHX9 cytoplasmic retention"
Output: {
subject: KRAS G12C (sotorasib-bound),
predicate: reactivates,
object: MAPK signaling,
mechanism: DHX9 cytoplasmic retention,
confidence: 0.85
}
Step 4: Graph Construction
The knowledge graph is a directed multigraph where:
- Nodes = entities (with type, attributes, source papers)
- Edges = relationships (with type, confidence, evidence papers)
- Edge weight = number of independent papers supporting the relationship
class KnowledgeGraph:
def __init__(self):
self.nodes = {} # entity_id -> Entity
self.edges = [] # list of Edge
def add_relationship(self, subject, predicate, obj, paper_id, confidence):
edge_key = (subject.id, predicate, obj.id)
if edge_key in self.edge_index:
self.edge_index[edge_key].evidence.append(paper_id)
self.edge_index[edge_key].weight += 1
else:
self.edges.append(Edge(subject, predicate, obj, paper_id, confidence))
Step 5: External Enrichment
RUMI enriches the graph with data from 15+ APIs:
| API | What it adds |
|---|---|
| PubChem | Drug structures, targets, clinical trial status |
| UniProt | Protein functions, domains, interactions |
| PDB | 3D structures for molecular docking context |
| OpenFDA | Adverse events, approved indications |
| Semantic Scholar | Citation networks, related papers |
| KEGG | Pathway membership, metabolic context |
| Reactome | Detailed biological pathway diagrams |
| ClinicalTrials.gov | Active trials for compounds in the graph |
This enrichment adds hundreds of additional edges and attributes to the graph.
Step 6: Graph Metrics
RUMI computes several graph metrics to identify important nodes and relationships:
Betweenness centrality: Identifies nodes that connect otherwise separate clusters. In the KRAS analysis, PI3K/AKT had high betweenness — it connects the MAPK resistance pathway to the metabolic survival pathway.
Jaccard similarity: Measures overlap between two nodes' neighborhoods. High Jaccard between "MET amplification" and "KRAS G12C" suggests they share many co-occurring entities.
Clustering coefficient: Measures how interconnected a node's neighbors are. High clustering around "resistance mechanisms" suggests a tightly coupled biological system.
PageRank: Identifies the most "important" entities in the graph by counting incoming edges weighted by the importance of the source.
Step 7: Non-Obvious Relationship Discovery
The real value of the knowledge graph is finding relationships that no single paper explicitly states. RUMI uses:
Transitive inference: If A activates B and B inhibits C, then A may indirectly inhibit C. RUMI traces these chains and reports them with decreasing confidence at each hop.
Cluster analysis: Entities that cluster together but aren't directly connected are candidates for novel relationships. In the KRAS analysis, RUMI found that DHX9 and RAC1 appeared in the same cluster but no paper had directly connected them — leading to the DHX9-RAC1-PAK1 hypothesis.
Gap analysis: Missing edges between important nodes suggest unstudied relationships. If two entities are both well-studied but never connected, that's either a genuine non-relationship or a gap in the literature.
Results: The KRAS G12C Knowledge Graph
Final statistics from the analysis:
- 47 entities extracted across 14 papers
- 89 relationships identified
- 6 clusters corresponding to resistance mechanism families
- 3 high-confidence hypotheses generated from graph analysis
- 12 additional relationships from external database enrichment
The graph correctly identified PI3K/AKT as the convergence point for multiple resistance mechanisms — a finding that required connecting papers from 4 different research groups who never cited each other.
Limitations
- Entity extraction accuracy is ~85% — complex multi-gene names and abbreviations cause errors
- Relationship extraction is conservative — many real relationships are missed because the LLM requires explicit statement
- Graph metrics are sensitive to publication bias — well-studied entities get higher centrality regardless of biological importance
- Temporal relationships (A happens before B) are not well captured
Try It
git clone https://github.com/subhansh-dev/Rumi
cd rumi
pip install -e .
playwright install chromium
rumi
Run /discover KRAS G12C resistance mechanisms to see the full pipeline in action.
Links
- GitHub: https://github.com/subhansh-dev/Rumi
- Portfolio: https://subhanshh.vercel.app
If you work in bioinformatics, cheminformatics, or drug discovery, I'd love feedback on the graph construction approach. What metrics am I missing? What enrichment APIs should I add?
— Subhansh
Top comments (0)