subhansh

Posted on May 28

Building Knowledge Graphs for Drug Discovery: A Technical Deep Dive with KRAS G12C

#bioinformatics #python #datascience #research

Building Knowledge Graphs for Drug Discovery: A Technical Deep Dive with KRAS G12C

Knowledge graphs are the backbone of modern drug discovery. But most implementations are static, manually curated, and lag behind the literature by months. RUMI builds knowledge graphs dynamically from PubMed papers, enriches them with 15+ external databases, and uses graph metrics to surface non-obvious relationships.

Here's the technical architecture, using KRAS G12C cancer drug resistance as a real case study.

The Pipeline

Step 1: Literature Acquisition

RUMI searches PubMed using the NCBI E-utilities API with semantic query expansion. For "KRAS G12C resistance mechanisms", it generates multiple query variants:

queries = [
    '"KRAS G12C" AND resistance',
    '"KRAS G12C" AND "treatment failure"',
    '"sotorasib" AND resistance',
    '"adagrasib" AND resistance',
    '"KRAS G12C" AND "adaptive resistance"',
]

Each query returns different papers. RUMI deduplicates by PMID and filters by relevance score (cosine similarity between query and abstract embeddings).

Step 2: Entity Extraction

RUMI extracts domain-specific entities using a combination of:

Dictionary matching: HGNC gene names, DrugBank drug names, MeSH disease terms
LLM extraction: For novel entities not in dictionaries (new protein names, experimental compounds)
Contextual disambiguation: "BRAF" could be the gene or the protein — RUMI uses context to resolve

Entity types in the drug discovery domain:

Entity Type	Examples	Count in KRAS Analysis
Gene/Protein	KRAS, AURKA, PHB2, MET	23
Drug/Compound	Sotorasib, Adagrasib, BBO-8520	8
Disease/Condition	NSCLC, CRC, pancreatic cancer	5
Pathway	MAPK, PI3K/AKT, RAS/MAPK	6
Mechanism	Amplification, mutation, methylation	9

Step 3: Relationship Extraction

For each pair of co-occurring entities, RUMI extracts the relationship type:

Activates: A promotes B
Inhibits: A suppresses B
Associated: A and B co-occur (direction unknown)
Causes: A causally leads to B
Treats: A is a therapeutic intervention for B

The LLM extracts relationship direction and confidence from the sentence context:

Input: "Sotorasib-bound KRAS accumulates and reactivates MAPK signaling through DHX9 cytoplasmic retention"
Output: {
  subject: KRAS G12C (sotorasib-bound),
  predicate: reactivates,
  object: MAPK signaling,
  mechanism: DHX9 cytoplasmic retention,
  confidence: 0.85
}

Step 4: Graph Construction

The knowledge graph is a directed multigraph where:

Nodes = entities (with type, attributes, source papers)
Edges = relationships (with type, confidence, evidence papers)
Edge weight = number of independent papers supporting the relationship

class KnowledgeGraph:
    def __init__(self):
        self.nodes = {}  # entity_id -> Entity
        self.edges = []  # list of Edge

    def add_relationship(self, subject, predicate, obj, paper_id, confidence):
        edge_key = (subject.id, predicate, obj.id)
        if edge_key in self.edge_index:
            self.edge_index[edge_key].evidence.append(paper_id)
            self.edge_index[edge_key].weight += 1
        else:
            self.edges.append(Edge(subject, predicate, obj, paper_id, confidence))

Step 5: External Enrichment

RUMI enriches the graph with data from 15+ APIs:

API	What it adds
PubChem	Drug structures, targets, clinical trial status
UniProt	Protein functions, domains, interactions
PDB	3D structures for molecular docking context
OpenFDA	Adverse events, approved indications
Semantic Scholar	Citation networks, related papers
KEGG	Pathway membership, metabolic context
Reactome	Detailed biological pathway diagrams
ClinicalTrials.gov	Active trials for compounds in the graph

This enrichment adds hundreds of additional edges and attributes to the graph.

Step 6: Graph Metrics

RUMI computes several graph metrics to identify important nodes and relationships:

Betweenness centrality: Identifies nodes that connect otherwise separate clusters. In the KRAS analysis, PI3K/AKT had high betweenness — it connects the MAPK resistance pathway to the metabolic survival pathway.

Jaccard similarity: Measures overlap between two nodes' neighborhoods. High Jaccard between "MET amplification" and "KRAS G12C" suggests they share many co-occurring entities.

Clustering coefficient: Measures how interconnected a node's neighbors are. High clustering around "resistance mechanisms" suggests a tightly coupled biological system.

PageRank: Identifies the most "important" entities in the graph by counting incoming edges weighted by the importance of the source.

Step 7: Non-Obvious Relationship Discovery

The real value of the knowledge graph is finding relationships that no single paper explicitly states. RUMI uses:

Transitive inference: If A activates B and B inhibits C, then A may indirectly inhibit C. RUMI traces these chains and reports them with decreasing confidence at each hop.

Cluster analysis: Entities that cluster together but aren't directly connected are candidates for novel relationships. In the KRAS analysis, RUMI found that DHX9 and RAC1 appeared in the same cluster but no paper had directly connected them — leading to the DHX9-RAC1-PAK1 hypothesis.

Gap analysis: Missing edges between important nodes suggest unstudied relationships. If two entities are both well-studied but never connected, that's either a genuine non-relationship or a gap in the literature.

Results: The KRAS G12C Knowledge Graph

Final statistics from the analysis:

47 entities extracted across 14 papers
89 relationships identified
6 clusters corresponding to resistance mechanism families
3 high-confidence hypotheses generated from graph analysis
12 additional relationships from external database enrichment

The graph correctly identified PI3K/AKT as the convergence point for multiple resistance mechanisms — a finding that required connecting papers from 4 different research groups who never cited each other.

Limitations

Entity extraction accuracy is ~85% — complex multi-gene names and abbreviations cause errors
Relationship extraction is conservative — many real relationships are missed because the LLM requires explicit statement
Graph metrics are sensitive to publication bias — well-studied entities get higher centrality regardless of biological importance
Temporal relationships (A happens before B) are not well captured

Try It

git clone https://github.com/subhansh-dev/Rumi
cd rumi
pip install -e .
playwright install chromium
rumi

Run /discover KRAS G12C resistance mechanisms to see the full pipeline in action.

DEV Community

Building Knowledge Graphs for Drug Discovery: A Technical Deep Dive with KRAS G12C

Building Knowledge Graphs for Drug Discovery: A Technical Deep Dive with KRAS G12C

The Pipeline

Step 1: Literature Acquisition

Step 2: Entity Extraction

Step 3: Relationship Extraction

Step 4: Graph Construction

Step 5: External Enrichment

Step 6: Graph Metrics

Step 7: Non-Obvious Relationship Discovery

Results: The KRAS G12C Knowledge Graph

Limitations

Try It

Links

Top comments (0)