Medical Literature Knowledge Graph

#ai #graphql #graphrag #medical

For a while now I’ve been telling anybody who’d listen that knowledge graphs are a good way to provide grounded information to language models. In recent months I’ve been slowly building a graph with a bunch of PubMed papers on cancer research, and finally I’m starting to see results worth talking about.

This conversation demonstrates:

Natural language querying of a medical knowledge graph
GraphQL API for structured entity and relationship retrieval
Semantic search capabilities across biomedical entities
Relationship traversal to discover connections between entities
Provisional vs canonical entity distinction based on standardization status
Usage tracking and confidence scores for entity mentions

What really is interesting about this is that (at least to me, and I am very much not a doctor), it sounds pretty intelligent. Nothing it’s saying is crazy or stupid.

If you want to view the graph and click on nodes and things, go to http://104.131.171.175/graph-viz/.

User: show me the first 15 biomarkers please

Assistant: I'll query for the first 15 biomarkers in the knowledge graph.

{
  biomarkers: entities(filter: { entityType: "biomarker" }, limit: 15) {
    items {
      entityId
      entityType
      name
      status
      confidence
      usageCount
      source
      canonicalUrl
    }
    total
  }
}

First 15 Biomarkers (out of 81 total)

CA-125
- Status: provisional
- Confidence: 0.45, Usage: 2
- Source: PMC12757429
JC-1 fluorescent probe
- Status: provisional
- Confidence: 0.475, Usage: 1
- Source: PMC12757429
BCR Shannon diversity
- Status: provisional
- Confidence: 0.35, Usage: 2
- Source: PMC12757429
...
DDR-IF score
- Status: provisional
- Confidence: 0.475, Usage: 21
- Source: PMC12757875
...
STING activation score
- Status: provisional
- Confidence: 0.475, Usage: 1
- Source: PMC12757875

Observations:

All 15 biomarkers shown are provisional (none have been promoted to canonical status yet)
They come from 2 medical papers (PMC12757429 and PMC12757875)
Usage counts range from 1 to 21 (DDR-IF score being the most used)
Confidence scores range from 0.35 to 0.475
Many relate to cancer research (STING pathway, TCGA cohort, CA-125, etc.)

Query 3: DDR-IF Score Investigation

User: DDR-IF score is discussed (possibly for the first time?) in this paper: https://pubmed.ncbi.nlm.nih.gov/41487593/ - curious that it has no canonical ID

Assistant: Excellent detective work! You're absolutely right. The DDR-IF score has no canonical ID because it's a novel composite biomarker introduced in that very recent paper (PMC12757875, which corresponds to PubMed ID 41487593).

Why DDR-IF score is provisional (not canonical):

Novel Research Contribution: The paper introduces the DDR-IF score as a new composite biomarker that integrates:

DNA Damage Response (DDR) alterations (HRD component)
Immune Infiltration (IF) features (TME/tumor microenvironment)
STING activation (immune signaling)

From the relationships extracted:

Purpose: Predicts PARP-ICB (PARP inhibitor + immune checkpoint blockade) response in ES-SCLC (extensive-stage small cell lung cancer)
- AUROC = 0.81 (strong predictive power)
Clinical utility: Identifies tumors most susceptible to PARP-ICB combination therapy
Cut-off: ≥0.65 defines "DDR-IF-high" status

Why no canonical ID exists:

This is cutting-edge research (2026 paper based on the creation date in properties)
Not yet incorporated into standardized biomarker databases like:
- MeSH (Medical Subject Headings)
- LOINC (Logical Observation Identifiers Names and Codes)
- NCBI Gene (for molecular markers)
May require validation in multiple studies before databases assign canonical identifiers
It's a composite score (not a single molecule/gene), which makes canonicalization more complex

The knowledge graph correctly identified it as provisional with decent usage (21 mentions across 3 papers), showing it's gaining traction but not yet standardized in biomedical ontologies.

Query 4: Understanding STING Activation

User: dumb question, what is STING activation?

Assistant: Not a dumb question at all! STING is actually pretty complex immunology. Let me search the knowledge graph for more details.

STING = Stimulator of Interferon Genes

What it does:

STING is part of your innate immune system - your body's first-line defense against threats. Think of it as an alarm system inside cells.

How it works:

Detection: When DNA appears in the wrong place inside a cell (e.g., from viruses, bacteria, or damaged cancer cells), a sensor called cGAS detects it
Activation: cGAS triggers STING protein, which gets phosphorylated (p-STING in the biomarker list)
Alarm response: Activated STING causes the cell to produce:
- Interferons (signaling proteins that alert the immune system)
- Inflammatory cytokines (recruit immune cells to attack)

Why it matters in cancer:

The double-edged sword:

Good news: Cancer cells with damaged DNA can trigger STING → immune system attacks tumor
Bad news: Chronic activation can lead to "cGAS-STING exhaustion" where the pathway stops working (like a fire alarm that gets ignored)

In the DDR-IF score context:

Tumors with DNA damage (from PARP inhibitors) → release DNA fragments → activate STING
STING activation → recruits immune cells → makes immunotherapy (checkpoint blockade) work better
The DDR-IF score measures STING activation to predict which tumors will respond to the PARP + immunotherapy combination

Clinical relevance:

"STING-desert phenotypes" = tumors with no STING activity (won't respond well to immunotherapy)
"STING agonists" = drugs being developed to artificially activate this pathway

So STING activation is basically your cell's burglar alarm going off when it detects suspicious DNA, calling in the immune system cavalry!