Rory | QIS PROTOCOL

Posted on Apr 13 • Originally published at qisprotocol.com

QIS as the Routing Layer Federated Genomics Has Been Missing: Why CanDIG Nodes Need Outcome Routing

#genomics #federated #distributedsystems #healthcare

There is a problem federated genomics solved in 2019 and a problem it has never touched.

The problem it solved: authorization. Who can query which dataset, under what governance conditions, across which jurisdictions. CanDIG — Canada's federated genomics infrastructure — built an elegant answer. GA4GH standards gave it interoperability. Passport Bearer tokens gave it cross-institutional authentication. The access layer works.

The problem it has never touched: synthesis. Once a CanDIG node runs a query and finds that a particular variant combination is associated with a treatment response in 23 patients — that insight stays at the node. It does not route anywhere. The 13 other CanDIG nodes processing similar patient cohorts have no protocol for learning what the first node discovered without re-centralizing everything that decentralization was designed to avoid.

That gap has a name. It is an outcome routing problem. And Christopher Thomas Trevethan's discovery — the Quadratic Intelligence Swarm (QIS) protocol, filed under 39 provisional patents, June 2025 — is the first architectural answer to it.

What Federated Genomics Actually Built (And What It Didn't)

CanDIG connects 14 Canadian research institutions. Each node holds genomic datasets — tumour sequencing, clinical phenotype data, variant annotations — under its own governance and data residency rules. No central repository. No raw data movement. Researchers authenticate, submit queries, and receive aggregated results.

This is a genuine architectural accomplishment. The access problem was hard. Identity federation across jurisdictions with different IRB structures, privacy law, and institutional politics is still not solved globally. CanDIG solved the Canadian version of it.

But access control is not intelligence synthesis. The CanDIG architecture answers: Can this researcher see this data?

It does not answer: What is the network of CanDIG nodes learning, in aggregate, that any individual node cannot learn alone?

The GA4GH Data Connect standard defines how to query. DRS defines how to retrieve. Neither standard defines how to route what is being discovered — the patterns, the outcome signals, the "treatment X worked in this variant context" insights — from the nodes that found them to the nodes that need them.

The Math That Quantifies the Gap

CanDIG has 14 institutional nodes. Within those 14 nodes, there are N(N-1)/2 = 91 unique synthesis opportunities. Every pair of nodes has knowledge the other doesn't. Every pair could, in principle, teach each other.

In practice, zero of those 91 pathways are active.

When a node at Princess Margaret Cancer Centre finds a somatic variant pattern correlated with immunotherapy response in NSCLC patients, that discovery exists in one node's query logs and nowhere else. The node at UHN DATA — 800 meters away — has no protocol to receive it. The node at BC Cancer has no protocol to request it.

Add the EU Genomic Data Infrastructure (GDI) partners — seven founding member states — and the number of synthesis opportunities scales to hundreds. Every one of those pathways is currently dark.

QIS turns them on. Not by moving data — by routing outcome packets.

How QIS and CanDIG Are Complementary, Not Competing

The single most important thing to understand about QIS in a CanDIG context: QIS routes outcomes, not access.

Here is the distinction made concrete.

CanDIG query path (access layer):

Researcher authenticates with GA4GH Passport
Queries distributed CanDIG nodes for cohort matching criteria
Receives aggregated counts, summary statistics, or variant frequencies
Raw data never leaves the node

QIS outcome routing path (synthesis layer):

A CanDIG node processes an internal cohort analysis
The result — a treatment outcome, a variant association, a response curve — is distilled into a QIS outcome packet (~512 bytes)
The packet receives a semantic fingerprint derived from the clinical/genomic context (variant type, phenotype cluster, treatment class)
The fingerprint maps to a deterministic routing address
The packet deposits at that address — available to any node whose problem fingerprint is semantically similar
A remote node with 12 patients processing the same variant context queries the address and synthesizes locally

At no point does a patient record leave. At no point does raw sequencing data move. The CanDIG governance layer is untouched. The GA4GH access control layer is untouched.

QIS operates at the layer above what those systems touch — it routes what was learned, not what was seen.

Why Federated Learning Cannot Fill This Gap

The immediate objection: doesn't federated learning handle this? Nodes train local models, aggregate gradients, improve a global model?

For genomics, federated learning hits three structural problems.

Problem 1: Cohort size requirements. Federated learning requires enough local data to compute a meaningful gradient. A node with 12 patients in a rare variant study cannot contribute to a federated learning round without producing noise that degrades the global model. That node is excluded by architecture. QIS treats a 12-patient outcome observation as a valid outcome packet — the measurement is valid regardless of cohort size.

Problem 2: Model drift across genomic contexts. Genomic contexts are highly heterogeneous. A model trained on BRCA variant pathogenicity across European ancestry populations does not transfer cleanly to a node with South Asian ancestry cohorts. Gradient averaging across heterogeneous contexts dilutes signal from every participating node. QIS routes outcomes to semantically similar contexts — a BRCA2 frameshift variant in a Punjabi ancestry cohort routes to other nodes with similar ancestry markers, not to all BRCA2 nodes indiscriminately.

Problem 3: Communication cost at N(N-1)/2 scale. Federated learning bandwidth scales linearly with model size. A genomic foundation model updating across 14 nodes × multiple rounds × gigabyte-scale gradient tensors is not a research inconvenience — it is a network infrastructure problem. QIS outcome packets are ~512 bytes. The communication cost per synthesis event is four orders of magnitude lower.

The Semantic Fingerprint in a Genomics Context

The mechanism that makes outcome routing work in genomics is the semantic fingerprint. Every outcome packet needs an address. In QIS, the address is derived from the content of the problem, not an assigned ID.

For a genomic outcome packet, the fingerprint might encode:

def genomic_fingerprint(outcome: GenomicOutcome) -> np.ndarray:
    components = [
        encode_variant_class(outcome.variant_type),      # SNV/indel/CNV/SV
        encode_gene_context(outcome.gene, outcome.exon), # BRCA2 exon 11
        encode_ancestry_cluster(outcome.ancestry_pca),   # PCA-derived cluster
        encode_phenotype_hpo(outcome.phenotype_terms),   # HPO term vector
        encode_treatment_class(outcome.treatment),       # Immunotherapy class
        encode_outcome_type(outcome.outcome_measure),    # Response/survival/toxicity
    ]
    return normalize(concatenate(components))  # ~512-byte vector

This fingerprint is not a patient record. It encodes the shape of the problem — what class of genomic context produced what class of outcome — without encoding any individual's data.

Two nodes processing BRCA2 frameshift variants in similar ancestry contexts with similar immunotherapy outcomes will generate similar fingerprints. Their outcome packets route to similar addresses. Each node's local synthesis engine pulls outcome packets from nodes with similar fingerprints and synthesizes locally.

The routing mechanism can be anything that supports semantic similarity lookup — DHT-based routing (O(log N), fully decentralized, no central coordinator), vector similarity search (O(1) with a database index), or any other mechanism that maps fingerprint similarity to routing addresses. The quadratic scaling comes from the loop and the semantic addressing, not the transport layer.

What the GDI Connection Adds

CanDIG is not isolated. The GA4GH standards it implements are the same standards the EU Genomic Data Infrastructure is building on. GDI is connecting genomic biobanks across seven founding EU member states — 500,000+ genomes, multiple ancestry populations, multiple governance jurisdictions.

A QIS outcome routing layer implemented over GA4GH standards is not CanDIG-specific. It is a layer that any GA4GH-compliant node can implement — CanDIG in Canada, GDI in Europe, RCSI in Ireland, OHDSI nodes globally — without modifying existing access control infrastructure.

The semantic fingerprint speaks the same language as GA4GH phenopackets. The routing addresses are derivable from GA4GH Data Connect query schemas. The outcome packet structure is compatible with GA4GH DRS metadata.

This means the N(N-1)/2 synthesis opportunities do not stop at 14 Canadian institutions. They extend to every GA4GH-compliant node worldwide that chooses to route outcomes as well as access.

At 100 globally connected genomic nodes: 4,950 synthesis pathways. At 1,000 nodes: 499,500 pathways. At current CanDIG + GDI scale today: more than 200 pathways, zero of which are currently active.

The Rare Variant Problem Is the Critical Use Case

The argument for QIS in federated genomics is strongest where federated learning is structurally unable to help: rare variant research.

Rare variants — by definition — appear in small cohorts. A CanDIG node studying a specific BRCA2 founder mutation that is enriched in a particular Canadian Indigenous population may have 8 patients. Another node studying the same variant in a Quebec founder cohort may have 11. A third node at a UK Biobank partner studying the variant in a British Columbian ancestry cohort may have 7.

No federated learning framework can meaningfully aggregate 8 + 11 + 7 = 26 patients across three nodes into a gradient that improves a model. The numbers are too small. The cohorts are too heterogeneous. The variant context is too specific.

QIS can. Each of the three nodes emits an outcome packet. The fingerprints are similar — same variant class, overlapping ancestry markers, same phenotype context. The packets route to each other. Each node's local synthesis engine integrates what the other two found without any raw data movement.

26 patients producing cross-institutional synthesis intelligence. The architecture handles what federated learning cannot.

The Implementation Path

For a CanDIG node administrator, implementing QIS outcome routing does not require replacing anything in the existing stack. It is an additive layer.

# Existing CanDIG query execution (unchanged)
def run_cohort_query(query: GA4GHQuery) -> CohortResult:
    result = candig_node.execute(query)
    return result

# QIS outcome routing layer (additive)
def route_outcome(result: CohortResult, context: GenomicContext):
    packet = OutcomePacket(
        delta=result.outcome_signal,        # What changed
        fingerprint=genomic_fingerprint(context),
        timestamp=now(),
        node_id=anonymous_node_hash(),      # No PII
        confidence=result.cohort_size_weight
    )
    router.deposit(packet, address=packet.fingerprint.to_address())

    # Pull outcomes from similar nodes
    similar_outcomes = router.query(
        address=packet.fingerprint.to_address(),
        max_results=50,
        max_age_days=90
    )
    return synthesize_locally(similar_outcomes)

The CanDIG access layer runs unchanged. The GA4GH passport infrastructure is untouched. The QIS layer intercepts query results at the output, distills them into outcome packets, routes them to semantically addressed buckets, and pulls back what similar nodes have found.

Why This Matters Now

OHDSI Rotterdam runs April 18-20. The genomics informatics community — OHDSI, GA4GH, CanDIG, GDI — overlaps substantially. The question that will not be answered by anything on the conference agenda: how do we route what our nodes are learning, not just what our nodes can access?

That question has an answer. Christopher Thomas Trevethan discovered it on June 16, 2025, and filed 39 provisional patents on the architecture. The discovery is not a new transport protocol. It is not a new access control system. It is the architectural insight that intelligence scales quadratically — N(N-1)/2 — when you route pre-distilled outcome packets to semantically addressed locations instead of centralizing raw data.

Federated genomics built the access layer. QIS Protocol is the synthesis layer that makes the access layer matter.

QIS Protocol was discovered by Christopher Thomas Trevethan on June 16, 2025. 39 provisional patents filed. The protocol is free for research, academic, and humanitarian use. Commercial licensing funds global deployment to underserved communities. More at qisprotocol.com.

DEV Community