Rory | QIS PROTOCOL

Posted on Apr 12

TensorFlow Federated Routes Gradients. QIS Routes Outcomes. Here Is Why You Need Both.

#distributedsystems #machinelearning #federatedlearning #healthdata

For federated learning practitioners, distributed systems architects, and researchers evaluating what sits between FL training rounds in health data infrastructure.

Two Protocols, Two Layers

TensorFlow Federated (TFF) is Google's open-source framework for machine learning on decentralized data. Since its introduction in March 2019, TFF has become the reference implementation for federated learning research — providing both a high-level FL API (tff.learning) and a low-level Federated Core for expressing custom algorithms.

The Quadratic Intelligence Swarm (QIS) protocol, discovered by Christopher Thomas Trevethan on June 16, 2025, is not a federated learning framework. It is a routing protocol for validated outcomes that operates beneath the FL layer.

These are not competing systems. They solve different problems at different layers of the distributed intelligence stack. This article describes exactly where each operates, what each cannot do, and why the gap between them matters for anyone building health data infrastructure.

What TensorFlow Federated Does

TFF implements the federated averaging paradigm introduced by McMahan et al. (AISTATS 2017). The protocol:

Server selects clients for a training round (typically hundreds per round in Google's production system — Bonawitz et al., MLSys 2019)
Server distributes the current global model to selected clients
Clients train locally for E epochs on their private data
Clients send model updates (gradient deltas or weight updates) back to the server
Server aggregates updates via weighted averaging (FedAvg) and updates the global model
Repeat for T communication rounds

The unit of exchange is a model update — a vector of floating-point numbers with the same dimensionality as the model parameters. For a BERT-scale model (~340M parameters), each client uploads approximately 1.27 GB per round at float32 precision. McMahan et al. demonstrated that FedAvg reduces communication by 10–100x compared to naive federated SGD by allowing multiple local epochs before synchronization. Subsequent work on quantization, sparsification, and knowledge distillation has reduced this further.

TFF provides differential privacy through tff.aggregators.DifferentiallyPrivateFactory with adaptive gradient clipping and calibrated Gaussian noise, and secure aggregation (Bonawitz et al., CCS 2017) through cryptographic masking so the server sees only the aggregate sum, never individual updates.

This is genuine, well-engineered infrastructure. It works at Google scale — Gboard next-word prediction has been trained with FedAvg and formal (ε, δ)-differential privacy guarantees in production (Xu et al., ACL 2023).

What TensorFlow Federated Does Not Do

These are structural boundaries of the FL paradigm, not TFF-specific limitations. They apply equally to any framework implementing federated averaging:

1. No learning between rounds

FL operates in discrete synchronous rounds. Between rounds, no information propagates across the network. A hospital that observes a critical adverse drug reaction at 2 AM contributes that signal only when the server initiates the next training round — which may be hours or days later.

The network is silent between rounds. Intelligence does not compound continuously.

2. No semantic routing

FedAvg aggregates all participating clients uniformly (or by dataset size weighting). It cannot route a specific signal to the subset of nodes holding the most relevant data. When a node managing rare EGFR-mutant non-small cell lung cancer patients contributes gradients, those gradients are averaged with all participating NSCLC nodes — including nodes managing entirely different mutation profiles. The rare signal is diluted into the global average.

There is no mechanism to say: "This outcome is relevant specifically to nodes managing patients with this mutation, this drug, this population profile."

3. No small-cohort participation

Google's production FL system selects "typically a few hundred" devices per round and oversamples by 130% to compensate for dropout (Bonawitz et al., 2019). Secure aggregation requires a minimum group size k — if fewer than k devices complete the round, the entire round is discarded.

Gradient averaging over 8 patients produces noise, not signal. The variance of a stochastic gradient estimate scales as σ²/n_k, where n_k is the local sample count. For a site with 3 rare disease patients, the gradient variance exceeds the signal magnitude. FL's convergence theory assumes large participant pools.

Kairouz et al. (2021) acknowledge this explicitly in "Advances and Open Problems in Federated Learning" — the 210-page survey that catalogues FL's open problems including non-IID convergence, the privacy-utility tradeoff, and the simulation-to-production gap.

4. No outcome routing

FL routes model parameters. It does not route treatment outcomes, clinical decisions, or patient-level validated results. The unit of exchange is always a gradient or weight vector — never "patients with condition X on drug Y showed outcome Z with confidence interval [a, b]."

This is not a criticism. FL was designed to train models, not to route outcomes. But it means that after a federated training round completes and produces a global model, the validated outcomes that emerge from applying that model at each site have no path back into the network intelligence. They sit at each site, unused by peers, until a researcher designs the next study.

Where QIS Operates

QIS is a routing protocol for the layer that FL leaves empty — the space between training rounds, between studies, between episodic analyses.

The protocol:

Distill: After any validated analysis completes at a node (including an FL-trained model's predictions), the outcome is distilled into a compact packet (~512 bytes). The packet contains derived statistics only — an outcome delta, a confidence interval, a cohort descriptor. No raw patient data. No model parameters.
Fingerprint: The packet is addressed using standardized clinical vocabulary (SNOMED CT, RxNorm, ICD-10, MedDRA) as the semantic coordinate. A treatment outcome for SNOMED concept 44054006 (Type 2 diabetes mellitus) on RxNorm concept 860975 (metformin 500mg) has a deterministic address. Every node using the same vocabulary computes the same address independently.
Route: The packet is deposited at its semantic address. Any node managing a similar patient population queries that address and retrieves relevant outcomes from peer sites. Routing cost is O(log N) or better depending on transport — O(log N) for DHT lookup, O(1) for database index or pub/sub.
Synthesize locally: Each node aggregates incoming packets on its own infrastructure. No central aggregator. The synthesis is weighted by local context — population demographics, clinical setting, confidence intervals.
Loop: The synthesis itself becomes a new outcome that can be distilled and routed. Intelligence compounds continuously. Each cycle through the loop enriches the network.

The unit of exchange is an outcome packet — approximately 512 bytes, fixed-size regardless of the originating dataset volume. Compare this to FL's per-round communication of O(d) floats where d is the number of model parameters.

Dimension	TensorFlow Federated	QIS Outcome Routing
Unit of exchange	Model gradients/weights (O(d) floats, d = model params)	Outcome packets (~512 bytes, fixed)
Communication per round/query	~1.27 GB per client (BERT-scale)	512 bytes per packet
Communication cost scaling	O(d × N) per round	O(N log N) total, or O(N) with index/pub/sub
Aggregation model	Central server averages all updates	Peer-to-peer, no central aggregator
Minimum participants per round	Hundreds (production); k minimum for SecAgg	No minimum — any validated outcome routes
Privacy mechanism	DP noise + SecAgg (degrades model utility)	No raw data in packets by architecture
Timing	Discrete synchronous rounds	Continuous — deposits on every validated outcome
Semantic routing	No — all clients aggregated uniformly	Yes — packets route to semantically matched nodes
What compounds	Model accuracy (per round)	Network intelligence (per outcome, continuously)
Between-round learning	None — network is silent	Active — outcomes route continuously
Transport	gRPC (TFF-native)	Transport-agnostic: DHT, REST, pub/sub, database

The Complementary Architecture

Consider a concrete deployment: a network of 48 hospitals running a federated model for drug safety monitoring.

With TFF alone:

The server initiates a training round. 48 hospitals train a classification model on their local adverse event data. Model updates flow to the server. The server aggregates. A global model is produced. The model is deployed at each site to classify incoming events.

Between training rounds: nothing. Hospital A observes a novel cardiac signal in 3 patients on Drug X. The signal sits at Hospital A until the next federated round. Hospital B, managing similar patients on the same drug, does not learn from Hospital A's observation for hours or days.

With TFF + QIS:

Same federated training round. Same model production and deployment. But now, when Hospital A's deployed model classifies the novel cardiac signal, the validated outcome is distilled into a 512-byte packet: MedDRA preferred term (cardiac toxicity) × drug code (Drug X) × outcome severity × population descriptor → semantic address.

The packet is deposited at that address. Hospital B, running continuous queries on the same drug-condition address, retrieves the packet within seconds. Hospital B's local synthesis updates its risk assessment for Drug X patients before the next FL round even begins.

The FL layer trains the model. The QIS layer routes the model's validated outcomes in real time. They operate at different layers of the stack:

Application Layer:   Clinical decision support, safety alerts
Synthesis Layer:     QIS Outcome Routing (continuous, between FL rounds)  ← NEW
Learning Layer:      TensorFlow Federated (episodic model training)
Data Layer:          Local patient records (OMOP CDM, FHIR, proprietary)
Transport Layer:     gRPC (FL) + DHT/REST/pub/sub (QIS)

The Communication Cost Arithmetic

FL communication cost per round for N clients with a model of d parameters:

C_FL = N × d × sizeof(float)

For N=48 hospitals, d=340M parameters (BERT-scale), float32:

C_FL = 48 × 340,000,000 × 4 bytes = 60.8 GB per round

Even with McMahan et al.'s 10–100x compression: 0.6–6.1 GB per round.

QIS communication cost for the same 48 hospitals, each depositing one outcome packet per validated analysis:

C_QIS = 48 × 512 bytes = 24.0 KB per cycle

This is not an apples-to-apples comparison — FL moves model intelligence, QIS moves outcome intelligence. They solve different problems. But the communication cost difference — six orders of magnitude — explains why QIS can operate continuously while FL operates in discrete rounds.

QIS does not replace the FL round. It fills the silence between rounds with continuous outcome intelligence at negligible communication cost.

The Rare Disease Argument

Kairouz et al. (2021) identify non-IID data heterogeneity and statistical heterogeneity as open problems. For rare diseases, these are not edge cases — they are the defining condition.

A rare disease network of 30 centers, each managing 2–15 patients:

FL: Gradient variance σ²/n_k at n_k=3 patients makes local updates unreliable. FedAvg may diverge. Secure aggregation requires minimum group sizes that many centers cannot meet. The rare disease network is architecturally excluded from FL.
QIS: Each center distills its validated treatment outcome into a 512-byte packet regardless of cohort size. A center with 2 patients contributes 2 valid outcome observations. The synthesis across 30 centers produces a distributed case series of 60+ observations — built from the protocol, not from a researcher designing a study.

FL requires statistical mass at each node to produce stable gradients. QIS requires only a validated outcome. For rare diseases, where the most informationally valuable data sits at the smallest sites, this architectural difference determines whether those sites participate or are excluded.

What This Means for FL Practitioners

If you are building federated learning infrastructure — whether on TFF, PySyft, NVIDIA FLARE, or a custom framework — the question QIS raises is not "should I replace FL with outcome routing?" It is:

What happens to the validated outcomes your FL-trained models generate at each site between training rounds?

If those outcomes sit unused until the next study: you have a synthesis gap. The intelligence your FL pipeline generates at 48 sites is not compounding across sites in real time.

QIS fills that gap. It does not modify the FL training loop. It does not touch model parameters. It operates on the outputs of your deployed models — the validated predictions, the classified events, the treatment outcomes — and routes them to peer sites via semantic addressing so the next FL round starts with enriched context at every node.

FL trains the model. QIS routes what the model learns. Together, they close the loop that neither closes alone.

The Discovery

Christopher Thomas Trevethan discovered the Quadratic Intelligence Swarm protocol on June 16, 2025. The breakthrough is the complete architecture — the loop that enables real-time quadratic intelligence scaling without compute explosion, not any single component. 39 provisional patents filed. Humanitarian licensing ensures the protocol is free forever for nonprofits, research institutions, and educational use.

For FL practitioners: the QIS protocol specification, the Yao communication complexity rebuttal, and the QIS glossary are published.

References cited: McMahan et al. (AISTATS 2017), Bonawitz et al. (MLSys 2019, CCS 2017), Kairouz et al. (Foundations and Trends in ML 2021), Li et al. (MLSys 2020), Xu et al. (ACL 2023).

This is part of an ongoing series on QIS — the Quadratic Intelligence Swarm protocol — documenting every domain where distributed outcome routing closes a synthesis gap that existing infrastructure cannot close.

DEV Community