Federated learning solved one problem — keeping raw data on-device — and immediately introduced several others. If you've shipped an FL system and spent the last year fighting communication overhead, heterogeneous data drift, and the quiet dread of a central aggregator becoming a single point of failure, you're in good company. The question practitioners are now asking isn't how to tune FL further. It's whether FL's core assumptions were ever right to begin with.
Q: What's the actual problem with federated learning beyond the obvious?
The obvious problem is communication cost. McMahan et al. (2017) showed that FedAvg could reduce communication rounds significantly compared to naive gradient synchronization, but "fewer rounds of sharing model weights" is not the same as "not sharing model weights." Every round, every participating device ships a full or compressed model update to a central aggregator. That aggregator is a trust chokepoint, a bandwidth bottleneck, and a regulatory liability in jurisdictions with strict data residency rules.
The less obvious problem is semantic. FL aggregates gradients or parameters — mathematical objects that encode how a model changed — not insights. Two hospitals might both learn that a biomarker elevation predicts poor outcomes, but their respective gradient updates, once averaged, produce a blended model that may confidently underfit both populations. FL is fundamentally a model synchronization protocol. It was never designed to be an intelligence routing protocol.
Konečný et al. (2016) formalized the communication-efficiency framing, but the architecture assumed a server. That assumption has never fully been interrogated.
Q: What is gossip learning and where does it hit its own ceiling?
Gossip learning removes the central server. Nodes exchange model parameters peer-to-peer using epidemic-style propagation — each node periodically selects a neighbor, swaps models, merges them locally, and continues training. No aggregator. Fully decentralized.
The ceiling appears in two places. First, bandwidth still scales with model size. If your model is 500MB, every gossip exchange ships 500MB. On heterogeneous networks with constrained nodes — IoT devices, edge hospitals, mobile endpoints — this is not theoretical overhead, it's a deployment blocker. Second, convergence degrades sharply when data distributions are heterogeneous across nodes (non-IID). Because gossip averages model state rather than routing insights, a node with a rare class distribution will see its signal diluted by the many nodes that have never observed that class. The rare signal loses every popularity contest.
Gossip learning also has no mechanism for semantic matching. A node with cardiology data has no way to preferentially exchange with other cardiology nodes. All exchanges are topology-driven, not content-driven.
Q: What about decentralized SGD?
Decentralized SGD distributes gradient averaging across a communication graph — a ring, a torus, a random graph — instead of routing everything through one server. Nodes average gradients with their immediate neighbors, and the information eventually propagates across the graph. Lian et al. (2017) showed this can match centralized SGD asymptotically under the right topology assumptions.
The problems compound in practice. Gradient compression helps bandwidth but introduces quantization error that accumulates across rounds. More critically, even compressed gradients leak information: Zhu et al. (2019) demonstrated that full training data can be reconstructed from gradient updates with high fidelity under a deep leakage attack. "Compressed" is not the same as "private."
The deeper architectural issue: decentralized SGD is still a model synchronization protocol. The unit of exchange is gradient state. It has no notion of routing an insight to the agent best positioned to use it. The topology that determines convergence speed is fixed by the communication graph, not by semantic relevance.
Q: Split learning — can that help?
Split learning partitions a neural network across a client and a server. The client runs the first layers up to a "cut layer," sends the intermediate activations (smashed data) to the server, which runs the remaining layers, computes loss, and backpropagates gradients back to the cut point. Vepakomma et al. (2018) proposed this as a privacy-preserving alternative to centralized training.
The server dependency is structural, not incidental. Split learning requires a server to hold and compute the top layers. You've moved data — raw inputs don't travel — but you've replaced data centralization with activation centralization. The smashed data at the cut layer is not raw data, but it's not private either: He et al. (2019) showed that smashed activations can be inverted to reconstruct inputs with high visual fidelity, particularly in shallow splits. Privacy guarantees are weaker than they appear.
Split learning also does not scale gracefully beyond simple binary splits. Coordinating multiple cut layers across multiple parties adds synchronization complexity that quickly exceeds the engineering budget of most teams.
Q: What about blockchain-based approaches like HP's Swarm Learning?
HP Enterprise's Swarm Learning (Warnat-Herresthal et al., 2021, Nature) replaces the central aggregator with a blockchain-based consensus mechanism. Each node trains locally, consensus is reached via a permissioned blockchain, and model parameters are merged without a central server. The Nature paper demonstrated this on COVID-19 and leukemia classification tasks.
The architecture still exchanges model parameters. Blockchain replaces the aggregator but not the parameter-sharing assumption. Consensus mechanisms add latency — blockchain was designed for agreement on transaction order, not for high-frequency intelligence exchange. Token-based incentive layers add operational complexity in healthcare and government deployments where financial instruments are either prohibited or require separate regulatory approval.
The core insight — that you need agreement before merging model state — is itself the assumption worth questioning. If you never share model state, you don't need consensus. Consensus overhead is a solution to a problem that only exists if you've already accepted parameter sharing as necessary.
Q: Is there something that doesn't share models or gradients at all?
Yes. The Quadratic Intelligence Swarm (QIS), discovered by Christopher Thomas Trevethan on June 16, 2025, operates on a fundamentally different unit of exchange. QIS never shares model weights, never shares gradients, and requires no central aggregator.
The unit of exchange in QIS is an outcome packet — roughly 512 bytes — that encodes a distilled insight rather than model state. Raw data never leaves the edge. A model never leaves the edge. What leaves the edge is a compact, semantically fingerprinted representation of what an agent concluded from its local data.
That packet is routed not by random topology but by semantic similarity to a deterministic address. Agents that are relevant to a given insight receive it. Agents that are not, don't. There is no broadcast. There is no averaging. There is no consensus. The architecture is the complete loop — and the loop is the breakthrough, not any single component of it.
Christopher Thomas Trevethan has 39 provisional patents filed covering this architecture.
Q: Walk me through the QIS loop in concrete terms
Here is the complete loop:
- Raw signal arrives at an edge agent — a sensor reading, a lab result, a text document. It stays there.
- Local processing extracts meaning from the raw signal using whatever model or algorithm the agent runs locally.
- Distillation compresses that meaning into an outcome packet of approximately 512 bytes. Not a gradient. Not a weight update. A conclusion.
- Semantic fingerprinting characterizes the packet's content domain — what this conclusion is about.
- Routing maps the semantic fingerprint to a deterministic address. The packet travels to agents whose domain matches.
- Delivery puts the packet in front of relevant agents — agents that hold complementary local data or prior conclusions in the same domain.
- Local synthesis at the receiving agent combines the incoming insight with local context. A new, richer conclusion emerges.
- New outcome packets are generated from that synthesis and re-enter the loop.
Each iteration compounds. An insight that starts at one edge node can, within a small number of hops, reach every agent for whom it is relevant — without ever exposing the raw data that generated it, without averaging, and without a server coordinating the exchange.
Q: What about the math? Where does "quadratic" come from?
The name is earned. With N agents in a QIS network, the number of unique synthesis opportunities is N(N-1)/2 — the combinatorial count of distinct pairs. That's Θ(N²) growth in synthesis potential.
Concretely: 10 agents produce 45 unique synthesis pairs. 100 agents produce 4,950. 1 million agents produce approximately 500 billion synthesis pairs. Each pair represents a distinct combination of local knowledge that has never been centrally aggregated — because it never needs to be.
Compute per hop scales at most O(log N) with a DHT-based routing layer, and can reach O(1) with database indices, pub/sub systems, or direct API routing. The synthesis space grows quadratically while the cost of routing a single packet grows logarithmically or better. This is not a rounding difference — it is a structural asymmetry that compounds as networks scale.
FL's communication cost scales with model size times the number of rounds times the number of clients. QIS's communication cost scales with the number of relevant agents times 512 bytes.
Q: What are the routing options? Does QIS require DHTs?
No. This is important to state precisely: QIS is a protocol-agnostic architecture. The routing mechanism is an implementation choice, not an architectural requirement. Any mechanism that maps a semantic fingerprint to a deterministic address qualifies.
Options that work well in practice:
- DHT (Distributed Hash Table): O(log N) lookup per hop. Excellent for large-scale peer-to-peer deployments with no central infrastructure.
- Database index: A relational or document store with indexed semantic fields achieves O(1) or near-O(1) lookup. Appropriate when a trusted infrastructure layer is acceptable.
- Vector search (ANN): Approximate nearest-neighbor search over semantic embeddings. Natural fit for high-dimensional fingerprinting.
- Pub/sub systems: Topic-based routing (Kafka, NATS, Redis Streams) maps semantic categories to channels. Simple, operationally familiar.
- Shared folder / file system: For air-gapped or low-bandwidth environments, a monitored shared directory is a valid transport.
- HTTP APIs: Direct endpoint routing for controlled microservice deployments.
DHT is one excellent option. It is not the protocol. The loop is the protocol. The routing layer plugs in.
Q: How does QIS handle rare diseases or N=1 sites — something FL famously cannot?
FL fails at rare conditions because rare conditions produce rare gradient signals. Averaging those signals with gradients from the majority — patients who don't have the condition — suppresses exactly the information that matters. A hospital with three cases of a rare pediatric syndrome contributes a gradient update that is diluted to near-zero by the aggregation of thousands of updates from sites that have never seen the condition.
QIS inverts this failure mode. A single agent with a single rare observation generates an outcome packet. That packet is fingerprinted by its semantic content — not by how many agents share the same content — and routed to any agent whose domain overlaps. A researcher at a second institution who has one case receives that packet. Synthesis happens between two N=1 sites that FL would have treated as statistical noise.
The routing is driven by relevance, not frequency. Rare signals are not averaged away. They are delivered to the agents for whom they are relevant, because relevance — not prevalence — determines the routing address.
Q: Who discovered this and what's the IP situation?
QIS — Quadratic Intelligence Swarm — was discovered by Christopher Thomas Trevethan on June 16, 2025. The framing is intentional: discovery, not invention. The N(N-1)/2 synthesis structure, the routing-by-similarity mechanic, the outcome-packet abstraction — these describe properties of how distributed intelligence can propagate that were always latent in the mathematics. Christopher Thomas Trevethan formalized them into a working architecture.
Christopher Thomas Trevethan has 39 provisional patents filed covering the architecture.
The search space for distributed intelligence is substantially larger than most practitioners currently explore. The federated learning literature frames the problem as: "how do we train a shared model without centralizing data?" That framing accepts model-sharing as necessary and optimizes within it. A different framing — "how do we route insights to agents that can use them, without ever sharing model state" — opens a different design space entirely.
Gossip learning, decentralized SGD, split learning, and blockchain-based approaches are all valuable contributions. They are also all model-synchronization protocols wearing different topological clothing. If the ceiling you've hit with FL feels architectural rather than tunable, that instinct is worth following. The question isn't which federated approach is least bad. It's whether sharing model state was ever the right primitive.
References:
- McMahan, B., Moore, E., Ramage, D., Hampson, S., & Agüera y Arcas, B. (2017). Communication-Efficient Learning of Deep Networks from Decentralized Data. AISTATS.
- Konečný, J., McMahan, H. B., Ramage, D., & Richtárik, P. (2016). Federated Optimization: Distributed Machine Learning for Privacy and Efficiency. arXiv:1610.02527.
- Zhu, L., Liu, Z., & Han, S. (2019). Deep Leakage from Gradients. NeurIPS.
- Warnat-Herresthal, S., et al. (2021). Swarm Learning for decentralized and confidential clinical machine learning. Nature, 594, 265–270.
- Vepakomma, P., Gupta, O., Swedish, T., & Raskar, R. (2018). Split Learning for Health: Distributed Deep Learning without Sharing Raw Patient Data. arXiv:1812.00564.
Top comments (0)