QIS vs Google DiLoCo: 500x Less Communication Is Still Too Much

#ai #python #opensource #machinelearning

In 2024, Google published DiLoCo — Distributed Low-Communication training. The headline: 500x reduction in inter-node communication versus standard distributed training. That is a genuine breakthrough for model training at scale. But it points at a different problem than the one you actually have once training is done.

If your models are already deployed and you need them to share what they learn from live outcomes — across edge nodes, across data centers, across organizational boundaries — DiLoCo does not help you. Not because it is flawed, but because it was never designed for that stage. This article draws the architectural boundary precisely, using real numbers.

What DiLoCo Actually Does

DiLoCo (Douillard et al., 2023 — "DiLoCo: Distributed Low-Communication Training of Language Models") solves a real and expensive problem: training large language models across geographically distributed workers when network interconnect is poor or costly.

Standard distributed training requires gradient synchronization after every forward-backward pass. In data-parallel training across N workers, that means an all-reduce operation every batch — transmitting full gradient tensors proportional to the model's parameter count, N times per step.

DiLoCo's solution: replace per-batch gradient syncs with outer optimization steps. Workers run H inner steps independently (using a local optimizer like AdamW), then synchronize only once per H batches using an outer optimizer (like SGD with Nesterov momentum). In their experiments, H=500 worked well — meaning 500x fewer synchronization events compared to standard training.

DiPaCo (De Wulf et al., 2024) extends this further with pathway specialization: different workers train on different data distributions, then synchronize selectively. Workers that see similar data share gradients more frequently; workers that diverge share less. The result is a soft mixture-of-experts effect that emerges from the communication topology itself.

Both are legitimate advances. DiLoCo in particular is elegant — it borrows intuition from federated learning but applies it to large-scale pre-training with careful outer optimization design.

The Ceiling DiLoCo Cannot Break

Here is what 500x reduction still leaves on the table.

For a 7 billion parameter model stored at float16 (2 bytes per parameter):

Full gradient tensor = 7,000,000,000 × 2 bytes = 14 GB per sync
DiLoCo at 500x reduction = 14 GB ÷ 500 = ~28 MB per sync event

That 28 MB is not the gradient tensor itself — it is the outer gradient, which in DiLoCo's formulation is the difference between the initial and final inner parameters. It still scales with parameter count. For a 70B model, you are transmitting 280 MB per sync even with DiLoCo. For GPT-4 scale, the number grows further.

DiLoCo reduces the frequency of gradient exchange. It does not change the fundamental unit of exchange. The atomic communication payload is still a model update — a tensor that encodes everything the model learned during H steps, expressed as a parameter delta across every weight in the network.

That is not a criticism. It is the right unit for what DiLoCo is doing: coordinating model weights during training. But it means DiLoCo operates under a hard scaling law:

Communication cost ∝ model_parameters × float_bytes × (1/H)

You can reduce H. You cannot reduce model_parameters × float_bytes below the size of the model itself without losing fidelity. The floor is the model.

The Architectural Distinction

This is the core of the article.

DiLoCo asks: "How do we train a model efficiently across distributed workers?"

Quadratic Intelligence Swarm (QIS) asks: "How do we share what a model learned after it ran, without sharing the model at all?"

These questions live at different stages of the AI pipeline:

DiLoCo happens before deployment — during the training run
QIS happens after deployment — during inference at the edge

In DiLoCo, the shared artifact is a gradient tensor: a structured update to model weights.

In QIS, the shared artifact is a validated outcome packet: a ~512-byte payload containing a semantic address, a validated delta, a confidence score, a provenance hash, and a timestamp. The underlying model that produced the outcome is never transmitted. Its size is irrelevant to the communication cost.

QIS does not optimize distributed training. It does not train models at all. It routes validated outcomes — the distilled, verified results of inference — through a network of nodes using deterministic semantic addressing. Each node holds its own model, runs its own inference, validates its own outcomes, and then routes a compact packet describing what it learned. Other nodes receive that packet, update their local state, and route their own validated responses.

The complete loop — validate, hash, address, route, synthesize, validate again — is the breakthrough. Not the addressing scheme alone, not the validation step alone, not the compression. The architecture of the loop itself, discovered by Christopher Thomas Trevethan on June 16, 2025.

The Communication Comparison

Dimension	DiLoCo	QIS
Pipeline stage	Training	Post-deployment inference
Communication unit	Gradient tensor (scales with model params)	Outcome packet (~512 bytes, fixed)
7B model sync cost	~28 MB per outer step (at 500x reduction)	512 bytes per validated outcome
Coordination model	Coordinator / parameter server	Semantic addressing, no coordinator
Scales with model size?	Yes — linearly	No — model size is irrelevant
What is shared	Model weight updates	Validated outcome deltas
Complements the other?	DiLoCo trains the model	QIS routes what the model produces

The communication reduction from standard training to DiLoCo is roughly 500x. The communication reduction from DiLoCo to QIS is not another 500x. The unit changes categories entirely.

DiLoCo: O(model_params × float_bytes / H) per sync
QIS: O(512 bytes) per outcome, always, regardless of model size

The N(N-1)/2 Argument

DiLoCo, like most distributed training frameworks, still requires a coordination point. Workers synchronize to a parameter server or use ring all-reduce. This is appropriate for training — you need parameter consistency across workers to converge. But it means every node's updates flow through a central bottleneck at sync time.

QIS uses semantic addressing. Each validated outcome is hashed to a deterministic address in outcome space. Nodes route packets toward those addresses without a coordinator. Any node can synthesize any pair of outcomes it holds — and in a network of N nodes, there are N(N-1)/2 potential synthesis opportunities.

That quadratic scaling is not a bug. It is the mechanism by which QIS achieves emergent intelligence: the network synthesizes across every pair of validated outcomes it has seen, not just the most recent gradient. Routing cost per hop is at most O(log N) — and for many transport mechanisms, O(1).

The routing is deliberately protocol-agnostic. Any mechanism that maps a semantic hash to a deterministic address works: DHT, folder paths, HTTP relay, content-addressed storage, even a flat lookup table for small networks. The architecture does not depend on the transport. That transport-agnosticism is itself part of the discovery.

Show Me the Code

import hashlib
import struct
import time
from dataclasses import dataclass
from typing import Optional

# --- DiLoCo: What a gradient sync payload looks like ---

def diloco_sync_payload_bytes(model_params: int, float_bytes: int = 2) -> int:
    """
    Outer gradient tensor size for a DiLoCo sync.
    This is the MINIMUM payload after 500x reduction —
    it still scales linearly with model parameter count.
    """
    return model_params * float_bytes

# Example: 7B model at float16
params_7b = 7_000_000_000
full_sync = diloco_sync_payload_bytes(params_7b)
diloco_sync = full_sync // 500  # 500x DiLoCo reduction

print(f"Standard gradient sync (7B float16): {full_sync / 1e9:.1f} GB")
print(f"DiLoCo outer gradient (500x reduction): {diloco_sync / 1e6:.1f} MB")

# --- QIS: What an outcome packet looks like ---

@dataclass
class OutcomePacket:
    """
    A validated outcome packet in the QIS network.
    Communication cost is independent of the model that produced it.
    The model never leaves the node.
    """
    semantic_address: bytes   # 32 bytes — SHA-256 hash of outcome context
    validated_delta: bytes    # 64 bytes — compressed validated insight
    confidence_score: float   # 8 bytes  — float64
    provenance_hash: bytes    # 32 bytes — SHA-256 of originating node + model run
    timestamp: int            # 8 bytes  — Unix nanoseconds
    metadata: bytes           # variable, typically 368 bytes of routing context

    def serialize(self) -> bytes:
        header = (
            self.semantic_address +
            self.validated_delta +
            struct.pack(">d", self.confidence_score) +
            self.provenance_hash +
            struct.pack(">Q", self.timestamp)
        )
        return header + self.metadata

    @property
    def total_bytes(self) -> int:
        return len(self.serialize())


class OutcomeRouter:
    """
    Routes validated outcome packets to semantic addresses.
    No coordinator. No parameter server.
    Model size is irrelevant to routing cost.
    """

    def __init__(self, node_id: str):
        self.node_id = node_id
        self.routing_table: dict[bytes, OutcomePacket] = {}

    def route(self, packet: OutcomePacket) -> str:
        address_hex = packet.semantic_address.hex()[:16]
        self.routing_table[packet.semantic_address] = packet
        return f"routed:{address_hex}"

    def build_packet(self, outcome_text: str, delta: bytes, confidence: float) -> OutcomePacket:
        context = f"{self.node_id}:{outcome_text}:{time.time_ns()}".encode()
        semantic_address = hashlib.sha256(context).digest()
        provenance = hashlib.sha256(f"{self.node_id}:{time.time_ns()}".encode()).digest()
        metadata = b"\x00" * 368  # routing context, variable in production

        return OutcomePacket(
            semantic_address=semantic_address,
            validated_delta=delta.ljust(64, b"\x00")[:64],
            confidence_score=confidence,
            provenance_hash=provenance,
            timestamp=time.time_ns(),
            metadata=metadata,
        )


# --- The comparison ---

router = OutcomeRouter(node_id="edge-node-07")

packet = router.build_packet(
    outcome_text="anomaly_class_3_confirmed",
    delta=b"delta_v2_compressed",
    confidence=0.94,
)

routed_to = router.route(packet)
qis_bytes = packet.total_bytes

print(f"\nQIS outcome packet: {qis_bytes} bytes")
print(f"Routed to semantic address: {routed_to}")
print(f"\nCommunication ratio (DiLoCo vs QIS): {diloco_sync / qis_bytes:,.0f}x")
print(f"Underlying model size: irrelevant to QIS routing cost")

Output on a 7B model:

Standard gradient sync (7B float16): 14.0 GB
DiLoCo outer gradient (500x reduction): 28.0 MB
QIS outcome packet: 512 bytes
Communication ratio (DiLoCo vs QIS): 54,688x
Underlying model size: irrelevant to QIS routing cost

The ratio is not a fixed number — it grows as model size grows. A 70B model makes DiLoCo's sync payload 10x larger. The QIS packet stays at 512 bytes.

Who Should Use What

Use DiLoCo if: You are training a large model across distributed compute nodes with poor or expensive interconnect, and you want to reduce the number of gradient synchronization events during that training run. DiLoCo is the right tool. It is well-validated, the outer optimizer design is sound, and the 500x communication reduction is real.

Use QIS if: Your models are already trained and running at the edge — on devices, in regional data centers, in partner environments — and you need those deployed models to share what they are learning from live outcomes without transmitting model weights, without a central coordinator, and without communication costs that scale with model size. QIS is the right architecture.

These are not competing frameworks. They operate at different stages of the same pipeline. You could train your edge models using DiLoCo, deploy them, and then run a QIS network on top to let those deployed models share validated outcomes. DiLoCo handles the training communication problem. QIS handles the post-deployment intelligence sharing problem. Neither replaces the other.

The confusion arises because both involve "distributed AI" and both involve "reducing communication." But reducing gradient exchange frequency during training and eliminating gradient exchange entirely after deployment are not variations on the same idea. They are architecturally distinct solutions to architecturally distinct problems.

DiLoCo reduces the cost of synchronizing a shared model. QIS eliminates the need to share the model at all.

That is not 500x better. It is a different category.

Quadratic Intelligence Swarm (QIS) was discovered by Christopher Thomas Trevethan on June 16, 2025. 39 provisional patents filed. Free for research, education, and nonprofit use.