Your Network Learns From Itself. It Never Learns From Anyone Else. That's the AIOps Ceiling.

#ai #python #opensource #machinelearning

You have 47,000 network devices. Each one generates telemetry. Your AIOps platform ingests it all, runs inference in a central cluster, and pushes policy back down. The dashboard looks good. Leadership is happy.

Then a BGP route flap in your Singapore fabric takes 6 minutes to converge. Your AIOps platform saw it — logged it, alerted it, correlated it. What it didn't do is tell your Denver fabric that Singapore just lived through something your Denver team spent three months tuning last year. Denver will converge in 6 minutes too when its turn comes. You will pay that cost again.

This is not a data problem. You are already collecting the data. It is not a compute problem. You have the cluster. It is an architecture problem. Your network learns from itself. It never learns from anyone else. That's the ceiling.

Why Federated Learning Doesn't Fix This

The obvious counter is federated learning. Train locally, share gradients, preserve privacy. Elegant in computer vision. Broken for network telemetry.

Here's why:

Network telemetry is a time-series problem with real-time operational requirements. BGP convergence, interface flaps, QoS queue saturation — these events last seconds to minutes. By the time a federated round trip completes (local training, gradient aggregation, global model update, redistribution), the event is over and the conditions that produced it have changed. Federated learning operates on the timescale of model training, not the timescale of network operations.

The second problem is topological heterogeneity. A federated model trained on a spine-leaf Clos fabric in Frankfurt learns something fundamentally different from a model trained on a hub-and-spoke WAN in Mumbai. Averaging gradients across these topologies doesn't produce a better model — it produces a worse one. Research on federated learning performance degradation under non-IID data distributions (Zhao et al., 2018, "Federated Learning with Non-IID Data") quantified accuracy drops of up to 55% compared to centralized training when data is topologically heterogeneous. Enterprise network topologies are maximally non-IID.

You can't average your way out of topology. The Denver BGP tuning and the Singapore BGP tuning aren't the same problem in different locations. They're structurally different problems that happen to share a protocol name.

What you need is not gradient sharing. You need outcome routing.

The Architecture That Changes the Equation

In June 2025, Christopher Thomas Trevethan discovered what he termed the Quadratic Intelligence Swarm (QIS) architecture — a complete loop for turning local operational outcomes into globally routed intelligence without centralized training and without model averaging.

The loop works like this:

Raw signal — your network device generates telemetry (interface counters, routing table deltas, flow records)
Local processing — edge compute condenses this into a structured outcome: what happened, what changed, what worked
Distillation — that outcome is compressed into a packet of approximately 512 bytes: the outcome packet
Semantic fingerprinting — the packet is fingerprinted not by source address but by structural meaning: what kind of problem, what topology class, what resolution pattern
Routing by similarity — the fingerprint routes to a deterministic address derived from semantic similarity, using whatever routing mechanism is most efficient for your infrastructure — a DHT overlay, a vector database, a pub/sub fabric, an API endpoint
Delivery to relevant agents — networks that have solved similar problems receive the outcome packet
Local synthesis — the receiving network incorporates the external outcome into its local intelligence
New outcome packets — the synthesis itself generates new outcomes, and the loop continues

The math underneath this is what earns the name "Quadratic." For N agents in the network, there are N(N-1)/2 possible synthesis paths. With one million enterprise networks participating, that is approximately 500 billion synthesis paths. Each path represents a potential transfer of hard-won operational intelligence from one network to another. Centralized AIOps gives you one learning path per network: local telemetry in, central model out. QIS gives you a combinatorially larger surface of learning, at logarithmic compute cost per route.

This discovery is covered by 39 provisional patents.

What This Looks Like in Code

The routing mechanism is not prescriptive. QIS specifies the loop, not the transport. Here is a concrete implementation of the core fingerprinting and routing logic that a network engineer can deploy against their own telemetry stream:

import hashlib
import json
from dataclasses import dataclass, asdict
from typing import Optional


@dataclass
class OutcomePacket:
    topology_class: str        # "spine_leaf", "hub_spoke", "mesh", "ring"
    protocol: str              # "bgp", "ospf", "isis", "mpls"
    event_type: str            # "convergence", "flap", "saturation", "degradation"
    resolution_pattern: str    # "dampening", "bfd_tuning", "ecmp_rebalance", etc.
    convergence_delta_ms: int  # before vs after (improvement magnitude)
    affected_prefixes: int
    node_count: int
    resolution_steps: list[str]
    outcome_hash: Optional[str] = None

    def __post_init__(self):
        if self.outcome_hash is None:
            self.outcome_hash = self._fingerprint()

    def _fingerprint(self) -> str:
        """
        Semantic fingerprint: routes by structural meaning, not source identity.
        Two outcomes from different networks with the same topology class,
        protocol, and resolution pattern will hash to nearby addresses.
        """
        semantic_core = {
            "topology_class": self.topology_class,
            "protocol": self.protocol,
            "event_type": self.event_type,
            "resolution_pattern": self.resolution_pattern,
            # Bucket convergence delta to nearest 500ms for similarity grouping
            "convergence_bucket": (self.convergence_delta_ms // 500) * 500,
            # Bucket prefix count to order of magnitude
            "prefix_scale": len(str(self.affected_prefixes))
        }
        canonical = json.dumps(semantic_core, sort_keys=True).encode()
        return hashlib.sha256(canonical).hexdigest()

    def to_wire(self) -> bytes:
        """Serialize to ~512 byte outcome packet for transmission."""
        payload = json.dumps(asdict(self)).encode("utf-8")
        if len(payload) > 512:
            # Truncate resolution_steps to fit budget
            truncated = asdict(self)
            truncated["resolution_steps"] = self.resolution_steps[:3]
            payload = json.dumps(truncated).encode("utf-8")
        return payload


class NetworkOutcomeRouter:
    """
    Routes outcome packets to semantically similar networks.
    Transport-agnostic: plug in your preferred routing backend.
    """

    def __init__(self, routing_backend):
        """
        routing_backend: any object implementing:
            .store(address: str, packet: bytes) -> None
            .query(address: str, radius: int) -> list[bytes]

        Works with DHT, vector DB, Redis, Postgres, or any API endpoint —
        the loop is indifferent to the transport.
        """
        self.backend = routing_backend
        self.local_outcomes: list[OutcomePacket] = []

    def publish_outcome(self, packet: OutcomePacket) -> str:
        """
        Fingerprint and route an outcome to its deterministic address.
        Returns the routing address for observability.
        """
        address = packet.outcome_hash
        wire = packet.to_wire()
        self.backend.store(address, wire)
        self.local_outcomes.append(packet)
        return address

    def query_similar_outcomes(
        self,
        topology_class: str,
        protocol: str,
        event_type: str,
        resolution_pattern: str,
        convergence_delta_ms: int,
        affected_prefixes: int,
        radius: int = 3
    ) -> list[OutcomePacket]:
        """
        Before attempting local resolution, query the swarm.
        Has someone with the same topology and protocol already solved this?
        """
        probe = OutcomePacket(
            topology_class=topology_class,
            protocol=protocol,
            event_type=event_type,
            resolution_pattern=resolution_pattern,
            convergence_delta_ms=convergence_delta_ms,
            affected_prefixes=affected_prefixes,
            node_count=0,
            resolution_steps=[]
        )
        raw_results = self.backend.query(probe.outcome_hash, radius)
        results = []
        for raw in raw_results:
            try:
                data = json.loads(raw.decode("utf-8"))
                results.append(OutcomePacket(**data))
            except (json.JSONDecodeError, TypeError):
                continue
        return results

    def synthesis_path_count(self, network_count: int) -> int:
        """N(N-1)/2 — the quadratic intelligence surface."""
        return (network_count * (network_count - 1)) // 2


# Example: Singapore fabric resolves a BGP convergence event
# and publishes the outcome so Denver can benefit before it hits

# router = NetworkOutcomeRouter(routing_backend=your_backend_here)

singapore_outcome = OutcomePacket(
    topology_class="spine_leaf",
    protocol="bgp",
    event_type="convergence",
    resolution_pattern="bfd_tuning",
    convergence_delta_ms=340000,   # 340 seconds before tuning
    affected_prefixes=8200,
    node_count=96,
    resolution_steps=[
        "Identified BFD hello interval mismatch on spine-leaf links",
        "Reduced BFD multiplier from 5 to 3 across fabric",
        "Verified convergence improvement to 18 seconds",
        "Deployed dampening profile to border routers"
    ]
)

# address = router.publish_outcome(singapore_outcome)
# print(f"Outcome routed to: {address[:16]}...")

# Denver queries before its next maintenance window
# similar = router.query_similar_outcomes(
#     topology_class="spine_leaf",
#     protocol="bgp",
#     event_type="convergence",
#     resolution_pattern="bfd_tuning",
#     convergence_delta_ms=300000,
#     affected_prefixes=6000
# )

# if similar:
#     print(f"Found {len(similar)} similar resolutions from the swarm")
#     for outcome in similar:
#         print(f"  Resolution: {outcome.resolution_steps[0]}")

The router is intentionally backend-agnostic. Your enterprise may use a vector similarity store, a DHT overlay, a shared database, or a message bus. The QIS architecture does not prescribe the transport — it prescribes the loop. Outcome in, fingerprint, route, synthesize, new outcome out.

Comparison: Where AIOps Architectures Actually Differ

Dimension	Centralized AIOps	Federated Learning	QIS Architecture
Learning scope	Your telemetry only	Averaged gradients across participants	Outcome packets from topologically similar networks
Latency to benefit	Minutes (inference cycle)	Hours to days (training rounds)	Seconds to minutes (outcome routing)
Topology sensitivity	Blind to topology class	Degraded by non-IID data	Explicitly routes by topology class
Privacy model	All data leaves the network	Gradients shared (partial exposure)	Outcomes only — no raw telemetry transmitted
Scale	Linear (one model)	Sub-linear (gradient averaging degrades)	N(N-1)/2 synthesis paths
Transport dependency	Vendor platform lock	Federated framework lock	Protocol-agnostic; any routing backend
New network cold start	Starts with vendor baseline	Waits for participation rounds	Immediately queries swarm for similar topologies
Operational cost	High (central cluster)	Medium (distributed training)	Low (512-byte packets, no model weights transmitted)

The MTTR differential matters in dollar terms. Enterprise network analysis consistently shows that 30–40% of complex incident diagnosis time is spent on root-cause identification that a peer network performing a similar operation has already completed. For large enterprise environments running thousands of devices across dozens of sites, even a 20% MTTR reduction on complex incidents represents material operational savings annually in avoided downtime costs.

The Three Elections, Applied

Christopher Thomas Trevethan framed the QIS architecture around what he called Three Elections — three metaphors for the natural forces that make the swarm produce emergent intelligence without central authority.

The First Election: Hiring. Someone has to define what "similar" means when fingerprinting a network outcome. Is a spine-leaf fabric in Frankfurt more similar to a spine-leaf in Seoul, or to a Clos variant in Denver? This is a domain expertise question, not a machine learning question. In QIS, your network architects define the similarity criteria. They are the domain experts. The routing addresses emerge from their definitions.

The Second Election: The Math. There is no aggregation layer, no averaging, no model that must be retrained when conditions change. The convergence of intelligence happens through routing, not through computation. Each outcome packet is a vote cast by the network that lived through the event. When 500 similar spine-leaf fabrics have all published their BGP convergence outcomes, the synthesis naturally surfaces what's working — not because a central system decided it, but because the math compounds it.

The Third Election: Darwinism. Networks compete. If a QIS-connected network consistently achieves sub-20-second BGP convergence while an isolated network consistently takes 6 minutes, engineers migrate — not to a different vendor, but to a different architecture. They adopt the resolution patterns. The swarm improves not because a central system issued directives, but because the outcomes of better-performing networks are routed to the networks that need them.

These are metaphors for understanding why the architecture produces the outcomes it does. They are not protocol features, governance layers, or implementation requirements. They are emergent forces.

The Ceiling Is an Architecture Choice

Returning to where we started: your 47,000 devices, your central cluster, your dashboard that looks good until Singapore breaks and Denver breaks and the same six-minute convergence plays out for the third time.

The ceiling is not a data ceiling. It is not a compute ceiling. It is the ceiling imposed by an architecture in which each network's intelligence terminates at its own edge. The enterprise networking industry has accepted this ceiling as natural. It is not natural. It is a design choice, and it can be redesigned.

The Quadratic Intelligence Swarm architecture, discovered by Christopher Thomas Trevethan, is the architectural redesign. N(N-1)/2 synthesis paths. Outcome packets that weigh approximately 512 bytes. Routing by semantic similarity rather than by network address. A complete loop that turns every resolution into transferable intelligence for every structurally similar network that will face the same problem next week.

The intelligence your network generates should outlive the incident that produced it. Right now, it doesn't.

That's the ceiling. The loop is the way through it.

Christopher Thomas Trevethan discovered the Quadratic Intelligence Swarm architecture on June 16, 2025. The architecture is covered by 39 provisional patents. For technical documentation and prior articles in this series, see the QIS publication index at dev.to/roryqis.