Aniket Hingane

Posted on Mar 13

Building an Autonomous Network Routing Agent That Thinks While It Acts

#ai #python #programming #machinelearning

How I Wired Partial Reasoning, Online Replanning, and Real-Time Adaptation into a Streaming Telecom Agent

TL;DR

Telecommunications networks are the opposite of stable. Nodes fail. Latency spikes without warning. Broadcast storms chew through capacity before engineers even open their laptops. In this experimental PoC, I built a streaming decision agent that doesn't just observe the chaos — it reasons about it, plans around it, and reroutes live traffic using updated assumptions, all without halting the execution loop. My project is called TelecomRoute-AI, it uses Python with NetworkX and Rich, and in my opinion, it's one of the more practically illustrative PoCs I've ever designed from scratch.

The full code for this experiment is available on GitHub:
👉 https://github.com/aniket-work/TelecomRoute-AI

Introduction

What got me thinking about this problem was a seemingly trivial question: what does an AI agent do when the world changes underneath it?

The classical view of intelligent agents — one that most beginner tutorials still teach — is sequential and static: read inputs, produce a plan, execute the plan. But from my experience working with distributed systems in experimental settings, real business environments almost never stay still. A routing table you computed three seconds ago can be catastrophically wrong three seconds later if an upstream node begins dropping packets. For me, the interesting design challenge is not just "build an agent that plans." The real challenge is: build an agent that keeps reasoning about plan validity even while executing it.

That's the kernel of what I wanted to explore here. In my opinion, this isn't an academic problem — it's engineering. And building it in Python, without any heavy ML framework, proved to be both illuminating and surprisingly doable.

For this PoC, I chose telecommunications network routing as the domain. The business problem is straightforward: in a mesh network with multiple server nodes (think: regional data centres connected through a web of edge routers), traffic must be routed from source node A to destination node F with minimum latency. The complication is that the network's health is not static. Nodes degrade. Links get saturated. In production, there are teams of engineers looking at dashboards trying to make these routing decisions. I wanted to explore what an autonomous agent would look like doing that job in real-time.

What's This Article About?

This article walks through my thought process building TelecomRoute-AI from scratch — a Python-based simulation where:

A TelecomNetwork environment emits time-series telemetry about node latencies.
A StreamingDecisionAgent ingests those readings tick-by-tick.
The agent applies partial reasoning using a sliding-window statistical model over edge latencies.
When its internal belief graph drifts significantly from observed reality, it triggers online replanning using Dijkstra's shortest-path algorithm.
Without interrupting the "traffic stream," it adapts mid-execution by swapping to the new route.

The article covers the code in depth, walks you through the design decisions, and shows you exactly what happens when Node D starts burning.

Tech Stack

I kept the tech stack deliberately minimal for this exploration. The goal was clarity, not complexity:

Python 3.12 — The backbone. No surprises.
NetworkX — For building the telecom mesh graph and running Dijkstra's algorithm dynamically.
Rich — For the real-time terminal dashboard. This library is criminally underrated for Python simulations.
Pillow (PIL) — For generating the GIF animations shown in this article.
collections.deque — Python's built-in sliding window, the unsung workhorse of the partial reasoning layer.
Mermaid.js (mermaid.ink) — For all architecture diagrams.

I specifically avoided LLMs, LangChain, or any external reasoning orchestrators. In my view, the most instructive way to understand agent mechanics is to write the reasoning logic yourself, in plain Python.

Why Read It?

If you've ever worked on or thought about:

High-velocity streaming data that changes faster than your system can process it
Systems that need to maintain decisions while the environment shifts beneath them
Reducing operator intervention in network or infrastructure management
Classical graph algorithms meeting real-time decision loops

...then this PoC is for you. From my experience, understanding these patterns gives you a transferable mental model regardless of which framework you eventually build with.

Let's Design

My core design philosophy here was: the agent should have a private belief graph, not just an observation log. This is a subtle but powerful distinction.

Most simple reactive systems work like this: observe event → check rule → take action. That's fine for well-defined rule spaces. But in a dynamic mesh network, the space of possible states is enormous. A pure reactive system would need an absurdly large rule table.

Instead, I went with an internal world model approach:

The agent maintains its own internal weighted graph (a copy of the network topology).
It continuously updates this internal graph using streaming telemetry.
Routing decisions are always made against the internal graph, not raw telemetry.
When the internal graph diverges too far from what telemetry reports (partial reasoning), the agent rebuilds its optimal path (online replanning) and switches routes (mid-execution adaptation).

In other words, the agent doesn't react to individual spikes. It reasons about patterns, then adapts systematically. That is, in my opinion, the correct architectural stance for this class of problem.

Let's Get Cooking

Let me walk through the three key components, explaining exactly what I was thinking when I wrote them.

Component 1: The Network Environment

The TelecomNetwork class simulates the physical topology and emits high-frequency telemetry. I think of this as the "real world" that the agent is trying to understand and react to.

class TelecomNetwork:
    """Simulates a dynamic telecommunications network environment."""
    def __init__(self):
        self.graph = nx.Graph()
        self._initialize_topology()
        self.source = 'A'
        self.destination = 'F'
        self.current_tick = 0
        self.node_health = {node: 1.0 for node in self.graph.nodes} # 1.0 = healthy

    def _initialize_topology(self):
        """Creates a mesh-like topology of network nodes."""
        edges = [
            ('A', 'B', 10), ('A', 'C', 15),
            ('B', 'D', 20), ('B', 'C', 12),
            ('C', 'E', 25), ('D', 'F', 15),
            ('E', 'F', 10), ('C', 'D', 18), ('D', 'E', 5)
        ]
        for u, v, w in edges:
            self.graph.add_edge(u, v, weight=w)

Each edge has a baseline weight representing nominal latency in milliseconds. The nodes are labelled A through F, representing data centres or edge routers in a regional mesh. I got the idea for this topology from looking at simplified backbone diagrams for multi-PoP (Point of Presence) networks.

What I find particularly important in this design is the node_health dictionary. Rather than just removing nodes from the graph when they fail (which would be too harsh and too unrealistic), I made health a gradient between 0.0 and 1.0. Partial failures — where a node is functioning but degraded — are much more common in real telecom environments than total failures.

def get_telemetry(self):
    """Emits streaming telemetry data for the agent to process."""
    self.current_tick += 1
    event = self.trigger_anomaly()

    telemetry = {}
    for u, v in self.graph.edges:
        base_w = self.graph[u][v]['weight']
        effective_health = min(self.node_health[u], self.node_health[v])

        noise = random.uniform(-2, 2)
        if effective_health < 0.5:
            latency = base_w * 5 + noise + 100  # Severe spike
        elif effective_health < 1.0:
            latency = base_w * 2 + noise + 20   # Moderate spike
        else:
            latency = base_w + noise             # Nominal

        telemetry[(u, v)] = max(1, round(latency, 2))
    return self.current_tick, telemetry, event

I put it this way because I wanted telemetry to be noisy — the way real network monitoring data looks. A node that's 80% healthy doesn't emit perfectly consistent readings. It shows jitter. Adding uniform random noise in the ±2ms band before layering the health-based penalties was specifically designed to test whether the agent's sliding-window approach could filter out noise from genuine degradation signals.

Component 2: The Partial Reasoning Engine

This is the heart of the agent's intelligence, and the piece I'm most satisfied with in this experiment.

class StreamingDecisionAgent:
    def __init__(self, source, destination, initial_graph):
        self.source = source
        self.destination = destination
        self.graph = initial_graph.copy()

        # Sliding window for partial reasoning
        self.window_size = 5
        self.telemetry_history = {
            edge: deque(maxlen=self.window_size) 
            for edge in self.graph.edges
        }

        self.current_route = None
        self._replan()  # Compute initial route

The choice of window_size = 5 is meaningful. I reasoned that in a network with a ~150ms tick rate, a window of 5 gives approximately 750ms of observation time before reacting. That's long enough to smooth over a single-tick noise blip, but short enough that genuine multi-tick degradation (like a broadcast storm that persists for seconds) is caught reliably. In a production system you'd tune this; for this PoC, 5 worked really well.

The deque with maxlen is elegant here — as new observations arrive, old ones are automatically evicted. No manual memory management, no growing arrays. From my experience, collections.deque is one of the most underused data structures in Python — it handles the sliding window allocation problem in a single line.

def ingest_telemetry(self, tick, telemetry):
    """Streaming Ingestion + Partial Reasoning + Adaptation."""
    adaptation_triggered = False
    replan_reason = ""

    # Step 1: Update sliding windows
    for edge_tuple, latency in telemetry.items():
        u, v = edge_tuple
        edge = (u, v) if (u, v) in self.graph.edges else (v, u)
        if edge in self.telemetry_history:
            self.telemetry_history[edge].append(latency)

    # Step 2: Partial Reasoning — check if beliefs diverge from observations
    graph_updated = False
    for edge, latencies in self.telemetry_history.items():
        if len(latencies) == self.window_size:
            avg_latency = sum(latencies) / len(latencies)
            current_weight = self.graph[edge[0]][edge[1]]['weight']

            # If observed diverges > 50% from stored belief
            if abs(avg_latency - current_weight) / current_weight > 0.5:
                self.graph[edge[0]][edge[1]]['weight'] = avg_latency
                graph_updated = True

    # Step 3: If belief model updated, trigger online replanning
    if graph_updated:
        old_route = self.current_route
        success = self._replan()
        if success and old_route != self.current_route:
            adaptation_triggered = True
            replan_reason = (
                f"Network belief model drifted. Re-routed from "
                f"{'->'.join(old_route)} to {'->'.join(self.current_route)}"
            )

    return adaptation_triggered, replan_reason, self.current_route

This single function encapsulates the full intelligence loop. I think the 50% divergence threshold is a good starting point, though I'd argue in production you'd want this to be a configurable parameter rather than a hardcoded constant. The math is simple: (observed_avg - belief) / belief. If the observed average latency on edge (B, D) has climbed from a belief of 20ms to 70ms over 5 ticks, that's a 250% divergence — obviously way over threshold. The belief graph gets updated, Dijkstra runs again, and the route swaps.

What I love about this design is that the agent is not binary in its assessment. It doesn't say "Node D is dead." It observes "throughput on the B→D link is consistently 250% above what I expect." That nuance is important — it means the agent can handle partial failures, degraded links, and gradual deterioration, not just complete outages.

Component 3: The Online Replanner

def _replan(self):
    """Online Replanning: Computes the optimal path based on current belief graph."""
    try:
        new_path = nx.shortest_path(
            self.graph, 
            source=self.source, 
            target=self.destination, 
            weight='weight'
        )
        self.current_route = new_path
        return True
    except nx.NetworkXNoPath:
        self.current_route = []
        return False

I designed the replanning function to be deliberately simple. In my view, the sophistication of this system lives in when replanning is triggered (the partial reasoning layer), not in the replanning algorithm itself. Dijkstra's algorithm on a small graph is essentially instantaneous — sub-millisecond. That's intentional. If replanning were computationally expensive (as it would be in a hyper-scale network), you'd need an approximation algorithm or incremental updates. For this PoC, I wanted to show the architecture cleanly, so I kept the pathfinding trivial.

The NetworkXNoPath exception handling is important. In a sufficiently degraded network, there might literally be no viable path from A to F. Rather than crashing, the agent gracefully handles this and sets current_route = [], which the UI renders as "NO PATH." From my experience, graceful degradation is non-negotiable in infrastructure-adjacent systems.

Component 4: The Main Execution Loop

def main():
    network = TelecomNetwork()
    agent = StreamingDecisionAgent(
        source='A', destination='F', initial_graph=network.graph
    )

    with Live(refresh_per_second=4) as live:
        for tick in range(1, 80):
            _, telemetry, anomaly_event = network.get_telemetry()

            adaptation_triggered, replan_reason, current_route = \
                agent.ingest_telemetry(tick, telemetry)

            actual_cost = network.get_actual_path_cost(current_route)

            # ... render Rich table and event log

I structured the main loop around a tick-based time model. Each tick simulates roughly one second of real network operation. The rich.live.Live context manager is what makes the terminal feel like a live dashboard — it handles clearing and re-rendering the display without flickering. That was something I spent more time on than expected; terminal rendering is trickier than it looks.

Let's Setup

To run this yourself, the setup is minimal:

git clone https://github.com/aniket-work/TelecomRoute-AI.git
cd TelecomRoute-AI
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

The requirements.txt includes:

networkx — graph algorithms
rich — terminal UI
pillow — GIF generation
requests — for Mermaid.ink diagram downloads

Step by step project walkthrough and details can be found at: https://github.com/aniket-work/TelecomRoute-AI

Let's Run

python main.py

Here's what you'll observe:

Tick 1–14: The agent routes traffic via A → B → D → F (lowest nominal latency at 45ms). Everything is nominal.
Tick 15: The trigger_anomaly() function fires. Node D's health drops to 0.2. Latency on edges connected to D skyrockets from ~20ms to ~200ms+.
Tick 16–20: The agent's sliding window begins filling with high readings. By tick 20, the 5-tick average on B→D diverges 250% from the stored belief. The belief graph updates. Dijkstra runs. The agent switches to A → C → E → F.
Tick 35: Node C's health begins degrading to 0.5. Latency on C edges climbs.
Tick 36–40: Agent detects C-path degradation. Replans again to A → B → C → E → F, which avoids the worst of both degraded nodes.
Tick 55: Node D recovers. Health returns to 1.0.

Here's the final simulation summary table printed at completion:

Post-Operation Analysis
─────────────────────────────────────────────────────────
Component                    Status/Result
─────────────────────────────────────────────────────────
Partial Reasoning            Success - Detected anomaly from streaming telemetry
Online Replanning            Success - Dynamically recomputed Dijkstra
Mid-Execution Adaptation     Success - Swapped route in < 15 ticks
Final Destination Reached    True (Node F)
─────────────────────────────────────────────────────────

What This Demonstrates with the Agent Communication Protocol

As the sequence diagram shows, the agent's internal communication pattern follows a tight loop: observe → reason → replan → adapt → continue. The critical design insight, in my view, is that steps 2 through 4 are all internal to the agent. From the perspective of the network environment, the agent is just a black box that occasionally changes which path it's forwarding traffic on. This separation of concerns is what makes the approach scalable.

Closing Thoughts

Looking back at this experiment, the most important thing I took away is how much expressive power a belief-graph approach provides over a purely reactive one.

Pure reactive systems are brittle — they're great for the conditions they were programmed to expect, and fail silently when conditions diverge. A belief-model approach — even a simple one like this — gives the agent the ability to hold nuanced assessments of the world, update them as evidence accumulates, and act on updated beliefs rather than raw, noisy signals.

From my experience, the pattern of partial reasoning → online replanning → mid-execution adaptation generalises extremely well beyond telecom. Think of:

Autonomous delivery routing: A vehicle that replans around unexpected road closures without stopping.
Financial trading agents: A strategy that updates position sizing as intra-day volatility shifts.
Hospital resource allocation: A system that rerouts patient flow as ER capacity changes hour-by-hour.

In all these cases, the environment stream is continuous, the agent's plan must remain dynamically valid, and the cost of pausing to plan from scratch is too high.

That's the essential argument for streaming decision agents: they're not just cool architecture — they're a practical necessity for systems operating in genuinely dynamic environments.

I'll be the first to admit: this simulation has its rough edges. The 50% drift threshold is a magic number. The network topology is small. Real telecom networks have thousands of nodes and sub-second failure detection requirements. But as a PoC meant to illuminate how to wire these ideas together in clean Python code, I'm proud of what this became.

If you run it yourself and have thoughts on improvements — better drift detection, adaptive thresholds, Bellman-Ford instead of Dijkstra — I'd genuinely love to hear them.

The full code is here: 👉 https://github.com/aniket-work/TelecomRoute-AI

Disclaimer

The views and opinions expressed here are solely my own and do not represent the views, positions, or opinions of my employer or any organization I am affiliated with. The content is based on my personal experience and experimentation and may be incomplete or incorrect. Any errors or misinterpretations are unintentional, and I apologize in advance if any statements are misunderstood or misrepresented.

DEV Community