Mustafa Siddiqui

Posted on Jul 16

Building Production-Grade Network Telemetry: A gRPC Journey Into the Heart of Network Monitoring

#go #networking #cisco

Or: How I Learned to Stop Worrying About SNMP and Love Binary Protocols

The Thing About Network Monitoring That Nobody Tells You

Six months ago, if you'd asked me how network devices share their status with monitoring systems, I would have confidently told you "SNMP, obviously" and moved on with my life, blissfully unaware that I was about to fall down a rabbit hole so deep it would fundamentally change how I think about distributed systems, concurrency, and why every major networking company is desperately trying to move away from protocols designed when the internet was basically a science experiment.

What started as curiosity about "how does Cisco actually monitor thousands of devices in real-time?" turned into building a complete gRPC-based network telemetry system that simulates the exact architecture used in production environments at companies like Cisco, Juniper, and every major cloud provider. This isn't a toy project - it's implementing the same patterns that handle billions of metrics per day in real network operations centers.

The journey taught me that modern network monitoring is essentially a massive distributed systems problem disguised as "just reading some counters," and that the gap between legacy SNMP polling and modern streaming telemetry is roughly equivalent to the difference between sending telegrams and video calling. Both technically work, but one scales to handle the demands of networks that carry half the world's internet traffic, and the other... doesn't.

Why SNMP is Dead (And Why That Matters)

Before we dive into building the future, let's talk about why the old way is fundamentally broken. SNMP (Simple Network Management Protocol) was designed in the 1980s when networks were small, simple, and mostly static. The basic model is polling: your monitoring system asks each device "hey, what are your interface counters?" every few minutes, the device responds with a text-based message, and everyone pretends this scales to modern networks.

Here's what happens when you try to monitor a modern data center with SNMP:

Polling Overhead: With thousands of devices and millions of metrics, the act of asking for data becomes the bottleneck
Temporal Resolution: You can only poll so frequently before overwhelming devices with requests
CPU Impact: Every SNMP request interrupts the device's primary job of forwarding packets
Bandwidth Waste: Text-based protocols with verbose formatting consume precious management bandwidth
No Real-Time: By the time you detect a problem, it's already been happening for minutes

Compare this to modern telemetry where devices proactively stream binary-encoded metrics at sub-second intervals over persistent gRPC connections. The difference is like comparing a telegraph system to a high-speed fiber optic network - technically both move information, but only one can handle the volume and speed required for modern operations.

Enter gRPC: When Google Solves Your Problems Before You Know You Have Them

gRPC represents everything SNMP isn't: binary, efficient, streaming-capable, and designed for the kind of scale that makes network engineers weep with joy. When Cisco and Juniper started implementing gNMI (gRPC Network Management Interface), they weren't just updating their protocols - they were fundamentally rethinking how network devices communicate with management systems.

The core insight is deceptively simple: instead of constantly asking devices for data, let devices stream data when it changes. Instead of parsing verbose text responses, use binary protocols that machines can process efficiently. Instead of stateless request-response patterns, maintain persistent connections that can handle bidirectional communication.

This shift enables monitoring architectures that seemed impossible with SNMP: real-time anomaly detection, sub-second alerting, and telemetry volumes measured in gigabytes per hour rather than kilobytes per minute.

Building the Future: A Production-Grade Telemetry System

The system I built implements the core patterns used in production network monitoring, but in a way that reveals the engineering decisions usually hidden behind vendor APIs and enterprise software licenses. Every major component represents a real-world challenge that network monitoring systems must solve.

The Protocol Foundation: Where Binary Meets Reality

First challenge: designing a protocol that can efficiently represent the hierarchical nature of network data while remaining extensible for future requirements. Protocol Buffers provide the foundation, but the schema design determines everything about performance and usability.

message InterfaceCounters {
    string interface_name = 1;
    int64 bytes_rx = 2;
    int64 bytes_tx = 3;
    int64 packets_rx = 4;
    int64 packets_tx = 5;
    int32 timestamp = 6;
}

message SubscribeRequest {
    string interface_name = 1;  // "device:interface" format
    int32 interval_ms = 2;
    SubscriptionMode mode = 3;
}

message SubscribeResponse {
    oneof response {
        InterfaceCounters counters = 1;
        Error error = 2;
    }
    int64 response_timestamp = 3;
}

The oneof union in the response message is crucial - it allows the same streaming interface to handle both successful data delivery and error conditions without breaking client parsing. The timestamp fields enable precise temporal correlation across multiple data streams, essential for detecting network-wide events that manifest as correlated changes across multiple devices.

This isn't just academic - in production networks, being able to correlate a traffic spike on Device A with increased error rates on Device B often reveals the root cause of complex network issues that would be invisible with traditional polling-based monitoring.

Device Simulation: Making Fake Traffic Look Real

The device simulator represents one of the most interesting engineering challenges: how do you generate realistic network traffic patterns that can stress-test your monitoring system without actual network hardware?

func (d *Device) UpdateCounters() error {
    d.mutex.Lock()
    defer d.mutex.Unlock()

    for _, iface := range d.interfaces {
        // Realistic traffic simulation
        bytesRx := rand.Intn(351) + 50    // 50-400 bytes per update
        bytesTx := rand.Intn(351) + 50
        packetsRx := rand.Intn(21) + 5    // 5-25 packets per update  
        packetsTx := rand.Intn(21) + 5

        iface.BytesRx += int64(bytesRx)
        iface.BytesTx += int64(bytesTx)
        iface.PacketsRx += int64(packetsRx)
        iface.PacketsTx += int64(packetsTx)

        iface.Timestamp = int32(time.Now().UnixMilli())
    }
    return nil
}

The traffic generation parameters aren't arbitrary - they're based on realistic packet sizes and transmission rates that create believable growth patterns. Real network interfaces exhibit bursty behavior with periods of high activity followed by quieter intervals, and the simulation captures this through controlled randomness.

More importantly, the simulation runs in background goroutines that update counters every 100ms, creating the continuous data flow that streaming protocols are designed to handle. This reveals performance characteristics that would be invisible with static test data.

Concurrency Architecture: Where Things Get Interesting

The most educational aspect of building this system was discovering how complex concurrent access patterns become when you're managing multiple devices, each with multiple interfaces, all being updated by background processes while serving real-time streaming requests to multiple clients.

type server struct {
    proto.UnimplementedNetworkTelemetryServer
    devices map[string]*Device
    mutex   sync.RWMutex  // High read/low write optimization
}

type Device struct {
    name       string
    interfaces map[string]*proto.InterfaceCounters
    mutex      sync.RWMutex  // Per-device locking for better concurrency
}

The dual-level locking strategy is essential for performance. The server-level RWMutex protects the device registry (devices are rarely added/removed), while per-device RWMutex instances protect interface counters (read frequently, updated every 100ms). This allows multiple clients to stream from different devices concurrently without lock contention.

The memory safety pattern in counter reads deserves special attention:

func (d *Device) GetCounters(interfaceName string) (*proto.InterfaceCounters, error) {
    d.mutex.RLock()
    defer d.mutex.RUnlock()

    if iface, ok := d.interfaces[interfaceName]; ok {
        // Return a copy - crucial for concurrent safety!
        return &proto.InterfaceCounters{
            InterfaceName: iface.InterfaceName,
            BytesRx:       iface.BytesRx,
            BytesTx:       iface.BytesTx,
            PacketsRx:     iface.PacketsRx,
            PacketsTx:     iface.PacketsTx,
            Timestamp:     iface.Timestamp,
        }, nil
    }
    return nil, errors.New("invalid interface name")
}

Returning a copy rather than a pointer to the original data prevents race conditions where the background update goroutine modifies counter values while a client is reading them. This pattern is fundamental to building concurrent systems that remain correct under load.

gRPC Streaming: The Magic of Persistent Connections

The Subscribe method implementation reveals the complexity hidden behind gRPC's simple streaming API:

func (s *server) Subscribe(req *proto.SubscribeRequest, stream proto.NetworkTelemetry_SubscribeServer) error {
    // Parse device:interface format
    nameSlice := strings.Split(req.InterfaceName, ":")
    if len(nameSlice) != 2 {
        return stream.Send(&proto.SubscribeResponse{
            Response: &proto.SubscribeResponse_Error{
                Error: &proto.Error{
                    Code:    proto.ErrorCode_DOES_NOT_EXIST,
                    Message: "invalid interface name format; expected DEVICE:INTERFACE",
                },
            },
            ResponseTimestamp: time.Now().UnixMilli(),
        })
    }

    deviceName, ifaceName := nameSlice[0], nameSlice[1]

    // Stream mode: continuous updates until client disconnects
    case proto.SubscriptionMode_STREAM:
        for {
            select {
            case <-stream.Context().Done():
                return nil  // Client disconnected gracefully
            default:
                ifaceCounters, err := device.GetCounters(ifaceName)
                if err != nil {
                    // Send error and terminate stream
                    stream.Send(&proto.SubscribeResponse{...})
                    return nil
                }

                if err := stream.Send(&proto.SubscribeResponse{
                    Response: &proto.SubscribeResponse_Counters{
                        Counters: ifaceCounters,
                    },
                    ResponseTimestamp: time.Now().UnixMilli(),
                }); err != nil {
                    return err  // Network error, terminate stream
                }

                time.Sleep(interval)
            }
        }
}

The context cancellation handling is crucial for production systems. When clients disconnect (intentionally or due to network issues), the server must detect this and clean up resources. The stream.Context().Done() channel provides this mechanism, allowing graceful termination of streaming operations.

The error handling strategy demonstrates production-ready practices: structured error responses with specific error codes allow clients to distinguish between temporary issues (retry appropriate) and permanent failures (don't retry).

Testing Real-World Patterns with grpcurl

One of the most satisfying moments in building this system was testing it with grpcurl and watching real streaming telemetry data flow:

# Subscribe to router1's eth0 interface in streaming mode
grpcurl -plaintext -d '{
  "interface_name": "router1:eth0",
  "interval_ms": 1000,
  "mode": "STREAM"
}' localhost:50051 NetworkTelemetry/Subscribe

# Output: Real-time counter updates
{
  "counters": {
    "interfaceName": "eth0",
    "bytesRx": "2164",
    "bytesTx": "2176", 
    "packetsRx": "145",
    "packetsTx": "152",
    "timestamp": 1735123456789
  },
  "responseTimestamp": "1735123456790"
}

Watching those counters increment in real-time while knowing the data is flowing through the same architectural patterns used to monitor production networks worth billions of dollars... there's something deeply satisfying about building systems that work the way the real world works.

The Performance Revolution: Why This Matters

Building this system revealed why major networking vendors invested heavily in moving from SNMP to streaming telemetry. The performance differences aren't incremental - they're transformational.

SNMP Polling Characteristics:

Request overhead: ~200 bytes per metric
Response parsing: string manipulation and conversion
Temporal resolution: limited by polling frequency vs. device CPU impact
Scalability: O(devices × metrics × polling_frequency)

gRPC Streaming Characteristics:

Binary encoding: ~20-30 bytes per metric
Zero parsing overhead: direct protobuf deserialization
Real-time: sub-second updates with minimal device impact
Scalability: O(active_streams) - independent of data volume

The cumulative effect enables monitoring architectures that seemed impossible with traditional approaches: real-time correlation across thousands of devices, anomaly detection with sub-second latency, and telemetry volumes that scale with network capacity rather than being limited by monitoring overhead.

What This Taught Me About Distributed Systems

Six months ago, I thought distributed systems were mostly about databases and web services. Building a network telemetry system revealed that networks themselves are distributed systems, and monitoring them requires all the same patterns: consistent data models, efficient serialization, graceful error handling, and careful attention to concurrency.

The most important insight was understanding how streaming protocols change the fundamental architecture of monitoring systems. With polling, your monitoring system is reactive - it discovers problems after they've happened. With streaming, monitoring becomes proactive - devices push data as conditions change, enabling real-time response to network events.

This architectural shift enables capabilities that transform network operations:

Predictive alerting: Detect trending conditions before they become problems
Real-time correlation: Connect events across multiple devices to identify root causes
Capacity planning: Understand utilization patterns with sub-second granularity
Automated response: React to network changes faster than human operators

The Production Reality: What Comes Next

This project implements the foundational patterns, but production network monitoring systems add layers of complexity that would each deserve their own deep-dive articles:

Data Storage & Time Series: Handling millions of metrics requires specialized time-series databases with automated retention policies and data aggregation strategies.

Multi-Tenancy & Security: Enterprise networks require authentication, authorization, and data isolation between different operational teams.

Scalability & Federation: Large networks require distributed monitoring architectures with hierarchical data aggregation and cross-site correlation.

Visualization & Alerting: Real-time dashboards and intelligent alerting systems that can identify patterns in high-volume telemetry streams.

Each of these represents the same kind of deep engineering challenge that made building the core telemetry system so educational. The difference between a working prototype and a production system isn't just polish - it's solving an entirely new category of problems that only become visible when you try to operate at scale.

Why Building This Matters (Beyond Just Learning Cool Tech)

Understanding how modern network monitoring works changes how you think about distributed systems performance and reliability. When you know that every major cloud provider depends on sub-second telemetry to detect and respond to network issues, you start to appreciate why streaming protocols and efficient serialization aren't just academic concerns - they're the foundation of internet infrastructure that billions of people depend on daily.

More practically, this knowledge translates directly to building better distributed systems of any kind. The patterns for handling high-volume streaming data, managing concurrent access to shared resources, and designing protocols that gracefully handle failure conditions apply whether you're monitoring network devices, processing financial transactions, or building real-time collaboration tools.

The Humbling Reality of Production Systems

Perhaps the most important lesson was realizing how much engineering effort goes into making complex systems look simple. When network operators view a dashboard showing real-time metrics from thousands of devices, they're seeing the result of countless engineering decisions about data formats, connection management, error handling, and resource allocation.

The clean API that allows subscribing to telemetry streams with a single gRPC call represents hundreds of lines of careful implementation work, just like my earlier journey with HTTP parsing revealed the complexity hidden behind web frameworks.

This is the paradox of good systems engineering: the better you do your job, the more invisible your work becomes. The sign of a successful network monitoring system is that operators can focus on network problems rather than monitoring problems, enabled by infrastructure they never have to think about.

Looking Forward: The Real-Time Everything Era

Building this system convinced me that streaming architectures represent the future of most monitoring and observability systems. The patterns that networking vendors pioneered for device monitoring are being adopted across the industry: application metrics, infrastructure monitoring, and business intelligence systems are all moving toward real-time streaming data models.

Understanding these patterns now means being prepared for a future where real-time data processing is the default rather than a special case. Whether you're building IoT systems, financial trading platforms, or social media applications, the ability to handle high-volume streaming data with low latency and high reliability is becoming a core requirement rather than a nice-to-have feature.

The Complete Code: Standing on the Shoulders of Giants

The complete implementation lives at mush1e/netmon-stack, and I encourage anyone interested in systems programming to explore the code. Every function represents a real-world challenge that production systems must solve, and the solutions reveal patterns that apply far beyond network monitoring.

More importantly, the project demonstrates that you can understand and implement the same technologies used by major tech companies to solve billion-dollar infrastructure problems. The gap between learning and building production-grade systems isn't as large as it seems - it just requires the patience to work through the details and the curiosity to understand why those details matter.

Still looking for opportunities to apply this obsession with understanding how systems really work to solving problems that matter. There's something deeply satisfying about building systems from first principles, even when (especially when) it reveals just how much engineering effort goes into making the complex appear simple.

Next up: Building a time-series database to store this telemetry data, because apparently I haven't suffered enough database design decisions yet. Plus real-time dashboards, because watching counters increment in a terminal is only satisfying for so long.

Built with ❤️ and probably too much ☕ while learning that the modern internet runs on protocols most people have never heard of

DEV Community