anatraf-nta

Posted on Apr 21

Why SNMP Monitoring Misses Most Network Problems — And What to Use Instead

#networking #monitoring #devops #sysadmin

At 2:13 a.m., the helpdesk got the first complaint: “The ERP app is slow again.” By 2:30 a.m., it had become a small flood. Users across two offices could log in, but every action in the application felt like someone had put the network in molasses. The monitoring dashboard was green. CPU on the core switch was fine. Interface errors were low. WAN utilization looked normal. The night shift did what most teams do under pressure: they checked the usual charts, restarted a few services, and waited for the problem to either disappear or become more obvious.

It did neither.

What finally solved the case was not another SNMP graph, not another ping sweep, and not another “have you tried rebooting it?” moment. It was packet capture and network traffic analysis. The network was technically up, but DNS retries, TCP retransmissions, and a subtle MSS/MTU mismatch on one path were stretching simple application requests into multi-second stalls. From the outside, the network looked healthy. At the packet level, it was a disaster.

That gap between “devices are up” and “users can actually work” is where most network troubleshooting goes wrong.

The problem with relying on SNMP alone

SNMP has survived for a reason. It is lightweight, ubiquitous, and useful for watching device health, link status, bandwidth, temperature, and interface errors. If a switch fans out smoke or a link goes down, SNMP will usually tell you.

But SNMP is an instrument panel, not a flight recorder.

It tells you that a port is busy. It does not tell you that half of those packets are being retransmitted.
It tells you that a WAN circuit is at 70% utilization. It does not tell you that DNS responses are delayed by 800 ms.
It tells you the firewall is forwarding traffic. It does not tell you that a TLS handshake is failing because of a broken intermediate certificate.
It tells you the interface is clean. It does not tell you that a microburst is causing jitter for VoIP calls or that one misconfigured host is generating an ARP storm.

For NetOps, SysAdmin, and DevOps teams, this is the central limitation of metric-based monitoring: the signal is too far from the user experience. A green dashboard can coexist with a broken application, and that is why “everything looks normal” is such a dangerous sentence in production.

What packet capture gives you that dashboards cannot

Packet capture records the actual conversations moving across the network. That matters because most hard problems are not caused by a single obvious outage. They are caused by interactions: latency between two services, delayed DNS resolution, uneven retransmissions, path asymmetry, middleboxes altering behavior, and application protocols that fail only under specific timing conditions.

With packet capture, you can answer questions that metrics cannot touch:

Did the client actually receive a response, or did it retry the request three times?
Was the delay in the application, the network, or the server?
Is the issue occurring before login, during authentication, or only after session establishment?
Are packets being dropped, reordered, fragmented, or retransmitted?
Is the problem limited to one VLAN, one site, one subnet, or one protocol?

That is why network traffic analysis is so powerful. It moves troubleshooting from guesswork to evidence.

A real workflow for network troubleshooting

The best way to think about troubleshooting is to start broad and then move down the stack.

1. Start with symptoms, not theories

Do not begin with “maybe the firewall is bad.” Begin with the user complaint and the time window.

Ask:

Who is affected?
What application is slow or broken?
Is the problem constant or intermittent?
Did it start after a change?
Is it site-specific, user-specific, or global?

This sounds basic, but it prevents one of the most expensive habits in operations: chasing the wrong layer first.

2. Correlate metrics with packet-level evidence

SNMP, syslog, NetFlow, application logs, and synthetic monitoring all help narrow the field. But once you have a candidate time window, move to packet capture. Look for:

TCP retransmissions
Duplicate ACKs
Zero-window events
SYN retries
DNS timeouts
TLS handshake delays
ICMP fragmentation-needed messages
Unexpected reset packets

A good capture does not just tell you “something is wrong.” It shows you exactly where the conversation derails.

3. Follow the path from client to server

Many teams look only at one side of the conversation. That is a mistake.

If a user reports slowness, inspect both directions:

Client → DNS resolver
Client → load balancer
Load balancer → application server
Server → database
Return traffic through firewalls, proxies, or WAN links

A path that is clean in one direction can still fail in the other. Asymmetric routing, stateful inspection quirks, and off-path appliances can make the return path behave very differently from the forward path.

4. Compare good sessions with bad sessions

One of the fastest ways to isolate a problem is to compare a healthy request to an unhealthy one.

Look at:

packet timing
handshake duration
segment sizes
window scaling
retransmission patterns
protocol negotiation

The difference between a 40 ms session and a 4 second session is often visible within the first few packets.

Common failures that only packet capture reveals

Here are the kinds of issues that consistently escape traditional monitoring.

DNS latency disguised as “application slowness”

Users say the app is slow. The app team blames the database. The database team blames the network. The capture shows each request waiting on a DNS resolver that is retrying or querying a distant server. The actual application payload is fine; the lookup path is not.

TCP retransmission storms

A link that is only “a little lossy” can create serious user pain. TCP will recover, but recovery costs time. Enough retransmissions turn a responsive service into a sluggish one. SNMP may show low error counts while the capture shows the real story: lost segments, duplicate ACKs, and growing latency.

MTU and fragmentation issues

Path MTU problems are classic. Everything appears okay until a large packet hits a device that refuses to forward it correctly. Small requests work. Large ones stall. VPNs, tunnels, and overlay networks are frequent offenders.

TLS handshake failures

If an app works in HTTP but fails in HTTPS, the root cause may be certificate chain problems, handshake timeouts, or middlebox interference. Packet capture can show whether the handshake completes, where it stalls, and which side tears it down.

Load balancer or proxy behavior

Sometimes the network is not the problem; the in-line device is. Proxies can inject latency, terminate sessions, rewrite headers, or alter timeouts. A packet trace makes those behaviors visible.

Microbursts and short-lived congestion

Polling every 5 minutes will miss congestion that lasts 500 milliseconds. Packet capture sees the spike, the queueing, and the retransmissions that follow. This is one of the biggest reasons why network traffic analysis is more precise than coarse monitoring.

Packet capture in modern environments

There is a myth that packet capture belongs only in a lab or on a laptop with Wireshark open. In reality, modern production environments need more disciplined visibility.

Data center networks

High-throughput servers, east-west traffic, container platforms, and service meshes create many tiny dependencies. A single slow dependency can ripple outward. Full packet capture helps you see where the chain starts to wobble.

Cloud and hybrid environments

Cloud networking introduces new layers: security groups, virtual appliances, NAT, overlays, and managed load balancers. A metric may tell you a service is reachable, but only packet capture shows whether the traffic is being modified, delayed, or dropped along the way.

Remote offices and WAN links

When a branch office complains about an app, you need to know whether the issue is the WAN, DNS, a VPN tunnel, or the endpoint itself. Packet capture can separate local LAN issues from upstream path issues very quickly.

End-user troubleshooting

For desktop support and SysAdmin teams, packet capture is often the fastest route to root cause when a single user is impacted. One trace can prove whether the problem is client-side, network-side, or server-side. That is a lot better than asking people to clear cache and pray.

How to make packet capture operational, not heroic

A lot of teams treat packet capture like a break-glass skill. They wait until the outage is painful, then scramble to capture traffic manually. That is too late and too random.

To make it operational:

Capture continuously or on-demand at key choke points.
Keep timestamps synchronized with NTP or PTP.
Store enough history to cover the time between symptom and investigation.
Index captures so you can search by host, protocol, session, or time.
Pair captures with metadata from logs and metrics.
Give operations teams a way to inspect traffic without needing a specialist every time.

This is where many organizations hit a practical wall. Traditional tools are great for one-off analysis, but they are not built for repeated production troubleshooting across distributed teams.

A simple framework for faster root cause analysis

When a ticket arrives, use this sequence:

Identify the affected users, app, and time window.
Check high-level monitoring for scope and recent changes.
Pull packet capture from the relevant path.
Compare healthy and unhealthy sessions.
Isolate the first abnormal packet or protocol exchange.
Confirm the cause with logs or config changes.
Document the pattern so the next incident is faster.

That last step matters. The goal is not just to fix one issue. The goal is to build a repeatable network troubleshooting practice.

Why this matters for NetOps, SysAdmin, and DevOps teams

NetOps teams need faster diagnosis across many links and sites.
SysAdmin teams need proof when the problem is not actually “the server.”
DevOps teams need visibility into the network layer when application performance suddenly falls apart.

All three groups benefit from the same thing: packet-level truth.

As infrastructure becomes more distributed, the old habit of relying on a few coarse metrics becomes less useful. More services, more encryption, more overlays, more proxies, and more remote users mean more places for timing, path, and protocol issues to hide. Network traffic analysis is no longer a specialist luxury. It is a core troubleshooting capability.

Closing thought

The real value of packet capture is not that it produces beautiful traces. It is that it ends arguments.

It tells you whether the problem is in the client, the path, or the server. It tells you whether the network is actually healthy or merely appears healthy from a dashboard. And when production is on fire at 2 a.m., that difference is everything.

If your team is still depending mainly on SNMP to explain user pain, you are working with a rearview mirror in a fog storm. Better than nothing, sure. But not exactly a strategy.

For teams that want deeper packet-level visibility without turning every incident into a manual forensic expedition, AnaTraf is worth a look at www.anatraf.com. It is built for the reality of production network troubleshooting, where answers need to come from the traffic itself, not from optimism.

DEV Community