anatraf-nta

Posted on Apr 21

How to Diagnose Intermittent Network Failures with Packet Capture

#networking #monitoring #devops #sysadmin

At 2:13 a.m., the NOC dashboard still looked healthy.

CPU on the core switches was normal. Interface utilization was below 40%. SNMP polling showed no down links. The application team insisted their new deployment was fine. Yet users in two branch offices kept reporting the same maddening symptom: the ERP web app would load, spin for 20 to 40 seconds, and then fail with a timeout. Five minutes later, it might work again.

This is the kind of incident that turns routine network troubleshooting into a blame carousel. The network team sees green dashboards. The server team sees no obvious resource pressure. The developers point to successful health checks. Meanwhile, the business only knows that a critical application is unreliable.

In situations like this, network traffic analysis is what separates guesses from evidence. And when the issue is intermittent, bursty, or protocol-specific, packet capture is often the only way to get to root cause.

In this article, I will walk through a practical workflow for using packet-level visibility to diagnose hard-to-reproduce failures. The examples are aimed at NetOps, SysAdmin, and DevOps teams that need a reliable method for real-world network troubleshooting instead of another vague checklist.

Why intermittent failures are so hard to troubleshoot

The most frustrating network problems are not total outages. They are partial failures with inconsistent symptoms:

Some users are affected, others are not
The application works on retry
Monitoring shows acceptable averages
Logs reveal timeouts but not why they happened
The issue appears to move between network, server, and application layers

Traditional monitoring is useful, but it has blind spots.

What SNMP and flow telemetry can tell you

They are good at answering questions like:

Is an interface up or down?
Is bandwidth usage unusually high?
Is CPU or memory under pressure?
Are there spikes in error counters?
Which conversations consume the most traffic?

That helps with broad situational awareness. But it usually does not answer:

Did DNS resolution take 4 seconds for one request path?
Did TCP retransmissions explode for only one VLAN?
Did a TLS handshake fail because of MTU or middlebox behavior?
Did the server advertise a zero window?
Did packets arrive out of order just long enough to trigger application timeouts?

That is where packet capture becomes essential.

Start with the symptom, not the tool

When teams jump straight into Wireshark filters without defining the symptom precisely, they often drown in packets.

Start by building a symptom statement with four dimensions:

Who is affected? One user, one site, one subnet, one app tier?
What fails? DNS lookup, TCP connection, TLS handshake, HTTP request, database query?
When does it happen? Constantly, during bursts, after deploys, at shift changes?
What does success look like? Establish the baseline timing and sequence.

For example:

Users in branch subnets 10.24.16.0/24 and 10.24.17.0/24 intermittently time out when loading the ERP login page between 01:45 and 03:00 UTC. The issue affects HTTPS requests to app01 and app02 in the data center.

This is a usable starting point for network traffic analysis because it narrows the capture scope.

Where to capture packets

One of the most important decisions in network troubleshooting is capture placement. If you capture in the wrong place, your packet data may be technically valid but operationally useless.

Common capture points include:

1. Client-side capture

Best when you need to confirm what the endpoint actually sent and received.

Useful for:

DNS delays
TLS handshake failures
Proxy behavior
Endpoint firewall interference

Limitations:

Hard to scale
Endpoint clock drift can complicate correlation
You may miss what changed in transit

2. Server-side capture

Best when the server team says, “We never saw the request,” or, “The client closed the session.”

Useful for:

Load balancer to backend issues
Server zero-window conditions
Application response delays
Reverse proxy behavior

Limitations:

You may not see upstream packet loss or WAN impairments
NAT or service mesh layers may change the traffic shape

3. SPAN/TAP near aggregation or core

Best when you need shared visibility across multiple clients or segments.

Useful for:

Comparing healthy vs unhealthy paths
Identifying VLAN-specific issues
Verifying packet loss, retries, and latency patterns at scale

Limitations:

SPAN oversubscription can drop mirrored traffic
Bad filter design creates too much data
East-west traffic may be missed depending on topology

4. Firewall or load balancer adjacent capture

Best when middleboxes might be involved.

Useful for:

Session resets
Asymmetric routing
NAT translation problems
TLS inspection side effects

The rule is simple: capture as close as possible to the point where behavior becomes ambiguous.

The packet capture workflow that works in production

Here is the practical workflow many teams skip.

Step 1: Capture both a failing case and a healthy case

A single pcap from a failed session is helpful. Two pcaps — one good, one bad — are much better.

Comparative analysis helps answer:

Which protocol stage diverged first?
Did response timing change before any visible error?
Did retransmissions begin before or after the app delay?
Is packet size or fragmentation different in the failure case?

Without a healthy baseline, teams often overreact to normal protocol behavior.

Step 2: Keep the capture narrow but sufficient

Do not start with “capture everything forever.” That is not observability; that is storage abuse.

A good initial capture filter might target:

Specific client subnets
The application VIP or backend servers
Relevant ports such as 53, 443, 1433, 3306
A limited time window around the incident

For example, if you were using tcpdump on a Linux sensor, your logic might be equivalent to:

branch client subnets
destination app VIP
DNS and HTTPS traffic

The goal is to preserve enough context to reconstruct the transaction path without ingesting unrelated noise.

Step 3: Build a timeline of the transaction

Packet analysis is easier when you stop thinking in “packets” and start thinking in “conversation stages.”

For a common web transaction, the timeline may be:

DNS query and response
TCP three-way handshake
TLS client hello / server hello
HTTP request
Server processing delay
HTTP response delivery
TCP teardown or reset

Mark the timestamps for each stage. The root cause often reveals itself as a gap:

DNS took 3.8 seconds
SYN was retransmitted three times
TLS stalled after Client Hello
HTTP request reached server quickly, but response started 18 seconds later
Server response was fragmented and triggered retransmissions

That gap is where network troubleshooting becomes evidence-driven.

Real failure patterns packet capture exposes

Let us look at the patterns that surface repeatedly in production.

1. TCP retransmission storms

This is one of the most common findings in intermittent slowness incidents.

Symptoms:

Pages partially load or stall
Sessions recover on retry
Throughput drops well below link capacity
SNMP may show the link as healthy

In the pcap, you may see:

Duplicate ACKs
Repeated retransmissions
Out-of-order segments
Large gaps between acknowledgments

Typical causes:

Congested WAN circuits
Microbursts on oversubscribed uplinks
Bad optics or duplex mismatches
Misbehaving QoS policies
SPAN capture on an oversubscribed path causing false interpretation if not validated

The key insight is that average interface utilization can look normal while a specific queue or path suffers transient drops.

2. DNS latency masquerading as application failure

Users often report “the app is down” when the real issue is name resolution delay.

In packet capture, this appears as:

Multiple DNS queries before connection setup
Long delays before the DNS response
Retries to secondary resolvers
Truncated responses forcing TCP fallback

Typical causes:

Resolver overload
Firewall inspection delay
EDNS/fragmentation issues
Split-horizon DNS misconfiguration
Intermittent path issues to the resolver

If the first visible slowdown occurs before the SYN packet is ever sent to the application, your app team is probably innocent.

3. MTU and fragmentation problems during TLS

A classic pattern: small requests work, large ones fail, and only some sites are affected.

In the trace, you may see:

TCP handshake succeeds
TLS negotiation starts
Large packets trigger fragmentation or silent drops
Client retransmits, then times out

Typical causes:

PMTUD failure
ICMP blocked in transit
VPN overlays lowering effective MTU
Firewall bugs handling fragmented traffic

This category is often missed because layer-3 reachability tests still pass.

4. Zero-window and application backpressure

Sometimes the network is not the bottleneck at all, but the packets prove it.

In the pcap, you may see:

Server advertising TCP zero window
Delayed window updates
Client waiting on receive availability
No network loss, but significant response stalls

Typical causes:

Application thread pool exhaustion
Disk latency on backend systems
Overloaded virtual machines
Garbage collection pauses

This is why packet capture is so valuable even when the fault is not strictly “the network.” It gives NetOps and DevOps teams shared evidence instead of finger-pointing.

5. Middlebox resets and asymmetric routing

Some of the ugliest incidents involve firewalls, load balancers, or NAT devices altering session behavior.

In packet data, look for:

Unexpected RST packets
Sequence number anomalies
Flows visible in one direction but not the other
Session expiration earlier than app expectations

Typical causes:

Aggressive idle timeout policies
Stateful inspection under pressure
ECMP or routing asymmetry confusing stateful devices
NAT table exhaustion

When this happens, logs on the application server often mislead the team because the server only sees an abrupt disconnect, not the policy decision that caused it.

A practical analysis sequence for incident response

When you have a pcap and limited time, use a repeatable sequence.

First pass: confirm the basics

Are both directions visible?
Are timestamps trustworthy?
Is there obvious packet loss or retransmission?
Is DNS normal?
Did the TCP handshake complete?
Did TLS complete?
Who sent the first reset or FIN?

Second pass: compare healthy vs failing flows

Measure:

DNS response time
SYN to SYN-ACK latency
TLS negotiation time
Time to first byte
Total transaction duration
Retransmission count
Window size behavior

Third pass: map the finding to an infrastructure hypothesis

Examples:

Retransmissions only from one branch → WAN, access uplink, or site-specific QoS
TLS stalls only across VPN users → MTU, PMTUD, overlay network
Zero-window from app servers → host or application bottleneck
DNS delay across all failed sessions → resolver or firewall path

This keeps packet analysis connected to operational action.

Common mistakes teams make with packet capture

Even experienced engineers sabotage their own troubleshooting process. The biggest mistakes are predictable.

Capturing too late

If the incident is intermittent, waiting until users are already escalated can mean the useful evidence is gone. For recurring issues, keep rolling capture at strategic points with retention aligned to the environment.

Capturing only one side

One-sided capture can still help, but it often leaves ambiguity around loss, delay, and middlebox behavior. If the issue matters, capture both ends or at least both sides of the suspected boundary.

Ignoring clock alignment

If you compare endpoint and server traces with unsynchronized clocks, the timeline becomes fiction. Use NTP everywhere. This sounds basic because it is basic, and teams still get burned by it.

Treating Wireshark as the whole strategy

Wireshark is excellent for analysis, but it is not a capture architecture. Production network traffic analysis requires planning for placement, retention, indexing, and retrieval.

Failing to preserve context

A packet trace without metadata is less useful than people think. Record:

Incident time window
Affected users or subnets
Application components involved
Routing or policy changes
Deployment events
Whether the trace reflects healthy or failing behavior

Future-you will appreciate it.

What to automate in your troubleshooting pipeline

If your team handles recurring incidents, stop treating packet capture as artisanal craftsmanship.

You can standardize several elements:

Predefined capture points for branch, core, and application tiers
Runbooks for DNS, TCP, TLS, and HTTP transaction analysis
Baseline transaction timings for critical apps
Incident tags linking pcaps to change windows and alerts
Rolling packet capture during known risk periods such as migrations or maintenance

For DevOps and SRE-adjacent teams, the real win is correlation. Packet evidence becomes dramatically more useful when paired with:

Deployment timestamps
Load balancer logs
Application traces
NetFlow or IPFIX summaries
Firewall session logs
Infrastructure metrics

This combination reduces mean time to innocence and, more importantly, mean time to root cause.

When full packet visibility is worth it

Not every environment needs to retain every packet. But many teams underestimate how often packet-level history pays for itself.

Full packet visibility is especially useful when you have:

High-value business applications with intermittent complaints
Hybrid infrastructure with VPN, firewall, and cloud path complexity
Compliance or forensics requirements
Frequent “network is slow” escalations with no clear owner
Branch offices where issues are hard to reproduce live

If your current tooling tells you utilization, uptime, and broad traffic patterns but cannot explain why a transaction failed, you have an observability gap.

Final takeaway

Intermittent failures are rarely solved by intuition. They are solved by narrowing the symptom, capturing at the right point, and building a packet-level timeline of what actually happened.

That is why packet capture remains one of the most effective tools in serious network troubleshooting. It exposes the hidden layers behind timeouts, retransmissions, slow DNS, TLS stalls, and middlebox interference. And for teams doing routine network traffic analysis, it provides a common source of truth across NetOps, SysAdmin, and DevOps.

If you are hitting the limits of dashboards, polling, and flow data, it may be time to add deeper packet visibility to your stack. Platforms such as AnaTraf focus on full packet capture and protocol-level analysis so teams can investigate production issues faster without guessing. Used well, that kind of visibility does not just shorten outages — it changes how your team troubleshoots altogether.

DEV Community

How to Diagnose Intermittent Network Failures with Packet Capture

Why intermittent failures are so hard to troubleshoot

What SNMP and flow telemetry can tell you

Start with the symptom, not the tool

Where to capture packets

1. Client-side capture

2. Server-side capture

3. SPAN/TAP near aggregation or core

4. Firewall or load balancer adjacent capture

The packet capture workflow that works in production

Step 1: Capture both a failing case and a healthy case

Step 2: Keep the capture narrow but sufficient

Step 3: Build a timeline of the transaction

Real failure patterns packet capture exposes

1. TCP retransmission storms

2. DNS latency masquerading as application failure

3. MTU and fragmentation problems during TLS

4. Zero-window and application backpressure

5. Middlebox resets and asymmetric routing

A practical analysis sequence for incident response

First pass: confirm the basics

Second pass: compare healthy vs failing flows

Third pass: map the finding to an infrastructure hypothesis

Common mistakes teams make with packet capture

Capturing too late

Capturing only one side

Ignoring clock alignment

Treating Wireshark as the whole strategy

Failing to preserve context

What to automate in your troubleshooting pipeline

When full packet visibility is worth it

Final takeaway

Top comments (0)