DEV Community

anatraf-nta
anatraf-nta

Posted on

How to Diagnose Intermittent Network Failures with Packet Capture

At 2:13 a.m., the NOC dashboard still looked healthy.

CPU on the core switches was normal. Interface utilization was below 40%. SNMP polling showed no down links. The application team insisted their new deployment was fine. Yet users in two branch offices kept reporting the same maddening symptom: the ERP web app would load, spin for 20 to 40 seconds, and then fail with a timeout. Five minutes later, it might work again.

This is the kind of incident that turns routine network troubleshooting into a blame carousel. The network team sees green dashboards. The server team sees no obvious resource pressure. The developers point to successful health checks. Meanwhile, the business only knows that a critical application is unreliable.

In situations like this, network traffic analysis is what separates guesses from evidence. And when the issue is intermittent, bursty, or protocol-specific, packet capture is often the only way to get to root cause.

In this article, I will walk through a practical workflow for using packet-level visibility to diagnose hard-to-reproduce failures. The examples are aimed at NetOps, SysAdmin, and DevOps teams that need a reliable method for real-world network troubleshooting instead of another vague checklist.

Why intermittent failures are so hard to troubleshoot

The most frustrating network problems are not total outages. They are partial failures with inconsistent symptoms:

  • Some users are affected, others are not
  • The application works on retry
  • Monitoring shows acceptable averages
  • Logs reveal timeouts but not why they happened
  • The issue appears to move between network, server, and application layers

Traditional monitoring is useful, but it has blind spots.

What SNMP and flow telemetry can tell you

They are good at answering questions like:

  • Is an interface up or down?
  • Is bandwidth usage unusually high?
  • Is CPU or memory under pressure?
  • Are there spikes in error counters?
  • Which conversations consume the most traffic?

That helps with broad situational awareness. But it usually does not answer:

  • Did DNS resolution take 4 seconds for one request path?
  • Did TCP retransmissions explode for only one VLAN?
  • Did a TLS handshake fail because of MTU or middlebox behavior?
  • Did the server advertise a zero window?
  • Did packets arrive out of order just long enough to trigger application timeouts?

That is where packet capture becomes essential.

Start with the symptom, not the tool

When teams jump straight into Wireshark filters without defining the symptom precisely, they often drown in packets.

Start by building a symptom statement with four dimensions:

  1. Who is affected? One user, one site, one subnet, one app tier?
  2. What fails? DNS lookup, TCP connection, TLS handshake, HTTP request, database query?
  3. When does it happen? Constantly, during bursts, after deploys, at shift changes?
  4. What does success look like? Establish the baseline timing and sequence.

For example:

Users in branch subnets 10.24.16.0/24 and 10.24.17.0/24 intermittently time out when loading the ERP login page between 01:45 and 03:00 UTC. The issue affects HTTPS requests to app01 and app02 in the data center.

This is a usable starting point for network traffic analysis because it narrows the capture scope.

Where to capture packets

One of the most important decisions in network troubleshooting is capture placement. If you capture in the wrong place, your packet data may be technically valid but operationally useless.

Common capture points include:

1. Client-side capture

Best when you need to confirm what the endpoint actually sent and received.

Useful for:

  • DNS delays
  • TLS handshake failures
  • Proxy behavior
  • Endpoint firewall interference

Limitations:

  • Hard to scale
  • Endpoint clock drift can complicate correlation
  • You may miss what changed in transit

2. Server-side capture

Best when the server team says, “We never saw the request,” or, “The client closed the session.”

Useful for:

  • Load balancer to backend issues
  • Server zero-window conditions
  • Application response delays
  • Reverse proxy behavior

Limitations:

  • You may not see upstream packet loss or WAN impairments
  • NAT or service mesh layers may change the traffic shape

3. SPAN/TAP near aggregation or core

Best when you need shared visibility across multiple clients or segments.

Useful for:

  • Comparing healthy vs unhealthy paths
  • Identifying VLAN-specific issues
  • Verifying packet loss, retries, and latency patterns at scale

Limitations:

  • SPAN oversubscription can drop mirrored traffic
  • Bad filter design creates too much data
  • East-west traffic may be missed depending on topology

4. Firewall or load balancer adjacent capture

Best when middleboxes might be involved.

Useful for:

  • Session resets
  • Asymmetric routing
  • NAT translation problems
  • TLS inspection side effects

The rule is simple: capture as close as possible to the point where behavior becomes ambiguous.

The packet capture workflow that works in production

Here is the practical workflow many teams skip.

Step 1: Capture both a failing case and a healthy case

A single pcap from a failed session is helpful. Two pcaps — one good, one bad — are much better.

Comparative analysis helps answer:

  • Which protocol stage diverged first?
  • Did response timing change before any visible error?
  • Did retransmissions begin before or after the app delay?
  • Is packet size or fragmentation different in the failure case?

Without a healthy baseline, teams often overreact to normal protocol behavior.

Step 2: Keep the capture narrow but sufficient

Do not start with “capture everything forever.” That is not observability; that is storage abuse.

A good initial capture filter might target:

  • Specific client subnets
  • The application VIP or backend servers
  • Relevant ports such as 53, 443, 1433, 3306
  • A limited time window around the incident

For example, if you were using tcpdump on a Linux sensor, your logic might be equivalent to:

  • branch client subnets
  • destination app VIP
  • DNS and HTTPS traffic

The goal is to preserve enough context to reconstruct the transaction path without ingesting unrelated noise.

Step 3: Build a timeline of the transaction

Packet analysis is easier when you stop thinking in “packets” and start thinking in “conversation stages.”

For a common web transaction, the timeline may be:

  1. DNS query and response
  2. TCP three-way handshake
  3. TLS client hello / server hello
  4. HTTP request
  5. Server processing delay
  6. HTTP response delivery
  7. TCP teardown or reset

Mark the timestamps for each stage. The root cause often reveals itself as a gap:

  • DNS took 3.8 seconds
  • SYN was retransmitted three times
  • TLS stalled after Client Hello
  • HTTP request reached server quickly, but response started 18 seconds later
  • Server response was fragmented and triggered retransmissions

That gap is where network troubleshooting becomes evidence-driven.

Real failure patterns packet capture exposes

Let us look at the patterns that surface repeatedly in production.

1. TCP retransmission storms

This is one of the most common findings in intermittent slowness incidents.

Symptoms:

  • Pages partially load or stall
  • Sessions recover on retry
  • Throughput drops well below link capacity
  • SNMP may show the link as healthy

In the pcap, you may see:

  • Duplicate ACKs
  • Repeated retransmissions
  • Out-of-order segments
  • Large gaps between acknowledgments

Typical causes:

  • Congested WAN circuits
  • Microbursts on oversubscribed uplinks
  • Bad optics or duplex mismatches
  • Misbehaving QoS policies
  • SPAN capture on an oversubscribed path causing false interpretation if not validated

The key insight is that average interface utilization can look normal while a specific queue or path suffers transient drops.

2. DNS latency masquerading as application failure

Users often report “the app is down” when the real issue is name resolution delay.

In packet capture, this appears as:

  • Multiple DNS queries before connection setup
  • Long delays before the DNS response
  • Retries to secondary resolvers
  • Truncated responses forcing TCP fallback

Typical causes:

  • Resolver overload
  • Firewall inspection delay
  • EDNS/fragmentation issues
  • Split-horizon DNS misconfiguration
  • Intermittent path issues to the resolver

If the first visible slowdown occurs before the SYN packet is ever sent to the application, your app team is probably innocent.

3. MTU and fragmentation problems during TLS

A classic pattern: small requests work, large ones fail, and only some sites are affected.

In the trace, you may see:

  • TCP handshake succeeds
  • TLS negotiation starts
  • Large packets trigger fragmentation or silent drops
  • Client retransmits, then times out

Typical causes:

  • PMTUD failure
  • ICMP blocked in transit
  • VPN overlays lowering effective MTU
  • Firewall bugs handling fragmented traffic

This category is often missed because layer-3 reachability tests still pass.

4. Zero-window and application backpressure

Sometimes the network is not the bottleneck at all, but the packets prove it.

In the pcap, you may see:

  • Server advertising TCP zero window
  • Delayed window updates
  • Client waiting on receive availability
  • No network loss, but significant response stalls

Typical causes:

  • Application thread pool exhaustion
  • Disk latency on backend systems
  • Overloaded virtual machines
  • Garbage collection pauses

This is why packet capture is so valuable even when the fault is not strictly “the network.” It gives NetOps and DevOps teams shared evidence instead of finger-pointing.

5. Middlebox resets and asymmetric routing

Some of the ugliest incidents involve firewalls, load balancers, or NAT devices altering session behavior.

In packet data, look for:

  • Unexpected RST packets
  • Sequence number anomalies
  • Flows visible in one direction but not the other
  • Session expiration earlier than app expectations

Typical causes:

  • Aggressive idle timeout policies
  • Stateful inspection under pressure
  • ECMP or routing asymmetry confusing stateful devices
  • NAT table exhaustion

When this happens, logs on the application server often mislead the team because the server only sees an abrupt disconnect, not the policy decision that caused it.

A practical analysis sequence for incident response

When you have a pcap and limited time, use a repeatable sequence.

First pass: confirm the basics

  • Are both directions visible?
  • Are timestamps trustworthy?
  • Is there obvious packet loss or retransmission?
  • Is DNS normal?
  • Did the TCP handshake complete?
  • Did TLS complete?
  • Who sent the first reset or FIN?

Second pass: compare healthy vs failing flows

Measure:

  • DNS response time
  • SYN to SYN-ACK latency
  • TLS negotiation time
  • Time to first byte
  • Total transaction duration
  • Retransmission count
  • Window size behavior

Third pass: map the finding to an infrastructure hypothesis

Examples:

  • Retransmissions only from one branch → WAN, access uplink, or site-specific QoS
  • TLS stalls only across VPN users → MTU, PMTUD, overlay network
  • Zero-window from app servers → host or application bottleneck
  • DNS delay across all failed sessions → resolver or firewall path

This keeps packet analysis connected to operational action.

Common mistakes teams make with packet capture

Even experienced engineers sabotage their own troubleshooting process. The biggest mistakes are predictable.

Capturing too late

If the incident is intermittent, waiting until users are already escalated can mean the useful evidence is gone. For recurring issues, keep rolling capture at strategic points with retention aligned to the environment.

Capturing only one side

One-sided capture can still help, but it often leaves ambiguity around loss, delay, and middlebox behavior. If the issue matters, capture both ends or at least both sides of the suspected boundary.

Ignoring clock alignment

If you compare endpoint and server traces with unsynchronized clocks, the timeline becomes fiction. Use NTP everywhere. This sounds basic because it is basic, and teams still get burned by it.

Treating Wireshark as the whole strategy

Wireshark is excellent for analysis, but it is not a capture architecture. Production network traffic analysis requires planning for placement, retention, indexing, and retrieval.

Failing to preserve context

A packet trace without metadata is less useful than people think. Record:

  • Incident time window
  • Affected users or subnets
  • Application components involved
  • Routing or policy changes
  • Deployment events
  • Whether the trace reflects healthy or failing behavior

Future-you will appreciate it.

What to automate in your troubleshooting pipeline

If your team handles recurring incidents, stop treating packet capture as artisanal craftsmanship.

You can standardize several elements:

  • Predefined capture points for branch, core, and application tiers
  • Runbooks for DNS, TCP, TLS, and HTTP transaction analysis
  • Baseline transaction timings for critical apps
  • Incident tags linking pcaps to change windows and alerts
  • Rolling packet capture during known risk periods such as migrations or maintenance

For DevOps and SRE-adjacent teams, the real win is correlation. Packet evidence becomes dramatically more useful when paired with:

  • Deployment timestamps
  • Load balancer logs
  • Application traces
  • NetFlow or IPFIX summaries
  • Firewall session logs
  • Infrastructure metrics

This combination reduces mean time to innocence and, more importantly, mean time to root cause.

When full packet visibility is worth it

Not every environment needs to retain every packet. But many teams underestimate how often packet-level history pays for itself.

Full packet visibility is especially useful when you have:

  • High-value business applications with intermittent complaints
  • Hybrid infrastructure with VPN, firewall, and cloud path complexity
  • Compliance or forensics requirements
  • Frequent “network is slow” escalations with no clear owner
  • Branch offices where issues are hard to reproduce live

If your current tooling tells you utilization, uptime, and broad traffic patterns but cannot explain why a transaction failed, you have an observability gap.

Final takeaway

Intermittent failures are rarely solved by intuition. They are solved by narrowing the symptom, capturing at the right point, and building a packet-level timeline of what actually happened.

That is why packet capture remains one of the most effective tools in serious network troubleshooting. It exposes the hidden layers behind timeouts, retransmissions, slow DNS, TLS stalls, and middlebox interference. And for teams doing routine network traffic analysis, it provides a common source of truth across NetOps, SysAdmin, and DevOps.

If you are hitting the limits of dashboards, polling, and flow data, it may be time to add deeper packet visibility to your stack. Platforms such as AnaTraf focus on full packet capture and protocol-level analysis so teams can investigate production issues faster without guessing. Used well, that kind of visibility does not just shorten outages — it changes how your team troubleshoots altogether.

Top comments (0)