DEV Community: Vlad Levinas

Throughput vs IOPS vs Latency Beyond Storage Network, Compute and Cloud Performance Explained

Vlad Levinas — Wed, 18 Mar 2026 18:34:16 +0000

1. Introduction

Most engineers first encounter throughput, IOPS, and latency in the context of storage. You provision an EBS volume, you see three numbers, and you move on. This is a mistake — and it compounds over years of building systems that mysteriously underperform.

These three metrics are not storage concepts. They are fundamental properties of any system that processes work: disks, network interfaces, CPUs, GPUs, load balancers, container runtimes, and Kubernetes control planes. The same relationship between throughput, operations per second, and response time governs how a NVMe drive handles 4KB random reads, how an ENI processes packets, and how an etcd cluster responds to key-value writes.

The reason performance tuning goes wrong is almost always the same: engineers optimize the wrong layer. They increase EBS IOPS when the real bottleneck is cross-AZ network latency. They add more CPU when the application is blocked on memory bandwidth. They scale pods horizontally when etcd is the chokepoint.

This article explains these metrics across every layer that matters in modern cloud infrastructure and gives you the mental model to diagnose bottlenecks correctly.

2. Core Definitions

Throughput

Throughput is the volume of data or work completed per unit of time. It measures capacity — how much the system can move. In storage, it is MB/s. In networking, it is Gbps. In compute, it is operations per second, instructions per cycle, or requests per second.

Think of throughput as the cross-sectional area of a pipe. A wider pipe moves more water per second, but says nothing about how fast any individual drop arrives.

IOPS (Operations Per Second)

IOPS counts discrete operations completed per unit of time. In storage, one IOP is one read or write request. In networking, the equivalent is packets per second (PPS). In compute, it maps to transactions per second, context switches per second, or system calls per second.

The critical insight: throughput and IOPS are related by operation size.

Throughput = IOPS × Operation Size

A volume doing 16,000 IOPS at 256KB block size delivers 4,000 MB/s throughput. The same volume doing 16,000 IOPS at 4KB delivers only 64 MB/s. You can hit your IOPS limit long before saturating throughput, or vice versa — and which one you hit first depends entirely on your workload pattern.

Latency

Latency is the time elapsed between issuing a request and receiving the response. It is measured in microseconds, milliseconds, or seconds depending on the layer.

Latency is not the inverse of throughput. A system can have high throughput and high latency simultaneously — this is the fundamental behavior of pipelining. A satellite link has 600ms latency but can sustain 100 Mbps throughput because multiple requests are in flight concurrently. Conversely, a system can have low latency but low throughput if it can only handle one operation at a time.

The relationship between all three:

Effective Throughput = Concurrency × (Operation Size / Latency)

This equation is universal. It explains why queue depth matters for disks, why TCP window size matters for networks, and why thread count matters for applications.

Metric	Storage Example	Network Example	Compute Example
Throughput	1,000 MB/s sequential	25 Gbps link capacity	3.2 GHz × IPC × cores
IOPS	64,000 random 4KB ops	14M packets per second	50,000 requests/sec
Latency	200μs per I/O	0.5ms RTT within AZ	2ms P99 response time

3. Storage Performance

Block Size Determines Which Limit You Hit

Every storage workload has a dominant block size. Databases doing 8KB page reads are IOPS-bound. Video transcoding doing 1MB sequential reads is throughput-bound. Misidentifying this is the single most common storage performance mistake.

Workload Pattern	Typical Block Size	Bottleneck
OLTP database	8–16 KB	IOPS
Data warehouse scan	256 KB – 1 MB	Throughput
Log writes	4–64 KB	IOPS + mixed
Object storage GET	Varies (full object)	Throughput
etcd (Kubernetes)	4–8 KB	Latency

Queue Depth

A single-threaded application issuing synchronous I/O achieves a queue depth of 1. At 200μs latency, this caps throughput at 5,000 IOPS — regardless of what the underlying volume supports. Increasing queue depth (via async I/O, io_uring, or multiple threads) allows the drive to process operations in parallel. NVMe drives are designed for queue depths of 32–256. EBS volumes benefit from queue depths of 4–16 depending on type.

This is exactly the same principle as HTTP/2 multiplexing or TCP window scaling: concurrency hides latency.

Burst vs Sustained Performance

AWS EBS gp3 volumes provide a baseline of 3,000 IOPS and 125 MB/s, with burst credits for gp2 (legacy). The distinction matters enormously for capacity planning:

Volume Type	Baseline IOPS	Max IOPS	Baseline Throughput	Max Throughput	Latency
gp3	3,000	16,000	125 MB/s	1,000 MB/s	sub-ms
io2 BE	provisioned	256,000	provisioned	4,000 MB/s	sub-ms, consistent
Local NVMe	N/A (no limit)	1.5M+	N/A	7+ GB/s	<100μs
gp2 (legacy)	100–16,000	16,000	128–250 MB/s	250 MB/s	sub-ms, variable

gp3 is fine for most workloads. io2 Block Express exists for when you need both high IOPS and predictable sub-millisecond latency — production databases where P99 consistency matters. Instance store NVMe is a different category entirely: no network hop, no EBS controller overhead, microsecond-level latency, but ephemeral.

Distributed Storage (Longhorn, Ceph, OpenEBS)

Running distributed storage in Kubernetes adds at least one network hop per I/O operation. Longhorn replicates data across nodes — a write to a 3-replica Longhorn volume becomes one local write plus two network-replicated writes. This fundamentally changes the performance profile:

Latency increases by the network RTT plus the slowest replica's write time. IOPS drops because each logical operation becomes multiple physical operations. Throughput is bounded by network bandwidth between nodes.

In a homelab K3s cluster with 1 Gbps links, a Longhorn volume with 3 replicas maxes out around 100 MB/s throughput regardless of underlying disk capability. With 10 Gbps links, you might reach 400–600 MB/s. Ceph (with RBD or CephFS) performs better at scale due to its CRUSH algorithm distributing I/O across OSDs, but the principle remains: every network hop adds latency, and replication multiplies write load.

For latency-sensitive workloads (databases, etcd), use local storage or hostPath. Reserve distributed storage for workloads where data availability matters more than raw performance.

4. Network Performance

Bandwidth vs Throughput

Network bandwidth is the raw capacity of the link (e.g., 25 Gbps for a c6i.8xlarge). Network throughput is how much of that bandwidth you actually use. The gap between them is caused by protocol overhead, packet loss, latency (the bandwidth-delay product), and application behavior.

A 25 Gbps link with 1ms RTT and a 64KB TCP window can only sustain about 500 Mbps, because the sender waits for acknowledgments before sending more data. This is why TCP window tuning and BBR congestion control exist.

Packets Per Second (PPS)

PPS is the network equivalent of IOPS. Every instance type has a PPS limit, and small packets exhaust it before bandwidth is saturated.

Consider a c6i.xlarge with roughly 12.5 Gbps bandwidth and approximately 1.5M PPS. Sending 64-byte packets: 1,500,000 × 64 bytes × 8 bits = 768 Mbps. You hit the PPS ceiling at under 1 Gbps despite having 12.5 Gbps available.

Scenario	Packet Size	PPS Needed	Bandwidth Used
HTTP API (small JSON)	~200 bytes	High	Low
Video streaming	~1400 bytes	Moderate	High
DNS resolution	~60 bytes	Very high	Very low
Database replication	~1400 bytes	Moderate	Moderate–High

This is why DNS servers, load balancers, and services handling many small requests per second need instance types with high PPS limits, not just high bandwidth.

Round-Trip Time (RTT)

RTT is network latency — the time for a packet to travel to a destination and back.

Path	Typical RTT
Same AZ (same region)	0.1–0.5 ms
Cross-AZ (same region)	0.5–2 ms
Cross-region (e.g., us-east to eu-west)	60–100 ms
To internet (varies)	10–200 ms

Cross-AZ latency is small per request but compounds under load. A service making 10 sequential cross-AZ calls per request adds 5–20ms to every response. For microservices architectures, this is the primary source of latency inflation that teams fail to account for.

Jitter

Jitter is latency variance — the difference between the best and worst RTT. Consistent 2ms RTT is manageable. RTT oscillating between 0.5ms and 15ms destroys application predictability. Jitter typically comes from network congestion, noisy neighbors on shared infrastructure, or garbage collection pauses in application-level proxies.

Load Balancer Impact

Every load balancer adds latency. AWS ALB adds 1–5ms per request depending on load. NLB adds ~100μs (it operates at Layer 4 with no HTTP parsing). A service mesh sidecar proxy (Envoy) adds 0.5–2ms per hop.

For a request flowing through: Client → ALB → Envoy sidecar → Service → Envoy sidecar → Database, you accumulate roughly 3–10ms of infrastructure latency before your application code runs a single instruction.

Container Networking Overhead

Containers introduce additional network processing. Every packet traverses the host's network namespace, crosses a veth pair, passes through iptables/nftables rules (or eBPF programs in Cilium), and potentially hits a CNI overlay network.

In Kubernetes, a kube-proxy iptables-based setup with 10,000 services creates approximately 40,000+ iptables rules. Each packet is evaluated against these rules linearly or via ipset. This is why IPVS mode and eBPF-based CNIs (Cilium) exist — they reduce per-packet processing overhead from O(n) to O(1) with hash-based lookups.

The practical impact: on a node running 50+ services with iptables kube-proxy, PPS capacity can drop 10–20% purely from kernel networking overhead.

5. Compute Performance

CPU Saturation

CPU utilization at 100% does not mean your application is working at maximum efficiency. It means your CPU has no idle cycles, but those cycles may be spent on context switches, kernel overhead, or spinlocks. The distinction between user time (your code), system time (kernel calls), and iowait (waiting for I/O) matters.

A machine at 90% CPU where 30% is system time has a fundamentally different bottleneck than one at 90% where 85% is user time. The first is making too many system calls or handling too many interrupts; the second is genuinely compute-bound.

CPU Steal

In virtualized environments (all of EC2), CPU steal indicates the hypervisor allocated your vCPU time to another instance. Steal above 5% means your workload is being throttled by the physical host. Burstable instances (t3, t3a) are particularly affected — once CPU credits are exhausted, steal effectively caps your performance.

On dedicated or metal instances, steal is zero. For latency-sensitive production workloads, this is not a trivial difference.

Context Switching

Every context switch costs 2–10μs depending on the CPU and cache state. An application doing 50,000 context switches per second spends 100–500ms of CPU time per second just switching — up to 50% of a core. Goroutine-heavy Go applications, thread-per-request Java services, and heavily multiplexed event loops all exhibit different context switching profiles.

Monitor vmstat and /proc/pid/status context switch counters. If voluntary context switches are high, your application is blocking on I/O. If involuntary context switches are high, you have too many runnable threads competing for CPU.

Memory Bandwidth

Modern CPUs can execute billions of operations per second, but they can only fetch data from RAM at 50–80 GB/s (dual-channel DDR4/DDR5). Workloads that scan large datasets — sorting, filtering, columnar analytics, ML inference — are often memory-bandwidth bound, not compute-bound. The CPU spends cycles stalled, waiting for data from memory.

This manifests as moderate CPU utilization (60–70%) that refuses to go higher regardless of thread count. perf stat showing high LLC (last-level cache) miss rates confirms this.

NUMA Considerations

Multi-socket servers (common in metal instances like i3en.metal, c5.metal) have Non-Uniform Memory Access architectures. Memory attached to the local socket has ~80ns access time; memory on the remote socket takes ~140ns. A thread accessing remote NUMA memory pays a 75% latency penalty on every cache miss.

For databases and latency-sensitive services running on metal instances, NUMA-aware scheduling and memory binding (via numactl or cgroup cpuset) is not optional — it is the difference between consistent and inconsistent performance.

GPU Workloads and Data Pipelines

GPU-accelerated workloads introduce a new throughput/latency tradeoff: PCIe or NVLink bandwidth between CPU and GPU memory. An A100 GPU can perform 312 TFLOPS (TF32), but it must be fed data fast enough. PCIe Gen4 x16 provides ~25 GB/s; NVLink provides up to 600 GB/s.

If your training pipeline reads data from EBS, preprocesses on CPU, and transfers to GPU, the bottleneck chain is: EBS throughput → CPU preprocessing speed → PCIe transfer bandwidth → GPU compute. Optimizing GPU utilization is useless if the data pipeline cannot keep up. This is why SageMaker and training instances use NVMe instance storage for dataset staging, and why data loader workers run in parallel.

Compute Metric	What It Means	Watch For
CPU User %	Time running application code	>85% = genuinely compute-bound
CPU System %	Time in kernel	>20% = excessive syscalls
CPU Steal %	Time stolen by hypervisor	>5% = noisy neighbor or credit exhaustion
Context switches/sec	Thread scheduling overhead	>50K/core = investigate
LLC miss rate	Cache misses hitting RAM	>10% = memory-bandwidth bound
IPC (instructions/cycle)	CPU efficiency	<1.0 = stalled on memory/branch

6. Kubernetes Perspective

etcd Latency Sensitivity

etcd is a consensus-based key-value store. Every write requires a quorum of disk fsyncs across the cluster. etcd's official recommendation is that 99th percentile disk fsync latency stays below 10ms. On EBS gp3, fsync latency during normal operation is 1–4ms but can spike to 20–50ms during EBS maintenance or burst credit exhaustion.

Symptoms of etcd performance problems: API server request latency increases, pod scheduling slows down, leader elections become unstable. The fix is straightforward — run etcd on io2 volumes or local NVMe storage. This is not premature optimization; it is a requirement for clusters above 50 nodes.

Image Pulls: Throughput + IOPS Combined

Pulling a container image is a mixed workload. Downloading layers is network-throughput bound. Extracting and writing layers to the container filesystem is storage IOPS and throughput bound. On a node with a slow disk, image pulls can take 30–60 seconds for a 500MB image, which directly impacts pod startup latency and autoscaling responsiveness.

Using containerd with zstd compression, image streaming (lazy pulling via stargz/nydus), or pre-cached images on AMIs eliminates this bottleneck. In CI/CD environments, this is often the difference between a 2-minute and a 10-minute pipeline.

Service Mesh Overhead

Each Envoy sidecar proxy consumes 50–100MB of memory and adds 0.5–2ms latency per hop. For a request traversing 5 services in a service mesh, that is 5–20ms of mesh overhead alone. Multiply by request volume, and sidecar proxies can consume 10–15% of cluster compute resources.

The decision to deploy a full mesh (Istio, Linkerd) vs a simpler approach (ambient mesh, direct service communication) should be based on measured overhead against measured security/observability value.

Autoscaling Effects on Latency

Horizontal Pod Autoscaler (HPA) reacts to metrics with a delay: metrics collection (15s default) → scaling decision (15s default) → pod creation → image pull → readiness probe → load balancer registration. The total cold-start latency is typically 30–120 seconds.

During this window, existing pods handle all load. If your scaling threshold is 70% CPU, you may see P99 latency spike as existing pods saturate beyond 90% while waiting for new pods. Solutions: use KEDA with request-rate metrics for faster scaling, over-provision slightly (target 50–60%), or use Knative/serverless for workloads with extreme burst patterns.

7. Real-World Scenarios

Scenario 1: PostgreSQL Production Database

A 16-core RDS instance on io2 with 50,000 provisioned IOPS and 1,000 MB/s throughput. The application reports slow queries, but RDS CloudWatch shows CPU at 40% and IOPS at 12,000. The bottleneck is not what you think.

Investigation reveals: the application makes 15 sequential queries per request, each taking 2ms for database processing plus 0.8ms cross-AZ network RTT (application runs in a different AZ from the database). Total latency per request: 15 × 2.8ms = 42ms. Moving the application to the same AZ drops this to 15 × 2.1ms = 31.5ms — a 25% improvement with zero database changes.

Lesson: network latency between application and database is multiplied by query count per request.

Scenario 2: AI/ML Training Pipeline on p4d.24xlarge

Eight A100 GPUs connected via NVLink. Training throughput is 40% below expected. GPU utilization shows frequent drops to 0%. The data pipeline cannot feed the GPUs fast enough.

The dataset is stored on an EFS mount delivering 200 MB/s. Eight GPUs need approximately 3–4 GB/s of preprocessed data. Fix: stage the dataset to local NVMe instance storage (4× 1.9TB NVMe SSDs providing 14 GB/s aggregate throughput), use multiple data loader workers for CPU preprocessing, and prefetch batches asynchronously.

Result: GPU utilization goes from 40% to 92%. Training time drops by 55%.

Lesson: GPU throughput is gated by the slowest link in the data pipeline.

Scenario 3: CI/CD Pipeline on Kubernetes

Jenkins agents running on m6i.xlarge nodes. Build times are 12 minutes, with 4 minutes spent on docker build. Investigation shows: each build pulls base images (network throughput), runs npm install downloading 800 packages (network IOPS — hundreds of small HTTP requests), compiles (CPU), and pushes the final image (network throughput + storage IOPS).

The npm install phase is slow not because of bandwidth but because of per-request latency: 800 sequential HTTPS requests to the registry, each with TLS handshake, DNS resolution, and connection setup. Fix: deploy a local npm proxy (Verdaccio), warm the cache, reduce per-request latency from 50ms to 2ms. npm install drops from 90 seconds to 8 seconds.

Lesson: high operation count with sequential latency dominates over raw throughput.

Scenario 4: High-Traffic API Gateway (50,000 RPS)

An API gateway running on c6i.4xlarge behind an NLB. At 50K RPS with ~200-byte average responses, bandwidth usage is only 80 Mbps — nowhere near the 12.5 Gbps limit. But the application hits the PPS ceiling.

Each request-response pair generates approximately 10–15 packets (TCP handshake, request, response, ACK, FIN). At 50K RPS, that is 500K–750K PPS. The c6i.4xlarge has roughly 1.5M PPS capacity for ENA. The system is at 50% PPS capacity, and with TLS termination and connection tracking overhead, the effective ceiling is lower.

Fix: enable TCP keepalive and HTTP/2 to amortize connection setup. This drops per-request packet count from 10–15 to 2–3 for persistent connections. Effective PPS at the same RPS drops by 70%.

Lesson: PPS, not bandwidth, is the limiting factor for high-RPS small-payload services.

Scenario 5: Log Ingestion Stack (Loki + Promtail)

A Loki cluster ingesting 50 GB/day of logs from 200 pods. Write latency is fine, but queries over 24-hour ranges take 30+ seconds. The bottleneck: Loki stores log chunks as compressed blocks on S3. Each query retrieves hundreds of small objects. S3 GET latency is 20–50ms per request, and Loki issues them sequentially per chunk.

With 500 chunks to read for a 24-hour query: 500 × 30ms = 15 seconds minimum, just on S3 GET latency. Adding compaction reduces chunk count. Deploying a Loki caching layer (memcached for chunk and index caches) drops repeat-query latency from 30s to 2s. Enabling parallel chunk fetching helps first-query performance.

Lesson: object storage latency is per-operation, and query performance depends on minimizing operation count.

8. Metrics That Senior Engineers Watch

P95 and P99 Latency

Average latency is almost useless. A service with 5ms average and 500ms P99 is broken for 1% of users. P99 captures the experience of real users during peak load, garbage collection pauses, and infrastructure hiccups. For public-facing services, track P99. For internal SLOs, P95 is often sufficient.

Tail Latency

Beyond P99, P99.9 and P99.99 latencies expose systemic issues: NUMA misses, TCP retransmissions, EBS latency spikes, or Go runtime GC pauses. At 10,000 RPS, P99.9 affects 10 requests per second — enough to trigger downstream timeouts and cascading failures.

Throttling

EBS IOPS throttling, EC2 network bandwidth throttling, API Gateway rate limiting, and CPU credit exhaustion all manifest similarly: sudden performance cliffs with no gradual degradation. Monitor VolumeQueueLength for EBS, NetworkBandwidthExceeded for EC2, and CPUCreditBalance for burstable instances.

Queue Length and Concurrency

Queue length is the universal health indicator. A disk with queue depth growing from 4 to 32 is saturated. A Kubernetes scheduler queue backing up means the control plane cannot keep pace. A load balancer with growing active connections and declining throughput indicates backend saturation.

Metric	What It Tells You	Red Flag Threshold
P99 latency	Worst-case user experience	>10× median
EBS VolumeQueueLength	Disk saturation	>4 for gp3, >16 for io2
CPU steal	Hypervisor contention	>5% sustained
Network PPS	Packet processing saturation	>70% of instance limit
Pod restart count	OOM kills or liveness failures	Any non-zero in prod
etcd fsync P99	Control plane health	>10ms
Active connections	Load balancer / backend saturation	Sustained growth

9. Common DevOps Mistakes

Optimizing IOPS when network is the bottleneck. An application making cross-AZ database calls has its latency dominated by network RTT, not disk I/O. Upgrading from gp3 to io2 saves 0.5ms per query, but the 1.5ms cross-AZ RTT remains unchanged. Fix the network path first — move the application into the same AZ, or use read replicas.

Confusing bandwidth with throughput. A 25 Gbps instance can move 25 Gbps of data only under ideal conditions — large transfer windows, no packet loss, no application bottlenecks. Real-world throughput with default TCP settings, small transfers, and TLS overhead is often 5–10 Gbps on the same hardware. Tune TCP buffer sizes and window scaling before upgrading instance types.

Ignoring CPU scheduling delays. Kubernetes pods with no CPU limits run on best-effort scheduling. Under contention, the CFS scheduler introduces latency that does not appear as CPU utilization — it appears as increased response time. Setting CPU requests (not limits, which cause throttling) ensures CFS allocates fair share without hard capping.

Using average latency for SLOs. A service reporting 5ms average latency with 200ms P99 is not meeting a 50ms SLO for 1% of traffic. Averages mask bimodal distributions where most requests are fast and a minority are catastrophically slow.

Over-provisioning compute, under-provisioning storage. Teams regularly deploy 32-core instances with gp3 defaults (3,000 IOPS, 125 MB/s). A single core doing synchronous I/O at queue depth 1 can generate 5,000+ IOPS from the application layer. Even a moderately I/O-intensive application on 32 cores can exhaust gp3 limits while CPU sits at 20%.

Ignoring the replication tax on distributed storage. Deploying Ceph or Longhorn with 3× replication and expecting bare-metal performance. Every write becomes three writes. Write throughput is bounded by the slowest replica node's disk and the network path to it. Budget accordingly: divide raw disk throughput by the replication factor and subtract network overhead.

10. Conclusion

Performance in cloud infrastructure is not a single-dimension problem. Every request flowing through your system traverses network links, hits CPU scheduling decisions, waits for memory access, and performs I/O operations — and each of these layers has its own throughput ceiling, IOPS limit, and latency floor.

The engineer who understands this treats performance as a pipeline analysis problem. You trace the request path, measure each segment, and find the narrowest pipe. The fix for a slow API is not always more replicas — it might be moving a database to the same AZ, switching from iptables to IPVS, or adding one SSD for etcd.

Three principles that consistently produce better outcomes:

First, measure before optimizing. Use perf, bpftrace, iostat, sar, CloudWatch, and Prometheus to identify where time is actually spent. Assumptions about bottleneck location are wrong more often than they are right.

Second, understand the relationship between throughput, IOPS, and latency at every layer. A system bottlenecked on PPS needs different remediation than one bottlenecked on bandwidth, even though both are "network problems."

Third, remember that latency is additive across layers while throughput is bounded by the weakest link. A request that passes through 6 components, each adding 2ms, has 12ms of infrastructure latency before your business logic executes. Design for shallow call graphs, minimize network hops, and keep latency-sensitive components close together — physically and architecturally.

Performance engineering is not about knowing the maximum IOPS of gp3. It is about knowing which questions to ask when something is slow.

Published on doc.thedevops.dev

eBPF- The Linux Superpower That Shows What Your Dashboards Miss

Vlad Levinas — Sun, 15 Mar 2026 17:35:02 +0000

A production-oriented guide for DevOps engineers, SREs, and Kubernetes platform teams who need visibility beyond what Prometheus and Grafana can provide.

1. The Incident That Changed How I Debug

The alert came in at 11:47pm. A payment API was timing out intermittently — not failing, not crashing, just occasionally returning responses that took eight seconds instead of eighty milliseconds. P99 latency was spiking. P50 looked fine. The dashboards showed nothing obviously wrong.

Prometheus showed normal CPU utilization. Memory was healthy. Pod restarts were zero. Kubernetes events were clean. The application logs were noisy but inconclusive — timeout errors that said what happened, not why. The backend team checked the database. The network team checked the load balancer. Two hours passed.

Then one engineer SSH'd into the node, ran a single command, and within ninety seconds had the answer: TCP retransmits between the API pods and the database pods were spiking to 40% on one specific node. Not a database problem. Not an application problem. A network problem on a node that all the upstream metrics had completely failed to surface.

The command was sudo tcpretrans-bpfcc. It runs inside the Linux kernel. It requires no agent. It requires no instrumentation. It just shows you what is actually happening at the TCP layer — in real time — on the host you are standing on.

That is eBPF.

2. Why DevOps Engineers Still SSH Into Servers

Modern observability stacks are genuinely impressive. A mature Prometheus + Grafana + Loki + Jaeger deployment can answer most questions about a production system. Kubernetes dashboards provide real-time pod and node state. APM tools give deep application-layer trace visibility. So why do experienced engineers still SSH into nodes the moment an incident gets complicated?

Because every observability tool operates above the kernel. Prometheus collects metrics that applications and exporters choose to expose. Logs contain what the application chooses to write. Traces show what the instrumented code path does. These tools are powerful, but they have a shared blind spot: anything that happens in the Linux kernel that the application does not know about, or does not choose to report, is invisible to them.

Consider what this leaves uncovered in practice:

Unexplained latency without application errors. A service is slow but healthy by every metric. The actual cause is scheduler latency — the container is waiting for CPU time it cannot get because a noisy neighbor process on the same node is consuming its entire CPU budget. No application metric captures this. No log line mentions it.

Intermittent failures that appear random. A connection to a downstream service occasionally fails, but the failure rate is low enough that retry logic masks it. The actual cause is silent TCP retransmits happening beneath the application's socket abstraction. The application sees delayed responses, not errors.

Silent packet drops in the kernel network stack. Packets are being dropped by conntrack table exhaustion, by an iptables rule mismatch, or by a full socket receive buffer. The application sees slowness. Prometheus shows nothing. The drop is happening inside the kernel before the application ever sees it.

Scheduler delays causing tail latency. P99 is high but P50 is normal. The application logic takes the same time every request. The variable is how long the process waits in the run queue before getting CPU time. This is invisible to application instrumentation and requires kernel-level observation.

Hidden process activity. A container is spawning unexpected subprocesses — a misconfigured health check script, a log rotation cron job, a malicious process injected into a compromised container. Nothing in the application metric stream shows this. You only see it by watching what the kernel executes.

eBPF closes this gap. It is the observability layer that lives inside the kernel itself, below every abstraction, visible to nothing and nothing invisible to it.

3. What eBPF Actually Is

eBPF stands for extended Berkeley Packet Filter, though the name is now largely historical — modern eBPF has nothing to do with packets specifically. In practical DevOps terms, eBPF is a way to run small, safe programs inside the Linux kernel without modifying kernel source code and without loading kernel modules.

When you run an eBPF program, it attaches to a kernel event — a function call, a system call, a network packet arrival, a tracepoint — and executes every time that event fires. The program can observe arguments, return values, timing, process context, and network data. It stores what it finds in BPF maps, which userspace tools then read to produce output.

The key properties that make eBPF practical for production:

Safety. Every eBPF program passes through a kernel verifier before loading. The verifier statically analyzes the program to ensure it cannot loop infinitely, cannot access arbitrary kernel memory, cannot crash the kernel, and terminates in bounded time. An eBPF program that fails verification is rejected. This makes eBPF categorically safer than kernel modules, which can crash the entire system.

Zero instrumentation. eBPF programs attach to existing kernel events. They require no changes to the applications being observed. A Python web service and a Go microservice and a Java application all produce the same system calls, network events, and scheduler interactions — all observable via eBPF without modifying a single line of application code.

Low overhead. Well-written eBPF programs add nanoseconds to the events they attach to. This is acceptable in production. Traditional tracing approaches — strace, tcpdump in promiscuous mode, ptrace — impose overhead that makes them unusable on production systems under load. eBPF does not.

Core eBPF Concepts

kprobes attach to arbitrary kernel functions. You can attach an eBPF program to any kernel function and inspect its arguments and return value. This is how biolatency measures disk I/O latency — it attaches to kernel block I/O functions and records timestamps.

uprobes attach to userspace functions in any running binary. You can attach to a specific function in a Go binary, a Python library, or a JVM without recompiling it. This enables language-level tracing without language-level instrumentation.

tracepoints are stable kernel instrumentation points added by kernel developers specifically for observability. They are more stable across kernel versions than kprobes because they are part of the official kernel ABI. tcp:tcp_retransmit_skb is a tracepoint — this is what tcpretrans uses.

XDP (eXpress Data Path) attaches eBPF programs at the network driver level, before packets enter the full kernel network stack. This enables packet filtering, modification, and forwarding at line rate — faster than any iptables rule.

Socket filters attach to sockets and can inspect or filter packets at the socket level. This is how classic BPF worked in tcpdump, and how modern network security tools inspect traffic per-connection.

BPF maps are data structures shared between eBPF programs and userspace. Hash maps, arrays, ring buffers, LRU maps — eBPF programs write data here, userspace tools read it. This is the mechanism by which kernel-level observations become human-readable output.

Perf events and ring buffers are high-throughput channels for sending per-event data from eBPF programs to userspace. Tools like execsnoop use ring buffers to stream process execution events in real time.

4. eBPF Architecture End to End

The data flow in an eBPF-based tool follows a consistent path:![[ebpf-packet-flow.svg|1200]]
1. Userspace tool (execsnoop, biolatency, a custom bpftrace script) initiates a BPF program load. The program is compiled to eBPF bytecode.

2. BPF loader (libbpf, the BCC library, the bpftrace runtime) submits the bytecode to the kernel via the bpf() system call.

3. Kernel verifier analyzes the bytecode: bounded loops, valid memory access, correct map usage, no unsafe operations. If the program passes, the verifier JIT-compiles it to native machine code for near-native execution performance.

4. eBPF program is attached to a hook point — a kprobe, tracepoint, XDP hook, socket filter, or perf event. It begins executing on every occurrence of that event.

5. BPF maps and ring buffers collect the data the program records: timestamps, PIDs, process names, IP addresses, byte counts, latency histograms.

6. Userspace reads the maps or subscribes to the ring buffer, processes the data, and produces the output the engineer sees — a sorted table, a latency histogram, a live stream of events.

The critical insight is step 4: the program executes inside the kernel, with direct access to kernel data structures, with zero copies and zero context switches to gather its observations. This is why eBPF provides visibility that no external agent can match.

5. Why eBPF Is So Powerful for DevOps

eBPF provides access to a class of observability data that is either impossible or impractically expensive to obtain any other way.

Process execution: Every process the kernel executes generates a kernel event. eBPF can capture the process name, arguments, parent PID, container namespace, and user ID for every execution. A compromised container that spawns a reverse shell generates a execve syscall that eBPF sees immediately — before the shell produces any network traffic or log output.

File system access: Every file open, every file read, every file write goes through the VFS layer. eBPF can observe every file access across every process on the node, filtered by path, duration, or process. An application that is slow because it is repeatedly re-reading a large configuration file is immediately visible — even if the application produces no logs about it.

TCP connection lifecycle: TCP connects, accepts, resets, retransmits, and drops all generate kernel events. eBPF observes all of them, per-connection, with process and container attribution. This level of network visibility has no equivalent in application-layer metrics.

Disk I/O latency: Block I/O requests go through the kernel block layer. eBPF attaches to the block layer and measures actual latency per device, per operation type, with microsecond precision. This reveals whether a disk is saturated, whether specific operations are slow, and whether the latency is consistent or bimodal.

Scheduler behavior: The Linux CFS scheduler tracks run queue length and scheduling delays. eBPF can measure how long processes wait in the run queue before receiving CPU time — the "scheduler latency" that causes tail latency spikes on CPU-constrained nodes without any application metric surfacing it.

Container runtime behavior: Containers are Linux namespaces and cgroups. eBPF operates at the kernel level and sees through these abstractions. It can observe all container processes, all container network flows, and all container file accesses without any cooperation from the container runtime.

6. 15 eBPF Commands Every DevOps Engineer Should Know

Installation note: On Ubuntu/Debian, tools are available via sudo apt install bpfcc-tools linux-headers-$(uname -r). Commands are typically available as execsnoop-bpfcc, opensnoop-bpfcc, etc. On some distributions the suffix is omitted. bpftrace is a separate higher-level scripting tool: sudo apt install bpftrace.

1. `execsnoop`

What it shows: Every process execution on the system in real time — process name, arguments, PID, parent PID, and return code.

When to use it: When a container is doing something unexpected, when a noisy process is spawning subprocesses, when investigating potential container escape or unexpected program execution, when debugging init systems or entrypoint scripts.

sudo execsnoop-bpfcc
sudo execsnoop-bpfcc -u www-data
sudo execsnoop-bpfcc --cgroupmap /sys/fs/cgroup/system.slice

Interpreting output: Each line is one execution: PCOMM (parent command), PID, PPID, RET (exit code), ARGS (full command line). A flood of short-lived processes in ARGS is a sign of a script looping or a health check misbehaving.

Real incident: A Kubernetes pod was consuming unexpected CPU. execsnoop revealed that a misconfigured liveness probe was executing a shell script every second, and that script was itself spawning three child processes to check service state. Forty-five executions per minute, invisible in every dashboard.

Pro tip: Filter by user or cgroup to scope output to specific containers. Without filtering, a busy node generates substantial output.

2. `opensnoop`

What it shows: Every file open call across all processes — filename, PID, process name, file descriptor returned, and error code.

When to use it: When an application is slow without explanation and disk I/O might be involved. When debugging "file not found" errors that appear intermittently. When identifying which files an application accesses during startup. When auditing file access patterns for compliance.

sudo opensnoop-bpfcc
sudo opensnoop-bpfcc -p 1234
sudo opensnoop-bpfcc -T

Interpreting output: Each line shows a file open attempt. ERR column shows errno values — ENOENT means file not found, EACCES means permission denied. A high rate of ENOENT errors for the same path indicates a misconfigured application repeatedly failing to find a file it expects.

Real incident: An application was slow on every tenth request with no log output. opensnoop revealed it was attempting to open /etc/ssl/certs/ca-bundle.crt on every tenth request (a certificate bundle that had been removed in a base image update) and falling back to a slower secondary path after the ENOENT error.

3. `tcpconnect`

What it shows: Every outbound TCP connection — source and destination address and port, PID, process name.

When to use it: When a container is making unexpected outbound connections. When debugging connection failures between services. When investigating whether an application is connecting to the right endpoints after a configuration change.

sudo tcpconnect-bpfcc
sudo tcpconnect-bpfcc -p 1234
sudo tcpconnect-bpfcc -P 5432

Interpreting output: Each line is one TCP SYN sent. The latency column (if present) shows time to connection establishment. Connections to unexpected IPs indicate misconfiguration or compromise.

Real incident: A microservice was connecting to a database endpoint that had been decommissioned three weeks earlier. The application was configured via an environment variable that was not updated. tcpconnect showed the actual destination IP, revealing the configuration drift immediately. Application logs only showed "connection refused" with no destination detail.

4. `tcpaccept`

What it shows: Every inbound TCP connection accepted — client address, server port, PID, process name.

When to use it: When measuring actual connection rates to a service. When debugging whether a service is accepting connections or queueing them. When auditing which clients are connecting to a service.

sudo tcpaccept-bpfcc
sudo tcpaccept-bpfcc -P 8080

Interpreting output: Each accepted connection appears with the client IP. A significant gap between the connection rate visible in tcpconnect on the client and tcpaccept on the server indicates connection queuing or backlog overflow.

5. `tcpretrans`

What it shows: Every TCP retransmission — source and destination address, port, TCP state, and retransmit type.

When to use it: When API latency is elevated without application errors. When investigating intermittent connection issues between specific service pairs. When a node has unexplained network degradation.

sudo tcpretrans-bpfcc
sudo tcpretrans-bpfcc -l

Interpreting output: Each retransmit line shows the affected connection. High retransmit rates on a specific destination IP indicate network issues on that path — switch congestion, link error rate, MTU mismatch, or failing NIC.

Real incident: The payment API incident from the opening. Forty percent retransmit rate on connections from pods on one specific node to the database tier. The node's NIC had a failing transceiver. Metrics showed nothing because the operating system was successfully retransmitting and eventually delivering packets — just 50-200ms later than expected.

6. `tcptop`

What it shows: A continuously updated table of TCP connections sorted by throughput — bytes sent and received per connection per interval.

When to use it: When identifying which connections are consuming the most network bandwidth. When investigating network saturation on a node. When finding bandwidth consumers during a performance degradation.

sudo tcptop-bpfcc
sudo tcptop-bpfcc 5

Interpreting output: Like top but for TCP connections. The highest bandwidth connections are at the top. A single connection consuming 90% of observed bandwidth is immediately visible.

7. `biolatency`

What it shows: A histogram of block device I/O latency — how long disk operations actually take, bucketed by microsecond ranges.

When to use it: When a database or stateful application is slow and disk is suspected. When validating SSD vs spinning disk performance. When debugging Kubernetes persistent volume latency. When investigating whether ceph/NFS/remote storage is meeting latency SLOs.

sudo biolatency-bpfcc
sudo biolatency-bpfcc -D
sudo biolatency-bpfcc 5 3

Interpreting output: The histogram shows how disk operations are distributed by latency. A bimodal distribution with a large tail indicates intermittent slow operations — the classic "mostly fast, occasionally very slow" pattern that databases experience under mixed read/write workloads. Operations in the >10ms bucket from an SSD indicate a problem.

Real incident: A PostgreSQL instance had P99 query latency of 800ms during write-heavy periods. biolatency showed that 95% of writes completed in under 1ms, but 2% of writes took over 50ms. The storage backend was a distributed volume system that occasionally had hot spot contention. The bimodal distribution was invisible in average disk latency metrics.

8. `biosnoop`

What it shows: Every individual block I/O operation — process, PID, device, operation type (read/write), sector, bytes, and actual latency per operation.

When to use it: When biolatency shows a latency problem and you need to know which process, which files, and which operations are slow. When debugging which Kubernetes pod is saturating a shared disk.

sudo biosnoop-bpfcc
sudo biosnoop-bpfcc -Q

Interpreting output: Each line is one I/O operation. Sort by latency column to find the slowest individual operations. The process column shows which process initiated the I/O, enabling attribution of disk activity to specific containers.

9. `fileslower`

What it shows: File read and write operations that exceed a latency threshold — which files, which processes, and how slow.

When to use it: When an application is experiencing file I/O latency but you need to know which specific files are slow, not just that disk I/O is slow in aggregate.

sudo fileslower-bpfcc 10
sudo fileslower-bpfcc 1

Interpreting output: Each line is a slow file operation exceeding the threshold (milliseconds). The FILENAME column immediately shows which paths are experiencing latency. NFS mounts frequently appear here during remote filesystem degradation.

Real incident: An application writing audit logs to an NFS mount was experiencing 200ms file write latency during peak hours. Application metrics showed no errors because writes eventually succeeded. fileslower immediately showed /mnt/nfs/audit/*.log operations taking 180-250ms. The NFS server was experiencing contention — invisible in every other tool.

10. `runqlat`

What it shows: A histogram of scheduler run queue latency — how long processes wait between becoming runnable and actually running.

When to use it: When tail latency is high but application logic time is consistent. When CPU utilization metrics look normal but applications are slow. When investigating whether CPU limits or noisy neighbors are causing scheduling delays.

sudo runqlat-bpfcc
sudo runqlat-bpfcc 5 3
sudo runqlat-bpfcc --pidnss

Interpreting output: The histogram shows distribution of wait times in the run queue. Most operations should be in the <100µs buckets. Significant mass in the 1ms-10ms range indicates CPU contention — processes are waiting longer than they should for CPU time. This directly causes tail latency.

Real incident: A service had P99 latency of 50ms when the business logic took under 5ms on every request. runqlat showed most requests waiting 30-45ms in the scheduler queue. The node was CPU-oversubscribed: containers with low CPU requests were consuming far more than their nominal allocation, starving other containers. CPU metrics showed 80% utilization — below alert threshold, but distributed in a way that caused severe scheduler delays.

11. `runqlen`

What it shows: A histogram of CPU run queue length — how many runnable processes are waiting for each CPU at sampling time.

When to use it: When diagnosing whether CPU pressure is causing latency. When identifying nodes with run queue imbalances across CPUs. When comparing CPU pressure between nodes.

sudo runqlen-bpfcc
sudo runqlen-bpfcc 5

Interpreting output: Run queue lengths above 2-3 per CPU indicate pressure. A consistently high run queue length means processes spend significant time waiting to run — the direct cause of the scheduler latency that runqlat measures.

12. `offcputime`

What it shows: Stack traces and time spent by processes blocked off CPU — waiting for I/O, sleeping on locks, waiting for network responses.

When to use it: When a service has low CPU utilization but is still slow. When thread counts are high but CPU is not saturated. When debugging lock contention, I/O blocking, or unexpected sleeps in application code.

sudo offcputime-bpfcc -p 1234
sudo offcputime-bpfcc --stack-storage-size 16384 5

Interpreting output: Stack traces showing where threads spend time blocked. A Java application with dozens of threads all blocked on the same lock call site is immediately visible as a stacked flame of identical call stacks.

13. `profile`

What it shows: CPU profiling across all processes — stack traces sampled at a configurable rate showing where CPU time is actually spent.

When to use it: When CPU utilization is high but unclear which code path is consuming it. When profiling containerized applications without modifying them. When doing continuous profiling on production nodes.

sudo profile-bpfcc -F 99 30
sudo profile-bpfcc -a -F 49 60

Interpreting output: Stack traces with sample counts. Higher counts mean more CPU time. Can be converted to FlameGraph format for visualization. Unlike language-specific profilers, this works on any process in any language with no instrumentation.

14. `tcpdrop`

What it shows: TCP packets being dropped by the kernel — why they were dropped, which connection they belonged to, and the kernel stack trace at the point of drop.

When to use it: When connections are being reset or dropped and the cause is unclear. When investigating conntrack table exhaustion. When debugging iptables rules that are silently dropping traffic. When a service works for most clients but fails for a subset with no pattern.

sudo tcpdrop-bpfcc

Interpreting output: Each line is a dropped packet with drop reason. conntrack: No route to host indicates conntrack exhaustion. Stack traces show the exact kernel code path that made the drop decision. This is the tool that makes silent packet drops visible.

Real incident: A Kubernetes cluster's conntrack table was intermittently exhausting during traffic spikes, causing random connection drops across the entire node. Every dashboard showed healthy services because most connections succeeded. tcpdrop showed hundreds of drops per second with nf_conntrack_full in the stack trace — the exact cause visible in seconds.

15. `mountsnoop`

What it shows: Mount and unmount system calls — which processes are mounting or unmounting filesystems, with arguments and return codes.

When to use it: When investigating container storage issues. When debugging persistent volume mount failures in Kubernetes. When auditing filesystem mount activity on a node. When a pod fails to start due to volume mount errors that the kubelet logs don't fully explain.

sudo mountsnoop-bpfcc

Interpreting output: Each line is a mount or unmount syscall with the process, target path, filesystem type, and return code. Failed mounts (non-zero return codes) immediately show which paths and filesystem types are failing and why.

7. 10 Production Debugging Scenarios

![[ebpf-workflow.svg|1200]]

Scenario 1: Kubernetes Pod Restarting — Reason Unclear

Symptoms: A pod shows increasing restart count in kubectl get pods. The container exits cleanly (exit code 0) with no error logs. The liveness probe succeeds before the restart.

Why dashboards were not enough: Prometheus shows restart count increasing but no error rate, no CPU spike, no memory pressure. Kubernetes events say "container restarted" without elaboration.

eBPF command that revealed the truth: execsnoop filtered to the pod's PID namespace showed the entrypoint script spawning a subprocess that failed silently, setting the parent process's exit condition to terminate the container after a fixed timeout — not a crash, a controlled exit triggered by the subprocess failing.

Action: Fix the subprocess failure (a health check script referencing a removed endpoint). Restart count drops to zero.

Scenario 2: API Latency Spikes — TCP Retransmits

Symptoms: P99 latency spikes to 400ms for ten-second windows, multiple times per hour. P50 is unaffected. Application logs show no errors during spikes. Prometheus TCP metrics are absent (not instrumented).

Why dashboards were not enough: Application metrics show latency distribution but not cause. Network metrics are absent because the application does not expose TCP-level data.

eBPF command: tcpretrans showed 35% retransmit rate on connections from one specific source IP (a pod on a congested node) to the database tier. All latency spikes correlated exactly with retransmit activity on this path.

Action: Migrate affected pods away from the congested node. Investigate switch port error counters on the physical switch connected to that node.

Scenario 3: Database Node with Intermittent Disk Latency

Symptoms: PostgreSQL reports occasional slow queries during write-heavy periods. WAL sync latency metrics are inconsistent. The storage team sees no alerts.

Why dashboards were not enough: Average disk latency looks normal. The slow queries are outliers that average metrics hide.

eBPF commands: biolatency showed a bimodal distribution with 2% of operations taking over 50ms on the NVMe device. biosnoop showed these slow operations were all write operations from the postgres process to the WAL device.

Action: Investigate storage backend queue depth during write bursts. Tune PostgreSQL wal_sync_method to reduce synchronous write frequency.

Scenario 4: Container Making Unexpected Outbound Connections

Symptoms: Security team flags unexpected outbound traffic from a container to an external IP. Application team denies intentional connectivity to that destination.

Why dashboards were not enough: Network flow logs exist but with a five-minute aggregation delay, making connection attribution to specific processes difficult. The destination is not in any known configuration.

eBPF command: tcpconnect filtered to the container's PID showed the connection being made by a background thread in a third-party library, not the main application. The library was calling home to a telemetry endpoint added in a recent dependency update.

Action: Pin the dependency version, add a network policy to block unexpected egress, review all third-party library dependencies for undisclosed network activity.

Scenario 5: NFS Latency Causing Application Slowness

Symptoms: An application writing to a shared NFS volume is slow during peak hours. The application team blames the NFS server. The infrastructure team says NFS server metrics look fine.

Why dashboards were not enough: NFS server aggregate metrics look normal because most clients are fine. The performance problem is specific to one mount path on one server.

eBPF commands: fileslower with a 5ms threshold showed /mnt/nfs/data/ write operations taking 100-300ms, with no slow operations on local paths. biolatency for the NFS block device showed high latency distribution.

Action: Identify the specific NFS export experiencing contention. Move the application's write-heavy workload to a local volume or a dedicated NFS export with reserved I/O capacity.

Scenario 6: Noisy Neighbor Spawning Unexpected Subprocesses

Symptoms: All containers on a specific node begin showing elevated P99 latency simultaneously. No recent deployments. The node's aggregate CPU metrics look normal.

Why dashboards were not enough: The problem is CPU distribution, not CPU total. Individual container CPU metrics show each container at reasonable utilization.

eBPF commands: execsnoop showed a batch processing container executing a shell loop that spawned 200+ short-lived processes per minute. runqlen showed run queue depth of 8-12 on the node's CPUs, far above normal. runqlat confirmed 30-50ms scheduler latency for all other processes.

Action: Add CPU limits to the batch processing container, move it to dedicated nodes, or implement CFS bandwidth throttling to prevent run queue saturation.

Scenario 7: Scheduler Latency Causing Tail Latency Under CPU Pressure

Symptoms: P99 latency is five times P50 latency. The application processes requests in consistent time. CPU utilization on the node is 70% — below alert threshold.

Why dashboards were not enough: CPU utilization at 70% does not indicate run queue saturation. Utilization measures time a CPU is busy, not how many processes are waiting for it.

eBPF command: runqlat showed run queue wait times predominantly in the 5-15ms range. A service with 5ms business logic producing P99 of 50ms: the 45ms difference is entirely scheduler waiting time. At 70% CPU utilization with many competing processes, individual processes can wait many milliseconds for a CPU slot.

Action: Reduce the number of processes competing for CPU on the node. Increase CPU requests/limits on latency-sensitive services. Consider dedicating a node pool for latency-sensitive workloads.

Scenario 8: Silent Packet Drops in the Network Stack

Symptoms: A service fails for a small percentage of clients with connection reset errors. No pattern in the affected clients. Retry logic masks most failures. Alert threshold not breached.

Why dashboards were not enough: Error rate is 0.3% — below alerting threshold. TCP-level packet drops are not visible in application metrics.

eBPF command: tcpdrop showed hundreds of drops per minute with nf_conntrack: table full, dropping packet in the kernel stack trace. The conntrack table was exhausting during traffic spikes, causing random connection resets that the application experienced as occasional errors.

Action: Increase /proc/sys/net/netfilter/nf_conntrack_max. Implement conntrack monitoring as a node-level Prometheus metric. Consider Cilium with eBPF-native connection tracking to eliminate conntrack entirely.

Scenario 9: File Descriptor Problems Slowing the Application

Symptoms: An application becomes progressively slower over hours, then recovers after restart. No obvious memory or CPU growth. Logs show no errors.

Why dashboards were not enough: The symptom is invisible until the file descriptor limit is approached. Standard metrics do not include open file descriptor counts for most applications.

eBPF command: opensnoop showed the application opening new file descriptors for every request and never closing them — a file descriptor leak in a connection pool implementation. The rate of new opens was constant while the count of open fds grew continuously.

Action: Fix the connection pool to properly close file descriptors. Add ulimit -n monitoring and alerting on process fd consumption.

Scenario 10: Health Check Causing Hidden Resource Pressure

Symptoms: A service handles load correctly in load tests but shows elevated latency in production. Production traffic volume is lower than the load test.

Why dashboards were not enough: The production environment includes a monitoring sidecar that runs in the same pod. The sidecar's resource consumption is not visible in application metrics.

eBPF commands: execsnoop showed the monitoring sidecar running a shell script every 15 seconds that executed twelve subprocesses for metric collection. profile showed these subprocesses consuming 15% of the container's CPU budget during each collection cycle, causing CPU throttling spikes that lasted 200-300ms — exactly correlating with P99 latency events in production.

Action: Replace the shell-script-based metric collector with a native binary. Reduce collection frequency. Give the sidecar a separate CPU limit that does not compete with the application container.

8. eBPF and Kubernetes

![[ebpf-kubernetes 2.svg|1200]]

This abstraction is useful for operations but creates observability gaps during incidents. eBPF pierces these gaps because it operates at the kernel level, below the container runtime, below the overlay network, below the namespace abstraction.

Short-lived pods are especially difficult in traditional observability. A pod that lives for thirty seconds may not produce enough telemetry to surface in aggregated metrics. eBPF captures everything that pod does from exec to exit, in real time, with no collection delay.

Overlay networking creates a challenge for network observability. Traffic between pods on the same node may never touch physical network infrastructure — it flows through virtual interfaces in the kernel network stack. Standard network monitoring that captures traffic at the physical NIC level sees none of this. eBPF captures it all because it operates inside the kernel, where the traffic passes regardless of the path.

DNS behavior in Kubernetes clusters is a common source of latency that is difficult to observe. Every service DNS lookup goes through the cluster's kube-dns or CoreDNS. eBPF can capture every DNS query and response with timing, revealing whether DNS lookup latency is contributing to service latency — even when the application does not log DNS activity.

eBPF-Native Kubernetes Tools

Cilium replaces kube-proxy and implements Kubernetes networking entirely in eBPF. This eliminates iptables from the data path, provides native network policy enforcement, and enables service-level network observability without service mesh sidecars. Cilium's Hubble component provides a visual network map built on eBPF flow data.

Pixie deploys an eBPF agent as a DaemonSet and automatically captures service-level metrics, traces, and logs for all services in the cluster — without any application instrumentation. HTTP request tracing, database query visibility, and service topology are available immediately after deployment.

Falco uses eBPF to monitor syscall activity across all containers in a cluster, detecting security violations — unexpected file accesses, network connections to blacklisted IPs, process executions that violate policy — in real time.

Datadog uses eBPF for its Universal Service Monitoring and Cloud Network Monitoring features, providing zero-instrumentation service-level metrics and network flow visibility.

Tetragon from Cilium provides eBPF-based runtime security enforcement — not just detection, but the ability to kill processes or drop connections based on eBPF-observed behavior.

Parca and Pyroscope implement continuous profiling using eBPF's profile capability, providing always-on CPU profiling across all cluster workloads with minimal overhead.

9. Network Observability and Packet Flow

The Linux network stack has multiple layers, and problems can occur at any of them. Standard network monitoring — interface counters, TCP statistics from /proc/net/tcp — provides aggregate visibility. eBPF provides per-connection, per-packet visibility at any layer.

When a packet arrives at a NIC, the first eBPF attachment point is XDP. A program attached at XDP can inspect the packet headers and decide to PASS it into the normal stack, DROP it immediately (at line rate, with no further kernel processing), REDIRECT it to another interface, or TX it back out the same interface. This is how high-performance load balancers, DDoS mitigation systems, and Cilium's packet processing operate.

After XDP, packets enter the Traffic Control (TC) layer where another eBPF hook point exists. TC programs can inspect and modify packets with full socket buffer access — more powerful than XDP but slightly less performant. Cilium uses TC hooks for network policy enforcement and service routing.

After TC, packets enter Netfilter — the iptables/nftables layer — where connection tracking and NAT occur. Conntrack table exhaustion is a common production problem: when the conntrack table fills, the kernel begins dropping new connection attempts. This is silent in most monitoring systems but immediately visible in tcpdrop.

At the socket layer, BPF socket filters can inspect and filter packets per-socket. This is how tcpretrans attaches to the retransmission path — at the socket layer where retransmit decisions are made.

Where problems are invisible without eBPF:

Packets dropped at XDP or TC produce no application errors — the application never sees them. Standard network interface counters may not capture these drops. tcpdrop attaches to the kfree_skb function — the kernel function that frees a dropped packet — and captures every drop regardless of where in the stack it occurred.

Retransmits are handled transparently by the kernel TCP implementation. The application sees a delayed response but no error. Without attaching to the retransmit path in the kernel, these are invisible.

Conntrack table exhaustion drops new connections without error response — the client sees a timeout, not a connection refused. Without tcpdrop, this is extremely difficult to diagnose.

10. Why eBPF Is Changing Linux Infrastructure

eBPF is not an incremental improvement to existing Linux tooling. It represents a structural change in what is possible to observe and control in a running Linux system, with implications that are still working through the industry.

Cloud networking is increasingly eBPF-native. Major cloud providers have replaced userspace networking stacks with eBPF programs that process packets in the kernel at rates that no userspace program can match. AWS's VPC networking, Google's GKE dataplane, and Azure's Accelerated Networking all use eBPF-based packet processing.

Kubernetes networking is converging on eBPF. kube-proxy — the component that implements Kubernetes service load balancing — is being replaced by eBPF implementations in most production environments. The iptables rules that kube-proxy generates scale poorly with cluster size and have latency characteristics that eBPF implementations do not.

Runtime security is shifting to eBPF. The traditional approach to container security — scanning images before deployment — is necessary but not sufficient. Attacks happen at runtime. eBPF enables runtime security tools that observe actual system behavior and enforce policy based on what processes are doing, not just what images contain.

Zero-instrumentation observability is becoming viable at scale. The historical choice for production observability was: instrument everything explicitly (high accuracy, high engineering cost) or collect aggregate metrics passively (low engineering cost, low accuracy). eBPF provides a third path: kernel-level automatic observation of all service behavior, with accuracy approaching manual instrumentation and engineering cost approaching passive monitoring.

Continuous profiling powered by eBPF is changing how performance optimization is approached. Always-on profiling at one percent CPU overhead across all cluster workloads means that performance regressions surface immediately in production rather than requiring dedicated profiling sessions.

eBPF's trajectory resembles containers in 2014: it is the technology that a small number of experienced engineers use to solve hard problems, that is being abstracted into tools that will eventually make its capabilities accessible to everyone, and that is quietly becoming the foundation of the next generation of Linux infrastructure.

11. Practical Tips for Engineers

Getting Started

On Ubuntu 22.04 or later:

sudo apt update
sudo apt install bpfcc-tools linux-headers-$(uname -r) bpftrace

On Ubuntu, tools are available as execsnoop-bpfcc, opensnoop-bpfcc, etc. On some distributions the -bpfcc suffix is absent. Verify with ls /usr/sbin/execsnoop*.

Kernel Version Matters

eBPF capabilities have expanded significantly across kernel versions. A minimum of kernel 4.9 is required for most BCC tools. Kernel 5.8 introduced significant improvements to ring buffers and CO-RE (Compile Once Run Everywhere) that modern tools depend on. For production use, kernel 5.15 LTS or later is recommended. Check with uname -r.

BCC vs bpftrace

BCC tools (the execsnoop, biolatency, etc. commands) are purpose-built tools for specific observability tasks. They are the right choice for incident debugging — run a command, get immediately useful output.

bpftrace is a high-level scripting language for writing custom eBPF programs. Use it when none of the BCC tools answer your specific question, or when you need to combine multiple observations in a custom way.

# bpftrace one-liner: trace all files opened by process name "nginx"
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat /comm == "nginx"/ { printf("%s\n", str(args->filename)); }'

Using eBPF Safely in Production

Never run eBPF programs you do not understand on production systems. BCC tools from the official repository are safe. Custom or third-party scripts should be reviewed before production use.

Filtering matters. Running opensnoop on a busy production node without PID or path filtering will generate enormous output and some overhead. Always filter to the specific process, container, or operation type you are investigating.

Duration limits. Most BCC tools accept a duration argument. Use it to ensure tools terminate rather than running indefinitely.

Kernel verifier rejection. On older kernels, some programs fail verification and will not load. This is safe — the program simply does not execute. The error message explains why.

First Five Commands to Learn

Start with these, in this order:

execsnoop — get familiar with process execution observation
opensnoop — understand file access patterns
tcpconnect and tcpretrans — build network-layer intuition
biolatency — learn to interpret I/O latency histograms
runqlat — understand scheduler latency and its relationship to tail latency

These five commands cover the majority of production debugging scenarios that eBPF is needed for.

Common Mistakes

Running tools without filters on busy nodes generates overwhelming output and unnecessary overhead. Always start with -p PID or equivalent filtering. Interpreting histograms without context — a tail at 100ms in biolatency means very different things on a network-attached volume versus a local NVMe. Assuming eBPF programs have zero overhead — they have very low overhead, not zero. Profiling at high frequency (1000Hz) on a production system adds measurable load. Confusing tool availability with kernel support — the tool may install but fail to load if the kernel version does not support the required features.

12. DevOps Cheat-Sheet

Save this section as a reference for incident debugging.

Processes

Command	Purpose	Best Use Case	Reveals
`execsnoop`	Trace all process executions	Unexpected process activity · security · debug init	Process name, args, PID, exit code
`opensnoop`	Trace all file opens	File access patterns · ENOENT errors · startup debug	Filename, process, error code
`offcputime`	Show where threads block	Slow service with low CPU · lock contention	Off-CPU stack traces with duration
`profile`	CPU flame profiling	High CPU · unknown hot path	CPU stack trace samples

Files and I/O

Command	Purpose	Best Use Case	Reveals
`biolatency`	Block I/O latency histogram	Slow database · disk investigation	Latency distribution by device
`biosnoop`	Per-operation block I/O trace	Which process is slow on disk	Per-op latency, process, device
`fileslower`	Slow file operations	NFS latency · specific file debugging	Slow file path, process, duration
`mountsnoop`	Mount/unmount syscalls	K8s PV failures · container storage	Mount path, type, return code

Networking

Command	Purpose	Best Use Case	Reveals
`tcpconnect`	Outbound TCP connections	Unexpected egress · misconfig	Destination IP, port, process
`tcpaccept`	Inbound TCP accepted	Connection rate · audit	Client IP, server port, process
`tcpretrans`	TCP retransmissions	Intermittent latency · network degradation	Connection, retransmit count
`tcptop`	TCP throughput by connection	Bandwidth consumer identification	Bytes in/out per connection
`tcpdrop`	Kernel TCP packet drops	Silent drops · conntrack exhaustion	Drop reason, kernel stack trace

CPU and Scheduler

Command	Purpose	Best Use Case	Reveals
`runqlat`	Scheduler run queue latency	Tail latency · CPU contention	Wait time histogram
`runqlen`	Run queue length	CPU saturation diagnosis	Queue depth per CPU
`profile`	CPU profiling by stack	Hot path identification	Sampled CPU stack traces

Containers and Kubernetes

Command	Purpose	Best Use Case	Reveals
`execsnoop --cgroupmap`	Container process execution	Security · unexpected spawns	All processes in cgroup
`tcpconnect -p`	Container network connects	Pod egress audit	Pod-level TCP connections
`opensnoop -p`	Container file access	Container FS debugging	Files opened by pod PID
`mountsnoop`	Volume mount activity	PV attach failures	Mount syscalls with errors

13. Conclusion

Most engineers observe Linux from the outside: through metrics that applications choose to expose, logs that code paths choose to write, and traces that instrumented services choose to record. This works for the incidents that fall within the boundaries of what existing instrumentation covers.

eBPF allows you to observe Linux from inside the kernel itself.

It reveals the scheduling decisions that cause tail latency before an application thread executes a single instruction. It shows the network packets that are dropped before a socket receives data. It captures the file accesses that an application makes without logging. It records every process execution, every TCP connection, every disk I/O operation — in real time, with process and container attribution, with no instrumentation required.

The gap between a thirty-minute incident and a two-hour incident is often not analytical skill — it is the presence or absence of the right observability layer. Engineers who know tcpretrans find network problems in minutes. Engineers who know runqlat diagnose CPU contention in seconds. Engineers who know tcpdrop surface silent packet drops that remain invisible to every other tool.

You do not need to become an eBPF developer to benefit from it. Learning five commands — execsnoop, tcpretrans, biolatency, runqlat, tcpdrop — changes how you approach production debugging. Each one reveals a class of problems that nothing else surfaces as quickly.

The Linux kernel has been recording everything that happens inside it for decades. eBPF finally gives production engineers a practical way to read that record.

Written for DevOps engineers, SREs, and platform teams operating production Linux and Kubernetes environments. Commands tested on Ubuntu 22.04 with kernel 5.15 and BCC tools 0.26.