Vlad Levinas

Posted on Mar 15

eBPF- The Linux Superpower That Shows What Your Dashboards Miss

#devops #linux #monitoring #performance

A production-oriented guide for DevOps engineers, SREs, and Kubernetes platform teams who need visibility beyond what Prometheus and Grafana can provide.

1. The Incident That Changed How I Debug

The alert came in at 11:47pm. A payment API was timing out intermittently — not failing, not crashing, just occasionally returning responses that took eight seconds instead of eighty milliseconds. P99 latency was spiking. P50 looked fine. The dashboards showed nothing obviously wrong.

Prometheus showed normal CPU utilization. Memory was healthy. Pod restarts were zero. Kubernetes events were clean. The application logs were noisy but inconclusive — timeout errors that said what happened, not why. The backend team checked the database. The network team checked the load balancer. Two hours passed.

Then one engineer SSH'd into the node, ran a single command, and within ninety seconds had the answer: TCP retransmits between the API pods and the database pods were spiking to 40% on one specific node. Not a database problem. Not an application problem. A network problem on a node that all the upstream metrics had completely failed to surface.

The command was sudo tcpretrans-bpfcc. It runs inside the Linux kernel. It requires no agent. It requires no instrumentation. It just shows you what is actually happening at the TCP layer — in real time — on the host you are standing on.

That is eBPF.

2. Why DevOps Engineers Still SSH Into Servers

Modern observability stacks are genuinely impressive. A mature Prometheus + Grafana + Loki + Jaeger deployment can answer most questions about a production system. Kubernetes dashboards provide real-time pod and node state. APM tools give deep application-layer trace visibility. So why do experienced engineers still SSH into nodes the moment an incident gets complicated?

Because every observability tool operates above the kernel. Prometheus collects metrics that applications and exporters choose to expose. Logs contain what the application chooses to write. Traces show what the instrumented code path does. These tools are powerful, but they have a shared blind spot: anything that happens in the Linux kernel that the application does not know about, or does not choose to report, is invisible to them.

Consider what this leaves uncovered in practice:

Unexplained latency without application errors. A service is slow but healthy by every metric. The actual cause is scheduler latency — the container is waiting for CPU time it cannot get because a noisy neighbor process on the same node is consuming its entire CPU budget. No application metric captures this. No log line mentions it.

Intermittent failures that appear random. A connection to a downstream service occasionally fails, but the failure rate is low enough that retry logic masks it. The actual cause is silent TCP retransmits happening beneath the application's socket abstraction. The application sees delayed responses, not errors.

Silent packet drops in the kernel network stack. Packets are being dropped by conntrack table exhaustion, by an iptables rule mismatch, or by a full socket receive buffer. The application sees slowness. Prometheus shows nothing. The drop is happening inside the kernel before the application ever sees it.

Scheduler delays causing tail latency. P99 is high but P50 is normal. The application logic takes the same time every request. The variable is how long the process waits in the run queue before getting CPU time. This is invisible to application instrumentation and requires kernel-level observation.

Hidden process activity. A container is spawning unexpected subprocesses — a misconfigured health check script, a log rotation cron job, a malicious process injected into a compromised container. Nothing in the application metric stream shows this. You only see it by watching what the kernel executes.

eBPF closes this gap. It is the observability layer that lives inside the kernel itself, below every abstraction, visible to nothing and nothing invisible to it.

3. What eBPF Actually Is

eBPF stands for extended Berkeley Packet Filter, though the name is now largely historical — modern eBPF has nothing to do with packets specifically. In practical DevOps terms, eBPF is a way to run small, safe programs inside the Linux kernel without modifying kernel source code and without loading kernel modules.

When you run an eBPF program, it attaches to a kernel event — a function call, a system call, a network packet arrival, a tracepoint — and executes every time that event fires. The program can observe arguments, return values, timing, process context, and network data. It stores what it finds in BPF maps, which userspace tools then read to produce output.

The key properties that make eBPF practical for production:

Safety. Every eBPF program passes through a kernel verifier before loading. The verifier statically analyzes the program to ensure it cannot loop infinitely, cannot access arbitrary kernel memory, cannot crash the kernel, and terminates in bounded time. An eBPF program that fails verification is rejected. This makes eBPF categorically safer than kernel modules, which can crash the entire system.

Zero instrumentation. eBPF programs attach to existing kernel events. They require no changes to the applications being observed. A Python web service and a Go microservice and a Java application all produce the same system calls, network events, and scheduler interactions — all observable via eBPF without modifying a single line of application code.

Low overhead. Well-written eBPF programs add nanoseconds to the events they attach to. This is acceptable in production. Traditional tracing approaches — strace, tcpdump in promiscuous mode, ptrace — impose overhead that makes them unusable on production systems under load. eBPF does not.

Core eBPF Concepts

kprobes attach to arbitrary kernel functions. You can attach an eBPF program to any kernel function and inspect its arguments and return value. This is how biolatency measures disk I/O latency — it attaches to kernel block I/O functions and records timestamps.

uprobes attach to userspace functions in any running binary. You can attach to a specific function in a Go binary, a Python library, or a JVM without recompiling it. This enables language-level tracing without language-level instrumentation.

tracepoints are stable kernel instrumentation points added by kernel developers specifically for observability. They are more stable across kernel versions than kprobes because they are part of the official kernel ABI. tcp:tcp_retransmit_skb is a tracepoint — this is what tcpretrans uses.

XDP (eXpress Data Path) attaches eBPF programs at the network driver level, before packets enter the full kernel network stack. This enables packet filtering, modification, and forwarding at line rate — faster than any iptables rule.

Socket filters attach to sockets and can inspect or filter packets at the socket level. This is how classic BPF worked in tcpdump, and how modern network security tools inspect traffic per-connection.

BPF maps are data structures shared between eBPF programs and userspace. Hash maps, arrays, ring buffers, LRU maps — eBPF programs write data here, userspace tools read it. This is the mechanism by which kernel-level observations become human-readable output.

Perf events and ring buffers are high-throughput channels for sending per-event data from eBPF programs to userspace. Tools like execsnoop use ring buffers to stream process execution events in real time.

4. eBPF Architecture End to End

The data flow in an eBPF-based tool follows a consistent path:![[ebpf-packet-flow.svg|1200]]
1. Userspace tool (execsnoop, biolatency, a custom bpftrace script) initiates a BPF program load. The program is compiled to eBPF bytecode.

2. BPF loader (libbpf, the BCC library, the bpftrace runtime) submits the bytecode to the kernel via the bpf() system call.

3. Kernel verifier analyzes the bytecode: bounded loops, valid memory access, correct map usage, no unsafe operations. If the program passes, the verifier JIT-compiles it to native machine code for near-native execution performance.

4. eBPF program is attached to a hook point — a kprobe, tracepoint, XDP hook, socket filter, or perf event. It begins executing on every occurrence of that event.

5. BPF maps and ring buffers collect the data the program records: timestamps, PIDs, process names, IP addresses, byte counts, latency histograms.

6. Userspace reads the maps or subscribes to the ring buffer, processes the data, and produces the output the engineer sees — a sorted table, a latency histogram, a live stream of events.

The critical insight is step 4: the program executes inside the kernel, with direct access to kernel data structures, with zero copies and zero context switches to gather its observations. This is why eBPF provides visibility that no external agent can match.

5. Why eBPF Is So Powerful for DevOps

eBPF provides access to a class of observability data that is either impossible or impractically expensive to obtain any other way.

Process execution: Every process the kernel executes generates a kernel event. eBPF can capture the process name, arguments, parent PID, container namespace, and user ID for every execution. A compromised container that spawns a reverse shell generates a execve syscall that eBPF sees immediately — before the shell produces any network traffic or log output.

File system access: Every file open, every file read, every file write goes through the VFS layer. eBPF can observe every file access across every process on the node, filtered by path, duration, or process. An application that is slow because it is repeatedly re-reading a large configuration file is immediately visible — even if the application produces no logs about it.

TCP connection lifecycle: TCP connects, accepts, resets, retransmits, and drops all generate kernel events. eBPF observes all of them, per-connection, with process and container attribution. This level of network visibility has no equivalent in application-layer metrics.

Disk I/O latency: Block I/O requests go through the kernel block layer. eBPF attaches to the block layer and measures actual latency per device, per operation type, with microsecond precision. This reveals whether a disk is saturated, whether specific operations are slow, and whether the latency is consistent or bimodal.

Scheduler behavior: The Linux CFS scheduler tracks run queue length and scheduling delays. eBPF can measure how long processes wait in the run queue before receiving CPU time — the "scheduler latency" that causes tail latency spikes on CPU-constrained nodes without any application metric surfacing it.

Container runtime behavior: Containers are Linux namespaces and cgroups. eBPF operates at the kernel level and sees through these abstractions. It can observe all container processes, all container network flows, and all container file accesses without any cooperation from the container runtime.

6. 15 eBPF Commands Every DevOps Engineer Should Know

Installation note: On Ubuntu/Debian, tools are available via sudo apt install bpfcc-tools linux-headers-$(uname -r). Commands are typically available as execsnoop-bpfcc, opensnoop-bpfcc, etc. On some distributions the suffix is omitted. bpftrace is a separate higher-level scripting tool: sudo apt install bpftrace.

1. `execsnoop`

What it shows: Every process execution on the system in real time — process name, arguments, PID, parent PID, and return code.

When to use it: When a container is doing something unexpected, when a noisy process is spawning subprocesses, when investigating potential container escape or unexpected program execution, when debugging init systems or entrypoint scripts.

sudo execsnoop-bpfcc
sudo execsnoop-bpfcc -u www-data
sudo execsnoop-bpfcc --cgroupmap /sys/fs/cgroup/system.slice

Interpreting output: Each line is one execution: PCOMM (parent command), PID, PPID, RET (exit code), ARGS (full command line). A flood of short-lived processes in ARGS is a sign of a script looping or a health check misbehaving.

Real incident: A Kubernetes pod was consuming unexpected CPU. execsnoop revealed that a misconfigured liveness probe was executing a shell script every second, and that script was itself spawning three child processes to check service state. Forty-five executions per minute, invisible in every dashboard.

Pro tip: Filter by user or cgroup to scope output to specific containers. Without filtering, a busy node generates substantial output.

2. `opensnoop`

What it shows: Every file open call across all processes — filename, PID, process name, file descriptor returned, and error code.

When to use it: When an application is slow without explanation and disk I/O might be involved. When debugging "file not found" errors that appear intermittently. When identifying which files an application accesses during startup. When auditing file access patterns for compliance.

sudo opensnoop-bpfcc
sudo opensnoop-bpfcc -p 1234
sudo opensnoop-bpfcc -T

Interpreting output: Each line shows a file open attempt. ERR column shows errno values — ENOENT means file not found, EACCES means permission denied. A high rate of ENOENT errors for the same path indicates a misconfigured application repeatedly failing to find a file it expects.

Real incident: An application was slow on every tenth request with no log output. opensnoop revealed it was attempting to open /etc/ssl/certs/ca-bundle.crt on every tenth request (a certificate bundle that had been removed in a base image update) and falling back to a slower secondary path after the ENOENT error.

3. `tcpconnect`

What it shows: Every outbound TCP connection — source and destination address and port, PID, process name.

When to use it: When a container is making unexpected outbound connections. When debugging connection failures between services. When investigating whether an application is connecting to the right endpoints after a configuration change.

sudo tcpconnect-bpfcc
sudo tcpconnect-bpfcc -p 1234
sudo tcpconnect-bpfcc -P 5432

Interpreting output: Each line is one TCP SYN sent. The latency column (if present) shows time to connection establishment. Connections to unexpected IPs indicate misconfiguration or compromise.

Real incident: A microservice was connecting to a database endpoint that had been decommissioned three weeks earlier. The application was configured via an environment variable that was not updated. tcpconnect showed the actual destination IP, revealing the configuration drift immediately. Application logs only showed "connection refused" with no destination detail.

4. `tcpaccept`

What it shows: Every inbound TCP connection accepted — client address, server port, PID, process name.

When to use it: When measuring actual connection rates to a service. When debugging whether a service is accepting connections or queueing them. When auditing which clients are connecting to a service.

sudo tcpaccept-bpfcc
sudo tcpaccept-bpfcc -P 8080

Interpreting output: Each accepted connection appears with the client IP. A significant gap between the connection rate visible in tcpconnect on the client and tcpaccept on the server indicates connection queuing or backlog overflow.

5. `tcpretrans`

What it shows: Every TCP retransmission — source and destination address, port, TCP state, and retransmit type.

When to use it: When API latency is elevated without application errors. When investigating intermittent connection issues between specific service pairs. When a node has unexplained network degradation.

sudo tcpretrans-bpfcc
sudo tcpretrans-bpfcc -l

Interpreting output: Each retransmit line shows the affected connection. High retransmit rates on a specific destination IP indicate network issues on that path — switch congestion, link error rate, MTU mismatch, or failing NIC.

Real incident: The payment API incident from the opening. Forty percent retransmit rate on connections from pods on one specific node to the database tier. The node's NIC had a failing transceiver. Metrics showed nothing because the operating system was successfully retransmitting and eventually delivering packets — just 50-200ms later than expected.

6. `tcptop`

What it shows: A continuously updated table of TCP connections sorted by throughput — bytes sent and received per connection per interval.

When to use it: When identifying which connections are consuming the most network bandwidth. When investigating network saturation on a node. When finding bandwidth consumers during a performance degradation.

sudo tcptop-bpfcc
sudo tcptop-bpfcc 5

Interpreting output: Like top but for TCP connections. The highest bandwidth connections are at the top. A single connection consuming 90% of observed bandwidth is immediately visible.

7. `biolatency`

What it shows: A histogram of block device I/O latency — how long disk operations actually take, bucketed by microsecond ranges.

When to use it: When a database or stateful application is slow and disk is suspected. When validating SSD vs spinning disk performance. When debugging Kubernetes persistent volume latency. When investigating whether ceph/NFS/remote storage is meeting latency SLOs.

sudo biolatency-bpfcc
sudo biolatency-bpfcc -D
sudo biolatency-bpfcc 5 3

Interpreting output: The histogram shows how disk operations are distributed by latency. A bimodal distribution with a large tail indicates intermittent slow operations — the classic "mostly fast, occasionally very slow" pattern that databases experience under mixed read/write workloads. Operations in the >10ms bucket from an SSD indicate a problem.

Real incident: A PostgreSQL instance had P99 query latency of 800ms during write-heavy periods. biolatency showed that 95% of writes completed in under 1ms, but 2% of writes took over 50ms. The storage backend was a distributed volume system that occasionally had hot spot contention. The bimodal distribution was invisible in average disk latency metrics.

8. `biosnoop`

What it shows: Every individual block I/O operation — process, PID, device, operation type (read/write), sector, bytes, and actual latency per operation.

When to use it: When biolatency shows a latency problem and you need to know which process, which files, and which operations are slow. When debugging which Kubernetes pod is saturating a shared disk.

sudo biosnoop-bpfcc
sudo biosnoop-bpfcc -Q

Interpreting output: Each line is one I/O operation. Sort by latency column to find the slowest individual operations. The process column shows which process initiated the I/O, enabling attribution of disk activity to specific containers.

9. `fileslower`

What it shows: File read and write operations that exceed a latency threshold — which files, which processes, and how slow.

When to use it: When an application is experiencing file I/O latency but you need to know which specific files are slow, not just that disk I/O is slow in aggregate.

sudo fileslower-bpfcc 10
sudo fileslower-bpfcc 1

Interpreting output: Each line is a slow file operation exceeding the threshold (milliseconds). The FILENAME column immediately shows which paths are experiencing latency. NFS mounts frequently appear here during remote filesystem degradation.

Real incident: An application writing audit logs to an NFS mount was experiencing 200ms file write latency during peak hours. Application metrics showed no errors because writes eventually succeeded. fileslower immediately showed /mnt/nfs/audit/*.log operations taking 180-250ms. The NFS server was experiencing contention — invisible in every other tool.

10. `runqlat`

What it shows: A histogram of scheduler run queue latency — how long processes wait between becoming runnable and actually running.

When to use it: When tail latency is high but application logic time is consistent. When CPU utilization metrics look normal but applications are slow. When investigating whether CPU limits or noisy neighbors are causing scheduling delays.

sudo runqlat-bpfcc
sudo runqlat-bpfcc 5 3
sudo runqlat-bpfcc --pidnss

Interpreting output: The histogram shows distribution of wait times in the run queue. Most operations should be in the <100µs buckets. Significant mass in the 1ms-10ms range indicates CPU contention — processes are waiting longer than they should for CPU time. This directly causes tail latency.

Real incident: A service had P99 latency of 50ms when the business logic took under 5ms on every request. runqlat showed most requests waiting 30-45ms in the scheduler queue. The node was CPU-oversubscribed: containers with low CPU requests were consuming far more than their nominal allocation, starving other containers. CPU metrics showed 80% utilization — below alert threshold, but distributed in a way that caused severe scheduler delays.

11. `runqlen`

What it shows: A histogram of CPU run queue length — how many runnable processes are waiting for each CPU at sampling time.

When to use it: When diagnosing whether CPU pressure is causing latency. When identifying nodes with run queue imbalances across CPUs. When comparing CPU pressure between nodes.

sudo runqlen-bpfcc
sudo runqlen-bpfcc 5

Interpreting output: Run queue lengths above 2-3 per CPU indicate pressure. A consistently high run queue length means processes spend significant time waiting to run — the direct cause of the scheduler latency that runqlat measures.

12. `offcputime`

What it shows: Stack traces and time spent by processes blocked off CPU — waiting for I/O, sleeping on locks, waiting for network responses.

When to use it: When a service has low CPU utilization but is still slow. When thread counts are high but CPU is not saturated. When debugging lock contention, I/O blocking, or unexpected sleeps in application code.

sudo offcputime-bpfcc -p 1234
sudo offcputime-bpfcc --stack-storage-size 16384 5

Interpreting output: Stack traces showing where threads spend time blocked. A Java application with dozens of threads all blocked on the same lock call site is immediately visible as a stacked flame of identical call stacks.

13. `profile`

What it shows: CPU profiling across all processes — stack traces sampled at a configurable rate showing where CPU time is actually spent.

When to use it: When CPU utilization is high but unclear which code path is consuming it. When profiling containerized applications without modifying them. When doing continuous profiling on production nodes.

sudo profile-bpfcc -F 99 30
sudo profile-bpfcc -a -F 49 60

Interpreting output: Stack traces with sample counts. Higher counts mean more CPU time. Can be converted to FlameGraph format for visualization. Unlike language-specific profilers, this works on any process in any language with no instrumentation.

14. `tcpdrop`

What it shows: TCP packets being dropped by the kernel — why they were dropped, which connection they belonged to, and the kernel stack trace at the point of drop.

When to use it: When connections are being reset or dropped and the cause is unclear. When investigating conntrack table exhaustion. When debugging iptables rules that are silently dropping traffic. When a service works for most clients but fails for a subset with no pattern.

sudo tcpdrop-bpfcc

Interpreting output: Each line is a dropped packet with drop reason. conntrack: No route to host indicates conntrack exhaustion. Stack traces show the exact kernel code path that made the drop decision. This is the tool that makes silent packet drops visible.

Real incident: A Kubernetes cluster's conntrack table was intermittently exhausting during traffic spikes, causing random connection drops across the entire node. Every dashboard showed healthy services because most connections succeeded. tcpdrop showed hundreds of drops per second with nf_conntrack_full in the stack trace — the exact cause visible in seconds.

15. `mountsnoop`

What it shows: Mount and unmount system calls — which processes are mounting or unmounting filesystems, with arguments and return codes.

When to use it: When investigating container storage issues. When debugging persistent volume mount failures in Kubernetes. When auditing filesystem mount activity on a node. When a pod fails to start due to volume mount errors that the kubelet logs don't fully explain.

sudo mountsnoop-bpfcc

Interpreting output: Each line is a mount or unmount syscall with the process, target path, filesystem type, and return code. Failed mounts (non-zero return codes) immediately show which paths and filesystem types are failing and why.

7. 10 Production Debugging Scenarios

![[ebpf-workflow.svg|1200]]

Scenario 1: Kubernetes Pod Restarting — Reason Unclear

Symptoms: A pod shows increasing restart count in kubectl get pods. The container exits cleanly (exit code 0) with no error logs. The liveness probe succeeds before the restart.

Why dashboards were not enough: Prometheus shows restart count increasing but no error rate, no CPU spike, no memory pressure. Kubernetes events say "container restarted" without elaboration.

eBPF command that revealed the truth: execsnoop filtered to the pod's PID namespace showed the entrypoint script spawning a subprocess that failed silently, setting the parent process's exit condition to terminate the container after a fixed timeout — not a crash, a controlled exit triggered by the subprocess failing.

Action: Fix the subprocess failure (a health check script referencing a removed endpoint). Restart count drops to zero.

Scenario 2: API Latency Spikes — TCP Retransmits

Symptoms: P99 latency spikes to 400ms for ten-second windows, multiple times per hour. P50 is unaffected. Application logs show no errors during spikes. Prometheus TCP metrics are absent (not instrumented).

Why dashboards were not enough: Application metrics show latency distribution but not cause. Network metrics are absent because the application does not expose TCP-level data.

eBPF command: tcpretrans showed 35% retransmit rate on connections from one specific source IP (a pod on a congested node) to the database tier. All latency spikes correlated exactly with retransmit activity on this path.

Action: Migrate affected pods away from the congested node. Investigate switch port error counters on the physical switch connected to that node.

Scenario 3: Database Node with Intermittent Disk Latency

Symptoms: PostgreSQL reports occasional slow queries during write-heavy periods. WAL sync latency metrics are inconsistent. The storage team sees no alerts.

Why dashboards were not enough: Average disk latency looks normal. The slow queries are outliers that average metrics hide.

eBPF commands: biolatency showed a bimodal distribution with 2% of operations taking over 50ms on the NVMe device. biosnoop showed these slow operations were all write operations from the postgres process to the WAL device.

Action: Investigate storage backend queue depth during write bursts. Tune PostgreSQL wal_sync_method to reduce synchronous write frequency.

Scenario 4: Container Making Unexpected Outbound Connections

Symptoms: Security team flags unexpected outbound traffic from a container to an external IP. Application team denies intentional connectivity to that destination.

Why dashboards were not enough: Network flow logs exist but with a five-minute aggregation delay, making connection attribution to specific processes difficult. The destination is not in any known configuration.

eBPF command: tcpconnect filtered to the container's PID showed the connection being made by a background thread in a third-party library, not the main application. The library was calling home to a telemetry endpoint added in a recent dependency update.

Action: Pin the dependency version, add a network policy to block unexpected egress, review all third-party library dependencies for undisclosed network activity.

Scenario 5: NFS Latency Causing Application Slowness

Symptoms: An application writing to a shared NFS volume is slow during peak hours. The application team blames the NFS server. The infrastructure team says NFS server metrics look fine.

Why dashboards were not enough: NFS server aggregate metrics look normal because most clients are fine. The performance problem is specific to one mount path on one server.

eBPF commands: fileslower with a 5ms threshold showed /mnt/nfs/data/ write operations taking 100-300ms, with no slow operations on local paths. biolatency for the NFS block device showed high latency distribution.

Action: Identify the specific NFS export experiencing contention. Move the application's write-heavy workload to a local volume or a dedicated NFS export with reserved I/O capacity.

Scenario 6: Noisy Neighbor Spawning Unexpected Subprocesses

Symptoms: All containers on a specific node begin showing elevated P99 latency simultaneously. No recent deployments. The node's aggregate CPU metrics look normal.

Why dashboards were not enough: The problem is CPU distribution, not CPU total. Individual container CPU metrics show each container at reasonable utilization.

eBPF commands: execsnoop showed a batch processing container executing a shell loop that spawned 200+ short-lived processes per minute. runqlen showed run queue depth of 8-12 on the node's CPUs, far above normal. runqlat confirmed 30-50ms scheduler latency for all other processes.

Action: Add CPU limits to the batch processing container, move it to dedicated nodes, or implement CFS bandwidth throttling to prevent run queue saturation.

Scenario 7: Scheduler Latency Causing Tail Latency Under CPU Pressure

Symptoms: P99 latency is five times P50 latency. The application processes requests in consistent time. CPU utilization on the node is 70% — below alert threshold.

Why dashboards were not enough: CPU utilization at 70% does not indicate run queue saturation. Utilization measures time a CPU is busy, not how many processes are waiting for it.

eBPF command: runqlat showed run queue wait times predominantly in the 5-15ms range. A service with 5ms business logic producing P99 of 50ms: the 45ms difference is entirely scheduler waiting time. At 70% CPU utilization with many competing processes, individual processes can wait many milliseconds for a CPU slot.

Action: Reduce the number of processes competing for CPU on the node. Increase CPU requests/limits on latency-sensitive services. Consider dedicating a node pool for latency-sensitive workloads.

Scenario 8: Silent Packet Drops in the Network Stack

Symptoms: A service fails for a small percentage of clients with connection reset errors. No pattern in the affected clients. Retry logic masks most failures. Alert threshold not breached.

Why dashboards were not enough: Error rate is 0.3% — below alerting threshold. TCP-level packet drops are not visible in application metrics.

eBPF command: tcpdrop showed hundreds of drops per minute with nf_conntrack: table full, dropping packet in the kernel stack trace. The conntrack table was exhausting during traffic spikes, causing random connection resets that the application experienced as occasional errors.

Action: Increase /proc/sys/net/netfilter/nf_conntrack_max. Implement conntrack monitoring as a node-level Prometheus metric. Consider Cilium with eBPF-native connection tracking to eliminate conntrack entirely.

Scenario 9: File Descriptor Problems Slowing the Application

Symptoms: An application becomes progressively slower over hours, then recovers after restart. No obvious memory or CPU growth. Logs show no errors.

Why dashboards were not enough: The symptom is invisible until the file descriptor limit is approached. Standard metrics do not include open file descriptor counts for most applications.

eBPF command: opensnoop showed the application opening new file descriptors for every request and never closing them — a file descriptor leak in a connection pool implementation. The rate of new opens was constant while the count of open fds grew continuously.

Action: Fix the connection pool to properly close file descriptors. Add ulimit -n monitoring and alerting on process fd consumption.

Scenario 10: Health Check Causing Hidden Resource Pressure

Symptoms: A service handles load correctly in load tests but shows elevated latency in production. Production traffic volume is lower than the load test.

Why dashboards were not enough: The production environment includes a monitoring sidecar that runs in the same pod. The sidecar's resource consumption is not visible in application metrics.

eBPF commands: execsnoop showed the monitoring sidecar running a shell script every 15 seconds that executed twelve subprocesses for metric collection. profile showed these subprocesses consuming 15% of the container's CPU budget during each collection cycle, causing CPU throttling spikes that lasted 200-300ms — exactly correlating with P99 latency events in production.

Action: Replace the shell-script-based metric collector with a native binary. Reduce collection frequency. Give the sidecar a separate CPU limit that does not compete with the application container.

8. eBPF and Kubernetes

![[ebpf-kubernetes 2.svg|1200]]

This abstraction is useful for operations but creates observability gaps during incidents. eBPF pierces these gaps because it operates at the kernel level, below the container runtime, below the overlay network, below the namespace abstraction.

Short-lived pods are especially difficult in traditional observability. A pod that lives for thirty seconds may not produce enough telemetry to surface in aggregated metrics. eBPF captures everything that pod does from exec to exit, in real time, with no collection delay.

Overlay networking creates a challenge for network observability. Traffic between pods on the same node may never touch physical network infrastructure — it flows through virtual interfaces in the kernel network stack. Standard network monitoring that captures traffic at the physical NIC level sees none of this. eBPF captures it all because it operates inside the kernel, where the traffic passes regardless of the path.

DNS behavior in Kubernetes clusters is a common source of latency that is difficult to observe. Every service DNS lookup goes through the cluster's kube-dns or CoreDNS. eBPF can capture every DNS query and response with timing, revealing whether DNS lookup latency is contributing to service latency — even when the application does not log DNS activity.

eBPF-Native Kubernetes Tools

Cilium replaces kube-proxy and implements Kubernetes networking entirely in eBPF. This eliminates iptables from the data path, provides native network policy enforcement, and enables service-level network observability without service mesh sidecars. Cilium's Hubble component provides a visual network map built on eBPF flow data.

Pixie deploys an eBPF agent as a DaemonSet and automatically captures service-level metrics, traces, and logs for all services in the cluster — without any application instrumentation. HTTP request tracing, database query visibility, and service topology are available immediately after deployment.

Falco uses eBPF to monitor syscall activity across all containers in a cluster, detecting security violations — unexpected file accesses, network connections to blacklisted IPs, process executions that violate policy — in real time.

Datadog uses eBPF for its Universal Service Monitoring and Cloud Network Monitoring features, providing zero-instrumentation service-level metrics and network flow visibility.

Tetragon from Cilium provides eBPF-based runtime security enforcement — not just detection, but the ability to kill processes or drop connections based on eBPF-observed behavior.

Parca and Pyroscope implement continuous profiling using eBPF's profile capability, providing always-on CPU profiling across all cluster workloads with minimal overhead.

9. Network Observability and Packet Flow

The Linux network stack has multiple layers, and problems can occur at any of them. Standard network monitoring — interface counters, TCP statistics from /proc/net/tcp — provides aggregate visibility. eBPF provides per-connection, per-packet visibility at any layer.

When a packet arrives at a NIC, the first eBPF attachment point is XDP. A program attached at XDP can inspect the packet headers and decide to PASS it into the normal stack, DROP it immediately (at line rate, with no further kernel processing), REDIRECT it to another interface, or TX it back out the same interface. This is how high-performance load balancers, DDoS mitigation systems, and Cilium's packet processing operate.

After XDP, packets enter the Traffic Control (TC) layer where another eBPF hook point exists. TC programs can inspect and modify packets with full socket buffer access — more powerful than XDP but slightly less performant. Cilium uses TC hooks for network policy enforcement and service routing.

After TC, packets enter Netfilter — the iptables/nftables layer — where connection tracking and NAT occur. Conntrack table exhaustion is a common production problem: when the conntrack table fills, the kernel begins dropping new connection attempts. This is silent in most monitoring systems but immediately visible in tcpdrop.

At the socket layer, BPF socket filters can inspect and filter packets per-socket. This is how tcpretrans attaches to the retransmission path — at the socket layer where retransmit decisions are made.

Where problems are invisible without eBPF:

Packets dropped at XDP or TC produce no application errors — the application never sees them. Standard network interface counters may not capture these drops. tcpdrop attaches to the kfree_skb function — the kernel function that frees a dropped packet — and captures every drop regardless of where in the stack it occurred.

Retransmits are handled transparently by the kernel TCP implementation. The application sees a delayed response but no error. Without attaching to the retransmit path in the kernel, these are invisible.

Conntrack table exhaustion drops new connections without error response — the client sees a timeout, not a connection refused. Without tcpdrop, this is extremely difficult to diagnose.

10. Why eBPF Is Changing Linux Infrastructure

eBPF is not an incremental improvement to existing Linux tooling. It represents a structural change in what is possible to observe and control in a running Linux system, with implications that are still working through the industry.

Cloud networking is increasingly eBPF-native. Major cloud providers have replaced userspace networking stacks with eBPF programs that process packets in the kernel at rates that no userspace program can match. AWS's VPC networking, Google's GKE dataplane, and Azure's Accelerated Networking all use eBPF-based packet processing.

Kubernetes networking is converging on eBPF. kube-proxy — the component that implements Kubernetes service load balancing — is being replaced by eBPF implementations in most production environments. The iptables rules that kube-proxy generates scale poorly with cluster size and have latency characteristics that eBPF implementations do not.

Runtime security is shifting to eBPF. The traditional approach to container security — scanning images before deployment — is necessary but not sufficient. Attacks happen at runtime. eBPF enables runtime security tools that observe actual system behavior and enforce policy based on what processes are doing, not just what images contain.

Zero-instrumentation observability is becoming viable at scale. The historical choice for production observability was: instrument everything explicitly (high accuracy, high engineering cost) or collect aggregate metrics passively (low engineering cost, low accuracy). eBPF provides a third path: kernel-level automatic observation of all service behavior, with accuracy approaching manual instrumentation and engineering cost approaching passive monitoring.

Continuous profiling powered by eBPF is changing how performance optimization is approached. Always-on profiling at one percent CPU overhead across all cluster workloads means that performance regressions surface immediately in production rather than requiring dedicated profiling sessions.

eBPF's trajectory resembles containers in 2014: it is the technology that a small number of experienced engineers use to solve hard problems, that is being abstracted into tools that will eventually make its capabilities accessible to everyone, and that is quietly becoming the foundation of the next generation of Linux infrastructure.

11. Practical Tips for Engineers

Getting Started

On Ubuntu 22.04 or later:

sudo apt update
sudo apt install bpfcc-tools linux-headers-$(uname -r) bpftrace

On Ubuntu, tools are available as execsnoop-bpfcc, opensnoop-bpfcc, etc. On some distributions the -bpfcc suffix is absent. Verify with ls /usr/sbin/execsnoop*.

Kernel Version Matters

eBPF capabilities have expanded significantly across kernel versions. A minimum of kernel 4.9 is required for most BCC tools. Kernel 5.8 introduced significant improvements to ring buffers and CO-RE (Compile Once Run Everywhere) that modern tools depend on. For production use, kernel 5.15 LTS or later is recommended. Check with uname -r.

BCC vs bpftrace

BCC tools (the execsnoop, biolatency, etc. commands) are purpose-built tools for specific observability tasks. They are the right choice for incident debugging — run a command, get immediately useful output.

bpftrace is a high-level scripting language for writing custom eBPF programs. Use it when none of the BCC tools answer your specific question, or when you need to combine multiple observations in a custom way.

# bpftrace one-liner: trace all files opened by process name "nginx"
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat /comm == "nginx"/ { printf("%s\n", str(args->filename)); }'

Using eBPF Safely in Production

Never run eBPF programs you do not understand on production systems. BCC tools from the official repository are safe. Custom or third-party scripts should be reviewed before production use.

Filtering matters. Running opensnoop on a busy production node without PID or path filtering will generate enormous output and some overhead. Always filter to the specific process, container, or operation type you are investigating.

Duration limits. Most BCC tools accept a duration argument. Use it to ensure tools terminate rather than running indefinitely.

Kernel verifier rejection. On older kernels, some programs fail verification and will not load. This is safe — the program simply does not execute. The error message explains why.

First Five Commands to Learn

Start with these, in this order:

execsnoop — get familiar with process execution observation
opensnoop — understand file access patterns
tcpconnect and tcpretrans — build network-layer intuition
biolatency — learn to interpret I/O latency histograms
runqlat — understand scheduler latency and its relationship to tail latency

These five commands cover the majority of production debugging scenarios that eBPF is needed for.

Common Mistakes

Running tools without filters on busy nodes generates overwhelming output and unnecessary overhead. Always start with -p PID or equivalent filtering. Interpreting histograms without context — a tail at 100ms in biolatency means very different things on a network-attached volume versus a local NVMe. Assuming eBPF programs have zero overhead — they have very low overhead, not zero. Profiling at high frequency (1000Hz) on a production system adds measurable load. Confusing tool availability with kernel support — the tool may install but fail to load if the kernel version does not support the required features.

12. DevOps Cheat-Sheet

Save this section as a reference for incident debugging.

Processes

Command	Purpose	Best Use Case	Reveals
`execsnoop`	Trace all process executions	Unexpected process activity · security · debug init	Process name, args, PID, exit code
`opensnoop`	Trace all file opens	File access patterns · ENOENT errors · startup debug	Filename, process, error code
`offcputime`	Show where threads block	Slow service with low CPU · lock contention	Off-CPU stack traces with duration
`profile`	CPU flame profiling	High CPU · unknown hot path	CPU stack trace samples

Files and I/O

Command	Purpose	Best Use Case	Reveals
`biolatency`	Block I/O latency histogram	Slow database · disk investigation	Latency distribution by device
`biosnoop`	Per-operation block I/O trace	Which process is slow on disk	Per-op latency, process, device
`fileslower`	Slow file operations	NFS latency · specific file debugging	Slow file path, process, duration
`mountsnoop`	Mount/unmount syscalls	K8s PV failures · container storage	Mount path, type, return code

Networking

Command	Purpose	Best Use Case	Reveals
`tcpconnect`	Outbound TCP connections	Unexpected egress · misconfig	Destination IP, port, process
`tcpaccept`	Inbound TCP accepted	Connection rate · audit	Client IP, server port, process
`tcpretrans`	TCP retransmissions	Intermittent latency · network degradation	Connection, retransmit count
`tcptop`	TCP throughput by connection	Bandwidth consumer identification	Bytes in/out per connection
`tcpdrop`	Kernel TCP packet drops	Silent drops · conntrack exhaustion	Drop reason, kernel stack trace

CPU and Scheduler

Command	Purpose	Best Use Case	Reveals
`runqlat`	Scheduler run queue latency	Tail latency · CPU contention	Wait time histogram
`runqlen`	Run queue length	CPU saturation diagnosis	Queue depth per CPU
`profile`	CPU profiling by stack	Hot path identification	Sampled CPU stack traces

Containers and Kubernetes

Command	Purpose	Best Use Case	Reveals
`execsnoop --cgroupmap`	Container process execution	Security · unexpected spawns	All processes in cgroup
`tcpconnect -p`	Container network connects	Pod egress audit	Pod-level TCP connections
`opensnoop -p`	Container file access	Container FS debugging	Files opened by pod PID
`mountsnoop`	Volume mount activity	PV attach failures	Mount syscalls with errors

13. Conclusion

Most engineers observe Linux from the outside: through metrics that applications choose to expose, logs that code paths choose to write, and traces that instrumented services choose to record. This works for the incidents that fall within the boundaries of what existing instrumentation covers.

eBPF allows you to observe Linux from inside the kernel itself.

It reveals the scheduling decisions that cause tail latency before an application thread executes a single instruction. It shows the network packets that are dropped before a socket receives data. It captures the file accesses that an application makes without logging. It records every process execution, every TCP connection, every disk I/O operation — in real time, with process and container attribution, with no instrumentation required.

The gap between a thirty-minute incident and a two-hour incident is often not analytical skill — it is the presence or absence of the right observability layer. Engineers who know tcpretrans find network problems in minutes. Engineers who know runqlat diagnose CPU contention in seconds. Engineers who know tcpdrop surface silent packet drops that remain invisible to every other tool.

You do not need to become an eBPF developer to benefit from it. Learning five commands — execsnoop, tcpretrans, biolatency, runqlat, tcpdrop — changes how you approach production debugging. Each one reveals a class of problems that nothing else surfaces as quickly.

The Linux kernel has been recording everything that happens inside it for decades. eBPF finally gives production engineers a practical way to read that record.

Written for DevOps engineers, SREs, and platform teams operating production Linux and Kubernetes environments. Commands tested on Ubuntu 22.04 with kernel 5.15 and BCC tools 0.26.