DEV Community

Cover image for eBPF for SREs: Observability Without Agents
Samson Tanimawo
Samson Tanimawo

Posted on

eBPF for SREs: Observability Without Agents

The Agent Problem

Traditional monitoring means shipping an agent with every service. That agent:

  • Adds memory overhead
  • Needs to be updated
  • Gets out of date
  • Breaks with kernel upgrades
  • Needs instrumentation code

eBPF says: what if the kernel itself could emit observability data?

What eBPF Actually Is

eBPF (extended Berkeley Packet Filter) lets you run sandboxed programs inside the Linux kernel without recompiling or loading modules. It was originally for packet filtering. Now it powers Cilium, Pixie, Falco, and dozens of other tools.

From an SRE perspective: you get deep visibility into syscalls, network traffic, process behavior, and filesystem operations with zero code changes to your applications.

What You Can Observe

network:
- every TCP connection (src, dst, bytes, duration)
- DNS queries and response times
- TLS handshake failures
- HTTP request/response cycles

application:
- function call latencies (uprobes)
- memory allocations
- lock contention
- GC pauses

security:
- syscall audit trails
- privilege escalations
- suspicious file access
- container escape attempts

performance:
- CPU scheduling delays
- I/O wait time per process
- disk latency histograms
- page fault patterns
Enter fullscreen mode Exit fullscreen mode

All of this without modifying your application code.

A Practical Example: Detecting Slow HTTP Requests

Traditional approach: instrument your HTTP framework with OpenTelemetry, deploy a collector, ship traces.

eBPF approach:

# Install bpftrace
sudo apt install bpftrace

# Trace every HTTP response larger than 1MB
sudo bpftrace -e '
uprobe:/usr/lib/libssl.so:SSL_write {
@http_writes[pid] = count();
@http_bytes[comm] = sum(arg2);
}
'
Enter fullscreen mode Exit fullscreen mode

No code changes. No restarts. Real-time visibility.

Tools Worth Knowing

1. Pixie (now part of New Relic)

  • Auto-instruments every service in your K8s cluster
  • No code changes, no sidecars
  • Full HTTP, MySQL, Postgres, DNS tracing
  • Open source

2. Cilium

  • Network observability + security policy enforcement
  • Replaces kube-proxy
  • Hubble UI for service-to-service traffic visualization

3. Falco

  • Runtime security detection
  • "Alert if a process inside a container spawns a shell"
  • Writes rules in YAML

4. Parca

  • Continuous profiling via eBPF
  • See CPU flame graphs across your entire fleet
  • Identify the most expensive code paths

5. Tracee

  • Security-focused eBPF tracing
  • Detects privilege escalations, cryptojacking, suspicious syscalls

The Tradeoffs

Pros:

  • Zero app code changes
  • Near-zero overhead (kernel-level efficiency)
  • Unified view across languages (Go, Python, Java, Rust, all seen the same way)
  • No agent lifecycle to manage

Cons:

  • Requires Linux 4.14+ (5.0+ preferred)
  • Steep learning curve for custom probes
  • Limited visibility into in-process logic (you see syscalls, not business logic)
  • eBPF verifier rejects programs for subtle reasons

When eBPF Shines

  • Network debugging: "Why is service A slow to reach service B?"
  • Security auditing: "What containers are making unexpected syscalls?"
  • Performance profiling: "Where is the cluster CPU time actually going?"
  • Incident forensics: "Reconstruct the syscall timeline during the outage"

When eBPF Is Wrong

  • Business logic observability you still need OpenTelemetry for spans
  • Application errors your logs and exception tracking still matter
  • Multi-region correlation eBPF is node-local

Use eBPF for infrastructure and network. Use OpenTelemetry for application logic. They complement each other.

Getting Started

  1. Deploy Pixie in a dev cluster (1-line install)
  2. Open the UI, watch real-time HTTP traffic
  3. Try a bpftrace one-liner to trace a specific syscall
  4. Read the Cilium + Hubble docs
  5. Replace one agent-based tool with its eBPF equivalent

The future of observability is kernel-native. Agent-based tools will still exist, but the gap will keep shrinking.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)