<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Parth Shah</title>
    <description>The latest articles on DEV Community by Parth Shah (@parth21shah).</description>
    <link>https://dev.to/parth21shah</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3660472%2F6ed6885c-1765-43a3-b8d0-5586258f4c3f.jpeg</url>
      <title>DEV Community: Parth Shah</title>
      <link>https://dev.to/parth21shah</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/parth21shah"/>
    <language>en</language>
    <item>
      <title>10,000 eBPF Events to 1 Alert: Don’t burn the CPU</title>
      <dc:creator>Parth Shah</dc:creator>
      <pubDate>Sat, 13 Dec 2025 20:24:15 +0000</pubDate>
      <link>https://dev.to/parth21shah/10000-ebpf-events-to-1-alert-dont-burn-the-cpu-2g41</link>
      <guid>https://dev.to/parth21shah/10000-ebpf-events-to-1-alert-dont-burn-the-cpu-2g41</guid>
      <description>&lt;p&gt;eBPF lets you observe &lt;em&gt;everything&lt;/em&gt; the Linux kernel is doing.&lt;/p&gt;

&lt;p&gt;The problem: if you ship every event to user space, &lt;strong&gt;your monitoring becomes the outage&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;On a busy server, the kernel can generate &lt;strong&gt;millions of events per second&lt;/strong&gt;: file opens, network packets, process forks… everything.&lt;/p&gt;

&lt;p&gt;If you try to ship all of that to a database or log system, two things happen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Observer effect:&lt;/strong&gt; your monitoring agent eats CPU and makes latency worse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disk death:&lt;/strong&gt; you fill storage with noise nobody will ever read.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I’ve seen people rack up serious log bills just by flipping &lt;code&gt;debug&lt;/code&gt; on in the wrong place.&lt;/p&gt;

&lt;p&gt;So the question becomes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do we go from 10,000+ raw events&lt;br&gt;&lt;br&gt;
to 1 useful alert,&lt;br&gt;&lt;br&gt;
without burning the CPU?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For me, the answer is an architecture that looks like a &lt;strong&gt;funnel&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;You cannot send every event to user space. Crossing the boundary (kernel → user) is not free.&lt;/p&gt;

&lt;p&gt;The pattern I use is a &lt;strong&gt;3-stage funnel&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;In-kernel filtering&lt;/strong&gt; — throw away as much as possible before waking any agent
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ring buffers&lt;/strong&gt; — move data efficiently when you do need to send it
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User-space windowing&lt;/strong&gt; — find patterns over time and only then alert
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the core idea behind how I’m building my own agent with Rust + eBPF right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture: a funnel, not a firehose
&lt;/h2&gt;

&lt;p&gt;If you treat eBPF like a log pipeline, you’ll lose.&lt;/p&gt;

&lt;p&gt;Most of the cost is not “eBPF” itself — it’s the work you force by crossing &lt;strong&gt;kernel → user space&lt;/strong&gt; too often:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;wakeups / context switches&lt;/li&gt;
&lt;li&gt;per-event allocations in your agent&lt;/li&gt;
&lt;li&gt;backpressure that turns into dropped events (aka blind spots)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You want a funnel: throw away the boring stuff &lt;em&gt;early&lt;/em&gt;, and only ship the interesting tail.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2x6xab1aa84viffqh7qg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2x6xab1aa84viffqh7qg.png" alt="Funnel: in-kernel filtering → ring buffer → user-space windowing" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 1: In-kernel filtering (don’t wake the agent)
&lt;/h2&gt;

&lt;p&gt;The fastest code is the code that never runs.&lt;br&gt;&lt;br&gt;
The cheapest event is the event you never send.&lt;/p&gt;
&lt;h3&gt;
  
  
  Example: detecting slow HTTP requests
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Naive approach&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On every request start and end, send an event to user space.&lt;/li&gt;
&lt;li&gt;Let the agent compute &lt;code&gt;(end - start)&lt;/code&gt; for every request and check if it’s &lt;code&gt;&amp;gt; 500ms&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result: you send thousands of events per second just to discover that almost all of them are fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;eBPF approach&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep a small map (hash table) in kernel memory.&lt;/li&gt;
&lt;li&gt;When a request starts, store the start timestamp in the map.&lt;/li&gt;
&lt;li&gt;When the request ends, look up the start time, compute duration, and check it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Logic in kernel:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if duration &lt;code&gt;&amp;lt;= 500ms&lt;/code&gt; → delete entry, do nothing&lt;/li&gt;
&lt;li&gt;if duration &lt;code&gt;&amp;gt; 500ms&lt;/code&gt; → send &lt;strong&gt;one&lt;/strong&gt; event to user space&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So 99% of “healthy” requests never cross the kernel boundary. No extra wakeup, no extra allocation in your agent, nothing.&lt;/p&gt;

&lt;p&gt;Same idea works for fork storms, short-lived jobs, etc. Do the cheap check in kernel, and only emit “interesting” cases.&lt;/p&gt;
&lt;h2&gt;
  
  
  Stage 2: Ring buffers (moving data without pain)
&lt;/h2&gt;

&lt;p&gt;When we do find a “bad” event (for example: a fork-bomb pattern), we still need to ship it to user space.&lt;/p&gt;

&lt;p&gt;What we &lt;em&gt;don’t&lt;/em&gt; want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;writing to files on every event&lt;/li&gt;
&lt;li&gt;sending everything over TCP sockets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both are too slow and add overhead.&lt;/p&gt;

&lt;p&gt;Instead, use a &lt;strong&gt;perf ring buffer&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it’s a shared buffer between kernel and user space&lt;/li&gt;
&lt;li&gt;kernel writes events to the head&lt;/li&gt;
&lt;li&gt;the agent reads from the tail&lt;/li&gt;
&lt;li&gt;no syscall per event, no per-event allocation on the hot path&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  The tricky part: falling behind
&lt;/h3&gt;

&lt;p&gt;If the kernel writes faster than you read, the buffer wraps and overwrites older data. That’s dropped events.&lt;/p&gt;

&lt;p&gt;To reduce that risk, don’t read one event at a time and process it synchronously. My pattern looks like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;read a &lt;strong&gt;batch&lt;/strong&gt; of events from the ring buffer&lt;/li&gt;
&lt;li&gt;push them into an internal queue/channel&lt;/li&gt;
&lt;li&gt;process them in a separate worker&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keep the ring buffer as empty as possible. Keep the kernel happy.&lt;/p&gt;
&lt;h2&gt;
  
  
  Stage 3: Windowing (from raw events to a real alert)
&lt;/h2&gt;

&lt;p&gt;Even after filtering, raw events are not the alert.&lt;/p&gt;

&lt;p&gt;Example stream:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“PID 100 called fork”&lt;/li&gt;
&lt;li&gt;“PID 100 called fork”&lt;/li&gt;
&lt;li&gt;“PID 100 called fork”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s not actionable. It’s just a list.&lt;/p&gt;

&lt;p&gt;To turn this into something useful, use time windows in user space.&lt;/p&gt;

&lt;p&gt;Very simplified example in pseudocode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// on each fork event
if event.type == FORK {
  process_stats[pid].fork_count += 1
}

// every 1 second (the tick)
for pid in process_stats {
  if process_stats[pid].fork_count &amp;gt; 50 {
    trigger_alert("fork_bomb_suspected", pid)
  }
  process_stats[pid].fork_count = 0
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you have a metric: &lt;strong&gt;forks per second per PID&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The alert becomes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“PID 1234 called fork 57 times in the last second.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Much more useful than staring at a wall of single fork events.&lt;/p&gt;

&lt;p&gt;Same idea applies to other patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“opened N new file descriptors in a short window”&lt;/li&gt;
&lt;li&gt;“created and exited M child processes in 2 seconds”&lt;/li&gt;
&lt;li&gt;“network connections to the same IP exploded in the last second”&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The missing piece: context
&lt;/h2&gt;

&lt;p&gt;Even with good filtering and windowing, tools often fail on the “why?” question.&lt;/p&gt;

&lt;p&gt;You get: “High CPU on PID 555.”&lt;br&gt;&lt;br&gt;
You ask: “What was PID 555 actually doing?”&lt;/p&gt;

&lt;p&gt;If the process is already gone, you won’t be able to inspect it later.&lt;/p&gt;

&lt;p&gt;That’s why I try to attach context at the moment of the event:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stack trace → which function was running&lt;/li&gt;
&lt;li&gt;parent PID → who launched this&lt;/li&gt;
&lt;li&gt;container/cgroup ID → which container/pod this belongs to&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I grab this data as close as possible to the event (inside the eBPF program or right after it reaches user space), and send it along with the alert.&lt;/p&gt;

&lt;p&gt;So the alert is no longer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“CPU high on PID 555”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It becomes something like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“CPU high in container X, process /usr/bin/worker, function handle_batch(), parent PID 42”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now you have a chance of fixing the real issue, not just staring at numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I’m using this today
&lt;/h2&gt;

&lt;p&gt;All of these ideas are not just theory for me. I’m building them into a small Rust + eBPF agent I call &lt;strong&gt;Linnix&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;eBPF programs handle the in-kernel filtering and write into perf ring buffers&lt;/li&gt;
&lt;li&gt;a Rust daemon reads events in batches, does the time-window logic, and raises alerts&lt;/li&gt;
&lt;li&gt;I keep a hard budget of &lt;strong&gt;&amp;lt; 1% CPU overhead&lt;/strong&gt;, so I’m forced to be careful about what leaves the kernel&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you follow these principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;filter early (in kernel)&lt;/li&gt;
&lt;li&gt;transport fast (ring buffers)&lt;/li&gt;
&lt;li&gt;aggregate later (user-space windows)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;you can watch systems doing millions of operations per second without your “observability” layer becoming the problem.&lt;/p&gt;

&lt;p&gt;Next up, I want to talk about automated remediation — how to safely act on these signals (for example, killing a runaway process) without creating a new class of outages.&lt;/p&gt;

&lt;p&gt;If you want to see the code side of this, I’m slowly open-sourcing it here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/linnix-os/linnix" rel="noopener noreferrer"&gt;https://github.com/linnix-os/linnix&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ebpf</category>
      <category>linux</category>
      <category>kubernetes</category>
    </item>
  </channel>
</rss>
