<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Prachi Jha</title>
    <description>The latest articles on DEV Community by Prachi Jha (@prachi_awesome_jha).</description>
    <link>https://dev.to/prachi_awesome_jha</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2734390%2F910b1e10-3045-47eb-9a96-5ef021fd1811.jpeg</url>
      <title>DEV Community: Prachi Jha</title>
      <link>https://dev.to/prachi_awesome_jha</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/prachi_awesome_jha"/>
    <language>en</language>
    <item>
      <title>I Baked a Flutter App Into a Car OS. Here's What Broke and What Didn't.</title>
      <dc:creator>Prachi Jha</dc:creator>
      <pubDate>Sat, 21 Mar 2026 18:20:23 +0000</pubDate>
      <link>https://dev.to/prachi_awesome_jha/i-compiled-a-car-os-from-scratch-the-hard-part-was-one-line-pae</link>
      <guid>https://dev.to/prachi_awesome_jha/i-compiled-a-car-os-from-scratch-the-hard-part-was-one-line-pae</guid>
      <description>&lt;p&gt;&lt;em&gt;This came out of preparing for GSoC 2026 with AGL.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Automotive Grade Linux runs the infotainment systems in production Mazdas and Subarus. It's backed by most major automakers and compiles entirely from source - kernel, C library, every system tool. I spent five days building an AGL image from scratch, wrote a Flutter app, and baked it into the OS. This is what actually happened.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Scale of the Problem
&lt;/h2&gt;

&lt;p&gt;You can't &lt;code&gt;sudo apt install agl&lt;/code&gt;. AGL is built using Yocto, an industry-standard build system for custom embedded Linux distributions. Yocto doesn't download a pre-built OS. It compiles everything from source: the kernel, the C library, every system tool, the Flutter engine, and the app itself.&lt;/p&gt;

&lt;p&gt;My laptop had neither the compute nor the disk space. I spun up a GCP VM:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Machine:&lt;/strong&gt; e2-standard-8 (8 vCPUs, 32 GB RAM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OS:&lt;/strong&gt; Ubuntu 22.04 LTS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disk:&lt;/strong&gt; 200 GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;First attempt failed overnight at 74%:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WARNING: The free space is running low (0.823GB left)
ERROR: No new tasks can be executed since the disk space monitor action is "STOPTASKS"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yocto needs more than 200 GB. I hit a quota limit trying to expand the disk in Asia, deleted the VM, recreated it in &lt;code&gt;us-central1&lt;/code&gt; with 400 GB, and started over.&lt;/p&gt;

&lt;p&gt;8 hours later, after 12,145 compilation tasks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tasks Summary: Attempted 12145 tasks of which 0 didn't need to be rerun and all succeeded.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I booted it in QEMU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;Automotive Grade Linux 21.90.0 qemux86-64 ttyS0
qemux86-64 login: root
&lt;/span&gt;&lt;span class="gp"&gt;root@qemux86-64:~#&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AGL 21.90.0. Codename: vimba. A virtual car computer, inside a cloud VM, in Iowa.&lt;/p&gt;




&lt;h2&gt;
  
  
  Writing the Flutter App
&lt;/h2&gt;

&lt;p&gt;The app reads &lt;code&gt;/etc/os-release&lt;/code&gt; at runtime to display the AGL version, which means the same binary shows Ubuntu values during local development and AGL values on the actual image - no build flags, no conditionals. The relevant field is &lt;code&gt;PRETTY_NAME&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight dart"&gt;&lt;code&gt;&lt;span class="n"&gt;Future&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;_loadAglVersion&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kd"&gt;async&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'/etc/os-release'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;contents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;readAsString&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;startsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'PRETTY_NAME'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;setState&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;_aglVersion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'='&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;replaceAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;''&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;});&lt;/span&gt;
      &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsi59wmygr3igxmti13me.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsi59wmygr3igxmti13me.png" alt="App on local machine" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Flutter app on local machine. For the image: Levi Ackerman from Attack on Titan. The sound button plays an audio clip I will not describe further.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8inxb9t0clm0oq4clvtk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8inxb9t0clm0oq4clvtk.png" alt="App on QEMU" width="800" height="1422"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Flutter app on QUEMU&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Baking It In: Yocto Layers and Recipes
&lt;/h2&gt;

&lt;p&gt;Yocto builds from "layers" - folders that each contribute something to the final image. AGL ships with layers for its core system, demo apps, and Flutter engine support. To add my app, I created &lt;code&gt;meta-agl-prachi&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;meta-agl-prachi/
├── conf/
│   └── layer.conf
└── recipes-apps/
    └── agl-quiz-app/
        ├── agl-quiz-app.bb
        └── files/
            └── agl_quiz_app.desktop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;.bb&lt;/code&gt; file (a "recipe") tells Yocto: where to fetch the source, how to build it, where to install it. Mine pointed to my GitHub repo and used &lt;code&gt;inherit flutter-app&lt;/code&gt;, a class provided by &lt;code&gt;meta-flutter&lt;/code&gt; that handles all the Flutter-specific build logic.&lt;/p&gt;

&lt;p&gt;Then I added my layer to the build and ran:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bitbake agl-ivi-demo-flutter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It finished in minutes. Suspiciously fast, only 5 tasks rerun. Yocto had used cached output from the previous build and skipped my layer entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Lockfile Problem
&lt;/h2&gt;

&lt;p&gt;I ran &lt;code&gt;bitbake agl-quiz-app&lt;/code&gt; in isolation to see what was actually failing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR: agl-quiz-app-1.0-r0 do_archive_pub_cache:
flutter pub get --enforce-lockfile failed: 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error named the failed command. It didn't say where that command came from or how to change it. So I followed the source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;find ~/AGL/master/external/meta-flutter &lt;span class="nt"&gt;-name&lt;/span&gt; &lt;span class="s2"&gt;"*.bbclass"&lt;/span&gt;
&lt;span class="c"&gt;# → flutter-app.bbclass&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;flutter-app.bbclass
&lt;span class="c"&gt;# → require conf/include/flutter-app.inc&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;flutter-app.inc
&lt;span class="c"&gt;# → require conf/include/common.inc&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;common.inc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three files deep. In &lt;code&gt;common.inc&lt;/code&gt; I found it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getVar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PUBSPEC_IGNORE_LOCKFILE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;pubspec_lock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app_root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pubspec.lock&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pubspec_lock&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rm -rf pubspec.lock&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;app_root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ...later...
&lt;/span&gt;&lt;span class="nf"&gt;run_command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;flutter pub get --enforce-lockfile&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;app_root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;flutter pub get --enforce-lockfile&lt;/code&gt; requires the lockfile to exactly match resolved dependencies. My lockfile was generated with a slightly different Dart SDK version than the build VM. The fix was a single line in my recipe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;PUBSPEC_IGNORE_LOCKFILE &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"1"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One line. After two days of debugging.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting the Display Working
&lt;/h2&gt;

&lt;p&gt;QEMU runs headless by default. To see the AGL UI, I exposed its display over VNC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;runqemu qemux86-64 serialstdio slirp &lt;span class="nv"&gt;qemuparams&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"-display vnc=:1"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I opened port 5901 in GCP's firewall and connected with TigerVNC. First connection: the AGL warning screen, rotated 90 degrees. AGL IVI is designed for portrait car dashboards. One more line in &lt;code&gt;weston.ini&lt;/code&gt; fixed the backend:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="py"&gt;backend&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;vnc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmdbngofgbqtyl7y4v0nu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmdbngofgbqtyl7y4v0nu.png" alt="AGL warning screen rotated" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Getting my app to render required understanding AGL's Wayland setup. The compositor runs as &lt;code&gt;agl-driver&lt;/code&gt; (uid 1001). Root cannot access &lt;code&gt;agl-driver&lt;/code&gt;'s Wayland socket - not a permissions workaround, just how Wayland works. The socket lives at &lt;code&gt;/run/user/1001/wayland-0&lt;/code&gt; and only the user who started the compositor can connect to it.&lt;/p&gt;

&lt;p&gt;I found the correct environment by reading the existing service file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /usr/lib/systemd/system/flutter-ics-homescreen.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Which revealed the exact variables and paths needed. With those:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;su agl-driver &lt;span class="nt"&gt;-s&lt;/span&gt; /bin/sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'
  WAYLAND_DISPLAY=wayland-0
  XDG_RUNTIME_DIR=/run/user/1001/
  LD_PRELOAD=/usr/lib/librive_text.so
  LIBCAMERA_LOG_LEVELS=*:ERROR
  flutter-auto -b /usr/share/flutter/agl_quiz_app/3.38.3/release --xdg-shell-app-id agl_quiz_app'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The app launched.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1ug455doxxfw8vqx3yd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1ug455doxxfw8vqx3yd.png" alt="AGL quiz app running in TigerVNC" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AGL 21.90.0 (vimba). The version string - "Automotive Grade Linux 21.90.0 (vimba)", pulled live from &lt;code&gt;/etc/os-release&lt;/code&gt; at runtime. Sound doesn't come through QEMU yet (that requires additional ALSA configuration), but everything else works.&lt;/p&gt;




&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;p&gt;The most useful thing I practiced here wasn't Flutter or Yocto syntax. It was following &lt;code&gt;require&lt;/code&gt; statements until I found the line actually doing the thing. &lt;code&gt;common.inc&lt;/code&gt; wasn't linked from anywhere in the docs. Three files of reading got me there. When something breaks in Yocto, the error names the failed task, and that task is a readable function somewhere in the layer files. Start there and keep reading.&lt;/p&gt;

&lt;p&gt;The structure itself is simpler than it looks: layers are folders, recipes are config files, classes are reusable logic. The surface area is large but not deep.&lt;/p&gt;

&lt;p&gt;The full code and Yocto layer are on GitHub: &lt;a href="https://github.com/PrachiJha-404/agl-flutter-app" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

</description>
      <category>linux</category>
      <category>systems</category>
      <category>opensource</category>
      <category>dart</category>
    </item>
    <item>
      <title>I Made My Code Do Nothing. It Got Slower.</title>
      <dc:creator>Prachi Jha</dc:creator>
      <pubDate>Sat, 07 Feb 2026 12:22:41 +0000</pubDate>
      <link>https://dev.to/prachi_awesome_jha/the-performance-paradox-when-doing-less-work-makes-your-code-slower-398c</link>
      <guid>https://dev.to/prachi_awesome_jha/the-performance-paradox-when-doing-less-work-makes-your-code-slower-398c</guid>
      <description>&lt;p&gt;I stripped my code down to do absolutely nothing. Just count events and move on. It got 8% slower.&lt;/p&gt;

&lt;p&gt;This isn't measurement noise. Over 30 seconds of processing 12+ million events across 5 test runs, the "optimized" version was consistently, measurably slower than the version doing expensive kernel symbol lookups and string formatting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I built an eBPF tool that monitors TCP packet drops in the Linux kernel. eBPF (Extended Berkeley Packet Filter) lets you hook into kernel functions without modifying kernel source. When a packet is dropped, my eBPF program captures the event (PID, drop reason, kernel function) and sends it to userspace through a ring buffer.&lt;/p&gt;

&lt;p&gt;During stress testing, I flooded my machine with SYN packets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;hping3 &lt;span class="nt"&gt;-S&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; 80 &lt;span class="nt"&gt;--flood&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The kernel started dropping ~400,000 packets per second. My tool reads each drop event from the ring buffer and processes it.&lt;/p&gt;

&lt;p&gt;I created four benchmark modes to find the bottleneck:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark&lt;/strong&gt; - Just count events. Zero processing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Busy&lt;/strong&gt; - Do expensive work (symbol lookups, string formatting), then discard the result.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File&lt;/strong&gt; - Same expensive work, but write to a file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terminal&lt;/strong&gt; - Same expensive work, print to terminal.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each mode ran for 30 seconds. Here's what happened:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Events Read&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;File&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12,548,707&lt;/td&gt;
&lt;td&gt;373,828/sec&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Busy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12,432,234&lt;/td&gt;
&lt;td&gt;370,316/sec&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Benchmark&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;11,570,132&lt;/td&gt;
&lt;td&gt;344,616/sec&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Terminal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;653,848&lt;/td&gt;
&lt;td&gt;19,353/sec&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The mode doing &lt;strong&gt;zero work&lt;/strong&gt; came in third.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Paradox
&lt;/h2&gt;

&lt;p&gt;Benchmark Mode should have won. It's literally just incrementing a counter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Benchmark Mode - the "fast" path&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ringbuf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EventsRead&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No symbol lookups. No string formatting. No I/O. Just atomic increment and repeat.&lt;/p&gt;

&lt;p&gt;But it lost by 8% to Busy Mode, which does this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Busy Mode&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;EventProcessor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ProcessEventBusy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;monitorEvent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EventsRead&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Map lookup for drop reason&lt;/span&gt;
    &lt;span class="n"&gt;reasonStr&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropReasons&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Reason&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;reasonStr&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;reasonStr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"UNKNOWN(%d)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Reason&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Binary search through 300k kernel symbols&lt;/span&gt;
    &lt;span class="n"&gt;symbolName&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;findNearestSymbol&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Location&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;symbolName&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;symbolName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"0x%x"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Location&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// String formatting with allocations&lt;/span&gt;
    &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"[%s] Drop | PID: %-6d | Reason: %-18s | Function: %s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"15:04:05"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;reasonStr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;symbolName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EventsPrinted&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How does doing more work make code faster?&lt;/p&gt;

&lt;h2&gt;
  
  
  Ruling Out the Obvious
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Memory Allocation?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Benchmark Mode:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total allocated: 303 MB&lt;/li&gt;
&lt;li&gt;GC runs: 14&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Busy Mode:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total allocated: 2,592 MB (8x more)&lt;/li&gt;
&lt;li&gt;GC runs: 75 (5x more)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Busy Mode was doing 5x more garbage collection and still winning. The bottleneck wasn't memory.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F29qd3xpbpkduovc0vula.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F29qd3xpbpkduovc0vula.png" alt="Benchmark mode output" width="800" height="287"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9rvh1ch1vlubh3euni9n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9rvh1ch1vlubh3euni9n.png" alt="Busy Mode output" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Smoking Gun: CPU Profiling
&lt;/h2&gt;

&lt;p&gt;I added CPU profiling with &lt;code&gt;pprof&lt;/code&gt; to both modes. The difference was stark.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmark Mode (344k events/sec)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpgujkw9f1a7eaxcl4gr1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpgujkw9f1a7eaxcl4gr1.jpg" alt="pprof benchmark" width="800" height="1130"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;unix.EpollWait:     11.12s (76%) ← Blocking on syscalls
ringbuf.Read:       13.61s (93%) ← Total time in read
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;76% of CPU time spent waiting in &lt;code&gt;epoll_wait()&lt;/code&gt;&lt;/strong&gt;, blocked while the kernel writes the next event.&lt;/p&gt;

&lt;h3&gt;
  
  
  Busy Mode (370k events/sec)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zt566yfc2oemtnahjhd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zt566yfc2oemtnahjhd.png" alt="pprof busy mode" width="800" height="597"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ProcessEventBusy:   11.27s (54%) ← Actually doing work
  ├─ fmt.Sprintf:    6.53s (31%)
  ├─ findNearestSymbol: 2.61s (13%)
  └─ mallocgc:       3.19s (15%)

unix.EpollWait:     5.57s (27%) ← Much less waiting
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only 27% waiting. The rest was productive work.&lt;/p&gt;

&lt;h3&gt;
  
  
  File Mode (373k events/sec)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fecvy9oyj12sq9hk149vy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fecvy9oyj12sq9hk149vy.png" alt="pprof file mode" width="800" height="451"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ProcessEvent:         11.66s (56%) ← Work
  ├─ fmt.Fprintf:      4.95s (24%) ← Cheaper than Sprintf!
  ├─ findNearestSymbol: 2.68s (13%)
  └─ mallocgc:         2.64s (13%)
unix.EpollWait:        5.87s (28%) ← Same batching as Busy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;File and Busy have identical syscall overhead (28% waiting). File edges ahead because &lt;code&gt;fmt.Fprintf()&lt;/code&gt; to a buffered file is more efficient than &lt;code&gt;fmt.Sprintf()&lt;/code&gt; creating throwaway strings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Happens: The Syscall Tax
&lt;/h2&gt;

&lt;p&gt;The kernel generates events at ~2.5µs per event (400k/sec).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmark Mode&lt;/strong&gt; processes each event in ~1µs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Event arrives → Process (1µs) → ringbuf.Read()
                                    ↓
                            Ring buffer empty!
                                    ↓
                    epoll_wait() blocks (context switch)
                                    ↓
                    Kernel writes next event (1.5µs)
                                    ↓
                        Wake up userspace
                                    ↓
                                 Repeat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; ~400,000 context switches per second, constant blocking.&lt;/p&gt;

&lt;p&gt;Benchmark Mode was &lt;strong&gt;too fast&lt;/strong&gt;. It kept asking for the next event before the kernel had written it, forcing the process to sleep in &lt;code&gt;epoll_wait()&lt;/code&gt; waiting for data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Busy Mode&lt;/strong&gt; processes each event in ~4µs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Event arrives → Process (4µs) → ringbuf.Read()
       ↑                            ↓
       |                    3 events already waiting!
       |                            ↓
       └──────── Kernel wrote more while we were busy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; ~150,000 context switches per second, natural batching.&lt;/p&gt;

&lt;p&gt;By the time userspace calls &lt;code&gt;Read()&lt;/code&gt;, multiple events are already queued in the ring buffer. No blocking needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Accidental Batching
&lt;/h2&gt;

&lt;p&gt;Busy Mode's processing time (~4µs per event) accidentally created &lt;strong&gt;natural batching&lt;/strong&gt;. While userspace was busy formatting strings and looking up symbols, the kernel queued multiple events. Each &lt;code&gt;ringbuf.Read()&lt;/code&gt; call pulled several events without blocking.&lt;/p&gt;

&lt;p&gt;Benchmark Mode outpaced its data source and spent most of its time context-switching between userspace and kernel, waiting for the next event.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The paradox:&lt;/strong&gt; Removing work made the code too fast, causing it to waste time waiting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Cost of Performance
&lt;/h2&gt;

&lt;p&gt;Here's what's actually expensive in event-driven programming:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cheap:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Memory allocation (even 8x more)&lt;/li&gt;
&lt;li&gt;Garbage collection (even 5x more)&lt;/li&gt;
&lt;li&gt;CPU work (symbol lookups, string formatting)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Expensive:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context switches (crossing the kernel/userspace boundary)&lt;/li&gt;
&lt;li&gt;Blocking syscalls (&lt;code&gt;epoll_wait&lt;/code&gt;, &lt;code&gt;read&lt;/code&gt;, &lt;code&gt;poll&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you optimize CPU work to near-zero, you don't eliminate the cost, you just shift it to I/O overhead. If your processing loop is faster than your data source, you end up paying the syscall tax on every single event.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Should Care?
&lt;/h2&gt;

&lt;p&gt;This pattern appears in any high-frequency event processing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Network packet processing&lt;/strong&gt; (eBPF, DPDK, raw sockets)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Financial trading systems&lt;/strong&gt; (market data feeds)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log aggregation&lt;/strong&gt; (reading from message queues)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics collection&lt;/strong&gt; (statsd, Prometheus exporters)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Game engines&lt;/strong&gt; (input event processing)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb:&lt;/strong&gt; If your event processing is faster than your event arrival rate, you're probably paying unnecessary syscall overhead. Profile your code. If you see &amp;gt;50% time in &lt;code&gt;epoll_wait&lt;/code&gt;/&lt;code&gt;poll&lt;/code&gt;/&lt;code&gt;select&lt;/code&gt;, you're thrashing on syscalls. Batch your reads.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actual Fix
&lt;/h2&gt;

&lt;p&gt;The real lesson here isn't "batch your reads better in userspace." The &lt;code&gt;cilium/ebpf&lt;/code&gt; library's &lt;code&gt;ringbuf.Read()&lt;/code&gt; is already reasonably efficient. You're still bound by the poll/epoll cycle regardless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The actual fix is to stop sending 400,000 events to userspace in the first place.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where eBPF's real power comes in: &lt;strong&gt;move the aggregation logic into the kernel.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Kernel: Drop packet → Send event to ring buffer
Userspace: Read event → Process → Count
Result: 400,000 context switches/sec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The proper approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Kernel: Drop packet → Increment counter in BPF map (no userspace trip!)
Userspace: Read aggregated counts once per second
Result: 1 context switch/sec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using eBPF maps, I can aggregate packet drops directly in kernel memory, count drops by kernel function, by IP address, by drop reason; and pull the summary to userspace periodically instead of streaming 400,000 individual events.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expected improvement:&lt;/strong&gt; 99.9% reduction in context switches (from 400k/sec to ~1/sec).&lt;/p&gt;

&lt;p&gt;This is future work for me. The interesting benchmark results I found were educational. They taught me about syscall overhead and accidental batching, but they also revealed I was solving the problem in the wrong place.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;The fastest code isn't always the code doing the least work, it's the code that minimizes expensive operations.&lt;/p&gt;

&lt;p&gt;In this case:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark Mode:&lt;/strong&gt; Optimized CPU work, paid 76% overhead in syscalls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Busy Mode:&lt;/strong&gt; Did 8x more allocation and 5x more GC, reduced syscall overhead to 27%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File Mode:&lt;/strong&gt; Most efficient I/O primitives, same batching benefits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The real enemy wasn't symbol lookups or string formatting. It was calling the kernel 400,000 times per second instead of 150,000 times.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Three lessons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Profile before optimizing&lt;/strong&gt; - My intuition said "remove work." The profiler said "syscalls are the bottleneck."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batching beats speed&lt;/strong&gt; - Reading 10 events with 1 syscall is faster than reading 10 events with 10 syscalls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;There's a sweet spot&lt;/strong&gt; - Too-fast processing just means waiting for I/O.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The bottleneck is rarely where you think it is.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Environment:&lt;/strong&gt; Ubuntu 24.04, Linux 6.5, Go 1.21 (variance &amp;lt;2% across 5 runs)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full code:&lt;/strong&gt; &lt;a href="https://github.com/PrachiJha-404/ebpf-tcp-monitor" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Profiling implementation:&lt;/strong&gt; &lt;a href="https://github.com/PrachiJha-404/ebpf-tcp-monitor/tree/benchmark/investigation" rel="noopener noreferrer"&gt;&lt;code&gt;benchmark/investigation&lt;/code&gt; branch&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Previous article:&lt;/strong&gt; &lt;a href="https://dev.to/prachi_awesome_jha/my-logs-lied-how-i-used-ebpf-to-find-the-truth-3k87"&gt;My Logs Lied: How I Used eBPF to Find the Truth&lt;/a&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>c</category>
      <category>kernel</category>
      <category>performance</category>
    </item>
    <item>
      <title>My Logs Lied: How I Used eBPF to Find the Truth</title>
      <dc:creator>Prachi Jha</dc:creator>
      <pubDate>Sun, 01 Feb 2026 17:34:24 +0000</pubDate>
      <link>https://dev.to/prachi_awesome_jha/my-logs-lied-how-i-used-ebpf-to-find-the-truth-3k87</link>
      <guid>https://dev.to/prachi_awesome_jha/my-logs-lied-how-i-used-ebpf-to-find-the-truth-3k87</guid>
      <description>&lt;p&gt;A while back, I wrote about the time I &lt;a href="https://dev.to/prachi_awesome_jha/my-go-server-was-so-fast-it-self-ddosd-my-laptop-48i2"&gt;accidentally DDOSed my own laptop&lt;/a&gt; while load testing a Go auction server. 1000 concurrent clients generated connections so fast that the kernel's TCP listen queue overflowed, but silently, dropping packets before they ever reached my application.&lt;/p&gt;

&lt;p&gt;The bug itself had a simple fix. But the experience left me with a harder problem: I had no way to &lt;em&gt;see&lt;/em&gt; it happening in real time. Application logs showed nothing. &lt;code&gt;netstat -s&lt;/code&gt; gave me a system-wide counter. "11,053 listen queue overflows". But not which process, not when, not why.&lt;/p&gt;

&lt;p&gt;So I built a tool to see inside the kernel. This is how it works, what I learned, and one genuinely weird thing I found while benchmarking it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem with Debugging Network Issues
&lt;/h2&gt;

&lt;p&gt;When a distributed system behaves badly, a database replica lags, a microservice times out, a message queue backs up - the first instinct is to look at application logs. But a whole category of problems lives &lt;em&gt;below&lt;/em&gt; the application, in the kernel's networking stack, invisible to anything your code can observe directly.&lt;/p&gt;

&lt;p&gt;TCP is designed to be resilient. It retransmits. It backs off. It recovers. But when it can't recover, when the listen queue overflows, when a firewall drops a packet, when a checksum fails, it just... drops the packet. No log entry. No error propagated upward. The application sees a timeout, eventually, but the actual cause happened microseconds earlier, deep in kernel code.&lt;/p&gt;

&lt;p&gt;The tools most developers reach for don't help here.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tcpdump&lt;/code&gt; captures packets on the wire, but it's too heavyweight to run continuously in production. It also shows you what &lt;em&gt;arrived&lt;/em&gt;, not what got &lt;em&gt;dropped&lt;/em&gt;. &lt;code&gt;netstat -s&lt;/code&gt; gives you aggregate counters - "11,053 listen queue overflows" - but nothing about which process, when, or why. You're left guessing.&lt;/p&gt;

&lt;p&gt;What I needed was something that could sit right at the point where the kernel drops a packet and report back: who was affected, why did it happen, and exactly where in kernel code did the decision get made.&lt;/p&gt;




&lt;h2&gt;
  
  
  Enter eBPF
&lt;/h2&gt;

&lt;p&gt;eBPF (extended Berkeley Packet Filter) is a technology that lets you run small, sandboxed programs inside the Linux kernel without modifying kernel source code or loading kernel modules (which is a lot more painful). It's been around for years, used by tools like Cilium and Datadog, but it's surprisingly accessible for individual developers once you understand the basics.&lt;/p&gt;

&lt;p&gt;The key insight is that the Linux kernel has built-in attachment points or hooks called &lt;strong&gt;tracepoints&lt;/strong&gt; - stable, documented hooks left by kernel developers for exactly this kind of observability. For packet drops, the relevant tracepoint is &lt;code&gt;skb/kfree_skb&lt;/code&gt;. Every time the kernel frees a socket buffer (which is what happens when a packet is dropped), this tracepoint fires.&lt;/p&gt;

&lt;p&gt;So the plan was straightforward: hook &lt;code&gt;kfree_skb&lt;/code&gt;, capture the information I needed, and get it out to my Go application in real time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building the Monitor
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Kernel Side
&lt;/h3&gt;

&lt;p&gt;The eBPF program itself is surprisingly small — about 30 lines of C:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;"vmlinux.h"&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;bpf/bpf_helpers.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;u32&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;        &lt;span class="c1"&gt;// Process context when drop occurred&lt;/span&gt;
    &lt;span class="n"&gt;u32&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     &lt;span class="c1"&gt;// Why the kernel dropped it&lt;/span&gt;
    &lt;span class="n"&gt;u64&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// Instruction pointer — where in kernel code&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;__uint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BPF_MAP_TYPE_RINGBUF&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;__uint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_entries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// 64KB ring buffer&lt;/span&gt;
    &lt;span class="n"&gt;__type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="nf"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".maps"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="n"&gt;SEC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"tracepoint/skb/kfree_skb"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;trace_tcp_drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;trace_event_raw_kfree_skb&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// Not a real drop, bail immediately&lt;/span&gt;

    &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bpf_ringbuf_reserve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// Ring buffer full — skip silently&lt;/span&gt;

    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bpf_get_current_pid_tgid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;location&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;bpf_ringbuf_submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things worth noting here. The filter on line one (&lt;code&gt;reason &amp;lt;= 1&lt;/code&gt;) is critical - &lt;code&gt;kfree_skb&lt;/code&gt; fires for &lt;em&gt;every&lt;/em&gt; packet that gets freed, including ones that completed successfully. Reason 0 (&lt;code&gt;SKB_DROP_REASON_NOT_SPECIFIED&lt;/code&gt;) and reason 1 (&lt;code&gt;SKB_DROP_REASON_NO_REASON&lt;/code&gt;) are normal lifecycle events. The kernel freed the buffer after it was done with the packet, not because something went wrong. We only care about actual drops, so we bail out immediately for everything else. This keeps the overhead minimal even though we're hooking a very hot kernel path.&lt;/p&gt;

&lt;p&gt;The ring buffer is a 64KB circular queue shared between kernel and userspace. When we call &lt;code&gt;bpf_ringbuf_reserve&lt;/code&gt;, we claim space in it. When we call &lt;code&gt;bpf_ringbuf_submit&lt;/code&gt;, the data becomes visible to userspace. If the buffer is full because userspace isn't reading fast enough, &lt;code&gt;reserve&lt;/code&gt; returns NULL and we silently skip that event — no blocking, no spinning. The eBPF verifier enforces this: our program &lt;em&gt;must&lt;/em&gt; terminate quickly, no exceptions.&lt;/p&gt;

&lt;p&gt;Before any of this runs, the kernel's eBPF verifier statically analyzes the program. It proves there are no infinite loops, no unsafe memory accesses, no calls to unapproved functions. If verification fails, the program doesn't load. This is why eBPF is safe to run in production — you literally cannot get dangerous code past the verifier.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Userspace Side
&lt;/h3&gt;

&lt;p&gt;The Go side reads from the ring buffer and turns raw kernel events into something useful.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;rd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ringbuf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewReader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;objs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Events&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;rd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c"&gt;// Blocks until an event is available&lt;/span&gt;
    &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;monitorEvent&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pointer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RawSample&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

    &lt;span class="n"&gt;symbolName&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;findNearestSymbol&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Location&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;reasonStr&lt;/span&gt;  &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;dropReasons&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Reason&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bufferedWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"[%s] Drop | PID: %-6d | Reason: %-18s | Function: %s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"15:04:05"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reasonStr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbolName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;unsafe.Pointer&lt;/code&gt; cast deserves explanation. The ring buffer gives us raw bytes. We know the layout matches our C struct exactly - same fields, same order, same sizes. Rather than parsing the bytes manually (slow, error-prone), we reinterpret them directly as a Go struct. Zero allocation, zero copying. It's the only &lt;code&gt;unsafe&lt;/code&gt; usage in the codebase, and it's justified.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Symbol resolution&lt;/strong&gt; is where the interesting work happens. The kernel gave us a raw instruction pointer, something like &lt;code&gt;0xffffffff81a2b574&lt;/code&gt;. Meaningless to a human. To translate it, we load &lt;code&gt;/proc/kallsyms&lt;/code&gt; at startup, around 200,000 kernel symbols, sorted by address. Then for each event, we do a binary search to find the function that contains our address, calculate the offset, and produce output like &lt;code&gt;tcp_v4_syn_recv_sock+0x234&lt;/code&gt;. Now you know exactly which kernel function dropped the packet.&lt;/p&gt;

&lt;p&gt;The output is written through a 256KB &lt;code&gt;bufio.Writer&lt;/code&gt;. This matters more than it might seem, and it connects to something I discovered later.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Looks Like in Practice
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[15:04:23] Drop | PID: 1234 | Reason: TCP_LISTEN_OVERFLOW | Function: tcp_v4_syn_recv_sock+0x234
[15:04:23] Drop | PID: 1234 | Reason: TCP_LISTEN_OVERFLOW | Function: tcp_v4_syn_recv_sock+0x234
[15:04:23] Drop | PID: 5678 | Reason: NETFILTER_DROP      | Function: nf_hook_slow+0x12a
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each line tells you: when it happened, which process was in context, why the kernel dropped it, and exactly where in kernel code the decision was made.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Note on PID Accuracy
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;bpf_get_current_pid_tgid()&lt;/code&gt; returns the PID of whichever process the kernel is running when the drop occurs. For &lt;code&gt;TCP_LISTEN_OVERFLOW&lt;/code&gt;, this is typically the listening process, the drop happens in its context. But for other drop types, particularly ones that occur during interrupt handling or in kernel threads, the PID might not correspond to the actual owner of the dropped packet.&lt;/p&gt;

&lt;p&gt;This is a fundamental limitation of the approach. The kernel doesn't always know which userspace process "owns" a packet at the point it gets dropped. For debugging specific issues like my listen queue overflow, the PID is accurate and useful. For a general-purpose production monitoring tool, you'd want to validate accuracy per drop type before relying on it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Beyond the Code
&lt;/h2&gt;

&lt;p&gt;This project started as a way to fix a single bug, but it ended up being a masterclass in how much complexity lives just beneath our main() functions.&lt;/p&gt;

&lt;p&gt;While you might reach for a platform like Cilium or Datadog for 24/7 production observability, there is something incredibly powerful about writing 30 lines of C that can peer into the heart of the kernel. It turns the "black box" of networking into a transparent stream of events.&lt;/p&gt;

&lt;p&gt;The source for this project is open on &lt;a href="https://github.com/PrachiJha-404/ebpf-tcp-monitor" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Presented at &lt;a href="https://hasgeek.com/bengalurusystemsmeetup" rel="noopener noreferrer"&gt;Bengaluru Systems Meetup&lt;/a&gt;, January 2026. Thanks to the organizers for the welcoming "just show up and talk about what you built" energy, it made all the difference.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>systems</category>
      <category>kernel</category>
      <category>programming</category>
      <category>go</category>
    </item>
    <item>
      <title>My Go Server was so fast it self-DDoS'd my laptop</title>
      <dc:creator>Prachi Jha</dc:creator>
      <pubDate>Sun, 28 Dec 2025 11:34:10 +0000</pubDate>
      <link>https://dev.to/prachi_awesome_jha/my-go-server-was-so-fast-it-self-ddosd-my-laptop-48i2</link>
      <guid>https://dev.to/prachi_awesome_jha/my-go-server-was-so-fast-it-self-ddosd-my-laptop-48i2</guid>
      <description>&lt;p&gt;In my fourth semester, my teammate and I built an auction server using sockets in Python. It was a final project for our Computer Networks course - functional enough to pass, inefficient enough to haunt me later.&lt;/p&gt;

&lt;p&gt;Fast forward to last month, when polishing my resume, I realized I needed a project that showed I could handle high-concurrency systems. So, I revisited that old auction server and rewrote it in Go, aiming for a 2-3x performance improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/PrachiJha-404/High-Throughput-Auction-Server.git" rel="noopener noreferrer"&gt;Full source code available here&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead, I ended up with a 80% failure rate at 1,000 concurrent users.&lt;br&gt;
Not because my code was broken. Because it was too fast. My Go implementation was so efficient it overwhelmed the Windows TCP stack and DDoS'd my own laptop.&lt;/p&gt;

&lt;p&gt;This is that story. &lt;/p&gt;

&lt;h3&gt;
  
  
  The Python Baseline
&lt;/h3&gt;

&lt;p&gt;The original Python version used &lt;code&gt;selectors&lt;/code&gt; for I/O multiplexing. The standard stuff - The OS would wake up our event loop when clients sent bids, we'd process them, repeat. The architecture was clean and worked fine for the project demo.&lt;/p&gt;

&lt;p&gt;However, I later realized that it had limitations. Python's Global Interpreter Lock (GIL) meant only one thread could execute bytecode at a time, no matter how many cores were available. My 18-core laptop was essentially operating on one thread.&lt;/p&gt;

&lt;p&gt;Still, for a baseline test with 100 users sending 50 bids each (5,000 total requests), Python performed decently:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcum5f9lf74x9fs3xq96.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcum5f9lf74x9fs3xq96.jpeg" alt="Python performance with 100 users" width="569" height="201"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These are some solid numbers! Time to see how I went about improving this performance using Go.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Go Rewrite
&lt;/h3&gt;

&lt;p&gt;Go's concurrency model is fundamentally different from Python's. Every client connection gets its own goroutine, a lightweight thread that costs about 2KB of memory. When a goroutine blocks waiting for network I/O, Go's scheduler parks it and moves on to other work. There is no spinning, polling or wasted cycles.&lt;/p&gt;

&lt;p&gt;I made three key architectural changes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Binary protocol&lt;/strong&gt;&lt;br&gt;
I replaced string parsing with fixed-size headers. The server now knows exactly how many bytes to read for each message, eliminating guesswork and partial frame errors.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Persistence: Moving from local memory to Redis&lt;/strong&gt;&lt;br&gt;
I switched from using in-memory dictionaries to Redis with Lua scripts for atomic check-then-set operations. Now, every bid survives a server crash.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Buffered channels&lt;/strong&gt;&lt;br&gt;
I created a 5,000-slot channel buffer between the network layer and the bid processor. This decoupled "receiving data" from "processing data," allowing the system to handle traffic spikes without blocking.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I ran the same 100-user test and expected around 300k RPM.&lt;/p&gt;

&lt;p&gt;I got 480,000 RPM. 100% success rate. &lt;em&gt;With&lt;/em&gt; the Redis overhead.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fszpxd27isze462y82yhz.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fszpxd27isze462y82yhz.jpeg" alt="Go performance with 100 users" width="450" height="247"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Python was storing everything in local memory, with no external I/O beyond the client connections. In contrast, Go was making a network round-trip to Redis for each bid and still outperformed Python by 3.4x.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Self-DDoS
&lt;/h3&gt;

&lt;p&gt;I scaled the test to 1,000 users, each sending 50 bids. That's 50,000 total requests.&lt;/p&gt;

&lt;p&gt;Python struggled but stayed functional:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fufsg9qpr3ecjpq9e3iut.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fufsg9qpr3ecjpq9e3iut.jpeg" alt="Python performance with 1k users" width="566" height="200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then I ran Go with the exact same parameters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjwvflsdnl4ubbpom4ksi.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjwvflsdnl4ubbpom4ksi.jpeg" alt="Go performance with 1k users" width="465" height="220"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The throughput was still higher than Python, but 80% of the requests failed. I checked my code looking for race conditions, panics, however I couldn't find anything that was obviously broken.&lt;/p&gt;

&lt;p&gt;Then I ran &lt;code&gt;netstat -s -p tcp&lt;/code&gt; to check the tcp stats, which revealed the problem: 11,053 Failed Connection Attempts and 31,984 Segments Retransmitted.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fswswxu9gb05ak6engytw.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fswswxu9gb05ak6engytw.jpeg" alt="netstat results" width="493" height="283"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;My server hadn't crashed. The Windows TCP stack just couldn't keep up.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Root Cause
&lt;/h3&gt;

&lt;p&gt;Here's what happened:&lt;/p&gt;

&lt;p&gt;My Go benchmarker spawned 1,000 goroutines almost simultaneously and each tried to connect to the server immediately. That meant 1,000 SYN packets hit the kernel in a single burst.&lt;/p&gt;

&lt;p&gt;The kernel has a finite "waiting room" for new connections (the listen backlog queue). When that queue overflowed, it started silently dropping SYN packets.&lt;/p&gt;

&lt;p&gt;The clients, receiving no SYN-ACK response, assumed packet loss and retransmitted. This created a feedback loop: more retransmissions led to more congestion, which caused more drops and retransmissions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classic TCP Congestion Collapse&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But then, &lt;strong&gt;why did Python succeed?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Python "succeeded" because its GIL accidentally rate-limited the connections. It was too slow to overwhelm the Operating System.&lt;/p&gt;

&lt;p&gt;Go exposed a bottleneck that Python never even reached.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fix
&lt;/h3&gt;

&lt;p&gt;The solution wasn't in my application code. It lay in working with the physical limits of the operating system.&lt;/p&gt;

&lt;p&gt;I implemented connection pacing by introducing a small delay between spawning each client goroutine.&lt;/p&gt;

&lt;p&gt;1ms pacing: Success rate jumped from 21% to 82%.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5s6r0dt7c8icur9r4v6j.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5s6r0dt7c8icur9r4v6j.jpeg" alt="Go performance with 1k users and pacing of 1 ms" width="493" height="250"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2ms pacing: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fah4rkt8py78bzw93ggqp.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fah4rkt8py78bzw93ggqp.jpeg" alt="Go performance with 1k users and 2 ms pacing" width="524" height="236"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3ms pacing:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh1078qbt114mdbsoa5ag.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh1078qbt114mdbsoa5ag.jpeg" alt="Go performance with 1k users and 3 ms pacing" width="522" height="231"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The result? &lt;strong&gt;100.00%&lt;/strong&gt; Success Rate at &lt;strong&gt;650,209 RPM&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;That was great, but I thought, why stop here?&lt;/p&gt;

&lt;p&gt;I experimented with increasing the delay between bids from the same user (from 10ms to 20ms), hoping it would help give the system more breathing room. However, I saw the success rate drop instead.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp6pgbwdnw0dk4md9metu.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp6pgbwdnw0dk4md9metu.jpeg" alt="Go performance with 1k users, pacing of 1 ms, time between same user sending bids increased to 20 ms" width="496" height="235"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The problem? Holding 1,000 sockets open while they sat idle put massive pressure on the TCP window. Windows eventually timed them out to reclaim resources.&lt;/p&gt;

&lt;p&gt;The sweet spot turned out to be 3ms pacing between connections, 10ms between bids. That's where the OS and Go runtime finally synced up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;The contrast is striking: Python stored all bid data in local memory with no external database or network hops beyond the client connection. Go made a Redis round-trip for every single bid.&lt;/p&gt;

&lt;p&gt;Python peaked at ~150k RPM with dropped requests. Go sustained 650k RPM with 100% reliability and full persistence.&lt;/p&gt;

&lt;p&gt;This wasn't about Go being "faster" in some abstract sense. It was about Go's runtime, designed to maximize modern hardware until hitting the next bottleneck, which in this case, was the operating system itself.&lt;/p&gt;

&lt;p&gt;Python managed 100 users, not because it was exceptionally built, but because it was too slow to hit the limits of the OS that Go found at 1,000 users.&lt;/p&gt;

&lt;p&gt;My Go server was finally fast enough to find the one limit I couldn't code my way out of: the physical capacity of the Windows TCP stack.&lt;/p&gt;

&lt;p&gt;Moving from Python to Go was more than just changing syntax.  It shifted how the application approached concurrency and I/O. By utilizing Go’s M:N scheduler and runtime netpoller instead of Python’s GIL-limited model, I was able to push the system to a point where the operating system became the bottleneck. &lt;/p&gt;

&lt;h3&gt;
  
  
  Hardware Details
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;CPU: Intel(R) Core(TM) Ultra 5 125 H (18 cores)&lt;/li&gt;
&lt;li&gt;OS: Windows 11 Version 24H2&lt;/li&gt;
&lt;li&gt;Redis: 7.x running locally on localhost&lt;/li&gt;
&lt;li&gt;Network: All connections over loopback (127.0.0.1)&lt;/li&gt;
&lt;li&gt;Go Version: 1.25.5&lt;/li&gt;
&lt;li&gt;Python Version: 3.12.6&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All tests were conducted on a single machine to eliminate network variability from application-level performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Repository&lt;/strong&gt;: &lt;a href="https://github.com/PrachiJha-404/High-Throughput-Auction-Server" rel="noopener noreferrer"&gt;View full implementation and benchmarks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Raw benchmark data&lt;/strong&gt;: Available in &lt;code&gt;/benchmarks&lt;/code&gt; directory&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>godev</category>
      <category>programming</category>
      <category>database</category>
      <category>systems</category>
    </item>
  </channel>
</rss>
