<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Coroot</title>
    <description>The latest articles on DEV Community by Coroot (@coroot).</description>
    <link>https://dev.to/coroot</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3034639%2Fa266f65f-6e11-46f8-b8e1-62b2834906de.png</url>
      <title>DEV Community: Coroot</title>
      <link>https://dev.to/coroot</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/coroot"/>
    <language>en</language>
    <item>
      <title>Zero-config Golang Heap Profiling</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Thu, 30 Apr 2026 21:28:59 +0000</pubDate>
      <link>https://dev.to/coroot/zero-config-golang-heap-profiling-33fi</link>
      <guid>https://dev.to/coroot/zero-config-golang-heap-profiling-33fi</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Coroot&lt;/a&gt; is an Apache 2.0 open source platform that simplifies observability with no-code configuration. The &lt;a href="https://github.com/coroot/coroot-node-agent" rel="noopener noreferrer"&gt;Coroot node-agent&lt;/a&gt; already collects CPU profiles for any process on the node &lt;a href="https://coroot.com/blog/troubleshooting-java-applications-with-coroot" rel="noopener noreferrer"&gt;using eBPF&lt;/a&gt;, with zero integration from the application side. For Java, we &lt;a href="https://coroot.com/blog/java-profiling-with-async-profiler" rel="noopener noreferrer"&gt;dynamically inject&lt;/a&gt; async-profiler into the JVM to get memory and lock profiles. But Go processes were still a blind spot for non-CPU profiling unless the app exposed a pprof endpoint and the cluster-agent scraped it.&lt;/p&gt;

&lt;p&gt;We wanted the same zero-config experience for Go heap profiles. This post is about how we got there.&lt;/p&gt;

&lt;h1&gt;
  
  
  The runtime already profiles
&lt;/h1&gt;

&lt;p&gt;Go's runtime has a built-in memory profiler. On every allocation, the runtime samples with probability &lt;code&gt;size / MemProfileRate&lt;/code&gt; and records the call stack. The default rate is &lt;code&gt;512 * 1024&lt;/code&gt;, or about 1 sample per 512KB allocated. Samples are aggregated into a linked list of "buckets", where each bucket represents a unique (stack trace, size class) combination and accumulates four counters: total allocations, total frees, bytes allocated, bytes freed.&lt;/p&gt;

&lt;p&gt;This is what &lt;code&gt;runtime.MemProfile()&lt;/code&gt; returns and what &lt;code&gt;go tool pprof http://.../debug/pprof/heap&lt;/code&gt; renders. The overhead is negligible and it's been production-grade since forever.&lt;/p&gt;

&lt;p&gt;There's one catch. The Go linker has an optimization: if no code in the binary references &lt;code&gt;runtime.MemProfile&lt;/code&gt;, it sets an internal &lt;code&gt;disableMemoryProfiling&lt;/code&gt; flag, and the runtime sets &lt;code&gt;MemProfileRate = 0&lt;/code&gt; on init. No samples, no buckets, nothing to read. A binary that doesn't import &lt;code&gt;runtime/pprof&lt;/code&gt; or &lt;code&gt;net/http/pprof&lt;/code&gt; (directly or transitively) has no heap profile available, even though the runtime fully supports it. We'll come back to this.&lt;/p&gt;

&lt;p&gt;This list is what &lt;code&gt;runtime.MemProfile()&lt;/code&gt; walks when &lt;code&gt;pprof&lt;/code&gt; asks for a heap profile. It's literally the global variable &lt;code&gt;runtime.mbuckets&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// runtime/mprof.go&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;mbuckets&lt;/span&gt; &lt;span class="n"&gt;atomic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UnsafePointer&lt;/span&gt; &lt;span class="c"&gt;// *bucket, memory profile buckets&lt;/span&gt;
    &lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the data is already there, being collected continuously, for free. The only question is how to read it from outside the process.&lt;/p&gt;

&lt;h1&gt;
  
  
  Reading process memory from outside
&lt;/h1&gt;

&lt;p&gt;Linux exposes every process's virtual address space via &lt;code&gt;/proc/&amp;lt;pid&amp;gt;/mem.&lt;/code&gt; With the right permissions (our node-agent already has &lt;code&gt;CAP_SYS_PTRACE&lt;/code&gt;), you can &lt;code&gt;pread()&lt;/code&gt; arbitrary addresses. It's read-only, it doesn't suspend the process, the target doesn't even know you're there.&lt;/p&gt;

&lt;p&gt;The plan:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Find the virtual address of &lt;code&gt;runtime.mbuckets&lt;/code&gt; in the Go binary's symbol table.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Read the pointer value at that address from &lt;code&gt;/proc/&amp;lt;pid&amp;gt;/mem.&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Walk the linked list, reading each bucket's header, stack PCs, and memRecord.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Convert to pprof format and upload.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  Finding runtime.mbuckets without loading the symbol table
&lt;/h1&gt;

&lt;p&gt;The first gotcha: Go binaries embed their own symbol table (pclntab) for runtime use, but &lt;code&gt;runtime.mbuckets&lt;/code&gt; is not a function. It's a variable, which lives in the &lt;code&gt;ELF .symtab&lt;/code&gt; section. On a stripped binary (&lt;code&gt;go build -ldflags="-s"&lt;/code&gt;), there's no &lt;code&gt;.symtab&lt;/code&gt; and we can't find the symbol. We skip those.&lt;/p&gt;

&lt;p&gt;On an unstripped binary, &lt;code&gt;.symtab&lt;/code&gt; can be huge. For &lt;code&gt;k3s&lt;/code&gt;, it's ~11MB. Using &lt;code&gt;debug/elf.File.Symbols()&lt;/code&gt; loads all of it into memory at once. For a node-agent that profiles dozens of Go processes, that's not OK.&lt;/p&gt;

&lt;p&gt;So we wrote a streaming scan that reads one &lt;code&gt;Elf64_Sym&lt;/code&gt; entry at a time and reads only the bytes we need from the string table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;findSymbolValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ef&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;elf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sectionName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symName&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;section&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ef&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Section&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sectionName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;strtab&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ef&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sections&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Link&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;symReader&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Open&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;24&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// Elf64_Sym&lt;/span&gt;
    &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;symName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;nameBuf&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;symReader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;nameIdx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ef&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ByteOrder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Uint32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ef&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ByteOrder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Uint64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;16&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;strtab&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadAt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nameBuf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nameIdx&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;nameBuf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nameBuf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;symName&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%s not found"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Peak memory: a 24-byte buffer plus a 17-byte buffer (&lt;code&gt;len("runtime.mbuckets")+1&lt;/code&gt;), regardless of binary size.&lt;/p&gt;

&lt;p&gt;Before doing this expensive scan we also check if the binary is Go at all via the &lt;code&gt;.go.buildinfo&lt;/code&gt; section: one section header lookup, zero allocations.&lt;/p&gt;

&lt;h1&gt;
  
  
  The bucket layout, and two traps
&lt;/h1&gt;

&lt;p&gt;The bucket struct itself is just a 48-byte header:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;bucket&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;_&lt;/span&gt;       &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotInHeap&lt;/span&gt;
    &lt;span class="n"&gt;next&lt;/span&gt;    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;
    &lt;span class="n"&gt;allnext&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;
    &lt;span class="n"&gt;typ&lt;/span&gt;     &lt;span class="n"&gt;bucketType&lt;/span&gt;
    &lt;span class="n"&gt;hash&lt;/span&gt;    &lt;span class="kt"&gt;uintptr&lt;/span&gt;
    &lt;span class="n"&gt;size&lt;/span&gt;    &lt;span class="kt"&gt;uintptr&lt;/span&gt;
    &lt;span class="n"&gt;nstk&lt;/span&gt;    &lt;span class="kt"&gt;uintptr&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But the runtime allocates extra space after it and stores two more things in the same contiguous region: the stack trace (nstk program counter addresses, 8 bytes each) and a memRecord struct holding the alloc/free counters.&lt;/p&gt;

&lt;p&gt;So from our point of view, each bucket is a variable-sized blob: &lt;code&gt;48 bytes header + nstk*8 bytes of PCs + 128&lt;/code&gt; bytes of &lt;code&gt;memRecord.&lt;/code&gt; We read the header first to get &lt;code&gt;nstk&lt;/code&gt;, then the rest.&lt;/p&gt;

&lt;p&gt;Two traps we fell into:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trap 1:&lt;/strong&gt; the first field, &lt;code&gt;_ sys.NotInHeap&lt;/code&gt;, looks like 8 bytes of padding. It's zero bytes. Sizing the header at 56 bytes gave us nicely parsed garbage: valid-looking pointers that turned out to be hash values, and typ values in the quintillions. Go 1.17 through 1.19 used a &lt;code&gt;//go:notinheap&lt;/code&gt; comment directive instead; Go 1.20 switched to the &lt;a href="https://github.com/golang/go/commit/a719a78c1b36141af68d84970695fe95263fb896" rel="noopener noreferrer"&gt;typed marker&lt;/a&gt;, but the binary layout didn't change. The real header is 48 bytes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trap 2:&lt;/strong&gt; there are two pointer fields, &lt;code&gt;next&lt;/code&gt; and &lt;code&gt;allnext.&lt;/code&gt; They are not the same list. next is the hash table chain within a size class. allnext is the global list of all memProfile buckets. We want &lt;code&gt;allnext.&lt;/code&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  The delta problem
&lt;/h1&gt;

&lt;p&gt;The counters in &lt;code&gt;memRecord&lt;/code&gt; are cumulative: they grow monotonically over the lifetime of the process. If we want an allocation rate, we need to compute the delta between two collection cycles.&lt;/p&gt;

&lt;p&gt;We keep a map per PID of &lt;code&gt;bucket address -&amp;gt; previous counters&lt;/code&gt; and subtract on each cycle to get the delta. We key by bucket address rather than stack hash: the Go runtime never frees mprof buckets, so the address is a stable unique identifier, and it's a single &lt;code&gt;uint64&lt;/code&gt; instead of a variable-length string, which avoids a huge amount of allocation churn in the hot path.&lt;/p&gt;

&lt;h1&gt;
  
  
  Too many syscalls
&lt;/h1&gt;

&lt;p&gt;Early profiles showed our collector spending 30-40% of its CPU in &lt;code&gt;syscall.Pread&lt;/code&gt;. Each bucket needs at least 2 reads: one for the header (to get &lt;code&gt;nstk&lt;/code&gt;), then one for the variable-length &lt;code&gt;stk[nstk] | memRecord&lt;/code&gt; block. With 1000+ buckets per process and a dozen Go processes on a node, that's thousands of syscalls per minute.&lt;/p&gt;

&lt;p&gt;We tried a read-ahead cache: on a miss, pull 256KB centered around the requested address. The idea was that Go's &lt;code&gt;persistentalloc&lt;/code&gt; places buckets in large arenas, so consecutive buckets in the &lt;code&gt;allnext&lt;/code&gt; chain might be physically close.&lt;/p&gt;

&lt;p&gt;We instrumented jump distances between consecutive buckets for one process with 1230 buckets. 40% of jumps are &amp;gt;16MB. Buckets are scattered across the entire process address space, not clustered. &lt;/p&gt;

&lt;p&gt;A 256KB cache hits ~20% of the time: better than nothing, but the best we could do without multi-MB buffers that cost more than they save.&lt;/p&gt;

&lt;h1&gt;
  
  
  The linker-disabled profiling problem
&lt;/h1&gt;

&lt;p&gt;After deploying, we saw some Go processes return an empty bucket list (&lt;code&gt;runtime.mbuckets pointer was 0x0&lt;/code&gt;) even though they were clearly allocating memory (tens of MB RSS, actively running).&lt;/p&gt;

&lt;p&gt;Turns out the Go linker has an optimization: if no code in the binary references &lt;code&gt;runtime.MemProfile&lt;/code&gt;, it sets a &lt;code&gt;disableMemoryProfiling&lt;/code&gt; flag, and the runtime sets &lt;code&gt;MemProfileRate = 0&lt;/code&gt; on init. No &lt;code&gt;profilealloc()&lt;/code&gt; calls, no buckets ever created.&lt;/p&gt;

&lt;p&gt;This hits any Go binary that doesn't import &lt;code&gt;runtime/pprof&lt;/code&gt; or &lt;code&gt;net/http/pprof&lt;/code&gt;, directly or transitively. In our case it was a small load generator: no pprof, no HTTP server, no dependencies that would drag pprof in. The profile endpoint the runtime would serve is dead code, so the linker dropped it.&lt;/p&gt;

&lt;p&gt;The fix: we can write to &lt;code&gt;/proc/&amp;lt;pid&amp;gt;/mem&lt;/code&gt; too. If we detect &lt;code&gt;MemProfileRate == 0&lt;/code&gt;, we write 524288 (the default) back to the &lt;code&gt;runtime.MemProfileRate&lt;/code&gt; address. The runtime checks this variable on every allocation, so the change takes effect immediately: no restart, no signal, nothing. Just a single atomic 8-byte write to a known address in the data segment.&lt;/p&gt;

&lt;p&gt;This is gated behind a &lt;code&gt;--go-heap-profiler=force&lt;/code&gt; flag for users who want the "always on" behavior:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;--&lt;span class="n"&gt;go&lt;/span&gt;-&lt;span class="n"&gt;heap&lt;/span&gt;-&lt;span class="n"&gt;profiler&lt;/span&gt;=&lt;span class="n"&gt;disabled&lt;/span&gt;  &lt;span class="c"&gt;# off
&lt;/span&gt;--&lt;span class="n"&gt;go&lt;/span&gt;-&lt;span class="n"&gt;heap&lt;/span&gt;-&lt;span class="n"&gt;profiler&lt;/span&gt;=&lt;span class="n"&gt;enabled&lt;/span&gt;   &lt;span class="c"&gt;# default, passive only
&lt;/span&gt;--&lt;span class="n"&gt;go&lt;/span&gt;-&lt;span class="n"&gt;heap&lt;/span&gt;-&lt;span class="n"&gt;profiler&lt;/span&gt;=&lt;span class="n"&gt;force&lt;/span&gt;     &lt;span class="c"&gt;# write MemProfileRate if zero
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The overhead of re-enabling profiling is whatever the Go default overhead is: ~1 sample per 512KB. For any workload where this matters, you'd want it on anyway.&lt;/p&gt;

&lt;h1&gt;
  
  
  Allocation rate metrics
&lt;/h1&gt;

&lt;p&gt;Since we already compute per-bucket alloc deltas, exposing total allocation rate as Prometheus counters is free:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="n"&gt;container_go_alloc_bytes_total&lt;/span&gt;    &lt;span class="c"&gt;# total bytes allocated&lt;/span&gt;
&lt;span class="n"&gt;container_go_alloc_objects_total&lt;/span&gt;  &lt;span class="c"&gt;# total objects allocated&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Summed across all buckets in the process. Coroot uses them to draw the allocation rate chart alongside the flamegraph.&lt;/p&gt;

&lt;h1&gt;
  
  
  Limitations
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Stripped binaries are skipped.&lt;/strong&gt; No &lt;code&gt;.symtab&lt;/code&gt;, no &lt;code&gt;runtime.mbuckets&lt;/code&gt; address, nothing we can do externally.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The&lt;/strong&gt; &lt;code&gt;active&lt;/code&gt; &lt;strong&gt;cycle updates on GC&lt;/strong&gt;. Between GCs, new allocations go into &lt;code&gt;future[0..2]&lt;/code&gt; and we don't see them. Same limitation as &lt;code&gt;runtime.MemProfile().&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Go-internal struct layout.&lt;/strong&gt; If the bucket struct changes in a future Go release, we'll need to update. The layout has been stable since Go 1.17, but there's no API guarantee.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Goroutine, block, and mutex profiles are not yet exposed.&lt;/strong&gt; Block and mutex use the same infrastructure (&lt;code&gt;bbuckets, xbuckets&lt;/code&gt;), but both are disabled by default and have real overhead if enabled (checks on every mutex/channel op), so we're not force-enabling them.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  In Coroot
&lt;/h1&gt;

&lt;p&gt;Profiles are already in the Coroot UI. Every memory chart has a link to the heap flamegraph for that service, so you can jump from "memory is climbing" to "here's the call stack eating it" in one click.&lt;/p&gt;

&lt;p&gt;What's new is that profiles are now plugged into RCA. If Coroot sees a service's CPU or memory go up at the same time as an issue, it pulls up the profile and compares two windows: the one during the issue, and a healthy one from just before. The flamegraph you see in the RCA is a diff, not a snapshot. Functions that got hotter pop out, the rest fade away.&lt;/p&gt;

&lt;p&gt;So now RCA can give you a different kind of answer. Instead of "p95 is up, allocations are up", you get "this function is allocating twice as much as it was before the deploy." The metric tells you something is off. The diff tells you which code is off.&lt;/p&gt;

&lt;h1&gt;
  
  
  Chaos experiments
&lt;/h1&gt;

&lt;p&gt;To see this in action, we set up a small demo and broke it on purpose. There's a &lt;code&gt;product-catalog&lt;/code&gt; service backed by Postgres, sitting behind an &lt;code&gt;api-gateway&lt;/code&gt;. We bolted a chaos middleware onto &lt;code&gt;product-catalog&lt;/code&gt; so we can flip on different kinds of bad behavior with a single API call, then we watched what showed up in Coroot.&lt;/p&gt;

&lt;h1&gt;
  
  
  GC pressure
&lt;/h1&gt;

&lt;p&gt;For the first experiment, we flipped on the &lt;code&gt;gc_pressure&lt;/code&gt; switch. That sends every request through a function called &lt;code&gt;inefficientEnrichProducts&lt;/code&gt;, which is exactly as bad as the name suggests. For each of 30 fake products in the request, it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Marshals and unmarshals the product 10 times in a row.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Builds a "search index" by lowercasing, uppercasing and title-casing every word and generating every 2 to 4 character n-grams.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Builds 20 nested "related products" maps, each with three sub-maps.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Marshals and unmarshals the whole result one more time "for caching".&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's about 2 MB of throwaway memory per request. The service still answers, but the garbage collector barely gets a break.&lt;/p&gt;

&lt;p&gt;The pain shows up one hop away. &lt;code&gt;api-gateway&lt;/code&gt; talks to &lt;code&gt;product-catalog&lt;/code&gt; on every page render, so as soon as the switch flips, its p95 latency jumps from 0.16s to 3.76s:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8vpqca44lsscosvrg642.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8vpqca44lsscosvrg642.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Coroot's RCA traces the spike back to product-catalog and pulls up its CPU profile:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faq09huibv95z56812b4e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faq09huibv95z56812b4e.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Look at the right side of the flamegraph. There's a fat column of &lt;code&gt;runtime.gcBgMarkWorker&lt;/code&gt;, &lt;code&gt;runtime.systemstack&lt;/code&gt;, &lt;code&gt;runtime.scanobject&lt;/code&gt;, &lt;code&gt;runtime.gcDrain&lt;/code&gt;. The garbage collector is burning real CPU. That's a clear sign the runtime is under allocation pressure, but the CPU profile can't tell you which line of your code is responsible for it.&lt;/p&gt;

&lt;p&gt;The heap profile can:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flxzwv0fpluxcu23b4d4z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flxzwv0fpluxcu23b4d4z.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There it is. &lt;code&gt;main.inefficientEnrichProducts&lt;/code&gt; sits at the top of &lt;code&gt;alloc_space&lt;/code&gt;, with the JSON encoders, map growth, and &lt;code&gt;bytes.Buffer&lt;/code&gt; operations stacked underneath. That's the exact set of things the function does inside its loop. Same function the CPU profile already flagged, but now you can see directly that it's the one driving the GC.&lt;/p&gt;

&lt;p&gt;Without the heap profile, you'd see the GC running hot and the JSON encoder eating CPU, and you'd still have to guess which call site to fix. With it, the guess is gone. Cache the marshalled output, drop the redundant rounds, or both, and the alloc band and the GC band shrink together on the next collection.&lt;/p&gt;

&lt;h1&gt;
  
  
  Memory leak
&lt;/h1&gt;

&lt;p&gt;For the second experiment, we flipped the &lt;code&gt;memory_leak&lt;/code&gt; switch. Now every request calls &lt;code&gt;appendToProductCache&lt;/code&gt;, which builds a small chunk of pointer-heavy data (a product map, a search index of fifty terms, cross-references to recent entries) and appends it to a global slice. Nothing ever evicts. The cache grows about 200 KB per request, forever.&lt;/p&gt;

&lt;p&gt;The symptom is the obvious one. &lt;code&gt;product-catalog&lt;/code&gt; memory just keeps climbing. After a few minutes, both replicas are growing at over 640% per hour and on track to OOM-kill themselves.&lt;/p&gt;

&lt;p&gt;What's interesting is what RCA does next. It pulls up the heap profile for the anomaly window and compares it against a healthy window from before the leak started:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4qqd9ovjc68t5xzc6ovl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4qqd9ovjc68t5xzc6ovl.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The diff narrows it down to a single function. &lt;code&gt;main.appendToProductCache&lt;/code&gt; accounts for 99.6% of the in-use memory that wasn't there before, and the full call path from the HTTP entrypoint down to it sits right above the flamegraph. There's almost nothing left to investigate.&lt;/p&gt;

&lt;p&gt;A plain heap snapshot would have shown &lt;code&gt;appendToProductCache&lt;/code&gt; near the top too, but mixed in with everything else the service legitimately allocates. The diff drops the noise and keeps only what changed, which is exactly what you want when you're chasing a leak that started somewhere in the last hour.&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;Heap profiles for your Go services no longer require pprof endpoints, scraping configuration, or a deploy. Coroot picks them up automatically from whatever is already running on your nodes, with no code changes, no annotations, and no restart.&lt;/p&gt;

&lt;p&gt;The payoff shows up in incidents. A memory leak comes down to one function in a diff'd flamegraph. GC pressure stops being a vague "the runtime is busy" and becomes a specific call site. And you get this code-level accuracy without needing access to the code itself, which matters for SRE and platform teams running services they didn't write. Because the profiles sit right next to the metrics and the RCA that surfaced the issue, you go from "something is wrong" to "here is what to fix" without ever leaving the page.&lt;/p&gt;

&lt;p&gt;Want to try Zero-config Go heap profiling on your setup, completely open source? Visit out our&lt;a href="https://github.com/coroot/coroot-node-agent" rel="noopener noreferrer"&gt; Github&lt;/a&gt; to quickly get set up.&lt;/p&gt;

</description>
      <category>go</category>
      <category>kubernetes</category>
      <category>devops</category>
      <category>observability</category>
    </item>
    <item>
      <title>Profiling Java apps: breaking things to prove it works</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Wed, 08 Apr 2026 21:51:04 +0000</pubDate>
      <link>https://dev.to/coroot/profiling-java-apps-breaking-things-to-prove-it-works-14da</link>
      <guid>https://dev.to/coroot/profiling-java-apps-breaking-things-to-prove-it-works-14da</guid>
      <description>&lt;p&gt;Coroot already does eBPF-based CPU profiling for Java. It catches CPU hotspots well, but that's all it can do. Every time we looked at a GC pressure issue or a latency spike caused by lock contention, we could see something was wrong but not what.&lt;/p&gt;

&lt;p&gt;We wanted memory allocation and lock contention profiling. So we decided to add &lt;a href="https://github.com/async-profiler/async-profiler" rel="noopener noreferrer"&gt;async-profiler&lt;/a&gt; support to &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;coroot-node-agent.&lt;/a&gt; The goal: memory allocation and lock contention profiles for any HotSpot JVM, with zero code changes. Here's how we got there.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why async-profiler
&lt;/h1&gt;

&lt;p&gt;We went with async-profiler. It's a native JVMTI agent used by pretty much everyone in the Java profiling space (Pyroscope, IntelliJ, Datadog). It can be loaded into a running JVM dynamically, supports CPU, allocation, and lock contention profiling in a single session, and works in unprivileged containers with no JVM flags. It outputs JFR format, which we parse using Grafana's &lt;a href="https://github.com/grafana/jfr-parser" rel="noopener noreferrer"&gt;jfr-parser.&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  How we integrated it
&lt;/h1&gt;

&lt;p&gt;The integration follows the same pattern as our &lt;a href="https://coroot.com/blog/java-tls-instrumentation-with-ebpf" rel="noopener noreferrer"&gt;Java TLS agent:&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The node agent detects Java processes by checking if the binary name ends with java, then confirms it's a HotSpot JVM by scanning &lt;code&gt;/proc/&amp;lt;pid&amp;gt;/maps.&lt;/code&gt; It deploys &lt;code&gt;libasyncProfiler.so&lt;/code&gt; (~600KB) into the container's filesystem at &lt;code&gt;/tmp/coroot/&lt;/code&gt; and loads the library into the JVM via the Attach API. async-profiler starts with &lt;code&gt;event=itimer&lt;/code&gt;,&lt;code&gt;interval=10ms,alloc,lock,jfr,&lt;/code&gt; capturing CPU, allocation, and lock events in a single session.&lt;/p&gt;

&lt;p&gt;For data collection, every 60 seconds the agent sends a stop command (async-profiler finalizes the JFR file), reads the file from the host via &lt;code&gt;/proc/&amp;lt;pid&amp;gt;/root/,&lt;/code&gt; and immediately sends a start command to begin a new recording.&lt;/p&gt;

&lt;p&gt;The gap between stop and start is ~4ms. We considered using dump (which doesn't stop the profiler), but JFR output requires proper chunk finalization, a dump writes incomplete metadata that parsers reject. The stop/start approach guarantees valid output every time.&lt;/p&gt;

&lt;p&gt;Each command goes through the JVM Attach protocol. It's one command per connection, HotSpot closes the socket after each response. After the first attach (which triggers the attach listener via SIGQUIT), subsequent connections just hit the existing Unix socket. Total overhead: ~2ms per command.&lt;/p&gt;

&lt;p&gt;If another tool (Pyroscope Java agent, Datadog, etc.) already loaded async-profiler into the JVM, we detect it by scanning &lt;code&gt;/proc/&amp;lt;pid&amp;gt;/maps&lt;/code&gt; and skip that process to avoid conflicts.&lt;/p&gt;

&lt;h1&gt;
  
  
  Enabling it
&lt;/h1&gt;

&lt;p&gt;Set the &lt;code&gt;ENABLE_JAVA_ASYNC_PROFILER=true&lt;/code&gt; environment variable on the node agent. In the Coroot custom resource:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;coroot.com/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Coroot&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nodeAgent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ENABLE_JAVA_ASYNC_PROFILER&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No JVM flags, no application restarts, no agent JARs. The node agent handles everything automatically for all HotSpot JVMs it discovers. If you haven't enabled it yet, the JVM report shows a hint with a link to the docs.&lt;/p&gt;

&lt;h1&gt;
  
  
  What you get
&lt;/h1&gt;

&lt;p&gt;Once enabled, Coroot adds new charts to the JVM report: allocation rate (bytes/s and objects/s) and lock contention (contentions/s and delay). Each chart has a profile button that opens the corresponding flamegraph, so you can go from "allocation rate spiked" to "this function is allocating all the memory" in one click.&lt;/p&gt;

&lt;p&gt;We also export Prometheus metrics from the profiling data. These are monotonic counters accumulated from the parsed profiles, so &lt;code&gt;rate()&lt;/code&gt; gives you allocation rate and contention rate over time. We initially tried getting allocation metrics from &lt;code&gt;hsperfdata (sun.gc.tlab.alloc)&lt;/code&gt;, but those are per-GC-cycle snapshots that reset every collection. The async-profiler data is the real thing.&lt;/p&gt;

&lt;h1&gt;
  
  
  Seeing it in action
&lt;/h1&gt;

&lt;p&gt;Enough theory. Let's break something and see how the profiling data helps us find the root cause.&lt;/p&gt;

&lt;p&gt;We have a demo environment with several microservices. The one we'll focus on is order-service, a Spring Boot app running on JDK 21, backed by MySQL. It handles order creation, listing, and payment processing. Normal latency is under 10ms.&lt;br&gt;
The demo has a built-in chaos controller that lets us inject failures via a REST API. We'll use two scenarios: lock contention and memory allocation pressure.&lt;/p&gt;
&lt;h1&gt;
  
  
  Lock contention
&lt;/h1&gt;

&lt;p&gt;For this scenario, the chaos controller spawns background threads that repeatedly acquire a shared lock and hold it for 5ms:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;startLockContention&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Runtime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getRuntime&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;availableProcessors&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;Thread&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;currentThread&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;isInterrupted&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="kd"&gt;synchronized&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;CONTENTION_LOCK&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                        &lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sleep&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
                    &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;InterruptedException&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                        &lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;currentThread&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;interrupt&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
                        &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
                    &lt;span class="o"&gt;}&lt;/span&gt;
                &lt;span class="o"&gt;}&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt;
        &lt;span class="o"&gt;},&lt;/span&gt; &lt;span class="s"&gt;"chaos-lock-holder-"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setDaemon&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;start&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Meanwhile, every incoming request also tries to acquire the same lock in the request interceptor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chaosConfig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isLockContention&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;synchronized&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ChaosController&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;CONTENTION_LOCK&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Request thread blocks here while holder threads occupy the lock&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After enabling this scenario, we can immediately see the impact on the order-service SLIs. The latency heatmap shows a clear shift, requests that used to complete in under 10ms are now taking 100ms+, with some exceeding a second:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu47w26cliy0d9j2brund.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu47w26cliy0d9j2brund.png" alt=" " width="800" height="645"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The request rate chart confirms the degradation, you can see the latency distribution shifting from green (fast) to red (slow):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxpjf9wxgucrbi1qleah3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxpjf9wxgucrbi1qleah3.png" alt=" " width="800" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now let's look at the JVM report. The lock contention chart shows a clear spike, the lock wait time jumps from near zero to significant values:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8coord8qsamof65wqho0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8coord8qsamof65wqho0.png" alt=" " width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's click the profile button on the lock contention chart to open the flamegraph:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffe7t50u0wplkjbryejtl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffe7t50u0wplkjbryejtl.png" alt=" " width="800" height="917"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The flamegraph shows the Java Lock (delay) profile in comparison mode. Red means "more time waiting for locks than before." Reading from top to bottom, we can see the Spring request processing chain, and at the bottom of the flamegraph, our ‘ChaosInterceptor.preHandle’ method, the one that tries to acquire the shared lock. That's the bottleneck.&lt;/p&gt;

&lt;p&gt;Without profiling, all we'd know is "requests are slow." With the lock profile, we can point at the exact monitor and the exact code paths waiting for it.&lt;/p&gt;

&lt;h1&gt;
  
  
  Memory allocation pressure
&lt;/h1&gt;

&lt;p&gt;The demo also supports a GC pressure scenario. It starts a background thread that continuously allocates and discards 256 MB byte arrays:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;startGcPressure&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;megabytes&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;Thread&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;currentThread&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;isInterrupted&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;garbage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;megabytes&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;];&lt;/span&gt;
            &lt;span class="n"&gt;garbage&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// prevent dead-code elimination&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;},&lt;/span&gt; &lt;span class="s"&gt;"chaos-gc-pressure"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setDaemon&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;start&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The JVM is configured with &lt;code&gt;-Xmx=512m&lt;/code&gt; so allocating 256 MB chunks means the GC runs after almost every allocation.&lt;/p&gt;

&lt;p&gt;After enabling this scenario, the JVM report tells the story. The allocation rate chart spikes from near zero to ~3 GB/s. GC time jumps in lockstep, the young collection pauses go from occasional to constant:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr5e128uro59vpwhaur6l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr5e128uro59vpwhaur6l.png" alt=" " width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now let's click the profile button on the allocation rate chart to see what is allocating all this memory:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftaivr103q3f1jugfa4yu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftaivr103q3f1jugfa4yu.png" alt=" " width="800" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The flamegraph shows the Java Memory (alloc_space) profile in comparison mode. At the bottom we can see &lt;code&gt;ChaosController$$Lambda.run&lt;/code&gt; and &lt;code&gt;startGcPressure&lt;/code&gt; marked as +100%, they didn't exist in the baseline period. The top-level &lt;code&gt;Thread.run&lt;/code&gt; frames confirm this is a background thread, not request processing.&lt;/p&gt;

&lt;p&gt;Without profiling, all we'd see is GC time going up. With the allocation profile, we know exactly which code is responsible.&lt;/p&gt;

&lt;p&gt;Enable it with a single environment variable and you get flamegraphs, time-series metrics, and a direct link between "something changed" and "here's the code." You can install it open source on your system &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;here. &lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>opensource</category>
      <category>observablity</category>
    </item>
    <item>
      <title>Making encrypted Java traffic observable with eBPF</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Wed, 25 Mar 2026 15:28:21 +0000</pubDate>
      <link>https://dev.to/coroot/making-encrypted-java-traffic-observable-with-ebpf-384k</link>
      <guid>https://dev.to/coroot/making-encrypted-java-traffic-observable-with-ebpf-384k</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Coroot's&lt;/a&gt; open source node agent uses eBPF to capture network traffic at the kernel level. It hooks into syscalls like &lt;code&gt;read&lt;/code&gt; and &lt;code&gt;write&lt;/code&gt;, reads the first bytes of each payload, and detects the protocol: HTTP, MySQL, PostgreSQL, Redis, Kafka, and others. This works for any language and any framework without touching application code.&lt;/p&gt;

&lt;p&gt;For encrypted traffic, we attach eBPF uprobes to TLS library functions like &lt;code&gt;SSL_write&lt;/code&gt; and &lt;code&gt;SSL_read&lt;/code&gt; in OpenSSL, &lt;code&gt;crypto/tls in Go&lt;/code&gt;, and &lt;code&gt;rustls&lt;/code&gt; in Rust. The uprobes fire before encryption or after decryption, so we see the plaintext.&lt;/p&gt;

&lt;p&gt;Java is different. And it has been a blind spot until now.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why Java is special
&lt;/h1&gt;

&lt;p&gt;Java's TLS implementation (JSSE) is not a native shared library. It's Java code that runs inside the JVM. There are no exported symbols like &lt;code&gt;SSL_write&lt;/code&gt; that eBPF could attach to.&lt;/p&gt;

&lt;p&gt;So when a Java app connects to MySQL or PostgreSQL over TLS, or makes HTTPS calls, eBPF tools cannot see the plaintext. All they see at the syscall level is encrypted data.&lt;/p&gt;

&lt;h1&gt;
  
  
  Our approach
&lt;/h1&gt;

&lt;p&gt;We solved this by combining a lightweight Java agent with a tiny native library that serves as an eBPF uprobe target.&lt;/p&gt;

&lt;p&gt;We dynamically load the agent into running JVMs using the attach API (the same mechanism profilers and debuggers use). The agent hooks &lt;code&gt;SSLSocketImpl$AppOutputStream.write&lt;/code&gt; and &lt;code&gt;SSLSocketImpl$AppInputStream.read&lt;/code&gt;, the internal JSSE classes where plaintext enters and leaves the TLS layer.&lt;/p&gt;

&lt;p&gt;When the application does an SSL write, our hook copies the first 1KB of plaintext into a thread-local native buffer and calls a stub native function. We copy to native memory because the pointer is stored and read later when the underlying &lt;code&gt;write()&lt;/code&gt; syscall fires. &lt;/p&gt;

&lt;p&gt;By that time our JNI call has already returned, and Java's GC could have moved the original byte array. We considered using &lt;code&gt;GetPrimitiveArrayCritical&lt;/code&gt; to pin the array in place and avoid the copy, but it blocks all garbage collectors while held, which is worse for the application than a small memcpy.&lt;/p&gt;

&lt;p&gt;For reads, we do the same after JSSE decrypts the data.&lt;/p&gt;

&lt;p&gt;The native stub function does nothing at runtime:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;coroot_java_tls_write_enter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;asm&lt;/span&gt; &lt;span class="k"&gt;volatile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="o"&gt;:::&lt;/span&gt; &lt;span class="s"&gt;"memory"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The asm volatile barrier prevents the compiler from optimizing it away. We attach eBPF uprobes to this function, so every call is captured with the buffer pointer and the payload size. From there, the data goes into our existing protocol detection pipeline, and HTTP, MySQL, PostgreSQL, Redis, Kafka and other protocols are parsed automatically.&lt;/p&gt;

&lt;p&gt;The file descriptor (which connection the data belongs to) is discovered without any Java reflection. When JSSE writes, the sequence is always: encrypt, then &lt;code&gt;write(fd, ciphertext)&lt;/code&gt; syscall. Our eBPF code stores the plaintext pointer when the stub is called, then the syscall that follows on the same thread provides the file descriptor. This is the same &lt;code&gt;ssl_pending&lt;/code&gt; mechanism we use for OpenSSL.&lt;/p&gt;

&lt;p&gt;The native library is compiled with &lt;code&gt;-nostdlib&lt;/code&gt;, so it has no dependencies and works in any container.&lt;/p&gt;

&lt;p&gt;The nice thing about this design is that there is no transport between the JVM and the node agent. No unix sockets, no shared memory, no protocols to maintain. The Java agent just calls a native function, and eBPF picks up the data through uprobes and existing syscall tracepoints.&lt;/p&gt;

&lt;h1&gt;
  
  
  Safety
&lt;/h1&gt;

&lt;p&gt;Our agent modifies the bytecode of two JVM internal classes to insert our hooks. That sounds scary, but all we add is a single method call before each SSL write and after each SSL read. The original code stays exactly the same. Every inserted call is wrapped in a try-catch that catches Throwable, so if our code fails for any reason, the error is silently ignored and the original SSL operation runs as if we were never there.&lt;/p&gt;

&lt;p&gt;We use ASM for the bytecode transformation. ByteBuddy would make the code shorter, but the agent JAR would grow from 130KB to over 8MB. Since we deploy the JAR into every container with a running JVM, keeping it small matters.&lt;/p&gt;

&lt;h1&gt;
  
  
  Benchmark
&lt;/h1&gt;

&lt;p&gt;We compared three scenarios on the same workload:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No instrumentation (baseline)&lt;/li&gt;
&lt;li&gt;eBPF with our Java TLS agent&lt;/li&gt;
&lt;li&gt;OpenTelemetry Java agent with traces exported to a collector&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We included the OpenTelemetry comparison because it is the most common alternative for Java observability without code changes. The OTel agent auto-instruments HTTP clients, JDBC, and other libraries by rewriting bytecode at class load time.&lt;/p&gt;

&lt;p&gt;The test uses two machines to avoid resource contention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Machine 1&lt;/strong&gt; (8 vCPU): Java HTTP proxy making HTTPS calls + coroot-node-agent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Machine 2&lt;/strong&gt; (8 vCPU): Go HTTPS server (5ms delay, ~1KB response) + wrk2 load generator&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0u7e2li7oh1syqf0ynl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0u7e2li7oh1syqf0ynl.png" alt=" " width="800" height="75"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each scenario ran for 15 minutes at 1,000 requests per second.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn2b2omhry9p6smz5xo8w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn2b2omhry9p6smz5xo8w.png" alt=" " width="800" height="203"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The baseline (no instrumentation) uses about 370m CPU cores. With our eBPF agent, CPU increases to about 426m, a 15% increase. The eBPF agent delivers the same throughput as the baseline.&lt;/p&gt;

&lt;p&gt;With the OpenTelemetry Java agent, CPU goes up to 511m, a 38% increase, and the application could only sustain about 800 of the 1,000 target requests per second, a 20% throughput drop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F935rxpcba8dxftrh5vsu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F935rxpcba8dxftrh5vsu.png" alt=" " width="800" height="217"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Limitations
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;JVM compatibility.&lt;/strong&gt; We support HotSpot-based JVMs: OpenJDK, Oracle JDK, Amazon Corretto, Azul Zulu, Eclipse Temurin. OpenJ9 and GraalVM native images are detected and skipped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSLSocket only.&lt;/strong&gt; We instrument SSLSocket (blocking I/O), which covers JDBC drivers, HttpsURLConnection, and most traditional Java HTTP clients. SSLEngine (used by Netty and async HTTP clients) is not yet supported.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic agent loading.&lt;/strong&gt; On Java 21+ the JVM prints a warning about dynamic agent loading being deprecated. JVMs with -XX:+DisableAttachMechanism or -XX:-EnableDynamicAgentLoading are detected and skipped.&lt;/p&gt;

&lt;h1&gt;
  
  
  Disabled by default
&lt;/h1&gt;

&lt;p&gt;This feature must be explicitly enabled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;coroot-node-agent &lt;span class="nt"&gt;--enable-java-tls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or with an environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;ENABLE_JAVA_TLS&lt;/span&gt;=&lt;span class="n"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you use the Coroot Operator on Kubernetes, add it to the Coroot CR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;coroot.com/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Coroot&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;coroot&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nodeAgent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ENABLE_JAVA_TLS&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Loading an agent into a running JVM without the user asking for it is not something we want to do by default. The agent is safe, but we think this should be the user's choice.&lt;/p&gt;

&lt;h1&gt;
  
  
  What you get
&lt;/h1&gt;

&lt;p&gt;With this feature enabled, Coroot automatically detects and parses protocols inside encrypted Java connections: HTTP, MySQL, PostgreSQL, Redis, Kafka, and everything else we support. No code changes, no SDKs, no sidecars. Enable the flag and encrypted Java traffic becomes visible. If you'd like to try it open source to improve observability in your system, you can check out our &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Github.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>java</category>
      <category>opensource</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Instrumenting Rust TLS with eBPF</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Wed, 18 Mar 2026 19:07:58 +0000</pubDate>
      <link>https://dev.to/coroot/instrumenting-rust-tls-with-ebpf-57cf</link>
      <guid>https://dev.to/coroot/instrumenting-rust-tls-with-ebpf-57cf</guid>
      <description>&lt;p&gt;eBPF collects telemetry directly from applications and infrastructure. One of the things it does is capture L7 traffic from TLS connections without any code changes, by hooking into TLS libraries and syscalls.&lt;/p&gt;

&lt;p&gt;Works great for OpenSSL. Works for Go.&lt;/p&gt;

&lt;p&gt;Then rustls enters the picture and everything stops being obvious. With OpenSSL, everything is nicely wrapped:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SSL_write(ssl, plaintext)
└─ write(fd, encrypted)

SSL_read(ssl, plaintext)
└─ read(fd, encrypted)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From eBPF’s point of view this is perfect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hook SSL_write, stash plaintext&lt;/li&gt;
&lt;li&gt;write() fires immediately → same thread → you know the FD&lt;/li&gt;
&lt;li&gt;same idea for reads
Everything happens inside one call. Correlation is trivial.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Rustls does things differently
&lt;/h1&gt;

&lt;p&gt;Rustls doesn’t own the socket and never calls read or write itself. It works on buffers, and the application (or runtime) is responsible for actually moving bytes over the network.&lt;/p&gt;

&lt;p&gt;The API reflects that separation pretty clearly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// application writes plaintext into rustls&lt;/span&gt;
&lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="nf"&gt;.write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plaintext&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// rustls produces encrypted bytes and writes them via io::Write&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="nf"&gt;.write_tls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// application reads encrypted bytes and feeds them into rustls&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="nf"&gt;.read_tls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// rustls decrypts and updates internal state&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="nf"&gt;.process_new_packets&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// application reads decrypted data&lt;/span&gt;
&lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="nf"&gt;.read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plaintext_buf&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So instead of one call doing everything, you get a pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;plaintext is buffered first&lt;/li&gt;
&lt;li&gt;encryption happens later&lt;/li&gt;
&lt;li&gt;syscalls happen outside of rustls&lt;/li&gt;
&lt;li&gt;decryption happens before the app reads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key difference for eBPF:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;writes: syscall happens after plaintext&lt;/li&gt;
&lt;li&gt;reads: syscall happens before plaintext&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the OpenSSL-style correlation only works in one direction.&lt;/p&gt;

&lt;h1&gt;
  
  
  Writes work as usual
&lt;/h1&gt;

&lt;p&gt;On the write side, nothing fundamentally new is needed. You hook Writer::write, stash the plaintext, and correlate it with the following sendto. The ordering is preserved, so the same approach as OpenSSL still applies here.&lt;/p&gt;

&lt;h1&gt;
  
  
  Reads are inverted
&lt;/h1&gt;

&lt;p&gt;The read path is where things really break.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;recvfrom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encrypted_buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...);&lt;/span&gt;   &lt;span class="c1"&gt;// happens first&lt;/span&gt;

&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_tls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;process_new_packets&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plaintext_buf&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;         &lt;span class="c1"&gt;// plaintext appears here&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By the time we see plaintext, the syscall is already gone.&lt;/p&gt;

&lt;p&gt;So the logic has to be reversed. Instead of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“see plaintext → wait for syscall”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;we do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“see syscall → remember it → use it later”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Concretely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;on &lt;em&gt;recvfrom&lt;/em&gt; → stash FD per thread&lt;/li&gt;
&lt;li&gt;on &lt;em&gt;reader.read&lt;/em&gt; → pick up that FD and attach it to plaintext&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s basically reverse correlation. Not pretty, but it matches how rustls works.&lt;/p&gt;

&lt;h1&gt;
  
  
  When “ret=1” doesn’t mean 1 byte
&lt;/h1&gt;

&lt;p&gt;This one took longer than expected. We reused the OpenSSL-style exit probe:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ret = PT_REGS_RC(ctx)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The probe fired, but results were weird:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ret=1
ret=0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Which made no sense for a read. Turns out Rust returns Result  like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rax → success or error flag&lt;/li&gt;
&lt;li&gt;rdx → actual number of bytes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we were reading rax and treating it as a size. Meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ret=1 → actually an error&lt;/li&gt;
&lt;li&gt;ret=0 → success, but size is somewhere else
Fix was straightforward once understood:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PT_REGS_RC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="c1"&gt;// success&lt;/span&gt;
    &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;dx&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// actual byte count&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Classic case of “everything works, but the numbers are garbage”.&lt;/p&gt;

&lt;h1&gt;
  
  
  Finding rustls in binaries
&lt;/h1&gt;

&lt;p&gt;Rust symbols are heavily mangled:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;_ZN55_$LT$rustls..conn..Writer$u20$as$u20$std..io..Write$GT$5write17h0ee1e61402b1a37cE&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;It looks messy, but it encodes the full path: rustls::stream::Writer implementing std::io::Write::write.&lt;/p&gt;

&lt;p&gt;The tricky part is that mangling isn’t stable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;different compiler versions use different schemes (legacy vs v0)&lt;/li&gt;
&lt;li&gt;optimizations and stripping can change what’s left in the binary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So matching exact names is fragile.&lt;/p&gt;

&lt;p&gt;Instead, we:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;check ELF .comment for rustc to detect that the binary was built with Rust&lt;/li&gt;
&lt;li&gt;then scan symbols for patterns like “rustls”, “Writer”+”write”, “Reader”+”read”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not perfect, but reliable enough in practice.&lt;/p&gt;

&lt;h1&gt;
  
  
  Results
&lt;/h1&gt;

&lt;p&gt;Coroot is an open source observability tool that uses eBPF to simplify setup. Because we instrument rustls at the library level, not the frameworks, this works across most Rust clients that use rustls under the hood.&lt;/p&gt;

&lt;p&gt;That includes HTTP stacks like hyper when paired with rustls (hyper-rustls, and frameworks like axum or warp when configured with rustls), database clients like sqlx when using its rustls TLS feature, and any async Rust service using tokio-rustls.&lt;/p&gt;

&lt;p&gt;No code changes, no SDKs, no wrappers.&lt;/p&gt;

&lt;p&gt;For Rust apps using OpenSSL via native-tls or openssl, the existing OpenSSL instrumentation already works. rustls was the missing piece.&lt;/p&gt;

&lt;p&gt;Below is an example of a service talking to MySQL over TLS. Coroot shows the actual queries even though everything on the wire is encrypted.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqsfb1kv57k5whu8iiv73.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqsfb1kv57k5whu8iiv73.png" alt=" " width="800" height="100"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you’d like to give our open source tool a try and simplify your own observability, you can check it out at here on &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Github.&lt;/a&gt; You can also view this guide and other open source observability articles on &lt;a href="https://coroot.com/blog/instrumenting-rust-tls-with-ebpf/" rel="noopener noreferrer"&gt;our blog.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>opensource</category>
      <category>rust</category>
    </item>
    <item>
      <title>How to make GPUs on Kubernetes Observable</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Tue, 20 Jan 2026 18:15:52 +0000</pubDate>
      <link>https://dev.to/coroot/making-gpus-on-kubernetes-observable-d2d</link>
      <guid>https://dev.to/coroot/making-gpus-on-kubernetes-observable-d2d</guid>
      <description>&lt;p&gt;GPUs are everywhere powering LLM inference, model training, video processing, and more. Kubernetes is often where these workloads run. But using GPUs in Kubernetes isn’t as simple as using CPUs.&lt;/p&gt;

&lt;p&gt;You need the right setup. You need efficient scheduling. And most importantly you need visibility.&lt;/p&gt;

&lt;p&gt;This post walks through how to run GPU workloads on Kubernetes, how to virtualize them efficiently, and how to use open source to monitor everything with zero instrumentation.&lt;/p&gt;

&lt;h1&gt;
  
  
  Running GPU Workloads on Kubernetes
&lt;/h1&gt;

&lt;p&gt;Running GPU workloads on Kubernetes is totally doable. But it takes a bit of setup.&lt;/p&gt;

&lt;p&gt;It starts with your nodes. Whether you’re running in the cloud or on bare metal, your cluster needs machines with physical GPUs. Most cloud providers support GPU-enabled node pools, and provisioning them is usually straightforward.&lt;/p&gt;

&lt;p&gt;Once the hardware is in place, the next step is software. For Kubernetes to schedule and run GPU workloads, it needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NVIDIA GPU drivers, installed on each node&lt;/li&gt;
&lt;li&gt;The NVIDIA container runtime, so containers can access the GPU&lt;/li&gt;
&lt;li&gt;The NVIDIA device plugin, so Kubernetes knows how to handle GPU resource requests&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can install all of this manually. But it’s fiddly and error-prone. That’s where the NVIDIA GPU Operator comes in. It automates the whole setup: installing drivers, configuring the runtime, and deploying the device plugin. Once that’s done, your cluster is GPU-ready.&lt;/p&gt;

&lt;p&gt;After that, requesting a GPU is simple. Just add this to your pod spec:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resources:
  limits:
    nvidia.com/gpu: 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kubernetes will handle the rest: scheduling your pod onto a node with an available GPU and assigning it to the container.&lt;/p&gt;

&lt;p&gt;Of course, not every workload needs an entire GPU to itself. And that’s where GPU virtualization becomes really useful.&lt;/p&gt;

&lt;h1&gt;
  
  
  Virtualizing GPUs in Kubernetes
&lt;/h1&gt;

&lt;p&gt;By default, Kubernetes treats GPUs as exclusive resources. One pod per device. But in many real-world cases, that’s overkill. The GPU Operator supports two forms of GPU virtualization that let you safely share a GPU between workloads:&lt;/p&gt;

&lt;p&gt;Time-Slicing: Multiple containers take turns using the GPU in rapid bursts. It’s a great fit for bursty inference workloads, batch jobs, or anything that doesn’t require ultra-low latency.&lt;/p&gt;

&lt;p&gt;MIG (Multi-Instance GPU): Available on GPUs like the A100 and H100, MIG lets you partition a single physical GPU into several hardware-isolated instances. Each one behaves like its own dedicated GPU, with its own memory, cache, and compute cores.&lt;/p&gt;

&lt;p&gt;Virtualization makes GPUs way more flexible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You stop wasting an entire GPU on a tiny batch job&lt;/li&gt;
&lt;li&gt;You get much better overall utilization&lt;/li&gt;
&lt;li&gt;You can safely share GPUs across apps without them stepping on each other&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And you finally have a shot at balancing cost and performance. It’s a game-changer, but only if you can actually see what’s going on.&lt;/p&gt;

&lt;h1&gt;
  
  
  What observability looks like once GPUs are in play
&lt;/h1&gt;

&lt;p&gt;So, the cluster is set up, the workloads are running, and maybe you’ve even started virtualizing GPUs to get better efficiency. Now comes the tricky part – actually understanding what’s happening.&lt;/p&gt;

&lt;p&gt;From the infrastructure side, we want to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many GPU-enabled nodes do we have right now?&lt;/li&gt;
&lt;li&gt;Which GPUs are actually doing work, and which are just burning budget?&lt;/li&gt;
&lt;li&gt;What’s the current GPU and memory utilization across the fleet?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And sure, if you’re in the cloud, temperature and power draw might feel like someone else’s problem. But it’s still good to know. Somewhere out there, your model is warming the planet one token at a time. Mother Nature says hi. 🌱&lt;/p&gt;

&lt;p&gt;From the application side, the questions change:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which GPUs is this app actually using?&lt;/li&gt;
&lt;li&gt;How much compute and memory is it consuming?&lt;/li&gt;
&lt;li&gt;Is it sharing the GPU with something else?&lt;/li&gt;
&lt;li&gt;And if so who’s the noisy neighbor hogging all the resources?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn’t just about curiosity. It’s about avoiding slowdowns, catching inefficiencies, and making smart scaling decisions. But here’s the catch: Kubernetes doesn’t tell you any of this.&lt;/p&gt;

&lt;h1&gt;
  
  
  Out-of-the-box GPU observability
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Coroot&lt;/a&gt; is an open source tool that uses eBPF to make any GPU-powered system observable with zero-configuration. It talks directly to the GPU using NVIDIA’s NVML library the same one behind nvidia-smi. This way you can see what’s happening on your GPUs with no guesswork.&lt;/p&gt;

&lt;p&gt;On startup, the agent looks for libnvidia-ml.so in all the usual (and unusual) places whether it’s installed by the GPU Operator, a package manager, or manually dropped in. If it finds the library, it loads it and starts gathering data.&lt;/p&gt;

&lt;p&gt;From there, it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Discovers all available GPUs on the node&lt;/li&gt;
&lt;li&gt;Collects real-time metrics utilization, memory usage, temperature, power draw&lt;/li&gt;
&lt;li&gt;Tracks per-process usage using NVML’s process telemetry&lt;/li&gt;
&lt;li&gt;Maps each process back to its container and pod, using Coroot’s existing PID-to-container tracking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So instead of “PID 12345 is using GPU 0,” you get “this container in this pod is using 78% of GPU-xxxx”&lt;/p&gt;

&lt;p&gt;When it comes to virtualized GPUs, Coroot sees which containers are tied to which GPU UUIDs, even when multiple workloads are time-sharing or using MIG slices on the same physical device. That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can see which apps are sharing the same GPU&lt;/li&gt;
&lt;li&gt;Understand how each one is using it&lt;/li&gt;
&lt;li&gt;Spot noisy neighbors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of this is automatic. Just install the agent and let Coroot do the rest.&lt;/p&gt;

&lt;p&gt;Once the Coroot agent discovers the GPUs and starts collecting data, all of it flows straight into the UI ready to explore without any dashboards to build or metrics to stitch together.&lt;/p&gt;

&lt;p&gt;Let’s walk through how this looks in practice.&lt;/p&gt;

&lt;h1&gt;
  
  
  Node-level GPU overview
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F44kp5c7jcapx7ymn5v6x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F44kp5c7jcapx7ymn5v6x.png" alt=" " width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the node view, Coroot shows everything you’d want to know about the GPUs attached to a specific machine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU utilization over time&lt;/li&gt;
&lt;li&gt;GPU memory usage&lt;/li&gt;
&lt;li&gt;Top consumers of both compute and memory&lt;/li&gt;
&lt;li&gt;Temperature and power draw&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn’t just “Hey, GPU usage went up.” You can see which containers are contributing to that load and whether the same GPU is being shared between apps.&lt;/p&gt;

&lt;h1&gt;
  
  
  App-level GPU breakdown
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F70g1tur4arwhgeznr2o5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F70g1tur4arwhgeznr2o5.png" alt=" " width="800" height="673"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is where things get real. If your app is slow, you can check GPU usage alongside CPU, memory, logs, traces, and everything else in one place.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How much GPU time your app is using&lt;/li&gt;
&lt;li&gt;How much memory it’s burning&lt;/li&gt;
&lt;li&gt;Which containers are sharing the GPU, and how much each one is using&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;GPU workloads aren’t cheap. And they’re rarely simple. Whether you’re running a single inference service or managing dozens of shared GPUs, you need real visibility. &lt;/p&gt;

&lt;p&gt;Observability is the same: not cheap, rarely simple. But with open source tools, managing your infra can be accessible to everyone - not just massive companies, but small businesses, startups, and homelab projects everywhere across the globe. &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Give it a try&lt;/a&gt; on your system next, and share feedback to help make good observability available for everyone.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Using AI for Troubleshooting: OpenAI vs DeepSeek</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Wed, 07 Jan 2026 18:52:22 +0000</pubDate>
      <link>https://dev.to/coroot/using-ai-for-troubleshooting-openai-vs-deepseek-343k</link>
      <guid>https://dev.to/coroot/using-ai-for-troubleshooting-openai-vs-deepseek-343k</guid>
      <description>&lt;p&gt;AI is now a go-to tool for everything from writing to coding. Modern LLMs are so powerful that, with the right prompt and a few adjustments, they can handle tasks almost effortlessly.&lt;/p&gt;

&lt;p&gt;At &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Coroot&lt;/a&gt; (an open source project to simplify observability) we’ve developed AI (LLMs + ML) to diagnose root cause analysis. Finding the exact root cause is tricky due to the many possible scenarios. Many of our users aren’t experts in areas like databases, network performance, or specific runtimes. We need to provide clear explanations, and ideally, even guidance on how to fix the problem.&lt;/p&gt;

&lt;p&gt;In the following blog, I decided to test the ability of different LLMs to diagnose a chaos experiment. Since laziness often drives progress, I skipped converting my telemetry data into a prompt and instead just uploaded a screenshot as input for the LLMs.&lt;/p&gt;

&lt;p&gt;The result shocked me.&lt;/p&gt;

&lt;h1&gt;
  
  
  OpenAI GPT-4o
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Here is my prompt, no other context has been provided:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1k8yst6a2x1ogzk37kb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1k8yst6a2x1ogzk37kb.png" alt=" " width="768" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From the provided PostgreSQL monitoring dashboard, it appears that an &lt;code&gt;ALTER TABLE&lt;/code&gt; statement is causing performance issues, leading to:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Increased Query Latency:&lt;/strong&gt; The Postgres average query latency has spiked, especially for pg-cluster-2.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I/O Load from ALTER TABLE:&lt;/strong&gt; The 'ALTER TABLE' statement (alter table articles alter column body set not null) is consuming significant I/O, as seen in the Queries by I/O time graph.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Locked Connections and Queries:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Postgres connections graph shows a sharp rise in locked connections.&lt;/li&gt;
&lt;li&gt;The Locked queries graph confirms that queries are blocked due to the &lt;code&gt;ALTER TABLE&lt;/code&gt; operation.&lt;/li&gt;
&lt;li&gt;The Blocking queries graph indicates that the &lt;code&gt;ALTER TABLE&lt;/code&gt; query is holding a lock and blocking multiple other queries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Replication Lag:&lt;/strong&gt; The Replication lag shows a spike, likely due to the &lt;code&gt;ALTER TABLE&lt;/code&gt; operation causing a backlog of writes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why is this Happening?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ALTER TABLE … SET NOT NULL&lt;/code&gt; is a table-wide operation that requires a full table scan to verify that no existing row violates the new constraint. It locks the entire table and prevents writes/reads from proceeding until the operation completes. This is especially problematic in a high-load environment where frequent reads/writes occur.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to Fix This?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Terminate the &lt;code&gt;ALTER TABLE&lt;/code&gt; Statement (if possible): If the system is in production and the lock is causing downtime, consider canceling the query:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE query LIKE '%ALTER TABLE articles alter column body set not null%';&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Restart Blocked Queries:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the &lt;code&gt;ALTER TABLE&lt;/code&gt; is canceled, the system should return to normal operations. However, some queries might still be waiting, so check for any lingering locks:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-Term Fixes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since ALTER TABLE is expensive, consider these best practices:&lt;/p&gt;

&lt;p&gt;1.Convert to an Online-Friendly Approach&lt;/p&gt;

&lt;p&gt;Instead of &lt;code&gt;ALTER TABLE … SET NOT NULL&lt;/code&gt;, use a safer approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add a check constraint first (non-blocking):&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;ALTER TABLE articles ADD CONSTRAINT articles_body_not_null CHECK (body IS NOT NULL) NOT VALID;&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Backfill missing values (if any).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Validate the constraint (still non-blocking):&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;ALTER TABLE articles VALIDATE CONSTRAINT articles_body_not_null;&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If no violations exist, alter the column:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;ALTER TABLE articles ALTER COLUMN body SET NOT NULL;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;2.Use &lt;code&gt;pg_repack&lt;/code&gt; for Zero-Downtime Migrations&lt;/p&gt;

&lt;p&gt;pg_repack allows reorganizing tables without full locks&lt;br&gt;
&lt;code&gt;pg_repack -t articles --dbname=mydatabase&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;3.Perform Schema Changes During Low-Traffic Periods&lt;/p&gt;

&lt;p&gt;If running in production, execute schema changes during maintenance windows to reduce impact.&lt;/p&gt;

&lt;p&gt;4.Partition Large Tables&lt;/p&gt;

&lt;p&gt;For frequently altered large tables, consider partitioning to isolate operations on smaller segments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ALTER TABLE operation is locking the articles table, causing significant query delays and replication lag. The best approach is to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cancel the operation if causing downtime.&lt;/li&gt;
&lt;li&gt;Use an online-safe alternative (check constraint + validation).&lt;/li&gt;
&lt;li&gt;Schedule schema changes carefully.
Use tools like pg_repack for minimal impact.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;That’s super impressive and surprisingly accurate. What shocked me the most was the model’s ability to interpret the screenshot as input.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;It’s exciting that an LLM can ‘understand’ our domain and generate deep summaries with actionable insights. However, relying on a single vendor didn’t seem like a robust solution.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Fortunately, DeepSeek has emerged as a strong alternative to OpenAI’s model, especially with the recent release of its R1 model. So, let’s repeat the experiment using DeepSeek-R1:&lt;/em&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  DeepSeeek-R1
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;The answer:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The data indicates a significant performance issue in pg-cluster-2, primarily caused by a long-running &lt;code&gt;ALTER TABLE&lt;/code&gt; command on the articles table. This operation is causing table locks, blocking other critical queries (e.g., &lt;code&gt;SELECT&lt;/code&gt; and &lt;code&gt;INSERT&lt;/code&gt;), leading to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High query latency (10-15 seconds).&lt;/li&gt;
&lt;li&gt;A surge in active/locked connections.&lt;/li&gt;
&lt;li&gt;Query throughput drops.&lt;/li&gt;
&lt;li&gt;Replication lag in other clusters (minor).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Root Cause&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;ALTER TABLE&lt;/code&gt; statement (alter table articles alter column body set not ?) is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Holding an exclusive lock on the articles table.&lt;/li&gt;
&lt;li&gt;Blocking subsequent queries (e.g., &lt;code&gt;SELECT&lt;/code&gt;, &lt;code&gt;INSERT&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Creating a chain of waiting queries, increasing latency and connection usage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Immediate Fixes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Terminate the ALTER TABLE command (if safe):&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;-- Find the PID of the ALTER TABLE query  &lt;br&gt;
   SELECT pid, query FROM pg_stat_activity WHERE query LIKE '%alter column body%';  &lt;br&gt;
   -- Terminate it  &lt;br&gt;
   SELECT pg_terminate_backend(&amp;lt;PID&amp;gt;);&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Caution: Ensure this won’t corrupt data (e.g., if the ALTER is part of a migration).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reschedule the DDL Operation:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run schema changes during low-traffic periods or use online DDL tools like:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pg_repack&lt;/code&gt; (to rebuild tables without locks).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt; (for index-related changes).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Optimize the &lt;code&gt;ALTER TABLE&lt;/code&gt;:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the column is being set to NOT NULL, ensure it already contains no NULL values. Pre-run:&lt;/p&gt;

&lt;p&gt;UPDATE articles SET body = '' WHERE body IS NULL;  -- Example cleanup&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-Term Improvements&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use Lock Timeouts:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add a lock timeout to prevent indefinite blocking:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SET LOCAL lock_timeout = '5s';  &lt;br&gt;
   ALTER TABLE ...;&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Monitor and Tune Queries:- Optimize the &lt;code&gt;SELECT id FROM&lt;/code&gt; articles &lt;code&gt;WHERE&lt;/code&gt; created &amp;lt; ? …&lt;code&gt;query with an index on created.- Batch the&lt;/code&gt;INSERT` operations to reduce contention.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Connection Pooling: Use a connection pooler (e.g., PgBouncer) to manage active connections and prevent exhaustion.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Replication Checks:Investigate replication lag in &lt;code&gt;pg-cluster-1&lt;/code&gt; and &lt;code&gt;pg-cluster-3&lt;/code&gt; to ensure failover readiness.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The blocking &lt;code&gt;ALTER TABLE&lt;/code&gt; is the critical issue. Address it first by terminating or rescheduling the operation, then implement safeguards to avoid similar incidents. Optimize queries and use non-locking DDL tools for future schema changes.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;DeepSeek-R1 delivers an answer that is quite close to OpenAI’s. I’d say it is accurate enough to be a viable alternative to OpenAI’s models.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Additionally, I tested other open-source models in my lab, but I was only able to run LLMs with up to 70 billion parameters. In this range, I didn’t find anything that came close to OpenAI or the larger DeepSeek models for our needs.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Is AIOps is finally here?
&lt;/h2&gt;

&lt;p&gt;Yes, but only if you have enough data describing your system’s behaviour. &lt;/p&gt;

&lt;p&gt;This wouldn’t have been possible with just CPU, memory, and disk usage alone. The real magic of AI in this case was its ability to understand the bigger picture, apply domain knowledge, and suggest how to fix the issue. &lt;/p&gt;

&lt;p&gt;This is where Coroot's AI-Powered Root Cause Analysis can provide a more complete picture, using eBPF + Machine Learning to catch context missing from LLMs alone. As part of our commitment to make observability simpler and accesible for &lt;em&gt;everyone&lt;/em&gt;, you can set it up for free with our open source version. &lt;a href="https://docs.coroot.com/ai/coroot-cloud" rel="noopener noreferrer"&gt;Details here.&lt;/a&gt; We hope it can help save you hours of digging through telemetry and make root cause analysis easier for your team.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>openai</category>
      <category>deepseek</category>
      <category>devops</category>
    </item>
    <item>
      <title>Chaos testing a Postgres cluster managed by CloudNativePG</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Tue, 16 Dec 2025 16:57:20 +0000</pubDate>
      <link>https://dev.to/coroot/chaos-testing-a-postgres-cluster-managed-by-cloudnativepg-f9d</link>
      <guid>https://dev.to/coroot/chaos-testing-a-postgres-cluster-managed-by-cloudnativepg-f9d</guid>
      <description>&lt;p&gt;As more organizations move their databases to cloud-native environments, effectively managing and monitoring these systems becomes crucial. According to Coroot’s anonymous usage statistics, 64% of projects use PostgreSQL, making it the most popular RDBMS among our users, compared to 14% using MySQL. This is not surprising since it is also the most widely used open-source database worldwide. &lt;/p&gt;

&lt;p&gt;Kubernetes is more than a platform for running containerized applications. It also enables better management of databases by allowing automation of tasks like backups, high availability, and scaling through its operator framework. This provides a management experience similar to using a managed service like AWS RDS but without vendor lock-in and often at a lower cost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/cloudnative-pg/cloudnative-pg" rel="noopener noreferrer"&gt;CloudNativePg&lt;/a&gt; is an open-source operator originally created by EDB, the oldest and the biggest Postgres vendor world-wide. As other operators, CNPG helps manage PostgreSQL databases on Kubernetes, covering the entire operational lifecycle from initial deployment to ongoing maintenance. Worth to mention that this is the youngest Postgres operator on the market, but its open source traction grows rapidly and based on my observations it’s the favorite operator across Reddit users.&lt;/p&gt;

&lt;p&gt;In this post I’ll install a CNPG cluster in my lab, instrument it with Coroot, then generate some load and introduce some failures to ensure high availability and observability.&lt;/p&gt;

&lt;p&gt;In this post I’ll install a CNPG cluster in my lab, instrument it with Coroot Community (&lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;open source&lt;/a&gt;), then generate some load and introduce some failures to ensure high availability and observability.&lt;/p&gt;

&lt;h1&gt;
  
  
  Setting up the cluster
&lt;/h1&gt;

&lt;p&gt;Installing the CloudNativePG operator is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;helm repo add cnpg https://cloudnative-pg.github.io/charts
helm upgrade --install cnpg cnpg/cloudnative-pg

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To deploy a cluster, create a Kubernetes custom resource:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kind: Cluster
metadata:
  name: pg-cluster
spec:
  instances: 3
  primaryUpdateStrategy: unsupervised
  storage:
    size: 30Gi   
  postgresql:
    shared_preload_libraries: [pg_stat_statements]
    parameters:
      pg_stat_statements.max: "10000"
      pg_stat_statements.track: all
  managed:
    roles:
    - name: coroot
      ensure: present
      login: true
      connectionLimit: 2
      inRoles:
      - pg_monitor
      passwordSecret:
        name: pg-cluster
---
apiVersion: v1
data:
  username: ******==
  password: *********==
kind: Secret
metadata:
  name: pg-cluster
type: kubernetes.io/basic-auth

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Installing Coroot
&lt;/h1&gt;

&lt;p&gt;In this post, I’ll be using the open source Community Edition of Coroot. Here are the commands to install the Coroot Operator for Kubernetes along with all Coroot components:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;helm repo add coroot https://coroot.github.io/helm-charts
helm repo update coroot
helm install -n coroot --create-namespace coroot-operator coroot/coroot-operator
helm install -n coroot coroot coroot/coroot-ce

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To access Coroot, I’m forwarding the Coroot UI port to my local machine. For production deployments the operator can create an Ingress.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl port-forward -n coroot service/coroot-coroot 8083:8080&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;In the UI, we can see two applications: the operator (cnpg-cloudnative-pg) and our Postgres cluster (pg-cluster). Coroot has also identified that pg-cluster is a Postgres database and suggests integrating Postgres monitoring.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14isgtg2l6ov5kios5j2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14isgtg2l6ov5kios5j2.png" alt=" " width="800" height="260"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Kubernetes approach to monitoring databases typically involves running metric exporters as sidecar containers within database instance Pods. However, this method can be challenging for certain use cases. For example, CNPG doesn’t support running custom sidecar containers, and their &lt;a href="https://github.com/cloudnative-pg/cnpg-i" rel="noopener noreferrer"&gt;CNPG-i&lt;/a&gt; capability requires specific plugin support and is still in the experimental stage.&lt;/p&gt;

&lt;p&gt;To address these limitations, Coroot has a dedicated coroot-cluster-agent that can discover and gather metrics from databases without requiring a separate container for each database instance. To configure this integration, simply use the credentials of the database role already created for Coroot. Click on “Postgres” in the Coroot UI and then on the “Configure” button.&lt;/p&gt;

&lt;p&gt;Next, provide the credentials configured for Coroot in the cluster specification. Coroot’s cluster-agent will then collect Postgres metrics from each instance in the cluster. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe11fmaz7j9p64jk35kz8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe11fmaz7j9p64jk35kz8.png" alt=" " width="800" height="541"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It feels a bit dull without any load or issues. Let’s add an application that interacts with this database.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5przlggyxd72sal2bzaw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5przlggyxd72sal2bzaw.png" alt=" " width="800" height="485"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I deployed a simple application called “app” that executes approximately 600 queries per second: 300 on the primary and 300 across both replicas.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2eu9vyyyeaae5uprgbdl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2eu9vyyyeaae5uprgbdl.png" alt=" " width="800" height="563"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I believe that any observability solution must be tested on failures to ensure that if some problem occurs, we will be able to quickly identify the root case. So, let’s introduce some failures&lt;/p&gt;

&lt;h1&gt;
  
  
  Failure #1: CPU noisy neighbor
&lt;/h1&gt;

&lt;p&gt;In shared infrastructures like Kubernetes clusters, applications often compete for resources. Let’s simulate a scenario with a noisy neighbor, where a CPU-intensive application runs on the same node as our database instance. The following Job will create a Pod with stress-ng on node100:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: batch/v1
kind: Job
metadata:
  name: cpu-stress
spec:
  template:
    metadata:
      labels:
        app: cpu-stress
    spec:
      nodeSelector:
        kubernetes.io/hostname: node100
      containers:
        - name: stress-ng
          image: debian:bullseye-slim
          command:
            - "/bin/sh"
            - "-c"
            - |
              apt-get update &amp;amp;&amp;amp; 
              apt-get install -y stress-ng &amp;amp;&amp;amp; 
              stress-ng --cpu 0 --timeout 300s
      restartPolicy: Never
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8w1dw6arwlkc6vq5fypd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8w1dw6arwlkc6vq5fypd.png" alt=" " width="800" height="545"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see, our “noisy neighbor” has affected Postgres performance. Now, let’s assume we don’t know the root cause and use Coroot to identify the issue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8d3kkh3y51y3z2h7n8g9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8d3kkh3y51y3z2h7n8g9.png" alt=" " width="800" height="598"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using the CPU Delay chart, we can observe that pg-cluster-2 is experiencing a CPU time shortage. Why? Because node100 is overloaded. And why is that? The cpu-stress application has consumed all available CPU time.&lt;/p&gt;

&lt;h1&gt;
  
  
  Failure #2: Postgres Locks
&lt;/h1&gt;

&lt;p&gt;Now, let’s explore a Postgres-specific failure scenario. We’ll run a suboptimal schema migration on our articles table, which contains 10 million rows:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ALTER TABLE articles ALTER COLUMN body SET NOT NULL;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbtzxqq0dhpbksidjp506.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbtzxqq0dhpbksidjp506.png" alt=" " width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For those who aren’t deeply familiar with databases, this migration will lock the entire table to verify that all rows are not NULL. Since the table is relatively large, the migration can take some time to complete. During this period, queries from our app will be forced to wait until the lock is released.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl1oud1iqh4a6bxk5iqlb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl1oud1iqh4a6bxk5iqlb.png" alt=" " width="800" height="647"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s interpret these charts together: The Postgres latency of pg-cluster-2 has significantly increased. Many SELECT and INSERT queries are locked by another query. Which one? The ALTER TABLE query. Why is this query taking so long to execute? Because it is performing I/O operations to verify that the body column in each row is not NULL.&lt;/p&gt;

&lt;p&gt;As you can see, having the right metrics was crucial in this scenario. For instance, simply knowing the number of Postgres locks wouldn’t help us identify the specific query holding the lock. &lt;/p&gt;

&lt;h1&gt;
  
  
  Failure #3: primary Postgres instance failure
&lt;/h1&gt;

&lt;p&gt;Now, let’s see how CloudNativePG handles a primary instance failure. To simulate this failure, I’ll simply delete the Pod of the primary Postgres instance.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl delete pod pg-cluster-2&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbgxnmk2pg6xrbb3qzt6g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbgxnmk2pg6xrbb3qzt6g.png" alt=" " width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(That's all for character count - to view the last experiment, &lt;a href="https://coroot.com/blog/engineering/chaos-testing-a-postgres-cluster-managed-by-cloudnativepg/" rel="noopener noreferrer"&gt;visit our blog.&lt;/a&gt;)&lt;/p&gt;

</description>
      <category>devops</category>
      <category>postgres</category>
      <category>opensource</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Monitoring a Docker Homelab with Open Source</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Mon, 15 Dec 2025 17:32:55 +0000</pubDate>
      <link>https://dev.to/coroot/monitoring-a-docker-homelab-with-open-source-3h07</link>
      <guid>https://dev.to/coroot/monitoring-a-docker-homelab-with-open-source-3h07</guid>
      <description>&lt;p&gt;&lt;em&gt;This blog comes from Coroot contributor Arie van den Heuvel: engineer, a System and Application Management Specialist, and a valued member of our open source community. You can read more of Arie’s writing and support the resource articles he has contributed to open source &lt;a href="https://solipsistic-sysadmin.blogspot.com/" rel="noopener noreferrer"&gt;on his blog.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnye4ffx0td42ylpaj7j3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnye4ffx0td42ylpaj7j3.png" alt=" " width="800" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When running a home server consisting of one or more nodes with some or all services in Docker, you may find yourself wanting to monitor your environment. Or even better, attain full observability.&lt;/p&gt;

&lt;p&gt;The frequent recommendation for this is a combination of Prometheus with Grafana. But this solution requires a lot of work to fully configure, in addition to work on one’s applications and setup for full visibility. Another possibility is to use the free tier of NewRelic, which has the advantage of remote insights on metrics and logs. Again, this requires additional work on containers or applications to have a more refined visibility of your services.&lt;/p&gt;

&lt;p&gt;For those not running Linux, an honourable mention to use as a solution would be &lt;a href="https://github.com/henrygd/beszel" rel="noopener noreferrer"&gt;Beszel.&lt;/a&gt; Beszel can be run as a local service or in docker. It consists of a web front-end and an agent that can be used on multiple systems that run Windows and MacOS. Installation is an easy job in docker. Once it’s running, Beszel provides insightful information with system metrics, docker services, and even logs.&lt;/p&gt;

&lt;p&gt;My personal choice for monitoring a home server system is &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Coroot. &lt;/a&gt; In the following blog, I’ll detail how I used Coroot to set up observability for my homelab, which you can then adopt for your own setup.&lt;/p&gt;

&lt;h1&gt;
  
  
  Observability with Coroot
&lt;/h1&gt;

&lt;p&gt;In my current setup on a &lt;a href="https://rockylinux.org/" rel="noopener noreferrer"&gt;Rocky Linux&lt;/a&gt; 9.x system, Coroot runs on a &lt;a href="https://github.com/ClickHouse/ClickHouse" rel="noopener noreferrer"&gt;Clickhouse&lt;/a&gt; server to store metrics, logs, traces and profiles, in addition to the &lt;a href="https://github.com/coroot/coroot-node-agent" rel="noopener noreferrer"&gt;Coroot node-agent &lt;/a&gt; and &lt;a href="https://github.com/coroot/coroot-cluster-agent" rel="noopener noreferrer"&gt;Coroot cluster-agent.&lt;/a&gt; The Coroot node-agent automatically collects all service metrics and logs using eBPF, while the cluster-agent provides detailed information on databases like MySQL, Postgres or Redis.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08xhxh81nrfd1ife3xib.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08xhxh81nrfd1ife3xib.png" alt=" " width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another advantage Coroot presents is the use of &lt;a href="https://coroot.com/ai" rel="noopener noreferrer"&gt;AI-powered Root Cause Analysis&lt;/a&gt;, which provides instantaneous and helpful insights for investigating incidents. With a &lt;a href="https://docs.coroot.com/ai/coroot-cloud/" rel="noopener noreferrer"&gt;Coroot Cloud account&lt;/a&gt;, you will have ten helpful analyses for free each month. Even without AI, the data presented with Coroot with standard alerts based on best metric practices is pretty insightful and helps to make your setup even better.&lt;/p&gt;

&lt;p&gt;Coroot services run in docker through a docker-compose file. In a normal Coroot setup Prometheus is used, but in this setup I have configured Clickhouse, &lt;a href="[url](https://docs.coroot.com/configuration/clickhouse/)"&gt;which is a supported alternative.&lt;br&gt;
&lt;/a&gt;&lt;/p&gt;
&lt;h1&gt;
  
  
  Clickhouse as a Local Service
&lt;/h1&gt;

&lt;p&gt;I have Clickhouse running as a local service. This setup allows for better control and convenience when scaling down memory usage of Clickhouse, scaling down logging on disk and the database, and simplifies making changes to the data. The only downside to note is this setup requires updating Clickhouse manually with &lt;code&gt;yum/dnf.&lt;/code&gt;&lt;/p&gt;
&lt;h1&gt;
  
  
  Installing Clickhouse
&lt;/h1&gt;

&lt;p&gt;Installing Clickhouse is easily achieved by adding the repo, installing Clickhouse, and making a few quick adjustments before starting it up.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
sudo dnf install -y yum-utils
sudo dnf-config-manager --add-repo
https://packages.clickhouse.com/rpm/clickhouse.repo
sudo dnf install -y clickhouse-server clickhouse-client

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before starting the service create file &lt;code&gt;/etc/clickhouse-server/config.d/z_log_disable.xml&lt;/code&gt; and insert the following content in the file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
​&amp;lt;?xml version="1.0"?&amp;gt;
&amp;lt;clickhouse&amp;gt;
&amp;lt;asynchronous_metric_log remove="1"/&amp;gt;
&amp;lt;metric_log remove="1"/&amp;gt;
&amp;lt;latency_log remove="1"/&amp;gt;
&amp;lt;query_thread_log remove="1" /&amp;gt;
&amp;lt;query_log remove="1" /&amp;gt;
&amp;lt;query_views_log remove="1" /&amp;gt;
&amp;lt;part_log remove="1"/&amp;gt;
&amp;lt;session_log remove="1"/&amp;gt;
&amp;lt;text_log remove="1" /&amp;gt;
&amp;lt;trace_log remove="1"/&amp;gt;
&amp;lt;crash_log remove="1"/&amp;gt;
&amp;lt;opentelemetry_span_log remove="1"/&amp;gt;
&amp;lt;zookeeper_log remove="1"/&amp;gt;
&amp;lt;/clickhouse&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this adjust cache sizes in &lt;code&gt;/etc/clickhouse-server/config.xml:&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;mark_cache_size&amp;gt;268435456&amp;lt;/mark_cache_size&amp;gt;
&amp;lt;index_mark_cache_size&amp;gt;67108864&amp;lt;/index_mark_cache_size&amp;gt;
&amp;lt;uncompressed_cache_size&amp;gt;16777216&amp;lt;/uncompressed_cache_size&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adjust memory usage ratio in &lt;code&gt;/etc/clickhouse-server/config.xml:&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
&amp;lt;max_server_memory_usage_to_ram_ratio&amp;gt;0.75&amp;lt;/max_server_memory_usage_to_ram_ratio&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lower the tread pool size in &lt;code&gt;/etc/clickhouse-server/config.xml:&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;!-- max_thread_pool_size&amp;gt;10000&amp;lt;/max_thread_pool_size&amp;gt; --&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And starting things up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
sudo systemctl deamon-reload
sudo systemctl enable clickhouse-server
sudo systemctl start clickhouse-server

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Installing Coroot
&lt;/h1&gt;

&lt;p&gt;First, check if your Linux system is using kernel 5.1 or later (although 4.2 is also supported.) This installation is different from the &lt;a href="https://docs.coroot.com/?env=docker" rel="noopener noreferrer"&gt;original docker-compose file.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Prometheus is not used in this setup, and Clickhouse runs as a local service. Another distinction is the retention of the data, which is normally set to seven days for traces, logs, profiles and metrics. Coroot also typically stores its own local cache for metrics for 30 days.&lt;/p&gt;

&lt;p&gt;In this setup, the data retention stored in Clickhouse is set up for 14 days. With eighteen local and docker services, the amount of data kept for all of this is 3GB on average in my system.&lt;br&gt;
Coroot, its node-agent, and cluster-agent, run as a docker service with a docker-compose that you create locally. This is achieved by inserting the following content in a locally created &lt;code&gt;docker-compose.yaml:&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
name: coroot

volumes:
node_agent_data: {}
cluster_agent_data: {}
coroot_data: {}

services:
coroot:
restart: always
image: ghcr.io/coroot/coroot${LICENSE_KEY:+-ee} # set 'coroot-ee' as the image if LICENSE_KEY is defined
pull_policy: always
user: root
volumes:
- coroot_data:/data
ports:
- 8080:8080
command:
- '--data-dir=/data'
- '--bootstrap-refresh-interval=15s'
- '--bootstrap-clickhouse-address=127.0.0.1:9000'
- '--bootstrap-prometheus-url=http://127.0.0.1:9090'
- '--global-prometheus-use-clickhouse'
- '--global-prometheus-url=http://127.0.0.1:9090'
- '--global-refresh-interval=15s'
- '--cache-ttl=31d'
- '--traces-ttl=21d'
- '--logs-ttl=21d'
- '--profiles-ttl=21d'
- '--metrics-ttl=21d'
environment:
- LICENSE_KEY=${LICENSE_KEY:-}
- GLOBAL_PROMETHEUS_USE_CLICKHOUSE
- CLICKHOUSE_SPACE_MANAGER_USAGE_THRESHOLD=75 # Set cleanup threshold to 75%
- CLICKHOUSE_SPACE_MANAGER_MIN_PARTITIONS=2 # Always keep at least 2 partitions
network_mode: host

node-agent:
restart: always
image: ghcr.io/coroot/coroot-node-agent
pull_policy: always
privileged: true
pid: "host"
volumes:
- /sys/kernel/tracing:/sys/kernel/tracing
- /sys/kernel/debug:/sys/kernel/debug
- /sys/fs/cgroup:/host/sys/fs/cgroup
- node_agent_data:/data
command:
- '--collector-endpoint=http://192.168.1.160:8080'
- '--cgroupfs-root=/host/sys/fs/cgroup'
- '--wal-dir=/data'

cluster-agent:
restart: always
image: ghcr.io/coroot/coroot-cluster-agent
pull_policy: always
volumes:
- cluster_agent_data:/data
command:
- '--coroot-url=http://192.168.1.160:8080'
- '--metrics-scrape-interval=15s'
- '--metrics-wal-dir=/data'
depends_on:
- coroot


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After creating this file and making any adjustments to your own likings and network preferences, type docker compose up -d and go to your IP address on port 8080. Here you have access to Coroot, and are now prompted to give admin credentials!&lt;/p&gt;

&lt;p&gt;In my setup, &lt;a href="https://github.com/containrrr/watchtower" rel="noopener noreferrer"&gt;Watchtower&lt;/a&gt; takes care of updating docker containers, which works great with Coroot.&lt;/p&gt;

&lt;p&gt;As a final sidenote: there are already some helpful hints and pointers present within Coroot for setting things up. In my case, there was information available that helped observe a Postgres database. Don’t forget to use the given commands as the admin/postgres user to make it work.&lt;/p&gt;

&lt;p&gt;Happy homelab observing! 🐧&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>docker</category>
      <category>opensource</category>
    </item>
    <item>
      <title>OpenTelemetry for Go: Measuring the Overhead</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Wed, 10 Dec 2025 19:15:09 +0000</pubDate>
      <link>https://dev.to/coroot/opentelemetry-for-go-measuring-the-overhead-dpi</link>
      <guid>https://dev.to/coroot/opentelemetry-for-go-measuring-the-overhead-dpi</guid>
      <description>&lt;h1&gt;
  
  
  OpenTelemetry for Go: Measuring the Overhead
&lt;/h1&gt;

&lt;p&gt;Everything comes at a cost — and observability is no exception. When we add metrics, logging, or distributed tracing to our applications, it helps us understand what’s going on with performance and key UX metrics like success rate and latency. But what’s the cost?&lt;/p&gt;

&lt;p&gt;I’m not talking about the price of observability tools here, I mean the instrumentation overhead. If an application logs or traces everything it does, that’s bound to slow it down or at least increase resource consumption. Of course, that doesn’t mean we should give up on observability. But it does mean we should measure the overhead so we can make informed tradeoffs.&lt;/p&gt;

&lt;p&gt;In this post, I want to measure the overhead of using OpenTelemetry in a Go application. To do that, I’ll use a super simple Go HTTP server that increments a counter in an in-memory database Valkey (a Redis fork) on every request. The idea behind the benchmark is straightforward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First, we’ll run the app under load without any instrumentation and measure its performance and resource usage.&lt;/li&gt;
&lt;li&gt;Then, using the exact same workload, we’ll repeat the test with OpenTelemetry SDK for Go enabled and compare the results.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Test setup
&lt;/h1&gt;

&lt;p&gt;For this benchmark, I’ll use four Linux nodes, each with 4 vCPUs and 8GB of RAM. One will run the application, another will host Valkey, a third will be used for the load generator, and the fourth for observability (using &lt;a href="https://coroot.com/" rel="noopener noreferrer"&gt;Coroot Community Edition&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3wy41plt5uuu007krfz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3wy41plt5uuu007krfz.png" alt=" " width="800" height="55"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I want to make sure the components involved in the test don’t interfere with each other, so I’m running them on separate nodes. This time, I’m not using Kubernetes, instead, I’ll run everything in plain Docker containers. I’m also using the host network mode for all containers, to avoid docker-proxy introducing any additional latency into the network path.&lt;/p&gt;

&lt;p&gt;Now, let’s take a look at the application code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;package main

import (
    "context"
    "log"
    "net/http"
    "os"
    "strconv"

    "github.com/go-redis/redis/extra/redisotel"
    "github.com/go-redis/redis/v8"

    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/sdk/trace"
)

var (
    rdb *redis.Client
)

func initTracing() {
    rdb.AddHook(redisotel.TracingHook{})
    client := otlptracehttp.NewClient()
    exporter, err := otlptrace.New(context.Background(), client)
    if err != nil {
        log.Fatal(err)
    }
    tracerProvider := trace.NewTracerProvider(trace.WithBatcher(exporter))
    otel.SetTracerProvider(tracerProvider)
    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    ))
}

func handler(w http.ResponseWriter, r *http.Request) {
    cmd := rdb.Incr(r.Context(), "counter")
    if err := cmd.Err(); err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }
    _, _ = w.Write([]byte(strconv.FormatInt(cmd.Val(), 10)))
}

func main() {
    rdb = redis.NewClient(&amp;amp;redis.Options{Addr: os.Getenv("REDIS_SERVER")})
    h := http.Handler(http.HandlerFunc(handler))
    if os.Getenv("ENABLE_OTEL") != "" {
        log.Println("enabling opentelemetry")
        initTracing()
        h = otelhttp.NewHandler(http.HandlerFunc(handler), "GET /")
    }
    log.Fatal(http.ListenAndServe(":8080", h))
} 

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By default, the application runs without instrumentation. Only if the environment variable ENABLE_OTEL is set, the OpenTelemetry SDK will be initialized. So runs without this variable will serve as the baseline for comparison.&lt;/p&gt;

&lt;h1&gt;
  
  
  Running the Benchmark
&lt;/h1&gt;

&lt;p&gt;Now let’s start all the components and begin testing.&lt;/p&gt;

&lt;p&gt;First, we launch Valkey using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run --name valkey -d --net=host valkey/valkey

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we start the Go app and point it to the Valkey instance by IP:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run -d --name app -e REDIS_SERVER="192.168.1.2:6379" --net=host failurepedia/redis-app:0.5

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To generate load, I’ll use wrk2, which allows precise control over request rate. In this test, I’m setting it to 10,000 requests per second using 100 connections and 8 threads. Each run will last 20 minutes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; docker run --rm --name load-generator -ti cylab/wrk2 \
   -t8 -c100 -d1200s -R10000 --u_latency http://192.168.1.3:8080/ 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Results
&lt;/h1&gt;

&lt;p&gt;Let’s take a look at the results.&lt;/p&gt;

&lt;p&gt;We started by running the app without any instrumentation. This serves as our baseline for performance and resource usage. Based on metrics gathered by Coroot using eBPF, the app successfully handled 10,000 requests per second. The majority of requests were served in under 5 milliseconds. The 95th percentile (p95) latency was around 5ms, the 99th percentile (p99) was about 10ms, with occasional spikes reaching up to 20ms.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuiyvg27rx88hqo7qaxt4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuiyvg27rx88hqo7qaxt4.png" alt=" " width="800" height="464"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;CPU usage was steady at around 2 CPU cores (or 2 CPU seconds per second), and memory consumption stayed low at roughly 10 MB.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnvuy2oqwclmtbxyh0b75.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnvuy2oqwclmtbxyh0b75.png" alt=" " width="800" height="215"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc286q5iy36e1148xcvvu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc286q5iy36e1148xcvvu.png" alt=" " width="800" height="209"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So that’s our baseline. Now, let’s restart the app container with the OpenTelemetry SDK enabled and see how things change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run -d --name app \
  -e REDIS_SERVER="192.168.1.2:6379" \
  -e ENABLE_OTEL=1 \
  -e OTEL_SERVICE_NAME="app" \
  -e OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://192.168.1.4:8080/v1/traces" \
  --net=host failurepedia/redis-app:0.5 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything else stayed the same – the infrastructure, the workload, and the duration of the test.&lt;/p&gt;

&lt;p&gt;Now let’s break down what changed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9vjjawlmkipq0g69t8oe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9vjjawlmkipq0g69t8oe.png" alt=" " width="800" height="216"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Memory usage increased from around 10 megabytes to somewhere between 15 and 18 megabytes. This additional overhead comes from the SDK and its background processes for handling telemetry data. While there is a clear difference, it doesn’t look like a significant increase in absolute terms, especially for modern applications where memory budgets are typically much larger.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl3sl7p9r489u7z63jz12.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl3sl7p9r489u7z63jz12.png" alt=" " width="800" height="225"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;CPU usage jumped from 2 cores to roughly 2.7 cores. That’s about a 35 percent increase. This is expected since the app is now tracing every request, preparing and exporting spans, and doing more work in the background.&lt;/p&gt;

&lt;p&gt;To understand exactly where this additional CPU usage was coming from, I used Coroot’s built-in eBPF-based CPU profiler to capture and compare profiles before and after enabling OpenTelemetry.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmt4k4f2w3f5pqw6r2v6h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmt4k4f2w3f5pqw6r2v6h.png" alt=" " width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The profiler showed that about 10 percent of total CPU time was spent in go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor, which handles span batching and export. Redis calls also got slightly more expensive — tracing added around 7 percent CPU overhead to go-redis operations. The rest of the increase came from instrumented HTTP handlers and middleware.&lt;/p&gt;

&lt;p&gt;In short, the overhead comes from OpenTelemetry’s span processing pipeline, not from the app’s core logic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1rqd80y4bysd0744vd3a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1rqd80y4bysd0744vd3a.png" alt=" " width="800" height="471"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Latency also changed, though not dramatically. With OpenTelemetry enabled, more requests fell into the 5 to 10 millisecond range. The 99th percentile latency went from 10 to about 15 milliseconds. Throughput remained stable at around 10,000 requests per second. We didn’t see any errors or timeouts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzt1wdyf5y5xhqhr71npo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzt1wdyf5y5xhqhr71npo.png" alt=" " width="800" height="212"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Network traffic also increased. With tracing enabled, the app started exporting telemetry data to Coroot, which resulted in an outbound traffic volume of about 4 megabytes per second, or roughly 32 megabits per second. For high-throughput services or environments with strict network constraints, this is something to keep in mind when enabling full request-level tracing.&lt;/p&gt;

&lt;p&gt;Overall, enabling OpenTelemetry introduced a noticeable but controlled overhead. These numbers aren’t negligible, especially at scale — but they’re also not a dealbreaker. For most teams, the visibility gained through distributed tracing and the ability to troubleshoot issues faster will justify the tradeoff.&lt;/p&gt;

&lt;h1&gt;
  
  
  eBPF-based instrumentation
&lt;/h1&gt;

&lt;p&gt;I often hear from engineers, especially in ad tech and other high-throughput environments, that they simply can’t afford the overhead of distributed tracing. At the same time, observability is absolutely critical for them. This is exactly the kind of scenario where eBPF-based instrumentation fits well. &lt;/p&gt;

&lt;p&gt;Instead of modifying application code or adding SDKs, an agent can observe application behavior at the kernel level using eBPF. Coroot’s agent supports this approach and is capable of collecting both metrics and traces using eBPF, without requiring any changes to the application itself.&lt;/p&gt;

&lt;p&gt;However, in high-load environments like the one used in this benchmark, we generally recommend disabling eBPF-based tracing and working with metrics only. Metrics still allow us to clearly see how services interact with each other, without storing data about every single request. They’re also much more efficient in terms of storage and runtime overhead.&lt;/p&gt;

&lt;p&gt;Throughout both runs of our test, Coroot’s agent was running on each node. Here’s what its CPU usage looked like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpbypdt6z8igg30h5khlt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpbypdt6z8igg30h5khlt.png" alt=" " width="800" height="229"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Node201 was running Valkey, node203 was running the app, and node204 was the load generator. As the chart shows, even under consistent load, the agent’s CPU usage stayed under 0.3 cores. That makes it lightweight enough for production use, especially when working in metrics-only mode.&lt;/p&gt;

&lt;p&gt;This approach offers a practical balance: good visibility with minimal cost.&lt;/p&gt;

&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;Observability comes at a cost, but as this experiment shows, that cost depends heavily on how you choose to implement it.&lt;/p&gt;

&lt;p&gt;OpenTelemetry SDKs provide detailed traces and deep visibility, but they also introduce measurable overhead in terms of CPU, memory, and network traffic. For many teams, especially when fast incident resolution is a priority, that tradeoff is entirely justified.&lt;/p&gt;

&lt;p&gt;At the same time, eBPF-based instrumentation offers a more lightweight option. It allows you to collect meaningful metrics without modifying application code and keeps resource usage minimal, especially when tracing is disabled and only metrics are collected.&lt;/p&gt;

&lt;p&gt;The right choice depends on your goals. If you need full traceability and detailed diagnostics, SDK-based tracing is a strong option. If your priority is low overhead and broad system visibility, eBPF-based metrics might be the better fit.&lt;/p&gt;

&lt;p&gt;Observability isn’t free, but with the right approach, it can be both effective and efficient.&lt;/p&gt;

</description>
      <category>observability</category>
      <category>opensource</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Coroot – eBPF-based, open source observability with actionable insights</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Wed, 09 Apr 2025 16:29:14 +0000</pubDate>
      <link>https://dev.to/coroot/coroot-ebpf-based-open-source-observability-with-actionable-insights-4dj1</link>
      <guid>https://dev.to/coroot/coroot-ebpf-based-open-source-observability-with-actionable-insights-4dj1</guid>
      <description>&lt;p&gt;A common open source approach to observability will begin with databases and visualizations for telemetry - Grafana, Prometheus, Jaeger. But observability doesn’t begin and end here: these tools require configuration, dashboard customization, and may not actually pinpoint the data you need to mitigate system risks.&lt;/p&gt;

&lt;p&gt;Coroot was designed to solve the problem of manual, time-consuming observability analysis: it handles the full observability journey — from collecting telemetry to turning it into actionable insights. We also strongly believe that simple observability should be an innovation everyone can benefit from: which is why our software is open source.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://coroot.com/" rel="noopener noreferrer"&gt;Features:&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Cost monitoring to track and minimise your cloud expenses (AWS, GCP, Azure.)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;SLO tracking with alerts to detect anomalies and compare them to your system’s baseline behaviour.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;1-click application profiling: see the exact line of code that caused an anomaly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Mapped timeframes (stop digging through Grafana to find when the incident occurred.)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;eBPF automatically gathers logs, metrics, traces, and profiles for you.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Service map to grasp a complete at-a-glance picture of your system.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Automatic discovery and monitoring of every application deployment in your kubernetes cluster.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can view Coroot’s documentation &lt;a href="https://docs.coroot.com/installation/" rel="noopener noreferrer"&gt;here&lt;/a&gt;, visit our &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Github&lt;/a&gt;, and join our &lt;a href="https://join.slack.com/t/coroot-community/shared_invite/zt-2te9x672s-4s_Wp732cd~o2vdFLNE5AA" rel="noopener noreferrer"&gt;Slack&lt;/a&gt; to become part of our community. We welcome any feedback and hope the tool can improve your workflow!&lt;/p&gt;

</description>
      <category>devops</category>
      <category>opensource</category>
      <category>kubernetes</category>
    </item>
  </channel>
</rss>
