<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Coroot</title>
    <description>The latest articles on DEV Community by Coroot (@coroot).</description>
    <link>https://dev.to/coroot</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3034639%2Fa266f65f-6e11-46f8-b8e1-62b2834906de.png</url>
      <title>DEV Community: Coroot</title>
      <link>https://dev.to/coroot</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/coroot"/>
    <language>en</language>
    <item>
      <title>Profiling Java apps: breaking things to prove it works</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Wed, 08 Apr 2026 21:51:04 +0000</pubDate>
      <link>https://dev.to/coroot/profiling-java-apps-breaking-things-to-prove-it-works-14da</link>
      <guid>https://dev.to/coroot/profiling-java-apps-breaking-things-to-prove-it-works-14da</guid>
      <description>&lt;p&gt;Coroot already does eBPF-based CPU profiling for Java. It catches CPU hotspots well, but that's all it can do. Every time we looked at a GC pressure issue or a latency spike caused by lock contention, we could see something was wrong but not what.&lt;/p&gt;

&lt;p&gt;We wanted memory allocation and lock contention profiling. So we decided to add &lt;a href="https://github.com/async-profiler/async-profiler" rel="noopener noreferrer"&gt;async-profiler&lt;/a&gt; support to &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;coroot-node-agent.&lt;/a&gt; The goal: memory allocation and lock contention profiles for any HotSpot JVM, with zero code changes. Here's how we got there.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why async-profiler
&lt;/h1&gt;

&lt;p&gt;We went with async-profiler. It's a native JVMTI agent used by pretty much everyone in the Java profiling space (Pyroscope, IntelliJ, Datadog). It can be loaded into a running JVM dynamically, supports CPU, allocation, and lock contention profiling in a single session, and works in unprivileged containers with no JVM flags. It outputs JFR format, which we parse using Grafana's &lt;a href="https://github.com/grafana/jfr-parser" rel="noopener noreferrer"&gt;jfr-parser.&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  How we integrated it
&lt;/h1&gt;

&lt;p&gt;The integration follows the same pattern as our &lt;a href="https://coroot.com/blog/java-tls-instrumentation-with-ebpf" rel="noopener noreferrer"&gt;Java TLS agent:&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The node agent detects Java processes by checking if the binary name ends with java, then confirms it's a HotSpot JVM by scanning &lt;code&gt;/proc/&amp;lt;pid&amp;gt;/maps.&lt;/code&gt; It deploys &lt;code&gt;libasyncProfiler.so&lt;/code&gt; (~600KB) into the container's filesystem at &lt;code&gt;/tmp/coroot/&lt;/code&gt; and loads the library into the JVM via the Attach API. async-profiler starts with &lt;code&gt;event=itimer&lt;/code&gt;,&lt;code&gt;interval=10ms,alloc,lock,jfr,&lt;/code&gt; capturing CPU, allocation, and lock events in a single session.&lt;/p&gt;

&lt;p&gt;For data collection, every 60 seconds the agent sends a stop command (async-profiler finalizes the JFR file), reads the file from the host via &lt;code&gt;/proc/&amp;lt;pid&amp;gt;/root/,&lt;/code&gt; and immediately sends a start command to begin a new recording.&lt;/p&gt;

&lt;p&gt;The gap between stop and start is ~4ms. We considered using dump (which doesn't stop the profiler), but JFR output requires proper chunk finalization, a dump writes incomplete metadata that parsers reject. The stop/start approach guarantees valid output every time.&lt;/p&gt;

&lt;p&gt;Each command goes through the JVM Attach protocol. It's one command per connection, HotSpot closes the socket after each response. After the first attach (which triggers the attach listener via SIGQUIT), subsequent connections just hit the existing Unix socket. Total overhead: ~2ms per command.&lt;/p&gt;

&lt;p&gt;If another tool (Pyroscope Java agent, Datadog, etc.) already loaded async-profiler into the JVM, we detect it by scanning &lt;code&gt;/proc/&amp;lt;pid&amp;gt;/maps&lt;/code&gt; and skip that process to avoid conflicts.&lt;/p&gt;

&lt;h1&gt;
  
  
  Enabling it
&lt;/h1&gt;

&lt;p&gt;Set the &lt;code&gt;ENABLE_JAVA_ASYNC_PROFILER=true&lt;/code&gt; environment variable on the node agent. In the Coroot custom resource:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;coroot.com/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Coroot&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nodeAgent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ENABLE_JAVA_ASYNC_PROFILER&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No JVM flags, no application restarts, no agent JARs. The node agent handles everything automatically for all HotSpot JVMs it discovers. If you haven't enabled it yet, the JVM report shows a hint with a link to the docs.&lt;/p&gt;

&lt;h1&gt;
  
  
  What you get
&lt;/h1&gt;

&lt;p&gt;Once enabled, Coroot adds new charts to the JVM report: allocation rate (bytes/s and objects/s) and lock contention (contentions/s and delay). Each chart has a profile button that opens the corresponding flamegraph, so you can go from "allocation rate spiked" to "this function is allocating all the memory" in one click.&lt;/p&gt;

&lt;p&gt;We also export Prometheus metrics from the profiling data. These are monotonic counters accumulated from the parsed profiles, so &lt;code&gt;rate()&lt;/code&gt; gives you allocation rate and contention rate over time. We initially tried getting allocation metrics from &lt;code&gt;hsperfdata (sun.gc.tlab.alloc)&lt;/code&gt;, but those are per-GC-cycle snapshots that reset every collection. The async-profiler data is the real thing.&lt;/p&gt;

&lt;h1&gt;
  
  
  Seeing it in action
&lt;/h1&gt;

&lt;p&gt;Enough theory. Let's break something and see how the profiling data helps us find the root cause.&lt;/p&gt;

&lt;p&gt;We have a demo environment with several microservices. The one we'll focus on is order-service, a Spring Boot app running on JDK 21, backed by MySQL. It handles order creation, listing, and payment processing. Normal latency is under 10ms.&lt;br&gt;
The demo has a built-in chaos controller that lets us inject failures via a REST API. We'll use two scenarios: lock contention and memory allocation pressure.&lt;/p&gt;
&lt;h1&gt;
  
  
  Lock contention
&lt;/h1&gt;

&lt;p&gt;For this scenario, the chaos controller spawns background threads that repeatedly acquire a shared lock and hold it for 5ms:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;startLockContention&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Runtime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getRuntime&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;availableProcessors&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;Thread&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;currentThread&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;isInterrupted&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="kd"&gt;synchronized&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;CONTENTION_LOCK&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                        &lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sleep&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
                    &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;InterruptedException&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                        &lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;currentThread&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;interrupt&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
                        &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
                    &lt;span class="o"&gt;}&lt;/span&gt;
                &lt;span class="o"&gt;}&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt;
        &lt;span class="o"&gt;},&lt;/span&gt; &lt;span class="s"&gt;"chaos-lock-holder-"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setDaemon&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;start&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Meanwhile, every incoming request also tries to acquire the same lock in the request interceptor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chaosConfig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isLockContention&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;synchronized&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ChaosController&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;CONTENTION_LOCK&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Request thread blocks here while holder threads occupy the lock&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After enabling this scenario, we can immediately see the impact on the order-service SLIs. The latency heatmap shows a clear shift, requests that used to complete in under 10ms are now taking 100ms+, with some exceeding a second:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu47w26cliy0d9j2brund.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu47w26cliy0d9j2brund.png" alt=" " width="800" height="645"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The request rate chart confirms the degradation, you can see the latency distribution shifting from green (fast) to red (slow):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxpjf9wxgucrbi1qleah3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxpjf9wxgucrbi1qleah3.png" alt=" " width="800" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now let's look at the JVM report. The lock contention chart shows a clear spike, the lock wait time jumps from near zero to significant values:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8coord8qsamof65wqho0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8coord8qsamof65wqho0.png" alt=" " width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's click the profile button on the lock contention chart to open the flamegraph:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffe7t50u0wplkjbryejtl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffe7t50u0wplkjbryejtl.png" alt=" " width="800" height="917"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The flamegraph shows the Java Lock (delay) profile in comparison mode. Red means "more time waiting for locks than before." Reading from top to bottom, we can see the Spring request processing chain, and at the bottom of the flamegraph, our ‘ChaosInterceptor.preHandle’ method, the one that tries to acquire the shared lock. That's the bottleneck.&lt;/p&gt;

&lt;p&gt;Without profiling, all we'd know is "requests are slow." With the lock profile, we can point at the exact monitor and the exact code paths waiting for it.&lt;/p&gt;

&lt;h1&gt;
  
  
  Memory allocation pressure
&lt;/h1&gt;

&lt;p&gt;The demo also supports a GC pressure scenario. It starts a background thread that continuously allocates and discards 256 MB byte arrays:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;startGcPressure&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;megabytes&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;Thread&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;currentThread&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;isInterrupted&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;garbage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;megabytes&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;];&lt;/span&gt;
            &lt;span class="n"&gt;garbage&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// prevent dead-code elimination&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;},&lt;/span&gt; &lt;span class="s"&gt;"chaos-gc-pressure"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setDaemon&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;start&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The JVM is configured with &lt;code&gt;-Xmx=512m&lt;/code&gt; so allocating 256 MB chunks means the GC runs after almost every allocation.&lt;/p&gt;

&lt;p&gt;After enabling this scenario, the JVM report tells the story. The allocation rate chart spikes from near zero to ~3 GB/s. GC time jumps in lockstep, the young collection pauses go from occasional to constant:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr5e128uro59vpwhaur6l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr5e128uro59vpwhaur6l.png" alt=" " width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now let's click the profile button on the allocation rate chart to see what is allocating all this memory:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftaivr103q3f1jugfa4yu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftaivr103q3f1jugfa4yu.png" alt=" " width="800" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The flamegraph shows the Java Memory (alloc_space) profile in comparison mode. At the bottom we can see &lt;code&gt;ChaosController$$Lambda.run&lt;/code&gt; and &lt;code&gt;startGcPressure&lt;/code&gt; marked as +100%, they didn't exist in the baseline period. The top-level &lt;code&gt;Thread.run&lt;/code&gt; frames confirm this is a background thread, not request processing.&lt;/p&gt;

&lt;p&gt;Without profiling, all we'd see is GC time going up. With the allocation profile, we know exactly which code is responsible.&lt;/p&gt;

&lt;p&gt;Enable it with a single environment variable and you get flamegraphs, time-series metrics, and a direct link between "something changed" and "here's the code." You can install it open source on your system &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;here. &lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>opensource</category>
      <category>observablity</category>
    </item>
    <item>
      <title>Making encrypted Java traffic observable with eBPF</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Wed, 25 Mar 2026 15:28:21 +0000</pubDate>
      <link>https://dev.to/coroot/making-encrypted-java-traffic-observable-with-ebpf-384k</link>
      <guid>https://dev.to/coroot/making-encrypted-java-traffic-observable-with-ebpf-384k</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Coroot's&lt;/a&gt; open source node agent uses eBPF to capture network traffic at the kernel level. It hooks into syscalls like &lt;code&gt;read&lt;/code&gt; and &lt;code&gt;write&lt;/code&gt;, reads the first bytes of each payload, and detects the protocol: HTTP, MySQL, PostgreSQL, Redis, Kafka, and others. This works for any language and any framework without touching application code.&lt;/p&gt;

&lt;p&gt;For encrypted traffic, we attach eBPF uprobes to TLS library functions like &lt;code&gt;SSL_write&lt;/code&gt; and &lt;code&gt;SSL_read&lt;/code&gt; in OpenSSL, &lt;code&gt;crypto/tls in Go&lt;/code&gt;, and &lt;code&gt;rustls&lt;/code&gt; in Rust. The uprobes fire before encryption or after decryption, so we see the plaintext.&lt;/p&gt;

&lt;p&gt;Java is different. And it has been a blind spot until now.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why Java is special
&lt;/h1&gt;

&lt;p&gt;Java's TLS implementation (JSSE) is not a native shared library. It's Java code that runs inside the JVM. There are no exported symbols like &lt;code&gt;SSL_write&lt;/code&gt; that eBPF could attach to.&lt;/p&gt;

&lt;p&gt;So when a Java app connects to MySQL or PostgreSQL over TLS, or makes HTTPS calls, eBPF tools cannot see the plaintext. All they see at the syscall level is encrypted data.&lt;/p&gt;

&lt;h1&gt;
  
  
  Our approach
&lt;/h1&gt;

&lt;p&gt;We solved this by combining a lightweight Java agent with a tiny native library that serves as an eBPF uprobe target.&lt;/p&gt;

&lt;p&gt;We dynamically load the agent into running JVMs using the attach API (the same mechanism profilers and debuggers use). The agent hooks &lt;code&gt;SSLSocketImpl$AppOutputStream.write&lt;/code&gt; and &lt;code&gt;SSLSocketImpl$AppInputStream.read&lt;/code&gt;, the internal JSSE classes where plaintext enters and leaves the TLS layer.&lt;/p&gt;

&lt;p&gt;When the application does an SSL write, our hook copies the first 1KB of plaintext into a thread-local native buffer and calls a stub native function. We copy to native memory because the pointer is stored and read later when the underlying &lt;code&gt;write()&lt;/code&gt; syscall fires. &lt;/p&gt;

&lt;p&gt;By that time our JNI call has already returned, and Java's GC could have moved the original byte array. We considered using &lt;code&gt;GetPrimitiveArrayCritical&lt;/code&gt; to pin the array in place and avoid the copy, but it blocks all garbage collectors while held, which is worse for the application than a small memcpy.&lt;/p&gt;

&lt;p&gt;For reads, we do the same after JSSE decrypts the data.&lt;/p&gt;

&lt;p&gt;The native stub function does nothing at runtime:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;coroot_java_tls_write_enter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;asm&lt;/span&gt; &lt;span class="k"&gt;volatile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="o"&gt;:::&lt;/span&gt; &lt;span class="s"&gt;"memory"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The asm volatile barrier prevents the compiler from optimizing it away. We attach eBPF uprobes to this function, so every call is captured with the buffer pointer and the payload size. From there, the data goes into our existing protocol detection pipeline, and HTTP, MySQL, PostgreSQL, Redis, Kafka and other protocols are parsed automatically.&lt;/p&gt;

&lt;p&gt;The file descriptor (which connection the data belongs to) is discovered without any Java reflection. When JSSE writes, the sequence is always: encrypt, then &lt;code&gt;write(fd, ciphertext)&lt;/code&gt; syscall. Our eBPF code stores the plaintext pointer when the stub is called, then the syscall that follows on the same thread provides the file descriptor. This is the same &lt;code&gt;ssl_pending&lt;/code&gt; mechanism we use for OpenSSL.&lt;/p&gt;

&lt;p&gt;The native library is compiled with &lt;code&gt;-nostdlib&lt;/code&gt;, so it has no dependencies and works in any container.&lt;/p&gt;

&lt;p&gt;The nice thing about this design is that there is no transport between the JVM and the node agent. No unix sockets, no shared memory, no protocols to maintain. The Java agent just calls a native function, and eBPF picks up the data through uprobes and existing syscall tracepoints.&lt;/p&gt;

&lt;h1&gt;
  
  
  Safety
&lt;/h1&gt;

&lt;p&gt;Our agent modifies the bytecode of two JVM internal classes to insert our hooks. That sounds scary, but all we add is a single method call before each SSL write and after each SSL read. The original code stays exactly the same. Every inserted call is wrapped in a try-catch that catches Throwable, so if our code fails for any reason, the error is silently ignored and the original SSL operation runs as if we were never there.&lt;/p&gt;

&lt;p&gt;We use ASM for the bytecode transformation. ByteBuddy would make the code shorter, but the agent JAR would grow from 130KB to over 8MB. Since we deploy the JAR into every container with a running JVM, keeping it small matters.&lt;/p&gt;

&lt;h1&gt;
  
  
  Benchmark
&lt;/h1&gt;

&lt;p&gt;We compared three scenarios on the same workload:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No instrumentation (baseline)&lt;/li&gt;
&lt;li&gt;eBPF with our Java TLS agent&lt;/li&gt;
&lt;li&gt;OpenTelemetry Java agent with traces exported to a collector&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We included the OpenTelemetry comparison because it is the most common alternative for Java observability without code changes. The OTel agent auto-instruments HTTP clients, JDBC, and other libraries by rewriting bytecode at class load time.&lt;/p&gt;

&lt;p&gt;The test uses two machines to avoid resource contention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Machine 1&lt;/strong&gt; (8 vCPU): Java HTTP proxy making HTTPS calls + coroot-node-agent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Machine 2&lt;/strong&gt; (8 vCPU): Go HTTPS server (5ms delay, ~1KB response) + wrk2 load generator&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0u7e2li7oh1syqf0ynl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0u7e2li7oh1syqf0ynl.png" alt=" " width="800" height="75"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each scenario ran for 15 minutes at 1,000 requests per second.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn2b2omhry9p6smz5xo8w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn2b2omhry9p6smz5xo8w.png" alt=" " width="800" height="203"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The baseline (no instrumentation) uses about 370m CPU cores. With our eBPF agent, CPU increases to about 426m, a 15% increase. The eBPF agent delivers the same throughput as the baseline.&lt;/p&gt;

&lt;p&gt;With the OpenTelemetry Java agent, CPU goes up to 511m, a 38% increase, and the application could only sustain about 800 of the 1,000 target requests per second, a 20% throughput drop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F935rxpcba8dxftrh5vsu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F935rxpcba8dxftrh5vsu.png" alt=" " width="800" height="217"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Limitations
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;JVM compatibility.&lt;/strong&gt; We support HotSpot-based JVMs: OpenJDK, Oracle JDK, Amazon Corretto, Azul Zulu, Eclipse Temurin. OpenJ9 and GraalVM native images are detected and skipped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSLSocket only.&lt;/strong&gt; We instrument SSLSocket (blocking I/O), which covers JDBC drivers, HttpsURLConnection, and most traditional Java HTTP clients. SSLEngine (used by Netty and async HTTP clients) is not yet supported.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic agent loading.&lt;/strong&gt; On Java 21+ the JVM prints a warning about dynamic agent loading being deprecated. JVMs with -XX:+DisableAttachMechanism or -XX:-EnableDynamicAgentLoading are detected and skipped.&lt;/p&gt;

&lt;h1&gt;
  
  
  Disabled by default
&lt;/h1&gt;

&lt;p&gt;This feature must be explicitly enabled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;coroot-node-agent &lt;span class="nt"&gt;--enable-java-tls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or with an environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;ENABLE_JAVA_TLS&lt;/span&gt;=&lt;span class="n"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you use the Coroot Operator on Kubernetes, add it to the Coroot CR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;coroot.com/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Coroot&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;coroot&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nodeAgent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ENABLE_JAVA_TLS&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Loading an agent into a running JVM without the user asking for it is not something we want to do by default. The agent is safe, but we think this should be the user's choice.&lt;/p&gt;

&lt;h1&gt;
  
  
  What you get
&lt;/h1&gt;

&lt;p&gt;With this feature enabled, Coroot automatically detects and parses protocols inside encrypted Java connections: HTTP, MySQL, PostgreSQL, Redis, Kafka, and everything else we support. No code changes, no SDKs, no sidecars. Enable the flag and encrypted Java traffic becomes visible. If you'd like to try it open source to improve observability in your system, you can check out our &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Github.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>java</category>
      <category>opensource</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Instrumenting Rust TLS with eBPF</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Wed, 18 Mar 2026 19:07:58 +0000</pubDate>
      <link>https://dev.to/coroot/instrumenting-rust-tls-with-ebpf-57cf</link>
      <guid>https://dev.to/coroot/instrumenting-rust-tls-with-ebpf-57cf</guid>
      <description>&lt;p&gt;eBPF collects telemetry directly from applications and infrastructure. One of the things it does is capture L7 traffic from TLS connections without any code changes, by hooking into TLS libraries and syscalls.&lt;/p&gt;

&lt;p&gt;Works great for OpenSSL. Works for Go.&lt;/p&gt;

&lt;p&gt;Then rustls enters the picture and everything stops being obvious. With OpenSSL, everything is nicely wrapped:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SSL_write(ssl, plaintext)
└─ write(fd, encrypted)

SSL_read(ssl, plaintext)
└─ read(fd, encrypted)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From eBPF’s point of view this is perfect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hook SSL_write, stash plaintext&lt;/li&gt;
&lt;li&gt;write() fires immediately → same thread → you know the FD&lt;/li&gt;
&lt;li&gt;same idea for reads
Everything happens inside one call. Correlation is trivial.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Rustls does things differently
&lt;/h1&gt;

&lt;p&gt;Rustls doesn’t own the socket and never calls read or write itself. It works on buffers, and the application (or runtime) is responsible for actually moving bytes over the network.&lt;/p&gt;

&lt;p&gt;The API reflects that separation pretty clearly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// application writes plaintext into rustls&lt;/span&gt;
&lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="nf"&gt;.write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plaintext&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// rustls produces encrypted bytes and writes them via io::Write&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="nf"&gt;.write_tls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// application reads encrypted bytes and feeds them into rustls&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="nf"&gt;.read_tls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// rustls decrypts and updates internal state&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="nf"&gt;.process_new_packets&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// application reads decrypted data&lt;/span&gt;
&lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="nf"&gt;.read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plaintext_buf&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So instead of one call doing everything, you get a pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;plaintext is buffered first&lt;/li&gt;
&lt;li&gt;encryption happens later&lt;/li&gt;
&lt;li&gt;syscalls happen outside of rustls&lt;/li&gt;
&lt;li&gt;decryption happens before the app reads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key difference for eBPF:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;writes: syscall happens after plaintext&lt;/li&gt;
&lt;li&gt;reads: syscall happens before plaintext&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the OpenSSL-style correlation only works in one direction.&lt;/p&gt;

&lt;h1&gt;
  
  
  Writes work as usual
&lt;/h1&gt;

&lt;p&gt;On the write side, nothing fundamentally new is needed. You hook Writer::write, stash the plaintext, and correlate it with the following sendto. The ordering is preserved, so the same approach as OpenSSL still applies here.&lt;/p&gt;

&lt;h1&gt;
  
  
  Reads are inverted
&lt;/h1&gt;

&lt;p&gt;The read path is where things really break.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;recvfrom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encrypted_buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...);&lt;/span&gt;   &lt;span class="c1"&gt;// happens first&lt;/span&gt;

&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_tls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;process_new_packets&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plaintext_buf&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;         &lt;span class="c1"&gt;// plaintext appears here&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By the time we see plaintext, the syscall is already gone.&lt;/p&gt;

&lt;p&gt;So the logic has to be reversed. Instead of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“see plaintext → wait for syscall”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;we do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“see syscall → remember it → use it later”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Concretely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;on &lt;em&gt;recvfrom&lt;/em&gt; → stash FD per thread&lt;/li&gt;
&lt;li&gt;on &lt;em&gt;reader.read&lt;/em&gt; → pick up that FD and attach it to plaintext&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s basically reverse correlation. Not pretty, but it matches how rustls works.&lt;/p&gt;

&lt;h1&gt;
  
  
  When “ret=1” doesn’t mean 1 byte
&lt;/h1&gt;

&lt;p&gt;This one took longer than expected. We reused the OpenSSL-style exit probe:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ret = PT_REGS_RC(ctx)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The probe fired, but results were weird:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ret=1
ret=0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Which made no sense for a read. Turns out Rust returns Result  like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rax → success or error flag&lt;/li&gt;
&lt;li&gt;rdx → actual number of bytes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we were reading rax and treating it as a size. Meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ret=1 → actually an error&lt;/li&gt;
&lt;li&gt;ret=0 → success, but size is somewhere else
Fix was straightforward once understood:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PT_REGS_RC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="c1"&gt;// success&lt;/span&gt;
    &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;dx&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// actual byte count&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Classic case of “everything works, but the numbers are garbage”.&lt;/p&gt;

&lt;h1&gt;
  
  
  Finding rustls in binaries
&lt;/h1&gt;

&lt;p&gt;Rust symbols are heavily mangled:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;_ZN55_$LT$rustls..conn..Writer$u20$as$u20$std..io..Write$GT$5write17h0ee1e61402b1a37cE&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;It looks messy, but it encodes the full path: rustls::stream::Writer implementing std::io::Write::write.&lt;/p&gt;

&lt;p&gt;The tricky part is that mangling isn’t stable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;different compiler versions use different schemes (legacy vs v0)&lt;/li&gt;
&lt;li&gt;optimizations and stripping can change what’s left in the binary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So matching exact names is fragile.&lt;/p&gt;

&lt;p&gt;Instead, we:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;check ELF .comment for rustc to detect that the binary was built with Rust&lt;/li&gt;
&lt;li&gt;then scan symbols for patterns like “rustls”, “Writer”+”write”, “Reader”+”read”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not perfect, but reliable enough in practice.&lt;/p&gt;

&lt;h1&gt;
  
  
  Results
&lt;/h1&gt;

&lt;p&gt;Coroot is an open source observability tool that uses eBPF to simplify setup. Because we instrument rustls at the library level, not the frameworks, this works across most Rust clients that use rustls under the hood.&lt;/p&gt;

&lt;p&gt;That includes HTTP stacks like hyper when paired with rustls (hyper-rustls, and frameworks like axum or warp when configured with rustls), database clients like sqlx when using its rustls TLS feature, and any async Rust service using tokio-rustls.&lt;/p&gt;

&lt;p&gt;No code changes, no SDKs, no wrappers.&lt;/p&gt;

&lt;p&gt;For Rust apps using OpenSSL via native-tls or openssl, the existing OpenSSL instrumentation already works. rustls was the missing piece.&lt;/p&gt;

&lt;p&gt;Below is an example of a service talking to MySQL over TLS. Coroot shows the actual queries even though everything on the wire is encrypted.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqsfb1kv57k5whu8iiv73.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqsfb1kv57k5whu8iiv73.png" alt=" " width="800" height="100"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you’d like to give our open source tool a try and simplify your own observability, you can check it out at here on &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Github.&lt;/a&gt; You can also view this guide and other open source observability articles on &lt;a href="https://coroot.com/blog/instrumenting-rust-tls-with-ebpf/" rel="noopener noreferrer"&gt;our blog.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>opensource</category>
      <category>rust</category>
    </item>
    <item>
      <title>How to make GPUs on Kubernetes Observable</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Tue, 20 Jan 2026 18:15:52 +0000</pubDate>
      <link>https://dev.to/coroot/making-gpus-on-kubernetes-observable-d2d</link>
      <guid>https://dev.to/coroot/making-gpus-on-kubernetes-observable-d2d</guid>
      <description>&lt;p&gt;GPUs are everywhere powering LLM inference, model training, video processing, and more. Kubernetes is often where these workloads run. But using GPUs in Kubernetes isn’t as simple as using CPUs.&lt;/p&gt;

&lt;p&gt;You need the right setup. You need efficient scheduling. And most importantly you need visibility.&lt;/p&gt;

&lt;p&gt;This post walks through how to run GPU workloads on Kubernetes, how to virtualize them efficiently, and how to use open source to monitor everything with zero instrumentation.&lt;/p&gt;

&lt;h1&gt;
  
  
  Running GPU Workloads on Kubernetes
&lt;/h1&gt;

&lt;p&gt;Running GPU workloads on Kubernetes is totally doable. But it takes a bit of setup.&lt;/p&gt;

&lt;p&gt;It starts with your nodes. Whether you’re running in the cloud or on bare metal, your cluster needs machines with physical GPUs. Most cloud providers support GPU-enabled node pools, and provisioning them is usually straightforward.&lt;/p&gt;

&lt;p&gt;Once the hardware is in place, the next step is software. For Kubernetes to schedule and run GPU workloads, it needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NVIDIA GPU drivers, installed on each node&lt;/li&gt;
&lt;li&gt;The NVIDIA container runtime, so containers can access the GPU&lt;/li&gt;
&lt;li&gt;The NVIDIA device plugin, so Kubernetes knows how to handle GPU resource requests&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can install all of this manually. But it’s fiddly and error-prone. That’s where the NVIDIA GPU Operator comes in. It automates the whole setup: installing drivers, configuring the runtime, and deploying the device plugin. Once that’s done, your cluster is GPU-ready.&lt;/p&gt;

&lt;p&gt;After that, requesting a GPU is simple. Just add this to your pod spec:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resources:
  limits:
    nvidia.com/gpu: 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kubernetes will handle the rest: scheduling your pod onto a node with an available GPU and assigning it to the container.&lt;/p&gt;

&lt;p&gt;Of course, not every workload needs an entire GPU to itself. And that’s where GPU virtualization becomes really useful.&lt;/p&gt;

&lt;h1&gt;
  
  
  Virtualizing GPUs in Kubernetes
&lt;/h1&gt;

&lt;p&gt;By default, Kubernetes treats GPUs as exclusive resources. One pod per device. But in many real-world cases, that’s overkill. The GPU Operator supports two forms of GPU virtualization that let you safely share a GPU between workloads:&lt;/p&gt;

&lt;p&gt;Time-Slicing: Multiple containers take turns using the GPU in rapid bursts. It’s a great fit for bursty inference workloads, batch jobs, or anything that doesn’t require ultra-low latency.&lt;/p&gt;

&lt;p&gt;MIG (Multi-Instance GPU): Available on GPUs like the A100 and H100, MIG lets you partition a single physical GPU into several hardware-isolated instances. Each one behaves like its own dedicated GPU, with its own memory, cache, and compute cores.&lt;/p&gt;

&lt;p&gt;Virtualization makes GPUs way more flexible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You stop wasting an entire GPU on a tiny batch job&lt;/li&gt;
&lt;li&gt;You get much better overall utilization&lt;/li&gt;
&lt;li&gt;You can safely share GPUs across apps without them stepping on each other&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And you finally have a shot at balancing cost and performance. It’s a game-changer, but only if you can actually see what’s going on.&lt;/p&gt;

&lt;h1&gt;
  
  
  What observability looks like once GPUs are in play
&lt;/h1&gt;

&lt;p&gt;So, the cluster is set up, the workloads are running, and maybe you’ve even started virtualizing GPUs to get better efficiency. Now comes the tricky part – actually understanding what’s happening.&lt;/p&gt;

&lt;p&gt;From the infrastructure side, we want to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many GPU-enabled nodes do we have right now?&lt;/li&gt;
&lt;li&gt;Which GPUs are actually doing work, and which are just burning budget?&lt;/li&gt;
&lt;li&gt;What’s the current GPU and memory utilization across the fleet?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And sure, if you’re in the cloud, temperature and power draw might feel like someone else’s problem. But it’s still good to know. Somewhere out there, your model is warming the planet one token at a time. Mother Nature says hi. 🌱&lt;/p&gt;

&lt;p&gt;From the application side, the questions change:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which GPUs is this app actually using?&lt;/li&gt;
&lt;li&gt;How much compute and memory is it consuming?&lt;/li&gt;
&lt;li&gt;Is it sharing the GPU with something else?&lt;/li&gt;
&lt;li&gt;And if so who’s the noisy neighbor hogging all the resources?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn’t just about curiosity. It’s about avoiding slowdowns, catching inefficiencies, and making smart scaling decisions. But here’s the catch: Kubernetes doesn’t tell you any of this.&lt;/p&gt;

&lt;h1&gt;
  
  
  Out-of-the-box GPU observability
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Coroot&lt;/a&gt; is an open source tool that uses eBPF to make any GPU-powered system observable with zero-configuration. It talks directly to the GPU using NVIDIA’s NVML library the same one behind nvidia-smi. This way you can see what’s happening on your GPUs with no guesswork.&lt;/p&gt;

&lt;p&gt;On startup, the agent looks for libnvidia-ml.so in all the usual (and unusual) places whether it’s installed by the GPU Operator, a package manager, or manually dropped in. If it finds the library, it loads it and starts gathering data.&lt;/p&gt;

&lt;p&gt;From there, it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Discovers all available GPUs on the node&lt;/li&gt;
&lt;li&gt;Collects real-time metrics utilization, memory usage, temperature, power draw&lt;/li&gt;
&lt;li&gt;Tracks per-process usage using NVML’s process telemetry&lt;/li&gt;
&lt;li&gt;Maps each process back to its container and pod, using Coroot’s existing PID-to-container tracking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So instead of “PID 12345 is using GPU 0,” you get “this container in this pod is using 78% of GPU-xxxx”&lt;/p&gt;

&lt;p&gt;When it comes to virtualized GPUs, Coroot sees which containers are tied to which GPU UUIDs, even when multiple workloads are time-sharing or using MIG slices on the same physical device. That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can see which apps are sharing the same GPU&lt;/li&gt;
&lt;li&gt;Understand how each one is using it&lt;/li&gt;
&lt;li&gt;Spot noisy neighbors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of this is automatic. Just install the agent and let Coroot do the rest.&lt;/p&gt;

&lt;p&gt;Once the Coroot agent discovers the GPUs and starts collecting data, all of it flows straight into the UI ready to explore without any dashboards to build or metrics to stitch together.&lt;/p&gt;

&lt;p&gt;Let’s walk through how this looks in practice.&lt;/p&gt;

&lt;h1&gt;
  
  
  Node-level GPU overview
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F44kp5c7jcapx7ymn5v6x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F44kp5c7jcapx7ymn5v6x.png" alt=" " width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the node view, Coroot shows everything you’d want to know about the GPUs attached to a specific machine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU utilization over time&lt;/li&gt;
&lt;li&gt;GPU memory usage&lt;/li&gt;
&lt;li&gt;Top consumers of both compute and memory&lt;/li&gt;
&lt;li&gt;Temperature and power draw&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn’t just “Hey, GPU usage went up.” You can see which containers are contributing to that load and whether the same GPU is being shared between apps.&lt;/p&gt;

&lt;h1&gt;
  
  
  App-level GPU breakdown
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F70g1tur4arwhgeznr2o5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F70g1tur4arwhgeznr2o5.png" alt=" " width="800" height="673"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is where things get real. If your app is slow, you can check GPU usage alongside CPU, memory, logs, traces, and everything else in one place.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How much GPU time your app is using&lt;/li&gt;
&lt;li&gt;How much memory it’s burning&lt;/li&gt;
&lt;li&gt;Which containers are sharing the GPU, and how much each one is using&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;GPU workloads aren’t cheap. And they’re rarely simple. Whether you’re running a single inference service or managing dozens of shared GPUs, you need real visibility. &lt;/p&gt;

&lt;p&gt;Observability is the same: not cheap, rarely simple. But with open source tools, managing your infra can be accessible to everyone - not just massive companies, but small businesses, startups, and homelab projects everywhere across the globe. &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Give it a try&lt;/a&gt; on your system next, and share feedback to help make good observability available for everyone.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Using AI for Troubleshooting: OpenAI vs DeepSeek</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Wed, 07 Jan 2026 18:52:22 +0000</pubDate>
      <link>https://dev.to/coroot/using-ai-for-troubleshooting-openai-vs-deepseek-343k</link>
      <guid>https://dev.to/coroot/using-ai-for-troubleshooting-openai-vs-deepseek-343k</guid>
      <description>&lt;p&gt;AI is now a go-to tool for everything from writing to coding. Modern LLMs are so powerful that, with the right prompt and a few adjustments, they can handle tasks almost effortlessly.&lt;/p&gt;

&lt;p&gt;At &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Coroot&lt;/a&gt; (an open source project to simplify observability) we’ve developed AI (LLMs + ML) to diagnose root cause analysis. Finding the exact root cause is tricky due to the many possible scenarios. Many of our users aren’t experts in areas like databases, network performance, or specific runtimes. We need to provide clear explanations, and ideally, even guidance on how to fix the problem.&lt;/p&gt;

&lt;p&gt;In the following blog, I decided to test the ability of different LLMs to diagnose a chaos experiment. Since laziness often drives progress, I skipped converting my telemetry data into a prompt and instead just uploaded a screenshot as input for the LLMs.&lt;/p&gt;

&lt;p&gt;The result shocked me.&lt;/p&gt;

&lt;h1&gt;
  
  
  OpenAI GPT-4o
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Here is my prompt, no other context has been provided:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1k8yst6a2x1ogzk37kb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1k8yst6a2x1ogzk37kb.png" alt=" " width="768" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From the provided PostgreSQL monitoring dashboard, it appears that an &lt;code&gt;ALTER TABLE&lt;/code&gt; statement is causing performance issues, leading to:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Increased Query Latency:&lt;/strong&gt; The Postgres average query latency has spiked, especially for pg-cluster-2.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I/O Load from ALTER TABLE:&lt;/strong&gt; The 'ALTER TABLE' statement (alter table articles alter column body set not null) is consuming significant I/O, as seen in the Queries by I/O time graph.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Locked Connections and Queries:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Postgres connections graph shows a sharp rise in locked connections.&lt;/li&gt;
&lt;li&gt;The Locked queries graph confirms that queries are blocked due to the &lt;code&gt;ALTER TABLE&lt;/code&gt; operation.&lt;/li&gt;
&lt;li&gt;The Blocking queries graph indicates that the &lt;code&gt;ALTER TABLE&lt;/code&gt; query is holding a lock and blocking multiple other queries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Replication Lag:&lt;/strong&gt; The Replication lag shows a spike, likely due to the &lt;code&gt;ALTER TABLE&lt;/code&gt; operation causing a backlog of writes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why is this Happening?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ALTER TABLE … SET NOT NULL&lt;/code&gt; is a table-wide operation that requires a full table scan to verify that no existing row violates the new constraint. It locks the entire table and prevents writes/reads from proceeding until the operation completes. This is especially problematic in a high-load environment where frequent reads/writes occur.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to Fix This?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Terminate the &lt;code&gt;ALTER TABLE&lt;/code&gt; Statement (if possible): If the system is in production and the lock is causing downtime, consider canceling the query:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE query LIKE '%ALTER TABLE articles alter column body set not null%';&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Restart Blocked Queries:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the &lt;code&gt;ALTER TABLE&lt;/code&gt; is canceled, the system should return to normal operations. However, some queries might still be waiting, so check for any lingering locks:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-Term Fixes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since ALTER TABLE is expensive, consider these best practices:&lt;/p&gt;

&lt;p&gt;1.Convert to an Online-Friendly Approach&lt;/p&gt;

&lt;p&gt;Instead of &lt;code&gt;ALTER TABLE … SET NOT NULL&lt;/code&gt;, use a safer approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add a check constraint first (non-blocking):&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;ALTER TABLE articles ADD CONSTRAINT articles_body_not_null CHECK (body IS NOT NULL) NOT VALID;&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Backfill missing values (if any).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Validate the constraint (still non-blocking):&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;ALTER TABLE articles VALIDATE CONSTRAINT articles_body_not_null;&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If no violations exist, alter the column:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;ALTER TABLE articles ALTER COLUMN body SET NOT NULL;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;2.Use &lt;code&gt;pg_repack&lt;/code&gt; for Zero-Downtime Migrations&lt;/p&gt;

&lt;p&gt;pg_repack allows reorganizing tables without full locks&lt;br&gt;
&lt;code&gt;pg_repack -t articles --dbname=mydatabase&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;3.Perform Schema Changes During Low-Traffic Periods&lt;/p&gt;

&lt;p&gt;If running in production, execute schema changes during maintenance windows to reduce impact.&lt;/p&gt;

&lt;p&gt;4.Partition Large Tables&lt;/p&gt;

&lt;p&gt;For frequently altered large tables, consider partitioning to isolate operations on smaller segments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ALTER TABLE operation is locking the articles table, causing significant query delays and replication lag. The best approach is to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cancel the operation if causing downtime.&lt;/li&gt;
&lt;li&gt;Use an online-safe alternative (check constraint + validation).&lt;/li&gt;
&lt;li&gt;Schedule schema changes carefully.
Use tools like pg_repack for minimal impact.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;That’s super impressive and surprisingly accurate. What shocked me the most was the model’s ability to interpret the screenshot as input.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;It’s exciting that an LLM can ‘understand’ our domain and generate deep summaries with actionable insights. However, relying on a single vendor didn’t seem like a robust solution.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Fortunately, DeepSeek has emerged as a strong alternative to OpenAI’s model, especially with the recent release of its R1 model. So, let’s repeat the experiment using DeepSeek-R1:&lt;/em&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  DeepSeeek-R1
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;The answer:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The data indicates a significant performance issue in pg-cluster-2, primarily caused by a long-running &lt;code&gt;ALTER TABLE&lt;/code&gt; command on the articles table. This operation is causing table locks, blocking other critical queries (e.g., &lt;code&gt;SELECT&lt;/code&gt; and &lt;code&gt;INSERT&lt;/code&gt;), leading to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High query latency (10-15 seconds).&lt;/li&gt;
&lt;li&gt;A surge in active/locked connections.&lt;/li&gt;
&lt;li&gt;Query throughput drops.&lt;/li&gt;
&lt;li&gt;Replication lag in other clusters (minor).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Root Cause&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;ALTER TABLE&lt;/code&gt; statement (alter table articles alter column body set not ?) is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Holding an exclusive lock on the articles table.&lt;/li&gt;
&lt;li&gt;Blocking subsequent queries (e.g., &lt;code&gt;SELECT&lt;/code&gt;, &lt;code&gt;INSERT&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Creating a chain of waiting queries, increasing latency and connection usage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Immediate Fixes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Terminate the ALTER TABLE command (if safe):&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;-- Find the PID of the ALTER TABLE query  &lt;br&gt;
   SELECT pid, query FROM pg_stat_activity WHERE query LIKE '%alter column body%';  &lt;br&gt;
   -- Terminate it  &lt;br&gt;
   SELECT pg_terminate_backend(&amp;lt;PID&amp;gt;);&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Caution: Ensure this won’t corrupt data (e.g., if the ALTER is part of a migration).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reschedule the DDL Operation:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run schema changes during low-traffic periods or use online DDL tools like:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pg_repack&lt;/code&gt; (to rebuild tables without locks).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt; (for index-related changes).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Optimize the &lt;code&gt;ALTER TABLE&lt;/code&gt;:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the column is being set to NOT NULL, ensure it already contains no NULL values. Pre-run:&lt;/p&gt;

&lt;p&gt;UPDATE articles SET body = '' WHERE body IS NULL;  -- Example cleanup&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-Term Improvements&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use Lock Timeouts:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add a lock timeout to prevent indefinite blocking:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SET LOCAL lock_timeout = '5s';  &lt;br&gt;
   ALTER TABLE ...;&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Monitor and Tune Queries:- Optimize the &lt;code&gt;SELECT id FROM&lt;/code&gt; articles &lt;code&gt;WHERE&lt;/code&gt; created &amp;lt; ? …&lt;code&gt;query with an index on created.- Batch the&lt;/code&gt;INSERT` operations to reduce contention.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Connection Pooling: Use a connection pooler (e.g., PgBouncer) to manage active connections and prevent exhaustion.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Replication Checks:Investigate replication lag in &lt;code&gt;pg-cluster-1&lt;/code&gt; and &lt;code&gt;pg-cluster-3&lt;/code&gt; to ensure failover readiness.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The blocking &lt;code&gt;ALTER TABLE&lt;/code&gt; is the critical issue. Address it first by terminating or rescheduling the operation, then implement safeguards to avoid similar incidents. Optimize queries and use non-locking DDL tools for future schema changes.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;DeepSeek-R1 delivers an answer that is quite close to OpenAI’s. I’d say it is accurate enough to be a viable alternative to OpenAI’s models.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Additionally, I tested other open-source models in my lab, but I was only able to run LLMs with up to 70 billion parameters. In this range, I didn’t find anything that came close to OpenAI or the larger DeepSeek models for our needs.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Is AIOps is finally here?
&lt;/h2&gt;

&lt;p&gt;Yes, but only if you have enough data describing your system’s behaviour. &lt;/p&gt;

&lt;p&gt;This wouldn’t have been possible with just CPU, memory, and disk usage alone. The real magic of AI in this case was its ability to understand the bigger picture, apply domain knowledge, and suggest how to fix the issue. &lt;/p&gt;

&lt;p&gt;This is where Coroot's AI-Powered Root Cause Analysis can provide a more complete picture, using eBPF + Machine Learning to catch context missing from LLMs alone. As part of our commitment to make observability simpler and accesible for &lt;em&gt;everyone&lt;/em&gt;, you can set it up for free with our open source version. &lt;a href="https://docs.coroot.com/ai/coroot-cloud" rel="noopener noreferrer"&gt;Details here.&lt;/a&gt; We hope it can help save you hours of digging through telemetry and make root cause analysis easier for your team.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>openai</category>
      <category>deepseek</category>
      <category>devops</category>
    </item>
    <item>
      <title>Chaos testing a Postgres cluster managed by CloudNativePG</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Tue, 16 Dec 2025 16:57:20 +0000</pubDate>
      <link>https://dev.to/coroot/chaos-testing-a-postgres-cluster-managed-by-cloudnativepg-f9d</link>
      <guid>https://dev.to/coroot/chaos-testing-a-postgres-cluster-managed-by-cloudnativepg-f9d</guid>
      <description>&lt;p&gt;As more organizations move their databases to cloud-native environments, effectively managing and monitoring these systems becomes crucial. According to Coroot’s anonymous usage statistics, 64% of projects use PostgreSQL, making it the most popular RDBMS among our users, compared to 14% using MySQL. This is not surprising since it is also the most widely used open-source database worldwide. &lt;/p&gt;

&lt;p&gt;Kubernetes is more than a platform for running containerized applications. It also enables better management of databases by allowing automation of tasks like backups, high availability, and scaling through its operator framework. This provides a management experience similar to using a managed service like AWS RDS but without vendor lock-in and often at a lower cost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/cloudnative-pg/cloudnative-pg" rel="noopener noreferrer"&gt;CloudNativePg&lt;/a&gt; is an open-source operator originally created by EDB, the oldest and the biggest Postgres vendor world-wide. As other operators, CNPG helps manage PostgreSQL databases on Kubernetes, covering the entire operational lifecycle from initial deployment to ongoing maintenance. Worth to mention that this is the youngest Postgres operator on the market, but its open source traction grows rapidly and based on my observations it’s the favorite operator across Reddit users.&lt;/p&gt;

&lt;p&gt;In this post I’ll install a CNPG cluster in my lab, instrument it with Coroot, then generate some load and introduce some failures to ensure high availability and observability.&lt;/p&gt;

&lt;p&gt;In this post I’ll install a CNPG cluster in my lab, instrument it with Coroot Community (&lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;open source&lt;/a&gt;), then generate some load and introduce some failures to ensure high availability and observability.&lt;/p&gt;

&lt;h1&gt;
  
  
  Setting up the cluster
&lt;/h1&gt;

&lt;p&gt;Installing the CloudNativePG operator is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;helm repo add cnpg https://cloudnative-pg.github.io/charts
helm upgrade --install cnpg cnpg/cloudnative-pg

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To deploy a cluster, create a Kubernetes custom resource:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kind: Cluster
metadata:
  name: pg-cluster
spec:
  instances: 3
  primaryUpdateStrategy: unsupervised
  storage:
    size: 30Gi   
  postgresql:
    shared_preload_libraries: [pg_stat_statements]
    parameters:
      pg_stat_statements.max: "10000"
      pg_stat_statements.track: all
  managed:
    roles:
    - name: coroot
      ensure: present
      login: true
      connectionLimit: 2
      inRoles:
      - pg_monitor
      passwordSecret:
        name: pg-cluster
---
apiVersion: v1
data:
  username: ******==
  password: *********==
kind: Secret
metadata:
  name: pg-cluster
type: kubernetes.io/basic-auth

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Installing Coroot
&lt;/h1&gt;

&lt;p&gt;In this post, I’ll be using the open source Community Edition of Coroot. Here are the commands to install the Coroot Operator for Kubernetes along with all Coroot components:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;helm repo add coroot https://coroot.github.io/helm-charts
helm repo update coroot
helm install -n coroot --create-namespace coroot-operator coroot/coroot-operator
helm install -n coroot coroot coroot/coroot-ce

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To access Coroot, I’m forwarding the Coroot UI port to my local machine. For production deployments the operator can create an Ingress.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl port-forward -n coroot service/coroot-coroot 8083:8080&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;In the UI, we can see two applications: the operator (cnpg-cloudnative-pg) and our Postgres cluster (pg-cluster). Coroot has also identified that pg-cluster is a Postgres database and suggests integrating Postgres monitoring.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14isgtg2l6ov5kios5j2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14isgtg2l6ov5kios5j2.png" alt=" " width="800" height="260"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Kubernetes approach to monitoring databases typically involves running metric exporters as sidecar containers within database instance Pods. However, this method can be challenging for certain use cases. For example, CNPG doesn’t support running custom sidecar containers, and their &lt;a href="https://github.com/cloudnative-pg/cnpg-i" rel="noopener noreferrer"&gt;CNPG-i&lt;/a&gt; capability requires specific plugin support and is still in the experimental stage.&lt;/p&gt;

&lt;p&gt;To address these limitations, Coroot has a dedicated coroot-cluster-agent that can discover and gather metrics from databases without requiring a separate container for each database instance. To configure this integration, simply use the credentials of the database role already created for Coroot. Click on “Postgres” in the Coroot UI and then on the “Configure” button.&lt;/p&gt;

&lt;p&gt;Next, provide the credentials configured for Coroot in the cluster specification. Coroot’s cluster-agent will then collect Postgres metrics from each instance in the cluster. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe11fmaz7j9p64jk35kz8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe11fmaz7j9p64jk35kz8.png" alt=" " width="800" height="541"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It feels a bit dull without any load or issues. Let’s add an application that interacts with this database.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5przlggyxd72sal2bzaw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5przlggyxd72sal2bzaw.png" alt=" " width="800" height="485"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I deployed a simple application called “app” that executes approximately 600 queries per second: 300 on the primary and 300 across both replicas.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2eu9vyyyeaae5uprgbdl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2eu9vyyyeaae5uprgbdl.png" alt=" " width="800" height="563"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I believe that any observability solution must be tested on failures to ensure that if some problem occurs, we will be able to quickly identify the root case. So, let’s introduce some failures&lt;/p&gt;

&lt;h1&gt;
  
  
  Failure #1: CPU noisy neighbor
&lt;/h1&gt;

&lt;p&gt;In shared infrastructures like Kubernetes clusters, applications often compete for resources. Let’s simulate a scenario with a noisy neighbor, where a CPU-intensive application runs on the same node as our database instance. The following Job will create a Pod with stress-ng on node100:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: batch/v1
kind: Job
metadata:
  name: cpu-stress
spec:
  template:
    metadata:
      labels:
        app: cpu-stress
    spec:
      nodeSelector:
        kubernetes.io/hostname: node100
      containers:
        - name: stress-ng
          image: debian:bullseye-slim
          command:
            - "/bin/sh"
            - "-c"
            - |
              apt-get update &amp;amp;&amp;amp; 
              apt-get install -y stress-ng &amp;amp;&amp;amp; 
              stress-ng --cpu 0 --timeout 300s
      restartPolicy: Never
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8w1dw6arwlkc6vq5fypd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8w1dw6arwlkc6vq5fypd.png" alt=" " width="800" height="545"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see, our “noisy neighbor” has affected Postgres performance. Now, let’s assume we don’t know the root cause and use Coroot to identify the issue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8d3kkh3y51y3z2h7n8g9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8d3kkh3y51y3z2h7n8g9.png" alt=" " width="800" height="598"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using the CPU Delay chart, we can observe that pg-cluster-2 is experiencing a CPU time shortage. Why? Because node100 is overloaded. And why is that? The cpu-stress application has consumed all available CPU time.&lt;/p&gt;

&lt;h1&gt;
  
  
  Failure #2: Postgres Locks
&lt;/h1&gt;

&lt;p&gt;Now, let’s explore a Postgres-specific failure scenario. We’ll run a suboptimal schema migration on our articles table, which contains 10 million rows:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ALTER TABLE articles ALTER COLUMN body SET NOT NULL;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbtzxqq0dhpbksidjp506.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbtzxqq0dhpbksidjp506.png" alt=" " width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For those who aren’t deeply familiar with databases, this migration will lock the entire table to verify that all rows are not NULL. Since the table is relatively large, the migration can take some time to complete. During this period, queries from our app will be forced to wait until the lock is released.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl1oud1iqh4a6bxk5iqlb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl1oud1iqh4a6bxk5iqlb.png" alt=" " width="800" height="647"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s interpret these charts together: The Postgres latency of pg-cluster-2 has significantly increased. Many SELECT and INSERT queries are locked by another query. Which one? The ALTER TABLE query. Why is this query taking so long to execute? Because it is performing I/O operations to verify that the body column in each row is not NULL.&lt;/p&gt;

&lt;p&gt;As you can see, having the right metrics was crucial in this scenario. For instance, simply knowing the number of Postgres locks wouldn’t help us identify the specific query holding the lock. &lt;/p&gt;

&lt;h1&gt;
  
  
  Failure #3: primary Postgres instance failure
&lt;/h1&gt;

&lt;p&gt;Now, let’s see how CloudNativePG handles a primary instance failure. To simulate this failure, I’ll simply delete the Pod of the primary Postgres instance.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kubectl delete pod pg-cluster-2&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbgxnmk2pg6xrbb3qzt6g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbgxnmk2pg6xrbb3qzt6g.png" alt=" " width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(That's all for character count - to view the last experiment, &lt;a href="https://coroot.com/blog/engineering/chaos-testing-a-postgres-cluster-managed-by-cloudnativepg/" rel="noopener noreferrer"&gt;visit our blog.&lt;/a&gt;)&lt;/p&gt;

</description>
      <category>devops</category>
      <category>postgres</category>
      <category>opensource</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Monitoring a Docker Homelab with Open Source</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Mon, 15 Dec 2025 17:32:55 +0000</pubDate>
      <link>https://dev.to/coroot/monitoring-a-docker-homelab-with-open-source-3h07</link>
      <guid>https://dev.to/coroot/monitoring-a-docker-homelab-with-open-source-3h07</guid>
      <description>&lt;p&gt;&lt;em&gt;This blog comes from Coroot contributor Arie van den Heuvel: engineer, a System and Application Management Specialist, and a valued member of our open source community. You can read more of Arie’s writing and support the resource articles he has contributed to open source &lt;a href="https://solipsistic-sysadmin.blogspot.com/" rel="noopener noreferrer"&gt;on his blog.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnye4ffx0td42ylpaj7j3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnye4ffx0td42ylpaj7j3.png" alt=" " width="800" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When running a home server consisting of one or more nodes with some or all services in Docker, you may find yourself wanting to monitor your environment. Or even better, attain full observability.&lt;/p&gt;

&lt;p&gt;The frequent recommendation for this is a combination of Prometheus with Grafana. But this solution requires a lot of work to fully configure, in addition to work on one’s applications and setup for full visibility. Another possibility is to use the free tier of NewRelic, which has the advantage of remote insights on metrics and logs. Again, this requires additional work on containers or applications to have a more refined visibility of your services.&lt;/p&gt;

&lt;p&gt;For those not running Linux, an honourable mention to use as a solution would be &lt;a href="https://github.com/henrygd/beszel" rel="noopener noreferrer"&gt;Beszel.&lt;/a&gt; Beszel can be run as a local service or in docker. It consists of a web front-end and an agent that can be used on multiple systems that run Windows and MacOS. Installation is an easy job in docker. Once it’s running, Beszel provides insightful information with system metrics, docker services, and even logs.&lt;/p&gt;

&lt;p&gt;My personal choice for monitoring a home server system is &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Coroot. &lt;/a&gt; In the following blog, I’ll detail how I used Coroot to set up observability for my homelab, which you can then adopt for your own setup.&lt;/p&gt;

&lt;h1&gt;
  
  
  Observability with Coroot
&lt;/h1&gt;

&lt;p&gt;In my current setup on a &lt;a href="https://rockylinux.org/" rel="noopener noreferrer"&gt;Rocky Linux&lt;/a&gt; 9.x system, Coroot runs on a &lt;a href="https://github.com/ClickHouse/ClickHouse" rel="noopener noreferrer"&gt;Clickhouse&lt;/a&gt; server to store metrics, logs, traces and profiles, in addition to the &lt;a href="https://github.com/coroot/coroot-node-agent" rel="noopener noreferrer"&gt;Coroot node-agent &lt;/a&gt; and &lt;a href="https://github.com/coroot/coroot-cluster-agent" rel="noopener noreferrer"&gt;Coroot cluster-agent.&lt;/a&gt; The Coroot node-agent automatically collects all service metrics and logs using eBPF, while the cluster-agent provides detailed information on databases like MySQL, Postgres or Redis.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08xhxh81nrfd1ife3xib.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08xhxh81nrfd1ife3xib.png" alt=" " width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another advantage Coroot presents is the use of &lt;a href="https://coroot.com/ai" rel="noopener noreferrer"&gt;AI-powered Root Cause Analysis&lt;/a&gt;, which provides instantaneous and helpful insights for investigating incidents. With a &lt;a href="https://docs.coroot.com/ai/coroot-cloud/" rel="noopener noreferrer"&gt;Coroot Cloud account&lt;/a&gt;, you will have ten helpful analyses for free each month. Even without AI, the data presented with Coroot with standard alerts based on best metric practices is pretty insightful and helps to make your setup even better.&lt;/p&gt;

&lt;p&gt;Coroot services run in docker through a docker-compose file. In a normal Coroot setup Prometheus is used, but in this setup I have configured Clickhouse, &lt;a href="[url](https://docs.coroot.com/configuration/clickhouse/)"&gt;which is a supported alternative.&lt;br&gt;
&lt;/a&gt;&lt;/p&gt;
&lt;h1&gt;
  
  
  Clickhouse as a Local Service
&lt;/h1&gt;

&lt;p&gt;I have Clickhouse running as a local service. This setup allows for better control and convenience when scaling down memory usage of Clickhouse, scaling down logging on disk and the database, and simplifies making changes to the data. The only downside to note is this setup requires updating Clickhouse manually with &lt;code&gt;yum/dnf.&lt;/code&gt;&lt;/p&gt;
&lt;h1&gt;
  
  
  Installing Clickhouse
&lt;/h1&gt;

&lt;p&gt;Installing Clickhouse is easily achieved by adding the repo, installing Clickhouse, and making a few quick adjustments before starting it up.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
sudo dnf install -y yum-utils
sudo dnf-config-manager --add-repo
https://packages.clickhouse.com/rpm/clickhouse.repo
sudo dnf install -y clickhouse-server clickhouse-client

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before starting the service create file &lt;code&gt;/etc/clickhouse-server/config.d/z_log_disable.xml&lt;/code&gt; and insert the following content in the file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
​&amp;lt;?xml version="1.0"?&amp;gt;
&amp;lt;clickhouse&amp;gt;
&amp;lt;asynchronous_metric_log remove="1"/&amp;gt;
&amp;lt;metric_log remove="1"/&amp;gt;
&amp;lt;latency_log remove="1"/&amp;gt;
&amp;lt;query_thread_log remove="1" /&amp;gt;
&amp;lt;query_log remove="1" /&amp;gt;
&amp;lt;query_views_log remove="1" /&amp;gt;
&amp;lt;part_log remove="1"/&amp;gt;
&amp;lt;session_log remove="1"/&amp;gt;
&amp;lt;text_log remove="1" /&amp;gt;
&amp;lt;trace_log remove="1"/&amp;gt;
&amp;lt;crash_log remove="1"/&amp;gt;
&amp;lt;opentelemetry_span_log remove="1"/&amp;gt;
&amp;lt;zookeeper_log remove="1"/&amp;gt;
&amp;lt;/clickhouse&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this adjust cache sizes in &lt;code&gt;/etc/clickhouse-server/config.xml:&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;mark_cache_size&amp;gt;268435456&amp;lt;/mark_cache_size&amp;gt;
&amp;lt;index_mark_cache_size&amp;gt;67108864&amp;lt;/index_mark_cache_size&amp;gt;
&amp;lt;uncompressed_cache_size&amp;gt;16777216&amp;lt;/uncompressed_cache_size&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adjust memory usage ratio in &lt;code&gt;/etc/clickhouse-server/config.xml:&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
&amp;lt;max_server_memory_usage_to_ram_ratio&amp;gt;0.75&amp;lt;/max_server_memory_usage_to_ram_ratio&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lower the tread pool size in &lt;code&gt;/etc/clickhouse-server/config.xml:&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;!-- max_thread_pool_size&amp;gt;10000&amp;lt;/max_thread_pool_size&amp;gt; --&amp;gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And starting things up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
sudo systemctl deamon-reload
sudo systemctl enable clickhouse-server
sudo systemctl start clickhouse-server

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Installing Coroot
&lt;/h1&gt;

&lt;p&gt;First, check if your Linux system is using kernel 5.1 or later (although 4.2 is also supported.) This installation is different from the &lt;a href="https://docs.coroot.com/?env=docker" rel="noopener noreferrer"&gt;original docker-compose file.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Prometheus is not used in this setup, and Clickhouse runs as a local service. Another distinction is the retention of the data, which is normally set to seven days for traces, logs, profiles and metrics. Coroot also typically stores its own local cache for metrics for 30 days.&lt;/p&gt;

&lt;p&gt;In this setup, the data retention stored in Clickhouse is set up for 14 days. With eighteen local and docker services, the amount of data kept for all of this is 3GB on average in my system.&lt;br&gt;
Coroot, its node-agent, and cluster-agent, run as a docker service with a docker-compose that you create locally. This is achieved by inserting the following content in a locally created &lt;code&gt;docker-compose.yaml:&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
name: coroot

volumes:
node_agent_data: {}
cluster_agent_data: {}
coroot_data: {}

services:
coroot:
restart: always
image: ghcr.io/coroot/coroot${LICENSE_KEY:+-ee} # set 'coroot-ee' as the image if LICENSE_KEY is defined
pull_policy: always
user: root
volumes:
- coroot_data:/data
ports:
- 8080:8080
command:
- '--data-dir=/data'
- '--bootstrap-refresh-interval=15s'
- '--bootstrap-clickhouse-address=127.0.0.1:9000'
- '--bootstrap-prometheus-url=http://127.0.0.1:9090'
- '--global-prometheus-use-clickhouse'
- '--global-prometheus-url=http://127.0.0.1:9090'
- '--global-refresh-interval=15s'
- '--cache-ttl=31d'
- '--traces-ttl=21d'
- '--logs-ttl=21d'
- '--profiles-ttl=21d'
- '--metrics-ttl=21d'
environment:
- LICENSE_KEY=${LICENSE_KEY:-}
- GLOBAL_PROMETHEUS_USE_CLICKHOUSE
- CLICKHOUSE_SPACE_MANAGER_USAGE_THRESHOLD=75 # Set cleanup threshold to 75%
- CLICKHOUSE_SPACE_MANAGER_MIN_PARTITIONS=2 # Always keep at least 2 partitions
network_mode: host

node-agent:
restart: always
image: ghcr.io/coroot/coroot-node-agent
pull_policy: always
privileged: true
pid: "host"
volumes:
- /sys/kernel/tracing:/sys/kernel/tracing
- /sys/kernel/debug:/sys/kernel/debug
- /sys/fs/cgroup:/host/sys/fs/cgroup
- node_agent_data:/data
command:
- '--collector-endpoint=http://192.168.1.160:8080'
- '--cgroupfs-root=/host/sys/fs/cgroup'
- '--wal-dir=/data'

cluster-agent:
restart: always
image: ghcr.io/coroot/coroot-cluster-agent
pull_policy: always
volumes:
- cluster_agent_data:/data
command:
- '--coroot-url=http://192.168.1.160:8080'
- '--metrics-scrape-interval=15s'
- '--metrics-wal-dir=/data'
depends_on:
- coroot


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After creating this file and making any adjustments to your own likings and network preferences, type docker compose up -d and go to your IP address on port 8080. Here you have access to Coroot, and are now prompted to give admin credentials!&lt;/p&gt;

&lt;p&gt;In my setup, &lt;a href="https://github.com/containrrr/watchtower" rel="noopener noreferrer"&gt;Watchtower&lt;/a&gt; takes care of updating docker containers, which works great with Coroot.&lt;/p&gt;

&lt;p&gt;As a final sidenote: there are already some helpful hints and pointers present within Coroot for setting things up. In my case, there was information available that helped observe a Postgres database. Don’t forget to use the given commands as the admin/postgres user to make it work.&lt;/p&gt;

&lt;p&gt;Happy homelab observing! 🐧&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>docker</category>
      <category>opensource</category>
    </item>
    <item>
      <title>OpenTelemetry for Go: Measuring the Overhead</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Wed, 10 Dec 2025 19:15:09 +0000</pubDate>
      <link>https://dev.to/coroot/opentelemetry-for-go-measuring-the-overhead-dpi</link>
      <guid>https://dev.to/coroot/opentelemetry-for-go-measuring-the-overhead-dpi</guid>
      <description>&lt;h1&gt;
  
  
  OpenTelemetry for Go: Measuring the Overhead
&lt;/h1&gt;

&lt;p&gt;Everything comes at a cost — and observability is no exception. When we add metrics, logging, or distributed tracing to our applications, it helps us understand what’s going on with performance and key UX metrics like success rate and latency. But what’s the cost?&lt;/p&gt;

&lt;p&gt;I’m not talking about the price of observability tools here, I mean the instrumentation overhead. If an application logs or traces everything it does, that’s bound to slow it down or at least increase resource consumption. Of course, that doesn’t mean we should give up on observability. But it does mean we should measure the overhead so we can make informed tradeoffs.&lt;/p&gt;

&lt;p&gt;In this post, I want to measure the overhead of using OpenTelemetry in a Go application. To do that, I’ll use a super simple Go HTTP server that increments a counter in an in-memory database Valkey (a Redis fork) on every request. The idea behind the benchmark is straightforward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First, we’ll run the app under load without any instrumentation and measure its performance and resource usage.&lt;/li&gt;
&lt;li&gt;Then, using the exact same workload, we’ll repeat the test with OpenTelemetry SDK for Go enabled and compare the results.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Test setup
&lt;/h1&gt;

&lt;p&gt;For this benchmark, I’ll use four Linux nodes, each with 4 vCPUs and 8GB of RAM. One will run the application, another will host Valkey, a third will be used for the load generator, and the fourth for observability (using &lt;a href="https://coroot.com/" rel="noopener noreferrer"&gt;Coroot Community Edition&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3wy41plt5uuu007krfz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3wy41plt5uuu007krfz.png" alt=" " width="800" height="55"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I want to make sure the components involved in the test don’t interfere with each other, so I’m running them on separate nodes. This time, I’m not using Kubernetes, instead, I’ll run everything in plain Docker containers. I’m also using the host network mode for all containers, to avoid docker-proxy introducing any additional latency into the network path.&lt;/p&gt;

&lt;p&gt;Now, let’s take a look at the application code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;package main

import (
    "context"
    "log"
    "net/http"
    "os"
    "strconv"

    "github.com/go-redis/redis/extra/redisotel"
    "github.com/go-redis/redis/v8"

    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/sdk/trace"
)

var (
    rdb *redis.Client
)

func initTracing() {
    rdb.AddHook(redisotel.TracingHook{})
    client := otlptracehttp.NewClient()
    exporter, err := otlptrace.New(context.Background(), client)
    if err != nil {
        log.Fatal(err)
    }
    tracerProvider := trace.NewTracerProvider(trace.WithBatcher(exporter))
    otel.SetTracerProvider(tracerProvider)
    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    ))
}

func handler(w http.ResponseWriter, r *http.Request) {
    cmd := rdb.Incr(r.Context(), "counter")
    if err := cmd.Err(); err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }
    _, _ = w.Write([]byte(strconv.FormatInt(cmd.Val(), 10)))
}

func main() {
    rdb = redis.NewClient(&amp;amp;redis.Options{Addr: os.Getenv("REDIS_SERVER")})
    h := http.Handler(http.HandlerFunc(handler))
    if os.Getenv("ENABLE_OTEL") != "" {
        log.Println("enabling opentelemetry")
        initTracing()
        h = otelhttp.NewHandler(http.HandlerFunc(handler), "GET /")
    }
    log.Fatal(http.ListenAndServe(":8080", h))
} 

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By default, the application runs without instrumentation. Only if the environment variable ENABLE_OTEL is set, the OpenTelemetry SDK will be initialized. So runs without this variable will serve as the baseline for comparison.&lt;/p&gt;

&lt;h1&gt;
  
  
  Running the Benchmark
&lt;/h1&gt;

&lt;p&gt;Now let’s start all the components and begin testing.&lt;/p&gt;

&lt;p&gt;First, we launch Valkey using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run --name valkey -d --net=host valkey/valkey

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we start the Go app and point it to the Valkey instance by IP:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run -d --name app -e REDIS_SERVER="192.168.1.2:6379" --net=host failurepedia/redis-app:0.5

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To generate load, I’ll use wrk2, which allows precise control over request rate. In this test, I’m setting it to 10,000 requests per second using 100 connections and 8 threads. Each run will last 20 minutes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; docker run --rm --name load-generator -ti cylab/wrk2 \
   -t8 -c100 -d1200s -R10000 --u_latency http://192.168.1.3:8080/ 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Results
&lt;/h1&gt;

&lt;p&gt;Let’s take a look at the results.&lt;/p&gt;

&lt;p&gt;We started by running the app without any instrumentation. This serves as our baseline for performance and resource usage. Based on metrics gathered by Coroot using eBPF, the app successfully handled 10,000 requests per second. The majority of requests were served in under 5 milliseconds. The 95th percentile (p95) latency was around 5ms, the 99th percentile (p99) was about 10ms, with occasional spikes reaching up to 20ms.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuiyvg27rx88hqo7qaxt4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuiyvg27rx88hqo7qaxt4.png" alt=" " width="800" height="464"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;CPU usage was steady at around 2 CPU cores (or 2 CPU seconds per second), and memory consumption stayed low at roughly 10 MB.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnvuy2oqwclmtbxyh0b75.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnvuy2oqwclmtbxyh0b75.png" alt=" " width="800" height="215"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc286q5iy36e1148xcvvu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc286q5iy36e1148xcvvu.png" alt=" " width="800" height="209"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So that’s our baseline. Now, let’s restart the app container with the OpenTelemetry SDK enabled and see how things change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run -d --name app \
  -e REDIS_SERVER="192.168.1.2:6379" \
  -e ENABLE_OTEL=1 \
  -e OTEL_SERVICE_NAME="app" \
  -e OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://192.168.1.4:8080/v1/traces" \
  --net=host failurepedia/redis-app:0.5 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything else stayed the same – the infrastructure, the workload, and the duration of the test.&lt;/p&gt;

&lt;p&gt;Now let’s break down what changed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9vjjawlmkipq0g69t8oe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9vjjawlmkipq0g69t8oe.png" alt=" " width="800" height="216"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Memory usage increased from around 10 megabytes to somewhere between 15 and 18 megabytes. This additional overhead comes from the SDK and its background processes for handling telemetry data. While there is a clear difference, it doesn’t look like a significant increase in absolute terms, especially for modern applications where memory budgets are typically much larger.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl3sl7p9r489u7z63jz12.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl3sl7p9r489u7z63jz12.png" alt=" " width="800" height="225"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;CPU usage jumped from 2 cores to roughly 2.7 cores. That’s about a 35 percent increase. This is expected since the app is now tracing every request, preparing and exporting spans, and doing more work in the background.&lt;/p&gt;

&lt;p&gt;To understand exactly where this additional CPU usage was coming from, I used Coroot’s built-in eBPF-based CPU profiler to capture and compare profiles before and after enabling OpenTelemetry.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmt4k4f2w3f5pqw6r2v6h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmt4k4f2w3f5pqw6r2v6h.png" alt=" " width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The profiler showed that about 10 percent of total CPU time was spent in go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor, which handles span batching and export. Redis calls also got slightly more expensive — tracing added around 7 percent CPU overhead to go-redis operations. The rest of the increase came from instrumented HTTP handlers and middleware.&lt;/p&gt;

&lt;p&gt;In short, the overhead comes from OpenTelemetry’s span processing pipeline, not from the app’s core logic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1rqd80y4bysd0744vd3a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1rqd80y4bysd0744vd3a.png" alt=" " width="800" height="471"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Latency also changed, though not dramatically. With OpenTelemetry enabled, more requests fell into the 5 to 10 millisecond range. The 99th percentile latency went from 10 to about 15 milliseconds. Throughput remained stable at around 10,000 requests per second. We didn’t see any errors or timeouts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzt1wdyf5y5xhqhr71npo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzt1wdyf5y5xhqhr71npo.png" alt=" " width="800" height="212"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Network traffic also increased. With tracing enabled, the app started exporting telemetry data to Coroot, which resulted in an outbound traffic volume of about 4 megabytes per second, or roughly 32 megabits per second. For high-throughput services or environments with strict network constraints, this is something to keep in mind when enabling full request-level tracing.&lt;/p&gt;

&lt;p&gt;Overall, enabling OpenTelemetry introduced a noticeable but controlled overhead. These numbers aren’t negligible, especially at scale — but they’re also not a dealbreaker. For most teams, the visibility gained through distributed tracing and the ability to troubleshoot issues faster will justify the tradeoff.&lt;/p&gt;

&lt;h1&gt;
  
  
  eBPF-based instrumentation
&lt;/h1&gt;

&lt;p&gt;I often hear from engineers, especially in ad tech and other high-throughput environments, that they simply can’t afford the overhead of distributed tracing. At the same time, observability is absolutely critical for them. This is exactly the kind of scenario where eBPF-based instrumentation fits well. &lt;/p&gt;

&lt;p&gt;Instead of modifying application code or adding SDKs, an agent can observe application behavior at the kernel level using eBPF. Coroot’s agent supports this approach and is capable of collecting both metrics and traces using eBPF, without requiring any changes to the application itself.&lt;/p&gt;

&lt;p&gt;However, in high-load environments like the one used in this benchmark, we generally recommend disabling eBPF-based tracing and working with metrics only. Metrics still allow us to clearly see how services interact with each other, without storing data about every single request. They’re also much more efficient in terms of storage and runtime overhead.&lt;/p&gt;

&lt;p&gt;Throughout both runs of our test, Coroot’s agent was running on each node. Here’s what its CPU usage looked like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpbypdt6z8igg30h5khlt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpbypdt6z8igg30h5khlt.png" alt=" " width="800" height="229"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Node201 was running Valkey, node203 was running the app, and node204 was the load generator. As the chart shows, even under consistent load, the agent’s CPU usage stayed under 0.3 cores. That makes it lightweight enough for production use, especially when working in metrics-only mode.&lt;/p&gt;

&lt;p&gt;This approach offers a practical balance: good visibility with minimal cost.&lt;/p&gt;

&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;Observability comes at a cost, but as this experiment shows, that cost depends heavily on how you choose to implement it.&lt;/p&gt;

&lt;p&gt;OpenTelemetry SDKs provide detailed traces and deep visibility, but they also introduce measurable overhead in terms of CPU, memory, and network traffic. For many teams, especially when fast incident resolution is a priority, that tradeoff is entirely justified.&lt;/p&gt;

&lt;p&gt;At the same time, eBPF-based instrumentation offers a more lightweight option. It allows you to collect meaningful metrics without modifying application code and keeps resource usage minimal, especially when tracing is disabled and only metrics are collected.&lt;/p&gt;

&lt;p&gt;The right choice depends on your goals. If you need full traceability and detailed diagnostics, SDK-based tracing is a strong option. If your priority is low overhead and broad system visibility, eBPF-based metrics might be the better fit.&lt;/p&gt;

&lt;p&gt;Observability isn’t free, but with the right approach, it can be both effective and efficient.&lt;/p&gt;

</description>
      <category>observability</category>
      <category>opensource</category>
      <category>devops</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Coroot – eBPF-based, open source observability with actionable insights</title>
      <dc:creator>Coroot</dc:creator>
      <pubDate>Wed, 09 Apr 2025 16:29:14 +0000</pubDate>
      <link>https://dev.to/coroot/coroot-ebpf-based-open-source-observability-with-actionable-insights-4dj1</link>
      <guid>https://dev.to/coroot/coroot-ebpf-based-open-source-observability-with-actionable-insights-4dj1</guid>
      <description>&lt;p&gt;A common open source approach to observability will begin with databases and visualizations for telemetry - Grafana, Prometheus, Jaeger. But observability doesn’t begin and end here: these tools require configuration, dashboard customization, and may not actually pinpoint the data you need to mitigate system risks.&lt;/p&gt;

&lt;p&gt;Coroot was designed to solve the problem of manual, time-consuming observability analysis: it handles the full observability journey — from collecting telemetry to turning it into actionable insights. We also strongly believe that simple observability should be an innovation everyone can benefit from: which is why our software is open source.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://coroot.com/" rel="noopener noreferrer"&gt;Features:&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Cost monitoring to track and minimise your cloud expenses (AWS, GCP, Azure.)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;SLO tracking with alerts to detect anomalies and compare them to your system’s baseline behaviour.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;1-click application profiling: see the exact line of code that caused an anomaly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Mapped timeframes (stop digging through Grafana to find when the incident occurred.)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;eBPF automatically gathers logs, metrics, traces, and profiles for you.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Service map to grasp a complete at-a-glance picture of your system.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Automatic discovery and monitoring of every application deployment in your kubernetes cluster.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can view Coroot’s documentation &lt;a href="https://docs.coroot.com/installation/" rel="noopener noreferrer"&gt;here&lt;/a&gt;, visit our &lt;a href="https://github.com/coroot/coroot" rel="noopener noreferrer"&gt;Github&lt;/a&gt;, and join our &lt;a href="https://join.slack.com/t/coroot-community/shared_invite/zt-2te9x672s-4s_Wp732cd~o2vdFLNE5AA" rel="noopener noreferrer"&gt;Slack&lt;/a&gt; to become part of our community. We welcome any feedback and hope the tool can improve your workflow!&lt;/p&gt;

</description>
      <category>devops</category>
      <category>opensource</category>
      <category>kubernetes</category>
    </item>
  </channel>
</rss>
