DEV Community: Coroot

Zero-config Golang Heap Profiling

Coroot — Thu, 30 Apr 2026 21:28:59 +0000

Coroot is an Apache 2.0 open source platform that simplifies observability with no-code configuration. The Coroot node-agent already collects CPU profiles for any process on the node using eBPF, with zero integration from the application side. For Java, we dynamically inject async-profiler into the JVM to get memory and lock profiles. But Go processes were still a blind spot for non-CPU profiling unless the app exposed a pprof endpoint and the cluster-agent scraped it.

We wanted the same zero-config experience for Go heap profiles. This post is about how we got there.

The runtime already profiles

Go's runtime has a built-in memory profiler. On every allocation, the runtime samples with probability size / MemProfileRate and records the call stack. The default rate is 512 * 1024, or about 1 sample per 512KB allocated. Samples are aggregated into a linked list of "buckets", where each bucket represents a unique (stack trace, size class) combination and accumulates four counters: total allocations, total frees, bytes allocated, bytes freed.

This is what runtime.MemProfile() returns and what go tool pprof http://.../debug/pprof/heap renders. The overhead is negligible and it's been production-grade since forever.

There's one catch. The Go linker has an optimization: if no code in the binary references runtime.MemProfile, it sets an internal disableMemoryProfiling flag, and the runtime sets MemProfileRate = 0 on init. No samples, no buckets, nothing to read. A binary that doesn't import runtime/pprof or net/http/pprof (directly or transitively) has no heap profile available, even though the runtime fully supports it. We'll come back to this.

This list is what runtime.MemProfile() walks when pprof asks for a heap profile. It's literally the global variable runtime.mbuckets:

// runtime/mprof.go
var (
    mbuckets atomic.UnsafePointer // *bucket, memory profile buckets
    ...
)

So the data is already there, being collected continuously, for free. The only question is how to read it from outside the process.

Reading process memory from outside

Linux exposes every process's virtual address space via /proc/<pid>/mem. With the right permissions (our node-agent already has CAP_SYS_PTRACE), you can pread() arbitrary addresses. It's read-only, it doesn't suspend the process, the target doesn't even know you're there.

The plan:

Find the virtual address of runtime.mbuckets in the Go binary's symbol table.
Read the pointer value at that address from /proc/<pid>/mem.
Walk the linked list, reading each bucket's header, stack PCs, and memRecord.
Convert to pprof format and upload.

Finding runtime.mbuckets without loading the symbol table

The first gotcha: Go binaries embed their own symbol table (pclntab) for runtime use, but runtime.mbuckets is not a function. It's a variable, which lives in the ELF .symtab section. On a stripped binary (go build -ldflags="-s"), there's no .symtab and we can't find the symbol. We skip those.

On an unstripped binary, .symtab can be huge. For k3s, it's ~11MB. Using debug/elf.File.Symbols() loads all of it into memory at once. For a node-agent that profiles dozens of Go processes, that's not OK.

So we wrote a streaming scan that reads one Elf64_Sym entry at a time and reads only the bytes we need from the string table:

func findSymbolValue(ef *elf.File, sectionName, symName string) (uint64, error) {
    section := ef.Section(sectionName)
    strtab := ef.Sections[section.Link]

    symReader := section.Open()
    entry := make([]byte, 24) // Elf64_Sym
    target := []byte(symName)
    nameBuf := make([]byte, len(target)+1)

    for {
        if _, err := symReader.Read(entry); err != nil {
            break
        }
        nameIdx := ef.ByteOrder.Uint32(entry[0:4])
        value := ef.ByteOrder.Uint64(entry[8:16])

        n, _ := strtab.ReadAt(nameBuf, int64(nameIdx))
        if n > len(target) && nameBuf[len(target)] != 0 {
            continue
        }
        if string(nameBuf[:len(target)]) == symName {
            return value, nil
        }
    }
    return 0, fmt.Errorf("%s not found", symName)
}

Peak memory: a 24-byte buffer plus a 17-byte buffer (len("runtime.mbuckets")+1), regardless of binary size.

Before doing this expensive scan we also check if the binary is Go at all via the .go.buildinfo section: one section header lookup, zero allocations.

The bucket layout, and two traps

The bucket struct itself is just a 48-byte header:

type bucket struct {
    _       sys.NotInHeap
    next    *bucket
    allnext *bucket
    typ     bucketType
    hash    uintptr
    size    uintptr
    nstk    uintptr
}

But the runtime allocates extra space after it and stores two more things in the same contiguous region: the stack trace (nstk program counter addresses, 8 bytes each) and a memRecord struct holding the alloc/free counters.

So from our point of view, each bucket is a variable-sized blob: 48 bytes header + nstk*8 bytes of PCs + 128 bytes of memRecord. We read the header first to get nstk, then the rest.

Two traps we fell into:

Trap 1: the first field, _ sys.NotInHeap, looks like 8 bytes of padding. It's zero bytes. Sizing the header at 56 bytes gave us nicely parsed garbage: valid-looking pointers that turned out to be hash values, and typ values in the quintillions. Go 1.17 through 1.19 used a //go:notinheap comment directive instead; Go 1.20 switched to the typed marker, but the binary layout didn't change. The real header is 48 bytes.

Trap 2: there are two pointer fields, next and allnext. They are not the same list. next is the hash table chain within a size class. allnext is the global list of all memProfile buckets. We want allnext.

The delta problem

The counters in memRecord are cumulative: they grow monotonically over the lifetime of the process. If we want an allocation rate, we need to compute the delta between two collection cycles.

We keep a map per PID of bucket address -> previous counters and subtract on each cycle to get the delta. We key by bucket address rather than stack hash: the Go runtime never frees mprof buckets, so the address is a stable unique identifier, and it's a single uint64 instead of a variable-length string, which avoids a huge amount of allocation churn in the hot path.

Too many syscalls

Early profiles showed our collector spending 30-40% of its CPU in syscall.Pread. Each bucket needs at least 2 reads: one for the header (to get nstk), then one for the variable-length stk[nstk] | memRecord block. With 1000+ buckets per process and a dozen Go processes on a node, that's thousands of syscalls per minute.

We tried a read-ahead cache: on a miss, pull 256KB centered around the requested address. The idea was that Go's persistentalloc places buckets in large arenas, so consecutive buckets in the allnext chain might be physically close.

We instrumented jump distances between consecutive buckets for one process with 1230 buckets. 40% of jumps are >16MB. Buckets are scattered across the entire process address space, not clustered.

A 256KB cache hits ~20% of the time: better than nothing, but the best we could do without multi-MB buffers that cost more than they save.

The linker-disabled profiling problem

After deploying, we saw some Go processes return an empty bucket list (runtime.mbuckets pointer was 0x0) even though they were clearly allocating memory (tens of MB RSS, actively running).

Turns out the Go linker has an optimization: if no code in the binary references runtime.MemProfile, it sets a disableMemoryProfiling flag, and the runtime sets MemProfileRate = 0 on init. No profilealloc() calls, no buckets ever created.

This hits any Go binary that doesn't import runtime/pprof or net/http/pprof, directly or transitively. In our case it was a small load generator: no pprof, no HTTP server, no dependencies that would drag pprof in. The profile endpoint the runtime would serve is dead code, so the linker dropped it.

The fix: we can write to /proc/<pid>/mem too. If we detect MemProfileRate == 0, we write 524288 (the default) back to the runtime.MemProfileRate address. The runtime checks this variable on every allocation, so the change takes effect immediately: no restart, no signal, nothing. Just a single atomic 8-byte write to a known address in the data segment.

This is gated behind a --go-heap-profiler=force flag for users who want the "always on" behavior:

--go-heap-profiler=disabled  # off
--go-heap-profiler=enabled   # default, passive only
--go-heap-profiler=force     # write MemProfileRate if zero

The overhead of re-enabling profiling is whatever the Go default overhead is: ~1 sample per 512KB. For any workload where this matters, you'd want it on anyway.

Allocation rate metrics

Since we already compute per-bucket alloc deltas, exposing total allocation rate as Prometheus counters is free:

container_go_alloc_bytes_total    # total bytes allocated
container_go_alloc_objects_total  # total objects allocated

Summed across all buckets in the process. Coroot uses them to draw the allocation rate chart alongside the flamegraph.

Limitations

Stripped binaries are skipped. No .symtab, no runtime.mbuckets address, nothing we can do externally.
The active cycle updates on GC. Between GCs, new allocations go into future[0..2] and we don't see them. Same limitation as runtime.MemProfile().
Go-internal struct layout. If the bucket struct changes in a future Go release, we'll need to update. The layout has been stable since Go 1.17, but there's no API guarantee.
Goroutine, block, and mutex profiles are not yet exposed. Block and mutex use the same infrastructure (bbuckets, xbuckets), but both are disabled by default and have real overhead if enabled (checks on every mutex/channel op), so we're not force-enabling them.

In Coroot

Profiles are already in the Coroot UI. Every memory chart has a link to the heap flamegraph for that service, so you can jump from "memory is climbing" to "here's the call stack eating it" in one click.

What's new is that profiles are now plugged into RCA. If Coroot sees a service's CPU or memory go up at the same time as an issue, it pulls up the profile and compares two windows: the one during the issue, and a healthy one from just before. The flamegraph you see in the RCA is a diff, not a snapshot. Functions that got hotter pop out, the rest fade away.

So now RCA can give you a different kind of answer. Instead of "p95 is up, allocations are up", you get "this function is allocating twice as much as it was before the deploy." The metric tells you something is off. The diff tells you which code is off.

Chaos experiments

To see this in action, we set up a small demo and broke it on purpose. There's a product-catalog service backed by Postgres, sitting behind an api-gateway. We bolted a chaos middleware onto product-catalog so we can flip on different kinds of bad behavior with a single API call, then we watched what showed up in Coroot.

GC pressure

For the first experiment, we flipped on the gc_pressure switch. That sends every request through a function called inefficientEnrichProducts, which is exactly as bad as the name suggests. For each of 30 fake products in the request, it:

Marshals and unmarshals the product 10 times in a row.
Builds a "search index" by lowercasing, uppercasing and title-casing every word and generating every 2 to 4 character n-grams.
Builds 20 nested "related products" maps, each with three sub-maps.
Marshals and unmarshals the whole result one more time "for caching".

That's about 2 MB of throwaway memory per request. The service still answers, but the garbage collector barely gets a break.

The pain shows up one hop away. api-gateway talks to product-catalog on every page render, so as soon as the switch flips, its p95 latency jumps from 0.16s to 3.76s:

Coroot's RCA traces the spike back to product-catalog and pulls up its CPU profile:

Look at the right side of the flamegraph. There's a fat column of runtime.gcBgMarkWorker, runtime.systemstack, runtime.scanobject, runtime.gcDrain. The garbage collector is burning real CPU. That's a clear sign the runtime is under allocation pressure, but the CPU profile can't tell you which line of your code is responsible for it.

The heap profile can:

There it is. main.inefficientEnrichProducts sits at the top of alloc_space, with the JSON encoders, map growth, and bytes.Buffer operations stacked underneath. That's the exact set of things the function does inside its loop. Same function the CPU profile already flagged, but now you can see directly that it's the one driving the GC.

Without the heap profile, you'd see the GC running hot and the JSON encoder eating CPU, and you'd still have to guess which call site to fix. With it, the guess is gone. Cache the marshalled output, drop the redundant rounds, or both, and the alloc band and the GC band shrink together on the next collection.

Memory leak

For the second experiment, we flipped the memory_leak switch. Now every request calls appendToProductCache, which builds a small chunk of pointer-heavy data (a product map, a search index of fifty terms, cross-references to recent entries) and appends it to a global slice. Nothing ever evicts. The cache grows about 200 KB per request, forever.

The symptom is the obvious one. product-catalog memory just keeps climbing. After a few minutes, both replicas are growing at over 640% per hour and on track to OOM-kill themselves.

What's interesting is what RCA does next. It pulls up the heap profile for the anomaly window and compares it against a healthy window from before the leak started:

The diff narrows it down to a single function. main.appendToProductCache accounts for 99.6% of the in-use memory that wasn't there before, and the full call path from the HTTP entrypoint down to it sits right above the flamegraph. There's almost nothing left to investigate.

A plain heap snapshot would have shown appendToProductCache near the top too, but mixed in with everything else the service legitimately allocates. The diff drops the noise and keeps only what changed, which is exactly what you want when you're chasing a leak that started somewhere in the last hour.

Summary

Heap profiles for your Go services no longer require pprof endpoints, scraping configuration, or a deploy. Coroot picks them up automatically from whatever is already running on your nodes, with no code changes, no annotations, and no restart.

The payoff shows up in incidents. A memory leak comes down to one function in a diff'd flamegraph. GC pressure stops being a vague "the runtime is busy" and becomes a specific call site. And you get this code-level accuracy without needing access to the code itself, which matters for SRE and platform teams running services they didn't write. Because the profiles sit right next to the metrics and the RCA that surfaced the issue, you go from "something is wrong" to "here is what to fix" without ever leaving the page.

Want to try Zero-config Go heap profiling on your setup, completely open source? Visit out our Github to quickly get set up.

Profiling Java apps: breaking things to prove it works

Coroot — Wed, 08 Apr 2026 21:51:04 +0000

Coroot already does eBPF-based CPU profiling for Java. It catches CPU hotspots well, but that's all it can do. Every time we looked at a GC pressure issue or a latency spike caused by lock contention, we could see something was wrong but not what.

We wanted memory allocation and lock contention profiling. So we decided to add async-profiler support to coroot-node-agent. The goal: memory allocation and lock contention profiles for any HotSpot JVM, with zero code changes. Here's how we got there.

Why async-profiler

We went with async-profiler. It's a native JVMTI agent used by pretty much everyone in the Java profiling space (Pyroscope, IntelliJ, Datadog). It can be loaded into a running JVM dynamically, supports CPU, allocation, and lock contention profiling in a single session, and works in unprivileged containers with no JVM flags. It outputs JFR format, which we parse using Grafana's jfr-parser.

How we integrated it

The integration follows the same pattern as our Java TLS agent:

The node agent detects Java processes by checking if the binary name ends with java, then confirms it's a HotSpot JVM by scanning /proc/<pid>/maps. It deploys libasyncProfiler.so (~600KB) into the container's filesystem at /tmp/coroot/ and loads the library into the JVM via the Attach API. async-profiler starts with event=itimer,interval=10ms,alloc,lock,jfr, capturing CPU, allocation, and lock events in a single session.

For data collection, every 60 seconds the agent sends a stop command (async-profiler finalizes the JFR file), reads the file from the host via /proc/<pid>/root/, and immediately sends a start command to begin a new recording.

The gap between stop and start is ~4ms. We considered using dump (which doesn't stop the profiler), but JFR output requires proper chunk finalization, a dump writes incomplete metadata that parsers reject. The stop/start approach guarantees valid output every time.

Each command goes through the JVM Attach protocol. It's one command per connection, HotSpot closes the socket after each response. After the first attach (which triggers the attach listener via SIGQUIT), subsequent connections just hit the existing Unix socket. Total overhead: ~2ms per command.

If another tool (Pyroscope Java agent, Datadog, etc.) already loaded async-profiler into the JVM, we detect it by scanning /proc/<pid>/maps and skip that process to avoid conflicts.

Enabling it

Set the ENABLE_JAVA_ASYNC_PROFILER=true environment variable on the node agent. In the Coroot custom resource:

apiVersion: coroot.com/v1
kind: Coroot
spec:
  nodeAgent:
    env:
      - name: ENABLE_JAVA_ASYNC_PROFILER
        value: "true"

No JVM flags, no application restarts, no agent JARs. The node agent handles everything automatically for all HotSpot JVMs it discovers. If you haven't enabled it yet, the JVM report shows a hint with a link to the docs.

What you get

Once enabled, Coroot adds new charts to the JVM report: allocation rate (bytes/s and objects/s) and lock contention (contentions/s and delay). Each chart has a profile button that opens the corresponding flamegraph, so you can go from "allocation rate spiked" to "this function is allocating all the memory" in one click.

We also export Prometheus metrics from the profiling data. These are monotonic counters accumulated from the parsed profiles, so rate() gives you allocation rate and contention rate over time. We initially tried getting allocation metrics from hsperfdata (sun.gc.tlab.alloc), but those are per-GC-cycle snapshots that reset every collection. The async-profiler data is the real thing.

Seeing it in action

Enough theory. Let's break something and see how the profiling data helps us find the root cause.

We have a demo environment with several microservices. The one we'll focus on is order-service, a Spring Boot app running on JDK 21, backed by MySQL. It handles order creation, listing, and payment processing. Normal latency is under 10ms.
The demo has a built-in chaos controller that lets us inject failures via a REST API. We'll use two scenarios: lock contention and memory allocation pressure.

Lock contention

For this scenario, the chaos controller spawns background threads that repeatedly acquire a shared lock and hold it for 5ms:

private void startLockContention() {
    int count = Runtime.getRuntime().availableProcessors();
    for (int i = 0; i < count; i++) {
        Thread t = new Thread(() -> {
            while (!Thread.currentThread().isInterrupted()) {
                synchronized (CONTENTION_LOCK) {
                    try {
                        Thread.sleep(5);
                    } catch (InterruptedException e) {
                        Thread.currentThread().interrupt();
                        break;
                    }
                }
            }
        }, "chaos-lock-holder-" + i);
        t.setDaemon(true);
        t.start();
    }
}

Meanwhile, every incoming request also tries to acquire the same lock in the request interceptor:

if (chaosConfig.isLockContention()) {
    synchronized (ChaosController.CONTENTION_LOCK) {
        // Request thread blocks here while holder threads occupy the lock
    }
}

After enabling this scenario, we can immediately see the impact on the order-service SLIs. The latency heatmap shows a clear shift, requests that used to complete in under 10ms are now taking 100ms+, with some exceeding a second:

The request rate chart confirms the degradation, you can see the latency distribution shifting from green (fast) to red (slow):

Now let's look at the JVM report. The lock contention chart shows a clear spike, the lock wait time jumps from near zero to significant values:

Let's click the profile button on the lock contention chart to open the flamegraph:

The flamegraph shows the Java Lock (delay) profile in comparison mode. Red means "more time waiting for locks than before." Reading from top to bottom, we can see the Spring request processing chain, and at the bottom of the flamegraph, our ‘ChaosInterceptor.preHandle’ method, the one that tries to acquire the shared lock. That's the bottleneck.

Without profiling, all we'd know is "requests are slow." With the lock profile, we can point at the exact monitor and the exact code paths waiting for it.

Memory allocation pressure

The demo also supports a GC pressure scenario. It starts a background thread that continuously allocates and discards 256 MB byte arrays:

private void startGcPressure(int megabytes) {
    Thread t = new Thread(() -> {
        while (!Thread.currentThread().isInterrupted()) {
            byte[] garbage = new byte[megabytes * 1024 * 1024];
            garbage[0] = 1; // prevent dead-code elimination
        }
    }, "chaos-gc-pressure");
    t.setDaemon(true);
    t.start();
}

The JVM is configured with -Xmx=512m so allocating 256 MB chunks means the GC runs after almost every allocation.

After enabling this scenario, the JVM report tells the story. The allocation rate chart spikes from near zero to ~3 GB/s. GC time jumps in lockstep, the young collection pauses go from occasional to constant:

Now let's click the profile button on the allocation rate chart to see what is allocating all this memory:

The flamegraph shows the Java Memory (alloc_space) profile in comparison mode. At the bottom we can see ChaosController$$Lambda.run and startGcPressure marked as +100%, they didn't exist in the baseline period. The top-level Thread.run frames confirm this is a background thread, not request processing.

Without profiling, all we'd see is GC time going up. With the allocation profile, we know exactly which code is responsible.

Enable it with a single environment variable and you get flamegraphs, time-series metrics, and a direct link between "something changed" and "here's the code." You can install it open source on your system here.

Making encrypted Java traffic observable with eBPF

Coroot — Wed, 25 Mar 2026 15:28:21 +0000

Coroot's open source node agent uses eBPF to capture network traffic at the kernel level. It hooks into syscalls like read and write, reads the first bytes of each payload, and detects the protocol: HTTP, MySQL, PostgreSQL, Redis, Kafka, and others. This works for any language and any framework without touching application code.

For encrypted traffic, we attach eBPF uprobes to TLS library functions like SSL_write and SSL_read in OpenSSL, crypto/tls in Go, and rustls in Rust. The uprobes fire before encryption or after decryption, so we see the plaintext.

Java is different. And it has been a blind spot until now.

Why Java is special

Java's TLS implementation (JSSE) is not a native shared library. It's Java code that runs inside the JVM. There are no exported symbols like SSL_write that eBPF could attach to.

So when a Java app connects to MySQL or PostgreSQL over TLS, or makes HTTPS calls, eBPF tools cannot see the plaintext. All they see at the syscall level is encrypted data.

Our approach

We solved this by combining a lightweight Java agent with a tiny native library that serves as an eBPF uprobe target.

We dynamically load the agent into running JVMs using the attach API (the same mechanism profilers and debuggers use). The agent hooks SSLSocketImpl$AppOutputStream.write and SSLSocketImpl$AppInputStream.read, the internal JSSE classes where plaintext enters and leaves the TLS layer.

When the application does an SSL write, our hook copies the first 1KB of plaintext into a thread-local native buffer and calls a stub native function. We copy to native memory because the pointer is stored and read later when the underlying write() syscall fires.

By that time our JNI call has already returned, and Java's GC could have moved the original byte array. We considered using GetPrimitiveArrayCritical to pin the array in place and avoid the copy, but it blocks all garbage collectors while held, which is worse for the application than a small memcpy.

For reads, we do the same after JSSE decrypts the data.

The native stub function does nothing at runtime:

int coroot_java_tls_write_enter(const char *buf, int len) {
    asm volatile("" ::: "memory");
    return len;
}

The asm volatile barrier prevents the compiler from optimizing it away. We attach eBPF uprobes to this function, so every call is captured with the buffer pointer and the payload size. From there, the data goes into our existing protocol detection pipeline, and HTTP, MySQL, PostgreSQL, Redis, Kafka and other protocols are parsed automatically.

The file descriptor (which connection the data belongs to) is discovered without any Java reflection. When JSSE writes, the sequence is always: encrypt, then write(fd, ciphertext) syscall. Our eBPF code stores the plaintext pointer when the stub is called, then the syscall that follows on the same thread provides the file descriptor. This is the same ssl_pending mechanism we use for OpenSSL.

The native library is compiled with -nostdlib, so it has no dependencies and works in any container.

The nice thing about this design is that there is no transport between the JVM and the node agent. No unix sockets, no shared memory, no protocols to maintain. The Java agent just calls a native function, and eBPF picks up the data through uprobes and existing syscall tracepoints.

Safety

Our agent modifies the bytecode of two JVM internal classes to insert our hooks. That sounds scary, but all we add is a single method call before each SSL write and after each SSL read. The original code stays exactly the same. Every inserted call is wrapped in a try-catch that catches Throwable, so if our code fails for any reason, the error is silently ignored and the original SSL operation runs as if we were never there.

We use ASM for the bytecode transformation. ByteBuddy would make the code shorter, but the agent JAR would grow from 130KB to over 8MB. Since we deploy the JAR into every container with a running JVM, keeping it small matters.

Benchmark

We compared three scenarios on the same workload:

No instrumentation (baseline)
eBPF with our Java TLS agent
OpenTelemetry Java agent with traces exported to a collector

We included the OpenTelemetry comparison because it is the most common alternative for Java observability without code changes. The OTel agent auto-instruments HTTP clients, JDBC, and other libraries by rewriting bytecode at class load time.

The test uses two machines to avoid resource contention:

Machine 1 (8 vCPU): Java HTTP proxy making HTTPS calls + coroot-node-agent
Machine 2 (8 vCPU): Go HTTPS server (5ms delay, ~1KB response) + wrk2 load generator

Each scenario ran for 15 minutes at 1,000 requests per second.

The baseline (no instrumentation) uses about 370m CPU cores. With our eBPF agent, CPU increases to about 426m, a 15% increase. The eBPF agent delivers the same throughput as the baseline.

With the OpenTelemetry Java agent, CPU goes up to 511m, a 38% increase, and the application could only sustain about 800 of the 1,000 target requests per second, a 20% throughput drop.

Limitations

JVM compatibility. We support HotSpot-based JVMs: OpenJDK, Oracle JDK, Amazon Corretto, Azul Zulu, Eclipse Temurin. OpenJ9 and GraalVM native images are detected and skipped.

SSLSocket only. We instrument SSLSocket (blocking I/O), which covers JDBC drivers, HttpsURLConnection, and most traditional Java HTTP clients. SSLEngine (used by Netty and async HTTP clients) is not yet supported.

Dynamic agent loading. On Java 21+ the JVM prints a warning about dynamic agent loading being deprecated. JVMs with -XX:+DisableAttachMechanism or -XX:-EnableDynamicAgentLoading are detected and skipped.

Disabled by default

This feature must be explicitly enabled:

coroot-node-agent --enable-java-tls

Or with an environment variable:

ENABLE_JAVA_TLS=true

If you use the Coroot Operator on Kubernetes, add it to the Coroot CR:

apiVersion: coroot.com/v1
kind: Coroot
metadata:
  name: coroot
spec:
  nodeAgent:
    env:
      - name: ENABLE_JAVA_TLS
        value: "true"

Loading an agent into a running JVM without the user asking for it is not something we want to do by default. The agent is safe, but we think this should be the user's choice.

What you get

With this feature enabled, Coroot automatically detects and parses protocols inside encrypted Java connections: HTTP, MySQL, PostgreSQL, Redis, Kafka, and everything else we support. No code changes, no SDKs, no sidecars. Enable the flag and encrypted Java traffic becomes visible. If you'd like to try it open source to improve observability in your system, you can check out our Github.

Instrumenting Rust TLS with eBPF

Coroot — Wed, 18 Mar 2026 19:07:58 +0000

eBPF collects telemetry directly from applications and infrastructure. One of the things it does is capture L7 traffic from TLS connections without any code changes, by hooking into TLS libraries and syscalls.

Works great for OpenSSL. Works for Go.

Then rustls enters the picture and everything stops being obvious. With OpenSSL, everything is nicely wrapped:

SSL_write(ssl, plaintext)
└─ write(fd, encrypted)

SSL_read(ssl, plaintext)
└─ read(fd, encrypted)

From eBPF’s point of view this is perfect:

hook SSL_write, stash plaintext
write() fires immediately → same thread → you know the FD
same idea for reads Everything happens inside one call. Correlation is trivial.

Rustls does things differently

Rustls doesn’t own the socket and never calls read or write itself. It works on buffers, and the application (or runtime) is responsible for actually moving bytes over the network.

The API reflects that separation pretty clearly:

// application writes plaintext into rustls
writer.write(plaintext);

// rustls produces encrypted bytes and writes them via io::Write
conn.write_tls(&mut socket);

// application reads encrypted bytes and feeds them into rustls
conn.read_tls(&mut socket);

// rustls decrypts and updates internal state
conn.process_new_packets();

// application reads decrypted data
reader.read(plaintext_buf);

So instead of one call doing everything, you get a pipeline:

plaintext is buffered first
encryption happens later
syscalls happen outside of rustls
decryption happens before the app reads

The key difference for eBPF:

writes: syscall happens after plaintext
reads: syscall happens before plaintext

So the OpenSSL-style correlation only works in one direction.

Writes work as usual

On the write side, nothing fundamentally new is needed. You hook Writer::write, stash the plaintext, and correlate it with the following sendto. The ordering is preserved, so the same approach as OpenSSL still applies here.

Reads are inverted

The read path is where things really break.

recvfrom(fd, encrypted_buf, ...);   // happens first

conn.read_tls(&mut socket);
conn.process_new_packets();

reader.read(plaintext_buf);         // plaintext appears here

By the time we see plaintext, the syscall is already gone.

So the logic has to be reversed. Instead of:

“see plaintext → wait for syscall”

we do:

“see syscall → remember it → use it later”

Concretely:

on recvfrom → stash FD per thread
on reader.read → pick up that FD and attach it to plaintext

It’s basically reverse correlation. Not pretty, but it matches how rustls works.

When “ret=1” doesn’t mean 1 byte

This one took longer than expected. We reused the OpenSSL-style exit probe:

ret = PT_REGS_RC(ctx)

The probe fired, but results were weird:

ret=1
ret=0

Which made no sense for a read. Turns out Rust returns Result like this:

rax → success or error flag
rdx → actual number of bytes

So we were reading rax and treating it as a size. Meaning:

ret=1 → actually an error
ret=0 → success, but size is somewhere else Fix was straightforward once understood:

if (PT_REGS_RC(ctx) == 0) { // success
    size = ctx->dx; // actual byte count
}

Classic case of “everything works, but the numbers are garbage”.

Finding rustls in binaries

Rust symbols are heavily mangled:

_ZN55_$LT$rustls..conn..Writer$u20$as$u20$std..io..Write$GT$5write17h0ee1e61402b1a37cE

It looks messy, but it encodes the full path: rustls::stream::Writer implementing std::io::Write::write.

The tricky part is that mangling isn’t stable:

different compiler versions use different schemes (legacy vs v0)
optimizations and stripping can change what’s left in the binary

So matching exact names is fragile.

Instead, we:

check ELF .comment for rustc to detect that the binary was built with Rust
then scan symbols for patterns like “rustls”, “Writer”+”write”, “Reader”+”read”

Not perfect, but reliable enough in practice.

Results

Coroot is an open source observability tool that uses eBPF to simplify setup. Because we instrument rustls at the library level, not the frameworks, this works across most Rust clients that use rustls under the hood.

That includes HTTP stacks like hyper when paired with rustls (hyper-rustls, and frameworks like axum or warp when configured with rustls), database clients like sqlx when using its rustls TLS feature, and any async Rust service using tokio-rustls.

No code changes, no SDKs, no wrappers.

For Rust apps using OpenSSL via native-tls or openssl, the existing OpenSSL instrumentation already works. rustls was the missing piece.

Below is an example of a service talking to MySQL over TLS. Coroot shows the actual queries even though everything on the wire is encrypted.

If you’d like to give our open source tool a try and simplify your own observability, you can check it out at here on Github. You can also view this guide and other open source observability articles on our blog.

How to make GPUs on Kubernetes Observable

Coroot — Tue, 20 Jan 2026 18:15:52 +0000

GPUs are everywhere powering LLM inference, model training, video processing, and more. Kubernetes is often where these workloads run. But using GPUs in Kubernetes isn’t as simple as using CPUs.

You need the right setup. You need efficient scheduling. And most importantly you need visibility.

This post walks through how to run GPU workloads on Kubernetes, how to virtualize them efficiently, and how to use open source to monitor everything with zero instrumentation.

Running GPU Workloads on Kubernetes

Running GPU workloads on Kubernetes is totally doable. But it takes a bit of setup.

It starts with your nodes. Whether you’re running in the cloud or on bare metal, your cluster needs machines with physical GPUs. Most cloud providers support GPU-enabled node pools, and provisioning them is usually straightforward.

Once the hardware is in place, the next step is software. For Kubernetes to schedule and run GPU workloads, it needs:

NVIDIA GPU drivers, installed on each node
The NVIDIA container runtime, so containers can access the GPU
The NVIDIA device plugin, so Kubernetes knows how to handle GPU resource requests

You can install all of this manually. But it’s fiddly and error-prone. That’s where the NVIDIA GPU Operator comes in. It automates the whole setup: installing drivers, configuring the runtime, and deploying the device plugin. Once that’s done, your cluster is GPU-ready.

After that, requesting a GPU is simple. Just add this to your pod spec:

resources:
  limits:
    nvidia.com/gpu: 1

Kubernetes will handle the rest: scheduling your pod onto a node with an available GPU and assigning it to the container.

Of course, not every workload needs an entire GPU to itself. And that’s where GPU virtualization becomes really useful.

Virtualizing GPUs in Kubernetes

By default, Kubernetes treats GPUs as exclusive resources. One pod per device. But in many real-world cases, that’s overkill. The GPU Operator supports two forms of GPU virtualization that let you safely share a GPU between workloads:

Time-Slicing: Multiple containers take turns using the GPU in rapid bursts. It’s a great fit for bursty inference workloads, batch jobs, or anything that doesn’t require ultra-low latency.

MIG (Multi-Instance GPU): Available on GPUs like the A100 and H100, MIG lets you partition a single physical GPU into several hardware-isolated instances. Each one behaves like its own dedicated GPU, with its own memory, cache, and compute cores.

Virtualization makes GPUs way more flexible:

You stop wasting an entire GPU on a tiny batch job
You get much better overall utilization
You can safely share GPUs across apps without them stepping on each other

And you finally have a shot at balancing cost and performance. It’s a game-changer, but only if you can actually see what’s going on.

What observability looks like once GPUs are in play

So, the cluster is set up, the workloads are running, and maybe you’ve even started virtualizing GPUs to get better efficiency. Now comes the tricky part – actually understanding what’s happening.

From the infrastructure side, we want to know:

How many GPU-enabled nodes do we have right now?
Which GPUs are actually doing work, and which are just burning budget?
What’s the current GPU and memory utilization across the fleet?

And sure, if you’re in the cloud, temperature and power draw might feel like someone else’s problem. But it’s still good to know. Somewhere out there, your model is warming the planet one token at a time. Mother Nature says hi. 🌱

From the application side, the questions change:

Which GPUs is this app actually using?
How much compute and memory is it consuming?
Is it sharing the GPU with something else?
And if so who’s the noisy neighbor hogging all the resources?

This isn’t just about curiosity. It’s about avoiding slowdowns, catching inefficiencies, and making smart scaling decisions. But here’s the catch: Kubernetes doesn’t tell you any of this.

Out-of-the-box GPU observability

Coroot is an open source tool that uses eBPF to make any GPU-powered system observable with zero-configuration. It talks directly to the GPU using NVIDIA’s NVML library the same one behind nvidia-smi. This way you can see what’s happening on your GPUs with no guesswork.

On startup, the agent looks for libnvidia-ml.so in all the usual (and unusual) places whether it’s installed by the GPU Operator, a package manager, or manually dropped in. If it finds the library, it loads it and starts gathering data.

From there, it:

Discovers all available GPUs on the node
Collects real-time metrics utilization, memory usage, temperature, power draw
Tracks per-process usage using NVML’s process telemetry
Maps each process back to its container and pod, using Coroot’s existing PID-to-container tracking

So instead of “PID 12345 is using GPU 0,” you get “this container in this pod is using 78% of GPU-xxxx”

When it comes to virtualized GPUs, Coroot sees which containers are tied to which GPU UUIDs, even when multiple workloads are time-sharing or using MIG slices on the same physical device. That means:

You can see which apps are sharing the same GPU
Understand how each one is using it
Spot noisy neighbors

All of this is automatic. Just install the agent and let Coroot do the rest.

Once the Coroot agent discovers the GPUs and starts collecting data, all of it flows straight into the UI ready to explore without any dashboards to build or metrics to stitch together.

Let’s walk through how this looks in practice.

Node-level GPU overview

On the node view, Coroot shows everything you’d want to know about the GPUs attached to a specific machine:

GPU utilization over time
GPU memory usage
Top consumers of both compute and memory
Temperature and power draw

This isn’t just “Hey, GPU usage went up.” You can see which containers are contributing to that load and whether the same GPU is being shared between apps.

App-level GPU breakdown

This is where things get real. If your app is slow, you can check GPU usage alongside CPU, memory, logs, traces, and everything else in one place.

How much GPU time your app is using
How much memory it’s burning
Which containers are sharing the GPU, and how much each one is using

Conclusion

GPU workloads aren’t cheap. And they’re rarely simple. Whether you’re running a single inference service or managing dozens of shared GPUs, you need real visibility.

Observability is the same: not cheap, rarely simple. But with open source tools, managing your infra can be accessible to everyone - not just massive companies, but small businesses, startups, and homelab projects everywhere across the globe. Give it a try on your system next, and share feedback to help make good observability available for everyone.

Using AI for Troubleshooting: OpenAI vs DeepSeek

Coroot — Wed, 07 Jan 2026 18:52:22 +0000

AI is now a go-to tool for everything from writing to coding. Modern LLMs are so powerful that, with the right prompt and a few adjustments, they can handle tasks almost effortlessly.

At Coroot (an open source project to simplify observability) we’ve developed AI (LLMs + ML) to diagnose root cause analysis. Finding the exact root cause is tricky due to the many possible scenarios. Many of our users aren’t experts in areas like databases, network performance, or specific runtimes. We need to provide clear explanations, and ideally, even guidance on how to fix the problem.

In the following blog, I decided to test the ability of different LLMs to diagnose a chaos experiment. Since laziness often drives progress, I skipped converting my telemetry data into a prompt and instead just uploaded a screenshot as input for the LLMs.

The result shocked me.

OpenAI GPT-4o

Here is my prompt, no other context has been provided:

Answer:

From the provided PostgreSQL monitoring dashboard, it appears that an ALTER TABLE statement is causing performance issues, leading to:

Increased Query Latency: The Postgres average query latency has spiked, especially for pg-cluster-2.

I/O Load from ALTER TABLE: The 'ALTER TABLE' statement (alter table articles alter column body set not null) is consuming significant I/O, as seen in the Queries by I/O time graph.

Locked Connections and Queries:

The Postgres connections graph shows a sharp rise in locked connections.
The Locked queries graph confirms that queries are blocked due to the ALTER TABLE operation.
The Blocking queries graph indicates that the ALTER TABLE query is holding a lock and blocking multiple other queries.

Replication Lag: The Replication lag shows a spike, likely due to the ALTER TABLE operation causing a backlog of writes.

Why is this Happening?

ALTER TABLE … SET NOT NULL is a table-wide operation that requires a full table scan to verify that no existing row violates the new constraint. It locks the entire table and prevents writes/reads from proceeding until the operation completes. This is especially problematic in a high-load environment where frequent reads/writes occur.

How to Fix This?

Terminate the ALTER TABLE Statement (if possible): If the system is in production and the lock is causing downtime, consider canceling the query:

SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE query LIKE '%ALTER TABLE articles alter column body set not null%';

Restart Blocked Queries:

Once the ALTER TABLE is canceled, the system should return to normal operations. However, some queries might still be waiting, so check for any lingering locks:

SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;

Long-Term Fixes

Since ALTER TABLE is expensive, consider these best practices:

1.Convert to an Online-Friendly Approach

Instead of ALTER TABLE … SET NOT NULL, use a safer approach:

Add a check constraint first (non-blocking):

ALTER TABLE articles ADD CONSTRAINT articles_body_not_null CHECK (body IS NOT NULL) NOT VALID;

Backfill missing values (if any).
Validate the constraint (still non-blocking):

ALTER TABLE articles VALIDATE CONSTRAINT articles_body_not_null;

If no violations exist, alter the column:

ALTER TABLE articles ALTER COLUMN body SET NOT NULL;

2.Use pg_repack for Zero-Downtime Migrations

pg_repack allows reorganizing tables without full locks
pg_repack -t articles --dbname=mydatabase

3.Perform Schema Changes During Low-Traffic Periods

If running in production, execute schema changes during maintenance windows to reduce impact.

4.Partition Large Tables

For frequently altered large tables, consider partitioning to isolate operations on smaller segments.

Conclusion

The ALTER TABLE operation is locking the articles table, causing significant query delays and replication lag. The best approach is to:

Cancel the operation if causing downtime.
Use an online-safe alternative (check constraint + validation).
Schedule schema changes carefully. Use tools like pg_repack for minimal impact.

That’s super impressive and surprisingly accurate. What shocked me the most was the model’s ability to interpret the screenshot as input.

It’s exciting that an LLM can ‘understand’ our domain and generate deep summaries with actionable insights. However, relying on a single vendor didn’t seem like a robust solution.

Fortunately, DeepSeek has emerged as a strong alternative to OpenAI’s model, especially with the recent release of its R1 model. So, let’s repeat the experiment using DeepSeek-R1:

DeepSeeek-R1

The answer:

The data indicates a significant performance issue in pg-cluster-2, primarily caused by a long-running ALTER TABLE command on the articles table. This operation is causing table locks, blocking other critical queries (e.g., SELECT and INSERT), leading to:

High query latency (10-15 seconds).
A surge in active/locked connections.
Query throughput drops.
Replication lag in other clusters (minor).

Root Cause

The ALTER TABLE statement (alter table articles alter column body set not ?) is:

Holding an exclusive lock on the articles table.
Blocking subsequent queries (e.g., SELECT, INSERT).
Creating a chain of waiting queries, increasing latency and connection usage.

Immediate Fixes

Terminate the ALTER TABLE command (if safe):

-- Find the PID of the ALTER TABLE query SELECT pid, query FROM pg_stat_activity WHERE query LIKE '%alter column body%'; -- Terminate it SELECT pg_terminate_backend(<PID>);

Caution: Ensure this won’t corrupt data (e.g., if the ALTER is part of a migration).

Reschedule the DDL Operation:

Run schema changes during low-traffic periods or use online DDL tools like:

pg_repack (to rebuild tables without locks).

CREATE INDEX CONCURRENTLY (for index-related changes).

Optimize the ALTER TABLE:

If the column is being set to NOT NULL, ensure it already contains no NULL values. Pre-run:

UPDATE articles SET body = '' WHERE body IS NULL; -- Example cleanup

Long-Term Improvements

Use Lock Timeouts:

Add a lock timeout to prevent indefinite blocking:

SET LOCAL lock_timeout = '5s'; ALTER TABLE ...;

Monitor and Tune Queries:- Optimize the SELECT id FROM articles WHERE created < ? …query with an index on created.- Batch theINSERT` operations to reduce contention.
Connection Pooling: Use a connection pooler (e.g., PgBouncer) to manage active connections and prevent exhaustion.
Replication Checks:Investigate replication lag in pg-cluster-1 and pg-cluster-3 to ensure failover readiness.

Summary

The blocking ALTER TABLE is the critical issue. Address it first by terminating or rescheduling the operation, then implement safeguards to avoid similar incidents. Optimize queries and use non-locking DDL tools for future schema changes.

DeepSeek-R1 delivers an answer that is quite close to OpenAI’s. I’d say it is accurate enough to be a viable alternative to OpenAI’s models.

Additionally, I tested other open-source models in my lab, but I was only able to run LLMs with up to 70 billion parameters. In this range, I didn’t find anything that came close to OpenAI or the larger DeepSeek models for our needs.

Is AIOps is finally here?

Yes, but only if you have enough data describing your system’s behaviour.

This wouldn’t have been possible with just CPU, memory, and disk usage alone. The real magic of AI in this case was its ability to understand the bigger picture, apply domain knowledge, and suggest how to fix the issue.

This is where Coroot's AI-Powered Root Cause Analysis can provide a more complete picture, using eBPF + Machine Learning to catch context missing from LLMs alone. As part of our commitment to make observability simpler and accesible for everyone, you can set it up for free with our open source version. Details here. We hope it can help save you hours of digging through telemetry and make root cause analysis easier for your team.

Chaos testing a Postgres cluster managed by CloudNativePG

Coroot — Tue, 16 Dec 2025 16:57:20 +0000

As more organizations move their databases to cloud-native environments, effectively managing and monitoring these systems becomes crucial. According to Coroot’s anonymous usage statistics, 64% of projects use PostgreSQL, making it the most popular RDBMS among our users, compared to 14% using MySQL. This is not surprising since it is also the most widely used open-source database worldwide.

Kubernetes is more than a platform for running containerized applications. It also enables better management of databases by allowing automation of tasks like backups, high availability, and scaling through its operator framework. This provides a management experience similar to using a managed service like AWS RDS but without vendor lock-in and often at a lower cost.

CloudNativePg is an open-source operator originally created by EDB, the oldest and the biggest Postgres vendor world-wide. As other operators, CNPG helps manage PostgreSQL databases on Kubernetes, covering the entire operational lifecycle from initial deployment to ongoing maintenance. Worth to mention that this is the youngest Postgres operator on the market, but its open source traction grows rapidly and based on my observations it’s the favorite operator across Reddit users.

In this post I’ll install a CNPG cluster in my lab, instrument it with Coroot, then generate some load and introduce some failures to ensure high availability and observability.

In this post I’ll install a CNPG cluster in my lab, instrument it with Coroot Community (open source), then generate some load and introduce some failures to ensure high availability and observability.

Setting up the cluster

Installing the CloudNativePG operator is simple:

helm repo add cnpg https://cloudnative-pg.github.io/charts
helm upgrade --install cnpg cnpg/cloudnative-pg

To deploy a cluster, create a Kubernetes custom resource:

kind: Cluster
metadata:
  name: pg-cluster
spec:
  instances: 3
  primaryUpdateStrategy: unsupervised
  storage:
    size: 30Gi   
  postgresql:
    shared_preload_libraries: [pg_stat_statements]
    parameters:
      pg_stat_statements.max: "10000"
      pg_stat_statements.track: all
  managed:
    roles:
    - name: coroot
      ensure: present
      login: true
      connectionLimit: 2
      inRoles:
      - pg_monitor
      passwordSecret:
        name: pg-cluster
---
apiVersion: v1
data:
  username: ******==
  password: *********==
kind: Secret
metadata:
  name: pg-cluster
type: kubernetes.io/basic-auth

Installing Coroot

In this post, I’ll be using the open source Community Edition of Coroot. Here are the commands to install the Coroot Operator for Kubernetes along with all Coroot components:

helm repo add coroot https://coroot.github.io/helm-charts
helm repo update coroot
helm install -n coroot --create-namespace coroot-operator coroot/coroot-operator
helm install -n coroot coroot coroot/coroot-ce

To access Coroot, I’m forwarding the Coroot UI port to my local machine. For production deployments the operator can create an Ingress.

kubectl port-forward -n coroot service/coroot-coroot 8083:8080

In the UI, we can see two applications: the operator (cnpg-cloudnative-pg) and our Postgres cluster (pg-cluster). Coroot has also identified that pg-cluster is a Postgres database and suggests integrating Postgres monitoring.

The Kubernetes approach to monitoring databases typically involves running metric exporters as sidecar containers within database instance Pods. However, this method can be challenging for certain use cases. For example, CNPG doesn’t support running custom sidecar containers, and their CNPG-i capability requires specific plugin support and is still in the experimental stage.

To address these limitations, Coroot has a dedicated coroot-cluster-agent that can discover and gather metrics from databases without requiring a separate container for each database instance. To configure this integration, simply use the credentials of the database role already created for Coroot. Click on “Postgres” in the Coroot UI and then on the “Configure” button.

Next, provide the credentials configured for Coroot in the cluster specification. Coroot’s cluster-agent will then collect Postgres metrics from each instance in the cluster.

It feels a bit dull without any load or issues. Let’s add an application that interacts with this database.

I deployed a simple application called “app” that executes approximately 600 queries per second: 300 on the primary and 300 across both replicas.

I believe that any observability solution must be tested on failures to ensure that if some problem occurs, we will be able to quickly identify the root case. So, let’s introduce some failures

Failure #1: CPU noisy neighbor

In shared infrastructures like Kubernetes clusters, applications often compete for resources. Let’s simulate a scenario with a noisy neighbor, where a CPU-intensive application runs on the same node as our database instance. The following Job will create a Pod with stress-ng on node100:

apiVersion: batch/v1
kind: Job
metadata:
  name: cpu-stress
spec:
  template:
    metadata:
      labels:
        app: cpu-stress
    spec:
      nodeSelector:
        kubernetes.io/hostname: node100
      containers:
        - name: stress-ng
          image: debian:bullseye-slim
          command:
            - "/bin/sh"
            - "-c"
            - |
              apt-get update && 
              apt-get install -y stress-ng && 
              stress-ng --cpu 0 --timeout 300s
      restartPolicy: Never

As we can see, our “noisy neighbor” has affected Postgres performance. Now, let’s assume we don’t know the root cause and use Coroot to identify the issue.

Using the CPU Delay chart, we can observe that pg-cluster-2 is experiencing a CPU time shortage. Why? Because node100 is overloaded. And why is that? The cpu-stress application has consumed all available CPU time.

Failure #2: Postgres Locks

Now, let’s explore a Postgres-specific failure scenario. We’ll run a suboptimal schema migration on our articles table, which contains 10 million rows:

ALTER TABLE articles ALTER COLUMN body SET NOT NULL;

For those who aren’t deeply familiar with databases, this migration will lock the entire table to verify that all rows are not NULL. Since the table is relatively large, the migration can take some time to complete. During this period, queries from our app will be forced to wait until the lock is released.

Let’s interpret these charts together: The Postgres latency of pg-cluster-2 has significantly increased. Many SELECT and INSERT queries are locked by another query. Which one? The ALTER TABLE query. Why is this query taking so long to execute? Because it is performing I/O operations to verify that the body column in each row is not NULL.

As you can see, having the right metrics was crucial in this scenario. For instance, simply knowing the number of Postgres locks wouldn’t help us identify the specific query holding the lock.

Failure #3: primary Postgres instance failure

Now, let’s see how CloudNativePG handles a primary instance failure. To simulate this failure, I’ll simply delete the Pod of the primary Postgres instance.

kubectl delete pod pg-cluster-2

(That's all for character count - to view the last experiment, visit our blog.)

Monitoring a Docker Homelab with Open Source

Coroot — Mon, 15 Dec 2025 17:32:55 +0000

This blog comes from Coroot contributor Arie van den Heuvel: engineer, a System and Application Management Specialist, and a valued member of our open source community. You can read more of Arie’s writing and support the resource articles he has contributed to open source on his blog.

When running a home server consisting of one or more nodes with some or all services in Docker, you may find yourself wanting to monitor your environment. Or even better, attain full observability.

The frequent recommendation for this is a combination of Prometheus with Grafana. But this solution requires a lot of work to fully configure, in addition to work on one’s applications and setup for full visibility. Another possibility is to use the free tier of NewRelic, which has the advantage of remote insights on metrics and logs. Again, this requires additional work on containers or applications to have a more refined visibility of your services.

For those not running Linux, an honourable mention to use as a solution would be Beszel. Beszel can be run as a local service or in docker. It consists of a web front-end and an agent that can be used on multiple systems that run Windows and MacOS. Installation is an easy job in docker. Once it’s running, Beszel provides insightful information with system metrics, docker services, and even logs.

My personal choice for monitoring a home server system is Coroot. In the following blog, I’ll detail how I used Coroot to set up observability for my homelab, which you can then adopt for your own setup.

Observability with Coroot

In my current setup on a Rocky Linux 9.x system, Coroot runs on a Clickhouse server to store metrics, logs, traces and profiles, in addition to the Coroot node-agent and Coroot cluster-agent. The Coroot node-agent automatically collects all service metrics and logs using eBPF, while the cluster-agent provides detailed information on databases like MySQL, Postgres or Redis.

Another advantage Coroot presents is the use of AI-powered Root Cause Analysis, which provides instantaneous and helpful insights for investigating incidents. With a Coroot Cloud account, you will have ten helpful analyses for free each month. Even without AI, the data presented with Coroot with standard alerts based on best metric practices is pretty insightful and helps to make your setup even better.

Coroot services run in docker through a docker-compose file. In a normal Coroot setup Prometheus is used, but in this setup I have configured Clickhouse, which is a supported alternative.

Clickhouse as a Local Service

I have Clickhouse running as a local service. This setup allows for better control and convenience when scaling down memory usage of Clickhouse, scaling down logging on disk and the database, and simplifies making changes to the data. The only downside to note is this setup requires updating Clickhouse manually with yum/dnf.

Installing Clickhouse

Installing Clickhouse is easily achieved by adding the repo, installing Clickhouse, and making a few quick adjustments before starting it up.


sudo dnf install -y yum-utils
sudo dnf-config-manager --add-repo
https://packages.clickhouse.com/rpm/clickhouse.repo
sudo dnf install -y clickhouse-server clickhouse-client

Before starting the service create file /etc/clickhouse-server/config.d/z_log_disable.xml and insert the following content in the file:


<?xml version="1.0"?>
<clickhouse>
<asynchronous_metric_log remove="1"/>
<metric_log remove="1"/>
<latency_log remove="1"/>
<query_thread_log remove="1" />
<query_log remove="1" />
<query_views_log remove="1" />
<part_log remove="1"/>
<session_log remove="1"/>
<text_log remove="1" />
<trace_log remove="1"/>
<crash_log remove="1"/>
<opentelemetry_span_log remove="1"/>
<zookeeper_log remove="1"/>
</clickhouse>

After this adjust cache sizes in /etc/clickhouse-server/config.xml:

<mark_cache_size>268435456</mark_cache_size>
<index_mark_cache_size>67108864</index_mark_cache_size>
<uncompressed_cache_size>16777216</uncompressed_cache_size>

Adjust memory usage ratio in /etc/clickhouse-server/config.xml:


<max_server_memory_usage_to_ram_ratio>0.75</max_server_memory_usage_to_ram_ratio>

Lower the tread pool size in /etc/clickhouse-server/config.xml:

<!-- max_thread_pool_size>10000</max_thread_pool_size> -->

And starting things up:


sudo systemctl deamon-reload
sudo systemctl enable clickhouse-server
sudo systemctl start clickhouse-server

Installing Coroot

First, check if your Linux system is using kernel 5.1 or later (although 4.2 is also supported.) This installation is different from the original docker-compose file.

Prometheus is not used in this setup, and Clickhouse runs as a local service. Another distinction is the retention of the data, which is normally set to seven days for traces, logs, profiles and metrics. Coroot also typically stores its own local cache for metrics for 30 days.

In this setup, the data retention stored in Clickhouse is set up for 14 days. With eighteen local and docker services, the amount of data kept for all of this is 3GB on average in my system.
Coroot, its node-agent, and cluster-agent, run as a docker service with a docker-compose that you create locally. This is achieved by inserting the following content in a locally created docker-compose.yaml:


name: coroot

volumes:
node_agent_data: {}
cluster_agent_data: {}
coroot_data: {}

services:
coroot:
restart: always
image: ghcr.io/coroot/coroot${LICENSE_KEY:+-ee} # set 'coroot-ee' as the image if LICENSE_KEY is defined
pull_policy: always
user: root
volumes:
- coroot_data:/data
ports:
- 8080:8080
command:
- '--data-dir=/data'
- '--bootstrap-refresh-interval=15s'
- '--bootstrap-clickhouse-address=127.0.0.1:9000'
- '--bootstrap-prometheus-url=http://127.0.0.1:9090'
- '--global-prometheus-use-clickhouse'
- '--global-prometheus-url=http://127.0.0.1:9090'
- '--global-refresh-interval=15s'
- '--cache-ttl=31d'
- '--traces-ttl=21d'
- '--logs-ttl=21d'
- '--profiles-ttl=21d'
- '--metrics-ttl=21d'
environment:
- LICENSE_KEY=${LICENSE_KEY:-}
- GLOBAL_PROMETHEUS_USE_CLICKHOUSE
- CLICKHOUSE_SPACE_MANAGER_USAGE_THRESHOLD=75 # Set cleanup threshold to 75%
- CLICKHOUSE_SPACE_MANAGER_MIN_PARTITIONS=2 # Always keep at least 2 partitions
network_mode: host

node-agent:
restart: always
image: ghcr.io/coroot/coroot-node-agent
pull_policy: always
privileged: true
pid: "host"
volumes:
- /sys/kernel/tracing:/sys/kernel/tracing
- /sys/kernel/debug:/sys/kernel/debug
- /sys/fs/cgroup:/host/sys/fs/cgroup
- node_agent_data:/data
command:
- '--collector-endpoint=http://192.168.1.160:8080'
- '--cgroupfs-root=/host/sys/fs/cgroup'
- '--wal-dir=/data'

cluster-agent:
restart: always
image: ghcr.io/coroot/coroot-cluster-agent
pull_policy: always
volumes:
- cluster_agent_data:/data
command:
- '--coroot-url=http://192.168.1.160:8080'
- '--metrics-scrape-interval=15s'
- '--metrics-wal-dir=/data'
depends_on:
- coroot

After creating this file and making any adjustments to your own likings and network preferences, type docker compose up -d and go to your IP address on port 8080. Here you have access to Coroot, and are now prompted to give admin credentials!

In my setup, Watchtower takes care of updating docker containers, which works great with Coroot.

As a final sidenote: there are already some helpful hints and pointers present within Coroot for setting things up. In my case, there was information available that helped observe a Postgres database. Don’t forget to use the given commands as the admin/postgres user to make it work.

Happy homelab observing! 🐧

OpenTelemetry for Go: Measuring the Overhead

Coroot — Wed, 10 Dec 2025 19:15:09 +0000

OpenTelemetry for Go: Measuring the Overhead

Everything comes at a cost — and observability is no exception. When we add metrics, logging, or distributed tracing to our applications, it helps us understand what’s going on with performance and key UX metrics like success rate and latency. But what’s the cost?

I’m not talking about the price of observability tools here, I mean the instrumentation overhead. If an application logs or traces everything it does, that’s bound to slow it down or at least increase resource consumption. Of course, that doesn’t mean we should give up on observability. But it does mean we should measure the overhead so we can make informed tradeoffs.

In this post, I want to measure the overhead of using OpenTelemetry in a Go application. To do that, I’ll use a super simple Go HTTP server that increments a counter in an in-memory database Valkey (a Redis fork) on every request. The idea behind the benchmark is straightforward:

First, we’ll run the app under load without any instrumentation and measure its performance and resource usage.
Then, using the exact same workload, we’ll repeat the test with OpenTelemetry SDK for Go enabled and compare the results.

Test setup

For this benchmark, I’ll use four Linux nodes, each with 4 vCPUs and 8GB of RAM. One will run the application, another will host Valkey, a third will be used for the load generator, and the fourth for observability (using Coroot Community Edition).

I want to make sure the components involved in the test don’t interfere with each other, so I’m running them on separate nodes. This time, I’m not using Kubernetes, instead, I’ll run everything in plain Docker containers. I’m also using the host network mode for all containers, to avoid docker-proxy introducing any additional latency into the network path.

Now, let’s take a look at the application code:

package main

import (
    "context"
    "log"
    "net/http"
    "os"
    "strconv"

    "github.com/go-redis/redis/extra/redisotel"
    "github.com/go-redis/redis/v8"

    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/sdk/trace"
)

var (
    rdb *redis.Client
)

func initTracing() {
    rdb.AddHook(redisotel.TracingHook{})
    client := otlptracehttp.NewClient()
    exporter, err := otlptrace.New(context.Background(), client)
    if err != nil {
        log.Fatal(err)
    }
    tracerProvider := trace.NewTracerProvider(trace.WithBatcher(exporter))
    otel.SetTracerProvider(tracerProvider)
    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    ))
}

func handler(w http.ResponseWriter, r *http.Request) {
    cmd := rdb.Incr(r.Context(), "counter")
    if err := cmd.Err(); err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }
    _, _ = w.Write([]byte(strconv.FormatInt(cmd.Val(), 10)))
}

func main() {
    rdb = redis.NewClient(&redis.Options{Addr: os.Getenv("REDIS_SERVER")})
    h := http.Handler(http.HandlerFunc(handler))
    if os.Getenv("ENABLE_OTEL") != "" {
        log.Println("enabling opentelemetry")
        initTracing()
        h = otelhttp.NewHandler(http.HandlerFunc(handler), "GET /")
    }
    log.Fatal(http.ListenAndServe(":8080", h))
}

By default, the application runs without instrumentation. Only if the environment variable ENABLE_OTEL is set, the OpenTelemetry SDK will be initialized. So runs without this variable will serve as the baseline for comparison.

Running the Benchmark

Now let’s start all the components and begin testing.

First, we launch Valkey using the following command:

docker run --name valkey -d --net=host valkey/valkey

Next, we start the Go app and point it to the Valkey instance by IP:

docker run -d --name app -e REDIS_SERVER="192.168.1.2:6379" --net=host failurepedia/redis-app:0.5

To generate load, I’ll use wrk2, which allows precise control over request rate. In this test, I’m setting it to 10,000 requests per second using 100 connections and 8 threads. Each run will last 20 minutes:

 docker run --rm --name load-generator -ti cylab/wrk2 \
   -t8 -c100 -d1200s -R10000 --u_latency http://192.168.1.3:8080/

Results

Let’s take a look at the results.

We started by running the app without any instrumentation. This serves as our baseline for performance and resource usage. Based on metrics gathered by Coroot using eBPF, the app successfully handled 10,000 requests per second. The majority of requests were served in under 5 milliseconds. The 95th percentile (p95) latency was around 5ms, the 99th percentile (p99) was about 10ms, with occasional spikes reaching up to 20ms.

CPU usage was steady at around 2 CPU cores (or 2 CPU seconds per second), and memory consumption stayed low at roughly 10 MB.

So that’s our baseline. Now, let’s restart the app container with the OpenTelemetry SDK enabled and see how things change:

docker run -d --name app \
  -e REDIS_SERVER="192.168.1.2:6379" \
  -e ENABLE_OTEL=1 \
  -e OTEL_SERVICE_NAME="app" \
  -e OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://192.168.1.4:8080/v1/traces" \
  --net=host failurepedia/redis-app:0.5

Everything else stayed the same – the infrastructure, the workload, and the duration of the test.

Now let’s break down what changed.

Memory usage increased from around 10 megabytes to somewhere between 15 and 18 megabytes. This additional overhead comes from the SDK and its background processes for handling telemetry data. While there is a clear difference, it doesn’t look like a significant increase in absolute terms, especially for modern applications where memory budgets are typically much larger.

CPU usage jumped from 2 cores to roughly 2.7 cores. That’s about a 35 percent increase. This is expected since the app is now tracing every request, preparing and exporting spans, and doing more work in the background.

To understand exactly where this additional CPU usage was coming from, I used Coroot’s built-in eBPF-based CPU profiler to capture and compare profiles before and after enabling OpenTelemetry.

The profiler showed that about 10 percent of total CPU time was spent in go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor, which handles span batching and export. Redis calls also got slightly more expensive — tracing added around 7 percent CPU overhead to go-redis operations. The rest of the increase came from instrumented HTTP handlers and middleware.

In short, the overhead comes from OpenTelemetry’s span processing pipeline, not from the app’s core logic.

Latency also changed, though not dramatically. With OpenTelemetry enabled, more requests fell into the 5 to 10 millisecond range. The 99th percentile latency went from 10 to about 15 milliseconds. Throughput remained stable at around 10,000 requests per second. We didn’t see any errors or timeouts.

Network traffic also increased. With tracing enabled, the app started exporting telemetry data to Coroot, which resulted in an outbound traffic volume of about 4 megabytes per second, or roughly 32 megabits per second. For high-throughput services or environments with strict network constraints, this is something to keep in mind when enabling full request-level tracing.

Overall, enabling OpenTelemetry introduced a noticeable but controlled overhead. These numbers aren’t negligible, especially at scale — but they’re also not a dealbreaker. For most teams, the visibility gained through distributed tracing and the ability to troubleshoot issues faster will justify the tradeoff.

eBPF-based instrumentation

I often hear from engineers, especially in ad tech and other high-throughput environments, that they simply can’t afford the overhead of distributed tracing. At the same time, observability is absolutely critical for them. This is exactly the kind of scenario where eBPF-based instrumentation fits well.

Instead of modifying application code or adding SDKs, an agent can observe application behavior at the kernel level using eBPF. Coroot’s agent supports this approach and is capable of collecting both metrics and traces using eBPF, without requiring any changes to the application itself.

However, in high-load environments like the one used in this benchmark, we generally recommend disabling eBPF-based tracing and working with metrics only. Metrics still allow us to clearly see how services interact with each other, without storing data about every single request. They’re also much more efficient in terms of storage and runtime overhead.

Throughout both runs of our test, Coroot’s agent was running on each node. Here’s what its CPU usage looked like:

Node201 was running Valkey, node203 was running the app, and node204 was the load generator. As the chart shows, even under consistent load, the agent’s CPU usage stayed under 0.3 cores. That makes it lightweight enough for production use, especially when working in metrics-only mode.

This approach offers a practical balance: good visibility with minimal cost.

Final Thoughts

Observability comes at a cost, but as this experiment shows, that cost depends heavily on how you choose to implement it.

OpenTelemetry SDKs provide detailed traces and deep visibility, but they also introduce measurable overhead in terms of CPU, memory, and network traffic. For many teams, especially when fast incident resolution is a priority, that tradeoff is entirely justified.

At the same time, eBPF-based instrumentation offers a more lightweight option. It allows you to collect meaningful metrics without modifying application code and keeps resource usage minimal, especially when tracing is disabled and only metrics are collected.

The right choice depends on your goals. If you need full traceability and detailed diagnostics, SDK-based tracing is a strong option. If your priority is low overhead and broad system visibility, eBPF-based metrics might be the better fit.

Observability isn’t free, but with the right approach, it can be both effective and efficient.

Coroot – eBPF-based, open source observability with actionable insights

Coroot — Wed, 09 Apr 2025 16:29:14 +0000

A common open source approach to observability will begin with databases and visualizations for telemetry - Grafana, Prometheus, Jaeger. But observability doesn’t begin and end here: these tools require configuration, dashboard customization, and may not actually pinpoint the data you need to mitigate system risks.

Coroot was designed to solve the problem of manual, time-consuming observability analysis: it handles the full observability journey — from collecting telemetry to turning it into actionable insights. We also strongly believe that simple observability should be an innovation everyone can benefit from: which is why our software is open source.

Features:

Cost monitoring to track and minimise your cloud expenses (AWS, GCP, Azure.)
SLO tracking with alerts to detect anomalies and compare them to your system’s baseline behaviour.
1-click application profiling: see the exact line of code that caused an anomaly.
Mapped timeframes (stop digging through Grafana to find when the incident occurred.)
eBPF automatically gathers logs, metrics, traces, and profiles for you.
Service map to grasp a complete at-a-glance picture of your system.
Automatic discovery and monitoring of every application deployment in your kubernetes cluster.

You can view Coroot’s documentation here, visit our Github, and join our Slack to become part of our community. We welcome any feedback and hope the tool can improve your workflow!