DEV Community: Frank Rosner

Unit Testing Alertmanager Routing and Inhibition Rules

Frank Rosner — Fri, 10 Apr 2026 12:40:12 +0000

Introduction

There are three ways to find out your alertmanager routing tree is broken. You catch it during a careful review before anything goes wrong. You wake up at 3am to a page that went to the wrong team. Or an alert goes to the wrong receiver, nobody gets paged, and you find out when the customer calls. Most of us have experienced at least the second one.

Alertmanager routing trees grow incrementally. A new team gets added, a new severity tier is introduced, someone adds a continue: true flag and forgets to remove it. The config file remains valid YAML throughout. amtool check-config keeps returning clean. Nothing tells you that warning alerts for DatabaseDown are now waking up the frontend on-call instead of the backend team.

This post describes a small Go tool we built to write unit tests for alertmanager routing and inhibition rules, run them in CI, and catch these mistakes before they matter.

The Problem

Alertmanager gives you two built-in tools for validating config:

amtool check-config validates syntax and structure. It cannot tell you whether an alert reaches the right receiver.
amtool config routes test lets you interactively test a single alert against the routing tree. It is useful for manual debugging but does not support batch test files, and it has no notion of inhibition. You cannot assert that a warning is suppressed when a critical is firing.

The failure modes we care about fall into two categories:

Wrong receiver: SomeAlert with team=backend ends up in the frontend Slack channel because a route was added above the team-based route without continue: false.
Broken inhibition: A warning fires even though a critical is active for the same alert, flooding your incident channel with noise. Or worse: warnings are being silenced when they should not be, hiding real problems.

Both are semantic errors. The config is syntactically valid; the routing logic is just wrong. Manual testing by firing test alerts into a staging alertmanager is slow, stateful, and easy to skip.

The Solution: Automated Unit Tests

alertmanager-routing-tests is a Go tool that evaluates alertmanager routing and inhibition rules purely in-memory, using alertmanager's own Go libraries. No running alertmanager instance is required.

You give it an alertmanager config file and a YAML test file. It runs each test case and reports which passed and which failed:

  PASS Unmatched alert routes to default receiver
  PASS Watchdog alert routes to null receiver
  PASS Team A alert routes to team-a-slack
  FAIL "wrong receiver test"
       alert {alertname=SomeAlert}:
         expected: nonexistent-receiver
         actual:   default
=== routing tests: 3 passed, 1 failed ===

Exit code 0 means all tests passed. Exit code 1 means at least one failed, which makes it CI-friendly by default.

How It Works

The tool imports alertmanager's own Go packages directly:

import (
    amconfig "github.com/prometheus/alertmanager/config"
    "github.com/prometheus/alertmanager/dispatch"
    "github.com/prometheus/alertmanager/inhibit"
)

Routing is straightforward. The config is loaded with amconfig.LoadFile, and dispatch.NewRoute(cfg.Route, nil).Match(labelSet) returns the same receiver list that a live alertmanager would produce for that label set.

Inhibition is more involved. Alertmanager's inhibitor is designed to work against a live alert store. The tool works around this by implementing a minimal provider.Alerts interface called fakeAlerts, which serves a fixed set of alerts from a buffered channel:

func (f *fakeAlerts) Subscribe() provider.AlertIterator {
    ch := make(chan *types.Alert, len(f.alerts))
    for _, a := range f.alerts {
        ch <- a
    }
    done := make(chan struct{})
    return provider.NewAlertIterator(ch, done, nil)
}

The inhibitor is constructed with this fake provider, its Run() goroutine is started, and after a brief pause for it to process the alert feed, Mutes(labelSet) is called for each alert to check whether it is suppressed.

The key design decision is that all alerts in a test case are fired together. This is what allows source alerts to inhibit target alerts within the same test case. An alert with severity=critical can suppress an alert with severity=warning when both are present in the same case.

Inhibition is checked first. If an alert is inhibited, receiver matching is skipped. Matching an inhibited alert to receivers is undefined behavior in a real alertmanager, so the test should assert inhibition explicitly.

Writing Tests

Here is a minimal alertmanager config:

global:
  resolve_timeout: 5m

inhibit_rules:
  - source_matchers: ['severity = "critical"']
    target_matchers: ['severity = "warning"']
    equal: [alertname]

route:
  receiver: default
  group_by: ['alertname']
  routes:
    - matchers:
        - alertname="Watchdog"
      receiver: "null"
    - matchers:
        - team="team-a"
      receiver: team-a-slack

receivers:
  - name: default
  - name: "null"
  - name: team-a-slack

Test files are YAML with a tests list. Each test case has a name and one or more alerts. Each alert has labels and an assertion: either expected_receivers or expected_inhibited: true:

expected_receivers: ordered list of receiver names the alert must match. Order matters because alertmanager's routing order is significant when continue: true is used.
expected_inhibited: set to true to assert the alert is suppressed. Omit (or leave false) otherwise. Do not set both on the same alert.

Here are the corresponding tests that exercise all four behaviors:

tests:
  # Anything not matched by a specific route falls through to the default receiver.
  - name: "Unmatched alert routes to default receiver"
    alerts:
      - labels:
          alertname: SomeAlert
        expected_receivers:
          - default

  # Watchdog is a synthetic heartbeat alert. It must not page anyone.
  - name: "Watchdog alert routes to null receiver"
    alerts:
      - labels:
          alertname: Watchdog
          severity: critical
        expected_receivers:
          - "null"

  # Team-based routing: alerts with team=team-a go to team-a-slack.
  - name: "Team A alert routes to team-a-slack"
    alerts:
      - labels:
          alertname: TeamAAlert
          team: team-a
        expected_receivers:
          - team-a-slack

  # Inhibition: a critical suppresses a warning with the same alertname.
  # Both alerts are fired together so the inhibitor can evaluate the relationship.
  - name: "critical suppresses warning with same alertname"
    alerts:
      - labels:
          alertname: SomeAlert
          severity: critical
        expected_receivers:
          - default
      - labels:
          alertname: SomeAlert
          severity: warning
        expected_inhibited: true

  # Inhibition boundary: a critical for AlertOne does NOT suppress a warning
  # for AlertTwo because the inhibit rule requires equal alertname.
  - name: "critical does NOT suppress warning with different alertname"
    alerts:
      - labels:
          alertname: AlertOne
          severity: critical
        expected_receivers:
          - default
      - labels:
          alertname: AlertTwo
          severity: warning
        expected_receivers:
          - default

The last test case is easy to miss in manual testing. The inhibition rule says "a critical suppresses a warning with the same alertname." Without a test pinning this boundary, a future change to the inhibition rule could accidentally broaden the equal list (or remove it entirely) and start silencing warnings across unrelated alerts.

Running the Tool

go run . example-alertmanager.yaml example-routing-tests.yaml

All five tests above pass against the example config. To see a failure, change an expected_receivers entry to a nonexistent receiver:

  PASS Unmatched alert routes to default receiver
  FAIL "Watchdog alert routes to null receiver"
       alert {alertname=Watchdog, severity=critical}:
         expected: nonexistent-receiver
         actual:   null
=== routing tests: 1 passed, 1 failed ===

The tool exits with code 1, which blocks CI.

Integrating with CI via Helm Charts

Many teams store their alertmanager config inside a Helm chart rather than as a standalone file. The config may be embedded as a YAML string inside a values file, or rendered into a ConfigMap or ApplicationSet at deploy time.

To test the rendered config, you need to extract it from the rendered template before passing it to the tool. Here is a Makefile target that does this end-to-end:

ROUTING_TEST_YAML := ./alertmanager-unit-tests/routing-tests.yaml

.PHONY: test
test:
    @WORKDIR=$$(mktemp -d) && \
    helm template test ./my-chart \
        -f "./alertmanager-unit-tests/values.yaml" \
        | yq 'select(.kind == "ConfigMap") | .data["alertmanager.yaml"]' \
        > "$$WORKDIR/alertmanager.yaml" && \
    cd /path/to/alertmanager-routing-tests && \
        go run . "$$WORKDIR/alertmanager.yaml" $(CURDIR)/$(ROUTING_TEST_YAML) && \
    rm -rf "$$WORKDIR"

The yq expression selects the rendered alertmanager config from the template output. Adjust the selector to match your chart's structure. If the config is embedded as a YAML string inside another resource (for example, an ArgoCD ApplicationSet), you may need from_yaml to parse it before extracting the alertmanager section:

| yq 'select(.kind == "ApplicationSet") \
    | .spec.template.spec.source.helm.values \
    | from_yaml \
    | .alertmanager.config'

With this in place, make test renders the chart and runs the routing tests in a single step. No live cluster, no running alertmanager. The tests run in CI the same way they run locally.

The test YAML lives next to the chart and is reviewed in the same pull requests that change the alertmanager config. Routing changes need passing tests to merge.

Conclusion

Alertmanager routing bugs are quiet. The config is valid, deployment succeeds, and the tree looks right when you read it. You only find out something is wrong when an alert fires and the wrong team gets paged, or nobody gets paged at all, or a customer calls.

Unit tests for routing rules are not conceptually different from unit tests for application code. The logic is complex, the failure modes are silent, and the consequences are real. A test file that exercises your routing tree, including inhibition boundaries, makes routing changes reviewable and gives you a CI gate that catches regressions before they reach production.

If you manage alertmanager config, consider starting with three test cases: the default catch-all, one named receiver route, and one inhibition rule. Extend from there as your routing tree grows.

Future Work

The most natural home for this feature is amtool itself. The existing amtool config routes test command already evaluates a single alert interactively. Extending it to accept a YAML file with multiple test cases and inhibition assertions would make batch routing tests available to the entire Prometheus community without a separate tool.

A contribution along these lines would require adding inhibition support and a test runner loop to the existing command, which is straightforward work on top of what amtool already does. We are considering contributing this upstream.

Taming Prometheus Scrapes - Understanding and Analyzing Your Metrics Endpoints

Frank Rosner — Thu, 26 Feb 2026 07:43:45 +0000

Introduction: Prometheus and the Scrape Format

Prometheus is a pull-based monitoring system. Instead of having services push metrics to a central collector, Prometheus periodically fetches (or scrapes) an HTTP endpoint (typically /metrics) from each monitored target. The response is a plain-text document in the Prometheus exposition format, listing every metric the service wants to expose. A scrape looks something like this:

# HELP http_requests_total Total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1234
http_requests_total{method="GET",status="500"} 7
http_requests_total{method="POST",status="200"} 89

# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 52428800

Each metric family starts with optional # HELP (description) and # TYPE (counter, gauge, histogram, summary, or untyped) comment lines, followed by one or more series: individual data points distinguished by their label sets. In the example above, http_requests_total has three series, one for each combination of method and status labels.

Metric Types

Prometheus defines four main metric types:

Counter: A monotonically increasing value (e.g. total requests, total errors).
Gauge: A value that can go up or down (e.g. current memory usage, queue depth).
Histogram: Samples observations into configurable buckets, plus a _sum and _count. Used for measuring things like request latencies or payload sizes. Each bucket boundary becomes its own series, labeled with le ("less than or equal").
Summary: Similar to a histogram but pre-computes quantiles client-side, labeled with quantile.

Histograms and summaries deserve a special mention when thinking about cardinality. A single histogram metric with 12 configured buckets exposes 14 series per label combination: 12 bucket series (le=...), one _sum, and one _count. If that histogram also carries a label like endpoint with 50 distinct values, you are already looking at 700 series from one metric family.

The Two Big Challenges: Size and Cardinality

Scrape Size

Every scrape is an HTTP response body that Prometheus has to download, parse, and ingest. For small services exposing a few dozen metrics this is trivial, but instrumented runtimes (Go, JVM, .NET) can expose hundreds of metrics by default, and larger applications may expose thousands. Scrape payloads regularly reach several megabytes, and in extreme cases tens of megabytes.

A large scrape causes two concrete problems. First, it consumes bandwidth on every scrape interval (typically 15–60 seconds). Second, and more critically, Prometheus enforces a scrape timeout: if the target takes too long to respond, the scrape is marked as failed and the data is simply not ingested. I encountered scrapes exceeding 64 MB that consistently timed out, meaning that data was never in Prometheus, making it impossible to debug the problem from inside Prometheus itself.

One common remedy is enabling gzip compression on the /metrics endpoint. Prometheus supports Accept-Encoding: gzip out of the box, and the text format compresses well. Compression can reduce transfer size by 80–90%, which helps significantly with bandwidth and timeout margins. However, gzip only addresses the transport problem. The data still has to be decompressed and parsed by Prometheus, and all those series still have to be stored and indexed. The real cost of a large scrape is not the bytes on the wire: it is the cardinality.

High Cardinality

Cardinality is the number of distinct time series a metric produces (i.e. the number of unique label value combinations). A metric with no labels has a cardinality of 1. A metric with a method label (say, 5 values) and a status label (say, 10 values) has a cardinality of up to 50.

Prometheus stores each series independently: it allocates memory for it, writes it to disk, and indexes it. High cardinality therefore translates directly into high memory usage, large on-disk storage, and slower queries. Unlike scrape size, this cost cannot be compressed away. It is structural.

Scenario 1: The High-Cardinality Label

The classic example is instrumenting a metric with a label whose value comes from an unbounded domain. Imagine tracking HTTP requests per session:

# TYPE http_requests_total counter
http_requests_total{session_id="a1b2c3"} 12
http_requests_total{session_id="x9y8z7"} 4
http_requests_total{session_id="p0q1r2"} 31
...

Each new user session creates a new series. With thousands of active sessions, Prometheus is ingesting thousands of new series every minute. Because counters are monotonically increasing, Prometheus keeps these series in memory until they are explicitly garbage-collected, which by default only happens after a series has not been seen for 5 minutes. A busy service with short sessions can create a continuously growing "cardinality debt" that eventually causes Prometheus to run out of memory.

Session IDs are an obvious case, but the same pattern appears with any high-churn identifier: request IDs, trace IDs, user IDs in a large system, or dynamically generated job names.

Scenario 2: Label Leaks from Stale Metadata

A similar problem arises when label values can become stale, and the service fails to clean up its own metric registry. A common example in Kubernetes is a service that tracks metrics per pod (perhaps a controller, a proxy, or a sidecar that monitors its neighbors):

# TYPE watched_pod_restarts_total counter
watched_pod_restarts_total{pod="web-7d4f9b-xkqzp"} 2
watched_pod_restarts_total{pod="web-7d4f9b-mnprt"} 0
watched_pod_restarts_total{pod="web-7d4f9b-tz9vw"} 5
...

When a pod is removed (due to a rolling deployment, a crash, or a scale-down), the correct behavior is to also delete the corresponding metric series from the registry. If the code forgets to do that, the series for the old pod keeps appearing in every scrape indefinitely. The service is essentially accumulating a series for every pod it has ever seen.

This is a pure instrumentation bug, and it can be surprisingly hard to notice. The service appears healthy, the scrape succeeds, and individual series look reasonable in isolation. But over time, especially in clusters with frequent deployments, the cardinality of that metric grows without bound. By the time someone notices the Prometheus memory usage climbing, hundreds or thousands of ghost series may already be present in the scrape.

Ad-Hoc Analysis: The Shell Toolbox

Why Not Just Query Prometheus?

Before reaching for shell tools, it is worth asking: why analyse the raw scrape at all, rather than using PromQL inside Prometheus?

There are two situations where Prometheus itself cannot help you. The first is when the scrape never made it into Prometheus in the first place. A scrape that exceeds the configured timeout is simply dropped: no data is ingested, and there is nothing to query. Large scrapes fail exactly like this, which means the tool you would normally use to investigate the problem is blind to it.

The second situation arises when a remote write intermediary sits between the scraper and the TSDB. Tools like vmagent scrape targets and forward metrics via the remote write protocol, which carries only sample data: it discards # HELP and # TYPE metadata. Once the data is in the storage backend, you lose the ability to filter or group by metric type, or to see the human-readable descriptions that often give the clearest clue about what a metric is and why its cardinality is high.

In both cases, working directly with the raw scrape text is the only option.

Counting Series with Shell Tools

Given a saved scrape, a first instinct is to reach for standard Unix tools. Here are the kinds of questions you might try to answer and how you would approach them.

How many lines does the scrape have?

wc -l < prometheus-scrape.txt

Which metric families are present?

grep '^# TYPE' prometheus-scrape.txt

# TYPE go_gc_cycles_automatic_gc_cycles_total counter
# TYPE go_gc_cycles_forced_gc_cycles_total counter
# TYPE go_gc_cycles_total_gc_cycles_total counter
# TYPE go_gc_duration_seconds summary
# TYPE go_gc_gogc_percent gauge
# TYPE go_gc_gomemlimit_bytes gauge
# TYPE go_gc_heap_allocs_by_size_bytes histogram
...
# TYPE prometheus_http_requests_total counter
# TYPE prometheus_http_response_size_bytes histogram
# TYPE promhttp_metric_handler_requests_in_flight gauge
# TYPE promhttp_metric_handler_requests_total counter

How many series does each metric expose?

The idea is to count non-comment, non-empty lines per metric family. One approach: strip comment and blank lines, extract the metric name (the part before { or the first space), then count occurrences.

grep -v '^#' prometheus-scrape.txt \
  | grep -v '^$' \
  | sed 's/[{ ].*//' \
  | sort \
  | uniq -c \
  | sort -rn \
  | head -20

59 prometheus_http_requests_total
20 prometheus_engine_query_duration_histogram_seconds_bucket
18 prometheus_sd_kubernetes_events_total
15 prometheus_tsdb_compaction_duration_seconds_bucket
13 prometheus_tsdb_compaction_chunk_size_bytes_bucket
13 prometheus_tsdb_compaction_chunk_samples_bucket
12 prometheus_engine_query_duration_seconds
12 net_conntrack_dialer_conn_failed_total
12 go_gc_heap_frees_by_size_bytes_bucket
12 go_gc_heap_allocs_by_size_bytes_bucket
11 prometheus_tsdb_compaction_chunk_range_seconds_bucket
10 prometheus_http_request_duration_seconds_bucket
 9 prometheus_http_response_size_bytes_bucket
 8 prometheus_tsdb_sample_ooo_delta_bucket
 8 go_sched_pauses_total_other_seconds_bucket
 8 go_sched_pauses_total_gc_seconds_bucket
 8 go_sched_pauses_stopping_other_seconds_bucket
 8 go_sched_pauses_stopping_gc_seconds_bucket
 8 go_sched_latencies_seconds_bucket
 8 go_gc_pauses_seconds_bucket

This pipes the scrape through a sequence of filters: drop comment lines, drop blank lines, strip everything after the metric name, sort, count duplicates, and sort by count descending.

The Limits of This Approach

The pipeline above works for simple gauges and counters, but it breaks down for histograms and summaries. You can already see this in the output above: prometheus_engine_query_duration_histogram_seconds_bucket appears as its own entry with 20 lines, but the corresponding _sum and _count lines are counted separately and buried lower in the list. The sed pattern extracts different "names" for each suffix (_bucket, _sum, _count), so the count is fragmented across multiple rows rather than attributed to a single metric family. The true series count for that histogram is higher than any individual row suggests. Reassembling these correctly requires knowing the metric type, which means parsing the # TYPE lines and correlating them with the data lines. That turns a one-liner into a non-trivial script.

There are further edge cases: metrics with no labels, metric names that are prefixes of other metric names, and histograms where the bucket count varies across label dimensions. Shell pipelines are quick to write but fragile to maintain, and getting an accurate cardinality figure for a real-world scrape is harder than it looks.

Introducing scrapecli

scrapecli is a small command-line tool that reads a Prometheus scrape from stdin and prints a structured summary. It uses the official Prometheus client libraries to parse the exposition format, so it understands metric types correctly: histograms and summaries are counted as single families, not fragmented by suffix.

Installation

If you have Go installed:

go install github.com/FRosner/scrapecli@latest

Alternatively, download a pre-built binary for your platform from the releases page.

Basic Usage

Pipe any Prometheus scrape into scrapecli:

curl -s localhost:9090/metrics | scrapecli

Or analyse a saved scrape file:

cat prometheus-scrape.txt | scrapecli

Running it against the same Prometheus scrape from the previous section produces:

What the Output Tells You

Summary gives an immediate sense of the scrape's overall footprint: total size on disk and which metrics are consuming the most series. Contrast the Top Metrics list with the awk output from the previous section: prometheus_engine_query_duration_histogram_seconds is correctly listed as a single family with 20 series, rather than appearing as a fragmented _bucket entry. Each entry also shows its byte contribution, making it easy to see which metrics dominate the scrape size.

Types breaks down the metric count by type. Seeing 20 histograms in a scrape is a prompt to check their bucket counts and label cardinality, since histograms multiply series quickly.

Labels shows every label name that appears across the scrape, how many distinct values it takes globally, and how many metric families use it. The handler label having 59 distinct values across 3 metrics immediately explains why prometheus_http_requests_total leads the cardinality ranking. The <none> entry counts metrics that carry no labels at all, which is useful context for understanding what fraction of the scrape is label-free.

Metrics lists every metric family with its type, series count, labels, and description. This is where you can quickly scan for unfamiliar metrics, check whether a metric's description matches your expectation, or spot a metric with an unexpectedly high cardinality.

JSON Output for Scripting

Passing -o json emits the same information as structured JSON, which is useful for feeding into other tools or automating cardinality checks in CI:

cat prometheus-scrape.txt | scrapecli -o json

{
  "summary": {
    "bytes": 79033,
    "top_cardinalities": [
      { "name": "prometheus_http_requests_total", "cardinality": 59 },
      { "name": "prometheus_engine_query_duration_histogram_seconds", "cardinality": 20 },
      ...
    ],
    "type_counts": {
      "counter": 109,
      "gauge": 90,
      "histogram": 20,
      "summary": 11
    },
    ...
  },
  "metrics": [...]
}

You could, for example, use jq to fail a CI step if any single metric exceeds a cardinality threshold:

cat prometheus-scrape.txt | scrapecli -o json \
  | jq '[.summary.top_cardinalities[] | select(.cardinality > 100)] | length > 0'

Conclusion

High cardinality is one of the most common and costly problems in Prometheus deployments, and it tends to be discovered late, usually when memory usage is already climbing or scrapes are already failing. The root cause is usually straightforward once you can see it: an unbounded label, a metric registry that was never cleaned up, a histogram with too many dimensions. The difficulty is getting a clear view of what is actually in a scrape before things go wrong.

Shell tools can get you part of the way there, but they require careful construction and give inaccurate results for histograms and summaries. Querying Prometheus directly is not always an option, especially when the scrape is too large to ingest or when metadata has been stripped by a remote write pipeline.

scrapecli is a small focused tool for exactly this gap: give it a scrape, and it tells you the size, the cardinality leaders, the type breakdown, and the label landscape, and it gets the numbers right. If you maintain a service that exposes Prometheus metrics, it is worth keeping in your toolbox for those moments when you need to understand what your /metrics endpoint is actually producing.

The project is open source and available at github.com/FRosner/scrapecli. Have you run into oversized scrapes or runaway cardinality in your own setup? I'd love to hear about it in the comments: what caused it, how you found it, and how you fixed it.

Catching Race Conditions in Go

Frank Rosner — Wed, 25 Feb 2026 09:35:09 +0000

Introduction

Concurrency is one of Go's greatest strengths. Goroutines are cheap, channels are expressive, and the standard library is built with concurrency in mind. But with concurrency comes a classic hazard: race conditions.

A race condition occurs when two or more goroutines access the same memory concurrently, and at least one of them is writing. The result depends on the exact scheduling order, which the Go runtime does not guarantee. This means your program may work correctly nine times out of ten, only to produce corrupted data or crash unpredictably in production under load.

What makes race conditions particularly dangerous is their silence. The compiler won't warn you. Your tests may pass. The bug only surfaces under the right (wrong) timing conditions, often making it hard to reproduce and painful to debug.

Fortunately, Go ships with a built-in race detector. By simply adding the -race flag to your go test, go run, or go build commands, Go instruments your code to monitor memory accesses at runtime and report any races it observes, complete with a detailed stack trace pointing you directly to the problem.

In this post, we'll explore how the race detector works, walk through a concrete example, and look at how to make it a standard part of your development workflow.

What is the Race Detector?

Go's race detector is a dynamic analysis tool built directly into the Go toolchain. You don't need to install anything extra; it's been available since Go 1.1.

Under the hood, it is powered by ThreadSanitizer (TSan), a battle-tested runtime instrumentation library originally developed at Google and now maintained as part of the LLVM project. It is also used in C/C++ toolchains like Clang and GCC. When you compile with -race, the Go compiler inserts instrumentation around every memory read and write. At runtime, TSan tracks which goroutine last accessed each memory location and flags any unsynchronized concurrent accesses.

Because it works at runtime, the race detector can only report races that actually happen during a given execution. It won't find races in code paths that aren't exercised by your tests, but there are no false positives. That said, instrumentation comes with overhead: programs compiled with -race typically run 2–20× slower and use 5–10× more memory. This is perfectly acceptable for tests and CI, but means you generally won't ship race-enabled binaries to production.

A Simple Race Condition Example

Let's look at a concrete example. Here is a simple Counter type with an Increment method:

type Counter struct {
  value int
}

func (c *Counter) Increment() {
  c.value++
}

func (c *Counter) Value() int {
  return c.value
}

And a test that increments it 1000 times concurrently:

func TestCounter(t *testing.T) {
  c := Counter{}
  var wg sync.WaitGroup

  for i := 0; i < 1000; i++ {
    wg.Add(1)
    go func() {
      defer wg.Done()
      c.Increment()
    }()
  }

  wg.Wait()

  if c.Value() != 1000 {
    t.Errorf("expected 1000, got %d", c.Value())
  }
}

The problem is in Increment: c.value++ is not a single atomic operation. It compiles to a read, an increment, and a write. If two goroutines interleave those steps, one of the writes gets lost, so the final count ends up lower than 1000.

Running go test shows this can fail outright:

--- FAIL: TestCounter (0.00s)
    counter_test.go:23: expected 1000, got 939
FAIL

But even when the test happens to pass (perhaps on a slower machine or a lucky scheduling order), the race is still there, waiting to bite. Running with -race makes it explicit:

==================
WARNING: DATA RACE
Read at 0x00c0000a01e8 by goroutine 9:
  github.com/frosner/go-test-race.(*Counter).Increment()
      counter.go:11 +0x84
  github.com/frosner/go-test-race.TestCounter.func1()
      counter_test.go:16 +0x80

Previous write at 0x00c0000a01e8 by goroutine 7:
  github.com/frosner/go-test-race.(*Counter).Increment()
      counter.go:11 +0x98
  github.com/frosner/go-test-race.TestCounter.func1()
      counter_test.go:16 +0x80

Goroutine 9 (running) created at:
  github.com/frosner/go-test-race.TestCounter()
      counter_test.go:14 +0x74
...
==================

The output tells you exactly what happened: goroutine 9 read c.value at counter.go:11 while goroutine 7 had just written to the same address. Both goroutines were spawned at counter_test.go:14. There's no ambiguity about where to look.

Running the Race Detector

The -race flag works with the three most common Go commands:

go test -race ./...   # run tests with race detection (most common)
go run -race main.go  # run a program with race detection
go build -race        # build a race-enabled binary

For day-to-day development, go test -race ./... is the one you'll use most. The ./... pattern runs all packages in the module recursively, so no race in any package goes undetected. go build -race is useful when you want to run a long-lived service manually and observe it under realistic traffic, for example during load testing or manual QA. Just don't forget to swap it back out before shipping.

Controlling the race detector

The race detector's behaviour can be tuned via the GORACE environment variable, which accepts a space-separated list of options:

GORACE="halt_on_error=1 log_path=/tmp/race" go test -race ./...

The most useful options are:

Option	Default	Description
`halt_on_error`	`0`	Exit immediately on the first race instead of continuing
`log_path`	`stderr`	Write race reports to a file (e.g. `log_path=/tmp/race` produces `/tmp/race.<pid>`)
`strip_path_prefix`	`""`	Remove a path prefix from stack frames to reduce noise

In CI it's often worth setting halt_on_error=1 so that the first detected race fails the build loudly rather than letting the run continue and produce a wall of interleaved reports.

Fixing the Race Condition

There are three idiomatic ways to fix a data race in Go: a mutex, an atomic operation, or a channel. The right choice depends on the complexity of the shared state.

Option 1: sync.Mutex

A mutex is the most general solution. It works for any shared state, including structs with multiple fields that must be updated together:

import "sync"

type Counter struct {
  mu    sync.Mutex
  value int
}

func (c *Counter) Increment() {
  c.mu.Lock()
  defer c.mu.Unlock()
  c.value++
}

func (c *Counter) Value() int {
  c.mu.Lock()
  defer c.mu.Unlock()
  return c.value
}

Note that Value also needs the lock: reading shared memory concurrently with a write is itself a race.

Option 2: sync/atomic

For a single integer counter, sync/atomic is simpler and faster than a mutex:

import "sync/atomic"

type Counter struct {
  value atomic.Int64
}

func (c *Counter) Increment() {
  c.value.Add(1)
}

func (c *Counter) Value() int {
  return int(c.value.Load())
}

atomic.Int64 (introduced in Go 1.19) provides a clean, type-safe API. Use atomics when you only need to update a single value; reach for a mutex as soon as the operation involves multiple fields.

Option 3: Channels

Channels express the Go proverb "share memory by communicating". Instead of protecting shared state with a lock, you give exclusive ownership of the value to a single dedicated goroutine and communicate with it via channels:

type Counter struct {
  inc chan struct{}
  val chan int
}

func NewCounter() *Counter {
  c := &Counter{
    inc: make(chan struct{}),
    val: make(chan int),
  }
  go func() {
    value := 0
    for {
      select {
      case <-c.inc:
        value++
      case c.val <- value:
      }
    }
  }()
  return c
}

func (c *Counter) Increment() {
  c.inc <- struct{}{}
}

func (c *Counter) Value() int {
  return <-c.val
}

The value variable lives exclusively inside the goroutine, so nothing else can touch it. There is no shared memory and therefore no race. Increment sends a signal on inc, and Value receives the current count from val.

The downside is lifecycle management: the background goroutine will leak if the Counter is abandoned. A production-ready version would need a Close method or a context.Context to shut it down. For a simple counter, that's more ceremony than a mutex or atomic warrants, and that's exactly the point. But for more complex stateful objects where multiple fields must stay consistent, this pattern can be a clean and expressive alternative.

Confirming the fix

After applying the mutex fix, go test -race passes cleanly:

ok    github.com/frosner/go-test-race  1.525s

No WARNING: DATA RACE. The race detector's silence is your green light.

Practical Considerations

It's a runtime tool, so coverage matters

As mentioned earlier, the race detector can only report races it actually observes. If a racy code path isn't exercised during a test run, no warning is produced. This means the race detector is only as good as your test suite.

A test that calls Increment once in a single goroutine will never trigger a race, even if the implementation is unsafe. The race in our example was only visible because the test deliberately ran 1000 concurrent goroutines. When writing tests for concurrent code, design them to exercise real concurrency: use multiple goroutines, vary timing with runtime.Gosched() or small sleeps where appropriate, and aim for high coverage of concurrent code paths.

Don't run it in production, but do run it in CI

The 2–20× slowdown and 5–10× memory increase make -race unsuitable for production binaries. However, this overhead is almost always acceptable in a test or CI environment, where correctness matters far more than raw speed.

The ideal setup is to run go test -race ./... on every pull request. Races caught at review time are cheap to fix. Races caught in production (if they're caught at all) can mean hours of debugging a Heisenbug.

The race detector doesn't cover all concurrency bugs

The race detector specifically finds data races: unsynchronized concurrent memory accesses. It will not catch deadlocks, livelocks, or logical race conditions where synchronization exists but the logic is still wrong. For those, you still need good tests and careful code review.

Tips for Effective Use

Use -count to repeat tests

Because the race detector only catches races that actually occur, a racy test might get lucky and pass on a single run. Passing -count=N tells Go to run each test N times in the same process, increasing the chances of hitting the problematic interleaving:

go test -race -count=10 ./...

This is particularly useful for tests that involve tight timing windows. It won't guarantee discovery, but it significantly raises the odds.

Use t.Parallel() to increase concurrency

Marking tests with t.Parallel() allows them to run concurrently with each other. This increases the overall goroutine concurrency during the test run, which gives the race detector more opportunities to observe racy interactions, especially across different test cases that share package-level state:

func TestCounter(t *testing.T) {
  t.Parallel()
  // ...
}

Write tests that explicitly exercise concurrency

For any type or function that is intended to be safe for concurrent use, write a test that actually uses it concurrently. The pattern used in our TestCounter (spawning many goroutines, using a sync.WaitGroup to wait for them, then asserting on the result) is a reliable template:

func TestConcurrentAccess(t *testing.T) {
  t.Parallel()
  c := Counter{}
  var wg sync.WaitGroup

  for i := 0; i < 1000; i++ {
    wg.Add(1)
    go func() {
      defer wg.Done()
      c.Increment()
    }()
  }

  wg.Wait()
  if c.Value() != 1000 {
    t.Errorf("expected 1000, got %d", c.Value())
  }
}

If the type is not meant to be used concurrently, document that explicitly. It's just as important to set the right expectations as it is to protect the ones that need it.

Add it to your CI pipeline

The simplest way to ensure -race is always run is to make it part of your standard test command in CI. In GitHub Actions, for example:

- name: Test
  run: go test -race ./...

One line. No extra tooling. You'll catch races on every push before they ever reach production.

Conclusion

Race conditions are among the hardest bugs to debug: non-deterministic, often invisible under normal load, and capable of causing silent data corruption. Go's built-in race detector gives you a powerful, zero-setup tool to catch them before they reach production.

We've seen how -race instruments your code at compile time, reports unsynchronized memory accesses with precise stack traces, and leaves no room for false positives. We've also seen that fixing a detected race is usually straightforward, with sync.Mutex, sync/atomic, or channels all being idiomatic options depending on the situation.

The cost of enabling it is low: one flag, a slower test run, and nothing more. The benefit is a whole class of concurrency bugs caught automatically, on every PR, before anyone is paged at 3am. If you're not already running go test -race ./... in CI, that's the one takeaway from this post. Add it today.

Addressing the Limitations of Local Path Provisioner in Kubernetes

Frank Rosner — Sun, 28 Dec 2025 07:25:51 +0000

Temporary Storage in Kubernetes

In Kubernetes, containers are ephemeral and stateless by default, allowing for easy scaling and management. Some workloads might require storage for temporary files, however. In vanilla Kubernetes, you are presented with the following options:

Mount an emptyDir volume, which will be created on the node where the pod is scheduled. It can either be backed by the node's default disk or the node's memory (tmpfs). Some cloud providers also offer emptyDir backed by local SSDs based on the node type. However, you cannot customize the mount point on the node for emptyDir volumes, which means less flexibility.
Mount a hostPath volume which allows you to specify a custom mount point on the node. This is not recommended for most application due to security risks as it allows mounting arbitrary paths on the node. Also, each pod is "responsible" for mounting the right path to avoid conflicts between pods. There is no separation.
Mount a local volume, which is backed by a statically provisioned local PV. When configured correctly, this approach avoids scheduling issues if your pod is supposed to get the same local data back when it is rescheduled. However, you have to manually create the PVs and manage them, which makes local volumes impractical for most production use cases.

While most workloads might be fine with emptyDir for storing temporary data, some applications have specific I/O requirements, such as configuring the filesystem in a certain way or choosing a certain RAID configuration for optimal performance. Think of databases or caches.

We need a way to dynamically provision local storage, securely, conflict-free, mounted to the specific path on the node that is mounted to a fast local disk. Ideally, we want to avoid scheduling problems and enforce capacity limits. Additionally, emptyDir will be wiped if the pod gets deleted, so we cannot reuse the volume even if the node still exists. This can be inconvenient if you want to reuse the state of your application after a rolling restart, for example.

Local path provisioner provides a way to mount local storage as persistent volumes in Kubernetes dynamically. It checks many of the boxes we are looking for. Let's take a closer look.

How Does Local Path Provisioner Work?

Local path provisioner is a Go application that can be installed in your Kubernetes cluster, e.g. via Helm. Based on your configuration, it will create either hostPath or local based PVs on the node automatically.

After installing the chart in your cluster, you will have access to the local-path storage class. To utilize it, you could create a StatefulSet with the respective volume claim template:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: volume-test
spec:
  serviceName: "test"
  replicas: 2
  selector:
    matchLabels:
      app: volume-test
  template:
    metadata:
      labels:
        app: volume-test
    spec:
      containers:
      - name: test-container
        image: busybox
        command: ['sh', '-c', 'echo "Test $(hostname)" > /data/test && sleep 3600']
        volumeMounts:
        - mountPath: /data
          name: local-storage
  volumeClaimTemplates:
  - metadata:
      name: local-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: local-path
      resources:
        requests:
          storage: 128Mi

After creating the StatefulSet resource, the following events will unfold:

The StatefulSet controller processes the first replica, creating the PVC and pod based on the template and setting the owner references. The pod will reference the PVC and both resources will be Pending.
The PVC control loop detects an unbound PVC, matches the storage class local-path to the provisioner rancher.io/local-path and triggers dynamic provisioning since no matching PV exists.
Local path provisioner watches PVC events via the Kubernetes API. If the storage class is configured with volumeBindingMode: WaitForFirstConsumer (default), it will defer the PV binding to pod scheduling. This is useful because there might be other constraints in your pods such as node selectors, or resource requests, which could result in unschedulable pods if the PV is bound to the wrong node.
The scheduler schedules the pod to a node and the PVC is annotated with volume.kubernetes.io/selected-node, indicating the selected node to the PVC binding controller.
The local path provisioner receives the PVC event, reads the selected-node annotation and creates a PV on the selected node using hostPath (default) in a node specific path, with a pod specific sub-path, avoiding conflicts. This is done via a helper pod that launches a container on the node to do the mounting. The PVC is then bound to the PV.
Kubelet observes that the PVC is fully bound, pulls the container image (if needed), mounts the host dir to the container and starts it.
The StatefulSet controller processes the second replica, similar to the first replica.

With the default configuration, if both replicas were to be scheduled on the same node, the layout on the node would look like this:

/opt/local-path-provisioner/
├── pvc-<uuid>_local-storage_volume-test_default_local-storage-volume-test-0/
│   └── test
└── pvc-<uuid>_local-storage_volume-test_default_local-storage-volume-test-1/
    └── test

When the PVC is deleted (and the reclaim policy is Delete), the PV will be deleted as well. The provisioner will detect this and clean up the host directory by scheduling another helper pod.

What are the Limitations of Local Path Provisioner?

While local path provisioner addresses the main shortcomings of emptyDir, local and hostPath by dynamically and securely provisioning local volumes on nodes in a conflict-free manner, it comes with a few limitations of its own.

First, if a node with a bound local-path PV gets removed from the cluster, the provisioner cannot schedule the helper pod to unmount the PV upon PVC deletion, and thus the PV remains "stuck" until manually deleted (see #215).

While orphaned PVs are a minor inconvenience, the second issue is more severe: Since local-path PVCs are tied to a node, but the lifecycle of a PVC in a StatefulSet is decoupled from the pod lifecycle, pods can become unschedulable if recreated, because node selected for the PVC is not part of the cluster anymore, full, or otherwise unsuitable for scheduling. This leads to service outage, with pods stuck in Pending phase until the PVC is deleted manually.

Thirdly, while Kubernetes allows you to specify storage limits for PVCs, local path provisioner does not enforce them. This can lead to overcommitting resources causing unexpected out of disk errors in the applications.

Luckily, there are different buildings blocks we can combine to address these issues: Local ephemeral storage, filesystem quotas, generic ephemeral volumes, and a custom application I call local path cleaner. Let's dive into the details.

Local Ephemeral Storage

The concept of local ephemeral storage was introduced in 2017 (v1.7) and reached GA in 2022 (v1.25). This means you can specify storage requests (and limits) in your container specification:

containers:
- name: test-container
  image: busybox
  command: ['sh', '-c', 'echo "Test $(hostname)" > /data/test && sleep 3600']
  resources:
    requests:
      ephemeral-storage: "5Gi"
  volumeMounts:
  - mountPath: /data
    name: local-storage

The scheduler will take storage requirements into account when scheduling pods. We can use this to avoid overcommitting local storage on a node. However, local ephemeral storage and persistent volumes serve different purposes (one is ephemeral, the other persistent). Kubernetes does not track PVC volumes as ephemeral storage consumption so we cannot combine local ephemeral storage requests and local-path PVCs out of the box.

Luckily, with a trick during node provisioning, we can still achieve what we are looking for. Let's consider GCP as an example. In our startup script, we might manually mount multiple local NVMe SSD in a RAID0 device:

# Find all SSDs
SSDs=($(readlink -f /dev/disk/by-id/google-local-nvme-ssd-*))

# Create RAID0 device
mdadm --create /dev/md0 \
  --level=0 --force \
  "--raid-devices=$${#SSDs[@]}" \
  "$${SSDs[@]}"

# Format RAID0 device
mkfs.xfs -s size=4096 /dev/md0

# Mount RAID0 device to /mnt/disks/ssd-array
mkdir -p /mnt/disks/ssd-array
mount /dev/md0 /mnt/disks/ssd-array
chmod a+w /mnt/disks/ssd-array

# Create fstab entry (to survive reboots)
raid_dev_uuid=$(blkid | grep dev/md0 | egrep -o '[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}')
echo "UUID=$raid_dev_uuid /mnt/disks/ssd-array xfs defaults,nofail,noatime 0 0" |\
  tee -a /etc/fstab

# Disable NODE_LOCAL_SSDS_EPHEMERAL as we manage ephemeral storage ourselves
sed -i 's|readonly NODE_LOCAL_SSDS_EPHEMERAL=true|readonly NODE_LOCAL_SSDS_EPHEMERAL=false|' \
  "$${KUBE_HOME}/kube-env"

Kubelet tracks ephemeral storage in certain locations. By bind mounting these into our RAID0 mount, we effectively enable Kubernetes to track the capacity of our custom local storage.

mkdir -p /mnt/disks/ssd-array/lib/kubelet
mv /var/lib/kubelet/* /mnt/disks/ssd-array/lib/kubelet
mount --bind /mnt/disks/ssd-array/lib/kubelet /var/lib/kubelet

mkdir -p /mnt/disks/ssd-array/lib/containerd
mv /var/lib/containerd/* /mnt/disks/ssd-array/lib/containerd
mount --bind /mnt/disks/ssd-array/lib/containerd /var/lib/containerd

mkdir -p /mnt/disks/ssd-array/stateful_partition
mount --bind /mnt/disks/ssd-array/stateful_partition /mnt/stateful_partition

Alternatively, we could hard code the available ephemeral storage capacity in the kubelet config based on the available space on the RAID0 device. While this would allow the scheduler to take storage requests into account for your local-path PVs, tracking actual usage will not work. If you wanted it to be 500Gi, you could run:

sed -i -E 's/(ephemeral-storage:).*/\1 500Gi/' /home/kubernetes/kubelet-config.yaml

When querying the node capacity, you should see the ephemeral storage capacity reflected:

status:
  capacity:
    cpu: "16"
    ephemeral-storage: 500Gi
    memory: 128Gi
    pods: "110"

Now all we need to do is tell local path provisioner to use our custom mount point instead of the default /opt/local-path-provisioner. We can do this by customizing the ConfigMap via Helm:

nodePathMap:
  - node: DEFAULT_PATH_FOR_NON_LISTED_NODES
    paths:
      - /mnt/disks/ssd-array/

If setup correctly, this should prevent Kubernetes from overcommitting local-path PVCs on a node. I admit that this is a bit of a hacky solution with multiple drawbacks:

We have to specify the requested storage capacity in two places: In the PVC and in the pod spec.
Ephemeral storage usage tracking might be off, causing kubelet to not properly enforce ephemeral storage limits.
We are repurposing the local ephemeral storage concept, which might cause confusion in larger organizations where many teams share the same multi-tenant Kubernetes cluster.

If you wanted to avoid overcommitting without ephemeral storage requests, you could try to align CPU and memory requests with the expected storage usage. Either way, once you have the overcommitting problem under control, we can move to enforcing the storage limits.

Filesystem Quotas

By default, containers that have a local-path PVC mounted, can use as much space in the volume as they want, independently of the space they requested. This can lead to noisy neighbor issues such as unexpected out of disk errors in the applications. Note that by requested space we are referring to the storage requests of the PVC, not the ephemeral storage requests of the container.

Fortunately, filesystems such as XFS support configuring storage quotas. There is an excellent minimal example in the local path provisioner repository.

xfsPath=$(dirname "$VOL_DIR")
pvcName=$(basename "$VOL_DIR")

mkdir -p "$VOL_DIR"

type=`stat -f -c %T ${xfsPath}`
if [ ${type} == 'xfs' ]; then
    project=`cat /etc/projects | tail -n 1`
    id=`echo ${project%:*}`

    if [ ! ${project} ]; then
        id=1
    else
        id=$[${id}+1]
    fi

    echo "${id}:${VOL_DIR}" >> /etc/projects
    echo "${pvcName}:${id}" >> /etc/projid

    xfs_quota -x -c "project -s ${pvcName}"
    xfs_quota -x -c "limit -p bhard=${VOL_SIZE_BYTES} ${pvcName}" ${xfsPath}
    xfs_quota -x -c "report -pbih" ${xfsPath}
fi

The script first checks if the filesystem is XFS. If not, we exit and the PV is created without quotas. Then, it reads the project file to determine if there are any existing projects so we can pick the next project ID. Project files look like this:

1:/some/path
2:/another/path

We then increment the last project ID and create a new project for our PVC. Finally, we initialize the quota record for the project, set the limit, and print a report for debugging purpose.

We can then pass this script via the Helm value configmap.setup. To avoid inconsistencies, it's wise to write a corresponding script for configmap.teardown that removes the quota + limits for the PVC. Note that for this approach to work, your node needs to have project quotas enabled on the mount point and your helper image needs to have xfsprogs-extra installed. We can achieve the former by modifying our init script mount and /etc/fstab contents, adding the prjquota option.

mount -o prjquota /dev/md0 /mnt/disks/ssd-array
# ...
echo "UUID=$raid_dev_uuid /mnt/disks/ssd-array xfs defaults,nofail,noatime,prjquota 0 0" |\
  | sudo tee -a /etc/fstab

To achieve the latter, you can specify a custom helper pod, which uses a container image with the required dependency installed (apk --no-cache add xfsprogs-extra, e.g.) via the Helm values configmap.helperPod. We now have a way to address overcommitting and enforcing storage limits, which allows us to safely put multiple pods with local-path PVCs on the same node. Next, let's see how can we avoid unschedulable pods.

Generic Ephemeral Volumes

Generic ephemeral volumes are similar to emptyDir in that their lifecycle is bound to the pod. However, they allow accessing arbitrary PVC storage classes via a volume claim template. We can modify our StatefulSet to use a generic ephemeral volume by moving the spec.volumeClaimTemplate[0] into spec.template.spec.volumes[0].ephemeral.volumeClaimTemplate:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: volume-test
spec:
  serviceName: "test"
  replicas: 2
  selector:
    matchLabels:
      app: volume-test
  template:
    metadata:
      labels:
        app: volume-test
    spec:
      containers:
        - name: test-container
          image: busybox
          command: ['sh', '-c', 'echo "Test $(hostname)" > /data/test && sleep 3600']
          volumeMounts:
            - mountPath: /data
              name: local-storage
      volumes:
        - name: local-storage
          ephemeral:
            volumeClaimTemplate:
              metadata:
                name: local-storage
              spec:
                accessModes: ["ReadWriteOnce"]
                storageClassName: local-path
                resources:
                  requests:
                    storage: 128Mi

This shifts the responsibility for creating the local-path PVC from the StatefulSet controller to the ephemeral volume controller. In turn, the owner reference of the PVC will point to the pod and no longer the StatefulSet. When the pod is deleted, the Kubernetes garbage collector will delete the PVC, and if the reclaim policy is Delete, the PV as well.

This should prevent the situation where the new STS replica becomes unschedulable because the old PVC is still bound to a non-existing or otherwise unsuitable node. Note that this approach does not work if you want to reuse the existing PVCs, e.g. to facilitate rolling restarts / upgrades without losing temporary data (as long as the replica can get scheduled to the same node). Let's investigate a different approach you can use in case you want to reuse the PVCs, or are on an older Kubernetes version (< 1.23) that does not support generic ephemeral volumes.

Local Path Cleaner

A few years ago, I wrote a Python application that I named "local path cleaner". It fulfills the following objectives:

Clean up released local-path PVs on nodes that are no longer part of the cluster.
Clean up local-path PVCs and the corresponding pods that are unschedulable.

Cleaning Released PVs

To clean released local-path PVs, we implement the following steps:

List all released local-path PVs.
For each PV, check if the node it is bound to is part of the cluster.
If the node is not part of the cluster, delete the PV.

Let's walk through the code. We're going to use the Kubernetes Python client to interact with the API. First, we use list_persistent_volume to list all PVs. In Kubernetes, PVs are not namespaced, so we can list them all at once. Note that you should specify a page size and handle pagination accordingly.

def get_released_local_path_pvs(v1: CoreV1Api):
    result = []
    _continue = None

    while True:
        pvs: V1PersistentVolumeList = v1.list_persistent_volume(
            watch=False, 
            _continue=_continue,
        )
        for pv in pvs.items:
            storage_class = pv.spec.storage_class_name
            phase = pv.status.phase
            if storage_class == 'local-path' and phase == "Released":
                result.append(pv)
        _continue = pvs.metadata._continue
        if not _continue:
            break

    return result

Next, let's write a similar function to obtain all nodes:

def get_nodes(v1: CoreV1Api) -> list[V1Node]:
    result: list[V1Node] = []
    _continue = None

    while True:
        nodes: V1NodeList = v1.list_node(
            watch=False, 
            _continue=_continue
        )
        for node in nodes.items:
            result.append(node)
        _continue = nodes.metadata._continue
        if not _continue:
            break

    return result

Then, we can find the PVs that are bound to nodes which are no longer part of the cluster. We'll walk through all the PVs, checking the node affinity selector terms to determine the assigned node. E.g. the following PV is bound to node gke-main-data-node-c51c4677-285f:

{
  "apiVersion": "v1",
  "kind": "PersistentVolume",
  "metadata": {
    "annotations": {
      "pv.kubernetes.io/provisioned-by": "cluster.local/local-path-storage-local-path-provisioner"
    },
    "name": "pvc-db-001"
  },
  "spec": {
    "capacity": {
      "storage": "2000Gi"
    },
    "hostPath": {
      "path": "/mnt/disks/ssd-array/pvc-db-app-0",
      "type": "DirectoryOrCreate"
    },
    "nodeAffinity": {
      "required": {
        "nodeSelectorTerms": [
          {
            "matchExpressions": [
              {
                "key": "kubernetes.io/hostname",
                "operator": "In",
                "values": [
                  "gke-main-data-node-c51c4677-285f"
                ]
              }
            ]
          }
        ]
      }
    },
    "persistentVolumeReclaimPolicy": "Delete",
    "storageClassName": "local-path",
    "volumeMode": "Filesystem"
  }
}

And here's the Python code. We don't delete the PVs immediately but gather them first so we can implement a dry-run mode where we simply log all the PVs we would have deleted.

def find_pvs_on_missing_nodes(v1: CoreV1Api, pvs):
    nodes: list[V1Node] = get_nodes(v1)
    deletion_candidates = []
    node_names = set(map(lambda n: n.metadata.name, nodes))
    pv: V1PersistentVolume
    for pv in pvs:
        node_selector_match_expression = pv.spec.node_affinity.required.node_selector_terms[0].match_expressions[0]
        if node_selector_match_expression.key == 'kubernetes.io/hostname' \ 
            and node_selector_match_expression.operator == 'In' \
            and node_selector_match_expression.values[0] not in node_names:
            deletion_candidates.append(pv)
    return deletion_candidates

Finally, we can put everything together and delete the PVs:

def clean_released_pvs(v1: CoreV1Api):
    pvs = get_released_local_path_pvs(v1)
    deletion_candidates = find_pvs_on_missing_nodes(pvs)
    for candidate in deletion_candidates:
        v1.delete_persistent_volume(candidate.metadata.name)

Cleaning Unschedulable Pods

When using local-path PVCs via StatefulSet instead of ephemeral volumes on the pod level, pods can become unschedulable. A common reason is that the PVC is bound to a node that has been scaled down by the cluster autoscaler, has been cordoned, or another workload has moved onto it so it does not have enough capacity.

While it is straightforward to reliably detect the first case, by comparing the node the PVC is bound to with the list of active nodes, I did not find a way to detect the other two cases. The Kubernetes events sometimes show hints about volume affinity conflicts, but this did not happen reliably in all cases.

In the end I decided to purge bound local-path PVCs of unschedulable pods aggressively, as they could always be recreated and I'd rather live with losing temporary data than dealing with a prolonged service outage. Here's the high level algorithm:

List all pending pods.
Identify unschedulable pods from the pending pods.
List all bound local-path PVCs.
For each unschedulable pod, check if it has a bound local-path PVC.
Delete the PVC and the pod, if the pod has a managing controller.

I found that deleting not only the PVC but also the pod reduces the time to recovery, as the managing controller will immediately recreate both the PVC and the pod in that case, triggering the scheduler and subsequently the local-path provisioner to get the pod onto a new node. Let's build the code step by step:

First, we want to list all pending pods. We can use the list_pod_for_all_namespaces method with the field selector status.phase=Pending for that. Here we could also employ some namespace filtering to only consider pods in namespaces that match a given regular expression.

def get_pending_pods(v1: CoreV1Api):
    result = []
    _continue = None

    while True:
        pods = v1.list_pod_for_all_namespaces(
            watch=False, _continue=_continue,
            field_selector="status.phase=Pending"
        )
        result += pods.items
        _continue = pods.metadata._continue
        if not _continue:
            break

    return result

Next, we keep only unschedulable pods. The information whether a pod is unschedulable is stored in status.conditions. Here's an example:

{
    "apiVersion": "v1",
    "kind": "Pod",
    "status": {
        "conditions": [
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2025-01-10T13:20:15Z",
                "message": "0/3 nodes are available: 1 node(s) were unschedulable. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.",
                "reason": "Unschedulable",
                "status": "False",
                "type": "PodScheduled"
            }
        ],
        "phase": "Pending",
        "qosClass": "Burstable"
    }
}

Based on that we can write a helper function get_condition that allows to get the condition of a given type from a pod (or None if it does not exist):

def get_condition(pod: V1Pod, condition_type):
    pod_status: V1PodStatus = pod.status
    condition: V1PodCondition
    return next(
        (
            condition
            for condition in pod_status.conditions
            if condition.type == condition_type
        ),
        None,
    )

Then we can write a function to filter unschedulable pods by checking the PodScheduled condition to be False and the reason to be Unschedulable:

def filter_unschedulable_pods(pods):
    result = []
    for pod in pods:
        pod_condition: V1PodCondition = get_condition(pod, 'PodScheduled')
        if pod_condition is not None \
            and pod_condition.status == 'False' \
            and pod_condition.reason == 'Unschedulable':
            result.append(pod)
    return result

Now that we know which pods are unschedulable, we need to keep only the ones that have bound PVCs. Unfortunately, this information is not available in the pod resource, so we need to fetch the PVCs, too. If you need the ability to filter certain namespaces, you could add that here.

def get_bound_local_path_pvcs(v1: CoreV1Api):
    result = []
    _continue = None

    while True:
        pvcs = v1.list_persistent_volume_claim_for_all_namespaces(
            watch=False,
            _continue=_continue,
        )
        for pvc in pvcs.items:
            storage_class = pvc.spec.storage_class_name
            phase = pvc.status.phase
            if storage_class == 'local-path' and phase == "Bound":
                result.append(pvc)
        _continue = pvcs.metadata._continue
        if not _continue:
            break

    return result

Now we combine the unschedulable pods and the PVCs to find the pods that have a bound PVC. We convert the bound local-path PVC list into a dictionary by PVC name to efficiently check each of the volumes of each unschedulable pod and match them if possible.

def find_pods_with_pvcs(pods, pvcs):
    pvc_res = []
    pods_res = []
    pvcs_by_name = {pvc.metadata.name: pvc for pvc in pvcs}
    for pod in pods:
        for volume in pod.spec.volumes:
            if volume.persistent_volume_claim:
                pod_pvc = pvcs_by_name.get(volume.persistent_volume_claim.claim_name)
                pods_res.append(pod)
                pvc_res.append(pod_pvc)
                break

    return pvc_res, pods_res

That's it! Now we can combine everything together. I ended up adding a small sleep call between deleting the PVC and the pod to reduce the risk of hitting a race condition where the pod would get recreated before the PVC, causing it to become unschedulable again.

def clean_unschedulable_pod_pvc_conflicts(v1: CoreV1Api):
    pending_pods = get_pending_pods(v1)
    unschedulable_pods = filter_unschedulable_pods(pending_pods)
    pvcs = get_bound_local_path_pvcs(v1)
    pvc_deletion_candidates, pod_deletion_candidates = find_pods_with_pvcs(unschedulable_pods, pvcs)

    for candidate in pvc_deletion_candidates:
        v1.delete_namespaced_persistent_volume_claim(candidate.metadata.name, candidate.metadata.namespace)

    time.sleep(2)

    for candidate in pod_deletion_candidates:
        v1.delete_namespaced_pod(candidate.metadata.name, candidate.metadata.namespace)

This method relies on the fact that upon deletion of the PVC and the pod, some controller will recreate them. To prevent accidentally deleting pods that are not managed by a StatefulSet, we can add a filter based on the owner reference:

POD_CONTROLLERS = ['StatefulSet']

def get_pod_owner_type(pod):
    owner_references = pod.metadata.owner_references
    if not owner_references:
        return None

    for owner in owner_references:
        if owner.controller:
            return owner.kind

    return None

for pod in pods:
    if get_pod_owner_type(pod) in POD_CONTROLLERS:
        # Delete the pod and PVC

Operations

We can run this code on a schedule, either by using a Kubernetes cron job, or by having a pod running with a sleep loop. I prefer the long-running pod by using a Deployment, as we want the code to run very frequently to reduce the impact of unschedulable pods. I recommend implementing proper logging, metrics, a dry-run mode, a configurable interval, filters, and flags for the different pieces for optimal operability.

Since the cleaner has to interact with the Kubernetes API, it needs the following RBAC permissions, if you have RBAC enabled:

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: local-path-cleaner
rules:
  - apiGroups: [""]
    resources: ["persistenvolumes", "persistentvolumeclaims", "pods"]
    verbs: ["get", "list", "delete"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list"]

Summary and Conclusion

In this post we explored different options to provide ephemeral or semi-persistent local storage to your Kubernetes workload. For most use cases, emptyDir volumes should be sufficient and those are the easiest to set up and manage. If you need customizable local storage, consider using local-path PVCs managed by the generic ephemeral volume controller to avoid unschedulable pods. If the local storage needs to be semi-persistent, you can use local-path PVCs managed by a StatefulSet in combination with local path cleaner.

In my opinion, managing state on Kubernetes has become a lot easier over the past few years. There are different controllers and mechanisms available to assist you. However, I would not call the problem solved, as there are still some use cases that are not covered by the standard Kubernetes buildings blocks, especially in applications with very specific I/O and operational requirements.

Have you used local path provisioner? What is your experience in terms of operability? Let me know in the comments!

If you liked this post, you can support me on ko-fi.

Investigating Error Logs Using LangGraph, LangChain and Watsonx.ai

Frank Rosner — Thu, 11 Dec 2025 10:36:32 +0000

Introduction

When dealing with production systems, observability plays a key role. It is a vital component of incident investigations, the foundation for monitoring and alerting, and incredibly useful for validation of new functionality, improvements, or bug fixes being shipped. Application logs are a big part of observability.¹ Logs can help us understand what the system was doing at any particular point in time with a high degree of granularity.

However, understanding application logs can be difficult. First, there can be many logs and finding the relevant ones is difficult. Indexing the logs and using a search engine to query them helps, but it cannot tell you which logs are related to the issue you are investigating. When I see an error in the logs that correlates with the timing of the incident, I usually ask myself a bunch of follow-up questions:

Is the error related to or possibly the root cause of the issue?
Is this error a known problem?
If it's known, has it been reported to the right team?
If it has been reported, is it being worked on or even fixed?
If it's fixed, is the fix rolled out to the environment I am investigating?
If it's rolled out, why is the error still happening? Is there a regression?
If there's a regression, has it been reported to the right team?

In the face of an incident, answering these questions can make you lose valuable time. You might suggest to simply postpone the investigation until the bleeding is stopped, but sometimes valuable information on how to stop the bleeding is hidden in the answers to these questions.

For example, there might be a bug ticket in progress in which someone commented a workaround you can apply. Or there might be a release candidate or hotfix release available you can potentially roll out prematurely. This is why I think it's valuable to investigate the logs deeply during incidents as well. I believe that GenAI can aid in answering these questions quickly for you during an investigation.

In this post we are going to explore how to use GenAI to investigate (error) logs. We are going to use IBM Watsonx.ai and LangGraph in Python. The remainder of the post is structured as follows: First, we will lay the technological foundation, introducing LangGraph, LangChain, and watsonx.ai. Then we will dive into the design and implementation of our solution. We will close the post by summarizing the main findings and giving an outlook for future work.

LangGraph, LangChain and Watsonx.ai

LangGraph is a graph-based orchestration framework for building stateful AI workflows, such as agents. It lets you model an AI application as a directed graph, where:

Nodes are functions (e.g., an LLM call, a tool call, or custom logic) that operate on a shared state.
Edges define how control flows from one node to the next, including conditional branches and loops.
State is an explicit, shared data structure (like a dict or TypedDict) that all nodes can read and update, making it easy to build long-running, stateful agents.

LangChain is often used as a building block within LangGraph nodes. It is a library for connecting LLMs to data and tools. By combining LangChain and LangGraph, you can build AI agents that can reason and act in cycles, adding a human in the loop as needed.

Watsonx.ai is an enterprise AI solution by IBM, which among other functionality, offers managed LLMs. We are going to combine these three tools to build an AI agent to investigate our logs. The required functionality is provided by the following Python modules: ibm-watsonx-ai, langgraph, and langchain-ibm.

To illustrate how the libraries interact, let's look at a simple example. The following code implements a basic agent that has access to a tool to get the weather in a city. We are using the create_agent (successor to create_react_agent) helper function, which creates a pre-built agent graph representing a chat and tool-calling loop maintaining the message history in the global state.

llm = ChatWatsonx(
    model_id="meta-llama/llama-3-70b-instruct",
    url=os.getenv("WATSONX_URL"),
)

def get_weather(city: str) -> str:
    return f"Weather in {city}: 30°C and sunny."

agent = create_agent(llm, tools=[get_weather])

agent.invoke({"messages": [
    {"role": "user", "content": "What is the weather in Berlin?"}
]})

For more complex applications, you can build your own graph, e.g. by utilizing the Graph API. Let's build our log investigation agent next.

Log Investigation Agent

Scope

In the introduction, I shared a few questions we'd like the agent to answer. In the context of this post, let's focus on the following initial high-level functionality:

Search relevant work / tickets / conversations to the log. We'll search two systems in parallel to demonstrate how to parallelize work in LangGraph: Jira and GitHub. You might also want to integrate chats like Slack, incident tracking tools, post-mortems, or other relevant tools. If you have access to some company-wide search engine like Glean, you could call that instead of interacting with the different APIs directly.
Gather and include operational context, such as the pod / container name and deployed version. We can use this information to assess the relevance of the found tickets and discussions.
Investigate the relevant tickets and conversations, looking for workarounds.

I am going to wrap this into a lean Dash UI that will look like this:

Let's dive into the implementation details. First, we are going to give a general overview of the architecture and then go into the implementation details of each step.

Defining the State

In LangGraph, state is shared among all nodes and passed along the edges. If multiple nodes modify the same property in the state concurrently, you need to define a custom merge function. For our use case, we will store all intermediate results in the state, so that we can show it to the user after the graph completed. This helps in building trust in the agent and checking its reasoning.

Here is the base definition of our Pydantic model LogInvestigationState. For now, I will just add the field that holds the raw log text. We will add more fields in the upcoming sections.

class LogInvestigationState(BaseModel):
    # Raw log text as provided by the user
    log_text: Optional[str] = None

Defining the Graph

First, let's define the high level graph architecture. The first node will inspect the provided log and derive search queries for the different systems (Jira and GitHub in our case). We don't want to search for the provided log verbatim as it might be too detailed to yield all relevant results.

After that, we'll run the search across all APIs. This can be parallelized and there is no LLM needed. We'll use conditional edges with the Send functionality, which spawns nodes on-demand for each of the found tickets. We'll then grade each of the ticket based on the high level information such as title and description given the full provided log and operational context to determine its relevancy.

After filtering out irrelevant tickets, we'll merge the graph back together. Finally, we will gather additional context for the relevant tickets, such as comments, linked PRs, etc. and submit a final LLM call that will produce a report for the user.

We'll use wrapper functions (prefixed with node_) that pass required dependencies such as the LLM into the actual node function, as I didn't find a way to inject dependencies such as API clients or LLM clients into the nodes otherwise. Example:

model: BaseChatModel  # your ChatWatsonx instance

def node_extract_ticket_queries(state: LogInvestigationState) -> LogInvestigationState:
    return extract_ticket_queries(model, state)

Now let's look at the Python code that builds the graph. The add_node function takes two arguments: the node name and the wrapper function to execute. LangGraph passes the state (state: LogInvestigationState) as an argument. add_edge takes the source and target node name as arguments. add_conditional_edges takes the source node name, our function that returns a list of Send objects, telling which target nodes to spawn. It also takes an optional path map argument, which we'll pass so that we can visualize the graph properly.

def build_graph():
    # Define dependencies and wrapper functions (node_...)
    # ...

    graph = StateGraph(LogInvestigationState)

    graph.add_node(extract_ticket_queries.__name__, node_extract_ticket_queries)
    graph.add_node(search_jira_issues.__name__, node_search_jira_issues)
    graph.add_node(search_github_issues.__name__, node_search_github_issues)
    graph.add_node(grade_jira_ticket.__name__, node_grade_jira_ticket_worker)
    graph.add_node(grade_github_ticket.__name__, node_grade_github_ticket_worker)
    graph.add_node(aggregate_jira_scores.__name__, node_aggregate_jira_scores)
    graph.add_node(aggregate_github_scores.__name__, node_aggregate_github_scores)
    graph.add_node(select_relevant_jira_issues.__name__, node_select_relevant_jira_issues)
    graph.add_node(select_relevant_github_issues.__name__, node_select_relevant_github_issues)
    graph.add_node(investigate_relevant_tickets.__name__, node_investigate_relevant_tickets)

    graph.set_entry_point(extract_ticket_queries.__name__)

    graph.add_edge(extract_ticket_queries.__name__, search_jira_issues.__name__)
    graph.add_edge(extract_ticket_queries.__name__, search_github_issues.__name__)

    graph.add_conditional_edges(
        search_jira_issues.__name__,
        edge_dispatch_jira_grading,
        [grade_jira_ticket.__name__],
    )
    graph.add_conditional_edges(
        search_github_issues.__name__,
        edge_dispatch_github_grading,
        [grade_github_ticket.__name__],
    )

    graph.add_edge(grade_jira_ticket.__name__, aggregate_jira_scores.__name__)
    graph.add_edge(grade_github_ticket.__name__, aggregate_github_scores.__name__)

    graph.add_edge(aggregate_jira_scores.__name__, select_relevant_jira_issues.__name__)
    graph.add_edge(aggregate_github_scores.__name__, select_relevant_github_issues.__name__)

    graph.add_edge(select_relevant_jira_issues.__name__, investigate_relevant_tickets.__name__)
    graph.add_edge(select_relevant_github_issues.__name__, investigate_relevant_tickets.__name__)

    graph.add_edge(investigate_relevant_tickets.__name__, END)

    return graph.compile()

Before we jump into the implementation of the individual steps, let's review a visual representation of the graph. LangGraph comes with some helper functions to plot the graph:

compiled_graph.get_graph().draw_png("log_investigation_graph.png")

Note that the way we split work into separate nodes is a matter of taste to some extent. If a set of nodes are executed sequentially (and not concurrently), and there is no need for memory / checkpointing or other functionalities, we could've merged them into a single node. I decided to keep them separate though as we will be reporting agent progress to the user via the UI. This will be implemented later based on node callbacks. Next, let's look into some of the individual steps in more detail.

Extracting Ticket Queries

The first node will prompt the LLM to analyze the log text and come up with queries to search for relevant information in the different systems. In my case, this will be Jira and GitHub. We'll also ask the LLM to create a summary we can use as a title for a new ticket if we decide to create one later. First, let's extend the state to store the result of this step:

class LogInvestigationState(BaseModel):
    log_text: Optional[str] = None
    jql_substring: Optional[str] = None
    github_query_substring: Optional[str] = None
    ticket_summary: Optional[str] = None

Next, let's come up with a system prompt for the LLM that tells it what to do with the log snippet. Here's what we'll be using:

You are given an application log snippet (may include stack traces).
Extract:
1) a substring to pass to a Jira search (JQL textfields ~ "..."),
2) a substring to pass to a GitHub issues search query, and
3) a concise ticket summary suitable as a GitHub issue title.

Rules for the substrings:

Include information from the first line of the log, as it often contains the most relevant information.

Look for discriminative information that might help find this exact issue. If there is a generic exception and a specific error message, focus on the error message.

If there is an exception type, include it plus relevant pieces of the message if available. Including only the exception name without any discriminative parts from the error will yield too many results, especially for generic exceptions.

Exclude variable data (timestamps, request IDs, hostnames, memory addresses, line numbers, absolute paths).

For Jira, produce a substring to search for in textfields.

For GitHub, produce a query compatible with the GitHub Search Issues API. Prefer free text phrase(s); qualifiers like in:title or in:body are optional.

Rules for the ticket summary/title:

Keep it succinct (<= 12 words) and clear.

Focus on the core error/symptom; avoid IDs, timestamps, stack details.

Do not end with a period.

Now we can implement the extract_ticket_queries function. We are going to use structured output to extract the different fields reliably.

class TicketQueries(BaseModel):
    jql_substring: str
    github_query_substring: str
    ticket_summary: str

Note that you always need to sanitize LLM output, as there is no guarantee that the LLM follows your instructions reliably. This includes quoting characters that might cause issues downstream, as well as shortening results to stay within character limits (e.g. for the title). I omitted this part in the code below for readability.

def extract_ticket_queries(llm: BaseChatModel, state: LogInvestigationState) -> LogInvestigationState:
    log_text = state.log_text
    system_msg = """..."""

    structured_llm = llm.with_structured_output(TicketQueries)
    data: TicketQueries = structured_llm.invoke(
        [
            SystemMessage(content=system_msg),
            HumanMessage(content=log_text),
        ]
    )

    return LogInvestigationState(
        jql_substring=escape_double_quotes(data.jql_substring),
        github_query_substring=escape_double_quotes(data.github_query_substring),
        ticket_summary=data.ticket_summary,
    )

Let's look at an example log text and the output of the LLM. Given the following log text:

ERROR [nioEventLoopGroup-5-21] 2025-10-09 06:38:16,740 BrainWasher.java:84 - Failed to brainwash tenant 41d51c90-0c8e-4db6-b0c8-d143007210f0. Timeout of 5s reached.

It might produce the following output:

TicketQueries(
    jql_substring="Failed to brainwash tenant Timeout of 5s reached",
    github_query_substring="Failed to brainwash tenant Timeout of 5s reached",
    ticket_summary="BrainWasher.java: Failed to brainwash tenant - Timeout of 5s reached",
)

Ticket Search

The code for searching tickets (or other relevant pieces of information such as post mortem documents, chat conversations, etc.) is highly dependent on the respective system / API you are integrating with. For the sake of simplicity, I am going to show only the code for searching Jira tickets using the jira Python package. The code for searching GitHub issues is very similar.

First, let's extend the state to store the results of the search. We'll need custom types to represent the tickets. While the client libraries come with some prebuilt types such as Issue, they are not consistent in the fields they have and probably not serializable (at least not in a way LangGraph can handle). Therefore, we are going to define my own types, such as JiraTicket, which maps the Jira issue data model to our internal data model.

class JiraTicket(BaseModel):
    key: str
    url: Optional[str] = None
    summary: Optional[str] = None
    description: Optional[str] = None
    status: Optional[str] = None

    @classmethod
    def from_jira_issue(cls, issue: jira.resources.Issue) -> JiraTicket:
    return cls(
        key=issue.key,
        url=issue.permalink(),
        summary=issue.fields.summary,
        description=issue.fields.description,
        status=issue.fields.status.name,
    )

Next, let's extend the LogInvestigationState to store the final queries (so we can show those to the user later on), as well as the results of the search, which are candidates for being relevant tickets.

class LogInvestigationState(BaseModel):
    log_text: Optional[str] = None
    jql_substring: Optional[str] = None
    jql: Optional[str] = None
    jira_candidates: Optional[Dict[str, JiraTicket]] = None
    github_query_substring: Optional[str] = None
    github_query: Optional[str] = None
    github_candidates: Optional[Dict[str, GitHubTicket]] = None
    ticket_summary: Optional[str] = None

Now we can implement the search_jira_issues function. We are going to use a custom JQL query that searches for the provided substring in the text fields of the ticket. We are also going to limit the search to a specific set of projects. You might want to adjust this to your needs, ideally making it configurable via environment variables or similar.

def search_jira_issues(
    state: LogInvestigationState,
    client: JIRA = None,
) -> LogInvestigationState:
    textfield_substring = state.jql_substring
    predicates = [
        f"project in (BW, UI)",
        f'textfields ~ "{textfield_substring}"'
    ]

    jql = " AND ".join(predicates) + " ORDER BY created DESC"

    fields = "summary,description,status"
    issues = client.search_issues(jql_str=jql, maxResults=50, fields=fields)

    results: Dict[str, JiraTicket] = {
        ticket.key: ticket
        for ticket in (JiraTicket.from_jira_issue(issue) for issue in issues)
    }

    return LogInvestigationState(jira_candidates=results, jql=jql)

With the search results stored in the state, we can investigate each ticket in detail, determining its relevance to the full log line so that we can focus our final investigation only on relevant tickets.

Relevance Scoring

In order to parallelize execution, we will use conditional edges spawning one worker node for each candidate ticket. First, let's define the states we need. We will need a class to hold the relevance result, which will be used for the structured LLM output, too.

class TicketRelevance(BaseModel):
    relevance_score: int = Field(description=
        "Relevance score from 0-100, where 0 is completely irrelevant" + 
        "and 100 is highly relevant")
    reasoning: str = Field(description=
        "Brief explanation of why this score was assigned")

Then, we'll extend the shared state to store these results for each ticket in a dictionary from ticket key to relevance.

class LogInvestigationState(BaseModel):
    # ... other keys defined previously ...
    jira_candidates: Optional[Dict[str, JiraTicket]] = None
    jira_ticket_relevance: Annotated[Optional[Dict[str, TicketRelevance]], merge_dicts] = None
    github_candidates: Optional[Dict[str, GitHubTicket]] = None
    github_ticket_relevance: Annotated[Optional[Dict[str, TicketRelevance]], merge_dicts] = None

We are using an Annotated type to tell LangGraph how to merge the state when multiple nodes modify the same property. In this case, we want to keep all entries in the dictionary, so we'll use a custom merge_dicts function. This is important because we are going to process all tickets concurrently, updating the shared state with the results.

def merge_dicts(left: Optional[Dict], right: Optional[Dict]) -> Dict:
    result = {}
    if left:
        result.update(left)
    if right:
        result.update(right)
    return result

To spawn the grading worker nodes conditionally, we'll use the Send API. The conditional edges invoke a dispatcher function that returns a list of Send objects, each representing a worker node to spawn for the respective ticket. The worker needs to know the log text as well as the ticket information. Here we can also add additional context such as the component name that emitted the log, the deployed version, cluster information or other relevant metadata.

def dispatch_jira_grading(state: LogInvestigationState) -> List[Send]:
    jira_candidates = state.jira_candidates
    log_text = state.log_text

    return [
        Send(grade_jira_ticket.__name__, JiraTicketGradeState(log_text=log_text, ticket=ticket))
        for ticket in jira_candidates.values()
    ]

The actual grading will be performed by grace_jira_ticket which invokes our LLM again. We will use a system prompt (TICKET_RELEVANCE_SYSTEM_PROMPT) to instruct the model what to do.

Score the ticket from 0-100 based on how likely it is to be related to the log text:

0-19: Completely unrelated

20-39: Possibly related but very weak connection

40-59: Moderately related, shares some keywords or concepts

60-79: Likely related, similar error patterns or context

80-100: Highly relevant, very similar or identical issue

When evaluating relevance, consider:

Error/Log Message Similarity: Does the ticket describe the same or similar error message, exception type, or log pattern?

Entity Matching: Does the ticket mention the same specific entities (e.g., names, UUIDs, file paths, service names, configuration keys)?

Context Similarity: Does the ticket describe similar circumstances, components, or operations?

Provide 2-3 sentences in the reasoning field explaining:

What similarities or differences you found between the log and the ticket

Whether specific entities or error patterns match

Why you assigned this particular relevance score

We'll prefix the system prompt with information on the source (e.g. that this is a Jira ticket) and provide
the LLM with a user message that contains the ticket information.

def grade_jira_ticket(llm: BaseChatModel, state: JiraTicketGradeState) -> JiraTicketGradeState:
    ticket = state.ticket
    log_text = state.log_text

    system_msg = f"""You are evaluating the relevance of a Jira ticket to a log error or issue.
{TICKET_RELEVANCE_SYSTEM_PROMPT}"""

    user_msg = f"""Log text:
{log_text}

Jira ticket:
Key: {ticket.key}
Summary: {ticket.summary}
Description: {ticket.description}
"""

    structured_llm = llm.with_structured_output(TicketRelevance)
    result: TicketRelevance = structured_llm.invoke([
        SystemMessage(content=system_msg), 
        HumanMessage(content=user_msg)
    ])

    return JiraTicketGradeState(jira_ticket_relevance={ticket.key: result})

Let's look at an example again. Given the following ticket BW-17993, the model might have the following response.

Relevance Score: 100
Reasoning: The log text and Jira ticket describe the same error message, exception type, and log pattern, with the same specific entity (tenant UUID) and similar circumstances (brainwash component exceeding washing timeout). The ticket and log text are almost identical, indicating a high relevance score.

After grading each ticket, we'll aggregate the results back into the main state in the aggregate_jira_scores and aggregate_github_scores nodes. This follows the map-reduce pattern as designed by LangGraph when using Send.

Relevance Filtering

The relevance filtering is a trivial step, where we simply go through all tickets and select those with a relevance score above a certain threshold. The selected tickets are stored in new fields relevant_jira_issues and relevant_github_issues in the state for further investigation.

class LogInvestigationState(BaseModel):
    # ... other keys defined previously ...
    relevant_jira_issues: Optional[List[JiraTicket]] = None
    relevant_github_issues: Optional[List[GitHubIssue]] = None

I am not going to show the code here, as it is rather trivial. Now that we have a set of relevant tickets identified, the final step is to analyze all of them in the context of the log line to produce a final report.

Final Analysis

First, let's extend the state to store the final analysis result:

class LogInvestigationState(BaseModel):
    # ... other keys defined previously ...
    analysis_result: Optional[str] = None

Then, let's come up with a system prompt. For each relevant ticket, we will fetch some additional information before submitting it to the LLM. This includes things like comments, linked PRs, and so on. Here is the system prompt we are going to use and the node function investigate_relevant_tickets implementation.

The user is investigating an issue that has been logged by their application.
They will provide the log text and a list of tickets. Each ticket has a key, a title, a body, and a list of comments (optional).
Some tickets might reference other tickets or pull requests.

Your task is to investigate the relevant tickets and provide a summary of the findings:

Is there any workaround available for the given issue? If there is no workaround mentioned, please state that clearly. Workarounds include, but are not limited to:

Configuration changes such as system properties, environment variables, etc.

Workload changes. If the issue can be prevented by changing the usage pattern, e.g. sending fewer or smaller requests.

Are there any relevant pull requests (PRs) that may contain a fix for the issue? Are they merged?

Focus only on insights that are relevant to the issue at hand that the SRE / production engineer is facing. Discussions that are unrelated to the operational context should be ignored.

As your answer will be embedded into an existing markdown page. If you want to structure your response, please use a simple flat hierarchy with level 5 headings (#####).

def investigate_relevant_tickets(
    llm: BaseChatModel,
    state: LogInvestigationState,
    jira_client: JIRA,
    github_client: Github,
) -> LogInvestigationState:
    system_prompt = f"""..."""

    user_prompt = f"""Log text:

{state.log_text}

Relevant tickets:
"""
    most_relevant_jira_issues = state.relevant_jira_issues or []
    most_relevant_github_issues = state.relevant_github_issues or []

    if len(most_relevant_jira_issues) == 0 and len(most_relevant_github_issues) == 0:
        # We don't want the LLM to hallucinate any wrong information, so we'd rather skip the analysis
        # without any context.
        return LogInvestigationState()

    user_prompt = _add_relevant_jira_issue_context(user_prompt, most_relevant_jira_issues, jira_client)
    user_prompt = _add_relevant_github_issue_context(user_prompt, most_relevant_github_issues, github_client)

    result: BaseMessage = llm.invoke([SystemMessage(content=system_prompt), HumanMessage(content=user_prompt)])
    return LogInvestigationState(analysis_result=result.content)

The helper functions _add_relevant_jira_issue_context and _add_relevant_github_issue_context are fetching additional context. This could be moved to different nodes, too, e.g. after the relevance filtering step. In our example, this is what the final user prompt might look like:

Log text:
ERROR [nioEventLoopGroup-5-21] 2025-10-09 06:38:16,740 BrainWasher.java:84 - Failed to brainwash tenant 41d51c90-0c8e-4db6-b0c8-d143007210f0. Timeout of 5s reached.

Relevant tickets:

## BW-17993

### Title

Brainwash component exceeds washing timeout

### Body

The brainwash component recently failed with the following error:
ERROR [nioEventLoopGroup-5-21] 2025-10-09 06:38:16,740 BrainWasher.java:84 - Failed to brainwash tenant 41d51c90-0c8e-4db6-b0c8-d143007210f0. Timeout of 5s reached.

### Comments

Frank Rosner wrote on 2025-10-09T09:01:00.000+0000: We were able to stop the bleeding by increasing the brainwashing timeout to 10s.
-Dbrainwasher.timeout=10s

And that's it! If your tickets contain enough information, the LLM should be able to provide a useful analysis and suggest to increase the timeout. Before we package this into the UI, let's add some progress reporting. This will help the user to understand what is happening under the hood.

Progress Reporting

In Dash, we will implement the graph execution as a background callback. Dash callbacks are Python functions that the frontend can invoke on certain triggers, such as the user pressing a button. A background callback will be executed asynchronously, and the client can poll for updates. They can also be cancelled.

By passing the progress key to the @callback decorator we can specify which Dash component properties to update when the callback makes progress. Dash will then inject a set_progress function into the callback function that we can use to update the progress, e.g. by setting the value of a progress bar. In our case, we are going to use individual indicators for each relevant node, that can either represent the node being pending, started, succeeded, or failed. We can use bootstrap icons and a Dash Bootstrap spinner component to represent those states, for example:

INDICATOR_PENDING = html.I(className="bi bi-circle me-2")
INDICATOR_IN_PROGRESS = html.Span(dbc.Spinner(size="sm"), className="me-2")
INDICATOR_COMPLETED = html.I(className="bi bi-check-circle-fill text-success me-2")
INDICATOR_FAILED = html.I(className="bi bi-exclamation-circle-fill text-danger me-2")
INDICATOR_SKIPPED = html.I(className="bi bi-fast-forward-circle me-2")

How will we know which state each node is in? We could use state streaming, but I decided to go for a custom decorator that we can apply to each node function that will invoke a callback with the node name and its current phase. This also adds error handling capabilities which make the graph more resilient. We'll also add an optional flag that indicates that the node is not critical to the overall execution and can fail without aborting the entire graph. In that case we will not raise the exception but instead return an empty state.

def callback_node(
    callback: Callable[[str, NodePhase, Any], None],
    node_name: str,
    optional: bool = False,
):
    def _decorator(fn: Callable):
        @wraps(fn)
        def _wrapped(*args, **kwargs):
            bound = inspect.signature(fn).bind_partial(*args, **kwargs)
            state = bound.arguments["state"]

            def _safe_invoke(phase: NodePhase, value: Any):
                try:
                    callback(node_name, phase, value)
                except Exception:
                    logger.exception("Callback failed for node=%s phase=%s", node_name, getattr(phase, "value", phase))

            # Before execution
            _safe_invoke(NodePhase.STARTED, state)

            try:
                result = fn(*args, **kwargs)
            except Exception as e:
                # On failure, report FAILURE with the original state
                _safe_invoke(NodePhase.FAILURE, state)
                if optional:
                    logger.warning("Optional node %s failed; returning empty state: %s", node_name, str(e))
                    return LogInvestigationState()
                raise

            # On success, report SUCCESS with the result
            _safe_invoke(NodePhase.SUCCESS, result)
            return result

        return _wrapped

    return _decorator

NodePhase is a custom enum type that represents the different phases a node can be in:

class NodePhase(Enum):
    STARTED = "STARTED"
    SUCCESS = "SUCCESS"
    FAILURE = "FAILURE"

We can now use that decorator to wrap our node functions. It can map the node based on the node name passed to the decorator.

# The caller has to pass a callback that will be used to call `set_progress` in the Dash callback
node_callback: Callable[[str, NodePhase, LogInvestigationState], None] = lambda x, y, z: None

@callback_node(node_callback, investigate_relevant_tickets.__name__, optional=True)
def node_investigate_relevant_tickets(state: LogInvestigationState) -> LogInvestigationState:
    with get_jira_client() as jira:
        with get_github_client() as gh:
            return investigate_relevant_tickets(model, state, jira, gh)

This works great for regular nodes. Dynamically generated nodes from the Send API are a bit more tricky. If no nodes are created, the callback will not be invoked. To address this, we can create a custom decorator for the dispatch functions and for the worker nodes which will work together. The dispatching callback marks the worker as started and have each of the worker nodes report their success or failure. I am not going to go into detail here, but feel free to leave a comment if you are curious about the implementation.

Summary and Conclusion

In this post we have seen how to utilize LangGraph and Watsonx.ai to build a custom log investigation agent. We created a Dash UI to present the investigation progress and results to the user.

The solution we created is already very useful, but of course there is always room for improvement. A few ideas come to mind:

Use more refined search technique (e.g. vector search) to find relevant tickets and conversations
Add more sources of information
Add further investigation steps, such as investigating code changes, merged PR dates, releases, etc. to reliably identify whether a fix is rolled out and new regressions.
Allow users to provide additional context

For me, this was a fun exercise and my first time working with LangGraph and LangChain. The LangChain toolstack has quite a lot of capabilities I have not explored, yet, though. Have you used any of the LangChain tools before? What was your experience? Let me know in the comments.

If you liked this post, you can support me on ko-fi.

According to the Sawmills Observability Report 2025, logs are the main course of spend in observability, too. ↩

libmalloc, jemalloc, tcmalloc, mimalloc - Exploring Different Memory Allocators

Frank Rosner — Mon, 01 Dec 2025 09:29:16 +0000

What Are Memory Allocators?

Applications need memory to store data and code during runtime. Memory can be allocated statically (fixed size at compile time), or dynamically (at runtime). Dynamic memory allocation is crucial when the memory size needed varies during program execution, which is the case for most modern applications.

The stack and the heap are two key memory regions during program execution. Stack allocation is used for function calls and local variables and it happens automatically. The stack allocation lifecycle is tied to the lifecycle of the function. The heap is used for dynamic memory allocation. In non-garbage-collected languages like C++, the programmer is responsible for managing the heap, while in garbage-collected languages like Java, the heap is managed automatically.

In many cases, heap allocation and de-allocation is implemented via a memory allocator, which implements functions like malloc and free. Those generic functions are part of the C standard library, and are implemented by libc, which on Linux, is glibc by default for most installations. On MacOS, libmalloc is the default implementation.

In Writing My Own Dynamic Memory Management I attempted to write a very simple allocator for my own operating system based on a doubly linked list. The following animation shows how the available heap is managed by the allocator:

Efficient memory allocation is a complex problem, especially on modern computer architectures. Modern allocators combine advanced data structures and algorithms to achieve high performance in concurrent environments.

Especially in performance critical applications, such as databases, web servers, and game engines, the choice of memory allocator can have a significant impact on performance. I wanted to learn more about the different allocators available. In this blog post we are going to compare a few well-known allocators on MacOS:

libmalloc - The default allocator on MacOS, developed by Apple.
jemalloc - Created by Jason Evans originally for FreeBSD to address fragmentation and scaling issues, jemalloc is a scalable allocator widely adopted in performance-critical applications including Firefox and Facebook.
tcmalloc - Developed by Google as part of the Google Performance Tools to enhance multithreaded allocation speed and reduce lock contention using thread-local or core-local caches.
mimalloc - Developed by Microsoft Research as a modern general-purpose allocator, focusing on locality and reducing contention with innovations like page-local free lists and free list sharding for performance gains.
hoard - Designed by Emery Berger and his team at the University of Massachusetts to reduce memory fragmentation and contention in multithreaded systems by partitioning heaps per thread, introduced in the early 2000s as a research-driven allocator.

Allocator Architectures

All modern memory allocators share common architectural concepts to manage dynamic memory efficiently and safely. Allocation requests can come in different sizes, ranging from a few bytes to megabytes or even gigabytes. Allocators need to be equipped with strategies to handle different allocation sizes with minimal overhead and fragmentation. Commonly this is achieved by using some form of segmentation based on the requested size.

Allocators need to track the state of allocated and free memory. This is often done by using data structures that keeps track of the state of each memory block. Metadata can be tracked externally, in a separate data structure, or internally within the block, or a combination of both.

In multithreaded environments, concurrency control is necessary to ensure safety when allocating and deallocating memory. Synchronization negatively impacts performance, however, so modern allocators use various techniques to minimize synchronization overhead, e.g. by using thread-local data structures and even entire heap regions. Of course, these techniques come with additional memory overhead.

Benchmarks

What to Compare?

While the interface looks simple, the implementations of those allocators differ significantly. Different allocators have different performance characteristics, and are better suited for different workloads and computer architectures. When comparing allocators, there are several key performance indicators (KPIs) to consider:

Throughput (ops/sec)
Latency - (sec/op)
Memory usage - (overhead and fragmentation)
Tooling and Usability (debugging, profiling, leak checking, ...)
Maintenance and Security (CVEs, security hardening, etc.)

The workload (allocation size, frequency, number of threads, etc.) impacts these KPIs, so it is important to benchmark your specific workload. While there are also platform restrictions that might limit your choice (e.g. libmalloc is only available on Apple operating systems), I will ignore the platform limitations in the comparison.

Benchmarking Setup

I'm running the benchmarks using Google Benchmark v1.9.4 on my Nov 2023 MacBook Pro (M3) with MacOS 15.6.1, compiled with Apple clang-1700.0.13.5. You can find the source code on GitHub.

While libmalloc is the default allocator on MacOS and part of libSystem, the other allocators are going to be installed via brew. Note that tcmalloc is part of gperftools, and libhoard is a custom tap (brew tap emeryberger/hoard). Here are the versions I am using:

# otool -L build/malloc-post-benchmark-libmalloc
/usr/lib/libSystem.B.dylib (current version 1351.0.0)

# brew info jemalloc | grep Cellar
/opt/homebrew/Cellar/jemalloc/5.3.0

# brew info gperftools | grep Cellar
/opt/homebrew/Cellar/gperftools/2.17.2

# brew info mimalloc | grep Cellar
/opt/homebrew/Cellar/mimalloc/3.1.5

# brew info emeryberger/hoard/libhoard | grep Cellar
/opt/homebrew/Cellar/libhoard/HEAD-5a7073f

I am using CMake to build the benchmark binaries for each allocator. The gist of the CMakeLists.txt is:

set(MALLOC_IMPLEMENTATIONS jemalloc mimalloc hoard tcmalloc)
foreach(MALLOC ${MALLOC_IMPLEMENTATIONS})
    find_library(${MALLOC}_LIBRARY ${MALLOC})
    set(EXE_NAME "${PROJECT_NAME}-benchmark-${MALLOC}")
    add_executable(${EXE_NAME} src/main.cpp)
    target_link_libraries(${EXE_NAME} PRIVATE benchmark::benchmark pthread)
    target_link_libraries(${EXE_NAME} PRIVATE ${${MALLOC}_LIBRARY})
endforeach()

Note that we cannot actively "reset" the allocator between each benchmark run. To avoid interactions between runs, we'll use a bash script to run the individual benchmarks in a loop. Thanks to the --benchmark_filter command line option and the way Google Benchmark builds benchmark names, we can loop over different parameters for a given benchmark, restarting the binary after each run.

run_allocation_throughput_benchmark() {
  local size=$1
  local threads=$2
  echo "Running allocation throughput benchmark for ${MALLOC} with ${size} size, ${threads} threads"
  ${executable} --benchmark_filter="BM_AllocationThroughput/${size}/iterations:1000/threads:${threads}" \
    --benchmark_out="results/${MALLOC}_AllocationThroughput_${size}_${threads}.json" \
    > /dev/null
}

executable_prefix="./build/malloc-post-benchmark-"

for executable in ${executable_prefix}*; do
  MALLOC="${executable#./build/malloc-post-benchmark-}"
  for threads in 1 2 4 8; do
    for size in {1..22}; do
      run_allocation_throughput_benchmark $((2**size)) ${threads}
    done
  done
done

We are storing the results in JSON files, which we combine, analyze and visualize using matplotlib in Python. Here's the structure of a benchmark result file:

{
  "context": {
    "date": "2025-11-26T14:00:48+01:00",
    "host_name": "MyMacBook",
    "executable": "./build/malloc-post-benchmark-hoard",
    "num_cpus": 12,
    "mhz_per_cpu": 24,
    "cpu_scaling_enabled": false,
    "caches": [
      {
        "type": "Data",
        "level": 1,
        "size": 65536,
        "num_sharing": 0
      },
      {
        "type": "Instruction",
        "level": 1,
        "size": 131072,
        "num_sharing": 0
      },
      {
        "type": "Unified",
        "level": 2,
        "size": 4194304,
        "num_sharing": 1
      }
    ],
    "load_avg": [2.55762,2.97754,3.83203],
    "library_version": "v1.9.4",
    "library_build_type": "debug",
    "json_schema_version": 1
  },
  "benchmarks": [
    {
      "name": "BM_AllocationThroughput/2/iterations:1000/threads:1",
      "family_index": 0,
      "per_family_instance_index": 0,
      "run_name": "BM_AllocationThroughput/2/iterations:1000/threads:1",
      "run_type": "iteration",
      "repetitions": 1,
      "repetition_index": 0,
      "threads": 1,
      "iterations": 1000,
      "real_time": 5.1424708217382431e+04,
      "cpu_time": 5.1425000000000051e+04,
      "time_unit": "ns",
      "items_per_second": 1.9445794846864346e+07
    }
  ]
}

Now with the setup in place, let's look into the different KPIs in greater detail.

Throughput

To measure throughput, we will design a benchmark that within each iteration, allocates memory of a given size for a fixed number of pointers (1000), then frees and reallocates memory for 1000 of these pointers at random, and finally frees all pointers. This yields a total of 2000 memory allocations and frees per iteration. For the throughput counter SetItemsProcessed we treat two malloc plus two free calls as one "item".

static void BM_AllocationThroughput(benchmark::State& state) {
    size_t sz = size_t(state.range(0));
    size_t n = 1000;

    std::vector<void*> ptrs(n);

    for (auto _ : state) {
        std::mt19937 rng(std::hash<std::thread::id>{}(std::this_thread::get_id()));
        std::uniform_int_distribution<int> dist(0, n - 1);

        for (size_t i = 0; i < n; ++i) {
            ptrs[i] = malloc(sz);
            if (!ptrs[i]) state.SkipWithError("malloc failed");
        }
        benchmark::DoNotOptimize(ptrs);

        for (size_t i = 0; i < n; ++i) {
            int j = dist(rng);
            free(ptrs[j]);
            ptrs[j] = malloc(sz);
        }

        for (size_t i = 0; i < n; ++i) {
            free(ptrs[i]);
        }
    }

    state.SetItemsProcessed(state.iterations() * n);
}

We can then run the benchmark for different allocation sizes and different number of threads:

BENCHMARK(BM_AllocationThroughput)
    ->RangeMultiplier(2)
    ->Range(1 << 1, 1 << 25)
    ->Iterations(1000)
    ->Threads(1)
    ->Threads(2)
    ->Threads(4)
    ->Threads(8);

We expect the allocation throughput per thread to decrease with increased parallelism due to the increased synchronization overhead. When plotting the throughput (average "items processed" per second per thread) for allocation sizes of 1KB, we can see that the throughput decreases across the board:

We can also see that hoard has the highest throughput, more than 2x of what mimalloc achieves. This is only one data point, however, as we were looking at 1KB allocations. Let's look at the throughput for different allocation sizes and different number of threads:

As you can see, the different allocators have vastly different throughput characteristics across the different workloads. While both hoard and mimalloc perform very well for small allocations, their throughput decreases rapidly for allocations > 1KB. tcmalloc takes the lead for allocations > 1KB and maintains a steady throughput up to 32KB (2¹⁵ bytes). jemalloc has the lowest throughput for smaller allocation sizes, but maintains a decent throughput especially with increased parallelism compared to mimalloc, hoard, and libmalloc. In very large allocations, only tcmalloc and jemalloc remain competitive, with tcmalloc maintaining 50x of the throughput of libmalloc at 4MB allocations.

The sharp drop in throughput for larger allocations in the different allocators can be explained by the way they handle them internally. tcmalloc for example handles small allocations within the per-CPU caches in the front-end, while larger allocations have to go through the central free list, increasing lock contention (see architecture diagram below, taken from the tcmalloc design documentation). The thresholds depend on the page size and can be viewed in the size class definitions.

Next, let's take a look at the latency of the different allocators.

Latency

For the latency benchmark, I was interested in the latency of the malloc call. To measure that, I used manual timing, measuring only the time spent in the malloc call:

static void BM_AllocationLatency(benchmark::State& state) {
    size_t sz = size_t(state.range(0));

    for (auto _ : state) {
        state.PauseTiming();

        auto start = std::chrono::high_resolution_clock::now();
        void* ptr = malloc(sz);
        auto end = std::chrono::high_resolution_clock::now();

        if (!ptr) state.SkipWithError("malloc failed");

        auto elapsed = std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
        state.SetIterationTime(elapsed.count());

        free(ptr);
        state.ResumeTiming();
    }
}

Since the latency difference between small and large allocations is in orders of magnitude, I will plot small (<= 1KB) and larger (> 1KB) results separately:

We can see that all allocators perform small allocations within 20-30ns, except for tcmalloc in the face of a larger amount of threads and very small (<= 64B) allocation sizes.

For larger allocations in a single-threaded environment, all allocators perform reasonably well. Only mimalloc starts to experience a significant latency increase for allocations > 64KB. When multiple threads come into play, hoard experiences a significant latency increase for allocations between 2KB and 32KB and tcmalloc also starts to see a significant latency increase for allocations > 262KB.

Memory Usage

When it comes to memory usage, we mainly care about two types of overhead:

The allocation overhead when the allocation size is not aligned with the internal page size
The bookkeeping / synchronization overhead (can be per thread, per core, per pointer)

First, let's investigate the allocation overhead. We can use the malloc_size function that is part of the malloc interface on MacOS to determine the actual size of the allocation:

void* ptr = malloc(sz);
size_t actual = malloc_size(ptr);
size_t overhead = actual - sz;

As expected, for tiny allocations, the overhead is very high (up to 1500% when allocating 1B with libmalloc). If your application is very memory constrained, tcmalloc has a "small-but-slow" mode that can be used when memory footprint minimization is more important than performance.

Starting from 16B, all allocators reach a reasonable overhead <= 100%. Let's look at the overhead for larger allocations:

As you can see the allocation overhead varies a lot between implementations. While jemalloc, mimalloc, and tcmalloc manage to keep overhead below 30% for most allocation sizes, libmalloc and hoard have a much higher overhead. hoard continuously reaches 100% overhead for allocations <= 32KB. libmalloc has peaks at 33B, ~1KB, and 32KB+1B.

In practice, to reduce waste, you should aim for allocations that are powers of two or at least aligned with the page size. On MacOS, you can use malloc_good_size to get the closest size that will not waste space, but it is not available on Linux. You can also prefer fewer, larger, long-lived buffers over many tiny allocations.

The graphs also reveal information about the internal thresholds related to sizes. E.g. on libmalloc, the zone thresholds align with the peaks in the graph:

TINY zone handles allocations up to 1008 bytes (~ 2¹⁰ bytes).
SMALL zone handles allocations from above TINY up to 32 KB (2¹⁵ bytes).
MEDIUM zone handles allocations from above SMALL up to 8 MB (2²³ bytes).
LARGE zone handles allocations beyond the MEDIUM threshold.

In addition to the overhead per allocation, there is also bookkeeping overhead. Since measuring this accurately is more involved, I decided to write a simple program and measure the resident set size (RSS) of the process over time. On MacOS, we can use the Mach API (mach/mach.h), specifically the task_info function.

size_t get_rss_bytes() {
    struct mach_task_basic_info info;
    mach_msg_type_number_t count = MACH_TASK_BASIC_INFO_COUNT;
    if (task_info(mach_task_self(), MACH_TASK_BASIC_INFO,
                  (task_info_t)&info, &count) == KERN_SUCCESS) {
        return info.resident_size;
    }
    return 0;
}

Next, we design a "worker" function that we can launch in a thread. It will spin until an external atomic flag indicates it to stop. First it will fill a vector of pointers with allocated memory, filled with 0s. Once all pointers are filled, it randomly selects a pointer to free and reallocate, simulating a workload. Before stopping, it frees all pointers.

std::atomic<bool> should_stop(false);

void worker_thread(size_t num_pointers, size_t pointer_size) {
    std::vector<void*> pointers;
    pointers.reserve(num_pointers);

    while (!should_stop.load()) {
        if (pointers.size() < num_pointers) {
            void* ptr = malloc(pointer_size);
            if (ptr != nullptr) {
                memset(ptr, 0, pointer_size);
                pointers.push_back(ptr);
            }
        } else {
            size_t idx = rand() % pointers.size();
            free(pointers[idx]);
            void* ptr = malloc(pointer_size);
            if (ptr != nullptr) {
                memset(ptr, 0, pointer_size);
                pointers[idx] = ptr;
            }
        }
    }

    for (void* ptr : pointers) {
        free(ptr);
    }
}

In the main method, we can submit this function to a given number of threads, passing a given number of pointers and pointer size. We are going to capture the RSS size in the main thread before launching the threads, every 1 second while the threads are running, and after stopping all threads.

The following graph plots the RSS size of the program over time for different allocators, running for 1 seconds with 1000 pointers and an allocation size of 1KB per pointer in 1 thread:

First, we can see that all allocators except libmalloc reach a stable RSS size immediately after the first measurement. The increase from the starting size to the stable size corresponds to the total size of allocated memory (1000 * 1KB = 1MB). You can also note that jemalloc is the only allocator dropping back to the starting usage after stopping the threads.

The starting memory usage differs significantly between allocators. libmalloc and mimalloc both consume less than 4MB, while hoard and jemalloc consume ~9MB and tcmalloc is the most hungry one with ~13MB. However, we can see that libmalloc reaches a stable size of ~13MB after 5 seconds as well.

Next, let's investigate the RSS usage for an increasing allocation size. When looking at the stable RSS size (1 second before the end) for each allocator, using 1KB allocations with a varying number of pointers and 4 threads, we can see that libmalloc indeed behaves differently than all the other allocators.

While the RSS size for all other allocators increases linearly with the allocated memory, libmalloc appears to consume much more memory than being allocated. A similar issue has been observed with the glibc default allocator on Linux and RocksDB, where the RSS was 3x higher compared to jemalloc (see Battle of the Mallocators for more details).

Tooling and Usability

For most applications, using the default allocator with the default settings is good enough. For some applications, you might want to switch to a different allocator. However, there are also very specialized applications, that either have very specific performance requirements, or are memory constrained. For those, the default settings might not be the best. Additionally, you might want to debug your allocation workload, e.g. by collecting and inspecting allocation statistics. Let's look into the tooling and configuration options for each allocator.

`libmalloc`

libmalloc allows some debugging configuration via environment variables, but there is no programmatic API or compile-time options, and little to no documented tuning options.

The tooling for libmalloc is tightly coupled to the MacOS tooling. It supports features such as mapping allocation addresses to call stack when MallocStackLogging is enabled, heap integrity checking via MallocCheckHeapStart, and a few other options.

`jemalloc`

jemalloc has three configuration mechanisms:

Environment variables. Via MALLOC_CONF, you can control nearly every aspec of the allocator. Example: MALLOC_CONF="prof:true,lg_prof_sample:19,prof_prefix:jeprof.out,narenas:4,dirty_decay_ms:5000" will enable heap profiling, sample every 512KB (2¹⁹B), write the profile to jeprof.out, use 4 arenas, and decay dirty pages after 5 seconds.
Programmatic API. You can use the mallctl interface to change the settings at runtime without restarting.
Compile-time options. Configuration can be baked via --with-malloc-conf or the malloc_conf global variable.

In terms of debugging and profiling, jemalloc has a wide variety of features. The main tool is jeprof, that analyzes heap dumps and generates flame graphs. By enabling the prof_leak option, allocations without matching free calls are reported.

MALLOC_CONF="prof:true,prof_prefix:jeprof.out,lg_prof_interval:5" \
  build/malloc-post-rss-jemalloc 4 1000 1024 10
jeprof --pdf build/malloc-post-rss-jemalloc jeprof.out.0

Note that in order to enable the profiling hooks, jemalloc needs to be configured with --enable-prof, which is not the case when installing it via homebrew. I was not able to compile it from source within a reasonable time frame, so I am not able to show the results here. But the features and presentation are very similar to what tcmalloc has to offer.

`tcmalloc`

tcmalloc also has three configuration mechanisms:

Environment variables. Via different TCMALLOC_* variables, you can configure things like release rates, size thresholds, limiting the heap size, etc.
Programmatic API. You can use the MallocExtension APIs to query and modify allocation parameters at runtime. Example: MallocExtension::instance()->SetNumericProperty("tcmalloc.max_per_cpu_cache_size", 16777216);
Compile-time options. Fundamental behaviour can be selected via preprocessor flags like -DTCMALLOC_INTERNAL_SMALL_BUT_SLOW to turn on the small but slow allocator.

In terms of tooling, tcmalloc is part of the Google Performance Tools (gperftools). The main relevant tool is pprof, which is similar to jeprof and used for analyzing heap profiles.

HEAPPROFILE=malloc-post-tcmalloc.hprof \
  build/malloc-post-rss-tcmalloc 4 1000 1024 1

pprof -http=localhost:8080 build/malloc-post-rss-tcmalloc \
  malloc-post-tcmalloc.hprof.0001.heap

It has multiple output formats, including a comprehensive Web UI, that can show flame graphs, call graphs, or a cumulated view (top):

Flat    Flat%   Sum%    Cum Cum%    Name
3.91MB  99.22%  99.22%  3.91MB  99.22%  [libsystem_malloc.dylib]    
0.03MB  00.78%  99.99%  0.03MB  00.78%  std::__1::__libcpp_operator_new[abi:ne190102]   
0       00.00%  99.99%  3.94MB  99.89%  worker_thread   
0       00.00%  99.99%  0.03MB  00.78%  std::__1::vector::reserve   
0       00.00%  99.99%  0.03MB  00.78%  std::__1::allocator::allocate[abi:ne190102] 
0       00.00%  99.99%  3.94MB  99.89%  std::__1::__thread_proxy[abi:ne190102]  
0       00.00%  99.99%  3.94MB  99.89%  std::__1::__thread_execute[abi:ne190102]    
0       00.00%  99.99%  0.03MB  00.78%  std::__1::__split_buffer::__split_buffer    
0       00.00%  99.99%  0.03MB  00.78%  std::__1::__libcpp_allocate[abi:ne190102]   
0       00.00%  99.99%  3.94MB  99.89%  std::__1::__invoke[abi:ne190102]    
0       00.00%  99.99%  0.03MB  00.78%  std::__1::__allocate_at_least[abi:ne190102] 
0       00.00%  99.99%  3.94MB  99.89%  [libsystem_pthread.dylib]

The graph view shows the allocation size (1KB), too:

tcmalloc also supports summary statistics using MALLOCSTATS=1:

MALLOCSTATS=1 build/malloc-post-rss-tcmalloc 4 1000 1024 1

------------------------------------------------
MALLOC:          20480 (    0.0 MiB) Bytes in use by application
MALLOC: +      3801088 (    3.6 MiB) Bytes in page heap freelist
MALLOC: +       372600 (    0.4 MiB) Bytes in central cache freelist
MALLOC: +      1048576 (    1.0 MiB) Bytes in transfer cache freelist
MALLOC: +          136 (    0.0 MiB) Bytes in thread cache freelists
MALLOC: +      2621504 (    2.5 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =      7864384 (    7.5 MiB) Actual memory used (physical + swap)
MALLOC: +            0 (    0.0 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =      7864384 (    7.5 MiB) Virtual address space used
MALLOC:
MALLOC:            141              Spans in use
MALLOC:              1              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------

`mimalloc`

Similar to libmalloc, mimalloc supports some basic environment variable configuration. In contrast to libmalloc, it does support some performance related configuration.

It also has a mi_option_set programmatic API but the options are less fine-grained than jemalloc or tcmalloc, reflecting the philosophy of sensible defaults over exhaustive tunability.

Tooling support for mimalloc is smaller compared to jemalloc and tcmalloc. It does not include a built-in sampling profiler, so you have to rely on external tools such as Valgrind. You can pass MIMALLOC_SHOW_STATS to get some basic statistics though.

MIMALLOC_SHOW_STATS=1 build/malloc-post-rss-mimalloc 4 1000 1024 1

heap stats:     peak       total     current       block      total#   
  reserved:     1.0 GiB     1.0 GiB     1.0 GiB                          
 committed:    11.1 MiB    11.5 MiB    11.1 MiB                          
     reset:     0      
    purged:     0      
   touched:     0           0           0                                ok
     pages:    68          68           0                                ok
-abandoned:    13.6 Ki     17.3 Mi      0                                ok
 -reclaima:     0      
 -reclaimf:    17.3 Mi 
-reabandon:     0      
    -waits:     0      
 -extended:     0      
   -retire:     0      
    arenas:     1      
 -rollback:     0      
     mmaps:    17      
   commits:     0      
    resets:     0      
    purges:     0      
   guarded:     0      
   threads:     4           4           0                                ok
  searches:     1.0 avg
numa nodes:     1
   elapsed:     2.011 s
   process: user: 4.010 s, system: 0.030 s, faults: 94, rss: 6.7 MiB, commit: 11.1 MiB

`hoard`

hoard does not appear to have any configuration options or specific profiling tools.

Maintenance and Security

Apple's libmalloc has sophisticated security features such as kalloc_type and xzone malloc. While it has the highest CVE count¹, I believe this is due to extensive security research on Apple platforms.

jemalloc and tcmalloc prioritize performance over security hardening, with minimal built-in protections. They have a handful of historical CVEs (jemalloc², tcmalloc³) reported, which are all patched in recent versions.

mimalloc offers the most comprehensive configurable security mode with guard pages, encrypted free lists, and randomization at ~10% performance cost. It has no core CVEs reported, only one minor advisory for the rust crate.⁴

hoard has the weakest security posture with documented overflow vulnerabilities such as multiple overflow vulnerabilities and no hardening features.

Summary and Conclusion

Based on the findings and my limited experience with using the allocators when developing this post, I came up with the following comparison table. I will leave out hoard, because I don't think it's a practical choice for any real world application.

	`libmalloc`	`tcmalloc`	`jemalloc`	`mimalloc`
Throughput	⭐⭐☆☆☆	⭐⭐⭐⭐☆	⭐⭐⭐☆☆	⭐⭐☆☆☆
Latency	⭐⭐⭐⭐⭐	⭐⭐☆☆☆	⭐⭐⭐⭐☆	⭐☆☆☆☆
Memory Overhead	⭐☆☆☆☆	⭐⭐⭐⭐⭐	⭐⭐⭐⭐☆	⭐⭐⭐⭐☆
Tooling and Usability	⭐⭐⭐☆☆	⭐⭐⭐⭐⭐	⭐⭐⭐⭐☆	⭐⭐⭐☆☆
Maintenance and Security	⭐⭐⭐⭐⭐	⭐⭐⭐⭐☆	⭐⭐⭐☆☆	⭐⭐⭐⭐⭐
Overall	16/25 ⭐	20/25 ⭐	18/25 ⭐	15/25 ⭐

On all Apple operating systems, libmalloc is the default choice. The main focus is on security, with decent performance for most workloads. Given the high memory overhead however, it might not be a good fit for performance critical applications with high allocation rates.

tcmalloc and jemalloc have the most stable performance characteristics across different allocation sizes and number of threads. The throughput remains reasonably high even at large allocation sizes, while latency remains within acceptable boundaries, with jemalloc winning in latency and tcmalloc winning in throughput for very large allocations. Among the ones I tested, those would be my preferred choice for applications with high performance requirements. They also have extensive configuration options that allows workload specific tuning. Note, however, that jemalloc appears to somewhat dead as of 2025, so I'd rather go with tcmalloc for any new project.

mimalloc works well for smaller allocations, but suffers in both latency and throughput for larger allocations. The advanced security features might be a unique selling point for some users though.

While hoard shines in certain areas, it does not appear to be a good choice, as it does not have steady performance characteristics across different allocation sizes and number of threads. It has severe security flaws and is not actively maintained.

Did you ever swap out the default allocator in your application? What was your experience? Let me know in the comments below!

If you liked this post, you can support me on ko-fi.

CVE-2015-5889, CVE-2018-4433, CVE-2023-32428 ↩
CVE-2007-6754, CVE-2006-7252 ↩
CVE-2005-4895 ↩
RUSTSEC-2022-0094 ↩

Comparing OpenBLAS and Accelerate on Apple Silicon for BLAS Routines

Frank Rosner — Tue, 18 Nov 2025 14:58:12 +0000

Motivation

Many real world applications such as machine learning, scientific computing, data compression, computer graphics and video processing require linear algebra operations. Tensors (mostly vectors, which are 1-dimensional tensors, and matrices, which are 2-dimensional tensors) are the primary data structures used to represent data in these applications.

Writing software is hard. Writing correct, performant, secure, reliable, etc., software is even harder. This is why most linear algebra operations are expressed in terms of Basic Linear Algebra Subprograms (BLAS). Common BLAS routines are vector addition, scalar multiplication, dot products, linear combinations, matrix-vector multiplication, and matrix-matrix multiplication.

Similarly to the saying "Don't roll your own cryptography", we should also heed the advice "Don't roll your own BLAS". BLAS libraries are often tuned for the specific SIMD (Single Instruction, Multiple Data) instructions of the underlying hardware. This can lead to performance improvements in orders of magnitude.

In this post we want to explore how to utilize BLAS libraries on Apple Silicon using C++, and compare the performance of OpenBLAS and Apple's Accelerate framework.

A Brief History of SIMD

SIMD is a type of parallel computing first conceptually classified in Flynn's taxonomy¹ in the 1960s. In the 1990s, as desktop processors became more powerful, SIMD instructions were introduced to improve multimedia and gaming performance. Intel's MMX, released in 1996, was the first widely deployed SIMD on desktop CPUs, followed by more advanced instruction sets like SSE, AVX, and AVX-512.

SIMD became standard in most modern CPUs, accelerating tasks such as digital image processing, audio processing, and gaming graphics by executing the same operation on multiple data points at once. Thus, SIMD evolved from early experimental supercomputers to an integral part of everyday computing.

SIMD on Apple Silicon

Apple Silicon processors, including the M1, M2, and M3 series, primarily support the ARM NEON instruction set, which is a 128-bit SIMD architecture part of the ARMv8-A ISA². NEON provides a wide range of integer and floating-point vector operations suitable for parallel processing on vectors of data.

In addition to NEON, Apple Silicon also features proprietary Apple Matrix Coprocessor (AMX)³ instructions. These instructions are specialized for high-performance computing tasks involving matrix operations and are unique to Apple Silicon. They are not part of the official ARM architecture and are currently undocumented publicly⁴, but they add significant computational acceleration beyond NEON for certain workloads.

The best way to utilize AMX instructions is through the Accelerate framework, which provides a high-level API for performing BLAS operations.

OpenBLAS vs Accelerate

Benchmark Setup

There are several CPU-based BLAS libraries, some of them developed by hardware manifacturers, such as Intel's MKL, AMD's BLIS and Apple's Accelerate. OpenBLAS is an open source library that has the broadest coverage of supported hardware and can be a solid default choice.

I was curious what the difference would be in terms of performance when comparing OpenBLAS and Accelerate on Apple Silicon. Since OpenBLAS does not support Apple's AMX instructions, relying on NEON instructions only, I expect it to be slower for most use cases.

I built four benchmarks around some common double precision BLAS routines that are part of the C BLAS interface:

cblas_daxpy – a · X + Y is a level 1 routine that scales vector X by scalar a and adds it to vector Y. This routine is used in linear combination of vectors, common in iterative algorithms like conjugate gradient (CG) or generalized minimum residual (GMRES) for solving linear systems.
cblas_ddot - X · Y is a level 1 routine that computes the dot product of vectors X and Y. Dot products are widely used in machine learning, e.g. to compute similarities between vectors or physics simulations.
cblas_dgemv - a · A · X + b · X is a level 2 routine that handles matrix-vector multiplication. a, b are scalars, X, Y are vectors, and A is a matrix. This routine is also common in iterative algorithms, but also in machine learning (linear regression), and signal processing.
cblas_dgemm - a · A · B + b · C is a level 3 routine that handles matrix-matrix multiplication. a, b are scalars, and A, B, C are matrices. This routine is common in machine learning and graphics processing, for example.

The benchmarks use Google's benchmark library to measure performance. You can find the full code on GitHub. The following listing outlines the benchmark for cblas_ddot defined as a macro taking the name of the BLAS backend as an argument:

#define BENCHMARK_DDOT(backend_name) \
static void BM_Ddot_##backend_name(benchmark::State& state) { \
    size_t n = state.range(0); \
    std::vector<double> x(n, 1.0); \
    std::vector<double> y(n, 2.0); \
    for (auto _ : state) { \
        benchmark::DoNotOptimize(blas_ddot(x, y)); \
    } \
    state.SetItemsProcessed(state.iterations() * n); \
} \
BENCHMARK(BM_Ddot_##backend_name)->RangeMultiplier(2)->Range(1<<1, 1<<22);

We'll obtain the desired vector size n from the benchmark state. We'll initialize two vectors of the same size x and y. The for (auto _ : state) loop runs the function for the desired number of iterations. benchmark::DoNotOptimize is used to prevent the compiler from optimizing away the function call because the result is unused. We'll record the user metric number of items processed as the number of iterations times the vector size.

We'll register the function as a benchmark using the BENCHMARK macro, defining the range of vector sizes to test, e.g. from 2² to 2²² with a multiplier of 2. We can then generate the benchmarks by calling the macro with the desired backend name:

#ifdef USE_ACCELERATE
BENCHMARK_DDOT(Accelerate)
BENCHMARK_DGEMM(Accelerate)
BENCHMARK_DGEMV(Accelerate)
BENCHMARK_DAXPY(Accelerate)
#endif

We'll repeat the same for OpenBLAS. In our CMakeLists.txt file we can then conditionally compile the two versions, or simply compile both versions at once:

# ...
option(BUILD_ACCELERATE "Build binary with Apple Accelerate framework (macOS only)" ON)
option(BUILD_OPENBLAS "Build binary with OpenBLAS" ON)
# ...
if(BUILD_ACCELERATE AND APPLE)
    add_executable(cpp-simd-post-accelerate ${SOURCES})
    target_link_libraries(cpp-simd-post-accelerate PRIVATE benchmark::benchmark benchmark::benchmark_main)
    target_link_libraries(cpp-simd-post-accelerate PRIVATE "-framework Accelerate")
    target_compile_definitions(cpp-simd-post-accelerate PRIVATE ACCELERATE_NEW_LAPACK USE_ACCELERATE)
    message(STATUS "Building cpp-simd-post-accelerate with Apple Accelerate framework")
endif()
# ...

When checking the resulting binaries, we can indeed see that they link to the respective libraries (assuming you installed them correctly on your system first):

$ otool -fahl build/cpp-simd-post-openblas | grep openblas -B5 -A5

Load command 14
          cmd LC_LOAD_DYLIB
      cmdsize 80
         name /opt/homebrew/opt/openblas/lib/libopenblas.0.dylib (offset 24)
   time stamp 2 Thu Jan  1 01:00:02 1970
      current version 0.0.0
compatibility version 0.0.0

$ otool -fahl build/cpp-simd-post-accelerate | grep Accelerate -B5 -A5

Load command 14
          cmd LC_LOAD_DYLIB
      cmdsize 96
         name /System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate (offset 24)
   time stamp 2 Thu Jan  1 01:00:02 1970
      current version 4.0.0
compatibility version 1.0.0

Please note that OpenBLAS was using the neoversen1 core on my Apple M3 Pro chip.

OPENBLAS_VERBOSE=2 ./build/cpp-simd-post-openblas
Core: neoversen1

Generating the Results

Now that we have our benchmarks compiled, let's run them and analyze the results. We can use the --benchmark_out command line argument to specify a JSON output file that will hold the results.

./build/cpp-simd-post-openblas --benchmark_out="openblas.json"
./build/cpp-simd-post-accelerate --benchmark_out="accelerate.json"

Aside from some metadata about the run, the JSON file contains a list of benchmark results of the following form:

{
  "name": "BM_Ddot_Accelerate/2",
  "family_index": 0,
  "per_family_instance_index": 0,
  "run_name": "BM_Ddot_Accelerate/2",
  "run_type": "iteration",
  "repetitions": 1,
  "repetition_index": 0,
  "threads": 1,
  "iterations": 76769538,
  "real_time": 9.1229957239181410e+00,
  "cpu_time": 9.1226027698642671e+00,
  "time_unit": "ns",
  "items_per_second": 2.1923567762994435e+08
}

The name field encodes the benchmark name (Ddot), the library (Accelerate), and the input size (2). For our analysis we will look at the user metric items_per_second (larger is better), which represents the number of input doubles we were able to process per second. I wrote a Python script that collects those benchmark results, parses the name field and plots the results. Let's take a look at them.

Analyzing the Results

Let's start with cblas_daxpy, which scales a vector and adds it to another vector. We'll benchmark vector sizes from 2 to 4,194,304. While most applications rely on smaller vectors up to 2¹² = 4096 elements, larger vectors are not unheard of.

We can see that in very small vector sizes, the performance difference is insignificant, and OpenBLAS even outperforms Accelerate. However, starting with input size 512, Accelerate surpasses OpenBLAS and remains consistently better by a factor of up to 6x, especially because OpenBLAS seems to slow down significantly starting from 2¹⁴ input size. What's interesting is that OpenBLAS picks up the pace again and surpasses Accelerate for input sizes larger than 2¹⁸.

Now let's look at the cblas_ddot results. We'll use the same input sizes to benchmark.

Unsurprisingly, the results are very similar, with the performance being comparable for very small vector sizes, Accelerate outperforming OpenBLAS for medium to large sizes, and eventually OpenBLAS catching up for very large sizes. OpenBLAS also shows a dip in larger sizes, which looks interesting. We will investigate that later.

Next, let's dive into the matrix operations. For matrices, we'll use smaller input sizes (up to 2¹³), as the number of elements will be the squared input size. Typical input sizes in real world applications range from smaller matrices up to very large ones. Note that many matrix operations, especially for graphics processing and generative AI are offloaded to GPUs. Let's start with cblas_dgemv.

Similarly to the vector results, small sizes show little difference in performance. However, Accelerate starts to outperform OpenBLAS quite early. Even when both implementations have their peak performance (2¹⁰), Accelerate outperforms OpenBLAS. However, with growing matrix sizes, OpenBLAS takes the upper hand.

Let's check out cblas_dgemm next.

The dgemm results are the most consistent ones across all four experiments. Accelerate outperforms OpenBLAS for most input sizes up until 2¹³, from where OpenBLAS takes the lead. Interestingly, OpenBLAS does not show the dip in performance for larger input sizes as in the vector experiments. In dgemm, OpenBLAS also does not appear to have reached its peak performance in the given input range.

Looking at these results, we can note a few observations:

For very small input sizes, the performance difference is insignificant. I believe this is due to the fact that with small inputs, memory access and function call overhead dominate performance.
For medium to large input sizes, Accelerate outperforms OpenBLAS. This is expected given that Accelerate can take advantage of Apple's AMX instructions and the AMX coprocessor, while OpenBLAS relies on NEON instructions only.
For very large input sizes, OpenBLAS outperforms Accelerate. I did not expect this, and I am not entirely sure what the reason behind the difference is. I am suspecting that either Accelerate is not optimized for very large input sizes, as those are not typical in the applications that run on consumer devices such as Macs, iPhones, etc., or it has to do with size limitations of the AMX coprocessor.
OpenBLAS shows a dip in performance for medium to large input sizes, which is especially visible in daxpy and ddot. I think that this might be caused by OpenBLAS dynamically choosing the number of threads based on the vector size. Let's investigate this a bit further.

Our benchmark run did not specify any parallelism and left the choise to the library. Based on the ARM64 kernel that is used for the ddot experiment, it seems that the number of threads is indeed chosen based on the input size, but also based on the OpenBLAS core used.

static inline int get_dot_optimal_nthreads(BLASLONG n) {
  int ncpu = num_cpu_avail(1);

#if defined(NEOVERSEV1) && !defined(COMPLEX) && !defined(BFLOAT16)
  return get_dot_optimal_nthreads_neoversev1(n, ncpu);
#elif defined(DYNAMIC_ARCH) && !defined(COMPLEX) && !defined(BFLOAT16)
  if (strcmp(gotoblas_corename(), "neoversev1") == 0) {
    return get_dot_optimal_nthreads_neoversev1(n, ncpu);
  }
#endif

  // Default case
  if (n <= 10000L)
    return 1;
  else
    return num_cpu_avail(1);
}
#endif

Since my OpenBLAS uses neoversen1, it should choose one thread for vectors <= 10k, and all available CPUs for vectors > 10k. We can see that this threshold (red vertical line) aligns with the change in the performance trend:

Summary and Conclusion

In this post we wrote and executed benchmarks for four common BLAS routines using two different BLAS implementations: OpenBLAS and Apple's Accelerate framework. We used Google's C++ benchmark library to measure performance and plotted the results using Python. We found the following results:

For very small input sizes, results of both libraries are comparable. This is likely due to the fact that with small inputs, memory access and function call overhead dominate performance.
Accelerate outperforms OpenBLAS for medium to large input sizes. This is expected given that Accelerate can take advantage of Apple's AMX instructions and the AMX coprocessor, while OpenBLAS relies on NEON instructions only. If there are specialized, state-of-the-art instructions available on your platform, you should use a BLAS library that takes advantage of them.
While the BLAS interfaces are the same, the available configuration options, e.g. the parallelism, can vary between implementations and have a significant impact on the performance.
OpenBLAS appears to outperform Accelerate for very large input sizes. There is no silver bullet.

Even when not rolling your own BLAS, relying on well-known BLAS libraries, there can be performance differences between the different implementations in orders of magnitude. If your application requires maximum performance, you should always benchmark the different options and choose the one that performs best for your use case on the hardware you run in production.

Note that during my benchmarks I only looked at throughput. Latency is another relevant metric to consider, especially for real-time applications. Latency can be significantly higher when using coprocessors or processing units farther away from the CPU, such as GPUs.

Have you compared different BLAS or scientific computing libraries before? Did you run into any unexplained performance differences? I'd love to hear about your experiences in the comments below!

If you liked this post, you can support me on ko-fi.

Flynn, Michael J. (December 1966). "Very high-speed computing systems" (PDF). Proceedings of the IEEE. 54 (12): 1901–1909. ↩
The ARMv8-A ISA (Instruction Set Architecture) is a 64-bit architecture developed by ARM Holdings, supporting both 32-bit (AArch32) and 64-bit (AArch64) execution states. It includes three main instruction sets: A32 and T32 for 32-bit processing, and A64 for 64-bit processing. An ISA is an abstract model that defines how software controls a processor, specifying the set of machine-level instructions the CPU can execute, along with how they are encoded and how they interact with registers and memory. ↩
Apple's AMX is not to be confused with Intel's AMX. ↩
You can find some user-written documentation and header files on GitHub. There's also The Elusive Apple Matrix Coprocessor blog post by Meeko Labs which is worth reading. ↩

Generating Application Specific Go Documentation Using Go AST and Antora

Frank Rosner — Tue, 04 Nov 2025 12:48:06 +0000

Motivation

In one of my projects I was working on some in-house Go application to manage complex task pipelines on Kubernetes. On a high level, a task consists of one or multiple steps that are executed in order. Engineers would contribute tasks in the form of new Go files. Different teams collaborate on those tasks. Engineers running the tasks need to understand how to configure, debug and operate them correctly. Therefore, having good documentation is crucial.

The biggest challenge with documentation, however, is to keep it up-to-date. This becomes more challenging the further away the documentation is kept from the source code. In the ideal scenario, the documentation is close to the source code, e.g. in the form of JavaDoc, Python Docstrings, or GoDoc.

The downside of these tools is that the content they generate is often tightly coupled to the ecosystem of the language. Additionally, their customizability is often limited. Typically, they generate some HTML page based on the text written in comments. Some of them will provide additional context, such as links to other packages, types, or functions. Take the JavaDoc of jvector's GraphIndex class as an example:

It shows information such as super- and implementing classes / interfaces, nested classes and implemented methods. You can browse other classes within the same package as well. This contextual information is useful, but it lacks real-world application context. Its main focus is documenting the API to other developers, and not documenting the behaviour of an application.

To make my point clearer, let's look at the type definitions for our tasks in Go. I'm going to leave out irrelevant detail here. In reality, the code is more complex. Every task has defined TaskLogic, which consists of one or multiple TaskSteps.

type TaskLogic interface {
    GetSteps() []TaskStep
}

We support multiple types of tasks, which are mapped from their API name to the respective task logic. For the scope of this blog post, let's assume that there are two tasks: DataProcessingTask and ReminderTask.

var TaskLogics = map[api.TaskType]TaskLogic{
    api.DataProcessingTask:  tasks.DataProcessingTaskLogic{},
    api.ReminderTask:        tasks.ReminderTaskLogic{},
}

Let's let's look into DataProcessingTaskLogic and the related steps in detail. Note that steps can be reused across tasks.

/*
DataProcessingTaskLogic processes the provided data
in a secure and performant manner. It uses sophisticated
algorithms for maximum efficiency, while ensuring
the highest level of security through quantum-resistant encryption.
 */
type DataProcessingTaskLogic struct{}

/*
StartupStep performs the necessary startup actions,
such as initializing the database connection 
and loading the configuration.
 */
type StartupStep struct{}

/*
RunStep performs the actual data processing.
 */
type RunStep struct{}

/*
CleanupStep performs the necessary cleanup actions,
such as closing the database connection and releasing resources.
 */
type CleanupStep struct{}

func (t DataProcessingTaskLogic) GetSteps() []ot.TaskStep {
    return []ot.TaskStep{
        StartupStep{},
        RunStep{},
        CleanupStep{},
    }
}

Now imagine an engineer who gets alerted on a malfunctioning data processing task. The error says that the task failed in the first step. How would they know what the steps are, and what to do to debug them?

We could write that information down in a runbook but if someone updates the logic, e.g. adding a new step, the documentation in the runbook would quickly become outdated. Ideally, you'd want the documentation about the task logic, including the performed steps, to be generated directly from the source code. Here's how it could look:

We use Antora to generate internal documentation for our services across different repositories, where code is written in different languages. Antora is a static site generator that allows you to write documentation in AsciiDoc and then generate a website from it. It is very customizable and allows you to combine documentation from different sources, e.g. different repositories into a single, cohesive, versioned documentation site.

In the remainder of this post I want to share how I built a documentation generator that combines GoDoc with application specific logic to generate documentation that can be integrated into an Antora site. On a high level, this will entail:

Generate Markdown files for each task type, outlining the task logic and steps. I chose Markdown because it is a bit easier to write and more widely supported, but you could generate AsciiDoc directly.
Convert Markdown to AsciiDoc (skip this step if you decided to generate AsciiDoc directly)
Integrate the AsciiDoc files into the Antora resources and build the site (skip this step if you don't need a static site)

Generating Markdown Documentation

We are going to use mage as our build tool. The directory structure of mage-related files looks as follows:

.
├── magefile.go
└── mage
    └── docs
        ├── lib.go
        └── task_docs.go

Generating the task documentation will be triggered using mage docs <output-directory>, thanks to the following function inside magefile.go:

// magefile.go
func Docs(outputDir string) error {
    return docs.GenerateDocs(outputDir)
}

To implement the GenerateDocs function, we'll need to define some types to represent the documentation, i.e.:

A mechanism to differentiate logic and steps, storing the struct name and package path (TypeIdent with TaskLogicType and TaskStepType)
TaskDocs to store the documentation for a task, including the documentation for the task logic and a list of the steps
StepDoc to store the documentation for a step

We store the StepDoc separately, because steps can be reused in multiple tasks. The reference will be done based on the TaskStepType. Here's the definition of those types:

// docs/task_docs.go
type TypeIdent struct {
    Name    string
    PkgPath string
}

type TaskLogicType TypeIdent
type TaskStepType TypeIdent

type TaskDocs struct {
    TaskLogicDoc *string
    FileName     string
    StructName   string
    TaskSteps    []TaskStepType
}

type StepDoc struct {
    StructName       string
    FileName         string
    ShortDescription *string
    LongDescription  *string
}

Then we'll need to implement the GenerateDocs function. This function will extract the task and step documentation from the source code comments, generate the Markdown files, and write them to the output directory.

// docs/lib.go
func GenerateDocs(outputDir string) error {
    taskDocs, stepDocs := ExtractTaskAndStepDocs("./")
    markdownDocs := GenerateMarkdownDocs(taskDocs, stepDocs)
    if len(markdownDocs) == 0 {
        return fmt.Errorf("No task documentation generated. Maybe the generator did not find the source files?")
    }

    err := WriteMarkdownDocsToFiles(markdownDocs, outputDir)
    if err != nil {
        return fmt.Errorf("Error writing task documentation to files: %w", err)
    }
    fmt.Println("Task documentation generated successfully to", outputDir)
    return nil
}

Let's dive into ExtractTaskAndStepDocs first. Since steps can be reused across tasks, we generate task and step docs separately, merging them later.
The goal of ExtractTaskAndStepDocs is to walk through all .go files in the project and extract the documentation from the comments.

func ExtractTaskAndStepDocs(rootPrefix string) (map[api.TaskType]TaskDocs, map[TaskStepType]StepDoc) {
    taskLogicLookup, taskStepLookup, taskSteps := buildTaskTypeLookups(controllers.TaskLogics)
    taskDocs := make(map[api.TaskType]TaskDocs)
    stepDocs := make(map[TaskStepType]StepDoc)

    err := filepath.WalkDir(rootPrefix, func(path string, d os.DirEntry, err error) error {
        if err != nil {
            return err
        }

        if !d.IsDir() && strings.HasSuffix(d.Name(), ".go") {
            processFile(path, taskLogicLookup, taskStepLookup, taskDocs, taskSteps, stepDocs, rootPrefix)
        }

        return nil
    })

    if err != nil {
        log.Fatalf("Error walking through files: %v", err)
    }

    return taskDocs, stepDocs
}

The helper function buildTaskTypeLookups is listed at the end of the blog post. The core idea is to build lookup maps from the TaskLogics that are available in the project. Those lookups will be used when processing the individual files to figure out if the types that are declared in the files are the ones used in the task logic (and thus should be included in the documentation). Let's look at the processFile function next.

func processFile(
    filePath string,
    taskLogicLookup map[TaskLogicType]api.TaskType,
    taskStepLookup map[TaskStepType]api.TaskType,
    taskDocs map[api.TaskType]TaskDocs,
    taskSteps map[api.TaskType][]TaskStepType,
    stepDocs map[TaskStepType]StepDoc,
    rootPrefix string,
) {
    fset := token.NewFileSet()

    node, err := parser.ParseFile(fset, filePath, nil, parser.ParseComments)
    if err != nil {
        log.Printf("Failed to parse file %s: %v", filePath, err)
        return
    }

    ast.Inspect(node, func(n ast.Node) bool {
        if genDecl, ok := n.(*ast.GenDecl); ok && genDecl.Tok == token.TYPE {
            processDeclaration(genDecl, filePath, taskLogicLookup, taskStepLookup, taskDocs, taskSteps, stepDocs, rootPrefix)
        }
        return true
    })
}

The processFile function uses the Go AST to parse the Go files and extract the documentation from the comments. The ast.Inspect function is used to traverse the AST and find all type declarations. We process each declaration in processDeclaration.

func processDeclaration(
    genDecl *ast.GenDecl,
    filePath string,
    taskLogicLookup map[TaskLogicType]api.TaskType,
    taskStepLookup map[TaskStepType]api.TaskType,
    allTaskDocs map[api.TaskType]TaskDocs,
    allTaskSteps map[api.TaskType][]TaskStepType,
    stepDocs map[TaskStepType]StepDoc,
    rootPrefix string,
) {
    for _, spec := range genDecl.Specs {
        if typeSpec, ok := spec.(*ast.TypeSpec); ok {
            if _, ok := typeSpec.Type.(*ast.StructType); ok {
                structName := typeSpec.Name.Name
                typeIdent := TypeIdent{structName, filePathToPackagePath(filePath, rootPrefix, packagePrefix)}
                if taskName, exists := taskLogicLookup[TaskLogicType(typeIdent)]; exists {
                    processTaskLogic(genDecl, filePath, structName, taskName, allTaskDocs, allTaskSteps[taskName])
                }
                if _, exists := taskStepLookup[TaskStepType(typeIdent)]; exists {
                    processTaskStep(genDecl, filePath, structName, stepDocs, rootPrefix)
                }
            }
        }
    }
}

The processDeclaration function checks if the type is a struct and if it is a task logic or step. If it is a task logic, we process it in processTaskLogic. If it is a step, we process it in processTaskStep.

func processTaskLogic(genDecl *ast.GenDecl, filePath, structName string, taskName api.TaskType, taskDocs map[api.TaskType]TaskDocs, taskSteps []TaskStepType) {
    taskDocs[taskName] = TaskDocs{
        TaskLogicDoc: extractComment(genDecl),
        FileName:     filePath,
        StructName:   structName,
        TaskSteps:    taskSteps,
    }
}

func processTaskStep(
    genDecl *ast.GenDecl,
    filePath, structName string,
    stepDocs map[TaskStepType]StepDoc,
    rootPrefix string,
) {
    comment := extractComment(genDecl)
    stepDocs[TaskStepType{structName, filePathToPackagePath(filePath, rootPrefix, packagePrefix)}] = StepDoc{
        StructName:       structName,
        FileName:         filePath,
        ShortDescription: extractShortDescription(comment),
        LongDescription:  comment,
    }
}

Both functions are rather trivial. They use a few additional helper functions extractComment, extractShortDescription and filePathToPackagePath to extract the documentation from the comments in a structured form. The comments are shown in the detail view of the respective task or step. The short description corresponds to the first sentence of the comment and will be used in the list of steps for a task. The file path is used to link to the source code on GitHub.

func extractComment(genDecl *ast.GenDecl) *string {
    if genDecl.Doc != nil {
        trimmedDoc := strings.TrimSpace(genDecl.Doc.Text())
        return &trimmedDoc
    }
    return nil
}

func extractShortDescription(comment *string) *string {
    if comment == nil {
        return nil
    }
    paragraphs := strings.SplitN(*comment, "\n\n", 2)
    replacedFirstParagraph := strings.ReplaceAll(paragraphs[0], "\n", " ")
    shortDescription := strings.TrimSpace(replacedFirstParagraph)
    return &shortDescription
}

func filePathToPackagePath(filePath, rootPrefix, packagePrefix string) string {
    relativePath := strings.TrimPrefix(filePath, rootPrefix)
    dirPath := filepath.Dir(relativePath)
    packagePath := filepath.Join(packagePrefix, dirPath)
    packagePath = filepath.ToSlash(packagePath)
    return packagePath
}

Now that we have the task and step documentation, we can generate the Markdown files in GenerateMarkdownDocs. Each task gets its own markdown file.
We could also generate individual files for each step, but I'll leave this exercise to the reader if needed. For the sake of simplicity we'll include the long description of the steps right into the task documentation.

func GenerateMarkdownDocs(taskDocs map[api.TaskType]TaskDocs, stepDocs map[TaskStepType]StepDoc) map[api.TaskType]string {
    markdownDocs := make(map[api.TaskType]string)

    for taskType, docs := range taskDocs {
        var sb strings.Builder

        sb.WriteString(fmt.Sprintf("# %s\n\n", taskType))

        sb.WriteString("## Description\n\n")
        if docs.TaskLogicDoc != nil {
            sb.WriteString(fmt.Sprintf("%s\n", *docs.TaskLogicDoc))
        } else {
            sb.WriteString("No description provided.\n")
        }

        sb.WriteString(fmt.Sprintf("\nYou can find the [source code](%s/%s) on GitHub.\n\n", githubPrefix, docs.FileName))

        sb.WriteString("## Steps\n\n")
        if docs.TaskSteps != nil && len(docs.TaskSteps) > 0 {
            for i, step := range docs.TaskSteps {
                sb.WriteString(fmt.Sprintf("### %d. %s\n\n", i+1, step.Name))
                stepDoc, exists := stepDocs[step]
                if exists && stepDoc.LongDescription != nil {
                    sb.WriteString(fmt.Sprintf("%s\n\n", *stepDoc.LongDescription))
                }
            }
        } else {
            sb.WriteString("No steps defined.\n")
        }

        markdownDocs[taskType] = sb.String()
    }

    return markdownDocs
}

We are not writing the markdown files directly here, but first building them into a map in order to make the code more testable. The last step will be to write the files for each task to the output directory. This is where WriteMarkdownDocsToFiles comes into play.

func WriteMarkdownDocsToFiles(markdownDocs map[api.TaskType]string, outputDir string) error {
    for taskType, content := range markdownDocs {
        fileName := fmt.Sprintf("%s.md", taskType)
        filePath := filepath.Join(outputDir, fileName)

        err := os.WriteFile(filePath, []byte(content), 0644)
        if err != nil {
            return fmt.Errorf("failed to write file %s: %w", filePath, err)
        } else {
            fmt.Printf("Wrote file %s\n", filePath)
        }
    }

    return nil
}

That's it on the Go side. Let's look into converting the Markdown files to AsciiDoc and integrating them into the Antora site. If you decided to generate AsciiDoc directly, you can skip this step. If you don't need to add the markdown files to a static site, you can also stop reading now.

Convert Markdown to AsciiDoc

To convert Markdown to AsciiDoc we can use pandoc. The following script will walk through the markdown files in the provided directory, convert each file to AsciiDoc, and also write a line to the navigation file which will be included as a partial in the Antora site.

#!/bin/bash
# integrate_go_docs.sh

if [ "$#" -ne 2 ]; then
  echo "Usage: $0 <md_dir> <pages_dir>"
  exit 1
fi

if ! command -v pandoc &> /dev/null
then
  echo "Error: pandoc is not installed. Please install pandoc to continue."
  exit 1
fi

md_dir=$1
pages_dir=$2

tasks_dir="components/operator/tasks"
nav_file="$pages_dir/../partials/task-operator-nav.adoc"

echo "// This file has been generated on $(date)" > "$nav_file"

# Convert task docs
for md_file in "$md_dir"/*.md
do
  adoc_dir="$pages_dir/$tasks_dir"
  adoc_file=$(basename "${md_file%.md}.adoc")
  adoc_path="$adoc_dir/$adoc_file"
  echo "Integrating $md_file"
  # -s to create a standalone document, including the title (=)
  # --shift-heading-level-by -1 to convert the markdown h1 (#) to asciidoc title (=)
  # See https://github.com/jgm/pandoc/issues/5615
  #
  # --wrap=none to avoid wrapping lines, causing long headlines to be broken
  # See https://github.com/jgm/pandoc/issues/3277#issuecomment-264706794
  pandoc -s -f markdown --shift-heading-level-by "-1" --wrap=none -t asciidoc -o "$adoc_path" "$md_file"
  echo "* xref:$tasks_dir/$adoc_file[]" >> "$nav_file"
done

With the navigation partial and the individual task AsciiDocs in place, we can build the Antora site. I'm not going to go cover the details of how to use Antora here. Please refer to the official documentation for that.

Integrate into Antora Site

We're building the site using Antora, which uses the npm ecosystem. We'll need scripts to launch integrate_go_docs.sh and to run antora with the provided playbook. Here's the package.json.

{
  "name": "go-antora-docs",
  "scripts": {
    "generate": "antora --stacktrace --fetch --clean playbooks/generate.yml",
    "go:pandoc": "./integrate_go_docs.sh ../docs/generated ./modules/ROOT/pages"
  },
  "repository": {
    "type": "git",
    "url": "git+https://github.com/yourorg/yourrepo.git"
  },
  "dependencies": {
    "@antora/cli": "^3.1.7",
    "@antora/lunr-extension": "^1.0.0-alpha.8",
    "@antora/site-generator-default": "^3.1.8",
    "@redocly/cli": "^1.25.11",
    "antora": "^3.1.8",
    "http-server": "^14.1.1"
  }
}

We are going to use GitHub actions to run the whole pipeline:

jobs:
  generate-go-docs:
    runs-on: ubuntu-24.04
    env:
      GOPATH: /home/runner/go
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - name: Set up Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.24.2'
      - name: Install Mage
        uses: magefile/mage-action@v3
        with:
          install-only: true
          version: "v1.15.0"
      - name: Generate Docs
        run: mage docs docs/generated
      - name: Upload  Docs
        uses: actions/upload-artifact@v4
        with:
          name: go-docs
          path: docs/generated
  compile-docs:
    runs-on: ubuntu-24.04
    needs:
      - go-docs
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 18
      - name: Install Dependencies
        working-directory: site
        run: npm ci
      - name: Install Pandoc
        working-directory: site
        run: |
          sudo apt-get update
          sudo apt-get install -y pandoc
      - name: Download Operator Docs
        uses: actions/download-artifact@v4
        with:
          name: operator-docs
          path: kubernetes/operator/docs/generated
      - name: Generate Operator Pandoc Page
        working-directory: site
        run: npm run "operator:pandoc"
      - name: Generate Antora Page
        working-directory: site
        run: npm run generate
      - name: Copy index HTML
        working-directory: site
        run: cp index.html playbooks/build/site
      - name: Upload antora
        uses: actions/upload-artifact@v4
        with:
          name: antora
          path: site/playbooks/build/site

And that's it. The GitHub action will run mage docs docs/generated, upload the resulting markdown, then pass it to the next job which downloads it, running npm run "operator:pandoc" and then npm run generate, finally moving the index.html into the build directory and uploading the resulting artifact.

Conclusion

In this post we've seen how we can generate documentation from Go code and integrate it into an Antora site. The approach can be adapted to other programming languages and documentation formats. The advantage of this approach is that the documentation is easier to keep up-to-date, and it can be enriched with application specific information, tailored towards not only API documentation, but documenting the behaviour of your application.

If you liked this post, you can support me on ko-fi.

func buildTaskTypeLookups(original map[api.TaskType]task.TaskLogic) (map[TaskLogicType]api.TaskType, map[TaskStepType]api.TaskType, map[api.TaskType][]TaskStepType) {
    invertedLogics := make(map[TaskLogicType]api.TaskType)
    invertedSteps := make(map[TaskStepType]api.TaskType)
    steps := make(map[api.TaskType][]TaskStepType)

    for apiName, logic := range original {
        logicType := reflect.TypeOf(logic)
        invertedLogics[TaskLogicType{logicType.Name(), logicType.PkgPath()}] = apiName

        logicSteps := logic.GetSteps()
        for _, step := range logicSteps {
            stepType := reflect.TypeOf(step)
            stepTypeSpec := TaskStepType{stepType.Name(), stepType.PkgPath()}
            invertedSteps[stepTypeSpec] = apiName
            steps[apiName] = append(steps[apiName], stepTypeSpec)
        }
    }

    return invertedLogics, invertedSteps, steps
}

Build Your Own Food Tracker with OpenAI Platform

Frank Rosner — Tue, 03 Jun 2025 14:15:57 +0000

The Benefits of Food Tracking

Whether your goal is to lose weight, gain muscle, or simply maintain a balanced diet, understanding the value of the food you consume is essential. By keeping a record of what we eat, we gain insights into our nutritional intake, allowing us to make informed decisions about our diet. Here are some of the key benefits of food tracking:

Awareness and Accountability: Tracking helps us understand what we're consuming daily, shedding light on eating habits we might not even be aware of.
Nutritional Balance: It enables us to monitor macronutrients like protein, fats, and carbohydrates, ensuring a balanced diet that aligns with personal goals such as weight loss, muscle gain, or overall well-being.
Portion Control: With a clearer picture of portion sizes, food tracking helps prevent overeating and supports mindful eating practices.
Customizable Goals: By recording meals, we can set specific goals, such as reducing sugar intake, increasing fiber consumption, or staying within a calorie limit.
Long-term Insights: Over time, food tracking can reveal patterns, helping to identify triggers for overeating, nutrient deficiencies, or correlations between diet and mood.

In this blog post, I want to share how easy it is to build your own food tracker using a GenAI platform like OpenAI. The tool analyzes images of meals from my Google Photos library and provides a nutritional breakdown using AI. You can find the source code over on GitHub.

Architecture

The architecture consists of three main components that work together to analyze your food photos: Your (Python) application, the Google Photos API, and the OpenAI API.

The Python application uses the Google Photos Library API to fetch your food photos. It requires:

OAuth 2.0 authentication for secure access to your photos
Search capabilities to find photos tagged as "food" from a specific date
The ability to download the actual image content for analysis

It then uses the OpenAI's GPT-4 Vision API to analyze these images. This entails:

Name the dish
Estimate nutritional content (calories, protein, carbs, fat, fiber, etc.)
Assess the health score and processing degree and break down meal components

Implementation

Google Photos API Authentication

Let's jump into the code. To access your Google photos, you need some initial setup. Note that the API I am using to access all my Google Photos has been deprecated and was removed on March 31, 2025. From now on, apps can only access Photos that they have created or that the user picks.

To generate credentials your app can use, run the following code:

from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
import os
import pickle

SCOPES = ['https://www.googleapis.com/auth/photoslibrary.readonly']

CREDENTIALS_FILE = os.getenv("GOOGLE_CREDENTIALS_FILE", ".secrets/client_secret.json")
AUTH_PORT = int(os.getenv("GOOGLE_AUTH_PORT", "8080"))
TOKEN_PICKLE_FILE = os.getenv("GOOGLE_TOKEN_PICKLE_FILE", ".secrets/token.pickle")

flow = InstalledAppFlow.from_client_secrets_file(CREDENTIALS_FILE, SCOPES)
creds = flow.run_local_server(port=AUTH_PORT, access_type='offline', include_granted_scopes='true')
with open(TOKEN_PICKLE_FILE, 'wb') as token:
    pickle.dump(creds, token)

This will open a browser window where you can authenticate your app and store the credentials in a pickle file. The pickle module is used to serialize and deserialize Python objects. Our app can then unpickle the credentials and use them to access the Google Photos API.

def google_authenticate():
    creds = None
    if os.path.exists(TOKEN_PICKLE_FILE):
        with open(TOKEN_PICKLE_FILE, 'rb') as token:
            creds = pickle.load(token)

    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            raise ValueError("Invalid credentials. Please run google_photos_init.py to authenticate.")

    return creds

This approach is only working for prototypes like these. In a production scenario you would use a more secure way to store and manage user credentials.

Searching Food Photos

Now that we have valid credentials to access the Google Photos API, let's use them to search for food photos in our library. First, we create a photos API object:

from googleapiclient import discovery

creds = google_authenticate()
photos_api = discovery.build("photoslibrary", "v1", credentials=creds, static_discovery=False)

Then, we write a function that uses the photos API to search for photos that match a specific search term and date filter:

def google_search_photos(api, search_term=None, date_filter=None):
    filters = {}
    if date_filter:
        filters["dateFilter"] = {"dates": [date_filter]}
    if search_term:
        filters["contentFilter"] = {"includedContentCategories": [search_term]}

    search_body = {
        "pageSize": 50,
        "filters": filters,
    }

    results = api.mediaItems().search(body=search_body).execute()
    return results.get('mediaItems', [])

The search term we are using is "food". We are using a content category filter for this. The date filter will make sure we only get the recent photos, e.g. from yesterday. The call could look like this:

photos = google_search_photos(photos_api, search_term="food", date_filter={"day": 1, "month": 1, "year": 2023})

Now let's look at an example photo returned by the API:

{
  "id": "ANNz-AhcvptH<redacted>",
  "productUrl": "https://photos.google.com/lr/photo/<redacted>",
  "baseUrl": "https://lh3.googleusercontent.com/lr/<redacted>",
  "mimeType": "image/jpeg",
  "mediaMetadata": {
    "creationTime": "2025-03-18T10:36:20.790Z",
    "width": "4080",
    "height": "3072",
    "photo": {
      "cameraMake": "Google",
      "cameraModel": "Pixel 6",
      "focalLength": 6.81,
      "apertureFNumber": 1.85,
      "isoEquivalent": 103,
      "exposureTime": "0.005034999s"
    }
  },
  "filename": "PXL_20250318_103620790.jpg"
}

Next, let's look into analyzing the image using the OpenAI API.

Photo Download and Analysis

First, let's define some helper functions to work with the image data. We need to be able to download an image and encode it for the OpenAI API.

import requests
import base64

def download_image(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.content
    else:
        raise Exception(f"Failed to download image. Status code: {response.status_code}")

def encode_image(image):
    return base64.b64encode(image).decode('utf-8')

Next, let's define the data model for the food analysis that we can use as structured output for the OpenAI API. Structured outputs allow you to obtain machine-readable output from the OpenAI API. It also implicitly tells the AI what information you are expecting. We define FoodAnalysis for a single dish and FoodAnalysisResponse for a list of dishes so we can pass multiples images at once with a common prompt.

from pydantic import BaseModel

class FoodAnalysis(BaseModel):
    readable_name: str
    protein_g: int
    fat_g: int
    carbohydrate_g: int
    fibre_g: int
    total_mass_g: int
    total_kcal: int
    total_health_score: int
    processing_degree: str
    components: list[str]

class FoodAnalysisResponse(BaseModel):
    foods: list[FoodAnalysis]

Finally, we can write the function that uses the OpenAI API to analyze the images. We are giving a system prompt to prime the LLM on the task:

You are a nutrition and health expert. You are helping a user understand the nutritional value of their food to help them eat healthier. For each image, estimate the protein, fat, fibre, and carbohydrate content in grams, the total mass in grams, the total calories, the total health score (1-10, 10 being super healthy, 1 being heart-attack-unhealthy), the processing degree ('low', 'medium', 'high'), and the components that are in the dish. If you are unsure, please provide an estimate. Only refuse the query if the image does not contain any food. Please also provide a readable name of the dish. If this looks like a well-known dish, you can use that name. Otherwise, you can describe it in a few words that are helpful to understand the dish.

The actual images will be uploaded as a list of user messages with image URL attachments containing the base64 encoded image. If the image is accessible via a URL directly, you don't need to download and encode it.

def analyze_images(images):
    client = OpenAI()

    image_messages = []
    for image in images:
        encoded_image = encode_image(image)
        image_messages.append({
            "role": "user",
            "content": [{
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{encoded_image}"
                }
            }]
        })

    try:
        response = client.beta.chat.completions.parse(
            messages=[
                {
                    "role": "system",
                    "content": "You are a nutrition and health expert. [...]"
                },
                *image_messages,
            ],
            model="gpt-4o-mini",
            response_format=FoodAnalysisResponse,
        )

        food_analysis = response.choices[0].message
        if food_analysis.parsed:
            return food_analysis.parsed
        elif food_analysis.refusal:
            print(food_analysis.refusal)
            return None
    except Exception as e:
        if type(e) == openai.LengthFinishReasonError:
            print("Too many tokens: ", e)
            return None
        else:
            print(e)
            return None

Let's put it all together, downloading all images and analyzing them. Here's the code with an example output:

images = []
for image in photos:
    images.append(download_image(image['baseUrl']))
analysis = analyze_images(images)

foods:
- carbohydrate_g: 15
  components:
  - arugula
  - roasted tomatoes
  - parmesan cheese
  - lemon
  - olive oil
  - seasonings
  fat_g: 10
  fibre_g: 5
  processing_degree: low
  protein_g: 8
  readable_name: Arugula Salad with Roasted Tomatoes and Cheese
  total_health_score: 8
  total_kcal: 150
  total_mass_g: 200

Assessing Accuracy

To assess the accuracy of the solution I developed a small validation script that analyzes given images and compares the results to expected values. Take a banana, for example:

{
  "foods": [
    {
      "readable_name": "Banana",
      "protein_g": 1,
      "fat_g": 0,
      "fibre_g": 2,
      "carbohydrate_g": 19,
      "total_mass_g": 71,
      "total_kcal": 72,
      "total_health_score": 8,
      "processing_degree": "low",
      "components": [
        "banana"
      ]
    }
  ]
}

The validation script outputs the differences between the actual and expected values. The main challenge turns out to be estimating the mass of the dish.

{
  "carbohydrate_g": {
    "actual": 27,
    "expected": 19,
    "difference": 8
  },
  "total_mass_g": {
    "actual": 118,
    "expected": 71,
    "difference": 47
  },
  "fibre_g": {
    "actual": 3,
    "expected": 2,
    "difference": 1
  },
  "total_kcal": {
    "actual": 105,
    "expected": 72,
    "difference": 33
  },
  "total_health_score": {
    "actual": 9,
    "expected": 8,
    "difference": 1
  }
}

Here are some more examples (differences shown as predicted / actual / % difference):

Name	Mass (g)	KCal	Carbohydrates (g)	Fat (g)	Protein (g)	Health Score
Banana	118 / 71 / +66%	105 / 72 / +46%	27 / 19 / +42%	0 / 0 / 0%	1 / 1 / 0%	8 / 8 / 0%
Pear	178 / 192 / -7%	102 / 109 / -6%	28 / 28 / 0%	0 / 0 / 0%	0 / 0 / 0%	9 / 10 / -10%
Dates	100 / 100 / 0%	277 / 296 / -6%	75 / 65 / +15%	0 / 0.5 / -100%	1 / 2 / -50%	8 / 8 / 0%
Yoghurt	100 / 152 / -34%	61 / 106 / -42%	6 / 5 / +20%	4 / 6 / -33%	3 / 7 / -57%	8 / 8 / 0%
Pudding	100 / 150 / -33%	180 / 165 / +9%	30 / 25 / +20%	7 / 5 / +40%	3 / 4 / -25%	5 / 3 / +66%
Salad	200 / 300 / -33%	90 / 232 / -61%	10 / 6 / +67%	7 / 17 / -59%	5 / 14 / -64%	9 / 10 / -10%
Dumplings	200 / 350 / -43%	320 / 595 / -46%	40 / 70 / -43%	8 / 23 / -65%	14 / 17 / -18%	7 / 8 / -13%

We can see that the estimates are not always accurate, with deviations of up to 66%. A big challenge appears to be estimating the mass of the dish, as well as seeing hidden ingredients in layered dishes.

The health score is pretty accurate, and on average, the deviation of the calorie intake appears to be acceptable. If the goal is to support healthy eating habits, I believe the agent is more than useful.

Interestingly, when analyzing photos of packaged food, the tool is able to accurately extract the nutritional information from the packaging.

Cost

Analyzing 1 image requires ~20k tokens. As of April 2025, when using gpt-4o-mini ($0.15 / 1 million tokens), this costs $0.003. Assuming you analyze 10 photos a day, this would cost less than $1 per month.

Summary

In this blog post, I have shown how easy it is to build your own food tracker using a GenAI platform like OpenAI. The tool analyzes images of meals from your Google Photos library and provides a nutritional breakdown using AI.

While the results are not perfect, I feel like the added value is pretty high if you are not willing to count all your calories and keep track of everything you eat manually.

If you liked this post, you can support me on ko-fi.

Frank - How are you so productive?

Frank Rosner — Thu, 23 Jan 2025 12:22:09 +0000

Introduction

Friends and colleagues often ask me: "Frank - How are you so productive?". While I don't have a silver bullet, I developed a mindset and adopted a set of tools and techniques that help me to be productive as a software engineer. In this post, I will share some of these strategies with you.

What is Productivity?

In economics, productivity measures the ratio of outputs and inputs. In software engineering, I consider output to be the value created by my work and inputs to be the money spent, which includes my work time, as well as any fees for the tools I am using to produce the value. In my opinion, it is important to consider value of your work as output, not pull requests merged or tickets closed.

When we talk about productivity, we often view it as a combination of effectiveness and efficiency.

Increasing effectiveness is about increasing output while keeping the input constant. In the context of output being value produced, it is often paraphrased as "doing the right things".
Increasing efficiency is about decreasing input while keeping the output constant. It is often paraphrased as "doing things right".

This distinction is useful because it can help identify potential for improvement. Consider being very fast in coding up something that is not needed. This is high efficiency but low effectiveness. On the other hand, consider writing some very valuable feature in a programming language you have no experience in, which is slowing you down significantly. While being highly effective, the efficiency will be low.

Prioritization

Being effective means "doing the right thing". But how do you identify what the right thing is? You will need to identify the needs. This can be done by talking to customers and stakeholders, reading customer feedback, or analyzing business metrics. Once you have identified the relevant goals / epics / features / tasks, you can prioritize them.

When prioritizing with productivity in mind, we cannot just consider the impact (output). We also need to consider the required effort (input). A useful tool for prioritization is the impact-effort-matrix.

You place your tasks on the plane between the effort and impact axes. To maximize productivity, focus on quick-wins first. Then plan to tackle the major projects, adding fillers here and there. Avoid waste.

Executing and Building Momentum

By prioritizing high-impact, low-effort tasks, you have a good foundation for productivity within your system (company, department, team). When looking at time and effort spent by individual contributors, e.g. yourself, however, there is a lot of potential for improvement, too. Having the "perfect plan" is not enough if you are not able to execute efficiently.

Everyone has the same amount of time available to them. For the sake of this argument, let's assume you are spending 8 hours per day at work. However, you will not be able to work the entire 8 hours on high-impact tasks. There are meetings, operational tasks, overhead, and distractions. And even when you are working on your task, your efficiency can depend on your mood, your energy level, and your ability to focus.

Personally, my biggest factor for long term productivity is building and maintaining momentum. I like to use a snowball analogy to reason about my momentum at work. In physics, momentum is defined as the mass of an object multiplied by its velocity.

When you start a new job, or move to a new role, you start off as a small snowball on the top of a hill. The mass of the snowball corresponds to the knowledge and skills you accumulate. The velocity corresponds to the rate at which you are able to complete tasks like coding, reviewing PRs, and so on.

When you start pushing for the first time, it feels hard to gain momentum as the ball is very small, and gets stuck easily on small branches or stones. As you keep going, the ball gains size and speed.

In order to build and maintain your momentum over time, you have several responsibilities:

Increase your snowball's mass. Learn new programming languages, frameworks, tools, and technologies. Get familiar with new code bases, and understand the business domain you are working in. Improve your tooling and workflows. Build lasting relationships with your co-workers, stakeholders and customers. All of this will allow you to overcome obstacles more easily.
Plan ahead to avoid hitting major obstacles like trees that risk stopping your snowball, or even making parts of it fall off. This means identifying potential blockers and risks early, and either avoiding them or mitigating them before you reach them.
Keep pushing your snowball. This means making progress on your tasks, delivering value. You need to find the right amount of pushing, as pushing too hard will make it hard to stop and navigate around obstacles or changing priorities.

How do you balance these activities throughout your work day / week? If you focus only on pushing, without planning, you'll easily run into obstacles. If you are not working on gaining mass, you'll not be able to overcome growing obstacles and take on bigger challenges. If you are only working on gaining mass but not pushing, you are not delivering value.

Over the course of my career, applying the core principles of Agile Software Development has worked well. Additionally, I discovered and honed some techniques and tools that help me build and maintain momentum every day. The following sections will do a quick recap of the agile principles and then dive into the tools and techniques I use.

Agile Software Development Principles

I am a big fan of the principles behind the Manifesto for Agile Software Development. And I'm not talking about "Agile Methodologies" like Scrum, but the core principles. To summarize the most important ones for me:

Delivering value to the customer continuously
Welcoming changing requirements
Simplicity - maximize the amount of work not done
Continuous attention to technical excellence and good design
Regular reflection and adaptation

How do I apply these in my daily work? I only consider something done, if the work is usable in production. I break down tasks into the smallest possible pieces, aiming to finish pull requests within a day to get early feedback and iterate quickly. If priorities change, this allows me to switch to another task without leaving half-finished work.

This ensures that my work is focused on adding value, and I can adapt to changing requirements quickly. To ensure simplicity, I apply the YAGNI (you ain't gonna need it) principle. I only implement what is needed now, and avoid over-engineering. I design explicitly to avoid losing momentum when having design discussions after the code is written. When revisiting code, I always try to improve it, paying back technical dept continuously.

I also regularly reflect on my work, and try to improve my workflows and tools. We will talk more about that in the Kaizen section below. If you want to know more about Agile Software Development, checkout my post Explain Agile Like I'm a Sports Student.

Next, let's dive into some concrete tools and techniques that you can try out yourself.

Tools and Techniques

Kaizen

Kaizen is a Japanese term that means "continuous improvement". It is a philosophy that focuses on making small, incremental changes to processes, workflows, and tools. It was popularized as part of the Toyota Way. The core ideas are:

Improvement is a never-ending process. Make small, consistent changes to achieve significant, long-term results.
Empower everyone to identify inefficiencies and suggest solutions.
Focus on the process, not the people. Improve processes systematically.
Eliminate waste. Activities that do not add value to the customer or organization need to be removed.
Measure and reflect. Use metrics to track progress, experiment with changes, and reflect on the results.

I apply Kaizen in the teams I work with, but also on a personal level. At the end of each day, I spend 5-10 minutes to reflect on the activities I performed that day, and the impact they had on the customer or my organization. I block 30-60 minutes every week to improve my workflows / tools.

Examples are:

Automate creation of daily or weekly messages / reports I compiled manually before.
Improve my coding efficiency by learning or reviewing keyboard shortcuts, IDE features or plugins.
Cancel meetings that have a low return of time invested (ROTI). Consider reading the meeting summary / minutes instead.
Add new folder and rule in my inbox to funnel some low-priority messages that I can look at on a weekly basis.
Archive some old Slack channels that are not relevant anymore.

Zero-Inbox

The zero-inbox technique aims at managing your inboxes (email, slack) effectively by keeping the number of unread messages at (or close to) 0. The goal is to reduce cognitive burden of a cluttered inbox, and to ensure that you are not missing important messages. The core ideas are:

Process every message, don't just "check" it. Apply the 4 D's: Delete, Delegate, Defer, Do.
- Delete the message (mostly applies to emails) if it is irrelevant, spam, or unnecessary.
- Delegate the task if it belongs to someone else. Forward it immediately.
- Defer the message if it requires your action but cannot be handled immediately. Schedule it for a later time. Most email clients have this functionality, and for Slack channels, I use the "remind me about this" feature.
- Do the task immediately if it takes less than two minutes to complete.
Use folders, labels, channels to categorize messages. Use automated filters / rules to organize the incoming messages automatically before you process them. I personally have different folders in my email account, based on the projects and the type of message, e.g. pull requests, ticket updates.
Don't use email as a To-Do list. Move larger, actionable items to a dedicated task management tool.
Block message time. Don't check emails or Slack continuously throughout the day, but use dedicated time slots, ideally when you're least productive, e.g. after lunch or in the afternoon.
Unsubscribe and filter. If you receive newsletters / updates that are not relevant, unsubscribe. If you cannot unsubscribe, add an automated filter to delete the messages before they reach your inbox.
Archive aggressively. I personally don't archive emails, but I use a filter to show only unread messages in my inbox. I archive orphaned temporary Slack channels aggressively.
Keep it simple and consistent. Whatever system you use, it needs to be easy enough for you to apply on a daily basis.

To-Do Lists

I tried using digital To-Do lists, but they didn't work out for me. They often got outdated, or some tasks got stuck on there forever. I switched to using a notebook, which lies in front of me on my desk. I use a simple system:

Every day, I write down the date and the tasks I want to complete that day, in order of priority. I either do it in the evening of the previous day, or as the first thing in the morning.
I check off tasks as I complete them. Whenever I engage in an activity, e.g. looking at a Slack message or an incoming PR review request, I review my list to remind myself what the most important task is. That helps me to get back on track, and focus on the most important bits first.
It is okay to add new tasks to the list as the day goes on. It is okay to not finish all the tasks. However, in the spirit of Kaizen, I will review these occasions at the end of the day, coming up with a plan to avoid them in the future.

Let's take a look at an example list:

Imagine that while working on the blog post, a colleague pings you that you need to send in a report by today. You add it to the top of the list, and start working on it immediately. At the end of the day, you did not manage to check your emails.

When closing your day, you attempt to understand how it happened that the report showed up surprisingly. There are different possible reasons, such as:

The report needed to be finished today, you knew about it, but forgot when you planned your day. In that case, you might want to adjust your process to include a reminder of due tasks one day before the due date.
The report needed to be finished today, but you didn't know about it because your colleague forgot to tell you. In that case, communicate clearly how much time you need in advance. Consider using a shared task management tool, where your colleague can assign you to certain tasks, that will notify you about this.
The report didn't have to be finished today. It wasn't going to be sent by end of next week anyway. In that case, make sure to challenge the priorities of urgent ad-hoc tasks in the future.

Time Boxing and Time Blocking

I use time boxing and time blocking daily. Time blocking helps me in planning my day and week, ensuring I make room for important, short- and long-term activities. Time boxing helps me to avoid getting stuck or lost in details.

Here are some activities I block time for:

Reviewing pull requests (daily)
Writing code (daily)
Reading and answering messages (daily)
1on1 / team meetings (weekly / bi-weekly)
Education, learning, personal development (weekly)
Workout (daily)

I sometimes use my calendar to block the time, or I'm writing the times next to the items on my To-Do list. I apply time-boxing within each block, but also on a broader scale. For example, when I timebox my PR reviews to 60 minutes, I will stop after 60 minutes even if I did not review all PRs. The remaining ones will get a higher priority the next day.

Time boxing also helps me to manage unknown unknowns better. When starting a bigger task, I often kick it off with a proof of concept (POC), time-boxed to a few hours. If I am not able to complete the POC in that time, I change the estimation of the effort needed for the task, placing it on another spot in the impact-effort-matrix, reprioritizing it accordingly. If it is still top priority, I'll extend the time box but if there are other quick-wins available, I might switch to them for now.

Time blocking helps me keep a balance of the different activities needed to build and maintain momentum, while time boxing emphasizes progress over perfection.

Focus and Pomodoro

Our brains are capable to solve complex problems, but struggle to deal with distractions and context switching. I personally have found that my productivity over 8 hours is higher if I dedicate 4 hours to deep work, with no distractions, and 4 hours of shallow work, where I can handle interruptions, compared to 8 hours of mixed work.

To successfully enter deep work, I need to have the right environment. I use headphones with music on, and my desk needs to have a little bit of clutter on it, but not too much. If there's too much, I have to clean it first. I might also turn off messaging programs / notifications.

While deep work is incredibly effective, it is also exhausting. My ability to focus drops rapidly after ~45-60 minutes, but after 30 minutes the focus starts to take a toll on my body. My head starts to hurt, and my muscles start to tense up.

To help maintain a balance between work and recovery, I often use the Pomodoro technique™. The core idea is to work for 25 minutes, then take a 5-minute break. After 4 Pomodoros, take a longer break of 15-30 minutes.

Pomodoro has another positive effect on my work. It forces me to split my work in smaller chunks, which can be completed within a single slot. E.g. if I'm writing code, my goal is to have it compile after each slot. Ideally, I'll also be able to commit the changes. When writing a blog post, I aim to complete a section within a slot.

Pareto Principle (aka 80/20)

The idea behind the Pareto Principle is that 80% of the consequences come from 20% of the causes. When applied to work, it means that 80% of the value comes from 20% of the work. We can make use of that principle to maximize productivity by focusing on the 20% of the work that brings the most value.

What does that mean in practice for me?

Use GenAI extensively. I use GenAI to generate code, which is often times not great, but if it works, it'll do as a first iteration. I use GenAI to generate automated tests as well. I would rather have ugly, tested code, than beautiful, refined, high performance code without any tests. Don't aim for 100% code coverage in the beginning, but focus on the key aspects.
Don't over-engineer. First, make it work, later make it right (if it has proven its value).
Refactor constantly. Whenever I touch code I look for opportunities to improve it. That mechanism ensures that throw-away code is not over-engineered, but relevant code converges to a high quality.
Reduce toil progressively. When you do something once, it's worth writing it down in some note or ticket. When you do it on a regular basis, but not very often, create a runbook. When the runbook becomes long and you run it more often, script it up. When you run the script often, automate it.

Of course, you need to keep in mind that the remaining 20% of results, taking up 4 times as much of your time as the first 80% to finish, is dept you are paying interest on. So you should choose wisely how much dept you can take on, how much interest you are willing to pay, and invest time in paying back dept regularly.

So why is applying this technique saving time? If you end up doing all the work eventually, what's the point? The point is the work you are doing is almost never going to solve the problem 100%. It may be because you did not understand the problem entirely. Or maybe the problem changes over time. Maybe or some other, better solution comes into the mix later down the line. By focusing on the 20% that brings the most value, you are able to deliver value faster, and you are able to pivot more easily.

Gemba Walking

Gemba walking is yet another technique coming from the Toyota Way. Gemba is a Japanese term that means "the real place". In the context of software development, it means going to the place where the work is done, e.g. the team's workspace, the code repository, the CI/CD pipeline, the incident channels, the production environment. This is relevant for me as a tech lead to avoid the "ivory tower" syndrome, and to stay connected to the work my colleagues are doing.

Gemba walking helps me to identify inefficiencies, bottlenecks, and blockers early. It also helps me to understand the context of the work better, and to build relationships with my colleagues. I try to do Gemba walks every day, blocking ~15 minutes for it. The key ideas are:

Go where value is created. In SRE, I call this "the trenches".
Observe, don't judge. Ask questions and listen to the problems your colleagues are facing. Read between the lines, be curious.
Engage with colleagues.
Focus on processes, not people.

Escalation

Escalation involves raising an issue to higher authority or expertise levels when it cannot be resolved at the current level. Escalating quickly is important to ensure efficient resolution of issues by ensuring they are handled by the people with the right expertise, authority, or resources.

Escalation is also important to avoid getting stuck / blocked on a task. While it might seem like "complaining" to some, it is in the interest of the company and your customers to resolve issues quickly.

I often combine escalation with time-boxing. If I do not manage to complete a task in the estimated time, I can escalate it to the respective parties, e.g. my boss, or an expert in the field.

Summary and Conclusion

In this post we explored various strategies to increase productivity as a software engineer, while balancing effectiveness and efficiency. We highlighted the significance of prioritization using the impact-effort matrix to focus on high-impact, low-effort tasks.

We saw how to build and maintain momentum, peeking into a productivity toolset, including Agile principles to ensure value delivery and adaptability, Kaizen for continuous improvement, zero-inbox techniques for managing messages, the use of to-do lists for daily task management, time boxing and time blocking for effective time management, the Pomodoro technique for maintaining focus, the Pareto Principle for maximizing value, Gemba walking to stay connected with the team's work, and the importance of quick escalation to resolve issues efficiently.

What are your favorite productivity tools and techniques? How do you balance effectiveness and efficiency in your work? Please share your thoughts in the comments!

If you liked this post, you can support me on ko-fi.

Cover photo by kris on Unsplash
To-Do list photo by Thomas Bormans on Unsplash
Calendar photo by Eric Rothermel on Unsplash
Kanban photo by Parabol | The Agile Meeting Tool on Unsplash
Frustrated person photo by ahmad gunnaivi on Unsplash
Camera lense photo by Paul Skorupskas on Unsplash
Laptop photo by Lukas Blazek on Unsplash
Car factory photo by Michael Satterfield on Unsplash

Visualizing the Apache Cassandra Token Ring with Plotly

Frank Rosner — Fri, 27 Sep 2024 18:13:09 +0000

Cassandra's Partitioning Mechanism

Apache Cassandra is a powerful, distributed NoSQL database designed to handle large amounts of data across many servers while providing linear horizontal scalability, high availability with flexible consistency guarantees, as well as fault tolerance.

One of the core mechanisms behind Cassandra's scalability is the data partitioning based on consistent hashing. In a typical hashing scenario, a hash function takes an input (e.g., the primary key of a row) and maps it to a fixed output range. In a distributed database, each node could be responsible for a subset of that range. However, if nodes are added or removed, the entire data distribution could change, causing large-scale data movement between nodes.

Consistent hashing solves this problem by mapping both data and nodes onto the same hash ring (a conceptual circle). Here’s how it works:

The Hash Ring. Imagine a circle where hash values are placed in a clockwise manner, ranging from 0 to the maximum value of the hash function. Database nodes are placed on this circle based on their own hash value. Data is also hashed and placed on the circle.
Assigning Data to Nodes. Data is assigned to the first node clockwise from its position on the ring. This node becomes responsible for that piece of data.
Node Addition/Removal. When a node is added, only the data between the new node and its predecessor in the ring is reassigned. When a node is removed, its data is reassigned to the next node clockwise.

We call each position on the ring a token. When a row needs to be stored, its primary key is hashed to calculate its token. The token is then used to determine which node stores the data by walking the ring until the next token owned by a node is found. This process ensures that data is evenly distributed across all nodes, avoiding hot spots and ensuring the cluster remains balanced as nodes are added or removed. This algorithm effectively partitions the token ring into ranges, with each range assigned to a node.

To make this algorithm more resilient and flexible, Cassandra uses a concept called virtual nodes (vnodes). Instead of assigning a single large token range to each node, Cassandra divides the token space into many smaller ranges and assigns multiple vnodes to each physical node. This allows the cluster to be more evenly balanced, especially when nodes are added or removed, since the system can redistribute small token ranges across the remaining nodes, avoiding load imbalances.

It also allows to combine heterogeneous hardware in a single cluster, as you can adjust the number of vnodes based on the available resources on your physical node.

Why Visualize the Token Ring?

Cassandra's distributed nature, combined with its use of consistent hashing and vnodes, makes it an efficient and scalable database. However, one of the challenges that arises when operating a Cassandra cluster is understanding how the data is distributed across nodes. Although Cassandra ensures that tokens are distributed evenly across the cluster, certain situations - such as manual node additions, removals, misconfiguration, or hardware failures - can lead to unbalanced token distribution.

For Cassandra operators and users, having insight into the token distribution can be a useful tool to debug issues with the database. Token imbalances can lead to unequal data distribution, resulting in hotspots where certain nodes handle significantly more traffic or store more data than others. This can cause performance degradation, uneven resource usage, and even outages.

While Cassandra offers command-line tools to check token ranges and node responsibilities (such as nodetool ring), these tools output the data in a raw, tabular format that can be difficult to interpret.

Fetching Token Ranges

Cassandra uses Murmur3, a hashing function that generates 64-bit tokens in the range of $2^{63},2^{63} - 1]$ . Each vnode in the cluster assigned a token it is responsible for. The token marks the end of a range, and the previous token defines the start. To visualize the token ring, we first need to calculate the token ranges, by:

Gathering all tokens from all nodes
Sorting the tokens
Pairing each token with the previous one to compute the range for each node

The Cassandra drivers have access to the token metadata, which includes both the raw tokens and the calculated ranges:

Metadata metadata = session.getMetadata();
TokenMap tokenMap = metadata.getTokenMap().get();

Set<TokenRange> ring = tokenMap.getTokenRanges();
// Returns [Murmur3TokenRange(Murmur3Token(12), Murmur3Token(2)),
//          Murmur3TokenRange(Murmur3Token(2), Murmur3Token(4)),
//          Murmur3TokenRange(Murmur3Token(4), Murmur3Token(6)),
//          Murmur3TokenRange(Murmur3Token(6), Murmur3Token(8)),
//          Murmur3TokenRange(Murmur3Token(8), Murmur3Token(10)),
//          Murmur3TokenRange(Murmur3Token(10), Murmur3Token(12))]

Visualizing Token Ranges

Now that we’ve calculated the token ranges, we can proceed to visualize them. To represent the Cassandra token ring, we use Plotly’s polar plot feature, which is perfect for this kind of circular visualization.

Here’s the function that creates the visualization:

max_token = math.pow(2, 64)

def plot_token_ranges(token_ranges: dict[str, list[(int, int)]]):
    fig = graph_objects.Figure()
    nodes = list(token_ranges.keys())
    colors = plotly.express.colors.sample_colorscale("Rainbow", len(nodes))
    color_map = {node: color for node, color in zip(nodes, colors)}

    for node, ranges in token_ranges.items():
        v_node_idx = 0
        for start, end in ranges:
            if end < start:
                range_width = abs(end + max_token - start)
            else:
                range_width = abs(end - start)
            # Theta needs to be in the middle of the range, because
            # the polar bar gets drawn from theta width/2 in both directions
            theta = (start + range_width / 2) * 360 / max_token
            fig.add_trace(
                graph_objects.Barpolar(
                    r=[1],
                    theta=[theta],
                    width=range_width * 360 / max_token,
                    customdata=[[start, end]],
                    hovertemplate="[%{customdata[0]}, %{customdata[1]}]",
                    name=node,
                    legendgroup=node,
                    marker_color=color_map[node],
                    showlegend=(v_node_idx == 0),
                )
            )
            v_node_idx += 1

    fig.show()

First we pass a token ranges dictionary, where each key is a node, and the value is a list of token ranges for that node (one per vnode).
Then we assign each node a unique color from a predefined color scale for easy identification in the visualization.
For each node, we loop through its token ranges and calculate:
- The width of each token range (range_width).
- The angle (theta) where the range should be displayed on the circular plot.
Each token range is represented as a polar bar. The hover text displays the start and end of each range, and we ensure the legend appears only once per node.

The following screenshot shows the output of the function for a simple three-node cluster.

Taking Replication Into Account

Most datasets stored in Cassandra are configured with a replication factor (RF) > 1, which means that each row is replicated to multiple nodes. Replication increases data availability and fault tolerance at the cost of storage and increased coordination between nodes.

How exactly the data is replicated depends on the configured replication strategy. If your cluster spans multiple racks (e.g. availability zones), you should use the NetworkTopologyStrategy. This strategy determines the additional replicas by walking the ring clockwise until it finds a node in a different rack, repeating the process until it reaches the desired number of replicas.

What does that mean for our token ring visualization? When looking at token ownership, i.e., which nodes own a given token, we would have to add the rack dimension to the plot, and the viewer would have to mentally perform the clockwise "walk" to determine the replicas.

An alternative approach is possible if the replication factor is equal to the number of racks. In that case, the algorithm can be simplified by calculating a separate token ring for each rack. We then only consider the nodes in each rack to calculate the ranges and can use the algorithm with RF = 1 to determine the node that owns the token in each rack. To visualize the ownership we can either plot the ring for each rack, or even overlay the different plots.

The following animation shows the per-rack token ranges for a six node cluster hosted across three racks and a replication factor of three:

Overlaying SSTable Ranges

Cassandra's storage engine is based on Log-Structured Merge (LSM) Trees. Data is written to a memtable in memory, and when the memtable reaches a certain size, it is flushed to disk as an SSTable. SSTables are immutable and are periodically merged into larger SSTables to reduce the number of files on disk.

The data in the SSTable files is sorted by the partition key. By computing the tokens of the partitions within the file, we can derive a token range based on the minimum and maximum token it contains. This allows us to overlay the token ranges of the nodes with an SSTable range.

To add the SSTable overlay, we extend the code, adding two additional parameters sstable_min_token and sstable_max_token to the function. We then add a new barpolar trace for the SSTable:

if sstable_max_token < sstable_min_token:
    sstable_range_width = abs(sstable_max_token + max_token - sstable_min_token)
else:
    sstable_range_width = abs(sstable_max_token - sstable_min_token)
sstable_theta = (sstable_min_token + sstable_range_width / 2) * 360 / max_token
fig.add_trace(
    graph_objects.Barpolar(
        r=[0.5],
        theta=[sstable_theta],
        width=sstable_range_width * 360 / max_token,
        customdata=[[sstable_min_token, sstable_max_token]],
        hovertemplate="[%{customdata[0]}, %{customdata[1]}]",
        name=sstable_name,
        legendgroup=sstable_name,
        marker_color="grey",
    )
)

When calculating the width of the SSTable, we need to take into account that the max token might be smaller than the min token, which effectively means that the range wraps around the ring. We choose r=[0.5] to place the SSTable overlay closer to the center of the plot, and color it grey to distinguish it from the token ranges.

The following screenshot shows our updated graph with an SSTable file overlapping with one vnode of the first node in the first rack.

Conclusion

Visualizing the Cassandra token ring with Plotly was a fun exercise to understand how tokens are distributed across nodes in a cluster, how replication works, and how SSTables fit into the mix. Looking into the token distribution helps you identify potential token imbalances.

If you do not want to worry about token ranges, vnodes, replication and SSTable files, and just want to benefit from the scalability and fault tolerance of Cassandra, consider using a managed service such as DataStax AstraDB.

If you liked this post, you can support me on ko-fi.

Books That Helped Me Become a Tech Lead

Frank Rosner — Wed, 20 Dec 2023 19:41:58 +0000

Why Books?

When developing my skills, I like to use a combination of conference talks, video tutorials, books, papers, blog posts, learning-by-doing, and teaching/blogging. Books are a great way to learn from the mistakes other people have made, to be inspired by their successes, and to experience their accomplishments second hand.

In this blog post I want to share my favorite books that helped me the most in my journey from being a senior software engineer to becoming a tech lead. They helped me to broaden and deepen my understanding about software engineering, software architecture, and building and running a software business. They taught me to challenge and shape my behaviour and habits. Some of them deeply affected my personal and professional life.

It goes without saying that reading those books will not automatically get you promoted or land you a new tech lead role. Of course, you still need to get your own experiences, make your own mistakes, and have that little bit of luck required. It is also important to constantly sharpen your technical knowledge and skills based on the specific domain you are working in. The books on this list do not focus on specific technologies, but rather on general principles and concepts that are applicable to any technology stack and business.

For each book on the list, I will include a brief summary that can help you judge whether the book is relevant to you. To give it a personal touch, I will also include the most valuable lesson I learned from the book. This is not necessarily the main message of the book, nor the only important one, but rather the one that resonated with me the most.

The List

Design It!

Design It!: From Programmer to Software Architect by Michael Keeling is a comprehensive guide aimed at software developers who aspire to transition into the role of a software architect. The book provides a pragmatic and accessible approach to software architecture, emphasizing the importance of design in creating effective software systems. Keeling covers a wide range of topics, from foundational principles of software architecture to practical techniques for designing scalable and maintainable systems.

Throughout the book, Keeling advocates for a hands-on, iterative approach to software design, encouraging readers to think critically about the architectural choices they make. He introduces various architectural styles and patterns, and discusses how to evaluate trade-offs and make decisions that align with the goals and constraints of a project. The book is filled with real-world examples, exercises, and practical tips, making it a valuable resource for those looking to develop their skills in software architecture and design.

The most valuable lesson I learned from the book: There is no such thing as "no design". "No design" often means multiple, implicit designs, in the heads of your engineers, that are not aligned with each other. Design explicitly, collaboratively, iteratively, and document the design in a written form!

Release It!

Release It!: Design and Deploy Production-Ready Software by Michael Nygard is a critical guide for software developers and architects focused on the challenges of creating software that performs reliably in production environments. The book delves into the complexities of designing, deploying, and maintaining software that can withstand the rigors of real-world operations. Nygard emphasizes the importance of considering production realities from the beginning of the design process, advocating for a mindset shift from merely writing code to delivering a resilient, scalable, and maintainable system.

Nygard provides insights into the various pitfalls that software systems encounter in production, such as network issues, unpredictable load patterns, and hardware failures. He introduces concepts like stability patterns and antipatterns, illustrating how to build systems that can gracefully handle failure and remain robust under stress. The book is enriched with real-life stories and case studies that demonstrate the catastrophic consequences of poor system design in production settings. "Release It!" is a valuable resource for software professionals seeking to ensure their systems are not just functional, but also resilient and reliable in the face of real-world challenges.

The most valuable lesson I learned from the book: Every software engineer should build their software with production in mind. Software in production is what runs your business, impacts your customers, and determines success or failure.

Site Reliability Engineering

Site Reliability Engineering: How Google Runs Production Systems authored by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy, is an insightful exploration into the practices and principles that Google employs to manage its large-scale, highly reliable systems. The book introduces the concept of Site Reliability Engineering (SRE), a discipline that blends aspects of software engineering with IT operations, focusing on creating scalable and reliable software systems.

The authors, all experienced practitioners in SRE at Google, share their expertise on how to build, deploy, monitor, and maintain systems that are robust and resilient. They delve into the specific strategies and techniques Google uses, such as setting service level objectives (SLOs), managing change effectively, and balancing the need for release velocity with service reliability. The book covers a range of topics from organizational aspects of SRE teams to technical practices like incident management and post-mortem culture. The book offers a rare glimpse into the inner workings of one of the world's most proficient engineering organizations and is a valuable resource for anyone involved in the operation, maintenance, and scaling of large systems.

The most valuable lesson I learned from the book: There are no perfect systems. By explicitly defining and measuring SLOs and error budgets, you can make informed decisions about the trade-offs between reliability and velocity.

Change Your Questions, Change Your Life

Change Your Questions, Change Your Life: 12 Powerful Tools for Leadership, Coaching, and Life by Marilee Adams explores the profound impact that the questions we ask can have on our lives and careers. Adams introduces the concept of "Question Thinking," a method of transforming thinking, action, and results through deliberate and mindful questioning. The book emphasizes how the types of questions we ask ourselves, ranging from limiting, judgmental "Judger" questions to more open, constructive "Learner" questions, can significantly influence our outlook and outcomes.

Adams illustrates her ideas through a compelling narrative, following the story of an individual struggling with life's challenges and learning to apply the principles of Question Thinking. This approach offers practical tools and techniques for individuals to improve their communication, decision-making, and problem-solving skills. By fostering a Learner mindset and asking better, more empowering questions, readers are guided towards more positive and productive personal and professional relationships. The book is particularly valuable for leaders, coaches, and anyone looking to enhance their ability to connect with others and navigate complex situations more effectively.

The most valuable lesson I learned from the book: I realized how often I am in the "Judger" mindset. Being more mindful about that, and consciously choosing to shift to a "Learner" mindset became almost like a super-power for me to solve any challenge I am facing.

Thinking, Fast and Slow

Thinking, Fast and Slow by Daniel Kahneman is a groundbreaking exploration of psychology and economics, delving into how we think and make decisions. Kahneman introduces two distinct modes of thinking that dominate our mental processes: "System 1" (fast, intuitive, and emotional) and "System 2" (slower, more deliberate, and more logical). Throughout the book, Kahneman explores the impact of these two systems on our judgment, decision-making, and the way we perceive the world around us.

The book is a comprehensive journey through various cognitive biases and heuristics that influence our everyday thinking. Kahneman demonstrates how our intuitive System 1, which often serves us well, can also lead to profound errors and biases. He also explores the capabilities and limitations of System 2, emphasizing how it can be influenced and overruled by the quick judgments of System 1. The book is a synthesis of decades of research, providing deep insights into the complexities of human thought and behavior. It's an essential read for anyone interested in understanding the mental processes that underlie our choices and actions in both personal and professional contexts.

The most valuable lesson I learned from the book: I learned that both modes are valuable, but also have their drawbacks. I learned to be more aware of the biases and heuristics that influence my thinking, and to consciously choose when to rely on System 1 and when to engage System 2.

Atomic Habits

Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones by James Clear is a transformative guide that delves into the science of habits and how small changes can lead to remarkable results. The author presents a comprehensive framework for understanding how habits form and offers practical strategies for cultivating good habits and breaking bad ones. The core philosophy of the book is that minor improvements, or "atomic habits", can accumulate into significant, life-altering outcomes over time.

Clear emphasizes the importance of systems over goals, arguing that focusing on the processes and systems that lead to a goal is more effective than fixating on the goal itself. He introduces the Four Laws of Behavior Change – a set of simple, actionable principles to guide habit formation. These include making cues obvious, cravings attractive, responses easy, and rewards satisfying. Through a combination of scientific research, personal stories, and real-world examples, Clear illustrates how these principles can be applied to various aspects of life, from fitness and financial management to productivity and personal growth. "Atomic Habits" offers an accessible and compelling blueprint for building habits that stick and is valuable for anyone looking to make positive, lasting changes in their life.

The most valuable lesson I learned from the book: By making many small changes to my daily routine, that individually only affect my productivity by a small amount, all of those habits combined make a huge impact.

Conscious Business

Conscious Business: How to Build Value Through Values by Fred Kofman is a thought-provoking book that explores the intersection of personal integrity and professional success. The author presents the idea that the key to creating a successful and sustainable business lies in conscious management practices, where personal values and ethical principles are at the forefront of decision-making processes. The book argues that success in business is not just about financial gain but also about achieving personal and professional fulfillment.

Kofman discusses various aspects of conscious business, including accountability, responsibility, emotional intelligence, communication skills, and the ability to resolve conflicts constructively. He emphasizes the importance of leaders who can inspire trust, cultivate a culture of openness and honesty, and lead with empathy. Through real-world examples, practical advice, and exercises, Kofman guides readers on how to develop these skills and apply them in their professional lives.

The most valuable lesson I learned from the book: The concept of unconditional response-ability. I now constantly remind myself that I have the power and responsibility to choose my responses to any situation, regardless of the circumstances. "Response-ability" is a play on the words "response" and "ability," highlighting the ability to respond consciously and proactively.

First, Break All The Rules

First, Break All the Rules: What the World's Greatest Managers Do Differently by Marcus Buckingham and Curt Coffman presents a radical approach to management based on research conducted by the Gallup Organization. This book challenges conventional wisdom about leadership and management, proposing that the most effective managers often defy standard practices.

The core message of the book is that great managers don't follow a single mold or adhere strictly to traditional management principles. Instead, they break the rules by focusing on their employees' individual strengths rather than trying to correct their weaknesses. The authors argue that this approach leads to higher engagement, productivity, and overall job satisfaction.

Buckingham and Coffman identify key insights and strategies that set apart the world's best managers. These include the importance of selecting talent over simply filling positions, defining the right outcomes rather than dictating the right steps, focusing on strengths rather than obsessing over weaknesses, and finding the right fit for employees rather than simply promoting them to the next rung on the ladder.

The most valuable lesson I learned from the book: The importance of focusing on strengths rather than weaknesses. I learned to accept my weaknesses as such, and use tools and strategies to compensate for them, rather than trying to "fix" them. Instead, I invest my time and energy into developing my strengths, and I try to do the same for the people I lead.

Honorable Mentions

There are many more books that I found valuable on my journey from senior software engineer to tech lead. They are more focussed on specific technologies, which is why I did not include them in the main list. Nevertheless, I want to mention them here, as they might be relevant to you depending on the field/industry you are working in.

Database Internals by Alex Petrov. The best book on databases I have ever read. It covers all the fundamentals of databases in a very accessible way. It is a must-read for anyone working with databases.
Designing Data-Intensive Applications by Martin Kleppmann. A comprehensive guide to building data-intensive applications. It covers a wide range of topics, from databases and data processing to distributed systems and stream processing.
Oracle JRockit: The Definitive Guide by Marcus Hirt and Marcus Lagergren. A great resource for anyone interested in JVM internals.
The Linux Programming Interface by Michael Kerrisk. A very detailed book about Linux, that covers a wide range of topics, from basic system calls to advanced topics like process groups, signals, and sockets.

Final Thoughts

While books are a great tool to learn, they are not a substitute for first-hand experience. You still need to make your own mistakes and learn from them. It also helps to talk about the books you read with others, to get their perspective and to challenge your own views. Maybe you can join a book club, or read the book together with a colleague or friend.

I hope this list will help you on your professional journey. If there is a book that inspired you and that you think should be on this list, please let me know in the comments below.

If you liked this post, you can support me on ko-fi.