<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Frank Rosner</title>
    <description>The latest articles on DEV Community by Frank Rosner (@frosnerd).</description>
    <link>https://dev.to/frosnerd</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F36307%2Fd20fd294-675f-436e-b246-3ee6ae0232fb.jpg</url>
      <title>DEV Community: Frank Rosner</title>
      <link>https://dev.to/frosnerd</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/frosnerd"/>
    <language>en</language>
    <item>
      <title>Unit Testing Alertmanager Routing and Inhibition Rules</title>
      <dc:creator>Frank Rosner</dc:creator>
      <pubDate>Fri, 10 Apr 2026 12:40:12 +0000</pubDate>
      <link>https://dev.to/frosnerd/unit-testing-alertmanager-routing-and-inhibition-rules-1hj4</link>
      <guid>https://dev.to/frosnerd/unit-testing-alertmanager-routing-and-inhibition-rules-1hj4</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;There are three ways to find out your alertmanager routing tree is broken. You catch it during a careful review before anything goes wrong. You wake up at 3am to a page that went to the wrong team. Or an alert goes to the wrong receiver, nobody gets paged, and you find out when the customer calls. Most of us have experienced at least the second one.&lt;/p&gt;

&lt;p&gt;Alertmanager routing trees grow incrementally. A new team gets added, a new severity tier is introduced, someone adds a &lt;code&gt;continue: true&lt;/code&gt; flag and forgets to remove it. The config file remains valid YAML throughout. &lt;code&gt;amtool check-config&lt;/code&gt; keeps returning clean. Nothing tells you that warning alerts for &lt;code&gt;DatabaseDown&lt;/code&gt; are now waking up the frontend on-call instead of the backend team.&lt;/p&gt;

&lt;p&gt;This post describes a small Go tool we built to write unit tests for alertmanager routing and inhibition rules, run them in CI, and catch these mistakes before they matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Alertmanager gives you two built-in tools for validating config:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;amtool check-config&lt;/code&gt;&lt;/strong&gt; validates syntax and structure. It cannot tell you whether an alert reaches the right receiver.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;amtool config routes test&lt;/code&gt;&lt;/strong&gt; lets you interactively test a single alert against the routing tree. It is useful for manual debugging but does not support batch test files, and it has no notion of inhibition. You cannot assert that a warning is suppressed when a critical is firing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The failure modes we care about fall into two categories:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Wrong receiver&lt;/strong&gt;: &lt;code&gt;SomeAlert&lt;/code&gt; with &lt;code&gt;team=backend&lt;/code&gt; ends up in the frontend Slack channel because a route was added above the team-based route without &lt;code&gt;continue: false&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Broken inhibition&lt;/strong&gt;: A warning fires even though a critical is active for the same alert, flooding your incident channel with noise. Or worse: warnings are being silenced when they should not be, hiding real problems.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both are semantic errors. The config is syntactically valid; the routing logic is just wrong. Manual testing by firing test alerts into a staging alertmanager is slow, stateful, and easy to skip.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: Automated Unit Tests
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;alertmanager-routing-tests&lt;/code&gt; is a Go tool that evaluates alertmanager routing and inhibition rules purely in-memory, using alertmanager's own Go libraries. No running alertmanager instance is required.&lt;/p&gt;

&lt;p&gt;You give it an alertmanager config file and a YAML test file. It runs each test case and reports which passed and which failed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  PASS Unmatched alert routes to default receiver
  PASS Watchdog alert routes to null receiver
  PASS Team A alert routes to team-a-slack
  FAIL "wrong receiver test"
       alert {alertname=SomeAlert}:
         expected: nonexistent-receiver
         actual:   default
=== routing tests: 3 passed, 1 failed ===
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit code 0 means all tests passed. Exit code 1 means at least one failed, which makes it CI-friendly by default.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The tool imports alertmanager's own Go packages directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;amconfig&lt;/span&gt; &lt;span class="s"&gt;"github.com/prometheus/alertmanager/config"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/prometheus/alertmanager/dispatch"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/prometheus/alertmanager/inhibit"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Routing&lt;/strong&gt; is straightforward. The config is loaded with &lt;code&gt;amconfig.LoadFile&lt;/code&gt;, and &lt;code&gt;dispatch.NewRoute(cfg.Route, nil).Match(labelSet)&lt;/code&gt; returns the same receiver list that a live alertmanager would produce for that label set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inhibition&lt;/strong&gt; is more involved. Alertmanager's inhibitor is designed to work against a live alert store. The tool works around this by implementing a minimal &lt;code&gt;provider.Alerts&lt;/code&gt; interface called &lt;code&gt;fakeAlerts&lt;/code&gt;, which serves a fixed set of alerts from a buffered channel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;fakeAlerts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Subscribe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AlertIterator&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ch&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Alert&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alerts&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alerts&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;ch&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;done&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewAlertIterator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The inhibitor is constructed with this fake provider, its &lt;code&gt;Run()&lt;/code&gt; goroutine is started, and after a brief pause for it to process the alert feed, &lt;code&gt;Mutes(labelSet)&lt;/code&gt; is called for each alert to check whether it is suppressed.&lt;/p&gt;

&lt;p&gt;The key design decision is that all alerts in a test case are fired together. This is what allows source alerts to inhibit target alerts within the same test case. An alert with &lt;code&gt;severity=critical&lt;/code&gt; can suppress an alert with &lt;code&gt;severity=warning&lt;/code&gt; when both are present in the same case.&lt;/p&gt;

&lt;p&gt;Inhibition is checked first. If an alert is inhibited, receiver matching is skipped. Matching an inhibited alert to receivers is undefined behavior in a real alertmanager, so the test should assert inhibition explicitly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Writing Tests
&lt;/h2&gt;

&lt;p&gt;Here is a minimal alertmanager config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;global&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resolve_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;

&lt;span class="na"&gt;inhibit_rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_matchers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"critical"'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;target_matchers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"warning"'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;equal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;alertname&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
  &lt;span class="na"&gt;group_by&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;alertname'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;routes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;matchers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;alertname="Watchdog"&lt;/span&gt;
      &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;null"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;matchers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;team="team-a"&lt;/span&gt;
      &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-a-slack&lt;/span&gt;

&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;null"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-a-slack&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test files are YAML with a &lt;code&gt;tests&lt;/code&gt; list. Each test case has a name and one or more alerts. Each alert has labels and an assertion: either &lt;code&gt;expected_receivers&lt;/code&gt; or &lt;code&gt;expected_inhibited: true&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;expected_receivers&lt;/code&gt;&lt;/strong&gt;: ordered list of receiver names the alert must match. Order matters because alertmanager's routing order is significant when &lt;code&gt;continue: true&lt;/code&gt; is used.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;expected_inhibited&lt;/code&gt;&lt;/strong&gt;: set to &lt;code&gt;true&lt;/code&gt; to assert the alert is suppressed. Omit (or leave false) otherwise. Do not set both on the same alert.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here are the corresponding tests that exercise all four behaviors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Anything not matched by a specific route falls through to the default receiver.&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unmatched&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;alert&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;routes&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;receiver"&lt;/span&gt;
    &lt;span class="na"&gt;alerts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;alertname&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SomeAlert&lt;/span&gt;
        &lt;span class="na"&gt;expected_receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;

  &lt;span class="c1"&gt;# Watchdog is a synthetic heartbeat alert. It must not page anyone.&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Watchdog&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;alert&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;routes&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;null&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;receiver"&lt;/span&gt;
    &lt;span class="na"&gt;alerts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;alertname&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Watchdog&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
        &lt;span class="na"&gt;expected_receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;null"&lt;/span&gt;

  &lt;span class="c1"&gt;# Team-based routing: alerts with team=team-a go to team-a-slack.&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Team&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;alert&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;routes&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;team-a-slack"&lt;/span&gt;
    &lt;span class="na"&gt;alerts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;alertname&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TeamAAlert&lt;/span&gt;
          &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-a&lt;/span&gt;
        &lt;span class="na"&gt;expected_receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;team-a-slack&lt;/span&gt;

  &lt;span class="c1"&gt;# Inhibition: a critical suppresses a warning with the same alertname.&lt;/span&gt;
  &lt;span class="c1"&gt;# Both alerts are fired together so the inhibitor can evaluate the relationship.&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;suppresses&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;warning&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;same&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;alertname"&lt;/span&gt;
    &lt;span class="na"&gt;alerts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;alertname&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SomeAlert&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
        &lt;span class="na"&gt;expected_receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;alertname&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SomeAlert&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
        &lt;span class="na"&gt;expected_inhibited&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="c1"&gt;# Inhibition boundary: a critical for AlertOne does NOT suppress a warning&lt;/span&gt;
  &lt;span class="c1"&gt;# for AlertTwo because the inhibit rule requires equal alertname.&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;does&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;NOT&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;suppress&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;warning&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;different&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;alertname"&lt;/span&gt;
    &lt;span class="na"&gt;alerts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;alertname&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AlertOne&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
        &lt;span class="na"&gt;expected_receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;alertname&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AlertTwo&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
        &lt;span class="na"&gt;expected_receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last test case is easy to miss in manual testing. The inhibition rule says "a critical suppresses a warning with the same &lt;code&gt;alertname&lt;/code&gt;." Without a test pinning this boundary, a future change to the inhibition rule could accidentally broaden the &lt;code&gt;equal&lt;/code&gt; list (or remove it entirely) and start silencing warnings across unrelated alerts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running the Tool
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go run &lt;span class="nb"&gt;.&lt;/span&gt; example-alertmanager.yaml example-routing-tests.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All five tests above pass against the example config. To see a failure, change an &lt;code&gt;expected_receivers&lt;/code&gt; entry to a nonexistent receiver:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  PASS Unmatched alert routes to default receiver
  FAIL "Watchdog alert routes to null receiver"
       alert {alertname=Watchdog, severity=critical}:
         expected: nonexistent-receiver
         actual:   null
=== routing tests: 1 passed, 1 failed ===
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tool exits with code 1, which blocks CI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Integrating with CI via Helm Charts
&lt;/h2&gt;

&lt;p&gt;Many teams store their alertmanager config inside a Helm chart rather than as a standalone file. The config may be embedded as a YAML string inside a values file, or rendered into a ConfigMap or ApplicationSet at deploy time.&lt;/p&gt;

&lt;p&gt;To test the rendered config, you need to extract it from the rendered template before passing it to the tool. Here is a Makefile target that does this end-to-end:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="nv"&gt;ROUTING_TEST_YAML&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; ./alertmanager-unit-tests/routing-tests.yaml

&lt;span class="nl"&gt;.PHONY&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;test&lt;/span&gt;
&lt;span class="nl"&gt;test&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nv"&gt;WORKDIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;$$(&lt;/span&gt;&lt;span class="nb"&gt;mktemp&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    helm template &lt;span class="nb"&gt;test&lt;/span&gt; ./my-chart &lt;span class="se"&gt;\&lt;/span&gt;
        &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="s2"&gt;"./alertmanager-unit-tests/values.yaml"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
        | yq &lt;span class="s1"&gt;'select(.kind == "ConfigMap") | .data["alertmanager.yaml"]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
        &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$$&lt;/span&gt;&lt;span class="s2"&gt;WORKDIR/alertmanager.yaml"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nb"&gt;cd&lt;/span&gt; /path/to/alertmanager-routing-tests &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
        go run &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$$&lt;/span&gt;&lt;span class="s2"&gt;WORKDIR/alertmanager.yaml"&lt;/span&gt; &lt;span class="p"&gt;$(&lt;/span&gt;CURDIR&lt;span class="p"&gt;)&lt;/span&gt;/&lt;span class="p"&gt;$(&lt;/span&gt;ROUTING_TEST_YAML&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$$&lt;/span&gt;&lt;span class="s2"&gt;WORKDIR"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;yq&lt;/code&gt; expression selects the rendered alertmanager config from the template output. Adjust the selector to match your chart's structure. If the config is embedded as a YAML string inside another resource (for example, an ArgoCD ApplicationSet), you may need &lt;code&gt;from_yaml&lt;/code&gt; to parse it before extracting the alertmanager section:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;| yq &lt;span class="s1"&gt;'select(.kind == "ApplicationSet") \
    | .spec.template.spec.source.helm.values \
    | from_yaml \
    | .alertmanager.config'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this in place, &lt;code&gt;make test&lt;/code&gt; renders the chart and runs the routing tests in a single step. No live cluster, no running alertmanager. The tests run in CI the same way they run locally.&lt;/p&gt;

&lt;p&gt;The test YAML lives next to the chart and is reviewed in the same pull requests that change the alertmanager config. Routing changes need passing tests to merge.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Alertmanager routing bugs are quiet. The config is valid, deployment succeeds, and the tree looks right when you read it. You only find out something is wrong when an alert fires and the wrong team gets paged, or nobody gets paged at all, or a customer calls.&lt;/p&gt;

&lt;p&gt;Unit tests for routing rules are not conceptually different from unit tests for application code. The logic is complex, the failure modes are silent, and the consequences are real. A test file that exercises your routing tree, including inhibition boundaries, makes routing changes reviewable and gives you a CI gate that catches regressions before they reach production.&lt;/p&gt;

&lt;p&gt;If you manage alertmanager config, consider starting with three test cases: the default catch-all, one named receiver route, and one inhibition rule. Extend from there as your routing tree grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Future Work
&lt;/h2&gt;

&lt;p&gt;The most natural home for this feature is &lt;code&gt;amtool&lt;/code&gt; itself. The existing &lt;code&gt;amtool config routes test&lt;/code&gt; command already evaluates a single alert interactively. Extending it to accept a YAML file with multiple test cases and inhibition assertions would make batch routing tests available to the entire Prometheus community without a separate tool.&lt;/p&gt;

&lt;p&gt;A contribution along these lines would require adding inhibition support and a test runner loop to the existing command, which is straightforward work on top of what &lt;code&gt;amtool&lt;/code&gt; already does. We are considering contributing this upstream.&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>prometheus</category>
      <category>alertmanager</category>
    </item>
    <item>
      <title>Taming Prometheus Scrapes - Understanding and Analyzing Your Metrics Endpoints</title>
      <dc:creator>Frank Rosner</dc:creator>
      <pubDate>Thu, 26 Feb 2026 07:43:45 +0000</pubDate>
      <link>https://dev.to/frosnerd/taming-prometheus-scrapes-understanding-and-analyzing-your-metrics-endpoints-ik5</link>
      <guid>https://dev.to/frosnerd/taming-prometheus-scrapes-understanding-and-analyzing-your-metrics-endpoints-ik5</guid>
      <description>&lt;h2&gt;
  
  
  Introduction: Prometheus and the Scrape Format
&lt;/h2&gt;

&lt;p&gt;Prometheus is a pull-based monitoring system. Instead of having services push metrics to a central collector, Prometheus periodically fetches (or &lt;em&gt;scrapes&lt;/em&gt;) an HTTP endpoint (typically &lt;code&gt;/metrics&lt;/code&gt;) from each monitored target. The response is a plain-text document in the &lt;a href="https://prometheus.io/docs/instrumenting/exposition_formats/" rel="noopener noreferrer"&gt;Prometheus exposition format&lt;/a&gt;, listing every metric the service wants to expose. A scrape looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# HELP http_requests_total Total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1234
http_requests_total{method="GET",status="500"} 7
http_requests_total{method="POST",status="200"} 89

# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 52428800
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each metric family starts with optional &lt;code&gt;# HELP&lt;/code&gt; (description) and &lt;code&gt;# TYPE&lt;/code&gt; (counter, gauge, histogram, summary, or untyped) comment lines, followed by one or more &lt;em&gt;series&lt;/em&gt;: individual data points distinguished by their &lt;em&gt;label sets&lt;/em&gt;. In the example above, &lt;code&gt;http_requests_total&lt;/code&gt; has three series, one for each combination of &lt;code&gt;method&lt;/code&gt; and &lt;code&gt;status&lt;/code&gt; labels.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metric Types
&lt;/h3&gt;

&lt;p&gt;Prometheus defines four main metric types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Counter&lt;/strong&gt;: A monotonically increasing value (e.g. total requests, total errors).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gauge&lt;/strong&gt;: A value that can go up or down (e.g. current memory usage, queue depth).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Histogram&lt;/strong&gt;: Samples observations into configurable buckets, plus a &lt;code&gt;_sum&lt;/code&gt; and &lt;code&gt;_count&lt;/code&gt;. Used for measuring things like request latencies or payload sizes. Each bucket boundary becomes its own series, labeled with &lt;code&gt;le&lt;/code&gt; ("less than or equal").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summary&lt;/strong&gt;: Similar to a histogram but pre-computes quantiles client-side, labeled with &lt;code&gt;quantile&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Histograms and summaries deserve a special mention when thinking about cardinality. A single histogram metric with 12 configured buckets exposes 14 series per label combination: 12 bucket series (&lt;code&gt;le=...&lt;/code&gt;), one &lt;code&gt;_sum&lt;/code&gt;, and one &lt;code&gt;_count&lt;/code&gt;. If that histogram also carries a label like &lt;code&gt;endpoint&lt;/code&gt; with 50 distinct values, you are already looking at 700 series from one metric family.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Two Big Challenges: Size and Cardinality
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scrape Size
&lt;/h3&gt;

&lt;p&gt;Every scrape is an HTTP response body that Prometheus has to download, parse, and ingest. For small services exposing a few dozen metrics this is trivial, but instrumented runtimes (Go, JVM, .NET) can expose hundreds of metrics by default, and larger applications may expose thousands. Scrape payloads regularly reach several megabytes, and in extreme cases tens of megabytes.&lt;/p&gt;

&lt;p&gt;A large scrape causes two concrete problems. First, it consumes bandwidth on every scrape interval (typically 15–60 seconds). Second, and more critically, Prometheus enforces a scrape timeout: if the target takes too long to respond, the scrape is marked as failed and the data is simply not ingested. I encountered scrapes exceeding 64 MB that consistently timed out, meaning that data was &lt;em&gt;never&lt;/em&gt; in Prometheus, making it impossible to debug the problem from inside Prometheus itself.&lt;/p&gt;

&lt;p&gt;One common remedy is enabling gzip compression on the &lt;code&gt;/metrics&lt;/code&gt; endpoint. Prometheus supports &lt;code&gt;Accept-Encoding: gzip&lt;/code&gt; out of the box, and the text format compresses well. Compression can reduce transfer size by 80–90%, which helps significantly with bandwidth and timeout margins. However, gzip only addresses the &lt;em&gt;transport&lt;/em&gt; problem. The data still has to be decompressed and parsed by Prometheus, and all those series still have to be stored and indexed. The real cost of a large scrape is not the bytes on the wire: it is the &lt;em&gt;cardinality&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  High Cardinality
&lt;/h3&gt;

&lt;p&gt;Cardinality is the number of distinct time series a metric produces (i.e. the number of unique label value combinations). A metric with no labels has a cardinality of 1. A metric with a &lt;code&gt;method&lt;/code&gt; label (say, 5 values) and a &lt;code&gt;status&lt;/code&gt; label (say, 10 values) has a cardinality of up to 50.&lt;/p&gt;

&lt;p&gt;Prometheus stores each series independently: it allocates memory for it, writes it to disk, and indexes it. High cardinality therefore translates directly into high memory usage, large on-disk storage, and slower queries. Unlike scrape size, this cost cannot be compressed away. It is structural.&lt;/p&gt;

&lt;h4&gt;
  
  
  Scenario 1: The High-Cardinality Label
&lt;/h4&gt;

&lt;p&gt;The classic example is instrumenting a metric with a label whose value comes from an unbounded domain. Imagine tracking HTTP requests per session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# TYPE http_requests_total counter
http_requests_total{session_id="a1b2c3"} 12
http_requests_total{session_id="x9y8z7"} 4
http_requests_total{session_id="p0q1r2"} 31
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each new user session creates a new series. With thousands of active sessions, Prometheus is ingesting thousands of new series every minute. Because counters are monotonically increasing, Prometheus keeps these series in memory until they are explicitly garbage-collected, which by default only happens after a series has not been seen for 5 minutes. A busy service with short sessions can create a continuously growing "cardinality debt" that eventually causes Prometheus to run out of memory.&lt;/p&gt;

&lt;p&gt;Session IDs are an obvious case, but the same pattern appears with any high-churn identifier: request IDs, trace IDs, user IDs in a large system, or dynamically generated job names.&lt;/p&gt;

&lt;h4&gt;
  
  
  Scenario 2: Label Leaks from Stale Metadata
&lt;/h4&gt;

&lt;p&gt;A similar problem arises when label values can become stale, and the service fails to clean up its own metric registry. A common example in Kubernetes is a service that tracks metrics per pod (perhaps a controller, a proxy, or a sidecar that monitors its neighbors):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# TYPE watched_pod_restarts_total counter
watched_pod_restarts_total{pod="web-7d4f9b-xkqzp"} 2
watched_pod_restarts_total{pod="web-7d4f9b-mnprt"} 0
watched_pod_restarts_total{pod="web-7d4f9b-tz9vw"} 5
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a pod is removed (due to a rolling deployment, a crash, or a scale-down), the correct behavior is to also delete the corresponding metric series from the registry. If the code forgets to do that, the series for the old pod keeps appearing in every scrape indefinitely. The service is essentially accumulating a series for every pod it has ever seen.&lt;/p&gt;

&lt;p&gt;This is a pure instrumentation bug, and it can be surprisingly hard to notice. The service appears healthy, the scrape succeeds, and individual series look reasonable in isolation. But over time, especially in clusters with frequent deployments, the cardinality of that metric grows without bound. By the time someone notices the Prometheus memory usage climbing, hundreds or thousands of ghost series may already be present in the scrape.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ad-Hoc Analysis: The Shell Toolbox
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why Not Just Query Prometheus?
&lt;/h3&gt;

&lt;p&gt;Before reaching for shell tools, it is worth asking: why analyse the raw scrape at all, rather than using PromQL inside Prometheus?&lt;/p&gt;

&lt;p&gt;There are two situations where Prometheus itself cannot help you. The first is when the scrape never made it into Prometheus in the first place. A scrape that exceeds the configured timeout is simply dropped: no data is ingested, and there is nothing to query. Large scrapes fail exactly like this, which means the tool you would normally use to investigate the problem is blind to it.&lt;/p&gt;

&lt;p&gt;The second situation arises when a remote write intermediary sits between the scraper and the TSDB. Tools like &lt;a href="https://docs.victoriametrics.com/vmagent/" rel="noopener noreferrer"&gt;vmagent&lt;/a&gt; scrape targets and forward metrics via the remote write protocol, which carries only sample data: it discards &lt;code&gt;# HELP&lt;/code&gt; and &lt;code&gt;# TYPE&lt;/code&gt; metadata. Once the data is in the storage backend, you lose the ability to filter or group by metric type, or to see the human-readable descriptions that often give the clearest clue about what a metric is and why its cardinality is high.&lt;/p&gt;

&lt;p&gt;In both cases, working directly with the raw scrape text is the only option.&lt;/p&gt;

&lt;h3&gt;
  
  
  Counting Series with Shell Tools
&lt;/h3&gt;

&lt;p&gt;Given a saved scrape, a first instinct is to reach for standard Unix tools. Here are the kinds of questions you might try to answer and how you would approach them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How many lines does the scrape have?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt; &amp;lt; prometheus-scrape.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1069
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Which metric families are present?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s1"&gt;'^# TYPE'&lt;/span&gt; prometheus-scrape.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# TYPE go_gc_cycles_automatic_gc_cycles_total counter
# TYPE go_gc_cycles_forced_gc_cycles_total counter
# TYPE go_gc_cycles_total_gc_cycles_total counter
# TYPE go_gc_duration_seconds summary
# TYPE go_gc_gogc_percent gauge
# TYPE go_gc_gomemlimit_bytes gauge
# TYPE go_gc_heap_allocs_by_size_bytes histogram
...
# TYPE prometheus_http_requests_total counter
# TYPE prometheus_http_response_size_bytes histogram
# TYPE promhttp_metric_handler_requests_in_flight gauge
# TYPE promhttp_metric_handler_requests_total counter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How many series does each metric expose?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The idea is to count non-comment, non-empty lines per metric family. One approach: strip comment and blank lines, extract the metric name (the part before &lt;code&gt;{&lt;/code&gt; or the first space), then count occurrences.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s1"&gt;'^#'&lt;/span&gt; prometheus-scrape.txt &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s1"&gt;'^$'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s/[{ ].*//'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;uniq&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="nt"&gt;-rn&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;59 prometheus_http_requests_total
20 prometheus_engine_query_duration_histogram_seconds_bucket
18 prometheus_sd_kubernetes_events_total
15 prometheus_tsdb_compaction_duration_seconds_bucket
13 prometheus_tsdb_compaction_chunk_size_bytes_bucket
13 prometheus_tsdb_compaction_chunk_samples_bucket
12 prometheus_engine_query_duration_seconds
12 net_conntrack_dialer_conn_failed_total
12 go_gc_heap_frees_by_size_bytes_bucket
12 go_gc_heap_allocs_by_size_bytes_bucket
11 prometheus_tsdb_compaction_chunk_range_seconds_bucket
10 prometheus_http_request_duration_seconds_bucket
 9 prometheus_http_response_size_bytes_bucket
 8 prometheus_tsdb_sample_ooo_delta_bucket
 8 go_sched_pauses_total_other_seconds_bucket
 8 go_sched_pauses_total_gc_seconds_bucket
 8 go_sched_pauses_stopping_other_seconds_bucket
 8 go_sched_pauses_stopping_gc_seconds_bucket
 8 go_sched_latencies_seconds_bucket
 8 go_gc_pauses_seconds_bucket
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pipes the scrape through a sequence of filters: drop comment lines, drop blank lines, strip everything after the metric name, sort, count duplicates, and sort by count descending.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Limits of This Approach
&lt;/h3&gt;

&lt;p&gt;The pipeline above works for simple gauges and counters, but it breaks down for histograms and summaries. You can already see this in the output above: &lt;code&gt;prometheus_engine_query_duration_histogram_seconds_bucket&lt;/code&gt; appears as its own entry with 20 lines, but the corresponding &lt;code&gt;_sum&lt;/code&gt; and &lt;code&gt;_count&lt;/code&gt; lines are counted separately and buried lower in the list. The &lt;code&gt;sed&lt;/code&gt; pattern extracts different "names" for each suffix (&lt;code&gt;_bucket&lt;/code&gt;, &lt;code&gt;_sum&lt;/code&gt;, &lt;code&gt;_count&lt;/code&gt;), so the count is fragmented across multiple rows rather than attributed to a single metric family. The true series count for that histogram is higher than any individual row suggests. Reassembling these correctly requires knowing the metric type, which means parsing the &lt;code&gt;# TYPE&lt;/code&gt; lines and correlating them with the data lines. That turns a one-liner into a non-trivial script.&lt;/p&gt;

&lt;p&gt;There are further edge cases: metrics with no labels, metric names that are prefixes of other metric names, and histograms where the bucket count varies across label dimensions. Shell pipelines are quick to write but fragile to maintain, and getting an accurate cardinality figure for a real-world scrape is harder than it looks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introducing scrapecli
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/FRosner/scrapecli" rel="noopener noreferrer"&gt;scrapecli&lt;/a&gt; is a small command-line tool that reads a Prometheus scrape from stdin and prints a structured summary. It uses the official Prometheus client libraries to parse the exposition format, so it understands metric types correctly: histograms and summaries are counted as single families, not fragmented by suffix.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installation
&lt;/h3&gt;

&lt;p&gt;If you have Go installed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go &lt;span class="nb"&gt;install &lt;/span&gt;github.com/FRosner/scrapecli@latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alternatively, download a pre-built binary for your platform from the &lt;a href="https://github.com/FRosner/scrapecli/releases" rel="noopener noreferrer"&gt;releases page&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Basic Usage
&lt;/h3&gt;

&lt;p&gt;Pipe any Prometheus scrape into scrapecli:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; localhost:9090/metrics | scrapecli
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or analyse a saved scrape file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;prometheus-scrape.txt | scrapecli
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running it against the same Prometheus scrape from the previous section produces:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq7r62zsp4dafhehg9zdv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq7r62zsp4dafhehg9zdv.png" width="800" height="1373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What the Output Tells You
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Summary&lt;/strong&gt; gives an immediate sense of the scrape's overall footprint: total size on disk and which metrics are consuming the most series. Contrast the Top Metrics list with the awk output from the previous section: &lt;code&gt;prometheus_engine_query_duration_histogram_seconds&lt;/code&gt; is correctly listed as a single family with 20 series, rather than appearing as a fragmented &lt;code&gt;_bucket&lt;/code&gt; entry. Each entry also shows its byte contribution, making it easy to see which metrics dominate the scrape size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Types&lt;/strong&gt; breaks down the metric count by type. Seeing 20 histograms in a scrape is a prompt to check their bucket counts and label cardinality, since histograms multiply series quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Labels&lt;/strong&gt; shows every label name that appears across the scrape, how many distinct values it takes globally, and how many metric families use it. The &lt;code&gt;handler&lt;/code&gt; label having 59 distinct values across 3 metrics immediately explains why &lt;code&gt;prometheus_http_requests_total&lt;/code&gt; leads the cardinality ranking. The &lt;code&gt;&amp;lt;none&amp;gt;&lt;/code&gt; entry counts metrics that carry no labels at all, which is useful context for understanding what fraction of the scrape is label-free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metrics&lt;/strong&gt; lists every metric family with its type, series count, labels, and description. This is where you can quickly scan for unfamiliar metrics, check whether a metric's description matches your expectation, or spot a metric with an unexpectedly high cardinality.&lt;/p&gt;

&lt;h3&gt;
  
  
  JSON Output for Scripting
&lt;/h3&gt;

&lt;p&gt;Passing &lt;code&gt;-o json&lt;/code&gt; emits the same information as structured JSON, which is useful for feeding into other tools or automating cardinality checks in CI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;prometheus-scrape.txt | scrapecli &lt;span class="nt"&gt;-o&lt;/span&gt; json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"bytes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;79033&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"top_cardinalities"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"prometheus_http_requests_total"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"cardinality"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;59&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"prometheus_engine_query_duration_histogram_seconds"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"cardinality"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type_counts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"counter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;109&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"gauge"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"histogram"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You could, for example, use &lt;code&gt;jq&lt;/code&gt; to fail a CI step if any single metric exceeds a cardinality threshold:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;prometheus-scrape.txt | scrapecli &lt;span class="nt"&gt;-o&lt;/span&gt; json &lt;span class="se"&gt;\&lt;/span&gt;
  | jq &lt;span class="s1"&gt;'[.summary.top_cardinalities[] | select(.cardinality &amp;gt; 100)] | length &amp;gt; 0'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;High cardinality is one of the most common and costly problems in Prometheus deployments, and it tends to be discovered late, usually when memory usage is already climbing or scrapes are already failing. The root cause is usually straightforward once you can see it: an unbounded label, a metric registry that was never cleaned up, a histogram with too many dimensions. The difficulty is getting a clear view of what is actually in a scrape before things go wrong.&lt;/p&gt;

&lt;p&gt;Shell tools can get you part of the way there, but they require careful construction and give inaccurate results for histograms and summaries. Querying Prometheus directly is not always an option, especially when the scrape is too large to ingest or when metadata has been stripped by a remote write pipeline.&lt;/p&gt;

&lt;p&gt;scrapecli is a small focused tool for exactly this gap: give it a scrape, and it tells you the size, the cardinality leaders, the type breakdown, and the label landscape, and it gets the numbers right. If you maintain a service that exposes Prometheus metrics, it is worth keeping in your toolbox for those moments when you need to understand what your &lt;code&gt;/metrics&lt;/code&gt; endpoint is actually producing.&lt;/p&gt;

&lt;p&gt;The project is open source and available at &lt;a href="https://github.com/FRosner/scrapecli" rel="noopener noreferrer"&gt;github.com/FRosner/scrapecli&lt;/a&gt;. Have you run into oversized scrapes or runaway cardinality in your own setup? I'd love to hear about it in the comments: what caused it, how you found it, and how you fixed it.&lt;/p&gt;

</description>
      <category>go</category>
      <category>prometheus</category>
      <category>devops</category>
      <category>observability</category>
    </item>
    <item>
      <title>Catching Race Conditions in Go</title>
      <dc:creator>Frank Rosner</dc:creator>
      <pubDate>Wed, 25 Feb 2026 09:35:09 +0000</pubDate>
      <link>https://dev.to/frosnerd/catching-race-conditions-in-go-307l</link>
      <guid>https://dev.to/frosnerd/catching-race-conditions-in-go-307l</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Concurrency is one of Go's greatest strengths. Goroutines are cheap, channels are expressive, and the standard library is built with concurrency in mind. But with concurrency comes a classic hazard: race conditions.&lt;/p&gt;

&lt;p&gt;A race condition occurs when two or more goroutines access the same memory concurrently, and at least one of them is writing. The result depends on the exact scheduling order, which the Go runtime does not guarantee. This means your program may work correctly nine times out of ten, only to produce corrupted data or crash unpredictably in production under load.&lt;/p&gt;

&lt;p&gt;What makes race conditions particularly dangerous is their silence. The compiler won't warn you. Your tests may pass. The bug only surfaces under the right (wrong) timing conditions, often making it hard to reproduce and painful to debug.&lt;/p&gt;

&lt;p&gt;Fortunately, Go ships with a built-in race detector. By simply adding the &lt;code&gt;-race&lt;/code&gt; flag to your &lt;code&gt;go test&lt;/code&gt;, &lt;code&gt;go run&lt;/code&gt;, or &lt;code&gt;go build&lt;/code&gt; commands, Go instruments your code to monitor memory accesses at runtime and report any races it observes, complete with a detailed stack trace pointing you directly to the problem.&lt;/p&gt;

&lt;p&gt;In this post, we'll explore how the race detector works, walk through a concrete example, and look at how to make it a standard part of your development workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the Race Detector?
&lt;/h2&gt;

&lt;p&gt;Go's race detector is a dynamic analysis tool built directly into the Go toolchain. You don't need to install anything extra; it's been available since Go 1.1.&lt;/p&gt;

&lt;p&gt;Under the hood, it is powered by &lt;a href="https://clang.llvm.org/docs/ThreadSanitizer.html" rel="noopener noreferrer"&gt;ThreadSanitizer (TSan)&lt;/a&gt;, a battle-tested runtime instrumentation library originally developed at Google and now maintained as part of the LLVM project. It is also used in C/C++ toolchains like Clang and GCC. When you compile with &lt;code&gt;-race&lt;/code&gt;, the Go compiler inserts instrumentation around every memory read and write. At runtime, TSan tracks which goroutine last accessed each memory location and flags any unsynchronized concurrent accesses.&lt;/p&gt;

&lt;p&gt;Because it works at runtime, the race detector can only report races that actually happen during a given execution. It won't find races in code paths that aren't exercised by your tests, but there are no false positives. That said, instrumentation comes with overhead: programs compiled with &lt;code&gt;-race&lt;/code&gt; typically run 2–20× slower and use 5–10× more memory. This is perfectly acceptable for tests and CI, but means you generally won't ship race-enabled binaries to production.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Simple Race Condition Example
&lt;/h2&gt;

&lt;p&gt;Let's look at a concrete example. Here is a simple &lt;code&gt;Counter&lt;/code&gt; type with an &lt;code&gt;Increment&lt;/code&gt; method:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Increment&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a test that increments it 1000 times concurrently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestCounter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
  &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;wg&lt;/span&gt; &lt;span class="n"&gt;sync&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WaitGroup&lt;/span&gt;

  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
      &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Increment&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}()&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Wait&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"expected 1000, got %d"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem is in &lt;code&gt;Increment&lt;/code&gt;: &lt;code&gt;c.value++&lt;/code&gt; is not a single atomic operation. It compiles to a read, an increment, and a write. If two goroutines interleave those steps, one of the writes gets lost, so the final count ends up lower than 1000.&lt;/p&gt;

&lt;p&gt;Running &lt;code&gt;go test&lt;/code&gt; shows this can fail outright:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--- FAIL: TestCounter (0.00s)
    counter_test.go:23: expected 1000, got 939
FAIL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But even when the test happens to pass (perhaps on a slower machine or a lucky scheduling order), the race is still there, waiting to bite. Running with &lt;code&gt;-race&lt;/code&gt; makes it explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;==================
WARNING: DATA RACE
Read at 0x00c0000a01e8 by goroutine 9:
  github.com/frosner/go-test-race.(*Counter).Increment()
      counter.go:11 +0x84
  github.com/frosner/go-test-race.TestCounter.func1()
      counter_test.go:16 +0x80

Previous write at 0x00c0000a01e8 by goroutine 7:
  github.com/frosner/go-test-race.(*Counter).Increment()
      counter.go:11 +0x98
  github.com/frosner/go-test-race.TestCounter.func1()
      counter_test.go:16 +0x80

Goroutine 9 (running) created at:
  github.com/frosner/go-test-race.TestCounter()
      counter_test.go:14 +0x74
...
==================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output tells you exactly what happened: goroutine 9 read &lt;code&gt;c.value&lt;/code&gt; at &lt;code&gt;counter.go:11&lt;/code&gt; while goroutine 7 had just written to the same address. Both goroutines were spawned at &lt;code&gt;counter_test.go:14&lt;/code&gt;. There's no ambiguity about where to look.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running the Race Detector
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;-race&lt;/code&gt; flag works with the three most common Go commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-race&lt;/span&gt; ./...   &lt;span class="c"&gt;# run tests with race detection (most common)&lt;/span&gt;
go run &lt;span class="nt"&gt;-race&lt;/span&gt; main.go  &lt;span class="c"&gt;# run a program with race detection&lt;/span&gt;
go build &lt;span class="nt"&gt;-race&lt;/span&gt;        &lt;span class="c"&gt;# build a race-enabled binary&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For day-to-day development, &lt;code&gt;go test -race ./...&lt;/code&gt; is the one you'll use most. The &lt;code&gt;./...&lt;/code&gt; pattern runs all packages in the module recursively, so no race in any package goes undetected. &lt;code&gt;go build -race&lt;/code&gt; is useful when you want to run a long-lived service manually and observe it under realistic traffic, for example during load testing or manual QA. Just don't forget to swap it back out before shipping.&lt;/p&gt;

&lt;h3&gt;
  
  
  Controlling the race detector
&lt;/h3&gt;

&lt;p&gt;The race detector's behaviour can be tuned via the &lt;code&gt;GORACE&lt;/code&gt; environment variable, which accepts a space-separated list of options:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;GORACE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"halt_on_error=1 log_path=/tmp/race"&lt;/span&gt; go &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-race&lt;/span&gt; ./...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most useful options are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Option&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;halt_on_error&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Exit immediately on the first race instead of continuing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;log_path&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;stderr&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Write race reports to a file (e.g. &lt;code&gt;log_path=/tmp/race&lt;/code&gt; produces &lt;code&gt;/tmp/race.&amp;lt;pid&amp;gt;&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;strip_path_prefix&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;""&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Remove a path prefix from stack frames to reduce noise&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In CI it's often worth setting &lt;code&gt;halt_on_error=1&lt;/code&gt; so that the first detected race fails the build loudly rather than letting the run continue and produce a wall of interleaved reports.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fixing the Race Condition
&lt;/h2&gt;

&lt;p&gt;There are three idiomatic ways to fix a data race in Go: a mutex, an atomic operation, or a channel. The right choice depends on the complexity of the shared state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 1: sync.Mutex
&lt;/h3&gt;

&lt;p&gt;A mutex is the most general solution. It works for any shared state, including structs with multiple fields that must be updated together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="s"&gt;"sync"&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;mu&lt;/span&gt;    &lt;span class="n"&gt;sync&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Mutex&lt;/span&gt;
  &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Increment&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that &lt;code&gt;Value&lt;/code&gt; also needs the lock: reading shared memory concurrently with a write is itself a race.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 2: sync/atomic
&lt;/h3&gt;

&lt;p&gt;For a single integer counter, &lt;code&gt;sync/atomic&lt;/code&gt; is simpler and faster than a mutex:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="s"&gt;"sync/atomic"&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="n"&gt;atomic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Int64&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Increment&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Load&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;atomic.Int64&lt;/code&gt; (introduced in Go 1.19) provides a clean, type-safe API. Use atomics when you only need to update a single value; reach for a mutex as soon as the operation involves multiple fields.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 3: Channels
&lt;/h3&gt;

&lt;p&gt;Channels express the Go proverb &lt;em&gt;"share memory by communicating"&lt;/em&gt;. Instead of protecting shared state with a lock, you give exclusive ownership of the value to a single dedicated goroutine and communicate with it via channels:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;inc&lt;/span&gt; &lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
  &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;NewCounter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Counter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;inc&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}),&lt;/span&gt;
    &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inc&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;
      &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}()&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Increment&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inc&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}{}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;value&lt;/code&gt; variable lives exclusively inside the goroutine, so nothing else can touch it. There is no shared memory and therefore no race. &lt;code&gt;Increment&lt;/code&gt; sends a signal on &lt;code&gt;inc&lt;/code&gt;, and &lt;code&gt;Value&lt;/code&gt; receives the current count from &lt;code&gt;val&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The downside is lifecycle management: the background goroutine will leak if the &lt;code&gt;Counter&lt;/code&gt; is abandoned. A production-ready version would need a &lt;code&gt;Close&lt;/code&gt; method or a &lt;code&gt;context.Context&lt;/code&gt; to shut it down. For a simple counter, that's more ceremony than a mutex or atomic warrants, and that's exactly the point. But for more complex stateful objects where multiple fields must stay consistent, this pattern can be a clean and expressive alternative.&lt;/p&gt;

&lt;h3&gt;
  
  
  Confirming the fix
&lt;/h3&gt;

&lt;p&gt;After applying the mutex fix, &lt;code&gt;go test -race&lt;/code&gt; passes cleanly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ok    github.com/frosner/go-test-race  1.525s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No &lt;code&gt;WARNING: DATA RACE&lt;/code&gt;. The race detector's silence is your green light.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  It's a runtime tool, so coverage matters
&lt;/h3&gt;

&lt;p&gt;As mentioned earlier, the race detector can only report races it actually observes. If a racy code path isn't exercised during a test run, no warning is produced. This means the race detector is only as good as your test suite.&lt;/p&gt;

&lt;p&gt;A test that calls &lt;code&gt;Increment&lt;/code&gt; once in a single goroutine will never trigger a race, even if the implementation is unsafe. The race in our example was only visible because the test deliberately ran 1000 concurrent goroutines. When writing tests for concurrent code, design them to exercise real concurrency: use multiple goroutines, vary timing with &lt;code&gt;runtime.Gosched()&lt;/code&gt; or small sleeps where appropriate, and aim for high coverage of concurrent code paths.&lt;/p&gt;

&lt;h3&gt;
  
  
  Don't run it in production, but do run it in CI
&lt;/h3&gt;

&lt;p&gt;The 2–20× slowdown and 5–10× memory increase make &lt;code&gt;-race&lt;/code&gt; unsuitable for production binaries. However, this overhead is almost always acceptable in a test or CI environment, where correctness matters far more than raw speed.&lt;/p&gt;

&lt;p&gt;The ideal setup is to run &lt;code&gt;go test -race ./...&lt;/code&gt; on every pull request. Races caught at review time are cheap to fix. Races caught in production (if they're caught at all) can mean hours of debugging a Heisenbug.&lt;/p&gt;

&lt;h3&gt;
  
  
  The race detector doesn't cover all concurrency bugs
&lt;/h3&gt;

&lt;p&gt;The race detector specifically finds data races: unsynchronized concurrent memory accesses. It will not catch deadlocks, livelocks, or logical race conditions where synchronization exists but the logic is still wrong. For those, you still need good tests and careful code review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tips for Effective Use
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Use -count to repeat tests
&lt;/h3&gt;

&lt;p&gt;Because the race detector only catches races that actually occur, a racy test might get lucky and pass on a single run. Passing &lt;code&gt;-count=N&lt;/code&gt; tells Go to run each test N times in the same process, increasing the chances of hitting the problematic interleaving:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-race&lt;/span&gt; &lt;span class="nt"&gt;-count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10 ./...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is particularly useful for tests that involve tight timing windows. It won't guarantee discovery, but it significantly raises the odds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use t.Parallel() to increase concurrency
&lt;/h3&gt;

&lt;p&gt;Marking tests with &lt;code&gt;t.Parallel()&lt;/code&gt; allows them to run concurrently with each other. This increases the overall goroutine concurrency during the test run, which gives the race detector more opportunities to observe racy interactions, especially across different test cases that share package-level state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestCounter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Parallel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="c"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Write tests that explicitly exercise concurrency
&lt;/h3&gt;

&lt;p&gt;For any type or function that is intended to be safe for concurrent use, write a test that actually uses it concurrently. The pattern used in our &lt;code&gt;TestCounter&lt;/code&gt; (spawning many goroutines, using a &lt;code&gt;sync.WaitGroup&lt;/code&gt; to wait for them, then asserting on the result) is a reliable template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestConcurrentAccess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Parallel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
  &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;wg&lt;/span&gt; &lt;span class="n"&gt;sync&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WaitGroup&lt;/span&gt;

  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
      &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Increment&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}()&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Wait&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"expected 1000, got %d"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the type is not meant to be used concurrently, document that explicitly. It's just as important to set the right expectations as it is to protect the ones that need it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Add it to your CI pipeline
&lt;/h3&gt;

&lt;p&gt;The simplest way to ensure &lt;code&gt;-race&lt;/code&gt; is always run is to make it part of your standard test command in CI. In GitHub Actions, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Test&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;go test -race ./...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One line. No extra tooling. You'll catch races on every push before they ever reach production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Race conditions are among the hardest bugs to debug: non-deterministic, often invisible under normal load, and capable of causing silent data corruption. Go's built-in race detector gives you a powerful, zero-setup tool to catch them before they reach production.&lt;/p&gt;

&lt;p&gt;We've seen how &lt;code&gt;-race&lt;/code&gt; instruments your code at compile time, reports unsynchronized memory accesses with precise stack traces, and leaves no room for false positives. We've also seen that fixing a detected race is usually straightforward, with &lt;code&gt;sync.Mutex&lt;/code&gt;, &lt;code&gt;sync/atomic&lt;/code&gt;, or channels all being idiomatic options depending on the situation.&lt;/p&gt;

&lt;p&gt;The cost of enabling it is low: one flag, a slower test run, and nothing more. The benefit is a whole class of concurrency bugs caught automatically, on every PR, before anyone is paged at 3am. If you're not already running &lt;code&gt;go test -race ./...&lt;/code&gt; in CI, that's the one takeaway from this post. Add it today.&lt;/p&gt;

</description>
      <category>go</category>
      <category>concurrency</category>
      <category>testing</category>
      <category>ci</category>
    </item>
    <item>
      <title>Addressing the Limitations of Local Path Provisioner in Kubernetes</title>
      <dc:creator>Frank Rosner</dc:creator>
      <pubDate>Sun, 28 Dec 2025 07:25:51 +0000</pubDate>
      <link>https://dev.to/frosnerd/addressing-the-limitations-of-local-path-provisioner-in-kubernetes-3g12</link>
      <guid>https://dev.to/frosnerd/addressing-the-limitations-of-local-path-provisioner-in-kubernetes-3g12</guid>
      <description>&lt;h2&gt;
  
  
  Temporary Storage in Kubernetes
&lt;/h2&gt;

&lt;p&gt;In Kubernetes, containers are ephemeral and stateless by default, allowing for easy scaling and management. Some workloads might require storage for temporary files, however. In vanilla Kubernetes, you are presented with the following options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mount an &lt;code&gt;emptyDir&lt;/code&gt; volume, which will be created on the node where the pod is scheduled. It can either be backed by the node's default disk or the node's memory (&lt;code&gt;tmpfs&lt;/code&gt;). Some cloud providers also offer &lt;code&gt;emptyDir&lt;/code&gt; backed by local SSDs based on the node type. However, you cannot customize the mount point on the node for &lt;code&gt;emptyDir&lt;/code&gt; volumes, which means less flexibility.&lt;/li&gt;
&lt;li&gt;Mount a &lt;code&gt;hostPath&lt;/code&gt; volume which allows you to specify a custom mount point on the node. This is not recommended for most application due to security risks as it allows mounting arbitrary paths on the node. Also, each pod is "responsible" for mounting the right path to avoid conflicts between pods. There is no separation.&lt;/li&gt;
&lt;li&gt;Mount a &lt;code&gt;local&lt;/code&gt; volume, which is backed by a statically provisioned local PV. When configured correctly, this approach avoids scheduling issues if your pod is supposed to get the same local data back when it is rescheduled. However, you have to manually create the PVs and manage them, which makes &lt;code&gt;local&lt;/code&gt; volumes impractical for most production use cases.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While most workloads might be fine with &lt;code&gt;emptyDir&lt;/code&gt; for storing temporary data, some applications have specific I/O requirements, such as configuring the filesystem in a certain way or choosing a certain RAID configuration for optimal performance. Think of databases or caches.&lt;/p&gt;

&lt;p&gt;We need a way to dynamically provision local storage, securely, conflict-free, mounted to the specific path on the node that is mounted to a fast local disk. Ideally, we want to avoid scheduling problems and enforce capacity limits. Additionally, &lt;code&gt;emptyDir&lt;/code&gt; will be wiped if the pod gets deleted, so we cannot reuse the volume even if the node still exists. This can be inconvenient if you want to reuse the state of your application after a rolling restart, for example.&lt;/p&gt;

&lt;p&gt;Local path provisioner provides a way to mount local storage as persistent volumes in Kubernetes dynamically. It checks many of the boxes we are looking for. Let's take a closer look.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Does Local Path Provisioner Work?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/rancher/local-path-provisioner" rel="noopener noreferrer"&gt;Local path provisioner&lt;/a&gt; is a Go application that can be installed in your Kubernetes cluster, e.g. via Helm. Based on your configuration, it will create either &lt;code&gt;hostPath&lt;/code&gt; or &lt;code&gt;local&lt;/code&gt; based PVs on the node automatically.&lt;/p&gt;

&lt;p&gt;After installing the chart in your cluster, you will have access to the &lt;code&gt;local-path&lt;/code&gt; storage class. To utilize it, you could create a &lt;code&gt;StatefulSet&lt;/code&gt; with the respective volume claim template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;StatefulSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;volume-test&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;serviceName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test"&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;volume-test&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;volume-test&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test-container&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;busybox&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sh'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-c'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"Test&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$(hostname)"&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;/data/test&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;sleep&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;3600'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/data&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local-storage&lt;/span&gt;
  &lt;span class="na"&gt;volumeClaimTemplates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local-storage&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ReadWriteOnce"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local-path&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;128Mi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After creating the &lt;code&gt;StatefulSet&lt;/code&gt; resource, the following events will unfold:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;code&gt;StatefulSet&lt;/code&gt; controller processes the first replica, creating the PVC and pod based on the template and setting the owner references. The pod will reference the PVC and both resources will be &lt;code&gt;Pending&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The PVC control loop detects an unbound PVC, matches the storage class &lt;code&gt;local-path&lt;/code&gt; to the provisioner &lt;code&gt;rancher.io/local-path&lt;/code&gt; and triggers dynamic provisioning since no matching PV exists.&lt;/li&gt;
&lt;li&gt;Local path provisioner watches PVC events via the Kubernetes API. If the storage class is configured with &lt;code&gt;volumeBindingMode: WaitForFirstConsumer&lt;/code&gt; (default), it will defer the PV binding to pod scheduling. This is useful because there might be other constraints in your pods such as node selectors, or resource requests, which could result in unschedulable pods if the PV is bound to the wrong node.&lt;/li&gt;
&lt;li&gt;The scheduler schedules the pod to a node and the PVC is annotated with &lt;code&gt;volume.kubernetes.io/selected-node&lt;/code&gt;, indicating the selected node to the PVC binding controller.&lt;/li&gt;
&lt;li&gt;The local path provisioner receives the PVC event, reads the &lt;code&gt;selected-node&lt;/code&gt; annotation and creates a PV on the selected node using &lt;code&gt;hostPath&lt;/code&gt; (default) in a node specific path, with a pod specific sub-path, avoiding conflicts. This is done via a helper pod that launches a container on the node to do the mounting. The PVC is then bound to the PV.&lt;/li&gt;
&lt;li&gt;Kubelet observes that the PVC is fully bound, pulls the container image (if needed), mounts the host dir to the container and starts it.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;StatefulSet&lt;/code&gt; controller processes the second replica, similar to the first replica.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With the default configuration, if both replicas were to be scheduled on the same node, the layout on the node would look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/opt/local-path-provisioner/
├── pvc-&amp;lt;uuid&amp;gt;_local-storage_volume-test_default_local-storage-volume-test-0/
│   └── test
└── pvc-&amp;lt;uuid&amp;gt;_local-storage_volume-test_default_local-storage-volume-test-1/
    └── test
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the PVC is deleted (and the reclaim policy is &lt;code&gt;Delete&lt;/code&gt;), the PV will be deleted as well. The provisioner will detect this and clean up the host directory by scheduling another helper pod.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are the Limitations of Local Path Provisioner?
&lt;/h2&gt;

&lt;p&gt;While local path provisioner addresses the main shortcomings of &lt;code&gt;emptyDir&lt;/code&gt;, &lt;code&gt;local&lt;/code&gt; and &lt;code&gt;hostPath&lt;/code&gt; by dynamically and securely provisioning local volumes on nodes in a conflict-free manner, it comes with a few limitations of its own.&lt;/p&gt;

&lt;p&gt;First, if a node with a bound &lt;code&gt;local-path&lt;/code&gt; PV gets removed from the cluster, the provisioner cannot schedule the helper pod to unmount the PV upon PVC deletion, and thus the PV remains "stuck" until manually deleted (see &lt;a href="https://github.com/rancher/local-path-provisioner/issues/215" rel="noopener noreferrer"&gt;#215&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;While orphaned PVs are a minor inconvenience, the second issue is more severe: Since &lt;code&gt;local-path&lt;/code&gt; PVCs are tied to a node, but the lifecycle of a PVC in a &lt;code&gt;StatefulSet&lt;/code&gt; is decoupled from the pod lifecycle, pods can become unschedulable if recreated, because node selected for the PVC is not part of the cluster anymore, full, or otherwise unsuitable for scheduling. This leads to service outage, with pods stuck in &lt;code&gt;Pending&lt;/code&gt; phase until the PVC is deleted manually.&lt;/p&gt;

&lt;p&gt;Thirdly, while Kubernetes allows you to specify storage limits for PVCs, local path provisioner does not enforce them. This can lead to overcommitting resources causing unexpected out of disk errors in the applications.&lt;/p&gt;

&lt;p&gt;Luckily, there are different buildings blocks we can combine to address these issues: &lt;a href="https://kubernetes.io/docs/concepts/storage/ephemeral-storage/" rel="noopener noreferrer"&gt;Local ephemeral storage&lt;/a&gt;, &lt;a href="https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/storage_administration_guide/xfsquota" rel="noopener noreferrer"&gt;filesystem quotas&lt;/a&gt;, &lt;a href="https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes" rel="noopener noreferrer"&gt;generic ephemeral volumes&lt;/a&gt;, and a custom application I call local path cleaner. Let's dive into the details.&lt;/p&gt;

&lt;h2&gt;
  
  
  Local Ephemeral Storage
&lt;/h2&gt;

&lt;p&gt;The concept of local ephemeral storage was introduced in 2017 (v1.7) and &lt;a href="https://kubernetes.io/blog/2022/09/19/local-storage-capacity-isolation-ga/" rel="noopener noreferrer"&gt;reached GA&lt;/a&gt; in 2022 (v1.25). This means you can specify storage requests (and limits) in your container specification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test-container&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;busybox&lt;/span&gt;
  &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sh'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-c'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"Test&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$(hostname)"&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;/data/test&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;sleep&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;3600'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;ephemeral-storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5Gi"&lt;/span&gt;
  &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/data&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local-storage&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The scheduler will take storage requirements into account when scheduling pods. We can use this to avoid overcommitting local storage on a node. However, local ephemeral storage and persistent volumes serve different purposes (one is ephemeral, the other persistent). Kubernetes does not track PVC volumes as ephemeral storage consumption so we cannot combine local ephemeral storage requests and &lt;code&gt;local-path&lt;/code&gt; PVCs out of the box.&lt;/p&gt;

&lt;p&gt;Luckily, with a trick during node provisioning, we can still achieve what we are looking for. Let's consider GCP as an example. In our startup script, we might manually mount multiple local NVMe SSD in a RAID0 device:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find all SSDs&lt;/span&gt;
&lt;span class="nv"&gt;SSDs&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;readlink&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; /dev/disk/by-id/google-local-nvme-ssd-&lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Create RAID0 device&lt;/span&gt;
mdadm &lt;span class="nt"&gt;--create&lt;/span&gt; /dev/md0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="nt"&gt;--force&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"--raid-devices=&lt;/span&gt;&lt;span class="nv"&gt;$$&lt;/span&gt;&lt;span class="s2"&gt;{#SSDs[@]}"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$$&lt;/span&gt;&lt;span class="s2"&gt;{SSDs[@]}"&lt;/span&gt;

&lt;span class="c"&gt;# Format RAID0 device&lt;/span&gt;
mkfs.xfs &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nv"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;4096 /dev/md0

&lt;span class="c"&gt;# Mount RAID0 device to /mnt/disks/ssd-array&lt;/span&gt;
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /mnt/disks/ssd-array
mount /dev/md0 /mnt/disks/ssd-array
&lt;span class="nb"&gt;chmod &lt;/span&gt;a+w /mnt/disks/ssd-array

&lt;span class="c"&gt;# Create fstab entry (to survive reboots)&lt;/span&gt;
&lt;span class="nv"&gt;raid_dev_uuid&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;blkid | &lt;span class="nb"&gt;grep &lt;/span&gt;dev/md0 | egrep &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s1"&gt;'[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"UUID=&lt;/span&gt;&lt;span class="nv"&gt;$raid_dev_uuid&lt;/span&gt;&lt;span class="s2"&gt; /mnt/disks/ssd-array xfs defaults,nofail,noatime 0 0"&lt;/span&gt; |&lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;tee&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt; /etc/fstab

&lt;span class="c"&gt;# Disable NODE_LOCAL_SSDS_EPHEMERAL as we manage ephemeral storage ourselves&lt;/span&gt;
&lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s1"&gt;'s|readonly NODE_LOCAL_SSDS_EPHEMERAL=true|readonly NODE_LOCAL_SSDS_EPHEMERAL=false|'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$$&lt;/span&gt;&lt;span class="s2"&gt;{KUBE_HOME}/kube-env"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kubelet tracks ephemeral storage in certain locations. By bind mounting these into our RAID0 mount, we effectively enable Kubernetes to track the capacity of our custom local storage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /mnt/disks/ssd-array/lib/kubelet
&lt;span class="nb"&gt;mv&lt;/span&gt; /var/lib/kubelet/&lt;span class="k"&gt;*&lt;/span&gt; /mnt/disks/ssd-array/lib/kubelet
mount &lt;span class="nt"&gt;--bind&lt;/span&gt; /mnt/disks/ssd-array/lib/kubelet /var/lib/kubelet

&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /mnt/disks/ssd-array/lib/containerd
&lt;span class="nb"&gt;mv&lt;/span&gt; /var/lib/containerd/&lt;span class="k"&gt;*&lt;/span&gt; /mnt/disks/ssd-array/lib/containerd
mount &lt;span class="nt"&gt;--bind&lt;/span&gt; /mnt/disks/ssd-array/lib/containerd /var/lib/containerd

&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /mnt/disks/ssd-array/stateful_partition
mount &lt;span class="nt"&gt;--bind&lt;/span&gt; /mnt/disks/ssd-array/stateful_partition /mnt/stateful_partition
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alternatively, we could hard code the available ephemeral storage capacity in the kubelet config based on the available space on the RAID0 device. While this would allow the scheduler to take storage requests into account for your &lt;code&gt;local-path&lt;/code&gt; PVs, tracking actual usage will not work. If you wanted it to be 500Gi, you could run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s1"&gt;'s/(ephemeral-storage:).*/\1 500Gi/'&lt;/span&gt; /home/kubernetes/kubelet-config.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When querying the node capacity, you should see the ephemeral storage capacity reflected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;capacity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16"&lt;/span&gt;
    &lt;span class="na"&gt;ephemeral-storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;500Gi&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;128Gi&lt;/span&gt;
    &lt;span class="na"&gt;pods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;110"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now all we need to do is tell local path provisioner to use our custom mount point instead of the default &lt;code&gt;/opt/local-path-provisioner&lt;/code&gt;. We can do this by customizing the &lt;code&gt;ConfigMap&lt;/code&gt; via Helm:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;nodePathMap&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;node&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DEFAULT_PATH_FOR_NON_LISTED_NODES&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/mnt/disks/ssd-array/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If setup correctly, this should prevent Kubernetes from overcommitting &lt;code&gt;local-path&lt;/code&gt; PVCs on a node. I admit that this is a bit of a hacky solution with multiple drawbacks: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We have to specify the requested storage capacity in two places: In the PVC and in the pod spec.&lt;/li&gt;
&lt;li&gt;Ephemeral storage usage tracking might be off, causing kubelet to not properly enforce ephemeral storage limits.&lt;/li&gt;
&lt;li&gt;We are repurposing the local ephemeral storage concept, which might cause confusion in larger organizations where many teams share the same multi-tenant Kubernetes cluster.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you wanted to avoid overcommitting without ephemeral storage requests, you could try to align CPU and memory requests with the expected storage usage. Either way, once you have the overcommitting problem under control, we can move to enforcing the storage limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Filesystem Quotas
&lt;/h2&gt;

&lt;p&gt;By default, containers that have a &lt;code&gt;local-path&lt;/code&gt; PVC mounted, can use as much space in the volume as they want, independently of the space they requested. This can lead to noisy neighbor issues such as unexpected out of disk errors in the applications. Note that by requested space we are referring to the storage requests of the PVC, not the ephemeral storage requests of the container.&lt;/p&gt;

&lt;p&gt;Fortunately, filesystems such as XFS support configuring storage quotas. There is an excellent &lt;a href="https://github.com/rancher/local-path-provisioner/blob/964c10d96f3098b3d8c0efc9db7e8ad253097ec9/examples/quota/setup#L1-L29" rel="noopener noreferrer"&gt;minimal example&lt;/a&gt; in the local path provisioner repository.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;xfsPath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;dirname&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$VOL_DIR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;pvcName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;basename&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$VOL_DIR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$VOL_DIR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="nb"&gt;stat&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; %T &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;xfsPath&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;'xfs'&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nv"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /etc/projects | &lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; 1&lt;span class="sb"&gt;`&lt;/span&gt;
    &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;project&lt;/span&gt;&lt;span class="p"&gt;%&lt;/span&gt;:&lt;span class="p"&gt;*&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;project&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        &lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
    &lt;span class="k"&gt;else
        &lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$[&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;id&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;+1]
    &lt;span class="k"&gt;fi

    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;id&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VOL_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /etc/projects
    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;pvcName&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;id&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /etc/projid

    xfs_quota &lt;span class="nt"&gt;-x&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"project -s &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;pvcName&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
    xfs_quota &lt;span class="nt"&gt;-x&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"limit -p bhard=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VOL_SIZE_BYTES&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;pvcName&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;xfsPath&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;
    xfs_quota &lt;span class="nt"&gt;-x&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"report -pbih"&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;xfsPath&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script first checks if the filesystem is XFS. If not, we exit and the PV is created without quotas. Then, it reads the project file to determine if there are any existing projects so we can pick the next project ID. Project files look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1:/some/path
2:/another/path
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We then increment the last project ID and create a new project for our PVC. Finally, we initialize the quota record for the project, set the limit, and print a report for debugging purpose. &lt;/p&gt;

&lt;p&gt;We can then pass this script via the Helm value &lt;code&gt;configmap.setup&lt;/code&gt;. To avoid inconsistencies, it's wise to write a corresponding script for &lt;code&gt;configmap.teardown&lt;/code&gt; that removes the quota + limits for the PVC. Note that for this approach to work, your node needs to have project quotas enabled on the mount point and your helper image needs to have &lt;code&gt;xfsprogs-extra&lt;/code&gt; installed. We can achieve the former by modifying our init script &lt;code&gt;mount&lt;/code&gt; and &lt;code&gt;/etc/fstab&lt;/code&gt; contents, adding the &lt;code&gt;prjquota&lt;/code&gt; option.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mount &lt;span class="nt"&gt;-o&lt;/span&gt; prjquota /dev/md0 /mnt/disks/ssd-array
&lt;span class="c"&gt;# ...&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"UUID=&lt;/span&gt;&lt;span class="nv"&gt;$raid_dev_uuid&lt;/span&gt;&lt;span class="s2"&gt; /mnt/disks/ssd-array xfs defaults,nofail,noatime,prjquota 0 0"&lt;/span&gt; |&lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt; /etc/fstab
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To achieve the latter, you can specify a custom helper pod, which uses a container image with the required dependency installed (&lt;code&gt;apk --no-cache add xfsprogs-extra&lt;/code&gt;, e.g.) via the Helm values &lt;code&gt;configmap.helperPod&lt;/code&gt;. We now have a way to address overcommitting and enforcing storage limits, which allows us to safely put multiple pods with &lt;code&gt;local-path&lt;/code&gt; PVCs on the same node. Next, let's see how can we avoid unschedulable pods. &lt;/p&gt;

&lt;h2&gt;
  
  
  Generic Ephemeral Volumes
&lt;/h2&gt;

&lt;p&gt;Generic ephemeral volumes are similar to &lt;code&gt;emptyDir&lt;/code&gt; in that their lifecycle is bound to the pod. However, they allow accessing arbitrary PVC storage classes via a volume claim template. We can modify our &lt;code&gt;StatefulSet&lt;/code&gt; to use a generic ephemeral volume by moving the &lt;code&gt;spec.volumeClaimTemplate[0]&lt;/code&gt; into &lt;code&gt;spec.template.spec.volumes[0].ephemeral.volumeClaimTemplate&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;StatefulSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;volume-test&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;serviceName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test"&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;volume-test&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;volume-test&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test-container&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;busybox&lt;/span&gt;
          &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sh'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-c'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"Test&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$(hostname)"&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;/data/test&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;sleep&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;3600'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
          &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/data&lt;/span&gt;
              &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local-storage&lt;/span&gt;
      &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local-storage&lt;/span&gt;
          &lt;span class="na"&gt;ephemeral&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;volumeClaimTemplate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local-storage&lt;/span&gt;
              &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ReadWriteOnce"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
                &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local-path&lt;/span&gt;
                &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                    &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;128Mi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shifts the responsibility for creating the &lt;code&gt;local-path&lt;/code&gt; PVC from the &lt;code&gt;StatefulSet&lt;/code&gt; controller to the ephemeral volume controller. In turn, the owner reference of the PVC will point to the pod and no longer the &lt;code&gt;StatefulSet&lt;/code&gt;. When the pod is deleted, the Kubernetes garbage collector will delete the PVC, and if the reclaim policy is &lt;code&gt;Delete&lt;/code&gt;, the PV as well.&lt;/p&gt;

&lt;p&gt;This should prevent the situation where the new STS replica becomes unschedulable because the old PVC is still bound to a non-existing or otherwise unsuitable node. Note that this approach does not work if you want to reuse the existing PVCs, e.g. to facilitate rolling restarts / upgrades without losing temporary data (as long as the replica can get scheduled to the same node). Let's investigate a different approach you can use in case you want to reuse the PVCs, or are on an older Kubernetes version (&amp;lt; 1.23) that does not support generic ephemeral volumes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Local Path Cleaner
&lt;/h2&gt;

&lt;p&gt;A few years ago, I wrote a Python application that I named "local path cleaner". It fulfills the following objectives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clean up released local-path PVs on nodes that are no longer part of the cluster.&lt;/li&gt;
&lt;li&gt;Clean up local-path PVCs and the corresponding pods that are unschedulable.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cleaning Released PVs
&lt;/h3&gt;

&lt;p&gt;To clean released local-path PVs, we implement the following steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;List all released local-path PVs.&lt;/li&gt;
&lt;li&gt;For each PV, check if the node it is bound to is part of the cluster.&lt;/li&gt;
&lt;li&gt;If the node is not part of the cluster, delete the PV.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let's walk through the code. We're going to use the Kubernetes Python client to interact with the API. First, we use &lt;code&gt;list_persistent_volume&lt;/code&gt; to list all PVs. In Kubernetes, PVs are not namespaced, so we can list them all at once. Note that you should specify a page size and handle pagination accordingly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_released_local_path_pvs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CoreV1Api&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;_continue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;pvs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;V1PersistentVolumeList&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_persistent_volume&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;watch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="n"&gt;_continue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_continue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pv&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pvs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;storage_class&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;storage_class_name&lt;/span&gt;
            &lt;span class="n"&gt;phase&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;storage_class&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;local-path&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;phase&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Released&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;_continue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pvs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_continue&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;_continue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, let's write a similar function to obtain all nodes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_nodes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CoreV1Api&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;V1Node&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;V1Node&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;_continue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;V1NodeList&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;watch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="n"&gt;_continue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_continue&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;_continue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_continue&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;_continue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, we can find the PVs that are bound to nodes which are no longer part of the cluster. We'll walk through all the PVs, checking the node affinity selector terms to determine the assigned node. E.g. the following PV is bound to node &lt;code&gt;gke-main-data-node-c51c4677-285f&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"apiVersion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"kind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PersistentVolume"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"metadata"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"annotations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"pv.kubernetes.io/provisioned-by"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cluster.local/local-path-storage-local-path-provisioner"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pvc-db-001"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"spec"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"capacity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"storage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2000Gi"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"hostPath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/mnt/disks/ssd-array/pvc-db-app-0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DirectoryOrCreate"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"nodeAffinity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"nodeSelectorTerms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"matchExpressions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
              &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"kubernetes.io/hostname"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"operator"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"In"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"values"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
                  &lt;/span&gt;&lt;span class="s2"&gt;"gke-main-data-node-c51c4677-285f"&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
              &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"persistentVolumeReclaimPolicy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Delete"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"storageClassName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"local-path"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"volumeMode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Filesystem"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And here's the Python code. We don't delete the PVs immediately but gather them first so we can implement a dry-run mode where we simply log all the PVs we would have deleted.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_pvs_on_missing_nodes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CoreV1Api&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pvs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;V1Node&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_nodes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;deletion_candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;node_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;pv&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;V1PersistentVolume&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pv&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pvs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;node_selector_match_expression&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;node_affinity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;required&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;node_selector_terms&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;match_expressions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;node_selector_match_expression&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;kubernetes.io/hostname&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; \ 
            &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;node_selector_match_expression&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;operator&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;In&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; \
            &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;node_selector_match_expression&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;node_names&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;deletion_candidates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;deletion_candidates&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, we can put everything together and delete the PVs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_released_pvs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CoreV1Api&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;pvs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_released_local_path_pvs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;deletion_candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;find_pvs_on_missing_nodes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pvs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;deletion_candidates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete_persistent_volume&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cleaning Unschedulable Pods
&lt;/h3&gt;

&lt;p&gt;When using &lt;code&gt;local-path&lt;/code&gt; PVCs via &lt;code&gt;StatefulSet&lt;/code&gt; instead of ephemeral volumes on the pod level, pods can become unschedulable. A common reason is that the PVC is bound to a node that has been scaled down by the cluster autoscaler, has been cordoned, or another workload has moved onto it so it does not have enough capacity.&lt;/p&gt;

&lt;p&gt;While it is straightforward to reliably detect the first case, by comparing the node the PVC is bound to with the list of active nodes, I did not find a way to detect the other two cases. The Kubernetes events sometimes show hints about volume affinity conflicts, but this did not happen reliably in all cases.&lt;/p&gt;

&lt;p&gt;In the end I decided to purge bound local-path PVCs of unschedulable pods aggressively, as they could always be recreated and I'd rather live with losing temporary data than dealing with a prolonged service outage. Here's the high level algorithm:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;List all pending pods.&lt;/li&gt;
&lt;li&gt;Identify unschedulable pods from the pending pods.&lt;/li&gt;
&lt;li&gt;List all bound local-path PVCs.&lt;/li&gt;
&lt;li&gt;For each unschedulable pod, check if it has a bound local-path PVC.&lt;/li&gt;
&lt;li&gt;Delete the PVC and the pod, if the pod has a managing controller.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I found that deleting not only the PVC but also the pod reduces the time to recovery, as the managing controller will immediately recreate both the PVC and the pod in that case, triggering the scheduler and subsequently the local-path provisioner to get the pod onto a new node. Let's build the code step by step:&lt;/p&gt;

&lt;p&gt;First, we want to list all pending pods. We can use the &lt;code&gt;list_pod_for_all_namespaces&lt;/code&gt; method with the field selector &lt;code&gt;status.phase=Pending&lt;/code&gt; for that. Here we could also employ some namespace filtering to only consider pods in namespaces that match a given regular expression.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_pending_pods&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CoreV1Api&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;_continue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;pods&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_pod_for_all_namespaces&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;watch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_continue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_continue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;field_selector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status.phase=Pending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;pods&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;
        &lt;span class="n"&gt;_continue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pods&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_continue&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;_continue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we keep only unschedulable pods. The information whether a pod is unschedulable is stored in &lt;code&gt;status.conditions&lt;/code&gt;. Here's an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"apiVersion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"kind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Pod"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"conditions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"lastProbeTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"lastTransitionTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-01-10T13:20:15Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0/3 nodes are available: 1 node(s) were unschedulable. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Unschedulable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"False"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PodScheduled"&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"phase"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Pending"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"qosClass"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Burstable"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Based on that we can write a helper function &lt;code&gt;get_condition&lt;/code&gt; that allows to get the condition of a given type from a pod (or &lt;code&gt;None&lt;/code&gt; if it does not exist):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_condition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;V1Pod&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;condition_type&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;pod_status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;V1PodStatus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;
    &lt;span class="n"&gt;condition&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;V1PodCondition&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;condition&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;condition&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pod_status&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conditions&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;condition&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;condition_type&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then we can write a function to filter unschedulable pods by checking the &lt;code&gt;PodScheduled&lt;/code&gt; condition to be &lt;code&gt;False&lt;/code&gt; and the reason to be &lt;code&gt;Unschedulable&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;filter_unschedulable_pods&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pods&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pods&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;pod_condition&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;V1PodCondition&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_condition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PodScheduled&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pod_condition&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; \
            &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;pod_condition&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;False&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; \
            &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;pod_condition&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unschedulable&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we know which pods are unschedulable, we need to keep only the ones that have bound PVCs. Unfortunately, this information is not available in the pod resource, so we need to fetch the PVCs, too. If you need the ability to filter certain namespaces, you could add that here.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_bound_local_path_pvcs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CoreV1Api&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;_continue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;pvcs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_persistent_volume_claim_for_all_namespaces&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;watch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;_continue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_continue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pvc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pvcs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;storage_class&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pvc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;storage_class_name&lt;/span&gt;
            &lt;span class="n"&gt;phase&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pvc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;storage_class&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;local-path&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;phase&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bound&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pvc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;_continue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pvcs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_continue&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;_continue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we combine the unschedulable pods and the PVCs to find the pods that have a bound PVC. We convert the bound local-path PVC list into a dictionary by PVC name to efficiently check each of the volumes of each unschedulable pod and match them if possible.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_pods_with_pvcs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pods&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pvcs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;pvc_res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;pods_res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;pvcs_by_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pvc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pvc&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pvc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pvcs&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pods&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;volume&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;volumes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persistent_volume_claim&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;pod_pvc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pvcs_by_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persistent_volume_claim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;claim_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;pods_res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;pvc_res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pod_pvc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pvc_res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pods_res&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it! Now we can combine everything together. I ended up adding a small &lt;code&gt;sleep&lt;/code&gt; call between deleting the PVC and the pod to reduce the risk of hitting a race condition where the pod would get recreated before the PVC, causing it to become unschedulable again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_unschedulable_pod_pvc_conflicts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CoreV1Api&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;pending_pods&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_pending_pods&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;unschedulable_pods&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;filter_unschedulable_pods&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pending_pods&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pvcs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_bound_local_path_pvcs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pvc_deletion_candidates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pod_deletion_candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;find_pods_with_pvcs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unschedulable_pods&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pvcs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pvc_deletion_candidates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete_namespaced_persistent_volume_claim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pod_deletion_candidates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;v1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete_namespaced_pod&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This method relies on the fact that upon deletion of the PVC and the pod, some controller will recreate them. To prevent accidentally deleting pods that are not managed by a &lt;code&gt;StatefulSet&lt;/code&gt;, we can add a filter based on the owner reference:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;POD_CONTROLLERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;StatefulSet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_pod_owner_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;owner_references&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;owner_references&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;owner_references&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;owner_references&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;controller&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pod&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pods&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;get_pod_owner_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;POD_CONTROLLERS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Delete the pod and PVC
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Operations
&lt;/h3&gt;

&lt;p&gt;We can run this code on a schedule, either by using a Kubernetes cron job, or by having a pod running with a sleep loop. I prefer the long-running pod by using a &lt;code&gt;Deployment&lt;/code&gt;, as we want the code to run very frequently to reduce the impact of unschedulable pods. I recommend implementing proper logging, metrics, a dry-run mode, a configurable interval, filters, and flags for the different pieces for optimal operability.&lt;/p&gt;

&lt;p&gt;Since the cleaner has to interact with the Kubernetes API, it needs the following RBAC permissions, if you have RBAC enabled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterRole&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local-path-cleaner&lt;/span&gt;
&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;persistenvolumes"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;persistentvolumeclaims"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delete"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nodes"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary and Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post we explored different options to provide ephemeral or semi-persistent local storage to your Kubernetes workload. For most use cases, &lt;code&gt;emptyDir&lt;/code&gt; volumes should be sufficient and those are the easiest to set up and manage. If you need customizable local storage, consider using &lt;code&gt;local-path&lt;/code&gt; PVCs managed by the generic ephemeral volume controller to avoid unschedulable pods. If the local storage needs to be semi-persistent, you can use &lt;code&gt;local-path&lt;/code&gt; PVCs managed by a &lt;code&gt;StatefulSet&lt;/code&gt; in combination with local path cleaner.&lt;/p&gt;

&lt;p&gt;In my opinion, managing state on Kubernetes has become a lot easier over the past few years. There are different controllers and mechanisms available to assist you. However, I would not call the problem solved, as there are still some use cases that are not covered by the standard Kubernetes buildings blocks, especially in applications with very specific I/O and operational requirements.&lt;/p&gt;

&lt;p&gt;Have you used local path provisioner? What is your experience in terms of operability? Let me know in the comments!&lt;/p&gt;




&lt;p&gt;If you liked this post, you can &lt;a href="https://ko-fi.com/frosnerd" rel="noopener noreferrer"&gt;support me on ko-fi&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>cloud</category>
      <category>devops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Investigating Error Logs Using LangGraph, LangChain and Watsonx.ai</title>
      <dc:creator>Frank Rosner</dc:creator>
      <pubDate>Thu, 11 Dec 2025 10:36:32 +0000</pubDate>
      <link>https://dev.to/frosnerd/investigating-error-logs-using-langgraph-langchain-and-watsonxai-2o6j</link>
      <guid>https://dev.to/frosnerd/investigating-error-logs-using-langgraph-langchain-and-watsonxai-2o6j</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;When dealing with production systems, observability plays a key role. It is a vital component of incident investigations, the foundation for monitoring and alerting, and incredibly useful for validation of new functionality, improvements, or bug fixes being shipped. Application logs are a big part of observability.&lt;sup id="fnref1"&gt;1&lt;/sup&gt; Logs can help us understand what the system was doing at any particular point in time with a high degree of granularity.&lt;/p&gt;

&lt;p&gt;However, understanding application logs can be difficult. First, there can be many logs and finding the relevant ones is difficult. Indexing the logs and using a search engine to query them helps, but it cannot tell you which logs are related to the issue you are investigating. When I see an error in the logs that correlates with the timing of the incident, I usually ask myself a bunch of follow-up questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the error related to or possibly the root cause of the issue?&lt;/li&gt;
&lt;li&gt;Is this error a known problem?&lt;/li&gt;
&lt;li&gt;If it's known, has it been reported to the right team?&lt;/li&gt;
&lt;li&gt;If it has been reported, is it being worked on or even fixed?&lt;/li&gt;
&lt;li&gt;If it's fixed, is the fix rolled out to the environment I am investigating?&lt;/li&gt;
&lt;li&gt;If it's rolled out, why is the error still happening? Is there a regression?&lt;/li&gt;
&lt;li&gt;If there's a regression, has it been reported to the right team?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the face of an incident, answering these questions can make you lose valuable time. You might suggest to simply postpone the investigation until the bleeding is stopped, but sometimes valuable information on how to stop the bleeding is hidden in the answers to these questions.&lt;/p&gt;

&lt;p&gt;For example, there might be a bug ticket in progress in which someone commented a workaround you can apply. Or there might be a release candidate or hotfix release available you can potentially roll out prematurely. This is why I think it's valuable to investigate the logs deeply during incidents as well. I believe that GenAI can aid in answering these questions quickly for you during an investigation.&lt;/p&gt;

&lt;p&gt;In this post we are going to explore how to use GenAI to investigate (error) logs. We are going to use IBM Watsonx.ai and LangGraph in Python. The remainder of the post is structured as follows: First, we will lay the technological foundation, introducing LangGraph, LangChain, and watsonx.ai. Then we will dive into the design and implementation of our solution. We will close the post by summarizing the main findings and giving an outlook for future work.&lt;/p&gt;

&lt;h2&gt;
  
  
  LangGraph, LangChain and Watsonx.ai
&lt;/h2&gt;

&lt;p&gt;LangGraph is a graph-based orchestration framework for building stateful AI workflows, such as agents. It lets you model an AI application as a directed graph, where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nodes are functions (e.g., an LLM call, a tool call, or custom logic) that operate on a shared state.&lt;/li&gt;
&lt;li&gt;Edges define how control flows from one node to the next, including conditional branches and loops.&lt;/li&gt;
&lt;li&gt;State is an explicit, shared data structure (like a &lt;code&gt;dict&lt;/code&gt; or &lt;code&gt;TypedDict&lt;/code&gt;) that all nodes can read and update, making it easy to build long-running, stateful agents.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LangChain is often used as a building block within LangGraph nodes. It is a library for connecting LLMs to data and tools. By combining LangChain and LangGraph, you can build AI agents that can reason and act in cycles, adding a human in the loop as needed.&lt;/p&gt;

&lt;p&gt;Watsonx.ai is an enterprise AI solution by IBM, which among other functionality, offers managed LLMs. We are going to combine these three tools to build an AI agent to investigate our logs. The required functionality is provided by the following Python modules: &lt;a href="https://pypi.org/project/ibm-watsonx-ai/" rel="noopener noreferrer"&gt;&lt;code&gt;ibm-watsonx-ai&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://pypi.org/project/langgraph/" rel="noopener noreferrer"&gt;&lt;code&gt;langgraph&lt;/code&gt;&lt;/a&gt;, and &lt;a href="https://pypi.org/project/langchain-ibm/" rel="noopener noreferrer"&gt;&lt;code&gt;langchain-ibm&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To illustrate how the libraries interact, let's look at a simple example. The following code implements a basic agent that has access to a tool to get the weather in a city. We are using the &lt;code&gt;create_agent&lt;/code&gt; (successor to &lt;code&gt;create_react_agent&lt;/code&gt;) helper function, which creates a pre-built agent graph representing a chat and tool-calling loop maintaining the message history in the global state.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatWatsonx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta-llama/llama-3-70b-instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WATSONX_URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Weather in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: 30°C and sunny.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;get_weather&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the weather in Berlin?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For more complex applications, you can build your own graph, e.g. by utilizing the &lt;a href="https://docs.langchain.com/oss/python/langgraph/graph-api" rel="noopener noreferrer"&gt;Graph API&lt;/a&gt;. Let's build our log investigation agent next.&lt;/p&gt;

&lt;h2&gt;
  
  
  Log Investigation Agent
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scope
&lt;/h3&gt;

&lt;p&gt;In the introduction, I shared a few questions we'd like the agent to answer. In the context of this post, let's focus on the following initial high-level functionality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Search relevant work / tickets / conversations to the log. We'll search two systems in parallel to demonstrate how to parallelize work in LangGraph: Jira and GitHub. You might also want to integrate chats like Slack, incident tracking tools, post-mortems, or other relevant tools. If you have access to some company-wide search engine like Glean, you could call that instead of interacting with the different APIs directly.&lt;/li&gt;
&lt;li&gt;Gather and include operational context, such as the pod / container name and deployed version. We can use this information to assess the relevance of the found tickets and discussions.&lt;/li&gt;
&lt;li&gt;Investigate the relevant tickets and conversations, looking for workarounds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I am going to wrap this into a lean &lt;a href="https://dash.plotly.com/" rel="noopener noreferrer"&gt;Dash&lt;/a&gt; UI that will look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvl7on3f21qqj9upceho.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvl7on3f21qqj9upceho.png" alt="log investigation UI" width="800" height="703"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's dive into the implementation details. First, we are going to give a general overview of the architecture and then go into the implementation details of each step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Defining the State
&lt;/h3&gt;

&lt;p&gt;In LangGraph, state is shared among all nodes and passed along the edges. If multiple nodes modify the same property in the state concurrently, you need to define a custom merge function. For our use case, we will store all intermediate results in the state, so that we can show it to the user after the graph completed. This helps in building trust in the agent and checking its reasoning.&lt;/p&gt;

&lt;p&gt;Here is the base definition of our &lt;a href="https://docs.pydantic.dev/latest/" rel="noopener noreferrer"&gt;Pydantic&lt;/a&gt; model &lt;code&gt;LogInvestigationState&lt;/code&gt;. For now, I will just add the field that holds the raw log text. We will add more fields in the upcoming sections.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Raw log text as provided by the user
&lt;/span&gt;    &lt;span class="n"&gt;log_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Defining the Graph
&lt;/h3&gt;

&lt;p&gt;First, let's define the high level graph architecture. The first node will inspect the provided log and derive search queries for the different systems (Jira and GitHub in our case). We don't want to search for the provided log verbatim as it might be too detailed to yield all relevant results.&lt;/p&gt;

&lt;p&gt;After that, we'll run the search across all APIs. This can be parallelized and there is no LLM needed. We'll use conditional edges with the &lt;a href="https://docs.langchain.com/oss/python/langgraph/graph-api#send" rel="noopener noreferrer"&gt;&lt;code&gt;Send&lt;/code&gt;&lt;/a&gt; functionality, which spawns nodes on-demand for each of the found tickets. We'll then grade each of the ticket based on the high level information such as title and description given the full provided log and operational context to determine its relevancy.&lt;/p&gt;

&lt;p&gt;After filtering out irrelevant tickets, we'll merge the graph back together. Finally, we will gather additional context for the relevant tickets, such as comments, linked PRs, etc. and submit a final LLM call that will produce a report for the user.&lt;/p&gt;

&lt;p&gt;We'll use wrapper functions (prefixed with &lt;code&gt;node_&lt;/code&gt;) that pass required dependencies such as the LLM into the actual node function, as I didn't find a way to inject dependencies such as API clients or LLM clients into the nodes otherwise. Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BaseChatModel&lt;/span&gt;  &lt;span class="c1"&gt;# your ChatWatsonx instance
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;node_extract_ticket_queries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;extract_ticket_queries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let's look at the Python code that builds the graph. The &lt;code&gt;add_node&lt;/code&gt; function takes two arguments: the node name and the wrapper function to execute. LangGraph passes the state (&lt;code&gt;state: LogInvestigationState&lt;/code&gt;) as an argument. &lt;code&gt;add_edge&lt;/code&gt; takes the source and target node name as arguments. &lt;code&gt;add_conditional_edges&lt;/code&gt; takes the source node name, our function that returns a list of &lt;code&gt;Send&lt;/code&gt; objects, telling which target nodes to spawn. It also takes an optional path map argument, which we'll pass so that we can visualize the graph properly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_graph&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Define dependencies and wrapper functions (node_...)
&lt;/span&gt;    &lt;span class="c1"&gt;# ...
&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extract_ticket_queries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_extract_ticket_queries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_jira_issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_search_jira_issues&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_github_issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_search_github_issues&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grade_jira_ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_grade_jira_ticket_worker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grade_github_ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_grade_github_ticket_worker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aggregate_jira_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_aggregate_jira_scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aggregate_github_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_aggregate_github_scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;select_relevant_jira_issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_select_relevant_jira_issues&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;select_relevant_github_issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_select_relevant_github_issues&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;investigate_relevant_tickets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_investigate_relevant_tickets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_entry_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extract_ticket_queries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extract_ticket_queries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;search_jira_issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extract_ticket_queries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;search_github_issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_conditional_edges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;search_jira_issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;edge_dispatch_jira_grading&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;grade_jira_ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_conditional_edges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;search_github_issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;edge_dispatch_github_grading&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;grade_github_ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grade_jira_ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aggregate_jira_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grade_github_ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aggregate_github_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aggregate_jira_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;select_relevant_jira_issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aggregate_github_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;select_relevant_github_issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;select_relevant_jira_issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;investigate_relevant_tickets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;select_relevant_github_issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;investigate_relevant_tickets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;investigate_relevant_tickets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before we jump into the implementation of the individual steps, let's review a visual representation of the graph. LangGraph comes with some helper functions to plot the graph:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;compiled_graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_graph&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;draw_png&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;log_investigation_graph.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgf0nv1snn7opqfxokpn2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgf0nv1snn7opqfxokpn2.png" alt="log investigation graph" width="800" height="919"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that the way we split work into separate nodes is a matter of taste to some extent. If a set of nodes are executed sequentially (and not concurrently), and there is no need for memory / checkpointing or other functionalities, we could've merged them into a single node. I decided to keep them separate though as we will be reporting agent progress to the user via the UI. This will be implemented later based on node callbacks. Next, let's look into some of the individual steps in more detail.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extracting Ticket Queries
&lt;/h3&gt;

&lt;p&gt;The first node will prompt the LLM to analyze the log text and come up with queries to search for relevant information in the different systems. In my case, this will be Jira and GitHub. We'll also ask the LLM to create a summary we can use as a title for a new ticket if we decide to create one later. First, let's extend the state to store the result of this step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;log_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;jql_substring&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;github_query_substring&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;ticket_summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, let's come up with a system prompt for the LLM that tells it what to do with the log snippet. Here's what we'll be using:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You are given an application log snippet (may include stack traces).&lt;br&gt;
Extract:&lt;br&gt;
1) a substring to pass to a Jira search (JQL textfields ~ "..."),&lt;br&gt;
2) a substring to pass to a GitHub issues search query, and&lt;br&gt;
3) a concise ticket summary suitable as a GitHub issue title.&lt;/p&gt;

&lt;p&gt;Rules for the substrings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Include information from the first line of the log, as it often contains the most relevant information.&lt;/li&gt;
&lt;li&gt;Look for discriminative information that might help find this exact issue. If there is a generic exception and a specific error message, focus on the error message.&lt;/li&gt;
&lt;li&gt;If there is an exception type, include it plus relevant pieces of the message if available. Including only the exception name without any discriminative parts from the error will yield too many results, especially for generic exceptions.&lt;/li&gt;
&lt;li&gt;Exclude variable data (timestamps, request IDs, hostnames, memory addresses, line numbers, absolute paths).&lt;/li&gt;
&lt;li&gt;For Jira, produce a substring to search for in textfields.&lt;/li&gt;
&lt;li&gt;For GitHub, produce a query compatible with the GitHub Search Issues API. Prefer free text phrase(s); qualifiers like in:title or in:body are optional.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rules for the ticket summary/title:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep it succinct (&amp;lt;= 12 words) and clear.&lt;/li&gt;
&lt;li&gt;Focus on the core error/symptom; avoid IDs, timestamps, stack details.&lt;/li&gt;
&lt;li&gt;Do not end with a period.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now we can implement the &lt;code&gt;extract_ticket_queries&lt;/code&gt; function. We are going to use structured output to extract the different fields reliably.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TicketQueries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;jql_substring&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;github_query_substring&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;ticket_summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that you always need to sanitize LLM output, as there is no guarantee that the LLM follows your instructions reliably. This includes quoting characters that might cause issues downstream, as well as shortening results to stay within character limits (e.g. for the title). I omitted this part in the code below for readability.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_ticket_queries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BaseChatModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;log_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_text&lt;/span&gt;
    &lt;span class="n"&gt;system_msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;structured_llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_structured_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TicketQueries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TicketQueries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;structured_llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system_msg&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;log_text&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;jql_substring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;escape_double_quotes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jql_substring&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;github_query_substring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;escape_double_quotes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;github_query_substring&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;ticket_summary&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ticket_summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's look at an example log text and the output of the LLM. Given the following log text:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR [nioEventLoopGroup-5-21] 2025-10-09 06:38:16,740 BrainWasher.java:84 - Failed to brainwash tenant 41d51c90-0c8e-4db6-b0c8-d143007210f0. Timeout of 5s reached.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It might produce the following output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;TicketQueries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;jql_substring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to brainwash tenant Timeout of 5s reached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;github_query_substring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to brainwash tenant Timeout of 5s reached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ticket_summary&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BrainWasher.java: Failed to brainwash tenant - Timeout of 5s reached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Ticket Search
&lt;/h3&gt;

&lt;p&gt;The code for searching tickets (or other relevant pieces of information such as post mortem documents, chat conversations, etc.) is highly dependent on the respective system / API you are integrating with. For the sake of simplicity, I am going to show only the code for searching Jira tickets using the &lt;a href="https://pypi.org/project/jira/" rel="noopener noreferrer"&gt;&lt;code&gt;jira&lt;/code&gt;&lt;/a&gt; Python package. The code for searching GitHub issues is very similar.&lt;/p&gt;

&lt;p&gt;First, let's extend the state to store the results of the search. We'll need custom types to represent the tickets. While the client libraries come with some prebuilt types such as &lt;a href="https://jira.readthedocs.io/api.html#jira.resources.Issue" rel="noopener noreferrer"&gt;&lt;code&gt;Issue&lt;/code&gt;&lt;/a&gt;, they are not consistent in the fields they have and probably not serializable (at least not in a way LangGraph can handle). Therefore, we are going to define my own types, such as &lt;code&gt;JiraTicket&lt;/code&gt;, which maps the Jira issue data model to our internal data model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;JiraTicket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="nd"&gt;@classmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;from_jira_issue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;jira&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resources&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Issue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;JiraTicket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;permalink&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, let's extend the &lt;code&gt;LogInvestigationState&lt;/code&gt; to store the final queries (so we can show those to the user later on), as well as the results of the search, which are candidates for being relevant tickets.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;log_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;jql_substring&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;jql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;jira_candidates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;JiraTicket&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;github_query_substring&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;github_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;github_candidates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GitHubTicket&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;ticket_summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can implement the &lt;code&gt;search_jira_issues&lt;/code&gt; function. We are going to use a custom JQL query that searches for the provided substring in the text fields of the ticket. We are also going to limit the search to a specific set of projects. You might want to adjust this to your needs, ideally making it configurable via environment variables or similar.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_jira_issues&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JIRA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;textfield_substring&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jql_substring&lt;/span&gt;
    &lt;span class="n"&gt;predicates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;project in (BW, UI)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;textfields ~ &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;textfield_substring&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;jql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; AND &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predicates&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; ORDER BY created DESC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary,description,status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;issues&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search_issues&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jql_str&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;jql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;maxResults&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;JiraTicket&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ticket&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ticket&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;JiraTicket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_jira_issue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;issue&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jira_candidates&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;jql&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;jql&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the search results stored in the state, we can investigate each ticket in detail, determining its relevance to the full log line so that we can focus our final investigation only on relevant tickets.&lt;/p&gt;

&lt;h3&gt;
  
  
  Relevance Scoring
&lt;/h3&gt;

&lt;p&gt;In order to parallelize execution, we will use conditional edges spawning one worker node for each candidate ticket. First, let's define the states we need. We will need a class to hold the relevance result, which will be used for the structured LLM output, too.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TicketRelevance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;relevance_score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Relevance score from 0-100, where 0 is completely irrelevant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; 
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and 100 is highly relevant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;reasoning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Brief explanation of why this score was assigned&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, we'll extend the shared state to store these results for each ticket in a dictionary from ticket key to relevance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# ... other keys defined previously ...
&lt;/span&gt;    &lt;span class="n"&gt;jira_candidates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;JiraTicket&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;jira_ticket_relevance&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TicketRelevance&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="n"&gt;merge_dicts&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;github_candidates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GitHubTicket&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;github_ticket_relevance&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TicketRelevance&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="n"&gt;merge_dicts&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We are using an &lt;code&gt;Annotated&lt;/code&gt; type to tell LangGraph how to merge the state when multiple nodes modify the same property. In this case, we want to keep all entries in the dictionary, so we'll use a custom &lt;code&gt;merge_dicts&lt;/code&gt; function. This is important because we are going to process all tickets concurrently, updating the shared state with the results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;merge_dicts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;left&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;right&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To spawn the grading worker nodes conditionally, we'll use the &lt;code&gt;Send&lt;/code&gt; API. The conditional edges invoke a dispatcher function that returns a list of &lt;code&gt;Send&lt;/code&gt; objects, each representing a worker node to spawn for the respective ticket. The worker needs to know the log text as well as the ticket information. Here we can also add additional context such as the component name that emitted the log, the deployed version, cluster information or other relevant metadata.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dispatch_jira_grading&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Send&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;jira_candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jira_candidates&lt;/span&gt;
    &lt;span class="n"&gt;log_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_text&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nc"&gt;Send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grade_jira_ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;JiraTicketGradeState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;log_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ticket&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;jira_candidates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The actual grading will be performed by &lt;code&gt;grace_jira_ticket&lt;/code&gt; which invokes our LLM again. We will use a system prompt (&lt;code&gt;TICKET_RELEVANCE_SYSTEM_PROMPT&lt;/code&gt;) to instruct the model what to do.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Score the ticket from 0-100 based on how likely it is to be related to the log text:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0-19: Completely unrelated&lt;/li&gt;
&lt;li&gt;20-39: Possibly related but very weak connection&lt;/li&gt;
&lt;li&gt;40-59: Moderately related, shares some keywords or concepts&lt;/li&gt;
&lt;li&gt;60-79: Likely related, similar error patterns or context&lt;/li&gt;
&lt;li&gt;80-100: Highly relevant, very similar or identical issue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When evaluating relevance, consider:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Error/Log Message Similarity&lt;/strong&gt;: Does the ticket describe the same or similar error message, exception type, or log pattern?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Entity Matching&lt;/strong&gt;: Does the ticket mention the same specific entities (e.g., names, UUIDs, file paths, service names, configuration keys)?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Similarity&lt;/strong&gt;: Does the ticket describe similar circumstances, components, or operations?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Provide 2-3 sentences in the reasoning field explaining:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What similarities or differences you found between the log and the ticket&lt;/li&gt;
&lt;li&gt;Whether specific entities or error patterns match&lt;/li&gt;
&lt;li&gt;Why you assigned this particular relevance score&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;We'll prefix the system prompt with information on the source (e.g. that this is a Jira ticket) and provide&lt;br&gt;
 the LLM with a user message that contains the ticket information.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;grade_jira_ticket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BaseChatModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JiraTicketGradeState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;JiraTicketGradeState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ticket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;
    &lt;span class="n"&gt;log_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_text&lt;/span&gt;

    &lt;span class="n"&gt;system_msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are evaluating the relevance of a Jira ticket to a log error or issue.
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;TICKET_RELEVANCE_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;user_msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Log text:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;log_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Jira ticket:
Key: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Summary: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Description: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;structured_llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_structured_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TicketRelevance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TicketRelevance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;structured_llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system_msg&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
        &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JiraTicketGradeState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jira_ticket_relevance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's look at an example again. Given the following ticket BW-17993, the model might have the following response.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8hy25ieluxhhb9x949qh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8hy25ieluxhhb9x949qh.png" width="800" height="180"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Relevance Score&lt;/strong&gt;: 100&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning&lt;/strong&gt;: The log text and Jira ticket describe the same error message, exception type, and log pattern, with the same specific entity (tenant UUID) and similar circumstances (brainwash component exceeding washing timeout). The ticket and log text are almost identical, indicating a high relevance score.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After grading each ticket, we'll aggregate the results back into the main state in the &lt;code&gt;aggregate_jira_scores&lt;/code&gt; and &lt;code&gt;aggregate_github_scores&lt;/code&gt; nodes. This follows the map-reduce pattern as designed by LangGraph when using &lt;code&gt;Send&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Relevance Filtering
&lt;/h3&gt;

&lt;p&gt;The relevance filtering is a trivial step, where we simply go through all tickets and select those with a relevance score above a certain threshold. The selected tickets are stored in new fields &lt;code&gt;relevant_jira_issues&lt;/code&gt; and &lt;code&gt;relevant_github_issues&lt;/code&gt; in the state for further investigation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# ... other keys defined previously ...
&lt;/span&gt;    &lt;span class="n"&gt;relevant_jira_issues&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;JiraTicket&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;relevant_github_issues&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;GitHubIssue&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I am not going to show the code here, as it is rather trivial. Now that we have a set of relevant tickets identified, the final step is to analyze all of them in the context of the log line to produce a final report.&lt;/p&gt;

&lt;h3&gt;
  
  
  Final Analysis
&lt;/h3&gt;

&lt;p&gt;First, let's extend the state to store the final analysis result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# ... other keys defined previously ...
&lt;/span&gt;    &lt;span class="n"&gt;analysis_result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, let's come up with a system prompt. For each relevant ticket, we will fetch some additional information before submitting it to the LLM. This includes things like comments, linked PRs, and so on. Here is the system prompt we are going to use and the node function &lt;code&gt;investigate_relevant_tickets&lt;/code&gt; implementation. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The user is investigating an issue that has been logged by their application.&lt;br&gt;
They will provide the log text and a list of tickets. Each ticket has a key, a title, a body, and a list of comments (optional).&lt;br&gt;
Some tickets might reference other tickets or pull requests.&lt;/p&gt;

&lt;p&gt;Your task is to investigate the relevant tickets and provide a summary of the findings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is there any workaround available for the given issue? If there is no workaround mentioned, please state that clearly.
Workarounds include, but are not limited to:

&lt;ul&gt;
&lt;li&gt;Configuration changes such as system properties, environment variables, etc.&lt;/li&gt;
&lt;li&gt;Workload changes. If the issue can be prevented by changing the usage pattern, e.g. sending fewer or smaller requests.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Are there any relevant pull requests (PRs) that may contain a fix for the issue? Are they merged?&lt;/li&gt;
&lt;li&gt;Focus only on insights that are relevant to the issue at hand that the SRE / production engineer is facing. Discussions that are unrelated to the operational context should be ignored.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As your answer will be embedded into an existing markdown page. If you want to structure your response, please use a simple flat hierarchy with level 5 headings (#####).&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;investigate_relevant_tickets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BaseChatModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;jira_client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JIRA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;github_client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Github&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;system_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;user_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Log text:

&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Relevant tickets:
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;most_relevant_jira_issues&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relevant_jira_issues&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;most_relevant_github_issues&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relevant_github_issues&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;most_relevant_jira_issues&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;most_relevant_github_issues&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# We don't want the LLM to hallucinate any wrong information, so we'd rather skip the analysis
&lt;/span&gt;        &lt;span class="c1"&gt;# without any context.
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;user_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_add_relevant_jira_issue_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;most_relevant_jira_issues&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;jira_client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;user_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_add_relevant_github_issue_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;most_relevant_github_issues&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;github_client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BaseMessage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_prompt&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis_result&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The helper functions &lt;code&gt;_add_relevant_jira_issue_context&lt;/code&gt; and &lt;code&gt;_add_relevant_github_issue_context&lt;/code&gt; are fetching additional context. This could be moved to different nodes, too, e.g. after the relevance filtering step. In our example, this is what the final user prompt might look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Log text:
ERROR [nioEventLoopGroup-5-21] 2025-10-09 06:38:16,740 BrainWasher.java:84 - Failed to brainwash tenant 41d51c90-0c8e-4db6-b0c8-d143007210f0. Timeout of 5s reached.

Relevant tickets:

## BW-17993

### Title

Brainwash component exceeds washing timeout

### Body

The brainwash component recently failed with the following error:
ERROR [nioEventLoopGroup-5-21] 2025-10-09 06:38:16,740 BrainWasher.java:84 - Failed to brainwash tenant 41d51c90-0c8e-4db6-b0c8-d143007210f0. Timeout of 5s reached.

### Comments

Frank Rosner wrote on 2025-10-09T09:01:00.000+0000: We were able to stop the bleeding by increasing the brainwashing timeout to 10s.
-Dbrainwasher.timeout=10s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And that's it! If your tickets contain enough information, the LLM should be able to provide a useful analysis and suggest to increase the timeout. Before we package this into the UI, let's add some progress reporting. This will help the user to understand what is happening under the hood.&lt;/p&gt;

&lt;h2&gt;
  
  
  Progress Reporting
&lt;/h2&gt;

&lt;p&gt;In Dash, we will implement the graph execution as a &lt;a href="https://dash.plotly.com/background-callbacks" rel="noopener noreferrer"&gt;background callback&lt;/a&gt;. Dash callbacks are Python functions that the frontend can invoke on certain triggers, such as the user pressing a button. A background callback will be executed asynchronously, and the client can poll for updates. They can also be cancelled.&lt;/p&gt;

&lt;p&gt;By passing the &lt;code&gt;progress&lt;/code&gt; key to the &lt;code&gt;@callback&lt;/code&gt; decorator we can specify which Dash component properties to update when the callback makes progress. Dash will then inject a &lt;code&gt;set_progress&lt;/code&gt; function into the callback function that we can use to update the progress, e.g. by setting the value of a progress bar. In our case, we are going to use individual indicators for each relevant node, that can either represent the node being pending, started, succeeded, or failed. We can use bootstrap icons and a &lt;a href="https://www.dash-bootstrap-components.com/docs/components/spinner/" rel="noopener noreferrer"&gt;Dash Bootstrap spinner&lt;/a&gt; component to represent those states, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;INDICATOR_PENDING&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;I&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;className&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bi bi-circle me-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;INDICATOR_IN_PROGRESS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dbc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Spinner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;className&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;me-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;INDICATOR_COMPLETED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;I&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;className&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bi bi-check-circle-fill text-success me-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;INDICATOR_FAILED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;I&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;className&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bi bi-exclamation-circle-fill text-danger me-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;INDICATOR_SKIPPED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;I&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;className&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bi bi-fast-forward-circle me-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How will we know which state each node is in? We could use &lt;a href="https://docs.langchain.com/oss/python/langgraph/streaming" rel="noopener noreferrer"&gt;state streaming&lt;/a&gt;, but I decided to go for a custom decorator that we can apply to each node function that will invoke a callback with the node name and its current phase. This also adds error handling capabilities which make the graph more resilient. We'll also add an &lt;code&gt;optional&lt;/code&gt; flag that indicates that the node is not critical to the overall execution and can fail without aborting the entire graph. In that case we will not raise the exception but instead return an empty state.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;callback_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NodePhase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;node_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;optional&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_decorator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nd"&gt;@wraps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_wrapped&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;bound&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inspect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;signature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;bind_partial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bound&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

            &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_safe_invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;NodePhase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="nf"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Callback failed for node=%s phase=%s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;phase&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

            &lt;span class="c1"&gt;# Before execution
&lt;/span&gt;            &lt;span class="nf"&gt;_safe_invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NodePhase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;STARTED&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# On failure, report FAILURE with the original state
&lt;/span&gt;                &lt;span class="nf"&gt;_safe_invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NodePhase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FAILURE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;optional&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Optional node %s failed; returning empty state: %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;

            &lt;span class="c1"&gt;# On success, report SUCCESS with the result
&lt;/span&gt;            &lt;span class="nf"&gt;_safe_invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NodePhase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SUCCESS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_wrapped&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_decorator&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;NodePhase&lt;/code&gt; is a custom enum type that represents the different phases a node can be in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;NodePhase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;STARTED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STARTED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;SUCCESS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SUCCESS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;FAILURE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FAILURE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can now use that decorator to wrap our node functions. It can map the node based on the node name passed to the decorator.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The caller has to pass a callback that will be used to call `set_progress` in the Dash callback
&lt;/span&gt;&lt;span class="n"&gt;node_callback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NodePhase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="nd"&gt;@callback_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node_callback&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;investigate_relevant_tickets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;optional&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;node_investigate_relevant_tickets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;LogInvestigationState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;get_jira_client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;jira&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;get_github_client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;gh&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;investigate_relevant_tickets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;jira&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works great for regular nodes. Dynamically generated nodes from the &lt;code&gt;Send&lt;/code&gt; API are a bit more tricky. If no nodes are created, the callback will not be invoked. To address this, we can create a custom decorator for the dispatch functions and for the worker nodes which will work together. The dispatching callback marks the worker as started and have each of the worker nodes report their success or failure. I am not going to go into detail here, but feel free to leave a comment if you are curious about the implementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary and Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post we have seen how to utilize LangGraph and Watsonx.ai to build a custom log investigation agent. We created a Dash UI to present the investigation progress and results to the user.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvl7on3f21qqj9upceho.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvl7on3f21qqj9upceho.png" alt="log investigation UI" width="800" height="703"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The solution we created is already very useful, but of course there is always room for improvement. A few ideas come to mind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use more refined search technique (e.g. vector search) to find relevant tickets and conversations&lt;/li&gt;
&lt;li&gt;Add more sources of information&lt;/li&gt;
&lt;li&gt;Add further investigation steps, such as investigating code changes, merged PR dates, releases, etc. to reliably identify whether a fix is rolled out and new regressions.&lt;/li&gt;
&lt;li&gt;Allow users to provide additional context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For me, this was a fun exercise and my first time working with LangGraph and LangChain. The LangChain toolstack has quite a lot of capabilities I have not explored, yet, though. Have you used any of the LangChain tools before? What was your experience? Let me know in the comments.&lt;/p&gt;




&lt;p&gt;If you liked this post, you can &lt;a href="https://ko-fi.com/frosnerd" rel="noopener noreferrer"&gt;support me on ko-fi&lt;/a&gt;.&lt;/p&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;According to the &lt;a href="https://www.sawmills.ai/observability-report-2025" rel="noopener noreferrer"&gt;Sawmills Observability Report 2025&lt;/a&gt;, logs are the main course of spend in observability, too. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>langchain</category>
      <category>showdev</category>
      <category>genai</category>
      <category>python</category>
    </item>
    <item>
      <title>libmalloc, jemalloc, tcmalloc, mimalloc - Exploring Different Memory Allocators</title>
      <dc:creator>Frank Rosner</dc:creator>
      <pubDate>Mon, 01 Dec 2025 09:29:16 +0000</pubDate>
      <link>https://dev.to/frosnerd/libmalloc-jemalloc-tcmalloc-mimalloc-exploring-different-memory-allocators-4lp3</link>
      <guid>https://dev.to/frosnerd/libmalloc-jemalloc-tcmalloc-mimalloc-exploring-different-memory-allocators-4lp3</guid>
      <description>&lt;h2&gt;
  
  
  What Are Memory Allocators?
&lt;/h2&gt;

&lt;p&gt;Applications need memory to store data and code during runtime. Memory can be allocated statically (fixed size at compile time), or dynamically (at runtime). Dynamic memory allocation is crucial when the memory size needed varies during program execution, which is the case for most modern applications.&lt;/p&gt;

&lt;p&gt;The stack and the heap are two key memory regions during program execution. Stack allocation is used for function calls and local variables and it happens automatically. The stack allocation lifecycle is tied to the lifecycle of the function. The heap is used for dynamic memory allocation. In non-garbage-collected languages like C++, the programmer is responsible for managing the heap, while in garbage-collected languages like Java, the heap is managed automatically.&lt;/p&gt;

&lt;p&gt;In many cases, heap allocation and de-allocation is implemented via a memory allocator, which implements functions like &lt;code&gt;malloc&lt;/code&gt; and &lt;code&gt;free&lt;/code&gt;. Those generic functions are part of the C standard library, and are implemented by &lt;code&gt;libc&lt;/code&gt;, which on Linux, is &lt;code&gt;glibc&lt;/code&gt; by default for most installations. On MacOS, &lt;code&gt;libmalloc&lt;/code&gt; is the default implementation.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://dev.to/frosnerd/writing-my-own-dynamic-memory-management-361g"&gt;Writing My Own Dynamic Memory Management&lt;/a&gt; I attempted to write a very simple allocator for my own operating system based on a doubly linked list. The following animation shows how the available heap is managed by the allocator:&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/CVXtq77b4Xc"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;Efficient memory allocation is a complex problem, especially on modern computer architectures. Modern allocators combine advanced data structures and algorithms to achieve high performance in concurrent environments.&lt;/p&gt;

&lt;p&gt;Especially in performance critical applications, such as databases, web servers, and game engines, the choice of memory allocator can have a significant impact on performance. I wanted to learn more about the different allocators available. In this blog post we are going to compare a few well-known allocators on MacOS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/apple-opensource/libmalloc" rel="noopener noreferrer"&gt;&lt;code&gt;libmalloc&lt;/code&gt;&lt;/a&gt; - The default allocator on MacOS, developed by Apple.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/jemalloc/jemalloc" rel="noopener noreferrer"&gt;&lt;code&gt;jemalloc&lt;/code&gt;&lt;/a&gt; - Created by Jason Evans originally for FreeBSD to address fragmentation and scaling issues, jemalloc is a scalable allocator widely adopted in performance-critical applications including Firefox and Facebook.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/google/tcmalloc" rel="noopener noreferrer"&gt;&lt;code&gt;tcmalloc&lt;/code&gt;&lt;/a&gt; - Developed by Google as part of the &lt;a href="https://github.com/gperftools/gperftools" rel="noopener noreferrer"&gt;Google Performance Tools&lt;/a&gt; to enhance multithreaded allocation speed and reduce lock contention using thread-local or core-local caches.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/microsoft/mimalloc" rel="noopener noreferrer"&gt;&lt;code&gt;mimalloc&lt;/code&gt;&lt;/a&gt; - Developed by Microsoft Research as a modern general-purpose allocator, focusing on locality and reducing contention with innovations like page-local free lists and free list sharding for performance gains.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/emeryberger/Hoard" rel="noopener noreferrer"&gt;&lt;code&gt;hoard&lt;/code&gt;&lt;/a&gt; - Designed by Emery Berger and his team at the University of Massachusetts to reduce memory fragmentation and contention in multithreaded systems by partitioning heaps per thread, introduced in the early 2000s as a research-driven allocator.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Allocator Architectures
&lt;/h2&gt;

&lt;p&gt;All modern memory allocators share common architectural concepts to manage dynamic memory efficiently and safely. Allocation requests can come in different sizes, ranging from a few bytes to megabytes or even gigabytes. Allocators need to be equipped with strategies to handle different allocation sizes with minimal overhead and fragmentation. Commonly this is achieved by using some form of segmentation based on the requested size.&lt;/p&gt;

&lt;p&gt;Allocators need to track the state of allocated and free memory. This is often done by using data structures that keeps track of the state of each memory block. Metadata can be tracked externally, in a separate data structure, or internally within the block, or a combination of both.&lt;/p&gt;

&lt;p&gt;In multithreaded environments, concurrency control is necessary to ensure safety when allocating and deallocating memory. Synchronization negatively impacts performance, however, so modern allocators use various techniques to minimize synchronization overhead, e.g. by using thread-local data structures and even entire heap regions. Of course, these techniques come with additional memory overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What to Compare?
&lt;/h3&gt;

&lt;p&gt;While the interface looks simple, the implementations of those allocators differ significantly. Different allocators have different performance characteristics, and are better suited for different workloads and computer architectures. When comparing allocators, there are several key performance indicators (KPIs) to consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throughput&lt;/strong&gt; (ops/sec)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt; - (sec/op)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory usage&lt;/strong&gt; - (overhead and fragmentation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tooling and Usability&lt;/strong&gt; (debugging, profiling, leak checking, ...)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintenance and Security&lt;/strong&gt; (CVEs, security hardening, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The workload (allocation size, frequency, number of threads, etc.) impacts these KPIs, so it is important to benchmark your specific workload. While there are also platform restrictions that might limit your choice (e.g. &lt;code&gt;libmalloc&lt;/code&gt; is only available on Apple operating systems), I will ignore the platform limitations in the comparison.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmarking Setup
&lt;/h3&gt;

&lt;p&gt;I'm running the benchmarks using Google Benchmark &lt;code&gt;v1.9.4&lt;/code&gt; on my Nov 2023 MacBook Pro (M3) with MacOS &lt;code&gt;15.6.1&lt;/code&gt;, compiled with Apple &lt;code&gt;clang-1700.0.13.5&lt;/code&gt;. You can find the &lt;a href="https://github.com/FRosner/malloc-post" rel="noopener noreferrer"&gt;source code&lt;/a&gt; on GitHub. &lt;/p&gt;

&lt;p&gt;While &lt;code&gt;libmalloc&lt;/code&gt; is the default allocator on MacOS and part of &lt;code&gt;libSystem&lt;/code&gt;, the other allocators are going to be installed via &lt;code&gt;brew&lt;/code&gt;. Note that &lt;code&gt;tcmalloc&lt;/code&gt; is part of &lt;code&gt;gperftools&lt;/code&gt;, and &lt;code&gt;libhoard&lt;/code&gt; is a custom tap (&lt;code&gt;brew tap emeryberger/hoard&lt;/code&gt;). Here are the versions I am using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# otool -L build/malloc-post-benchmark-libmalloc&lt;/span&gt;
/usr/lib/libSystem.B.dylib &lt;span class="o"&gt;(&lt;/span&gt;current version 1351.0.0&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# brew info jemalloc | grep Cellar&lt;/span&gt;
/opt/homebrew/Cellar/jemalloc/5.3.0

&lt;span class="c"&gt;# brew info gperftools | grep Cellar&lt;/span&gt;
/opt/homebrew/Cellar/gperftools/2.17.2

&lt;span class="c"&gt;# brew info mimalloc | grep Cellar&lt;/span&gt;
/opt/homebrew/Cellar/mimalloc/3.1.5

&lt;span class="c"&gt;# brew info emeryberger/hoard/libhoard | grep Cellar&lt;/span&gt;
/opt/homebrew/Cellar/libhoard/HEAD-5a7073f
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I am using CMake to build the benchmark binaries for each allocator. The gist of the CMakeLists.txt is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cmake"&gt;&lt;code&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;MALLOC_IMPLEMENTATIONS jemalloc mimalloc hoard tcmalloc&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;foreach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;MALLOC &lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MALLOC_IMPLEMENTATIONS&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;find_library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MALLOC&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;_LIBRARY &lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MALLOC&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;EXE_NAME &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PROJECT_NAME&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-benchmark-&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MALLOC&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;add_executable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;EXE_NAME&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt; src/main.cpp&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;target_link_libraries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;EXE_NAME&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt; PRIVATE benchmark::benchmark pthread&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;target_link_libraries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;EXE_NAME&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt; PRIVATE &lt;span class="si"&gt;${${&lt;/span&gt;&lt;span class="nv"&gt;MALLOC&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="nv"&gt;_LIBRARY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;endforeach&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that we cannot actively "reset" the allocator between each benchmark run. To avoid interactions between runs, we'll use a bash script to run the individual benchmarks in a loop. Thanks to the &lt;code&gt;--benchmark_filter&lt;/code&gt; command line option and the way Google Benchmark builds benchmark names, we can loop over different parameters for a given benchmark, restarting the binary after each run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;run_allocation_throughput_benchmark&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;threads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$2&lt;/span&gt;
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Running allocation throughput benchmark for &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MALLOC&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; with &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;size&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; size, &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;threads&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; threads"&lt;/span&gt;
  &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;executable&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="nt"&gt;--benchmark_filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"BM_AllocationThroughput/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;size&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/iterations:1000/threads:&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;threads&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--benchmark_out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"results/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MALLOC&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_AllocationThroughput_&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;size&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;threads&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="nv"&gt;executable_prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"./build/malloc-post-benchmark-"&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;executable &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;executable_prefix&lt;/span&gt;&lt;span class="k"&gt;}*&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;MALLOC&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;#./build/malloc-post-benchmark-&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;threads &lt;span class="k"&gt;in &lt;/span&gt;1 2 4 8&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    for &lt;/span&gt;size &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;1..22&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
      &lt;/span&gt;run_allocation_throughput_benchmark &lt;span class="k"&gt;$((&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;size&lt;span class="k"&gt;))&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;threads&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;done
  done
done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We are storing the results in JSON files, which we combine, analyze and visualize using &lt;a href="https://matplotlib.org/" rel="noopener noreferrer"&gt;matplotlib&lt;/a&gt; in Python. Here's the structure of a benchmark result file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-11-26T14:00:48+01:00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"host_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MyMacBook"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"executable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./build/malloc-post-benchmark-hoard"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"num_cpus"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mhz_per_cpu"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"cpu_scaling_enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"caches"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Data"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;65536&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"num_sharing"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Instruction"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;131072&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"num_sharing"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Unified"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4194304&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"num_sharing"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"load_avg"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;2.55762&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;2.97754&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;3.83203&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"library_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"v1.9.4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"library_build_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"debug"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"json_schema_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"benchmarks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BM_AllocationThroughput/2/iterations:1000/threads:1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"family_index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"per_family_instance_index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"run_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BM_AllocationThroughput/2/iterations:1000/threads:1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"run_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"iteration"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"repetitions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"repetition_index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"threads"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"iterations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"real_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;5.1424708217382431e+04&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"cpu_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;5.1425000000000051e+04&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"time_unit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ns"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"items_per_second"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.9445794846864346e+07&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now with the setup in place, let's look into the different KPIs in greater detail.&lt;/p&gt;

&lt;h3&gt;
  
  
  Throughput
&lt;/h3&gt;

&lt;p&gt;To measure throughput, we will design a benchmark that within each iteration, allocates memory of a given size for a fixed number of pointers (1000), then frees and reallocates memory for 1000 of these pointers at random, and finally frees all pointers. This yields a total of 2000 memory allocations and frees per iteration. For the throughput counter &lt;code&gt;SetItemsProcessed&lt;/code&gt; we treat two &lt;code&gt;malloc&lt;/code&gt; plus two &lt;code&gt;free&lt;/code&gt; calls as one "item".&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;BM_AllocationThroughput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;State&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;sz&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ptrs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;mt19937&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kr"&gt;thread&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{}(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;this_thread&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;get_id&lt;/span&gt;&lt;span class="p"&gt;()));&lt;/span&gt;
        &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;uniform_int_distribution&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;ptrs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sz&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;ptrs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SkipWithError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"malloc failed"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;DoNotOptimize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ptrs&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="n"&gt;free&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ptrs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
            &lt;span class="n"&gt;ptrs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sz&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;free&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ptrs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetItemsProcessed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iterations&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can then run the benchmark for different allocation sizes and different number of threads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;BENCHMARK&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BM_AllocationThroughput&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;RangeMultiplier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Iterations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Threads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Threads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Threads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Threads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We expect the allocation throughput per thread to decrease with increased parallelism due to the increased synchronization overhead. When plotting the throughput (average "items processed" per second per thread) for allocation sizes of 1KB, we can see that the throughput decreases across the board:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FFRosner%2Fmalloc-post%2Fblob%2Fmain%2Fplots%2Fallocation_throughput_per_thread_%281kb%29_implementation_threads_items_per_second_results.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FFRosner%2Fmalloc-post%2Fblob%2Fmain%2Fplots%2Fallocation_throughput_per_thread_%281kb%29_implementation_threads_items_per_second_results.png%3Fraw%3Dtrue"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can also see that &lt;code&gt;hoard&lt;/code&gt; has the highest throughput, more than 2x of what &lt;code&gt;mimalloc&lt;/code&gt; achieves. This is only one data point, however, as we were looking at 1KB allocations. Let's look at the throughput for different allocation sizes and different number of threads:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F365pl8wy4zbbkeynfgtq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F365pl8wy4zbbkeynfgtq.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, the different allocators have vastly different throughput characteristics across the different workloads. While both &lt;code&gt;hoard&lt;/code&gt; and &lt;code&gt;mimalloc&lt;/code&gt; perform very well for small allocations, their throughput decreases rapidly for allocations &amp;gt; 1KB. &lt;code&gt;tcmalloc&lt;/code&gt; takes the lead for allocations &amp;gt; 1KB and maintains a steady throughput up to 32KB (2&lt;sup&gt;15&lt;/sup&gt; bytes). &lt;code&gt;jemalloc&lt;/code&gt; has the lowest throughput for smaller allocation sizes, but maintains a decent throughput especially with increased parallelism compared to &lt;code&gt;mimalloc&lt;/code&gt;, &lt;code&gt;hoard&lt;/code&gt;, and &lt;code&gt;libmalloc&lt;/code&gt;. In very large allocations, only &lt;code&gt;tcmalloc&lt;/code&gt; and &lt;code&gt;jemalloc&lt;/code&gt; remain competitive, with &lt;code&gt;tcmalloc&lt;/code&gt; maintaining 50x of the throughput of &lt;code&gt;libmalloc&lt;/code&gt; at 4MB allocations.&lt;/p&gt;

&lt;p&gt;The sharp drop in throughput for larger allocations in the different allocators can be explained by the way they handle them internally. &lt;code&gt;tcmalloc&lt;/code&gt; for example handles small allocations within the per-CPU caches in the front-end, while larger allocations have to go through the central free list, increasing lock contention (see architecture diagram below, taken from the &lt;code&gt;tcmalloc&lt;/code&gt; &lt;a href="https://google.github.io/tcmalloc/design.html" rel="noopener noreferrer"&gt;design documentation&lt;/a&gt;). The thresholds depend on the page size and can be viewed in the &lt;a href="https://github.com/google/tcmalloc/blob/master/tcmalloc/size_classes.cc" rel="noopener noreferrer"&gt;size class definitions&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff19yo4kls2ccfkwazeov.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff19yo4kls2ccfkwazeov.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, let's take a look at the latency of the different allocators.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latency
&lt;/h3&gt;

&lt;p&gt;For the latency benchmark, I was interested in the latency of the &lt;code&gt;malloc&lt;/code&gt; call. To measure that, I used manual timing, measuring only the time spent in the &lt;code&gt;malloc&lt;/code&gt; call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;BM_AllocationLatency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;State&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;sz&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;size_t&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PauseTiming&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

        &lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;chrono&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;high_resolution_clock&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;ptr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sz&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;chrono&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;high_resolution_clock&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SkipWithError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"malloc failed"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;chrono&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;duration_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;chrono&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetIterationTime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;elapsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

        &lt;span class="n"&gt;free&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResumeTiming&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since the latency difference between small and large allocations is in orders of magnitude, I will plot small (&amp;lt;= 1KB) and larger (&amp;gt; 1KB) results separately:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FFRosner%2Fmalloc-post%2Fblob%2Fmain%2Fplots%2Fallocation_latency_%28small%29_implementation_size_real_time_results.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FFRosner%2Fmalloc-post%2Fblob%2Fmain%2Fplots%2Fallocation_latency_%28small%29_implementation_size_real_time_results.png%3Fraw%3Dtrue"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that all allocators perform small allocations within 20-30ns, except for &lt;code&gt;tcmalloc&lt;/code&gt; in the face of a larger amount of threads and very small (&amp;lt;= 64B) allocation sizes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FFRosner%2Fmalloc-post%2Fblob%2Fmain%2Fplots%2Fallocation_latency_%28large%29_implementation_size_real_time_results.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FFRosner%2Fmalloc-post%2Fblob%2Fmain%2Fplots%2Fallocation_latency_%28large%29_implementation_size_real_time_results.png%3Fraw%3Dtrue"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For larger allocations in a single-threaded environment, all allocators perform reasonably well. Only &lt;code&gt;mimalloc&lt;/code&gt; starts to experience a significant latency increase for allocations &amp;gt; 64KB. When multiple threads come into play, hoard experiences a significant latency increase for allocations between 2KB and 32KB and &lt;code&gt;tcmalloc&lt;/code&gt; also starts to see a significant latency increase for allocations &amp;gt; 262KB.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory Usage
&lt;/h3&gt;

&lt;p&gt;When it comes to memory usage, we mainly care about two types of overhead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The allocation overhead when the allocation size is not aligned with the internal page size&lt;/li&gt;
&lt;li&gt;The bookkeeping / synchronization overhead (can be per thread, per core, per pointer)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;First, let's investigate the allocation overhead. We can use the &lt;a href="https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/malloc_size.3.html" rel="noopener noreferrer"&gt;&lt;code&gt;malloc_size&lt;/code&gt;&lt;/a&gt; function that is part of the &lt;code&gt;malloc&lt;/code&gt; interface on MacOS to determine the actual size of the allocation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;ptr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sz&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;malloc_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;overhead&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;sz&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As expected, for tiny allocations, the overhead is very high (up to 1500% when allocating 1B with &lt;code&gt;libmalloc&lt;/code&gt;). If your application is very memory constrained, &lt;code&gt;tcmalloc&lt;/code&gt; has a "small-but-slow" mode that can be used when memory footprint minimization is more important than performance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FFRosner%2Fmalloc-post%2Fblob%2Fmain%2Fplots%2Fallocation_overhead_%28tiny%29_implementation_size_overhead_percent_results.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FFRosner%2Fmalloc-post%2Fblob%2Fmain%2Fplots%2Fallocation_overhead_%28tiny%29_implementation_size_overhead_percent_results.png%3Fraw%3Dtrue"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Starting from 16B, all allocators reach a reasonable overhead &amp;lt;= 100%. Let's look at the overhead for larger allocations:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FFRosner%2Fmalloc-post%2Fblob%2Fmain%2Fplots%2Fallocation_overhead_%28regular%29_implementation_size_overhead_percent_results.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FFRosner%2Fmalloc-post%2Fblob%2Fmain%2Fplots%2Fallocation_overhead_%28regular%29_implementation_size_overhead_percent_results.png%3Fraw%3Dtrue"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see the allocation overhead varies a lot between implementations. While &lt;code&gt;jemalloc&lt;/code&gt;, &lt;code&gt;mimalloc&lt;/code&gt;, and &lt;code&gt;tcmalloc&lt;/code&gt; manage to keep overhead below 30% for most allocation sizes, &lt;code&gt;libmalloc&lt;/code&gt; and &lt;code&gt;hoard&lt;/code&gt; have a much higher overhead. &lt;code&gt;hoard&lt;/code&gt; continuously reaches 100% overhead for allocations &amp;lt;= 32KB. &lt;code&gt;libmalloc&lt;/code&gt; has peaks at 33B, ~1KB, and 32KB+1B.&lt;/p&gt;

&lt;p&gt;In practice, to reduce waste, you should aim for allocations that are powers of two or at least aligned with the page size. On MacOS, you can use &lt;code&gt;malloc_good_size&lt;/code&gt; to get the closest size that will not waste space, but it is not available on Linux. You can also prefer fewer, larger, long-lived buffers over many tiny allocations.&lt;/p&gt;

&lt;p&gt;The graphs also reveal information about the internal thresholds related to sizes. E.g. on &lt;code&gt;libmalloc&lt;/code&gt;, the &lt;a href="https://github.com/apple-oss-distributions/libmalloc/blob/d876784c79e2869ff1cce519f46905c49117f9a6/src/thresholds.h" rel="noopener noreferrer"&gt;zone thresholds&lt;/a&gt; align with the peaks in the graph:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TINY zone handles allocations up to 1008 bytes (~ 2&lt;sup&gt;10&lt;/sup&gt; bytes).&lt;/li&gt;
&lt;li&gt;SMALL zone handles allocations from above TINY up to 32 KB (2&lt;sup&gt;15&lt;/sup&gt; bytes).&lt;/li&gt;
&lt;li&gt;MEDIUM zone handles allocations from above SMALL up to 8 MB (2&lt;sup&gt;23&lt;/sup&gt; bytes).&lt;/li&gt;
&lt;li&gt;LARGE zone handles allocations beyond the MEDIUM threshold.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In addition to the overhead per allocation, there is also bookkeeping overhead. Since measuring this accurately is more involved, I decided to write a simple program and measure the resident set size (RSS) of the process over time. On MacOS, we can use the Mach API (&lt;code&gt;mach/mach.h&lt;/code&gt;), specifically the &lt;a href="https://developer.apple.com/documentation/kernel/1537934-task_info" rel="noopener noreferrer"&gt;&lt;code&gt;task_info&lt;/code&gt;&lt;/a&gt; function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="nf"&gt;get_rss_bytes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;mach_task_basic_info&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;mach_msg_type_number_t&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MACH_TASK_BASIC_INFO_COUNT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mach_task_self&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;MACH_TASK_BASIC_INFO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_info_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;KERN_SUCCESS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resident_size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we design a "worker" function that we can launch in a thread. It will spin until an external atomic flag indicates it to stop. First it will fill a vector of pointers with allocated memory, filled with 0s. Once all pointers are filled, it randomly selects a pointer to free and reallocate, simulating a workload. Before stopping, it frees all pointers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;should_stop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;worker_thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;num_pointers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;pointer_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pointers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;pointers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reserve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_pointers&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;should_stop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pointers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;num_pointers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;ptr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pointer_size&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="nb"&gt;nullptr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;memset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pointer_size&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="n"&gt;pointers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;push_back&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;pointers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
            &lt;span class="n"&gt;free&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pointers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
            &lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;ptr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pointer_size&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="nb"&gt;nullptr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;memset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pointer_size&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="n"&gt;pointers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;ptr&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pointers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;free&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the main method, we can submit this function to a given number of threads, passing a given number of pointers and pointer size. We are going to capture the RSS size in the main thread before launching the threads, every 1 second while the threads are running, and after stopping all threads.&lt;/p&gt;

&lt;p&gt;The following graph plots the RSS size of the program over time for different allocators, running for 1 seconds with 1000 pointers and an allocation size of 1KB per pointer in 1 thread:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FFRosner%2Fmalloc-post%2Fblob%2Fmain%2Fplots%2Frss_usage_over_time_%281_thread%2C_1000_x_1kb_allocations%29_implementation_seconds_rss_results.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FFRosner%2Fmalloc-post%2Fblob%2Fmain%2Fplots%2Frss_usage_over_time_%281_thread%2C_1000_x_1kb_allocations%29_implementation_seconds_rss_results.png%3Fraw%3Dtrue"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First, we can see that all allocators except &lt;code&gt;libmalloc&lt;/code&gt; reach a stable RSS size immediately after the first measurement. The increase from the starting size to the stable size corresponds to the total size of allocated memory (1000 * 1KB = 1MB). You can also note that &lt;code&gt;jemalloc&lt;/code&gt; is the only allocator dropping back to the starting usage after stopping the threads.&lt;/p&gt;

&lt;p&gt;The starting memory usage differs significantly between allocators. &lt;code&gt;libmalloc&lt;/code&gt; and &lt;code&gt;mimalloc&lt;/code&gt; both consume less than 4MB, while &lt;code&gt;hoard&lt;/code&gt; and &lt;code&gt;jemalloc&lt;/code&gt; consume ~9MB and &lt;code&gt;tcmalloc&lt;/code&gt; is the most hungry one with ~13MB. However, we can see that &lt;code&gt;libmalloc&lt;/code&gt; reaches a stable size of ~13MB after 5 seconds as well.&lt;/p&gt;

&lt;p&gt;Next, let's investigate the RSS usage for an increasing allocation size. When looking at the stable RSS size (1 second before the end) for each allocator, using 1KB allocations with a varying number of pointers and 4 threads, we can see that &lt;code&gt;libmalloc&lt;/code&gt; indeed behaves differently than all the other allocators.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FFRosner%2Fmalloc-post%2Fblob%2Fmain%2Fplots%2Fmax_rss_usage_%284_threads%2C_1kb_allocations%29_implementation_pointers_rss_results.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FFRosner%2Fmalloc-post%2Fblob%2Fmain%2Fplots%2Fmax_rss_usage_%284_threads%2C_1kb_allocations%29_implementation_pointers_rss_results.png%3Fraw%3Dtrue"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While the RSS size for all other allocators increases linearly with the allocated memory, &lt;code&gt;libmalloc&lt;/code&gt; appears to consume much more memory than being allocated. A similar issue has been observed with the &lt;code&gt;glibc&lt;/code&gt; default allocator on Linux and RocksDB, where the RSS was 3x higher compared to &lt;code&gt;jemalloc&lt;/code&gt; (see &lt;a href="https://smalldatum.blogspot.com/2025/04/battle-of-mallocators.html" rel="noopener noreferrer"&gt;Battle of the Mallocators&lt;/a&gt; for more details).&lt;/p&gt;

&lt;h3&gt;
  
  
  Tooling and Usability
&lt;/h3&gt;

&lt;p&gt;For most applications, using the default allocator with the default settings is good enough. For some applications, you might want to switch to a different allocator. However, there are also very specialized applications, that either have very specific performance requirements, or are memory constrained. For those, the default settings might not be the best. Additionally, you might want to debug your allocation workload, e.g. by collecting and inspecting allocation statistics. Let's look into the tooling and configuration options for each allocator.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;libmalloc&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;libmalloc&lt;/code&gt; allows some debugging configuration via environment variables, but there is no programmatic API or compile-time options, and little to no documented tuning options.&lt;/p&gt;

&lt;p&gt;The tooling for &lt;code&gt;libmalloc&lt;/code&gt; is tightly coupled to the MacOS tooling. It supports features such as mapping allocation addresses to call stack when &lt;code&gt;MallocStackLogging&lt;/code&gt; is enabled, heap integrity checking via &lt;code&gt;MallocCheckHeapStart&lt;/code&gt;, and a few other options.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;jemalloc&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;jemalloc&lt;/code&gt; has three configuration mechanisms:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Environment variables. Via &lt;code&gt;MALLOC_CONF&lt;/code&gt;, you can control nearly every aspec of the allocator. Example: &lt;code&gt;MALLOC_CONF="prof:true,lg_prof_sample:19,prof_prefix:jeprof.out,narenas:4,dirty_decay_ms:5000"&lt;/code&gt; will enable heap profiling, sample every 512KB (2&lt;sup&gt;19&lt;/sup&gt;B), write the profile to &lt;code&gt;jeprof.out&lt;/code&gt;, use 4 arenas, and decay dirty pages after 5 seconds.&lt;/li&gt;
&lt;li&gt;Programmatic API. You can use the &lt;code&gt;mallctl&lt;/code&gt; interface to change the settings at runtime without restarting.&lt;/li&gt;
&lt;li&gt;Compile-time options. Configuration can be baked via &lt;code&gt;--with-malloc-conf&lt;/code&gt; or the &lt;code&gt;malloc_conf&lt;/code&gt; global variable.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In terms of debugging and profiling, &lt;code&gt;jemalloc&lt;/code&gt; has a wide variety of features. The main tool is &lt;code&gt;jeprof&lt;/code&gt;, that analyzes heap dumps and generates flame graphs. By enabling the &lt;code&gt;prof_leak&lt;/code&gt; option, allocations without matching &lt;code&gt;free&lt;/code&gt; calls are reported.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;MALLOC_CONF&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"prof:true,prof_prefix:jeprof.out,lg_prof_interval:5"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  build/malloc-post-rss-jemalloc 4 1000 1024 10
jeprof &lt;span class="nt"&gt;--pdf&lt;/span&gt; build/malloc-post-rss-jemalloc jeprof.out.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that in order to enable the profiling hooks, &lt;code&gt;jemalloc&lt;/code&gt; needs to be configured with &lt;a href="https://developer.mantidproject.org/ProfilingWithJemalloc.html" rel="noopener noreferrer"&gt;&lt;code&gt;--enable-prof&lt;/code&gt;&lt;/a&gt;, which is not the case when installing it via homebrew. I was not able to compile it from source within a reasonable time frame, so I am not able to show the results here. But the features and presentation are very similar to what &lt;code&gt;tcmalloc&lt;/code&gt; has to offer.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;tcmalloc&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;tcmalloc&lt;/code&gt; also has three configuration mechanisms:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Environment variables. Via different &lt;code&gt;TCMALLOC_*&lt;/code&gt; variables, you can configure things like release rates, size thresholds, limiting the heap size, etc.&lt;/li&gt;
&lt;li&gt;Programmatic API. You can use the &lt;code&gt;MallocExtension&lt;/code&gt; APIs to query and modify allocation parameters at runtime. Example: &lt;code&gt;MallocExtension::instance()-&amp;gt;SetNumericProperty("tcmalloc.max_per_cpu_cache_size", 16777216);&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Compile-time options. Fundamental behaviour can be selected via preprocessor flags like &lt;code&gt;-DTCMALLOC_INTERNAL_SMALL_BUT_SLOW&lt;/code&gt; to turn on the small but slow allocator.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In terms of tooling, &lt;code&gt;tcmalloc&lt;/code&gt; is part of the Google Performance Tools (&lt;code&gt;gperftools&lt;/code&gt;). The main relevant tool is &lt;code&gt;pprof&lt;/code&gt;, which is similar to &lt;code&gt;jeprof&lt;/code&gt; and used for analyzing heap profiles.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;HEAPPROFILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;malloc-post-tcmalloc.hprof &lt;span class="se"&gt;\&lt;/span&gt;
  build/malloc-post-rss-tcmalloc 4 1000 1024 1

pprof &lt;span class="nt"&gt;-http&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;localhost:8080 build/malloc-post-rss-tcmalloc &lt;span class="se"&gt;\&lt;/span&gt;
  malloc-post-tcmalloc.hprof.0001.heap
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It has multiple output formats, including a comprehensive Web UI, that can show flame graphs, call graphs, or a cumulated view (top):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Flat    Flat%   Sum%    Cum Cum%    Name
3.91MB  99.22%  99.22%  3.91MB  99.22%  [libsystem_malloc.dylib]    
0.03MB  00.78%  99.99%  0.03MB  00.78%  std::__1::__libcpp_operator_new[abi:ne190102]   
0       00.00%  99.99%  3.94MB  99.89%  worker_thread   
0       00.00%  99.99%  0.03MB  00.78%  std::__1::vector::reserve   
0       00.00%  99.99%  0.03MB  00.78%  std::__1::allocator::allocate[abi:ne190102] 
0       00.00%  99.99%  3.94MB  99.89%  std::__1::__thread_proxy[abi:ne190102]  
0       00.00%  99.99%  3.94MB  99.89%  std::__1::__thread_execute[abi:ne190102]    
0       00.00%  99.99%  0.03MB  00.78%  std::__1::__split_buffer::__split_buffer    
0       00.00%  99.99%  0.03MB  00.78%  std::__1::__libcpp_allocate[abi:ne190102]   
0       00.00%  99.99%  3.94MB  99.89%  std::__1::__invoke[abi:ne190102]    
0       00.00%  99.99%  0.03MB  00.78%  std::__1::__allocate_at_least[abi:ne190102] 
0       00.00%  99.99%  3.94MB  99.89%  [libsystem_pthread.dylib]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The graph view shows the allocation size (1KB), too:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvvs0fminsnsi5htfavp2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvvs0fminsnsi5htfavp2.png" alt="pprof graph view"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tcmalloc&lt;/code&gt; also supports summary statistics using &lt;code&gt;MALLOCSTATS=1&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;MALLOCSTATS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 build/malloc-post-rss-tcmalloc 4 1000 1024 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;------------------------------------------------
MALLOC:          20480 (    0.0 MiB) Bytes in use by application
MALLOC: +      3801088 (    3.6 MiB) Bytes in page heap freelist
MALLOC: +       372600 (    0.4 MiB) Bytes in central cache freelist
MALLOC: +      1048576 (    1.0 MiB) Bytes in transfer cache freelist
MALLOC: +          136 (    0.0 MiB) Bytes in thread cache freelists
MALLOC: +      2621504 (    2.5 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =      7864384 (    7.5 MiB) Actual memory used (physical + swap)
MALLOC: +            0 (    0.0 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =      7864384 (    7.5 MiB) Virtual address space used
MALLOC:
MALLOC:            141              Spans in use
MALLOC:              1              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;code&gt;mimalloc&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;Similar to &lt;code&gt;libmalloc&lt;/code&gt;, &lt;code&gt;mimalloc&lt;/code&gt; supports some basic environment variable configuration. In contrast to &lt;code&gt;libmalloc&lt;/code&gt;, it does support some performance related configuration. &lt;/p&gt;

&lt;p&gt;It also has a &lt;code&gt;mi_option_set&lt;/code&gt; programmatic API but the options are less fine-grained than &lt;code&gt;jemalloc&lt;/code&gt; or &lt;code&gt;tcmalloc&lt;/code&gt;, reflecting the philosophy of sensible defaults over exhaustive tunability.&lt;/p&gt;

&lt;p&gt;Tooling support for &lt;code&gt;mimalloc&lt;/code&gt; is smaller compared to &lt;code&gt;jemalloc&lt;/code&gt; and &lt;code&gt;tcmalloc&lt;/code&gt;. It does not include a built-in sampling profiler, so you have to rely on external tools such as Valgrind. You can pass &lt;code&gt;MIMALLOC_SHOW_STATS&lt;/code&gt; to get some basic statistics though.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;MIMALLOC_SHOW_STATS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 build/malloc-post-rss-mimalloc 4 1000 1024 1 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;heap stats:     peak       total     current       block      total#   
  reserved:     1.0 GiB     1.0 GiB     1.0 GiB                          
 committed:    11.1 MiB    11.5 MiB    11.1 MiB                          
     reset:     0      
    purged:     0      
   touched:     0           0           0                                ok
     pages:    68          68           0                                ok
-abandoned:    13.6 Ki     17.3 Mi      0                                ok
 -reclaima:     0      
 -reclaimf:    17.3 Mi 
-reabandon:     0      
    -waits:     0      
 -extended:     0      
   -retire:     0      
    arenas:     1      
 -rollback:     0      
     mmaps:    17      
   commits:     0      
    resets:     0      
    purges:     0      
   guarded:     0      
   threads:     4           4           0                                ok
  searches:     1.0 avg
numa nodes:     1
   elapsed:     2.011 s
   process: user: 4.010 s, system: 0.030 s, faults: 94, rss: 6.7 MiB, commit: 11.1 MiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;code&gt;hoard&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;hoard&lt;/code&gt; does not appear to have any configuration options or specific profiling tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  Maintenance and Security
&lt;/h3&gt;

&lt;p&gt;Apple's &lt;code&gt;libmalloc&lt;/code&gt; has sophisticated security features such as kalloc_type and xzone malloc. While it has the highest CVE count&lt;sup id="fnref1"&gt;1&lt;/sup&gt;, I believe this is due to extensive security research on Apple platforms.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;jemalloc&lt;/code&gt; and &lt;code&gt;tcmalloc&lt;/code&gt; prioritize performance over security hardening, with minimal built-in protections. They have a handful of historical CVEs (&lt;code&gt;jemalloc&lt;/code&gt;&lt;sup id="fnref2"&gt;2&lt;/sup&gt;, &lt;code&gt;tcmalloc&lt;/code&gt;&lt;sup id="fnref3"&gt;3&lt;/sup&gt;) reported, which are all patched in recent versions.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;mimalloc&lt;/code&gt; offers the most comprehensive configurable security mode with guard pages, encrypted free lists, and randomization at ~10% performance cost. It has no core CVEs reported, only one minor advisory for the rust crate.&lt;sup id="fnref4"&gt;4&lt;/sup&gt; &lt;/p&gt;

&lt;p&gt;&lt;code&gt;hoard&lt;/code&gt; has the weakest security posture with documented overflow vulnerabilities such as multiple overflow vulnerabilities and no hardening features.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary and Conclusion
&lt;/h2&gt;

&lt;p&gt;Based on the findings and my limited experience with using the allocators when developing this post, I came up with the following comparison table. I will leave out &lt;code&gt;hoard&lt;/code&gt;, because I don't think it's a practical choice for any real world application.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;libmalloc&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;tcmalloc&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;jemalloc&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;mimalloc&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⭐⭐☆☆☆&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐☆&lt;/td&gt;
&lt;td&gt;⭐⭐⭐☆☆&lt;/td&gt;
&lt;td&gt;⭐⭐☆☆☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐☆☆☆&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐☆&lt;/td&gt;
&lt;td&gt;⭐☆☆☆☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory Overhead&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⭐☆☆☆☆&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐☆&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tooling and Usability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⭐⭐⭐☆☆&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐☆&lt;/td&gt;
&lt;td&gt;⭐⭐⭐☆☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maintenance and Security&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐☆&lt;/td&gt;
&lt;td&gt;⭐⭐⭐☆☆&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overall&lt;/td&gt;
&lt;td&gt;16/25 ⭐&lt;/td&gt;
&lt;td&gt;20/25 ⭐&lt;/td&gt;
&lt;td&gt;18/25 ⭐&lt;/td&gt;
&lt;td&gt;15/25 ⭐&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On all Apple operating systems, &lt;code&gt;libmalloc&lt;/code&gt; is the default choice. The main focus is on security, with decent performance for most workloads. Given the high memory overhead however, it might not be a good fit for performance critical applications with high allocation rates.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tcmalloc&lt;/code&gt; and &lt;code&gt;jemalloc&lt;/code&gt; have the most stable performance characteristics across different allocation sizes and number of threads. The throughput remains reasonably high even at large allocation sizes, while latency remains within acceptable boundaries, with &lt;code&gt;jemalloc&lt;/code&gt; winning in latency and &lt;code&gt;tcmalloc&lt;/code&gt; winning in throughput for very large allocations. Among the ones I tested, those would be my preferred choice for applications with high performance requirements. They also have extensive configuration options that allows workload specific tuning. Note, however, that &lt;code&gt;jemalloc&lt;/code&gt; appears to &lt;a href="https://jasone.github.io/2025/06/12/jemalloc-postmortem/" rel="noopener noreferrer"&gt;somewhat dead&lt;/a&gt; as of 2025, so I'd rather go with &lt;code&gt;tcmalloc&lt;/code&gt; for any new project.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;mimalloc&lt;/code&gt; works well for smaller allocations, but suffers in both latency and throughput for larger allocations. The advanced security features might be a unique selling point for some users though.&lt;/p&gt;

&lt;p&gt;While &lt;code&gt;hoard&lt;/code&gt; shines in certain areas, it does not appear to be a good choice, as it does not have steady performance characteristics across different allocation sizes and number of threads. It has severe security flaws and is not actively maintained.&lt;/p&gt;

&lt;p&gt;Did you ever swap out the default allocator in your application? What was your experience? Let me know in the comments below!&lt;/p&gt;




&lt;p&gt;If you liked this post, you can &lt;a href="https://ko-fi.com/frosnerd" rel="noopener noreferrer"&gt;support me on ko-fi&lt;/a&gt;.&lt;/p&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;&lt;a href="https://www.cve.org/CVERecord?id=CVE-2015-5889" rel="noopener noreferrer"&gt;CVE-2015-5889&lt;/a&gt;, &lt;a href="https://www.cve.org/CVERecord?id=CVE-2018-4433" rel="noopener noreferrer"&gt;CVE-2018-4433&lt;/a&gt;, &lt;a href="https://www.cve.org/CVERecord?id=CVE-2023-32428" rel="noopener noreferrer"&gt;CVE-2023-32428&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;&lt;a href="https://www.cve.org/CVERecord?id=CVE-2007-6754" rel="noopener noreferrer"&gt;CVE-2007-6754&lt;/a&gt;, &lt;a href="https://www.cve.org/CVERecord?id=CVE-2006-7252" rel="noopener noreferrer"&gt;CVE-2006-7252&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn3"&gt;
&lt;p&gt;&lt;a href="https://www.cve.org/CVERecord?id=CVE-2005-4895" rel="noopener noreferrer"&gt;CVE-2005-4895&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn4"&gt;
&lt;p&gt;&lt;a href="https://rustsec.org/advisories/RUSTSEC-2022-0094.html" rel="noopener noreferrer"&gt;RUSTSEC-2022-0094&lt;/a&gt; ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>cpp</category>
      <category>benchmarking</category>
      <category>memory</category>
      <category>performance</category>
    </item>
    <item>
      <title>Comparing OpenBLAS and Accelerate on Apple Silicon for BLAS Routines</title>
      <dc:creator>Frank Rosner</dc:creator>
      <pubDate>Tue, 18 Nov 2025 14:58:12 +0000</pubDate>
      <link>https://dev.to/frosnerd/comparing-openblas-and-accelerate-on-apple-silicon-for-blas-routines-2pb9</link>
      <guid>https://dev.to/frosnerd/comparing-openblas-and-accelerate-on-apple-silicon-for-blas-routines-2pb9</guid>
      <description>&lt;h2&gt;
  
  
  Motivation
&lt;/h2&gt;

&lt;p&gt;Many real world applications such as machine learning, scientific computing, data compression, computer graphics and video processing require linear algebra operations. Tensors (mostly vectors, which are 1-dimensional tensors, and matrices, which are 2-dimensional tensors) are the primary data structures used to represent data in these applications.&lt;/p&gt;

&lt;p&gt;Writing software is hard. Writing correct, performant, secure, reliable, etc., software is even harder. This is why most linear algebra operations are expressed in terms of Basic Linear Algebra Subprograms (BLAS). Common BLAS routines are vector addition, scalar multiplication, dot products, linear combinations, matrix-vector multiplication, and matrix-matrix multiplication.&lt;/p&gt;

&lt;p&gt;Similarly to the saying "Don't roll your own cryptography", we should also heed the advice "Don't roll your own BLAS". BLAS libraries are often tuned for the specific SIMD (Single Instruction, Multiple Data) instructions of the underlying hardware. This can lead to performance improvements in orders of magnitude.&lt;/p&gt;

&lt;p&gt;In this post we want to explore how to utilize BLAS libraries on Apple Silicon using C++, and compare the performance of OpenBLAS and Apple's Accelerate framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Brief History of SIMD
&lt;/h2&gt;

&lt;p&gt;SIMD is a type of parallel computing first conceptually classified in Flynn's taxonomy&lt;sup id="fnref1"&gt;1&lt;/sup&gt; in the 1960s. In the 1990s, as desktop processors became more powerful, SIMD instructions were introduced to improve multimedia and gaming performance. Intel's MMX, released in 1996, was the first widely deployed SIMD on desktop CPUs, followed by more advanced instruction sets like SSE, AVX, and AVX-512.&lt;/p&gt;

&lt;p&gt;SIMD became standard in most modern CPUs, accelerating tasks such as digital image processing, audio processing, and gaming graphics by executing the same operation on multiple data points at once. Thus, SIMD evolved from early experimental supercomputers to an integral part of everyday computing.&lt;/p&gt;

&lt;h2&gt;
  
  
  SIMD on Apple Silicon
&lt;/h2&gt;

&lt;p&gt;Apple Silicon processors, including the M1, M2, and M3 series, primarily support the &lt;a href="https://developer.arm.com/Architectures/Neon" rel="noopener noreferrer"&gt;ARM NEON&lt;/a&gt; instruction set, which is a 128-bit SIMD architecture part of the ARMv8-A ISA&lt;sup id="fnref2"&gt;2&lt;/sup&gt;. NEON provides a wide range of integer and floating-point vector operations suitable for parallel processing on vectors of data.&lt;/p&gt;

&lt;p&gt;In addition to NEON, Apple Silicon also features proprietary Apple Matrix Coprocessor (AMX)&lt;sup id="fnref3"&gt;3&lt;/sup&gt; instructions. These instructions are specialized for high-performance computing tasks involving matrix operations and are unique to Apple Silicon. They are not part of the official ARM architecture and are currently undocumented publicly&lt;sup id="fnref4"&gt;4&lt;/sup&gt;, but they add significant computational acceleration beyond NEON for certain workloads.&lt;/p&gt;

&lt;p&gt;The best way to utilize AMX instructions is through the Accelerate framework, which provides a high-level API for performing BLAS operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  OpenBLAS vs Accelerate
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Benchmark Setup
&lt;/h3&gt;

&lt;p&gt;There are several CPU-based BLAS libraries, some of them developed by hardware manifacturers, such as &lt;a href="https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html" rel="noopener noreferrer"&gt;Intel's MKL&lt;/a&gt;, &lt;a href="https://www.amd.com/de/developer/aocl/blis.html" rel="noopener noreferrer"&gt;AMD's BLIS&lt;/a&gt; and &lt;a href="https://developer.apple.com/documentation/accelerate" rel="noopener noreferrer"&gt;Apple's Accelerate&lt;/a&gt;. &lt;a href="https://github.com/OpenMathLib/OpenBLAS" rel="noopener noreferrer"&gt;OpenBLAS&lt;/a&gt; is an open source library that has the broadest coverage of supported hardware and can be a solid default choice.&lt;/p&gt;

&lt;p&gt;I was curious what the difference would be in terms of performance when comparing OpenBLAS and Accelerate on Apple Silicon. Since OpenBLAS does not support Apple's AMX instructions, relying on NEON instructions only, I expect it to be slower for most use cases.&lt;/p&gt;

&lt;p&gt;I built four benchmarks around some common double precision BLAS routines that are part of the C BLAS interface:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;cblas_daxpy&lt;/code&gt; – &lt;em&gt;a · X + Y&lt;/em&gt; is a level 1 routine that scales vector &lt;em&gt;X&lt;/em&gt; by scalar &lt;em&gt;a&lt;/em&gt; and adds it to vector &lt;em&gt;Y&lt;/em&gt;. This routine is used in linear combination of vectors, common in iterative algorithms like conjugate gradient (CG) or generalized minimum residual (GMRES) for solving linear systems.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cblas_ddot&lt;/code&gt; - &lt;em&gt;X · Y&lt;/em&gt; is a level 1 routine that computes the dot product of vectors &lt;em&gt;X&lt;/em&gt; and &lt;em&gt;Y&lt;/em&gt;. Dot products are widely used in machine learning, e.g. to compute similarities between vectors or physics simulations.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cblas_dgemv&lt;/code&gt; - &lt;em&gt;a · &lt;strong&gt;A&lt;/strong&gt; · X + b · X&lt;/em&gt; is a level 2 routine that handles matrix-vector multiplication. &lt;em&gt;a&lt;/em&gt;, &lt;em&gt;b&lt;/em&gt; are scalars, &lt;em&gt;X&lt;/em&gt;, &lt;em&gt;Y&lt;/em&gt; are vectors, and &lt;em&gt;&lt;strong&gt;A&lt;/strong&gt;&lt;/em&gt; is a matrix. This routine is also common in iterative algorithms, but also in machine learning (linear regression), and signal processing.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cblas_dgemm&lt;/code&gt; - &lt;em&gt;a · &lt;strong&gt;A&lt;/strong&gt; · &lt;strong&gt;B&lt;/strong&gt; + b · &lt;strong&gt;C&lt;/strong&gt;&lt;/em&gt; is a level 3 routine that handles matrix-matrix multiplication. &lt;em&gt;a&lt;/em&gt;, &lt;em&gt;b&lt;/em&gt; are scalars, and &lt;em&gt;&lt;strong&gt;A&lt;/strong&gt;&lt;/em&gt;, &lt;em&gt;&lt;strong&gt;B&lt;/strong&gt;&lt;/em&gt;, &lt;em&gt;&lt;strong&gt;C&lt;/strong&gt;&lt;/em&gt; are matrices. This routine is common in machine learning and graphics processing, for example.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The benchmarks use Google's &lt;a href="https://github.com/google/benchmark" rel="noopener noreferrer"&gt;benchmark&lt;/a&gt; library to measure performance. You can find the full code on &lt;a href="https://github.com/frosnerd/cpp-simd-post" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. The following listing outlines the benchmark for &lt;code&gt;cblas_ddot&lt;/code&gt; defined as a macro taking the name of the BLAS backend as an argument:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#define BENCHMARK_DDOT(backend_name) \
static void BM_Ddot_##backend_name(benchmark::State&amp;amp; state) { \
    size_t n = state.range(0); \
    std::vector&amp;lt;double&amp;gt; x(n, 1.0); \
    std::vector&amp;lt;double&amp;gt; y(n, 2.0); \
    for (auto _ : state) { \
        benchmark::DoNotOptimize(blas_ddot(x, y)); \
    } \
    state.SetItemsProcessed(state.iterations() * n); \
} \
BENCHMARK(BM_Ddot_##backend_name)-&amp;gt;RangeMultiplier(2)-&amp;gt;Range(1&amp;lt;&amp;lt;1, 1&amp;lt;&amp;lt;22);
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We'll obtain the desired vector size &lt;code&gt;n&lt;/code&gt; from the benchmark state. We'll initialize two vectors of the same size &lt;code&gt;x&lt;/code&gt; and &lt;code&gt;y&lt;/code&gt;. The &lt;code&gt;for (auto _ : state)&lt;/code&gt; loop runs the function for the desired number of iterations. &lt;code&gt;benchmark::DoNotOptimize&lt;/code&gt; is used to prevent the compiler from optimizing away the function call because the result is unused. We'll record the user metric number of items processed as the number of iterations times the vector size.&lt;/p&gt;

&lt;p&gt;We'll register the function as a benchmark using the &lt;code&gt;BENCHMARK&lt;/code&gt; macro, defining the range of vector sizes to test, e.g. from 2&lt;sup&gt;2&lt;/sup&gt; to 2&lt;sup&gt;22&lt;/sup&gt; with a multiplier of 2. We can then generate the benchmarks by calling the macro with the desired backend name:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#ifdef USE_ACCELERATE
&lt;/span&gt;&lt;span class="n"&gt;BENCHMARK_DDOT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Accelerate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;BENCHMARK_DGEMM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Accelerate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;BENCHMARK_DGEMV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Accelerate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;BENCHMARK_DAXPY&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Accelerate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="cp"&gt;#endif
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We'll repeat the same for OpenBLAS. In our &lt;code&gt;CMakeLists.txt&lt;/code&gt; file we can then conditionally compile the two versions, or simply compile both versions at once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cmake"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;span class="nb"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;BUILD_ACCELERATE &lt;span class="s2"&gt;"Build binary with Apple Accelerate framework (macOS only)"&lt;/span&gt; ON&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;BUILD_OPENBLAS &lt;span class="s2"&gt;"Build binary with OpenBLAS"&lt;/span&gt; ON&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;span class="nb"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;BUILD_ACCELERATE AND APPLE&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;add_executable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;cpp-simd-post-accelerate &lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SOURCES&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;target_link_libraries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;cpp-simd-post-accelerate PRIVATE benchmark::benchmark benchmark::benchmark_main&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;target_link_libraries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;cpp-simd-post-accelerate PRIVATE &lt;span class="s2"&gt;"-framework Accelerate"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;target_compile_definitions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;cpp-simd-post-accelerate PRIVATE ACCELERATE_NEW_LAPACK USE_ACCELERATE&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;STATUS &lt;span class="s2"&gt;"Building cpp-simd-post-accelerate with Apple Accelerate framework"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;endif&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When checking the resulting binaries, we can indeed see that they link to the respective libraries (assuming you installed them correctly on your system first):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ otool -fahl build/cpp-simd-post-openblas | grep openblas -B5 -A5

Load command 14
          cmd LC_LOAD_DYLIB
      cmdsize 80
         name /opt/homebrew/opt/openblas/lib/libopenblas.0.dylib (offset 24)
   time stamp 2 Thu Jan  1 01:00:02 1970
      current version 0.0.0
compatibility version 0.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ otool -fahl build/cpp-simd-post-accelerate | grep Accelerate -B5 -A5

Load command 14
          cmd LC_LOAD_DYLIB
      cmdsize 96
         name /System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate (offset 24)
   time stamp 2 Thu Jan  1 01:00:02 1970
      current version 4.0.0
compatibility version 1.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Please note that OpenBLAS was using the &lt;code&gt;neoversen1&lt;/code&gt; core on my Apple M3 Pro chip.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OPENBLAS_VERBOSE=2 ./build/cpp-simd-post-openblas
Core: neoversen1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Generating the Results
&lt;/h3&gt;

&lt;p&gt;Now that we have our benchmarks compiled, let's run them and analyze the results. We can use the &lt;code&gt;--benchmark_out&lt;/code&gt; command line argument to specify a JSON output file that will hold the results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/cpp-simd-post-openblas &lt;span class="nt"&gt;--benchmark_out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"openblas.json"&lt;/span&gt;
./build/cpp-simd-post-accelerate &lt;span class="nt"&gt;--benchmark_out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"accelerate.json"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aside from some metadata about the run, the JSON file contains a list of benchmark results of the following form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BM_Ddot_Accelerate/2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"family_index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"per_family_instance_index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"run_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BM_Ddot_Accelerate/2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"run_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"iteration"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"repetitions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"repetition_index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"threads"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"iterations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;76769538&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"real_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;9.1229957239181410e+00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cpu_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;9.1226027698642671e+00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"time_unit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ns"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"items_per_second"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.1923567762994435e+08&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;name&lt;/code&gt; field encodes the benchmark name (&lt;code&gt;Ddot&lt;/code&gt;), the library (&lt;code&gt;Accelerate&lt;/code&gt;), and the input size (&lt;code&gt;2&lt;/code&gt;). For our analysis we will look at the user metric &lt;code&gt;items_per_second&lt;/code&gt; (larger is better), which represents the number of input doubles we were able to process per second. I wrote a Python script that collects those benchmark results, parses the &lt;code&gt;name&lt;/code&gt; field and plots the results. Let's take a look at them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Analyzing the Results
&lt;/h3&gt;

&lt;p&gt;Let's start with &lt;code&gt;cblas_daxpy&lt;/code&gt;, which scales a vector and adds it to another vector. We'll benchmark vector sizes from 2 to 4,194,304. While most applications rely on smaller vectors up to 2&lt;sup&gt;12&lt;/sup&gt; =  4096  elements, larger vectors are not unheard of.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsn4ypnz8yi5hv6pq09hc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsn4ypnz8yi5hv6pq09hc.png" alt="daxpy benchmark results" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that in very small vector sizes, the performance difference is insignificant, and OpenBLAS even outperforms Accelerate. However, starting with input size 512, Accelerate surpasses OpenBLAS and remains consistently better by a factor of up to 6x, especially because OpenBLAS seems to slow down significantly starting from 2&lt;sup&gt;14&lt;/sup&gt; input size. What's interesting is that OpenBLAS picks up the pace again and surpasses Accelerate for input sizes larger than 2&lt;sup&gt;18&lt;/sup&gt;. &lt;/p&gt;

&lt;p&gt;Now let's look at the &lt;code&gt;cblas_ddot&lt;/code&gt; results. We'll use the same input sizes to benchmark.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flrb02axyd04mqpzadivl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flrb02axyd04mqpzadivl.png" alt="ddot benchmark results" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Unsurprisingly, the results are very similar, with the performance being comparable for very small vector sizes, Accelerate outperforming OpenBLAS for medium to large sizes, and eventually OpenBLAS catching up for very large sizes. OpenBLAS also shows a dip in larger sizes, which looks interesting. We will investigate that later.&lt;/p&gt;

&lt;p&gt;Next, let's dive into the matrix operations. For matrices, we'll use smaller input sizes (up to 2&lt;sup&gt;13&lt;/sup&gt;), as the number of elements will be the squared input size. Typical input sizes in real world applications range from smaller matrices up to very large ones. Note that many matrix operations, especially for graphics processing and generative AI are offloaded to GPUs. Let's start with &lt;code&gt;cblas_dgemv&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu5j972ba5o6i4dy28hy0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu5j972ba5o6i4dy28hy0.png" alt="dgemv benchmark results" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Similarly to the vector results, small sizes show little difference in performance. However, Accelerate starts to outperform OpenBLAS quite early. Even when both implementations have their peak performance (2&lt;sup&gt;10&lt;/sup&gt;), Accelerate outperforms OpenBLAS. However, with growing matrix sizes, OpenBLAS takes the upper hand.&lt;/p&gt;

&lt;p&gt;Let's check out &lt;code&gt;cblas_dgemm&lt;/code&gt; next.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkijbe9iviaig6o9g1n4s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkijbe9iviaig6o9g1n4s.png" alt="dgemm benchmark results" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;dgemm&lt;/code&gt; results are the most consistent ones across all four experiments. Accelerate outperforms OpenBLAS for most input sizes up until 2&lt;sup&gt;13&lt;/sup&gt;, from where OpenBLAS takes the lead. Interestingly, OpenBLAS does not show the dip in performance for larger input sizes as in the vector experiments. In &lt;code&gt;dgemm&lt;/code&gt;, OpenBLAS also does not appear to have reached its peak performance in the given input range.&lt;/p&gt;

&lt;p&gt;Looking at these results, we can note a few observations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;For very small input sizes, the performance difference is insignificant. I believe this is due to the fact that with small inputs, memory access and function call overhead dominate performance.&lt;/li&gt;
&lt;li&gt;For medium to large input sizes, Accelerate outperforms OpenBLAS. This is expected given that Accelerate can take advantage of Apple's AMX instructions and the AMX coprocessor, while OpenBLAS relies on NEON instructions only.&lt;/li&gt;
&lt;li&gt;For very large input sizes, OpenBLAS outperforms Accelerate. I did not expect this, and I am not entirely sure what the reason behind the difference is. I am suspecting that either Accelerate is not optimized for very large input sizes, as those are not typical in the applications that run on consumer devices such as Macs, iPhones, etc., or it has to do with size limitations of the AMX coprocessor.&lt;/li&gt;
&lt;li&gt;OpenBLAS shows a dip in performance for medium to large input sizes, which is especially visible in &lt;code&gt;daxpy&lt;/code&gt; and &lt;code&gt;ddot&lt;/code&gt;. I think that this might be caused by OpenBLAS dynamically choosing the number of threads based on the vector size. Let's investigate this a bit further.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Our benchmark run did not specify any parallelism and left the choise to the library. Based on the &lt;a href="https://github.com/OpenMathLib/OpenBLAS/blob/f6df9bebbb4259aa61ab5634c0f1269fb152cc0e/kernel/arm64/dot.c#L85-L102" rel="noopener noreferrer"&gt;ARM64 kernel&lt;/a&gt; that is used for the &lt;code&gt;ddot&lt;/code&gt; experiment, it seems that the number of threads is indeed chosen based on the input size, but also based on the OpenBLAS core used.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kr"&gt;inline&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;get_dot_optimal_nthreads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BLASLONG&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;ncpu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;num_cpu_avail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="cp"&gt;#if defined(NEOVERSEV1) &amp;amp;&amp;amp; !defined(COMPLEX) &amp;amp;&amp;amp; !defined(BFLOAT16)
&lt;/span&gt;  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;get_dot_optimal_nthreads_neoversev1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ncpu&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="cp"&gt;#elif defined(DYNAMIC_ARCH) &amp;amp;&amp;amp; !defined(COMPLEX) &amp;amp;&amp;amp; !defined(BFLOAT16)
&lt;/span&gt;  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strcmp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gotoblas_corename&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s"&gt;"neoversev1"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;get_dot_optimal_nthreads_neoversev1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ncpu&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="cp"&gt;#endif
&lt;/span&gt;
  &lt;span class="c1"&gt;// Default case&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;10000L&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;else&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;num_cpu_avail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="cp"&gt;#endif
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since my OpenBLAS uses &lt;code&gt;neoversen1&lt;/code&gt;, it should choose one thread for vectors &amp;lt;= 10k, and all available CPUs for vectors &amp;gt; 10k. We can see that this threshold (red vertical line) aligns with the change in the performance trend:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq10ektung4q5rmcei9su.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq10ektung4q5rmcei9su.png" alt="neoversen1 thresholds" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary and Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post we wrote and executed benchmarks for four common BLAS routines using two different BLAS implementations: OpenBLAS and Apple's Accelerate framework. We used Google's C++ benchmark library to measure performance and plotted the results using Python. We found the following results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For very small input sizes, results of both libraries are comparable. This is likely due to the fact that with small inputs, memory access and function call overhead dominate performance. &lt;/li&gt;
&lt;li&gt;Accelerate outperforms OpenBLAS for medium to large input sizes. This is expected given that Accelerate can take advantage of Apple's AMX instructions and the AMX coprocessor, while OpenBLAS relies on NEON instructions only. If there are specialized, state-of-the-art instructions available on your platform, you should use a BLAS library that takes advantage of them.&lt;/li&gt;
&lt;li&gt;While the BLAS interfaces are the same, the available configuration options, e.g. the parallelism, can vary between implementations and have a significant impact on the performance.&lt;/li&gt;
&lt;li&gt;OpenBLAS appears to outperform Accelerate for very large input sizes. There is no silver bullet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even when not rolling your own BLAS, relying on well-known BLAS libraries, there can be performance differences between the different implementations in orders of magnitude. If your application requires maximum performance, you should always benchmark the different options and choose the one that performs best for your use case on the hardware you run in production.&lt;/p&gt;

&lt;p&gt;Note that during my benchmarks I only looked at throughput. Latency is another relevant metric to consider, especially for real-time applications. Latency can be significantly higher when using coprocessors or processing units farther away from the CPU, such as GPUs.&lt;/p&gt;

&lt;p&gt;Have you compared different BLAS or scientific computing libraries before? Did you run into any unexplained performance differences? I'd love to hear about your experiences in the comments below!&lt;/p&gt;




&lt;p&gt;If you liked this post, you can &lt;a href="https://ko-fi.com/frosnerd" rel="noopener noreferrer"&gt;support me on ko-fi&lt;/a&gt;.&lt;/p&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;Flynn, Michael J. (December 1966). "Very high-speed computing systems" (PDF). Proceedings of the IEEE. 54 (12): 1901–1909. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;The ARMv8-A ISA (Instruction Set Architecture) is a 64-bit architecture developed by ARM Holdings, supporting both 32-bit (AArch32) and 64-bit (AArch64) execution states. It includes three main instruction sets: A32 and T32 for 32-bit processing, and A64 for 64-bit processing. An ISA is an abstract model that defines how software controls a processor, specifying the set of machine-level instructions the CPU can execute, along with how they are encoded and how they interact with registers and memory. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn3"&gt;
&lt;p&gt;Apple's AMX is not to be confused with Intel's AMX. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn4"&gt;
&lt;p&gt;You can find some &lt;a href="https://github.com/corsix/amx" rel="noopener noreferrer"&gt;user-written documentation and header files&lt;/a&gt; on GitHub. There's also &lt;a href="https://research.meekolab.com/the-elusive-apple-matrix-coprocessor-amx" rel="noopener noreferrer"&gt;The Elusive Apple Matrix Coprocessor&lt;/a&gt; blog post by Meeko Labs which is worth reading. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>cpp</category>
      <category>simd</category>
      <category>matrix</category>
      <category>cpu</category>
    </item>
    <item>
      <title>Generating Application Specific Go Documentation Using Go AST and Antora</title>
      <dc:creator>Frank Rosner</dc:creator>
      <pubDate>Tue, 04 Nov 2025 12:48:06 +0000</pubDate>
      <link>https://dev.to/frosnerd/generating-application-specific-go-documentation-using-go-ast-and-antora-aho</link>
      <guid>https://dev.to/frosnerd/generating-application-specific-go-documentation-using-go-ast-and-antora-aho</guid>
      <description>&lt;h2&gt;
  
  
  Motivation
&lt;/h2&gt;

&lt;p&gt;In one of my projects I was working on some in-house Go application to manage complex task pipelines on Kubernetes. On a high level, a task consists of one or multiple steps that are executed in order. Engineers would contribute tasks in the form of new Go files. Different teams collaborate on those tasks. Engineers running the tasks need to understand how to configure, debug and operate them correctly. Therefore, having good documentation is crucial.&lt;/p&gt;

&lt;p&gt;The biggest challenge with documentation, however, is to keep it up-to-date. This becomes more challenging the further away the documentation is kept from the source code. In the ideal scenario, the documentation is close to the source code, e.g. in the form of &lt;a href="https://docs.oracle.com/javase/8/docs/technotes/tools/windows/javadoc.html" rel="noopener noreferrer"&gt;JavaDoc&lt;/a&gt;, &lt;a href="https://peps.python.org/pep-0257/" rel="noopener noreferrer"&gt;Python Docstrings&lt;/a&gt;, or &lt;a href="https://go.dev/blog/godoc" rel="noopener noreferrer"&gt;GoDoc&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The downside of these tools is that the content they generate is often tightly coupled to the ecosystem of the language. Additionally, their customizability is often limited. Typically, they generate some HTML page based on the text written in comments. Some of them will provide additional context, such as links to other packages, types, or functions. Take the JavaDoc of jvector's &lt;a href="https://javadoc.io/doc/io.github.jbellis/jvector/latest/io/github/jbellis/jvector/graph/GraphIndex.html" rel="noopener noreferrer"&gt;&lt;code&gt;GraphIndex&lt;/code&gt;&lt;/a&gt; class as an example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flv4aoyg95j4vbjjbylcd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flv4aoyg95j4vbjjbylcd.png" alt="Interface GraphIndex javadoc" width="800" height="322"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It shows information such as super- and implementing classes / interfaces, nested classes and implemented methods. You can browse other classes within the same package as well. This contextual information is useful, but it lacks real-world application context. Its main focus is documenting the API to other developers, and not documenting the behaviour of an application.&lt;/p&gt;

&lt;p&gt;To make my point clearer, let's look at the type definitions for our tasks in Go. I'm going to leave out irrelevant detail here. In reality, the code is more complex. Every task has defined &lt;code&gt;TaskLogic&lt;/code&gt;, which consists of one or multiple &lt;code&gt;TaskStep&lt;/code&gt;s.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;TaskLogic&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;GetSteps&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;TaskStep&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We support multiple types of tasks, which are mapped from their API name to the respective task logic. For the scope of this blog post, let's assume that there are two tasks: &lt;code&gt;DataProcessingTask&lt;/code&gt; and &lt;code&gt;ReminderTask&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;TaskLogics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;TaskLogic&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataProcessingTask&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataProcessingTaskLogic&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
    &lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReminderTask&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReminderTaskLogic&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's let's look into &lt;code&gt;DataProcessingTaskLogic&lt;/code&gt; and the related steps in detail. Note that steps can be reused across tasks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;/*
DataProcessingTaskLogic processes the provided data
in a secure and performant manner. It uses sophisticated
algorithms for maximum efficiency, while ensuring
the highest level of security through quantum-resistant encryption.
 */&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;DataProcessingTaskLogic&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="c"&gt;/*
StartupStep performs the necessary startup actions,
such as initializing the database connection 
and loading the configuration.
 */&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;StartupStep&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="c"&gt;/*
RunStep performs the actual data processing.
 */&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;RunStep&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="c"&gt;/*
CleanupStep performs the necessary cleanup actions,
such as closing the database connection and releasing resources.
 */&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;CleanupStep&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="n"&gt;DataProcessingTaskLogic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;GetSteps&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;ot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskStep&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;ot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskStep&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;StartupStep&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
        &lt;span class="n"&gt;RunStep&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
        &lt;span class="n"&gt;CleanupStep&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now imagine an engineer who gets alerted on a malfunctioning data processing task. The error says that the task failed in the first step. How would they know what the steps are, and what to do to debug them?&lt;/p&gt;

&lt;p&gt;We could write that information down in a runbook but if someone updates the logic, e.g. adding a new step, the documentation in the runbook would quickly become outdated. Ideally, you'd want the documentation about the task logic, including the performed steps, to be generated directly from the source code. Here's how it could look:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbsxcmnfgzg9dy6j2b5lu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbsxcmnfgzg9dy6j2b5lu.png" alt="generated docs with task description and step docs" width="800" height="565"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We use &lt;a href="https://antora.org/" rel="noopener noreferrer"&gt;Antora&lt;/a&gt; to generate internal documentation for our services across different repositories, where code is written in different languages. Antora is a static site generator that allows you to write documentation in AsciiDoc and then generate a website from it. It is very customizable and allows you to combine documentation from different sources, e.g. different repositories into a single, cohesive, versioned documentation site.&lt;/p&gt;

&lt;p&gt;In the remainder of this post I want to share how I built a documentation generator that combines GoDoc with application specific logic to generate documentation that can be integrated into an Antora site. On a high level, this will entail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generate Markdown files for each task type, outlining the task logic and steps. I chose Markdown because it is a bit easier to write and more widely supported, but you could generate AsciiDoc directly.&lt;/li&gt;
&lt;li&gt;Convert Markdown to AsciiDoc (skip this step if you decided to generate AsciiDoc directly)&lt;/li&gt;
&lt;li&gt;Integrate the AsciiDoc files into the Antora resources and build the site (skip this step if you don't need a static site) &lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Generating Markdown Documentation
&lt;/h2&gt;

&lt;p&gt;We are going to use &lt;a href="https://magefile.org/" rel="noopener noreferrer"&gt;mage&lt;/a&gt; as our build tool. The directory structure of mage-related files looks as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.
├── magefile.go
└── mage
    └── docs
        ├── lib.go
        └── task_docs.go
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generating the task documentation will be triggered using &lt;code&gt;mage docs &amp;lt;output-directory&amp;gt;&lt;/code&gt;, thanks to the following function inside &lt;code&gt;magefile.go&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// magefile.go&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;Docs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputDir&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GenerateDocs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputDir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To implement the &lt;code&gt;GenerateDocs&lt;/code&gt; function, we'll need to define some types to represent the documentation, i.e.:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A mechanism to differentiate logic and steps, storing the struct name and package path (&lt;code&gt;TypeIdent&lt;/code&gt; with &lt;code&gt;TaskLogicType&lt;/code&gt; and &lt;code&gt;TaskStepType&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;TaskDocs&lt;/code&gt; to store the documentation for a task, including the documentation for the task logic and a list of the steps&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;StepDoc&lt;/code&gt; to store the documentation for a step&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We store the &lt;code&gt;StepDoc&lt;/code&gt; separately, because steps can be reused in multiple tasks. The reference will be done based on the &lt;code&gt;TaskStepType&lt;/code&gt;. Here's the definition of those types:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// docs/task_docs.go&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;TypeIdent&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Name&lt;/span&gt;    &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;PkgPath&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;TaskLogicType&lt;/span&gt; &lt;span class="n"&gt;TypeIdent&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;TaskStepType&lt;/span&gt; &lt;span class="n"&gt;TypeIdent&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;TaskDocs&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;TaskLogicDoc&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;FileName&lt;/span&gt;     &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;StructName&lt;/span&gt;   &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;TaskSteps&lt;/span&gt;    &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;StepDoc&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;StructName&lt;/span&gt;       &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;FileName&lt;/span&gt;         &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;ShortDescription&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;LongDescription&lt;/span&gt;  &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then we'll need to implement the &lt;code&gt;GenerateDocs&lt;/code&gt; function. This function will extract the task and step documentation from the source code comments, generate the Markdown files, and write them to the output directory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// docs/lib.go&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;GenerateDocs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputDir&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;taskDocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stepDocs&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ExtractTaskAndStepDocs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;markdownDocs&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;GenerateMarkdownDocs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;taskDocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stepDocs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;markdownDocs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"No task documentation generated. Maybe the generator did not find the source files?"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;WriteMarkdownDocsToFiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;markdownDocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputDir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Error writing task documentation to files: %w"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Task documentation generated successfully to"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputDir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's dive into &lt;code&gt;ExtractTaskAndStepDocs&lt;/code&gt; first. Since steps can be reused across tasks, we generate task and step docs separately, merging them later.&lt;br&gt;
The goal of &lt;code&gt;ExtractTaskAndStepDocs&lt;/code&gt; is to walk through all &lt;code&gt;.go&lt;/code&gt; files in the project and extract the documentation from the comments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;ExtractTaskAndStepDocs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rootPrefix&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;TaskDocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;StepDoc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;taskLogicLookup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskStepLookup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskSteps&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;buildTaskTypeLookups&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;controllers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskLogics&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;taskDocs&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;TaskDocs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;stepDocs&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;StepDoc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WalkDir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rootPrefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DirEntry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsDir&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HasSuffix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s"&gt;".go"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;processFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskLogicLookup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskStepLookup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskDocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskSteps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stepDocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rootPrefix&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatalf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Error walking through files: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;taskDocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stepDocs&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The helper function &lt;code&gt;buildTaskTypeLookups&lt;/code&gt; is listed at the end of the blog post. The core idea is to build lookup maps from the &lt;code&gt;TaskLogics&lt;/code&gt; that are available in the project. Those lookups will be used when processing the individual files to figure out if the types that are declared in the files are the ones used in the task logic (and thus should be included in the documentation). Let's look at the &lt;code&gt;processFile&lt;/code&gt; function next.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;processFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;filePath&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;taskLogicLookup&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskLogicType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;taskStepLookup&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;taskDocs&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;TaskDocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;taskSteps&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;][]&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;stepDocs&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;StepDoc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rootPrefix&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;fset&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewFileSet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ParseFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ParseComments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed to parse file %s: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Inspect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;genDecl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GenDecl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;genDecl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tok&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TYPE&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;processDeclaration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genDecl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskLogicLookup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskStepLookup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskDocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskSteps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stepDocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rootPrefix&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;processFile&lt;/code&gt; function uses the &lt;a href="https://go.dev/ref/spec#Notation" rel="noopener noreferrer"&gt;Go AST&lt;/a&gt; to parse the Go files and extract the documentation from the comments. The &lt;code&gt;ast.Inspect&lt;/code&gt; function is used to traverse the AST and find all type declarations. We process each declaration in &lt;code&gt;processDeclaration&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;processDeclaration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;genDecl&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GenDecl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;filePath&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;taskLogicLookup&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskLogicType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;taskStepLookup&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;allTaskDocs&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;TaskDocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;allTaskSteps&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;][]&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;stepDocs&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;StepDoc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rootPrefix&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;genDecl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Specs&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;typeSpec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TypeSpec&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;typeSpec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;structName&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;typeSpec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;
                &lt;span class="n"&gt;typeIdent&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;TypeIdent&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;structName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filePathToPackagePath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rootPrefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;packagePrefix&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;taskName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exists&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;taskLogicLookup&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskLogicType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;typeIdent&lt;/span&gt;&lt;span class="p"&gt;)];&lt;/span&gt; &lt;span class="n"&gt;exists&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="n"&gt;processTaskLogic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genDecl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;structName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allTaskDocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allTaskSteps&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;taskName&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exists&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;taskStepLookup&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;typeIdent&lt;/span&gt;&lt;span class="p"&gt;)];&lt;/span&gt; &lt;span class="n"&gt;exists&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="n"&gt;processTaskStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genDecl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;structName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stepDocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rootPrefix&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;processDeclaration&lt;/code&gt; function checks if the type is a struct and if it is a task logic or step. If it is a task logic, we process it in &lt;code&gt;processTaskLogic&lt;/code&gt;. If it is a step, we process it in &lt;code&gt;processTaskStep&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;processTaskLogic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genDecl&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GenDecl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;structName&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskName&lt;/span&gt; &lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskDocs&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;TaskDocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskSteps&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;taskDocs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;taskName&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TaskDocs&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;TaskLogicDoc&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;extractComment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genDecl&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;FileName&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;StructName&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;structName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;TaskSteps&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;    &lt;span class="n"&gt;taskSteps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;processTaskStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;genDecl&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GenDecl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;structName&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;stepDocs&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;StepDoc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rootPrefix&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;comment&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;extractComment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genDecl&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;stepDocs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;structName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filePathToPackagePath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rootPrefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;packagePrefix&lt;/span&gt;&lt;span class="p"&gt;)}]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;StepDoc&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;StructName&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;       &lt;span class="n"&gt;structName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;FileName&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;         &lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;ShortDescription&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;extractShortDescription&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;LongDescription&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both functions are rather trivial. They use a few additional helper functions &lt;code&gt;extractComment&lt;/code&gt;, &lt;code&gt;extractShortDescription&lt;/code&gt; and &lt;code&gt;filePathToPackagePath&lt;/code&gt; to extract the documentation from the comments in a structured form. The comments are shown in the detail view of the respective task or step. The short description corresponds to the first sentence of the comment and will be used in the list of steps for a task. The file path is used to link to the source code on GitHub.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;extractComment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genDecl&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GenDecl&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;genDecl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Doc&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;trimmedDoc&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TrimSpace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genDecl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Doc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;trimmedDoc&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;extractShortDescription&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;comment&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;comment&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;paragraphs&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SplitN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;replacedFirstParagraph&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReplaceAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paragraphs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;shortDescription&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TrimSpace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;replacedFirstParagraph&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;shortDescription&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;filePathToPackagePath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rootPrefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;packagePrefix&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;relativePath&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TrimPrefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rootPrefix&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;dirPath&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relativePath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;packagePath&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;packagePrefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dirPath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;packagePath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToSlash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;packagePath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;packagePath&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we have the task and step documentation, we can generate the Markdown files in &lt;code&gt;GenerateMarkdownDocs&lt;/code&gt;. Each task gets its own markdown file.&lt;br&gt;
We could also generate individual files for each step, but I'll leave this exercise to the reader if needed. For the sake of simplicity we'll include the long description of the steps right into the task documentation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;GenerateMarkdownDocs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;taskDocs&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;TaskDocs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stepDocs&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;StepDoc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;markdownDocs&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;taskType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;taskDocs&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;sb&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Builder&lt;/span&gt;

        &lt;span class="n"&gt;sb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"# %s&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskType&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="n"&gt;sb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"## Description&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskLogicDoc&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;sb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskLogicDoc&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;sb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"No description provided.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;sb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;You can find the [source code](%s/%s) on GitHub.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;githubPrefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FileName&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="n"&gt;sb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"## Steps&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskSteps&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskSteps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskSteps&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;sb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"### %d. %s&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="n"&gt;stepDoc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exists&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;stepDocs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;exists&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;stepDoc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LongDescription&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="n"&gt;sb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%s&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;stepDoc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LongDescription&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;sb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"No steps defined.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;markdownDocs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;taskType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;markdownDocs&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We are not writing the markdown files directly here, but first building them into a map in order to make the code more testable. The last step will be to write the files for each task to the output directory. This is where &lt;code&gt;WriteMarkdownDocsToFiles&lt;/code&gt; comes into play.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;WriteMarkdownDocsToFiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;markdownDocs&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputDir&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;taskType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;markdownDocs&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;fileName&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%s.md"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskType&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;filePath&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputDir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="m"&gt;0644&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"failed to write file %s: %w"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Wrote file %s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it on the Go side. Let's look into converting the Markdown files to AsciiDoc and integrating them into the Antora site. If you decided to generate AsciiDoc directly, you can skip this step. If you don't need to add the markdown files to a static site, you can also stop reading now. &lt;/p&gt;

&lt;h2&gt;
  
  
  Convert Markdown to AsciiDoc
&lt;/h2&gt;

&lt;p&gt;To convert Markdown to AsciiDoc we can use &lt;code&gt;pandoc&lt;/code&gt;. The following script will walk through the markdown files in the provided directory, convert each file to AsciiDoc, and also write a line to the navigation file which will be included as a &lt;a href="https://docs.antora.org/antora/latest/page/include-a-partial/" rel="noopener noreferrer"&gt;partial&lt;/a&gt; in the Antora site.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# integrate_go_docs.sh&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"$#"&lt;/span&gt; &lt;span class="nt"&gt;-ne&lt;/span&gt; 2 &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Usage: &lt;/span&gt;&lt;span class="nv"&gt;$0&lt;/span&gt;&lt;span class="s2"&gt; &amp;lt;md_dir&amp;gt; &amp;lt;pages_dir&amp;gt;"&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi

if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="nb"&gt;command&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; pandoc &amp;amp;&amp;gt; /dev/null
&lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Error: pandoc is not installed. Please install pandoc to continue."&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi

&lt;/span&gt;&lt;span class="nv"&gt;md_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;
&lt;span class="nv"&gt;pages_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$2&lt;/span&gt;

&lt;span class="nv"&gt;tasks_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"components/operator/tasks"&lt;/span&gt;
&lt;span class="nv"&gt;nav_file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$pages_dir&lt;/span&gt;&lt;span class="s2"&gt;/../partials/task-operator-nav.adoc"&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"// This file has been generated on &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$nav_file&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# Convert task docs&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;md_file &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$md_dir&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;/&lt;span class="k"&gt;*&lt;/span&gt;.md
&lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;adoc_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$pages_dir&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;$tasks_dir&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nv"&gt;adoc_file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;basename&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;md_file&lt;/span&gt;&lt;span class="p"&gt;%.md&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.adoc"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
  &lt;span class="nv"&gt;adoc_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$adoc_dir&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;$adoc_file&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Integrating &lt;/span&gt;&lt;span class="nv"&gt;$md_file&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="c"&gt;# -s to create a standalone document, including the title (=)&lt;/span&gt;
  &lt;span class="c"&gt;# --shift-heading-level-by -1 to convert the markdown h1 (#) to asciidoc title (=)&lt;/span&gt;
  &lt;span class="c"&gt;# See https://github.com/jgm/pandoc/issues/5615&lt;/span&gt;
  &lt;span class="c"&gt;#&lt;/span&gt;
  &lt;span class="c"&gt;# --wrap=none to avoid wrapping lines, causing long headlines to be broken&lt;/span&gt;
  &lt;span class="c"&gt;# See https://github.com/jgm/pandoc/issues/3277#issuecomment-264706794&lt;/span&gt;
  pandoc &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; markdown &lt;span class="nt"&gt;--shift-heading-level-by&lt;/span&gt; &lt;span class="s2"&gt;"-1"&lt;/span&gt; &lt;span class="nt"&gt;--wrap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;none &lt;span class="nt"&gt;-t&lt;/span&gt; asciidoc &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$adoc_path&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$md_file&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"* xref:&lt;/span&gt;&lt;span class="nv"&gt;$tasks_dir&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;$adoc_file&lt;/span&gt;&lt;span class="s2"&gt;[]"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$nav_file&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the navigation partial and the individual task AsciiDocs in place, we can build the Antora site. I'm not going to go cover the details of how to use Antora here. Please refer to the &lt;a href="https://docs.antora.org/antora/latest/install-and-run-quickstart/" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt; for that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Integrate into Antora Site
&lt;/h2&gt;

&lt;p&gt;We're building the site using Antora, which uses the &lt;code&gt;npm&lt;/code&gt; ecosystem. We'll need scripts to launch &lt;code&gt;integrate_go_docs.sh&lt;/code&gt; and to run antora with the provided playbook. Here's the &lt;code&gt;package.json&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"go-antora-docs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scripts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"generate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"antora --stacktrace --fetch --clean playbooks/generate.yml"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"go:pandoc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./integrate_go_docs.sh ../docs/generated ./modules/ROOT/pages"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"repository"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"git"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"git+https://github.com/yourorg/yourrepo.git"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@antora/cli"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^3.1.7"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@antora/lunr-extension"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^1.0.0-alpha.8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@antora/site-generator-default"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^3.1.8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@redocly/cli"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^1.25.11"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"antora"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^3.1.8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"http-server"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^14.1.1"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We are going to use GitHub actions to run the whole pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;generate-go-docs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-24.04&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;GOPATH&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/runner/go&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Go&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-go@v5&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;go-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1.24.2'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install Mage&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;magefile/mage-action@v3&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;install-only&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v1.15.0"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Generate Docs&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mage docs docs/generated&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Upload  Docs&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/upload-artifact@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;go-docs&lt;/span&gt;
          &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docs/generated&lt;/span&gt;
  &lt;span class="na"&gt;compile-docs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-24.04&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;go-docs&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-node@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;node-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;18&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install Dependencies&lt;/span&gt;
        &lt;span class="na"&gt;working-directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;site&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm ci&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install Pandoc&lt;/span&gt;
        &lt;span class="na"&gt;working-directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;site&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;sudo apt-get update&lt;/span&gt;
          &lt;span class="s"&gt;sudo apt-get install -y pandoc&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Download Operator Docs&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/download-artifact@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;operator-docs&lt;/span&gt;
          &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kubernetes/operator/docs/generated&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Generate Operator Pandoc Page&lt;/span&gt;
        &lt;span class="na"&gt;working-directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;site&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm run "operator:pandoc"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Generate Antora Page&lt;/span&gt;
        &lt;span class="na"&gt;working-directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;site&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm run generate&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Copy index HTML&lt;/span&gt;
        &lt;span class="na"&gt;working-directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;site&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cp index.html playbooks/build/site&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Upload antora&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/upload-artifact@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;antora&lt;/span&gt;
          &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;site/playbooks/build/site&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And that's it. The GitHub action will run &lt;code&gt;mage docs docs/generated&lt;/code&gt;, upload the resulting markdown, then pass it to the next job which downloads it, running &lt;code&gt;npm run "operator:pandoc"&lt;/code&gt; and then &lt;code&gt;npm run generate&lt;/code&gt;, finally moving the &lt;code&gt;index.html&lt;/code&gt; into the build directory and uploading the resulting artifact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post we've seen how we can generate documentation from Go code and integrate it into an Antora site. The approach can be adapted to other programming languages and documentation formats. The advantage of this approach is that the documentation is easier to keep up-to-date, and it can be enriched with application specific information, tailored towards not only API documentation, but documenting the behaviour of your application.&lt;/p&gt;




&lt;p&gt;If you liked this post, you can &lt;a href="https://ko-fi.com/frosnerd" rel="noopener noreferrer"&gt;support me on ko-fi&lt;/a&gt;.&lt;/p&gt;






&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;buildTaskTypeLookups&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskLogic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskLogicType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;][]&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;invertedLogics&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskLogicType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;invertedSteps&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;][]&lt;/span&gt;&lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;apiName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;logic&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;original&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;logicType&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;reflect&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TypeOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;invertedLogics&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TaskLogicType&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;logicType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;logicType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PkgPath&lt;/span&gt;&lt;span class="p"&gt;()}]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;apiName&lt;/span&gt;

        &lt;span class="n"&gt;logicSteps&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;logic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetSteps&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;logicSteps&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;stepType&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;reflect&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TypeOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;stepTypeSpec&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;TaskStepType&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stepType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;stepType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PkgPath&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;
            &lt;span class="n"&gt;invertedSteps&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;stepTypeSpec&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;apiName&lt;/span&gt;
            &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;apiName&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;apiName&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;stepTypeSpec&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;invertedLogics&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;invertedSteps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>go</category>
      <category>documentation</category>
      <category>showdev</category>
      <category>html</category>
    </item>
    <item>
      <title>Build Your Own Food Tracker with OpenAI Platform</title>
      <dc:creator>Frank Rosner</dc:creator>
      <pubDate>Tue, 03 Jun 2025 14:15:57 +0000</pubDate>
      <link>https://dev.to/frosnerd/build-your-own-food-tracker-with-openai-platform-55n8</link>
      <guid>https://dev.to/frosnerd/build-your-own-food-tracker-with-openai-platform-55n8</guid>
      <description>&lt;h2&gt;
  
  
  The Benefits of Food Tracking
&lt;/h2&gt;

&lt;p&gt;Whether your goal is to lose weight, gain muscle, or simply maintain a balanced diet, understanding the value of the food you consume is essential. By keeping a record of what we eat, we gain insights into our nutritional intake, allowing us to make informed decisions about our diet. Here are some of the key benefits of food tracking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Awareness and Accountability&lt;/strong&gt;: Tracking helps us understand what we're consuming daily, shedding light on eating habits we might not even be aware of.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nutritional Balance&lt;/strong&gt;: It enables us to monitor macronutrients like protein, fats, and carbohydrates, ensuring a balanced diet that aligns with personal goals such as weight loss, muscle gain, or overall well-being.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Portion Control&lt;/strong&gt;: With a clearer picture of portion sizes, food tracking helps prevent overeating and supports mindful eating practices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customizable Goals&lt;/strong&gt;: By recording meals, we can set specific goals, such as reducing sugar intake, increasing fiber consumption, or staying within a calorie limit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-term Insights&lt;/strong&gt;: Over time, food tracking can reveal patterns, helping to identify triggers for overeating, nutrient deficiencies, or correlations between diet and mood.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this blog post, I want to share how easy it is to build your own food tracker using a GenAI platform like OpenAI. The tool analyzes images of meals from my Google Photos library and provides a nutritional breakdown using AI. You can find the &lt;a href="https://github.com/FRosner/photos-calory-tracker" rel="noopener noreferrer"&gt;source code&lt;/a&gt; over on GitHub.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The architecture consists of three main components that work together to analyze your food photos: Your (Python) application, the Google Photos API, and the OpenAI API. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frgkqu31c2vjbws531rg5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frgkqu31c2vjbws531rg5.png" alt="architecture" width="800" height="111"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Python application uses the Google Photos Library API to fetch your food photos. It requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OAuth 2.0 authentication for secure access to your photos&lt;/li&gt;
&lt;li&gt;Search capabilities to find photos tagged as "food" from a specific date&lt;/li&gt;
&lt;li&gt;The ability to download the actual image content for analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It then uses the OpenAI's GPT-4 Vision API to analyze these images. This entails:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Name the dish&lt;/li&gt;
&lt;li&gt;Estimate nutritional content (calories, protein, carbs, fat, fiber, etc.)&lt;/li&gt;
&lt;li&gt;Assess the health score and processing degree and break down meal components&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Google Photos API Authentication
&lt;/h3&gt;

&lt;p&gt;Let's jump into the code. To access your Google photos, you need some initial setup. Note that the API I am using to access all my Google Photos has been deprecated and was removed on March 31, 2025. From now on, apps can only access Photos that they have created or that the user picks.&lt;/p&gt;

&lt;p&gt;To generate credentials your app can use, run the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google_auth_oauthlib.flow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InstalledAppFlow&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.auth.transport.requests&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pickle&lt;/span&gt;

&lt;span class="n"&gt;SCOPES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://www.googleapis.com/auth/photoslibrary.readonly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;CREDENTIALS_FILE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GOOGLE_CREDENTIALS_FILE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.secrets/client_secret.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;AUTH_PORT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GOOGLE_AUTH_PORT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8080&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;TOKEN_PICKLE_FILE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GOOGLE_TOKEN_PICKLE_FILE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.secrets/token.pickle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;flow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;InstalledAppFlow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_client_secrets_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CREDENTIALS_FILE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SCOPES&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;creds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;flow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_local_server&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;AUTH_PORT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;access_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;offline&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;include_granted_scopes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TOKEN_PICKLE_FILE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;pickle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will open a browser window where you can authenticate your app and store the credentials in a pickle file. The pickle module is used to serialize and deserialize Python objects. Our app can then unpickle the credentials and use them to access the Google Photos API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;google_authenticate&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;creds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TOKEN_PICKLE_FILE&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TOKEN_PICKLE_FILE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;creds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pickle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;creds&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;creds&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expired&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;refresh_token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;refresh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Invalid credentials. Please run google_photos_init.py to authenticate.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;creds&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach is only working for prototypes like these. In a production scenario you would use a more secure way to store and manage user credentials.&lt;/p&gt;

&lt;h3&gt;
  
  
  Searching Food Photos
&lt;/h3&gt;

&lt;p&gt;Now that we have valid credentials to access the Google Photos API, let's use them to search for food photos in our library. First, we create a photos API object:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;googleapiclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;discovery&lt;/span&gt;

&lt;span class="n"&gt;creds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;google_authenticate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;photos_api&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;discovery&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;photoslibrary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;static_discovery&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, we write a function that uses the photos API to search for photos that match a specific search term and date filter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;google_search_photos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;search_term&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date_filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;filters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;date_filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;filters&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dateFilter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dates&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;date_filter&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;search_term&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;filters&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contentFilter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;includedContentCategories&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;search_term&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

    &lt;span class="n"&gt;search_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pageSize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;filters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mediaItems&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;search_body&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;mediaItems&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The search term we are using is "food". We are using a &lt;a href="https://developers.google.com/photos/library/guides/apply-filters#content-categories" rel="noopener noreferrer"&gt;content category filter&lt;/a&gt; for this. The &lt;a href="https://developers.google.com/photos/library/guides/apply-filters#dates-date-ranges" rel="noopener noreferrer"&gt;date filter&lt;/a&gt; will make sure we only get the recent photos, e.g. from yesterday. The call could look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;photos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;google_search_photos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;photos_api&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;search_term&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;food&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date_filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;day&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let's look at an example photo returned by the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ANNz-AhcvptH&amp;lt;redacted&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"productUrl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://photos.google.com/lr/photo/&amp;lt;redacted&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"baseUrl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://lh3.googleusercontent.com/lr/&amp;lt;redacted&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mimeType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"image/jpeg"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mediaMetadata"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"creationTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-03-18T10:36:20.790Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"width"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"height"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"3072"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"photo"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"cameraMake"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Google"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"cameraModel"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Pixel 6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"focalLength"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;6.81&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"apertureFNumber"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"isoEquivalent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;103&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"exposureTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0.005034999s"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PXL_20250318_103620790.jpg"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, let's look into analyzing the image using the OpenAI API.&lt;/p&gt;

&lt;h3&gt;
  
  
  Photo Download and Analysis
&lt;/h3&gt;

&lt;p&gt;First, let's define some helper functions to work with the image data. We need to be able to download an image and encode it for the OpenAI API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;download_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to download image. Status code: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;encode_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, let's define the data model for the food analysis that we can use as &lt;a href="https://platform.openai.com/docs/guides/structured-outputs" rel="noopener noreferrer"&gt;structured output&lt;/a&gt; for the OpenAI API. Structured outputs allow you to obtain machine-readable output from the OpenAI API. It also implicitly tells the AI what information you are expecting. We define &lt;code&gt;FoodAnalysis&lt;/code&gt; for a single dish and &lt;code&gt;FoodAnalysisResponse&lt;/code&gt; for a list of dishes so we can pass multiples images at once with a common prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FoodAnalysis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;readable_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;protein_g&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;fat_g&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;carbohydrate_g&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;fibre_g&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;total_mass_g&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;total_kcal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;total_health_score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;processing_degree&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;components&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FoodAnalysisResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;foods&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;FoodAnalysis&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, we can write the function that uses the OpenAI API to analyze the images. We are giving a system prompt to prime the LLM on the task:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You are a nutrition and health expert. You are helping a user understand the nutritional value of their food to help them eat healthier. For each image, estimate the protein, fat, fibre, and carbohydrate content in grams, the total mass in grams, the total calories, the total health score (1-10, 10 being super healthy, 1 being heart-attack-unhealthy), the processing degree ('low', 'medium', 'high'), and the components that are in the dish. If you are unsure, please provide an estimate. Only refuse the query if the image does not contain any food. Please also provide a readable name of the dish. If this looks like a well-known dish, you can use that name. Otherwise, you can describe it in a few words that are helpful to understand the dish.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The actual images will be uploaded as a list of user messages with image URL attachments containing the base64 encoded image. If the image is accessible via a URL directly, you don't need to download and encode it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_images&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;image_messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;encoded_image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;encode_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;image_messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data:image/jpeg;base64,&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;encoded_image&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}]&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a nutrition and health expert. [...]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;image_messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FoodAnalysisResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;food_analysis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;food_analysis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;food_analysis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;food_analysis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;refusal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;food_analysis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;refusal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LengthFinishReasonError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Too many tokens: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's put it all together, downloading all images and analyzing them. Here's the code with an example output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;images&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;photos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;download_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;baseUrl&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;analysis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;analyze_images&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;foods:
- carbohydrate_g: 15
  components:
  - arugula
  - roasted tomatoes
  - parmesan cheese
  - lemon
  - olive oil
  - seasonings
  fat_g: 10
  fibre_g: 5
  processing_degree: low
  protein_g: 8
  readable_name: Arugula Salad with Roasted Tomatoes and Cheese
  total_health_score: 8
  total_kcal: 150
  total_mass_g: 200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Assessing Accuracy
&lt;/h2&gt;

&lt;p&gt;To assess the accuracy of the solution I developed a small &lt;a href="https://github.com/FRosner/photos-calory-tracker/blob/main/validate.py" rel="noopener noreferrer"&gt;validation script&lt;/a&gt; that analyzes given images and compares the results to expected values. Take a banana, for example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsi9dk7a7ouhgp5vx9n70.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsi9dk7a7ouhgp5vx9n70.jpg" alt="Banana" width="800" height="602"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"foods"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"readable_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Banana"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"protein_g"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"fat_g"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"fibre_g"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"carbohydrate_g"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"total_mass_g"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;71&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"total_kcal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;72&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"total_health_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"processing_degree"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"components"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"banana"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The validation script outputs the differences between the actual and expected values. The main challenge turns out to be estimating the mass of the dish.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"carbohydrate_g"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"actual"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;27&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"expected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"difference"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"total_mass_g"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"actual"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;118&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"expected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;71&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"difference"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;47&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fibre_g"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"actual"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"expected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"difference"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"total_kcal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"actual"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;105&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"expected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;72&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"difference"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;33&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"total_health_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"actual"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"expected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"difference"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here are some more examples (differences shown as predicted / actual / % difference):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;Photo&lt;/th&gt;
&lt;th&gt;Mass (g)&lt;/th&gt;
&lt;th&gt;KCal&lt;/th&gt;
&lt;th&gt;Carbohydrates (g)&lt;/th&gt;
&lt;th&gt;Fat (g)&lt;/th&gt;
&lt;th&gt;Protein (g)&lt;/th&gt;
&lt;th&gt;Health Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Banana&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsi9dk7a7ouhgp5vx9n70.jpg" alt="Banana" width="800" height="602"&gt;&lt;/td&gt;
&lt;td&gt;118 / 71 / +66%&lt;/td&gt;
&lt;td&gt;105 / 72 / +46%&lt;/td&gt;
&lt;td&gt;27 / 19 / +42%&lt;/td&gt;
&lt;td&gt;0 / 0 / 0%&lt;/td&gt;
&lt;td&gt;1 / 1 / 0%&lt;/td&gt;
&lt;td&gt;8 / 8 / 0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pear&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3trdm9nvy7yic3lus6t2.jpg" alt="Pear" width="800" height="602"&gt;&lt;/td&gt;
&lt;td&gt;178 / 192 / -7%&lt;/td&gt;
&lt;td&gt;102 / 109 / -6%&lt;/td&gt;
&lt;td&gt;28 / 28 / 0%&lt;/td&gt;
&lt;td&gt;0 / 0 / 0%&lt;/td&gt;
&lt;td&gt;0 / 0 / 0%&lt;/td&gt;
&lt;td&gt;9 / 10 / -10%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dates&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fafwly3sz2t6l3q8xawud.jpg" alt="Dates" width="800" height="602"&gt;&lt;/td&gt;
&lt;td&gt;100 / 100 / 0%&lt;/td&gt;
&lt;td&gt;277 / 296 / -6%&lt;/td&gt;
&lt;td&gt;75 / 65 / +15%&lt;/td&gt;
&lt;td&gt;0 / 0.5 / -100%&lt;/td&gt;
&lt;td&gt;1 / 2 / -50%&lt;/td&gt;
&lt;td&gt;8 / 8 / 0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Yoghurt&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx9ftc2221fy0fcrvov1m.jpg" alt="Yoghurt" width="800" height="644"&gt;&lt;/td&gt;
&lt;td&gt;100 / 152 / -34%&lt;/td&gt;
&lt;td&gt;61 / 106 / -42%&lt;/td&gt;
&lt;td&gt;6 / 5 / +20%&lt;/td&gt;
&lt;td&gt;4 / 6 / -33%&lt;/td&gt;
&lt;td&gt;3 / 7 / -57%&lt;/td&gt;
&lt;td&gt;8 / 8 / 0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pudding&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8f73d44m6or8ayitpbyl.jpg" alt="Unpackaged pudding" width="800" height="602"&gt;&lt;/td&gt;
&lt;td&gt;100 / 150 / -33%&lt;/td&gt;
&lt;td&gt;180 / 165 / +9%&lt;/td&gt;
&lt;td&gt;30 / 25 / +20%&lt;/td&gt;
&lt;td&gt;7 / 5 / +40%&lt;/td&gt;
&lt;td&gt;3 / 4 / -25%&lt;/td&gt;
&lt;td&gt;5 / 3 / +66%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Salad&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj5d8op16i32jp1jf3qb8.jpg" alt="Salad" width="800" height="602"&gt;&lt;/td&gt;
&lt;td&gt;200 / 300 / -33%&lt;/td&gt;
&lt;td&gt;90 / 232 / -61%&lt;/td&gt;
&lt;td&gt;10 / 6 / +67%&lt;/td&gt;
&lt;td&gt;7 / 17 / -59%&lt;/td&gt;
&lt;td&gt;5 / 14 / -64%&lt;/td&gt;
&lt;td&gt;9 / 10 / -10%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dumplings&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyw7zwxkydoqv73plivwj.jpg" alt="Dumplings" width="800" height="602"&gt;&lt;/td&gt;
&lt;td&gt;200 / 350 / -43%&lt;/td&gt;
&lt;td&gt;320 / 595 / -46%&lt;/td&gt;
&lt;td&gt;40 / 70 / -43%&lt;/td&gt;
&lt;td&gt;8 / 23 / -65%&lt;/td&gt;
&lt;td&gt;14 / 17 / -18%&lt;/td&gt;
&lt;td&gt;7 / 8 / -13%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We can see that the estimates are not always accurate, with deviations of up to 66%. A big challenge appears to be estimating the mass of the dish, as well as seeing hidden ingredients in layered dishes. &lt;/p&gt;

&lt;p&gt;The health score is pretty accurate, and on average, the deviation of the calorie intake appears to be acceptable. If the goal is to support healthy eating habits, I believe the agent is more than useful.&lt;/p&gt;

&lt;p&gt;Interestingly, when analyzing photos of packaged food, the tool is able to accurately extract the nutritional information from the packaging.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fih0tv79yj572gqchdovm.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fih0tv79yj572gqchdovm.jpg" alt="Packaged pudding" width="800" height="602"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost
&lt;/h2&gt;

&lt;p&gt;Analyzing 1 image requires ~20k tokens. As of April 2025, when using &lt;code&gt;gpt-4o-mini&lt;/code&gt; ($0.15 / 1 million tokens), this costs $0.003. Assuming you analyze 10 photos a day, this would cost less than $1 per month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In this blog post, I have shown how easy it is to build your own food tracker using a GenAI platform like OpenAI. The tool analyzes images of meals from your Google Photos library and provides a nutritional breakdown using AI.&lt;/p&gt;

&lt;p&gt;While the results are not perfect, I feel like the added value is pretty high if you are not willing to count all your calories and keep track of everything you eat manually.&lt;/p&gt;




&lt;p&gt;If you liked this post, you can &lt;a href="https://ko-fi.com/frosnerd" rel="noopener noreferrer"&gt;support me on ko-fi&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>genai</category>
      <category>python</category>
      <category>showdev</category>
      <category>ai</category>
    </item>
    <item>
      <title>Frank - How are you so productive?</title>
      <dc:creator>Frank Rosner</dc:creator>
      <pubDate>Thu, 23 Jan 2025 12:22:09 +0000</pubDate>
      <link>https://dev.to/frosnerd/frank-how-are-you-so-productive-2l56</link>
      <guid>https://dev.to/frosnerd/frank-how-are-you-so-productive-2l56</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Friends and colleagues often ask me: "Frank - How are you so productive?". While I don't have a silver bullet, I developed a mindset and adopted a set of tools and techniques that help me to be productive as a software engineer. In this post, I will share some of these strategies with you.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Productivity?
&lt;/h2&gt;

&lt;p&gt;In economics, productivity measures the ratio of outputs and inputs. In software engineering, I consider output to be the value created by my work and inputs to be the money spent, which includes my work time, as well as any fees for the tools I am using to produce the value. In my opinion, it is important to consider value of your work as output, not pull requests merged or tickets closed.&lt;/p&gt;

&lt;p&gt;When we talk about productivity, we often view it as a combination of effectiveness and efficiency.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increasing effectiveness is about increasing output while keeping the input constant. In the context of output being value produced, it is often paraphrased as "doing the right things".&lt;/li&gt;
&lt;li&gt;Increasing efficiency is about decreasing input while keeping the output constant. It is often paraphrased as "doing things right".&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This distinction is useful because it can help identify potential for improvement. Consider being very fast in coding up something that is not needed. This is high efficiency but low effectiveness. On the other hand, consider writing some very valuable feature in a programming language you have no experience in, which is slowing you down significantly. While being highly effective, the efficiency will be low.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prioritization
&lt;/h2&gt;

&lt;p&gt;Being effective means "doing the right thing". But how do you identify what the right thing is? You will need to identify the needs. This can be done by talking to customers and stakeholders, reading customer feedback, or analyzing business metrics. Once you have identified the relevant goals / epics / features / tasks, you can prioritize them.&lt;/p&gt;

&lt;p&gt;When prioritizing with productivity in mind, we cannot just consider the impact (output). We also need to consider the required effort (input). A useful tool for prioritization is the impact-effort-matrix.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo30m8i3w2fs1b8epn121.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo30m8i3w2fs1b8epn121.png" alt="import effort matrix" width="800" height="637"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You place your tasks on the plane between the effort and impact axes. To maximize productivity, focus on quick-wins first. Then plan to tackle the major projects, adding fillers here and there. Avoid waste.&lt;/p&gt;

&lt;h2&gt;
  
  
  Executing and Building Momentum
&lt;/h2&gt;

&lt;p&gt;By prioritizing high-impact, low-effort tasks, you have a good foundation for productivity within your system (company, department, team). When looking at time and effort spent by individual contributors, e.g. yourself, however, there is a lot of potential for improvement, too. Having the "perfect plan" is not enough if you are not able to execute efficiently.&lt;/p&gt;

&lt;p&gt;Everyone has the same amount of time available to them. For the sake of this argument, let's assume you are spending 8 hours per day at work. However, you will not be able to work the entire 8 hours on high-impact tasks. There are meetings, operational tasks, overhead, and distractions. And even when you are working on your task, your efficiency can depend on your mood, your energy level, and your ability to focus.&lt;/p&gt;

&lt;p&gt;Personally, my biggest factor for long term productivity is building and maintaining momentum. I like to use a snowball analogy to reason about my momentum at work. In physics, momentum is defined as the mass of an object multiplied by its velocity.&lt;/p&gt;

&lt;p&gt;When you start a new job, or move to a new role, you start off as a small snowball on the top of a hill. The mass of the snowball corresponds to the knowledge and skills you accumulate. The velocity corresponds to the rate at which you are able to complete tasks like coding, reviewing PRs, and so on.&lt;/p&gt;

&lt;p&gt;When you start pushing for the first time, it feels hard to gain momentum as the ball is very small, and gets stuck easily on small branches or stones. As you keep going, the ball gains size and speed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F29ria77me3l61mp69a6f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F29ria77me3l61mp69a6f.png" alt="snowball going downhill" width="800" height="509"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In order to build and maintain your momentum over time, you have several responsibilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Increase&lt;/strong&gt; your snowball's &lt;strong&gt;mass&lt;/strong&gt;. Learn new programming languages, frameworks, tools, and technologies. Get familiar with new code bases, and understand the business domain you are working in. Improve your tooling and workflows. Build lasting relationships with your co-workers, stakeholders and customers. All of this will allow you to overcome obstacles more easily.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan&lt;/strong&gt; ahead to avoid hitting major obstacles like trees that risk stopping your snowball, or even making parts of it fall off. This means identifying potential blockers and risks early, and either avoiding them or mitigating them before you reach them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep pushing&lt;/strong&gt; your snowball. This means making progress on your tasks, delivering value. You need to find the right amount of pushing, as pushing too hard will make it hard to stop and navigate around obstacles or changing priorities.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How do you balance these activities throughout your work day / week? If you focus only on pushing, without planning, you'll easily run into obstacles. If you are not working on gaining mass, you'll not be able to overcome growing obstacles and take on bigger challenges. If you are only working on gaining mass but not pushing, you are not delivering value.&lt;/p&gt;

&lt;p&gt;Over the course of my career, applying the core principles of Agile Software Development has worked well. Additionally, I discovered and honed some techniques and tools that help me build and maintain momentum every day. The following sections will do a quick recap of the agile principles and then dive into the tools and techniques I use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agile Software Development Principles
&lt;/h2&gt;

&lt;p&gt;I am a big fan of the principles behind the &lt;a href="https://agilemanifesto.org/principles.html" rel="noopener noreferrer"&gt;Manifesto for Agile Software Development&lt;/a&gt;. And I'm not talking about "Agile Methodologies" like Scrum, but the core principles. To summarize the most important ones for me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Delivering value to the customer continuously&lt;/li&gt;
&lt;li&gt;Welcoming changing requirements&lt;/li&gt;
&lt;li&gt;Simplicity - maximize the amount of work not done&lt;/li&gt;
&lt;li&gt;Continuous attention to technical excellence and good design&lt;/li&gt;
&lt;li&gt;Regular reflection and adaptation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How do I apply these in my daily work? I only consider something done, if the work is usable in production. I break down tasks into the smallest possible pieces, aiming to finish pull requests within a day to get early feedback and iterate quickly. If priorities change, this allows me to switch to another task without leaving half-finished work.&lt;/p&gt;

&lt;p&gt;This ensures that my work is focused on adding value, and I can adapt to changing requirements quickly. To ensure simplicity, I apply the YAGNI (you ain't gonna need it) principle. I only implement what is needed now, and avoid over-engineering. I design explicitly to avoid losing momentum when having design discussions after the code is written. When revisiting code, I always try to improve it, paying back technical dept continuously.&lt;/p&gt;

&lt;p&gt;I also regularly reflect on my work, and try to improve my workflows and tools. We will talk more about that in the Kaizen section below. If you want to know more about Agile Software Development, checkout my post &lt;a href="https://dev.to/frosnerd/explain-agile-like-im-a-sports-student-3m8l"&gt;Explain Agile Like I'm a Sports Student&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Next, let's dive into some concrete tools and techniques that you can try out yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools and Techniques
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Kaizen
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xm4pefep7fg828862ze.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xm4pefep7fg828862ze.png" alt="Kaizen" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Kaizen is a Japanese term that means "continuous improvement". It is a philosophy that focuses on making small, incremental changes to processes, workflows, and tools. It was popularized as part of the &lt;a href="https://en.wikipedia.org/wiki/The_Toyota_Way" rel="noopener noreferrer"&gt;Toyota Way&lt;/a&gt;. The core ideas are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Improvement is a never-ending process. Make small, consistent changes to achieve significant, long-term results.&lt;/li&gt;
&lt;li&gt;Empower everyone to identify inefficiencies and suggest solutions.&lt;/li&gt;
&lt;li&gt;Focus on the process, not the people. Improve processes systematically.&lt;/li&gt;
&lt;li&gt;Eliminate waste. Activities that do not add value to the customer or organization need to be removed.&lt;/li&gt;
&lt;li&gt;Measure and reflect. Use metrics to track progress, experiment with changes, and reflect on the results.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I apply Kaizen in the teams I work with, but also on a personal level. At the end of each day, I spend 5-10 minutes to reflect on the activities I performed that day, and the impact they had on the customer or my organization. I block 30-60 minutes every week to improve my workflows / tools.&lt;/p&gt;

&lt;p&gt;Examples are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automate creation of daily or weekly messages / reports I compiled manually before.&lt;/li&gt;
&lt;li&gt;Improve my coding efficiency by learning or reviewing keyboard shortcuts, IDE features or plugins.&lt;/li&gt;
&lt;li&gt;Cancel meetings that have a low return of time invested (ROTI). Consider reading the meeting summary / minutes instead.&lt;/li&gt;
&lt;li&gt;Add new folder and rule in my inbox to funnel some low-priority messages that I can look at on a weekly basis.&lt;/li&gt;
&lt;li&gt;Archive some old Slack channels that are not relevant anymore.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Zero-Inbox
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6viorqjyp6acrszouwwf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6viorqjyp6acrszouwwf.png" alt="overflowing mailbox" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The zero-inbox technique aims at managing your inboxes (email, slack) effectively by keeping the number of unread messages at (or close to) 0. The goal is to reduce cognitive burden of a cluttered inbox, and to ensure that you are not missing important messages. The core ideas are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Process every message, don't just "check" it. Apply the 4 D's: Delete, Delegate, Defer, Do.

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Delete&lt;/strong&gt; the message (mostly applies to emails) if it is irrelevant, spam, or unnecessary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delegate&lt;/strong&gt; the task if it belongs to someone else. Forward it immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Defer&lt;/strong&gt; the message if it requires your action but cannot be handled immediately. Schedule it for a later time. Most email clients have this functionality, and for Slack channels, I use the "remind me about this" feature.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Do&lt;/strong&gt; the task immediately if it takes less than two minutes to complete.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Use folders, labels, channels to categorize messages. Use automated filters / rules to organize the incoming messages automatically before you process them. I personally have different folders in my email account, based on the projects and the type of message, e.g. pull requests, ticket updates.&lt;/li&gt;

&lt;li&gt;Don't use email as a To-Do list. Move larger, actionable items to a dedicated task management tool.&lt;/li&gt;

&lt;li&gt;Block message time. Don't check emails or Slack continuously throughout the day, but use dedicated time slots, ideally when you're least productive, e.g. after lunch or in the afternoon.&lt;/li&gt;

&lt;li&gt;Unsubscribe and filter. If you receive newsletters / updates that are not relevant, unsubscribe. If you cannot unsubscribe, add an automated filter to delete the messages before they reach your inbox.&lt;/li&gt;

&lt;li&gt;Archive aggressively. I personally don't archive emails, but I use a filter to show only unread messages in my inbox. I archive orphaned temporary Slack channels aggressively.&lt;/li&gt;

&lt;li&gt;Keep it simple and consistent. Whatever system you use, it needs to be easy enough for you to apply on a daily basis.&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  To-Do Lists
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk0utich8a7bpt5qngbtx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk0utich8a7bpt5qngbtx.jpg" alt="todo list" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I tried using digital To-Do lists, but they didn't work out for me. They often got outdated, or some tasks got stuck on there forever. I switched to using a notebook, which lies in front of me on my desk. I use a simple system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every day, I write down the date and the tasks I want to complete that day, in order of priority. I either do it in the evening of the previous day, or as the first thing in the morning.&lt;/li&gt;
&lt;li&gt;I check off tasks as I complete them. Whenever I engage in an activity, e.g. looking at a Slack message or an incoming PR review request, I review my list to remind myself what the most important task is. That helps me to get back on track, and focus on the most important bits first.&lt;/li&gt;
&lt;li&gt;It is okay to add new tasks to the list as the day goes on. It is okay to not finish all the tasks. However, in the spirit of Kaizen, I will review these occasions at the end of the day, coming up with a plan to avoid them in the future.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's take a look at an example list:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe2tn5l16jdn173i0kdf8.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe2tn5l16jdn173i0kdf8.jpg" alt="my to do list" width="800" height="543"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Imagine that while working on the blog post, a colleague pings you that you need to send in a report by today. You add it to the top of the list, and start working on it immediately. At the end of the day, you did not manage to check your emails.&lt;/p&gt;

&lt;p&gt;When closing your day, you attempt to understand how it happened that the report showed up surprisingly. There are different possible reasons, such as:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The report needed to be finished today, you knew about it, but forgot when you planned your day. In that case, you might want to adjust your process to include a reminder of due tasks one day before the due date.&lt;/li&gt;
&lt;li&gt;The report needed to be finished today, but you didn't know about it because your colleague forgot to tell you. In that case, communicate clearly how much time you need in advance. Consider using a shared task management tool, where your colleague can assign you to certain tasks, that will notify you about this.&lt;/li&gt;
&lt;li&gt;The report didn't have to be finished today. It wasn't going to be sent by end of next week anyway. In that case, make sure to challenge the priorities of urgent ad-hoc tasks in the future.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Time Boxing and Time Blocking
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwtfdfhib6gyn53u3pnf2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwtfdfhib6gyn53u3pnf2.jpg" alt="Calendar" width="800" height="535"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I use time boxing and time blocking daily. Time blocking helps me in planning my day and week, ensuring I make room for important, short- and long-term activities. Time boxing helps me to avoid getting stuck or lost in details.&lt;/p&gt;

&lt;p&gt;Here are some activities I block time for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reviewing pull requests (daily)&lt;/li&gt;
&lt;li&gt;Writing code (daily)&lt;/li&gt;
&lt;li&gt;Reading and answering messages (daily)&lt;/li&gt;
&lt;li&gt;1on1 / team meetings (weekly / bi-weekly)&lt;/li&gt;
&lt;li&gt;Education, learning, personal development (weekly)&lt;/li&gt;
&lt;li&gt;Workout (daily)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I sometimes use my calendar to block the time, or I'm writing the times next to the items on my To-Do list. I apply time-boxing within each block, but also on a broader scale. For example, when I timebox my PR reviews to 60 minutes, I will stop after 60 minutes even if I did not review all PRs. The remaining ones will get a higher priority the next day.&lt;/p&gt;

&lt;p&gt;Time boxing also helps me to manage unknown unknowns better. When starting a bigger task, I often kick it off with a proof of concept (POC), time-boxed to a few hours. If I am not able to complete the POC in that time, I change the estimation of the effort needed for the task, placing it on another spot in the impact-effort-matrix, reprioritizing it accordingly. If it is still top priority, I'll extend the time box but if there are other quick-wins available, I might switch to them for now.&lt;/p&gt;

&lt;p&gt;Time blocking helps me keep a balance of the different activities needed to build and maintain momentum, while time boxing emphasizes progress over perfection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Focus and Pomodoro
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc04zgqlmioday8fgk3fs.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc04zgqlmioday8fgk3fs.jpg" alt="camera lens focusing" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our brains are capable to solve complex problems, but struggle to deal with distractions and context switching. I personally have found that my productivity over 8 hours is higher if I dedicate 4 hours to deep work, with no distractions, and 4 hours of shallow work, where I can handle interruptions, compared to 8 hours of mixed work.&lt;/p&gt;

&lt;p&gt;To successfully enter deep work, I need to have the right environment. I use headphones with music on, and my desk needs to have a little bit of clutter on it, but not too much. If there's too much, I have to clean it first. I might also turn off messaging programs / notifications.&lt;/p&gt;

&lt;p&gt;While deep work is incredibly effective, it is also exhausting. My ability to focus drops rapidly after ~45-60 minutes, but after 30 minutes the focus starts to take a toll on my body. My head starts to hurt, and my muscles start to tense up.&lt;/p&gt;

&lt;p&gt;To help maintain a balance between work and recovery, I often use the &lt;a href="https://www.pomodorotechnique.com/" rel="noopener noreferrer"&gt;Pomodoro technique™&lt;/a&gt;. The core idea is to work for 25 minutes, then take a 5-minute break. After 4 Pomodoros, take a longer break of 15-30 minutes. &lt;/p&gt;

&lt;p&gt;Pomodoro has another positive effect on my work. It forces me to split my work in smaller chunks, which can be completed within a single slot. E.g. if I'm writing code, my goal is to have it compile after each slot. Ideally, I'll also be able to commit the changes. When writing a blog post, I aim to complete a section within a slot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pareto Principle (aka 80/20)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2px7rq217tqacmn1onj.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2px7rq217tqacmn1onj.jpg" alt="laptop with diagrams and charts" width="800" height="530"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The idea behind the Pareto Principle is that 80% of the consequences come from 20% of the causes. When applied to work, it means that 80% of the value comes from 20% of the work. We can make use of that principle to maximize productivity by focusing on the 20% of the work that brings the most value.&lt;/p&gt;

&lt;p&gt;What does that mean in practice for me?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use GenAI extensively. I use GenAI to generate code, which is often times not great, but if it works, it'll do as a first iteration. I use GenAI to generate automated tests as well. I would rather have ugly, tested code, than beautiful, refined, high performance code without any tests. Don't aim for 100% code coverage in the beginning, but focus on the key aspects.&lt;/li&gt;
&lt;li&gt;Don't over-engineer. First, make it work, later make it right (if it has proven its value).&lt;/li&gt;
&lt;li&gt;Refactor constantly. Whenever I touch code I look for opportunities to improve it. That mechanism ensures that throw-away code is not over-engineered, but relevant code converges to a high quality.&lt;/li&gt;
&lt;li&gt;Reduce toil progressively. When you do something once, it's worth writing it down in some note or ticket. When you do it on a regular basis, but not very often, create a runbook. When the runbook becomes long and you run it more often, script it up. When you run the script often, automate it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Of course, you need to keep in mind that the remaining 20% of results, taking up 4 times as much of your time as the first 80% to finish, is dept you are paying interest on. So you should choose wisely how much dept you can take on, how much interest you are willing to pay, and invest time in paying back dept regularly.&lt;/p&gt;

&lt;p&gt;So why is applying this technique saving time? If you end up doing all the work eventually, what's the point? The point is the work you are doing is almost never going to solve the problem 100%. It may be because you did not understand the problem entirely. Or maybe the problem changes over time. Maybe or some other, better solution comes into the mix later down the line. By focusing on the 20% that brings the most value, you are able to deliver value faster, and you are able to pivot more easily.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemba Walking
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flz7fkqn4h3g7prog6ds5.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flz7fkqn4h3g7prog6ds5.jpg" alt="car factory assembly line" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Gemba walking is yet another technique coming from the Toyota Way. Gemba is a Japanese term that means "the real place". In the context of software development, it means going to the place where the work is done, e.g. the team's workspace, the code repository, the CI/CD pipeline, the incident channels, the production environment. This is relevant for me as a tech lead to avoid the "ivory tower" syndrome, and to stay connected to the work my colleagues are doing.&lt;/p&gt;

&lt;p&gt;Gemba walking helps me to identify inefficiencies, bottlenecks, and blockers early. It also helps me to understand the context of the work better, and to build relationships with my colleagues. I try to do Gemba walks every day, blocking ~15 minutes for it. The key ideas are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Go where value is created. In SRE, I call this "the trenches".&lt;/li&gt;
&lt;li&gt;Observe, don't judge. Ask questions and listen to the problems your colleagues are facing. Read between the lines, be curious.&lt;/li&gt;
&lt;li&gt;Engage with colleagues.&lt;/li&gt;
&lt;li&gt;Focus on processes, not people.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Escalation
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbm7kezvs161vcyfy4wo7.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbm7kezvs161vcyfy4wo7.jpg" alt="frustrated person" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Escalation involves raising an issue to higher authority or expertise levels when it cannot be resolved at the current level. Escalating quickly is important to ensure efficient resolution of issues by ensuring they are handled by the people with the right expertise, authority, or resources.&lt;/p&gt;

&lt;p&gt;Escalation is also important to avoid getting stuck / blocked on a task. While it might seem like "complaining" to some, it is in the interest of the company and your customers to resolve issues quickly.&lt;/p&gt;

&lt;p&gt;I often combine escalation with time-boxing. If I do not manage to complete a task in the estimated time, I can escalate it to the respective parties, e.g. my boss, or an expert in the field.&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary and Conclusion
&lt;/h1&gt;

&lt;p&gt;In this post we explored various strategies to increase productivity as a software engineer, while balancing effectiveness and efficiency. We highlighted the significance of prioritization using the impact-effort matrix to focus on high-impact, low-effort tasks. &lt;/p&gt;

&lt;p&gt;We saw how to build and maintain momentum, peeking into a productivity toolset, including Agile principles to ensure value delivery and adaptability, Kaizen for continuous improvement, zero-inbox techniques for managing messages, the use of to-do lists for daily task management, time boxing and time blocking for effective time management, the Pomodoro technique for maintaining focus, the Pareto Principle for maximizing value, Gemba walking to stay connected with the team's work, and the importance of quick escalation to resolve issues efficiently.&lt;/p&gt;

&lt;p&gt;What are your favorite productivity tools and techniques? How do you balance effectiveness and efficiency in your work? Please share your thoughts in the comments!&lt;/p&gt;




&lt;p&gt;If you liked this post, you can &lt;a href="https://ko-fi.com/frosnerd" rel="noopener noreferrer"&gt;support me on ko-fi&lt;/a&gt;.&lt;/p&gt;




&lt;ul&gt;
&lt;li&gt;Cover photo by &lt;a href="https://unsplash.com/@leafybirdy?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;kris&lt;/a&gt; on &lt;a href="https://unsplash.com/photos/selective-focus-photography-of-productivity-printed-book-n9u9ZEoH2yM?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;To-Do list photo by &lt;a href="https://unsplash.com/@thomasbormans?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Thomas Bormans&lt;/a&gt; on &lt;a href="https://unsplash.com/photos/a-notepad-with-a-pen-on-top-of-it-pcpsVsyFp_s?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Calendar photo by &lt;a href="https://unsplash.com/@erothermel?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Eric Rothermel&lt;/a&gt; on &lt;a href="https://unsplash.com/photos/white-printer-paperr-FoKO4DpXamQ?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Kanban photo by &lt;a href="https://unsplash.com/@parabol?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Parabol | The Agile Meeting Tool&lt;/a&gt; on &lt;a href="https://unsplash.com/photos/a-woman-writing-on-a-wall-with-sticky-notes-FLsPtPiE4zU?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Frustrated person photo by &lt;a href="https://unsplash.com/@gunaivi?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;ahmad gunnaivi&lt;/a&gt; on &lt;a href="https://unsplash.com/photos/man-in-bluee-ssweater-OupUvbC_TEY?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Camera lense photo by &lt;a href="https://unsplash.com/@pawelskor?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Paul Skorupskas&lt;/a&gt; on &lt;a href="https://unsplash.com/photos/person-holding-camera-lens-7KLa-xLbSXA?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Laptop photo by &lt;a href="https://unsplash.com/@goumbik?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Lukas Blazek&lt;/a&gt; on &lt;a href="https://unsplash.com/photos/turned-on-black-and-grey-laptop-computer-mcSDtbWXUZU?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Car factory photo by &lt;a href="https://unsplash.com/@satterfieldgroup?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Michael Satterfield&lt;/a&gt; on &lt;a href="https://unsplash.com/photos/a-row-of-cars-parked-in-a-parking-lot-vnTzjMW9OoM?utm_content=creditCopyText&amp;amp;utm_medium=referral&amp;amp;utm_source=unsplash" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>productivity</category>
      <category>learning</category>
      <category>career</category>
    </item>
    <item>
      <title>Visualizing the Apache Cassandra Token Ring with Plotly</title>
      <dc:creator>Frank Rosner</dc:creator>
      <pubDate>Fri, 27 Sep 2024 18:13:09 +0000</pubDate>
      <link>https://dev.to/frosnerd/visualizing-the-apache-cassandra-token-ring-with-plotly-4pmh</link>
      <guid>https://dev.to/frosnerd/visualizing-the-apache-cassandra-token-ring-with-plotly-4pmh</guid>
      <description>&lt;h2&gt;
  
  
  Cassandra's Partitioning Mechanism
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://cassandra.apache.org/_/index.html" rel="noopener noreferrer"&gt;Apache Cassandra&lt;/a&gt; is a powerful, distributed NoSQL database designed to handle large amounts of data across many servers while providing linear horizontal scalability, high availability with flexible consistency guarantees, as well as fault tolerance.&lt;/p&gt;

&lt;p&gt;One of the core mechanisms behind Cassandra's scalability is the data partitioning based on &lt;strong&gt;consistent hashing&lt;/strong&gt;. In a typical hashing scenario, a hash function takes an input (e.g., the primary key of a row) and maps it to a fixed output range. In a distributed database, each node could be responsible for a subset of that range. However, if nodes are added or removed, the entire data distribution could change, causing large-scale data movement between nodes.&lt;/p&gt;

&lt;p&gt;Consistent hashing solves this problem by mapping both data and nodes onto the same hash ring (a conceptual circle). Here’s how it works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Hash Ring.&lt;/strong&gt; Imagine a circle where hash values are placed in a clockwise manner, ranging from 0 to the maximum value of the hash function. Database nodes are placed on this circle based on their own hash value. Data is also hashed and placed on the circle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assigning Data to Nodes.&lt;/strong&gt; Data is assigned to the first node clockwise from its position on the ring. This node becomes responsible for that piece of data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node Addition/Removal.&lt;/strong&gt; When a node is added, only the data between the new node and its predecessor in the ring is reassigned. When a node is removed, its data is reassigned to the next node clockwise.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We call each position on the ring a token. When a row needs to be stored, its primary key is hashed to calculate its token. The token is then used to determine which node stores the data by walking the ring until the next token owned by a node is found. This process ensures that data is evenly distributed across all nodes, avoiding hot spots and ensuring the cluster remains balanced as nodes are added or removed. This algorithm effectively partitions the token ring into ranges, with each range assigned to a node.&lt;/p&gt;

&lt;p&gt;To make this algorithm more resilient and flexible, Cassandra uses a concept called &lt;strong&gt;virtual nodes&lt;/strong&gt; (vnodes). Instead of assigning a single large token range to each node, Cassandra divides the token space into many smaller ranges and assigns multiple vnodes to each physical node. This allows the cluster to be more evenly balanced, especially when nodes are added or removed, since the system can redistribute small token ranges across the remaining nodes, avoiding load imbalances. &lt;/p&gt;

&lt;p&gt;It also allows to combine heterogeneous hardware in a single cluster, as you can adjust the number of vnodes based on the available resources on your physical node.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Visualize the Token Ring?
&lt;/h2&gt;

&lt;p&gt;Cassandra's distributed nature, combined with its use of consistent hashing and vnodes, makes it an efficient and scalable database. However, one of the challenges that arises when operating a Cassandra cluster is understanding how the data is distributed across nodes. Although Cassandra ensures that tokens are distributed evenly across the cluster, certain situations - such as manual node additions, removals, misconfiguration, or hardware failures - can lead to unbalanced token distribution.&lt;/p&gt;

&lt;p&gt;For Cassandra operators and users, having insight into the token distribution can be a useful tool to debug issues with the database. Token imbalances can lead to unequal data distribution, resulting in hotspots where certain nodes handle significantly more traffic or store more data than others. This can cause performance degradation, uneven resource usage, and even outages.&lt;/p&gt;

&lt;p&gt;While Cassandra offers command-line tools to check token ranges and node responsibilities (such as &lt;code&gt;nodetool ring&lt;/code&gt;), these tools output the data in a raw, tabular format that can be difficult to interpret.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fetching Token Ranges
&lt;/h2&gt;

&lt;p&gt;Cassandra uses &lt;a href="https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/dht/Murmur3Partitioner.java" rel="noopener noreferrer"&gt;Murmur3&lt;/a&gt;, a hashing function that generates 64-bit tokens in the range of 

&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;[−263,263−1][-2^{63},2^{63} - 1] &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mopen"&gt;[&lt;/span&gt;&lt;span class="mord"&gt;−&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;63&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;63&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mclose"&gt;]&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
. Each vnode in the cluster assigned a token it is responsible for. The token marks the end of a range, and the previous token defines the start. To visualize the token ring, we first need to calculate the token ranges, by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Gathering all tokens from all nodes&lt;/li&gt;
&lt;li&gt;Sorting the tokens&lt;/li&gt;
&lt;li&gt;Pairing each token with the previous one to compute the range for each node&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;a href="https://docs.datastax.com/en/developer/java-driver/4.2/manual/core/metadata/token/index.html" rel="noopener noreferrer"&gt;Cassandra drivers&lt;/a&gt; have access to the token metadata, which includes both the raw tokens and the calculated ranges:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Metadata&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getMetadata&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="nc"&gt;TokenMap&lt;/span&gt; &lt;span class="n"&gt;tokenMap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getTokenMap&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

&lt;span class="nc"&gt;Set&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;TokenRange&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ring&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenMap&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getTokenRanges&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="c1"&gt;// Returns [Murmur3TokenRange(Murmur3Token(12), Murmur3Token(2)),&lt;/span&gt;
&lt;span class="c1"&gt;//          Murmur3TokenRange(Murmur3Token(2), Murmur3Token(4)),&lt;/span&gt;
&lt;span class="c1"&gt;//          Murmur3TokenRange(Murmur3Token(4), Murmur3Token(6)),&lt;/span&gt;
&lt;span class="c1"&gt;//          Murmur3TokenRange(Murmur3Token(6), Murmur3Token(8)),&lt;/span&gt;
&lt;span class="c1"&gt;//          Murmur3TokenRange(Murmur3Token(8), Murmur3Token(10)),&lt;/span&gt;
&lt;span class="c1"&gt;//          Murmur3TokenRange(Murmur3Token(10), Murmur3Token(12))]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Visualizing Token Ranges
&lt;/h2&gt;

&lt;p&gt;Now that we’ve calculated the token ranges, we can proceed to visualize them. To represent the Cassandra token ring, we use Plotly’s polar plot feature, which is perfect for this kind of circular visualization.&lt;/p&gt;

&lt;p&gt;Here’s the function that creates the visualization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;max_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;plot_token_ranges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_ranges&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)]]):&lt;/span&gt;
    &lt;span class="n"&gt;fig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;graph_objects&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Figure&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;nodes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_ranges&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;colors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plotly&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;express&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;colors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sample_colorscale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rainbow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;color_map&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;colors&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ranges&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;token_ranges&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;v_node_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ranges&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;range_width&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;max_token&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;range_width&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="c1"&gt;# Theta needs to be in the middle of the range, because
&lt;/span&gt;            &lt;span class="c1"&gt;# the polar bar gets drawn from theta width/2 in both directions
&lt;/span&gt;            &lt;span class="n"&gt;theta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;range_width&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;360&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;max_token&lt;/span&gt;
            &lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;graph_objects&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Barpolar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="n"&gt;theta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;theta&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;range_width&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;360&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;max_token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;customdata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
                    &lt;span class="n"&gt;hovertemplate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[%{customdata[0]}, %{customdata[1]}]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;legendgroup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;marker_color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;color_map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="n"&gt;showlegend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v_node_idx&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;v_node_idx&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;First we pass a token ranges dictionary, where each key is a node, and the value is a list of token ranges for that node (one per vnode).&lt;/li&gt;
&lt;li&gt;Then we assign each node a unique color from a predefined color scale for easy identification in the visualization.&lt;/li&gt;
&lt;li&gt;For each node, we loop through its token ranges and calculate:

&lt;ul&gt;
&lt;li&gt;The width of each token range (&lt;code&gt;range_width&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;The angle (&lt;code&gt;theta&lt;/code&gt;) where the range should be displayed on the circular plot.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Each token range is represented as a polar bar. The hover text displays the start and end of each range, and we ensure the legend appears only once per node.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;The following screenshot shows the output of the function for a simple three-node cluster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feloewrn3bc86wt8n6tc2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feloewrn3bc86wt8n6tc2.png" alt="Example cluster with 3 nodes, 8 vnodes per node" width="800" height="585"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Taking Replication Into Account
&lt;/h2&gt;

&lt;p&gt;Most datasets stored in Cassandra are configured with a replication factor (RF) &amp;gt; 1, which means that each row is replicated to multiple nodes. Replication increases data availability and fault tolerance at the cost of storage and increased coordination between nodes.&lt;/p&gt;

&lt;p&gt;How exactly the data is replicated depends on the configured replication strategy. If your cluster spans multiple racks (e.g. availability zones), you should use the &lt;a href="https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/architecture/archDataDistributeReplication.html" rel="noopener noreferrer"&gt;&lt;code&gt;NetworkTopologyStrategy&lt;/code&gt;&lt;/a&gt;. This strategy determines the additional replicas by walking the ring clockwise until it finds a node in a different rack, repeating the process until it reaches the desired number of replicas.&lt;/p&gt;

&lt;p&gt;What does that mean for our token ring visualization? When looking at token ownership, i.e., which nodes own a given token, we would have to add the rack dimension to the plot, and the viewer would have to mentally perform the clockwise "walk" to determine the replicas. &lt;/p&gt;

&lt;p&gt;An alternative approach is possible if the replication factor is equal to the number of racks. In that case, the algorithm can be simplified by calculating a separate token ring for each rack. We then only consider the nodes in each rack to calculate the ranges and can use the algorithm with RF = 1 to determine the node that owns the token in each rack. To visualize the ownership we can either plot the ring for each rack, or even overlay the different plots.&lt;/p&gt;

&lt;p&gt;The following animation shows the per-rack token ranges for a six node cluster hosted across three racks and a replication factor of three:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84hczfr8qiak7z7sxxh1.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84hczfr8qiak7z7sxxh1.gif" alt="ring per rack with tabs" width="600" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Overlaying SSTable Ranges
&lt;/h2&gt;

&lt;p&gt;Cassandra's storage engine is based on &lt;a href="https://www.cs.umb.edu/~poneil/lsmtree.pdf" rel="noopener noreferrer"&gt;Log-Structured Merge (LSM) Trees&lt;/a&gt;. Data is written to a memtable in memory, and when the memtable reaches a certain size, it is flushed to disk as an SSTable. SSTables are immutable and are periodically merged into larger SSTables to reduce the number of files on disk.&lt;/p&gt;

&lt;p&gt;The data in the SSTable files is sorted by the partition key. By computing the tokens of the partitions within the file, we can derive a token range based on the minimum and maximum token it contains. This allows us to overlay the token ranges of the nodes with an SSTable range.&lt;/p&gt;

&lt;p&gt;To add the SSTable overlay, we extend the code, adding two additional parameters &lt;code&gt;sstable_min_token&lt;/code&gt; and &lt;code&gt;sstable_max_token&lt;/code&gt; to the function. We then add a new barpolar trace for the SSTable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;sstable_max_token&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;sstable_min_token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sstable_range_width&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sstable_max_token&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;max_token&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;sstable_min_token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sstable_range_width&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sstable_max_token&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;sstable_min_token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sstable_theta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sstable_min_token&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;sstable_range_width&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;360&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;max_token&lt;/span&gt;
&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;graph_objects&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Barpolar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;theta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;sstable_theta&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sstable_range_width&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;360&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;max_token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;customdata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;sstable_min_token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sstable_max_token&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
        &lt;span class="n"&gt;hovertemplate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[%{customdata[0]}, %{customdata[1]}]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sstable_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;legendgroup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sstable_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;marker_color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grey&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When calculating the width of the SSTable, we need to take into account that the max token might be smaller than the min token, which effectively means that the range wraps around the ring. We choose &lt;code&gt;r=[0.5]&lt;/code&gt; to place the SSTable overlay closer to the center of the plot, and color it grey to distinguish it from the token ranges.&lt;/p&gt;

&lt;p&gt;The following screenshot shows our updated graph with an SSTable file overlapping with one vnode of the first node in the first rack.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5h3os5cd27iz5mf2zdpm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5h3os5cd27iz5mf2zdpm.png" alt="SSTable token range overlay" width="800" height="557"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Visualizing the Cassandra token ring with Plotly was a fun exercise to understand how tokens are distributed across nodes in a cluster, how replication works, and how SSTables fit into the mix. Looking into the token distribution helps you identify potential token imbalances.&lt;/p&gt;

&lt;p&gt;If you do not want to worry about token ranges, vnodes, replication and SSTable files, and just want to benefit from the scalability and fault tolerance of Cassandra, consider using a managed service such as &lt;a href="https://www.datastax.com/products/datastax-astra?utm_medium=social_organic&amp;amp;utm_source=twitter&amp;amp;utm_campaign=frank_rosner" rel="noopener noreferrer"&gt;DataStax AstraDB&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;If you liked this post, you can &lt;a href="https://ko-fi.com/frosnerd" rel="noopener noreferrer"&gt;support me on ko-fi&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
      <category>plotly</category>
      <category>cassandra</category>
      <category>database</category>
    </item>
    <item>
      <title>Books That Helped Me Become a Tech Lead</title>
      <dc:creator>Frank Rosner</dc:creator>
      <pubDate>Wed, 20 Dec 2023 19:41:58 +0000</pubDate>
      <link>https://dev.to/frosnerd/books-that-helped-me-become-a-tech-lead-3831</link>
      <guid>https://dev.to/frosnerd/books-that-helped-me-become-a-tech-lead-3831</guid>
      <description>&lt;h2&gt;
  
  
  Why Books?
&lt;/h2&gt;

&lt;p&gt;When developing my skills, I like to use a combination of conference talks, video tutorials, books, papers, blog posts, learning-by-doing, and teaching/blogging. Books are a great way to learn from the mistakes other people have made, to be inspired by their successes, and to experience their accomplishments second hand.&lt;/p&gt;

&lt;p&gt;In this blog post I want to share my favorite books that helped me the most in my journey from being a senior software engineer to becoming a tech lead. They helped me to broaden and deepen my understanding about software engineering, software architecture, and building and running a software business. They taught me to challenge and shape my behaviour and habits. Some of them deeply affected my personal and professional life.&lt;/p&gt;

&lt;p&gt;It goes without saying that reading those books will not automatically get you promoted or land you a new tech lead role. Of course, you still need to get your own experiences, make your own mistakes, and have that little bit of luck required. It is also important to constantly sharpen your technical knowledge and skills based on the specific domain you are working in. The books on this list do not focus on specific technologies, but rather on general principles and concepts that are applicable to any technology stack and business.&lt;/p&gt;

&lt;p&gt;For each book on the list, I will include a brief summary that can help you judge whether the book is relevant to you. To give it a personal touch, I will also include the most valuable lesson I learned from the book. This is not necessarily the main message of the book, nor the only important one, but rather the one that resonated with me the most.&lt;/p&gt;

&lt;h2&gt;
  
  
  The List
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Design It!
&lt;/h3&gt;

&lt;p&gt;&lt;a href="(https://www.oreilly.com/library/view/design-it/9781680502923/)"&gt;&lt;em&gt;Design It!: From Programmer to Software Architect&lt;/em&gt;&lt;/a&gt; by Michael Keeling is a comprehensive guide aimed at software developers who aspire to transition into the role of a software architect. The book provides a pragmatic and accessible approach to software architecture, emphasizing the importance of design in creating effective software systems. Keeling covers a wide range of topics, from foundational principles of software architecture to practical techniques for designing scalable and maintainable systems.&lt;/p&gt;

&lt;p&gt;Throughout the book, Keeling advocates for a hands-on, iterative approach to software design, encouraging readers to think critically about the architectural choices they make. He introduces various architectural styles and patterns, and discusses how to evaluate trade-offs and make decisions that align with the goals and constraints of a project. The book is filled with real-world examples, exercises, and practical tips, making it a valuable resource for those looking to develop their skills in software architecture and design.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The most valuable lesson I learned from the book:&lt;/strong&gt; There is no such thing as "no design". "No design" often means multiple, implicit designs, in the heads of your engineers, that are not aligned with each other. Design explicitly, collaboratively, iteratively, and document the design in a written form!&lt;/p&gt;

&lt;h3&gt;
  
  
  Release It!
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.oreilly.com/library/view/release-it/9781680500264/" rel="noopener noreferrer"&gt;&lt;em&gt;Release It!: Design and Deploy Production-Ready Software&lt;/em&gt;&lt;/a&gt; by Michael Nygard is a critical guide for software developers and architects focused on the challenges of creating software that performs reliably in production environments. The book delves into the complexities of designing, deploying, and maintaining software that can withstand the rigors of real-world operations. Nygard emphasizes the importance of considering production realities from the beginning of the design process, advocating for a mindset shift from merely writing code to delivering a resilient, scalable, and maintainable system.&lt;/p&gt;

&lt;p&gt;Nygard provides insights into the various pitfalls that software systems encounter in production, such as network issues, unpredictable load patterns, and hardware failures. He introduces concepts like stability patterns and antipatterns, illustrating how to build systems that can gracefully handle failure and remain robust under stress. The book is enriched with real-life stories and case studies that demonstrate the catastrophic consequences of poor system design in production settings. "Release It!" is a valuable resource for software professionals seeking to ensure their systems are not just functional, but also resilient and reliable in the face of real-world challenges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The most valuable lesson I learned from the book:&lt;/strong&gt; Every software engineer should build their software with production in mind. Software in production is what runs your business, impacts your customers, and determines success or failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Site Reliability Engineering
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/" rel="noopener noreferrer"&gt;&lt;em&gt;Site Reliability Engineering: How Google Runs Production Systems&lt;/em&gt;&lt;/a&gt; authored by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy, is an insightful exploration into the practices and principles that Google employs to manage its large-scale, highly reliable systems. The book introduces the concept of Site Reliability Engineering (SRE), a discipline that blends aspects of software engineering with IT operations, focusing on creating scalable and reliable software systems.&lt;/p&gt;

&lt;p&gt;The authors, all experienced practitioners in SRE at Google, share their expertise on how to build, deploy, monitor, and maintain systems that are robust and resilient. They delve into the specific strategies and techniques Google uses, such as setting service level objectives (SLOs), managing change effectively, and balancing the need for release velocity with service reliability. The book covers a range of topics from organizational aspects of SRE teams to technical practices like incident management and post-mortem culture. The book offers a rare glimpse into the inner workings of one of the world's most proficient engineering organizations and is a valuable resource for anyone involved in the operation, maintenance, and scaling of large systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The most valuable lesson I learned from the book:&lt;/strong&gt; There are no perfect systems. By explicitly defining and measuring SLOs and error budgets, you can make informed decisions about the trade-offs between reliability and velocity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Change Your Questions, Change Your Life
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.goodreads.com/en/book/show/6665149" rel="noopener noreferrer"&gt;&lt;em&gt;Change Your Questions, Change Your Life: 12 Powerful Tools for Leadership, Coaching, and Life&lt;/em&gt;&lt;/a&gt; by Marilee Adams explores the profound impact that the questions we ask can have on our lives and careers. Adams introduces the concept of "Question Thinking," a method of transforming thinking, action, and results through deliberate and mindful questioning. The book emphasizes how the types of questions we ask ourselves, ranging from limiting, judgmental "Judger" questions to more open, constructive "Learner" questions, can significantly influence our outlook and outcomes.&lt;/p&gt;

&lt;p&gt;Adams illustrates her ideas through a compelling narrative, following the story of an individual struggling with life's challenges and learning to apply the principles of Question Thinking. This approach offers practical tools and techniques for individuals to improve their communication, decision-making, and problem-solving skills. By fostering a Learner mindset and asking better, more empowering questions, readers are guided towards more positive and productive personal and professional relationships. The book is particularly valuable for leaders, coaches, and anyone looking to enhance their ability to connect with others and navigate complex situations more effectively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The most valuable lesson I learned from the book:&lt;/strong&gt; I realized how often I am in the "Judger" mindset. Being more mindful about that, and consciously choosing to shift to a "Learner" mindset became almost like a super-power for me to solve any challenge I am facing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Thinking, Fast and Slow
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.amazon.de/-/en/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321" rel="noopener noreferrer"&gt;&lt;em&gt;Thinking, Fast and Slow&lt;/em&gt;&lt;/a&gt; by Daniel Kahneman is a groundbreaking exploration of psychology and economics, delving into how we think and make decisions. Kahneman introduces two distinct modes of thinking that dominate our mental processes: "System 1" (fast, intuitive, and emotional) and "System 2" (slower, more deliberate, and more logical). Throughout the book, Kahneman explores the impact of these two systems on our judgment, decision-making, and the way we perceive the world around us.&lt;/p&gt;

&lt;p&gt;The book is a comprehensive journey through various cognitive biases and heuristics that influence our everyday thinking. Kahneman demonstrates how our intuitive System 1, which often serves us well, can also lead to profound errors and biases. He also explores the capabilities and limitations of System 2, emphasizing how it can be influenced and overruled by the quick judgments of System 1. The book is a synthesis of decades of research, providing deep insights into the complexities of human thought and behavior. It's an essential read for anyone interested in understanding the mental processes that underlie our choices and actions in both personal and professional contexts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The most valuable lesson I learned from the book:&lt;/strong&gt; I learned that both modes are valuable, but also have their drawbacks. I learned to be more aware of the biases and heuristics that influence my thinking, and to consciously choose when to rely on System 1 and when to engage System 2.&lt;/p&gt;

&lt;h3&gt;
  
  
  Atomic Habits
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://jamesclear.com/atomic-habits" rel="noopener noreferrer"&gt;&lt;em&gt;Atomic Habits: An Easy &amp;amp; Proven Way to Build Good Habits &amp;amp; Break Bad Ones&lt;/em&gt;&lt;/a&gt; by James Clear is a transformative guide that delves into the science of habits and how small changes can lead to remarkable results. The author presents a comprehensive framework for understanding how habits form and offers practical strategies for cultivating good habits and breaking bad ones. The core philosophy of the book is that minor improvements, or "atomic habits", can accumulate into significant, life-altering outcomes over time.&lt;/p&gt;

&lt;p&gt;Clear emphasizes the importance of systems over goals, arguing that focusing on the processes and systems that lead to a goal is more effective than fixating on the goal itself. He introduces the Four Laws of Behavior Change – a set of simple, actionable principles to guide habit formation. These include making cues obvious, cravings attractive, responses easy, and rewards satisfying. Through a combination of scientific research, personal stories, and real-world examples, Clear illustrates how these principles can be applied to various aspects of life, from fitness and financial management to productivity and personal growth. "Atomic Habits" offers an accessible and compelling blueprint for building habits that stick and is valuable for anyone looking to make positive, lasting changes in their life.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The most valuable lesson I learned from the book:&lt;/strong&gt; By making many small changes to my daily routine, that individually only affect my productivity by a small amount, all of those habits combined make a huge impact.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conscious Business
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.amazon.de/-/en/Fred-Kofman/dp/1622032020" rel="noopener noreferrer"&gt;&lt;em&gt;Conscious Business: How to Build Value Through Values&lt;/em&gt;&lt;/a&gt; by Fred Kofman is a thought-provoking book that explores the intersection of personal integrity and professional success. The author presents the idea that the key to creating a successful and sustainable business lies in conscious management practices, where personal values and ethical principles are at the forefront of decision-making processes. The book argues that success in business is not just about financial gain but also about achieving personal and professional fulfillment.&lt;/p&gt;

&lt;p&gt;Kofman discusses various aspects of conscious business, including accountability, responsibility, emotional intelligence, communication skills, and the ability to resolve conflicts constructively. He emphasizes the importance of leaders who can inspire trust, cultivate a culture of openness and honesty, and lead with empathy. Through real-world examples, practical advice, and exercises, Kofman guides readers on how to develop these skills and apply them in their professional lives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The most valuable lesson I learned from the book:&lt;/strong&gt; The concept of unconditional response-ability. I now constantly remind myself that I have the power and responsibility to choose my responses to any situation, regardless of the circumstances. "Response-ability" is a play on the words "response" and "ability," highlighting the ability to respond consciously and proactively.&lt;/p&gt;

&lt;h3&gt;
  
  
  First, Break All The Rules
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://store.gallup.com/p/en-us/10286/first-break-all-the-rules" rel="noopener noreferrer"&gt;&lt;em&gt;First, Break All the Rules: What the World's Greatest Managers Do Differently&lt;/em&gt;&lt;/a&gt; by Marcus Buckingham and Curt Coffman presents a radical approach to management based on research conducted by the Gallup Organization. This book challenges conventional wisdom about leadership and management, proposing that the most effective managers often defy standard practices.&lt;/p&gt;

&lt;p&gt;The core message of the book is that great managers don't follow a single mold or adhere strictly to traditional management principles. Instead, they break the rules by focusing on their employees' individual strengths rather than trying to correct their weaknesses. The authors argue that this approach leads to higher engagement, productivity, and overall job satisfaction.&lt;/p&gt;

&lt;p&gt;Buckingham and Coffman identify key insights and strategies that set apart the world's best managers. These include the importance of selecting talent over simply filling positions, defining the right outcomes rather than dictating the right steps, focusing on strengths rather than obsessing over weaknesses, and finding the right fit for employees rather than simply promoting them to the next rung on the ladder.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The most valuable lesson I learned from the book:&lt;/strong&gt; The importance of focusing on strengths rather than weaknesses. I learned to accept my weaknesses as such, and use tools and strategies to compensate for them, rather than trying to "fix" them. Instead, I invest my time and energy into developing my strengths, and I try to do the same for the people I lead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honorable Mentions
&lt;/h2&gt;

&lt;p&gt;There are many more books that I found valuable on my journey from senior software engineer to tech lead. They are more focussed on specific technologies, which is why I did not include them in the main list. Nevertheless, I want to mention them here, as they might be relevant to you depending on the field/industry you are working in.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.databass.dev/" rel="noopener noreferrer"&gt;&lt;em&gt;Database Internals&lt;/em&gt;&lt;/a&gt; by Alex Petrov. The &lt;em&gt;best&lt;/em&gt; book on databases I have ever read. It covers all the fundamentals of databases in a very accessible way. It is a must-read for anyone working with databases.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.de/-/en/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321" rel="noopener noreferrer"&gt;&lt;em&gt;Designing Data-Intensive Applications&lt;/em&gt;&lt;/a&gt; by Martin Kleppmann. A comprehensive guide to building data-intensive applications. It covers a wide range of topics, from databases and data processing to distributed systems and stream processing.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.packtpub.com/product/oracle-jrockit-the-definitive-guide/9781847198068" rel="noopener noreferrer"&gt;&lt;em&gt;Oracle JRockit: The Definitive Guide&lt;/em&gt;&lt;/a&gt; by Marcus Hirt and Marcus Lagergren. A great resource for anyone interested in JVM internals.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://man7.org/tlpi/" rel="noopener noreferrer"&gt;&lt;em&gt;The Linux Programming Interface&lt;/em&gt;&lt;/a&gt; by Michael Kerrisk. A very detailed book about Linux, that covers a wide range of topics, from basic system calls to advanced topics like process groups, signals, and sockets.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;While books are a great tool to learn, they are not a substitute for first-hand experience. You still need to make your own mistakes and learn from them. It also helps to talk about the books you read with others, to get their perspective and to challenge your own views. Maybe you can join a book club, or read the book together with a colleague or friend.&lt;/p&gt;

&lt;p&gt;I hope this list will help you on your professional journey. If there is a book that inspired you and that you think should be on this list, please let me know in the comments below.&lt;/p&gt;




&lt;p&gt;If you liked this post, you can &lt;a href="https://ko-fi.com/frosnerd" rel="noopener noreferrer"&gt;support me on ko-fi&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>books</category>
      <category>swe</category>
      <category>sre</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
