<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dylan Dumont</title>
    <description>The latest articles on DEV Community by Dylan Dumont (@dylan_dumont_266378d98367).</description>
    <link>https://dev.to/dylan_dumont_266378d98367</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3853448%2F34f28cc2-c576-4b86-8a09-73e1aeb86ed4.png</url>
      <title>DEV Community: Dylan Dumont</title>
      <link>https://dev.to/dylan_dumont_266378d98367</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dylan_dumont_266378d98367"/>
    <language>en</language>
    <item>
      <title>WebSockets vs Server-Sent Events: Choosing the Right Real-Time Protocol</title>
      <dc:creator>Dylan Dumont</dc:creator>
      <pubDate>Sun, 03 May 2026 12:37:49 +0000</pubDate>
      <link>https://dev.to/dylan_dumont_266378d98367/websockets-vs-server-sent-events-choosing-the-right-real-time-protocol-3chp</link>
      <guid>https://dev.to/dylan_dumont_266378d98367/websockets-vs-server-sent-events-choosing-the-right-real-time-protocol-3chp</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Selecting between bidirectional connections and unidirectional streams isn't just a technology preference; it's a fundamental trade-off in system topology and resource cost.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;We are designing a notification endpoint for a microservice mesh. This service ingests telemetry data from distributed sensors and pushes critical alerts to frontend clients. The architecture must support millions of concurrent connections without exhausting server memory. We evaluate the trade-off between bidirectional WebSockets and unidirectional Server-Sent Events to determine which fits this use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Define Directionality
&lt;/h2&gt;

&lt;p&gt;The first decision involves data flow. WebSockets require two-way communication for full-duplex traffic, while Server-Sent Events (SSE) are strictly server-to-client. If your frontend only needs to receive updates, SSE is lighter. If the client must also send commands, WebSockets are mandatory.&lt;/p&gt;

&lt;p&gt;Consider this Go implementation defining the endpoint behavior.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// SSE Handler&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Server&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;HandleSSE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Content-Type"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"text/event-stream"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Cache-Control"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"no-cache"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Connection"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"keep-alive"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ticker&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewTicker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"event"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"tick"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"sensor-1"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rationale: SSE headers are significantly smaller than WebSocket handshake headers, reducing initial latency for broadcast scenarios.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Manage Connection Lifecycle
&lt;/h2&gt;

&lt;p&gt;Real-time protocols require keeping connections alive. SSE handles connection teardown gracefully with HTTP 204 responses, but WebSockets need explicit ping/pong logic. You must implement a heartbeat mechanism to distinguish idle clients from disconnected ones.&lt;/p&gt;

&lt;p&gt;Use a Goroutine to manage the active state for long-lived sessions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Server&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;TrackSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="n"&gt;net&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Conn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ticker&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewTicker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;30&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Write&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"PONG&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rationale: Writing PONG data ensures the client knows the server is active before the idle timeout triggers a disconnect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Implement Backpressure Handling
&lt;/h2&gt;

&lt;p&gt;If the server floods the client with data faster than the network can handle it, the browser will drop the connection. SSE relies on the &lt;code&gt;data&lt;/code&gt; payload size and the browser's buffer limits. You must monitor queue depth before pushing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;queueSize&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;maxBufferLimit&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;stopSending&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteHeader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusTooManyRequests&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rationale: Dropping the connection early signals the client to retry later, preventing memory exhaustion on the server process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Plan for Client Reconnection
&lt;/h2&gt;

&lt;p&gt;Clients lose network stability. SSE supports automatic reconnection via &lt;code&gt;Last-Event-ID&lt;/code&gt;. WebSockets require custom heartbeat logic to reconnect. The browser handles SSE connection drops and retries automatically, whereas WebSockets need state restoration logic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Server&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OnConnect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;lastEventID&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Last-Event-ID"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;lastEventID&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resyncStream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lastEventID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rationale: The &lt;code&gt;Last-Event-ID&lt;/code&gt; header allows the client to resume the stream exactly where it left off without losing event order.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Directionality&lt;/strong&gt; dictates protocol choice; use SSE for push-only scenarios to save memory and reduce handshake overhead.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Browser Support&lt;/strong&gt; for SSE is near universal in modern browsers, while WebSockets often require fallback logic for legacy environments.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Bandwidth&lt;/strong&gt; costs are lower with SSE because it uses standard HTTP long-polling mechanisms rather than maintaining a persistent TCP tunnel for bi-directional data.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;State Management&lt;/strong&gt; is simpler with SSE since the server does not need to track the client's outgoing state.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Load Balancing&lt;/strong&gt; is friendlier for SSE because it requires no session stickiness, whereas WebSockets often require sticky sessions or stateless middleware configuration.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Client Control&lt;/strong&gt; is lost with SSE, but this is acceptable when the client passively consumes data rather than actively querying it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Explore WebSocket sub-protocols like WSS for encryption requirements in sensitive environments.&lt;/li&gt;
&lt;li&gt;  Review your load balancer configuration to ensure it handles persistent connections correctly.&lt;/li&gt;
&lt;li&gt;  Design a fallback mechanism that switches to WebSocket if SSE connection quality drops.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4saY8oe" rel="noopener noreferrer"&gt;Designing Data-Intensive Applications (Kleppmann)&lt;/a&gt;&lt;/strong&gt; — Understands the fundamental trade-offs between streaming, batching, and persistence that apply here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4m8wG9e" rel="noopener noreferrer"&gt;A Philosophy of Software Design (Ousterhout)&lt;/a&gt;&lt;/strong&gt; — Teaches you to choose the right abstraction level for system interfaces and communication boundaries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Part of the &lt;strong&gt;Architecture Patterns&lt;/strong&gt; series.&lt;/p&gt;

</description>
      <category>distributed</category>
      <category>systems</category>
      <category>networking</category>
      <category>backend</category>
    </item>
    <item>
      <title>Async Runtime Internals: How tokio Schedules Your Futures</title>
      <dc:creator>Dylan Dumont</dc:creator>
      <pubDate>Thu, 30 Apr 2026 12:39:25 +0000</pubDate>
      <link>https://dev.to/dylan_dumont_266378d98367/async-runtime-internals-how-tokio-schedules-your-futures-3kn8</link>
      <guid>https://dev.to/dylan_dumont_266378d98367/async-runtime-internals-how-tokio-schedules-your-futures-3kn8</guid>
      <description>&lt;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;Async concurrency isn't about avoiding locks; it is about understanding the precise moment a thread yields and how the runtime recovers execution flow.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;We are dissecting the inner workings of the Tokio runtime to understand how a &lt;code&gt;Future&lt;/code&gt; transitions from a pending state to execution. This scope focuses on the event loop's polling mechanism, the handling of I/O readiness, and the implications for task ownership. We will not cover the standard library implementation or &lt;code&gt;tokio::task::join&lt;/code&gt;. We are focusing on the lifecycle of a detached task submitted to a multi-threaded worker pool. This guide clarifies how your application avoids blocking the main event loop and keeps resources available for concurrent operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Submitting a Future
&lt;/h2&gt;

&lt;p&gt;When you call &lt;code&gt;tokio::spawn&lt;/code&gt;, you drop the future into the runtime's internal queue without immediately executing it. The future does not consume CPU cycles at this moment because it is treated as a pending instruction waiting for a specific event.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// This code only runs when polled by the runtime&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Task started"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This specific choice matters because it decouples the creation of the task from the execution context, allowing the application to manage thread counts independently of code logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — The Ready Queue
&lt;/h2&gt;

&lt;p&gt;The runtime maintains a queue of woken tasks. When a future completes a step, such as reading from a socket, it signals itself to the event loop via a wake token. The event loop iterates through this queue to execute pending tasks immediately on the current thread.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  Pending Queue -&amp;gt; Poll Future -&amp;gt; Complete -&amp;gt; Wake -&amp;gt; Re-inserted
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This mechanism ensures that if a future blocks I/O, the runtime does not hold the thread. The event loop detects the I/O completion signal and reinserts the future into the queue. This design enables high concurrency without thread proliferation. The readiness event signals the runtime to process the future on the next tick.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Event Loop Dispatch
&lt;/h2&gt;

&lt;p&gt;The event loop runs continuously, polling registered resources for readiness. It checks for I/O events from the operating system to determine if a socket is ready for reading or writing. If an I/O event is available, the runtime dispatches the associated future to the current thread. If the future is not ready, the runtime waits for the next I/O event or a timer event.&lt;/p&gt;

&lt;p&gt;The runtime uses an internal reactor (usually &lt;code&gt;mio&lt;/code&gt;) to register file descriptors. The reactor returns ready events when the OS indicates activity. This mechanism abstracts the complexity of kernel-level file descriptor management, allowing the application to focus on business logic. The runtime ensures that every pending future is polled at least once per tick to prevent starvation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Context Switching
&lt;/h2&gt;

&lt;p&gt;Context switching occurs when a task blocks I/O and yields execution back to the event loop. The runtime saves the task state and moves the thread to handle other pending tasks in the queue. This process is efficient because the runtime reuses threads from a pool rather than spawning new ones. The number of threads in the pool is typically equal to the number of logical cores on the machine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Tokio thread pool configuration&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;runtime&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Builder&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new_multi_thread&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;.worker_threads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// Adjust based on CPU cores&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This configuration ensures that the runtime scales its thread pool appropriately for the available hardware resources. If a task blocks, the runtime continues to process other tasks on the same thread, maintaining responsiveness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;Understanding the lifecycle of a future involves grasping how the runtime polls tasks without blocking. When a future blocks, the runtime schedules it for later execution when I/O completes. The runtime ensures that every task is polled periodically to maintain progress. This design enables high throughput by utilizing multiple threads and avoiding unnecessary blocking.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;You should review how custom &lt;code&gt;Poll&lt;/code&gt; implementations differ from built-in types. Study the &lt;code&gt;tokio::io&lt;/code&gt; module to understand how readiness checks are performed. Consider how to handle errors returned from polled futures in your error handling strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;p&gt;Refer to &lt;em&gt;&lt;a href="https://amzn.to/4saY8oe" rel="noopener noreferrer"&gt;Designing Data-Intensive Applications (Kleppmann)&lt;/a&gt;&lt;/em&gt; for insights into concurrency models. Read &lt;em&gt;&lt;a href="https://amzn.to/4m8wG9e" rel="noopener noreferrer"&gt;A Philosophy of Software Design (Ousterhout)&lt;/a&gt;&lt;/em&gt; to understand abstraction costs. Consult &lt;em&gt;&lt;a href="https://amzn.to/4sPlPDL" rel="noopener noreferrer"&gt;Learn Rust in a Month of Lunches (MacLeod)&lt;/a&gt;&lt;/em&gt; for syntax specifics. Review &lt;em&gt;&lt;a href="https://amzn.to/41FQGXh" rel="noopener noreferrer"&gt;Cracking the Coding Interview (McDowell)&lt;/a&gt;&lt;/em&gt; for algorithmic patterns in async code. These resources provide context for asynchronous programming.&lt;/p&gt;

&lt;p&gt;Part of the &lt;strong&gt;Architecture Patterns&lt;/strong&gt; series.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>concurrency</category>
      <category>systems</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Service Mesh Fundamentals: What a Sidecar Proxy Actually Does</title>
      <dc:creator>Dylan Dumont</dc:creator>
      <pubDate>Wed, 29 Apr 2026 12:40:21 +0000</pubDate>
      <link>https://dev.to/dylan_dumont_266378d98367/service-mesh-fundamentals-what-a-sidecar-proxy-actually-does-4g34</link>
      <guid>https://dev.to/dylan_dumont_266378d98367/service-mesh-fundamentals-what-a-sidecar-proxy-actually-does-4g34</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Sidecar proxies decouple infrastructure concerns from business logic by intercepting traffic at the container boundary without modifying the application source code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;We are focusing on the sidecar proxy pattern specifically. This involves understanding how a proxy shares a network namespace with a service and intercepts TCP traffic before it reaches the application. The scope is the data plane, not the control plane orchestration. We will demonstrate how a proxy sits alongside a container to handle routing, encryption, and observability. This pattern is essential for modern distributed systems where business teams do not want to maintain infrastructure logic inside their core repositories.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Container Networking Co-location
&lt;/h2&gt;

&lt;p&gt;The sidecar must live in the same network namespace to share the same IP address. In Kubernetes, this is often managed via &lt;code&gt;hostNetwork&lt;/code&gt; or explicit sidecar containers. The application sends requests to &lt;code&gt;localhost:port&lt;/code&gt;, and the sidecar accepts these connections on that same interface. This arrangement avoids network namespace leakage where the app thinks it is talking to an internal service but actually exits to the cluster. The Rust example demonstrates how a listener binds to the same socket the application expects.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// sidecar_proxy.rs&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;net&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;TcpListener&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;io&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;AsyncReadExt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AsyncWriteExt&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nd"&gt;#[tokio::main]&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nb"&gt;Box&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;dyn&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;error&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;listener&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;TcpListener&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"127.0.0.1:8081"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;incoming&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;listener&lt;/span&gt;&lt;span class="nf"&gt;.accept&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// Forward traffic to upstream or app logic&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Sidecar listening on 127.0.0.1:8081"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Dockerfile builds the container environment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; rust:1.70-slim&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; sidecar_proxy.rs .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;cargo build &lt;span class="nt"&gt;--release&lt;/span&gt;
&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 8081&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["./target/release/sidecar_proxy"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2 — Traffic Hijacking and Proxying
&lt;/h2&gt;

&lt;p&gt;The sidecar must intercept traffic intended for the application. Without hijacking, the application receives requests directly. The proxy binds to the same port, effectively "stealing" connections. This allows injecting middleware logic like logging, authentication, or rate limiting. For example, &lt;code&gt;iptables&lt;/code&gt; REDIRECT rules can force traffic to the proxy. The Rust code handles socket reuse.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;io&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Read&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;forward_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;net&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;TcpStream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;net&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SocketAddr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nb"&gt;Box&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;dyn&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;error&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Vec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;from&lt;/span&gt;&lt;span class="nf"&gt;.read_to_end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;addr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;net&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;SocketAddr&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;127&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9000&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt; &lt;span class="c1"&gt;// App port&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;to_stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;net&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;TcpStream&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;to_stream&lt;/span&gt;&lt;span class="nf"&gt;.write_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This logic replaces direct socket calls in the app with calls to the proxy loop. The proxy becomes the single point of truth for ingress.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Metadata and Service Discovery
&lt;/h2&gt;

&lt;p&gt;The proxy needs to know where to route traffic. In a service mesh, the proxy registers metadata with a control plane to learn cluster topology. This metadata includes service names, mesh ID, and upstream endpoints. Without this, the proxy cannot perform routing. The Go example shows struct definitions for this metadata injection.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// sidecar_metadata.go&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Metadata&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ServiceName&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;MeshID&lt;/span&gt;      &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;Upstreams&lt;/span&gt;   &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;GetTarget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"api.example.com"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;"10.0.0.5:9090"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The control plane pushes this to the sidecar via gRPC or HTTP. This allows the sidecar to dynamically update routing tables without reloading the binary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — mTLS and Policy Enforcement
&lt;/h2&gt;

&lt;p&gt;Security is handled by the proxy, not the app. This involves mutual TLS (mTLS) where the sidecar validates client certificates. The proxy enforces policies like denying traffic from untrusted IPs. If the app handles mTLS, the keys rotate when the pod moves, causing potential outages. The sidecar handles rotation transparently. A configuration file defines allowed peers and certificates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# security_policy.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;security.istio.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PeerAuthentication&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mtls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;STRICT&lt;/span&gt;
  &lt;span class="na"&gt;portLevelMtls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
      &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;STRICT&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The sidecar injects the certificate store into the process environment. The app only needs to accept connections, while the proxy verifies them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5 — Observability and Metrics
&lt;/h2&gt;

&lt;p&gt;The proxy exposes metrics like request latency and errors. The application does not need to instrument every endpoint. The proxy aggregates this data to provide cluster-wide visibility. Prometheus queries the sidecar endpoint to build dashboards. This separation reduces the application footprint. The sidecar runs a metrics server on a specific port.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;hyper&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;server&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;accept&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Accept&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;hyper&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;Body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Server&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;hyper_util&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;rt&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;TokioExecutor&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nd"&gt;#[tokio::main]&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;addr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;127&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;9090&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;hyper&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;service&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;make_service_fn&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Metrics logic here&lt;/span&gt;
        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nn"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;hyper&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Body&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"OK"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Server&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.serve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="nf"&gt;.into_make_service&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows operations teams to monitor system health without touching application code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Decoupling:&lt;/strong&gt; Infrastructure logic is isolated from business logic.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Egress Interception:&lt;/strong&gt; Outbound calls are handled by the proxy, preventing data leaks.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Metadata Plane:&lt;/strong&gt; Dynamic configuration updates via control plane integration.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Policy Isolation:&lt;/strong&gt; Security policies are defined centrally and enforced by the proxy.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;The next step is implementing an xDS gRPC client for service discovery. Advanced topics include using eBPF to bypass the sidecar for performance. Finally, integrate this into a Kubernetes environment using admission controllers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;em&gt;&lt;a href="https://amzn.to/4saY8oe" rel="noopener noreferrer"&gt;Designing Data-Intensive Applications (Kleppmann)&lt;/a&gt;&lt;/em&gt;: For distributed system data models and replication patterns.&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;&lt;a href="https://amzn.to/4c2jE8D" rel="noopener noreferrer"&gt;Computer Systems: A Programmer's Perspective (Bryant &amp;amp; O'Hallaron)&lt;/a&gt;&lt;/em&gt;: For networking stack details and socket programming.&lt;/li&gt;
&lt;li&gt;  &lt;em&gt;&lt;a href="https://amzn.to/4m8wG9e" rel="noopener noreferrer"&gt;A Philosophy of Software Design (Ousterhout)&lt;/a&gt;&lt;/em&gt;: For managing complexity in large systems.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>distributed</category>
      <category>systems</category>
      <category>networking</category>
      <category>backend</category>
    </item>
    <item>
      <title>Bulkhead vs Circuit Breaker: Choosing the Right Fault Isolation Strategy</title>
      <dc:creator>Dylan Dumont</dc:creator>
      <pubDate>Mon, 27 Apr 2026 12:37:49 +0000</pubDate>
      <link>https://dev.to/dylan_dumont_266378d98367/bulkhead-vs-circuit-breaker-choosing-the-right-fault-isolation-strategy-3e4p</link>
      <guid>https://dev.to/dylan_dumont_266378d98367/bulkhead-vs-circuit-breaker-choosing-the-right-fault-isolation-strategy-3e4p</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Stop your entire system from collapsing because one microservice is choking.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;We are designing a distributed system where dependencies inevitably fail. The goal is to contain that failure to prevent cascading outages. This article contrasts the Circuit Breaker pattern, which stops retrying failed operations, with the Bulkhead pattern, which limits resource consumption per subsystem. We will implement these strategies in Go, leveraging concurrency primitives that reflect production reality. We distinguish between failing fast and resource isolation to determine the correct architectural tradeoff for your infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Visualizing Cascading Failure
&lt;/h2&gt;

&lt;p&gt;Cascading failure occurs when one service's overload consumes system-wide resources like threads or bandwidth. Understanding the flow is the first defense.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Healthy Service]
      |
      v
[Overloaded Service] --&amp;gt; [System Threads]
      |                    ^
      |---------------------|
      |                    |
   [Cascade to DB]     [Thread Pool Starvation]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without isolation, a spike in load to a dependency drains the pool, causing healthy paths to starve. This visualizes the critical need to prevent a single failure point from consuming a shared resource like a thread pool. The choice matters because preventing resource starvation is distinct from preventing logic errors from propagating.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Implementing a Circuit Breaker
&lt;/h2&gt;

&lt;p&gt;A Circuit Breaker detects repeated failures and opens the circuit to bypass the failing downstream service. In Go, we simulate this with state tracking and timeouts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;CircuitBreaker&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;failureThreshold&lt;/span&gt;   &lt;span class="kt"&gt;uint&lt;/span&gt;
    &lt;span class="n"&gt;resetTimeout&lt;/span&gt;       &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt;              &lt;span class="n"&gt;State&lt;/span&gt; &lt;span class="c"&gt;// CLOSED, OPEN, HALF_OPEN&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;CircuitBreaker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fn&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;StateOpen&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resetTimer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Since&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lastTrip&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resetTimeout&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"circuit is open"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failures&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failureThreshold&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;StateOpen&lt;/span&gt;
            &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lastTrip&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This implementation uses state tracking rather than a library dependency to demonstrate core logic. Using this custom struct ensures you understand the reset timer and threshold mechanics. It matters because standard library wrappers often hide the internal timing logic you need for tuning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Conceptualizing the Bulkhead
&lt;/h2&gt;

&lt;p&gt;A Bulkhead pattern limits resource access per service using logical barriers, like thread pools or semaphores. It does not stop failures; it stops resource exhaustion from affecting unrelated paths.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Main Pool]          [Pool A]          [Pool B]
      |                |                |
  Service 1       Service 2      Service 3 (External DB)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This diagram shows logical isolation where a spike in &lt;code&gt;Service 3&lt;/code&gt; cannot exhaust the &lt;code&gt;Main Pool&lt;/code&gt;. Allocating separate execution pools or connection limits for specific dependencies is the core concept here. It matters because circuit breakers protect against errors, but bulkheads protect against resource exhaustion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Implementing Bulkhead Isolation
&lt;/h2&gt;

&lt;p&gt;In Go, we use a semaphore-based pool to limit concurrency per service group. We define specific limits per dependency group rather than a global limit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Bulkhead&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;maxConcurrentPerGroup&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;uint&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Bulkhead&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Acquire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Acquire token from the specific semaphore&lt;/span&gt;
    &lt;span class="c"&gt;// If limit reached, block or return error&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Bulkhead&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ExecuteWithBulkhead&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fn&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;acquireToken&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;releaseToken&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We use a map to track distinct semaphore limits keyed by the service identifier. Defining distinct limits per service rather than a global limit allows partial system failure. This choice matters because a global thread pool is insufficient for modern microservice architectures where dependencies vary in cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5 — Combining Both Strategies
&lt;/h2&gt;

&lt;p&gt;Production systems often require both patterns for different layers. You might use Circuit Breakers for external API calls and Bulkheads for internal worker pools.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Request]
   |
   v
[Bulkhead Pool] --&amp;gt; [Circuit Breaker] --&amp;gt; [External API]
   |
   | [Circuit Breaker]
   | [Bulkhead Pool]
   v
[Internal DB]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Applying both ensures that resource contention doesn't happen alongside error propagation. You might handle connection limits in the Bulkhead and handle timeout/failure limits in the Circuit Breaker. Combining both ensures that resource contention doesn't happen alongside error propagation during high load.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Circuit Breakers&lt;/strong&gt; protect against repeated logic errors by stopping retries after a threshold.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Bulkheads&lt;/strong&gt; protect against resource starvation by limiting concurrent execution per group.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Combine Strategies&lt;/strong&gt; use bulkheads for internal resource limits and breakers for external network dependencies.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Monitor Metrics&lt;/strong&gt; track both failure rates and semaphore wait times to validate effectiveness.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;Implementing these patterns is only the beginning. Your next priority should be observability to detect threshold breaches before they impact users. Consider implementing metrics for failure rates and semaphore wait times to validate effectiveness. Finally, explore retry logic that is safe for idempotent operations to complement the breakers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4saY8oe" rel="noopener noreferrer"&gt;Designing Data-Intensive Applications (Kleppmann)&lt;/a&gt;&lt;/strong&gt; — Explains distributed system resilience and failure modes comprehensively.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4m8wG9e" rel="noopener noreferrer"&gt;A Philosophy of Software Design (Ousterhout)&lt;/a&gt;&lt;/strong&gt; — Discusses tradeoffs between modularity and complexity relevant to bulkheads.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Part of the &lt;strong&gt;Architecture Patterns&lt;/strong&gt; series.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>systems</category>
      <category>patterns</category>
      <category>distributed</category>
    </item>
    <item>
      <title>LSM Trees vs B-Trees: How Storage Engines Choose Their Data Structure</title>
      <dc:creator>Dylan Dumont</dc:creator>
      <pubDate>Sun, 26 Apr 2026 12:39:33 +0000</pubDate>
      <link>https://dev.to/dylan_dumont_266378d98367/lsm-trees-vs-b-trees-how-storage-engines-choose-their-data-structure-10l9</link>
      <guid>https://dev.to/dylan_dumont_266378d98367/lsm-trees-vs-b-trees-how-storage-engines-choose-their-data-structure-10l9</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;"Choosing between LSM Trees and B-Trees dictates the throughput ceiling of your write-heavy or read-heavy workload."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;We are analyzing the fundamental trade-offs between two dominant key-value storage paradigms. The goal is not to declare one superior, but to understand the architectural implications of each. This comparison focuses on write amplification, read latency, and disk seek patterns. We will examine how these structures handle concurrent writes and sequential reads, providing a decision framework for engineering teams selecting a persistence layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — B-Tree Random Access Optimization
&lt;/h2&gt;

&lt;p&gt;B-Trees enforce a balanced height and sorted order, ensuring that insertion, deletion, and lookup operations run in O(log n) time. Maintaining balance requires frequent random writes to disk whenever a node splits. This structure minimizes read latency because any key is accessed in a predictable number of disk seeks. However, the overhead of splitting nodes during updates can slow down throughput during high-write scenarios.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;BTreeNode&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;children&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;NodeRef&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This structure minimizes random I/O but creates write amplification during splits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — LSM Memtable Buffering
&lt;/h2&gt;

&lt;p&gt;LSM Trees separate mutable memory from immutable storage to optimize write performance. Incoming writes go into an in-memory sorted structure called a Memtable. Once the Memtable reaches a size threshold, it flushes to the disk as an immutable Sorted String Table (SSTable). This buffering allows the system to absorb millions of writes per second without touching the physical disk immediately.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;Memtable&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BTreeMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;Memtable&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.entries&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.max_size&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.sst&lt;/span&gt;&lt;span class="nf"&gt;.write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.entries&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.entries&lt;/span&gt;&lt;span class="nf"&gt;.clear&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This buffers writes in memory before flushing to disk, drastically improving throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Compaction Lifecycle
&lt;/h2&gt;

&lt;p&gt;The disk eventually contains multiple SSTables with overlapping keys. A compaction process merges these sorted files into larger, more compact files. This process is critical for space reclamation and read efficiency. It involves scanning multiple sorted files, removing duplicates, and writing a new file. Over time, this reduces file fragmentation and ensures that sequential reads hit contiguous blocks of data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Memtable -&amp;gt; SSTable1
Memtable -&amp;gt; SSTable2
Compaction: SSTable1 + SSTable2 -&amp;gt; New SSTable
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Over time, smaller files merge into larger ones to optimize sequential read performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Handling Read Amplification Costs
&lt;/h2&gt;

&lt;p&gt;Read operations in an LSM Tree are more complex than in a B-Tree. When searching for a key, the engine checks the Memtable first. If the key isn't found, it scans through the SSTables. While LSM Trees are optimized for writes, reads can suffer from increased latency due to this multi-level lookup. This is a trade-off for the write performance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.memtable&lt;/span&gt;&lt;span class="nf"&gt;.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;.or_else&lt;/span&gt;&lt;span class="p"&gt;(||&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="nf"&gt;.find_sstable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This adds latency per read but allows massive write concurrency without locking.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5 — Engine Selection Matrix
&lt;/h2&gt;

&lt;p&gt;The decision to use one structure over the other depends on the workload. If random writes dominate, avoid LSM Trees. If read amplification is acceptable, choose LSM for high throughput. Use RocksDB for KV stores requiring massive write rates, and B-Trees (like InnoDB) for relational SQL databases where fast point lookups are vital.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High Write Load&lt;/strong&gt;: Choose LSM Trees.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High Read Load&lt;/strong&gt;: Choose B-Trees.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Random Writes&lt;/strong&gt;: Avoid LSM Trees.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequential Writes&lt;/strong&gt;: Favor LSM Trees.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HDD Storage&lt;/strong&gt;: Favor B-Trees.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSD Storage&lt;/strong&gt;: Favor LSM Trees.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Write Amplification&lt;/strong&gt; is the primary cost of LSM Trees, increasing physical writes. &lt;strong&gt;Read Latency&lt;/strong&gt; increases due to multiple lookups through the Memtable and SSTables. &lt;strong&gt;Sequential I/O&lt;/strong&gt; is heavily favored by LSM Trees for compaction and flushing. &lt;strong&gt;Memory Footprint&lt;/strong&gt; is higher for LSM Trees due to the Memtable buffering. &lt;strong&gt;Failure Domain&lt;/strong&gt; risks increase with large Memtables due to memory loss during a crash.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;Future discussions will cover cloud storage patterns like S3 object stores and new storage engine abstractions like RocksDB. We will also explore how to implement custom SSTable merging strategies in Rust.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4saY8oe" rel="noopener noreferrer"&gt;Designing Data-Intensive Applications (Kleppmann)&lt;/a&gt;&lt;/strong&gt; — explains the LSM Tree internals and write amplification concepts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4m8wG9e" rel="noopener noreferrer"&gt;A Philosophy of Software Design (Ousterhout)&lt;/a&gt;&lt;/strong&gt; — discusses managing complexity when designing distributed storage layers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article is part of the &lt;strong&gt;Architecture Patterns&lt;/strong&gt; series.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>database</category>
      <category>systems</category>
      <category>backend</category>
    </item>
    <item>
      <title>Change Data Capture: Streaming Database Changes to Downstream Systems</title>
      <dc:creator>Dylan Dumont</dc:creator>
      <pubDate>Sat, 25 Apr 2026 12:41:07 +0000</pubDate>
      <link>https://dev.to/dylan_dumont_266378d98367/change-data-capture-streaming-database-changes-to-downstream-systems-flg</link>
      <guid>https://dev.to/dylan_dumont_266378d98367/change-data-capture-streaming-database-changes-to-downstream-systems-flg</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Manual polling is an anti-pattern; stream the truth directly from the source.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;We are constructing a robust Change Data Capture (CDC) pipeline using Go. This system watches the Write-Ahead Log (WAL) of a PostgreSQL instance, captures row-level changes, transforms the payload into a standard domain model, and emits it to a downstream consumer.&lt;/p&gt;

&lt;p&gt;The architecture relies on logical replication rather than physical log scanning to handle schema changes gracefully. The system must handle connection resets and maintain exactly-once semantics relative to the upstream store without duplicating processing work.&lt;/p&gt;

&lt;p&gt;The high-level data flow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----------------+     +----------------+     +----------------+
|  PostgreSQL    |----&amp;gt;|   CDC Reader   |----&amp;gt;|  Downstream    |
|   WAL Stream   |     |   (Go Service) |     |   System       |
+----------------+     +----------------+     +----------------+
         ^                    |                    ^
         |                    v                    |
   +-----+------------+        |        +----------+-------+
   |  Failover /      |        |        |  Offset Storage  |
   |  Offset Replay   |        |        |  (Kafka/Table)   |
   +------------------+        |        +-------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1 — Establish Replication Slot
&lt;/h2&gt;

&lt;p&gt;The first step is initializing a logical replication slot on the source database. Without this, the backend may discard committed records during server restarts, leading to data loss. We establish the connection using the binary protocol to prepare for WAL decoding.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;pgxpool&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dsn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;slotName&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s"&gt;"cdc_reader_slot"&lt;/span&gt;
&lt;span class="n"&gt;slot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;`SELECT pg_create_logical_replication_slot($1, 'pgoutput')`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;slotName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c"&gt;// Retrieve latest WAL location to start streaming&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures the CDC reader captures every single change, including schema evolution signals, rather than relying on implicit polling intervals which introduce latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Start WAL Stream
&lt;/h2&gt;

&lt;p&gt;Once the slot is created, we must subscribe to the stream using a specific position. We use &lt;code&gt;CopyFrom&lt;/code&gt; with the &lt;code&gt;wal&lt;/code&gt; command to receive the raw change events. In production, this stream is non-blocking; we read a buffer of events rather than waiting indefinitely for a single row.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;walStream&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pgconn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StartLogicalReplication&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pglogical&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewSlot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;slotName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;walStream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Next&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;pglogical&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReplicationMessage&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;processChange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern decouples the ingestion rate from the database commit speed, allowing the downstream system to buffer and process data at its own capacity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Decode Change Payloads
&lt;/h2&gt;

&lt;p&gt;Raw WAL messages are binary and contain specific flags for insertion, updates, and deletions. We decode these payloads into a generic event struct. We strip the original SQL identity and extract only the necessary columns for the consumer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;ChangeEvent&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;TableName&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;Type&lt;/span&gt;      &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="c"&gt;// INSERT, UPDATE, DELETE&lt;/span&gt;
    &lt;span class="n"&gt;RowData&lt;/span&gt;   &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="k"&gt;interface&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;processChange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ChangeEvent&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Decode binary payload based on replication protocol&lt;/span&gt;
    &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;ChangeEvent&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;TableName&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"users"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"UPDATE"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c"&gt;// Logic to parse row data into map&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Choosing binary parsing over simple string replacement ensures we handle complex data types like arrays and JSON correctly without triggering database errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Transform and Filter
&lt;/h2&gt;

&lt;p&gt;We must often enrich raw data before emitting it. For example, we might join with a reference table to convert a foreign key &lt;code&gt;user_id&lt;/code&gt; into an &lt;code&gt;email&lt;/code&gt;. We also implement a filter to drop events from tables irrelevant to the target consumer, saving bandwidth and compute.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;transformEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ChangeEvent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ChangeEvent&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Add metadata like timestamp and source&lt;/span&gt;
    &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"postgres_main"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"DELETE"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RowData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="k"&gt;interface&lt;/span&gt;&lt;span class="p"&gt;{}{&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RowData&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This step adds context that makes the event self-describing, reducing the need for the consumer to maintain state about which table a record came from.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5 — Emit with Backpressure
&lt;/h2&gt;

&lt;p&gt;Sending the event is the final step. If the downstream service is slow, we must buffer. We use a channel or a queue (like &lt;code&gt;kafka-producer&lt;/code&gt; or &lt;code&gt;grpc-stream&lt;/code&gt;) to emit data. We handle backpressure by pausing ingestion if the send channel fills up, preventing the reader from consuming too much memory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;emit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ChangeEvent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Simulate sending to downstream consumer&lt;/span&gt;
    &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;downstream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="c"&gt;// Sent&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;After&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Millisecond&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="c"&gt;// Apply backpressure or drop&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Handling backpressure gracefully prevents the CDC reader from consuming excessive CPU cycles when the consumer is temporarily slow or down.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;WAL Logs&lt;/strong&gt;: Physical or logical WAL streams provide atomic visibility into database changes that polling cannot match.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Replication Slots&lt;/strong&gt;: They prevent the backend from replaying WAL entries to new readers, ensuring isolation and data integrity.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Binary Parsing&lt;/strong&gt;: Decoding binary formats ensures correct handling of complex data types and schema metadata without SQL ambiguity.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Backpressure&lt;/strong&gt;: Always respect consumer speed; buffering prevents system overload and memory spikes during traffic bursts.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Read &lt;em&gt;&lt;a href="https://amzn.to/4saY8oe" rel="noopener noreferrer"&gt;Designing Data-Intensive Applications (Kleppmann)&lt;/a&gt;&lt;/em&gt; (Ch. 8) to understand consistency models like eventually consistent replication.&lt;/li&gt;
&lt;li&gt;  Review &lt;em&gt;&lt;a href="https://amzn.to/4m8wG9e" rel="noopener noreferrer"&gt;A Philosophy of Software Design (Ousterhout)&lt;/a&gt;&lt;/em&gt; (Ch. 2) to manage complexity when adding transformation logic to data pipelines.&lt;/li&gt;
&lt;li&gt;  Implement offset storage in your system to allow for manual recovery or checkpointing after restarts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This guide focuses on practical patterns for reliable data pipelines. Part of the &lt;strong&gt;Architecture Patterns&lt;/strong&gt; series.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>database</category>
      <category>systems</category>
      <category>backend</category>
    </item>
    <item>
      <title>OpenTelemetry in Rust: Instrumenting a Service From Scratch</title>
      <dc:creator>Dylan Dumont</dc:creator>
      <pubDate>Fri, 24 Apr 2026 12:39:51 +0000</pubDate>
      <link>https://dev.to/dylan_dumont_266378d98367/opentelemetry-in-rust-instrumenting-a-service-from-scratch-5c56</link>
      <guid>https://dev.to/dylan_dumont_266378d98367/opentelemetry-in-rust-instrumenting-a-service-from-scratch-5c56</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Modern distributed systems require structured observability without sacrificing developer velocity or introducing technical debt.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;We are constructing a high-throughput Rust API service that automatically traces HTTP requests from the first incoming header to the final response. The goal is to demonstrate how to integrate OpenTelemetry (OTLP) into a production-grade Rust stack using minimal boilerplate while maintaining async context. We will avoid external managed SDKs in favor of the standard &lt;code&gt;opentelemetry&lt;/code&gt; crates, ensuring full control over the telemetry pipeline. This approach applies to any backend service written in Rust, whether it runs on Kubernetes or local infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Initialize the Global Provider
&lt;/h2&gt;

&lt;p&gt;Before recording any telemetry data, you must configure the global &lt;code&gt;Provider&lt;/code&gt; to handle context propagation and resource attributes. This step ensures that the OpenTelemetry SDK manages the lifecycle of the trace pipeline without manual resource cleanup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;opentelemetry&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;propagation&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;TraceContextPropagator&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;opentelemetry_sdk&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;TracerProvider&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Create a provider using a default exporter or OTLP&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;TracerProvider&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;.with_simple_processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;opentelemetry_sdk&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;runtime&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;current&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="nf"&gt;.build&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="nn"&gt;opentelemetry&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;global&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;set_tracer_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This configuration step establishes the global telemetry state. By initializing the provider early in the application lifecycle, you guarantee that all subsequent code uses the same tracer implementation. This prevents race conditions where concurrent tasks might attempt to use a null or uninitialized provider, causing silent failures in metrics collection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Configure the OTLP Pipeline
&lt;/h2&gt;

&lt;p&gt;The OpenTelemetry Collector expects data in a specific protocol format, usually OTLP over HTTP or gRPC. You define this endpoint in the exporter configuration to ensure data reaches your monitoring backend securely.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;exporter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;OtlpExporter&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.with_endpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://localhost:4317"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The OpenTelemetry Collector acts as the intermediary between your application and monitoring tools. Configuring the endpoint correctly prevents data loss during high load or network instability. The exporter handles batching logic internally, so you do not need to manage buffer sizes manually unless throughput optimization is required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Handle Context Propagation
&lt;/h2&gt;

&lt;p&gt;When a request hits your service, the incoming &lt;code&gt;traceparent&lt;/code&gt; header must be extracted and attached to the current async runtime context. Without this, you cannot correlate requests across microservices or handle retries correctly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;opentelemetry&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;propagation&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Propagator&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;// In Axum middleware or handlers:&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Propagator&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Context propagation is critical for distributed systems. It allows the runtime to identify which request triggered the execution context automatically. If you skip this step, every retry or callback within the service will spawn a new, uncorrelated trace tree.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Instrument Handler Logic
&lt;/h2&gt;

&lt;p&gt;You attach a span to the handler function so that all internal calls made within that scope are automatically included in the trace. This creates a clear boundary between business logic and infrastructure noise.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;handle_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;opentelemetry&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;global&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"my-service"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="nf"&gt;.span_builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"process_request"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.start&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="c1"&gt;// ... logic&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Span lifecycle management ensures that the trace object is dropped correctly when the async task completes. Rust's &lt;code&gt;Drop&lt;/code&gt; trait handles this automatically. Keeping the instrumentation code close to business logic reduces the risk of missing steps in complex flows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5 — Record Errors and Metrics
&lt;/h2&gt;

&lt;p&gt;Finally, you ensure that panics or HTTP errors are recorded as distinct spans with error status. This allows your backend monitoring to alert on failure rates instantly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="nf"&gt;.set_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;opentelemetry&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;KeyValue&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"true"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Error handling is a distinct telemetry concern. It prevents false positives where a 500 error is treated as a system crash rather than a business error. You should also attach status codes to spans to help downstream services understand request outcomes without needing to inspect raw logs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tracing Spans&lt;/strong&gt; — Encapsulate logic boundaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Propagation&lt;/strong&gt; — Correlate distributed calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OTLP Export&lt;/strong&gt; — Standardize data shipping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SDK Lifecycle&lt;/strong&gt; — Ensure resource management.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error Status&lt;/strong&gt; — Track failure events.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Visualize traces in Tempo.&lt;/li&gt;
&lt;li&gt;Add metric aggregation.&lt;/li&gt;
&lt;li&gt;Configure batching.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4saY8oe" rel="noopener noreferrer"&gt;Designing Data-Intensive Applications (Kleppmann)&lt;/a&gt;&lt;/strong&gt; — Explains the underlying systems architecture needed to support observability pipelines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4m8wG9e" rel="noopener noreferrer"&gt;A Philosophy of Software Design (Ousterhout)&lt;/a&gt;&lt;/strong&gt; — Discusses why complexity grows without structured boundaries like traces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4sPlPDL" rel="noopener noreferrer"&gt;Learn Rust in a Month of Lunches (MacLeod)&lt;/a&gt;&lt;/strong&gt; — Essential for understanding Rust async lifetimes used in OTel.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture Patterns
&lt;/h2&gt;

&lt;p&gt;This guide is part of the &lt;strong&gt;Architecture Patterns&lt;/strong&gt; series, focusing on scalable backend services in Rust.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>observability</category>
      <category>backend</category>
      <category>devops</category>
    </item>
    <item>
      <title>The Raft Consensus Algorithm: Leader Election and Log Replication Explained</title>
      <dc:creator>Dylan Dumont</dc:creator>
      <pubDate>Thu, 23 Apr 2026 12:37:45 +0000</pubDate>
      <link>https://dev.to/dylan_dumont_266378d98367/the-raft-consensus-algorithm-leader-election-and-log-replication-explained-2j7b</link>
      <guid>https://dev.to/dylan_dumont_266378d98367/the-raft-consensus-algorithm-leader-election-and-log-replication-explained-2j7b</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Raft solves the hardest problem in distributed systems: keeping replicas synchronized while nodes fail.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;We are dissecting the Raft consensus protocol to understand how a cluster maintains a single source of truth. Unlike Paxos, Raft is designed to be human-readable and easier to implement correctly. Our scope is not building a complete key-value store, but modeling the core state machine of a Raft node. We will focus on the three core roles, the heartbeat mechanism, and the safety properties that prevent split-brain scenarios. We will use Go for examples because its interface definition and structuring closely mimic the RPC patterns found in production Raft libraries like etcd.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Defining Node Roles
&lt;/h2&gt;

&lt;p&gt;Raft nodes operate in a finite state machine. A node transitions between &lt;code&gt;Follower&lt;/code&gt;, &lt;code&gt;Candidate&lt;/code&gt;, and &lt;code&gt;Leader&lt;/code&gt;. The leader manages log replication, while followers maintain state consistency. This separation ensures that only one node writes to the log at any term, preventing conflicts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;NodeState&lt;/span&gt; &lt;span class="kt"&gt;uint8&lt;/span&gt;

&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;StateFollower&lt;/span&gt; &lt;span class="n"&gt;NodeState&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;iota&lt;/span&gt;
    &lt;span class="n"&gt;StateCandidate&lt;/span&gt;
    &lt;span class="n"&gt;StateLeader&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;RaftNode&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;State&lt;/span&gt;    &lt;span class="n"&gt;NodeState&lt;/span&gt;
    &lt;span class="n"&gt;Term&lt;/span&gt;     &lt;span class="kt"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;VoteFor&lt;/span&gt;  &lt;span class="n"&gt;NodeID&lt;/span&gt;
    &lt;span class="n"&gt;LastLogIndex&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Go’s type system handles the state transitions cleanly, ensuring type safety without external libraries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Conducting Leader Elections
&lt;/h2&gt;

&lt;p&gt;When a follower stops receiving heartbeats, it starts an election. It increments its term, sets itself to &lt;code&gt;Candidate&lt;/code&gt;, and broadcasts a &lt;code&gt;RequestVote&lt;/code&gt; RPC to all other nodes. Nodes only vote for a leader if they have newer log entries than the candidate. This prevents old terms from becoming leaders.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;RaftNode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;RequestVote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;term&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;term&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Term&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;State&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;StateLeader&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c"&gt;// Simplified logic: vote if log matches&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This logic enforces the requirement that a leader must have the most up-to-date logs, ensuring safety.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Log Replication via RPC
&lt;/h2&gt;

&lt;p&gt;The leader persists commands by appending them to its log. It then replicates this entry to a majority of followers using &lt;code&gt;AppendEntries&lt;/code&gt;. Followers append the entry, acknowledge the success, and the leader moves it to its commit index. If a majority acknowledges the entry, it is considered committed and safe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;RaftNode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;AppendEntries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;leaderTerm&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;leaderTerm&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Term&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Log&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This function demonstrates the AppendEntries RPC where a leader proposes a change, and followers store it locally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Commit Safety and Stability
&lt;/h2&gt;

&lt;p&gt;A log entry is considered committed once a majority of nodes store it. The leader sends commit indices down to followers during heartbeats. Crucially, a leader will never overwrite a committed log entry, even if its local log becomes stale during an election. This guarantees that the state machine remains consistent across the cluster even after node failures and recoveries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;RaftNode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ApplyCommit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CommittedIndex&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CommittedIndex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;
        &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures that only committed entries are executed by the state machine, preserving durability guarantees.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;State Machine:&lt;/strong&gt; Raft nodes transition between Follower, Candidate, and Leader based on received RPCs.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Election Safety:&lt;/strong&gt; Leaders must have the most recent logs, preventing old terms from becoming leaders and maintaining order.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Log Consistency:&lt;/strong&gt; Followers only accept entries from a current leader, ensuring global consistency.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Durability:&lt;/strong&gt; Entries are durable once acknowledged by a majority before being applied to the state machine.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Safety First:&lt;/strong&gt; Raft prioritizes data correctness over availability during partitions to prevent data loss.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;Consider optimizing the heartbeat interval and election timeout to balance consistency and latency. Next, explore how to handle split-brain scenarios where multiple nodes elect themselves leader simultaneously. You should also investigate how Raft handles snapshot compression to manage large log sizes efficiently. Finally, compare Raft with other consensus algorithms to understand trade-offs in your specific architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;- &lt;strong&gt;&lt;a href="https://amzn.to/4saY8oe" rel="noopener noreferrer"&gt;Designing Data-Intensive Applications (Kleppmann)&lt;/a&gt;&lt;/strong&gt; — the definitive guide to understanding data persistence, replication, and partitioning strategies in distributed systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Part of the &lt;strong&gt;Architecture Patterns&lt;/strong&gt; series.&lt;/p&gt;

</description>
      <category>distributed</category>
      <category>systems</category>
      <category>architecture</category>
      <category>backend</category>
    </item>
    <item>
      <title>HTTP/2 Multiplexing: Why One Connection Is Enough</title>
      <dc:creator>Dylan Dumont</dc:creator>
      <pubDate>Wed, 22 Apr 2026 12:42:41 +0000</pubDate>
      <link>https://dev.to/dylan_dumont_266378d98367/http2-multiplexing-why-one-connection-is-enough-ib</link>
      <guid>https://dev.to/dylan_dumont_266378d98367/http2-multiplexing-why-one-connection-is-enough-ib</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;In the age of high-concurrency systems, opening a new TCP connection for every request is performance suicide.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;We are designing a backend service that handles thousands of concurrent requests without triggering connection timeouts. The goal is to eliminate the latency overhead of TCP handshakes for every single interaction. By leveraging HTTP/2 features, we can serve multiple requests simultaneously over a single persistent connection. This article explains how to configure a client to utilize multiplexing effectively and the architectural benefits this brings to a distributed backend system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Reusing the TCP Handshake
&lt;/h2&gt;

&lt;p&gt;The biggest bottleneck in network programming is establishing a connection. A three-way handshake adds round-trip latency that shouldn't be paid for every small request. HTTP/1.1 uses persistent connections, but HTTP/2 takes this further by allowing parallel requests on that one connection.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;transport&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Transport&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;MaxIdleConnsPerHost&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;IdleConnTimeout&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;       &lt;span class="m"&gt;90&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ForceAttemptHTTP2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;     &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;TLSHandshakeTimeout&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="m"&gt;10&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Go, setting &lt;code&gt;ForceAttemptHTTP2: true&lt;/code&gt; is critical. Without it, a Go client might default to HTTP/1.1, opening a new TCP connection per host. By configuring &lt;code&gt;MaxIdleConnsPerHost&lt;/code&gt;, we ensure the client keeps idle connections alive, amortizing the TCP handshake cost across hundreds of requests. This reduces latency significantly, especially for cold starts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Sending Multiple Requests Per Connection
&lt;/h2&gt;

&lt;p&gt;HTTP/2 introduces a critical improvement over HTTP/1.1: multiplexing. This allows multiple requests and responses to share the same underlying TCP connection. Unlike HTTP/1.1, where requests are serialized, HTTP/2 allows a client to fire &lt;code&gt;GET /user?id=1&lt;/code&gt; and &lt;code&gt;GET /user?id=2&lt;/code&gt; in parallel on the same wire.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;req1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"https://api.example.com/user/1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;req2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"https://api.example.com/user/2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;resp1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err1&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Do&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;resp2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err2&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Do&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we see how a single &lt;code&gt;http.Client&lt;/code&gt; instance handles two requests concurrently. While the code looks simple, the underlying transport layer manages the frames interleaved in the TCP stream. This eliminates head-of-line blocking for the payload, meaning a slow server response won't stall other requests on the same connection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Compressing Overhead with HPACK
&lt;/h2&gt;

&lt;p&gt;HTTP headers can be verbose. Repeating &lt;code&gt;Host&lt;/code&gt;, &lt;code&gt;User-Agent&lt;/code&gt;, or &lt;code&gt;Authorization&lt;/code&gt; headers for every request wastes bandwidth and CPU. HTTP/2 uses the HPACK compression format to encode headers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// HPACK is handled transparently inside the transport layer.&lt;/span&gt;
&lt;span class="c"&gt;// You don't usually need to configure it manually in standard Go clients.&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Transport&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By enabling HTTP/2 (&lt;code&gt;ForceAttemptHTTP2: true&lt;/code&gt;), the client automatically applies HPACK. This drastically reduces the bytes written over the wire without altering the payload. For high-throughput services, this means less network chatter and lower CPU overhead for serialization, especially on constrained networks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Managing Stream State and Errors
&lt;/h2&gt;

&lt;p&gt;HTTP/2 introduces new error handling semantics compared to HTTP/1.1. If a request fails, the error applies to the specific stream, not the entire connection. However, a severe issue like a certificate error or a &lt;code&gt;RST_STREAM&lt;/code&gt; might close the connection.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Handling a stream-level error vs connection-level error.&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusCode&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StatusServiceUnavailable&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Stream error: retry this request or fail gracefully.&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We must ensure our client handles these scenarios gracefully. Using timeouts and retry logic is essential. If the server sends a &lt;code&gt;GOAWAY&lt;/code&gt; frame, the client should release resources and stop sending new requests to that connection, preventing data loss during graceful shutdowns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;TCP Persistence&lt;/strong&gt; — Establishing TCP is expensive; reuse connections to amortize the handshake cost.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Parallel Streams&lt;/strong&gt; — Sending multiple requests simultaneously eliminates queuing latency.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Header Compression&lt;/strong&gt; — Reduces bandwidth usage without changing application logic.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Flow Control&lt;/strong&gt; — Manages backpressure to prevent overwhelmed servers from crashing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Now that you understand the mechanics, the next step is integrating this into your service. You might consider moving to HTTP/3 (QUIC) which provides better resilience against packet loss, or exploring gRPC which sits on top of HTTP/2 for even lower latency. You should also look into connection pooling strategies for database drivers which work similarly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;p&gt;Here are resources that deepen your understanding of systems and performance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4saY8oe" rel="noopener noreferrer"&gt;Designing Data-Intensive Applications (Kleppmann)&lt;/a&gt;&lt;/strong&gt; — Understanding the trade-offs of different transport protocols.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4c2jE8D" rel="noopener noreferrer"&gt;Computer Systems: A Programmer's Perspective (Bryant &amp;amp; O'Hallaron)&lt;/a&gt;&lt;/strong&gt; — The best book for understanding low-level networking primitives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4mekBiY" rel="noopener noreferrer"&gt;Python Crash Course (Matthes)&lt;/a&gt;&lt;/strong&gt; — Good for scripting rapid tests of your network logic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/41FQGXh" rel="noopener noreferrer"&gt;Cracking the Coding Interview (McDowell)&lt;/a&gt;&lt;/strong&gt; — Essential for algorithmic complexity in concurrent systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4m8wG9e" rel="noopener noreferrer"&gt;A Philosophy of Software Design (Ousterhout)&lt;/a&gt;&lt;/strong&gt; — Understanding the trade-offs between simplicity and performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/3O0yNPF" rel="noopener noreferrer"&gt;AI Engineering (Chip Huyen)&lt;/a&gt;&lt;/strong&gt; — Practical applications of high-speed data ingestion for models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4sPlPDL" rel="noopener noreferrer"&gt;Learn Rust in a Month of Lunches (MacLeod)&lt;/a&gt;&lt;/strong&gt; — An alternative language for building high-performance clients.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;By optimizing your client for multiplexing, you gain the resilience to handle load spikes and the efficiency to reduce latency. The next time you open a new connection for a small request, remember that HTTP/2 provides the tools to do better. Happy coding.&lt;/p&gt;

</description>
      <category>distributed</category>
      <category>systems</category>
      <category>networking</category>
      <category>backend</category>
    </item>
    <item>
      <title>The RED Method: Request Rate, Errors, and Duration as Your Core SLIs</title>
      <dc:creator>Dylan Dumont</dc:creator>
      <pubDate>Sun, 19 Apr 2026 12:39:38 +0000</pubDate>
      <link>https://dev.to/dylan_dumont_266378d98367/the-red-method-request-rate-errors-and-duration-as-your-core-slis-4jk</link>
      <guid>https://dev.to/dylan_dumont_266378d98367/the-red-method-request-rate-errors-and-duration-as-your-core-slis-4jk</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;"Noise drowns out signal; focus on the three metrics that actually indicate system health."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;We are instrumenting a Go-based HTTP handler to expose the three Request Rate, Errors, and Duration metrics required to calculate Service Level Indicators (SLIs). This scope excludes internal tracing spans or database metrics, focusing strictly on the surface API gateway to ensure consistency across a distributed backend. The goal is to replace legacy monitoring scripts with a structured metrics export that feeds directly into a Prometheus stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Instrument the Middleware
&lt;/h2&gt;

&lt;p&gt;The first step is intercepting incoming requests before they reach the application logic. You need a middleware function that wraps the handler and captures the timing start point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;RequestInfo&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Start&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Time&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;RequestMetricsMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandlerFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;reqInfo&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;RequestInfo&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Start&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;

        &lt;span class="c"&gt;// Wrap the original handler logic here&lt;/span&gt;
        &lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServeHTTP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c"&gt;// Extract duration&lt;/span&gt;
        &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Since&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reqInfo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This separation ensures the application logic remains clean while observability concerns are handled at the infrastructure boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Aggregate Request Counts
&lt;/h2&gt;

&lt;p&gt;Counters track the total volume of requests. You should maintain separate counters for 4xx errors and 5xx errors to distinguish client failures from server failures.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;totalRequests&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prometheus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewCounter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prometheus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CounterOpts&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"api_total_requests_total"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Help&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Total number of API requests."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;error5xx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prometheus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewCounter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prometheus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CounterOpts&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"api_errors_5xx_total"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Help&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Server-side errors."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Counters are essential for calculating Request Rate per second, which helps determine capacity planning thresholds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Classify Error Labels
&lt;/h2&gt;

&lt;p&gt;Do not just count errors; label them. Use status codes (2xx, 4xx, 5xx) as labels to allow you to query specific failure modes later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;recordError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;error5xx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Inc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="c"&gt;// Record 4xx in a similar gauge or counter with a label&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This specificity allows you to distinguish between a rate-limiting issue (429) and a database crash (500) during incident response.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Measure Latency Histograms
&lt;/h2&gt;

&lt;p&gt;Duration needs more than an average. A histogram with percentiles (p50, p95, p99) is required to understand the tail latency that impacts user experience.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Since&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reqInfo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;apiDurationHistogram&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Observe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Seconds&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Histograms normalize for request volume, preventing a flood of requests from skewing the average latency significantly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5 — Export Metrics via HTTP Endpoint
&lt;/h2&gt;

&lt;p&gt;The final step is exposing these values so a collector like Prometheus can scrape them every 15 seconds. Ensure your server does not block during the write phase.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;startServer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;mux&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewServeMux&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;mux&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/metrics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prometheus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListenAndServe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;":8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mux&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Standard HTTP endpoints provide the necessary protocol compliance for cloud-native observability stacks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Request Rate&lt;/strong&gt; provides visibility into traffic volume and helps identify capacity saturation points in real-time.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Errors&lt;/strong&gt; must be labeled by status code to allow engineers to differentiate between client and server failures.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Duration&lt;/strong&gt; histograms are superior to averages because they reveal the tail latency that causes actual user complaints.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Instrumentation&lt;/strong&gt; should happen at the edge, ensuring that metrics reflect the contract presented to the client, not internal implementation details.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;SLOs&lt;/strong&gt; derived from these RED metrics drive meaningful alerts rather than noise from every internal dependency failure.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;Next, define Service Level Objectives (SLOs) based on the 99.9th percentile of the Duration histogram. You should calculate error budgets to determine how much failure is acceptable before slowing down feature deployment. Finally, implement alerting rules that trigger on sustained spikes in error5xx over 5xx rates exceeding your threshold for one minute.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4saY8oe" rel="noopener noreferrer"&gt;Designing Data-Intensive Applications (Kleppmann)&lt;/a&gt;&lt;/strong&gt; — Essential for understanding how to structure systems to handle the data flow that metrics represent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4m8wG9e" rel="noopener noreferrer"&gt;A Philosophy of Software Design (Ousterhout)&lt;/a&gt;&lt;/strong&gt; — Relevant for managing the complexity trade-offs when instrumenting every layer of a backend system.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Part of the &lt;strong&gt;Architecture Patterns&lt;/strong&gt; series.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>observability</category>
      <category>backend</category>
      <category>devops</category>
    </item>
    <item>
      <title>Building a Job Queue in Rust: Persistent Tasks With Retry Logic</title>
      <dc:creator>Dylan Dumont</dc:creator>
      <pubDate>Fri, 17 Apr 2026 12:40:11 +0000</pubDate>
      <link>https://dev.to/dylan_dumont_266378d98367/building-a-job-queue-in-rust-persistent-tasks-with-retry-logic-5n9</link>
      <guid>https://dev.to/dylan_dumont_266378d98367/building-a-job-queue-in-rust-persistent-tasks-with-retry-logic-5n9</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;"Transient failures are inevitable; durable execution requires state to survive the crash."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;We are constructing a resilient worker service in Rust that processes background tasks from a persistent queue. This example prioritizes data durability over peak throughput, ensuring that failed jobs are never lost but eventually succeed or move to a dead letter queue. We will use async Rust with SQL for storage, demonstrating how to structure state transitions that survive application restarts. The focus is on architectural correctness over raw performance, building a foundation for long-running background processing systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Define the Job State Machine
&lt;/h2&gt;

&lt;p&gt;The worker must track a job's lifecycle without relying on volatile memory alone. We start by defining an enum that explicitly tracks every state transition, ensuring the logic is exhaustive.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;enum&lt;/span&gt; &lt;span class="n"&gt;JobStatus&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;Pending&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;Running&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;Succeeded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;Failed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;DeadLetter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This choice matters because explicit states prevent silent state drifts that often plague long-running daemon processes. By forcing the developer to handle every case, we reduce the chance of forgetting to update a database column after a panic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Persist Job State in Storage
&lt;/h2&gt;

&lt;p&gt;A transient failure of the application worker must not result in data loss. We model the job table to include columns for status, retry count, and last attempt timestamp, creating a source of truth that survives restarts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nd"&gt;#[derive(sqlx::FromRow)]&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;Job&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Uuid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JobStatus&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;retry_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DateTime&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Utc&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;last_attempted&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;DateTime&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Utc&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Storing metadata here allows us to query for pending work and ensures we can resume processing from exactly where the application died. We use UUIDs for the ID to maintain uniqueness and avoid accidental collisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Implement Exponential Backoff Logic
&lt;/h2&gt;

&lt;p&gt;When a job fails, we must wait before retrying to prevent database overload. We generate a delay based on the current retry count, using a &lt;code&gt;tokio::time::sleep&lt;/code&gt; to enforce a pause before the next attempt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;calculate_delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retry_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Duration&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Start with 1 second delay and double it with each retry&lt;/span&gt;
  &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;base_duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_secs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;max_duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_secs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;raw_delay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_duration&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retry_count&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;capped_delay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw_delay&lt;/span&gt;&lt;span class="nf"&gt;.min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_duration&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Add jitter to prevent thundering herd issues&lt;/span&gt;
  &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;jitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_millis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;random&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="nn"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_secs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;capped_delay&lt;/span&gt;&lt;span class="nf"&gt;.as_secs&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;jitter&lt;/span&gt;&lt;span class="nf"&gt;.as_secs&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using exponential backoff instead of a fixed delay ensures that transient network issues resolve without overwhelming the system resources. The jitter component is critical for preventing multiple workers from retrying at the exact same second, which can cause spikes in database load.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Handle Permanent Failures in a DLQ
&lt;/h2&gt;

&lt;p&gt;A job should not be retried infinitely if the error is irrecoverable. If the retry count exceeds a threshold, we transition the state to &lt;code&gt;DeadLetter&lt;/code&gt; to prevent an infinite loop and allow operators to manually inspect or discard the job.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;should_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Job&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="py"&gt;.retry_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;MAX_RETRIES&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Mark as DeadLetter&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This separation isolates error handling from success paths, adhering to the principle of separation of concerns. The &lt;code&gt;DeadLetter&lt;/code&gt; state acts as a final repository for problematic jobs, ensuring the system doesn't block on them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;p&gt;Building a durable job queue requires treating state as an external truth source rather than application memory. By defining a strict state machine and persisting it in a relational database, we ensure that no work is ever lost even if the worker process crashes. The retry logic with exponential backoff protects system health, while the dead letter queue allows for manual intervention on permanent failures. This pattern scales well for any background processing system that values correctness over speed. The separation of concerns—logic for success, logic for retry, logic for failure—ensures that the code remains maintainable and the architecture remains robust against transient failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next
&lt;/h2&gt;

&lt;p&gt;To expand on this pattern, consider adding concurrency controls to process jobs in parallel without overloading the database write locks. Investigate how &lt;code&gt;postgres&lt;/code&gt; connection pooling interacts with long-running transactions when processing large payloads. Finally, review the logging strategies for tracking job lifecycle events in a distributed system context to ensure observability aligns with operational expectations. You might also consider implementing a metrics pipeline to track average processing times per job type.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reading
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Books
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4saY8oe" rel="noopener noreferrer"&gt;Designing Data-Intensive Applications (Kleppmann)&lt;/a&gt;&lt;/strong&gt;: Covers the tradeoffs between durability and availability that inform our database schema choices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://amzn.to/4m8wG9e" rel="noopener noreferrer"&gt;A Philosophy of Software Design (Ousterhout)&lt;/a&gt;&lt;/strong&gt;: The chapter on coupling applies to how we separate the retry logic from the processing logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Further Reading
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://rust-lang.github.io/async-book/" rel="noopener noreferrer"&gt;Rust Async Book&lt;/a&gt; for deeper async patterns.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/launchbadge/sqlx" rel="noopener noreferrer"&gt;SQL Alchemy for Rust&lt;/a&gt; for database interaction.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Part of the &lt;strong&gt;Architecture Patterns&lt;/strong&gt; series.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>systems</category>
      <category>architecture</category>
      <category>backend</category>
    </item>
    <item>
      <title>Log-Structured Merge Trees: The Data Structure That Powers Modern Databases</title>
      <dc:creator>Dylan Dumont</dc:creator>
      <pubDate>Thu, 16 Apr 2026 12:43:31 +0000</pubDate>
      <link>https://dev.to/dylan_dumont_266378d98367/log-structured-merge-trees-the-data-structure-that-powers-modern-databases-ck4</link>
      <guid>https://dev.to/dylan_dumont_266378d98367/log-structured-merge-trees-the-data-structure-that-powers-modern-databases-ck4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;LSM trees optimize write performance by buffering changes in memory before flushing to disk sequentially.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;We are implementing a simplified LSM tree architecture to understand the mechanics behind high-throughput databases like Cassandra and RocksDB. This scope focuses on the core trade-off between write speed and storage durability. We will explore how write-heavy workloads are decoupled from read-heavy operations by leveraging sequential disk access rather than random seeking. This pattern is essential for modern backend systems handling massive logging or ingestion streams.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — In-Memory Write Buffer
&lt;/h2&gt;

&lt;p&gt;The core innovation of LSM trees is the in-memory buffer called a memtable. Incoming writes are appended to this vector rather than hitting the disk immediately. This drastically reduces the number of expensive seek operations required to update the dataset. In a production context, this allows the system to absorb burst traffic by queuing updates until the buffer capacity is reached.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;MemTable&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;capacity_limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;usize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;MemTable&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;capacity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;usize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;Self&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;MemTable&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;Vec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;capacity_limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;capacity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.entries&lt;/span&gt;&lt;span class="nf"&gt;.push&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rust vectors provide contiguous memory allocation, which aligns perfectly with how operating systems handle sequential writes to storage media.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Flushing to SSTables
&lt;/h2&gt;

&lt;p&gt;Once the memtable fills its capacity limit, the buffer is frozen and flushed to the disk as a Sorted String Table (SSTable). This process is asynchronous and ensures that no data is lost by moving the buffer content into an immutable file. The file is written sequentially, which minimizes random read/write amplification that plagues traditional B-Trees under heavy update loads.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;flush_memtable_to_disk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;MemTable&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Simulate serialization to SSTable format&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;serde_json&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;to_vec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="py"&gt;.entries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="c1"&gt;// In reality, this would be written to disk with compression&lt;/span&gt;
    &lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"sst-000001"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Immutable files allow multiple versions to coexist without locking, enabling readers to see consistent snapshots of the database state.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Indexing for Random Reads
&lt;/h2&gt;

&lt;p&gt;Storing all data in memtables would be inefficient for reads. We must maintain indexes to locate keys quickly. We utilize bloom filters or inverted indices to determine if a key exists in an SSTable before loading the full file into memory. This check is fast and requires minimal disk I/O, making the system efficient for large datasets.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;collections&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;HashSet&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;check_bloom_filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sst_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Check bit array to determine existence probability&lt;/span&gt;
    &lt;span class="n"&gt;sst_key&lt;/span&gt;&lt;span class="nf"&gt;.contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"valid"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bloom filter returns a probabilistic false positive but never a false negative, ensuring that valid keys are always found.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Asynchronous Compaction
&lt;/h2&gt;

&lt;p&gt;Over time, the number of SSTable files grows, consuming excessive disk space. A background compaction process merges these files, removing duplicates and reclaiming free space. This ensures that the storage usage remains proportional to the active data lifespan.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;compact_sstables&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sst_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Simulate merging files into a new sorted file&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;merged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Vec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sst_files&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;merged&lt;/span&gt;&lt;span class="nf"&gt;.push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"merged-{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;merged&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This process runs on a separate thread to ensure that write throughput is not degraded by background cleanup tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Write Amplification:&lt;/strong&gt; LSM trees reduce write amplification by grouping updates and writing them sequentially.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequential I/O:&lt;/strong&gt; By avoiding random seeks, the system utilizes the high bandwidth of modern storage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Durability:&lt;/strong&gt; Data is persisted as soon as it is written to the immutable SSTable on disk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Concurrency:&lt;/strong&gt; The write buffer allows high concurrency without contention on the storage device.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Explore immutable snapshots for point-in-time recovery.&lt;/li&gt;
&lt;li&gt;Investigate how memtables handle concurrent reads with read amplification.&lt;/li&gt;
&lt;li&gt;Consider the trade-offs when implementing bloom filters in low-memory environments.&lt;/li&gt;
&lt;li&gt;Compare LSM tree implementations across different database engines like Redis and RocksDB.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://amzn.to/4saY8oe" rel="noopener noreferrer"&gt;Designing Data-Intensive Applications (Kleppmann)&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://amzn.to/4c2jE8D" rel="noopener noreferrer"&gt;Computer Systems: A Programmer's Perspective (Bryant &amp;amp; O'Hallaron)&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://amzn.to/4mekBiY" rel="noopener noreferrer"&gt;Python Crash Course (Matthes)&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://amzn.to/41FQGXh" rel="noopener noreferrer"&gt;Cracking the Coding Interview (McDowell)&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://amzn.to/4m8wG9e" rel="noopener noreferrer"&gt;A Philosophy of Software Design (Ousterhout)&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://amzn.to/3O0yNPF" rel="noopener noreferrer"&gt;AI Engineering (Chip Huyen)&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://amzn.to/4sPlPDL" rel="noopener noreferrer"&gt;Learn Rust in a Month of Lunches (MacLeod)&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>distributed</category>
      <category>systems</category>
      <category>architecture</category>
      <category>backend</category>
    </item>
  </channel>
</rss>
