<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Prajwal Mahajan</title>
    <description>The latest articles on DEV Community by Prajwal Mahajan (@prajwalmahajan101).</description>
    <link>https://dev.to/prajwalmahajan101</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3976522%2F5933f324-01bd-4c69-bcd4-63e8cff079ab.png</url>
      <title>DEV Community: Prajwal Mahajan</title>
      <link>https://dev.to/prajwalmahajan101</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/prajwalmahajan101"/>
    <language>en</language>
    <item>
      <title>Building toykv: a from-scratch persistent KV in Go, and why I took the opposite call from toymq three times</title>
      <dc:creator>Prajwal Mahajan</dc:creator>
      <pubDate>Wed, 17 Jun 2026 16:39:53 +0000</pubDate>
      <link>https://dev.to/prajwalmahajan101/building-toykv-a-from-scratch-persistent-kv-in-go-and-why-i-took-the-opposite-call-from-toymq-5862</link>
      <guid>https://dev.to/prajwalmahajan101/building-toykv-a-from-scratch-persistent-kv-in-go-and-why-i-took-the-opposite-call-from-toymq-5862</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A retrospective on &lt;a href="https://github.com/prajwalmahajan101/toykv" rel="noopener noreferrer"&gt;&lt;code&gt;toykv&lt;/code&gt;&lt;/a&gt; — a single-node, in-memory key-value store with append-only-file persistence, RESP2-compatible on the wire, three binaries, and a v1.0.0 I shipped after about four weeks of work.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR — and the rule I want you to remember
&lt;/h2&gt;

&lt;p&gt;Some rules generalise across systems by inversion. &lt;a href="https://github.com/prajwalmahajan101/toymq" rel="noopener noreferrer"&gt;&lt;code&gt;toymq&lt;/code&gt;&lt;/a&gt; taught me &lt;strong&gt;zero values are not sentinels&lt;/strong&gt;. &lt;a href="https://github.com/prajwalmahajan101/toykv" rel="noopener noreferrer"&gt;&lt;code&gt;toykv&lt;/code&gt;&lt;/a&gt; taught me &lt;strong&gt;durations are not deadlines&lt;/strong&gt;. Same shape of bug, opposite axis — and along the way I ended up answering three load-bearing questions in the exact opposite direction from &lt;code&gt;toymq&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I shipped &lt;code&gt;toymq&lt;/code&gt; with &lt;strong&gt;no version byte&lt;/strong&gt; on its WAL. &lt;code&gt;toykv&lt;/code&gt;'s AOF starts with one.&lt;/li&gt;
&lt;li&gt;I gave &lt;code&gt;toymq&lt;/code&gt; its &lt;strong&gt;own wire protocol&lt;/strong&gt; and a separate WAL record frame. In &lt;code&gt;toykv&lt;/code&gt; I write the same RESP arrays I speak on the wire straight into the AOF.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;toymq&lt;/code&gt; had its own &lt;strong&gt;delivery semantics&lt;/strong&gt; to defend. For &lt;code&gt;toykv&lt;/code&gt; I chose RESP2 specifically so &lt;code&gt;redis-cli&lt;/code&gt; and &lt;code&gt;go-redis/v9&lt;/code&gt; become free third-party test harnesses.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The proof those opposite answers were right is the &lt;strong&gt;crash matrix&lt;/strong&gt; — nine rows, each pinned by a test file that exists today, each test owned by the part of the system that introduced the risk. Every row is cheap because of one of those three decisions.&lt;/p&gt;

&lt;p&gt;The hero rule of &lt;code&gt;toykv&lt;/code&gt; — sibling of &lt;code&gt;toymq&lt;/code&gt;'s &lt;code&gt;lastAcked uint64 == 0&lt;/code&gt; zero-value-sentinel bug, reached from the other direction — is one line:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Absolute deadlines, never relative durations.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Store &lt;code&gt;EX 5&lt;/code&gt; on the wire; write &lt;code&gt;PEXPIREAT 1718659200000&lt;/code&gt; to disk. &lt;code&gt;toymq&lt;/code&gt; taught me that zero values are not sentinels. &lt;code&gt;toykv&lt;/code&gt; taught me that durations are not deadlines. Same shape of bug, opposite axis.&lt;/p&gt;

&lt;p&gt;The numbers: ~5k LOC of Go, 6 ADRs (I budgeted four), 15 journal entries, three binaries (&lt;code&gt;toykv&lt;/code&gt;, &lt;code&gt;toykv-cli&lt;/code&gt;, &lt;code&gt;toykv-tui&lt;/code&gt;), v1.0.0 tagged on 2026-06-17.&lt;/p&gt;

&lt;h2&gt;
  
  
  What toykv is, and isn't
&lt;/h2&gt;

&lt;p&gt;A learning artifact. Single-node, in-memory, no auth, no TLS, strings only. 18 commands from the Redis surface — &lt;code&gt;GET&lt;/code&gt;, &lt;code&gt;SET&lt;/code&gt; (with &lt;code&gt;NX&lt;/code&gt;/&lt;code&gt;XX&lt;/code&gt;/&lt;code&gt;EX&lt;/code&gt;/&lt;code&gt;PX&lt;/code&gt;/&lt;code&gt;EXAT&lt;/code&gt;/&lt;code&gt;PXAT&lt;/code&gt;), &lt;code&gt;DEL&lt;/code&gt;, &lt;code&gt;INCR&lt;/code&gt;/&lt;code&gt;DECR&lt;/code&gt;, &lt;code&gt;KEYS&lt;/code&gt;, &lt;code&gt;FLUSHDB&lt;/code&gt;, &lt;code&gt;EXPIRE&lt;/code&gt;/&lt;code&gt;TTL&lt;/code&gt;/&lt;code&gt;PEXPIRE&lt;/code&gt;/&lt;code&gt;PEXPIREAT&lt;/code&gt;/&lt;code&gt;PERSIST&lt;/code&gt;, &lt;code&gt;BGREWRITEAOF&lt;/code&gt;, and a handful of metadata commands. Under &lt;code&gt;appendfsync=always&lt;/code&gt; (the default), throughput tops out in the low thousands of SETs per second on commodity NVMe — a per-write &lt;code&gt;fsync&lt;/code&gt; is the durability commit point and the cost is honest. If you need a real KV store, reach for Redis or Valkey. If you want to &lt;em&gt;understand&lt;/em&gt; what "durable" means in one, build one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The crash matrix
&lt;/h2&gt;

&lt;p&gt;I lay it out the way I read it: every row of fault surface has exactly one test file, and the test lives next to the layer that introduced the risk. Composed-fault tests sit on their own at the bottom because composing faults is a different job from proving any one of them.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Fault surface&lt;/th&gt;
&lt;th&gt;Risk&lt;/th&gt;
&lt;th&gt;Test file&lt;/th&gt;
&lt;th&gt;Invariant proven&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AOF append + replay&lt;/td&gt;
&lt;td&gt;acked SET / DEL lost on SIGKILL under &lt;code&gt;fsync=always&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;internal/server/aof_crash_test.go&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Every record acknowledged before the kill replays on restart&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partial-tail handling&lt;/td&gt;
&lt;td&gt;crash mid-record corrupts replay state&lt;/td&gt;
&lt;td&gt;&lt;code&gt;internal/aof/replayer_test.go&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Replay rejects a torn record; offset reported, server refuses to serve until truncated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTL lock-upgrade race&lt;/td&gt;
&lt;td&gt;sweeper drops an unexpired key under load&lt;/td&gt;
&lt;td&gt;&lt;code&gt;internal/store/sweeper_test.go&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;N writers × 1 Hz sweeper → zero spurious &lt;code&gt;(nil)&lt;/code&gt; for unexpired keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTL crash round-trip&lt;/td&gt;
&lt;td&gt;TTL records lost across SIGKILL + restart&lt;/td&gt;
&lt;td&gt;&lt;code&gt;internal/server/aof_ttl_crash_test.go&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;v2 replay accepts v1 records; expiry decodes round-trip&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;BGREWRITEAOF&lt;/code&gt; during writes&lt;/td&gt;
&lt;td&gt;rewrite races a concurrent SET; tail loses data&lt;/td&gt;
&lt;td&gt;&lt;code&gt;internal/aof/rewriter_test.go&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Side-buffer captures all live appends; rewrite + tail merge with no loss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Crash during rewrite&lt;/td&gt;
&lt;td&gt;partial &lt;code&gt;.aof.tmp&lt;/code&gt; left, no canonical AOF survives&lt;/td&gt;
&lt;td&gt;&lt;code&gt;internal/server/aof_rewrite_crash_test.go&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Exactly one of &lt;code&gt;{old, new}&lt;/code&gt; present after restart; replay consistent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shipped binary protocol drift&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;redis-cli&lt;/code&gt; / &lt;code&gt;go-redis&lt;/code&gt; regression invisible to in-process tests&lt;/td&gt;
&lt;td&gt;&lt;code&gt;test/e2e/protocol_*_test.go&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Byte-compat for every shipped command&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;BGREWRITEAOF&lt;/code&gt; + restart&lt;/td&gt;
&lt;td&gt;restart after compaction loses data&lt;/td&gt;
&lt;td&gt;&lt;code&gt;test/e2e/rewrite_restart_test.go&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Mixed-workload state survives rewrite + restart&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Composed faults (kill + pause + rewrite + writes)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;failure modes only visible when multiple faults overlap&lt;/td&gt;
&lt;td&gt;&lt;code&gt;test/chaos/invariants_test.go&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Acked-SET survival, monotonic INCR, no panic across soak&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The protocol-compatibility layer (rows 7–8) deliberately introduces &lt;em&gt;zero&lt;/em&gt; new crash invariants. It verifies that the bytes leaving the socket match the spec third-party clients expect; everything below the socket has already been proven elsewhere. The chaos suite (row 9) is the only layer that &lt;em&gt;composes&lt;/em&gt; faults. Composed-fault tests as the primary durability proof are slow and flaky; composed-fault tests as a release-confidence soak after each component is independently proven are exactly what you want.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The thesis I want you to take away:&lt;/strong&gt; every row is cheap because of one of the three decisions below. Where a crash matrix becomes expensive — where the test file is 800 lines, or two rows have to share state — that's usually the symptom of an architecture decision made elsewhere.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IExSCiAgICBMMVsiU3RvcmFnZTxici8%2BQU9GIGFwcGVuZCArIHJlcGxheTxici8%2BcGFydGlhbC10YWlsIl0gLS0%2BIEwyWyJUVEw8YnIvPmxvY2stdXBncmFkZSByYWNlPGJyLz5jcmFzaCByb3VuZC10cmlwIl0KICAgIEwyIC0tPiBMM1siTGl2ZSByZXdyaXRlPGJyLz5kdWFsLXdyaXRlPGJyLz5jcmFzaCBkdXJpbmcgcmV3cml0ZSJdCiAgICBMMyAtLT4gTDRbIlByb3RvY29sIGNvbXBhdDxici8%2BYnl0ZS1jb21wYXQ8YnIvPnJld3JpdGUgKyByZXN0YXJ0Il0KICAgIEw0IC0tPiBMNVsiQ29tcG9zZWQgY2hhb3M8YnIvPmtpbGwgKyBwYXVzZSArIHJld3JpdGUiXQogICAgc3R5bGUgTDEgZmlsbDojZGJlYWZlLHN0cm9rZTojMWQ0ZWQ4CiAgICBzdHlsZSBMMiBmaWxsOiNkYmVhZmUsc3Ryb2tlOiMxZDRlZDgKICAgIHN0eWxlIEwzIGZpbGw6I2RiZWFmZSxzdHJva2U6IzFkNGVkOAogICAgc3R5bGUgTDQgZmlsbDojZmVmM2M3LHN0cm9rZTojZDk3NzA2CiAgICBzdHlsZSBMNSBmaWxsOiNkY2ZjZTcsc3Ryb2tlOiMxNmEzNGE%3Ftype%3Dpng" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IExSCiAgICBMMVsiU3RvcmFnZTxici8%2BQU9GIGFwcGVuZCArIHJlcGxheTxici8%2BcGFydGlhbC10YWlsIl0gLS0%2BIEwyWyJUVEw8YnIvPmxvY2stdXBncmFkZSByYWNlPGJyLz5jcmFzaCByb3VuZC10cmlwIl0KICAgIEwyIC0tPiBMM1siTGl2ZSByZXdyaXRlPGJyLz5kdWFsLXdyaXRlPGJyLz5jcmFzaCBkdXJpbmcgcmV3cml0ZSJdCiAgICBMMyAtLT4gTDRbIlByb3RvY29sIGNvbXBhdDxici8%2BYnl0ZS1jb21wYXQ8YnIvPnJld3JpdGUgKyByZXN0YXJ0Il0KICAgIEw0IC0tPiBMNVsiQ29tcG9zZWQgY2hhb3M8YnIvPmtpbGwgKyBwYXVzZSArIHJld3JpdGUiXQogICAgc3R5bGUgTDEgZmlsbDojZGJlYWZlLHN0cm9rZTojMWQ0ZWQ4CiAgICBzdHlsZSBMMiBmaWxsOiNkYmVhZmUsc3Ryb2tlOiMxZDRlZDgKICAgIHN0eWxlIEwzIGZpbGw6I2RiZWFmZSxzdHJva2U6IzFkNGVkOAogICAgc3R5bGUgTDQgZmlsbDojZmVmM2M3LHN0cm9rZTojZDk3NzA2CiAgICBzdHlsZSBMNSBmaWxsOiNkY2ZjZTcsc3Ryb2tlOiMxNmEzNGE%3Ftype%3Dpng" alt="Crash matrix layer flow: storage → TTL → live rewrite → protocol compatibility → composed chaos"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first three layers introduce new fault surface, each proven in isolation by a dedicated crash test. The protocol layer verifies the shipped binary speaks the spec; it adds no new crash invariants. The chaos layer composes everything underneath into one soak. That shape is the proof the architecture is paying its rent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inversion 1 — a version byte you'll thank yourself for, and absolute deadlines
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;toymq&lt;/code&gt;'s WAL has no version byte. The argument I wrote in that retrospective is good: defending the wrong byte forever is worse than the one-line migration of adding one later.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;toykv&lt;/code&gt;'s AOF starts with eight bytes: &lt;code&gt;"TOYKV\x00\x00"&lt;/code&gt; plus a one-byte version. The reasoning is also good, for a different shape of problem: &lt;strong&gt;the AOF outlives the binary that wrote it.&lt;/strong&gt; If I run &lt;code&gt;toykv&lt;/code&gt; on a tempdir on Monday and the same dir with a newer binary on Friday, the Friday binary has to know what assumption the Monday bytes were committed under. There is no person to ask. The byte is the answer.&lt;/p&gt;

&lt;p&gt;I made the call before I wrote a line of replay code, and wrote it up as an ADR alongside the first AOF commit. It paid for itself inside four weeks: I needed a new on-disk shape for TTL, and the version byte let v2 binaries accept v1 files transparently. The replay loop is the same eight-line dispatch table; v2 just knows about a &lt;code&gt;PEXPIREAT&lt;/code&gt; token v1 didn't.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The rule:&lt;/strong&gt; the cheaper migration is the one your future self can opt into, not the one their replay code is forced into. A version byte you don't use is a no-op. A version byte you didn't write is a migration script.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Absolute deadlines, never relative durations
&lt;/h3&gt;

&lt;p&gt;The companion ADR made the second call: every TTL command I accept on the wire rewrites to absolute &lt;code&gt;PEXPIREAT &amp;lt;unix-ms&amp;gt;&lt;/code&gt; before it touches the disk. A wire &lt;code&gt;SET k v EX 5&lt;/code&gt; becomes an AOF &lt;code&gt;SET k v&lt;/code&gt; + &lt;code&gt;PEXPIREAT k &amp;lt;now-ms + 5000&amp;gt;&lt;/code&gt;. The wire stays ergonomic; the disk stays unambiguous.&lt;/p&gt;

&lt;p&gt;Imagine the alternative. Store &lt;code&gt;EX 5&lt;/code&gt; verbatim in the AOF. Crash. Restart twelve hours later. Replay reads &lt;code&gt;EX 5&lt;/code&gt; and applies it against &lt;code&gt;time.Now()&lt;/code&gt; — every TTL in the file just got extended by twelve hours. Worse, the bug is silent: keys don't disappear, they linger. The test that catches this is a multi-day fixture, because the only way to notice is to wait past the original expiry. &lt;strong&gt;That's the toykv-shaped sibling of toymq's &lt;code&gt;lastAcked == 0&lt;/code&gt; sentinel collision&lt;/strong&gt; — a value that looks valid until the system restarts in a state it didn't anticipate. Same shape of bug, different axis.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The rule:&lt;/strong&gt; absolute deadlines, never relative durations. The wire can speak in human-friendly relative time; the disk has to speak in moments. A duration without a reference point isn't a deadline — it's a wish.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the rule I earned. Everything else in the post is the architecture I built to make the rule cheap to enforce.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IExSCiAgICBzdWJncmFwaCBXSVJFWyJXaXJlIOKAlCBlcmdvbm9taWMsIHJlbGF0aXZlIl0KICAgICAgICBXMVsiU0VUIGsgdiBFWCA1Il0KICAgICAgICBXMlsiRVhQSVJFIGsgMzAiXQogICAgICAgIFczWyJQRVhQSVJFIGsgMTUwMCJdCiAgICBlbmQKICAgIHN1YmdyYXBoIENBTk9OWyJDYW5vbmljYWxpc2VyIOKAlCBjb21wdXRlcyBub3coKSArIM6UIl0KICAgICAgICBDMVsibm93X21zID0gdGltZS5Ob3coKS5Vbml4TWlsbGkoKSJdCiAgICBlbmQKICAgIHN1YmdyYXBoIERJU0tbIkFPRiDigJQgYWJzb2x1dGUsIGRlY2lzaW9uLWZyZWUiXQogICAgICAgIEQxQVsiU0VUIGsgdiJdCiAgICAgICAgRDFCWyJQRVhQSVJFQVQgayAxNzE4NjU5MjAwMDAwIl0KICAgICAgICBEMlsiUEVYUElSRUFUIGsgMTcxODY1OTIyNTAwMCJdCiAgICAgICAgRDNbIlBFWFBJUkVBVCBrIDE3MTg2NTkyMjE1MDAiXQogICAgZW5kCiAgICBXMSAtLT4gQzEgLS0%2BIEQxQQogICAgQzEgLS0%2BIEQxQgogICAgVzIgLS0%2BIEMxIC0tPiBEMgogICAgVzMgLS0%2BIEMxIC0tPiBEMwogICAgc3R5bGUgV0lSRSBmaWxsOiNkYmVhZmUsc3Ryb2tlOiMxZDRlZDgKICAgIHN0eWxlIENBTk9OIGZpbGw6I2ZlZjNjNyxzdHJva2U6I2Q5NzcwNgogICAgc3R5bGUgRElTSyBmaWxsOiNkY2ZjZTcsc3Ryb2tlOiMxNmEzNGE%3Ftype%3Dpng" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IExSCiAgICBzdWJncmFwaCBXSVJFWyJXaXJlIOKAlCBlcmdvbm9taWMsIHJlbGF0aXZlIl0KICAgICAgICBXMVsiU0VUIGsgdiBFWCA1Il0KICAgICAgICBXMlsiRVhQSVJFIGsgMzAiXQogICAgICAgIFczWyJQRVhQSVJFIGsgMTUwMCJdCiAgICBlbmQKICAgIHN1YmdyYXBoIENBTk9OWyJDYW5vbmljYWxpc2VyIOKAlCBjb21wdXRlcyBub3coKSArIM6UIl0KICAgICAgICBDMVsibm93X21zID0gdGltZS5Ob3coKS5Vbml4TWlsbGkoKSJdCiAgICBlbmQKICAgIHN1YmdyYXBoIERJU0tbIkFPRiDigJQgYWJzb2x1dGUsIGRlY2lzaW9uLWZyZWUiXQogICAgICAgIEQxQVsiU0VUIGsgdiJdCiAgICAgICAgRDFCWyJQRVhQSVJFQVQgayAxNzE4NjU5MjAwMDAwIl0KICAgICAgICBEMlsiUEVYUElSRUFUIGsgMTcxODY1OTIyNTAwMCJdCiAgICAgICAgRDNbIlBFWFBJUkVBVCBrIDE3MTg2NTkyMjE1MDAiXQogICAgZW5kCiAgICBXMSAtLT4gQzEgLS0%2BIEQxQQogICAgQzEgLS0%2BIEQxQgogICAgVzIgLS0%2BIEMxIC0tPiBEMgogICAgVzMgLS0%2BIEMxIC0tPiBEMwogICAgc3R5bGUgV0lSRSBmaWxsOiNkYmVhZmUsc3Ryb2tlOiMxZDRlZDgKICAgIHN0eWxlIENBTk9OIGZpbGw6I2ZlZjNjNyxzdHJva2U6I2Q5NzcwNgogICAgc3R5bGUgRElTSyBmaWxsOiNkY2ZjZTcsc3Ryb2tlOiMxNmEzNGE%3Ftype%3Dpng" alt="Wire-to-disk TTL canonicalisation: relative durations on the wire collapse through now_ms to absolute PEXPIREAT records on disk"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Replay reads &lt;code&gt;PEXPIREAT k 1718659200000&lt;/code&gt;. The math works regardless of when replay runs — a year later, the answer is "this key was already expired"; a second after the crash, "this key has 4.999 seconds left." There is no clock to argue with. The wire stays ergonomic; the disk stays unambiguous.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBhdXRvbnVtYmVyCiAgICBwYXJ0aWNpcGFudCBTIGFzIFNlcnZlci5OZXcKICAgIHBhcnRpY2lwYW50IFIgYXMgYW9mLlJlcGxheWVyCiAgICBwYXJ0aWNpcGFudCBEIGFzIFNlcnZlci5kaXNwYXRjaAogICAgcGFydGljaXBhbnQgU1QgYXMgU3RvcmUKCiAgICBTLT4%2BUjogT3BlbihkaXIpCiAgICBSLT4%2BUjogcmVhZCA4LWJ5dGUgaGVhZGVyCiAgICBOb3RlIG92ZXIgUjogIlRPWUtWXDBcMCIgKyB2ZXJzaW9uIGJ5dGUKICAgIGFsdCBoZWFkZXIgbWlzc2luZwogICAgICAgIFItLT4%2BUzogaW8uRU9GIChmcmVzaCBkaXIsIG5vIEFPRiB5ZXQpCiAgICBlbHNlIHZlcnNpb24gdW5rbm93bgogICAgICAgIFItLT4%2BUzogRXJyVmVyc2lvblVuc3VwcG9ydGVkCiAgICBlbmQKICAgIGxvb3AgbmV4dCByZWNvcmQKICAgICAgICBSLT4%2BUjogcGFyc2UgUkVTUCBhcnJheQogICAgICAgIFItPj5EOiByZXBsYXlBcHBseShhcmd2KQogICAgICAgIE5vdGUgb3ZlciBEOiBzYW1lIGRpc3BhdGNoIGFzIGxpdmU8YnIvPmFvZiA9PSBuaWwsIG5vIHJlLWFwcGVuZAogICAgICAgIEQtPj5TVDogU0VUIC8gREVMIC8gU0VURVggLyBQRVhQSVJFQVQgLi4uCiAgICBlbmQKICAgIFItLT4%2BUzogYnl0ZXMsIHJlY29yZHMsIGR1cmF0aW9u%3Ftype%3Dpng" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBhdXRvbnVtYmVyCiAgICBwYXJ0aWNpcGFudCBTIGFzIFNlcnZlci5OZXcKICAgIHBhcnRpY2lwYW50IFIgYXMgYW9mLlJlcGxheWVyCiAgICBwYXJ0aWNpcGFudCBEIGFzIFNlcnZlci5kaXNwYXRjaAogICAgcGFydGljaXBhbnQgU1QgYXMgU3RvcmUKCiAgICBTLT4%2BUjogT3BlbihkaXIpCiAgICBSLT4%2BUjogcmVhZCA4LWJ5dGUgaGVhZGVyCiAgICBOb3RlIG92ZXIgUjogIlRPWUtWXDBcMCIgKyB2ZXJzaW9uIGJ5dGUKICAgIGFsdCBoZWFkZXIgbWlzc2luZwogICAgICAgIFItLT4%2BUzogaW8uRU9GIChmcmVzaCBkaXIsIG5vIEFPRiB5ZXQpCiAgICBlbHNlIHZlcnNpb24gdW5rbm93bgogICAgICAgIFItLT4%2BUzogRXJyVmVyc2lvblVuc3VwcG9ydGVkCiAgICBlbmQKICAgIGxvb3AgbmV4dCByZWNvcmQKICAgICAgICBSLT4%2BUjogcGFyc2UgUkVTUCBhcnJheQogICAgICAgIFItPj5EOiByZXBsYXlBcHBseShhcmd2KQogICAgICAgIE5vdGUgb3ZlciBEOiBzYW1lIGRpc3BhdGNoIGFzIGxpdmU8YnIvPmFvZiA9PSBuaWwsIG5vIHJlLWFwcGVuZAogICAgICAgIEQtPj5TVDogU0VUIC8gREVMIC8gU0VURVggLyBQRVhQSVJFQVQgLi4uCiAgICBlbmQKICAgIFItLT4%2BUzogYnl0ZXMsIHJlY29yZHMsIGR1cmF0aW9u%3Ftype%3Dpng" alt="AOF replay header + version dispatch: 8-byte magic + version, then loop over RESP records, dispatching through the live server table"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The AOF on disk
&lt;/h3&gt;

&lt;p&gt;The file format is deliberately boring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/
└── appendonly.aof    ← header + RESP-encoded record stream
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One file. No manifest, no index, no segment rotation in v1. The header is eight bytes; everything after is a stream of RESP arrays. The recovery scan walks the file from offset zero on &lt;code&gt;Open&lt;/code&gt; — O(disk), but correct by construction. A manifest would be a second consistency problem on top of the first.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;magic&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;bytes&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Literal &lt;code&gt;"TOYKV\x00\x00"&lt;/code&gt; — fails fast on a wrong file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;version&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;u8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;0x01&lt;/code&gt; (strings only) or &lt;code&gt;0x02&lt;/code&gt; (adds &lt;code&gt;PEXPIREAT&lt;/code&gt; for TTL)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;record_*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;RESP&lt;/td&gt;
&lt;td&gt;variable&lt;/td&gt;
&lt;td&gt;Same &lt;code&gt;*&amp;lt;n&amp;gt;\r\n$&amp;lt;len&amp;gt;\r\n…&lt;/code&gt; frame the wire speaks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;There is no per-record CRC. RESP framing already catches truncated bulks and mismatched array counts; adding a CRC for a learning artefact is belt-and-braces I deliberately deferred. The first real-world corruption that slips past RESP-level parsing is the moment to add one — not before.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inversion 2 — write the bytes you'd send
&lt;/h2&gt;

&lt;p&gt;The journal entry that wrote itself, &lt;code&gt;docs/journal/02-aof.md&lt;/code&gt;, has this line:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The whole AOF format is RESP arrays on disk. &lt;code&gt;Append&lt;/code&gt; calls the existing &lt;code&gt;resp.Writer&lt;/code&gt; against a buffered file handle; &lt;code&gt;Replay&lt;/code&gt; calls the existing &lt;code&gt;resp.Reader&lt;/code&gt;. One codec, two consumers."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the architectural claim. Everything I did downstream of it is mechanical. The replay path is &lt;strong&gt;nine lines of Go&lt;/strong&gt;: skip the header, read a RESP array, dispatch it, repeat to EOF. During replay &lt;code&gt;s.aof == nil&lt;/code&gt;, so the &lt;code&gt;appendIfLive&lt;/code&gt; call inside each mutating handler is a silent no-op. &lt;strong&gt;The same nine lines handle real traffic and crash recovery.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The AOF-replay invariant — &lt;em&gt;every record acknowledged before the kill replays on restart&lt;/em&gt; — collapses to a single ordering question: &lt;strong&gt;does the &lt;code&gt;+OK&lt;/code&gt; cross the network before or after &lt;code&gt;file.Sync()&lt;/code&gt; returns?&lt;/strong&gt; Under &lt;code&gt;appendfsync=always&lt;/code&gt;, the answer is &lt;em&gt;after&lt;/em&gt;. The proof is operational: the crash test (&lt;code&gt;internal/server/aof_crash_test.go&lt;/code&gt;) writes ~90 SETs against a self-re-exec child server, sends &lt;code&gt;SIGKILL&lt;/code&gt; mid-stream, restarts, and checks every &lt;code&gt;+OK&lt;/code&gt; against the post-replay state. &lt;code&gt;-count=10&lt;/code&gt; after the first pass found zero flakes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBhdXRvbnVtYmVyCiAgICBwYXJ0aWNpcGFudCBDIGFzIENsaWVudAogICAgcGFydGljaXBhbnQgSCBhcyBTZXJ2ZXIuaGFuZGxlcgogICAgcGFydGljaXBhbnQgU1QgYXMgU3RvcmUKICAgIHBhcnRpY2lwYW50IFcgYXMgYW9mLldyaXRlcgogICAgcGFydGljaXBhbnQgRlMgYXMgS2VybmVsL0ZTCgogICAgQy0%2BPkg6IFNFVCBrIHYKICAgIEgtPj5TVDogc3RvcmUuU2V0KGssIHYpCiAgICBTVC0tPj5IOiBvawogICAgSC0%2BPlc6IEFwcGVuZChTRVQsIGssIHYpCiAgICBXLT4%2BVzogZW5jb2RlIFJFU1AgYXJyYXkKICAgIFctPj5GUzogZmlsZS5Xcml0ZShieXRlcykKICAgIE5vdGUgb3ZlciBXLEZTOiBwYWdlIGNhY2hlIG9ubHk8YnIvPnBvd2VyIGxvc3MgaGVyZSBsb3NlcyB0aGVtCiAgICBXLT4%2BRlM6IGZpbGUuU3luYygpICAtLSBmc3luYygyKQogICAgTm90ZSBvdmVyIFcsRlM6IGR1cmFiaWxpdHkgY29tbWl0IHBvaW50PGJyLz5ieXRlcyBvbiBzdGFibGUgc3RvcmFnZQogICAgRlMtLT4%2BVzogb2sKICAgIFctLT4%2BSDogbmlsCiAgICBILS0%2BPkM6ICtPSw%3Ftype%3Dpng" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBhdXRvbnVtYmVyCiAgICBwYXJ0aWNpcGFudCBDIGFzIENsaWVudAogICAgcGFydGljaXBhbnQgSCBhcyBTZXJ2ZXIuaGFuZGxlcgogICAgcGFydGljaXBhbnQgU1QgYXMgU3RvcmUKICAgIHBhcnRpY2lwYW50IFcgYXMgYW9mLldyaXRlcgogICAgcGFydGljaXBhbnQgRlMgYXMgS2VybmVsL0ZTCgogICAgQy0%2BPkg6IFNFVCBrIHYKICAgIEgtPj5TVDogc3RvcmUuU2V0KGssIHYpCiAgICBTVC0tPj5IOiBvawogICAgSC0%2BPlc6IEFwcGVuZChTRVQsIGssIHYpCiAgICBXLT4%2BVzogZW5jb2RlIFJFU1AgYXJyYXkKICAgIFctPj5GUzogZmlsZS5Xcml0ZShieXRlcykKICAgIE5vdGUgb3ZlciBXLEZTOiBwYWdlIGNhY2hlIG9ubHk8YnIvPnBvd2VyIGxvc3MgaGVyZSBsb3NlcyB0aGVtCiAgICBXLT4%2BRlM6IGZpbGUuU3luYygpICAtLSBmc3luYygyKQogICAgTm90ZSBvdmVyIFcsRlM6IGR1cmFiaWxpdHkgY29tbWl0IHBvaW50PGJyLz5ieXRlcyBvbiBzdGFibGUgc3RvcmFnZQogICAgRlMtLT4%2BVzogb2sKICAgIFctLT4%2BSDogbmlsCiAgICBILS0%2BPkM6ICtPSw%3Ftype%3Dpng" alt="Durability commit point: handler writes RESP into the AOF, calls fsync, only then returns +OK to the client"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I drew this diagram in the same shape as the WAL diagram in the &lt;code&gt;toymq&lt;/code&gt; retrospective on purpose. A reader who has seen both will notice that &lt;strong&gt;the durability ordering is identical&lt;/strong&gt; — that's the thing the two projects agree on. The codec choice is what's different, and the codec choice is what made the replay path nine lines instead of a hundred.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The rule:&lt;/strong&gt; write the bytes you'd send. One codec is cheaper than two consistent codecs, and the second codec is where the corruption story you don't want lives.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Inversion 3 — RESP because the harness is free
&lt;/h2&gt;

&lt;p&gt;For &lt;code&gt;toymq&lt;/code&gt; I invented a protocol. The defence is sound: the protocol is the wire, the wire is the contract, an invented wire is the moment a project decides it can defend its own boundaries.&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;toykv&lt;/code&gt; I chose RESP2. &lt;strong&gt;RESP2 is a testing strategy disguised as a feature.&lt;/strong&gt; The reason is the protocol-compatibility layer of the crash matrix: integration tests against the shipped binary using third-party clients — &lt;code&gt;redis-cli&lt;/code&gt; for byte-level conformance, &lt;code&gt;go-redis/v9&lt;/code&gt; for the API a real Go consumer would see. That layer shipped without a line of client code I wrote. The clients ship for free, by people who have never heard of &lt;code&gt;toykv&lt;/code&gt;. They don't care about the implementation; they care about the wire. That's exactly the property I wanted.&lt;/p&gt;

&lt;p&gt;The crash-matrix tie-in is the load-bearing one: &lt;strong&gt;the protocol layer owns zero new crash invariants.&lt;/strong&gt; It owns &lt;em&gt;protocol&lt;/em&gt; invariants — TTL &lt;code&gt;-2/-1&lt;/code&gt; sentinels, &lt;code&gt;EXPIRE&lt;/code&gt; against a missing key returning &lt;code&gt;0&lt;/code&gt;, the byte-exact framing of &lt;code&gt;*3\r\n$3\r\nSET\r\n…&lt;/code&gt; — but every command was already covered by a crash test in the layer that introduced it. The protocol layer is the validation that the bytes leaving the socket match the spec the third-party clients expect; everything below the socket has already been proven.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The rule:&lt;/strong&gt; protocol compatibility is a testing strategy disguised as a feature. Two independent implementations of the wire shake out the spec. No single-author project has that property by default — you have to inherit it deliberately.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Honourable mention 1 — the dual-write &lt;code&gt;bufio&lt;/code&gt; bug
&lt;/h2&gt;

&lt;p&gt;The interesting bug of the project was &lt;em&gt;not&lt;/em&gt; the rename-plus-dir-fsync-plus-fd-swap dance for &lt;code&gt;BGREWRITEAOF&lt;/code&gt; that I'd been bracing for. The invariants I'd written up in the rewrite ADR worked the first time. What broke was the dual-write &lt;code&gt;Append&lt;/code&gt; path inside the rewriter: &lt;code&gt;sideBuf.Len() == 0&lt;/code&gt; after what looked like a successful write.&lt;/p&gt;

&lt;p&gt;Root cause: &lt;code&gt;resp.Writer&lt;/code&gt; wraps its target in a &lt;code&gt;bufio.Writer&lt;/code&gt;. The live path constructs &lt;code&gt;resp.NewWriter(outerBufio)&lt;/code&gt;, and &lt;code&gt;bufio.NewWriter(outerBufio)&lt;/code&gt; is smart enough to short-circuit — it returns the existing &lt;code&gt;*bufio.Writer&lt;/code&gt; rather than wrapping it again. One buffer, one flush. When &lt;code&gt;Rewriter.BeginRewrite&lt;/code&gt; constructed a fresh mirror writer against a &lt;code&gt;*bytes.Buffer&lt;/code&gt;, the argument was no longer a &lt;code&gt;*bufio.Writer&lt;/code&gt;. So &lt;code&gt;bufio.NewWriter(&amp;amp;scratch)&lt;/code&gt; created a brand new inner bufio layer. &lt;strong&gt;Two buffers, two flush points.&lt;/strong&gt; Records were "written" but pinned in an inner bufio nothing else was going to drain before &lt;code&gt;DrainAndSwap&lt;/code&gt; ran. Two-line fix in &lt;a href="https://github.com/prajwalmahajan101/toykv/commit/736bf63" rel="noopener noreferrer"&gt;&lt;code&gt;736bf63&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The rule:&lt;/strong&gt; any write path with two sinks has two flush points; both must be ordered against the swap. A bufio layer is invisible — the same line of Go behaves differently depending on the static type of its argument. Never construct a fresh wrapper of a swappable type against a non-bufio target without immediately scheduling its flush.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBhdXRvbnVtYmVyCiAgICBwYXJ0aWNpcGFudCBIIGFzIFNlcnZlci5oYW5kbGVyCiAgICBwYXJ0aWNpcGFudCBXIGFzIGFvZi5Xcml0ZXIKICAgIHBhcnRpY2lwYW50IFIgYXMgYW9mLlJld3JpdGVyCiAgICBwYXJ0aWNpcGFudCBUTVAgYXMgLmFvZi50bXAKICAgIHBhcnRpY2lwYW50IEFPRiBhcyAuYW9mIChjYW5vbmljYWwpCiAgICBwYXJ0aWNpcGFudCBGUyBhcyBLZXJuZWwvRlMKCiAgICBOb3RlIG92ZXIgVyxSOiBTbmFwc2hvdCArIGR1YWwtd3JpdGUKICAgIEgtPj5SOiBCR1JFV1JJVEVBT0YKICAgIFItPj5XOiBCZWdpblJld3JpdGUoc2lkZUJ1ZikKICAgIE5vdGUgb3ZlciBXOiBsaXZlIEFwcGVuZHMgbm93IGZhbiBvdXQ6PGJyLz5jYW5vbmljYWwgKG91dGVyQnVmaW8pICsgc2lkZSAoaW5uZXJCdWZpbykKICAgIFItPj5SOiBTdG9yZS5TbmFwc2hvdCgpIOKAlCByZW5kZXIgY2Fub25pY2FsIFJFU1AKICAgIFItPj5UTVA6IHdyaXRlIGhlYWRlciArIHNuYXBzaG90IGJ5dGVzCiAgICBILT4%2BVzogQXBwZW5kIFNFVCBrIHYgKGNvbmN1cnJlbnQgd2l0aCBzbmFwc2hvdCkKICAgIFctPj5BT0Y6IHdyaXRlIHRvIGNhbm9uaWNhbCAoZHVyYWJsZSBvbiBmc3luYykKICAgIFctPj5XOiB3cml0ZSB0byBzaWRlQnVmIHZpYSBpbm5lckJ1ZmlvPGJyLz4qKm11c3QgRmx1c2goKSDigJQgdGhlIGR1YWwtd3JpdGUgYnVnKioKCiAgICBOb3RlIG92ZXIgVyxGUzogRHJhaW4gKyBzd2FwCiAgICBSLT4%2BVzogRHJhaW5BbmRTd2FwKCkKICAgIFctPj5XOiBpbm5lckJ1ZmlvLkZsdXNoKCkKICAgIFctPj5UTVA6IGFwcGVuZCBzaWRlQnVmIGJ5dGVzCiAgICBSLT4%2BRlM6IHJlbmFtZSguYW9mLnRtcCwgLmFvZikKICAgIFItPj5GUzogZnN5bmMoZGlyKQogICAgTm90ZSBvdmVyIEFPRjogaW52YXJpYW50OiBleGFjdGx5IG9uZSBvZjxici8%2Be29sZCAuYW9mLCBuZXcgLmFvZn0gcHJlc2VudCBhdCBhbGwgdGltZXM%3Ftype%3Dpng" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBhdXRvbnVtYmVyCiAgICBwYXJ0aWNpcGFudCBIIGFzIFNlcnZlci5oYW5kbGVyCiAgICBwYXJ0aWNpcGFudCBXIGFzIGFvZi5Xcml0ZXIKICAgIHBhcnRpY2lwYW50IFIgYXMgYW9mLlJld3JpdGVyCiAgICBwYXJ0aWNpcGFudCBUTVAgYXMgLmFvZi50bXAKICAgIHBhcnRpY2lwYW50IEFPRiBhcyAuYW9mIChjYW5vbmljYWwpCiAgICBwYXJ0aWNpcGFudCBGUyBhcyBLZXJuZWwvRlMKCiAgICBOb3RlIG92ZXIgVyxSOiBTbmFwc2hvdCArIGR1YWwtd3JpdGUKICAgIEgtPj5SOiBCR1JFV1JJVEVBT0YKICAgIFItPj5XOiBCZWdpblJld3JpdGUoc2lkZUJ1ZikKICAgIE5vdGUgb3ZlciBXOiBsaXZlIEFwcGVuZHMgbm93IGZhbiBvdXQ6PGJyLz5jYW5vbmljYWwgKG91dGVyQnVmaW8pICsgc2lkZSAoaW5uZXJCdWZpbykKICAgIFItPj5SOiBTdG9yZS5TbmFwc2hvdCgpIOKAlCByZW5kZXIgY2Fub25pY2FsIFJFU1AKICAgIFItPj5UTVA6IHdyaXRlIGhlYWRlciArIHNuYXBzaG90IGJ5dGVzCiAgICBILT4%2BVzogQXBwZW5kIFNFVCBrIHYgKGNvbmN1cnJlbnQgd2l0aCBzbmFwc2hvdCkKICAgIFctPj5BT0Y6IHdyaXRlIHRvIGNhbm9uaWNhbCAoZHVyYWJsZSBvbiBmc3luYykKICAgIFctPj5XOiB3cml0ZSB0byBzaWRlQnVmIHZpYSBpbm5lckJ1ZmlvPGJyLz4qKm11c3QgRmx1c2goKSDigJQgdGhlIGR1YWwtd3JpdGUgYnVnKioKCiAgICBOb3RlIG92ZXIgVyxGUzogRHJhaW4gKyBzd2FwCiAgICBSLT4%2BVzogRHJhaW5BbmRTd2FwKCkKICAgIFctPj5XOiBpbm5lckJ1ZmlvLkZsdXNoKCkKICAgIFctPj5UTVA6IGFwcGVuZCBzaWRlQnVmIGJ5dGVzCiAgICBSLT4%2BRlM6IHJlbmFtZSguYW9mLnRtcCwgLmFvZikKICAgIFItPj5GUzogZnN5bmMoZGlyKQogICAgTm90ZSBvdmVyIEFPRjogaW52YXJpYW50OiBleGFjdGx5IG9uZSBvZjxici8%2Be29sZCAuYW9mLCBuZXcgLmFvZn0gcHJlc2VudCBhdCBhbGwgdGltZXM%3Ftype%3Dpng" alt="BGREWRITEAOF dual-write + atomic rename: live Appends fan out to canonical + side buffer during snapshot, then DrainAndSwap before rename"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The invariant the rewrite ADR commits to is "the canonical file is durable and consistent at every instant until the rename." A silently-empty side buffer would have meant the rename swapped in a file missing the appends issued during the snapshot. The replay test was the green-light proof that both sides of the dual-write are actually dual-writing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honourable mention 2 — the test was wrong, not the code
&lt;/h2&gt;

&lt;p&gt;The TTL sweeper is a 1 Hz goroutine that samples 20 keys under a read lock, then upgrades to a write lock for the eviction pass — &lt;code&gt;sync.RWMutex&lt;/code&gt; doesn't support upgrade-in-place, so the upgrade is &lt;em&gt;release-and-reacquire-and-double-check&lt;/em&gt;. The race test (&lt;code&gt;internal/store/sweeper_test.go&lt;/code&gt;, commit &lt;a href="https://github.com/prajwalmahajan101/toykv/commit/bab6033" rel="noopener noreferrer"&gt;&lt;code&gt;bab6033&lt;/code&gt;&lt;/a&gt;) failed on its first stress run with four violations in 468k reads.&lt;/p&gt;

&lt;p&gt;The bug was &lt;strong&gt;in the test, not the store.&lt;/strong&gt; The test captured &lt;code&gt;t0 := time.Now()&lt;/code&gt; before &lt;code&gt;Get&lt;/code&gt;, but Get's internal time check happens later — sometimes much later if the scheduler preempts. A Get that correctly returned &lt;code&gt;(nil)&lt;/code&gt; because the entry's TTL had elapsed by the time Get &lt;em&gt;internally&lt;/em&gt; checked would be flagged as "spurious nil for unexpired key." The fix forced the memory model onto paper: I now capture &lt;code&gt;tAfter&lt;/code&gt; after Get (upper-bounds the internal check) and &lt;code&gt;atomic.Load(lastExpireAt)&lt;/code&gt; before Get (synchronizes-with the writer's &lt;code&gt;atomic.Store&lt;/code&gt;). Then &lt;code&gt;tAfter &amp;lt; loadedExpireAt&lt;/code&gt; is a real violation; no false positives possible.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The rule:&lt;/strong&gt; if a sweeper and a writer can race, the test has to schedule them, not hope. A test whose pass/fail depends on the scheduler's whim is a test that finds the race detector's bug, not your code's.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  ADRs as crystallization — and the ADR I almost didn't write
&lt;/h2&gt;

&lt;p&gt;I budgeted four ADRs for v1.0.0 and shipped six. The interesting one is the TUI ADR.&lt;/p&gt;

&lt;p&gt;The TUI ADR (Bubble Tea with an injectable &lt;code&gt;Doer&lt;/code&gt;) is where the budget actively pushed against the code. The TUI was the first part of the project that took a third-party dep beyond &lt;code&gt;go-redis/v9&lt;/code&gt; in tests, and the choice — &lt;code&gt;bubbletea&lt;/code&gt; vs &lt;code&gt;tview&lt;/code&gt; vs hand-rolled — was non-trivial. I knew it was non-trivial because every two days, in a different context, I caught myself re-deriving the same argument for the same choice. That's the signal: when the same reasoning costs me a second time, I didn't decide; I waited.&lt;/p&gt;

&lt;p&gt;I had two options. Force the choice into one of the existing four ADRs and write a worse ADR. Or break the budget and write the TUI ADR on its own. I broke the budget. The post-mortem is that &lt;strong&gt;the budget itself was the wrong abstraction.&lt;/strong&gt; I had treated ADR count as a proxy for noise, when the actual signal was ADR &lt;em&gt;quality&lt;/em&gt; — and a forced-fit ADR is noisier than two clean ones.&lt;/p&gt;

&lt;p&gt;The discipline, identical to the one I wrote in the &lt;code&gt;toymq&lt;/code&gt; post: write the ADR the moment the decision is forced by code, never before. ADRs written ahead of code force the code to fit the ADR. ADRs written &lt;em&gt;at&lt;/em&gt; crystallization record what the code already decided. The reason that discipline matters is that the TUI ADR was a decision I'd already made three times by the day I wrote it down — I just hadn't admitted I'd made it. The ADR is the admission.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The rule:&lt;/strong&gt; budget the quality bar, not the count. The cost of an ADR I write is one afternoon. The cost of an ADR I didn't write is a future me re-deriving the same argument in three different files. Those costs aren't comparable, and counting ADRs treats them as if they are.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;Three regrets from the critical path:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Chaos composition should have shipped with the AOF, not at the end.&lt;/strong&gt; The &lt;code&gt;test/chaos/&lt;/code&gt; soak (kill + pause + rewrite + writes overlapped) felt expensive enough to defer. It wasn't. Two-thirds of the harness reused the self-re-exec pattern I already had for the AOF crash test. Running a composition soak from the start would have caught the dual-write bug at the point the rewriter landed, because the rewriter would have been driven by a scaffolded integration test instead of an isolated unit test.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Store.Snapshot()&lt;/code&gt; should have shipped with the AOF, not landed as a refactor PR when the rewriter needed it.&lt;/strong&gt; Snapshot is a contract boundary — "give me the current state, expired entries already evicted, in a form I can serialize." I deferred it because the AOF didn't &lt;em&gt;use&lt;/em&gt; it yet. The refactor PR that extracted it later was harder to review than the BGREWRITEAOF work itself, because it touched nine files mechanically. Shipping the boundary up front would have cost one extra journal entry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The ADR budget number was the wrong abstraction.&lt;/strong&gt; See above. Budget the bar, not the count.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;toykv-go&lt;/code&gt; — the SDK extraction. The strategy I committed to is: hold the SDK split until the wire format is locked, then extract &lt;code&gt;internal/client&lt;/code&gt;, &lt;code&gt;respfmt&lt;/code&gt;, and &lt;code&gt;cmdparse&lt;/code&gt; into a sibling repo. v1.0.0 of &lt;code&gt;toykv-go&lt;/code&gt; tagged the same day as &lt;code&gt;toykv&lt;/code&gt; v1.0.0.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;toymessenger&lt;/code&gt; — the consumer. The reason &lt;code&gt;toykv&lt;/code&gt; and &lt;code&gt;toymq&lt;/code&gt; exist as separate projects is that there is a third project, an E2E-encrypted chat, that uses both: KV for session state and presence, MQ for the message log. &lt;code&gt;toymessenger&lt;/code&gt; is where the two projects' opposite decisions get tested against a single user.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;toykv v2&lt;/code&gt; only if real use justifies it. The roadmap lists candidate features (lists/sets/hashes, AUTH + TLS, RDB snapshots, observability with &lt;code&gt;INFO&lt;/code&gt; + &lt;code&gt;/metrics&lt;/code&gt;, &lt;code&gt;SCAN&lt;/code&gt;, &lt;code&gt;RENAME&lt;/code&gt;), but the rule is the same one Redis uses: don't add a feature until somebody is asking for it and willing to live with the trade-off. A single-node KV that survives a v1 cycle without a v2 ask is a single-node KV that earned the cut.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one-line version
&lt;/h2&gt;

&lt;p&gt;Write the bytes you'd send. Stamp them with a version. The next person to read them is you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Source: &lt;a href="https://github.com/prajwalmahajan101/toykv" rel="noopener noreferrer"&gt;&lt;code&gt;prajwalmahajan101/toykv&lt;/code&gt;&lt;/a&gt;. Deeper docs: &lt;a href="https://github.com/prajwalmahajan101/toykv/blob/main/docs/HLD.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/HLD.md&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/prajwalmahajan101/toykv/blob/main/docs/LLD.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/LLD.md&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/prajwalmahajan101/toykv/blob/main/docs/TESTING.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/TESTING.md&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/prajwalmahajan101/toykv/blob/main/docs/BENCHMARKS.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/BENCHMARKS.md&lt;/code&gt;&lt;/a&gt;. ADRs: &lt;a href="https://github.com/prajwalmahajan101/toykv/tree/main/docs/adr" rel="noopener noreferrer"&gt;&lt;code&gt;docs/adr/&lt;/code&gt;&lt;/a&gt;. Release notes: &lt;a href="https://github.com/prajwalmahajan101/toykv/blob/main/docs/release-notes/v1.0.0.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/release-notes/v1.0.0.md&lt;/code&gt;&lt;/a&gt;. Corrections welcome — open a discussion or file an issue.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Companion post: &lt;a href="https://dev.to/prajwalmahajan101/building-toymq-a-from-scratch-persistent-message-broker-in-go-ob7"&gt;Building toymq&lt;/a&gt; — the sibling project this one is in dialogue with.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>distributedsystems</category>
      <category>backend</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Building resilience-kit: A Python Resilience Kernel Forged in Production</title>
      <dc:creator>Prajwal Mahajan</dc:creator>
      <pubDate>Thu, 11 Jun 2026 08:18:08 +0000</pubDate>
      <link>https://dev.to/prajwalmahajan101/building-resilience-kit-a-python-resilience-kernel-forged-in-production-5973</link>
      <guid>https://dev.to/prajwalmahajan101/building-resilience-kit-a-python-resilience-kernel-forged-in-production-5973</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;resilience-kit&lt;/code&gt; is a framework-agnostic, async-first resilience kernel for Python.&lt;/strong&gt; Circuit breakers, retries with jitter, token-bucket throttles, SSRF guards with DNS pinning, fire-and-forget audit logs, field-level encryption. Pluggable backends (memory, Redis/Valkey, &lt;code&gt;pybreaker&lt;/code&gt;). Thin adapters for FastAPI and Django that wire the primitives into each framework's lifecycle and contain zero business logic. One core, 11 ADRs, two adapters.&lt;/p&gt;

&lt;p&gt;The interesting part isn't the primitives — most of them are well-trodden ground. The interesting part is &lt;strong&gt;the release rule&lt;/strong&gt;: v0.1.0 only shipped because I had to upgrade two unrelated starter repos to it and score &lt;strong&gt;≥ 8/10&lt;/strong&gt; on each migration. The reports got filed. FastAPI scored 8/10. Django scored 9/10. The gate held. Anything below 8 would have blocked the cut.&lt;/p&gt;

&lt;p&gt;That rule — &lt;em&gt;"the kit ships iff the kit survives its own migration"&lt;/em&gt; — is what shaped the public surface. Every helper in v0.1.0 (&lt;code&gt;bind_to&lt;/code&gt;, &lt;code&gt;from_exception&lt;/code&gt;, &lt;code&gt;legacy_env_alias&lt;/code&gt;, &lt;code&gt;verify_envelope_contract&lt;/code&gt;) exists because the first dogfooding round flagged its absence as a blocker. Every doc gap got patched because the migration reports cited the line number it should have been on. This post is how the kit got built, the five axioms underneath it, the gate that gated it, and the bugs it found in itself along the way.&lt;/p&gt;

&lt;p&gt;95 source modules, 52 test modules across unit/contract/integration suites, 11 ADRs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five Axioms
&lt;/h2&gt;

&lt;p&gt;I didn't draft these up front. I extracted them after the fact by reading the ADRs and asking &lt;em&gt;what did I refuse to compromise on?&lt;/em&gt; They're the load-bearing constraints; everything else follows.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. One package, many extras. Not a forest of micro-packages.
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;pip install resilience-kit&lt;/code&gt; gives you the core. &lt;code&gt;pip install resilience-kit[redis,pybreaker,fastapi]&lt;/code&gt; adds the backends and adapter. &lt;strong&gt;No micro-libraries until the surface justifies it.&lt;/strong&gt; One repo, one CHANGELOG, one tag. Modularity is enforced by import discipline (&lt;code&gt;import-linter&lt;/code&gt; with layered contracts) and by entry-point provider discovery, not by package boundaries.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRCCiAgICBzdWJncmFwaCBDWyJZb3VyIGFwcGxpY2F0aW9uIl0KICAgICAgICBBUFBbRmFzdEFQSSAvIERqYW5nbyBzZXJ2aWNlXQogICAgZW5kCiAgICBzdWJncmFwaCBBWyJyZXNpbGllbmNlX2tpdC5hZGFwdGVycy4qIl0KICAgICAgICBGQVthZGFwdGVycy5mYXN0YXBpXQogICAgICAgIERKW2FkYXB0ZXJzLmRqYW5nb10KICAgIGVuZAogICAgc3ViZ3JhcGggRVsiT3B0aW9uYWwgZXh0cmFzPGJyLz4oZGlzY292ZXJlZCB2aWEgZW50cnktcG9pbnRzKSJdCiAgICAgICAgUkVEW3JlZGlzIGJhY2tlbmRdCiAgICAgICAgUEJbcHlicmVha2VyIGJhY2tlbmRdCiAgICAgICAgUEdbcG9zdGdyZXMgYXVkaXRdCiAgICBlbmQKICAgIHN1YmdyYXBoIEtbInJlc2lsaWVuY2Vfa2l0IGNvcmUiXQogICAgICAgIFBST1RPW1Byb3RvY29sczxici8%2BY2FjaGUgwrcgYnJlYWtlciDCtyB0aHJvdHRsZSDCtyBhdWRpdF0KICAgICAgICBQUklNW1ByaW1pdGl2ZXM8YnIvPnJldHJ5IMK3IHJlZ2lzdHJ5IMK3IHJlY292ZXJ5IMK3IGh0dHBfY2xpZW50XQogICAgICAgIFBSSU0gLS0%2BIFBST1RPCiAgICBlbmQKICAgIEFQUCAtLT4gRkEKICAgIEFQUCAtLT4gREoKICAgIEZBIC0tPiBQUklNCiAgICBESiAtLT4gUFJJTQogICAgUkVEIC0uLT58aW1wbGVtZW50c3wgUFJPVE8KICAgIFBCIC0uLT58aW1wbGVtZW50c3wgUFJPVE8KICAgIFBHIC0uLT58aW1wbGVtZW50c3wgUFJPVE8KICAgIHN0eWxlIEsgZmlsbDojZGJlYWZlLHN0cm9rZTojMWQ0ZWQ4CiAgICBzdHlsZSBBIGZpbGw6I2ZlZjNjNyxzdHJva2U6I2Q5NzcwNgogICAgc3R5bGUgRSBmaWxsOiNmM2U4ZmYsc3Ryb2tlOiM3YzNhZWQKICAgIHN0eWxlIEMgZmlsbDojZGNmY2U3LHN0cm9rZTojMTZhMzRhCg%3D%3D%3Ftype%3Dpng" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRCCiAgICBzdWJncmFwaCBDWyJZb3VyIGFwcGxpY2F0aW9uIl0KICAgICAgICBBUFBbRmFzdEFQSSAvIERqYW5nbyBzZXJ2aWNlXQogICAgZW5kCiAgICBzdWJncmFwaCBBWyJyZXNpbGllbmNlX2tpdC5hZGFwdGVycy4qIl0KICAgICAgICBGQVthZGFwdGVycy5mYXN0YXBpXQogICAgICAgIERKW2FkYXB0ZXJzLmRqYW5nb10KICAgIGVuZAogICAgc3ViZ3JhcGggRVsiT3B0aW9uYWwgZXh0cmFzPGJyLz4oZGlzY292ZXJlZCB2aWEgZW50cnktcG9pbnRzKSJdCiAgICAgICAgUkVEW3JlZGlzIGJhY2tlbmRdCiAgICAgICAgUEJbcHlicmVha2VyIGJhY2tlbmRdCiAgICAgICAgUEdbcG9zdGdyZXMgYXVkaXRdCiAgICBlbmQKICAgIHN1YmdyYXBoIEtbInJlc2lsaWVuY2Vfa2l0IGNvcmUiXQogICAgICAgIFBST1RPW1Byb3RvY29sczxici8%2BY2FjaGUgwrcgYnJlYWtlciDCtyB0aHJvdHRsZSDCtyBhdWRpdF0KICAgICAgICBQUklNW1ByaW1pdGl2ZXM8YnIvPnJldHJ5IMK3IHJlZ2lzdHJ5IMK3IHJlY292ZXJ5IMK3IGh0dHBfY2xpZW50XQogICAgICAgIFBSSU0gLS0%2BIFBST1RPCiAgICBlbmQKICAgIEFQUCAtLT4gRkEKICAgIEFQUCAtLT4gREoKICAgIEZBIC0tPiBQUklNCiAgICBESiAtLT4gUFJJTQogICAgUkVEIC0uLT58aW1wbGVtZW50c3wgUFJPVE8KICAgIFBCIC0uLT58aW1wbGVtZW50c3wgUFJPVE8KICAgIFBHIC0uLT58aW1wbGVtZW50c3wgUFJPVE8KICAgIHN0eWxlIEsgZmlsbDojZGJlYWZlLHN0cm9rZTojMWQ0ZWQ4CiAgICBzdHlsZSBBIGZpbGw6I2ZlZjNjNyxzdHJva2U6I2Q5NzcwNgogICAgc3R5bGUgRSBmaWxsOiNmM2U4ZmYsc3Ryb2tlOiM3YzNhZWQKICAgIHN0eWxlIEMgZmlsbDojZGNmY2U3LHN0cm9rZTojMTZhMzRhCg%3D%3D%3Ftype%3Dpng" alt="Layered package: your app → adapters → core; extras implement the core Protocols via entry-points" width="1283" height="677"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Arrows are hard imports (&lt;code&gt;from resilience_kit.X import Y&lt;/code&gt;). Dotted lines are entry-point conformance — no import required, just a Protocol-shaped class registered in the extra's &lt;code&gt;pyproject.toml&lt;/code&gt; under &lt;code&gt;[project.entry-points."resilience_kit.cache"]&lt;/code&gt; (and siblings). The kit's import graph is acyclic by import-linter contract; consumer apps depend only on the adapter layer.&lt;/p&gt;

&lt;p&gt;The bet: a single distributable up to ~10k LOC is simpler to release, version, and depend on than a federation. If the surface ever grows past that, splitting into namespace packages later is mechanical and doesn't break the public API.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Protocols, not ABCs.
&lt;/h3&gt;

&lt;p&gt;Every swappable interface — cache, breaker, throttle, audit backend, metrics sink — is a &lt;code&gt;typing.Protocol&lt;/code&gt;, not an &lt;code&gt;abc.ABC&lt;/code&gt;. Third-party backends import &lt;em&gt;zero&lt;/em&gt; kit packages at definition time. They expose a class whose shape matches; &lt;code&gt;mypy --strict&lt;/code&gt; verifies the structural conformance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The rule:&lt;/strong&gt; your code should conform to a shape, not inherit from a library. This is true dependency inversion — the kit doesn't even appear in the import graph of the thing implementing its protocol. Entry points provide the precedence chain at runtime: explicit &amp;gt; env &amp;gt; entry-point &amp;gt; default.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Outer breaker, inner retry. Composition order is correctness.
&lt;/h3&gt;

&lt;p&gt;If you compose a circuit breaker and a retry, the breaker must wrap the retry. If you accidentally invert the stack — retry around breaker — the retry loop will hammer an &lt;code&gt;OPEN&lt;/code&gt; breaker, defeating the point of the breaker. This isn't a style preference; it's a correctness property.&lt;/p&gt;

&lt;p&gt;I learned this the hard way early on. A caller passed &lt;code&gt;retry_on=(Exception,)&lt;/code&gt; to the retry decorator. The decorator happily caught &lt;code&gt;ServiceUnavailableError&lt;/code&gt; — the exception the breaker raises to say "stop calling me." The fix (commit &lt;code&gt;f6cebdd&lt;/code&gt;) made the dangerous composition fail safe by stripping &lt;code&gt;ServiceUnavailableError&lt;/code&gt; from the retry's catch list, no matter what the caller passes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_filter_retry_on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Drop ServiceUnavailableError from a retry_on tuple.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;exceptions&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;issubclass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ServiceUnavailableError&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The rule:&lt;/strong&gt; the dangerous composition must be the one that fails safe. An inverted stack now fails instantly, because the retry loop refuses to catch the one error a breaker is designed to throw.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Async-first with explicit sync bridges. Never trust auto-bridging.
&lt;/h3&gt;

&lt;p&gt;Every primitive's async API is primary; sync wrappers exist only where the primitive is idiomatically sync (Django middleware, DRF throttles, management commands). Crucially, the kit &lt;strong&gt;does not use &lt;code&gt;asgiref.sync.async_to_sync&lt;/code&gt;&lt;/strong&gt;. That function caches a thread-local event loop to "avoid overhead" — which means it's a hidden global namespace, and any other library that also caches a thread-local loop will collide with it.&lt;/p&gt;

&lt;p&gt;Instead, the kit owns both crossings explicitly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Long-lived background work&lt;/strong&gt; (recovery monitor, audit dispatcher): the Django adapter spawns a single daemon thread with its own private event loop. The loop drives the monitor and owns the audit queue. &lt;code&gt;atexit&lt;/code&gt; drains the queue on the &lt;em&gt;same&lt;/em&gt; loop before closing it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Short-lived sync→async calls&lt;/strong&gt; (DRF throttles checking Redis): a per-call &lt;code&gt;asyncio.run()&lt;/code&gt;. Sub-millisecond overhead. No shared state. No collision risk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The rule:&lt;/strong&gt; a bridge between two worlds must own its own crossing. Don't borrow infrastructure from either side.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Adapters are dumb wiring. If your adapter has business logic, your primitive is wrong.
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;resilience_kit.adapters.fastapi&lt;/code&gt; and &lt;code&gt;resilience_kit.adapters.django&lt;/code&gt; exist solely to wire kit primitives into each framework's lifecycle: ASGI middleware, FastAPI dependencies, Django &lt;code&gt;AppConfig.ready()&lt;/code&gt;, DRF throttle classes, management commands. They translate framework-shaped concepts (a Django &lt;code&gt;MIDDLEWARE&lt;/code&gt; tuple entry, a FastAPI &lt;code&gt;Depends()&lt;/code&gt;) into kit calls. Nothing else.&lt;/p&gt;

&lt;p&gt;If an adapter file ever grows past ~300 LOC, the primitive itself is wrong, not the adapter. This is enforced socially (PR review) and architecturally (import-linter prevents adapters from importing each other or from owning state).&lt;/p&gt;

&lt;p&gt;The closing line of this post is the punchier statement of this axiom.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dogfooding Gate
&lt;/h2&gt;

&lt;p&gt;The most consequential decision in this project wasn't a code decision. It was a process decision:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"v0.1.0 is a clean cut iff both boilerplate reports score ≥ 8/10."&lt;/strong&gt;&lt;br&gt;
— &lt;code&gt;docs/RELEASE-PLAN.md&lt;/code&gt; §4 verification&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I write FastAPI and Django services. Before the kit existed, both starter repos owned the same resilience layer twice — circuit breakers, retries, throttles, SSRF guards, audit logs — copy-pasted and lightly diverged. The kit is the deduplication. But deduplication is easy to talk about and hard to validate; "the kit can replace the boilerplate code" is a claim, not a proof.&lt;/p&gt;

&lt;p&gt;So I made it a release gate. Cut a release candidate. Upgrade both boilerplates against it. &lt;strong&gt;File a structured report on each migration&lt;/strong&gt; — blockers, helpers used, missing surface, pain points, doc gaps, ROADMAP suggestions — and a 1–10 outcome score. If either score is &amp;lt; 8, the kit isn't done. Iterate. Re-test.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRECiAgICBBW0N1dCByZWxlYXNlIGNhbmRpZGF0ZV0gLS0%2BIEJbVXBncmFkZSBmYXN0YXBpX2JvaWxlcnBsYXRlXQogICAgQSAtLT4gRFtVcGdyYWRlIGRqYW5nb19ib2lsZXJwbGF0ZV0KICAgIEIgLS0%2BIENbRmlsZSBGYXN0QVBJIHJlcG9ydDxici8%2Bc2NvcmUgMeKAkzEwXQogICAgRCAtLT4gRVtGaWxlIERqYW5nbyByZXBvcnQ8YnIvPnNjb3JlIDHigJMxMF0KICAgIEMgLS0%2BIEZ7Qm90aCDiiaUgOC8xMD99CiAgICBFIC0tPiBGCiAgICBGIC0tIHllcyAtLT4gR1tDdXQgcmVsZWFzZTxici8%2BdGFnICsgUHlQSV0KICAgIEYgLS0gbm8gLS0%2BIEhbSXRlcmF0ZSBvbiBraXQ8YnIvPmZpeCBibG9ja2Vyc10KICAgIEggLS0%2BIEEKICAgIEcgLS0%2BIElbRmluZGluZ3MgZmVlZDxici8%2BbmV4dCBwYXRjaCArIG1pbm9yXQogICAgc3R5bGUgRiBmaWxsOiNmZWYzYzcsc3Ryb2tlOiNkOTc3MDYKICAgIHN0eWxlIEcgZmlsbDojZGNmY2U3LHN0cm9rZTojMTZhMzRhCiAgICBzdHlsZSBIIGZpbGw6I2ZlZTJlMixzdHJva2U6I2RjMjYyNgo%3D%3Ftype%3Dpng" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRECiAgICBBW0N1dCByZWxlYXNlIGNhbmRpZGF0ZV0gLS0%2BIEJbVXBncmFkZSBmYXN0YXBpX2JvaWxlcnBsYXRlXQogICAgQSAtLT4gRFtVcGdyYWRlIGRqYW5nb19ib2lsZXJwbGF0ZV0KICAgIEIgLS0%2BIENbRmlsZSBGYXN0QVBJIHJlcG9ydDxici8%2Bc2NvcmUgMeKAkzEwXQogICAgRCAtLT4gRVtGaWxlIERqYW5nbyByZXBvcnQ8YnIvPnNjb3JlIDHigJMxMF0KICAgIEMgLS0%2BIEZ7Qm90aCDiiaUgOC8xMD99CiAgICBFIC0tPiBGCiAgICBGIC0tIHllcyAtLT4gR1tDdXQgcmVsZWFzZTxici8%2BdGFnICsgUHlQSV0KICAgIEYgLS0gbm8gLS0%2BIEhbSXRlcmF0ZSBvbiBraXQ8YnIvPmZpeCBibG9ja2Vyc10KICAgIEggLS0%2BIEEKICAgIEcgLS0%2BIElbRmluZGluZ3MgZmVlZDxici8%2BbmV4dCBwYXRjaCArIG1pbm9yXQogICAgc3R5bGUgRiBmaWxsOiNmZWYzYzcsc3Ryb2tlOiNkOTc3MDYKICAgIHN0eWxlIEcgZmlsbDojZGNmY2U3LHN0cm9rZTojMTZhMzRhCiAgICBzdHlsZSBIIGZpbGw6I2ZlZTJlMixzdHJva2U6I2RjMjYyNgo%3D%3Ftype%3Dpng" alt="Dogfooding cycle: cut RC → migrate both boilerplates → file reports → both ≥8/10 ships, otherwise iterate" width="673" height="777"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That cycle ran from &lt;code&gt;0.1.0rc1&lt;/code&gt; → &lt;code&gt;0.1.0&lt;/code&gt;. Both reports passed:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Repo&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/prajwalmahajan101/fastapi_boilerplate" rel="noopener noreferrer"&gt;&lt;code&gt;fastapi_boilerplate&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8 / 10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2026-06-10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/prajwalmahajan101/django_boilerplate" rel="noopener noreferrer"&gt;&lt;code&gt;django_boilerplate&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9 / 10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2026-06-11&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Both reports applied all four primary helpers&lt;/strong&gt; that shipped in v0.1.0: &lt;code&gt;bind_to(target)&lt;/code&gt; for ContextVar mirroring, &lt;code&gt;from_exception(exc)&lt;/code&gt; for the kit-shape → adopter-envelope projection, &lt;code&gt;legacy_env_alias()&lt;/code&gt; for pre-kit env-var translation, and &lt;code&gt;verify_envelope_contract()&lt;/code&gt; for an adopter-side test that the bridge maps every kit exception cleanly. The fact that &lt;em&gt;both&lt;/em&gt; reports used the same four — first try, no fallbacks — was the convergent signal that the helper surface was right.&lt;/p&gt;

&lt;p&gt;What's interesting isn't that the gate passed. It's &lt;em&gt;what the gate produced&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Migration Reports Found
&lt;/h2&gt;

&lt;p&gt;The reports were structured by design — same template, same sections — so the findings could be aligned and compared. Four findings showed up in both:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Cross-cutting finding&lt;/th&gt;
&lt;th&gt;Target&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;X1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The public exception bridge lives at &lt;code&gt;resilience_kit.adapters._envelope.from_exception&lt;/code&gt; — the leading underscore reads as private. Re-export it without the underscore, or rename the module.&lt;/td&gt;
&lt;td&gt;v0.1.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;X2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;verify_envelope_contract&lt;/code&gt; raises a flat &lt;code&gt;AssertionError&lt;/code&gt;. Pytest only surfaces the first failure; programmatic CI dashboards can't introspect a structured result. Return an &lt;code&gt;EnvelopeContractResult&lt;/code&gt; (list of &lt;code&gt;(exc_class, ok, reason)&lt;/code&gt;).&lt;/td&gt;
&lt;td&gt;v0.1.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;X3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;legacy_env_alias(aliases=...)&lt;/code&gt; &lt;em&gt;replaces&lt;/em&gt; the default table. Adopters who want to extend with project-specific aliases must copy the default dict + add. Add &lt;code&gt;extra_aliases=&lt;/code&gt; that merges, or document the copy-pattern explicitly.&lt;/td&gt;
&lt;td&gt;v0.1.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;X4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The rc1 → v0.1.0 migration guide had four documented gaps — different recipes needed Django-specific snippets, ordering was load-bearing in one place, the projection shape was wrong for a required-&lt;code&gt;code&lt;/code&gt; envelope, and a &lt;code&gt;request_id=None&lt;/code&gt; patch needed an explicit "top up from your own ContextVar" note.&lt;/td&gt;
&lt;td&gt;v0.1.0 doc patch&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two of those (X1, X2) are surface-quality nits — the kit works, but the ergonomics chafed. Two (X3, X4) are real correctness traps — silent override of the default alias table, and migration recipes that worked for FastAPI but not for DRF. Both reports independently flagged all four. That kind of convergence is exactly what dogfooding is for: a single user can rationalize around a sharp edge; two users hitting the same edge means the edge is the bug.&lt;/p&gt;

&lt;p&gt;The single-repo findings were sharper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FastAPI-specific:&lt;/strong&gt; &lt;code&gt;from_exception(envelope_cls=...)&lt;/code&gt; projects the kit's exception details into &lt;code&gt;[{field, message}]&lt;/code&gt; per error-list-item. FastAPI's &lt;code&gt;ErrorDetail&lt;/code&gt; schema also requires &lt;code&gt;code&lt;/code&gt; — so the projection raised a &lt;code&gt;ValidationError&lt;/code&gt; at handler time. Cost to diagnose: ~10 minutes. Workaround: drop &lt;code&gt;envelope_cls&lt;/code&gt; from the call, translate manually in the handler (six lines). Tracked as a v0.1.1 patch (add &lt;code&gt;code=exc.error_code&lt;/code&gt; to the projected items, or add a per-entry callback hook).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Django-specific:&lt;/strong&gt; the migration guide's &lt;code&gt;verify_envelope_contract&lt;/code&gt; example called &lt;code&gt;handler(...).body&lt;/code&gt; — that's the FastAPI shape. DRF stores the body at &lt;code&gt;.data&lt;/code&gt;, not &lt;code&gt;.body&lt;/code&gt;. Doc-only fix; it landed in the same PR as the report.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Smoke proof:&lt;/strong&gt; the Django report didn't just say the bridge worked. It quoted the response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GET /api/accounts/me/  (anonymous, 401)

X-Request-Id: 1e43827d5faa4fcfa3551acfcf9caa27
{"request_id":"1e43827d5faa4fcfa3551acfcf9caa27", …}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same UUID in the response header &lt;em&gt;and&lt;/em&gt; in the JSON envelope's &lt;code&gt;request_id&lt;/code&gt; field. rc1 returned &lt;code&gt;"request_id":null&lt;/code&gt;. v0.1.0 returned the bound value. The bridge — &lt;code&gt;bind_to(request_id_ctx)&lt;/code&gt; in a thin &lt;code&gt;BindRequestIdMiddleware&lt;/code&gt; slotted immediately after the kit's &lt;code&gt;RequestIdMiddleware&lt;/code&gt; — flowed kit → bridge → DRF handler → response. End to end. That's the kind of proof migration reports should ship.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Convergent v0.2 wishlist.&lt;/strong&gt; Both reports independently asked for the same five things: &lt;code&gt;DjangoSettingsSource&lt;/code&gt; (read &lt;code&gt;settings.RESILIENCE&lt;/code&gt; instead of forcing env vars), &lt;code&gt;resilience_kit.utils.*&lt;/code&gt; (five small modules every boilerplate re-implements: &lt;code&gt;log_sanitization&lt;/code&gt;, &lt;code&gt;network&lt;/code&gt;, &lt;code&gt;timing&lt;/code&gt;, &lt;code&gt;function_logger&lt;/code&gt;, &lt;code&gt;data&lt;/code&gt;), a &lt;code&gt;GlobalThrottle&lt;/code&gt; (process-wide cap; nginx covers it in prod but laptop dev wants the in-process belt), a free-function &lt;code&gt;metrics.record_*&lt;/code&gt; shim over &lt;code&gt;MetricsSink&lt;/code&gt; to keep adopters' bounded-cardinality guards in place, and multi-alias Redis topology so cache, throttle, and breaker can point at separate instances. That convergence is the v0.2 design spec, half-written for me.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bugs the Kit Found in Itself
&lt;/h2&gt;

&lt;p&gt;The dogfooding gate is the systemic story. The bugs are the craft story.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;f6cebdd&lt;/code&gt; — outer breaker, inner retry
&lt;/h3&gt;

&lt;p&gt;The bug behind axiom 3. A caller passed &lt;code&gt;retry_on=(Exception,)&lt;/code&gt; to the retry decorator. The decorator caught &lt;code&gt;ServiceUnavailableError&lt;/code&gt; — exactly the wrong exception. The fix is the &lt;code&gt;_filter_retry_on&lt;/code&gt; snippet above. The rule it earned is in axiom 3.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRCCiAgICBzdWJncmFwaCBCQURbIuKdjCBJbnZlcnRlZCBzdGFjayDigJQgcmV0cnkgd3JhcHMgYnJlYWtlciJdCiAgICAgICAgZGlyZWN0aW9uIFRCCiAgICAgICAgQlIxW0NhbGxlcl0gLS0%2BIEJSMltyZXRyeSBkZWNvcmF0b3JdCiAgICAgICAgQlIyIC0tPiBCUjNbY2lyY3VpdCBicmVha2VyXQogICAgICAgIEJSMyAtLT4gQlI0W291dGJvdW5kIGNhbGxdCiAgICAgICAgQlI0IC0uLT58ZmFpbHN8IEJSMwogICAgICAgIEJSMyAtLi0%2BfE9QRU48YnIvPnJhaXNlczxici8%2BU2VydmljZVVuYXZhaWxhYmxlfCBCUjIKICAgICAgICBCUjIgLS4tPnxjYXRjaGVzIGl0PGJyLz5hbmQgcmV0cmllc3wgQlIzCiAgICAgICAgQlIyIC0uLT58aGFtbWVycyB0aGU8YnIvPk9QRU4gYnJlYWtlcnwgQlIzCiAgICBlbmQKICAgIHN1YmdyYXBoIEdPT0RbIuKchSBDb3JyZWN0IHN0YWNrIOKAlCBicmVha2VyIHdyYXBzIHJldHJ5Il0KICAgICAgICBkaXJlY3Rpb24gVEIKICAgICAgICBHUjFbQ2FsbGVyXSAtLT4gR1IyW2NpcmN1aXQgYnJlYWtlcl0KICAgICAgICBHUjIgLS0%2BIEdSM1tyZXRyeSBkZWNvcmF0b3JdCiAgICAgICAgR1IzIC0tPiBHUjRbb3V0Ym91bmQgY2FsbF0KICAgICAgICBHUjQgLS4tPnx0cmFuc2llbnQ8YnIvPmZhaWx1cmV8IEdSMwogICAgICAgIEdSMyAtLi0%2BfHJldHJ5IHdpdGg8YnIvPmppdHRlcnwgR1I0CiAgICAgICAgR1IzIC0uLT58Z2l2ZXMgdXA8YnIvPmFmdGVyIE4gYXR0ZW1wdHN8IEdSMgogICAgICAgIEdSMiAtLi0%2BfHRyaXBzIE9QRU48YnIvPm9uIHRocmVzaG9sZHwgR1IxCiAgICBlbmQKICAgIHN0eWxlIEJBRCBmaWxsOiNmZWUyZTIsc3Ryb2tlOiNkYzI2MjYKICAgIHN0eWxlIEdPT0QgZmlsbDojZGNmY2U3LHN0cm9rZTojMTZhMzRhCg%3D%3D%3Ftype%3Dpng" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRCCiAgICBzdWJncmFwaCBCQURbIuKdjCBJbnZlcnRlZCBzdGFjayDigJQgcmV0cnkgd3JhcHMgYnJlYWtlciJdCiAgICAgICAgZGlyZWN0aW9uIFRCCiAgICAgICAgQlIxW0NhbGxlcl0gLS0%2BIEJSMltyZXRyeSBkZWNvcmF0b3JdCiAgICAgICAgQlIyIC0tPiBCUjNbY2lyY3VpdCBicmVha2VyXQogICAgICAgIEJSMyAtLT4gQlI0W291dGJvdW5kIGNhbGxdCiAgICAgICAgQlI0IC0uLT58ZmFpbHN8IEJSMwogICAgICAgIEJSMyAtLi0%2BfE9QRU48YnIvPnJhaXNlczxici8%2BU2VydmljZVVuYXZhaWxhYmxlfCBCUjIKICAgICAgICBCUjIgLS4tPnxjYXRjaGVzIGl0PGJyLz5hbmQgcmV0cmllc3wgQlIzCiAgICAgICAgQlIyIC0uLT58aGFtbWVycyB0aGU8YnIvPk9QRU4gYnJlYWtlcnwgQlIzCiAgICBlbmQKICAgIHN1YmdyYXBoIEdPT0RbIuKchSBDb3JyZWN0IHN0YWNrIOKAlCBicmVha2VyIHdyYXBzIHJldHJ5Il0KICAgICAgICBkaXJlY3Rpb24gVEIKICAgICAgICBHUjFbQ2FsbGVyXSAtLT4gR1IyW2NpcmN1aXQgYnJlYWtlcl0KICAgICAgICBHUjIgLS0%2BIEdSM1tyZXRyeSBkZWNvcmF0b3JdCiAgICAgICAgR1IzIC0tPiBHUjRbb3V0Ym91bmQgY2FsbF0KICAgICAgICBHUjQgLS4tPnx0cmFuc2llbnQ8YnIvPmZhaWx1cmV8IEdSMwogICAgICAgIEdSMyAtLi0%2BfHJldHJ5IHdpdGg8YnIvPmppdHRlcnwgR1I0CiAgICAgICAgR1IzIC0uLT58Z2l2ZXMgdXA8YnIvPmFmdGVyIE4gYXR0ZW1wdHN8IEdSMgogICAgICAgIEdSMiAtLi0%2BfHRyaXBzIE9QRU48YnIvPm9uIHRocmVzaG9sZHwgR1IxCiAgICBlbmQKICAgIHN0eWxlIEJBRCBmaWxsOiNmZWUyZTIsc3Ryb2tlOiNkYzI2MjYKICAgIHN0eWxlIEdPT0QgZmlsbDojZGNmY2U3LHN0cm9rZTojMTZhMzRhCg%3D%3D%3Ftype%3Dpng" alt="Inverted vs correct stack: retry-wraps-breaker hammers the OPEN state; breaker-wraps-retry contains transient failures and trips cleanly" width="852" height="676"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The inverted stack is the one a fresh reader writes first — composition order looks like a style preference until the breaker is &lt;code&gt;OPEN&lt;/code&gt;. The &lt;code&gt;_filter_retry_on&lt;/code&gt; invariant makes the wrong order fail safely instead of catastrophically: the retry decorator refuses to swallow the breaker's "stop" signal, even when the caller explicitly tells it to.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;bc499da&lt;/code&gt; — &lt;code&gt;asyncio.Event&lt;/code&gt; rebinding on every restart
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;RecoveryMonitor&lt;/code&gt; held an &lt;code&gt;asyncio.Event&lt;/code&gt; created in &lt;code&gt;__init__&lt;/code&gt; and reused across every &lt;code&gt;start()&lt;/code&gt;/&lt;code&gt;stop()&lt;/code&gt; cycle. &lt;code&gt;asyncio.Event&lt;/code&gt; doesn't bind to a loop at construction — it binds lazily, on the first &lt;code&gt;.wait()&lt;/code&gt; or &lt;code&gt;.set()&lt;/code&gt;. So the first start captured the loop. The second start, on a fresh loop, raised:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RuntimeError: &amp;lt;asyncio.locks.Event object at …&amp;gt; is bound to a different event loop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It surfaced in pytest-asyncio first — two FastAPI integration tests in one session, per-function loop scope. The second test's lifespan opened a fresh loop and called &lt;code&gt;monitor.start()&lt;/code&gt;; the Event still held the first test's closed loop. Same shape of bug as Django's &lt;code&gt;runserver&lt;/code&gt; autoreload, Gunicorn graceful restart, any ASGI lifespan that opens twice. The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In src/resilience_kit/recovery.py, RecoveryMonitor.start()
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_task&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;done&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="c1"&gt;# Reassign rather than .clear() so the Event binds to the
&lt;/span&gt;        &lt;span class="c1"&gt;# *current* event loop. asyncio.Event lazily binds to
&lt;/span&gt;        &lt;span class="c1"&gt;# whichever loop first calls .wait()/.set() on it; reusing
&lt;/span&gt;        &lt;span class="c1"&gt;# the prior Event across loops (test harness, restarted
&lt;/span&gt;        &lt;span class="c1"&gt;# server) raises ``RuntimeError: bound to a different event
&lt;/span&gt;        &lt;span class="c1"&gt;# loop``. A fresh Event has no binding yet.
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_stopping&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_run&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resilience_kit.recovery_monitor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;_logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RecoveryMonitor started.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Six lines changed. The lesson generalizes: &lt;code&gt;asyncio.Event&lt;/code&gt;, &lt;code&gt;asyncio.Queue&lt;/code&gt;, &lt;code&gt;asyncio.Lock&lt;/code&gt; and several siblings capture their loop binding on first use, and &lt;code&gt;.clear()&lt;/code&gt; doesn't unbind. &lt;strong&gt;Loop-bound state is loop-scoped, not loop-local.&lt;/strong&gt; Construct fresh on every lifecycle restart; don't cache.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;9f3d846&lt;/code&gt; — &lt;code&gt;pybreaker&lt;/code&gt; wrapper swallowed the tripping-call exception
&lt;/h3&gt;

&lt;p&gt;The kit wraps &lt;code&gt;pybreaker.CircuitBreaker&lt;/code&gt; as one of three breaker backends. On the call that trips the breaker (the one that crosses the failure threshold), &lt;code&gt;pybreaker&lt;/code&gt; raises its own &lt;code&gt;CircuitBreakerError&lt;/code&gt; to signal "I just opened" — discarding the underlying exception that &lt;em&gt;caused&lt;/em&gt; the trip. The caller saw "breaker open" without knowing &lt;em&gt;why&lt;/em&gt; it opened. Forensics suffered.&lt;/p&gt;

&lt;p&gt;The fix: the wrapper captures the original exception in a small state object before letting &lt;code&gt;pybreaker&lt;/code&gt; raise, then re-raises the original on the tripping call. The breaker still opens; the caller still sees the real failure cause. &lt;strong&gt;Don't let third-party wrappers eat your forensic signal.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Honourable mentions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;aaa5f41&lt;/code&gt; — ASGI header bytes vs str. The request_id middleware did &lt;code&gt;.lower()&lt;/code&gt; on &lt;code&gt;b"X-Request-Id"&lt;/code&gt;. Worked on Starlette's str-typed headers, broke on raw ASGI bytes. Boundary type discipline.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;1c8f834&lt;/code&gt; — settings cache leaking the test source between tests, masking what &lt;code&gt;legacy_env_alias&lt;/code&gt; actually resolved. The reset helper now restores the &lt;em&gt;default&lt;/em&gt; &lt;code&gt;EnvSettingsSource&lt;/code&gt;, not the last test's override.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;e588e8c&lt;/code&gt; — CodeQL flagged a dict whose keys were unioned &lt;code&gt;str | bytes&lt;/code&gt;. The fix was three characters. The lesson: take CodeQL findings seriously even when "it works"; the static analyser sees the type-confusion path the runtime tests don't.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are real fixes from the kit's git history. None of them were dramatic. All of them earned a small piece of the design.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building the Dangerous Parts First
&lt;/h2&gt;

&lt;p&gt;Two areas in the kit are trivial to get wrong and catastrophic when they fail.&lt;/p&gt;

&lt;h3&gt;
  
  
  SSRF guard + DNS rebinding TOCTOU
&lt;/h3&gt;

&lt;p&gt;An SSRF guard that just validates a resolved IP is vulnerable to a Time-of-Check-to-Time-of-Use (TOCTOU) race. The attacker's DNS returns a safe public IP at check time; the SSRF guard approves; the underlying HTTP transport does its own DNS lookup; the attacker's DNS returns &lt;code&gt;127.0.0.1&lt;/code&gt; this time; the request lands on localhost.&lt;/p&gt;

&lt;p&gt;The fix is to make the check and the connect share state. &lt;code&gt;AsyncAPIClient&lt;/code&gt; resolves the host, validates the IP against the SSRF allow-list, and pins the validated IP into a &lt;code&gt;contextvars.ContextVar&lt;/code&gt;. A custom &lt;code&gt;httpx&lt;/code&gt; transport reads the pinned IP and connects directly — skipping the second DNS lookup entirely while preserving the original &lt;code&gt;Host&lt;/code&gt; header and TLS SNI hostname so cert verification still passes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBwYXJ0aWNpcGFudCBDIGFzIENsaWVudAogICAgcGFydGljaXBhbnQgUksgYXMgUmVzaWxpZW5jZSBLaXQKICAgIHBhcnRpY2lwYW50IEROUSBhcyBETlMgUmVzb2x2ZXIKICAgIHBhcnRpY2lwYW50IEhUVFBZIGFzIEhUVFBZIFRyYW5zcG9ydAogICAgcGFydGljaXBhbnQgUyBhcyBUYXJnZXQgU2VydmVyCgogICAgQy0%2BPlJLOiBodHRweF9jbGllbnQuZ2V0KCJodHRwczovL2V4YW1wbGUuY29tIikKICAgIFJLLT4%2BRE5ROiBSZXNvbHZlICJleGFtcGxlLmNvbSIKICAgIEROUS0tPj5SSzogMS4yLjMuNAogICAgUkstPj5SSzogU1NSRkd1YXJkLnZhbGlkYXRlKDEuMi4zLjQpCiAgICBhbHQgU1NSRiBjaGVjayBmYWlscwogICAgICAgIFJLLS0%2BPkM6IFJhaXNlIFNTUkZHdWFyZEVycm9yCiAgICBlbmQKICAgIFJLLT4%2BUks6IFBpbiAxLjIuMy40IHRvIENvbnRleHRWYXIKICAgIFJLLT4%2BSFRUUFk6IGdldCguLi4pCiAgICBIVFRQWS0%2BPlJLOiBSZWFkIHBpbm5lZCBJUCBmcm9tIENvbnRleHRWYXIKICAgIEhUVFBZLT4%2BUzogQ29ubmVjdCB0byAxLjIuMy40IChza2lwcyBETlMpCiAgICBTLS0%2BPkhUVFBZOiBSZXNwb25zZQogICAgSFRUUFktLT4%2BUks6IFJlc3BvbnNlCiAgICBSSy0tPj5DOiBSZXNwb25zZQ%3D%3D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBwYXJ0aWNpcGFudCBDIGFzIENsaWVudAogICAgcGFydGljaXBhbnQgUksgYXMgUmVzaWxpZW5jZSBLaXQKICAgIHBhcnRpY2lwYW50IEROUSBhcyBETlMgUmVzb2x2ZXIKICAgIHBhcnRpY2lwYW50IEhUVFBZIGFzIEhUVFBZIFRyYW5zcG9ydAogICAgcGFydGljaXBhbnQgUyBhcyBUYXJnZXQgU2VydmVyCgogICAgQy0%2BPlJLOiBodHRweF9jbGllbnQuZ2V0KCJodHRwczovL2V4YW1wbGUuY29tIikKICAgIFJLLT4%2BRE5ROiBSZXNvbHZlICJleGFtcGxlLmNvbSIKICAgIEROUS0tPj5SSzogMS4yLjMuNAogICAgUkstPj5SSzogU1NSRkd1YXJkLnZhbGlkYXRlKDEuMi4zLjQpCiAgICBhbHQgU1NSRiBjaGVjayBmYWlscwogICAgICAgIFJLLS0%2BPkM6IFJhaXNlIFNTUkZHdWFyZEVycm9yCiAgICBlbmQKICAgIFJLLT4%2BUks6IFBpbiAxLjIuMy40IHRvIENvbnRleHRWYXIKICAgIFJLLT4%2BSFRUUFk6IGdldCguLi4pCiAgICBIVFRQWS0%2BPlJLOiBSZWFkIHBpbm5lZCBJUCBmcm9tIENvbnRleHRWYXIKICAgIEhUVFBZLT4%2BUzogQ29ubmVjdCB0byAxLjIuMy40IChza2lwcyBETlMpCiAgICBTLS0%2BPkhUVFBZOiBSZXNwb25zZQogICAgSFRUUFktLT4%2BUks6IFJlc3BvbnNlCiAgICBSSy0tPj5DOiBSZXNwb25zZQ%3D%3D" alt="SSRF Guard and DNS Pinning Flow" width="1269" height="832"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The check and the connect now share a single piece of state. TOCTOU bypassed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fire-and-forget audit dispatch
&lt;/h3&gt;

&lt;p&gt;Audit logs can't bring down the application. The pre-kit boilerplates wrote audit rows synchronously — if the database hiccupped, the request failed. The kit replaces that with a decoupled, fire-and-forget pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An in-memory, bounded &lt;code&gt;asyncio.Queue&lt;/code&gt; (default 10,000 events) acts as a buffer.&lt;/li&gt;
&lt;li&gt;A background worker flushes batches to the configured audit backend (&lt;code&gt;postgres&lt;/code&gt;, &lt;code&gt;stdlib_logging&lt;/code&gt;, or a custom entry-point backend).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If the database is down, the dispatcher degrades gracefully.&lt;/strong&gt; It retries with backoff; if it ultimately fails, it logs the batch to stderr and increments a &lt;code&gt;dropped&lt;/code&gt; metric. The app survives.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The audit dispatcher is also one of the loop-owned resources from axiom 4 — the Django adapter spawns it inside its daemon thread's private loop and drains it via &lt;code&gt;atexit&lt;/code&gt; on the same loop before closing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Verification: Testing What Unit Tests Can't See
&lt;/h2&gt;

&lt;p&gt;The test suite splits three ways. Unit tests verify logic in isolation. Contract tests verify the same logic works against every backend. Integration tests verify the whole stack survives chaos.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Contract tests: the same assertion, every backend.&lt;/strong&gt; The kit defines multiple backends — memory, Redis, &lt;code&gt;pybreaker&lt;/code&gt; for circuit breakers. The same test suite runs against all of them, parametrized by &lt;code&gt;pytest.mark.parametrize("backend", …)&lt;/code&gt;. A Lua script in &lt;code&gt;redis_impl.py&lt;/code&gt; that diverges from the in-memory contract fails the same assertion that passes against &lt;code&gt;memory_impl.py&lt;/code&gt;. The contract suite catches it the moment it ships.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSRF TOCTOU validation.&lt;/strong&gt; The integration test (&lt;code&gt;tests/integration/test_dns_rebinding.py&lt;/code&gt;) mocks &lt;code&gt;socket.getaddrinfo&lt;/code&gt; with a &lt;code&gt;_RebindingResolver&lt;/code&gt; that returns a safe public IP on the first call and a private IP on the second. The SSRF guard validates the first IP and pins it. The connection should use only that pinned IP — no second resolution. The test asserts the underlying transport's request URL carries the pinned IP, not the attacker's late-arriving private one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recovery monitor validation.&lt;/strong&gt; The integration test (&lt;code&gt;tests/integration/test_recovery_monitor.py&lt;/code&gt;) spins up a real Redis with &lt;code&gt;testcontainers&lt;/code&gt;, calls &lt;code&gt;docker_client.api.pause(container_id)&lt;/code&gt; to freeze TCP without killing the connection, verifies that throttles degrade to in-memory and flag degradation in metrics, then unpauses and verifies the monitor restores the Redis-backed providers within the 5-second exit-gate window. This test is what caught the &lt;code&gt;bc499da&lt;/code&gt; bug — the second invocation of the monitor across pytest's per-function loops blew up exactly as described above.&lt;/p&gt;

&lt;p&gt;The kit ships with 52 test modules across unit, contract, and integration suites.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;Four regrets from the critical path:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Chaos and failure injection should come before v0.2.&lt;/strong&gt; I built the core under happy-path assumptions, then spent the adapter phase hardening it. Testcontainers, container-pause cycles, Redis-down scenarios — these should have been mandatory from day one, not post-release validation. Every failure mode I caught during adapter work would have reshaped a primitive's design if I'd seen it earlier.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CI matrix (Python 3.11, 3.12, 3.13) should be commit-one, not late hardening.&lt;/strong&gt; I assumed 3.11 compatibility, then discovered version-specific differences in asyncio behaviour and stdlib deprecations during release prep — exactly the work a GitHub matrix would have done on every PR for free. Fixes were mostly mechanical; the cost was the discovery cycle.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Contract test suite should precede any backend implementation.&lt;/strong&gt; I wrote backends (memory, Redis, &lt;code&gt;pybreaker&lt;/code&gt;) and then wrote tests to validate them. Tests written &lt;em&gt;after&lt;/em&gt; backends encode the backends' assumptions as the spec, not the spec's expectations as the test. Contract tests written first force you to commit to the &lt;em&gt;behaviour&lt;/em&gt; before you commit to the &lt;em&gt;implementation&lt;/em&gt; — and they catch backend-specific divergence at the boundary rather than during integration. I got there eventually; getting there first would have been cheaper.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;mypy --strict&lt;/code&gt; should be enforced from commit-one, not applied retroactively.&lt;/strong&gt; I shipped the first half of the project without strict type hints, then made it a release gate. Retrofitting types to async code with complex control flow is error-prone, and the rewrite surfaced structural mismatches between &lt;code&gt;Protocol&lt;/code&gt; declarations and the backends meant to satisfy them. Strict types from the start force the design right the first time.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These are design regrets, not dogfooding bugs. The dogfooding bugs fed v0.1.1 and v0.2. The design regrets are about how I'd budget time on the next from-scratch infrastructure project.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The convergent v0.2 wishlist from both migration reports is half-written for me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DjangoSettingsSource&lt;/code&gt;&lt;/strong&gt; — read &lt;code&gt;settings.RESILIENCE&lt;/code&gt; directly instead of forcing env-only config. Both reports rank this P1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;resilience_kit.utils.*&lt;/code&gt;&lt;/strong&gt; — five small modules (&lt;code&gt;log_sanitization&lt;/code&gt;, &lt;code&gt;network&lt;/code&gt;, &lt;code&gt;timing&lt;/code&gt;, &lt;code&gt;function_logger&lt;/code&gt;, &lt;code&gt;data&lt;/code&gt;) every boilerplate re-implements. Promoting them into the kit saves ~900 LOC per consumer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MetricsSink&lt;/code&gt; cardinality contract&lt;/strong&gt; — bound the labels at &lt;code&gt;record_*&lt;/code&gt; time, not after a Prometheus cardinality explosion in prod.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;GlobalThrottle&lt;/code&gt;&lt;/strong&gt; (Valkey-Lua) — process-wide cap. Nginx covers it for fleet deployments; the in-process belt is for laptop dev and single-pod deploys.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-alias Redis topology&lt;/strong&gt; — &lt;code&gt;RESILIENCE_REDIS_URLS__&amp;lt;alias&amp;gt;&lt;/code&gt; so cache, throttle, and breaker can point at separate instances.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The v0.2 exit gate is the same rule, raised: both boilerplates re-test against a v0.2 pre-release and score &lt;strong&gt;≥ 8.5 / 10&lt;/strong&gt; — half a point higher than 0.1.0.&lt;/p&gt;

&lt;p&gt;Full roadmap: &lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/ROADMAP.md" rel="noopener noreferrer"&gt;&lt;code&gt;ROADMAP.md&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I started by trying to delete duplication. I ended up with five axioms, a dogfooding release rule, and a set of bug-earned invariants that are reusable far beyond this kit. The duplication was the symptom. The axioms are the fix.&lt;/p&gt;




&lt;h2&gt;
  
  
  Notes &amp;amp; references
&lt;/h2&gt;

&lt;p&gt;Each axiom and dangerous-parts decision has a one-page ADR in the repo. If you want the long-form reasoning — alternatives considered, consequences, usage notes — these are the load-bearing ones:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Topic in this post&lt;/th&gt;
&lt;th&gt;ADR&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Protocols, not ABCs&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/adr/0001-protocol-not-abc.md" rel="noopener noreferrer"&gt;&lt;code&gt;0001-protocol-not-abc&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hand-rolled retry over &lt;code&gt;tenacity&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/adr/0002-handrolled-retry-not-tenacity.md" rel="noopener noreferrer"&gt;&lt;code&gt;0002-handrolled-retry-not-tenacity&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One package, many extras&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/adr/0003-single-package-with-extras.md" rel="noopener noreferrer"&gt;&lt;code&gt;0003-single-package-with-extras&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Entry-point backends + precedence chain&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/adr/0004-entry-points-for-third-party-backends.md" rel="noopener noreferrer"&gt;&lt;code&gt;0004-entry-points-for-third-party-backends&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/adr/0009-entry-point-precedence-chain.md" rel="noopener noreferrer"&gt;&lt;code&gt;0009-entry-point-precedence-chain&lt;/code&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fire-and-forget audit dispatch&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/adr/0005-fire-and-forget-audit.md" rel="noopener noreferrer"&gt;&lt;code&gt;0005-fire-and-forget-audit&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Outer breaker, inner retry&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/adr/0006-outer-breaker-inner-retry.md" rel="noopener noreferrer"&gt;&lt;code&gt;0006-outer-breaker-inner-retry&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNS pin via &lt;code&gt;ContextVar&lt;/code&gt; (TOCTOU)&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/adr/0007-dns-pin-via-contextvar.md" rel="noopener noreferrer"&gt;&lt;code&gt;0007-dns-pin-via-contextvar&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fernet env-guarded keys&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/adr/0008-fernet-env-guard.md" rel="noopener noreferrer"&gt;&lt;code&gt;0008-fernet-env-guard&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FastAPI adapter shape&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/adr/0010-fastapi-adapter-shape.md" rel="noopener noreferrer"&gt;&lt;code&gt;0010-fastapi-adapter-shape&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Django sync/async bridge&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/adr/0011-django-sync-async-bridge.md" rel="noopener noreferrer"&gt;&lt;code&gt;0011-django-sync-async-bridge&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Migration reports: &lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/m8b-upgrade-reports/" rel="noopener noreferrer"&gt;&lt;code&gt;docs/m8b-upgrade-reports/&lt;/code&gt;&lt;/a&gt; — &lt;code&gt;SUMMARY.md&lt;/code&gt;, &lt;code&gt;fastapi_boilerplate.md&lt;/code&gt;, &lt;code&gt;django_boilerplate.md&lt;/code&gt;.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;The adapter is not the kit. The kit is what survives without the adapter.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Source Code:&lt;/em&gt; &lt;a href="https://github.com/prajwalmahajan101/resilience-kit" rel="noopener noreferrer"&gt;resilience-kit GitHub Repository&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Documentation:&lt;/em&gt; &lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/PRD.md" rel="noopener noreferrer"&gt;Design Doc (PRD)&lt;/a&gt; | &lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/LLD.md" rel="noopener noreferrer"&gt;Architecture (LLD)&lt;/a&gt; | &lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/adr/" rel="noopener noreferrer"&gt;ADR Index&lt;/a&gt; | &lt;a href="https://github.com/prajwalmahajan101/resilience-kit/blob/main/docs/m8b-upgrade-reports/" rel="noopener noreferrer"&gt;Migration Reports&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>architecture</category>
      <category>fastapi</category>
      <category>django</category>
    </item>
    <item>
      <title>Building toymq: a from-scratch persistent message broker in Go</title>
      <dc:creator>Prajwal Mahajan</dc:creator>
      <pubDate>Tue, 09 Jun 2026 20:04:45 +0000</pubDate>
      <link>https://dev.to/prajwalmahajan101/building-toymq-a-from-scratch-persistent-message-broker-in-go-ob7</link>
      <guid>https://dev.to/prajwalmahajan101/building-toymq-a-from-scratch-persistent-message-broker-in-go-ob7</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A retrospective on &lt;a href="https://github.com/prajwalmahajan101/toymq" rel="noopener noreferrer"&gt;&lt;code&gt;toymq&lt;/code&gt;&lt;/a&gt; — a single-node persistent message broker in Go, recreated by hand to understand one of the smallest functional units in a distributed system.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR — and the bug you should remember
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Build the dangerous parts first.&lt;/strong&gt; Build the thing that, if it's wrong, forces a rewrite of everything above it. Then build the thing above it.&lt;/p&gt;

&lt;p&gt;The worst bug of the project took six commits to surface and an integration test to catch: a &lt;code&gt;uint64&lt;/code&gt; consumer offset where &lt;code&gt;0&lt;/code&gt; meant &lt;em&gt;both&lt;/em&gt; "message id 0 was acked" &lt;em&gt;and&lt;/em&gt; "never acked." Unit tests never restarted the broker between Ack and Subscribe, so the collision stayed invisible. &lt;strong&gt;Zero values are not sentinels.&lt;/strong&gt; That's the lesson the post earns.&lt;/p&gt;

&lt;p&gt;The numbers: ~10k lines of Go, 17 ADRs, four binaries, 90.3% test coverage, a chaos harness that pushed 14,000 messages through three &lt;code&gt;SIGKILL&lt;/code&gt;/restart cycles with zero acked-message loss. v1.0 in four days; v1.3 plus post-release hardening in another two.&lt;/p&gt;

&lt;p&gt;The order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WAL → wire protocol → broker → session → server → cmd wiring
   → integration tests → chaos → post-v1.0 hardening
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The point of this post is the order, not the code.&lt;/p&gt;

&lt;h2&gt;
  
  
  What toymq is, and isn't
&lt;/h2&gt;

&lt;p&gt;A learning artifact. Single node. No replication. No auth. No TLS. Throughput tops out at a few hundred messages per second because every publish does a per-message &lt;code&gt;fsync&lt;/code&gt;. If you need a real broker, reach for NATS or RabbitMQ. If you want to &lt;em&gt;understand&lt;/em&gt; what a broker is — what &lt;code&gt;durable&lt;/code&gt; actually means, what &lt;code&gt;at-least-once&lt;/code&gt; costs, what crash recovery requires — build one. That's what this post is about.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "build the bottom first" is non-obvious
&lt;/h2&gt;

&lt;p&gt;Every tutorial does the opposite. You build a server, accept connections, parse a request, stub a handler, add real handlers, add persistence last. By the time you reach durability, you've shipped a wire protocol that assumes you don't have one, a handler API that assumes nothing crashes, and a test suite that mocks the part of the system that matters.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;toymq&lt;/code&gt; inverts that. The first commit isn't a server — it's a framed record format and a CRC check. Day 1 is "what is on disk after a crash" and nothing else.&lt;/p&gt;

&lt;p&gt;This works for a selfish reason: &lt;strong&gt;the bottom of the stack, if I get it wrong, forces a rewrite of everything above it.&lt;/strong&gt; A storage bug means the broker is wrong. A protocol bug means every client &lt;em&gt;and&lt;/em&gt; the broker are wrong. Risk shrinks as you go up. So you spend the budget for being wrong at the bottom, while the budget is highest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Durability: what should "on disk" actually mean?
&lt;/h2&gt;

&lt;p&gt;The storage piece of any persistent system is the &lt;strong&gt;write-ahead log&lt;/strong&gt; — a file you append records to, fsync, and never go back to mutate. On crash, you scan it from the start and rebuild your state.&lt;/p&gt;

&lt;p&gt;The one rule for the WAL: &lt;strong&gt;the on-disk format is a contract.&lt;/strong&gt; Anything that lands on disk has to survive forever, because the recovery scan will assume it. Changing the format later means writing a migration.&lt;/p&gt;

&lt;h3&gt;
  
  
  On-disk layout
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/
└── topics/
    ├── orders/
    │   ├── segment.log     ← append-only WAL, one record per PUB
    │   └── offsets.json    ← per-consumer state, written by debouncer
    └── events/
        ├── segment.log
        └── offsets.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One directory per topic. Two files per topic. No metadata file, no index, no manifest. The recovery scan walks every segment from offset zero on &lt;code&gt;Open&lt;/code&gt;. Scans are O(disk) but correct by construction — the WAL is its own truth, and a manifest would be a second consistency problem on top of the first.&lt;/p&gt;

&lt;h3&gt;
  
  
  The record frame
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;length&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;u32&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4 bytes&lt;/td&gt;
&lt;td&gt;Bytes that follow, excluding self&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;msg_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;u64&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8 bytes&lt;/td&gt;
&lt;td&gt;Monotonic per topic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ts_ns&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;u64&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8 bytes&lt;/td&gt;
&lt;td&gt;Append timestamp, ns since epoch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;key_len&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;u16&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2 bytes&lt;/td&gt;
&lt;td&gt;0 if the message has no key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;key&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;bytes&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;key_len&lt;/code&gt; bytes&lt;/td&gt;
&lt;td&gt;Dedupe key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;payload_len&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;u32&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4 bytes&lt;/td&gt;
&lt;td&gt;Payload length in bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;payload&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;bytes&lt;/td&gt;
&lt;td&gt;&lt;code&gt;payload_len&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Message body&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CRC32&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;u32&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4 bytes&lt;/td&gt;
&lt;td&gt;Checksum over all preceding bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Deliberately boring. The interesting part is what's &lt;em&gt;not&lt;/em&gt; there: &lt;strong&gt;no version byte.&lt;/strong&gt; Adding a version byte later is a one-line migration; defending the wrong version byte forever is not. Don't add knobs the format doesn't yet need.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;fsync&lt;/code&gt; is the durability commit point
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;Log.Append&lt;/code&gt; is the only function in toymq that calls &lt;code&gt;fsync&lt;/code&gt;. Every &lt;code&gt;PUB&lt;/code&gt; goes through it. Every &lt;code&gt;OK&lt;/code&gt; the broker emits is, by construction, preceded by a successful &lt;code&gt;fsync&lt;/code&gt; of the corresponding record:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBhdXRvbnVtYmVyCiAgICBwYXJ0aWNpcGFudCBUIGFzIFRvcGljCiAgICBwYXJ0aWNpcGFudCBMIGFzIFdBTCBMb2cKICAgIHBhcnRpY2lwYW50IEZTIGFzIEtlcm5lbC9GaWxlc3lzdGVtCgogICAgVC0-Pkw6IEFwcGVuZChyZWNvcmQpCiAgICBMLT4-TDogZW5jb2RlIChDUkMgKyBsZW4gKyBib2R5KQogICAgTC0-Pkw6IGFjcXVpcmUgcHViTXUKICAgIEwtPj5GUzogZmlsZS5Xcml0ZShieXRlcykKICAgIE5vdGUgb3ZlciBMLEZTOiBCeXRlcyBpbiBwYWdlIGNhY2hlLjxici8-UG93ZXIgbG9zcyBoZXJlIGxvc2VzIHRoZW0uCiAgICBMLT4-RlM6IGZpbGUuU3luYygpIOKAlCBmc3luYygyKQogICAgTm90ZSBvdmVyIEwsRlM6IEJ5dGVzIG9uIHN0YWJsZSBzdG9yYWdlLjxici8-RHVyYWJpbGl0eSBjb21taXQgcG9pbnQuCiAgICBGUy0tPj5MOiBvawogICAgTC0-Pkw6IGNvbW1pdHRlZE9mZnNldC5TdG9yZShuZXdPZmZzZXQpCiAgICBMLT4-TDogY29uZC5Ccm9hZGNhc3QoKQogICAgTC0-Pkw6IHJlbGVhc2UgcHViTXUKICAgIEwtLT4-VDogbXNnSUQsIG5ld09mZnNldA%3Ftype%3Dpng" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBhdXRvbnVtYmVyCiAgICBwYXJ0aWNpcGFudCBUIGFzIFRvcGljCiAgICBwYXJ0aWNpcGFudCBMIGFzIFdBTCBMb2cKICAgIHBhcnRpY2lwYW50IEZTIGFzIEtlcm5lbC9GaWxlc3lzdGVtCgogICAgVC0-Pkw6IEFwcGVuZChyZWNvcmQpCiAgICBMLT4-TDogZW5jb2RlIChDUkMgKyBsZW4gKyBib2R5KQogICAgTC0-Pkw6IGFjcXVpcmUgcHViTXUKICAgIEwtPj5GUzogZmlsZS5Xcml0ZShieXRlcykKICAgIE5vdGUgb3ZlciBMLEZTOiBCeXRlcyBpbiBwYWdlIGNhY2hlLjxici8-UG93ZXIgbG9zcyBoZXJlIGxvc2VzIHRoZW0uCiAgICBMLT4-RlM6IGZpbGUuU3luYygpIOKAlCBmc3luYygyKQogICAgTm90ZSBvdmVyIEwsRlM6IEJ5dGVzIG9uIHN0YWJsZSBzdG9yYWdlLjxici8-RHVyYWJpbGl0eSBjb21taXQgcG9pbnQuCiAgICBGUy0tPj5MOiBvawogICAgTC0-Pkw6IGNvbW1pdHRlZE9mZnNldC5TdG9yZShuZXdPZmZzZXQpCiAgICBMLT4-TDogY29uZC5Ccm9hZGNhc3QoKQogICAgTC0-Pkw6IHJlbGVhc2UgcHViTXUKICAgIEwtLT4-VDogbXNnSUQsIG5ld09mZnNldA%3Ftype%3Dpng" alt="WAL Append sequence: write → fsync → committedOffset → release" width="665" height="907"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The committed-offset update happens &lt;em&gt;after&lt;/em&gt; fsync. Any reader that sees &lt;code&gt;committedOffset == X&lt;/code&gt; is guaranteed the bytes through &lt;code&gt;X&lt;/code&gt; are on stable storage. Cost: ~1–2 ms p99 on commodity NVMe. The correctness budget gets spent here, not on throughput tricks I can't defend. A broker that loses an acked message is not a broker.&lt;/p&gt;

&lt;h3&gt;
  
  
  How this compares to Kafka
&lt;/h3&gt;

&lt;p&gt;Kafka does almost none of this. Its durability story is page cache + replication: trust the OS to flush eventually, trust the replicas to catch up. It pushes hundreds of thousands of messages per second per broker as a result. toymq has neither replicas nor the throughput budget to assume the page cache wins, so it trusts &lt;code&gt;fsync&lt;/code&gt; directly. &lt;strong&gt;Different problem, different answer — and the gap between those answers is exactly what hand-rolling teaches you.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The first nasty bug landed here, too: a &lt;code&gt;make([]byte, payloadLen)&lt;/code&gt; that ran &lt;em&gt;before&lt;/em&gt; checking &lt;code&gt;payloadLen &amp;lt;= maxPayload&lt;/code&gt;. A malicious client could OOM the broker with a single packet claiming a 100 GB payload. The fix is one line; the rule is general — &lt;strong&gt;validate before allocating, always.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The wire protocol: agreeing on what happened
&lt;/h2&gt;

&lt;p&gt;Once the storage layer holds, the next question is how clients and broker talk. A protocol is the contract that lets the two ends disagree about &lt;em&gt;when&lt;/em&gt; something happened (network is unreliable) without disagreeing about &lt;em&gt;whether&lt;/em&gt; it happened.&lt;/p&gt;

&lt;p&gt;toymq's protocol fits in a paragraph: a command word, a few arguments, a length-prefixed payload. The full &lt;code&gt;PUB&lt;/code&gt; happy path, end to end:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBhdXRvbnVtYmVyCiAgICBwYXJ0aWNpcGFudCBDIGFzIENsaWVudAogICAgcGFydGljaXBhbnQgU1IgYXMgU2Vzc2lvbiBSCiAgICBwYXJ0aWNpcGFudCBTSCBhcyBTZXNzaW9uIEgKICAgIHBhcnRpY2lwYW50IFNXIGFzIFNlc3Npb24gVwogICAgcGFydGljaXBhbnQgQiBhcyBCcm9rZXIKICAgIHBhcnRpY2lwYW50IFQgYXMgVG9waWMKICAgIHBhcnRpY2lwYW50IFcgYXMgV0FMIExvZwoKICAgIEMtPj5TUjogUFVCIG9yZGVycyBrMSA1XG5oZWxsb1xuCiAgICBTUi0-PlNIOiBQdWJDb21tYW5ke1RvcGljLCBLZXksIFBheWxvYWR9CiAgICBTSC0-PkI6IFB1Ymxpc2godG9waWMsIGtleSwgcGF5bG9hZCkKICAgIEItPj5UOiBQdWJsaXNoKGtleSwgcGF5bG9hZCkKICAgIFQtPj5UOiBEZWR1cGUuTG9va3VwKGtleSkg4oaSIG1pc3MKICAgIFQtPj5XOiBBcHBlbmQocmVjb3JkKQogICAgTm90ZSBvdmVyIFc6IGZzeW5jKCkg4oCUIGR1cmFiaWxpdHkgY29tbWl0CiAgICBXLS0-PlQ6IG1zZ0lELCBvZmZzZXQKICAgIFQtPj5UOiBEZWR1cGUuSW5zZXJ0KGtleSwgbXNnSUQpCiAgICBULS0-PkI6IG1zZ0lECiAgICBCLS0-PlNIOiBtc2dJRAogICAgU0gtPj5TVzogV3JpdGVPSyhtc2dJRCkKICAgIFNXLS0-PkM6IE9LIDBcbg%3Ftype%3Dpng" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBhdXRvbnVtYmVyCiAgICBwYXJ0aWNpcGFudCBDIGFzIENsaWVudAogICAgcGFydGljaXBhbnQgU1IgYXMgU2Vzc2lvbiBSCiAgICBwYXJ0aWNpcGFudCBTSCBhcyBTZXNzaW9uIEgKICAgIHBhcnRpY2lwYW50IFNXIGFzIFNlc3Npb24gVwogICAgcGFydGljaXBhbnQgQiBhcyBCcm9rZXIKICAgIHBhcnRpY2lwYW50IFQgYXMgVG9waWMKICAgIHBhcnRpY2lwYW50IFcgYXMgV0FMIExvZwoKICAgIEMtPj5TUjogUFVCIG9yZGVycyBrMSA1XG5oZWxsb1xuCiAgICBTUi0-PlNIOiBQdWJDb21tYW5ke1RvcGljLCBLZXksIFBheWxvYWR9CiAgICBTSC0-PkI6IFB1Ymxpc2godG9waWMsIGtleSwgcGF5bG9hZCkKICAgIEItPj5UOiBQdWJsaXNoKGtleSwgcGF5bG9hZCkKICAgIFQtPj5UOiBEZWR1cGUuTG9va3VwKGtleSkg4oaSIG1pc3MKICAgIFQtPj5XOiBBcHBlbmQocmVjb3JkKQogICAgTm90ZSBvdmVyIFc6IGZzeW5jKCkg4oCUIGR1cmFiaWxpdHkgY29tbWl0CiAgICBXLS0-PlQ6IG1zZ0lELCBvZmZzZXQKICAgIFQtPj5UOiBEZWR1cGUuSW5zZXJ0KGtleSwgbXNnSUQpCiAgICBULS0-PkI6IG1zZ0lECiAgICBCLS0-PlNIOiBtc2dJRAogICAgU0gtPj5TVzogV3JpdGVPSyhtc2dJRCkKICAgIFNXLS0-PkM6IE9LIDBcbg%3Ftype%3Dpng" alt="PUB happy path: Client → Reader → Handler → Broker → Topic → WAL fsync → OK" width="1627" height="826"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;OK&lt;/code&gt; only crosses the network &lt;em&gt;after&lt;/em&gt; &lt;code&gt;fsync&lt;/code&gt; returns. That's the durability promise made visible: the client never sees an &lt;code&gt;OK&lt;/code&gt; for a message that isn't on disk.&lt;/p&gt;

&lt;p&gt;Two design choices shape everything above this layer:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A sealed &lt;code&gt;Command&lt;/code&gt; type.&lt;/strong&gt; The parser dispatches on an interface with an unexported marker method. You cannot add a new command type without editing the file the parser lives in. The compiler keeps the protocol and its handler in sync — if you add a command and forget to handle it, the build breaks. Free safety.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clean EOF vs torn header.&lt;/strong&gt; The parser distinguishes a stream closed cleanly &lt;em&gt;between&lt;/em&gt; commands (propagate &lt;code&gt;io.EOF&lt;/code&gt;) from a stream closed &lt;em&gt;mid-line&lt;/em&gt; (return &lt;code&gt;ErrBadFraming&lt;/code&gt;). That distinction recurs everywhere downstream: the session loop uses it to tell "client gracefully disconnected" from "client crashed and dropped." Most tutorials skip this and pay for it forever.&lt;/p&gt;

&lt;h2&gt;
  
  
  Brokering, in two parts: structure vs semantics
&lt;/h2&gt;

&lt;p&gt;A broker has to do two things: route messages from producers to consumers, and guarantee something about the delivery (&lt;code&gt;at-least-once&lt;/code&gt;, in toymq's case). I split this into two branches and the split mattered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structure&lt;/strong&gt; came first: topics (auto-created on first publish), a WAL per topic, an LRU dedupe index keyed by &lt;code&gt;(producer-id, msg-id)&lt;/code&gt;, in-memory consumer state. Acceptance bar: "you can publish a message and consume it; if the broker restarts, recovery works." No visibility timeouts, no NACKs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantics&lt;/strong&gt; came second. Every message a consumer has been told about lives in one of three states:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc3RhdGVEaWFncmFtLXYyCiAgICBbKl0gLS0-IFBlbmRpbmcgOiBXQUwuQXBwZW5kIHByb2R1Y2VzIE1zZ0lECiAgICBQZW5kaW5nIC0tPiBJbmZsaWdodCA6IHJ1bkRlbGl2ZXJ5IHNlbmRzIE1TR1xuKFNlbnRBdCArIEF0dGVtcHRzKyspCiAgICBJbmZsaWdodCAtLT4gQWNrZWQgOiBDbGllbnQgQUNLCiAgICBJbmZsaWdodCAtLT4gUGVuZGluZyA6IHZpc2liaWxpdHkgdGltZW91dCBmaXJlc1xuKG5vdyAtIFNlbnRBdCA-IFZpc2liaWxpdHkpCiAgICBJbmZsaWdodCAtLT4gUGVuZGluZyA6IENsaWVudCBOQUNLCiAgICBBY2tlZCAtLT4gWypd%3Ftype%3Dpng" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc3RhdGVEaWFncmFtLXYyCiAgICBbKl0gLS0-IFBlbmRpbmcgOiBXQUwuQXBwZW5kIHByb2R1Y2VzIE1zZ0lECiAgICBQZW5kaW5nIC0tPiBJbmZsaWdodCA6IHJ1bkRlbGl2ZXJ5IHNlbmRzIE1TR1xuKFNlbnRBdCArIEF0dGVtcHRzKyspCiAgICBJbmZsaWdodCAtLT4gQWNrZWQgOiBDbGllbnQgQUNLCiAgICBJbmZsaWdodCAtLT4gUGVuZGluZyA6IHZpc2liaWxpdHkgdGltZW91dCBmaXJlc1xuKG5vdyAtIFNlbnRBdCA-IFZpc2liaWxpdHkpCiAgICBJbmZsaWdodCAtLT4gUGVuZGluZyA6IENsaWVudCBOQUNLCiAgICBBY2tlZCAtLT4gWypd%3Ftype%3Dpng" alt="Visibility-timeout state machine: Pending ↔ Inflight → Acked" width="540" height="508"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A message can bounce between &lt;code&gt;Pending&lt;/code&gt; and &lt;code&gt;Inflight&lt;/code&gt; many times before reaching &lt;code&gt;Acked&lt;/code&gt; — that's the at-least-once contract. The chaos suite's no-loss invariant verifies that every acked &lt;code&gt;MsgID&lt;/code&gt; reaches at least one consumer's seen set. Duplicates above one are allowed; loss is not.&lt;/p&gt;

&lt;p&gt;The redelivery ticker is the load-bearing piece, and it forced the rule that became the hot-path discipline for the entire broker:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Never send while holding the inflight lock.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inflightMu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;pending&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inflight&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inflightMu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;pending&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SendCh&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="c"&gt;// safe — no lock held&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A blocking send on a channel, while holding a lock the redelivery scan also wants, is deadlock bait. Take the snapshot under the lock, release it, &lt;em&gt;then&lt;/em&gt; send. The symmetric rule on the ticker side: the scan &lt;strong&gt;does&lt;/strong&gt; hold the lock for the full pass — releasing per-entry lets a concurrent &lt;code&gt;ACK&lt;/code&gt; delete an entry mid-iteration and panic the map. &lt;code&gt;-race&lt;/code&gt; finds the second one in five seconds; code review catches the first.&lt;/p&gt;

&lt;p&gt;I tried to do structure + semantics in one branch first. I rolled it back at the second self-merge conflict. &lt;strong&gt;The two-merge rule:&lt;/strong&gt; if a branch conflicts against itself twice before it's done, the branch is too big. Split it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concurrency: serving many connections without corruption
&lt;/h2&gt;

&lt;p&gt;The broker is one process; clients are many. The design question is &lt;em&gt;how many goroutines per connection, and who owns what&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;toymq's answer: &lt;strong&gt;three goroutines per session, one channel of truth between them.&lt;/strong&gt; A fourth — the broker's per-subscription delivery worker — lives on the broker side but writes into this session's outbound channel.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRECiAgICBOZXRbKCJuZXQuQ29ubjxici8-KGNsaWVudCBzb2NrZXQpIildCgogICAgc3ViZ3JhcGggU2Vzc2lvblsiT25lIFNlc3Npb24iXQogICAgICAgIGRpcmVjdGlvbiBUQgogICAgICAgIFJbIlNlc3Npb24gUjxici8-cmVhZHMgYnl0ZXMsIHBhcnNlcyBDb21tYW5kIl0KICAgICAgICBIWyJTZXNzaW9uIEg8YnIvPnJ1bnMgYnJva2VyIGNhbGxzLCBidWlsZHMgcmVzcG9uc2UiXQogICAgICAgIFdbIlNlc3Npb24gVzxici8-YnVmZmVycyArIGZsdXNoZXMgbmV0LkNvbm4iXQogICAgZW5kCgogICAgc3ViZ3JhcGggQnJva2VyWyJCcm9rZXIgc2lkZSJdCiAgICAgICAgRFsicnVuRGVsaXZlcnk8YnIvPihvbmUgcGVyIHN1YnNjcmlwdGlvbik8YnIvPldBTCBSZWFkZXIg4oaSIE1TRyBmcmFtZXMiXQogICAgZW5kCgogICAgTmV0IC0tPnxieXRlcyBpbnwgUgogICAgUiAtLT58ImluYm91bmQgY2hhbiBDb21tYW5kInwgSAogICAgSCAtLT58Im91dGJvdW5kIGNoYW4gcmVzcG9uc2UifCBXCiAgICBEIC0tPnwib3V0Ym91bmQgY2hhbiByZXNwb25zZSJ8IFcKICAgIFcgLS0-fGJ5dGVzIG91dHwgTmV0%3Ftype%3Dpng" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRECiAgICBOZXRbKCJuZXQuQ29ubjxici8-KGNsaWVudCBzb2NrZXQpIildCgogICAgc3ViZ3JhcGggU2Vzc2lvblsiT25lIFNlc3Npb24iXQogICAgICAgIGRpcmVjdGlvbiBUQgogICAgICAgIFJbIlNlc3Npb24gUjxici8-cmVhZHMgYnl0ZXMsIHBhcnNlcyBDb21tYW5kIl0KICAgICAgICBIWyJTZXNzaW9uIEg8YnIvPnJ1bnMgYnJva2VyIGNhbGxzLCBidWlsZHMgcmVzcG9uc2UiXQogICAgICAgIFdbIlNlc3Npb24gVzxici8-YnVmZmVycyArIGZsdXNoZXMgbmV0LkNvbm4iXQogICAgZW5kCgogICAgc3ViZ3JhcGggQnJva2VyWyJCcm9rZXIgc2lkZSJdCiAgICAgICAgRFsicnVuRGVsaXZlcnk8YnIvPihvbmUgcGVyIHN1YnNjcmlwdGlvbik8YnIvPldBTCBSZWFkZXIg4oaSIE1TRyBmcmFtZXMiXQogICAgZW5kCgogICAgTmV0IC0tPnxieXRlcyBpbnwgUgogICAgUiAtLT58ImluYm91bmQgY2hhbiBDb21tYW5kInwgSAogICAgSCAtLT58Im91dGJvdW5kIGNoYW4gcmVzcG9uc2UifCBXCiAgICBEIC0tPnwib3V0Ym91bmQgY2hhbiByZXNwb25zZSJ8IFcKICAgIFcgLS0-fGJ5dGVzIG91dHwgTmV0%3Ftype%3Dpng" alt="Per-session goroutines: Reader → Handler → Writer, with broker-side runDelivery feeding outbound channel" width="888" height="743"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Session R&lt;/strong&gt; owns the socket's &lt;code&gt;bufio.Reader&lt;/code&gt;. Nothing else reads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session W&lt;/strong&gt; owns the socket's &lt;code&gt;bufio.Writer&lt;/code&gt;. Nothing else writes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session H&lt;/strong&gt; dispatches parsed commands to the broker and feeds responses into &lt;code&gt;outbound&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;runDelivery&lt;/code&gt;&lt;/strong&gt; is the broker's delivery goroutine for &lt;em&gt;this&lt;/em&gt; consumer; it also writes into the session's &lt;code&gt;outbound&lt;/code&gt; channel.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;outbound&lt;/code&gt; is the only place where two goroutines (H and runDelivery) might race to write the socket — and the channel itself serializes them, no lock needed. &lt;code&gt;bufio.Writer&lt;/code&gt; has no concurrent-safe contract, and you don't lock around it; you funnel through a channel. Closing the socket, draining &lt;code&gt;outbound&lt;/code&gt;, and joining all goroutines becomes one operation: cancel a context, wait on a WaitGroup.&lt;/p&gt;

&lt;p&gt;The rule that came out of this is sharp and reusable: &lt;strong&gt;every &lt;code&gt;go&lt;/code&gt;-spawned goroutine must be accounted for in exactly one &lt;code&gt;WaitGroup&lt;/code&gt; or &lt;code&gt;doneCh&lt;/code&gt;.&lt;/strong&gt; A "WaitGroup race" in the listener — &lt;code&gt;wg.Wait&lt;/code&gt; returning before a goroutine that called &lt;code&gt;wg.Add&lt;/code&gt; had a chance to register — was the bug that taught me to write the rule down explicitly.&lt;/p&gt;

&lt;p&gt;The TCP accept loop adds two subtleties most servers get wrong: an &lt;code&gt;EMFILE&lt;/code&gt;-aware exponential backoff (a burst of connections can exhaust file descriptors and tight-loop the broker), and an explicit &lt;code&gt;listener.Close()&lt;/code&gt; on context cancel (without it, &lt;code&gt;Accept()&lt;/code&gt; blocks forever and shutdown hangs).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;cmd/toymq&lt;/code&gt; keeps &lt;code&gt;main.go&lt;/code&gt; thin: a separate &lt;code&gt;internal/config&lt;/code&gt; package owns flag validation, a testable &lt;code&gt;run(ctx, args, stdout, stderr) int&lt;/code&gt; lets tests drive shutdown via context cancellation, and &lt;code&gt;signal.NotifyContext&lt;/code&gt; wires &lt;code&gt;SIGTERM&lt;/code&gt;/&lt;code&gt;SIGINT&lt;/code&gt; into the context in one line. The smoke test boots &lt;code&gt;run&lt;/code&gt;, sends a fake &lt;code&gt;SIGTERM&lt;/code&gt;, and asserts a clean exit. It catches "shutdown leaks a goroutine" the moment it happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing what unit tests can't see
&lt;/h2&gt;

&lt;p&gt;By this point every package had unit tests. Coverage was 78%. &lt;code&gt;-race&lt;/code&gt; clean. But the most interesting bugs in any networked system live in the &lt;em&gt;seams&lt;/em&gt; — what happens when the broker sends a 300-byte &lt;code&gt;MSG&lt;/code&gt;, the client receives bytes 1..200, and then disconnects?&lt;/p&gt;

&lt;p&gt;I built in-process integration tests next: a real broker + server on a kernel-assigned TCP port, a stripped-down test client that queues incoming &lt;code&gt;MSG&lt;/code&gt;s so asynchronous deliveries don't deadlock the response-reading. Six scenarios: round-trip ACK, 1000-message restart, visibility-timeout redelivery, NACK redelivery, dedupe, subscribe takeover. Coverage went from 78% to 90.3%.&lt;/p&gt;

&lt;p&gt;The suite caught the worst bug of the project on its first run. The test:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Publish message &lt;code&gt;id=0&lt;/code&gt;. Subscribe. Receive MSG. ACK.&lt;/li&gt;
&lt;li&gt;Restart the broker.&lt;/li&gt;
&lt;li&gt;Subscribe again. &lt;strong&gt;Expect no replay.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It failed. Message &lt;code&gt;id=0&lt;/code&gt; replayed. With &lt;code&gt;id=1&lt;/code&gt; as the first message, no replay. The bug was specifically the zero msg-id.&lt;/p&gt;

&lt;p&gt;Root cause: consumer state on disk stored a single &lt;code&gt;lastAcked uint64&lt;/code&gt;. Recovery treated &lt;code&gt;lastAcked == 0&lt;/code&gt; as "never acked" (the Go zero value) instead of "acked msg id 0." On the first restart after acking the very first message, the broker couldn't tell the difference. The fix adds a &lt;code&gt;hasAcked bool&lt;/code&gt; and persists both:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;ConsumerState&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;LastAcked&lt;/span&gt; &lt;span class="kt"&gt;uint64&lt;/span&gt;
    &lt;span class="n"&gt;HasAcked&lt;/span&gt;  &lt;span class="kt"&gt;bool&lt;/span&gt;   &lt;span class="c"&gt;// ← the field that should have been there from day 1&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Go-language lesson: &lt;strong&gt;zero values are not sentinels.&lt;/strong&gt; They collide with valid data the moment your domain is non-empty. The systems lesson is sharper — &lt;strong&gt;the bug existed for six commits before anyone noticed.&lt;/strong&gt; Unit tests never restarted the broker between Ack and Subscribe. By the time integration tests ran, the broken state was sealed in code-review history weeks earlier. The integration test caught it on its first run.&lt;/p&gt;

&lt;p&gt;Then I added &lt;strong&gt;chaos&lt;/strong&gt;: a supervisor that &lt;code&gt;SIGKILL&lt;/code&gt;s the broker on schedule and restarts it; a producer that keeps publishing through restarts; a consumer that records every msg-id it ever sees. Invariant: for every msg-id the producer got an &lt;code&gt;OK&lt;/code&gt; for, the consumer must receive it at least once, eventually. A 90-second smoke pushes 14,000 messages through three &lt;code&gt;SIGKILL&lt;/code&gt; cycles with zero acked loss. That's not a proof of correctness — it's evidence the WAL + fsync + offset design holds up under realistic crash patterns.&lt;/p&gt;

&lt;p&gt;Chaos also found a data race in the chaos &lt;em&gt;test itself&lt;/em&gt; — a &lt;code&gt;bytes.Buffer&lt;/code&gt; capturing stderr being written by the supervisor and read by the assertion. The broker was correct; the test had the race. &lt;strong&gt;&lt;code&gt;-race&lt;/code&gt; is for everything you ship, including tests.&lt;/strong&gt; A race in test infrastructure can mask real bugs or manufacture phantom ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  After v1.0: the part where the work gets easier
&lt;/h2&gt;

&lt;p&gt;Once the bottom half held, the top half was craft, not danger. A Go client library (&lt;code&gt;pkg/client&lt;/code&gt;) on top of the protocol, with a single read goroutine that demuxes incoming frames by type — separate channels per call would have raced two reads on the socket, a deadlock recipe. A CLI (&lt;code&gt;toymqctl&lt;/code&gt;) and a latency harness (&lt;code&gt;toymq-bench&lt;/code&gt;). A Bubble Tea TUI (&lt;code&gt;toymq-tui&lt;/code&gt;) with one sharp lesson worth keeping: &lt;strong&gt;boolean state for "is X active?" is a smell.&lt;/strong&gt; A &lt;code&gt;bool&lt;/code&gt; for "is a modal open?" broke the day the second modal shipped on top of the first. A modal &lt;em&gt;stack&lt;/em&gt; fixed it. Lists, stacks, enums — anything but a boolean — almost always model the actual shape.&lt;/p&gt;

&lt;p&gt;Then Prometheus metrics and OpenTelemetry tracing, with the design rule: &lt;strong&gt;observability does not change behavior.&lt;/strong&gt; No locks, no goroutines, no allocations on the hot path. &lt;code&gt;prometheus.CounterVec.WithLabelValues&lt;/code&gt; is cached; &lt;code&gt;otel.Tracer().Start&lt;/code&gt; is a no-op when no provider is attached.&lt;/p&gt;

&lt;p&gt;Finally, CI hardening after &lt;code&gt;main&lt;/code&gt; went red twice in a row with &lt;code&gt;gofmt&lt;/code&gt; drift. Two recurrences is the line where you stop fixing instances and start fixing the system: a &lt;code&gt;Makefile&lt;/code&gt;, an opt-in pre-commit hook, &lt;code&gt;golangci-lint&lt;/code&gt; with a conservative ruleset, and a Go 1.25 + 1.26 CI matrix. The linter's first run flagged 31 things; seven were real bugs (&lt;code&gt;commited&lt;/code&gt;, &lt;code&gt;exhuastion&lt;/code&gt;, three dead helpers, one &lt;code&gt;cap&lt;/code&gt;-shadowing parameter). A linter that flags 30 things on its first run is doing its job.&lt;/p&gt;

&lt;h2&gt;
  
  
  ADRs as crystallization, not ceremony
&lt;/h2&gt;

&lt;p&gt;Seventeen ADRs sounds like ceremony. It wasn't.&lt;/p&gt;

&lt;p&gt;The rule: &lt;strong&gt;write the ADR the moment the decision is forced by code, never before.&lt;/strong&gt; Not when I had a hunch. When I had to type "we are doing it this way" into a function, &lt;em&gt;then&lt;/em&gt; the ADR went next to it.&lt;/p&gt;

&lt;p&gt;ADRs written before code force the code to fit the ADR, which means defending decisions made when you knew the least. ADRs written &lt;em&gt;at&lt;/em&gt; crystallization record what the code already decided. If the code later changes its mind, you supersede the old ADR — you don't try to retroactively justify it. The point is that the ADR is always honest about &lt;em&gt;when&lt;/em&gt; the decision was made.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Run chaos one branch earlier.&lt;/strong&gt; Integration tests didn't find what chaos found one branch later. Chaos was the right tool, just deployed one branch too late.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI matrix from day one.&lt;/strong&gt; A Go 1.25 + 1.26 matrix takes 30 minutes to set up and prevents an entire class of "works on my machine" failure. It should have been part of the first &lt;code&gt;chore(ci)&lt;/code&gt; commit, not a retrofit at the end.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wire trace context through the protocol from v1.3.&lt;/strong&gt; Spans today are root spans inside the broker. Cross-process correlation needs a &lt;code&gt;TRACEPARENT&lt;/code&gt; line in the wire format. Wire-format changes are exactly what you regret deferring.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treat zero values as a code smell from day one.&lt;/strong&gt; The &lt;code&gt;lastAcked&lt;/code&gt; bug was the most embarrassing of the project. Whenever the answer is "use 0 / &lt;code&gt;""&lt;/code&gt; / &lt;code&gt;nil&lt;/code&gt; / &lt;code&gt;-1&lt;/code&gt; to mean absent," the right answer is &lt;code&gt;(value, present bool)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Persisting the dedupe LRU to disk without losing the boring-by-design property of the WAL.&lt;/strong&gt; Right now the LRU is in memory only — survives one broker lifetime, lost on &lt;code&gt;SIGKILL&lt;/code&gt;. The fix mirrors the atomic-swap pattern used for offsets; the interesting question is what happens on a torn write of the LRU snapshot itself, since unlike the WAL there's no CRC-framed record stream to scan. That's the next post.&lt;/p&gt;

&lt;p&gt;After that, the series moves to &lt;strong&gt;tinykv&lt;/strong&gt; (a Redis-subset KV store in Go) and &lt;strong&gt;tinyraft&lt;/strong&gt; (a 3-node Raft consensus cluster). Same risk-first sequencing. Same ADRs-as-crystallization. The arc ends where you compose them: replicated state machine semantics on top of a real consensus log.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one-line version
&lt;/h2&gt;

&lt;p&gt;If a tutorial tells you to build the server first, you've already started in the wrong place. &lt;strong&gt;The storage is the project.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Source: &lt;a href="https://github.com/prajwalmahajan101/toymq" rel="noopener noreferrer"&gt;&lt;code&gt;prajwalmahajan101/toymq&lt;/code&gt;&lt;/a&gt;. Deeper docs (with every diagram in this post and more): &lt;a href="https://github.com/prajwalmahajan101/toymq/blob/main/docs/ARCHITECTURE.md" rel="noopener noreferrer"&gt;&lt;code&gt;ARCHITECTURE.md&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/prajwalmahajan101/toymq/blob/main/docs/PERSISTENCE.md" rel="noopener noreferrer"&gt;&lt;code&gt;PERSISTENCE.md&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/prajwalmahajan101/toymq/blob/main/docs/CONCURRENCY.md" rel="noopener noreferrer"&gt;&lt;code&gt;CONCURRENCY.md&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/prajwalmahajan101/toymq/blob/main/docs/REDELIVERY.md" rel="noopener noreferrer"&gt;&lt;code&gt;REDELIVERY.md&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/prajwalmahajan101/toymq/blob/main/docs/FLOWS.md" rel="noopener noreferrer"&gt;&lt;code&gt;FLOWS.md&lt;/code&gt;&lt;/a&gt;. ADR index: &lt;a href="https://github.com/prajwalmahajan101/toymq/tree/main/docs/adr" rel="noopener noreferrer"&gt;&lt;code&gt;docs/adr/&lt;/code&gt;&lt;/a&gt;. Open ideas: &lt;a href="https://github.com/prajwalmahajan101/toymq/blob/main/IDEA.md" rel="noopener noreferrer"&gt;&lt;code&gt;IDEA.md&lt;/code&gt;&lt;/a&gt;. Corrections welcome — open a discussion or file an issue.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>distributedsystems</category>
      <category>backend</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
