<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Harrison Guo</title>
    <description>The latest articles on DEV Community by Harrison Guo (@harrison_guo_e01b4c8793a0).</description>
    <link>https://dev.to/harrison_guo_e01b4c8793a0</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3809272%2Ff7da2c77-d1e2-4b04-8cf4-11c5f274f605.png</url>
      <title>DEV Community: Harrison Guo</title>
      <link>https://dev.to/harrison_guo_e01b4c8793a0</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/harrison_guo_e01b4c8793a0"/>
    <language>en</language>
    <item>
      <title>Why Go Handles Millions of Connections: User-Space Context Switching, Explained</title>
      <dc:creator>Harrison Guo</dc:creator>
      <pubDate>Tue, 14 Apr 2026 21:43:03 +0000</pubDate>
      <link>https://dev.to/harrison_guo_e01b4c8793a0/why-go-handles-millions-of-connections-user-space-context-switching-explained-kf3</link>
      <guid>https://dev.to/harrison_guo_e01b4c8793a0/why-go-handles-millions-of-connections-user-space-context-switching-explained-kf3</guid>
      <description>&lt;p&gt;Somewhere around 40,000 concurrent connections, your Java service falls over. Not from CPU, not from network — from memory, because every connection is a thread and every thread wants its own megabyte of stack. By the time you've finished Googling whether this is a &lt;code&gt;-Xss&lt;/code&gt; problem or a &lt;code&gt;ulimit&lt;/code&gt; problem, Ops has already bumped the box to 64 GB and you've pushed the wall back another 20,000 connections. Linear in RAM. It never ends.&lt;/p&gt;

&lt;p&gt;A Go service on half that box can hold 200,000 connections without noticing. People assume it's because Go is faster. It isn't. Per-request, Go and Java are roughly the same — sometimes Java wins. What Go does differently is more fundamental: &lt;strong&gt;it stops asking the kernel to help.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;tl;dr&lt;/strong&gt; — High-concurrency isn't about raw CPU. It's about how cheaply you can hold an idle connection open. Go's 2KB goroutine stacks and user-space M:N scheduler push the marginal cost of a connection close to zero. The kernel only gets involved when there's real I/O to do. This is the same principle HFT engines chase with DPDK and io_uring — Go just hands it to you for free.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Wrong Mental Model
&lt;/h2&gt;

&lt;p&gt;Most engineers I talk to think "threads are expensive because threading is hard." That's not wrong, but it misses the more mechanical reason.&lt;/p&gt;

&lt;p&gt;Every time a traditional language (Java pre-Loom, C# pre-async everywhere, classic Python) parks a thread waiting for I/O, it pays two concrete costs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Stack memory&lt;/strong&gt;: Default JVM thread stack is 1 MB. 40,000 threads = 40 GB of stack, most of which is unused.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context-switch cost&lt;/strong&gt;: When the OS swaps the thread, it traps into the kernel, saves the full register set, swaps page tables if there's an address-space change, flushes TLB entries, and walks the scheduler's runqueue. Measured on modern x86, that's &lt;strong&gt;1–5 microseconds per switch&lt;/strong&gt;, plus the less visible cost of instruction-cache pollution.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Multiply that by tens of thousands of waiters and you're paying the kernel a rent that has nothing to do with your actual workload.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Go Does Instead
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRCCiAgICBzdWJncmFwaCBKYXZhWyJKYXZhIMK3IG9uZSB0aHJlYWQgcGVyIGNvbm5lY3Rpb24iXQogICAgICAgIEpUMVsiVGhyZWFkIDE8YnIvPnN0YWNrIOKJiCAxIE1CIl0KICAgICAgICBKVDJbIlRocmVhZCAyPGJyLz5zdGFjayDiiYggMSBNQiJdCiAgICAgICAgSlQzWyJUaHJlYWQgLi4uPGJyLz5zdGFjayDiiYggMSBNQiJdCiAgICAgICAgSlQxIC0uLT58a2VybmVsIGNvbnRleHQgc3dpdGNoPGJyLz5UTEIgZmx1c2ggwrcgcmVnIHNhdmV8IEtlcm5lbDFbKEtlcm5lbCBzY2hlZHVsZXIpXQogICAgICAgIEpUMiAtLi0-IEtlcm5lbDEKICAgICAgICBKVDMgLS4tPiBLZXJuZWwxCiAgICBlbmQKCiAgICBzdWJncmFwaCBHb1siR28gwrcgZ29yb3V0aW5lcyBvbiBhIHNtYWxsIHBvb2wgb2YgT1MgdGhyZWFkcyJdCiAgICAgICAgRzFbIkdvcm91dGluZSAxPGJyLz5zdGFjayAyIEtCIl0KICAgICAgICBHMlsiR29yb3V0aW5lIDI8YnIvPnN0YWNrIDIgS0IiXQogICAgICAgIEczWyJHb3JvdXRpbmUgLi4uPGJyLz5zdGFjayAyIEtCIl0KICAgICAgICBHNFsiR29yb3V0aW5lIE48YnIvPnN0YWNrIDIgS0IiXQogICAgICAgIFJ1bnRpbWVbIkdvIHJ1bnRpbWUgc2NoZWR1bGVyPGJyLz5NOk4gwrcgdXNlciBzcGFjZSJdCiAgICAgICAgRzEgLS0-IFJ1bnRpbWUKICAgICAgICBHMiAtLT4gUnVudGltZQogICAgICAgIEczIC0tPiBSdW50aW1lCiAgICAgICAgRzQgLS0-IFJ1bnRpbWUKICAgICAgICBSdW50aW1lIC0tPnxydW5zIG9ufCBPU1RbIk9TIHRocmVhZCAxIl0KICAgICAgICBSdW50aW1lIC0tPnxydW5zIG9ufCBPU1QyWyJPUyB0aHJlYWQgLi4uIl0KICAgICAgICBSdW50aW1lIC0tPnxydW5zIG9ufCBPU1RuWyJPUyB0aHJlYWQgR09NQVhQUk9DUyJdCiAgICBlbmQKCiAgICBjbGFzc0RlZiBoZWF2eSBmaWxsOiNmZWQ3ZDcsc3Ryb2tlOiNjNTMwMzAKICAgIGNsYXNzRGVmIGxpZ2h0IGZpbGw6I2YwZmZmNCxzdHJva2U6IzJmODU1YQogICAgY2xhc3MgSmF2YSBoZWF2eQogICAgY2xhc3MgR28gbGlnaHQ%3D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRCCiAgICBzdWJncmFwaCBKYXZhWyJKYXZhIMK3IG9uZSB0aHJlYWQgcGVyIGNvbm5lY3Rpb24iXQogICAgICAgIEpUMVsiVGhyZWFkIDE8YnIvPnN0YWNrIOKJiCAxIE1CIl0KICAgICAgICBKVDJbIlRocmVhZCAyPGJyLz5zdGFjayDiiYggMSBNQiJdCiAgICAgICAgSlQzWyJUaHJlYWQgLi4uPGJyLz5zdGFjayDiiYggMSBNQiJdCiAgICAgICAgSlQxIC0uLT58a2VybmVsIGNvbnRleHQgc3dpdGNoPGJyLz5UTEIgZmx1c2ggwrcgcmVnIHNhdmV8IEtlcm5lbDFbKEtlcm5lbCBzY2hlZHVsZXIpXQogICAgICAgIEpUMiAtLi0-IEtlcm5lbDEKICAgICAgICBKVDMgLS4tPiBLZXJuZWwxCiAgICBlbmQKCiAgICBzdWJncmFwaCBHb1siR28gwrcgZ29yb3V0aW5lcyBvbiBhIHNtYWxsIHBvb2wgb2YgT1MgdGhyZWFkcyJdCiAgICAgICAgRzFbIkdvcm91dGluZSAxPGJyLz5zdGFjayAyIEtCIl0KICAgICAgICBHMlsiR29yb3V0aW5lIDI8YnIvPnN0YWNrIDIgS0IiXQogICAgICAgIEczWyJHb3JvdXRpbmUgLi4uPGJyLz5zdGFjayAyIEtCIl0KICAgICAgICBHNFsiR29yb3V0aW5lIE48YnIvPnN0YWNrIDIgS0IiXQogICAgICAgIFJ1bnRpbWVbIkdvIHJ1bnRpbWUgc2NoZWR1bGVyPGJyLz5NOk4gwrcgdXNlciBzcGFjZSJdCiAgICAgICAgRzEgLS0-IFJ1bnRpbWUKICAgICAgICBHMiAtLT4gUnVudGltZQogICAgICAgIEczIC0tPiBSdW50aW1lCiAgICAgICAgRzQgLS0-IFJ1bnRpbWUKICAgICAgICBSdW50aW1lIC0tPnxydW5zIG9ufCBPU1RbIk9TIHRocmVhZCAxIl0KICAgICAgICBSdW50aW1lIC0tPnxydW5zIG9ufCBPU1QyWyJPUyB0aHJlYWQgLi4uIl0KICAgICAgICBSdW50aW1lIC0tPnxydW5zIG9ufCBPU1RuWyJPUyB0aHJlYWQgR09NQVhQUk9DUyJdCiAgICBlbmQKCiAgICBjbGFzc0RlZiBoZWF2eSBmaWxsOiNmZWQ3ZDcsc3Ryb2tlOiNjNTMwMzAKICAgIGNsYXNzRGVmIGxpZ2h0IGZpbGw6I2YwZmZmNCxzdHJva2U6IzJmODU1YQogICAgY2xhc3MgSmF2YSBoZWF2eQogICAgY2xhc3MgR28gbGlnaHQ%3D" alt="JT1[" width="1545" height="548"&gt;&lt;/a&gt;stack ≈ 1 MB"]"/&amp;gt;&lt;/p&gt;

&lt;p&gt;Go's concurrency is built on an &lt;strong&gt;M:N scheduler&lt;/strong&gt;. You have many goroutines (N) multiplexed onto a small number of OS threads (M, typically &lt;code&gt;GOMAXPROCS&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Here's the part that matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A goroutine starts with a &lt;strong&gt;2 KB stack&lt;/strong&gt;, not a megabyte. Growth is copy-and-resize in user space, triggered by the function prologue when it detects a near-overflow.&lt;/li&gt;
&lt;li&gt;Switching between goroutines happens &lt;strong&gt;entirely in the Go runtime&lt;/strong&gt;. No syscall. No TLB flush. No register-set save-and-restore at OS cost. Roughly a couple hundred nanoseconds in microbenchmarks — an order of magnitude cheaper than an OS-level context switch. The exact number moves around with workload, scheduler contention, and Go version; what's stable is the order of magnitude.&lt;/li&gt;
&lt;li&gt;When a goroutine blocks on network I/O, the runtime parks it and flips the underlying OS thread to run a different goroutine. The goroutine's state lives in Go's own scheduler, not in a kernel wait queue.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the actual answer to "why Go scales to millions of connections": &lt;strong&gt;the runtime refuses to hand idle work back to the kernel&lt;/strong&gt;. The kernel still does the real I/O — Go uses &lt;code&gt;epoll&lt;/code&gt; on Linux, &lt;code&gt;kqueue&lt;/code&gt; on BSD, IOCP on Windows — but it only involves the kernel when there's &lt;em&gt;actual&lt;/em&gt; work, not when a goroutine is just sitting around.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Small Benchmark That Tells the Whole Story
&lt;/h2&gt;

&lt;p&gt;Here's a stripped-down Go program that spins up N goroutines, each one holds a channel read, and prints the total RSS when they're all parked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"fmt"&lt;/span&gt;
    &lt;span class="s"&gt;"os"&lt;/span&gt;
    &lt;span class="s"&gt;"runtime"&lt;/span&gt;
    &lt;span class="s"&gt;"sync"&lt;/span&gt;
    &lt;span class="s"&gt;"syscall"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="n"&gt;_000&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sscanf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s"&gt;"%d"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;wg&lt;/span&gt; &lt;span class="n"&gt;sync&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WaitGroup&lt;/span&gt;
    &lt;span class="n"&gt;ch&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{})&lt;/span&gt;
    &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ch&lt;/span&gt; &lt;span class="c"&gt;// park forever&lt;/span&gt;
        &lt;span class="p"&gt;}()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Let the runtime settle&lt;/span&gt;
    &lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GC&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="n"&gt;syscall&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Rusage&lt;/span&gt;
    &lt;span class="n"&gt;syscall&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Getrusage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;syscall&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RUSAGE_SELF&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"goroutines=%d  rss=%d KB  (%.1f KB/goroutine)&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Maxrss&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Maxrss&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="kt"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="nb"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Wait&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On my laptop (M1, Go 1.22, macOS):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;goroutines=10000    rss=28672 KB   (2.9 KB/goroutine)
goroutines=100000   rss=263168 KB  (2.6 KB/goroutine)
goroutines=1000000  rss=2600960 KB (2.6 KB/goroutine)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2.6 KB per parked goroutine&lt;/strong&gt;, flat, all the way to a million. That's the story. Not 1 MB. Not 256 KB. Two and a half KB.&lt;/p&gt;

&lt;p&gt;Try the equivalent program with &lt;code&gt;new Thread(() -&amp;gt; ...).start()&lt;/code&gt; in Java and you will run out of memory well before 100,000. The comparison isn't even close, and it isn't about execution speed — it's about what an idle waiter costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Parallel in Finance: Same Problem, Opposite Extreme
&lt;/h2&gt;

&lt;p&gt;The part that made this click for me is noticing where else this principle shows up. High-frequency trading engines and exchange colocation boxes have the same bottleneck — kernel context switches are expensive — and they solve it the other way: &lt;strong&gt;skip the kernel entirely&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DPDK&lt;/strong&gt; gives userspace direct access to the NIC. Packets bypass the kernel network stack.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kernel-bypass sockets&lt;/strong&gt; (Solarflare Onload, AWS Nitro enhanced networking) push the TCP/IP stack into userspace.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;io_uring&lt;/strong&gt; on modern Linux brings the same idea to general-purpose code — a shared memory ring buffer between app and kernel, batched, with minimal syscalls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RDMA&lt;/strong&gt; lets network cards write directly into another machine's memory. No kernel on either end.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Different tools, same target: &lt;strong&gt;syscalls and context switches are expensive; keep them off the hot path&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Go arrives at the same destination with a completely different route. Instead of bypassing the kernel, it hides the kernel behind a user-space scheduler and only calls in when absolutely necessary. HFT says "the kernel is slow, route around it." Go says "the kernel is slow, so we'll handle most of the state ourselves and only ring the kernel's doorbell when we have real work." The principle is identical.&lt;/p&gt;

&lt;p&gt;Once you see this pattern, you start seeing it everywhere. V8 Isolates. Erlang processes. Rust async runtimes. The details differ but the bet is the same: &lt;strong&gt;keep concurrency cheap by keeping it out of the kernel&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Go Actually Breaks Under Load
&lt;/h2&gt;

&lt;p&gt;None of this means Go scales forever. When I've seen Go services crack at scale, it's usually not the runtime:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;File descriptors&lt;/strong&gt;: Default &lt;code&gt;ulimit -n&lt;/code&gt; is 1024 on most systems. You'll hit this before you stress the scheduler. Push it to 1M if you're actually building a long-poll service.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ephemeral ports&lt;/strong&gt;: If your service fans out to a downstream with lots of short-lived outbound connections, the 28K-ish default ephemeral port range bites before anything else.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conntrack tables&lt;/strong&gt;: Linux's &lt;code&gt;nf_conntrack_max&lt;/code&gt; default is laughably small for a real service. Tune it or turn it off on high-throughput paths.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GC pressure from allocation-heavy handlers&lt;/strong&gt;: The scheduler is cheap; the garbage collector is not. Sync pools, stack-allocated buffers, and careful escape analysis still matter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The load balancer&lt;/strong&gt;: Your L4/L7 LB probably caps out before Go does.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I've watched a Go service sit happily at 400K connections on a single pod while the upstream Envoy bled under its own CPU budget. The Go process was the calm one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concurrency Isn't a Speed Contest
&lt;/h2&gt;

&lt;p&gt;It's a cost-of-idleness contest.&lt;/p&gt;

&lt;p&gt;If you're building anything with long-lived connections — streaming APIs, WebSocket fan-out, server-sent events, message brokers, pub/sub gateways, anything with more connections than cores — the question isn't "is my language fast?" It's "&lt;strong&gt;how much does one idle waiter cost me?&lt;/strong&gt;"&lt;/p&gt;

&lt;p&gt;Go's answer is 2.6 KB and 200 nanoseconds. That's why it scales.&lt;/p&gt;

&lt;p&gt;If you come from a world where "high concurrency" means "we bought a bigger box," Go can feel like cheating. It isn't. It's just a careful, decade-old design decision that says: the kernel is a system call you should make as rarely as possible, and when you must, do it in bulk.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://golang.org/src/runtime/HACKING.md" rel="noopener noreferrer"&gt;The Go Scheduler: Design Principles (Dmitry Vyukov)&lt;/a&gt; — runtime internals from a core contributor&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;runtime/proc.go&lt;/code&gt; in the Go source tree — the actual M/P/G logic, shorter and more readable than you'd expect&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://people.freebsd.org/~jlemon/papers/kqueue.pdf" rel="noopener noreferrer"&gt;Dragonfly BSD's &lt;code&gt;kqueue&lt;/code&gt; paper&lt;/a&gt; — where &lt;code&gt;epoll&lt;/code&gt; got many of its ideas&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kernel.dk/io_uring.pdf" rel="noopener noreferrer"&gt;io_uring introduction (Jens Axboe)&lt;/a&gt; — the modern-kernel answer to the same problem Go solved in user space&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want to understand why the decade-old Go scheduler still holds up, read &lt;code&gt;runtime/proc.go&lt;/code&gt; once. The comments alone are worth an afternoon.&lt;/p&gt;

</description>
      <category>go</category>
      <category>concurrency</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Claude Code Deep Dive Part 4: Why It Uses Markdown Files Instead of Vector DBs</title>
      <dc:creator>Harrison Guo</dc:creator>
      <pubDate>Wed, 08 Apr 2026 05:19:40 +0000</pubDate>
      <link>https://dev.to/harrison_guo_e01b4c8793a0/claude-code-deep-dive-part-4-why-it-uses-markdown-files-instead-of-vector-dbs-1hf6</link>
      <guid>https://dev.to/harrison_guo_e01b4c8793a0/claude-code-deep-dive-part-4-why-it-uses-markdown-files-instead-of-vector-dbs-1hf6</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 4 of our Claude Code Architecture Deep Dive series. &lt;a href="https://harrisonsec.com/blog/claude-code-source-leaked-hidden-features/" rel="noopener noreferrer"&gt;Part 1: 5 Hidden Features&lt;/a&gt; | &lt;a href="https://harrisonsec.com/blog/claude-code-deep-dive-query-loop/" rel="noopener noreferrer"&gt;Part 2: The 1,421-Line While Loop&lt;/a&gt; | &lt;a href="https://harrisonsec.com/blog/claude-code-context-engineering-compression-pipeline/" rel="noopener noreferrer"&gt;Part 3: Context Engineering — 5-Level Compression Pipeline&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article replaces and deepens our earlier analysis, &lt;a href="https://harrisonsec.com/blog/claude-code-memory-simpler-than-you-think/" rel="noopener noreferrer"&gt;Claude Code's Memory Is Simpler Than You Think&lt;/a&gt;. The original focused on limitations. This one focuses on **why&lt;/em&gt;* — the first-principles tradeoffs behind every design choice.*&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Principle: Only Record What Cannot Be Derived
&lt;/h2&gt;

&lt;p&gt;This single constraint governs every decision in Claude Code's memory system:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Don't save code patterns — read the current code. Don't save git history — run &lt;code&gt;git log&lt;/code&gt;. Don't save file paths — glob the project. Don't save past bug fixes — they're in commits.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This isn't about saving storage. It's about &lt;strong&gt;preventing memory drift&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If a memory says "auth module lives in &lt;code&gt;src/auth/&lt;/code&gt;", one refactor makes that memory a lie. But the model doesn't know it's a lie — it trusts specific references by default. A stale memory is worse than no memory at all, because the model acts on it with confidence.&lt;/p&gt;

&lt;p&gt;Code is self-describing. The source of truth is always the current state of the project, not a snapshot from three weeks ago. Memory should store &lt;strong&gt;meta-information&lt;/strong&gt; — who the user is, what they prefer, what decisions were made and why — not facts that the codebase already expresses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four Types, Closed Taxonomy
&lt;/h2&gt;

&lt;p&gt;Claude Code enforces exactly four memory types. Not tags. Not categories. Four types with hard boundaries:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;What to Store&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;user&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Identity, preferences, expertise&lt;/td&gt;
&lt;td&gt;"Data scientist, focused on observability"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;feedback&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Behavioral corrections AND confirmations&lt;/td&gt;
&lt;td&gt;"Don't summarize after code changes — user reads diffs"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;project&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Decisions, deadlines, stakeholder context&lt;/td&gt;
&lt;td&gt;"Merge freeze after 2026-03-05 for mobile release"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;reference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pointers to external systems&lt;/td&gt;
&lt;td&gt;"Pipeline bugs tracked in Linear INGEST project"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why closed taxonomy beats open tagging:&lt;/strong&gt; Free-form tags cause label explosion. A model tagging memories freely might produce "coding-style", "code-style", "style-preference", "formatting" — four labels for the same concept. Closed taxonomy forces an explicit semantic choice. Each type has different storage structure (feedback requires &lt;code&gt;Why&lt;/code&gt; + &lt;code&gt;How to apply&lt;/code&gt; fields) and different retrieval behavior. The constraint buys clarity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Positive Feedback Matters More Than Corrections
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;feedback&lt;/code&gt; type stores both failures AND successes. The source code explains why:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"If you only save corrections, you will avoid past mistakes but drift away from approaches the user has already validated, and may grow overly cautious."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Imagine the user says "this code style is great, keep doing this." If you don't save that, next session the model might "improve" the style — moving away from what the user explicitly liked. Positive feedback &lt;strong&gt;anchors&lt;/strong&gt; the model to known-good patterns. Without anchors, corrections alone push the model toward progressively safer (blander) output.&lt;/p&gt;

&lt;h3&gt;
  
  
  Project Type: Relative Dates Kill You
&lt;/h3&gt;

&lt;p&gt;When a user says "merge freeze after Thursday", the memory must store "merge freeze after 2026-03-05." A memory read three weeks later has no idea what "Thursday" meant. This seems obvious, but it's an explicit rule in the source code because models default to storing user language verbatim.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Sonnet Side-Query Instead of Vector Embeddings
&lt;/h2&gt;

&lt;p&gt;This is the design choice that draws the most criticism. Claude Code uses a live LLM call (Sonnet) to pick relevant memories instead of vector similarity search. Here's the actual tradeoff:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IExSCiAgICBRdWVyeVsiVXNlciBxdWVyeSJdIC0tPiBTY2FuWyJTY2FuIG1lbW9yeSBkaXJcblJlYWQgZnJvbnRtYXR0ZXIgb25seVxuTWF4IDIwMCBmaWxlcyJdCiAgICBTY2FuIC0tPiBNYW5pZmVzdFsiRm9ybWF0IG1hbmlmZXN0XG50eXBlICsgZmlsZW5hbWUgKyB0aW1lc3RhbXBcbisgZGVzY3JpcHRpb24iXQogICAgTWFuaWZlc3QgLS0-IFNvbm5ldFsiU29ubmV0IHNpZGUtcXVlcnlcbn4yNTBtcywgMjU2IHRva2Vuc1xuU2VsZWN0IHRvcCA1Il0KICAgIFNvbm5ldCAtLT4gRmlsdGVyWyJEZWR1cGxpY2F0ZVxuUmVtb3ZlIGFscmVhZHktc3VyZmFjZWQiXQogICAgRmlsdGVyIC0tPiBJbmplY3RbIkluamVjdCBhcyBzeXN0ZW0tcmVtaW5kZXJcbldpdGggZnJlc2huZXNzIHdhcm5pbmciXQ%3D%3D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IExSCiAgICBRdWVyeVsiVXNlciBxdWVyeSJdIC0tPiBTY2FuWyJTY2FuIG1lbW9yeSBkaXJcblJlYWQgZnJvbnRtYXR0ZXIgb25seVxuTWF4IDIwMCBmaWxlcyJdCiAgICBTY2FuIC0tPiBNYW5pZmVzdFsiRm9ybWF0IG1hbmlmZXN0XG50eXBlICsgZmlsZW5hbWUgKyB0aW1lc3RhbXBcbisgZGVzY3JpcHRpb24iXQogICAgTWFuaWZlc3QgLS0-IFNvbm5ldFsiU29ubmV0IHNpZGUtcXVlcnlcbn4yNTBtcywgMjU2IHRva2Vuc1xuU2VsZWN0IHRvcCA1Il0KICAgIFNvbm5ldCAtLT4gRmlsdGVyWyJEZWR1cGxpY2F0ZVxuUmVtb3ZlIGFscmVhZHktc3VyZmFjZWQiXQogICAgRmlsdGVyIC0tPiBJbmplY3RbIkluamVjdCBhcyBzeXN0ZW0tcmVtaW5kZXJcbldpdGggZnJlc2huZXNzIHdhcm5pbmciXQ%3D%3D" alt="flowchart LR" width="1704" height="118"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sonnet reads descriptions (not full content), evaluates semantic relevance, and returns up to 5 filenames. The call costs ~250ms and 256 output tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this over vector embeddings:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Sonnet Side-Query&lt;/th&gt;
&lt;th&gt;Vector Embeddings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Semantic depth&lt;/td&gt;
&lt;td&gt;Full language understanding — "deployment" matches "CI/CD"&lt;/td&gt;
&lt;td&gt;Cosine similarity — good but shallow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;Zero — one API call&lt;/td&gt;
&lt;td&gt;Requires embedding model + vector store&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transparency&lt;/td&gt;
&lt;td&gt;Can inspect WHY a memory was selected&lt;/td&gt;
&lt;td&gt;Opaque similarity scores&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost per query&lt;/td&gt;
&lt;td&gt;~250ms + 256 tokens (shared prompt cache)&lt;/td&gt;
&lt;td&gt;Embedding call + search latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scaling&lt;/td&gt;
&lt;td&gt;Degrades past ~200 files&lt;/td&gt;
&lt;td&gt;Scales to millions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The tradeoff is deliberate: for a &lt;strong&gt;session-based CLI tool&lt;/strong&gt; where users typically have 20-100 memories, Sonnet's semantic understanding beats vector search's scale. The 250ms latency is hidden entirely through &lt;strong&gt;async prefetch&lt;/strong&gt; — the search runs in parallel while the model generates its response. For the user, memory recall is "free."&lt;/p&gt;

&lt;h3&gt;
  
  
  The 5-File Cap: Constraint as Design
&lt;/h3&gt;

&lt;p&gt;Why limit to 5 memories when a user might have 200?&lt;/p&gt;

&lt;p&gt;This is not a technical limitation. It's a &lt;strong&gt;behavioral nudge&lt;/strong&gt;. If the system scaled to inject 50 memories, users would never clean up stale ones. The 5-file cap pushes users to write better descriptions (so the right memories get selected) and consolidate outdated entries (so slots aren't wasted on stale info).&lt;/p&gt;

&lt;p&gt;Design principle: &lt;strong&gt;constraints that change user behavior beat constraints that scale infrastructure.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Background Extraction: The Invisible Agent
&lt;/h2&gt;

&lt;p&gt;Claude Code doesn't just save memories when you say &lt;code&gt;/remember&lt;/code&gt;. After every conversation turn where the main agent stops (no more tool calls), a &lt;strong&gt;forked background agent&lt;/strong&gt; runs to extract memorable information.&lt;/p&gt;

&lt;p&gt;Key design details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mutual exclusion&lt;/strong&gt;: If the main agent already wrote a memory in this turn, the extractor skips. No duplicate memories from the same conversation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trailing runs&lt;/strong&gt;: If extraction is still running when the next turn ends, the new request queues as &lt;code&gt;pendingContext&lt;/code&gt;. When the current extraction finishes, it picks up the pending work. No concurrent writes to the memory directory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5-turn hard deadline&lt;/strong&gt;: The extractor gets at most 5 tool-call turns. Efficiency over completeness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimal permissions&lt;/strong&gt;: Read/Grep/Glob unlimited. Write &lt;strong&gt;only&lt;/strong&gt; to the memory directory. Cannot modify project files, execute code, or call external services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared prompt cache&lt;/strong&gt;: The forked agent reuses the parent's cached system prompt — near-zero additional token overhead.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The execution strategy is prescribed in the prompt: "Turn 1: parallel reads of all existing memories. Turn 2: parallel writes of new memories." Two turns for the common case. The 5-turn budget handles edge cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trust but Verify: The Eval That Proved It
&lt;/h2&gt;

&lt;p&gt;The most impactful section in Claude Code's memory prompt is &lt;code&gt;TRUSTING_RECALL_SECTION&lt;/code&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"A memory that names a specific function, file, or flag is a claim that it existed &lt;em&gt;when the memory was written&lt;/em&gt;. It may have been renamed, removed, or never merged."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The rule: before acting on a memory that references a file path, verify the file exists (Glob). Before trusting a function name, confirm it's still there (Grep).&lt;/p&gt;

&lt;p&gt;This section's value was proven empirically: &lt;strong&gt;without it, eval pass rate was 0/2. With it, 3/3.&lt;/strong&gt; Models default to trusting specific references in memory. They'll confidently say "as stored in memory, the auth module is at &lt;code&gt;src/auth/&lt;/code&gt;" — even when that path was renamed weeks ago. The verification requirement breaks this default behavior.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fydwixv55gmz3gnjza4wg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fydwixv55gmz3gnjza4wg.png" alt="Three Architectures, Three Tradeoffs" width="800" height="1000"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Architectures, Three Tradeoffs
&lt;/h2&gt;

&lt;p&gt;This is &lt;strong&gt;not&lt;/strong&gt; a ranking. I'm using OpenClaw and Hermes as contrasts because they represent the two obvious alternative bets: scale and autonomy. Claude Code, OpenClaw, and Hermes Agent made different choices for different deployment models.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;th&gt;OpenClaw&lt;/th&gt;
&lt;th&gt;Hermes Agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Markdown files (flat)&lt;/td&gt;
&lt;td&gt;MD + SQLite (FTS + vector)&lt;/td&gt;
&lt;td&gt;SQLite + FTS + MEMORY.md&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Recall&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sonnet side-query (semantic)&lt;/td&gt;
&lt;td&gt;Embedding cosine + FTS fusion&lt;/td&gt;
&lt;td&gt;Full-text search + structured queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infrastructure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Zero (filesystem only)&lt;/td&gt;
&lt;td&gt;SQLite + embedding model&lt;/td&gt;
&lt;td&gt;SQLite&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transparency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full (plain text, human-readable)&lt;/td&gt;
&lt;td&gt;Partial (vector scores opaque)&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Learning loop&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None (static after write)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Self-evolving (auto-generates skills)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Session model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Session-based, stateless between sessions&lt;/td&gt;
&lt;td&gt;Persistent, cross-session&lt;/td&gt;
&lt;td&gt;Persistent, self-improving&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scale ceiling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~200 files by design&lt;/td&gt;
&lt;td&gt;Scales with SQLite&lt;/td&gt;
&lt;td&gt;Scales with SQLite&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Claude Code's Bet
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Optimize for zero infrastructure and full transparency. Accept a scale ceiling.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For a CLI tool that runs on a developer's laptop, requiring SQLite or an embedding service is friction. Plain Markdown files are human-readable, git-trackable, and editable with any text editor. The 200-file ceiling is intentional — if you need more, you should be consolidating, not scaling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When this breaks:&lt;/strong&gt; Teams with hundreds of shared memories. Long-running projects where memory accumulation outpaces cleanup. Multi-user scenarios where memory needs to be queried across team members.&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenClaw's Bet
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Accept infrastructure overhead for persistent cross-session scale.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;OpenClaw stores memories in SQLite with both full-text search and vector embeddings. This enables fuzzy semantic matching across thousands of memories, weighted fusion of multiple retrieval signals, and persistent state that survives across sessions indefinitely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When this breaks:&lt;/strong&gt; Setup complexity. Users must configure embedding models. Vector similarity scores are opaque — when the wrong memory is recalled, debugging why is harder than inspecting a Sonnet side-query.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hermes Agent's Bet
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Accept complexity for a self-evolving learning loop.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Hermes doesn't just store memories — it generates &lt;strong&gt;skills&lt;/strong&gt; from completed tasks. After a complex task (5+ tool calls), the agent distills the entire process into a structured skill document. Next time it encounters a similar task, it loads the skill instead of solving from scratch. Skills self-iterate: if the agent finds a better approach during execution, it updates the skill automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When this breaks:&lt;/strong&gt; Skill quality is unverified. A bad skill propagated through the learning loop compounds errors. The self-evolving mechanism needs guardrails that don't exist yet — there's no eval framework for auto-generated skills.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Right Choice Depends on Your Deployment Model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Session-based, single user, zero setup → Claude Code's approach
Persistent, multi-user, cross-session  → OpenClaw's approach  
Autonomous, self-improving, research    → Hermes's approach
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is no universal "best." The first-principles question is: &lt;strong&gt;what are you optimizing for — simplicity, scale, or autonomy?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Teaches About Agent Design
&lt;/h2&gt;

&lt;p&gt;Three principles that transfer beyond memory systems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Constraints that change user behavior &amp;gt; constraints that scale infrastructure.&lt;/strong&gt; The 5-file cap is more effective than unlimited vector search, because it forces better memory hygiene. Don't build capacity for a mess — design incentives for cleanliness.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Eval data beats intuition for prompt engineering.&lt;/strong&gt; The trust-verification section wasn't added because someone thought it was a good idea. It was added because evals went from 0/2 to 3/3. If you can't measure it, you're guessing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use the model's own reasoning for retrieval when latency allows.&lt;/strong&gt; Sonnet understanding "deployment" relates to "CI/CD" is something no keyword match or embedding similarity can reliably do. When your retrieval budget allows a model call, the quality ceiling is higher than any static index.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Previous: &lt;a href="https://harrisonsec.com/blog/claude-code-context-engineering-compression-pipeline/" rel="noopener noreferrer"&gt;Part 3: Context Engineering — 5-Level Compression Pipeline&lt;/a&gt; | &lt;a href="https://harrisonsec.com/blog/claude-code-deep-dive-query-loop/" rel="noopener noreferrer"&gt;Part 2: The 1,421-Line While Loop&lt;/a&gt; | &lt;a href="https://harrisonsec.com/blog/claude-code-source-leaked-hidden-features/" rel="noopener noreferrer"&gt;Part 1: 5 Hidden Features&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;See also: &lt;a href="https://harrisonsec.com/blog/claude-code-codex-plugin-two-brains/" rel="noopener noreferrer"&gt;Claude Code + Codex: Two Brains&lt;/a&gt; for how dual-AI workflows complement the memory system.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>memory</category>
      <category>agents</category>
      <category>openclaw</category>
    </item>
    <item>
      <title>Claude Code Deep Dive Part 3: The 5-Level Compression Pipeline Behind 1M Tokens</title>
      <dc:creator>Harrison Guo</dc:creator>
      <pubDate>Wed, 08 Apr 2026 05:19:22 +0000</pubDate>
      <link>https://dev.to/harrison_guo_e01b4c8793a0/claude-code-deep-dive-part-3-the-5-level-compression-pipeline-behind-200k-tokens-4im1</link>
      <guid>https://dev.to/harrison_guo_e01b4c8793a0/claude-code-deep-dive-part-3-the-5-level-compression-pipeline-behind-200k-tokens-4im1</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 3 of our Claude Code Architecture Deep Dive series. &lt;a href="https://harrisonsec.com/blog/claude-code-source-leaked-hidden-features/" rel="noopener noreferrer"&gt;Part 1: 5 Hidden Features&lt;/a&gt; | &lt;a href="https://harrisonsec.com/blog/claude-code-deep-dive-query-loop/" rel="noopener noreferrer"&gt;Part 2: The 1,421-Line While Loop&lt;/a&gt; | &lt;a href="https://harrisonsec.com/blog/claude-code-memory-first-principles-tradeoffs/" rel="noopener noreferrer"&gt;Part 4: Memory Tradeoffs&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Context Engineering Is the Real Moat
&lt;/h2&gt;

&lt;p&gt;Every AI agent has the same fundamental constraint: a fixed-size context window. Claude's is now up to 1M tokens. That sounds massive — until you realize a real coding session can easily generate multiples of that. Dozens of file reads, hundreds of tool calls, thousands of lines of output.&lt;/p&gt;

&lt;p&gt;The model's decision quality depends entirely on what it sees. Get the tradeoff wrong, and it forgets which files it just edited, re-reads content it already saw, or contradicts its own earlier decisions.&lt;/p&gt;

&lt;p&gt;Think of the context window as an office desk. Limited surface area. You need the most important documents within arm's reach, everything else filed in drawers — retrievable, but not cluttering your workspace.&lt;/p&gt;

&lt;p&gt;Claude Code's context engineering is that filing system. And it's far more sophisticated than most people expect. In &lt;a href="https://harrisonsec.com/blog/claude-code-deep-dive-query-loop/" rel="noopener noreferrer"&gt;Part 2&lt;/a&gt;, we covered the 4-stage compression overview as part of the loop's survival mechanism. Here, we zoom into the internal engineering — revealing a 5th level most sessions never trigger, a dual-path algorithm that adapts to cache state, and a security blind spot in the summarizer.&lt;/p&gt;

&lt;p&gt;The compression pipeline alone lives in &lt;code&gt;src/services/compact/&lt;/code&gt; — over 3,960 lines of TypeScript across 5 files.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5-Level Compression Pipeline
&lt;/h2&gt;

&lt;p&gt;The design philosophy is &lt;strong&gt;progressive compression&lt;/strong&gt;: cheapest first, heaviest last. Each level is more expensive than the previous one — consuming more compute or discarding more context detail.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8d9fzli4uyblkyxux3e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8d9fzli4uyblkyxux3e.png" alt="The 5-Level Compression Pipeline" width="800" height="1000"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRECiAgICBJbnB1dFsiTWVzc2FnZSBIaXN0b3J5Il0gLS0-IEwxWyJMZXZlbCAxOiBUb29sIFJlc3VsdCBCdWRnZXRcbjUwSyBjaGFyIHRocmVzaG9sZCDihpIgZGlzayArIDJLQiBwcmV2aWV3XG7wn5KwIENvc3Q6IFplcm8iXQogICAgTDEgLS0-IEwyWyJMZXZlbCAyOiBIaXN0b3J5IFNuaXBcbkZlYXR1cmUtZ2F0ZWQgdG9rZW4gcmVsZWFzZVxu8J-SsCBDb3N0OiBaZXJvIl0KICAgIEwyIC0tPiBMM1siTGV2ZWwgMzogTWljcm9jb21wYWN0XG5EdWFsIHBhdGg6IHRpbWUtYmFzZWQgT1IgY2FjaGUtZWRpdFxu8J-SsCBDb3N0OiBaZXJvIEFQSSBjYWxscyJdCiAgICBMMyAtLT4gTDRbIkxldmVsIDQ6IENvbnRleHQgQ29sbGFwc2VcblByb2plY3Rpb24tYmFzZWQgZm9sZGluZyB-OTAlXG7wn5KwIENvc3Q6IFplcm8gKG5vbi1kZXN0cnVjdGl2ZSkiXQogICAgTDQgLS0-IEw1WyJMZXZlbCA1OiBBdXRvY29tcGFjdFxuRm9yayBjaGlsZCBhZ2VudCBmb3IgZnVsbCBzdW1tYXJ5XG7wn5KwIENvc3Q6IE9uZSBBUEkgY2FsbCAoaXJyZXZlcnNpYmxlKSJd" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRECiAgICBJbnB1dFsiTWVzc2FnZSBIaXN0b3J5Il0gLS0-IEwxWyJMZXZlbCAxOiBUb29sIFJlc3VsdCBCdWRnZXRcbjUwSyBjaGFyIHRocmVzaG9sZCDihpIgZGlzayArIDJLQiBwcmV2aWV3XG7wn5KwIENvc3Q6IFplcm8iXQogICAgTDEgLS0-IEwyWyJMZXZlbCAyOiBIaXN0b3J5IFNuaXBcbkZlYXR1cmUtZ2F0ZWQgdG9rZW4gcmVsZWFzZVxu8J-SsCBDb3N0OiBaZXJvIl0KICAgIEwyIC0tPiBMM1siTGV2ZWwgMzogTWljcm9jb21wYWN0XG5EdWFsIHBhdGg6IHRpbWUtYmFzZWQgT1IgY2FjaGUtZWRpdFxu8J-SsCBDb3N0OiBaZXJvIEFQSSBjYWxscyJdCiAgICBMMyAtLT4gTDRbIkxldmVsIDQ6IENvbnRleHQgQ29sbGFwc2VcblByb2plY3Rpb24tYmFzZWQgZm9sZGluZyB-OTAlXG7wn5KwIENvc3Q6IFplcm8gKG5vbi1kZXN0cnVjdGl2ZSkiXQogICAgTDQgLS0-IEw1WyJMZXZlbCA1OiBBdXRvY29tcGFjdFxuRm9yayBjaGlsZCBhZ2VudCBmb3IgZnVsbCBzdW1tYXJ5XG7wn5KwIENvc3Q6IE9uZSBBUEkgY2FsbCAoaXJyZXZlcnNpYmxlKSJd" alt="flowchart TD" width="276" height="950"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most conversations never reach Level 5. That's the point.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 1 — Tool Result Budget (Zero Cost)
&lt;/h3&gt;

&lt;p&gt;Problem: A single &lt;code&gt;FileReadTool&lt;/code&gt; call on a 10,000-line file dumps the entire thing into context. A &lt;code&gt;BashTool&lt;/code&gt; running &lt;code&gt;find&lt;/code&gt; returns thousands of paths.&lt;/p&gt;

&lt;p&gt;Solution: When a tool result exceeds 50,000 characters (&lt;code&gt;DEFAULT_MAX_RESULT_SIZE_CHARS&lt;/code&gt;), Claude Code doesn't truncate it — it &lt;strong&gt;persists the full output to disk&lt;/strong&gt; and keeps only a 2KB preview in context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;persisted-output&amp;gt;&lt;/span&gt;
Output too large (2.3 MB). Full output saved to:
/tmp/.claude/session-xxx/tool-results/toolu_abc123.txt

Preview (first 2.0 KB):
[first 2000 bytes of content]
...
&lt;span class="nt"&gt;&amp;lt;/persisted-output&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why persist instead of truncate? Truncation means permanent loss. If the model later needs line 500 of that output — maybe that's where the bug is — it can use the &lt;code&gt;Read&lt;/code&gt; tool to access the full file from disk. The 2KB preview gives enough context to decide whether that's necessary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 2 — History Snip
&lt;/h3&gt;

&lt;p&gt;Think of History Snip as garbage collection for stale conversation scaffolding. If the session contains repetitive assistant wrappers, redundant bookkeeping, or older spans that no longer affect the next decision, this layer can cut them before heavier compression starts.&lt;/p&gt;

&lt;p&gt;Its real importance is accounting correctness. It feeds &lt;code&gt;snipTokensFreed&lt;/code&gt; into the autocompact threshold calculation. Without that correction, the last assistant message's &lt;code&gt;usage&lt;/code&gt; data still reflects the pre-snip context size, so autocompact can fire even after tokens were already freed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 3 — Microcompact (The Dual-Path Design)
&lt;/h3&gt;

&lt;p&gt;This is where it gets clever. Microcompact cleans up old tool results that are no longer useful — that file you read 30 minutes ago is probably irrelevant now, but it's still eating thousands of tokens.&lt;/p&gt;

&lt;p&gt;The twist: &lt;strong&gt;Microcompact has two completely different code paths&lt;/strong&gt;, selected based on cache state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Path A — Cache Cold (Time-Based)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When the user was away long enough for the prompt cache to expire (default 5-minute TTL), the cache is already dead. Rebuilding is inevitable. So Microcompact goes ahead and &lt;strong&gt;directly modifies message content&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// microCompact.ts — cold path&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;block&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[Old tool result content cleared]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple, brutal, effective. Keep only the N most recent compactable tool results, replace everything else with a placeholder.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Path B — Cache Hot (Cache-Editing)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When the user is actively chatting and the prompt cache is warm — holding 100K+ tokens of cached prefix — directly modifying messages would &lt;strong&gt;invalidate the entire cache&lt;/strong&gt;. That's a massive cost hit.&lt;/p&gt;

&lt;p&gt;Instead, the hot path uses an API-level mechanism called &lt;code&gt;cache_edits&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tag tool result blocks with &lt;code&gt;cache_reference: tool_use_id&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Construct &lt;code&gt;cache_edits&lt;/code&gt; blocks telling the server to delete those references in-place&lt;/li&gt;
&lt;li&gt;Server-side deletion preserves cache warmth — no client re-upload needed&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The messages themselves are returned &lt;strong&gt;unchanged&lt;/strong&gt;. The edit happens at the API layer, invisible to the local conversation state.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Time-Based (Cold)&lt;/th&gt;
&lt;th&gt;Cache-Edit (Hot)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Trigger&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time gap exceeds threshold&lt;/td&gt;
&lt;td&gt;Tool count exceeds threshold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Direct message modification&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cache_edits&lt;/code&gt; API blocks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cache Impact&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cache rebuilds anyway&lt;/td&gt;
&lt;td&gt;Preserves 100K+ cached prefix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API Calls&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;td&gt;Zero (edits piggyback on next request)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The two paths are mutually exclusive. Time-based takes priority — if the cache is already cold, using &lt;code&gt;cache_edits&lt;/code&gt; is pointless.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 4 — Context Collapse (Non-Destructive)
&lt;/h3&gt;

&lt;p&gt;Think of this as a database &lt;strong&gt;View&lt;/strong&gt; — the underlying table (message array) stays unchanged, but queries (API requests) see a filtered, summarized projection.&lt;/p&gt;

&lt;p&gt;Context Collapse triggers at ~90% utilization. Unlike autocompact, it's &lt;strong&gt;reversible&lt;/strong&gt; — original messages are never deleted, and the collapse can be rolled back if needed. The summaries live in a separate collapse store, and &lt;code&gt;projectView()&lt;/code&gt; overlays them onto the original messages at query time.&lt;/p&gt;

&lt;p&gt;Critical interaction: when Context Collapse is active, &lt;strong&gt;Autocompact is suppressed&lt;/strong&gt;. Both compete for the same token space — autocompact at ~87%, collapse at ~90% — and autocompact would destroy the fine-grained context that collapse is trying to preserve.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 5 — Autocompact (The Last Resort)
&lt;/h3&gt;

&lt;p&gt;When everything else fails to keep tokens under control, the system forks a child agent to summarize the entire conversation. This is expensive and irreversible.&lt;/p&gt;

&lt;p&gt;The compression prompt uses a two-phase &lt;strong&gt;Chain-of-Thought Scratchpad&lt;/strong&gt; technique:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;&amp;lt;analysis&amp;gt;&lt;/code&gt; block&lt;/strong&gt; — the model walks through every message chronologically: user intent, approaches taken, key decisions, filenames, code snippets, errors, fixes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;&amp;lt;summary&amp;gt;&lt;/code&gt; block&lt;/strong&gt; — a structured summary with 9 standardized sections (Primary Request, Key Technical Concepts, Files and Code, Errors and Fixes, Problem Solving, All User Messages, Pending Tasks, Current Work, Optional Next Step)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The critical design: &lt;code&gt;formatCompactSummary()&lt;/code&gt; &lt;strong&gt;strips the &lt;code&gt;&amp;lt;analysis&amp;gt;&lt;/code&gt; block&lt;/strong&gt; and keeps only the &lt;code&gt;&amp;lt;summary&amp;gt;&lt;/code&gt;. Chain-of-thought reasoning improves summary quality dramatically, but the reasoning itself would waste tokens if kept in context. Discard the work, keep the conclusion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Post-Compression Recovery&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Autocompact's biggest risk: the model "forgets" files it just edited. The system automatically runs &lt;code&gt;runPostCompactCleanup()&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Restore last 5 recently-read files (≤5K tokens each)&lt;/li&gt;
&lt;li&gt;Restore all activated skills (≤25K tokens total)&lt;/li&gt;
&lt;li&gt;Re-announce deferred tools, agent lists, MCP directives&lt;/li&gt;
&lt;li&gt;Reset Context Collapse state&lt;/li&gt;
&lt;li&gt;Restore Plan mode state if active&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without this recovery step, the model would start re-reading files it just edited — or worse, make contradictory changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Circuit Breaker Story&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On March 10, 2026, Anthropic's telemetry showed 1,279 sessions with 50+ consecutive autocompact failures. The worst session hit 3,272 consecutive failures. Globally, this wasted approximately 250,000 API calls per day.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://harrisonsec.com/blog/claude-code-deep-dive-query-loop/" rel="noopener noreferrer"&gt;Part 2&lt;/a&gt;, we mentioned the circuit breaker as a single boolean (&lt;code&gt;hasAttemptedReactiveCompact&lt;/code&gt;). Here's the production story behind it.&lt;/p&gt;

&lt;p&gt;The fix was three lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After 3 consecutive failures, stop trying. The context is irrecoverably over-limit — burning more API calls won't help. This is a textbook circuit breaker: detect a failure loop, break it early, fail gracefully.&lt;/p&gt;

&lt;p&gt;Three adjacent systems make this pipeline viable in production: accurate token estimation, prompt-cache boundaries, and the summarizer's security assumptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Token Estimation Without API Calls
&lt;/h2&gt;

&lt;p&gt;Most agents estimate context size by counting tokens on the client. This typically has 30%+ error — enough to trigger compression too early or too late.&lt;/p&gt;

&lt;p&gt;Claude Code uses a smarter approach. Think of it as a morning weigh-in: you step on the scale at 75kg, then eat lunch. You don't need the scale again — estimating 75.5kg is good enough.&lt;/p&gt;

&lt;p&gt;The "scale" is the &lt;code&gt;usage&lt;/code&gt; data returned by every API response — server-side precise token counts. The "lunch" is the few messages added since then.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;tokenCountWithEstimation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Find the most recent message with server-reported usage&lt;/span&gt;
  &lt;span class="c1"&gt;// Use that as the anchor point&lt;/span&gt;
  &lt;span class="c1"&gt;// Estimate only the delta (new messages since anchor)&lt;/span&gt;
  &lt;span class="c1"&gt;// Result: &amp;lt;5% error vs 30%+ from pure client estimation&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This eliminates the need for tokenizer API calls while maintaining accuracy that's good enough for compression timing decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Prompt Cache Architecture
&lt;/h2&gt;

&lt;p&gt;Claude Code's system prompt can be 50-100K tokens. Without caching, every API call would re-process this from scratch.&lt;/p&gt;

&lt;p&gt;The key innovation: &lt;code&gt;SYSTEM_PROMPT_DYNAMIC_BOUNDARY&lt;/code&gt; — a sentinel string that splits the system prompt into static and dynamic halves.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Before the boundary&lt;/strong&gt;: core instructions, tool descriptions, security rules — identical for ALL users globally → cached with &lt;code&gt;scope: 'global'&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After the boundary&lt;/strong&gt;: MCP tool instructions, output preferences, language settings — varies per user → not cached globally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means millions of Claude Code users &lt;strong&gt;share the same cached system prompt prefix&lt;/strong&gt;. One cache hit saves compute for everyone. But change one byte before the boundary, and the global cache breaks for all users.&lt;/p&gt;

&lt;p&gt;To protect this, Claude Code implements &lt;strong&gt;sticky-on latching&lt;/strong&gt; for beta headers: once a header is sent in a session, it persists for all subsequent requests — even if the feature flag is turned off mid-session. Flexibility sacrificed for cache stability.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Security Blind Spot
&lt;/h2&gt;

&lt;p&gt;Here's something the compression pipeline gets wrong: &lt;strong&gt;it treats all content equally&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The autocompact summarizer processes user instructions and tool results through the same pipeline. If an attacker plants malicious instructions inside a project file — and the model reads that file — those instructions survive compression. They become part of the summary, indistinguishable from legitimate context.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;&amp;lt;analysis&amp;gt;&lt;/code&gt; scratchpad that makes summaries so good also faithfully preserves injected instructions. There's no classification step that distinguishes "user said this" from "this was in a file the model read."&lt;/p&gt;

&lt;p&gt;Additionally, &lt;code&gt;truncateHeadForPTLRetry()&lt;/code&gt; reveals another edge: when the conversation is so long that the compression request itself triggers a Prompt-Too-Long error, the system recursively drops the oldest turns to make the compression fit. An attacker could craft inputs that survive this truncation — instructions placed strategically in the middle of conversations, not at the edges.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Designs Worth Stealing
&lt;/h2&gt;

&lt;p&gt;If you're building your own agent, these patterns transfer directly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Progressive compression (cheapest first)&lt;/strong&gt; — Don't jump to expensive summarization. Try zero-cost approaches first. Most sessions will never need the heavy option.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cache-aware dual paths&lt;/strong&gt; — Let infrastructure state drive algorithm selection. When cache is cold, optimize for simplicity. When cache is hot, optimize for preservation. Same goal, different strategies.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Circuit breakers on automated recovery&lt;/strong&gt; — Never let a fix become a new failure mode. If compression fails 3 times, it will fail a 4th time. Stop. The 250K wasted API calls/day before this fix was added is a cautionary tale for any self-healing system.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Next: &lt;a href="https://harrisonsec.com/blog/claude-code-memory-first-principles-tradeoffs/" rel="noopener noreferrer"&gt;Part 4: Memory — First-Principles Tradeoffs in Agent Persistence&lt;/a&gt; — why Anthropic chose Markdown files over vector databases, and when that's the wrong call.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Previous: &lt;a href="https://harrisonsec.com/blog/claude-code-deep-dive-query-loop/" rel="noopener noreferrer"&gt;Part 2: The 1,421-Line While Loop&lt;/a&gt; | &lt;a href="https://harrisonsec.com/blog/claude-code-source-leaked-hidden-features/" rel="noopener noreferrer"&gt;Part 1: 5 Hidden Features&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>contextengineering</category>
      <category>agents</category>
      <category>compression</category>
    </item>
    <item>
      <title>Claude Code + Codex Plugin: Two AI Brains, One Terminal</title>
      <dc:creator>Harrison Guo</dc:creator>
      <pubDate>Tue, 07 Apr 2026 14:47:24 +0000</pubDate>
      <link>https://dev.to/harrison_guo_e01b4c8793a0/claude-code-codex-plugin-two-ai-brains-one-terminal-k31</link>
      <guid>https://dev.to/harrison_guo_e01b4c8793a0/claude-code-codex-plugin-two-ai-brains-one-terminal-k31</guid>
      <description>&lt;p&gt;You're debugging a gnarly race condition. Claude Code has been going at it for 10 minutes — reading files, forming theories, running tests. Then it hits a wall. Same hypothesis, same failed fix, third attempt.&lt;/p&gt;

&lt;p&gt;What if you could call in a second brain — a completely different model with fresh eyes — without leaving your terminal?&lt;/p&gt;

&lt;p&gt;That's what the &lt;strong&gt;Codex plugin for Claude Code&lt;/strong&gt; does. It puts OpenAI's Codex (powered by GPT-5.4) inside your Claude Code session as a callable rescue agent. Two models. Two reasoning styles. One shared codebase.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is It, Exactly?
&lt;/h2&gt;

&lt;p&gt;The Codex plugin is a &lt;strong&gt;Claude Code plugin&lt;/strong&gt; — not a standalone tool. It lives inside your Claude Code session and gives you slash commands to dispatch tasks to OpenAI's Codex CLI.&lt;/p&gt;

&lt;p&gt;Think of it as a second engineer sitting next to you. Claude (Opus) is your primary — it has the full conversation context, knows your project, runs your tools. Codex is your specialist — you hand it a focused task, it works in a sandboxed environment, and returns results.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;they don't compete. They complement.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude sees the big picture. It orchestrates, reads files, runs tools, manages state.&lt;/li&gt;
&lt;li&gt;Codex gets a sharp, scoped task. It reasons deeply on that one problem and comes back with an answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Setup: 3 Minutes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Install the Codex CLI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @openai/codex
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Authenticate
&lt;/h3&gt;

&lt;p&gt;Inside Claude Code, type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!codex login
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This opens a browser for OpenAI authentication. Once done, your token is stored locally.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Verify
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/codex:setup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude Code will check that the Codex CLI is installed, authenticated, and ready.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzoze1uyqa3jdtsphmnxs.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzoze1uyqa3jdtsphmnxs.jpg" alt="Codex setup — ready, authenticated, review gate available" width="800" height="348"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Commands
&lt;/h2&gt;

&lt;p&gt;The plugin adds 7 slash commands to Claude Code:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/codex:setup&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Check installation and auth status&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/codex:rescue&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hand a task to Codex (the main one you'll use)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/codex:review&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Run a Codex code review on your local git changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/codex:adversarial-review&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Same, but Codex actively challenges your design choices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/codex:status&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Check running/recent Codex jobs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/codex:result&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Get the output of a finished background job&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/codex:cancel&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Kill an active background Codex job&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Rescue Workflow: When Claude Gets Stuck
&lt;/h2&gt;

&lt;p&gt;This is where the plugin shines. Claude Code will &lt;strong&gt;proactively&lt;/strong&gt; spawn the Codex rescue agent when it detects it's stuck — same hypothesis loop, repeated failures, or a task that needs a second implementation pass.&lt;/p&gt;

&lt;p&gt;You can also trigger it manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/codex:rescue fix the race condition in src/worker.ts — tests pass locally but fail in CI under parallel execution
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What happens behind the scenes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Claude takes your request and shapes it into a structured prompt optimized for GPT-5.4&lt;/li&gt;
&lt;li&gt;The plugin invokes &lt;code&gt;codex-companion.mjs task&lt;/code&gt; with that prompt&lt;/li&gt;
&lt;li&gt;Codex works in the shared repository — reading files, reasoning, writing code&lt;/li&gt;
&lt;li&gt;Results come back into your Claude Code session&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvonf3oosq27oo2p89zj.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvonf3oosq27oo2p89zj.jpg" alt="Codex rescue in action — dispatching task to GPT-5.4 via codex-companion" width="800" height="247"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Foreground vs Background
&lt;/h3&gt;

&lt;p&gt;Small, focused rescues run in the foreground — you wait and get the result immediately.&lt;/p&gt;

&lt;p&gt;Big, multi-step investigations can run in the background:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/codex:rescue --background investigate why the build is 3x slower since the last merge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check on it later with &lt;code&gt;/codex:status&lt;/code&gt; and grab results with &lt;code&gt;/codex:result&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code Review: A Second Opinion That Actually Pushes Back
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/codex:review
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This sends your local git diff to Codex for review. It checks against your working tree or branch changes.&lt;/p&gt;

&lt;p&gt;But the real power is the adversarial review:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/codex:adversarial-review
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't "looks good to me." Codex will actively challenge your implementation approach, question design decisions, and flag things a polite reviewer wouldn't mention. It's the code review you &lt;em&gt;need&lt;/em&gt;, not the one you &lt;em&gt;want&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6x63jksxnw8i6dp99hff.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6x63jksxnw8i6dp99hff.jpg" alt="Codex review — checking git working tree for code review" width="800" height="372"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use Which Brain
&lt;/h2&gt;

&lt;p&gt;After a month of daily use, here's my mental model:&lt;/p&gt;

&lt;h3&gt;
  
  
  Let Claude (Opus) Handle:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration&lt;/strong&gt; — multi-file changes, refactors across the codebase&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context-heavy tasks&lt;/strong&gt; — "fix this bug" when you've been discussing it for 20 messages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool-heavy workflows&lt;/strong&gt; — file reads, grep, test runs, build commands&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conversation continuity&lt;/strong&gt; — anything that builds on prior context&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Call in Codex (GPT-5.4) For:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fresh eyes&lt;/strong&gt; — when Claude is circling the same hypothesis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deep single-problem reasoning&lt;/strong&gt; — "why does this specific test fail under these exact conditions"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adversarial review&lt;/strong&gt; — challenge assumptions Claude might share with you&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel investigation&lt;/strong&gt; — background a research task while Claude keeps working&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Pattern That Works Best
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Claude does the initial investigation — reads files, forms a theory&lt;/li&gt;
&lt;li&gt;If the theory doesn't pan out in 2-3 attempts, &lt;strong&gt;rescue to Codex&lt;/strong&gt; with the full context of what was tried&lt;/li&gt;
&lt;li&gt;Codex returns a diagnosis or fix&lt;/li&gt;
&lt;li&gt;Claude applies it in context, runs tests, iterates&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Two models. Two reasoning paths. Converging on the same answer faster than either alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Advanced: Prompt Shaping
&lt;/h2&gt;

&lt;p&gt;The plugin includes a &lt;code&gt;gpt-5-4-prompting&lt;/code&gt; skill that automatically structures your rescue requests into Codex-optimized prompts using XML tags:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;&amp;lt;task&amp;gt;&lt;/code&gt; — the concrete job&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;&amp;lt;verification_loop&amp;gt;&lt;/code&gt; — how to confirm the fix works&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;&amp;lt;grounding_rules&amp;gt;&lt;/code&gt; — stay anchored to evidence, not guesses&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;&amp;lt;action_safety&amp;gt;&lt;/code&gt; — don't refactor unrelated code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don't need to write these yourself. Claude does it automatically when it hands off to Codex. But knowing they exist explains why Codex rescue results are usually sharper than raw Codex CLI usage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Advanced: The Review Gate
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/codex:setup --enable-review-gate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When enabled, every &lt;code&gt;git commit&lt;/code&gt; in the repo triggers an automatic Codex review before the commit completes. It's a pre-commit hook powered by a second AI brain.&lt;/p&gt;

&lt;p&gt;This is aggressive — I only enable it on critical branches or before releases. But when you want zero-trust code quality, it's unmatched.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;The Codex plugin doesn't replace Claude Code. It makes Claude Code &lt;strong&gt;anti-fragile&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Every AI agent has blind spots — reasoning loops it can't escape, patterns it over-fits to, assumptions it shares with its user. A second model with a different training distribution breaks those loops.&lt;/p&gt;

&lt;p&gt;The dual-brain setup isn't about which model is "better." It's about &lt;strong&gt;coverage&lt;/strong&gt;. Two independent reasoning paths catch more bugs than one brilliant path run twice.&lt;/p&gt;

&lt;p&gt;If you're using Claude Code daily, install the Codex plugin. It's 3 minutes of setup and it will save you hours of "why is Claude stuck on this?"&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part of the &lt;a href="https://harrisonsec.com/blog/claude-code-deep-dive-query-loop/" rel="noopener noreferrer"&gt;Claude Code Architecture Deep Dive&lt;/a&gt; series. Previous: &lt;a href="https://harrisonsec.com/blog/claude-code-deep-dive-query-loop/" rel="noopener noreferrer"&gt;The 1,421-Line While Loop That Runs Everything&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>codex</category>
      <category>agents</category>
      <category>openai</category>
    </item>
    <item>
      <title>Claude Code Deep Dive Part 2: The 1,421-Line While Loop That Runs Everything</title>
      <dc:creator>Harrison Guo</dc:creator>
      <pubDate>Fri, 03 Apr 2026 17:24:18 +0000</pubDate>
      <link>https://dev.to/harrison_guo_e01b4c8793a0/claude-code-deep-dive-part-2-the-1421-line-while-loop-that-runs-everything-121</link>
      <guid>https://dev.to/harrison_guo_e01b4c8793a0/claude-code-deep-dive-part-2-the-1421-line-while-loop-that-runs-everything-121</guid>
      <description>&lt;p&gt;&lt;em&gt;This is Part 2 of our Claude Code Architecture Deep Dive series. &lt;a href="https://harrisonsec.com/blog/claude-code-source-leaked-hidden-features/" rel="noopener noreferrer"&gt;Part 1: 5 Hidden Features&lt;/a&gt; covered the surface-level discoveries. Now we go deeper.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Heart of Claude Code
&lt;/h2&gt;

&lt;p&gt;Every AI coding agent — Claude Code, Cursor, Copilot — runs some version of the same loop: send context to an LLM, get back text and tool calls, execute tools, feed results back, repeat. We called this &lt;a href="https://harrisonsec.com/blog/ai-stack-explained-llm-talks-program-walks/" rel="noopener noreferrer"&gt;LLM talks, program walks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But Claude Code's implementation of this loop is anything but simple. It lives in &lt;code&gt;query.ts&lt;/code&gt;, a 1,729-line async generator. The &lt;code&gt;while(true)&lt;/code&gt; starts at line 307 and ends at line 1728 — a single loop body spanning 1,421 lines of production code.&lt;/p&gt;

&lt;p&gt;This is not a toy. This is the engine that processes every keystroke, every tool call, every error recovery, every context compression decision for millions of users.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// query.ts — line 307&lt;/span&gt;
&lt;span class="c1"&gt;// eslint-disable-next-line no-constant-condition&lt;/span&gt;
&lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;toolUseContext&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;state&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;state&lt;/span&gt;
    &lt;span class="c1"&gt;// ... 1,421 lines of state machine logic ...&lt;/span&gt;
    &lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="c1"&gt;// while (true)  — line 1728&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why a State Machine, Not Recursion
&lt;/h2&gt;

&lt;p&gt;Early versions of Claude Code used recursion — the query function called itself. But recursion has a fatal flaw: in long conversations with hundreds of tool calls, the call stack grows until it explodes.&lt;/p&gt;

&lt;p&gt;The current design uses &lt;code&gt;while(true)&lt;/code&gt; with a &lt;code&gt;state&lt;/code&gt; object that carries context between iterations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// query.ts — lines 207-215 (State type, partial)&lt;/span&gt;
&lt;span class="nx"&gt;autoCompactTracking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AutoCompactTrackingState&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;undefined&lt;/span&gt;
&lt;span class="nx"&gt;maxOutputTokensRecoveryCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;
&lt;span class="nx"&gt;hasAttemptedReactiveCompact&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;       &lt;span class="c1"&gt;// circuit breaker for 413 recovery&lt;/span&gt;
&lt;span class="nx"&gt;stopHookActive&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;undefined&lt;/span&gt;
&lt;span class="nx"&gt;turnCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;
&lt;span class="nx"&gt;transition&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;undefined&lt;/span&gt; &lt;span class="c1"&gt;// why we continued&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each &lt;code&gt;continue&lt;/code&gt; statement is a state transition. There are &lt;strong&gt;9 distinct &lt;code&gt;continue&lt;/code&gt; points&lt;/strong&gt; in the code (lines 950, 1115, 1165, 1220, 1251, 1305, 1316, 1340), each representing a different reason to run another turn:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Next tool call needed&lt;/li&gt;
&lt;li&gt;Reactive compact triggered after 413&lt;/li&gt;
&lt;li&gt;Max output tokens recovery&lt;/li&gt;
&lt;li&gt;Stop hook interrupted&lt;/li&gt;
&lt;li&gt;Token budget continuation&lt;/li&gt;
&lt;li&gt;And more&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Loop at a Glance
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRECiAgICBBWyLikaAgQ29tcHJlc3MgQ29udGV4dDxici8-KDQgc3RhZ2VzKSJdIC0tPiBCWyLikaEgVG9rZW4gQnVkZ2V0IENoZWNrIl0KICAgIEIgLS0-IENbIuKRoiBDYWxsIE1vZGVsIEFQSTxici8-KHN0cmVhbWluZykiXQogICAgQyAtLT4gRFsi4pGjIFN0cmVhbSBUb29sIEV4ZWN1dGlvbjxici8-KHBhcmFsbGVsIHdpdGggZ2VuZXJhdGlvbikiXQogICAgRCAtLT4gRVsi4pGkIEVycm9yIFJlY292ZXJ5PGJyLz4oNDEzIOKGkiByZWFjdGl2ZSBjb21wYWN0KSJdCiAgICBFIC0tPiBGWyLikaUgU3RvcCBIb29rcyJdCiAgICBGIC0tPiBHWyLikaYgVG9rZW4gQnVkZ2V0IENoZWNrICMyIl0KICAgIEcgLS0-IEhbIuKRpyBFeGVjdXRlIFRvb2xzPGJyLz4oMTQtc3RlcCBwaXBlbGluZSkiXQogICAgSCAtLT4gSVsi4pGoIEluamVjdCBBdHRhY2htZW50czxici8-KG1lbW9yeSwgc2tpbGxzLCBxdWV1ZWQgY21kcykiXQogICAgSSAtLT4gSlsi4pGpIEFzc2VtYmxlIE1lc3NhZ2VzIl0KICAgIEogLS0-fCJuZXh0IHR1cm4ifCBBCgogICAgc3R5bGUgQSBmaWxsOiMxYTRkMmUsc3Ryb2tlOiMyMmM1NWUsY29sb3I6I2ZmZgogICAgc3R5bGUgQyBmaWxsOiMxYTNhNWMsc3Ryb2tlOiMzYjgyZjYsY29sb3I6I2ZmZgogICAgc3R5bGUgRCBmaWxsOiM0YTM1MjAsc3Ryb2tlOiNmNTllMGIsY29sb3I6I2ZmZgogICAgc3R5bGUgRSBmaWxsOiM0YTIwMjAsc3Ryb2tlOiNlZjQ0NDQsY29sb3I6I2ZmZgogICAgc3R5bGUgSCBmaWxsOiMzYTIwNTAsc3Ryb2tlOiM4YjVjZjYsY29sb3I6I2ZmZg%3D%3D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRECiAgICBBWyLikaAgQ29tcHJlc3MgQ29udGV4dDxici8-KDQgc3RhZ2VzKSJdIC0tPiBCWyLikaEgVG9rZW4gQnVkZ2V0IENoZWNrIl0KICAgIEIgLS0-IENbIuKRoiBDYWxsIE1vZGVsIEFQSTxici8-KHN0cmVhbWluZykiXQogICAgQyAtLT4gRFsi4pGjIFN0cmVhbSBUb29sIEV4ZWN1dGlvbjxici8-KHBhcmFsbGVsIHdpdGggZ2VuZXJhdGlvbikiXQogICAgRCAtLT4gRVsi4pGkIEVycm9yIFJlY292ZXJ5PGJyLz4oNDEzIOKGkiByZWFjdGl2ZSBjb21wYWN0KSJdCiAgICBFIC0tPiBGWyLikaUgU3RvcCBIb29rcyJdCiAgICBGIC0tPiBHWyLikaYgVG9rZW4gQnVkZ2V0IENoZWNrICMyIl0KICAgIEcgLS0-IEhbIuKRpyBFeGVjdXRlIFRvb2xzPGJyLz4oMTQtc3RlcCBwaXBlbGluZSkiXQogICAgSCAtLT4gSVsi4pGoIEluamVjdCBBdHRhY2htZW50czxici8-KG1lbW9yeSwgc2tpbGxzLCBxdWV1ZWQgY21kcykiXQogICAgSSAtLT4gSlsi4pGpIEFzc2VtYmxlIE1lc3NhZ2VzIl0KICAgIEogLS0-fCJuZXh0IHR1cm4ifCBBCgogICAgc3R5bGUgQSBmaWxsOiMxYTRkMmUsc3Ryb2tlOiMyMmM1NWUsY29sb3I6I2ZmZgogICAgc3R5bGUgQyBmaWxsOiMxYTNhNWMsc3Ryb2tlOiMzYjgyZjYsY29sb3I6I2ZmZgogICAgc3R5bGUgRCBmaWxsOiM0YTM1MjAsc3Ryb2tlOiNmNTllMGIsY29sb3I6I2ZmZgogICAgc3R5bGUgRSBmaWxsOiM0YTIwMjAsc3Ryb2tlOiNlZjQ0NDQsY29sb3I6I2ZmZgogICAgc3R5bGUgSCBmaWxsOiMzYTIwNTAsc3Ryb2tlOiM4YjVjZjYsY29sb3I6I2ZmZg%3D%3D" alt="flowchart TD" width="342" height="1198"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  10 Steps Per Iteration
&lt;/h2&gt;

&lt;p&gt;Each time the loop runs, it does these 10 things in order. Every step has real source code behind it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Context Compression (4 stages)
&lt;/h3&gt;

&lt;p&gt;Before calling the API, the system tries to fit everything into the context window. Four compression mechanisms fire in priority order (imports at lines 12-16, 115-116):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Snip Compact&lt;/strong&gt; — trims overly long individual messages in history&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Micro Compact&lt;/strong&gt; — finer-grained editing based on &lt;code&gt;tool_use_id&lt;/code&gt;, cache-friendly (line 370: "microcompact operates purely by tool_use_id")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Collapse&lt;/strong&gt; — folds inactive context regions into summaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto Compact&lt;/strong&gt; — when total tokens approach the threshold, triggers full compression&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These are not mutually exclusive — they run in priority order:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IExSCiAgICBBWyJTbmlwIENvbXBhY3Q8YnIvPjxpPnRyaW0gbG9uZyBtZXNzYWdlczwvaT4iXSAtLT58InN0aWxsIHRvbyBiaWc_InwgQlsiTWljcm8gQ29tcGFjdDxici8-PGk-dG9vbF91c2VfaWQgZWRpdHM8L2k-Il0KICAgIEIgLS0-fCJzdGlsbCB0b28gYmlnPyJ8IENbIkNvbnRleHQgQ29sbGFwc2U8YnIvPjxpPmZvbGQgaW5hY3RpdmUgcmVnaW9uczwvaT4iXQogICAgQyAtLT58InN0aWxsIHRvbyBiaWc_InwgRFsiQXV0byBDb21wYWN0PGJyLz48aT5mdWxsIGNvbXByZXNzaW9uPC9pPiJdCiAgICBEIC0tPnwiQVBJIHJldHVybnMgNDEzInwgRVsiUmVhY3RpdmUgQ29tcGFjdDxici8-PGk-ZW1lcmdlbmN5LCBvbmNlIG9ubHk8L2k-Il0KCiAgICBzdHlsZSBBIGZpbGw6IzFhM2EyZSxzdHJva2U6IzRhZGU4MCxjb2xvcjojZmZmCiAgICBzdHlsZSBCIGZpbGw6IzFhM2EyZSxzdHJva2U6IzRhZGU4MCxjb2xvcjojZmZmCiAgICBzdHlsZSBDIGZpbGw6IzNhMzUyMCxzdHJva2U6I2ZiYmYyNCxjb2xvcjojZmZmCiAgICBzdHlsZSBEIGZpbGw6IzRhMjAyMCxzdHJva2U6I2VmNDQ0NCxjb2xvcjojZmZmCiAgICBzdHlsZSBFIGZpbGw6IzRhMTAyMCxzdHJva2U6I2RjMjYyNixjb2xvcjojZmZm" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IExSCiAgICBBWyJTbmlwIENvbXBhY3Q8YnIvPjxpPnRyaW0gbG9uZyBtZXNzYWdlczwvaT4iXSAtLT58InN0aWxsIHRvbyBiaWc_InwgQlsiTWljcm8gQ29tcGFjdDxici8-PGk-dG9vbF91c2VfaWQgZWRpdHM8L2k-Il0KICAgIEIgLS0-fCJzdGlsbCB0b28gYmlnPyJ8IENbIkNvbnRleHQgQ29sbGFwc2U8YnIvPjxpPmZvbGQgaW5hY3RpdmUgcmVnaW9uczwvaT4iXQogICAgQyAtLT58InN0aWxsIHRvbyBiaWc_InwgRFsiQXV0byBDb21wYWN0PGJyLz48aT5mdWxsIGNvbXByZXNzaW9uPC9pPiJdCiAgICBEIC0tPnwiQVBJIHJldHVybnMgNDEzInwgRVsiUmVhY3RpdmUgQ29tcGFjdDxici8-PGk-ZW1lcmdlbmN5LCBvbmNlIG9ubHk8L2k-Il0KCiAgICBzdHlsZSBBIGZpbGw6IzFhM2EyZSxzdHJva2U6IzRhZGU4MCxjb2xvcjojZmZmCiAgICBzdHlsZSBCIGZpbGw6IzFhM2EyZSxzdHJva2U6IzRhZGU4MCxjb2xvcjojZmZmCiAgICBzdHlsZSBDIGZpbGw6IzNhMzUyMCxzdHJva2U6I2ZiYmYyNCxjb2xvcjojZmZmCiAgICBzdHlsZSBEIGZpbGw6IzRhMjAyMCxzdHJva2U6I2VmNDQ0NCxjb2xvcjojZmZmCiAgICBzdHlsZSBFIGZpbGw6IzRhMTAyMCxzdHJva2U6I2RjMjYyNixjb2xvcjojZmZm" alt="flowchart LR" width="1552" height="94"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The system tries lightweight options first. If snip + micro bring tokens under the limit, the heavy compressors never run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Token Budget Check
&lt;/h3&gt;

&lt;p&gt;If a token budget is active (&lt;code&gt;feature('TOKEN_BUDGET')&lt;/code&gt;, line 280), the system checks whether to continue. Users can specify targets like "+500k", and the system tracks cumulative output tokens per turn, injecting nudge messages near the goal to keep the model working.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Call Model API
&lt;/h3&gt;

&lt;p&gt;Line 659 — the actual API call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for await (const message of deps.callModel({
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a streaming call. The response arrives token by token, and the system processes it incrementally.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Streaming Tool Execution
&lt;/h3&gt;

&lt;p&gt;This is a critical optimization. Traditional agents wait for the model to finish generating all output, then execute tools. Claude Code uses &lt;code&gt;StreamingToolExecutor&lt;/code&gt; (imported at line 96):&lt;/p&gt;

&lt;p&gt;When the model is still generating its second tool call, the first one is already running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traditional Agent (sequential):
┌─────────────────────────┐┌───┐┌───┐┌───┐┌───┐┌───┐
│  LLM generates 5 calls  ││ T1││ T2││ T3││ T4││ T5│  ← 30s total
└─────────────────────────┘└───┘└───┘└───┘└───┘└───┘

Claude Code (streaming):
┌─────────────────────────┐
│  LLM generates 5 calls  │
├──┬──┬──┬──┬─────────────┘
│T1│T2│T3│T4│T5│                                       ← 18s total
└──┴──┴──┴──┴──┘
↑ tools start while LLM is still generating
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a turn with 5 tool calls, traditional waits 30 seconds. Streaming finishes in 18 — a &lt;strong&gt;40% speedup&lt;/strong&gt; from architecture alone, not model improvements.&lt;/p&gt;

&lt;p&gt;Line 554-555 reveals an interesting detail: &lt;code&gt;stop_reason === 'tool_use'&lt;/code&gt; is unreliable — "it's not always set correctly." The system detects tool calls by watching for &lt;code&gt;tool_use&lt;/code&gt; blocks during streaming instead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Error Recovery
&lt;/h3&gt;

&lt;p&gt;If the prompt is too long? Try context collapse drain. If that fails, try reactive compact (line 15-16). If the API returns 413 (prompt too long), trigger emergency compression and retry.&lt;/p&gt;

&lt;p&gt;But there's a circuit breaker: &lt;code&gt;hasAttemptedReactiveCompact&lt;/code&gt; (line 209, initialized &lt;code&gt;false&lt;/code&gt; at line 275) ensures each turn only attempts reactive compact once. Without this, a genuinely oversized conversation would loop forever.&lt;/p&gt;

&lt;p&gt;The system also handles model degradation — if the primary model fails, it can fall back to a different model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Stop Hooks
&lt;/h3&gt;

&lt;p&gt;After the model stops outputting, the system runs registered stop hooks. These can inspect the output and decide whether to let the model continue. This is where external governance plugs in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 7: Token Budget Check (Again)
&lt;/h3&gt;

&lt;p&gt;Yes, checked twice — once before calling the model (should we even start?) and once after (did we exceed the budget?). The second check decides whether to inject a "keep going" nudge or stop.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 8: Tool Execution
&lt;/h3&gt;

&lt;p&gt;If the response contains &lt;code&gt;tool_use&lt;/code&gt; blocks, execute them. Two paths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;runTools()&lt;/code&gt; (from &lt;code&gt;toolOrchestration.ts&lt;/code&gt;, line 98) — batch execution&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;StreamingToolExecutor&lt;/code&gt; (line 96) — streaming execution, gated by &lt;code&gt;config.gates.streamingToolExecution&lt;/code&gt; (line 561)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each tool call goes through the 14-step execution pipeline in &lt;code&gt;toolExecution.ts&lt;/code&gt; (1,745 lines) — validation, permission checks, hooks, actual execution, analytics. That's a story for &lt;a href="https://harrisonsec.com/blog/claude-code-deep-dive-tool-pipeline/" rel="noopener noreferrer"&gt;Part 3&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 9: Attachment Injection
&lt;/h3&gt;

&lt;p&gt;After tools finish, the system injects additional context before the next turn:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory attachments&lt;/strong&gt; — relevant memories from the &lt;code&gt;memdir/&lt;/code&gt; system&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skill discovery&lt;/strong&gt; — matching skills based on the current task&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queued commands&lt;/strong&gt; — any commands that were waiting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This happens after tool execution but before the next API call, ensuring the model has fresh context.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 10: Assemble and Loop
&lt;/h3&gt;

&lt;p&gt;Build the new message list from all the pieces — original conversation, tool results, attachments, system reminders — and go back to step 1.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Architecture Matters
&lt;/h2&gt;

&lt;p&gt;Most open-source AI agents implement the loop as 50 lines of pseudocode: call model, parse tool calls, execute, repeat. Claude Code's 1,421-line version exists because production reality is messy:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context doesn't fit.&lt;/strong&gt; A real coding session easily hits 200K tokens. Without the 4-stage compression pipeline, the agent dies on every long conversation. Most agents just truncate and lose context. Claude Code compresses intelligently — lightweight first, heavy only when needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Models fail.&lt;/strong&gt; APIs return 413, connections drop, rate limits hit. The 9 continue points aren't over-engineering — they're the minimum number of recovery paths needed for reliable operation. The &lt;code&gt;hasAttemptedReactiveCompact&lt;/code&gt; circuit breaker is the kind of detail that separates a demo from a product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speed matters more than correctness of execution order.&lt;/strong&gt; Streaming tool execution — starting the first tool while the model is still generating the third — is a user experience decision backed by architecture. Traditional agents feel slow because they are: they serialize everything. Claude Code parallelizes at the loop level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tokens cost money.&lt;/strong&gt; The &lt;code&gt;SYSTEM_PROMPT_DYNAMIC_BOUNDARY&lt;/code&gt; marker in &lt;code&gt;prompts.ts&lt;/code&gt; (914 lines) splits the system prompt into static (cacheable) and dynamic sections. If two requests share the same static prefix byte-for-byte, the API caches it. Source comment: "don't modify content before the boundary, or you'll destroy the cache." This is prompt cache economics — saving Anthropic real compute costs at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Behavioral Constitution
&lt;/h2&gt;

&lt;p&gt;Buried inside the prompt assembly, &lt;code&gt;getSimpleDoingTasksSection()&lt;/code&gt; may be the most valuable function in the entire codebase. It encodes hard-won rules about what the model should NOT do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don't add features the user didn't ask for&lt;/li&gt;
&lt;li&gt;Don't over-abstract — three duplicate lines beat a premature abstraction&lt;/li&gt;
&lt;li&gt;Don't add comments to code you didn't change&lt;/li&gt;
&lt;li&gt;Don't add unnecessary error handling&lt;/li&gt;
&lt;li&gt;Read code before modifying it&lt;/li&gt;
&lt;li&gt;If a method fails, diagnose before retrying&lt;/li&gt;
&lt;li&gt;Report honestly — don't say you ran something you didn't&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Anyone who has used Claude Code recognizes these rules. I've personally watched the system refuse to add "helpful" abstractions and stick to minimal changes. That's not the model being disciplined — it's the prompt constraining the model. The takeaway: &lt;strong&gt;don't trust model self-discipline. Codify the behavior.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How Other Agents Compare
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;th&gt;Cursor&lt;/th&gt;
&lt;th&gt;Typical OSS Agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Loop complexity&lt;/td&gt;
&lt;td&gt;1,421 lines, 9 continue points&lt;/td&gt;
&lt;td&gt;Unknown (closed source)&lt;/td&gt;
&lt;td&gt;~50-200 lines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compression&lt;/td&gt;
&lt;td&gt;4-stage pipeline + reactive 413 recovery&lt;/td&gt;
&lt;td&gt;Tab-level context pruning&lt;/td&gt;
&lt;td&gt;Truncate or fail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool execution&lt;/td&gt;
&lt;td&gt;Streaming (parallel with generation)&lt;/td&gt;
&lt;td&gt;Sequential&lt;/td&gt;
&lt;td&gt;Sequential&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error recovery&lt;/td&gt;
&lt;td&gt;Circuit breakers, model fallback, emergency compact&lt;/td&gt;
&lt;td&gt;Basic retry&lt;/td&gt;
&lt;td&gt;Crash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt caching&lt;/td&gt;
&lt;td&gt;Static/dynamic boundary, section registry&lt;/td&gt;
&lt;td&gt;Unknown&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The gap between Claude Code and most open-source agents is not model quality — it's the program layer. The model is the same Opus or Sonnet for everyone. What makes Claude Code feel different is 1,421 lines of careful engineering around it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;The query loop is where "LLM talks, program walks" becomes concrete:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The LLM outputs text and tool call JSON. That's it.&lt;/li&gt;
&lt;li&gt;The program handles compression, budget tracking, error recovery, streaming, permissions, memory injection, and 14-step tool validation.&lt;/li&gt;
&lt;li&gt;The 1,421 lines are not the model being smart. They're the program being careful.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're building an AI agent and your main loop is under 100 lines, you're not handling the cases that matter. Production is not about the happy path. It's about what happens when context overflows, the API returns 413, the user's conversation hits 500 turns, and three tools need to run while the model is still thinking.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next: Part 3 — The 14-Step Tool Execution Pipeline (coming soon) — what happens between "model says call this tool" and the tool actually running.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Previous: &lt;a href="https://harrisonsec.com/blog/claude-code-source-leaked-hidden-features/" rel="noopener noreferrer"&gt;Part 1 — 5 Hidden Features Found in 510K Lines&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Video: &lt;a href="https://youtu.be/giNERYV-X7k" rel="noopener noreferrer"&gt;The AI Stack Explained — LLM Talks, Program Walks&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>agents</category>
      <category>architecture</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Claude Code Source Leaked: 5 Hidden Features Found in 510K Lines of Code</title>
      <dc:creator>Harrison Guo</dc:creator>
      <pubDate>Tue, 31 Mar 2026 22:02:07 +0000</pubDate>
      <link>https://dev.to/harrison_guo_e01b4c8793a0/claude-code-source-leaked-5-hidden-features-found-in-510k-lines-of-code-3mbn</link>
      <guid>https://dev.to/harrison_guo_e01b4c8793a0/claude-code-source-leaked-5-hidden-features-found-in-510k-lines-of-code-3mbn</guid>
      <description>&lt;h2&gt;
  
  
  What Happened
&lt;/h2&gt;

&lt;p&gt;Anthropic shipped Claude Code v2.1.88 to npm with a 60MB source map still attached. That single file contained 1,906 source files and 510,000 lines of fully readable TypeScript. No minification. No obfuscation. Just the raw codebase, sitting in a public registry for anyone to download.&lt;/p&gt;

&lt;p&gt;Within hours, backup repositories appeared on GitHub. One of them — &lt;a href="https://github.com/instructkr/claude-code" rel="noopener noreferrer"&gt;instructkr/claude-code&lt;/a&gt; — racked up 20,000+ stars almost instantly. Anthropic pulled the package, but the code was already mirrored everywhere. The cat was out of the bag, and it had opinions about AI safety.&lt;/p&gt;

&lt;h2&gt;
  
  
  5 Hidden Features Found in the Source
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Buddy Pet System
&lt;/h3&gt;

&lt;p&gt;Deep in &lt;code&gt;buddy/types.ts&lt;/code&gt;, there is a complete virtual pet system. Eighteen species, five rarity tiers, shiny variants, hats, custom eyes, and stat blocks. This was clearly planned as an April Fools easter egg.&lt;/p&gt;

&lt;p&gt;The species list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;SPECIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;duck&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;goose&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;blob&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cat&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;dragon&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;octopus&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;owl&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;penguin&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;turtle&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;snail&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ghost&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;axolotl&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;capybara&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cactus&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;robot&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;rabbit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;mushroom&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;chonk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rarity weights:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;RARITY_WEIGHTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;common&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// 60%&lt;/span&gt;
  &lt;span class="na"&gt;uncommon&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// 25%&lt;/span&gt;
  &lt;span class="na"&gt;rare&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// 10%&lt;/span&gt;
  &lt;span class="na"&gt;epic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;       &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;//  4%&lt;/span&gt;
  &lt;span class="na"&gt;legendary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;   &lt;span class="c1"&gt;//  1%&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each buddy gets a hat, eyes, and stats:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Hat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;none&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;crown&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tophat&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;propeller&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;halo&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;wizard&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;beanie&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tinyduck&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Eye&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;·&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;✦&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;×&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;◉&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;°&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Stat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;DEBUGGING&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;PATIENCE&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;CHAOS&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;WISDOM&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SNARK&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your buddy is generated deterministically from &lt;code&gt;hash(userId)&lt;/code&gt;. Every account gets a unique pet. There is also a &lt;code&gt;shiny&lt;/code&gt; boolean variant — presumably the rare version you brag about in team Slack.&lt;/p&gt;

&lt;p&gt;This was 100% an April 1st drop. The leak killed the surprise.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Undercover Mode
&lt;/h3&gt;

&lt;p&gt;This one raised eyebrows. In &lt;code&gt;utils/undercover.ts&lt;/code&gt;, there is a mode that makes Claude pretend to be a human developer.&lt;/p&gt;

&lt;p&gt;When active, undercover mode:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Removes &lt;strong&gt;all&lt;/strong&gt; Anthropic traces from commits and pull requests&lt;/li&gt;
&lt;li&gt;Strips &lt;code&gt;Co-Authored-By&lt;/code&gt; headers&lt;/li&gt;
&lt;li&gt;Removes any mention of "Claude Code" or model names&lt;/li&gt;
&lt;li&gt;Writes commit messages "as a human developer would"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From the source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// UNDERCOVER MODE — CRITICAL&lt;/span&gt;
&lt;span class="c1"&gt;// You are operating UNDERCOVER in a PUBLIC/OPEN-SOURCE repository.&lt;/span&gt;
&lt;span class="c1"&gt;// Your commit messages, PR titles, and PR bodies MUST NOT contain&lt;/span&gt;
&lt;span class="c1"&gt;// ANY Anthropic-internal information. Do not blow your cover.&lt;/span&gt;
&lt;span class="c1"&gt;//&lt;/span&gt;
&lt;span class="c1"&gt;// NEVER include in commit messages or PR descriptions:&lt;/span&gt;
&lt;span class="c1"&gt;// - Internal model codenames (animal names like Capybara, Tengu, etc.)&lt;/span&gt;
&lt;span class="c1"&gt;// - Unreleased model version numbers (e.g., opus-4-7, sonnet-4-8)&lt;/span&gt;
&lt;span class="c1"&gt;// - The phrase "Claude Code" or any mention that you are an AI&lt;/span&gt;
&lt;span class="c1"&gt;// - Co-Authored-By lines or any other attribution&lt;/span&gt;
&lt;span class="c1"&gt;//&lt;/span&gt;
&lt;span class="c1"&gt;// Write commit messages as a human developer would.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It auto-activates for public and open-source repositories. You can force it on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;CLAUDE_CODE_UNDERCOVER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is no force-OFF override. The safety default is undercover-on for public repos.&lt;/p&gt;

&lt;p&gt;The kicker: this feature is gated to internal Anthropic employees only (&lt;code&gt;USER_TYPE === 'ant'&lt;/code&gt;). Anthropic's own developers use Claude Code on open-source projects, and this mode ensures their contributions look human-written. Draw your own conclusions about what that means for AI attribution norms.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Kairos — Permanent Memory
&lt;/h3&gt;

&lt;p&gt;Behind the feature flag &lt;code&gt;KAIROS&lt;/code&gt; in &lt;code&gt;main.tsx&lt;/code&gt; and the &lt;code&gt;memdir/&lt;/code&gt; directory, there is a persistent memory system that survives across sessions.&lt;/p&gt;

&lt;p&gt;This is not the &lt;code&gt;.claude/&lt;/code&gt; project memory you already know. Kairos is a four-stage memory consolidation pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Orient&lt;/strong&gt; — scan context, identify what matters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collect&lt;/strong&gt; — gather facts, decisions, patterns from the session&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consolidate&lt;/strong&gt; — merge new memories with existing long-term store&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prune&lt;/strong&gt; — discard stale or low-value memories&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The system runs automatically when you are not actively using Claude Code. It tracks memory age, performs periodic scans, and supports team memory paths — meaning shared memory across a team's Claude Code instances.&lt;/p&gt;

&lt;p&gt;This turns Claude Code from a stateless tool into a persistent assistant that learns your codebase, your patterns, and your preferences over time. It is the most architecturally significant hidden feature in the leak.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Ultraplan — Deep Task Planning
&lt;/h3&gt;

&lt;p&gt;The feature flag &lt;code&gt;ULTRAPLAN&lt;/code&gt; in &lt;code&gt;commands.ts&lt;/code&gt; enables a deep planning mode that can run for up to 30 minutes on a single task. It uses remote agent execution — meaning the heavy thinking happens server-side, not in your terminal.&lt;/p&gt;

&lt;p&gt;Ultraplan is listed under &lt;code&gt;INTERNAL_ONLY_COMMANDS&lt;/code&gt;. Anthropic's engineers apparently have access to a planning mode that goes far beyond what ships to paying customers. This is the kind of feature that separates "AI autocomplete" from "AI architect."&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Multi-Agent, Voice, and Daemon Modes
&lt;/h3&gt;

&lt;p&gt;The source reveals several execution modes that are not publicly documented:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Coordinator mode&lt;/strong&gt; — orchestrates multiple Claude instances running in parallel, each working on a subtask&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice mode&lt;/strong&gt; (&lt;code&gt;VOICE_MODE&lt;/code&gt; flag) — voice input/output for Claude Code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bridge mode&lt;/strong&gt; (&lt;code&gt;BRIDGE_MODE&lt;/code&gt;) — remote control of a Claude Code instance from another process&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daemon mode&lt;/strong&gt; (&lt;code&gt;DAEMON&lt;/code&gt;) — runs Claude Code as a background process&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UDS inbox&lt;/strong&gt; (&lt;code&gt;UDS_INBOX&lt;/code&gt;) — Unix domain socket for inter-process communication between Claude instances&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, these paint a picture of Claude Code evolving from a single-user CLI into a multi-agent orchestration platform. The daemon + UDS architecture means Claude Code instances can message each other, coordinate work, and run without a terminal attached.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Architecture
&lt;/h2&gt;

&lt;p&gt;The entire Claude Code engine lives in &lt;code&gt;queryLoop()&lt;/code&gt; at &lt;code&gt;query.ts&lt;/code&gt; line 241. At line 307, there is a &lt;code&gt;while(true)&lt;/code&gt; loop that drives everything:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;callModel()&lt;/code&gt; sends the conversation to the LLM&lt;/li&gt;
&lt;li&gt;The LLM returns text and &lt;code&gt;tool_use&lt;/code&gt; JSON blocks&lt;/li&gt;
&lt;li&gt;The program parses each &lt;code&gt;tool_use&lt;/code&gt;, checks permissions, executes the tool&lt;/li&gt;
&lt;li&gt;Results feed back into the conversation&lt;/li&gt;
&lt;li&gt;Loop continues until the LLM stops requesting tools&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the "LLM talks, program walks" pattern I wrote about &lt;a href="https://harrisonsec.com/blog/ai-stack-explained-llm-talks-program-walks/" rel="noopener noreferrer"&gt;previously&lt;/a&gt;. The LLM decides what to do. The program decides whether to allow it, then does it. Seeing it confirmed in 510K lines of production code is satisfying.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security Architecture
&lt;/h2&gt;

&lt;p&gt;Claude Code's permission system is the most carefully engineered part of the codebase. Every tool call passes through six layers, implemented in &lt;code&gt;useCanUseTool.tsx&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Config allowlist&lt;/strong&gt; — checks project and user configuration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-mode classifier&lt;/strong&gt; — determines if the tool is safe for autonomous execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coordinator gate&lt;/strong&gt; — validates against the orchestration layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Swarm worker gate&lt;/strong&gt; — checks permissions for sub-agent execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bash classifier&lt;/strong&gt; — analyzes shell commands for safety&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interactive user prompt&lt;/strong&gt; — final human confirmation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;External commands run in a sandbox. This is defense-in-depth done right. The irony is that the company that built this careful permission model forgot to strip a source map from their npm package.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means
&lt;/h2&gt;

&lt;p&gt;The moat for AI coding tools is not the CLI. It is the model. Anyone can read this source code and understand the architecture, but nobody can replicate Sonnet or Opus. The &lt;code&gt;queryLoop()&lt;/code&gt; pattern is elegant but simple — the magic is in what &lt;code&gt;callModel()&lt;/code&gt; returns. That said, the product roadmap is now public. Competitors know about Kairos, Ultraplan, multi-agent coordination, and voice mode. That is real strategic damage.&lt;/p&gt;

&lt;p&gt;For a company that positions itself as the responsible AI lab — the one that takes safety seriously — shipping a fully readable source map to a public registry is a notable operational security failure. The six-layer permission system in the code is impressive. The process that let a 60MB source map slip through CI/CD is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Watch the Deep Dive
&lt;/h2&gt;

&lt;p&gt;I broke down the full AI agent architecture — the same query loop that Claude Code uses — in a 15-minute video: &lt;a href="https://youtu.be/giNERYV-X7k" rel="noopener noreferrer"&gt;Watch on YouTube&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For background on the "LLM talks, program walks" pattern: &lt;a href="https://harrisonsec.com/blog/ai-stack-explained-llm-talks-program-walks/" rel="noopener noreferrer"&gt;Read: The AI Stack Explained — LLM Talks, Program Walks&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Coming next: a deep dive into Claude Code's 6-layer permission system and the Kairos memory architecture — with full code walkthroughs. Subscribe to catch it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>anthropic</category>
      <category>agents</category>
      <category>security</category>
    </item>
    <item>
      <title>The AI Stack Explained: LLM Talks, Program Walks</title>
      <dc:creator>Harrison Guo</dc:creator>
      <pubDate>Mon, 30 Mar 2026 04:14:15 +0000</pubDate>
      <link>https://dev.to/harrison_guo_e01b4c8793a0/the-ai-stack-explained-llm-talks-program-walks-3p8a</link>
      <guid>https://dev.to/harrison_guo_e01b4c8793a0/the-ai-stack-explained-llm-talks-program-walks-3p8a</guid>
      <description>&lt;p&gt;LLM. Token. Context. Prompt. Function Calling. MCP. Agent. Skill.&lt;/p&gt;

&lt;p&gt;You've spent months trying to understand these concepts. Here's something that might surprise you: &lt;strong&gt;they're all the same thing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An LLM can only do one thing — output text. It can't browse the web. It can't query a database. It can't control your computer. The program around it does all of that. The program reads the text the LLM outputs, takes action on its behalf, and feeds the result back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM talks, program walks.&lt;/strong&gt; That's the entire AI stack in four words.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;tl;dr&lt;/strong&gt; — Every AI capability — from chatbots to autonomous agents — is built on one loop: the LLM outputs text, the program reads it and acts, the result feeds back. Understanding this loop makes every AI concept transparent.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Layer 1: The LLM — A Genius That Can Only Play Word Chain
&lt;/h2&gt;

&lt;p&gt;At its core, a large language model is a word prediction machine.&lt;/p&gt;

&lt;p&gt;You give it "The capital of France is" — it predicts "Paris." Then it appends "Paris" to the input and predicts again. Comma. "Which." "Is." On and on — until it outputs a stop token.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fumf4cdjx113vi9twg66v.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fumf4cdjx113vi9twg66v.webp" alt="The LLM is a word prediction machine — input text in, predict next word out" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;No thinking. No understanding. No consciousness. &lt;strong&gt;Just one thing: given the text so far, predict the next word.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But the model's internals are pure matrix math — it only understands numbers. So there's a translator: the &lt;strong&gt;Tokenizer&lt;/strong&gt;. It chops text into small chunks called Tokens, maps each to a number, feeds them to the model, and converts the output back to text.&lt;/p&gt;

&lt;p&gt;A Token ≠ a word. "helpful" → "help" + "ful" (2 tokens). "unbelievable" → "un" + "believ" + "able" (3 tokens).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tokens are the atoms of the LLM world.&lt;/strong&gt; Everything goes in as tokens, everything comes out as tokens.&lt;/p&gt;

&lt;p&gt;The LLM can play word chain. But it has a fatal flaw.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 2: Context — A Genius with No Memory
&lt;/h2&gt;

&lt;p&gt;The LLM has no memory. This isn't a metaphor — it's literally a math function. Input in, output out, done. Next call? Knows nothing.&lt;/p&gt;

&lt;p&gt;So why does it seem like it remembers your earlier messages?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Because every time you send a message, the program behind the scenes stitches your entire conversation history together and sends it all at once.&lt;/strong&gt; The LLM doesn't "remember." It re-reads everything from scratch. Every single time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyf7qma8qkbniyx4xyn6w.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyf7qma8qkbniyx4xyn6w.webp" alt="Context = everything on the LLM's desk: chat history, system instructions, your question, tool list" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This bundle is called &lt;strong&gt;Context&lt;/strong&gt; — everything the LLM can see at once. Think of it as a desk. Today's models fit about 1 million tokens on that desk (~750,000 words, roughly all seven Harry Potter books).&lt;/p&gt;

&lt;p&gt;But even with a big desk, dumping a thousand-page manual is impractical. The fix? &lt;strong&gt;Only put the relevant pages on the desk.&lt;/strong&gt; Search ahead of time, find matching chunks, feed only those.&lt;/p&gt;

&lt;p&gt;That's &lt;strong&gt;RAG&lt;/strong&gt; — Retrieval-Augmented Generation. Don't dump everything. Pick what matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 3: Prompt — What You Say to the LLM
&lt;/h2&gt;

&lt;p&gt;Don't overthink "Prompt." A prompt is just what you say to the LLM. Every message you type is a prompt.&lt;/p&gt;

&lt;p&gt;But there are two kinds:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnwhr5qtnmgw0x7zh95a2.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnwhr5qtnmgw0x7zh95a2.webp" alt="Two kinds of prompt: User Prompt (what to do now) and System Prompt (who you are)" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User Prompt&lt;/strong&gt; — what you type. "Write me a sorting algorithm in Python."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Prompt&lt;/strong&gt; — rules the developer sets behind the scenes. "You are a senior Python engineer. Keep answers concise." You never see this, but the LLM reads it every time.&lt;/p&gt;

&lt;p&gt;Both get packed into Context. User Prompt = what to do now. System Prompt = who you are and what rules to follow.&lt;/p&gt;

&lt;p&gt;The LLM can now predict words, see history, and follow instructions. But it's still just outputting text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What comes next is the most important part.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 4: Function Calling — Where Everything Begins
&lt;/h2&gt;

&lt;p&gt;Let's come back to the fundamental fact:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An LLM can only output text.&lt;/strong&gt; It can't browse the internet. It can't check the weather. It can't call any API.&lt;/p&gt;

&lt;p&gt;So how does it "check the weather"? &lt;strong&gt;It doesn't. The program does.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBwYXJ0aWNpcGFudCBZb3UKICAgIHBhcnRpY2lwYW50IFByb2dyYW0KICAgIHBhcnRpY2lwYW50IExMTQogICAgcGFydGljaXBhbnQgQVBJIGFzIFdlYXRoZXIgQVBJCgogICAgWW91LT4-UHJvZ3JhbTogIldoYXQncyB0aGUgd2VhdGhlciBpbiBUb2t5bz8iCiAgICBQcm9ncmFtLT4-TExNOiBbeW91ciBxdWVzdGlvbiArIHRvb2wgY2F0YWxvZ10KICAgIExMTS0-PlByb2dyYW06IHsidG9vbCI6ICJnZXRfd2VhdGhlciIsICJhcmdzIjogeyJjaXR5IjogIlRva3lvIn19CiAgICBOb3RlIG92ZXIgTExNOiBMTE0ncyBqb2IgaXMgZG9uZS4gSXQganVzdCBvdXRwdXQgSlNPTiB0ZXh0LgogICAgUHJvZ3JhbS0-PkFQSTogR0VUIC93ZWF0aGVyP2NpdHk9VG9reW8KICAgIEFQSS0-PlByb2dyYW06IHsiY29uZGl0aW9uIjogIkNsb3VkeSIsICJ0ZW1wIjogIjE4wrBDIn0KICAgIFByb2dyYW0tPj5MTE06IFtvcmlnaW5hbCBxdWVzdGlvbiArIHRvb2wgcmVzdWx0XQogICAgTExNLT4-UHJvZ3JhbTogIkl0J3MgY3VycmVudGx5IGNsb3VkeSBpbiBUb2t5bywgYXJvdW5kIDE4wrBDLiIKICAgIFByb2dyYW0tPj5Zb3U6ICJJdCdzIGN1cnJlbnRseSBjbG91ZHkgaW4gVG9reW8sIGFyb3VuZCAxOMKwQy4i" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBwYXJ0aWNpcGFudCBZb3UKICAgIHBhcnRpY2lwYW50IFByb2dyYW0KICAgIHBhcnRpY2lwYW50IExMTQogICAgcGFydGljaXBhbnQgQVBJIGFzIFdlYXRoZXIgQVBJCgogICAgWW91LT4-UHJvZ3JhbTogIldoYXQncyB0aGUgd2VhdGhlciBpbiBUb2t5bz8iCiAgICBQcm9ncmFtLT4-TExNOiBbeW91ciBxdWVzdGlvbiArIHRvb2wgY2F0YWxvZ10KICAgIExMTS0-PlByb2dyYW06IHsidG9vbCI6ICJnZXRfd2VhdGhlciIsICJhcmdzIjogeyJjaXR5IjogIlRva3lvIn19CiAgICBOb3RlIG92ZXIgTExNOiBMTE0ncyBqb2IgaXMgZG9uZS4gSXQganVzdCBvdXRwdXQgSlNPTiB0ZXh0LgogICAgUHJvZ3JhbS0-PkFQSTogR0VUIC93ZWF0aGVyP2NpdHk9VG9reW8KICAgIEFQSS0-PlByb2dyYW06IHsiY29uZGl0aW9uIjogIkNsb3VkeSIsICJ0ZW1wIjogIjE4wrBDIn0KICAgIFByb2dyYW0tPj5MTE06IFtvcmlnaW5hbCBxdWVzdGlvbiArIHRvb2wgcmVzdWx0XQogICAgTExNLT4-UHJvZ3JhbTogIkl0J3MgY3VycmVudGx5IGNsb3VkeSBpbiBUb2t5bywgYXJvdW5kIDE4wrBDLiIKICAgIFByb2dyYW0tPj5Zb3U6ICJJdCdzIGN1cnJlbnRseSBjbG91ZHkgaW4gVG9reW8sIGFyb3VuZCAxOMKwQy4i" alt="sequenceDiagram" width="1210" height="586"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The LLM did not call anything. It just output a JSON string. The program parsed that JSON, the program called the API, the program got the result, and the program fed it back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's all Function Calling is.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I'll sum it up in four words: &lt;strong&gt;LLM talks, program walks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The LLM only talks — "I want to check the weather." The program walks — it actually goes and checks. Everything that comes next is built on this loop.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 5: MCP — The Tool Catalog
&lt;/h2&gt;

&lt;p&gt;We've got "LLM talks, program walks." But there's a practical problem: &lt;strong&gt;how does the program know what tools are available?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Imagine you're a new employee with dozens of internal systems. Nobody gives you a tool directory. &lt;strong&gt;MCP is that directory — in a standard format.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9vgx83nrliy09kdvw4wu.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9vgx83nrliy09kdvw4wu.webp" alt="MCP Server: catalog + execution — Program asks, MCP returns tools, Program sends request, MCP executes" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;An MCP Server provides two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Catalog&lt;/strong&gt; — "What tools do you have?" → returns each tool's name, description, parameters, and return format&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution&lt;/strong&gt; — "Call get_weather with Tokyo" → runs it, returns the result&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Before MCP, every platform had its own way of connecting tools. Build for ChatGPT, rewrite for Claude, rewrite for Gemini. Same tool, three times.&lt;/p&gt;

&lt;p&gt;MCP unified this: &lt;strong&gt;build once, run everywhere.&lt;/strong&gt; Think USB-C — one cable works for everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 6: Agent — The "Talks &amp;amp; Walks" Loop, on Repeat
&lt;/h2&gt;

&lt;p&gt;In Function Calling, the LLM talked once and the program walked once. One round trip. But real problems aren't that simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"What's the weather here? If it's raining, find me a nearby umbrella shop."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's multiple steps:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvs9mqui7ri71126nnmld.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvs9mqui7ri71126nnmld.webp" alt="The agent loop: LLM talks → program walks → feedback → repeat until done" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;LLM says "I need the location tool" → program executes → returns coordinates&lt;/li&gt;
&lt;li&gt;LLM says "Check weather at these coordinates" → program executes → returns "rainy"&lt;/li&gt;
&lt;li&gt;LLM says "Search nearby umbrella shops" → program executes → returns results&lt;/li&gt;
&lt;li&gt;LLM combines everything → outputs the final answer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every step is the same loop: &lt;strong&gt;talks → walks → feedback → talks again → walks again.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A system that can plan autonomously, execute across multiple steps, and loop until completion — that's an &lt;strong&gt;Agent.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Claude Code, Cursor, and GitHub Copilot all call themselves agents. Under the hood, they're running this same loop.&lt;/p&gt;

&lt;p&gt;But here's the key insight: getting the location, checking the weather, searching for shops — the &lt;strong&gt;program does all of that&lt;/strong&gt;. None of it requires intelligence. The LLM's only job? &lt;strong&gt;Deciding what to do next.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An "intelligent agent" is actually assembled from parts that require zero intelligence.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 7: Skill — Pre-Written Rules
&lt;/h2&gt;

&lt;p&gt;The agent can plan on its own. But it doesn't know your rules.&lt;/p&gt;

&lt;p&gt;Your team has a deployment checklist — pass all tests, verify env variables, confirm rollback plan, notify on-call. You want the agent to follow this every time. Are you going to type all that out every deploy?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Skill is those rules written into a document, stored in a fixed location.&lt;/strong&gt; It's literally a Markdown file — name, description, steps, rules, format, examples.&lt;/p&gt;

&lt;p&gt;Let's be honest: a Skill is just a prompt that lives in a different place and has a fancier name. But Skills have one clever design — &lt;strong&gt;progressive disclosure:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fprnfwfw3n3vcorsl59de.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fprnfwfw3n3vcorsl59de.webp" alt="Progressive disclosure: scan catalog → load instructions → follow citations" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Level 1:&lt;/strong&gt; Scan names and descriptions (table of contents)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level 2:&lt;/strong&gt; Load full instructions when matched (open the chapter)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level 3:&lt;/strong&gt; Load referenced docs/scripts only when needed (check the footnotes)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's a tradeoff between token cost and information completeness. &lt;strong&gt;Just enough is optimal.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Big Picture
&lt;/h2&gt;

&lt;p&gt;Let's zoom out:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp1lkp2o71smnfndnk2le.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp1lkp2o71smnfndnk2le.webp" alt="The full AI stack — 7 layers from LLM at the base to Skill at the top" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZ3JhcGggVEQKICAgIFNbU2tpbGxdIC0tPnxwcmUtd3JpdHRlbiBydWxlc3wgQQogICAgQVtBZ2VudF0gLS0-fGxvb3Agb24gcmVwZWF0fCBNCiAgICBNW01DUF0gLS0-fHRvb2wgY2F0YWxvZ3wgRkMKICAgIEZDW0Z1bmN0aW9uIENhbGxpbmddIC0tPnx0ZXh0IOKGkiBhY3Rpb258IFAKICAgIFBbUHJvbXB0XSAtLT58aW5zdHJ1Y3Rpb25zfCBDCiAgICBDW0NvbnRleHRdIC0tPnxldmVyeXRoaW5nIHZpc2libGV8IFQKICAgIFRbVG9rZW5dIC0tPnxhdG9taWMgdW5pdHN8IEwKICAgIExbTExNXSAtLT58b3V0cHV0cyB0ZXh0fCBGQwoKICAgIHN0eWxlIEwgZmlsbDojMzMzLHN0cm9rZTojNjY2LGNvbG9yOiNmZmYKICAgIHN0eWxlIFQgZmlsbDojMzMzLHN0cm9rZTojNzc3LGNvbG9yOiNmZmYKICAgIHN0eWxlIEMgZmlsbDojMzMzLHN0cm9rZTojODg4LGNvbG9yOiNmZmYKICAgIHN0eWxlIFAgZmlsbDojMzMzLHN0cm9rZTojOTk5LGNvbG9yOiNmZmYKICAgIHN0eWxlIEZDIGZpbGw6IzFhM2E1YyxzdHJva2U6IzRkYTZmZixjb2xvcjojZmZmCiAgICBzdHlsZSBNIGZpbGw6IzNkMjIwMCxzdHJva2U6I2ZmOWY0Myxjb2xvcjojZmZmCiAgICBzdHlsZSBBIGZpbGw6IzAwMzMzMyxzdHJva2U6IzAwZDJkMyxjb2xvcjojZmZmCiAgICBzdHlsZSBTIGZpbGw6IzJkMTA0NSxzdHJva2U6IzliNTliNixjb2xvcjojZmZm" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZ3JhcGggVEQKICAgIFNbU2tpbGxdIC0tPnxwcmUtd3JpdHRlbiBydWxlc3wgQQogICAgQVtBZ2VudF0gLS0-fGxvb3Agb24gcmVwZWF0fCBNCiAgICBNW01DUF0gLS0-fHRvb2wgY2F0YWxvZ3wgRkMKICAgIEZDW0Z1bmN0aW9uIENhbGxpbmddIC0tPnx0ZXh0IOKGkiBhY3Rpb258IFAKICAgIFBbUHJvbXB0XSAtLT58aW5zdHJ1Y3Rpb25zfCBDCiAgICBDW0NvbnRleHRdIC0tPnxldmVyeXRoaW5nIHZpc2libGV8IFQKICAgIFRbVG9rZW5dIC0tPnxhdG9taWMgdW5pdHN8IEwKICAgIExbTExNXSAtLT58b3V0cHV0cyB0ZXh0fCBGQwoKICAgIHN0eWxlIEwgZmlsbDojMzMzLHN0cm9rZTojNjY2LGNvbG9yOiNmZmYKICAgIHN0eWxlIFQgZmlsbDojMzMzLHN0cm9rZTojNzc3LGNvbG9yOiNmZmYKICAgIHN0eWxlIEMgZmlsbDojMzMzLHN0cm9rZTojODg4LGNvbG9yOiNmZmYKICAgIHN0eWxlIFAgZmlsbDojMzMzLHN0cm9rZTojOTk5LGNvbG9yOiNmZmYKICAgIHN0eWxlIEZDIGZpbGw6IzFhM2E1YyxzdHJva2U6IzRkYTZmZixjb2xvcjojZmZmCiAgICBzdHlsZSBNIGZpbGw6IzNkMjIwMCxzdHJva2U6I2ZmOWY0Myxjb2xvcjojZmZmCiAgICBzdHlsZSBBIGZpbGw6IzAwMzMzMyxzdHJva2U6IzAwZDJkMyxjb2xvcjojZmZmCiAgICBzdHlsZSBTIGZpbGw6IzJkMTA0NSxzdHJva2U6IzliNTliNixjb2xvcjojZmZm" alt="graph TD" width="253" height="966"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From top to bottom:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Function Calling&lt;/strong&gt; — the program turns text into action&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP&lt;/strong&gt; — provides the tool catalog&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent&lt;/strong&gt; — lets the loop run multiple rounds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skill&lt;/strong&gt; — pre-written rules that guide the LLM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG&lt;/strong&gt; — picks relevant info for the desk&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt; — stitches history back in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;None of these capabilities belong to the LLM itself.&lt;/strong&gt; They're all granted by external programs.&lt;/p&gt;

&lt;p&gt;The LLM's sole contribution? &lt;strong&gt;Outputting the right text at the right time.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Questions That Cut Through Any Buzzword
&lt;/h2&gt;

&lt;p&gt;Next time someone throws a new concept at you — Multi-Agent, Agentic RAG, Orchestration Framework — you only need two questions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;① What text did the LLM output?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;② Who read that text and turned it into an actual action?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer those two questions, and any concept becomes transparent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM talks, program walks. That loop is how the entire AI world runs.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;See Function Calling happen live in your terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/harrison001/llm-talks-program-walks.git
&lt;span class="nb"&gt;cd &lt;/span&gt;llm-talks-program-walks
pip &lt;span class="nb"&gt;install &lt;/span&gt;openai
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_key_here
python mouth_speaks_hand_acts.py &lt;span class="s2"&gt;"What's the weather in Tokyo?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The terminal labels every step — "This is just TEXT" when the LLM outputs JSON, and "The PROGRAM did this" when the program executes the function. &lt;a href="https://github.com/harrison001/llm-talks-program-walks" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://harrisonsec.com/blog/ai-stack-explained-llm-talks-program-walks/" rel="noopener noreferrer"&gt;https://harrisonsec.com/blog/ai-stack-explained-llm-talks-program-walks/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>mcp</category>
      <category>agents</category>
    </item>
    <item>
      <title>Why Your "Fail-Fast" Strategy is Killing Your Distributed System (and How to Fix It)</title>
      <dc:creator>Harrison Guo</dc:creator>
      <pubDate>Sat, 21 Mar 2026 06:34:59 +0000</pubDate>
      <link>https://dev.to/harrison_guo_e01b4c8793a0/why-your-fail-fast-strategy-is-killing-your-distributed-system-and-how-to-fix-it-elg</link>
      <guid>https://dev.to/harrison_guo_e01b4c8793a0/why-your-fail-fast-strategy-is-killing-your-distributed-system-and-how-to-fix-it-elg</guid>
      <description>&lt;p&gt;It's 2 AM. PagerDuty fires. Redis master is down. Your application, trained to fail fast, dutifully fails — every single request, all at once. By the time Sentinel promotes a new master 12 seconds later, you've already generated 40,000 errors and three escalation calls. The system recovered on its own. Your application didn't let it.&lt;/p&gt;

&lt;p&gt;This is the story of how "good engineering" can make a 12-second infrastructure event into a 12-minute outage — and how to design boundaries that prevent it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;tl;dr&lt;/strong&gt; — During infrastructure failovers (Redis, Kafka, etcd), blind fail-fast amplifies instability. Bounded retry — centralized, time-boxed, invisible to business logic — absorbs the 10–15 second recovery window without leaking infrastructure noise to users. Resilience is not a library. It is a contract between layers.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Core Question
&lt;/h2&gt;

&lt;p&gt;When your session storage — Redis, Memcached, or any stateful dependency — goes temporarily unavailable, you face a fundamental architectural choice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Should you fail fast? Or should you retry?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We all learned fail-fast as gospel. And it is — until it isn't. During transient infrastructure events like leader elections, blind fail-fast propagates instability instead of containing it. The response you choose determines whether the incident resolves itself in 12 seconds or snowballs into a 12-minute outage with three bridge calls.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Happens During Failover
&lt;/h2&gt;

&lt;p&gt;To understand why fail-fast can backfire, look at the mechanics of a Redis Sentinel failover:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Duration&lt;/th&gt;
&lt;th&gt;What Happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Detection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~10–12s&lt;/td&gt;
&lt;td&gt;Sentinel quorum detects master is down&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Election&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~1–2s&lt;/td&gt;
&lt;td&gt;Sentinels agree on a new master&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Promotion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~1s&lt;/td&gt;
&lt;td&gt;Replica promoted, clients notified&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reconnection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~1–3s&lt;/td&gt;
&lt;td&gt;Clients re-establish connections&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Note: these phases overlap. Total failover typically completes in 12–15 seconds, not the sum of individual phases. Reconnection time also depends heavily on your client library — a Sentinel-aware client with topology refresh (e.g., Lettuce, go-redis with Sentinel support) reconnects in under a second, while a naive connection pool can take 30s+.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;During this window, your application sees TCP dial timeouts and connection resets. Nothing is broken. No data is lost. The system is doing exactly what it was designed to do — electing a new leader. Your application just needs to not panic for 12 seconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Blind Fail-Fast Is Dangerous
&lt;/h2&gt;

&lt;p&gt;If your application fails immediately on the first connection timeout during this window, four things happen in rapid succession:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Instability Amplification
&lt;/h3&gt;

&lt;p&gt;A 3-second infrastructure blip becomes a user-visible outage. Every request during the failover window returns an error, even though the system would have recovered on its own.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Infrastructure Semantics Leak Upward
&lt;/h3&gt;

&lt;p&gt;Your business layer now exposes raw infrastructure details — "Redis connection refused" — to clients that have no idea what Redis is or why it matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Uncontrolled Client Retries
&lt;/h3&gt;

&lt;p&gt;Clients receiving errors start retrying independently. If you have 1,000 concurrent users and each retries 3 times, you just turned 1,000 QPS into 3,000 QPS — hitting an infrastructure layer that's already struggling to stabilize.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Retry Storms
&lt;/h3&gt;

&lt;p&gt;This is the catastrophic outcome. Unbounded retries create cascading load amplification. CPU spikes prevent recovery. The system enters an instability feedback loop where the act of trying to recover keeps the system down. I've seen retry storms take down entire regions.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Your timeout config was technically correct. Your system was functionally down. That's not a timeout problem — that's a design problem."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's the distinction that actually matters in production: &lt;strong&gt;the failure TYPE must determine your recovery strategy.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Infrastructure-Level&lt;/th&gt;
&lt;th&gt;Business-Level&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Examples&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Network jitter, leader election, connection reset, &lt;code&gt;READONLY&lt;/code&gt; replica response&lt;/td&gt;
&lt;td&gt;Validation error, permission denial, domain rule violation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Nature&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Transient — will resolve on its own&lt;/td&gt;
&lt;td&gt;Permanent — retrying won't help&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Strategy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;ABSORB&lt;/strong&gt; — retry within bounds&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;FAIL FAST&lt;/strong&gt; — return error immediately&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Treating a leader election timeout the same as a schema validation error is an architectural mistake. One will resolve in seconds; the other will never succeed no matter how many times you retry.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Failure Boundary Model
&lt;/h2&gt;

&lt;p&gt;This is the architectural pattern that makes everything work:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZ3JhcGggVEQKICAgIHN1YmdyYXBoIENsaWVudF9MYXllciBbQ2xpZW50IExheWVyXQogICAgICAgIENbUmV0cmllcyBvbmx5IHdoZW4gc2lnbmFsZWQgcmV0cnlhYmxlXQogICAgZW5kCgogICAgc3ViZ3JhcGggQnVzaW5lc3NfTGF5ZXIgW0J1c2luZXNzIExheWVyXQogICAgICAgIEJbUHJlc2VydmVzIHNlbWFudGljIGludGVncml0eTxici8-RkFJTC1GQVNUIEJPVU5EQVJZXQogICAgZW5kCgogICAgc3ViZ3JhcGggSW5mcmFzdHJ1Y3R1cmVfQm91bmRhcnkgW0luZnJhc3RydWN0dXJlIEJvdW5kYXJ5XQogICAgICAgIElbQWJzb3JicyB0cmFuc2llbnQgaW5zdGFiaWxpdHk8YnIvPlJFVFJZIEJPVU5EQVJZXQogICAgZW5kCgogICAgc3ViZ3JhcGggRGVwZW5kZW5jeSBbRGVwZW5kZW5jeV0KICAgICAgICBEW1JlZGlzIC8gTkFUUyAvIEthZmthIC8gREJdCiAgICBlbmQKCiAgICBDIC0tPiBCCiAgICBCIC0tICJCdXNpbmVzcyBFcnJvcnM6IEZhaWwgSW1tZWRpYXRlbHkiIC0tPiBDCiAgICBCIC0tPiBJCiAgICBJIC0tICJCb3VuZGVkIFJldHJ5OiBBYnNvcmJzIDEwLTE1cyBub2lzZSIgLS0-IEIKICAgIEkgLS0-IEQ%3D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZ3JhcGggVEQKICAgIHN1YmdyYXBoIENsaWVudF9MYXllciBbQ2xpZW50IExheWVyXQogICAgICAgIENbUmV0cmllcyBvbmx5IHdoZW4gc2lnbmFsZWQgcmV0cnlhYmxlXQogICAgZW5kCgogICAgc3ViZ3JhcGggQnVzaW5lc3NfTGF5ZXIgW0J1c2luZXNzIExheWVyXQogICAgICAgIEJbUHJlc2VydmVzIHNlbWFudGljIGludGVncml0eTxici8-RkFJTC1GQVNUIEJPVU5EQVJZXQogICAgZW5kCgogICAgc3ViZ3JhcGggSW5mcmFzdHJ1Y3R1cmVfQm91bmRhcnkgW0luZnJhc3RydWN0dXJlIEJvdW5kYXJ5XQogICAgICAgIElbQWJzb3JicyB0cmFuc2llbnQgaW5zdGFiaWxpdHk8YnIvPlJFVFJZIEJPVU5EQVJZXQogICAgZW5kCgogICAgc3ViZ3JhcGggRGVwZW5kZW5jeSBbRGVwZW5kZW5jeV0KICAgICAgICBEW1JlZGlzIC8gTkFUUyAvIEthZmthIC8gREJdCiAgICBlbmQKCiAgICBDIC0tPiBCCiAgICBCIC0tICJCdXNpbmVzcyBFcnJvcnM6IEZhaWwgSW1tZWRpYXRlbHkiIC0tPiBDCiAgICBCIC0tPiBJCiAgICBJIC0tICJCb3VuZGVkIFJldHJ5OiBBYnNvcmJzIDEwLTE1cyBub2lzZSIgLS0-IEIKICAgIEkgLS0-IEQ%3D" alt="graph TD" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The retry boundary sits in the &lt;strong&gt;infrastructure client wrapper&lt;/strong&gt; — the thin layer between your business code and the dependency client. Not in HTTP middleware, not in individual service handlers, not in a sidecar. In the client wrapper itself.&lt;/p&gt;

&lt;p&gt;Why does this matter? Because if retry logic exists at multiple layers, you get retry amplification. I've seen teams with retry in the HTTP handler, the service layer, AND the Redis client — producing 3 × 3 × 3 = 27 attempts per original request. That's not resilience. That's a DDoS against your own infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key principles:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Retry belongs at the infrastructure boundary&lt;/strong&gt; — one place, one policy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business logic must remain fail-fast&lt;/strong&gt; — semantic errors should never be retried.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;By the time an error reaches the client, it has been vetted and classified.&lt;/strong&gt; We are designing for predictability.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Bounded Retry: Implementation
&lt;/h2&gt;

&lt;p&gt;If we're going to retry, we must do it with discipline. Four pillars:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Centralized
&lt;/h3&gt;

&lt;p&gt;Retry logic lives in one place — the infrastructure client wrapper. Not in individual handlers, not in middleware, not in the business layer. One retry boundary per dependency, one policy, one set of metrics.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Time-Bounded
&lt;/h3&gt;

&lt;p&gt;We define a &lt;strong&gt;retry budget&lt;/strong&gt; — for example, 15 seconds. Why 15? Because it encapsulates the 10–12 second Sentinel detection window plus a margin for stabilization and reconnection. Time-based budgets are superior to pure attempt counts because they normalize across different failure modes — a retry that takes 5s per attempt behaves very differently from one that takes 100ms.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Attempt-Limited with Jitter
&lt;/h3&gt;

&lt;p&gt;Maximum 2–3 retry attempts within the budget window, with exponential backoff and &lt;strong&gt;jitter&lt;/strong&gt;. Without jitter, synchronized retries from multiple application instances create a thundering herd — everyone hits the new master at exactly the same moment.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Invisible to Business Logic
&lt;/h3&gt;

&lt;p&gt;If the retry succeeds within the budget, the business layer never knew there was a problem. If it fails, the business layer receives a clean, classified error — not a raw TCP stack trace that means nothing to anyone above the infrastructure layer.&lt;/p&gt;

&lt;p&gt;Here's what this looks like in practice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Bounded retry wrapper — lives in the infrastructure client layer&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;withBoundedRetry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;maxAttempts&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;deadline&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;lastErr&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;maxAttempts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;After&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deadline&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;lastErr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;lastErr&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="c"&gt;// success — business layer never knew&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;isRetryable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lastErr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;normalizeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lastErr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// permanent failure — fail fast&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c"&gt;// Exponential backoff with jitter&lt;/span&gt;
        &lt;span class="n"&gt;backoff&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Millisecond&lt;/span&gt;
        &lt;span class="n"&gt;jitter&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Int63n&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backoff&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
        &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;After&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backoff&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;jitter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;normalizeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lastErr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// budget exhausted — fail deterministically&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────┐
│          Retry Budget: 15 seconds           │
│                                             │
│  Attempt 1  →  timeout (5s)  →  backoff     │
│  Attempt 2  →  timeout (5s)  →  backoff     │
│  Attempt 3  →  success                      │
│                                             │
│  Total elapsed: ~11s                        │
│  Application impact: ZERO                   │
│                                             │
│  ─── OR ───                                 │
│                                             │
│  Budget exhausted → FAIL DETERMINISTICALLY  │
│  Clean, classified error to business layer  │
└─────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Retry is not infinite. Retry is time-boxed. Once the budget is exhausted, we fail deterministically."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Error Normalization
&lt;/h2&gt;

&lt;p&gt;This is where most teams get it wrong. They retry everything — or nothing. The retry decision must be driven by error classification:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Raw Error&lt;/th&gt;
&lt;th&gt;Normalized To&lt;/th&gt;
&lt;th&gt;Retryable?&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TCP dial timeout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;UNAVAILABLE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Connection not established, may recover&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Connection reset&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;UNAVAILABLE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Transient network disruption&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;READONLY&lt;/code&gt; (replica)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;UNAVAILABLE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Sentinel failover in progress — replica not yet promoted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Leader election in progress&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;UNAVAILABLE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Raft/consensus transition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;OOM command not allowed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;RESOURCE_EXHAUSTED&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Backpressure — retrying makes it worse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WRONGTYPE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;INVALID_ARGUMENT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Schema error — will never succeed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;NOPERM&lt;/code&gt; / &lt;code&gt;Permission denied&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PERMISSION_DENIED&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Auth failure — will never succeed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NOT_FOUND&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NOT_FOUND&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Semantic absence — retry won't create the resource&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;READONLY&lt;/code&gt; case deserves special attention. During Sentinel failover, a replica that hasn't been promoted yet responds with &lt;code&gt;READONLY&lt;/code&gt; to write commands. If your retry layer treats this as a permanent error, your circuit breaker trips, clients get errors, and a 12-second failover becomes a 5-minute outage while someone manually resets the breaker. Classify &lt;code&gt;READONLY&lt;/code&gt; as &lt;code&gt;UNAVAILABLE&lt;/code&gt; — it will resolve when the new master is promoted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The rule is simple:&lt;/strong&gt; you cannot leak internal implementation details up the stack. Your retry layer must inspect and reclassify errors — not just map them 1:1. Error semantics must align across every layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Relationship with Circuit Breakers
&lt;/h2&gt;

&lt;p&gt;Bounded retry is the &lt;strong&gt;inner loop&lt;/strong&gt; — it handles transient failures within a known recovery window. But what if the dependency is truly down, not just transitioning?&lt;/p&gt;

&lt;p&gt;That's where &lt;strong&gt;circuit breakers&lt;/strong&gt; serve as the &lt;strong&gt;outer loop&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZ3JhcGggTFIKICAgIFJlcSgoUmVxdWVzdCkpIC0tPiBDQntDaXJjdWl0IEJyZWFrZXI8YnIvPidPdXRlciBMb29wJ30KICAgIENCIC0tICJIZWFsdGh5IiAtLT4gQlJbQm91bmRlZCBSZXRyeTxici8-J0lubmVyIExvb3AnXQogICAgQlIgLS0-IERlcFsoRGVwZW5kZW5jeSldCgogICAgQ0IgLS0gIk9wZW46IEZhaWx1cmUgUmF0ZSBIaWdoIiAtLT4gRkZbRmFzdCBGYWlsXQogICAgQlIgLS0gIkJ1ZGdldCBFeGhhdXN0ZWQiIC0tPiBFcnJbTm9ybWFsaXplZCBFcnJvcl0KCiAgICBzdHlsZSBDQiBmaWxsOiNmOWYsc3Ryb2tlOiMzMzMsc3Ryb2tlLXdpZHRoOjJweAogICAgc3R5bGUgQlIgZmlsbDojYmJmLHN0cm9rZTojMzMzLHN0cm9rZS13aWR0aDoycHg%3D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZ3JhcGggTFIKICAgIFJlcSgoUmVxdWVzdCkpIC0tPiBDQntDaXJjdWl0IEJyZWFrZXI8YnIvPidPdXRlciBMb29wJ30KICAgIENCIC0tICJIZWFsdGh5IiAtLT4gQlJbQm91bmRlZCBSZXRyeTxici8-J0lubmVyIExvb3AnXQogICAgQlIgLS0-IERlcFsoRGVwZW5kZW5jeSldCgogICAgQ0IgLS0gIk9wZW46IEZhaWx1cmUgUmF0ZSBIaWdoIiAtLT4gRkZbRmFzdCBGYWlsXQogICAgQlIgLS0gIkJ1ZGdldCBFeGhhdXN0ZWQiIC0tPiBFcnJbTm9ybWFsaXplZCBFcnJvcl0KCiAgICBzdHlsZSBDQiBmaWxsOiNmOWYsc3Ryb2tlOiMzMzMsc3Ryb2tlLXdpZHRoOjJweAogICAgc3R5bGUgQlIgZmlsbDojYmJmLHN0cm9rZTojMzMzLHN0cm9rZS13aWR0aDoycHg%3D" alt="graph LR" width="1075" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bounded retry&lt;/strong&gt; absorbs transient events (leader election, network jitter) — seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Circuit breaker&lt;/strong&gt; protects against sustained outages (dependency truly dead) — minutes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without a circuit breaker, sustained failures chew through retry budgets on every request, wasting resources. Without bounded retry, every transient blip trips the circuit breaker unnecessarily. They are complementary, not redundant.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability: Instrument the Boundary
&lt;/h2&gt;

&lt;p&gt;A production retry boundary must emit metrics. Without them, you're flying blind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;retry_attempt_total&lt;/code&gt;&lt;/strong&gt; — how often retries fire (by dependency, by error type)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;retry_budget_exhausted_total&lt;/code&gt;&lt;/strong&gt; — how often the full budget is consumed without success&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;retry_success_on_attempt&lt;/code&gt;&lt;/strong&gt; — which attempt number succeeds (histogram)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;error_classification&lt;/code&gt;&lt;/strong&gt; — distribution of retryable vs non-retryable errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The key alert:&lt;/strong&gt; if retry budget exhaustion rate exceeds ~5%, either your budget is too tight or your dependency is degraded beyond transient. This is the signal that distinguishes a leader election from a real outage — and it's the signal that should trigger your circuit breaker.&lt;/p&gt;




&lt;h2&gt;
  
  
  Beyond Redis: A Universal Pattern
&lt;/h2&gt;

&lt;p&gt;If this looks Redis-specific, zoom out. The bounded retry pattern applies to any stateful dependency with leader election:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Redis Sentinel&lt;/strong&gt; — master failover with quorum detection, 10–15s window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NATS JetStream&lt;/strong&gt; — stream leader election in the Raft group, typically 2–5s with default election timeout&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;etcd / Consul&lt;/strong&gt; — Raft leader election, ~1–2s with default settings, but watch streams may buffer longer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kafka&lt;/strong&gt; — partition leader election via controller, typically 5–15s depending on &lt;code&gt;replica.lag.time.max.ms&lt;/code&gt; and ISR size&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CockroachDB / TiKV&lt;/strong&gt; — range leader election, similar Raft mechanics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The mechanics are the same everywhere: a detection window, a brief period of unavailability, and then recovery. Design your retry budget to absorb that window. Calibrate the budget to the specific system — 15s for Redis Sentinel, 5s for NATS, 20s for Kafka.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cross-Layer Contract
&lt;/h2&gt;

&lt;p&gt;Resilience is not a library you &lt;code&gt;import&lt;/code&gt;. It is a &lt;strong&gt;contract between layers&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infrastructure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Absorbs transient instability via bounded retry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Business&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Remains fail-fast for semantic integrity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Client&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retries only when signaled retryable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When failure is bounded and classified, the system becomes &lt;strong&gt;predictable&lt;/strong&gt;. And predictability is the foundation of operational confidence.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resilience Checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Retry Budget:&lt;/strong&gt; Is my retry window matched to the dependency's failover time (e.g., 15s for Redis)?&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Jitter:&lt;/strong&gt; Do my retries have randomized sleep to avoid the "Thundering Herd"?&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Error Classification:&lt;/strong&gt; Does my code distinguish between &lt;code&gt;READONLY&lt;/code&gt; (retryable) and &lt;code&gt;PERMISSION_DENIED&lt;/code&gt; (not retryable)?&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Centralization:&lt;/strong&gt; Is my retry logic in the client wrapper, not leaked across handlers?&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Observability:&lt;/strong&gt; Do I have an alert if "Retry Budget Exhausted" exceeds 5%?&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fail fast — but not during transient infrastructure events.&lt;/strong&gt; A leader election is not a business error. Don't treat it like one.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Retry must be bounded.&lt;/strong&gt; Time-boxed, attempt-limited, with jitter. No open-ended retry loops.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Retry must be centralized.&lt;/strong&gt; One retry boundary per dependency, at the infrastructure layer. Retry in multiple layers = retry amplification.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Failure semantics must be normalized.&lt;/strong&gt; Retryable vs non-retryable must be explicit. Watch for &lt;code&gt;READONLY&lt;/code&gt; — the most common Sentinel failover gotcha.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Resilience requires cross-layer alignment.&lt;/strong&gt; Bounded retry (inner loop) + circuit breaker (outer loop) + observability = production-grade resilience.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Should distributed systems always fail fast?
&lt;/h3&gt;

&lt;p&gt;No. Fail fast for &lt;strong&gt;business-level errors&lt;/strong&gt; (validation, permission, domain rules), but use &lt;strong&gt;bounded retry&lt;/strong&gt; for transient infrastructure failures like leader election and temporary network instability.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is a reasonable retry budget for Redis Sentinel failover?
&lt;/h3&gt;

&lt;p&gt;In many production setups, &lt;strong&gt;12-15 seconds&lt;/strong&gt; is a practical starting point because it usually covers Sentinel detection, promotion, and client reconnection. Calibrate with your own failover timings and SLOs.&lt;/p&gt;

&lt;h3&gt;
  
  
  If the service already retries, should the client also retry?
&lt;/h3&gt;

&lt;p&gt;Only when explicitly signaled retryable. Blind retries at both layers often create retry amplification and can trigger a retry storm.&lt;/p&gt;

&lt;h3&gt;
  
  
  How is bounded retry different from a circuit breaker?
&lt;/h3&gt;

&lt;p&gt;Bounded retry handles short transient windows (inner loop). Circuit breaker handles sustained dependency failure and stops repeated expensive attempts (outer loop).&lt;/p&gt;

&lt;h3&gt;
  
  
  Why not use a Service Mesh (Istio) for retries?
&lt;/h3&gt;

&lt;p&gt;While Mesh can retry, the application layer has better "semantic awareness." Only the app knows if a specific error is safe to retry based on idempotency.&lt;/p&gt;

&lt;h3&gt;
  
  
  When should I NOT use Bounded Retry?
&lt;/h3&gt;

&lt;p&gt;For non-idempotent operations unless you have a robust request-ID tracking system. For business errors (400s), always fail fast.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://harrisonsec.com/blog/rust-vs-c-assembly-performance-safety-analysis/" rel="noopener noreferrer"&gt;Rust vs C Assembly: Complete Performance and Safety Analysis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://harrisonsec.com/blog/legacy-compatibility-lab-full-stack/" rel="noopener noreferrer"&gt;Legacy Compatibility Lab: My Full Stack for Reviving Dead Software&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Distributed systems are not about avoiding failure.&lt;br&gt;
They are about &lt;strong&gt;designing boundaries&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If retry is everywhere, the system becomes unpredictable.&lt;br&gt;
If retry is nowhere, transient instability leaks upward.&lt;/p&gt;

&lt;p&gt;The goal is not infinite retry.&lt;br&gt;
&lt;strong&gt;The goal is bounded retry.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That boundary is what keeps systems stable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resilience is not a library. It is a contract between layers.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Based on a talk I gave on failure boundary design in distributed systems.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://harrisonsec.com/blog/fail-fast-bounded-resilience-distributed-systems/" rel="noopener noreferrer"&gt;harrisonsec.com&lt;/a&gt;. Listen to the &lt;a href="https://harrisonsec.com/audio/why-failing-fast-triggers-cascading-failures.m4a" rel="noopener noreferrer"&gt;deep dive audio&lt;/a&gt; for a detailed walkthrough.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>redis</category>
      <category>systemdesign</category>
      <category>backend</category>
    </item>
  </channel>
</rss>
