DEV Community: Harrison Guo

Why Go Handles Millions of Connections: User-Space Context Switching, Explained

Harrison Guo — Tue, 14 Apr 2026 21:43:03 +0000

Somewhere around 40,000 concurrent connections, your Java service falls over. Not from CPU, not from network — from memory, because every connection is a thread and every thread wants its own megabyte of stack. By the time you've finished Googling whether this is a -Xss problem or a ulimit problem, Ops has already bumped the box to 64 GB and you've pushed the wall back another 20,000 connections. Linear in RAM. It never ends.

A Go service on half that box can hold 200,000 connections without noticing. People assume it's because Go is faster. It isn't. Per-request, Go and Java are roughly the same — sometimes Java wins. What Go does differently is more fundamental: it stops asking the kernel to help.

tl;dr — High-concurrency isn't about raw CPU. It's about how cheaply you can hold an idle connection open. Go's 2KB goroutine stacks and user-space M:N scheduler push the marginal cost of a connection close to zero. The kernel only gets involved when there's real I/O to do. This is the same principle HFT engines chase with DPDK and io_uring — Go just hands it to you for free.

The Wrong Mental Model

Most engineers I talk to think "threads are expensive because threading is hard." That's not wrong, but it misses the more mechanical reason.

Every time a traditional language (Java pre-Loom, C# pre-async everywhere, classic Python) parks a thread waiting for I/O, it pays two concrete costs:

Stack memory: Default JVM thread stack is 1 MB. 40,000 threads = 40 GB of stack, most of which is unused.
Context-switch cost: When the OS swaps the thread, it traps into the kernel, saves the full register set, swaps page tables if there's an address-space change, flushes TLB entries, and walks the scheduler's runqueue. Measured on modern x86, that's 1–5 microseconds per switch, plus the less visible cost of instruction-cache pollution.

Multiply that by tens of thousands of waiters and you're paying the kernel a rent that has nothing to do with your actual workload.

What Go Does Instead

stack ≈ 1 MB"]"/>

Go's concurrency is built on an M:N scheduler. You have many goroutines (N) multiplexed onto a small number of OS threads (M, typically GOMAXPROCS).

Here's the part that matters:

A goroutine starts with a 2 KB stack, not a megabyte. Growth is copy-and-resize in user space, triggered by the function prologue when it detects a near-overflow.
Switching between goroutines happens entirely in the Go runtime. No syscall. No TLB flush. No register-set save-and-restore at OS cost. Roughly a couple hundred nanoseconds in microbenchmarks — an order of magnitude cheaper than an OS-level context switch. The exact number moves around with workload, scheduler contention, and Go version; what's stable is the order of magnitude.
When a goroutine blocks on network I/O, the runtime parks it and flips the underlying OS thread to run a different goroutine. The goroutine's state lives in Go's own scheduler, not in a kernel wait queue.

This is the actual answer to "why Go scales to millions of connections": the runtime refuses to hand idle work back to the kernel. The kernel still does the real I/O — Go uses epoll on Linux, kqueue on BSD, IOCP on Windows — but it only involves the kernel when there's actual work, not when a goroutine is just sitting around.

A Small Benchmark That Tells the Whole Story

Here's a stripped-down Go program that spins up N goroutines, each one holds a channel read, and prints the total RSS when they're all parked:

package main

import (
    "fmt"
    "os"
    "runtime"
    "sync"
    "syscall"
)

func main() {
    n := 100_000
    if len(os.Args) > 1 {
        fmt.Sscanf(os.Args[1], "%d", &n)
    }

    var wg sync.WaitGroup
    ch := make(chan struct{})
    wg.Add(n)

    for i := 0; i < n; i++ {
        go func() {
            defer wg.Done()
            <-ch // park forever
        }()
    }

    // Let the runtime settle
    runtime.GC()

    var r syscall.Rusage
    syscall.Getrusage(syscall.RUSAGE_SELF, &r)
    fmt.Printf("goroutines=%d  rss=%d KB  (%.1f KB/goroutine)\n",
        n, r.Maxrss, float64(r.Maxrss)/float64(n))

    close(ch)
    wg.Wait()
}

On my laptop (M1, Go 1.22, macOS):

goroutines=10000    rss=28672 KB   (2.9 KB/goroutine)
goroutines=100000   rss=263168 KB  (2.6 KB/goroutine)
goroutines=1000000  rss=2600960 KB (2.6 KB/goroutine)

2.6 KB per parked goroutine, flat, all the way to a million. That's the story. Not 1 MB. Not 256 KB. Two and a half KB.

Try the equivalent program with new Thread(() -> ...).start() in Java and you will run out of memory well before 100,000. The comparison isn't even close, and it isn't about execution speed — it's about what an idle waiter costs.

The Parallel in Finance: Same Problem, Opposite Extreme

The part that made this click for me is noticing where else this principle shows up. High-frequency trading engines and exchange colocation boxes have the same bottleneck — kernel context switches are expensive — and they solve it the other way: skip the kernel entirely.

DPDK gives userspace direct access to the NIC. Packets bypass the kernel network stack.
Kernel-bypass sockets (Solarflare Onload, AWS Nitro enhanced networking) push the TCP/IP stack into userspace.
io_uring on modern Linux brings the same idea to general-purpose code — a shared memory ring buffer between app and kernel, batched, with minimal syscalls.
RDMA lets network cards write directly into another machine's memory. No kernel on either end.

Different tools, same target: syscalls and context switches are expensive; keep them off the hot path.

Go arrives at the same destination with a completely different route. Instead of bypassing the kernel, it hides the kernel behind a user-space scheduler and only calls in when absolutely necessary. HFT says "the kernel is slow, route around it." Go says "the kernel is slow, so we'll handle most of the state ourselves and only ring the kernel's doorbell when we have real work." The principle is identical.

Once you see this pattern, you start seeing it everywhere. V8 Isolates. Erlang processes. Rust async runtimes. The details differ but the bet is the same: keep concurrency cheap by keeping it out of the kernel.

Where Go Actually Breaks Under Load

None of this means Go scales forever. When I've seen Go services crack at scale, it's usually not the runtime:

File descriptors: Default ulimit -n is 1024 on most systems. You'll hit this before you stress the scheduler. Push it to 1M if you're actually building a long-poll service.
Ephemeral ports: If your service fans out to a downstream with lots of short-lived outbound connections, the 28K-ish default ephemeral port range bites before anything else.
Conntrack tables: Linux's nf_conntrack_max default is laughably small for a real service. Tune it or turn it off on high-throughput paths.
GC pressure from allocation-heavy handlers: The scheduler is cheap; the garbage collector is not. Sync pools, stack-allocated buffers, and careful escape analysis still matter.
The load balancer: Your L4/L7 LB probably caps out before Go does.

I've watched a Go service sit happily at 400K connections on a single pod while the upstream Envoy bled under its own CPU budget. The Go process was the calm one.

Concurrency Isn't a Speed Contest

It's a cost-of-idleness contest.

If you're building anything with long-lived connections — streaming APIs, WebSocket fan-out, server-sent events, message brokers, pub/sub gateways, anything with more connections than cores — the question isn't "is my language fast?" It's "how much does one idle waiter cost me?"

Go's answer is 2.6 KB and 200 nanoseconds. That's why it scales.

If you come from a world where "high concurrency" means "we bought a bigger box," Go can feel like cheating. It isn't. It's just a careful, decade-old design decision that says: the kernel is a system call you should make as rarely as possible, and when you must, do it in bulk.

Claude Code Deep Dive Part 4: Why It Uses Markdown Files Instead of Vector DBs

Harrison Guo — Wed, 08 Apr 2026 05:19:40 +0000

This is Part 4 of our Claude Code Architecture Deep Dive series. Part 1: 5 Hidden Features | Part 2: The 1,421-Line While Loop | Part 3: Context Engineering — 5-Level Compression Pipeline

This article replaces and deepens our earlier analysis, Claude Code's Memory Is Simpler Than You Think. The original focused on limitations. This one focuses on **why* — the first-principles tradeoffs behind every design choice.*

The Core Principle: Only Record What Cannot Be Derived

This single constraint governs every decision in Claude Code's memory system:

Don't save code patterns — read the current code. Don't save git history — run git log. Don't save file paths — glob the project. Don't save past bug fixes — they're in commits.

This isn't about saving storage. It's about preventing memory drift.

If a memory says "auth module lives in src/auth/", one refactor makes that memory a lie. But the model doesn't know it's a lie — it trusts specific references by default. A stale memory is worse than no memory at all, because the model acts on it with confidence.

Code is self-describing. The source of truth is always the current state of the project, not a snapshot from three weeks ago. Memory should store meta-information — who the user is, what they prefer, what decisions were made and why — not facts that the codebase already expresses.

Four Types, Closed Taxonomy

Claude Code enforces exactly four memory types. Not tags. Not categories. Four types with hard boundaries:

Type	What to Store	Example
user	Identity, preferences, expertise	"Data scientist, focused on observability"
feedback	Behavioral corrections AND confirmations	"Don't summarize after code changes — user reads diffs"
project	Decisions, deadlines, stakeholder context	"Merge freeze after 2026-03-05 for mobile release"
reference	Pointers to external systems	"Pipeline bugs tracked in Linear INGEST project"

Why closed taxonomy beats open tagging: Free-form tags cause label explosion. A model tagging memories freely might produce "coding-style", "code-style", "style-preference", "formatting" — four labels for the same concept. Closed taxonomy forces an explicit semantic choice. Each type has different storage structure (feedback requires Why + How to apply fields) and different retrieval behavior. The constraint buys clarity.

Why Positive Feedback Matters More Than Corrections

The feedback type stores both failures AND successes. The source code explains why:

"If you only save corrections, you will avoid past mistakes but drift away from approaches the user has already validated, and may grow overly cautious."

Imagine the user says "this code style is great, keep doing this." If you don't save that, next session the model might "improve" the style — moving away from what the user explicitly liked. Positive feedback anchors the model to known-good patterns. Without anchors, corrections alone push the model toward progressively safer (blander) output.

Project Type: Relative Dates Kill You

When a user says "merge freeze after Thursday", the memory must store "merge freeze after 2026-03-05." A memory read three weeks later has no idea what "Thursday" meant. This seems obvious, but it's an explicit rule in the source code because models default to storing user language verbatim.

Why Sonnet Side-Query Instead of Vector Embeddings

This is the design choice that draws the most criticism. Claude Code uses a live LLM call (Sonnet) to pick relevant memories instead of vector similarity search. Here's the actual tradeoff:

How it works:

Sonnet reads descriptions (not full content), evaluates semantic relevance, and returns up to 5 filenames. The call costs ~250ms and 256 output tokens.

Why this over vector embeddings:

Dimension	Sonnet Side-Query	Vector Embeddings
Semantic depth	Full language understanding — "deployment" matches "CI/CD"	Cosine similarity — good but shallow
Infrastructure	Zero — one API call	Requires embedding model + vector store
Transparency	Can inspect WHY a memory was selected	Opaque similarity scores
Cost per query	~250ms + 256 tokens (shared prompt cache)	Embedding call + search latency
Scaling	Degrades past ~200 files	Scales to millions

The tradeoff is deliberate: for a session-based CLI tool where users typically have 20-100 memories, Sonnet's semantic understanding beats vector search's scale. The 250ms latency is hidden entirely through async prefetch — the search runs in parallel while the model generates its response. For the user, memory recall is "free."

The 5-File Cap: Constraint as Design

Why limit to 5 memories when a user might have 200?

This is not a technical limitation. It's a behavioral nudge. If the system scaled to inject 50 memories, users would never clean up stale ones. The 5-file cap pushes users to write better descriptions (so the right memories get selected) and consolidate outdated entries (so slots aren't wasted on stale info).

Design principle: constraints that change user behavior beat constraints that scale infrastructure.

Background Extraction: The Invisible Agent

Claude Code doesn't just save memories when you say /remember. After every conversation turn where the main agent stops (no more tool calls), a forked background agent runs to extract memorable information.

Key design details:

Mutual exclusion: If the main agent already wrote a memory in this turn, the extractor skips. No duplicate memories from the same conversation.
Trailing runs: If extraction is still running when the next turn ends, the new request queues as pendingContext. When the current extraction finishes, it picks up the pending work. No concurrent writes to the memory directory.
5-turn hard deadline: The extractor gets at most 5 tool-call turns. Efficiency over completeness.
Minimal permissions: Read/Grep/Glob unlimited. Write only to the memory directory. Cannot modify project files, execute code, or call external services.
Shared prompt cache: The forked agent reuses the parent's cached system prompt — near-zero additional token overhead.

The execution strategy is prescribed in the prompt: "Turn 1: parallel reads of all existing memories. Turn 2: parallel writes of new memories." Two turns for the common case. The 5-turn budget handles edge cases.

Trust but Verify: The Eval That Proved It

The most impactful section in Claude Code's memory prompt is TRUSTING_RECALL_SECTION:

"A memory that names a specific function, file, or flag is a claim that it existed when the memory was written. It may have been renamed, removed, or never merged."

The rule: before acting on a memory that references a file path, verify the file exists (Glob). Before trusting a function name, confirm it's still there (Grep).

This section's value was proven empirically: without it, eval pass rate was 0/2. With it, 3/3. Models default to trusting specific references in memory. They'll confidently say "as stored in memory, the auth module is at src/auth/" — even when that path was renamed weeks ago. The verification requirement breaks this default behavior.

Three Architectures, Three Tradeoffs

This is not a ranking. I'm using OpenClaw and Hermes as contrasts because they represent the two obvious alternative bets: scale and autonomy. Claude Code, OpenClaw, and Hermes Agent made different choices for different deployment models.

Dimension	Claude Code	OpenClaw	Hermes Agent
Storage	Markdown files (flat)	MD + SQLite (FTS + vector)	SQLite + FTS + MEMORY.md
Recall	Sonnet side-query (semantic)	Embedding cosine + FTS fusion	Full-text search + structured queries
Infrastructure	Zero (filesystem only)	SQLite + embedding model	SQLite
Transparency	Full (plain text, human-readable)	Partial (vector scores opaque)	Partial
Learning loop	None (static after write)	None	Self-evolving (auto-generates skills)
Session model	Session-based, stateless between sessions	Persistent, cross-session	Persistent, self-improving
Scale ceiling	~200 files by design	Scales with SQLite	Scales with SQLite

Claude Code's Bet

Optimize for zero infrastructure and full transparency. Accept a scale ceiling.

For a CLI tool that runs on a developer's laptop, requiring SQLite or an embedding service is friction. Plain Markdown files are human-readable, git-trackable, and editable with any text editor. The 200-file ceiling is intentional — if you need more, you should be consolidating, not scaling.

When this breaks: Teams with hundreds of shared memories. Long-running projects where memory accumulation outpaces cleanup. Multi-user scenarios where memory needs to be queried across team members.

OpenClaw's Bet

Accept infrastructure overhead for persistent cross-session scale.

OpenClaw stores memories in SQLite with both full-text search and vector embeddings. This enables fuzzy semantic matching across thousands of memories, weighted fusion of multiple retrieval signals, and persistent state that survives across sessions indefinitely.

When this breaks: Setup complexity. Users must configure embedding models. Vector similarity scores are opaque — when the wrong memory is recalled, debugging why is harder than inspecting a Sonnet side-query.

Hermes Agent's Bet

Accept complexity for a self-evolving learning loop.

Hermes doesn't just store memories — it generates skills from completed tasks. After a complex task (5+ tool calls), the agent distills the entire process into a structured skill document. Next time it encounters a similar task, it loads the skill instead of solving from scratch. Skills self-iterate: if the agent finds a better approach during execution, it updates the skill automatically.

When this breaks: Skill quality is unverified. A bad skill propagated through the learning loop compounds errors. The self-evolving mechanism needs guardrails that don't exist yet — there's no eval framework for auto-generated skills.

The Right Choice Depends on Your Deployment Model

Session-based, single user, zero setup → Claude Code's approach
Persistent, multi-user, cross-session  → OpenClaw's approach  
Autonomous, self-improving, research    → Hermes's approach

There is no universal "best." The first-principles question is: what are you optimizing for — simplicity, scale, or autonomy?

What This Teaches About Agent Design

Three principles that transfer beyond memory systems:

Constraints that change user behavior > constraints that scale infrastructure. The 5-file cap is more effective than unlimited vector search, because it forces better memory hygiene. Don't build capacity for a mess — design incentives for cleanliness.
Eval data beats intuition for prompt engineering. The trust-verification section wasn't added because someone thought it was a good idea. It was added because evals went from 0/2 to 3/3. If you can't measure it, you're guessing.
Use the model's own reasoning for retrieval when latency allows. Sonnet understanding "deployment" relates to "CI/CD" is something no keyword match or embedding similarity can reliably do. When your retrieval budget allows a model call, the quality ceiling is higher than any static index.

Previous: Part 3: Context Engineering — 5-Level Compression Pipeline | Part 2: The 1,421-Line While Loop | Part 1: 5 Hidden Features

See also: Claude Code + Codex: Two Brains for how dual-AI workflows complement the memory system.

Claude Code Deep Dive Part 3: The 5-Level Compression Pipeline Behind 1M Tokens

Harrison Guo — Wed, 08 Apr 2026 05:19:22 +0000

This is Part 3 of our Claude Code Architecture Deep Dive series. Part 1: 5 Hidden Features | Part 2: The 1,421-Line While Loop | Part 4: Memory Tradeoffs

Why Context Engineering Is the Real Moat

Every AI agent has the same fundamental constraint: a fixed-size context window. Claude's is now up to 1M tokens. That sounds massive — until you realize a real coding session can easily generate multiples of that. Dozens of file reads, hundreds of tool calls, thousands of lines of output.

The model's decision quality depends entirely on what it sees. Get the tradeoff wrong, and it forgets which files it just edited, re-reads content it already saw, or contradicts its own earlier decisions.

Think of the context window as an office desk. Limited surface area. You need the most important documents within arm's reach, everything else filed in drawers — retrievable, but not cluttering your workspace.

Claude Code's context engineering is that filing system. And it's far more sophisticated than most people expect. In Part 2, we covered the 4-stage compression overview as part of the loop's survival mechanism. Here, we zoom into the internal engineering — revealing a 5th level most sessions never trigger, a dual-path algorithm that adapts to cache state, and a security blind spot in the summarizer.

The compression pipeline alone lives in src/services/compact/ — over 3,960 lines of TypeScript across 5 files.

The 5-Level Compression Pipeline

The design philosophy is progressive compression: cheapest first, heaviest last. Each level is more expensive than the previous one — consuming more compute or discarding more context detail.

Most conversations never reach Level 5. That's the point.

Level 1 — Tool Result Budget (Zero Cost)

Problem: A single FileReadTool call on a 10,000-line file dumps the entire thing into context. A BashTool running find returns thousands of paths.

Solution: When a tool result exceeds 50,000 characters (DEFAULT_MAX_RESULT_SIZE_CHARS), Claude Code doesn't truncate it — it persists the full output to disk and keeps only a 2KB preview in context:

<persisted-output>
Output too large (2.3 MB). Full output saved to:
/tmp/.claude/session-xxx/tool-results/toolu_abc123.txt

Preview (first 2.0 KB):
[first 2000 bytes of content]
...
</persisted-output>

Why persist instead of truncate? Truncation means permanent loss. If the model later needs line 500 of that output — maybe that's where the bug is — it can use the Read tool to access the full file from disk. The 2KB preview gives enough context to decide whether that's necessary.

Level 2 — History Snip

Think of History Snip as garbage collection for stale conversation scaffolding. If the session contains repetitive assistant wrappers, redundant bookkeeping, or older spans that no longer affect the next decision, this layer can cut them before heavier compression starts.

Its real importance is accounting correctness. It feeds snipTokensFreed into the autocompact threshold calculation. Without that correction, the last assistant message's usage data still reflects the pre-snip context size, so autocompact can fire even after tokens were already freed.

Level 3 — Microcompact (The Dual-Path Design)

This is where it gets clever. Microcompact cleans up old tool results that are no longer useful — that file you read 30 minutes ago is probably irrelevant now, but it's still eating thousands of tokens.

The twist: Microcompact has two completely different code paths, selected based on cache state.

Path A — Cache Cold (Time-Based)

When the user was away long enough for the prompt cache to expire (default 5-minute TTL), the cache is already dead. Rebuilding is inevitable. So Microcompact goes ahead and directly modifies message content:

// microCompact.ts — cold path
return { ...block, content: '[Old tool result content cleared]' }

Simple, brutal, effective. Keep only the N most recent compactable tool results, replace everything else with a placeholder.

Path B — Cache Hot (Cache-Editing)

When the user is actively chatting and the prompt cache is warm — holding 100K+ tokens of cached prefix — directly modifying messages would invalidate the entire cache. That's a massive cost hit.

Instead, the hot path uses an API-level mechanism called cache_edits:

Tag tool result blocks with cache_reference: tool_use_id
Construct cache_edits blocks telling the server to delete those references in-place
Server-side deletion preserves cache warmth — no client re-upload needed

The messages themselves are returned unchanged. The edit happens at the API layer, invisible to the local conversation state.

	Time-Based (Cold)	Cache-Edit (Hot)
Trigger	Time gap exceeds threshold	Tool count exceeds threshold
Operation	Direct message modification	`cache_edits` API blocks
Cache Impact	Cache rebuilds anyway	Preserves 100K+ cached prefix
API Calls	Zero	Zero (edits piggyback on next request)

The two paths are mutually exclusive. Time-based takes priority — if the cache is already cold, using cache_edits is pointless.

Level 4 — Context Collapse (Non-Destructive)

Think of this as a database View — the underlying table (message array) stays unchanged, but queries (API requests) see a filtered, summarized projection.

Context Collapse triggers at ~90% utilization. Unlike autocompact, it's reversible — original messages are never deleted, and the collapse can be rolled back if needed. The summaries live in a separate collapse store, and projectView() overlays them onto the original messages at query time.

Critical interaction: when Context Collapse is active, Autocompact is suppressed. Both compete for the same token space — autocompact at ~87%, collapse at ~90% — and autocompact would destroy the fine-grained context that collapse is trying to preserve.

Level 5 — Autocompact (The Last Resort)

When everything else fails to keep tokens under control, the system forks a child agent to summarize the entire conversation. This is expensive and irreversible.

The compression prompt uses a two-phase Chain-of-Thought Scratchpad technique:

<analysis> block — the model walks through every message chronologically: user intent, approaches taken, key decisions, filenames, code snippets, errors, fixes
<summary> block — a structured summary with 9 standardized sections (Primary Request, Key Technical Concepts, Files and Code, Errors and Fixes, Problem Solving, All User Messages, Pending Tasks, Current Work, Optional Next Step)

The critical design: formatCompactSummary() strips the <analysis> block and keeps only the <summary>. Chain-of-thought reasoning improves summary quality dramatically, but the reasoning itself would waste tokens if kept in context. Discard the work, keep the conclusion.

Post-Compression Recovery

Autocompact's biggest risk: the model "forgets" files it just edited. The system automatically runs runPostCompactCleanup():

Restore last 5 recently-read files (≤5K tokens each)
Restore all activated skills (≤25K tokens total)
Re-announce deferred tools, agent lists, MCP directives
Reset Context Collapse state
Restore Plan mode state if active

Without this recovery step, the model would start re-reading files it just edited — or worse, make contradictory changes.

The Circuit Breaker Story

On March 10, 2026, Anthropic's telemetry showed 1,279 sessions with 50+ consecutive autocompact failures. The worst session hit 3,272 consecutive failures. Globally, this wasted approximately 250,000 API calls per day.

In Part 2, we mentioned the circuit breaker as a single boolean (hasAttemptedReactiveCompact). Here's the production story behind it.

The fix was three lines:

const MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3

After 3 consecutive failures, stop trying. The context is irrecoverably over-limit — burning more API calls won't help. This is a textbook circuit breaker: detect a failure loop, break it early, fail gracefully.

Three adjacent systems make this pipeline viable in production: accurate token estimation, prompt-cache boundaries, and the summarizer's security assumptions.

Token Estimation Without API Calls

Most agents estimate context size by counting tokens on the client. This typically has 30%+ error — enough to trigger compression too early or too late.

Claude Code uses a smarter approach. Think of it as a morning weigh-in: you step on the scale at 75kg, then eat lunch. You don't need the scale again — estimating 75.5kg is good enough.

The "scale" is the usage data returned by every API response — server-side precise token counts. The "lunch" is the few messages added since then.

function tokenCountWithEstimation(messages): number {
  // Find the most recent message with server-reported usage
  // Use that as the anchor point
  // Estimate only the delta (new messages since anchor)
  // Result: <5% error vs 30%+ from pure client estimation
}

This eliminates the need for tokenizer API calls while maintaining accuracy that's good enough for compression timing decisions.

The Prompt Cache Architecture

Claude Code's system prompt can be 50-100K tokens. Without caching, every API call would re-process this from scratch.

The key innovation: SYSTEM_PROMPT_DYNAMIC_BOUNDARY — a sentinel string that splits the system prompt into static and dynamic halves.

Before the boundary: core instructions, tool descriptions, security rules — identical for ALL users globally → cached with scope: 'global'
After the boundary: MCP tool instructions, output preferences, language settings — varies per user → not cached globally

This means millions of Claude Code users share the same cached system prompt prefix. One cache hit saves compute for everyone. But change one byte before the boundary, and the global cache breaks for all users.

To protect this, Claude Code implements sticky-on latching for beta headers: once a header is sent in a session, it persists for all subsequent requests — even if the feature flag is turned off mid-session. Flexibility sacrificed for cache stability.

The Security Blind Spot

Here's something the compression pipeline gets wrong: it treats all content equally.

The autocompact summarizer processes user instructions and tool results through the same pipeline. If an attacker plants malicious instructions inside a project file — and the model reads that file — those instructions survive compression. They become part of the summary, indistinguishable from legitimate context.

The <analysis> scratchpad that makes summaries so good also faithfully preserves injected instructions. There's no classification step that distinguishes "user said this" from "this was in a file the model read."

Additionally, truncateHeadForPTLRetry() reveals another edge: when the conversation is so long that the compression request itself triggers a Prompt-Too-Long error, the system recursively drops the oldest turns to make the compression fit. An attacker could craft inputs that survive this truncation — instructions placed strategically in the middle of conversations, not at the edges.

Three Designs Worth Stealing

If you're building your own agent, these patterns transfer directly:

Progressive compression (cheapest first) — Don't jump to expensive summarization. Try zero-cost approaches first. Most sessions will never need the heavy option.
Cache-aware dual paths — Let infrastructure state drive algorithm selection. When cache is cold, optimize for simplicity. When cache is hot, optimize for preservation. Same goal, different strategies.
Circuit breakers on automated recovery — Never let a fix become a new failure mode. If compression fails 3 times, it will fail a 4th time. Stop. The 250K wasted API calls/day before this fix was added is a cautionary tale for any self-healing system.

Next: Part 4: Memory — First-Principles Tradeoffs in Agent Persistence — why Anthropic chose Markdown files over vector databases, and when that's the wrong call.

Previous: Part 2: The 1,421-Line While Loop | Part 1: 5 Hidden Features

Claude Code + Codex Plugin: Two AI Brains, One Terminal

Harrison Guo — Tue, 07 Apr 2026 14:47:24 +0000

You're debugging a gnarly race condition. Claude Code has been going at it for 10 minutes — reading files, forming theories, running tests. Then it hits a wall. Same hypothesis, same failed fix, third attempt.

What if you could call in a second brain — a completely different model with fresh eyes — without leaving your terminal?

That's what the Codex plugin for Claude Code does. It puts OpenAI's Codex (powered by GPT-5.4) inside your Claude Code session as a callable rescue agent. Two models. Two reasoning styles. One shared codebase.

What Is It, Exactly?

The Codex plugin is a Claude Code plugin — not a standalone tool. It lives inside your Claude Code session and gives you slash commands to dispatch tasks to OpenAI's Codex CLI.

Think of it as a second engineer sitting next to you. Claude (Opus) is your primary — it has the full conversation context, knows your project, runs your tools. Codex is your specialist — you hand it a focused task, it works in a sandboxed environment, and returns results.

The key insight: they don't compete. They complement.

Claude sees the big picture. It orchestrates, reads files, runs tools, manages state.
Codex gets a sharp, scoped task. It reasons deeply on that one problem and comes back with an answer.

Setup: 3 Minutes

1. Install the Codex CLI

npm install -g @openai/codex

2. Authenticate

Inside Claude Code, type:

!codex login

This opens a browser for OpenAI authentication. Once done, your token is stored locally.

3. Verify

/codex:setup

Claude Code will check that the Codex CLI is installed, authenticated, and ready.

The Commands

The plugin adds 7 slash commands to Claude Code:

Command	What It Does
`/codex:setup`	Check installation and auth status
`/codex:rescue`	Hand a task to Codex (the main one you'll use)
`/codex:review`	Run a Codex code review on your local git changes
`/codex:adversarial-review`	Same, but Codex actively challenges your design choices
`/codex:status`	Check running/recent Codex jobs
`/codex:result`	Get the output of a finished background job
`/codex:cancel`	Kill an active background Codex job

The Rescue Workflow: When Claude Gets Stuck

This is where the plugin shines. Claude Code will proactively spawn the Codex rescue agent when it detects it's stuck — same hypothesis loop, repeated failures, or a task that needs a second implementation pass.

You can also trigger it manually:

/codex:rescue fix the race condition in src/worker.ts — tests pass locally but fail in CI under parallel execution

What happens behind the scenes:

Claude takes your request and shapes it into a structured prompt optimized for GPT-5.4
The plugin invokes codex-companion.mjs task with that prompt
Codex works in the shared repository — reading files, reasoning, writing code
Results come back into your Claude Code session

Foreground vs Background

Small, focused rescues run in the foreground — you wait and get the result immediately.

Big, multi-step investigations can run in the background:

/codex:rescue --background investigate why the build is 3x slower since the last merge

Check on it later with /codex:status and grab results with /codex:result.

Code Review: A Second Opinion That Actually Pushes Back

/codex:review

This sends your local git diff to Codex for review. It checks against your working tree or branch changes.

But the real power is the adversarial review:

/codex:adversarial-review

This isn't "looks good to me." Codex will actively challenge your implementation approach, question design decisions, and flag things a polite reviewer wouldn't mention. It's the code review you need, not the one you want.

When to Use Which Brain

After a month of daily use, here's my mental model:

Let Claude (Opus) Handle:

Orchestration — multi-file changes, refactors across the codebase
Context-heavy tasks — "fix this bug" when you've been discussing it for 20 messages
Tool-heavy workflows — file reads, grep, test runs, build commands
Conversation continuity — anything that builds on prior context

Call in Codex (GPT-5.4) For:

Fresh eyes — when Claude is circling the same hypothesis
Deep single-problem reasoning — "why does this specific test fail under these exact conditions"
Adversarial review — challenge assumptions Claude might share with you
Parallel investigation — background a research task while Claude keeps working

The Pattern That Works Best

Claude does the initial investigation — reads files, forms a theory
If the theory doesn't pan out in 2-3 attempts, rescue to Codex with the full context of what was tried
Codex returns a diagnosis or fix
Claude applies it in context, runs tests, iterates

Two models. Two reasoning paths. Converging on the same answer faster than either alone.

Advanced: Prompt Shaping

The plugin includes a gpt-5-4-prompting skill that automatically structures your rescue requests into Codex-optimized prompts using XML tags:

<task> — the concrete job
<verification_loop> — how to confirm the fix works
<grounding_rules> — stay anchored to evidence, not guesses
<action_safety> — don't refactor unrelated code

You don't need to write these yourself. Claude does it automatically when it hands off to Codex. But knowing they exist explains why Codex rescue results are usually sharper than raw Codex CLI usage.

Advanced: The Review Gate

/codex:setup --enable-review-gate

When enabled, every git commit in the repo triggers an automatic Codex review before the commit completes. It's a pre-commit hook powered by a second AI brain.

This is aggressive — I only enable it on critical branches or before releases. But when you want zero-trust code quality, it's unmatched.

The Bottom Line

The Codex plugin doesn't replace Claude Code. It makes Claude Code anti-fragile.

Every AI agent has blind spots — reasoning loops it can't escape, patterns it over-fits to, assumptions it shares with its user. A second model with a different training distribution breaks those loops.

The dual-brain setup isn't about which model is "better." It's about coverage. Two independent reasoning paths catch more bugs than one brilliant path run twice.

If you're using Claude Code daily, install the Codex plugin. It's 3 minutes of setup and it will save you hours of "why is Claude stuck on this?"

Part of the Claude Code Architecture Deep Dive series. Previous: The 1,421-Line While Loop That Runs Everything.

Claude Code Deep Dive Part 2: The 1,421-Line While Loop That Runs Everything

Harrison Guo — Fri, 03 Apr 2026 17:24:18 +0000

This is Part 2 of our Claude Code Architecture Deep Dive series. Part 1: 5 Hidden Features covered the surface-level discoveries. Now we go deeper.

The Heart of Claude Code

Every AI coding agent — Claude Code, Cursor, Copilot — runs some version of the same loop: send context to an LLM, get back text and tool calls, execute tools, feed results back, repeat. We called this LLM talks, program walks.

But Claude Code's implementation of this loop is anything but simple. It lives in query.ts, a 1,729-line async generator. The while(true) starts at line 307 and ends at line 1728 — a single loop body spanning 1,421 lines of production code.

This is not a toy. This is the engine that processes every keystroke, every tool call, every error recovery, every context compression decision for millions of users.

// query.ts — line 307
// eslint-disable-next-line no-constant-condition
while (true) {
    let { toolUseContext } = state
    const { ... } = state
    // ... 1,421 lines of state machine logic ...
    state = next
} // while (true)  — line 1728

Why a State Machine, Not Recursion

Early versions of Claude Code used recursion — the query function called itself. But recursion has a fatal flaw: in long conversations with hundreds of tool calls, the call stack grows until it explodes.

The current design uses while(true) with a state object that carries context between iterations:

// query.ts — lines 207-215 (State type, partial)
autoCompactTracking: AutoCompactTrackingState | undefined
maxOutputTokensRecoveryCount: number
hasAttemptedReactiveCompact: boolean       // circuit breaker for 413 recovery
stopHookActive: boolean | undefined
turnCount: number
transition: { reason: string } | undefined // why we continued

Each continue statement is a state transition. There are 9 distinct continue points in the code (lines 950, 1115, 1165, 1220, 1251, 1305, 1316, 1340), each representing a different reason to run another turn:

Next tool call needed
Reactive compact triggered after 413
Max output tokens recovery
Stop hook interrupted
Token budget continuation
And more

The Loop at a Glance

10 Steps Per Iteration

Each time the loop runs, it does these 10 things in order. Every step has real source code behind it.

Step 1: Context Compression (4 stages)

Before calling the API, the system tries to fit everything into the context window. Four compression mechanisms fire in priority order (imports at lines 12-16, 115-116):

Snip Compact — trims overly long individual messages in history
Micro Compact — finer-grained editing based on tool_use_id, cache-friendly (line 370: "microcompact operates purely by tool_use_id")
Context Collapse — folds inactive context regions into summaries
Auto Compact — when total tokens approach the threshold, triggers full compression

These are not mutually exclusive — they run in priority order:

The system tries lightweight options first. If snip + micro bring tokens under the limit, the heavy compressors never run.

Step 2: Token Budget Check

If a token budget is active (feature('TOKEN_BUDGET'), line 280), the system checks whether to continue. Users can specify targets like "+500k", and the system tracks cumulative output tokens per turn, injecting nudge messages near the goal to keep the model working.

Step 3: Call Model API

Line 659 — the actual API call:

for await (const message of deps.callModel({

This is a streaming call. The response arrives token by token, and the system processes it incrementally.

Step 4: Streaming Tool Execution

This is a critical optimization. Traditional agents wait for the model to finish generating all output, then execute tools. Claude Code uses StreamingToolExecutor (imported at line 96):

When the model is still generating its second tool call, the first one is already running:

Traditional Agent (sequential):
┌─────────────────────────┐┌───┐┌───┐┌───┐┌───┐┌───┐
│  LLM generates 5 calls  ││ T1││ T2││ T3││ T4││ T5│  ← 30s total
└─────────────────────────┘└───┘└───┘└───┘└───┘└───┘

Claude Code (streaming):
┌─────────────────────────┐
│  LLM generates 5 calls  │
├──┬──┬──┬──┬─────────────┘
│T1│T2│T3│T4│T5│                                       ← 18s total
└──┴──┴──┴──┴──┘
↑ tools start while LLM is still generating

In a turn with 5 tool calls, traditional waits 30 seconds. Streaming finishes in 18 — a 40% speedup from architecture alone, not model improvements.

Line 554-555 reveals an interesting detail: stop_reason === 'tool_use' is unreliable — "it's not always set correctly." The system detects tool calls by watching for tool_use blocks during streaming instead.

Step 5: Error Recovery

If the prompt is too long? Try context collapse drain. If that fails, try reactive compact (line 15-16). If the API returns 413 (prompt too long), trigger emergency compression and retry.

But there's a circuit breaker: hasAttemptedReactiveCompact (line 209, initialized false at line 275) ensures each turn only attempts reactive compact once. Without this, a genuinely oversized conversation would loop forever.

The system also handles model degradation — if the primary model fails, it can fall back to a different model.

Step 6: Stop Hooks

After the model stops outputting, the system runs registered stop hooks. These can inspect the output and decide whether to let the model continue. This is where external governance plugs in.

Step 7: Token Budget Check (Again)

Yes, checked twice — once before calling the model (should we even start?) and once after (did we exceed the budget?). The second check decides whether to inject a "keep going" nudge or stop.

Step 8: Tool Execution

If the response contains tool_use blocks, execute them. Two paths:

runTools() (from toolOrchestration.ts, line 98) — batch execution
StreamingToolExecutor (line 96) — streaming execution, gated by config.gates.streamingToolExecution (line 561)

Each tool call goes through the 14-step execution pipeline in toolExecution.ts (1,745 lines) — validation, permission checks, hooks, actual execution, analytics. That's a story for Part 3.

Step 9: Attachment Injection

After tools finish, the system injects additional context before the next turn:

Memory attachments — relevant memories from the memdir/ system
Skill discovery — matching skills based on the current task
Queued commands — any commands that were waiting

This happens after tool execution but before the next API call, ensuring the model has fresh context.

Step 10: Assemble and Loop

Build the new message list from all the pieces — original conversation, tool results, attachments, system reminders — and go back to step 1.

Why This Architecture Matters

Most open-source AI agents implement the loop as 50 lines of pseudocode: call model, parse tool calls, execute, repeat. Claude Code's 1,421-line version exists because production reality is messy:

Context doesn't fit. A real coding session easily hits 200K tokens. Without the 4-stage compression pipeline, the agent dies on every long conversation. Most agents just truncate and lose context. Claude Code compresses intelligently — lightweight first, heavy only when needed.

Models fail. APIs return 413, connections drop, rate limits hit. The 9 continue points aren't over-engineering — they're the minimum number of recovery paths needed for reliable operation. The hasAttemptedReactiveCompact circuit breaker is the kind of detail that separates a demo from a product.

Speed matters more than correctness of execution order. Streaming tool execution — starting the first tool while the model is still generating the third — is a user experience decision backed by architecture. Traditional agents feel slow because they are: they serialize everything. Claude Code parallelizes at the loop level.

Tokens cost money. The SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker in prompts.ts (914 lines) splits the system prompt into static (cacheable) and dynamic sections. If two requests share the same static prefix byte-for-byte, the API caches it. Source comment: "don't modify content before the boundary, or you'll destroy the cache." This is prompt cache economics — saving Anthropic real compute costs at scale.

The Behavioral Constitution

Buried inside the prompt assembly, getSimpleDoingTasksSection() may be the most valuable function in the entire codebase. It encodes hard-won rules about what the model should NOT do:

Don't add features the user didn't ask for
Don't over-abstract — three duplicate lines beat a premature abstraction
Don't add comments to code you didn't change
Don't add unnecessary error handling
Read code before modifying it
If a method fails, diagnose before retrying
Report honestly — don't say you ran something you didn't

Anyone who has used Claude Code recognizes these rules. I've personally watched the system refuse to add "helpful" abstractions and stick to minimal changes. That's not the model being disciplined — it's the prompt constraining the model. The takeaway: don't trust model self-discipline. Codify the behavior.

How Other Agents Compare

Aspect	Claude Code	Cursor	Typical OSS Agent
Loop complexity	1,421 lines, 9 continue points	Unknown (closed source)	~50-200 lines
Compression	4-stage pipeline + reactive 413 recovery	Tab-level context pruning	Truncate or fail
Tool execution	Streaming (parallel with generation)	Sequential	Sequential
Error recovery	Circuit breakers, model fallback, emergency compact	Basic retry	Crash
Prompt caching	Static/dynamic boundary, section registry	Unknown	None

The gap between Claude Code and most open-source agents is not model quality — it's the program layer. The model is the same Opus or Sonnet for everyone. What makes Claude Code feel different is 1,421 lines of careful engineering around it.

The Bottom Line

The query loop is where "LLM talks, program walks" becomes concrete:

The LLM outputs text and tool call JSON. That's it.
The program handles compression, budget tracking, error recovery, streaming, permissions, memory injection, and 14-step tool validation.
The 1,421 lines are not the model being smart. They're the program being careful.

If you're building an AI agent and your main loop is under 100 lines, you're not handling the cases that matter. Production is not about the happy path. It's about what happens when context overflows, the API returns 413, the user's conversation hits 500 turns, and three tools need to run while the model is still thinking.

Next: Part 3 — The 14-Step Tool Execution Pipeline (coming soon) — what happens between "model says call this tool" and the tool actually running.

Previous: Part 1 — 5 Hidden Features Found in 510K Lines

Video: The AI Stack Explained — LLM Talks, Program Walks

Claude Code Source Leaked: 5 Hidden Features Found in 510K Lines of Code

Harrison Guo — Tue, 31 Mar 2026 22:02:07 +0000

What Happened

Anthropic shipped Claude Code v2.1.88 to npm with a 60MB source map still attached. That single file contained 1,906 source files and 510,000 lines of fully readable TypeScript. No minification. No obfuscation. Just the raw codebase, sitting in a public registry for anyone to download.

Within hours, backup repositories appeared on GitHub. One of them — instructkr/claude-code — racked up 20,000+ stars almost instantly. Anthropic pulled the package, but the code was already mirrored everywhere. The cat was out of the bag, and it had opinions about AI safety.

5 Hidden Features Found in the Source

1. Buddy Pet System

Deep in buddy/types.ts, there is a complete virtual pet system. Eighteen species, five rarity tiers, shiny variants, hats, custom eyes, and stat blocks. This was clearly planned as an April Fools easter egg.

The species list:

const SPECIES = [
  'duck', 'goose', 'blob', 'cat', 'dragon', 'octopus',
  'owl', 'penguin', 'turtle', 'snail', 'ghost', 'axolotl',
  'capybara', 'cactus', 'robot', 'rabbit', 'mushroom', 'chonk'
];

Rarity weights:

const RARITY_WEIGHTS = {
  common:    60,  // 60%
  uncommon:  25,  // 25%
  rare:      10,  // 10%
  epic:       4,  //  4%
  legendary:  1   //  1%
};

Each buddy gets a hat, eyes, and stats:

type Hat = 'none' | 'crown' | 'tophat' | 'propeller' | 'halo' | 'wizard' | 'beanie' | 'tinyduck';
type Eye = '·' | '✦' | '×' | '◉' | '@' | '°';
type Stat = 'DEBUGGING' | 'PATIENCE' | 'CHAOS' | 'WISDOM' | 'SNARK';

Your buddy is generated deterministically from hash(userId). Every account gets a unique pet. There is also a shiny boolean variant — presumably the rare version you brag about in team Slack.

This was 100% an April 1st drop. The leak killed the surprise.

2. Undercover Mode

This one raised eyebrows. In utils/undercover.ts, there is a mode that makes Claude pretend to be a human developer.

When active, undercover mode:

Removes all Anthropic traces from commits and pull requests
Strips Co-Authored-By headers
Removes any mention of "Claude Code" or model names
Writes commit messages "as a human developer would"

From the source:

// UNDERCOVER MODE — CRITICAL
// You are operating UNDERCOVER in a PUBLIC/OPEN-SOURCE repository.
// Your commit messages, PR titles, and PR bodies MUST NOT contain
// ANY Anthropic-internal information. Do not blow your cover.
//
// NEVER include in commit messages or PR descriptions:
// - Internal model codenames (animal names like Capybara, Tengu, etc.)
// - Unreleased model version numbers (e.g., opus-4-7, sonnet-4-8)
// - The phrase "Claude Code" or any mention that you are an AI
// - Co-Authored-By lines or any other attribution
//
// Write commit messages as a human developer would.

It auto-activates for public and open-source repositories. You can force it on:

CLAUDE_CODE_UNDERCOVER=1

There is no force-OFF override. The safety default is undercover-on for public repos.

The kicker: this feature is gated to internal Anthropic employees only (USER_TYPE === 'ant'). Anthropic's own developers use Claude Code on open-source projects, and this mode ensures their contributions look human-written. Draw your own conclusions about what that means for AI attribution norms.

3. Kairos — Permanent Memory

Behind the feature flag KAIROS in main.tsx and the memdir/ directory, there is a persistent memory system that survives across sessions.

This is not the .claude/ project memory you already know. Kairos is a four-stage memory consolidation pipeline:

Orient — scan context, identify what matters
Collect — gather facts, decisions, patterns from the session
Consolidate — merge new memories with existing long-term store
Prune — discard stale or low-value memories

The system runs automatically when you are not actively using Claude Code. It tracks memory age, performs periodic scans, and supports team memory paths — meaning shared memory across a team's Claude Code instances.

This turns Claude Code from a stateless tool into a persistent assistant that learns your codebase, your patterns, and your preferences over time. It is the most architecturally significant hidden feature in the leak.

4. Ultraplan — Deep Task Planning

The feature flag ULTRAPLAN in commands.ts enables a deep planning mode that can run for up to 30 minutes on a single task. It uses remote agent execution — meaning the heavy thinking happens server-side, not in your terminal.

Ultraplan is listed under INTERNAL_ONLY_COMMANDS. Anthropic's engineers apparently have access to a planning mode that goes far beyond what ships to paying customers. This is the kind of feature that separates "AI autocomplete" from "AI architect."

5. Multi-Agent, Voice, and Daemon Modes

The source reveals several execution modes that are not publicly documented:

Coordinator mode — orchestrates multiple Claude instances running in parallel, each working on a subtask
Voice mode (VOICE_MODE flag) — voice input/output for Claude Code
Bridge mode (BRIDGE_MODE) — remote control of a Claude Code instance from another process
Daemon mode (DAEMON) — runs Claude Code as a background process
UDS inbox (UDS_INBOX) — Unix domain socket for inter-process communication between Claude instances

Together, these paint a picture of Claude Code evolving from a single-user CLI into a multi-agent orchestration platform. The daemon + UDS architecture means Claude Code instances can message each other, coordinate work, and run without a terminal attached.

The Core Architecture

The entire Claude Code engine lives in queryLoop() at query.ts line 241. At line 307, there is a while(true) loop that drives everything:

callModel() sends the conversation to the LLM
The LLM returns text and tool_use JSON blocks
The program parses each tool_use, checks permissions, executes the tool
Results feed back into the conversation
Loop continues until the LLM stops requesting tools

This is the "LLM talks, program walks" pattern I wrote about previously. The LLM decides what to do. The program decides whether to allow it, then does it. Seeing it confirmed in 510K lines of production code is satisfying.

Security Architecture

Claude Code's permission system is the most carefully engineered part of the codebase. Every tool call passes through six layers, implemented in useCanUseTool.tsx:

Config allowlist — checks project and user configuration
Auto-mode classifier — determines if the tool is safe for autonomous execution
Coordinator gate — validates against the orchestration layer
Swarm worker gate — checks permissions for sub-agent execution
Bash classifier — analyzes shell commands for safety
Interactive user prompt — final human confirmation

External commands run in a sandbox. This is defense-in-depth done right. The irony is that the company that built this careful permission model forgot to strip a source map from their npm package.

What This Means

The moat for AI coding tools is not the CLI. It is the model. Anyone can read this source code and understand the architecture, but nobody can replicate Sonnet or Opus. The queryLoop() pattern is elegant but simple — the magic is in what callModel() returns. That said, the product roadmap is now public. Competitors know about Kairos, Ultraplan, multi-agent coordination, and voice mode. That is real strategic damage.

For a company that positions itself as the responsible AI lab — the one that takes safety seriously — shipping a fully readable source map to a public registry is a notable operational security failure. The six-layer permission system in the code is impressive. The process that let a 60MB source map slip through CI/CD is not.

Watch the Deep Dive

I broke down the full AI agent architecture — the same query loop that Claude Code uses — in a 15-minute video: Watch on YouTube

For background on the "LLM talks, program walks" pattern: Read: The AI Stack Explained — LLM Talks, Program Walks

Coming next: a deep dive into Claude Code's 6-layer permission system and the Kairos memory architecture — with full code walkthroughs. Subscribe to catch it.

The AI Stack Explained: LLM Talks, Program Walks

Harrison Guo — Mon, 30 Mar 2026 04:14:15 +0000

LLM. Token. Context. Prompt. Function Calling. MCP. Agent. Skill.

You've spent months trying to understand these concepts. Here's something that might surprise you: they're all the same thing.

An LLM can only do one thing — output text. It can't browse the web. It can't query a database. It can't control your computer. The program around it does all of that. The program reads the text the LLM outputs, takes action on its behalf, and feeds the result back.

LLM talks, program walks. That's the entire AI stack in four words.

tl;dr — Every AI capability — from chatbots to autonomous agents — is built on one loop: the LLM outputs text, the program reads it and acts, the result feeds back. Understanding this loop makes every AI concept transparent.

Layer 1: The LLM — A Genius That Can Only Play Word Chain

At its core, a large language model is a word prediction machine.

You give it "The capital of France is" — it predicts "Paris." Then it appends "Paris" to the input and predicts again. Comma. "Which." "Is." On and on — until it outputs a stop token.

No thinking. No understanding. No consciousness. Just one thing: given the text so far, predict the next word.

But the model's internals are pure matrix math — it only understands numbers. So there's a translator: the Tokenizer. It chops text into small chunks called Tokens, maps each to a number, feeds them to the model, and converts the output back to text.

A Token ≠ a word. "helpful" → "help" + "ful" (2 tokens). "unbelievable" → "un" + "believ" + "able" (3 tokens).

Tokens are the atoms of the LLM world. Everything goes in as tokens, everything comes out as tokens.

The LLM can play word chain. But it has a fatal flaw.

Layer 2: Context — A Genius with No Memory

The LLM has no memory. This isn't a metaphor — it's literally a math function. Input in, output out, done. Next call? Knows nothing.

So why does it seem like it remembers your earlier messages?

Because every time you send a message, the program behind the scenes stitches your entire conversation history together and sends it all at once. The LLM doesn't "remember." It re-reads everything from scratch. Every single time.

This bundle is called Context — everything the LLM can see at once. Think of it as a desk. Today's models fit about 1 million tokens on that desk (~750,000 words, roughly all seven Harry Potter books).

But even with a big desk, dumping a thousand-page manual is impractical. The fix? Only put the relevant pages on the desk. Search ahead of time, find matching chunks, feed only those.

That's RAG — Retrieval-Augmented Generation. Don't dump everything. Pick what matters.

Layer 3: Prompt — What You Say to the LLM

Don't overthink "Prompt." A prompt is just what you say to the LLM. Every message you type is a prompt.

But there are two kinds:

User Prompt — what you type. "Write me a sorting algorithm in Python."

System Prompt — rules the developer sets behind the scenes. "You are a senior Python engineer. Keep answers concise." You never see this, but the LLM reads it every time.

Both get packed into Context. User Prompt = what to do now. System Prompt = who you are and what rules to follow.

The LLM can now predict words, see history, and follow instructions. But it's still just outputting text.

What comes next is the most important part.

Layer 4: Function Calling — Where Everything Begins

Let's come back to the fundamental fact:

An LLM can only output text. It can't browse the internet. It can't check the weather. It can't call any API.

So how does it "check the weather"? It doesn't. The program does.

The LLM did not call anything. It just output a JSON string. The program parsed that JSON, the program called the API, the program got the result, and the program fed it back.

That's all Function Calling is.

I'll sum it up in four words: LLM talks, program walks.

The LLM only talks — "I want to check the weather." The program walks — it actually goes and checks. Everything that comes next is built on this loop.

Layer 5: MCP — The Tool Catalog

We've got "LLM talks, program walks." But there's a practical problem: how does the program know what tools are available?

Imagine you're a new employee with dozens of internal systems. Nobody gives you a tool directory. MCP is that directory — in a standard format.

An MCP Server provides two things:

Catalog — "What tools do you have?" → returns each tool's name, description, parameters, and return format
Execution — "Call get_weather with Tokyo" → runs it, returns the result

Before MCP, every platform had its own way of connecting tools. Build for ChatGPT, rewrite for Claude, rewrite for Gemini. Same tool, three times.

MCP unified this: build once, run everywhere. Think USB-C — one cable works for everything.

Layer 6: Agent — The "Talks & Walks" Loop, on Repeat

In Function Calling, the LLM talked once and the program walked once. One round trip. But real problems aren't that simple:

"What's the weather here? If it's raining, find me a nearby umbrella shop."

That's multiple steps:

LLM says "I need the location tool" → program executes → returns coordinates
LLM says "Check weather at these coordinates" → program executes → returns "rainy"
LLM says "Search nearby umbrella shops" → program executes → returns results
LLM combines everything → outputs the final answer

Every step is the same loop: talks → walks → feedback → talks again → walks again.

A system that can plan autonomously, execute across multiple steps, and loop until completion — that's an Agent.

Claude Code, Cursor, and GitHub Copilot all call themselves agents. Under the hood, they're running this same loop.

But here's the key insight: getting the location, checking the weather, searching for shops — the program does all of that. None of it requires intelligence. The LLM's only job? Deciding what to do next.

An "intelligent agent" is actually assembled from parts that require zero intelligence.

Layer 7: Skill — Pre-Written Rules

The agent can plan on its own. But it doesn't know your rules.

Your team has a deployment checklist — pass all tests, verify env variables, confirm rollback plan, notify on-call. You want the agent to follow this every time. Are you going to type all that out every deploy?

A Skill is those rules written into a document, stored in a fixed location. It's literally a Markdown file — name, description, steps, rules, format, examples.

Let's be honest: a Skill is just a prompt that lives in a different place and has a fancier name. But Skills have one clever design — progressive disclosure:

Level 1: Scan names and descriptions (table of contents)
Level 2: Load full instructions when matched (open the chapter)
Level 3: Load referenced docs/scripts only when needed (check the footnotes)

It's a tradeoff between token cost and information completeness. Just enough is optimal.

The Big Picture

Let's zoom out:

From top to bottom:

Function Calling — the program turns text into action
MCP — provides the tool catalog
Agent — lets the loop run multiple rounds
Skill — pre-written rules that guide the LLM
RAG — picks relevant info for the desk
Memory — stitches history back in

None of these capabilities belong to the LLM itself. They're all granted by external programs.

The LLM's sole contribution? Outputting the right text at the right time.

Two Questions That Cut Through Any Buzzword

Next time someone throws a new concept at you — Multi-Agent, Agentic RAG, Orchestration Framework — you only need two questions:

① What text did the LLM output?

② Who read that text and turned it into an actual action?

Answer those two questions, and any concept becomes transparent.

LLM talks, program walks. That loop is how the entire AI world runs.

Try It Yourself

See Function Calling happen live in your terminal:

git clone https://github.com/harrison001/llm-talks-program-walks.git
cd llm-talks-program-walks
pip install openai
export OPENAI_API_KEY=your_key_here
python mouth_speaks_hand_acts.py "What's the weather in Tokyo?"

The terminal labels every step — "This is just TEXT" when the LLM outputs JSON, and "The PROGRAM did this" when the program executes the function. View on GitHub

Originally published at https://harrisonsec.com/blog/ai-stack-explained-llm-talks-program-walks/

Why Your "Fail-Fast" Strategy is Killing Your Distributed System (and How to Fix It)

Harrison Guo — Sat, 21 Mar 2026 06:34:59 +0000

It's 2 AM. PagerDuty fires. Redis master is down. Your application, trained to fail fast, dutifully fails — every single request, all at once. By the time Sentinel promotes a new master 12 seconds later, you've already generated 40,000 errors and three escalation calls. The system recovered on its own. Your application didn't let it.

This is the story of how "good engineering" can make a 12-second infrastructure event into a 12-minute outage — and how to design boundaries that prevent it.

tl;dr — During infrastructure failovers (Redis, Kafka, etcd), blind fail-fast amplifies instability. Bounded retry — centralized, time-boxed, invisible to business logic — absorbs the 10–15 second recovery window without leaking infrastructure noise to users. Resilience is not a library. It is a contract between layers.

The Core Question

When your session storage — Redis, Memcached, or any stateful dependency — goes temporarily unavailable, you face a fundamental architectural choice:

Should you fail fast? Or should you retry?

We all learned fail-fast as gospel. And it is — until it isn't. During transient infrastructure events like leader elections, blind fail-fast propagates instability instead of containing it. The response you choose determines whether the incident resolves itself in 12 seconds or snowballs into a 12-minute outage with three bridge calls.

What Actually Happens During Failover

To understand why fail-fast can backfire, look at the mechanics of a Redis Sentinel failover:

Phase	Duration	What Happens
Detection	~10–12s	Sentinel quorum detects master is down
Election	~1–2s	Sentinels agree on a new master
Promotion	~1s	Replica promoted, clients notified
Reconnection	~1–3s	Clients re-establish connections

Note: these phases overlap. Total failover typically completes in 12–15 seconds, not the sum of individual phases. Reconnection time also depends heavily on your client library — a Sentinel-aware client with topology refresh (e.g., Lettuce, go-redis with Sentinel support) reconnects in under a second, while a naive connection pool can take 30s+.

During this window, your application sees TCP dial timeouts and connection resets. Nothing is broken. No data is lost. The system is doing exactly what it was designed to do — electing a new leader. Your application just needs to not panic for 12 seconds.

Why Blind Fail-Fast Is Dangerous

If your application fails immediately on the first connection timeout during this window, four things happen in rapid succession:

1. Instability Amplification

A 3-second infrastructure blip becomes a user-visible outage. Every request during the failover window returns an error, even though the system would have recovered on its own.

2. Infrastructure Semantics Leak Upward

Your business layer now exposes raw infrastructure details — "Redis connection refused" — to clients that have no idea what Redis is or why it matters.

3. Uncontrolled Client Retries

Clients receiving errors start retrying independently. If you have 1,000 concurrent users and each retries 3 times, you just turned 1,000 QPS into 3,000 QPS — hitting an infrastructure layer that's already struggling to stabilize.

4. Retry Storms

This is the catastrophic outcome. Unbounded retries create cascading load amplification. CPU spikes prevent recovery. The system enters an instability feedback loop where the act of trying to recover keeps the system down. I've seen retry storms take down entire regions.

"Your timeout config was technically correct. Your system was functionally down. That's not a timeout problem — that's a design problem."

Here's the distinction that actually matters in production: the failure TYPE must determine your recovery strategy.

	Infrastructure-Level	Business-Level
Examples	Network jitter, leader election, connection reset, `READONLY` replica response	Validation error, permission denial, domain rule violation
Nature	Transient — will resolve on its own	Permanent — retrying won't help
Strategy	ABSORB — retry within bounds	FAIL FAST — return error immediately

Treating a leader election timeout the same as a schema validation error is an architectural mistake. One will resolve in seconds; the other will never succeed no matter how many times you retry.

The Failure Boundary Model

This is the architectural pattern that makes everything work:

The retry boundary sits in the infrastructure client wrapper — the thin layer between your business code and the dependency client. Not in HTTP middleware, not in individual service handlers, not in a sidecar. In the client wrapper itself.

Why does this matter? Because if retry logic exists at multiple layers, you get retry amplification. I've seen teams with retry in the HTTP handler, the service layer, AND the Redis client — producing 3 × 3 × 3 = 27 attempts per original request. That's not resilience. That's a DDoS against your own infrastructure.

Key principles:

Retry belongs at the infrastructure boundary — one place, one policy.
Business logic must remain fail-fast — semantic errors should never be retried.
By the time an error reaches the client, it has been vetted and classified. We are designing for predictability.

Bounded Retry: Implementation

If we're going to retry, we must do it with discipline. Four pillars:

1. Centralized

Retry logic lives in one place — the infrastructure client wrapper. Not in individual handlers, not in middleware, not in the business layer. One retry boundary per dependency, one policy, one set of metrics.

2. Time-Bounded

We define a retry budget — for example, 15 seconds. Why 15? Because it encapsulates the 10–12 second Sentinel detection window plus a margin for stabilization and reconnection. Time-based budgets are superior to pure attempt counts because they normalize across different failure modes — a retry that takes 5s per attempt behaves very differently from one that takes 100ms.

3. Attempt-Limited with Jitter

Maximum 2–3 retry attempts within the budget window, with exponential backoff and jitter. Without jitter, synchronized retries from multiple application instances create a thundering herd — everyone hits the new master at exactly the same moment.

4. Invisible to Business Logic

If the retry succeeds within the budget, the business layer never knew there was a problem. If it fails, the business layer receives a clean, classified error — not a raw TCP stack trace that means nothing to anyone above the infrastructure layer.

Here's what this looks like in practice:

// Bounded retry wrapper — lives in the infrastructure client layer
func withBoundedRetry(ctx context.Context, budget time.Duration, maxAttempts int, op func() error) error {
    deadline := time.Now().Add(budget)
    var lastErr error

    for attempt := 0; attempt < maxAttempts; attempt++ {
        if time.Now().After(deadline) {
            break
        }

        lastErr = op()
        if lastErr == nil {
            return nil // success — business layer never knew
        }

        if !isRetryable(lastErr) {
            return normalizeError(lastErr) // permanent failure — fail fast
        }

        // Exponential backoff with jitter
        backoff := time.Duration(1<<attempt) * 500 * time.Millisecond
        jitter := time.Duration(rand.Int63n(int64(backoff / 2)))
        select {
        case <-time.After(backoff + jitter):
        case <-ctx.Done():
            return ctx.Err()
        }
    }

    return normalizeError(lastErr) // budget exhausted — fail deterministically
}

┌─────────────────────────────────────────────┐
│          Retry Budget: 15 seconds           │
│                                             │
│  Attempt 1  →  timeout (5s)  →  backoff     │
│  Attempt 2  →  timeout (5s)  →  backoff     │
│  Attempt 3  →  success                      │
│                                             │
│  Total elapsed: ~11s                        │
│  Application impact: ZERO                   │
│                                             │
│  ─── OR ───                                 │
│                                             │
│  Budget exhausted → FAIL DETERMINISTICALLY  │
│  Clean, classified error to business layer  │
└─────────────────────────────────────────────┘

"Retry is not infinite. Retry is time-boxed. Once the budget is exhausted, we fail deterministically."

Error Normalization

This is where most teams get it wrong. They retry everything — or nothing. The retry decision must be driven by error classification:

Raw Error	Normalized To	Retryable?	Why
`TCP dial timeout`	`UNAVAILABLE`	Yes	Connection not established, may recover
`Connection reset`	`UNAVAILABLE`	Yes	Transient network disruption
`READONLY` (replica)	`UNAVAILABLE`	Yes	Sentinel failover in progress — replica not yet promoted
`Leader election in progress`	`UNAVAILABLE`	Yes	Raft/consensus transition
`OOM command not allowed`	`RESOURCE_EXHAUSTED`	No	Backpressure — retrying makes it worse
`WRONGTYPE`	`INVALID_ARGUMENT`	No	Schema error — will never succeed
`NOPERM` / `Permission denied`	`PERMISSION_DENIED`	No	Auth failure — will never succeed
`NOT_FOUND`	`NOT_FOUND`	No	Semantic absence — retry won't create the resource

The READONLY case deserves special attention. During Sentinel failover, a replica that hasn't been promoted yet responds with READONLY to write commands. If your retry layer treats this as a permanent error, your circuit breaker trips, clients get errors, and a 12-second failover becomes a 5-minute outage while someone manually resets the breaker. Classify READONLY as UNAVAILABLE — it will resolve when the new master is promoted.

The rule is simple: you cannot leak internal implementation details up the stack. Your retry layer must inspect and reclassify errors — not just map them 1:1. Error semantics must align across every layer.

The Relationship with Circuit Breakers

Bounded retry is the inner loop — it handles transient failures within a known recovery window. But what if the dependency is truly down, not just transitioning?

That's where circuit breakers serve as the outer loop:

Bounded retry absorbs transient events (leader election, network jitter) — seconds.
Circuit breaker protects against sustained outages (dependency truly dead) — minutes.

Without a circuit breaker, sustained failures chew through retry budgets on every request, wasting resources. Without bounded retry, every transient blip trips the circuit breaker unnecessarily. They are complementary, not redundant.

Observability: Instrument the Boundary

A production retry boundary must emit metrics. Without them, you're flying blind:

retry_attempt_total — how often retries fire (by dependency, by error type)
retry_budget_exhausted_total — how often the full budget is consumed without success
retry_success_on_attempt — which attempt number succeeds (histogram)
error_classification — distribution of retryable vs non-retryable errors

The key alert: if retry budget exhaustion rate exceeds ~5%, either your budget is too tight or your dependency is degraded beyond transient. This is the signal that distinguishes a leader election from a real outage — and it's the signal that should trigger your circuit breaker.

Beyond Redis: A Universal Pattern

If this looks Redis-specific, zoom out. The bounded retry pattern applies to any stateful dependency with leader election:

Redis Sentinel — master failover with quorum detection, 10–15s window
NATS JetStream — stream leader election in the Raft group, typically 2–5s with default election timeout
etcd / Consul — Raft leader election, ~1–2s with default settings, but watch streams may buffer longer
Kafka — partition leader election via controller, typically 5–15s depending on replica.lag.time.max.ms and ISR size
CockroachDB / TiKV — range leader election, similar Raft mechanics

The mechanics are the same everywhere: a detection window, a brief period of unavailability, and then recovery. Design your retry budget to absorb that window. Calibrate the budget to the specific system — 15s for Redis Sentinel, 5s for NATS, 20s for Kafka.

The Cross-Layer Contract

Resilience is not a library you import. It is a contract between layers:

Layer	Responsibility
Infrastructure	Absorbs transient instability via bounded retry
Business	Remains fail-fast for semantic integrity
Client	Retries only when signaled retryable

When failure is bounded and classified, the system becomes predictable. And predictability is the foundation of operational confidence.

Resilience Checklist

[ ] Retry Budget: Is my retry window matched to the dependency's failover time (e.g., 15s for Redis)?
[ ] Jitter: Do my retries have randomized sleep to avoid the "Thundering Herd"?
[ ] Error Classification: Does my code distinguish between READONLY (retryable) and PERMISSION_DENIED (not retryable)?
[ ] Centralization: Is my retry logic in the client wrapper, not leaked across handlers?
[ ] Observability: Do I have an alert if "Retry Budget Exhausted" exceeds 5%?

Key Takeaways

Fail fast — but not during transient infrastructure events. A leader election is not a business error. Don't treat it like one.
Retry must be bounded. Time-boxed, attempt-limited, with jitter. No open-ended retry loops.
Retry must be centralized. One retry boundary per dependency, at the infrastructure layer. Retry in multiple layers = retry amplification.
Failure semantics must be normalized. Retryable vs non-retryable must be explicit. Watch for READONLY — the most common Sentinel failover gotcha.
Resilience requires cross-layer alignment. Bounded retry (inner loop) + circuit breaker (outer loop) + observability = production-grade resilience.

Frequently Asked Questions

Should distributed systems always fail fast?

No. Fail fast for business-level errors (validation, permission, domain rules), but use bounded retry for transient infrastructure failures like leader election and temporary network instability.

What is a reasonable retry budget for Redis Sentinel failover?

In many production setups, 12-15 seconds is a practical starting point because it usually covers Sentinel detection, promotion, and client reconnection. Calibrate with your own failover timings and SLOs.

If the service already retries, should the client also retry?

Only when explicitly signaled retryable. Blind retries at both layers often create retry amplification and can trigger a retry storm.

How is bounded retry different from a circuit breaker?

Bounded retry handles short transient windows (inner loop). Circuit breaker handles sustained dependency failure and stops repeated expensive attempts (outer loop).

Why not use a Service Mesh (Istio) for retries?

While Mesh can retry, the application layer has better "semantic awareness." Only the app knows if a specific error is safe to retry based on idempotency.

When should I NOT use Bounded Retry?

For non-idempotent operations unless you have a robust request-ID tracking system. For business errors (400s), always fail fast.

Final Thought

Distributed systems are not about avoiding failure.
They are about designing boundaries.

If retry is everywhere, the system becomes unpredictable.
If retry is nowhere, transient instability leaks upward.

The goal is not infinite retry.
The goal is bounded retry.

That boundary is what keeps systems stable.

Resilience is not a library. It is a contract between layers.

Based on a talk I gave on failure boundary design in distributed systems.

Originally published at harrisonsec.com. Listen to the deep dive audio for a detailed walkthrough.

DEV Community: Harrison Guo

Why Go Handles Millions of Connections: User-Space Context Switching, Explained

The Wrong Mental Model

What Go Does Instead

A Small Benchmark That Tells the Whole Story

The Parallel in Finance: Same Problem, Opposite Extreme

Where Go Actually Breaks Under Load

Concurrency Isn't a Speed Contest

Further Reading

Claude Code Deep Dive Part 4: Why It Uses Markdown Files Instead of Vector DBs

The Core Principle: Only Record What Cannot Be Derived

Four Types, Closed Taxonomy

Why Positive Feedback Matters More Than Corrections

Project Type: Relative Dates Kill You

Why Sonnet Side-Query Instead of Vector Embeddings

The 5-File Cap: Constraint as Design

Background Extraction: The Invisible Agent

Trust but Verify: The Eval That Proved It

Three Architectures, Three Tradeoffs

Claude Code's Bet

OpenClaw's Bet

Hermes Agent's Bet

The Right Choice Depends on Your Deployment Model

What This Teaches About Agent Design

Claude Code Deep Dive Part 3: The 5-Level Compression Pipeline Behind 1M Tokens

Why Context Engineering Is the Real Moat

The 5-Level Compression Pipeline

Level 1 — Tool Result Budget (Zero Cost)

Level 2 — History Snip

Level 3 — Microcompact (The Dual-Path Design)

Level 4 — Context Collapse (Non-Destructive)

Level 5 — Autocompact (The Last Resort)

Token Estimation Without API Calls

The Prompt Cache Architecture

The Security Blind Spot

Three Designs Worth Stealing

Claude Code + Codex Plugin: Two AI Brains, One Terminal

What Is It, Exactly?

Setup: 3 Minutes

1. Install the Codex CLI

2. Authenticate

3. Verify

The Commands

The Rescue Workflow: When Claude Gets Stuck

Foreground vs Background

Code Review: A Second Opinion That Actually Pushes Back

When to Use Which Brain

Let Claude (Opus) Handle:

Call in Codex (GPT-5.4) For:

The Pattern That Works Best

Advanced: Prompt Shaping

Advanced: The Review Gate

The Bottom Line

Claude Code Deep Dive Part 2: The 1,421-Line While Loop That Runs Everything

The Heart of Claude Code

Why a State Machine, Not Recursion

The Loop at a Glance

10 Steps Per Iteration

Step 1: Context Compression (4 stages)

Step 2: Token Budget Check

Step 3: Call Model API

Step 4: Streaming Tool Execution

Step 5: Error Recovery

Step 6: Stop Hooks

Step 7: Token Budget Check (Again)

Step 8: Tool Execution

Step 9: Attachment Injection

Step 10: Assemble and Loop

Why This Architecture Matters

The Behavioral Constitution

How Other Agents Compare

The Bottom Line

Claude Code Source Leaked: 5 Hidden Features Found in 510K Lines of Code

What Happened

5 Hidden Features Found in the Source

1. Buddy Pet System

2. Undercover Mode

3. Kairos — Permanent Memory

4. Ultraplan — Deep Task Planning

5. Multi-Agent, Voice, and Daemon Modes