Truong Phung

Posted on Apr 27 • Edited on May 6

🏗️ Building Agents Like Claude Code — A Source-Derived Blueprint 📘

#ai #llm #webdev #tutorial

A comprehensive synthesis of the claude-code-from-source project (and companion site claude-code-from-source.com) — distilled into core principles, techniques, and actionable guidelines for builders who want to ship a coding agent of comparable quality.

The source repo is a 18-chapter educational reverse-engineering of Claude Code derived from npm source maps. No proprietary code is reproduced — only architectural pseudocode and design rationale. This guide does the same.

💡 TL;DR — the whole agent in one mental picture
🎯 What you are actually building
🧱 The six core abstractions
📦 State: two tiers, one source of truth
⚙️ The agent loop: AsyncGenerator as control plane
🔧 Tools: self-describing, fail-closed, parameterized
⚡ Concurrency and speculative execution
🔒 Permissions: modes, rules, and bubbling
🗜️ Context engineering: the 4-layer compression pipeline
🌐 The API layer: prompt caching as architecture
🤖 Sub-agents and fork agents
🕸️ Multi-agent coordination patterns
🧠 Memory: file-based persistence + LLM recall
🔌 Skills, hooks, plugins — extensibility surface
🔗 MCP: the universal external-tool protocol
🚀 Bootstrap, startup, and rendering performance
📋 The 10 foundational patterns (cheat sheet)
🗺️ Build-your-own: a 14-step roadmap
⚠️ Anti-patterns and pitfalls
📖 Glossary

0. 💡 TL;DR — the whole agent in one mental picture

Before the details, hold this picture in your head. Everything else is elaboration.

┌─────────────────────────────────────────────────────────────┐
│  query() — async generator, the only place control flows    │
│                                                             │
│   while not done:                                           │
│     state    = compress(state)              # 4 layers      │
│     response = await stream(model, state)                   │
│     yield response.messages                 # to UI         │
│     if no tool_calls:  return  completed                    │
│     batches  = partition(response.tool_calls)               │
│     for batch in batches:                                   │
│       results = run(batch)                  # parallel-safe │
│       yield results.messages                                │
│       state += results                                      │
└─────────────────────────────────────────────────────────────┘
        ▲                ▲                ▲             ▲
        │                │                │             │
   Memory (files)    Tools (self-      Hooks (27       Sub-agents
   loaded into       describing,       lifecycle       (recursive
   system prompt     fail-closed,      events)         query() with
   at session        partitioned by                    isolated state)
   start             safety per-call)

Five rules carry 80% of the design:

🔄 The loop is an async generator. Backpressure, cancellation, and typed terminal states fall out for free.
📝 Every tool is self-describing (schema, permissions, concurrency safety). The loop never special-cases tools.
🛡️ Safety is per invocation, not per tool type. Bash("ls") ≠ Bash("rm -rf").
💾 Prompt cache is architecture, not optimization. Static-then-dynamic boundary, sticky flags, byte-identical fork prefixes.
📁 Memory is files. A small LLM picks which to load. No database, no embeddings. Trust through transparency.

If you only build those five things well, you have ~80% of Claude Code. The rest is layering and polish.

1. 🎯 What you are actually building

A production coding agent is not a chat loop with tool calls bolted on. It is a streaming, cancellable, recursive state machine that has to:

Survive token-budget exhaustion mid-task without losing the user's work.
Run dozens of tools per turn safely, often in parallel, sometimes speculatively.
Spawn child agents that cost ~10% of a normal call thanks to prompt cache reuse.
Persist semantic knowledge across sessions without a database.
Allow third parties to extend it (skills, hooks, MCP) without crashing the host.
Boot in under 300 ms and stream the first token in well under a second.

If your design omits any of these, you will hit a wall later. Build for them on day one — most of them are cheap when planned, expensive when retrofitted.

The closing principle of the source book: push complexity to the boundaries. Protocol translation, state reconciliation, external tool invocation, permission checking — these belong at the edges. The interior (loop, memory, tool composition) stays clean and exhaustively typed.

2. 🧱 The six core abstractions

Every part of Claude Code reduces to one of these. Implement them as first-class modules, not as helpers attached to a god object.

#	Abstraction	Responsibility	Approx LoC in CC
1	Query Loop	Async generator that streams model output, runs tools, appends results, decides when to stop. Returns a typed `Terminal` discriminated union (10 reasons).	~1,700
2	Tool System	Self-describing tools with schema, permissions, concurrency, rendering. Batched into concurrent/serial groups. Speculative execution during streaming.	—
3	Tasks	Background units following `pending → running → completed \	failed \
4	State	Two layers: a mutable singleton {% raw %}`STATE` (~80 fields, infrastructure) + a 34-line reactive store (UI: messages, approvals, progress).	—
5	Memory	File-tier persistence (`CLAUDE.md`, `~/.claude/MEMORY.md`, team symlinks). LLM picks relevant memories at session start.	—
6	Hooks	Lifecycle interceptors at 27 events, in 4 forms: shell command, single-shot prompt, agent loop, HTTP webhook.	—

Why this carving

The Query Loop is the only place control flow lives. Tools, hooks, sub-agents — they all yield through it.
State is split because infrastructure mutates rarely but reads constantly; UI is the opposite. One subscription model can't serve both.
Memory is its own primitive (not a tool) because it is read on every system-prompt build, before any tool can run.
Hooks are first-class because the permission system itself runs partially as PreToolUse hooks. They are not an afterthought.

3. 📦 State: two tiers, one source of truth

State design is where most agent codebases collapse. Claude Code splits it into two tiers with strict layering:

Tier	What it holds	Mutability	Reachable from
Bootstrap state (`STATE`)	~80 fields: `originalCwd`, `sessionId`, model overrides, cost accumulators, telemetry handles, prompt-cache allowlists	Mutable through ~100 typed setters	Everywhere — DAG leaf, depends on nothing but Node.js stdlib
AppState (reactive store)	Messages, input mode, tool approvals, progress indicators, todos	Immutable snapshots; updater functions only	Inside React components

Why split them

Availability: session ID, telemetry, and cost trackers must exist before React mounts. A reactive store cannot serve them.
Access pattern: bootstrap state is read constantly, mutated rarely, with no subscribers. AppState is read by render subscribers on every change. One subscription model can't serve both.
Dependency direction: bootstrap depends on nothing → AppState imports bootstrap → React imports AppState. Enforce this with a lint rule. Cycles will sneak in otherwise.

The reactive store in 34 lines

function makeStore(initial, onTransition) {
  let current = initial
  const subs = new Set()
  return {
    read:      () => current,
    update:    (fn) => {
      const next = fn(current)
      if (Object.is(next, current)) return        // skip noop
      const prev = current; current = next
      onTransition?.(prev, next)                  // side effects FIRST
      subs.forEach(cb => cb())                    // then UI
    },
    subscribe: (cb) => { subs.add(cb); return () => subs.delete(cb) },
  }
}

Three deliberate choices:

Updater-only mutations. No set(value) API. Stale-closure bugs vanish.
Object.is guard. Identical references skip re-renders and side effects.
onChange fires before listeners. Side effects (e.g. persist to disk, notify remote session) complete before the UI flips.

The sticky latch pattern (write-once flags)

A pattern worth memorizing — applies any time a value influences a server-side cache key:

type Latch = boolean | null   // null = "not yet evaluated"

function shouldSendBetaHeader(featureCurrentlyActive: boolean): boolean {
  const latched = getAfkLatch()
  if (latched === true) return true            // already on — keep sending
  if (featureCurrentlyActive) {
    setAfkLatch(true)                           // first activation — latch
    return true
  }
  return false                                  // never activated
}

The three-state type self-documents intent: null says "we haven't decided yet." Once true, never returns to false. Five such latches in Claude Code prevent mid-session feature toggles from busting 50–70K tokens of cached prompt.

Centralizing side effects on diffs

A real production bug: permission mode was synced to the remote session by 2 of 8+ mutation paths. Eventually one drifted. The fix was a single onChangeAppState(prev, next) callback that detects field changes structurally — every mutation path is automatically covered. Side effects scale much more slowly than mutation sites; centralize on diffs, not events.

Cost tracking (a concrete example)

Every API response runs through addToTotalSessionCost:

Accumulates per-model usage in bootstrap state.
Reports to OpenTelemetry.
Recursively processes nested model calls (sub-agents, recall queries).
Persists to project config on process exit.
Restores on next session only if the persisted session ID matches.

Histograms use reservoir sampling (Algorithm R) with 1,024 entries to compute p50/p95/p99. Averages hide tail latency, and tail latency is what users feel.

Actionable: even in v0, instrument cost and latency. You cannot decide what to optimize from feel.

4. ⚙️ The agent loop: AsyncGenerator as control plane

The loop is an async function* — not a while with callbacks, not an event emitter, not an RxJS pipeline. There are three concrete reasons to choose generators:

Backpressure for free. A generator yields only when the consumer calls .next(). The REPL pulls via for await, naturally pausing if the UI can't render fast enough.
Typed terminal states. The generator's return is a discriminated union of why execution stopped: completed, max_turns, error, aborted_streaming, aborted_tools, prompt_too_long, image_error, model_error, stop_hook_prevented, hook_stopped, blocking_limit. The compiler enforces exhaustive handling.
Composability. Inner generators delegate via yield*. No callback nesting, no promise plumbing.

Loop skeleton

async function* query(initialState):
  state = initialState
  while true:
    state = compress(state)              // 4-layer pipeline (§8)
    response = await callModel(state)    // streaming
    yield* response.messages             // surface to UI

    if response.error and recoverable:
      state = recover(state, error)
      continue
    if response.error and not recoverable:
      return { kind: 'model_error', error }
    if not response.toolCalls:
      if stopHookBlocks(state):
        state = applyHookFeedback(state)
        continue
      return { kind: 'completed' }

    batches = partitionToolCalls(response.toolCalls)
    for batch in batches:
      results = await executeBatch(batch, state)
      yield* results.messages
      state = appendToolResults(state, results)

    // re-enter with new state

Continue states (don't `return`, just `continue`)

collapse_drain_retry, reactive_compact_retry, max_output_tokens_escalate, max_output_tokens_recovery, stop_hook_blocking, token_budget_continuation, next_turn. Naming each one is what makes the loop testable — every test asserts which transition fired.

Error recovery is a ladder, not a fallback

Order matters. From least to most aggressive:

Trigger	Step 1	Step 2	Step 3
`prompt_too_long` (413)	drain staged collapse summaries	reactive compact	surface to user
`max_output_tokens`	escalate cap 8K → 64K	multi-turn recovery (≤3 attempts)	surface
`media_size_error`	reactive compact	—	surface

Guards prevent infinite loops: hasAttemptedReactiveCompact one-shot flags, hard caps on recovery attempts, circuit breakers. Never run stop hooks on an error response — that creates "error → hook blocks → retry → error" spirals.

Cancellation

Aborts can hit during streaming or during tool execution. In both cases, the executor must drain remaining requests by emitting synthetic tool_result blocks for queued/running tools. The Anthropic API rejects an assistant message containing a tool_use block without a matching tool_result. signal.reason distinguishes hard aborts from "submit interrupts" (a new user message), so you skip redundant interruption stubs in the latter case.

Actionable: every tool_use your agent emits must have a paired tool_result in message history before the next API call. Make this an invariant your loop enforces, not a hope.

5. 🔧 Tools: self-describing, fail-closed, parameterized

Interface

A tool is parameterized by three types: Input, Output, and Progress. The Input doubles as a Zod schema and the JSON Schema given to the model.

The full Tool interface in Claude Code has ~45 members. Five are critical:

call(input, ctx) — runs the work.
inputSchema — Zod schema (validated, plus auto-generated JSON Schema).
isConcurrencySafe(parsedInput) — per invocation, not per type.
checkPermissions(parsedInput, ctx) — returns allow | deny | ask | passthrough with optional updatedInput.
validateInput(parsedInput, ctx) — semantic checks beyond schema (e.g. reject no-op edits).

The `buildTool()` factory pattern (fail-closed)

Never construct a tool literal directly. Wrap it in a factory that fills in dangerous defaults conservatively:

const SAFE_DEFAULTS = {
  isEnabled:        () => true,
  isParallelSafe:   () => false,   // serial unless proven otherwise
  isReadOnly:       () => false,   // assume writes
  isDestructive:    () => false,
  checkPermissions: (input) => ({ behavior: 'allow', updatedInput: input }),
}

If a tool author forgets isConcurrencySafe, they get serial execution — slow, but never corrupting. The opposite default would silently produce race conditions.

Tool result shape

type ToolResult<T> = {
  data: T
  newMessages?: Message[]                              // e.g. AgentTool injects sub-agent transcript
  contextModifier?: (ctx: ToolUseContext) => ToolUseContext  // e.g. EnterPlanMode
}

Context modifiers only apply to serial tools. Concurrent tools queue modifiers until the batch completes — otherwise data dependencies and shared state become race-condition territory.

The 14-step execution pipeline (`checkPermissionsAndCallTool()`)

This is the choreography every tool call goes through. Implement it as a single function that returns a ToolResult or ToolError. Skipping any of these steps will hurt later.

#	Step	Why it matters
1	Tool lookup (with alias map)	Old transcripts may reference renamed tools
2	Abort check	Don't waste compute on cancelled queued calls
3	Zod validation	Catch type errors; hint to call `ToolSearch` for deferred tools
4	Semantic validation	E.g. reject no-op edits, block `sleep` if a `Monitor` tool exists
5	Speculative classifier start	Fire auto-mode permission classifier in parallel for Bash
6	Input backfill	Expand `~/foo` → absolute paths for hooks/permissions but keep originals for transcript stability
7	`PreToolUse` hooks	Hooks decide / modify / block
8	Permission resolution	Rule match → tool method → mode default → prompt → classifier
9	Permission denied path	Build error, fire `PermissionDenied` hook
10	Execute `call()`	The actual work
11	Result budgeting	Persist oversized output to disk; replace with preview
12	`PostToolUse` hooks	Modify MCP output, possibly block continuation
13	Append `newMessages`	Sub-agent transcripts, system reminders
14	Error classification	Telemetry, OTel events

Result budgeting

Per-tool size caps prevent runaway output:

Tool	`maxResultSizeChars`	Rationale
Bash	30,000	Most useful output fits
Edit	100,000	Diffs need room
Grep	100,000	Search results accumulate
Read	∞	Self-bounded by token limit; persisting would create circular Read loops

Above the cap, the system writes the full content to a <persisted-output> file and returns a preview pointing to it. An aggregate ContentReplacementState tracks per-conversation budgets so multiple near-cap results cannot blow context together.

Deferred loading

Tools marked shouldDefer: true send only { name, description, defer_loading: true } to the API. The model has to call ToolSearch to load full schemas. Three benefits:

Smaller initial prompt.
Adding/removing a deferred tool changes the prompt by a few tokens, not hundreds — prompt cache stays warm.
Less tool-soup confusion for the model.

Tool registry assembly order matters

final = sort(builtins, alpha) ++ sort(mcpTools, alpha)

Sort within each partition, then concatenate. A flat sort across all tools would interleave MCP tools into built-in positions, busting cache breakpoints whenever MCP servers are added/removed.

6. ⚡ Concurrency and speculative execution

The core insight

Safety is determined per-invocation, not per-tool-type. Bash("ls -la") is concurrency-safe. Bash("rm -rf build/") is not. Same tool. Different inputs. Different verdict.

The partition algorithm

partitionToolCalls(calls):
  batches = []
  current = { kind: 'concurrent', tools: [] }
  for call in calls:
    tool = lookup(call.name)
    parsed = tool.inputSchema.safeParse(call.input)
    safe = parsed.success and tool.isConcurrencySafe(parsed.data)
    if safe and current.kind == 'concurrent':
      current.tools.push(call)
    else if safe:
      batches.push(current); current = { kind: 'concurrent', tools: [call] }
    else:
      if current.tools: batches.push(current)
      batches.push({ kind: 'serial', tools: [call] })
      current = { kind: 'concurrent', tools: [] }
  if current.tools: batches.push(current)
  return batches

Example: [Read, Read, Grep, Edit, Read] → [concurrent[Read, Read, Grep], serial[Edit], concurrent[Read]].

Parsing failure → serial. Safety-check exception → serial. Always fail closed.

Speculative streaming execution

The StreamingToolExecutor watches the model stream. The moment a tool_use block is fully parsed (often seconds before the response finishes), it starts that tool — provided admission rules allow.

Admission rule: a tool can start executing iff no tool is currently running, or both the new tool and all currently-running tools are concurrency-safe.

Sequential timeline: stream 2.5s + 3 serial tools = 3.1s
Speculative: stream 2.5s overlapped with tools 1–2; total 2.6s

Tool states: Queued → Executing → Completed → Yielded. Yield in submission order, not completion order — even if c.ts finishes before a.ts, the conversation history must remain a, b, c.

Error cascade policy

Bash errors cascade within a batch. Shell commands form implicit pipelines; running cp after a failing mkdir is pointless.
Read/Grep errors isolate. One file read failure has no bearing on a sibling grep.

Cancelled siblings get synthetic results: "Cancelled: parallel tool call Bash(mkdir build) errored".

Interrupt behavior

Each tool declares interruptBehavior(): 'cancel' | 'block'. The executor treats an executing batch as interruptible only when all tools in it support cancel. A single block tool blocks user Esc for the whole batch.

7. 🔒 Permissions: modes, rules, and bubbling

Seven modes (most → least permissive)

Mode	Behavior
`bypassPermissions`	No checks (testing only)
`dontAsk`	Auto-deny prompts (background agents — never block on user input)
`auto`	Lightweight LLM classifier evaluates each call against transcript
`acceptEdits`	File edits auto-allowed; other mutations prompt
`default`	Standard interactive — user approves each action
`plan`	Read-only; all writes denied
`bubble`	Sub-agent escalates the decision to its parent

Sub-agents default to bubble. Background agents default to dontAsk (they can't block on a prompt that has no UI).

Resolution chain

1. Hook decision?         → final
2. allowedRules / deniedRules / askRules match?  → final
3. tool.checkPermissions()  → allow | deny | ask | passthrough
4. Mode default
5. (interactive only) prompt user
6. (auto only) classifier

Rules

Three pieces: source (tracks provenance), ruleBehavior (allow/deny/ask), ruleValue (with optional content patterns).

Bash(git *) — Bash commands starting with git
Edit(/src/**) — file edits restricted to /src
Fetch(domain:example.com) — HTTP fetches limited to that domain

For Bash, parse the command via a real bash AST parser (parseForSecurity()), split on && || ; |, and classify each subcommand. If the parser fails, return fail-safe behavior — assume any command it can't parse is unsafe.

8. 🗜️ Context engineering: the 4-layer compression pipeline

Run before every API call, in this strict order:

Layer	What it does	Cost
0. Tool result budget	Enforce per-message size caps; exempt tools without finite `maxResultSizeChars`	Trivial
1. Snip compact	Physically remove old messages; emit UI boundary marker; report tokens freed	Cheap
2. Microcompact	Drop tool results by `tool_use_id` once unneeded; cache edits via deferred boundary messages	Cheap
3. Context collapse	Replace conversation spans with summaries (granular)	Medium
4. Auto-compact	Fork an entire Claude conversation to summarize history; circuit-break after 3 consecutive failures	Heavy

Why ordering matters: if collapse alone gets tokens below the auto-compact threshold, auto-compact never runs — so you keep fine-grained recent history.

Budget thresholds

Auto-compact triggers at effectiveContextWindow − 13,000 tokens.
Hard blocking limit at effectiveContextWindow − 3,000.
10K-token gap between them is where reactive compact runs if proactive failed.

Token counting blends authoritative API usage numbers with rough estimates for messages added since the last response — biased conservative so compaction fires slightly early.

Actionable: instrument both estimated and authoritative token counts, log the delta. When the delta drifts, your estimator is broken and your safety margins are wrong.

9. 🌐 The API layer: prompt caching as architecture

Prompt caching is not an optimization. It is an architectural constraint. Every design decision either preserves cache hits or busts them.

Multi-provider abstraction

A single getAnthropicClient() factory dispatches to one of:

Direct API (key or OAuth)
AWS Bedrock
Google Vertex AI
Azure Foundry

Provider chosen at boot from env vars + config. Stored in bootstrap state; never re-checked. SDKs dynamically imported (don't load Bedrock if you're on direct API).

A buildFetch wrapper injects an x-client-request-id UUID header on every request, so you can correlate client-side timeouts with server-side logs.

Cache scopes

Scope	Where	TTL
Global	Static prompt prefix shared across all users	Long
1-hour	Eligible users' extended cache	60 min
Ephemeral (default)	Per-session	~5 min

The system prompt has a literal === DYNAMIC BOUNDARY === marker:

Above (cacheScope: global): identity, system rules, task guidance, tool usage instructions, tone/style.
Below (per-session): session guidance, CLAUDE.md, env info, language, MCP instructions (uncached, marked dangerous), output style.

Rule: every runtime if above the boundary doubles the cache key space. 3 conditionals = 8 prefixes. 5 = 32. Compile-time feature flags are fine; runtime checks must live below the boundary.

Global scope is disabled when MCP tools are present — user-specific tool definitions would fragment the global cache into millions of unique prefixes.

Sticky latches

Five session-scoped boolean flags that, once set, cannot be unset for the rest of the session. They control beta/feature headers. Reason: "mid-session toggles don't change the server-side cache key" — flipping a flag would bust 50–70K tokens of cached context.

Pattern: Once(value) — a setter that throws or no-ops on second call. Use this for any cache-influencing config.

Output token slot reservation

Production p99 output = 4,911 tokens. Default SDK reservation = 32K–64K. Over-reservation = 8–16×.

Strategy: cap default max_tokens at 8K. On the rare truncation (<1% of requests), retry with 64K. Recovers 12–28% of the context window for free.

Streaming: skip the SDK helper

The SDK's BetaMessageStream calls partialParse() on every input_json_delta — repeatedly re-parsing growing JSON from scratch (O(n²)). Use raw Stream<BetaRawMessageStreamEvent> and accumulate tool-input strings yourself.

Watchdog and fallback

Idle watchdog: setTimeout(90s) reset on every chunk. At 45s, warn. At 90s, abort and retry non-streaming.
Non-streaming fallback activates when streaming dies mid-response (network, stall, truncation, proxies returning 200 with non-SSE bodies).
Disable fallback when streaming tool execution is active — duplicate tool runs would corrupt state.

10. 🤖 Sub-agents and fork agents

Single-agent capability has a hard ceiling. The fix is recursive: spawn child agents that are the same loop with isolated state.

`AgentTool` input schema (dynamic)

Field	Purpose
`description`	3–5 word task summary
`prompt`	Full instructions
`subagent_type`	Specialization key (optional)
`model`	Override (haiku/sonnet/opus)
`run_in_background`	Async execution
`name`	For team addressability
`isolation`	`worktree` (filesystem clone) or `remote`

Critical pattern: feature-gate the schema itself. "The model never sees fields it cannot use." Don't tell the model "don't use name here" — remove name from the schema in this context. The model cannot misuse what it cannot see.

Output (discriminated union)

Sync: { status: 'completed', prompt, ...result }
Async: { status: 'async_launched', agentId, outputFile } — outputFile is a filesystem path that fills in when the bg agent completes; parents poll independently of process state.

The 15-step lifecycle (`runAgent()`)

Model resolution — caller override > agent definition > parent model > default. Read-only agents default to Haiku.
Agent ID — agent-<hex>. Override path supports resuming a backgrounded agent.
Context preparation — fork agents clone parent history (after filterIncompleteToolCalls()); fresh agents start empty.
CLAUDE.md stripping — read-only agents (Explore, Plan) omit project instructions. Saves ~10.2% of fleet cache_creation tokens.
Permission isolation — per-agent getAppState() overlay. Permissive parent modes (bypass, acceptEdits) always win.
Tool resolution — fork agents reuse parent's exact array byte-for-byte; normal agents apply allow/deny lists. General-purpose agents cannot spawn sub-agents (prevents exponential fan-out).
System prompt — fork agents inherit pre-rendered bytes; normal agents call agentDef.getSystemPrompt(ctx).
Abort controller — sync agents share parent's controller (Esc kills both). Async agents get an independent one (survive parent abort).
Hook registration — agent-id-scoped, auto-cleanup on termination.
Skill preloading — declared in frontmatter, loaded concurrently to mask latency, prepended as a user message.
MCP initialization — inline servers (cleaned on termination) or shared configs (memoized, persistent). Must complete before context creation so tools are in the pool when snapshotted.
Context creation — createSubagentContext() makes isolation decisions:

Aspect Sync Async

setAppState shared isolated

setAppStateForTasks shared shared

readFileState own cache own cache

abortController parent's independent
Cache-safe params callback — for bg agents; lets the summarization service fork the conversation with cache-identical prefix.
Query loop — same query() function. Yields back to caller, records to sidechain JSONL transcript, forwards metrics.
Cleanup (finally) — MCP cleanup, hook clear, agent tracking, file cache, message GC, kill orphan shell tasks, remove agent's todos.

Aspect	Sync	Async
`setAppState`	shared	isolated
`setAppStateForTasks`	shared	shared
`readFileState`	own cache	own cache
`abortController`	parent's	independent

Fork agents: cache-driven subprocess design

The point of a fork is byte-identical request prefix to the parent, so children pay 10% input-token cost.

Three mechanisms make this work:

System prompt threading — pass parent's already-rendered bytes via override.systemPrompt. Don't regenerate; feature flags or session date may have changed.
Exact tool passthrough — useExactTools: true. No filtering, no reordering, no re-serialization. Even forbidden tools (like AgentTool itself) stay in the array — runtime guards prevent misuse.
Placeholder tool results — buildForkedMessages() clones the parent's last assistant message. For each tool_use, it inserts a constant placeholder string "Fork started -- processing in background". Same string for every child → same bytes.

Resulting structure: [...shared_history, assistant(all_tool_uses), user(placeholders..., directive)].

Only the final directive differs across children. With a 48,500-token shared prefix and 5 children, savings exceed 90% on input tokens for children 2–5.

When fork is disabled

Coordinator mode — coordinators have a structured-delegation prompt children would inappropriately inherit.
Non-interactive — fork uses permissionMode: 'bubble', which needs a user-facing prompt.
Explicit subagent_type — the user picked Explore/Plan/etc, so fork yields.

Recursive fork prevention (defense in depth)

Primary: child's context.options.querySource = 'agent:builtin:fork'. AgentTool checks this before allowing fork.
Fallback: scan message history for the boilerplate XML tag if querySource was lost in transit.

Six built-in agent archetypes

Archetype	Model	Tools	Notable
General-purpose	Default	All except `Agent`	Workhorse
Explore	Haiku	Read-only	Omits CLAUDE.md, one-shot prompt (saves 135 chars/invocation)
Plan	Inherit	Read-only	4-step process, must end with "Critical Files" list
Verification	Inherit	Read-only, async	System prompt explicitly anti-rationalization; requires adversarial probe
Claude Code Guide	Haiku	`dontAsk` mode	Doc fetcher; system prompt injects user's configured skills/agents/MCP
Statusline Setup	Sonnet	Read + Edit only	Narrowly-scoped specialist

Frontmatter format for user-defined agents

---
description: "When to use this"
tools: [Read, Bash]
disallowedTools: [FileWrite]
model: haiku
permissionMode: dontAsk
maxTurns: 50
skills: [my-skill]
mcpServers: [slack, {my-server: {command: node, args: [./server.js]}}]
hooks:
  PreToolUse:
    - command: "echo validating"
---

# System prompt body in markdown...

Trust hierarchy (least to most trusted): user agents < plugin agents < policy agents < built-in. User-agent hooks/MCP are silently skipped under strictPluginOnlyCustomization — graceful degradation, not error.

11. 🕸️ Multi-agent coordination patterns

Three distinct shapes:

A. Simple background delegation

Fire-and-forget. Tests, searches, lints. No coordination protocol.

B. Coordinator mode

Hierarchical manager-worker. The coordinator gets only three tools: Agent (spawn), SendMessage (talk), TaskStop (kill). That's it. By design.

"The coordinator's job is to think, plan, decompose, and synthesize. Workers do the work."

Critical principle: never delegate understanding. Coordinators must give workers exact file paths, exact line numbers, exact change descriptions — not "based on the research, fix the bug."

Workflow phases:

Research — multiple workers explore in parallel
Synthesis — coordinator (not workers) integrates findings
Implementation — workers receive precise instructions
Verification — workers validate

C. Swarm teams

Peer-to-peer. Same process, isolated via AsyncLocalStorage, file-based mailboxes. Each message has metadata (sender, timestamp, color for UI).

Three interruption levels:

Abort current work — cancel turn, keep operating
Shutdown request — cooperative graceful wind-down
Kill — hard abort via controller

Task state machine (universal)

All background work — bash, sub-agents, remote sessions, teammates, dreams — flows through one state model:

pending → running → { completed | failed | killed }

Seven task types with single-char visual prefixes: local_bash (b), local_agent (a), remote_agent (r), in_process_teammate (t), local_workflow (w), monitor_mcp (m), dream (d).

`SendMessage` dispatch order

Bridge (bridge:<session-id>) — cross-machine via Remote Control relays
UDS (uds:<socket-path>) — local IPC via Unix Domain Sockets
In-process — agent IDs / names of running agents
Team mailbox — file-based queue

Killer feature: transparent agent resumption. Sending a message to a "dead" agent automatically resurrects it from its disk transcript. The conversation simply continues.

Command queue invariant

Messages are delivered between tool rounds, never mid-execution. The agent finishes the current turn, then receives new info. No race conditions, no corrupted state. Make this a hard rule — it's the cheapest way to get correctness in multi-agent comms.

Pattern selection

Scenario	Pattern
Single bg task	Delegation
Multi-file refactor with research phase	Coordinator
Long-running collaborative dev	Swarm

Operational guardrail

A 50-message memory cap on in-process teammates exists because a real production incident reached 36.8 GB across 292 agents. Plan for unbounded fan-out from day one or it will hurt you.

12. 🧠 Memory: file-based persistence + LLM recall

Why files, not a database

Transparency — users open .md files and see exactly what the agent remembers. Trust through observability, not capability.
Modification time is a built-in epistemological signal: "when was this observation recorded?"
Zero infrastructure — no schema migrations, no indexes, no backups.

Layout

~/.claude/projects/<sanitized-git-root>/memory/
  MEMORY.md                        # always loaded; index only; ≤200 lines, ≤25 KB
  user_role.md                     # one memory per file
  feedback_testing.md
  project_migration_q2.md
  team/                            # shared via symlink
  logs/YYYY/MM/YYYY-MM-DD.md       # KAIROS append-only mode

Four-type taxonomy

Type	Purpose
user	Role, expertise, preferences
feedback	Corrections + validated approaches (lead with rule, then Why: and How to apply: lines)
project	Active work context with absolute dates (always convert "Thursday" → `2026-03-05`)
reference	Pointers to external systems (Linear, Slack channels)

Derivability test: if git log / git blame / the code itself can answer it, don't memorize it. No code patterns, no architecture, no debug fix recipes.

Frontmatter contract

---
name: <title>
description: <one-line summary used by recall LLM>
type: user | feedback | project | reference
---

<body — for feedback/project, structure as: rule → **Why:** → **How to apply:**>

The description field carries the most weight — it's the LLM-recall index.

Two-tier retrieval

Tier 1 (always loaded): MEMORY.md index (~3,000 tokens for ~150 entries). Lines after 200 are truncated.
Tier 2 (on-demand): an async Sonnet side-query gets the manifest (type, name, date, description), the user's current query, and recent tool history. Returns up to 5 filenames as structured JSON. Validated against the file list to catch hallucination.

This trades a few hundred ms of latency for semantic precision keyword-matching cannot achieve — especially for negation (do NOT use mocks).

Staleness policy

Don't expire. Annotate. Today/yesterday → no caveat. Older → human-readable warning ("This memory is 47 days old — code claims may be outdated"). Models reason better about "47 days ago" than ISO timestamps.

Write path (two-step)

Write <type>_<topic>.md with frontmatter + body.
Add a one-line pointer to MEMORY.md: - [Title](file.md) — one-line hook.

A background extraction agent runs at loop completion to catch memories the main agent missed.

KAIROS continuous mode

For long-lived sessions, replace two-step writes with append-only daily logs in logs/YYYY/MM/. A separate consolidation pass (after 24h or 5+ modified sessions) merges logs into structured memories.

Security (team paths)

Three-layer validation, all fail-closed:

Input sanitization (null bytes, traversal sequences, Unicode attacks)
String-level path validation with trailing-separator checks
Symlink resolution against the deepest existing ancestor

No partial-success fallbacks. Reject early, reject completely.

13. 🔌 Skills, hooks, plugins — extensibility surface

Skills: two-phase loading

The killer pattern. 50 skills shouldn't cost 50 docs of system-prompt tokens at startup.

Phase 1 (startup): parse YAML frontmatter only — name, description, when_to_use. Inject into system prompt as a directory.
Phase 2 (invocation): load full markdown body, substitute $ARGUMENTS and ${CLAUDE_SESSION_ID}, execute inline shell commands, prepend as a user message.

You pay the token cost only when the skill actually runs.

Skill source priority (highest → lowest)

Managed (policy / enterprise)
User (~/.claude/skills/)
Project (.claude/skills/)
--add-dir flag
Legacy commands
Bundled
MCP (remote, untrusted)

Hard security boundary: MCP skills never execute inline shell commands. External MCP servers are content-only. No exceptions.

Frontmatter controls

name: my-skill
description: ...
when_to_use: ...
disable-model-invocation: false   # block autonomous use
context: fork                     # run as sub-agent with own token budget
paths: ["src/**/*.ts"]            # conditional activation
hooks:
  PreToolUse: [...]

Hooks: 27 events, 6 types

User-configurable:

Command — spawn shell process, read stdout/exit code
Prompt — lightweight LLM call
Agent — multi-turn loop (max 50 turns)
HTTP — POST to remote policy server

Internal:

Callback — programmatically registered
Function — session-scoped TypeScript

Top 5 lifecycle points to know:

Hook	Fires	Can do
`PreToolUse`	Before tool execution	Block / modify / approve / inject context
`PostToolUse`	After successful execution	Inject feedback, replace MCP output
`Stop`	Before Claude concludes	Force continuation (verification loops)
`SessionStart`	Session begin	Cannot block
`UserPromptSubmit`	User submits	Block (input validation)

Other events span tool lifecycle (PostToolUseFailure, PermissionDenied, PermissionRequest), session (SessionEnd, Setup), subagents (SubagentStart, SubagentStop), compaction (PreCompact, PostCompact), notifications, configuration, file watching, task tracking — 27 in total.

Snapshot security model

captureHooksConfigSnapshot() freezes hook config at startup. If malicious code modifies .claude/settings.json mid-session, the snapshot prevents the change from taking effect. Only the /hooks command or the file watcher can update the live config.

Policy cascade: enterprise hooks cannot be disabled by users; allowManagedHooksOnly restricts to policy-approved hooks.

Exit code semantics (command hooks)

Code	Meaning
0	success
2	blocking error (deliberately uncommon to prevent accidental enforcement)
other	non-blocking warning

Skill ↔ hook integration

When a skill is invoked, its frontmatter hooks register as session-scoped. The skill directory becomes CLAUDE_PLUGIN_ROOT for those hook commands. once: true removes the hook after first execution. For sub-agents, Stop hooks auto-convert to SubagentStop to fire at the correct lifecycle point.

14. 🔗 MCP: the universal external-tool protocol

Skills and hooks extend the agent in-process. MCP (Model Context Protocol) is the standard way third parties extend it out-of-process — across servers, vendors, and trust boundaries. If you want a tool ecosystem you don't control, this is the layer that makes it possible.

Eight transports, three deployment shapes

Shape	Transport	Use
Local process	`stdio` (default)	Subprocess; JSON-RPC over stdin/stdout; no auth
Remote server	`http`	Streamable HTTP; POST + optional SSE
	`sse`	Legacy (pre-2025)
	`ws`	WebSocket bidirectional
	`claudeai-proxy`	Routed via Claude.ai infrastructure
In-process	`sdk`	Control messages over stdin/stdout
	`InProcessTransport`	Direct function calls via `queueMicrotask()` (63 lines)
IDE	`sse-ide`, `ws-ide`	Runtime-specific

Recommendation: start with stdio for local tools. Move to http only when you need remote. Use InProcessTransport for tools you control end-to-end — eliminates subprocess overhead.

Tool wrapping (4 stages)

External MCP tools must merge into the same Tool interface as built-ins. Four transformations:

Name normalization → mcp__{server}__{tool}. Invalid characters become underscores. Match ^[a-zA-Z0-9_-]{1,64}$.
Description truncation at 2,048 chars. (Real-world: OpenAPI servers were dumping 15–60 KB descriptions.)
Schema passthrough. Pass MCP input schemas straight through; do not transform.
Annotation mapping. readOnlyHint: true → enables concurrent execution. destructiveHint: true → triggers stricter permission checks.

After wrapping, MCP tools are indistinguishable from built-ins at the loop level. The same 14-step execution pipeline runs.

Configuration scopes (7 sources, content-deduplicated)

Scope	Source	Trust
`local`	`.mcp.json` in project	User approval required
`user`	`~/.claude.json`	User-managed
`project`	Project-level	Shared
`enterprise`	Org-managed	Pre-approved
`managed`	Plugin-provided	Auto-discovered
`claudeai`	Web interface	Pre-authorized
`dynamic`	SDK injection	Programmatic

Servers with matching command/args (or URLs) are deduplicated by content, not by name. Two configs naming the same binary differently still merge.

OAuth (RFC 9728 + RFC 8414)

Discovery chain when a server returns 401:

Probe /.well-known/oauth-protected-resource for authorization-server metadata.
Fall back to RFC 8414 discovery against the MCP server itself.
Use configured authServerMetadataUrl as escape hatch.

Cross-App Access (XAA) enables federated token exchange via identity providers. Real-world spec violations are common — normalizeOAuthErrorBody() rewrites Slack's "200 with error body" responses to a proper HTTP 400. Plan for spec drift on day one.

Server lifecycle

States: connected, failed, needs-auth (15-min TTL cache), pending, disabled.
Spawn batching: local in batches of 3, remote in batches of 20 — protects against file-descriptor exhaustion.
Session-expiry detection: Streamable HTTP returns 404 + JSON-RPC code -32001 → reconnect + single retry.

Timeout layers

Layer	Duration	Why
Connection	30 s	Unreachable / slow servers
Per-request	60 s	Fresh `AbortSignal` per request
Tool call	~27.8 h	Legitimate long-running operations
Auth	30 s	Unreachable OAuth servers

Trap: if you reuse a single AbortSignal across requests it expires during idle periods. wrapFetchWithTimeout() creates a fresh signal per request. Memorize this.

Critical security rule

MCP skills never execute inline shell commands. External servers are content-only. Every other extension surface (user skills, project skills) can run shell; MCP cannot. This is the single most important MCP rule and the one you will be tempted to break.

`InProcessTransport` in 63 lines

Two key mechanics:

send() delivers via queueMicrotask() — prevents stack-depth blow-ups on synchronous request/response cycles.
close() cascades to peer transport — no half-open connection states.

If you are wrapping an internal service as an MCP server, this is your reference. Don't subprocess what you can call directly.

15. 🚀 Bootstrap, startup, and rendering performance

The 5-phase pipeline (target: < 300 ms)

Phase	File	What happens
0. Fast-path dispatch	`cli.tsx`	Inspect args. `--version` / `--help` → dynamic-import only that handler, exit. Don't load React, telemetry, MCP.
1. Module-level I/O	`main.tsx`	Side-effect-fire MDM (security policy) + keychain subprocesses during import evaluation. ~138 ms of module loading runs in parallel with subprocess I/O.
2. Parse and trust	`init.ts`	Parse args, load config. Enforce a trust boundary dialog. Before: only safe ops (TLS, themes, telemetry). After: env vars and git commands.
3. Setup	`setup.ts`	Register everything in parallel: commands, agents, hooks, plugins, MCP. Hook config snapshot frozen here.
4. Launch	`replLauncher.ts`	Seven entry paths converge: REPL, print, SDK, resume, continue, pipe, headless. All call the same `query()` loop.

Other startup techniques

API preconnection — fire a HEAD to the Anthropic API during init. TCP+TLS handshake (100–200 ms) overlaps with setup. Connection is warm by the time the user submits.
Dynamic import for heavy libs — OpenTelemetry, provider SDKs, React for non-REPL paths.
50+ profiling checkpoints sampled at 100% of internal users / 0.5% of external. Without instrumentation you can't tell what to optimize.

Search performance (270K+ paths)

Three layers:

Bitmap pre-filter — assign each path a 26-bit mask of contained lowercase letters. Reject query: one integer comparison (charBits[i] & needleBitmap) !== needleBitmap. Rejects 10–90% at 4 bytes/entry.
Score-bound rejection — skip paths that can't beat the current top score before expensive scoring.
Async indexing with partial queryability — yield every ~4 ms. Search begins within 5–10 ms of index availability.

Rendering: patterns that transfer beyond the terminal

Claude Code forks Ink because stock Ink allocates one JS object per cell per frame — at 200×120 that's 24,000 GC'd objects every 16 ms. Whatever you're rendering, the lessons transfer:

Double-buffer + atomic write. Two persistent Frame objects; render into the back, swap pointers (no allocation), write the diff in one syscall wrapped in BSU/ESU (Begin/End Synchronized Update). No tearing.
Cell-level diffing with damage rectangles. Compute the bounding box of writes; diff only inside it. ~6× reduction in compare work for localized updates.
Three interning pools (chars, styles, hyperlinks) → integer IDs everywhere. Style transitions become a single pre-cached string lookup. Pools generationally reset every 5 min.
Frame throttling. 60 fps focused, 30 fps blurred (throttle(deferredRender, FRAME_INTERVAL_MS)). Scroll events get a tighter 4 ms schedule.
Pack related data. Two Int32 words per cell beats scattered objects — better cache behavior, faster compare, fewer allocations.
Lazy expensive work. Syntax highlighting via React Suspense — code shows unstyled first, colors paint moments later.
Separate hot paths from React. Direct DOM mutation + microtask scheduling for scroll. React handles the final paint, where it's already efficient.

The thesis: performance is not making operations fast; it is eliminating operations entirely.

16. 📋 The 10 foundational patterns (cheat sheet)

#	Pattern	Why it matters
1	AsyncGenerator-based loops	Natural backpressure, clean cancellation via `.return()`, typed terminal states
2	Speculative tool execution	Run safe read-only tools while the model is still streaming → noticeable latency cut
3	Concurrent-safe batching	Partition by per-invocation safety; serial isolates side effects
4	Fork agents for cache sharing	Byte-identical prefixes ⇒ ~95% input-token savings on children
5	4-layer context compression	snip → microcompact → collapse → autocompact, in that order
6	File-based memory + LLM recall	Beats embeddings for negation and intent-aware retrieval; zero infra
7	Two-phase skill loading	Frontmatter at startup, body on invocation
8	Sticky latches	Cache-influencing flags become write-once for the session
9	Slot reservation	8K default output, 64K on demand — recovers 12–28% of context
10	Hook config snapshots	Freeze at boot; defense against mid-session injection from a malicious repo

17. 🗺️ Build-your-own: a 14-step roadmap

A pragmatic order to implement these in. Each step compiles and runs on its own.

Tool interface + factory. Define Tool<I, O, P>, buildTool() with safe defaults, and a ToolResult type. Ship one tool: Read. Test the Zod-based JSON Schema generation.
Query loop v0. Async generator. No tools, no compression, just stream the model and yield messages. Return a Terminal discriminated union.
Tool execution path. Add the 14-step pipeline as one function. Wire the loop to call it on tool_use blocks. Always pair tool_use with a tool_result, even on error.
Permission modes + rules. Implement default, acceptEdits, plan, bypassPermissions. Add the resolution chain. Skip auto (LLM classifier) for now.
Concurrency partition + executor. partitionToolCalls() + a serial/concurrent executor. Add isConcurrencySafe() to every tool. Yield results in submission order.
Hook system v0. Two events: PreToolUse, PostToolUse. Command hooks only (shell process, exit codes). Capture a snapshot at startup.
State split. Mutable singleton STATE for infra (cwd, model, session id). Tiny reactive store for UI (messages, approvals).
Multi-provider client factory. Direct API first. Stub the others. buildFetch wrapper for client-request-id header.
Prompt caching architecture. System-prompt boundary marker. Static prefix (cache scope: global if no MCP). Dynamic suffix per-session. Implement one sticky latch as proof.
Compression v1: snip + microcompact. Skip collapse and autocompact for now. Wire the budget thresholds.
Streaming tool executor. Watch the streaming SSE. Start safe tools when their tool_use is fully parsed. Buffer to preserve submission order.
AgentTool + sub-agent lifecycle. Re-enter query() with isolated context. Implement the cleanup finally block. Skip fork agents.
Memory. File layout, frontmatter contract, two-tier retrieval (index + LLM recall side-query). Four types only.
Skills (two-phase) + slash commands. Frontmatter at startup; body at invocation; $ARGUMENTS substitution. Add EXTRA_DIRS resolution order.

Save for later (don't build until step 14 lands): fork agents, swarm teams, remote tasks, KAIROS continuous mode, auto-mode permission classifier, MCP transport layer, terminal renderer optimization, bitmap search index.

18. ⚠️ Anti-patterns and pitfalls

Loop / control flow

❌ Callbacks or event emitters for the agent loop. You'll re-invent backpressure poorly. Use async function*.
❌ A single error terminal state. Lose information. Encode 10+ specific reasons in a discriminated union.
❌ Stop hooks on error responses. Creates error → hook blocks → retry → error infinite loops. Skip them.
❌ Forgetting to pair tool_use with tool_result on abort. API will reject the next message. Drain queued tools with synthetic results on every cancellation path.

Tools

❌ A constructor literal instead of a factory. Defaults will be unsafe. Always go through buildTool().
❌ Per-tool-type concurrency safety. Bash is sometimes safe, sometimes not. Pass parsed input.
❌ Concatenating built-ins and MCP tools then sorting flat. Cache breakpoint dies. Sort within partition, then concat.
❌ Returning huge raw output. Cap with maxResultSizeChars. Persist to disk + return preview.
❌ Using the SDK's BetaMessageStream. O(n²) JSON re-parsing. Read raw stream events.

Permissions

❌ Scattering if mode === ... checks throughout tool code. Centralize in modes + the resolution chain.
❌ Trusting a partial bash parse. If parseForSecurity() fails, treat the command as unsafe.
❌ Sub-agent default = default mode. It needs a UI to prompt; bg agents have none. Default to bubble (sync) or dontAsk (async).

Caching / API

❌ Runtime conditionals in the static prompt prefix. Each one doubles cache key space. Move below the boundary.
❌ Mid-session feature toggles that change request headers. Use sticky latches.
❌ Reserving 64K output tokens by default. Over-reserve 8–16×. Cap at 8K, escalate on demand.
❌ Regenerating the system prompt for fork children. Feature flags or session date may have moved. Pass parent's bytes.
❌ Filtering tools per child agent in fork mode. Different array → different cache key. Use useExactTools: true and runtime guards.

Memory

❌ Storing what git log can answer. Code patterns, fix recipes, who-changed-what. Useless duplication that goes stale.
❌ Embedding-only retrieval. Misses negation ("do NOT mock the DB"). Use LLM recall over a manifest.
❌ Hard expiration. Annotate with age; let the model decide. Stale memories are still data.
❌ Letting MEMORY.md grow past 200 lines. Truncated silently. Treat the index as a budget.

Multi-agent

❌ Coordinators with the full tool set. They'll do the work themselves. Restrict to Agent, SendMessage, TaskStop.
❌ Workers asked to "based on the research, implement X." They re-derive context, miss specifics, hallucinate paths. Synthesis is the coordinator's job.
❌ Mid-tool-execution message delivery. Race conditions. Queue at tool-round boundaries.
❌ Unbounded teammate state. 36.8 GB / 292 agents was a real production incident. Cap message history.
❌ General-purpose agents that can spawn Agent. Exponential fan-out. Block recursive spawning at the schema level.

Bootstrap / hooks

❌ Loading the world for --version. Fast-path dispatch first, full bootstrap second.
❌ Hook config that updates live mid-session. Lets a malicious repo redefine permissions after trust dialog. Snapshot at startup; update only via explicit user channel.
❌ Treating MCP skills like local skills. They are content-only. Never execute their inline shell commands.

🎯 Closing thought

The deepest principle in the source book is repeated at every layer: push complexity to the boundaries. Permission resolution, protocol translation, state reconciliation, tool I/O — these are the messy edges. Concentrate the mess there. Keep the loop, the tool composition, the memory recall, and the streaming logic clean and exhaustively typed.

If you remember nothing else: most of this system is generators yielding strongly-typed events through a series of small modules, with a few critical caches and a few critical safety doors. Build it in that order.

19. 📖 Glossary

Quick reference for the jargon used throughout this guide.

Term	Meaning
AsyncGenerator	A JS function declared `async function*`. Yields values lazily, pauses at each `yield` until consumer calls `.next()`. Provides backpressure and clean cancellation.
Backpressure	The producer pauses when the consumer can't keep up. Generators give it for free; event emitters do not.
Cache breakpoint	The byte position in the prompt where the prompt cache stops matching. Move volatile content after the breakpoint to maximize hit rate.
Concurrency-safe	A tool invocation that can run in parallel with others without observable side effects. Determined per-input, not per-tool-type.
Context window	The token budget for a single API call (prompt + output). When you exceed it the API rejects the request.
Discriminated union	A type made of variants tagged by a literal field (`{ kind: 'completed' } \
Fork agent	A sub-agent that inherits the parent's byte-identical prompt prefix to maximize prompt-cache hits (~95% input-token discount on children 2…N).
Frontmatter	The YAML block at the top of a {% raw %}`.md` file (between two `---` lines). Used for skill/agent/memory metadata.
Hook	A user/plugin/policy interceptor at one of 27 lifecycle events. Can block, modify, or inject.
MCP	Model Context Protocol — the JSON-RPC standard for connecting external tool servers to an agent. Eight transports.
Microcompact	Layer 2 of context compression. Removes tool results by `tool_use_id` when no longer needed.
Prompt cache	Anthropic's server-side cache of prompt prefixes. ~90% discount on cached input tokens. Entire architecture revolves around preserving hits.
Reservoir sampling	Algorithm R. Maintain a fixed-size random sample of an unbounded stream. Used here for latency histograms (1,024 entries → accurate p50/p95/p99).
Slot reservation	The `max_tokens` value sent to the API. Default cap 8K, escalate to 64K on truncation (<1% of requests). Reclaims 12–28% of context.
Speculative execution	Starting tools while the model is still streaming, before the assistant message completes. Saves hundreds of ms when read-only tools dominate.
Sticky latch	A write-once boolean (`null \
Sub-agent	A child agent spawned via {% raw %}`AgentTool`. New `query()` generator with isolated message history. Sync (parent waits) or async (background).
Synthetic tool result	A fabricated `tool_result` block emitted on cancellation so the API doesn't see a `tool_use` without a matching result.
Terminal state	The discriminated-union value the agent loop returns (vs. yields). Encodes why execution stopped — 10 distinct reasons.
`tool_use` / `tool_result`	Anthropic API blocks. Every `tool_use` in an assistant message must be paired with a `tool_result` in the next user message. The single most common bug source.
Two-phase skill loading	Frontmatter loaded into the system prompt at startup; full body loaded only on invocation. Lets you ship 50+ skills cheaply.

Sources

Repo: https://github.com/alejandrobalderas/claude-code-from-source (raw chapter markdown — primary source)
Companion site: https://claude-code-from-source.com (live, returns HTTP 200; WebFetch was Cloudflare-blocked, content retrieved via direct curl + WebSearch index)
Chapters analyzed: 1 (Architecture), 2 (Bootstrap), 3 (State), 4 (API Layer), 5 (Agent Loop), 6 (Tools), 7 (Concurrency), 8 (Sub-Agents), 9 (Fork Agents), 10 (Coordination), 11 (Memory), 12 (Extensibility), 13 (Terminal UI), 15 (MCP), 17 (Performance), 18 (Epilogue).

The source repo is purely educational and contains no source code from Claude Code — only original pseudocode derived from npm source maps. This guide follows the same convention.

If you found this helpful, let me know by leaving a 👍 or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! 😃

Table of Contents