DEV Community

Harrison Guo
Harrison Guo

Posted on • Originally published at harrisonsec.com

Claude Code Deep Dive Part 2: The 1,421-Line While Loop That Runs Everything

This is Part 2 of our Claude Code Architecture Deep Dive series. Part 1: 5 Hidden Features covered the surface-level discoveries. Now we go deeper.

The Heart of Claude Code

Every AI coding agent — Claude Code, Cursor, Copilot — runs some version of the same loop: send context to an LLM, get back text and tool calls, execute tools, feed results back, repeat. We called this LLM talks, program walks.

But Claude Code's implementation of this loop is anything but simple. It lives in query.ts, a 1,729-line async generator. The while(true) starts at line 307 and ends at line 1728 — a single loop body spanning 1,421 lines of production code.

This is not a toy. This is the engine that processes every keystroke, every tool call, every error recovery, every context compression decision for millions of users.

// query.ts — line 307
// eslint-disable-next-line no-constant-condition
while (true) {
    let { toolUseContext } = state
    const { ... } = state
    // ... 1,421 lines of state machine logic ...
    state = next
} // while (true)  — line 1728
Enter fullscreen mode Exit fullscreen mode

Why a State Machine, Not Recursion

Early versions of Claude Code used recursion — the query function called itself. But recursion has a fatal flaw: in long conversations with hundreds of tool calls, the call stack grows until it explodes.

The current design uses while(true) with a state object that carries context between iterations:

// query.ts — lines 207-215 (State type, partial)
autoCompactTracking: AutoCompactTrackingState | undefined
maxOutputTokensRecoveryCount: number
hasAttemptedReactiveCompact: boolean       // circuit breaker for 413 recovery
stopHookActive: boolean | undefined
turnCount: number
transition: { reason: string } | undefined // why we continued
Enter fullscreen mode Exit fullscreen mode

Each continue statement is a state transition. There are 9 distinct continue points in the code (lines 950, 1115, 1165, 1220, 1251, 1305, 1316, 1340), each representing a different reason to run another turn:

  • Next tool call needed
  • Reactive compact triggered after 413
  • Max output tokens recovery
  • Stop hook interrupted
  • Token budget continuation
  • And more

The Loop at a Glance

flowchart TD

10 Steps Per Iteration

Each time the loop runs, it does these 10 things in order. Every step has real source code behind it.

Step 1: Context Compression (4 stages)

Before calling the API, the system tries to fit everything into the context window. Four compression mechanisms fire in priority order (imports at lines 12-16, 115-116):

  1. Snip Compact — trims overly long individual messages in history
  2. Micro Compact — finer-grained editing based on tool_use_id, cache-friendly (line 370: "microcompact operates purely by tool_use_id")
  3. Context Collapse — folds inactive context regions into summaries
  4. Auto Compact — when total tokens approach the threshold, triggers full compression

These are not mutually exclusive — they run in priority order:

flowchart LR

The system tries lightweight options first. If snip + micro bring tokens under the limit, the heavy compressors never run.

Step 2: Token Budget Check

If a token budget is active (feature('TOKEN_BUDGET'), line 280), the system checks whether to continue. Users can specify targets like "+500k", and the system tracks cumulative output tokens per turn, injecting nudge messages near the goal to keep the model working.

Step 3: Call Model API

Line 659 — the actual API call:

for await (const message of deps.callModel({
Enter fullscreen mode Exit fullscreen mode

This is a streaming call. The response arrives token by token, and the system processes it incrementally.

Step 4: Streaming Tool Execution

This is a critical optimization. Traditional agents wait for the model to finish generating all output, then execute tools. Claude Code uses StreamingToolExecutor (imported at line 96):

When the model is still generating its second tool call, the first one is already running:

Traditional Agent (sequential):
┌─────────────────────────┐┌───┐┌───┐┌───┐┌───┐┌───┐
│  LLM generates 5 calls  ││ T1││ T2││ T3││ T4││ T5│  ← 30s total
└─────────────────────────┘└───┘└───┘└───┘└───┘└───┘

Claude Code (streaming):
┌─────────────────────────┐
│  LLM generates 5 calls  │
├──┬──┬──┬──┬─────────────┘
│T1│T2│T3│T4│T5│                                       ← 18s total
└──┴──┴──┴──┴──┘
↑ tools start while LLM is still generating
Enter fullscreen mode Exit fullscreen mode

In a turn with 5 tool calls, traditional waits 30 seconds. Streaming finishes in 18 — a 40% speedup from architecture alone, not model improvements.

Line 554-555 reveals an interesting detail: stop_reason === 'tool_use' is unreliable — "it's not always set correctly." The system detects tool calls by watching for tool_use blocks during streaming instead.

Step 5: Error Recovery

If the prompt is too long? Try context collapse drain. If that fails, try reactive compact (line 15-16). If the API returns 413 (prompt too long), trigger emergency compression and retry.

But there's a circuit breaker: hasAttemptedReactiveCompact (line 209, initialized false at line 275) ensures each turn only attempts reactive compact once. Without this, a genuinely oversized conversation would loop forever.

The system also handles model degradation — if the primary model fails, it can fall back to a different model.

Step 6: Stop Hooks

After the model stops outputting, the system runs registered stop hooks. These can inspect the output and decide whether to let the model continue. This is where external governance plugs in.

Step 7: Token Budget Check (Again)

Yes, checked twice — once before calling the model (should we even start?) and once after (did we exceed the budget?). The second check decides whether to inject a "keep going" nudge or stop.

Step 8: Tool Execution

If the response contains tool_use blocks, execute them. Two paths:

  • runTools() (from toolOrchestration.ts, line 98) — batch execution
  • StreamingToolExecutor (line 96) — streaming execution, gated by config.gates.streamingToolExecution (line 561)

Each tool call goes through the 14-step execution pipeline in toolExecution.ts (1,745 lines) — validation, permission checks, hooks, actual execution, analytics. That's a story for Part 3.

Step 9: Attachment Injection

After tools finish, the system injects additional context before the next turn:

  • Memory attachments — relevant memories from the memdir/ system
  • Skill discovery — matching skills based on the current task
  • Queued commands — any commands that were waiting

This happens after tool execution but before the next API call, ensuring the model has fresh context.

Step 10: Assemble and Loop

Build the new message list from all the pieces — original conversation, tool results, attachments, system reminders — and go back to step 1.

Why This Architecture Matters

Most open-source AI agents implement the loop as 50 lines of pseudocode: call model, parse tool calls, execute, repeat. Claude Code's 1,421-line version exists because production reality is messy:

Context doesn't fit. A real coding session easily hits 200K tokens. Without the 4-stage compression pipeline, the agent dies on every long conversation. Most agents just truncate and lose context. Claude Code compresses intelligently — lightweight first, heavy only when needed.

Models fail. APIs return 413, connections drop, rate limits hit. The 9 continue points aren't over-engineering — they're the minimum number of recovery paths needed for reliable operation. The hasAttemptedReactiveCompact circuit breaker is the kind of detail that separates a demo from a product.

Speed matters more than correctness of execution order. Streaming tool execution — starting the first tool while the model is still generating the third — is a user experience decision backed by architecture. Traditional agents feel slow because they are: they serialize everything. Claude Code parallelizes at the loop level.

Tokens cost money. The SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker in prompts.ts (914 lines) splits the system prompt into static (cacheable) and dynamic sections. If two requests share the same static prefix byte-for-byte, the API caches it. Source comment: "don't modify content before the boundary, or you'll destroy the cache." This is prompt cache economics — saving Anthropic real compute costs at scale.

The Behavioral Constitution

Buried inside the prompt assembly, getSimpleDoingTasksSection() may be the most valuable function in the entire codebase. It encodes hard-won rules about what the model should NOT do:

  • Don't add features the user didn't ask for
  • Don't over-abstract — three duplicate lines beat a premature abstraction
  • Don't add comments to code you didn't change
  • Don't add unnecessary error handling
  • Read code before modifying it
  • If a method fails, diagnose before retrying
  • Report honestly — don't say you ran something you didn't

Anyone who has used Claude Code recognizes these rules. I've personally watched the system refuse to add "helpful" abstractions and stick to minimal changes. That's not the model being disciplined — it's the prompt constraining the model. The takeaway: don't trust model self-discipline. Codify the behavior.

How Other Agents Compare

Aspect Claude Code Cursor Typical OSS Agent
Loop complexity 1,421 lines, 9 continue points Unknown (closed source) ~50-200 lines
Compression 4-stage pipeline + reactive 413 recovery Tab-level context pruning Truncate or fail
Tool execution Streaming (parallel with generation) Sequential Sequential
Error recovery Circuit breakers, model fallback, emergency compact Basic retry Crash
Prompt caching Static/dynamic boundary, section registry Unknown None

The gap between Claude Code and most open-source agents is not model quality — it's the program layer. The model is the same Opus or Sonnet for everyone. What makes Claude Code feel different is 1,421 lines of careful engineering around it.

The Bottom Line

The query loop is where "LLM talks, program walks" becomes concrete:

  • The LLM outputs text and tool call JSON. That's it.
  • The program handles compression, budget tracking, error recovery, streaming, permissions, memory injection, and 14-step tool validation.
  • The 1,421 lines are not the model being smart. They're the program being careful.

If you're building an AI agent and your main loop is under 100 lines, you're not handling the cases that matter. Production is not about the happy path. It's about what happens when context overflows, the API returns 413, the user's conversation hits 500 turns, and three tools need to run while the model is still thinking.


Next: Part 3 — The 14-Step Tool Execution Pipeline (coming soon) — what happens between "model says call this tool" and the tool actually running.

Previous: Part 1 — 5 Hidden Features Found in 510K Lines

Video: The AI Stack Explained — LLM Talks, Program Walks

Top comments (1)

Collapse
 
delimit profile image
Delimit.ai

Your deep dive into query.ts highlights how a 1,729-line async generator can stay manageable by structuring iterations into 10 clear steps with 9 continue points for early exits. To apply this in your own projects, prioritize 4-stage compression to reduce data overhead and enable streaming tool execution for real-time responsiveness. With line numbers for reference, it's a solid blueprint for scaling complex loops without losing control.