neuzhou

Posted on Apr 6

I Read 2.5 Million Lines of AI Agent Source Code - Here Are the 4 Patterns Every Project Shares

#ai #agents #opensource #architecture

I Read 2.5 Million Lines of AI Agent Source Code â€” Here Are the 4 Patterns Every Project Shares

Over the past few months, I tore apart 10 open-source AI agent projects â€” line by line. Not README skimming. Not "I cloned the repo and grepped for interesting stuff." I read the actual code: the agent loops, the memory systems, the extension mechanisms, the deployment configs. 2.5 million lines across Dify, Claude Code, Goose, Hermes, DeerFlow, Pi Mono, Lightpanda, MiroFish, oh-my-claudecode, and Guardrails AI.

I published the full teardowns in awesome-ai-anatomy. But this post isn't about individual projects. It's about something that only becomes visible after you've read all 10: the same architectural patterns keep showing up, independently, across projects built by different teams in different languages.

Four patterns. Let me walk you through them with actual code.

Pattern 1: Memory = Pointers, Not Content

Every project stores memory. None of them store it the way you'd expect.

The naive approach is to dump the full conversation history into the context window. Nobody does this. What they actually do is store references â€” pointers to knowledge â€” and inject a compressed snapshot into the system prompt.

Claude Code stores memory as flat rules in a .claude/ directory. These aren't conversation logs. They're user-written instructions like "always use TypeScript" or "never modify the auth module." The model gets these as static rules at the start of each session. No history, no dynamic updates. Dead simple, and it works because Claude Code treats memory as configuration, not as recall.

Hermes Agent takes this further with a frozen snapshot pattern. At session start, it reads MEMORY.md and USER.md, serializes them into the system prompt, then freezes the snapshot. Even if the agent updates memory during the session (via tool calls that write to disk), the system prompt doesn't change until next session:

# From builtin_memory_provider.py
def system_prompt_block(self) -> str:
    """Uses the frozen snapshot captured at load time.
    This ensures the system prompt stays stable throughout a session
    (preserving the prompt cache), even though the live entries
    may change via tool calls."""

Why freeze? Prompt caching. If you have a 4,000-word MEMORY.md and your provider charges for prompt tokens, recompiling the system prompt on every memory write burns money. Hermes freezes the snapshot at session start and defers updates to the next session. Memory writes hit disk immediately but don't affect the current prompt. You trade freshness for cost efficiency.

DeerFlow goes the most sophisticated route â€” structured memory with confidence scores:

{
  "facts": [
    {"id": "...", "content": "User prefers Python over JS", "confidence": 0.9},
    {"id": "...", "content": "Team uses PostgreSQL", "confidence": 0.75}
  ],
  "history": {
    "recentMonths": {"summary": "..."},
    "earlierContext": {"summary": "..."},
    "longTermBackground": {"summary": "..."}
  }
}

Three time horizons for history. Per-fact confidence scores. LLM-extracted, debounced, and written asynchronously. This is the most ambitious memory architecture in the group â€” and also the most fragile (single JSON file, no file locking, no concurrent write safety).

The pattern across all three: memory is never the raw conversation. It's always a compressed, structured pointer to what matters. The model doesn't remember â€” it reads a cheat sheet at the start of each session.

Pattern 2: MCP Is the New API

The Model Context Protocol is eating the tool integration layer. Out of 10 projects, 7 either use MCP directly, ship an MCP server, or have MCP on their immediate roadmap. This isn't hype â€” it's convergence.

Goose is the purest example. Block Inc built the entire extension system on MCP. Not as an add-on. As the foundation. Every tool in Goose â€” file editing, shell execution, code analysis, even the todo list â€” is an MCP extension:

// From crates/goose/src/agents/extension.rs
pub enum ExtensionConfig {
    Sse { ... },              // Legacy SSE (deprecated)
    Stdio { cmd, args, ... }, // Child process via stdin/stdout
    Builtin { name, ... },    // In-process MCP server
    Platform { name, ... },   // In-process with agent context
    StreamableHttp { uri },   // Remote MCP via HTTP
    Frontend { tools, ... },  // UI-provided tools (desktop only)
    InlinePython { code },    // Python code run via uvx
}

Six flavors of MCP, all sharing the same McpClientTrait interface. The agent loop doesn't care whether a tool lives inside the binary or runs as a separate process across the network. The dispatch code path is identical. This is what gives Goose its modularity â€” you can swap any capability without touching the core agent.

Dify takes a different angle. Its plugin daemon runs as a separate process, communicating with the main API server. The plugin system isn't technically MCP yet, but the architecture is heading there â€” isolated execution, protocol-based communication, hot-swappable capabilities. At 136K stars, when Dify fully adopts MCP, the ecosystem implications are significant.

Lightpanda ships an MCP server mode alongside its CDP implementation. You can talk to the browser via Chrome DevTools Protocol or via MCP. One binary, two protocols. This is the pattern I expect to see everywhere: existing tools adding MCP as a second interface, not replacing what they have but offering a new way in.

The holdouts are interesting too. Claude Code still uses an internal tool registry via buildTool(). Hermes has its own tool system. Both work, but they require tools to be built specifically for that agent. MCP tools work with any MCP-compatible agent. The network effects are obvious, and I think the holdouts will adopt MCP within the next 12 months.

Pattern 3: Extension Bus > Monolith

Every agent framework starts as a monolith. The ones that survive refactor into a bus.

The evidence is in the god files. Claude Code's query.ts: 1,729 lines. Hermes's run_agent.py: 9,000+ lines. Pi Mono's agent-session.ts: 1,500+ lines. Goose's extension_manager.rs: 2,300 lines. The agent loop is a gravitational well â€” context management, tool dispatch, error handling, state tracking, and permission checks all want to live close to the main loop. And they do, until the file becomes unmaintainable.

Only two projects have found structural solutions.

Goose goes all-in on the extension bus. The agent itself is a thin dispatcher. It owns the prompt, manages the conversation, and calls the LLM. Everything else â€” every tool, every capability â€” lives in an extension. The developer extension that provides shell, edit, write, and tree tools? It's technically just another MCP client that happens to run in-process. You could rip it out and replace it with an external service and the agent loop wouldn't notice.

DeerFlow uses a middleware chain. Every message passes through 14 middlewares in strict order:

ThreadDataMiddleware â†’ UploadsMiddleware â†’ SandboxMiddleware â†’
SandboxAuditMiddleware â†’ DanglingToolCallMiddleware â†’
LLMErrorHandlingMiddleware â†’ ToolErrorHandlingMiddleware â†’
SummarizationMiddleware â†’ TodoMiddleware â†’ TokenUsageMiddleware â†’
TitleMiddleware â†’ MemoryMiddleware â†’ ViewImageMiddleware â†’
LoopDetectionMiddleware â†’ SubagentLimitMiddleware â†’
ClarificationMiddleware

Each middleware handles exactly one concern. LoopDetectionMiddleware doesn't also try to do rate limiting. SandboxMiddleware doesn't try to manage thread state. Clean separation. The cost is ordering constraints â€” ClarificationMiddleware must be last, SummarizationMiddleware must run before MemoryMiddleware â€” but those are manageable.

Pi Mono takes a different approach: seven standalone npm packages in a monorepo. The dependency graph is strict. The TUI library (pi-tui) has zero dependency on the AI layer (pi-ai). The agent core (pi-agent-core) is 3K lines. The coding agent (pi-coding-agent) is 69K lines but it's the consumer, not the core. You can build a completely different product on top of the same AI layer â€” the Slack bot (pi-mom) does exactly this.

The pattern: projects that survive past 100K lines of code are the ones that extract the extension mechanism early. The ones that don't end up with a 9,000-line god file that nobody wants to touch.

Pattern 4: The Harness Matters More Than the Model

This was the finding that surprised me most. After reading 2.5M lines of code, the thing that differentiates these projects isn't which LLM they use. It's everything around the LLM.

Consider what happens before and after every model call:

Before the call: context compression. Claude Code uses a 4-layer cascade â€” surgical deletion (lossless) â†’ cache-level editing â†’ structured archival â†’ full compression (lossy). Hermes uses a 5-step algorithm that protects the head and tail of the conversation while summarizing the middle. Goose runs background tool-pair summarization concurrently while the agent processes the current turn. These are complex, carefully ordered systems, and the quality of the compression directly determines whether the agent remembers what it was doing 30 turns ago.

After the call: tool inspection. Goose runs every tool call through a 5-inspector pipeline before execution:

fn create_tool_inspection_manager(...) -> ToolInspectionManager {
    let mut manager = ToolInspectionManager::new();
    manager.add_inspector(Box::new(SecurityInspector::new()));
    manager.add_inspector(Box::new(EgressInspector::new()));
    manager.add_inspector(Box::new(AdversaryInspector::new(provider)));
    manager.add_inspector(Box::new(PermissionInspector::new(...)));
    manager.add_inspector(Box::new(RepetitionInspector::new(None)));
    manager
}

Security â†’ Egress â†’ Adversary (LLM-based review) â†’ Permission â†’ Repetition. The adversary inspector calls the LLM itself to review suspicious tool calls. The repetition inspector catches infinite loops. This is defense in depth. Nobody else in the group does this â€” most projects bolt on permission checks or skip them entirely.

Around the call: streaming tool execution. Claude Code doesn't wait for the model to finish speaking before starting tool execution. Read-only tools run in parallel while the stream is still flowing. Write tools get exclusive locks. It's a reader-writer lock pattern that makes the agent feel fast even when it's doing the same work.

None of this is model intelligence. It's engineering around the model. The harness â€” context management, tool safety, streaming execution, loop detection, cost tracking â€” is where the actual differentiation happens. You could swap the underlying LLM in most of these projects, and the agent would still behave roughly the same. You could not swap the harness.

Bonus: The Wildest Discoveries

Some things I found that don't fit into patterns but are too good not to mention:

Claude Code ships 18 virtual pet species. Hidden in the source code is a full tamagotchi system â€” virtual pets that the coding agent can apparently raise. 18 species. In production. In a tool that people run with sudo. I have questions.

Pi Mono's "stealth mode" impersonates Claude Code. The code renames Pi's tools to match Claude Code's exact casing before sending requests to Anthropic â€” Read, Write, Edit, Bash, Grep, Glob â€” to piggyback on whatever preferential treatment Anthropic gives its own tool. The author even maintains a public history tracker for Claude Code's prompts at cchistory.mariozechner.at. That's competitive intelligence on another level.

MiroFish's "collective intelligence" is LLMs playing pretend. 50K stars. The name promises collective intelligence. The actual implementation: extract entities from a document, give each entity an LLM persona, throw them into a simulated social network (using the OASIS engine from camel-ai), have them interact for N rounds, then compile the interaction logs into a "prediction report." There's no swarm algorithm, no evolutionary computation, no particle optimization. It's LLM role-playing on a fake Twitter. The report quality depends entirely on what the LLM already knows.

What This Means If You're Building an Agent

Four patterns. Every project rediscovers them independently:

Don't store conversations, store pointers. Freeze your memory snapshot. Compress aggressively. The model doesn't need perfect recall â€” it needs a good cheat sheet.
Build on MCP. The network effects are real. Every tool you build as an MCP server works with every MCP client. The holdouts will convert.
Extract your extension bus early. If your agent loop is over 2,000 lines, you've waited too long. Pull tools into extensions. Use middleware. Split your monorepo into packages with strict dependency boundaries.
Invest in the harness, not the model. Context compression, tool inspection, streaming execution, loop detection â€” that's where your actual product lives. The model is replaceable; the engineering around it is not.

The full teardowns â€” all 10 projects, architecture diagrams, code examples, comparisons â€” are at awesome-ai-anatomy. We publish a new one every week.

If you're building AI agents for a living, you should know how the best ones actually work.

Follow @NeuZhou for teardown threads. Join the Discord to discuss architecture decisions.

DEV Community