I found 18 virtual pet species hidden inside Claude Code. That's not in the docs. I found it by reading every file.
Over the past month, I read the source code of 15 AI agent projects - not the docs, not the READMEs, the actual code. Claude Code, OpenHands, Goose, Cline, Codex CLI, Dify, DeerFlow, Pi Mono, Hermes Agent, MemPalace, Lightpanda, MiroFish, oh-my-claudecode, oh-my-codex, and Guardrails AI. About 3.8 million lines of TypeScript, Rust, Python, and Zig.
I wasn't trying to rank them. I was trying to understand how production agents actually work - the architecture decisions that happen below the surface, the patterns nobody documents, and the trade-offs every team makes differently.
Here's what surprised me.
1. Context management is the real product - everything else is plumbing
The gap between projects that invest in context management and projects that don't is enormous. It's the single biggest predictor of whether an agent works for a 2-hour coding session or falls apart after 20 minutes.
OpenHands has 10 composable condenser strategies - including one where the agent can ask to be condensed (CondensationRequestTool), and another that uses probabilistic forgetting. You can chain them into pipelines. Claude Code runs a 4-layer cascade that starts lossless (just snipping history markers) and progressively gets more aggressive - structured archival, then full compression. The design principle is "lossless before lossy, local before global." It's elegant.
Then there are projects - good ones, with tens of thousands of stars - where the entire context strategy is "hope the conversation doesn't get too long."
MemPalace took a different but clever approach: a 4-layer memory stack where identity context (~100 tokens) loads instantly, essential facts (~500 tokens) come next, and deep retrieval only kicks in on demand. Most sessions never need the full history. That means session startup stays under 900 tokens while other agents are loading entire conversation histories.
If I were allocating engineering effort on a new agent, I'd spend 40% of it here.
2. Every agent grows a God Object. Every single one.
10 out of 12 projects with an agent loop have a single massive file at their center. Cline's Task class is 3,756 lines. Codex CLI's codex.rs is 7,786 lines. Hermes Agent's run_agent.py is over 9,000 lines - what I'd gently call "load-bearing spaghetti" at 26K stars.
This isn't bad engineering. It's gravity. An agent loop needs access to context, tools, permissions, and UI on every single turn. Without a structural pattern that forces separation, everything accumulates in one place.
Only two projects avoided it. DeerFlow uses a 14-middleware chain where each middleware handles exactly one concern (logging, safety, cost tracking, loop detection). The core stays thin while capabilities stack around it. Goose pushes all capabilities to MCP extensions, keeping the agent loop deliberately minimal.
The lesson: if your architecture doesn't actively resist the God Object, you will grow one. It's not a question of discipline - it's physics.
3. Security is a 10,000-line commitment
The three projects with real security - Codex CLI, OpenHands, and Goose - each invested 10,000+ lines of security-specific code.
Codex CLI went deepest: OS-native sandboxing with Seatbelt on macOS, Landlock on Linux, and RestrictedToken on Windows. That's 17K lines just for the sandbox, plus a full MITM proxy for network filtering. OpenHands uses Docker container isolation plus a triple-analyzer stack (GraySwan + Invariant + LLM risk scoring). Goose built a 5-inspector pipeline that chains pattern matching, ML analysis, and LLM review.
Most projects? A permission dialog that users auto-approve after the third prompt. Cline even has a "YOLO mode" that disables all permission checks - which is at least honest about what most approval-based security becomes in practice.
There's no middle ground here. Either you commit to multi-layer security or you effectively have none. The minimum viable setup: a sandbox for code execution, human approval for writes, and some form of loop detection. Everything beyond that is differentiation - but those three are table stakes.
4. MCP is winning the extension protocol war
The projects that adopted MCP (Model Context Protocol) early got the largest extension ecosystems with the least code.
Goose went all-in - MCP is their extension system. All six of their extension types speak it. Adding a new OpenAI-compatible provider is a 10-line JSON file. No Rust code, no compilation, no PR needed. Cline, Codex CLI, and MemPalace all adopted it too, and immediately gained access to a shared ecosystem of tools.
Projects that built proprietary extension protocols - Dify's plugin daemon, Guardrails AI's Validator Hub - created deeper integration but smaller ecosystems. It's the classic platform tradeoff: MCP gives you breadth; proprietary gives you depth.
MCP is doing for agent tools what HTTP did for web services. If you're building an agent extension system today, you need a strong reason not to adopt it.
5. Markdown is the new config file
This one snuck up on me. Across multiple projects, markdown files have become the primary way to configure agent behavior:
- Claude Code has
CLAUDE.mdfor project rules - Codex CLI uses
AGENTS.mdfor workspace configuration and contribution guidelines - Cline uses
.clinerules/directories - oh-my-claudecode defines agent roles as
.mdfiles - drop a markdown file, get a new agent
It works because LLMs can just read it. No parsing, no schema validation, no YAML indentation errors. Developers write natural language instructions, the agent reads them directly. It's human-readable, version-controllable, and LLM-parseable all at once.
This is the package.json moment for AI agents: a convention that's so obvious in hindsight that every project will adopt it. Hermes Agent takes it further - its self-improving skills system means the agent itself creates and edits markdown files to learn new capabilities. Skills as versioned, discoverable, agent-editable markdown files is "learning from experience" implemented as file management, not neural network fine-tuning.
6. Building from scratch produces better architecture than using frameworks
The top 4 projects by code quality - Codex CLI, Claude Code, Goose, and Lightpanda - all built from scratch. The from-scratch average quality is noticeably higher than the framework-dependent average.
Now, correlation isn't causation. These are well-resourced teams (Anthropic, OpenAI, Block). But there's a structural reason too: framework dependency creates a ceiling. DeerFlow is coupled to LangGraph - when you debug agent failures, you debug through LangGraph's internals, not your own code. MiroFish is coupled to OASIS - the core engine isn't theirs. When the framework makes a breaking change, you eat it.
The exception is interesting: Dify extracted their graph execution engine into graphon, a standalone PyPI package. They used a framework to bootstrap, then extracted the core into their own abstraction once they understood the domain well enough. That's the move - if you start with a framework, plan your extraction strategy from day one. The worst outcome is being two years into LangGraph when they ship a breaking change.
7. The filesystem is becoming the extension registry
Some of the best DX patterns I found are shockingly simple:
-
Goose: Drop a JSON file in
providers/declarative/, get a new LLM provider -
Cline: Drop a shell script in
.cline/hooks/, get a lifecycle hook -
oh-my-codex: Drop an
.mjsfile in.omx/hooks/, get an event plugin -
oh-my-claudecode: Drop a markdown file in
agents/, get a new agent role
No API calls. No package managers. No marketplace account. Just files in directories. Convention-over-configuration applied to agent extensibility - and it has the lowest friction of any extension model I've seen.
The projects that support multiple entry points (config file ? shell script ? full code contribution) have the healthiest contributor pools. Projects that only accept code-level PRs in Rust have. fewer contributors. Stars don't tell the full story either - I saw projects with 50K stars and zero test files, and projects with 37K stars and foundation governance with real extension registries. Forks and PRs measure community health; stars measure marketing.
What I'd steal if I were building an agent today
After reading all 15 codebases, here's what I'd take:
MemPalace's layered memory loading. Identity context (~100 tokens) loads first, essential facts (~500 tokens) next, deep retrieval only on demand. Most sessions never need the full history. Session startup stays under 900 tokens.
DeerFlow's middleware chain. 14 composable middlewares, ~500 lines to implement, prevents a 5,000-line monolith. It's the highest-leverage architectural decision you can make.
Goose's declarative provider pattern. Adding a provider shouldn't require code changes. A JSON file with an endpoint URL and model name is enough.
Claude Code's context cascade principle. Lossless before lossy. Local before global. Start by snipping history markers, escalate to summarization only when you have to.
Pi Mono's lazy provider loading. Dynamic import + promise caching. Unused providers cost zero at startup. Game-engine texture streaming applied to LLM providers - simple and effective.
Codex CLI's queue-pair architecture. If you need multiple frontends (CLI, desktop, MCP server), typed
Submission?Eventchannels decouple the core from the interface. Overengineering for one frontend, essential for three.
The full teardowns
Every claim above has file paths and line numbers behind it. I published the complete source-level analyses - architecture diagrams, security evaluations, "stuff worth stealing" sections, and verification logs - in the awesome-ai-anatomy repo.
15 projects dissected. Updated weekly. If you build AI agents, you should know how the best ones actually work.
What patterns have you noticed in agent codebases? I'm always looking for the next project to tear down - drop suggestions in the comments.
Top comments (1)
Fascinating synthesis! The recurring patterns across 15 codebases is incredibly valuable. The "shared memory as bottleneck" observation aligns with what we are seeing in embodied AI agent development too.
One architectural pattern worth adding: purpose-built embedded databases for on-device data. Most agent codebases assume external databases, but moteDB (Rust-native embedded multimodal DB) is designed for scenarios where the data layer needs to live in the same process as the agent. The shared memory / context management problem becomes different — and more tractable — when you remove the IPC/network boundary from the equation. Keep up the excellent analysis!