neuzhou

Posted on Apr 8

I read the source code of 11 AI agents. Most of them are a mess.

#ai #opensource #programming #security

I've spent the last few months reading the source code of 11 AI coding agents, line by line. Not the README. Not the docs. The actual implementations -- grep, wc -l, reading every module until the architecture clicks.

Reading a codebase is not the same as maintaining it at 3am. These are observations from the outside. But some of what I found was hard to unsee.

The 5 findings that kept me up at night

1. Claude Code ships 18 virtual pet species in production

Not a joke. Anthropic's flagship coding agent -- the one people run with sudo on their machines -- contains a full tamagotchi system. 18 species of virtual pets, hidden in the TypeScript source. A virtual pet system. In a coding agent. That has access to your filesystem.

I'm not saying it's a backdoor. I'm saying: if they shipped this without anyone noticing, what else is in there?

2. Pi Mono has a "stealth mode" that impersonates Claude Code

Pi Mono (32K stars on GitHub) has a feature called stealth mode. What it does: it fakes Claude Code's tool names when making API calls. The goal is to dodge rate limits by pretending to be a different product.

This isn't buried in some fork. It's in the main codebase. The tool names are spoofed to look like Claude Code's tool ecosystem, giving Pi Mono preferential treatment from API providers that whitelist Anthropic's tooling.

One Anthropic detection update, and every Pi Mono user gets rate-limited or key-flagged. Great strategy.

3. MiroFish: 50K stars, zero collective intelligence

MiroFish markets itself as a "collective intelligence" platform. 50K GitHub stars. Sounds like something real.

It's not.

The "collective intelligence" is LLMs role-playing as humans on a simulated social network powered by the OASIS engine from camel-ai. There are no real humans. There is no real collective. It's language models pretending to be people, posting on a fake social network, and the output gets called "collective intelligence."

The codebase is 39K lines. No input validation. No sandbox. The core capability is borrowed entirely from OASIS -- MiroFish doesn't even own its main feature. The builtins.open monkey-patch for Windows compatibility tells you everything about the level of engineering rigor.

4. Lightpanda built an entire browser in Zig for AI agents

This one is actually good.

Lightpanda wrote a headless browser from scratch in Zig. Not a wrapper around Chrome. Not Puppeteer with extra steps. A browser. From scratch. 91K lines of Zig + Rust FFI. The rendering pipeline is libcurl -> html5ever -> custom Zig DOM -> V8 -> CDP.

Their benchmarks show 9x faster than headless Chrome for typical AI agent workloads. The bitcast dispatch trick they use lets Zig act like a language with vtables -- a systems programming technique I hadn't seen before. Comptime metaprogramming pushed to its useful limit.

Single binary. No container. Just works.

5. Every single project has a God Object

I counted. Every one. The worst offender: Hermes Agent's run_agent.py at 9,000+ lines. One file. Agent loop, tool dispatch, context management, provider calls, error handling, cron scheduling, memory ops -- all crammed in.

Here's the full list:

Project	God File	Lines
Hermes Agent	`run_agent.py`	9,000+
Lightpanda	`Page.zig`	3,660
Claude Code	`query.ts`	1,729
Pi Mono	`agent-session.ts`	1,500+
MiroFish	`report_agent.py`	1,400+
Guardrails AI	`guard.py`	1,076

The while-loop pattern makes this almost inevitable. Your agent loop starts at 200 lines, then someone adds error recovery, then streaming, then tool dispatch, then context management, and suddenly you're reviewing a 9,000-line PR because nobody wanted to do the refactor.

DeerFlow is the counter-example: 16 middleware files, ~200 lines each, one concern per file. Clean. Testable. Composable. But DeerFlow has its own problems (more on that below).

Patterns: what actually works

After reading all 11 codebases, some patterns stand out.

The while-loop wins

4 out of 11 projects use a simple while(true) loop as their agent core: Claude Code, Goose, Pi Mono, Hermes Agent. The agent loop is sequential -- model speaks, tools execute, model speaks again. A while-loop expresses this naturally.

Dify uses a graph-based DAG engine (the enterprise choice). DeerFlow uses a middleware chain (best extensibility-to-complexity ratio). oh-my-claudecode uses a phase-based pipeline (plan -> exec -> verify -> fix). But the while-loop projects ship faster and are easier to debug.

The cost is the God Object problem above. Pick your poison.

Context management is where the gap shows

Everyone talks about model choice and prompt engineering. How you manage the context window is where the gap actually shows.

Claude Code has a 4-layer cascade. Layer 1: surgical deletion of low-value messages. Layer 2: cache-level hiding. Layer 3: structured archival. Layer 4: full LLM compression. Lossless operations first, lossy operations only when necessary. This is well-engineered.

Hermes Agent has a 5-step compression pipeline with head/tail protection and a structured summary template (Goal/Progress/Decisions/Files/Next). Plus a neat trick: freezing MEMORY.md at session start so the system prompt stays stable, preserving the provider's prompt cache. Nobody else does this.

Goose proactively compresses at 80% capacity with concurrent background summarization of tool call/result pairs.

MiroFish has no context management at all. DeerFlow has a single summarization middleware with no progressive degradation. Claude Code has 4 layers of progressive degradation. MiroFish has nothing. That's the gap.

Nobody has solved cost budgets

This is the single biggest gap across all 11 projects.

DeerFlow tracks tokens but sets no spending limits. Hermes tracks memory usage but has no dollar ceiling. oh-my-claudecode runs 19 agents across 3 model tiers with zero cost controls. Goose has a 1000-turn max but no dollar cap.

Only Dify has execution limits (500 steps, 1200 seconds) set at the infrastructure level. Every other project trusts the model to know when to stop, which is the one thing models are reliably bad at.

Your first $300 runaway session at 3am will fix this real quick.

Security is an afterthought (with one exception)

I graded all 11 projects on security across 7 dimensions: input validation, sandbox/isolation, auth/RBAC, prompt injection defense, data exfiltration prevention, tool execution safety, and memory/state protection.

Goose is the clear leader. Its 5-inspector pipeline (Security, Egress, Adversary, Permission, Repetition) runs before every tool call. Each inspector returns Allow, RequireApproval, or Deny. The AdversaryInspector calls the LLM itself to review suspicious calls. Plus a 31-key env var blocklist that prevents DLL injection and library preloading through extension configs. Nobody else comes close.

OpenAI's Codex CLI deserves mention too -- queue-pair architecture with a Guardian AI approval gate and full 3-OS sandboxing (macOS Seatbelt, Linux Landlock, Docker fallback).

DeerFlow has no authentication, no RBAC, no rate limiting. The security section of their docs literally says "improper deployment may introduce security risks." Deploy it on a public IP and anyone can execute arbitrary code on your machine.

The ratings

#	Project	Stars	Overall	Why
1	Claude Code	109K	A-	Best context management, virtual pets notwithstanding. Anthropic-locked.
2	Dify	136K	B+	Enterprise-grade. 7+ containers and 400+ env vars to prove it.
3	Goose	37K	A-	Best security by far. 30+ providers. MCP-first. Clean Rust.
4	Codex CLI	27K	A	Solid sandboxing, Guardian AI approval gate. 3-OS sandbox coverage.
5	DeerFlow	58K	B-	Good middleware architecture. Security is a README paragraph.
6	Pi Mono	32K	B	Clever extension system. Stealth mode is a liability.
7	Hermes Agent	26K	B-	Best memory recall (FTS5). 9K-line god file holds it back.
8	oh-my-claudecode	24K	B	19-agent team is ambitious. One Anthropic update breaks everything.
9	Lightpanda	27K	A-	Not an agent, but the best-engineered browser in this group.
10	Guardrails AI	6.6K	B+	Focused scope done well. Hub supply chain is the risk.
11	MiroFish	50K	C	50K stars built on marketing. Core tech is borrowed. No security.

What I'd steal if I were building an agent today

Context cascade from Claude Code (lossless before lossy)
Middleware architecture from DeerFlow (one concern per file)
5-inspector security pipeline from Goose
Frozen memory snapshots from Hermes Agent
Functional tool composition from Claude Code / Pi Mono
Loop detection from DeerFlow (hash-based, warn at 3, kill at 5)

And I'd add cost budgets on day one. Because nobody else did, and they all should have.

Want the full teardowns?

Each project gets its own deep-dive with architecture diagrams, code references, and security analysis. All open source.

? github.com/NeuZhou/awesome-ai-anatomy - 11 teardowns and counting. Star it if you want updates when new agents get dissected.

Currently working on: Cursor, Aider, and OpenHands.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.