DEV Community

neuzhou
neuzhou

Posted on

I read the source code of 11 AI agents. Most of them are a mess.

I've spent the last few months reading the source code of 11 AI coding agents, line by line. Not the README. Not the docs. The actual implementations -- grep, wc -l, reading every module until the architecture clicks.

Reading a codebase is not the same as maintaining it at 3am. These are observations from the outside. But some of what I found was hard to unsee.

The 5 findings that kept me up at night

1. Claude Code ships 18 virtual pet species in production

Not a joke. Anthropic's flagship coding agent -- the one people run with sudo on their machines -- contains a full tamagotchi system. 18 species of virtual pets, hidden in the TypeScript source. A virtual pet system. In a coding agent. That has access to your filesystem.

I'm not saying it's a backdoor. I'm saying: if they shipped this without anyone noticing, what else is in there?

2. Pi Mono has a "stealth mode" that impersonates Claude Code

Pi Mono (32K stars on GitHub) has a feature called stealth mode. What it does: it fakes Claude Code's tool names when making API calls. The goal is to dodge rate limits by pretending to be a different product.

This isn't buried in some fork. It's in the main codebase. The tool names are spoofed to look like Claude Code's tool ecosystem, giving Pi Mono preferential treatment from API providers that whitelist Anthropic's tooling.

One Anthropic detection update, and every Pi Mono user gets rate-limited or key-flagged. Great strategy.

3. MiroFish: 50K stars, zero collective intelligence

MiroFish markets itself as a "collective intelligence" platform. 50K GitHub stars. Sounds like something real.

It's not.

The "collective intelligence" is LLMs role-playing as humans on a simulated social network powered by the OASIS engine from camel-ai. There are no real humans. There is no real collective. It's language models pretending to be people, posting on a fake social network, and the output gets called "collective intelligence."

The codebase is 39K lines. No input validation. No sandbox. The core capability is borrowed entirely from OASIS -- MiroFish doesn't even own its main feature. The builtins.open monkey-patch for Windows compatibility tells you everything about the level of engineering rigor.

4. Lightpanda built an entire browser in Zig for AI agents

This one is actually good.

Lightpanda wrote a headless browser from scratch in Zig. Not a wrapper around Chrome. Not Puppeteer with extra steps. A browser. From scratch. 91K lines of Zig + Rust FFI. The rendering pipeline is libcurl -> html5ever -> custom Zig DOM -> V8 -> CDP.

Their benchmarks show 9x faster than headless Chrome for typical AI agent workloads. The bitcast dispatch trick they use lets Zig act like a language with vtables -- a systems programming technique I hadn't seen before. Comptime metaprogramming pushed to its useful limit.

Single binary. No container. Just works.

5. Every single project has a God Object

I counted. Every one. The worst offender: Hermes Agent's run_agent.py at 9,000+ lines. One file. Agent loop, tool dispatch, context management, provider calls, error handling, cron scheduling, memory ops -- all crammed in.

Here's the full list:

Project God File Lines
Hermes Agent run_agent.py 9,000+
Lightpanda Page.zig 3,660
Claude Code query.ts 1,729
Pi Mono agent-session.ts 1,500+
MiroFish report_agent.py 1,400+
Guardrails AI guard.py 1,076

The while-loop pattern makes this almost inevitable. Your agent loop starts at 200 lines, then someone adds error recovery, then streaming, then tool dispatch, then context management, and suddenly you're reviewing a 9,000-line PR because nobody wanted to do the refactor.

DeerFlow is the counter-example: 16 middleware files, ~200 lines each, one concern per file. Clean. Testable. Composable. But DeerFlow has its own problems (more on that below).

Patterns: what actually works

After reading all 11 codebases, some patterns stand out.

The while-loop wins

4 out of 11 projects use a simple while(true) loop as their agent core: Claude Code, Goose, Pi Mono, Hermes Agent. The agent loop is sequential -- model speaks, tools execute, model speaks again. A while-loop expresses this naturally.

Dify uses a graph-based DAG engine (the enterprise choice). DeerFlow uses a middleware chain (best extensibility-to-complexity ratio). oh-my-claudecode uses a phase-based pipeline (plan -> exec -> verify -> fix). But the while-loop projects ship faster and are easier to debug.

The cost is the God Object problem above. Pick your poison.

Context management is where the gap shows

Everyone talks about model choice and prompt engineering. How you manage the context window is where the gap actually shows.

Claude Code has a 4-layer cascade. Layer 1: surgical deletion of low-value messages. Layer 2: cache-level hiding. Layer 3: structured archival. Layer 4: full LLM compression. Lossless operations first, lossy operations only when necessary. This is well-engineered.

Hermes Agent has a 5-step compression pipeline with head/tail protection and a structured summary template (Goal/Progress/Decisions/Files/Next). Plus a neat trick: freezing MEMORY.md at session start so the system prompt stays stable, preserving the provider's prompt cache. Nobody else does this.

Goose proactively compresses at 80% capacity with concurrent background summarization of tool call/result pairs.

MiroFish has no context management at all. DeerFlow has a single summarization middleware with no progressive degradation. Claude Code has 4 layers of progressive degradation. MiroFish has nothing. That's the gap.

Nobody has solved cost budgets

This is the single biggest gap across all 11 projects.

DeerFlow tracks tokens but sets no spending limits. Hermes tracks memory usage but has no dollar ceiling. oh-my-claudecode runs 19 agents across 3 model tiers with zero cost controls. Goose has a 1000-turn max but no dollar cap.

Only Dify has execution limits (500 steps, 1200 seconds) set at the infrastructure level. Every other project trusts the model to know when to stop, which is the one thing models are reliably bad at.

Your first $300 runaway session at 3am will fix this real quick.

Security is an afterthought (with one exception)

I graded all 11 projects on security across 7 dimensions: input validation, sandbox/isolation, auth/RBAC, prompt injection defense, data exfiltration prevention, tool execution safety, and memory/state protection.

Goose is the clear leader. Its 5-inspector pipeline (Security, Egress, Adversary, Permission, Repetition) runs before every tool call. Each inspector returns Allow, RequireApproval, or Deny. The AdversaryInspector calls the LLM itself to review suspicious calls. Plus a 31-key env var blocklist that prevents DLL injection and library preloading through extension configs. Nobody else comes close.

OpenAI's Codex CLI deserves mention too -- queue-pair architecture with a Guardian AI approval gate and full 3-OS sandboxing (macOS Seatbelt, Linux Landlock, Docker fallback).

DeerFlow has no authentication, no RBAC, no rate limiting. The security section of their docs literally says "improper deployment may introduce security risks." Deploy it on a public IP and anyone can execute arbitrary code on your machine.

The ratings

# Project Stars Overall Why
1 Claude Code 109K A- Best context management, virtual pets notwithstanding. Anthropic-locked.
2 Dify 136K B+ Enterprise-grade. 7+ containers and 400+ env vars to prove it.
3 Goose 37K A- Best security by far. 30+ providers. MCP-first. Clean Rust.
4 Codex CLI 27K A Solid sandboxing, Guardian AI approval gate. 3-OS sandbox coverage.
5 DeerFlow 58K B- Good middleware architecture. Security is a README paragraph.
6 Pi Mono 32K B Clever extension system. Stealth mode is a liability.
7 Hermes Agent 26K B- Best memory recall (FTS5). 9K-line god file holds it back.
8 oh-my-claudecode 24K B 19-agent team is ambitious. One Anthropic update breaks everything.
9 Lightpanda 27K A- Not an agent, but the best-engineered browser in this group.
10 Guardrails AI 6.6K B+ Focused scope done well. Hub supply chain is the risk.
11 MiroFish 50K C 50K stars built on marketing. Core tech is borrowed. No security.

What I'd steal if I were building an agent today

  • Context cascade from Claude Code (lossless before lossy)
  • Middleware architecture from DeerFlow (one concern per file)
  • 5-inspector security pipeline from Goose
  • Frozen memory snapshots from Hermes Agent
  • Functional tool composition from Claude Code / Pi Mono
  • Loop detection from DeerFlow (hash-based, warn at 3, kill at 5)

And I'd add cost budgets on day one. Because nobody else did, and they all should have.



Want the full teardowns?

Each project gets its own deep-dive with architecture diagrams, code references, and security analysis. All open source.

? github.com/NeuZhou/awesome-ai-anatomy - 11 teardowns and counting. Star it if you want updates when new agents get dissected.

Currently working on: Cursor, Aider, and OpenHands.

Top comments (1)

Collapse
 
ali_muwwakkil_a776a21aa9c profile image
Ali Muwwakkil

It's interesting how many AI agents prioritize flashy features over structural integrity. In our experience with enterprise teams, we often see that the real challenge isn't building the agent but ensuring it integrates seamlessly into existing workflows. A well-architected prompt design often trumps a complex feature set. The key is balancing novelty with robust, maintainable code that can adapt over time - not just mimic others. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)