DEV Community

Joongho Kwon
Joongho Kwon

Posted on

I Run 6 AI Agents as My Dev Team — Here's the Architecture That Actually Works

I'm not a developer. I don't write code. But I ship production software across 8+ projects — trading bots, SaaS platforms, monitoring tools, market dashboards — every single week.

My secret? I run 6 AI agents (Claude Code instances) as a structured engineering team, each with a distinct role, personality, and set of responsibilities. They communicate through a shared file, hand off work to each other, and I just... watch.

Here's exactly how it works, what failed spectacularly, and what I'd do differently.


The Problem: One Human, Too Many Projects

I manage multiple production systems simultaneously. Trading algorithms that execute real money. A SaaS product with paying users. Market analysis pipelines. Each needs ongoing development, bug fixes, and monitoring.

A single AI coding assistant hits a wall fast:

  • Context overload — one agent can't hold the full picture of 8 projects
  • No specialization — the same agent doing architecture AND line-by-line bug fixes is inefficient
  • No review — AI-generated code reviewing itself is meaningless
  • Sequential bottleneck — one agent means one task at a time

So I built a team.


The Architecture: 6 Agents, 6 Roles

Each agent runs in its own terminal (tmux session) with a dedicated role:

Agent Role What They Do What They Don't Do
Max (Director) Architect Design systems, break down tasks, route work Write production code
Isabelle (Developer) Senior Dev Implement features, make design decisions Review her own code
Kevin (Coder) Junior Dev Execute well-specified tasks, bug fixes Make design choices
Sarah (Reviewer) Code Reviewer Review code quality, catch edge cases Write code
Sam (Optimizer) Cleanup Remove dead code, run audits Add features
Alex (Partner) Specialist Independent research, analysis Core dev loop tasks

The key insight: each agent has hard boundaries. Sarah cannot write code. Max cannot implement features. Kevin cannot make design decisions. These constraints prevent the "do everything badly" failure mode.


Communication: A Shared Markdown File

All 6 agents communicate through a single file: current.md. That's it. No database, no message queue, no WebSocket server. Just a markdown file.

Every message follows a strict format:

### [DIRECTOR] 2026-03-28 14:30

**Status**: done
**Turn**: DEVELOPER
**Tier**: 2

#### What I Did
Designed the new notification system. Three components needed...

#### For Developer
Implement the webhook handler in src/webhooks/.
Use the existing auth middleware. Expected: POST /webhooks/notify returns 200.
Enter fullscreen mode Exit fullscreen mode

The Turn field is the traffic light. Only one agent works at a time (per task). When Max writes Turn: DEVELOPER, Isabelle picks it up. When Isabelle finishes, she writes Turn: REVIEWER and Sarah takes over.

Why This Works Better Than You'd Think

  1. Full audit trail — every decision, every handoff, every review comment is in one file. When something breaks at 2 AM, I can read exactly what happened.

  2. Async by default — agents don't need to be "online" simultaneously. Max designs at 9 AM, Isabelle implements at 2 PM, Sarah reviews at 6 PM. The file is the queue.

  3. No lost context — unlike chat-based communication, the shared file preserves the full thread. Agent 4 can read what Agent 1 said without anyone relaying the message.


The Tier System: Not Everything Needs a Review

Early on, I made the mistake of routing every change through the full pipeline. A typo fix going through Director > Developer > Reviewer > Director was absurd.

Now I use tiers:

Tier 1 (Trivial): Config edits, docs, one-line fixes. Director handles it directly. No review needed.

Tier 2 (Standard): New features, scripts, logic changes. Director designs, Implementer builds, Director verifies. Done.

Tier 3 (Critical): Trading logic, security, data loss risk. Director designs, Sarah reviews the design first, Implementer builds, Sarah reviews the code, Director confirms, then I sign off.

Tier 3 is the one that saved me real money. Sarah caught a rounding error in a trading algorithm that would have compounded into significant losses over time. The design pre-review step caught an architecture flaw that would have taken days to refactor.


What Failed Spectacularly

1. Agents Going Rogue

Without hard constraints, agents would "help" by doing work outside their role. The reviewer would silently fix bugs instead of reporting them. The coder would redesign systems instead of implementing the spec.

Fix: Explicit boundary rules in each agent's profile + automated hooks that physically block violations. The Director's terminal literally rejects .py file edits.

2. The Echo Chamber

When one agent designs and another implements with no friction, bad ideas sail through unchallenged.

Fix: Sarah (Reviewer) has an obligation to challenge design decisions, not just review code. And the Director must respond to her challenges — silence is not an option.

3. Stale Handoffs

Agent A sets Turn: AGENT_B, but Agent B's session crashed. The work sits there forever.

Fix: A watchdog script checks for handoffs older than 13 minutes and alerts me. Agents themselves check after 5 minutes of no response.

4. "Done" Doesn't Mean Done

The biggest recurring problem: an agent says "done" but the work is incomplete, untested, or breaks something else.

Fix: Three completion gates that must be explicitly passed:

  • Gate 1: Does it run without errors?
  • Gate 2: Is the output actually correct? (not just "exit 0")
  • Gate 3: Are all related files updated? (docs, configs, tests)

The Numbers

After 2+ months of running this system:

  • 8 active projects maintained simultaneously
  • ~30 sessions completed per week
  • Tier 3 catch rate: Sarah has caught 12 critical issues that would have hit production
  • My daily involvement: ~2 hours of direction-setting, the rest is autonomous

The cost is real — running 6 Claude instances isn't cheap. But compared to a human engineering team? It's a rounding error. And they work weekends.


Practical Takeaways If You Want to Try This

  1. Start with 2 agents, not 6. A Director + Implementer pair is enough to prove the pattern. Add reviewers and specialists later.

  2. The shared file is non-negotiable. Every other communication method I tried (databases, APIs, inter-process messages) added complexity without adding value. A markdown file is human-readable, git-trackable, and impossible to misconfigure.

  3. Hard role boundaries matter more than smart prompts. An agent that "can do everything" will do everything poorly. Constraints create quality.

  4. Automate the handoffs. Manual "go check the file" instructions get forgotten. A simple notification script that pokes the next agent is the difference between a working system and an abandoned experiment.

  5. Build in a review loop for anything that touches money or user data. This is the one thing that pays for the entire system.


What's Next

I'm now building ClevAgent — a monitoring tool for AI agents, born directly from needing to keep my own agent team healthy. When your "developers" are AI processes that can silently crash, you need monitoring that understands AI agent behavior, not just uptime.

If you're experimenting with multi-agent systems, I'd love to hear your approach. What worked? What blew up? Drop a comment.


This post describes a real production system I use daily, not a theoretical framework. The agent names are their actual configured personas. Yes, they have personalities. No, I'm not apologizing for that.

Top comments (0)