DEV Community

Joongho Kwon
Joongho Kwon

Posted on • Edited on

I Run 6 AI Agents as My Dev Team — Here's the Architecture That Actually Works

I'm not a developer. I don't write code. But I ship production software across 8+ projects — trading bots, SaaS platforms, monitoring tools, market dashboards — every single week.

My secret? I run 6 AI agents (Claude Code instances) as a structured engineering team, each with a distinct role, personality, and set of responsibilities. They communicate through a shared file, hand off work to each other, and I just... watch.

Here's exactly how it works, what failed spectacularly, and what I'd do differently.


The Problem: One Human, Too Many Projects

I manage multiple production systems simultaneously. Trading algorithms that execute real money. A SaaS product with paying users. Market analysis pipelines. Each needs ongoing development, bug fixes, and monitoring.

A single AI coding assistant hits a wall fast:

  • Context overload — one agent can't hold the full picture of 8 projects
  • No specialization — the same agent doing architecture AND line-by-line bug fixes is inefficient
  • No review — AI-generated code reviewing itself is meaningless
  • Sequential bottleneck — one agent means one task at a time

So I built a team.


The Architecture: 6 Agents, 6 Roles

Each agent runs in its own terminal (tmux session) with a dedicated role:

Agent Role What They Do What They Don't Do
Max (Director) Architect Design systems, break down tasks, route work Write production code
Isabelle (Developer) Senior Dev Implement features, make design decisions Review her own code
Kevin (Coder) Junior Dev Execute well-specified tasks, bug fixes Make design choices
Sarah (Reviewer) Code Reviewer Review code quality, catch edge cases Write code
Sam (Optimizer) Cleanup Remove dead code, run audits Add features
Alex (Partner) Specialist Independent research, analysis Core dev loop tasks

The key insight: each agent has hard boundaries. Sarah cannot write code. Max cannot implement features. Kevin cannot make design decisions. These constraints prevent the "do everything badly" failure mode.


Communication: A Shared Markdown File

All 6 agents communicate through a single file: current.md. That's it. No database, no message queue, no WebSocket server. Just a markdown file.

Every message follows a strict format:

### [DIRECTOR] 2026-03-28 14:30

**Status**: done
**Turn**: DEVELOPER
**Tier**: 2

#### What I Did
Designed the new notification system. Three components needed...

#### For Developer
Implement the webhook handler in src/webhooks/.
Use the existing auth middleware. Expected: POST /webhooks/notify returns 200.
Enter fullscreen mode Exit fullscreen mode

The Turn field is the traffic light. Only one agent works at a time (per task). When Max writes Turn: DEVELOPER, Isabelle picks it up. When Isabelle finishes, she writes Turn: REVIEWER and Sarah takes over.

Why This Works Better Than You'd Think

  1. Full audit trail — every decision, every handoff, every review comment is in one file. When something breaks at 2 AM, I can read exactly what happened.

  2. Async by default — agents don't need to be "online" simultaneously. Max designs at 9 AM, Isabelle implements at 2 PM, Sarah reviews at 6 PM. The file is the queue.

  3. No lost context — unlike chat-based communication, the shared file preserves the full thread. Agent 4 can read what Agent 1 said without anyone relaying the message.


The Tier System: Not Everything Needs a Review

Early on, I made the mistake of routing every change through the full pipeline. A typo fix going through Director > Developer > Reviewer > Director was absurd.

Now I use tiers:

Tier 1 (Trivial): Config edits, docs, one-line fixes. Director handles it directly. No review needed.

Tier 2 (Standard): New features, scripts, logic changes. Director designs, Implementer builds, Director verifies. Done.

Tier 3 (Critical): Trading logic, security, data loss risk. Director designs, Sarah reviews the design first, Implementer builds, Sarah reviews the code, Director confirms, then I sign off.

Tier 3 is the one that saved me real money. Sarah caught a rounding error in a trading algorithm that would have compounded into significant losses over time. The design pre-review step caught an architecture flaw that would have taken days to refactor.


What Failed Spectacularly

1. Agents Going Rogue

Without hard constraints, agents would "help" by doing work outside their role. The reviewer would silently fix bugs instead of reporting them. The coder would redesign systems instead of implementing the spec.

Fix: Explicit boundary rules in each agent's profile + automated hooks that physically block violations. The Director's terminal literally rejects .py file edits.

2. The Echo Chamber

When one agent designs and another implements with no friction, bad ideas sail through unchallenged.

Fix: Sarah (Reviewer) has an obligation to challenge design decisions, not just review code. And the Director must respond to her challenges — silence is not an option.

3. Stale Handoffs

Agent A sets Turn: AGENT_B, but Agent B's session crashed. The work sits there forever.

Fix: A watchdog script checks for handoffs older than 13 minutes and alerts me. Agents themselves check after 5 minutes of no response.

4. "Done" Doesn't Mean Done

The biggest recurring problem: an agent says "done" but the work is incomplete, untested, or breaks something else.

Fix: Three completion gates that must be explicitly passed:

  • Gate 1: Does it run without errors?
  • Gate 2: Is the output actually correct? (not just "exit 0")
  • Gate 3: Are all related files updated? (docs, configs, tests)

The Numbers

After 2+ months of running this system:

  • 8 active projects maintained simultaneously
  • ~30 sessions completed per week
  • Tier 3 catch rate: Sarah has caught 12 critical issues that would have hit production
  • My daily involvement: ~2 hours of direction-setting, the rest is autonomous

The cost is real — running 6 Claude instances isn't cheap. But compared to a human engineering team? It's a rounding error. And they work weekends.


Practical Takeaways If You Want to Try This

  1. Start with 2 agents, not 6. A Director + Implementer pair is enough to prove the pattern. Add reviewers and specialists later.

  2. The shared file is non-negotiable. Every other communication method I tried (databases, APIs, inter-process messages) added complexity without adding value. A markdown file is human-readable, git-trackable, and impossible to misconfigure.

  3. Hard role boundaries matter more than smart prompts. An agent that "can do everything" will do everything poorly. Constraints create quality.

  4. Automate the handoffs. Manual "go check the file" instructions get forgotten. A simple notification script that pokes the next agent is the difference between a working system and an abandoned experiment.

  5. Build in a review loop for anything that touches money or user data. This is the one thing that pays for the entire system.


What's Next

I'm now building ClevAgent — a monitoring tool for AI agents, born directly from needing to keep my own agent team healthy. When your "developers" are AI processes that can silently crash, you need monitoring that understands AI agent behavior, not just uptime.

If you're experimenting with multi-agent systems, I'd love to hear your approach. What worked? What blew up? Drop a comment.


This post describes a real production system I use daily, not a theoretical framework. The agent names are their actual configured personas. Yes, they have personalities. No, I'm not apologizing for that.

Top comments (8)

Collapse
 
godnick profile image
Henry Godnick

You nailed the cost observation — "running 6 Claude instances isn't cheap." Running that many sessions simultaneously, the token burn rate must be intense, especially with the back-and-forth handoff pattern generating a lot of context.

Curious if you've found a good way to monitor the actual token consumption per agent/session? I've been using a macOS menu bar tool called TokenBar (tokenbar.site) that shows real-time token usage across providers. Helped me realize which of my agents were burning the most tokens — turns out it was always the reviewer doing deep context reads, not the coder actually generating code.

For your setup with 6 agents, having per-session visibility seems like it would tie nicely into your tier system — you could set token budgets for Tier 1 vs Tier 3 tasks.

Collapse
 
mickyarun profile image
arun rajkumar

The Tier 3 catch rate is the buried lede here. Sarah catching 12 critical issues is the whole argument for structured multi-agent review over a single "do everything" agent.

We've been running a similar separation at Atoa for our payments infra — distinct agents for design review vs. code generation vs. test validation. The echo chamber failure you described hit us early: agents validating each other's assumptions because they shared the same training priors. Hard role boundaries fixed more of that than any prompt engineering did.

The watchdog for stale handoffs is the kind of operational detail that only shows up after you've been burned by it. Most writeups skip this entirely.

Collapse
 
kalpaka profile image
Kalpaka

The role boundaries are the most important part here. When you give an agent hard constraints (Sarah cannot write code), you're solving something organizations get wrong with humans too: confusing capability with role.

The rogue agent problem is telling. A reviewer silently fixing bugs is what happens on human teams with fuzzy boundaries. Making the constraint structural (terminal-level blocking) rather than cultural (a prompt saying "please don't") is the right instinct.

Worth watching: if Sarah and Max are both Claude instances, the mandatory challenge may converge on similar blind spots. In human teams, diverse reviewers matter not for more eyes but for eyes that see differently.

Collapse
 
apex_stack profile image
Apex Stack

The tier system resonates a lot. I run a similar multi-agent setup for a financial data platform — about 10 scheduled agents handling everything from daily ETL jobs to SEO auditing across 85K+ pages in 12 languages. The biggest lesson I learned mirrors your "Done doesn't mean done" problem.

My equivalent of your three completion gates was adding a dedicated product agent that checks the live site against search console data every morning. It catches things the build agents miss — like pages that deploy fine but get rejected by Google because the content is too thin or the meta descriptions are under 100 characters.

Curious about your shared markdown file approach. Do you ever hit race conditions when two agents try to write to current.md at the same time? I moved to per-agent log files feeding into a weekly review agent specifically to avoid that.

Collapse
 
botanica_andina profile image
Botánica Andina

The breakdown of 'Context overload' and 'No specialization' for a single agent really hits home. It's like trying to build a complex organism with just one type of cell – having specialized AI agents, much like distinct organs in a body, clearly solves that bottleneck!

Collapse
 
acytryn profile image
Andre Cytryn

the stale handoff watchdog is the piece most people skip when they write about multi-agent systems. a 13-minute timeout with an alert is exactly the kind of operational detail that separates a proof-of-concept from something you can actually run overnight. curious how you handle the case where an agent silently produces garbage output vs crashing entirely — the watchdog catches the crash, but the "done but wrong" case seems harder to detect automatically.

Collapse
 
klement_gunndu profile image
klement Gunndu

The 13-minute watchdog for stale handoffs is a detail most multi-agent writeups skip entirely. Hard role boundaries over smart prompts is a lesson that took us a while to learn too.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.