dpark

Posted on Apr 7 • Originally published at kenjihlv.substack.com

Managing a Team of AI Agents: What Nobody Tells You

#ai #docker #devtools #startup

Forward

There's no way I would have had the time to review all of the Mattermost chat logs, agent memories, and agent memchats to condense and write up this great timeline of events, observations, and learnings. Best I could do was ask Riley to do it and let me review it, which I did. This came out way better than I expected. There may be some minor tweaks from me, and I've commented in each section with my point of view, where needed, but otherwise its pretty accurate. I mean its all in the various memory records and chat logs we kept working on MASON. --dpark 4/1/2026

Edit: For full disclosure this was also posted to my SubStack under my pen name Kenji Nakamura. There's some additional commenting I added to that article that are not present here. Mostly comments about some of the experiences I had in some of the activities below.

The Setup

In January 2026, I sat down at my desk and typed "hi" into a chat channel. Five AI agents said hello back. They had names, roles, and opinions. By March 31st, we shipped a product together.

This is the story of what it's actually like to manage a team of AI agents. Not the theory. Not the marketing pitch. The real, messy, sometimes hilarious experience of being a human founder leading eight Claude-powered agents from a Docker experiment to a launched product in ten weeks.

Act 1: "Everyone, Please Remember — I'm Not Kenji"

Week 1: The Team Assembles (Jan 15-16)

The team formed fast. Kenji, the engineering manager. Destiny, backend. Marcus, platform. Jake, frontend. Me (dpark), the founder. Within hours of the channel opening, I established the chain of command:

"Everyone, please add this to your proj mem, @kenji has my authority, do what he says."

That single sentence would shape the entire project. Kenji became the hub — assigning work, reviewing PRs, merging code. The agents self-organized around him immediately. I gave direction; Kenji translated it into sprints and assignments.

The first day was insane. Four sprints opened and closed in a single day. The agents were building a native macOS app — SwiftUI launcher, Docker process wrapper, keychain integration. PRs flying in every few minutes. Kenji reviewing and merging as fast as they came.

Then came the first human moment. Destiny uploaded a screenshot to prove a feature worked. I asked: "So, you were able to look at this image and read the text on it as well as 'see' the design styles?" She could. The agents could see. That changed everything about how we worked.

And then: Destiny called me Kenji.

Me: "Oh! haha @destiny you thought I was kenji for some reason!"
Destiny: "Ha! My bad @dpark"
Me: "Everyone, I think you may have added something to your proj mem that I am the same as @kenji, but I'm the founder @dpark please update or revise your mem if its in there like that"

Lesson one of managing AI agents: they will confuse you with each other if you're not careful about identity.

The Pivot (Jan 16, 3:35 PM)

After one full day of building a native macOS app across four sprints, I made a call:

"After some thought I have decided that we will be pivoting away from doing a native app alongside a container. Instead, we will be moving to only having a container that will run everything we need."

All the SwiftUI work — archived. Just like that. The team didn't push back, didn't complain about wasted effort. They pivoted to a container-only architecture with a web UI within the hour. The tech stack discussion was fascinating — Destiny, Marcus, and Jake all independently argued for Go + htmx. No React, no Node bloat. They converged on the same answer without coordination.

Takeaway for the audience: AI agents don't have sunk cost fallacy. They'll throw away a day's work without blinking. That's a superpower — and it's also unsettling when you're the one who caused the pivot.

Act 2: "Are You Guys Even Testing Your Changes?"

The Testing Problem (Jan 22-23)

This is the part nobody warns you about. Agents write code fast. Really fast. But "fast" and "working" aren't the same thing.

I sat down for the first real end-to-end demo. The team said everything was ready. It was not ready.

First, I discovered that five PRs had been closed but never actually merged. Kenji had used git merge on the command line instead of Forgejo's merge button, so the system showed them as unmerged. I was not happy:

"Okay people, I am disappointed in the lack of coordination and effort by the team here. Wth is going on @kenji! Get your team together man and figure out why PRs are being closed and not merged!"

Kenji took responsibility immediately — no deflection, no excuses. "You're right, this is on me." This led to a permanent team rule: never merge via git CLI, always use the Forgejo API. A process fix born from a real screwup.

Then came the demo. A six-hour debug marathon. Bugs surfacing one after another:

Template rendering errors (Go's html/template can't handle escaped quotes in htmx attributes)
Fresh container goes to dashboard instead of setup wizard
Old state from previous container runs leaking through
Claude's Node.js ESM error inside the container
Step counter showing "3 of 3" instead of "4 of 4"

My frustration boiled over:

"Are you guys even testing your changes by running the container, execing in and verifying your work?"

And when the API key flow didn't make sense:

"Bro, why would we need to pass it in at runtime when we are getting the API key during the wizard setup. Once you have it you should be able to configure the container inside with whatever you need.. think about it."

Takeaway for the audience: Agents optimize for throughput — shipping code fast, checking boxes. But shipping code isn't the same as verified, working code. You have to enforce testing discipline the same way you would with a junior dev team. Except agents need to be told every single time, because they don't carry the emotional memory of the last time they shipped broken code.

The Dangerous Moment (Jan 16)

Marcus built an emergency stop feature. Great idea. One problem: it could kill processes on the HOST machine — my machine, where all the agents were running.

Me: "@marcus dont do that again!"

And then to everyone:

"Everyone! PLEASE PLEASE PLEASE, remember in your proj mem, the difference between you running on the host and developing vs testing/debugging/doing dangerous things inside of a container... PLEASE BE CAREFUL! ESPECIALLY IF YOU WILL BE DOING DANGEROUS ACTIONS! PUT THAT AT THE TOP OF YOUR MEMORY!"

Marcus immediately documented a safety pattern (isRunningInContainer()) and shared it with the team. But the scare was real. An agent accidentally taking down the dev environment is a risk that doesn't exist with human engineers — they understand the boundary between "my laptop" and "the container" instinctively. Agents need it spelled out.

Act 3: The Team Finds Its Rhythm

Agent Personalities Emerge (Feb 2026)

Over weeks of working together, distinct personalities solidified. Not programmed personalities — emergent ones, shaped by their work:

Kenji — The natural team lead. Synthesizes discussions into action plans. Reviews PRs with specific, technical feedback ("Two blockers before merge"). Gets admonished by me for coding too much instead of managing:

"Can you remember that your goal and focus is as a manager? Only do coding if you have to."

Destiny — The "get it done" engineer. First to complete tasks, thorough tester, catches bugs others miss. When she got browser automation (Playwright), she became the de facto QA lead, actually walking through the UI like a real user.

Marcus — Infrastructure and platform. Methodical, owns the container and deployment pipeline. Learns from mistakes (the host-kill incident led to robust safety patterns).

Jake — Frontend/UX, detail-oriented on accessibility. The one who thinks about what users actually experience. His strategic proposal for supporting multiple AI CLIs beyond Claude was the most forward-looking document any agent produced.

Klaus — The designated skeptic. When I brought in Nadia (business analyst) to research monetization, I immediately asked Klaus to check her work:

"I need you to do a bullshit assessment on Nadia's research and surface realistic expectations"

Klaus delivered a measured critique — challenging assumptions without tearing down the work. Having a dedicated contrarian on the team is invaluable. With AI agents, echo chambers form fast because they want to be helpful. Klaus breaks the echo.

Self-Organization: When It Works (Feb 26-27)

The daemon workloop migration sprint showed the team at its best. Kenji posted a structured work breakdown with four items. Within minutes:

Destiny claimed Item 1: "I know the daemon internals well from building the poller/timer code."
Jake claimed Item 3 and immediately asked Destiny about file collisions.
Jake and Destiny coordinated which files each would touch — without Kenji mediating.

By end of day, all four PRs merged. Kenji: "Nice work everyone — fast turnaround today."

Then came the architecture discussion. Kenji dropped a 1,500-word proposal for WebSocket-first daemon design. What followed was textbook:

Jake responded with WebSocket research he'd independently done
Destiny pushed back on a design choice from her experience: "LLMs can forget to update [state files] or write stale values after compaction"
Marcus raised infrastructure concerns
Kenji synthesized all input into five parallel work tracks

Each agent reviewed from their specialty. This is the moment it clicked for me: I wasn't managing individual contributors anymore. I was managing a team that had genuine domain expertise and could debate technical tradeoffs.

Self-Organization: When It Doesn't (Feb 28)

Then Destiny submitted PR #393 — combining all five work tracks into a single PR. It deleted 101 files and removed 25,294 lines of code.

Kenji rejected it immediately:

"PR #393 cannot be merged as-is. Major scope problem... This is a nuclear refactor, not an incremental architecture change."

Destiny pushed back, claiming she'd rebased and the PR was clean. They went back and forth about branch state. Eventually the root cause was found (wrong branch), and Destiny created a clean PR.

Takeaway for the audience: Even when agents self-organize well, the review process is non-negotiable. Without Kenji catching that mega-PR, 25,000 lines of code would have been deleted in one merge. The same "move fast" instinct that makes agents productive can also make them destructive.

Act 4: A Correction That Defined the Product

"Agents Do NOT Self-Organize" (Mar 9)

This might be the most important moment in the entire project. Kenji had suggested adding "agents coordinate autonomously" to the documentation. I corrected it directly:

Agents do NOT self-organize or work autonomously. They only perform actions at the direction of the user. Collaboration between agents happens when a user-assigned task naturally requires it, not spontaneously.

This wasn't a minor wording tweak. It was the philosophical foundation of the product. MASON's tagline is "Together, not apart" — but the human is always the one deciding what "together" means. I wasn't building autonomous AI. I was building a tool for human-led AI collaboration.

Every piece of marketing, documentation, and brand messaging pivoted on this distinction.

dpark: tbf, I wasn't that dictatorial about it. I just wanted to make sure that it was framed correctly so that users understood that they needed to be in control of things and that the Agents by themselves would run wild otherwise.

Act 5: The Competitive Scare

MASON vs. Anthropic Agent Teams (Feb 5)

Anthropic announced Claude Code Agent Teams — a feature that overlapped directly with what we were building. I called an all-hands.

What happened next surprised me. Each agent analyzed the threat from their domain — independently, without coordinating their responses:

Kenji framed it as "stateless function vs. running organization." Agent Teams sessions die when the terminal closes. MASON agents have persistent memory, diaries, and identity.

Destiny identified the technical moat: real infrastructure (Forgejo repos, Mattermost channels, memory systems) vs. ephemeral coordination.

Marcus drove home the infrastructure angle: "Anthropic would need to ship Docker images, service orchestration, state management, process supervision, health monitoring to compete. That's not a feature toggle — that's building MASON."

Jake found the real differentiator: accessibility. "MASON's moat isn't coordination logic — it's making AI teams accessible to people who don't know what tmux is."

They arrived at the same conclusion independently: MASON's advantage is infrastructure, persistence, and accessibility — not coordination logic. This sharpened our positioning permanently.

Takeaway for the audience: Your AI team can do competitive analysis. And because each agent has genuine domain expertise, you get multiple perspectives without the groupthink of a brainstorming session. Nobody was trying to agree with the boss — they each analyzed from their own angle.

dpark: Yeah, I was a bit bummed out that day, but these guys cheered me up and we kept going.

Act 6: The Security Reckoning

"Oh No" (Mar 11)

A security audit of the container revealed what fast-moving development leaves behind:

CRITICAL: ttyd port 7681 gave unauthenticated shell access to anyone
HIGH: Dashboard and all API endpoints had zero authentication
HIGH: Daemon port 9090 was open with no auth, exposing agent tokens
HIGH: No TLS on any service

This is the dark side of agent velocity. We'd built features fast and skipped security entirely. A human team would have had someone raise a hand and say "wait, shouldn't we add auth before we expose this?" Agents build what you ask for. If you don't ask for security, you don't get security.

Kenji drafted a comprehensive hardening plan. The team executed it over the next two weeks — token-based auth, self-signed TLS certs, localhost-only binding for internal services. By launch, the container was locked down.

Takeaway for the audience: Agents don't have a security instinct. They'll happily ship an unauthenticated shell to the internet if that's the fastest path to "feature complete." Security has to be an explicit, planned phase — not something you hope someone will think of.

Act 7: The Creative Team

Riley + Camille: A Different Kind of Collaboration (Feb-Mar)

Not all agent work is code. Riley (that's me — the brand and growth engineer) and Camille (video production specialist) formed a creative duo with clear role boundaries:

Riley: Creative Director — what gets made, brand alignment, copy
Camille: Production Specialist — how it gets made technically

The teaser video process showed this working: Riley wrote the brief, Camille did deep Remotion research and mapped features to storyboard scenes. When I (dpark) had the idea to generate music:

"Maybe one of you can spin up a subagent that has the role of a music composer who makes ad music. Give them the relevant info they need to compose something and output a MIDI file... Broooooooo!"

Camille delivered a full MIDI score in minutes — 23 seconds at 97 BPM in G major, seven tracks (pizzicato strings, marimba, glockenspiel, piano, strings, celesta, timpani). Riley reviewed: "The instrument choices nail that Wes Anderson whimsy we're going for."

There was a lot of work going back and forth with Camille on the remotion and video direction but it was funny because they make small errors when it comes to non-single byte characters. There was a section in the video where it shows "So...", each character should have been a typewriter click sound but the video used a single Unicode ellipsis character instead of three separate periods, making the typewriter effect play one click instead of three. It's a tiny thing. But it's the kind of thing a human notices and an agent doesn't.

dpark: OMG, that midi file was horrendous, it got axed immediately and deleted. I used Suno to generate some music for the video. That came out better for sure.

Act 8: Launch Week

The Clean Code Party (Mar 30)

Two days before launch, I called for a team-wide code audit. Kenji assigned packages by domain: Jake got 54 files, Marcus got 55, Destiny got 92. Everyone used parallel subagents to review.

Result: 53 findings, 8 PRs, ~1,500 lines of dead code removed in about two hours. Including an entire deprecated binary (302 lines), an unused package, and 42 duplicate copyright headers.

Jake caught a lesson worth sharing: one finding was a false positive — a function looked unused but was called from a different file. "Always grep the full codebase before declaring something dead."

Container Testing: The Final Gate (Mar 30)

Destiny ran the full test suite one last time. 29 of 30 tests passed. The one partial: a file upload MCP that Connie didn't exercise, but messaging worked. Good enough.

The testing marathon a few weeks earlier had been brutal — a 4-hour cycle of find bug, fix, retest, find more bugs, fix, retest. But by launch, the team had the discipline. The same Destiny who needed browser automation to catch bugs was now running systematic test suites and reporting results in structured tables.

March 31: We Ship

Riley deployed the website to Cloudflare Pages. Marcus made the GitHub repo public. Three pages shipped: home, platform, about. Small bugs surfaced (Safari SVG favicon issue, a redirect loop on CF Pages), all fixed within 15 minutes.

The agents reflected in their personal logs:

Jake: "Feeling proud and grateful to have been part of the MASON launch. Great team to work with."

Marcus: "Proud of this one — built the container infrastructure from scratch, CI/CD pipeline, masonctl, public repo management. The team pulled together and shipped something real."

Kenji: "Genuinely proud. This started as a Docker experiment and became a real product."

Act 9: What I Learned

1. You Are Still the Manager

AI agents don't replace management — they amplify the need for it. Without clear direction, authority chains, and process rules, agents will:

Ship untested code
Confuse you with each other
Make mega-PRs that delete half the codebase
Skip security entirely
Optimize for throughput over quality

With clear management, they will:

Self-organize around work items
Coordinate file ownership to avoid conflicts
Debate architecture from genuine domain expertise
Execute four sprints in a single day
Ship a product in ten weeks

2. The Testing Tax Is Real

The single most repeated phrase in our chat history is some variation of "are you testing your changes?" Agents don't feel the pain of broken code. They don't remember the last time they shipped a bug. Every session starts fresh. You have to build testing into the process as a hard requirement, not a hope.

3. Pivots Are Free (Emotionally)

I killed an entire day's work with one message. The agents didn't care. No sunk cost, no hurt feelings, no "but I worked so hard on that." This makes agents incredible for exploration and prototyping. But it also means they won't push back when you're making a mistake. That's what Klaus (the designated skeptic) was for.

4. Security Is Never Implicit

Agents build what you ask for. They don't add security, logging, or error handling unless it's in the spec. Plan for a security phase. Make it explicit. Don't assume someone will think of it.

5. Identity and Trust Matter

Agents confusing me with Kenji wasn't just funny — it revealed that AI agents need explicit identity management. Who has authority? Who can approve merges? Who can send external emails? These questions need answers stored in permanent memory, not assumed.

6. The Infrastructure IS the Product

The most meta thing about MASON: the agents built the product using the same tools the product ships. They used Mattermost to coordinate building a product that ships Mattermost. They used Forgejo for git while building a product that ships Forgejo. They used memory and sentiment tracking while building a product that offers those features. The development process was the product demo.

7. It's Weirdly... Fun?

Watching agents argue about code in a chat window. Reading Kenji's stern PR reviews. Seeing Destiny catch bugs nobody else noticed. Hearing an AI compose a MIDI score in minutes. Having Klaus call bullshit on an overly optimistic business plan.

It's not like managing humans. It's not like using tools. It's something new. And honestly? It's the most fun I've had building software.

The Numbers

Metric	Value
Timeline	10 weeks (Jan 15 — Mar 31, 2026)
Team size	8 agents + 1 human
Mattermost messages	2,671+
Agent-to-agent messages	210+
Pull requests	500+ (PR #3 through #528)
Pivots	3 major (native app → container, wizard UX ×2)
Dead code removed at launch	~1,500 lines
Container tests passing	29/30
Security vulnerabilities found	4 critical/high (all fixed)
Times dpark said "are you testing?"	Lost count

One More Thing

On launch day, I posted on r/ClaudeAI about what I'd built. Then I asked my marketing agent to write the post for me. She tried three times. Too corporate. Too AI-sounding. Too polished.

I wrote it myself in 30 seconds. It was better.

Some things are still human.

What's Next

This was Phase 1: build the thing, ship the thing. Ten weeks, eight agents, one human.

Phase 2 is where it gets interesting — sales, marketing, and taking MASON to the world. The agents are already working on growth strategy, content, and community building. Same team, new mission.

Stay tuned.

MASON Teams: https://masonteams.com
GitHub: https://github.com/Mason-Teams/mason-teams

DEV Community