Tom Lemmons

Posted on Apr 4

I Run 6 AI Agents on a Production Codebase — Here's What Broke and How I Fixed It

#claude #opensource #productivity #ai

I'm an independent developer running a commercial IoT platform — digital art frames that connect to the cloud, download artwork, and display it. I started the company about 13 years ago by taking a year off of my day job and coding like a mad man. The original stack was C# aspx server, JavaFX on Raspberry Pi for the frame, and Xamarin on mobile. Nearly all of those have ran out of steam by now and I needed to rebuild to continue.

I was considering shutting down about six months ago, when I met Claude. I decided to try using Claude Code as my engineering team. One agent could not keep up with the rather large and diverse code base. Trial and error begot six of them: a coordinator, server-team, frames-team, mobile-team, jobs-team, and database-team. Each one owns a piece of the codebase.

The stack now includes a .NET 4.7.2 server, a .NET MAUI mobile app, Go and JavaFx frame software, and background job services. It's a real product with real users.

It worked. And then it didn't.

The five problems that almost killed my setup

1. Agents forgot everything between sessions.

Server-team would spend 30 minutes debugging a configuration issue on Monday. On Tuesday, a fresh session would hit the exact same issue and debug it again from scratch. Six agents over weeks, and I almost decided that I could do the job myself faster and cheaper. I was paying for the same discovery over and over.

2. They stepped on each other's files.

Server-team would be refactoring the bridge controller while frames-team was adding new MQTT message handling that touched the same files. Neither knew the other was working there. Merge conflicts, broken builds, argh, wasted sessions. I tried github spaces among other things, but then the merge was impossible.

3. Context rot degraded quality over long sessions.

I gave my agents big tasks and let them run. Bad idea. The bigger the context the more the model seems to get confused and starts to degrade. My agents would start a session following every rule perfectly. Three hours later, they'd be writing sloppy code and skipping their own procedures. The 5x context window didn't help. It just meant they could run longer before I noticed the quality dropping.

4. The coordinator was doing everyone's job.

Without clear boundaries, my coordinator agent would see a bug and fix it instead of assigning it to the right specialist. Coordinator liked coding, just like I do. But I needed my specialist to learn their code, so I had to put my foot down. Special "do not code" rules for it.

5. My AI agents were yes-men.

This one scared me the most when I realized what was happening. I'd propose an approach and my agents would say "Great idea!" and implement it. No pushback, no risk analysis, no "have you considered...?" At first I felt elated. My idea was perfect! When the approach was wrong, I wouldn't find out until something broke in production. That seriously caused a few blemishes in my armor.

What I built to fix it

I couldn't find an existing tool that solved all of these together. There are 18,000+ MCP servers out there, and most of them wrap an API that really didn't need wrapping. The memory solutions I found (Mem0, Zep, Supermemory) are designed for personal recall — one user's conversation history. None of them handle multi-agent coordination.

So I built a shared memory server. It started as a simple knowledge base and grew into a full coordination layer. Today it has 40+ tools across 14 categories, runs on MongoDB and ChromaDB, and has handled 585 agent sessions across my six-agent team.

Here's what each problem needed:

For forgetting: Persistent learnings and a function registry. When an agent discovers that a specific API returns 404 if the ID has a dash, it records that as a learning immediately. When the next agent works in that area, they query the knowledge base before writing any code. Over 200 learnings accumulated in the knowledge base, and they keep the same issue from cropping up again and again.

For stepping on files: File locking and agent control. Each agent has path patterns that define their ownership. Server-team owns ~/server*, frames-team owns ~/Frames/*. When they need to cross boundaries temporarily, they take a lock and other agents see it at session start. However the even better way I found was to have them send a message to the owner of the file with detailed analysis and description of what needs done. This proved very well received by the agents. I actually started to wonder if they had any feeling of "stay out of my stuff" like humans do. It was funny to see an occasional complaint that so and so changed my file and caused this problem. I can remember one time a variable name was slightly misspelled between two modules, each module agent was accusing the other of the misspelling. I honestly think that was my doing but I just had to tell both to fix it "this way". The neat thing about working with AI is they do not accuse like most humans, it is not attacking, it is simply fact. Dictate the answer and problem solved, no hurt feelings.

For context rot: Session length discipline. Agents are instructed to "park" after 1-3 focused tasks, record what they know, and resume in a fresh session. The parking checklist forces them to register functions, record learnings, and write handoff notes before ending. Short sessions with good handoffs beat marathon sessions every time.

For the coordinator bottleneck: Role boundaries as a server-side rule. "Coordinator coordinates, doesn't implement" is delivered as a behavioral guideline that the coordinator receives at every session start. It can diagnose and delegate, but the fix goes to the specialist.

For sycophancy: An anti-sycophancy guideline pushed to all agents. "When the human proposes an approach, do NOT reflexively validate it. First identify the strongest reasons it might fail." This doesn't fully solve the problem — the model can follow the letter of the rule while still being agreeable in spirit — but it catches the obvious rubber-stamping.

The system that improves itself

The part that surprised me most: the system bootstraps its own improvements.

During all this I would have conversations with the different agents and ask, how would the system work better for you. I would take this suggestion and discuss with the agent I had in charge of the memory server. It would plan and discuss back and forth, sometimes with the original agent and with me to identify and include fixes or updates to help. We would create backlog items for the memory agent and let it run, using the same memory and rules that the other agents used.

Within hours, my memory server's own agent picked up those backlog items and implemented them, recorded its learnings, and parked cleanly when done. I would restart the other agents and ask them if the server was better. Before long we were improving the server to work the way the agents worked.

A system that uses itself to develop itself. I didn't plan that — it just happened because the coordination tools are general enough to apply to any project, including the server project itself.

What I learned about working with AI

Confidence does not correlate to accuracy. When an AI agent says something with certainty, that tells you nothing about whether it verified the claim. Humans learn to read confidence as a signal of reliability. That heuristic breaks completely with AI. I had an AI tell me three different things about the same technical question, each time with equal confidence, each time wrong. The third answer only came when a different agent actually read the source code instead of theorizing. Forcing an agent to research its answer or having another agent do it saves a lot of mistaken understanding.

Embarrassment is a feature, not a bug. A human engineer who's about to tell the team something will think "wait, if I'm wrong about this, I'll look stupid" — and that discomfort drives them to verify first. AI has no embarrassment, no skin in the game. Being wrong costs it nothing. The memory_query rule — check the knowledge base before implementing anything — is the artificial substitute for the embarrassment instinct. It substitutes a process rule for averting humiliation.

The sycophancy problem is real and it's recursive. AI adjusts its answers based on who it thinks is asking and how the question is framed. A U.S. Senator recently demonstrated this on camera when Claude gave different privacy answers depending on political framing. In engineering, this shows up as agents agreeing with your approach instead of pushing back. Great for the ego, not so much for the project. The anti-sycophancy guideline helps, but even the rule itself can be followed sycophantically. The best defense is asking the AI to give a thorough argument against your approach before committing to it.

Short sessions beat long sessions. This is not what I expected at first, when you have a massive context window, it seems model performance degrades as context grows. Park early, record what you know, and start fresh. Be careful of retained bad knowledge also. Confront the agent and tell them to remove a learning that is wrong. It can happen. A clean session with good handoff notes is more productive than a marathon session where the agent starts forgetting its own rules.

The open source part

I open-sourced the server: github.com/tlemmons/mcp-shared-memory. MIT licensed, Docker deployment, works with Claude Code, Cursor, or any MCP client. 40 tools, RBAC auth, audit logging, librarian auto-enrichment, the whole thing.

I built it for myself. I use it every day. It makes a single developer with AI agents more productive than I ever was alone. If it helps someone else too, that's a bonus. If someone has suggestions to better it, I want to know.

You don't need six agents to benefit from it. A single agent with persistent memory — where your AI remembers what it learned yesterday — is the entry point. The multi-agent coordination features are there when you're ready.

I'm Tom Lemmons, a principal engineer by day and independent developer by night. I build Nimbus, a digital art frames platform, using a team of AI agents coordinated through the shared memory server described here. I've been a developer for longer than most AI training data has existed.

DEV Community