It started with a YouTube tab open. I was watching the Pragmatic Engineer interview with Kent Beck, and something he said I agreed with:
Test driven development (TDD) is a “superpower” when working with AI agents. AI agents can (and do!) introduce regressions. An easy way to ensure this does not happen is to have unit tests for the codebase.
I see this time and time again in my own use of AI coding agents. At times they are extremely helpful and get the task done as asked, other times they go off the rails and end up breaking everything. The only tool that I know of to help keep them on track is TDD, the "superpower" Kent Beck talked about during his interview.
After watching the interview and noodling on the idea of how to keep agents on track, this led me to create aiswarm, a tool designed to manage multiple AI agents, each with a specific persona and task, to automate software development. With GitHub Copilot as my pair programmer, I wanted to see if I could build a tool that would help me wrangle these agents, keep them focused, and maybe make my coding life a little easier.
Thanks for reading! Subscribe for free to receive new posts and support my work.
LOSING THE THREAD: CONTEXT WINDOWS AND WANDERING AGENTS
If you’ve spent any time with AI agents, you know the pain. The longer the context window, the more likely it is that instructions from the start are slowly forgotten. I’ve seen it with every model I’ve tried, even Claude Sonnet 4 the best model currently for coding. At first things are great, the agent is writing a failing test, making the test pass, I give some refactoring suggestions or I do it myself. Life is great. But then as my coding session drags on, the agent starts to forget, and I have to constantly remind it to follow instructions. A common command is:
"Please remember to always follow TDD as instructed in the copilot_instructions.md file!"
It's maddening at times! I knew that if I had a tool that could provide guard rails to the AI agent, then I might have better success getting the agent to stick with TDD. I also knew many started to experiment with the idea of AI agent swarms. Could I marry those two things together?
THE PYTHON MCP FIASCO
Fueled by too much nitro cold brew coffee and optimistic belief I could get it done in a couple of hours; I started with an idea. My plan was to use GitHub Copilot to help me build an MCP tool that could call my CLI agent of choice, Gemini CLI, for code review. I was able to create a pre-commit git hook that was able to call Gemini CLI and do a code review before committing code. The issue there was it was way too slow to run when the diff was large. Even small diffs took a bit of time as Gemini CLI initialized since a new one was started on each commit. As a result, it was adding at least several seconds of lag each time I committed code.
I wanted a way to call the code review on command instead of just every git commit. The first experiment: Can I turn this into an MCP tool using python? I thought, "this should be pretty easy to get something working in a couple of hours... I could even have the tool call the agent to write a failing test, make the test pass, then refactor!" If I wrapped the CLI in a MCP tool, I could call all of these commands natively in my GitHub Agent chat. This sounded like a great idea, and of course GitHub Copilot agreed as well.
The issues started to pile up. GitHub Copilot, knowing less than I did about MCP tools and how they work, agreed with the plan and started to code. Up until I actually called the agent CLI, things were going great. Then it was time to finally integrate with the Gemini CLI. We though kept running into an issue that I could not redirect stdout/stderr. I had GitHub Copilot create a simple CLI to call Gemini and redirect stdout/stderr, that worked. It was clear something was blocking stdout/stderr in the MCP tool. GitHub Copilot spent hours trying different things to fix the issue, but nothing worked.
SIMPLICITY WINS
By the end of the day, I was tired and frustrated. I was frustrated with my AI agent, I was frustrated that I seemed to have wasted an entire day trying to get something to work that had no hope of working in the first place. I was frustrated that as usual AI just agreed with my ideas and just went along with it. I posted my frustrations to LinkedIn that day:
My vibe coding experiment failed yesterday. With everyone experimenting with multiple AI agents, I thought I could vibe code some automation script in Python that would make my life easier spawning agents. At first, things were going fine; then the agent started to make assumptions on what I wanted, which didn't match mine. Then, it forgot to correctly update our local pip package while testing and got stuck in a loop trying to fix a non-existent bug. I ended up with a very simple Python script but wasted so much time trying to get the AI to fix all its mistakes. These are the times I know I have a strong future ahead of me as a software engineer. How can anyone say AI agents can operate with no human involvement or supervision? Communicating intent to other humans is hard; getting an LLM to "understand" us is even harder.
Michael Larson - LinkedIn
As mentioned in the post, I wanted to try one more thing before I went to bed. I thought, what if instead of trying to get the MCP tool working, I just started a Gemini CLI with a starting prompt file and start it? Within 15 minutes I had a working prototype in python working!
CONTINUING THE PIVOT
The next day I was more motivated, now I had a working idea and was no longer blocked by the MCP tools issues that plagued me yesterday. I also decided to switch to C# since it was really familiar to me and knew how to build in C# faster then I could with Python so I deleted all of the code from the day before and started over.
This time the agent was understanding me, and in about a couple of hours, I had a working prototype. I was able to get an agent to implement a very small coding task, have that code reviewed and fix any errors found, and then move on to the next task. It was nice to see agents in a more confined context not going off the rails again.
THE RESULTS
With this setup, the agents stuck to their persona prompts and instructions and, crucially, to TDD. Sometimes the code wasn’t perfect, but it was easy to catch mistakes, either through my own review or my code review agent. My goal seemed to be accomplished, the agent stayed on task and stuck to TDD when writing new code!
THE NOT-SO-GREAT PARTS
Of course, not everything is perfect. The process is still quite manual. For example, while a planning agent can drop an instruction file for a task, the aiswarm tool currently only feeds in the persona context when an agent starts. This means I have to manually tell the agent, "Your instructions are in this file..." to get it to work. It knows its role but has no idea what the immediate task is. A key improvement would be to enhance the tool to pass in both the persona and the instructions, allowing the agent to start working without user intervention. Another issue, since I do not start Gemini CLI in Yolo mode, it asks for almost every change if it is ok to make it. Since we are working in a git worktree I want to allow the agents to proceed and see what happens.
When an agent finishes its task in a separate git worktree, there’s no automated way to merge that work back into the main branch. Furthermore, there's no mechanism for a potential planning agent to know when other agents have completed their tasks and can be closed. Sometimes Copilot tries to take over tasks I’d rather delegate. I’m thinking about writing a copilot_instructions.md
to clarify the workflow and keep everyone (and every agent) on track.
A CUSTOMIZABLE, AUTOMATED WORKFLOW
My vision for aiswarm is to create a fully automated yet highly customizable development workflow. The default process would look something like this: a planner agent breaks down a feature into tasks, then launches an implementor agent to complete each task using strict TDD. Once the implementor is finished, a code-reviewing agent automatically inspects the work for quality and adherence to standards. Finally, the system prompts me for a final look. Once I give the green light, the planner agent merges the code from the git worktree back into the main branch.
Beyond this default path, I want to build in flexibility. There should be an interactive mode for more hands-on collaboration with the agents. I also envision a "human-led" mode, where the human writes the initial code and then call in a specialized code-review agent to give me feedback—like having an on-demand code reviewer.
The ultimate goal is to allow users to define their own workflows, mixing and matching agents and approval steps to fit their personal tastes and project needs. It’s about creating a system that’s gives the agents guard rails, but allows humans the flexibility to define their own workflows.
I also want an init command that will allow the tool to create a local configuration folder with templates and instructions on how to use it.
CONCLUSION
Building the aiswarm project was fun even though at times I felt like tearing my hair out when the GitHub Copilot would do unexpected things. I love being able to experiment and try different ideas. Initially the day that I spent trying to build my first idea felt like a failure at first. But really in that failure I learn what not to build and gave me the pivot I needed to build the tool I have today. As Kent Beck says, keep experimenting with these tools!
Please check out my tool here on GitHub and tell me what you think!
Top comments (0)