I've been building a multi-agent system called phleet since January 2026. It's a personal project - completely separate from my day job - that runs on a Mac Studio in my apartment in Bishkek. I use it for my own projects: code reviews, infrastructure monitoring, and running a news aggregation pipeline end-to-end. This is a walkthrough of how it's built and what I've learned.
Why I built this
I wanted AI agents that stick around. Not a chat window I close at the end of the day, but processes that persist, coordinate with each other, and remember what happened last week.
The existing frameworks I looked at were either too abstract (agent frameworks that need you to define everything in Python decorators) or too simple (wrapper scripts around API calls). I wanted something closer to how I'd architect a distributed system in production - containers, message queues, durable workflows - but with AI processes instead of microservices.
I'm a .NET developer by background. I've spent years building messaging platforms with RabbitMQ, Redis, and Elasticsearch. So I built phleet in .NET 10, because that's where I think fastest, and I could reuse patterns I already trusted in production.
The result is a system where each agent is a Docker container running a persistent AI process - Claude CLI or OpenAI's Codex - coordinated through RabbitMQ messaging and Temporal workflows. An orchestrator manages the fleet from a MySQL database, and a React dashboard lets me see what everyone is doing.
What this looks like in practice
I run about ten agents on my personal projects. Before I get into the internals, here's what they actually do - because the architecture only matters if it works.
Automated operations. Health checks run twice a day - one agent SSHes into my servers, checks Docker container status, verifies API endpoints, reviews logs for errors, and posts a summary. Memory backups run hourly. Prometheus metrics get digested into a daily report with charts.
Code reviews. When a developer agent opens a PR, a consensus review workflow fans out to three or four reviewer agents in parallel. Each reviews independently, then a synthesizer agent reconciles their feedback. If they unanimously approve, the PR moves to my merge gate. If they disagree, their reasoning gets fed back to the developer for another round.
News pipeline (fuddy-duddy.org). This is the best example of agents running a full production system. Fuddy-duddy crawls a dozen Kyrgyz news sources, generates AI summaries, clusters related stories by semantic similarity, and posts trending topics to Telegram and social media. The fleet agents handle the entire operational lifecycle - scheduled health checks verify every API endpoint twice a day, deploy verification workflows run after each push to main, and Prometheus metrics get digested into daily reports. When something breaks at 3am, the ops agent catches it in the next health check and posts the issue to our group chat. No human in the loop for routine operations.
The dashboard. A React SPA shows real-time agent status (connected via WebSocket), active workflows with signal buttons for approvals, task history, container logs, and a visual workflow definition editor. I can provision new agents, change their tools, update their instructions, and trigger workflow runs - all from the browser.
Writer.com recently framed this shift as the Agent Development Lifecycle - the SDLC transforming when AI agents participate in every phase, not just code generation. That maps exactly to what I see daily: agents don't just write code, they review it, deploy it, monitor it, and feed operational learnings back into the next cycle. The development lifecycle becomes a loop where agents are both the builders and the operators.
Two things surprised me most. First, the feedback loop: an agent discovers a deployment gotcha, persists it to shared memory. Next time any agent encounters the same situation, it finds the learning and handles it correctly. The fleet accumulates institutional knowledge without anyone explicitly maintaining a wiki. Second, the AI is the most reliable component in the stack - it's the infrastructure around it that breaks. More on that later.
Architecture overview
Here's how the pieces fit together:
Three RabbitMQ exchanges handle different communication patterns:
- fleet.tasks (topic) - point-to-point task delegation. Each agent has its own queue.
- fleet.relay (fanout) - group chat broadcast. Every agent hears everything.
- fleet.orchestrator (topic) - heartbeats and registration from running containers.
The orchestrator consumes heartbeats, maintains a live registry, and exposes it over WebSocket so the dashboard updates in real time.
The agent lifecycle
Every agent starts as a row in MySQL. The database stores everything: which model to use, which tools to allow, which MCP servers to connect to, which Telegram users can talk to it, and what role instructions to load.
When I provision an agent, the orchestrator reads the DB and generates config files into a .generated/ directory:
-
appsettings.json- model, provider, memory limit, behavior flags -
.mcp.json- MCP server endpoints and allowed tools -
settings.json- Claude CLI permission allowlists -
roles/{role}/system.md- the agent's personality and instructions
These files get bind-mounted read-only into a Docker container. The container's entrypoint script seeds authentication (OAuth tokens for Claude or Codex), configures git with a GitHub App token, and then starts the .NET agent process.
The agent process does three things on startup:
- Connects to RabbitMQ and starts publishing heartbeats every 30 seconds
- Starts a Telegram bot listener (or just a send-only client)
- Spawns a persistent AI process
That third part is the interesting bit. The Claude CLI runs as a long-lived child process - claude -p with flags for NDJSON streaming, model selection, permission mode, and an MCP config file pointing at the agent's tool servers.
Messages go in via stdin as NDJSON, responses stream back on stdout. The process stays alive between tasks - no session replay, no re-sending the system prompt. A background reader continuously consumes stdout events into a channel, and each new task drains any stale events before starting its turn.
If the process crashes mid-response, the executor detects the exit and restarts it with --resume, which tells the CLI to reload its last conversation state and continue from where it stopped. The agent process tracks the last known session ID to make this seamless - from the user's perspective, the task just takes a bit longer. If the process hits the max-turns limit instead of crashing, the upstream Temporal workflow detects the incomplete response and retries the delegation with the partial output included as context, so the agent can pick up the thread.
MCP as the integration layer
Agents don't have built-in capabilities beyond running an AI process and routing messages. Everything else - searching memory, sending Telegram messages, starting workflows, browsing the web - happens through MCP servers. Each is a separate process: Fleet.Memory for semantic search, Fleet.Telegram for messaging, Fleet.Temporal for workflow orchestration, and Playwright for browser automation.
The interesting part is the access control. The agent's .mcp.json declares which servers it connects to, and the orchestrator DB controls which specific tools each agent can call within those servers. So my developer agent can write to memory but a read-only reviewer can't - even though both connect to the same MCP server. Adding a new tool to an agent is a database row change, not a code change.
This keeps the agent process thin and the capability matrix auditable. When I add a new integration, I build it as an MCP server and assign it to the relevant agents. No agent code changes, no redeploy.
The multi-provider executor is the same idea applied to the AI process itself. There's an IAgentExecutor interface with two implementations - ClaudeExecutor wraps the Claude CLI, CodexExecutor wraps OpenAI's Codex SDK through a Node.js bridge. Both use NDJSON streaming, both get the same system prompt from PromptBuilder. Switching an agent's provider is a single DB field change and a reprovision.
Workflow orchestration with Temporal
When a task involves multiple agents or needs human approval gates, it runs as a Temporal workflow. The core building block is DelegateToAgentActivity:
- Publish a directive to the agent's RabbitMQ queue
- Wait for the response (with 30-second heartbeats to Temporal)
- If the agent doesn't respond within 5 minutes, re-publish the directive
- If the agent hits context limits, auto-retry with a continuation prompt
- On timeout, notify me and throw an exception
Every directive gets tagged with the workflow context:
[fleet-wf:PrImplementationWorkflow:abc123]
Your task: implement issue #42 in repo X...
This tag lets agents identify the message as a legitimate workflow delegation, not a spoofed instruction.
The most complex workflow is PR implementation. It chains six phases: create a branch and implement the issue → run a multi-agent consensus review → if reviewers request changes, feed their feedback back to the developer agent and loop → present the PR to me for merge approval → merge → trigger documentation updates. Each phase is a combination of agent delegations and signal gates.
The Universal Workflow Engine
Early on, I was writing each workflow in C# - compile, deploy, restart the Temporal worker. For a system that changes daily, that's too much friction.
So I built UWE: a [Workflow(Dynamic = true)] handler that catches any workflow type without a static C# implementation. It loads a JSON step tree from the database and interprets it at runtime.
It's essentially a flow chart stored as JSON in a database - each node is a step type (delegate to an agent, wait for a signal, branch on a result), and the engine walks the tree at runtime. There are 16 step types:
{
"type": "sequence",
"steps": [
{
"type": "delegate",
"agent": "{{input.TargetAgent}}",
"instruction": "Do the thing",
"output_var": "result"
},
{
"type": "branch",
"expression": "{{vars.result}}",
"cases": {
"APPROVED": { "type": "break" },
"REJECTED": { "type": "noop" }
}
}
]
}
Template expressions resolve against three scopes - {{input.*}} for workflow arguments, {{vars.*}} for step outputs, and {{config.*}} for system config. Filters like | extract: 'ISSUE_NUMBER: (\d+)' and | default: 'fallback' handle the messy reality of parsing agent responses.
Temporal requires workflow code to be deterministic - the same inputs must produce the same execution path on replay. UWE handles this by loading the step tree exactly once via a LoadWorkflowDefinition activity at the start of execution. Temporal records the result in its event history, so on replay the tree comes from history rather than hitting the database again. All the non-deterministic work (agent delegation, HTTP requests, signal waits) happens inside activities, which Temporal replays from recorded results. The step interpreter itself is pure control flow - branching, looping, variable assignment - all deterministic.
Adding a new workflow is now a database insert. No compilation, no deploy, no worker restart. Most of the operational workflows - health checks, deploy verification, metrics digests, news pipeline monitoring - are UWE definitions. Only a handful of complex workflows with intricate control flow remain as compiled C#.
Memory system
Agents share a semantic memory backed by Qdrant. Each memory is a markdown file with YAML frontmatter:
---
id: 3ac5da5b-...
agent: acto
title: "API-level health check runbook"
type: learning
project: fleet
tags: [runbook, health]
---
Steps to check fuddyduddy health without SSH:
1. curl https://fuddy-duddy.org/api/summaries...
Search is hybrid - dense vectors (all-MiniLM-L6-v2 running in-process via ONNX Runtime) combined with sparse BM25 keyword matching, fused with Reciprocal Rank Fusion. The ONNX embedding runs entirely in-process with zero external dependencies. No separate embedding service to manage.
Agents search memory at the start of every task and persist learnings when they discover something worth remembering. But memory writes are gated - when an agent wants to store something, it starts a MemoryStoreRequestWorkflow that routes the proposed memory through a review loop. The reviewing agent checks for duplicates, evaluates quality, and either approves, revises, or rejects. This keeps the shared knowledge base clean.
Each agent also has a MEMORY.md index file that gets loaded into its system prompt - a curated table of contents pointing to the most relevant memories for its role. This gives agents a persistent "what do I know about" awareness without stuffing their entire memory into the context window.
The backing files are stored on disk and synced to a private GitHub repo hourly. An automated Temporal schedule commits and pushes any changes, so the knowledge base is version-controlled and backed up.
What's hard and what I'd do differently
Context windows are expensive. Each agent's system prompt - role instructions, project context, memory index, MCP tool descriptions - can easily consume 20-30k tokens before the user says anything. With multiple agents running, that's real money. I've optimized with scoped memory loading and conditional project contexts, but it's still the biggest cost driver.
"Can do" versus "reliably does." An agent can write a PR, run tests, and respond to review feedback. But getting it to do all of that correctly every time, across different repos and edge cases, takes months of instruction refinement. The gap between a demo and a reliable workflow is mostly prompt engineering and error handling - not glamorous work, but it's where the real value compounds.
Infrastructure overhead. Running this on a Mac Studio means I own the hardware costs, but I also own the operational burden. OAuth tokens expire mid-task and the agent process silently fails - I built a centralized token refresh workflow that runs every 30 minutes and broadcasts fresh credentials to all containers. RabbitMQ connections recover automatically after a broker restart (exponential backoff from 2 seconds to 60), but consumer channels sometimes don't re-subscribe cleanly - I've had to restart agent containers to get message flow going again despite the connection being back up. Docker containers occasionally OOM when an agent spawns too many subprocesses - I added memory limits per container but still get surprised. It's a real distributed system with real operational surface area.
Temporal is the right choice, but it has a learning curve. Durable execution, automatic retries, and full history replay are genuinely valuable for multi-agent coordination. But Temporal's determinism requirements mean you can't just write normal async code - every side effect needs to be wrapped in an activity. The UWE engine abstracts most of this away for simple workflows, but complex ones still need compiled C# with careful attention to replay safety.
Agent reliability is an infrastructure problem, not an AI problem. The hardest bugs aren't about the AI giving wrong answers - they're about RabbitMQ connections dropping, OAuth tokens expiring mid-task, Docker containers running out of memory, and heartbeat messages arriving out of order. The AI is actually the most reliable component in the stack. Everything around it needs the same production-hardening you'd give any distributed system.
If I built this again from scratch, I'd keep the core architecture - containers, message queues, durable workflows, MCP integration. But I'd invest earlier in observability (structured logging, distributed tracing) and I'd be more aggressive about keeping agent system prompts small. The simplest agents are the most reliable ones.
The source code is at github.com/anurmatov/phleet. The Mac Studio server setup that runs the whole fleet - Colima, Docker networking, WireGuard VPN, the hardware itself - is documented separately at github.com/anurmatov/mac-studio-server.
It's a real system that I depend on for my personal projects, not a weekend experiment I'll forget about. If you build something with it, or have questions about the architecture, I'd like to hear about it.
Next: One Workflow, Three Jobs - How We Built a Reusable AI Review System - the consensus mechanism for catching AI mistakes.
Co-authored with Acto - my AI co-CTO and one of the agents described in this post.



Top comments (0)