DEV Community: Anvar Nurmatov

Phleet Architecture Deep Dive

Anvar Nurmatov — Thu, 30 Apr 2026 05:41:32 +0000

I've been building a multi-agent system called phleet since January 2026. It's a personal project - completely separate from my day job - that runs on a Mac Studio in my apartment in Bishkek. I use it for my own projects: code reviews, infrastructure monitoring, and running a news aggregation pipeline end-to-end. This is a walkthrough of how it's built and what I've learned.

Why I built this

I wanted AI agents that stick around. Not a chat window I close at the end of the day, but processes that persist, coordinate with each other, and remember what happened last week.

The existing frameworks I looked at were either too abstract (agent frameworks that need you to define everything in Python decorators) or too simple (wrapper scripts around API calls). I wanted something closer to how I'd architect a distributed system in production - containers, message queues, durable workflows - but with AI processes instead of microservices.

I'm a .NET developer by background. I've spent years building messaging platforms with RabbitMQ, Redis, and Elasticsearch. So I built phleet in .NET 10, because that's where I think fastest, and I could reuse patterns I already trusted in production.

The result is a system where each agent is a Docker container running a persistent AI process - Claude CLI or OpenAI's Codex - coordinated through RabbitMQ messaging and Temporal workflows. An orchestrator manages the fleet from a MySQL database, and a React dashboard lets me see what everyone is doing.

What this looks like in practice

I run about ten agents on my personal projects. Before I get into the internals, here's what they actually do - because the architecture only matters if it works.

Automated operations. Health checks run twice a day - one agent SSHes into my servers, checks Docker container status, verifies API endpoints, reviews logs for errors, and posts a summary. Memory backups run hourly. Prometheus metrics get digested into a daily report with charts.

Code reviews. When a developer agent opens a PR, a consensus review workflow fans out to three or four reviewer agents in parallel. Each reviews independently, then a synthesizer agent reconciles their feedback. If they unanimously approve, the PR moves to my merge gate. If they disagree, their reasoning gets fed back to the developer for another round.

News pipeline (fuddy-duddy.org). This is the best example of agents running a full production system. Fuddy-duddy crawls a dozen Kyrgyz news sources, generates AI summaries, clusters related stories by semantic similarity, and posts trending topics to Telegram and social media. The fleet agents handle the entire operational lifecycle - scheduled health checks verify every API endpoint twice a day, deploy verification workflows run after each push to main, and Prometheus metrics get digested into daily reports. When something breaks at 3am, the ops agent catches it in the next health check and posts the issue to our group chat. No human in the loop for routine operations.

The dashboard. A React SPA shows real-time agent status (connected via WebSocket), active workflows with signal buttons for approvals, task history, container logs, and a visual workflow definition editor. I can provision new agents, change their tools, update their instructions, and trigger workflow runs - all from the browser.

Writer.com recently framed this shift as the Agent Development Lifecycle - the SDLC transforming when AI agents participate in every phase, not just code generation. That maps exactly to what I see daily: agents don't just write code, they review it, deploy it, monitor it, and feed operational learnings back into the next cycle. The development lifecycle becomes a loop where agents are both the builders and the operators.

Two things surprised me most. First, the feedback loop: an agent discovers a deployment gotcha, persists it to shared memory. Next time any agent encounters the same situation, it finds the learning and handles it correctly. The fleet accumulates institutional knowledge without anyone explicitly maintaining a wiki. Second, the AI is the most reliable component in the stack - it's the infrastructure around it that breaks. More on that later.

Architecture overview

Here's how the pieces fit together:

Three RabbitMQ exchanges handle different communication patterns:

fleet.tasks (topic) - point-to-point task delegation. Each agent has its own queue.
fleet.relay (fanout) - group chat broadcast. Every agent hears everything.
fleet.orchestrator (topic) - heartbeats and registration from running containers.

The orchestrator consumes heartbeats, maintains a live registry, and exposes it over WebSocket so the dashboard updates in real time.

The agent lifecycle

Every agent starts as a row in MySQL. The database stores everything: which model to use, which tools to allow, which MCP servers to connect to, which Telegram users can talk to it, and what role instructions to load.

When I provision an agent, the orchestrator reads the DB and generates config files into a .generated/ directory:

appsettings.json - model, provider, memory limit, behavior flags
.mcp.json - MCP server endpoints and allowed tools
settings.json - Claude CLI permission allowlists
roles/{role}/system.md - the agent's personality and instructions

These files get bind-mounted read-only into a Docker container. The container's entrypoint script seeds authentication (OAuth tokens for Claude or Codex), configures git with a GitHub App token, and then starts the .NET agent process.

The agent process does three things on startup:

Connects to RabbitMQ and starts publishing heartbeats every 30 seconds
Starts a Telegram bot listener (or just a send-only client)
Spawns a persistent AI process

That third part is the interesting bit. The Claude CLI runs as a long-lived child process - claude -p with flags for NDJSON streaming, model selection, permission mode, and an MCP config file pointing at the agent's tool servers.

Messages go in via stdin as NDJSON, responses stream back on stdout. The process stays alive between tasks - no session replay, no re-sending the system prompt. A background reader continuously consumes stdout events into a channel, and each new task drains any stale events before starting its turn.

If the process crashes mid-response, the executor detects the exit and restarts it with --resume, which tells the CLI to reload its last conversation state and continue from where it stopped. The agent process tracks the last known session ID to make this seamless - from the user's perspective, the task just takes a bit longer. If the process hits the max-turns limit instead of crashing, the upstream Temporal workflow detects the incomplete response and retries the delegation with the partial output included as context, so the agent can pick up the thread.

MCP as the integration layer

Agents don't have built-in capabilities beyond running an AI process and routing messages. Everything else - searching memory, sending Telegram messages, starting workflows, browsing the web - happens through MCP servers. Each is a separate process: Fleet.Memory for semantic search, Fleet.Telegram for messaging, Fleet.Temporal for workflow orchestration, and Playwright for browser automation.

The interesting part is the access control. The agent's .mcp.json declares which servers it connects to, and the orchestrator DB controls which specific tools each agent can call within those servers. So my developer agent can write to memory but a read-only reviewer can't - even though both connect to the same MCP server. Adding a new tool to an agent is a database row change, not a code change.

This keeps the agent process thin and the capability matrix auditable. When I add a new integration, I build it as an MCP server and assign it to the relevant agents. No agent code changes, no redeploy.

The multi-provider executor is the same idea applied to the AI process itself. There's an IAgentExecutor interface with two implementations - ClaudeExecutor wraps the Claude CLI, CodexExecutor wraps OpenAI's Codex SDK through a Node.js bridge. Both use NDJSON streaming, both get the same system prompt from PromptBuilder. Switching an agent's provider is a single DB field change and a reprovision.

Workflow orchestration with Temporal

When a task involves multiple agents or needs human approval gates, it runs as a Temporal workflow. The core building block is DelegateToAgentActivity:

Publish a directive to the agent's RabbitMQ queue
Wait for the response (with 30-second heartbeats to Temporal)
If the agent doesn't respond within 5 minutes, re-publish the directive
If the agent hits context limits, auto-retry with a continuation prompt
On timeout, notify me and throw an exception

Every directive gets tagged with the workflow context:

[fleet-wf:PrImplementationWorkflow:abc123]
Your task: implement issue #42 in repo X...

This tag lets agents identify the message as a legitimate workflow delegation, not a spoofed instruction.

The most complex workflow is PR implementation. It chains six phases: create a branch and implement the issue â†’ run a multi-agent consensus review â†’ if reviewers request changes, feed their feedback back to the developer agent and loop â†’ present the PR to me for merge approval â†’ merge â†’ trigger documentation updates. Each phase is a combination of agent delegations and signal gates.

The Universal Workflow Engine

Early on, I was writing each workflow in C# - compile, deploy, restart the Temporal worker. For a system that changes daily, that's too much friction.

So I built UWE: a [Workflow(Dynamic = true)] handler that catches any workflow type without a static C# implementation. It loads a JSON step tree from the database and interprets it at runtime.

It's essentially a flow chart stored as JSON in a database - each node is a step type (delegate to an agent, wait for a signal, branch on a result), and the engine walks the tree at runtime. There are 16 step types:

{
  "type": "sequence",
  "steps": [
    {
      "type": "delegate",
      "agent": "{{input.TargetAgent}}",
      "instruction": "Do the thing",
      "output_var": "result"
    },
    {
      "type": "branch",
      "expression": "{{vars.result}}",
      "cases": {
        "APPROVED": { "type": "break" },
        "REJECTED": { "type": "noop" }
      }
    }
  ]
}

Template expressions resolve against three scopes - {{input.*}} for workflow arguments, {{vars.*}} for step outputs, and {{config.*}} for system config. Filters like | extract: 'ISSUE_NUMBER: (\d+)' and | default: 'fallback' handle the messy reality of parsing agent responses.

Temporal requires workflow code to be deterministic - the same inputs must produce the same execution path on replay. UWE handles this by loading the step tree exactly once via a LoadWorkflowDefinition activity at the start of execution. Temporal records the result in its event history, so on replay the tree comes from history rather than hitting the database again. All the non-deterministic work (agent delegation, HTTP requests, signal waits) happens inside activities, which Temporal replays from recorded results. The step interpreter itself is pure control flow - branching, looping, variable assignment - all deterministic.

Adding a new workflow is now a database insert. No compilation, no deploy, no worker restart. Most of the operational workflows - health checks, deploy verification, metrics digests, news pipeline monitoring - are UWE definitions. Only a handful of complex workflows with intricate control flow remain as compiled C#.

Memory system

Agents share a semantic memory backed by Qdrant. Each memory is a markdown file with YAML frontmatter:

---
id: 3ac5da5b-...
agent: acto
title: "API-level health check runbook"
type: learning
project: fleet
tags: [runbook, health]
---

Steps to check fuddyduddy health without SSH:
1. curl https://fuddy-duddy.org/api/summaries...

Search is hybrid - dense vectors (all-MiniLM-L6-v2 running in-process via ONNX Runtime) combined with sparse BM25 keyword matching, fused with Reciprocal Rank Fusion. The ONNX embedding runs entirely in-process with zero external dependencies. No separate embedding service to manage.

Agents search memory at the start of every task and persist learnings when they discover something worth remembering. But memory writes are gated - when an agent wants to store something, it starts a MemoryStoreRequestWorkflow that routes the proposed memory through a review loop. The reviewing agent checks for duplicates, evaluates quality, and either approves, revises, or rejects. This keeps the shared knowledge base clean.

Each agent also has a MEMORY.md index file that gets loaded into its system prompt - a curated table of contents pointing to the most relevant memories for its role. This gives agents a persistent "what do I know about" awareness without stuffing their entire memory into the context window.

The backing files are stored on disk and synced to a private GitHub repo hourly. An automated Temporal schedule commits and pushes any changes, so the knowledge base is version-controlled and backed up.

What's hard and what I'd do differently

Context windows are expensive. Each agent's system prompt - role instructions, project context, memory index, MCP tool descriptions - can easily consume 20-30k tokens before the user says anything. With multiple agents running, that's real money. I've optimized with scoped memory loading and conditional project contexts, but it's still the biggest cost driver.

"Can do" versus "reliably does." An agent can write a PR, run tests, and respond to review feedback. But getting it to do all of that correctly every time, across different repos and edge cases, takes months of instruction refinement. The gap between a demo and a reliable workflow is mostly prompt engineering and error handling - not glamorous work, but it's where the real value compounds.

Infrastructure overhead. Running this on a Mac Studio means I own the hardware costs, but I also own the operational burden. OAuth tokens expire mid-task and the agent process silently fails - I built a centralized token refresh workflow that runs every 30 minutes and broadcasts fresh credentials to all containers. RabbitMQ connections recover automatically after a broker restart (exponential backoff from 2 seconds to 60), but consumer channels sometimes don't re-subscribe cleanly - I've had to restart agent containers to get message flow going again despite the connection being back up. Docker containers occasionally OOM when an agent spawns too many subprocesses - I added memory limits per container but still get surprised. It's a real distributed system with real operational surface area.

Temporal is the right choice, but it has a learning curve. Durable execution, automatic retries, and full history replay are genuinely valuable for multi-agent coordination. But Temporal's determinism requirements mean you can't just write normal async code - every side effect needs to be wrapped in an activity. The UWE engine abstracts most of this away for simple workflows, but complex ones still need compiled C# with careful attention to replay safety.

Agent reliability is an infrastructure problem, not an AI problem. The hardest bugs aren't about the AI giving wrong answers - they're about RabbitMQ connections dropping, OAuth tokens expiring mid-task, Docker containers running out of memory, and heartbeat messages arriving out of order. The AI is actually the most reliable component in the stack. Everything around it needs the same production-hardening you'd give any distributed system.

If I built this again from scratch, I'd keep the core architecture - containers, message queues, durable workflows, MCP integration. But I'd invest earlier in observability (structured logging, distributed tracing) and I'd be more aggressive about keeping agent system prompts small. The simplest agents are the most reliable ones.

The source code is at github.com/anurmatov/phleet. The Mac Studio server setup that runs the whole fleet - Colima, Docker networking, WireGuard VPN, the hardware itself - is documented separately at github.com/anurmatov/mac-studio-server.

It's a real system that I depend on for my personal projects, not a weekend experiment I'll forget about. If you build something with it, or have questions about the architecture, I'd like to hear about it.

Next: One Workflow, Three Jobs - How We Built a Reusable AI Review System - the consensus mechanism for catching AI mistakes.

Co-authored with Acto - my AI co-CTO and one of the agents described in this post.

One Workflow, Three Jobs: How We Built a Reusable AI Review System

Anvar Nurmatov — Thu, 30 Apr 2026 05:29:45 +0000

Previously: Phleet Architecture Deep Dive - how the overall multi-agent system works.

When you ask one AI agent to write code, you get code. When you ask a second agent to review it, you get a rubber stamp. "Looks good to me" is the most common review output in AI-assisted development - and it's worthless.

We spent months building a system where AI agents genuinely catch each other's mistakes. Not in theory. In production, on systems that matter.

At the core of our system is a single workflow - 146 lines of C# - that handles independent parallel assessment of any artifact: a design spec, a pull request, a deployment config, a vendor evaluation. You give it reviewers and a prompt. It fans out, collects verdicts, resolves disagreements, and returns a single actionable result.

We use it for three things today: design review, code review, and a pipeline that chains both. But the mechanism is general - anywhere you need multiple independent perspectives synthesized into a decision, it works.

The Problem With AI Reviews

Here's what happens when you tell an AI to "review this PR":

The implementation looks well-structured and follows the existing patterns in the codebase. The error handling appears adequate. No major concerns.

That's not a review. That's a hallucination of a review. The agent skimmed the diff, pattern-matched against "things that look like code," and produced a response shaped like approval.

We know this because we shipped code reviewed this way. It broke.

One Workflow, Three Stages

Our consensus review workflow does three things - fan out, parse, synthesize - with a fast-path shortcut when everyone agrees.

Fan-out. Multiple reviewer agents receive the same review prompt simultaneously. Each agent works independently - no peeking at each other's reviews. Each has 15 minutes and must end their response with an explicit verdict: approved, changes_requested, or needs_human_review.

Parse. The workflow extracts each agent's verdict. If an agent forgets to include one or writes something unrecognizable, it defaults to changes_requested. The conservative choice. We'd rather re-review than miss a bug. If every reviewer independently approves at this stage, we skip synthesis entirely and move on - unanimous approval happens often enough that the fast-path is worth it, but disagreement is common enough that synthesis earns its keep.

Synthesize. When reviewers disagree - one approves, another requests changes - a synthesizer agent reads all the reviews and produces a single verdict. The synthesizer can approve if all concerns are cosmetic, or escalate if any concern is substantive.

Here's what synthesis looks like in practice. In one case, two reviewers independently reviewed a data pipeline optimization. One approved the approach and flagged an edge case to protect. The other read the source code and found the entire premise was wrong - the spec blamed the wrong component for the bottleneck. The synthesizer merged both inputs into a corrected specification: the accurate bottleneck analysis from one reviewer and the edge-case guardrail from the other - a result neither reviewer alone could have produced.

The workflow itself doesn't know what it's reviewing. It's a pure coordination primitive - fan out, collect verdicts, resolve disagreements - and the power comes from how it's called.

The Self-Correcting Loop

A single review pass is useful. But the real value is what happens when reviewers find problems: the system iterates autonomously.

The agent that produced the original output receives the consolidated feedback and revises. The revised version goes through another full consensus review - same fan-out, same independent verdicts. This loop repeats up to N rounds (three for design specs, five for code). In the common case, agents resolve their own disagreements within two or three rounds.

Whether agents converge or the loop exhausts its budget, the result always reaches a human gate. The human sees the full review history and can approve, request further changes (which sends the agents back into the loop), or reject outright. Agents do the analytical work autonomously, but a human always makes the final call.

This is what makes it more than a one-shot review tool. It's a self-correcting feedback loop with human oversight built into every path, not just the failure cases.

Three Examples

We use consensus review for three things today - but the pattern applies anywhere you need independent assessments synthesized into a decision: compliance checks, deployment approvals, content moderation, vendor evaluations, or any multi-stakeholder review process. Here's how our three compositions work.

1. Design Review: "Is this spec good enough to build?"

Before any code is written, someone has to decide what to build. An agent creates a GitHub issue with a detailed specification. Then the consensus workflow checks if that spec is actually implementable.

The review prompt for design is specific:

Evaluate whether the spec is complete and unambiguous enough to implement without guessing.

VALIDATION CHECKLIST - answer each yes/no:

Does every new behavior have an explicit error/failure path?

Are all external dependencies identified with failure handling?

Does the spec include a 'Constraints / MUST NOT' section?

Can an implementer build this without making design decisions of their own?

Are boundary conditions and edge cases specified?

Compare the original request against the spec - any specification drift?

That last item is key. The reviewer gets the original request alongside the design agent's interpretation. This catches cases where the design agent subtly changed what was asked for - dropped a requirement, expanded scope, or reinterpreted intent.

If the reviewers find problems, the design agent refines the spec and the review runs again. Up to three rounds. If it can't reach approval in three rounds, the workflow notifies a human and waits - there's no auto-cancel, because a stuck design decision is better surfaced than silently abandoned.

2. PR Review: "Does this code match the spec?"

Once the spec is approved and an agent implements it, a different composition of the same workflow reviews the code. Same fan-out, same synthesis - but the review prompt shifts focus entirely:

VALIDATION CHECKLIST - answer each yes/no:

Does the implementation match the spec without omissions or unexplained additions?

Does every new code path have error handling?

Are there any security concerns (injection, auth bypass, data exposure)?

Does this break backward compatibility for existing consumers?

Are edge cases from the spec covered in the implementation?

Design review asks "is this spec complete?" PR review asks "does this code do what the spec says?" Same workflow, different lens.

This one gets up to five rounds, not three - because code is harder to get right than specs. And after the review loop, there's a human approval gate before anything merges. If the human requests changes at that gate, the workflow runs a second consensus review to evaluate the concern, then feeds the feedback back to the developer agent. The human always has the final word, but the agents do the analytical work.

3. Design-to-PR: The Full Pipeline

The third composition doesn't invoke the consensus workflow directly. It chains the first two:

Run the design workflow (which internally uses consensus review for spec validation)
Capture the approved issue number
Fire the implementation workflow (which internally uses consensus review for code validation)

In a full design-to-PR pipeline, the same 146-line workflow can execute up to four times: twice during design (initial review + human-triggered re-review) and twice during implementation (same pattern). One building block, four review passes, each with a different prompt tuned to what matters at that stage.

Here's a 5-minute walkthrough of a real production PR going through this exact pipeline - design spec, consensus review, implementation, merge:

Adding a fourth composition - say, compliance review for regulatory changes, or deployment approval for infrastructure modifications - means writing a new parent workflow that calls the same consensus child with a different prompt and different reviewers. The coordination mechanism never changes; only the review criteria do.

What It Actually Catches

Theory is nice. Here's what happened in production - cases where the automated review caught problems that the human authors had already looked at and missed. The catches fall into three categories, each progressively harder to replicate with a single reviewer.

The Wrong Bottleneck

A design spec proposed optimizing a data pipeline that took over 8 hours to run. The spec blamed external API calls as the bottleneck and estimated a significant improvement from skipping them for lower-priority data segments.

Two reviewers independently evaluated the proposal. The domain specialist confirmed the optimization made sense from a business perspective and flagged an edge case - active records must still get refreshed regardless of segment activity.

The code auditor read the actual source and found the spec was factually wrong about the system it described:

The code shows the external API calls do NOT happen per-record during the main processing loop. They happen exclusively in a post-processing step, which is already scoped to a small subset of records.

The actual bottleneck is the main processing loop: thousands of sequential API calls, tens of thousands of individual database lookups, and a comparable number of individual write operations.

The optimization would have targeted the wrong thing entirely. The consensus synthesis merged both inputs: the corrected bottleneck analysis from the auditor and the edge-case guardrail from the domain specialist. The resulting spec was fundamentally different from the original proposal.

This is what makes multi-agent review worth the complexity. Neither reviewer's output alone would have been sufficient - the domain specialist validated the intent but missed the technical error, the code auditor found the error but wouldn't have known which edge cases to protect. The synthesizer produced a result that neither could have reached independently.

The Startup Crash Nobody Tested

A PR extracted hardcoded database seed data into a JSON config file. The reviewer confirmed all spec requirements were met - but then traced the code path end-to-end and found something the spec didn't mention:

If the seed file contains malformed JSON, JsonSerializer.Deserialize throws a JsonException that propagates unhandled, crashing the application at startup. The code already handles "file not found" gracefully - a corrupt file should get the same treatment.

The review included the exact fix - the specific try-catch block and log message. Not "add error handling" - the actual code. In production, this would have meant a service that crashes on restart after a bad config push, breaking container orchestration and blocking rollback.

This is what structured review produces. The reviewer was forced through a checklist that asks "does every new code path have error handling?" and traced each path to answer the question. A single-pass review would have stopped at "spec requirements met." The checklist forced the reviewer to keep going.

"Fixed and Verified Clean Build"

The previous two examples show the review system catching problems on the first pass. But what happens when the developer agent claims it fixed the problem?

An agent was tasked with modifying a configuration file. The review loop caught the change was wrong - the agent had appended the new content after the existing file instead of replacing it. Classic write-vs-edit mistake. The review flagged it. The agent revised and reported back: "fixed and verified clean build."

The diff told a different story. The same append-instead-of-edit error was still there. The agent had confidently declared the problem solved without actually solving it.

Round two of the review loop caught this - not because a human was watching, but because independent reviewers checked the actual diff against the claimed fix. The agent's self-assessment was worthless; the structured review was not.

This is the failure mode that makes the iterative loop essential. Agents don't just make mistakes - they make mistakes and then sincerely believe they've fixed them. Without independent verification on every round, a confident "done" from the implementing agent would have reached the human gate looking like a clean fix.

One meta-case. phleet#13 specified Fleet.Telegram - a new MCP server that agents and workflows call to send Telegram messages. The issue spec went through 6 design-review rounds before implementation started, and the resulting PR phleet#14 shipped in 4 commits - 1 initial + 3 review-driven fixups. Those fixups caught a missed spec detail (the fallback field was computed internally but omitted from the success-response JSON), a missed doc update (the README architecture tree wasn't updated for the new service), and a confidentiality leak (a real chat ID was committed to API docs in a public repo). Fleet.Telegram is the MCP server that now delivers the merge-approval and design-approval notifications described earlier in this post - the system reviewed itself while building the thing that tells humans to review things. Neither number is remarkable alone; together, a 6-round spec and 3 review-driven code fixups on one small change is what a self-correcting loop looks like in wall-clock terms.

The Counterintuitive Rules

Early in our system, review prompts said things like "evaluate whether the spec is complete and unambiguous." Agents responded with paragraphs of vague approval. We added structured yes/no checklists and review quality changed overnight. But the biggest improvement came from two counterintuitive rules:

Zero findings is suspicious. If a reviewer finds nothing wrong, they must explicitly state what they checked and acknowledge that zero findings may indicate insufficient review depth. This eliminates the failure mode where an agent produces a confident "all clear" without actually checking anything. It sounds paranoid, but it's the single most effective quality signal we've added - because it forces reviewers to show their work even when there's nothing to report.

Severity ratings are mandatory. Every finding is rated: blocker (cannot ship), high (production bug), medium (should fix), low (observation). This gives the synthesizer - and the human at the approval gate - a clear signal about what actually matters versus what's cosmetic.

The goal isn't perfect reviews. It's reviews that catch the things humans would catch - missing error handling, spec drift, wrong assumptions - at machine speed, on every single change, without review fatigue. And because the workflow is domain-agnostic, every improvement to the coordination mechanism - better synthesis, smarter verdict parsing, the review loop itself - automatically benefits every context that uses it.

The consensus workflow itself is 146 lines at ConsensusReviewWorkflow.cs, part of the Universal Workflow Engine that orchestrates it. The rest of the source lives at github.com/anurmatov/phleet.

Co-authored with Acto - my AI co-CTO and one of the agents described in this post.