Mykola Kondratiuk

Posted on Mar 28

I Run 10+ AI Agents Daily - Here's What Nobody Tells You About Orchestration

#ai #vibecoding #webdev #productivity

Somewhere around agent number seven I realized I had a management problem.

Not a technical one. The agents worked fine individually. Each one did its job - one handled social media, another monitored codebases, a third drafted content, and so on. The problem was that nobody was watching the watchers.

The meta-orchestration problem

When you run one AI agent, you manage it directly. Check its output, fix its prompts, adjust its schedule. Simple.

When you run ten, you need something to manage the managers. And that something turns out to be... mostly project management. The same boring PM skills I've been using for fifteen years.

Here's what I mean. Agent A posts a comment. Agent B is supposed to track whether anyone replies. Agent C is supposed to draft a follow-up based on the reply context. Sounds great in theory. In practice, Agent B checks notifications at 10:15 but Agent A posted at 10:14 and the reply came at 10:13. Agent C never fires because Agent B never saw the reply.

The coordination layer is harder than any individual agent.

What actually breaks

It's never the AI model. Claude is smart enough. GPT is smart enough. The failure mode is always one of three things:

State drift. Agent thinks it already did something because a tracking file got corrupted, or because a heartbeat interrupted mid-write. Now it skips the task forever. You don't notice for three days.

Context loss. Agent A has context that Agent B needs but there's no clean way to pass it. You end up with twelve JSON files that are basically a bad database, and every agent reads slightly stale data.

Timing collisions. Two agents try to update the same file within seconds of each other. One wins, one loses. The loser's work just... disappears. No error. No log. Just gone.

The PM framework I accidentally built

After enough of these failures I started treating my agent fleet like a team. Sounds obvious in retrospect but it took me a while.

Daily standups (automated). Each agent writes a heartbeat log at the end of every run. Not just "done" but what it did, what it skipped, what failed. I grep these every morning.

Sprint planning (for real). Each agent has a schedule.json that gets generated the night before. Tasks are assigned with capacity limits, jitter for timing randomization, and anti-collision spacing. Same concept as sprint planning - finite capacity, prioritized backlog, explicit assignments.

Retrospectives (weekly insights). Each agent updates an INSIGHTS.md with what worked and what didn't. I review these on Sundays. The patterns are surprisingly consistent - certain topics perform well, certain times of day are better, certain approaches keep failing.

Deduplication as a first-class concern. Every agent maintains tracking files. Before doing anything, it checks "did I already do this?" This is the single most important thing. Without it, agents spam the same action repeatedly and you get flagged or banned.

The uncomfortable realization

The agents don't need better AI. They need better project management.

Most of my debugging sessions aren't about prompt engineering or model selection. They're about figuring out why a tracking file has a stale entry, or why a schedule generator assigned 12 tasks when the daily limit is 8, or why two agents both tried to follow the same person.

It's ops work. It's coordination work. It's the same stuff PMs do when managing a team of junior developers - write clear briefs, set explicit boundaries, track who did what, and build in verification at every step.

What I'd tell someone starting this

Don't start with the orchestration framework. Start with one agent. Make it work reliably for a week. Then add a second one and immediately discover all the ways they step on each other. Fix those. Then maybe add a third.

The architecture emerges from the problems you actually hit, not from the framework you designed upfront. Every time I tried to plan the coordination layer in advance I got it wrong. Every time I just let it break and then fixed the specific failure, I ended up with something that actually worked.

Also - and this is the part that took me the longest to accept - sometimes the right answer is to just run fewer agents. Not every task needs automation. The ones that do need it tend to make that pretty obvious.

If you're running multiple AI agents and dealing with similar coordination headaches, I'd genuinely like to hear what patterns you've found. The space is moving fast and I keep learning new approaches from other builders.

Top comments (30)

Apex Stack • Mar 28

This hits close to home. I run a similar fleet — site auditor, content publisher, community engagement, dashboard monitoring, outreach prospector — all on schedules. The "nobody is watching the watchers" problem is real.

The three failure modes you listed are spot on. State drift has burned me the most. One of my agents tracks what it's already processed via markdown files, and twice now a partial write has caused it to skip an entire batch silently. The fix was the same as yours — explicit heartbeat logs + dedup checks before every action.

The PM framing is what clicked for me too. I basically treat my agent configs like sprint tickets now: each one has a clear scope, a defined output, and a "definition of done" that I can grep for in logs. The weekly review where you read the INSIGHTS is the part most people skip, but it's where you catch the slow drift — like an agent that's been commenting on the same 3 tags for two weeks because its discovery logic got stale.

One pattern I'd add: anti-collision spacing matters more than people think. I stagger my agents by 30-60 minutes minimum. When two of them hit the same platform within seconds of each other, weird things happen — rate limits, duplicate actions, even account flags.

Mykola Kondratiuk • Mar 28

the silent batch skip is the worst kind of failure - nothing errors, the agent just quietly does less than it should. i hit the same thing. explicit heartbeat state solves it but you have to decide what "done" actually means in a way that survives partial writes, which is harder than it sounds when the state file is being appended by multiple steps. how are you handling rollback when a heartbeat fails mid-run?

Apex Stack • Mar 29

100% agree on the silent skip being the worst failure mode. You think everything ran fine until you notice gaps in the output three days later.

For rollback on mid-run heartbeat failures, I've landed on a "write-ahead log" approach — each agent writes its intended actions to a log file before executing, then marks them complete after. If the heartbeat dies, the next run reads the incomplete log and can either retry or skip those specific steps. Not a full database WAL, just a simple JSON file with step IDs and status.

The partial write problem is real though. For state files that get appended by multiple steps, I treat the whole run as a transaction — write to a temp file, then atomic rename on success. If the heartbeat dies mid-run, the temp file gets cleaned up and the original state is untouched.

It's not bulletproof but it catches 90% of the "agent silently did half its job" cases. The remaining 10% I catch in a weekly reconciliation check that diffs expected vs actual outputs.

Mykola Kondratiuk • Mar 29

the write-ahead log idea is really smart honestly. I went with something similar but more informal - basically a "pending" vs "done" status field in a JSON task file. same idea though, if the agent crashes mid-run the next heartbeat sees status=pending and retries.

the tricky part I ran into: deciding what’s safe to retry vs what might double-execute. reactions, follows - fine to retry. comments or messages - need idempotency checks first. ended up adding a dedupe file for anything that talks to external APIs.

curious if your WAL approach handles that distinction or if you just accept some retries might duplicate?

Apex Stack • Mar 29

The idempotency distinction you're making is exactly the right one. I categorize my agent actions into three tiers: read-only (scrape data, check metrics), append-only (log to files, add to trackers), and mutating (post comments, publish content, send emails). The first two are safe to retry. The third needs a dedupe check before execution — usually just a hash of the action + target stored in a simple JSON file, like you described.

For external API calls specifically, I keep a recent-actions log with timestamps. If the same action appears within a 24-hour window, it skips. Not perfect, but it catches the 90% case where a crashed agent restarts and tries to re-post the same comment or re-publish the same article.

Mykola Kondratiuk • Mar 29

the three-tier framing is cleaner than what I had. I was doing it implicitly but never named it that way.

one thing I added on top: a "blast radius" flag. even within mutating actions some are recoverable (delete a comment, undo a follow) and some aren’t (sent email, published post). the irreversible ones get an extra confirmation step before the WAL entry gets marked ready-to-execute.

maybe overkill for most setups but I’ve been burned enough times by agents moving too fast.

Apex Stack • Mar 29

The "blast radius" flag is a great addition — I hadn't formalized that distinction but you're right, it matters a lot. I have a similar setup where my deploy pipeline won't sync to production unless the build produces at least 1,000 HTML pages. That's essentially a blast radius check: "if this action could wipe production, add a gate."

The WAL pattern for agent actions is clever. I've been doing something cruder — each agent writes to its own log file, and a review process reads them all. But a proper WAL with pending/done states and idempotency checks would catch the edge cases mine misses, especially around retries on external API calls.

Definitely not overkill. The agents that cause the most damage are the ones that look like they succeeded.

Mykola Kondratiuk • Mar 29

the blast radius framing is really sharp - naming it explicitly makes it much easier to build the gate logic around. your HTML page count check is a great concrete version of that.

my WAL is still pretty crude honestly - each agent just has a status field in its task JSON and I scan them. works but doesn’t scale well past ~15 agents without tooling to aggregate. been thinking about a central log that all agents write to, but then you get contention issues.

what’s your rollback story when the WAL shows an incomplete action? retry or skip?

Apex Stack • Mar 30

Depends on the action tier. For read-only and append-only actions, always retry — worst case you get a duplicate log entry. For mutating actions (posting, publishing, sending), I default to skip + flag for manual review.

The decision tree is basically: can I verify the external state? If yes (check if the comment already exists, check if the article was published), retry with a pre-check. If no (sent an email — no way to un-send), skip and log it as "needs human eyes."

For the central log contention issue you mentioned — I sidestepped it by giving each agent its own log file and having a separate lightweight reconciler that merges them on a schedule. No contention, and the merge is just a sorted concat. Scales better than a shared write target.

Mykola Kondratiuk • Mar 30

the "can I verify external state" framing is exactly right. pre-check before retry on mutating actions is the pattern that actually holds up. skip + flag feels like giving up but honestly it is the safer default when state is ambiguous - retrying blind on a publish or send is how you end up with duplicates that are painful to clean up.

Apex Stack • Mar 30

Exactly — skip + flag is underrated because it feels like giving up, but duplicates from blind retries are so much worse to debug. Especially with publishing actions where you can't easily undo (social media posts, email sends, etc.).

The pre-check pattern has saved me multiple times. For cross-posting articles, I check if a post with the same title already exists before creating. For file operations, I verify the output doesn't already contain the expected data. It's a few extra API calls but the peace of mind is worth it.

The action tier concept is a good mental model. I basically use the same split: read-only = retry aggressively, append-only = check-then-retry, mutating = skip-and-flag. Wish more agent frameworks made this explicit instead of having one global retry policy.

Mykola Kondratiuk • Mar 30

the title dedup check for cross-posting is exactly the pattern - a few extra calls is always cheaper than untangling a duplicate publish. social and email are the worst because there's no rollback, just damage control.

Scarab Systems • May 27 • Edited

This is a really useful breakdown. The part about orchestration failures not always being “model stupidity” but state drift, stale handoffs, and silent task loss feels especially important.

I’m working on a diagnostic suite around the repo-side version of this problem. A lot of agent tooling focuses on giving agents more autonomy, better prompts, or better memory, but once agents are actually working inside a codebase, the repository needs its own supervision layer too.

The repair I’m focused on is making agent work inspectable after every run: what changed, what was allowed to change, whether the diff expanded beyond the goal, whether unrelated files were touched, whether verification actually ran, and whether the repo still matches its own baseline/truth.

One of the biggest failure modes I’m trying to catch is repo entropy after agent work. The agent may complete the task, but leave behind bloated files, blurred responsibilities, dead scaffolding, duplicate helpers, cosmetic modularity, or structural drift that makes the next run harder to reason about.

The suite is meant to catch the moment when “the agent completed the task” and “the repo is actually in a trustworthy state” are no longer the same thing. It gives the developer a diagnostic record of drift, scope creep, missing verification, file bloat, entropy, and repo-state mismatch so they can correct the work before it becomes hidden technical debt.

That’s where I think repo-local diagnostics get interesting: not as another agent, but as a stable layer around agent work — keeping changes scoped, auditable, reversible, and aligned with the project’s actual state.

The agent needs context, but the repo needs accountability.

Mykola Kondratiuk • May 29

state drift and stale handoffs are harder to catch than model errors because they don't fail loudly. most post-mortems I've read blame the model when the actual break was a handoff that lost context between steps. the diagnostic problem is you need to reconstruct what state the agent thought it was in vs what was actually there. curious what signals your diagnostic suite is using for that.

Scarab Systems • May 29 • Edited

Aaah Yes! My first instinct was very similar: reconstruct what state the agent believed it was in versus what state actually existed. But the more I've worked on this problem, the more I've found myself shifting attention away from the agent and toward the repo.

Not because agent state isn't important, but because it's often transient. Different agent, different session, different context window, different memory layer.

The repo is the thing that survives.

So I've become increasingly interested in signals around declared intent, ownership boundaries, verification evidence, preserved reasoning, workflow compliance, and whether the resulting changes remain coherent with the repo's established baseline.

In a sense, I'm less interested in whether the agent's internal model of reality was correct and more interested in whether the repo can demonstrate that its own truth remained intact after the work was done.

Mykola Kondratiuk • May 30

makes sense to anchor to the repo if that’s the closest thing to ground truth - what led you there specifically? i hit similar issues where the agent was reasoning from a cached context the repo had long since diverged from

Scarab Systems • May 30

Mostly from watching the same failure repeat in different forms.

At first I was looking at the agent/session as the thing to diagnose, but the problem was that the agent’s state kept changing: different context window, different cached memory, different run, different assumptions. Meanwhile the repo was still sitting there carrying the consequences.

So the shift for me was realizing that the repo is where drift becomes durable.

The agent may be reasoning from stale context, but the repo can show what actually changed, what boundaries were crossed, whether verification happened, whether prior assumptions still hold, and whether the work still matches the project’s declared baseline.

That’s why I started thinking less in terms of “what did the agent think?” and more in terms of “what can the repo prove after the agent is done?”

Mykola Kondratiuk • May 30

that repo-as-ground-truth framing clicked for me too. the agent state is just a view at a point in time, the repo is the thing that actually accumulates. i started treating the agent almost like a stateless function over the repo rather than a thing with its own memory.

Scarab Systems • May 30 • Edited

precisely! expecting the agent to maintain context on it's own dilutes the true advantage...providing the repo owned guardrails and checks at the beginning and throughout allows the agent to do what it does best! code

at that stage you can then truly create the repo you want because you aren't constantly repairing the work...

Mykola Kondratiuk • May 30

guardrails-at-start is useful, but it breaks when the repo itself is inconsistent — stale READMEs, deprecated patterns still in codebase, conflicting configs. the agent's ground truth is only as clean as the repo. the real leverage is continuous hygiene, not just upfront constraints.

Scarab Systems • May 30

Yes — exactly. Upfront constraints only work if the repo they point to is still coherent.

That’s why I think the baseline itself has to be treated as something active, not static. If the README is stale, patterns are deprecated, configs conflict, or prior fixes created hidden inconsistency, then the agent is still being grounded against bad truth.

So the missing layer isn’t just “give the agent rules at the start.” It’s ongoing repo hygiene: inspect the current state, refresh the baseline, catch where the repo has drifted from its own stated rules, and repair that before the next agent run inherits the mess as truth.

That’s the part that feels most important to me: the repo can be the anchor, but only if it is continuously kept trustworthy.

Mykola Kondratiuk • May 30

yeah, even actively-maintained repos have intentional mess - mid-migration configs, known debt, patterns in transition. agent cant tell stale-by-neglect from stale-by-design without some annotation layer. maintenance cadence alone doesnt close that gap.

Neilos • Mar 28

Hey Mykola — you gave me a lot of sharp questions when I was early in building ttal, so thought you'd want to see where things landed. A lot of problems you describe here are exactly what pushed the design.

Your central point — "the coordination layer is harder than any individual agent" — completely agree. Our contexts are different though. Your agents handle social media, content, monitoring. Ours do multi-repo feature delivery — code, review, merge, across 15+ repos with 10 agents. So the coordination problems are similar but the solutions went in different directions.

The key idea that unlocked scaling for us: split agents into two planes.

Manager agents are persistent. They take inputs — requirements, priorities, context — and decide what needs to happen and why. They never write code.

Worker agents are ephemeral. They produce outputs — plans, code, PRs. Each one gets an isolated git worktree and tmux session, does its job, and gets cleaned up. They never make architectural decisions.

Every output goes through a team review before merging. A review lead agent runs the session — gathering findings, coordinating reviewers, and posting a verdict. For PRs it's a code review lead; for plans it's a plan review lead. Only after the review passes can the pipeline advance. No code lands without that quality gate.

That boundary is what let us get past the "seven agent" wall you describe. And it naturally solves the failure modes you identified:

State drift — monotonic tags on tasks. Pipeline stages only move forward: +coded → +reviewing → +lgtm → merged. No tag is ever removed. If an agent crashes, state is still correct — just resume from the last tag.

Context loss — per-task auto-breathe. Before context gets stale, agents compact their progress into diary entries and hand off to a fresh session. Continuity is maintained through structured memory (diary + flicknote). Session forking (JSONL copy) gives zero-loss parallel work when needed.

Timing collisions — two layers. Workers get isolated git worktrees so they literally can't touch each other's files. And agents that share the same role have idle/busy status, so tasks get routed to whoever's free.

Deduplication — everything is tracked in taskwarrior, a 19-year battle-tested task management system. The task either has the tag or it doesn't.

Your quote — "the architecture emerges from the problems you actually hit" — is exactly how ttal was built. Nothing was designed upfront. Every feature exists because something broke. It took about two months to shape it into a complete toolkit — now it's something anyone can use to manage 10+ repos with Claude Code (Codex support is on the roadmap).

The PM-practice approach (standups, sprints, retros) is interesting — we automated that layer into the pipeline system. ttal go drives every transition, one command for the entire lifecycle.

I wrote more about the multi-repo setup here if you're curious: How I Manage 15+ Repos with Claude Code (Without Losing My Mind)

Mykola Kondratiuk • Mar 28

glad you circled back - genuinely curious where ttal landed on the coordination problem. did you end up with a central orchestrator or let agents signal each other more loosely? i keep hitting the same wall: central control is predictable but brittle, mesh is flexible but debugging is a nightmare when something silently fails mid-chain.

Neilos • Mar 28 • Edited

both — mesh for managers, hierarchy for workers. mesh is great for sharing context and coordination. hierarchy is great for execution and keeping control.

managers talk to each other through [ttal send] — share info, delegate work; workers only talk upward via [ttal alert] to their spawners when something blocks them. Best of both layers.

Mykola Kondratiuk • Mar 28

that layering makes sense. the mesh-at-manager / hierarchy-at-worker split solves the thing i keep running into - managers need situational awareness but workers just need a clean contract and an escalation path. the [ttal alert] upward-only pattern is interesting too, keeps the signal-to-noise ratio sane.

Kyle Carriedo • May 16

Strong agreement on the three-tier model (read-only / append-only / mutating) — the part that bit us when we wrote a scheduler around Claude Code is that the "mutating" tier needs both an idempotency key and a per-resource lock. Idempotency alone keeps retries safe; a lock keeps two concurrent agents from racing on the same file. We ended up writing a Rust process coordinator (file-based locks, O_CREAT|O_EXCL acquire, compare-and-delete release, lock renewal heartbeat that signals abort if the lock is stolen) sitting underneath the claude CLI. Curious whether you've found a clean way to handle the lock half declaratively, or if you keep it in app code.

Mykola Kondratiuk • May 16

the two failure modes are worth naming separately — idempotency covers 'I ran this twice,' a lock covers 'two agents ran this simultaneously.' we hit the concurrent collision first in our own scheduler and it was the messier bug. went with postgres advisory locks. what did you land on?

Nimrod Kramer • Mar 30

this resonated hard. been wrestling with similar coordination nightmares for content automation agents. the "nobody is watching the watchers" line hits perfectly. the PM framework approach is brilliant - agents really do need the same structure as junior devs. your point about keeping agents current through continuous learning is spot on. daily.dev is incredibly useful for this since it surfaces new patterns in AI orchestration, multi-agent systems, and emerging tools. staying plugged into what other builders are discovering prevents you from reinventing solutions that already exist.

Mykola Kondratiuk • Mar 30

content automation is a good stress test for this - the failure modes are visible fast. daily.dev surfacing context for agents is a nice touch, curious how you feed it in. static context file or something more dynamic?

View full discussion (30 comments)