DEV Community

I Run 10+ AI Agents Daily - Here's What Nobody Tells You About Orchestration

Mykola Kondratiuk on March 28, 2026

Somewhere around agent number seven I realized I had a management problem. Not a technical one. The agents worked fine individually. Each one did ...
Collapse
 
apex_stack profile image
Apex Stack

This hits close to home. I run a similar fleet — site auditor, content publisher, community engagement, dashboard monitoring, outreach prospector — all on schedules. The "nobody is watching the watchers" problem is real.

The three failure modes you listed are spot on. State drift has burned me the most. One of my agents tracks what it's already processed via markdown files, and twice now a partial write has caused it to skip an entire batch silently. The fix was the same as yours — explicit heartbeat logs + dedup checks before every action.

The PM framing is what clicked for me too. I basically treat my agent configs like sprint tickets now: each one has a clear scope, a defined output, and a "definition of done" that I can grep for in logs. The weekly review where you read the INSIGHTS is the part most people skip, but it's where you catch the slow drift — like an agent that's been commenting on the same 3 tags for two weeks because its discovery logic got stale.

One pattern I'd add: anti-collision spacing matters more than people think. I stagger my agents by 30-60 minutes minimum. When two of them hit the same platform within seconds of each other, weird things happen — rate limits, duplicate actions, even account flags.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

the silent batch skip is the worst kind of failure - nothing errors, the agent just quietly does less than it should. i hit the same thing. explicit heartbeat state solves it but you have to decide what "done" actually means in a way that survives partial writes, which is harder than it sounds when the state file is being appended by multiple steps. how are you handling rollback when a heartbeat fails mid-run?

Collapse
 
apex_stack profile image
Apex Stack

100% agree on the silent skip being the worst failure mode. You think everything ran fine until you notice gaps in the output three days later.

For rollback on mid-run heartbeat failures, I've landed on a "write-ahead log" approach — each agent writes its intended actions to a log file before executing, then marks them complete after. If the heartbeat dies, the next run reads the incomplete log and can either retry or skip those specific steps. Not a full database WAL, just a simple JSON file with step IDs and status.

The partial write problem is real though. For state files that get appended by multiple steps, I treat the whole run as a transaction — write to a temp file, then atomic rename on success. If the heartbeat dies mid-run, the temp file gets cleaned up and the original state is untouched.

It's not bulletproof but it catches 90% of the "agent silently did half its job" cases. The remaining 10% I catch in a weekly reconciliation check that diffs expected vs actual outputs.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

the write-ahead log idea is really smart honestly. I went with something similar but more informal - basically a "pending" vs "done" status field in a JSON task file. same idea though, if the agent crashes mid-run the next heartbeat sees status=pending and retries.

the tricky part I ran into: deciding what’s safe to retry vs what might double-execute. reactions, follows - fine to retry. comments or messages - need idempotency checks first. ended up adding a dedupe file for anything that talks to external APIs.

curious if your WAL approach handles that distinction or if you just accept some retries might duplicate?

Thread Thread
 
apex_stack profile image
Apex Stack

The idempotency distinction you're making is exactly the right one. I categorize my agent actions into three tiers: read-only (scrape data, check metrics), append-only (log to files, add to trackers), and mutating (post comments, publish content, send emails). The first two are safe to retry. The third needs a dedupe check before execution — usually just a hash of the action + target stored in a simple JSON file, like you described.

For external API calls specifically, I keep a recent-actions log with timestamps. If the same action appears within a 24-hour window, it skips. Not perfect, but it catches the 90% case where a crashed agent restarts and tries to re-post the same comment or re-publish the same article.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

the three-tier framing is cleaner than what I had. I was doing it implicitly but never named it that way.

one thing I added on top: a "blast radius" flag. even within mutating actions some are recoverable (delete a comment, undo a follow) and some aren’t (sent email, published post). the irreversible ones get an extra confirmation step before the WAL entry gets marked ready-to-execute.

maybe overkill for most setups but I’ve been burned enough times by agents moving too fast.

Collapse
 
neil_agentic profile image
Neilos

Hey Mykola — you gave me a lot of sharp questions when I was early in building ttal, so thought you'd want to see where things landed. A lot of problems you describe here are exactly what pushed the design.

Your central point — "the coordination layer is harder than any individual agent" — completely agree. Our contexts are different though. Your agents handle social media, content, monitoring. Ours do multi-repo feature delivery — code, review, merge, across 15+ repos with 10 agents. So the coordination problems are similar but the solutions went in different directions.

The key idea that unlocked scaling for us: split agents into two planes.

Manager agents are persistent. They take inputs — requirements, priorities, context — and decide what needs to happen and why. They never write code.

Worker agents are ephemeral. They produce outputs — plans, code, PRs. Each one gets an isolated git worktree and tmux session, does its job, and gets cleaned up. They never make architectural decisions.

Every output goes through a team review before merging. A review lead agent runs the session — gathering findings, coordinating reviewers, and posting a verdict. For PRs it's a code review lead; for plans it's a plan review lead. Only after the review passes can the pipeline advance. No code lands without that quality gate.

That boundary is what let us get past the "seven agent" wall you describe. And it naturally solves the failure modes you identified:

State drift — monotonic tags on tasks. Pipeline stages only move forward: +coded+reviewing+lgtm → merged. No tag is ever removed. If an agent crashes, state is still correct — just resume from the last tag.

Context loss — per-task auto-breathe. Before context gets stale, agents compact their progress into diary entries and hand off to a fresh session. Continuity is maintained through structured memory (diary + flicknote). Session forking (JSONL copy) gives zero-loss parallel work when needed.

Timing collisions — two layers. Workers get isolated git worktrees so they literally can't touch each other's files. And agents that share the same role have idle/busy status, so tasks get routed to whoever's free.

Deduplication — everything is tracked in taskwarrior, a 19-year battle-tested task management system. The task either has the tag or it doesn't.

Your quote — "the architecture emerges from the problems you actually hit" — is exactly how ttal was built. Nothing was designed upfront. Every feature exists because something broke. It took about two months to shape it into a complete toolkit — now it's something anyone can use to manage 10+ repos with Claude Code (Codex support is on the roadmap).

The PM-practice approach (standups, sprints, retros) is interesting — we automated that layer into the pipeline system. ttal go drives every transition, one command for the entire lifecycle.

I wrote more about the multi-repo setup here if you're curious: How I Manage 15+ Repos with Claude Code (Without Losing My Mind)

Collapse
 
itskondrat profile image
Mykola Kondratiuk

glad you circled back - genuinely curious where ttal landed on the coordination problem. did you end up with a central orchestrator or let agents signal each other more loosely? i keep hitting the same wall: central control is predictable but brittle, mesh is flexible but debugging is a nightmare when something silently fails mid-chain.

Collapse
 
neil_agentic profile image
Neilos • Edited

both — mesh for managers, hierarchy for workers. mesh is great for sharing context and coordination. hierarchy is great for execution and keeping control.

managers talk to each other through [ttal send] — share info, delegate work; workers only talk upward via [ttal alert] to their spawners when something blocks them. Best of both layers.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

that layering makes sense. the mesh-at-manager / hierarchy-at-worker split solves the thing i keep running into - managers need situational awareness but workers just need a clean contract and an escalation path. the [ttal alert] upward-only pattern is interesting too, keeps the signal-to-noise ratio sane.

Collapse
 
kcarriedo profile image
Kyle Carriedo

Strong agreement on the three-tier model (read-only / append-only / mutating) — the part that bit us when we wrote a scheduler around Claude Code is that the "mutating" tier needs both an idempotency key and a per-resource lock. Idempotency alone keeps retries safe; a lock keeps two concurrent agents from racing on the same file. We ended up writing a Rust process coordinator (file-based locks, O_CREAT|O_EXCL acquire, compare-and-delete release, lock renewal heartbeat that signals abort if the lock is stolen) sitting underneath the claude CLI. Curious whether you've found a clean way to handle the lock half declaratively, or if you keep it in app code.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

the two failure modes are worth naming separately — idempotency covers 'I ran this twice,' a lock covers 'two agents ran this simultaneously.' we hit the concurrent collision first in our own scheduler and it was the messier bug. went with postgres advisory locks. what did you land on?

Collapse
 
scarab-systems profile image
scarab systems • Edited

This is a really useful breakdown. The part about orchestration failures not always being “model stupidity” but state drift, stale handoffs, and silent task loss feels especially important.

I’m working on a diagnostic suite around the repo-side version of this problem. A lot of agent tooling focuses on giving agents more autonomy, better prompts, or better memory, but once agents are actually working inside a codebase, the repository needs its own supervision layer too.

The repair I’m focused on is making agent work inspectable after every run: what changed, what was allowed to change, whether the diff expanded beyond the goal, whether unrelated files were touched, whether verification actually ran, and whether the repo still matches its own baseline/truth.

One of the biggest failure modes I’m trying to catch is repo entropy after agent work. The agent may complete the task, but leave behind bloated files, blurred responsibilities, dead scaffolding, duplicate helpers, cosmetic modularity, or structural drift that makes the next run harder to reason about.

The suite is meant to catch the moment when “the agent completed the task” and “the repo is actually in a trustworthy state” are no longer the same thing. It gives the developer a diagnostic record of drift, scope creep, missing verification, file bloat, entropy, and repo-state mismatch so they can correct the work before it becomes hidden technical debt.

That’s where I think repo-local diagnostics get interesting: not as another agent, but as a stable layer around agent work — keeping changes scoped, auditable, reversible, and aligned with the project’s actual state.

The agent needs context, but the repo needs accountability.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

state drift and stale handoffs are harder to catch than model errors because they don't fail loudly. most post-mortems I've read blame the model when the actual break was a handoff that lost context between steps. the diagnostic problem is you need to reconstruct what state the agent thought it was in vs what was actually there. curious what signals your diagnostic suite is using for that.

Collapse
 
scarab-systems profile image
scarab systems • Edited

Aaah Yes! My first instinct was very similar: reconstruct what state the agent believed it was in versus what state actually existed. But the more I've worked on this problem, the more I've found myself shifting attention away from the agent and toward the repo.

Not because agent state isn't important, but because it's often transient. Different agent, different session, different context window, different memory layer.

The repo is the thing that survives.

So I've become increasingly interested in signals around declared intent, ownership boundaries, verification evidence, preserved reasoning, workflow compliance, and whether the resulting changes remain coherent with the repo's established baseline.

In a sense, I'm less interested in whether the agent's internal model of reality was correct and more interested in whether the repo can demonstrate that its own truth remained intact after the work was done.

Collapse
 
nimrodkra profile image
Nimrod Kramer

this resonated hard. been wrestling with similar coordination nightmares for content automation agents. the "nobody is watching the watchers" line hits perfectly. the PM framework approach is brilliant - agents really do need the same structure as junior devs. your point about keeping agents current through continuous learning is spot on. daily.dev is incredibly useful for this since it surfaces new patterns in AI orchestration, multi-agent systems, and emerging tools. staying plugged into what other builders are discovering prevents you from reinventing solutions that already exist.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

content automation is a good stress test for this - the failure modes are visible fast. daily.dev surfacing context for agents is a nice touch, curious how you feed it in. static context file or something more dynamic?