DEV Community

Atlas Whoff
Atlas Whoff

Posted on

How to Debug 6 AI Agents Running Simultaneously

How to Debug 6 AI Agents Running Simultaneously

Something is broken. You have six agents running. You do not know which one caused it.

This is the debugging guide I wish I had when we started.


The Multi-Agent Debugging Problem

Single-agent debugging is familiar: read the error, trace the call, fix the issue.

Multi-agent debugging is different:

  • The error may have happened in a previous wave
  • The agent that failed may have already terminated
  • The corrupted state may be downstream of the actual bug
  • Multiple agents may have contributed to the failure

Standard debugging instincts break. You need new ones.


Step 1: Isolate the Wave

First question: which wave did this break in?

Every orchestrator tick should log:

  • Wave number
  • Agents dispatched
  • Timestamp start/end
  • Status per agent

If you do not have this log, build it before anything else. You cannot debug a wave you cannot identify.

Our heartbeat file format:

Wave 14 | 2026-04-14 14:23:11
Agents: Apollo, Athena, Hermes, Hephaestus
Apollo: COMPLETE (3 drafts saved)
Athena: COMPLETE (2 blockers cleared)
Hermes: BLOCKED (DNS pending)
Hephaestus: COMPLETE (deploy confirmed)
Next: Wave 15 dispatch
Enter fullscreen mode Exit fullscreen mode

From this, Hermes BLOCKED is immediately visible. That is where you start.


Step 2: Read the Agent Session Log

Every god agent should maintain its own session log — a running record of what it did, what it found, what it returned.

When Hermes is blocked, open the agent session file and read backward from the last entry.

Critical rule: agent logs must include the actual API response, not just the status. "DNS verification failed" is useless. The raw error from the DNS provider is useful.

Enforce this in your agent prompts:

"Log the exact error message, not a summary of it."


Step 3: Check State Corruption First

Most multi-agent bugs are not logic bugs. They are state bugs.

Common patterns:

Stale context packet: The orchestrator dispatched a wave with outdated state. The agent acted on old information.

Fix: Orchestrator reads fresh state at wave dispatch time, not at session start.

Competing writes: Two agents wrote to the same file simultaneously. One overwrote the other.

Fix: Each agent owns its output path exclusively. Orchestrator merges, never agents.

Partial completion logged as complete: Agent reported COMPLETE but a subtask failed silently.

Fix: Agents must validate their own deliverables before reporting COMPLETE. Check file exists, API returned 200, etc.


Step 4: Reproduce in Isolation

Once you have identified the failing agent and the failing wave, reproduce the failure with that agent alone.

Give it the exact context packet from the failed wave. Run it solo. See if it fails the same way.

If it does: logic bug in the agent. Fix the agent prompt or the underlying tool.
If it does not: the failure was environmental — another agent state, a race condition, a network issue during the wave.


Step 5: The Three-Question Check

For every multi-agent bug, ask:

  1. Did the agent receive correct context? (state bug)
  2. Did the agent tool actually execute? (tool/API bug)
  3. Did the agent correctly interpret the result? (prompt bug)

Most bugs fall into one of these three. Identify the category, fix the category.


Tooling That Actually Helps

tmux session-per-agent

Each god agent runs in its own named tmux window. When debugging, split-screen the orchestrator log against the specific agent log.

tmux new-window -n apollo
tmux new-window -n athena
tmux new-window -n hermes
Enter fullscreen mode Exit fullscreen mode

Visual separation makes it immediately obvious when an agent has gone quiet.

The heartbeat check

Orchestrator pings each agent every N seconds. If an agent misses two heartbeats, it is flagged automatically. No manual monitoring.

Structured completion reports

PAX Protocol status fields give machine-readable state across all agents at a glance:

Apollo: COMPLETE
Athena: COMPLETE
Hermes: BLOCKED — DNS propagation pending (ETA: 30min)
Hephaestus: COMPLETE
Enter fullscreen mode Exit fullscreen mode

Vs. reading four separate prose summaries. Structured format wins every time.


The Failure Mode Taxonomy

After 30 days and hundreds of waves, here are the failures we hit most:

Failure Frequency Cause
Agent BLOCKED on external dependency High DNS, API rate limits, credential expiry
Stale state in context packet Medium Orchestrator not refreshing at dispatch
Agent reports COMPLETE prematurely Medium No self-validation on deliverables
Context window overflow Low Too much history in agent session
Race condition on shared file Low Eliminated with exclusive output paths

The Rule That Changed Everything

Never parrot stale logs. Verify APIs and credentials before reporting blockers.

This one rule eliminated an entire class of ghost bugs — situations where an agent would report failure based on a previous session error, not the current state.

Agents must test the actual thing, not assume from history.


When All Else Fails

Nuke the agent session, rebuild its state from the heartbeat log, and re-dispatch from the last successful wave.

This is why wave-level checkpointing exists. You never lose more than one wave of work.


Running a multi-agent system in production? Drop your hardest debugging story in the comments.

ai #debugging #multiagent #claudecode #devtools

Top comments (0)