DEV Community

Cover image for Debugging Multi-Agent Systems in TypeScript: From Flat Logs to Execution Trees
chintanonweb
chintanonweb Subscriber

Posted on • Edited on

Debugging Multi-Agent Systems in TypeScript: From Flat Logs to Execution Trees

AI agents are easy to demo when they follow a clean path: receive a task, call a tool, produce an answer, and finish successfully.

They become much harder to reason about when multiple agents run together.

In a real system, agents may plan, call tools, retry failures, make decisions from stale state, run in parallel, or touch the same resource from different paths. When something breaks, flat logs usually tell us what happened, but they rarely show why it happened.

That is the debugging gap I wanted to explore.

So I built a small TypeScript-based multi-agent incident-response simulator. The goal was simple: simulate a production incident where multiple agents diagnose and remediate infrastructure problems. The system had a diagnostic agent, database agent, network agent, scaling agent, and coordinator agent.

On paper, the design looked reasonable.

The DiagnosticAgent analyzed the incoming incident. The DatabaseAgent handled database-related issues. The NetworkAgent managed load balancer or routing problems. The ScalingAgent handled capacity decisions. The CoordinatorAgent orchestrated everything and was responsible for avoiding conflicting actions.

The architecture looked clean until the agents started working at the same time.

The Problem With Flat Logs

In the first version, the simulator emitted logs like this:

\[2:47:23\] DiagnosticAgent: High DB latency detected  
\[2:47:24\] DatabaseAgent: Initiating replica scale-up  
\[2:47:25\] DiagnosticAgent: Connection pool exhaustion detected  
\[2:47:26\] DatabaseAgent: Taking node-3 offline for maintenance  
\[2:47:27\] ScalingAgent: Database performance degraded, scaling up  
\[2:47:28\] NetworkAgent: Detected backend failures, restarting load balancer  
\[2:47:29\] CoordinatorAgent: Conflict detected  
\[2:47:32\] ERROR: Cluster quorum lost
Enter fullscreen mode Exit fullscreen mode

These logs were useful, but only up to a point.

They showed that the database agent scaled replicas. They showed that another agent also tried to scale. They showed that a node was taken offline. They showed that the coordinator noticed a conflict.

But they did not clearly answer the important questions:

Which agent made a decision from stale state?

Did the coordinator run before or after the conflicting tool calls?

Were the database and scaling agents truly running in parallel?

Which exact tool call caused the final failure?

Was the problem an LLM decision, a tool execution issue, or a coordination issue?

This is where normal logging started to feel too flat. The system behavior was no longer a simple list of events. It was a tree of decisions, tool calls, retries, and parallel branches.

That is when I tried agent-inspect.

Adding Local Execution Tracing

agent-inspect is a local-first execution tree debugger for TypeScript and Node.js AI agents. Instead of sending traces to a hosted dashboard, it writes local traces that can be inspected from the terminal.

That local-first model is important during development. I did not want to set up a full observability platform just to understand one local agent run. I wanted something closer to a structured debugging layer between console.log and production-grade observability.

The first step was to wrap the coordinator flow.

import { inspectRun, step } from "agent-inspect";

async function handleIncident(incident: Incident) {  
 return inspectRun(  
   "incident-response-coordinator",  
   async () \=\> {  
     const diagnosis \= await step("diagnose-incident", async () \=\> {  
       return diagnosticAgent.analyze(incident);  
     });

     const actions \= await step("execute-remediation", async () \=\> {  
       return Promise.all(\[  
         step.tool("database-remediation", () \=\>  
           databaseAgent.handleIssue(diagnosis.dbIssues)  
         ),  
         step.tool("network-remediation", () \=\>  
           networkAgent.handleIssue(diagnosis.networkIssues)  
         ),  
         step.tool("scaling-remediation", () \=\>  
           scalingAgent.handleIssue(diagnosis.scalingIssues)  
         ),  
       \]);  
     });

     return step("resolve-conflicts", async () \=\> {  
       return resolveConflicts(actions);  
     });  
   },  
   {  
     traceDir: "./.agent-inspect",  
   }  
 );  
}
Enter fullscreen mode Exit fullscreen mode

The code did not need a full rewrite. The main change was adding meaningful boundaries around the work.

The outer inspectRun represented one agent run. The normal step calls represented logical phases. The step.tool calls marked operations that touched external systems or simulated infrastructure.

Then I instrumented the database agent.

class DatabaseAgent {  
 async handleIssue(issues: DbIssue\[\]) {  
   return step("database-agent-execution", async () \=\> {  
     const dbState \= await step.tool("check-db-state", async () \=\> {  
       return this.getClusterState();  
     });

     const decision \= await step.llm("decide-db-action", async () \=\> {  
       return this.llm.chat({  
         messages: \[  
           {  
             role: "user",  
             content: JSON.stringify({  
               task: "Decide the safest database remediation action",  
               issues,  
               dbState,  
             }),  
           },  
         \],  
       });  
     });

     if (decision.action \=== "scale-up") {  
       return step.tool("scale-database", async () \=\> {  
         return this.scaleUpReplicas(decision.targetCount);  
       });  
     }

     if (decision.action \=== "restart-node") {  
       return step.tool("restart-node", async () \=\> {  
         return this.restartNode(decision.nodeId);  
       });  
     }

     return {  
       action: "no-op",  
       reason: "No safe database action selected",  
     };  
   });  
 }  
}
Enter fullscreen mode Exit fullscreen mode

The important part is not just the tracing. It is the naming.

A trace is only useful if the steps describe the system in the same language engineers use during debugging. check-db-state, decide-db-action, scale-database, and restart-node are much more useful than generic messages like running task or tool call started.

Inspecting the Failed Run

After running the simulator, I listed the local traces:

npx agent-inspect list --dir ./.agent-inspect

Then I inspected the failed run:

npx agent-inspect view <run-id> --dir ./.agent-inspect

The execution tree made the issue much easier to understand:

incident-response-coordinator                              \[47.2s\] ✗  
├─ diagnose-incident                                       \[3.1s\] ✓  
├─ execute-remediation                                     \[41.8s\] ✗  
│  ├─ database-remediation                                 \[23.2s\] ✓  
│  │  └─ database-agent-execution                          \[23.1s\] ✓  
│  │     ├─ check-db-state                                 \[0.4s\] ✓  
│  │     ├─ decide-db-action                               \[2.1s\] ✓  
│  │     ├─ scale-database                                 \[18.3s\] ✓  
│  │     ├─ check-db-state                                 \[0.3s\] ✓  
│  │     ├─ decide-db-action                               \[1.9s\] ✓  
│  │     └─ restart-node                                   \[0.3s\] ✓  
│  ├─ network-remediation                                  \[5.2s\] ✓  
│  └─ scaling-remediation                                  \[41.7s\] ✗  
│     └─ scaling-agent-execution                           \[41.6s\] ✗  
│        ├─ check-scaling-state                            \[0.3s\] ✓  
│        ├─ decide-scaling-action                          \[2.2s\] ✓  
│        └─ scale-database                                 \[39.1s\] ✗  
│           └─ Error: Operation timeout \- cluster in inconsistent state  
└─ resolve-conflicts                                       \[not reached\]
Enter fullscreen mode Exit fullscreen mode

This view showed the problem more clearly than the logs.

The database agent checked the state, decided to scale up, and started a database scaling operation. Then it checked state again and decided to restart a node. At the same time, the scaling agent also detected database pressure and started another scaling operation.

Both agents were acting on the same resource. Both believed their action was valid. The coordinator was supposed to resolve conflicts, but the trace showed that resolve-conflicts was never reached because the failure happened inside the parallel remediation step.

That was the real bug.

It was not simply a bad prompt. It was not only a database operation failure. It was a coordination bug caused by parallel agents acting on the same resource without a proper resource-level guard.

Fixing the Coordination Model

Once the execution tree made the failure visible, the fix became much more direct.

The first change was to add a state refresh guard. If the database cluster already had an operation in progress, the agent should wait for stable state before making another decision.

async function handleIssue(issues: DbIssue\[\]) {  
 return step("database-agent-execution", async () \=\> {  
   const dbState \= await step.tool("check-db-state", async () \=\> {  
     return this.getClusterState();  
   });

   if (dbState.hasInProgressOperations) {  
     return step("wait-for-stability", async () \=\> {  
       await this.waitForStableState();  
       return this.handleIssue(issues);  
     });  
   }

   return this.decideAndExecute(issues, dbState);  
 });  
}
Enter fullscreen mode Exit fullscreen mode

The second change was to protect critical operations with a lock.

async function scaleUpReplicas(targetCount: number) {  
 return step.tool("scale-database", async () \=\> {  
   const lock \= await this.acquireLock("database-scaling", 60\_000);

   try {  
     return this.performScaleUp(targetCount);  
   } finally {  
     await lock.release();  
   }  
 });  
}
Enter fullscreen mode Exit fullscreen mode

The third change was at the coordinator level. If multiple agents wanted to touch the same resource, the coordinator should not blindly run them in parallel.

const actions \= await step("execute-remediation-sequenced", async () \=\> {  
 const targets \= identifyResourceTargets(diagnosis);

 if (targets.database.length \> 0\) {  
   const dbActions \= await step.tool("database-remediation", () \=\>  
     databaseAgent.handleIssue(diagnosis.dbIssues)  
   );

   const networkActions \= await step.tool("network-remediation", () \=\>  
     networkAgent.handleIssue(diagnosis.networkIssues)  
   );

   return {  
     dbActions,  
     networkActions,  
   };  
 }

 return Promise.all(\[  
   step.tool("network-remediation", () \=\>  
     networkAgent.handleIssue(diagnosis.networkIssues)  
   ),  
   step.tool("scaling-remediation", () \=\>  
     scalingAgent.handleIssue(diagnosis.scalingIssues)  
   ),  
 \]);  
});
Enter fullscreen mode Exit fullscreen mode

After the fix, the trace looked different:

incident-response-coordinator                              \[15.3s\] ✓  
├─ diagnose-incident                                       \[2.8s\] ✓  
├─ execute-remediation-sequenced                           \[11.2s\] ✓  
│  └─ database-remediation                                 \[8.4s\] ✓  
│     └─ database-agent-execution                          \[8.3s\] ✓  
│        ├─ check-db-state                                 \[0.3s\] ✓  
│        ├─ acquire-lock                                   \[0.1s\] ✓  
│        ├─ decide-db-action                               \[1.9s\] ✓  
│        ├─ scale-database                                 \[5.8s\] ✓  
│        └─ release-lock                                   \[0.1s\] ✓  
└─ resolve-conflicts                                       \[1.3s\] ✓
Enter fullscreen mode Exit fullscreen mode

This is the kind of output I want during agent development.

Not just “something failed,” but where it failed. Not just “the tool timed out,” but what sequence caused the timeout. Not just “agents ran in parallel,” but which branches actually overlapped.

Why This Matters for AI Agent Engineering

As agent systems become more common, debugging needs to move beyond raw logs.

A single-agent workflow can often be debugged with a few log statements. But multi-agent systems introduce coordination problems. A bug may not live inside one function. It may live between two valid decisions that become unsafe when executed together.

That is why execution trees are useful.

They show the structure of the run. They show parent-child relationships. They separate normal logic from tool calls and LLM calls. They make retries, skipped steps, failed branches, and slow operations easier to reason about.

This also changes how we think about observability.

Production observability platforms are still important. Tools like LangSmith, Langfuse, OpenTelemetry-based pipelines, and APM platforms solve important team and production problems. But during local development, I often want something lighter. I want to run the agent, inspect the trace, make a change, and compare the result.

That is the space where a local-first tool like agent-inspect fits naturally.

It is not trying to replace production monitoring. It is closer to a developer workflow tool for understanding agent behavior before it reaches production.

Practical Lessons From the Project

The first lesson is that flat logs hide structure. In a multi-agent workflow, order alone is not enough. You need to know which step belonged to which agent, which steps were siblings, and which operation blocked or failed.

The second lesson is that not every agent bug is an LLM bug. In this simulator, the expensive failure came from tool coordination and stale state, not from a slow model call. Without tracing, it would have been easy to spend time tuning prompts while ignoring the actual failure path.

The third lesson is that instrumentation can become living documentation. A well-named step() call describes the architecture. When a new engineer reads the trace, they can understand the runtime behavior faster than reading scattered logs.

The fourth lesson is that local-first debugging is still valuable. Not every debugging session needs a dashboard, collector, account, or cloud upload. Sometimes the fastest path is a local trace file and a terminal command.

Final Thoughts

The more I build with AI agents, the more I feel that debugging is becoming an architecture problem.

It is not enough to know that an agent produced the wrong answer. We need to know what it planned, which tools it called, which state it observed, which branches ran in parallel, where retries happened, and what changed between two runs.

For TypeScript and Node.js teams building agentic systems, agent-inspect is a useful tool to explore that workflow. It gives you a lightweight way to turn agent runs into readable execution trees without committing to a hosted observability setup on day one.

For my multi-agent incident-response simulator, the biggest value was simple: it turned a confusing wall of logs into a system I could reason about.

And that is usually the first step toward making agent systems reliable.

Npm lib: https://www.npmjs.com/package/agent-inspect

Github repo: https://github.com/chintandb/incident-response-coordinator

Top comments (7)

Collapse
 
raaj_g_3b74d49c20c2a941ce profile image
Raaj G

This is a very practical way to explain why multi-agent debugging needs more than flat logs. I especially liked the point that the failure was not simply a “bad prompt” or a tool timeout, but a coordination issue between valid agent decisions running in parallel.

The execution tree example makes the problem much easier to reason about: you can clearly see which agent checked state, which tool call happened, where the branches overlapped, and why resolve-conflicts was never reached.

Also agree with the naming point. Well-named steps like check-db-state, decide-db-action, and scale-database make traces feel like living documentation, not just debugging output.

This is exactly the kind of local-first workflow TypeScript agent builders need before moving into heavier production observability setups.

Collapse
 
raju_dandigam profile image
Raju Dandigam

Thanks a lot for writing this up @chintanonweb and for taking the time to show the problem through a real multi-agent incident-response flow.

As the builder of agent-inspect, this is exactly the kind of debugging gap I was hoping to make easier to reason about. Flat logs are useful, but once agents start running in parallel, calling tools, retrying, and touching shared resources, the real issue often lives between the steps — not inside one single log line.

I really liked how you highlighted that the failure was not simply a bad prompt or a tool timeout. The important part was the coordination bug: multiple agents made individually valid decisions, but together they created an unsafe sequence before conflict resolution could run.

That is why execution trees matter. They make parent-child relationships, parallel branches, LLM calls, tool calls, failed steps, and skipped steps visible in one place. Also, your point about naming is important. A trace becomes much more useful when steps like check-db-state, decide-db-action, and scale-database read like the architecture itself.

This kind of local-first debugging workflow is where I believe TypeScript agent development needs more tooling: lightweight enough for daily development, but structured enough to expose the actual runtime behavior behind agent decisions.

Really appreciate the thoughtful example and practical walkthrough.

Collapse
 
lcmd007 profile image
Andy Stewart • Edited

So relatable! Multi-agent coordination often falls into pure chaos; flat logs only show a superficial sequence, failing completely to untangle concurrent conflicts and stale states.

Building a local 'execution tree' with TypeScript to rigidly structure and deterministically track complex LLM decisions and tool chains is hard-core, system-level architectural thinking. Taming AI's inherent indeterminism with absolute engineering discipline—brilliant!

Collapse
 
mnemehq profile image
Theo Valmis

The flat log problem you're describing is the distributed systems debugging problem all over again, plus a layer of model inference making the causal chain harder to trace. In microservices, the fix was distributed tracing — propagating trace IDs through the call chain so you could reconstruct what happened across service boundaries. Multi-agent systems need something equivalent, but the execution graph isn't just service calls; it also includes model decisions.

The cluster quorum failure in your example is the classic conflict pattern: two agents acting on the same resource from different diagnostic paths, each making a locally valid decision that's globally destructive. The coordinator detected the conflict after the fact. What you'd actually want is the execution tree to surface the shared resource dependency before both agents act — which means the tracing layer needs to capture not just what happened, but what state each agent was reasoning from when it chose to act.

Collapse
 
raju_dandigam profile image
Raju Dandigam

Thanks again for writing this. The biggest shift I’ve seen while building agent-inspect is that agent debugging stops being event debugging and starts becoming workflow debugging. The difficult part is rarely “did the API call fail?” — it is reconstructing the branching decisions, retries, context transitions, and coordination path that led to the final behavior. I especially appreciate the local-first framing because many developers do not want to deploy an entire observability platform just to understand one failing run during development. Really thoughtful explanation of the execution-tree model.

Collapse
 
glendel profile image
Glendel Joubert Fyne Acosta

Thanks for sharing! Execution trees are definitely the right direction for Multi-Agent debugging.

Flat logs hide the most important part of the system: responsibility flow.

In Multi-Agent Systems (MAS), I think every node in the execution tree should answer:

  • Which agent acted ?
  • What context did it receive ?
  • What tool/action did it request ?
  • Was the action allowed ?
  • What actually executed ?
  • What evidence came back ?
  • What did the agent claim afterward ?

That last distinction matters a lot. In Agent Systems, the model's summary and the runtime's evidence can diverge.

Execution trees become much more valuable when they show both: the reasoning path and the proof path.

Collapse
 
harjjotsinghh profile image
Harjot Singh

Flat logs -> execution trees is exactly the right upgrade, and it's the single biggest quality-of-life win in multi-agent debugging. With one agent a linear log is fine; with many agents handing off and spawning sub-tasks, a flat log is unreadable - you can't tell which agent did what, in what order, or which parent call a failure belongs to. A tree (parent -> child spans, who called whom, inputs/outputs at each node) turns "something broke somewhere" into "agent C failed on this input, called by B." That structure is the difference between debuggable and a black box.

The thing I'd add as the natural next step: once you have the tree, attach cost and verification status to each node - then the same view that debugs failures also shows you where tokens go and which step's output wasn't validated. Execution tree as the unified observability primitive. That's how I lean on it in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - per-node events (cost, status, gate result) on the tree are how a multi-agent build stays both debuggable and ~$3 flat. Excellent, practical post - this is the infra people underbuild until they're drowning in flat logs. Are you attaching per-node cost/timing to the tree yet, or is it currently focused on the call structure? The cost overlay is a killer addition.