AI agents are easy to demo when they follow a clean path: receive a task, call a tool, produce an answer, and finish successfully.
They become much harder to reason about when multiple agents run together.
In a real system, agents may plan, call tools, retry failures, make decisions from stale state, run in parallel, or touch the same resource from different paths. When something breaks, flat logs usually tell us what happened, but they rarely show why it happened.
That is the debugging gap I wanted to explore.
So I built a small TypeScript-based multi-agent incident-response simulator. The goal was simple: simulate a production incident where multiple agents diagnose and remediate infrastructure problems. The system had a diagnostic agent, database agent, network agent, scaling agent, and coordinator agent.
On paper, the design looked reasonable.
The DiagnosticAgent analyzed the incoming incident. The DatabaseAgent handled database-related issues. The NetworkAgent managed load balancer or routing problems. The ScalingAgent handled capacity decisions. The CoordinatorAgent orchestrated everything and was responsible for avoiding conflicting actions.
The architecture looked clean until the agents started working at the same time.
The Problem With Flat Logs
In the first version, the simulator emitted logs like this:
\[2:47:23\] DiagnosticAgent: High DB latency detected
\[2:47:24\] DatabaseAgent: Initiating replica scale-up
\[2:47:25\] DiagnosticAgent: Connection pool exhaustion detected
\[2:47:26\] DatabaseAgent: Taking node-3 offline for maintenance
\[2:47:27\] ScalingAgent: Database performance degraded, scaling up
\[2:47:28\] NetworkAgent: Detected backend failures, restarting load balancer
\[2:47:29\] CoordinatorAgent: Conflict detected
\[2:47:32\] ERROR: Cluster quorum lost
These logs were useful, but only up to a point.
They showed that the database agent scaled replicas. They showed that another agent also tried to scale. They showed that a node was taken offline. They showed that the coordinator noticed a conflict.
But they did not clearly answer the important questions:
Which agent made a decision from stale state?
Did the coordinator run before or after the conflicting tool calls?
Were the database and scaling agents truly running in parallel?
Which exact tool call caused the final failure?
Was the problem an LLM decision, a tool execution issue, or a coordination issue?
This is where normal logging started to feel too flat. The system behavior was no longer a simple list of events. It was a tree of decisions, tool calls, retries, and parallel branches.
That is when I tried agent-inspect.
Adding Local Execution Tracing
agent-inspect is a local-first execution tree debugger for TypeScript and Node.js AI agents. Instead of sending traces to a hosted dashboard, it writes local traces that can be inspected from the terminal.
That local-first model is important during development. I did not want to set up a full observability platform just to understand one local agent run. I wanted something closer to a structured debugging layer between console.log and production-grade observability.
The first step was to wrap the coordinator flow.
import { inspectRun, step } from "agent-inspect";
async function handleIncident(incident: Incident) {
return inspectRun(
"incident-response-coordinator",
async () \=\> {
const diagnosis \= await step("diagnose-incident", async () \=\> {
return diagnosticAgent.analyze(incident);
});
const actions \= await step("execute-remediation", async () \=\> {
return Promise.all(\[
step.tool("database-remediation", () \=\>
databaseAgent.handleIssue(diagnosis.dbIssues)
),
step.tool("network-remediation", () \=\>
networkAgent.handleIssue(diagnosis.networkIssues)
),
step.tool("scaling-remediation", () \=\>
scalingAgent.handleIssue(diagnosis.scalingIssues)
),
\]);
});
return step("resolve-conflicts", async () \=\> {
return resolveConflicts(actions);
});
},
{
traceDir: "./.agent-inspect",
}
);
}
The code did not need a full rewrite. The main change was adding meaningful boundaries around the work.
The outer inspectRun represented one agent run. The normal step calls represented logical phases. The step.tool calls marked operations that touched external systems or simulated infrastructure.
Then I instrumented the database agent.
class DatabaseAgent {
async handleIssue(issues: DbIssue\[\]) {
return step("database-agent-execution", async () \=\> {
const dbState \= await step.tool("check-db-state", async () \=\> {
return this.getClusterState();
});
const decision \= await step.llm("decide-db-action", async () \=\> {
return this.llm.chat({
messages: \[
{
role: "user",
content: JSON.stringify({
task: "Decide the safest database remediation action",
issues,
dbState,
}),
},
\],
});
});
if (decision.action \=== "scale-up") {
return step.tool("scale-database", async () \=\> {
return this.scaleUpReplicas(decision.targetCount);
});
}
if (decision.action \=== "restart-node") {
return step.tool("restart-node", async () \=\> {
return this.restartNode(decision.nodeId);
});
}
return {
action: "no-op",
reason: "No safe database action selected",
};
});
}
}
The important part is not just the tracing. It is the naming.
A trace is only useful if the steps describe the system in the same language engineers use during debugging. check-db-state, decide-db-action, scale-database, and restart-node are much more useful than generic messages like running task or tool call started.
Inspecting the Failed Run
After running the simulator, I listed the local traces:
npx agent-inspect list --dir ./.agent-inspect
Then I inspected the failed run:
npx agent-inspect view <run-id> --dir ./.agent-inspect
The execution tree made the issue much easier to understand:
incident-response-coordinator \[47.2s\] ✗
├─ diagnose-incident \[3.1s\] ✓
├─ execute-remediation \[41.8s\] ✗
│ ├─ database-remediation \[23.2s\] ✓
│ │ └─ database-agent-execution \[23.1s\] ✓
│ │ ├─ check-db-state \[0.4s\] ✓
│ │ ├─ decide-db-action \[2.1s\] ✓
│ │ ├─ scale-database \[18.3s\] ✓
│ │ ├─ check-db-state \[0.3s\] ✓
│ │ ├─ decide-db-action \[1.9s\] ✓
│ │ └─ restart-node \[0.3s\] ✓
│ ├─ network-remediation \[5.2s\] ✓
│ └─ scaling-remediation \[41.7s\] ✗
│ └─ scaling-agent-execution \[41.6s\] ✗
│ ├─ check-scaling-state \[0.3s\] ✓
│ ├─ decide-scaling-action \[2.2s\] ✓
│ └─ scale-database \[39.1s\] ✗
│ └─ Error: Operation timeout \- cluster in inconsistent state
└─ resolve-conflicts \[not reached\]
This view showed the problem more clearly than the logs.
The database agent checked the state, decided to scale up, and started a database scaling operation. Then it checked state again and decided to restart a node. At the same time, the scaling agent also detected database pressure and started another scaling operation.
Both agents were acting on the same resource. Both believed their action was valid. The coordinator was supposed to resolve conflicts, but the trace showed that resolve-conflicts was never reached because the failure happened inside the parallel remediation step.
That was the real bug.
It was not simply a bad prompt. It was not only a database operation failure. It was a coordination bug caused by parallel agents acting on the same resource without a proper resource-level guard.
Fixing the Coordination Model
Once the execution tree made the failure visible, the fix became much more direct.
The first change was to add a state refresh guard. If the database cluster already had an operation in progress, the agent should wait for stable state before making another decision.
async function handleIssue(issues: DbIssue\[\]) {
return step("database-agent-execution", async () \=\> {
const dbState \= await step.tool("check-db-state", async () \=\> {
return this.getClusterState();
});
if (dbState.hasInProgressOperations) {
return step("wait-for-stability", async () \=\> {
await this.waitForStableState();
return this.handleIssue(issues);
});
}
return this.decideAndExecute(issues, dbState);
});
}
The second change was to protect critical operations with a lock.
async function scaleUpReplicas(targetCount: number) {
return step.tool("scale-database", async () \=\> {
const lock \= await this.acquireLock("database-scaling", 60\_000);
try {
return this.performScaleUp(targetCount);
} finally {
await lock.release();
}
});
}
The third change was at the coordinator level. If multiple agents wanted to touch the same resource, the coordinator should not blindly run them in parallel.
const actions \= await step("execute-remediation-sequenced", async () \=\> {
const targets \= identifyResourceTargets(diagnosis);
if (targets.database.length \> 0\) {
const dbActions \= await step.tool("database-remediation", () \=\>
databaseAgent.handleIssue(diagnosis.dbIssues)
);
const networkActions \= await step.tool("network-remediation", () \=\>
networkAgent.handleIssue(diagnosis.networkIssues)
);
return {
dbActions,
networkActions,
};
}
return Promise.all(\[
step.tool("network-remediation", () \=\>
networkAgent.handleIssue(diagnosis.networkIssues)
),
step.tool("scaling-remediation", () \=\>
scalingAgent.handleIssue(diagnosis.scalingIssues)
),
\]);
});
After the fix, the trace looked different:
incident-response-coordinator \[15.3s\] ✓
├─ diagnose-incident \[2.8s\] ✓
├─ execute-remediation-sequenced \[11.2s\] ✓
│ └─ database-remediation \[8.4s\] ✓
│ └─ database-agent-execution \[8.3s\] ✓
│ ├─ check-db-state \[0.3s\] ✓
│ ├─ acquire-lock \[0.1s\] ✓
│ ├─ decide-db-action \[1.9s\] ✓
│ ├─ scale-database \[5.8s\] ✓
│ └─ release-lock \[0.1s\] ✓
└─ resolve-conflicts \[1.3s\] ✓
This is the kind of output I want during agent development.
Not just “something failed,” but where it failed. Not just “the tool timed out,” but what sequence caused the timeout. Not just “agents ran in parallel,” but which branches actually overlapped.
Why This Matters for AI Agent Engineering
As agent systems become more common, debugging needs to move beyond raw logs.
A single-agent workflow can often be debugged with a few log statements. But multi-agent systems introduce coordination problems. A bug may not live inside one function. It may live between two valid decisions that become unsafe when executed together.
That is why execution trees are useful.
They show the structure of the run. They show parent-child relationships. They separate normal logic from tool calls and LLM calls. They make retries, skipped steps, failed branches, and slow operations easier to reason about.
This also changes how we think about observability.
Production observability platforms are still important. Tools like LangSmith, Langfuse, OpenTelemetry-based pipelines, and APM platforms solve important team and production problems. But during local development, I often want something lighter. I want to run the agent, inspect the trace, make a change, and compare the result.
That is the space where a local-first tool like agent-inspect fits naturally.
It is not trying to replace production monitoring. It is closer to a developer workflow tool for understanding agent behavior before it reaches production.
Practical Lessons From the Project
The first lesson is that flat logs hide structure. In a multi-agent workflow, order alone is not enough. You need to know which step belonged to which agent, which steps were siblings, and which operation blocked or failed.
The second lesson is that not every agent bug is an LLM bug. In this simulator, the expensive failure came from tool coordination and stale state, not from a slow model call. Without tracing, it would have been easy to spend time tuning prompts while ignoring the actual failure path.
The third lesson is that instrumentation can become living documentation. A well-named step() call describes the architecture. When a new engineer reads the trace, they can understand the runtime behavior faster than reading scattered logs.
The fourth lesson is that local-first debugging is still valuable. Not every debugging session needs a dashboard, collector, account, or cloud upload. Sometimes the fastest path is a local trace file and a terminal command.
Final Thoughts
The more I build with AI agents, the more I feel that debugging is becoming an architecture problem.
It is not enough to know that an agent produced the wrong answer. We need to know what it planned, which tools it called, which state it observed, which branches ran in parallel, where retries happened, and what changed between two runs.
For TypeScript and Node.js teams building agentic systems, agent-inspect is a useful tool to explore that workflow. It gives you a lightweight way to turn agent runs into readable execution trees without committing to a hosted observability setup on day one.
For my multi-agent incident-response simulator, the biggest value was simple: it turned a confusing wall of logs into a system I could reason about.
And that is usually the first step toward making agent systems reliable.
Npm lib: https://www.npmjs.com/package/agent-inspect
Github repo: https://github.com/rajudandigam/agent-inspect
Top comments (0)