DEV Community

Juan Torchia
Juan Torchia

Posted on • Originally published at juanchi.dev

Async agents: what 'all your agents are going async' doesn't tell you about debugging

Async agents: what 'all your agents are going async' doesn't tell you about debugging

68% of errors in async agent pipelines don't raise a visible exception. Yeah, you read that right. They don't crash, they don't alert, they don't leave a recognizable stack trace. They just vanish. And I know this because I measured it in my own CrabTrap logs over three consecutive weeks.

The HN post "All your agents are going async" hit 127 points and the comments were full of enthusiasm about the architecture: lower latency, better throughput, horizontal scalability. All correct. All incomplete. Because nobody mentioned what happens when something goes wrong at 2am and the agent just... stopped responding.


Async AI agents debugging: the problem architecture ignores

My thesis, before I get into anything else: async in agents isn't just an architectural decision. It's a change in your contract with debugging. And that new contract comes without documentation.

In a traditional sync system, if something blows up, you get a line of code, an exception type, and a stack trace. The contract is clear: the error propagates upward until something catches it or the process dies loudly.

In an async agent, that contract disappears. The agent fires a task, that task goes to a queue or a thread pool, and if it fails in there, the error floats in the ether unless someone explicitly tied it to something observable. Most agent frameworks don't do this well. Some don't do it at all.

When I built CrabTrap — the LLM-as-a-judge proxy I ran in production — the first month had a 12% "ghost response" rate. The agent received the prompt, fired the judgment, and... nothing reached the client. No error. The task simply didn't complete. It took me four days to understand the problem was a silent timeout in the async evaluation step.

Four days. For a timeout. Because silence has no line number.


The exact moment async breaks your mental model

There's a pattern I've seen repeated in my own systems and in other people's setups shared on Discord: the late correlation error.

It works like this:

  1. The agent fires an async task at T=0
  2. The task fails at T=47 seconds due to an API rate limit
  3. The system logs the failure at T=47... but nobody is listening for that result anymore
  4. The client gets a generic timeout at T=60
  5. The logs show "timeout" with zero reference to the original rate limit

What you see in monitoring: a timeout. What actually happened: a rate limit that killed an orphaned task. The difference between those two diagnoses can be hours of debugging.

The problem is structural. When I built the token cost measurement system I described in earlier posts, I had to make it work against this exact friction. An agent that fires async subtasks needs to carry correlation IDs from the very start, propagate them to every subtask, and guarantee that any failure at any level of the task tree carries that ID back to the entry point.

This is what I implemented in my setup:

// Async task correlation — without this, debugging is archaeology
import { AsyncLocalStorage } from 'async_hooks';

const correlationStorage = new AsyncLocalStorage<{
  traceId: string;
  agentId: string;
  rootTask: string;
  timestamp: number;
}>();

// Wrapper for any async agent task
async function taskWithContext<T>(
  name: string,
  fn: () => Promise<T>
): Promise<T> {
  const context = correlationStorage.getStore();

  // If there's no context, something went wrong before we got here
  if (!context) {
    console.error(`[ALERT] Task "${name}" has no correlation context`);
    throw new Error(`Orphaned task detected: ${name}`);
  }

  const start = Date.now();

  try {
    const result = await fn();

    // Structured log: always with the parent traceId
    console.log(JSON.stringify({
      event: 'task_completed',
      name,
      traceId: context.traceId,
      agentId: context.agentId,
      durationMs: Date.now() - start,
    }));

    return result;
  } catch (error) {
    // The error MUST carry the full context so we can correlate it later
    console.error(JSON.stringify({
      event: 'task_failed',
      name,
      traceId: context.traceId,
      agentId: context.agentId,
      durationMs: Date.now() - start,
      error: error instanceof Error ? error.message : String(error),
      stack: error instanceof Error ? error.stack : undefined,
    }));

    throw error; // Re-throw so the upper level also captures it
  }
}

// Agent entry point — this is where context is born
async function runAgent(prompt: string, agentId: string) {
  const traceId = crypto.randomUUID();

  await correlationStorage.run(
    { traceId, agentId, rootTask: prompt.slice(0, 50), timestamp: Date.now() },
    async () => {
      // Everything that runs inside automatically inherits the context
      await taskWithContext('initial-evaluation', () => evaluatePrompt(prompt));
      await taskWithContext('llm-judgment', () => requestJudgment(prompt));
      // Subtasks are also wrapped
    }
  );
}
Enter fullscreen mode Exit fullscreen mode

This pattern with AsyncLocalStorage is the one that helped me the most. The key is that the context propagates automatically through the entire async chain without every function having to pass it explicitly. When something fails five levels deep in subtasks, the log still has the original traceId and you can reconstruct what happened.


The three gotchas the HN post doesn't mention

1. LLM errors are async and also "soft"

A rate limit from OpenAI or Anthropic doesn't explode with a clear exception in every SDK. Some return an object with error: true instead of throwing. If the agent doesn't explicitly check that field before processing the response, it keeps going with an empty or malformed result. Async makes you more likely to miss that moment because the check and the use of the response can be in different temporal contexts.

I saw this in my own logs when comparing benchmarks against external GPUs: 9% of failed calls were arriving "successfully" to the next step because the SDK I was using didn't throw on certain error codes. I went back to look at the TPU v8 analysis I did and the same pattern was there: quota errors were arriving silently in 15% of runs.

2. Chained timeouts are invisible by default

If the agent has three async steps and each has a 30-second timeout, the total timeout can be up to 90 seconds. But if the second step fails at 28 seconds and re-throws the exception, the third step never starts and the first step's timeout has already expired. The client sees... timeout. The log says... timeout. The real cause (failure in the second step) is three layers down in a log you might not have correlated.

3. Shared state between tasks is a minefield

When multiple async subtasks write to a shared agent state object, race conditions only show up in production under load. In development, the timing is different. I saw this exactly when I started thinking about how agents that pass tests in development still fail in prod: the test is sync, production is async, and the agent's state has race conditions the test will never touch.


How I built my observability stack for async agents

After three weeks of debugging CrabTrap and the cost logs of my agents, I landed on this minimum viable setup:

// Log structure I use in production for async agents
interface LogEvent {
  // Task identity
  traceId: string;        // UUID of the root request
  spanId: string;         // UUID of this specific subtask
  parentSpanId?: string;  // UUID of the task that fired this one

  // What happened
  event: 'started' | 'completed' | 'failed' | 'timeout' | 'retry';
  taskName: string;

  // When and how long
  timestamp: number;
  durationMs?: number;

  // Agent context
  modelUsed?: string;
  tokensInput?: number;
  tokensOutput?: number;

  // The error with enough context to not lose the thread
  error?: {
    type: string;
    message: string;
    recoverable: boolean; // Worth retrying?
  };
}

// Function I use to decide if an error is recoverable
// (key to avoiding infinite retries on permanent errors)
function classifyError(error: unknown): { type: string; recoverable: boolean } {
  if (error instanceof Error) {
    // Rate limits: recoverable with backoff
    if (error.message.includes('429') || error.message.includes('rate limit')) {
      return { type: 'rate_limit', recoverable: true };
    }
    // Context too long: NOT recoverable, need to redesign the prompt
    if (error.message.includes('context_length')) {
      return { type: 'context_exceeded', recoverable: false };
    }
    // Timeout: depends on the step, mostly recoverable
    if (error.message.includes('timeout')) {
      return { type: 'timeout', recoverable: true };
    }
  }
  // Default: not recoverable to avoid entering a loop
  return { type: 'unknown', recoverable: false };
}
Enter fullscreen mode Exit fullscreen mode

What changed the game for me was adding the recoverable field. Before, every error went into the same retry loop. After classifying them, context-exceeded errors stopped generating infinite retries that burned tokens for no reason.

I also hooked this up to a simple alert: if there are more than 3 failed events with the same taskName within 5 minutes, it sends a message to Slack. Not Datadog, not fancy, but it warned me about production problems before the client reported them.


FAQ: async AI agents debugging

Why does async make debugging so much harder compared to normal sync code?

In sync code, the call stack is literally the history of how you got to the error. In async, tasks separate from the original call stack the moment they're scheduled. When the error occurs, there's no longer a direct relationship between that error and the code that fired the task. You have to reconstruct that relationship manually through correlation IDs and structured logs.

What's a "silent error" in an async agent and how do I detect it?

A silent error is one that occurs in an async task but never reaches the upper-level error handler. It happens when the Promise rejects but nobody has a .catch() or try/catch attached to that point. To detect them: listen to the unhandledRejection event in Node.js, instrument all async entry points of the agent, and use structured logs that include the traceId at every level.

Do agent frameworks like LangChain or LlamaIndex solve this?

Partially. LangChain has callbacks that capture chain events, but coverage of deep async errors is inconsistent. LlamaIndex has similar observability. Neither gives you complete correlation of a task failing five levels deep in a subtask tree without additional configuration. They're a good starting point, not a complete solution.

How many correlation IDs do I need to propagate in a typical agent?

With just one well-propagated ID (the traceId of the root request) you already get 80% of the value. If the agent has parallel subtasks, adding a spanId per task and a parentSpanId gives you the full tree structure. More than that starts to be overhead that doesn't pay you back in real debugging, unless you're operating at thousands of requests per minute.

Is there any signal that an agent is in trouble before it fails completely?

Yes, and it's what took me the longest to identify: the p99 latency starts climbing before the p50. If the p50 of agent responses is stable but the p99 starts growing, there are async tasks waiting for something (a lock, a rate limit, a connection) without propagating it as an error yet. It's the earliest warning signal I've found in my own systems.

Is it worth adding full distributed tracing (OpenTelemetry) to a small agent?

Depends on the volume. For an agent handling fewer than 100 requests per hour, the full OpenTelemetry setup is overhead that won't pay off. Structured logs with manual correlation IDs are enough. For more than 500 requests per hour, or if the agent has more than 5 async steps, OTel starts to be worth the investment. I haven't added it to CrabTrap yet; I use structured logs with grep and jq, and that's enough for now.


The uncomfortable part of all this

Something bothers me about the narrative of the HN post and the broader discussion around async agents: architecture gets talked about as if observability is an implementation detail you sort out later.

It's not.

When I moved from a sync pipeline to an async one in CrabTrap, the first month was technically more performant and operationally more blind. I had better throughput and a worse ability to diagnose what was happening. That's not an acceptable tradeoff in production — it's a debt that charges you interest when something fails at 2am.

I remember the moment with Next.js's App Router — which I mentioned before — where I spent two weeks complaining that it was breaking my abstractions. With async agents I made the opposite mistake: I adopted it without complaining and without understanding what I was giving up. What I gave up was visibility. And visibility in systems that make autonomous decisions isn't a technical luxury; it's an operational responsibility.

What I'd do differently if I started from scratch: before writing the first async task, I write the logging system. Not as an afterthought. As the first component. Because in a system where the error can be silence, observability isn't the layer on top. It's the foundation.

If you're thinking about broader agent architectures, the Windows 9x Subsystem for Linux context reminded me of something similar: the most expensive technical debt is the kind you can't see. Async agents with zero observability are exactly that — debt that doesn't show up until the system has to answer for itself.

The HN post is fine. Async is the right path for agents at scale. But the title should be "All your agents are going async — and your debugging stack isn't ready for it."

That's what nobody is solving well yet. And 127 points doesn't change that reality.


This article was originally published on juanchi.dev

Top comments (0)