DEV Community

AdmilsonCossa
AdmilsonCossa

Posted on

AI agents do not fail in one place

They fail across concurrency, retries, timeouts, queues, tools, streams, and provider calls.

That is why the JavaScript ecosystem has huge demand for separate async primitives:

Package Weekly Downloads
p-limit ~204M
p-map ~53M
p-timeout ~36M
p-retry ~38M
async-retry ~24M
p-queue ~23M
bottleneck ~10M

These libraries are good.

But the production pain is deeper:

You want concurrency → add p-limit

You want retries → add p-retry

You want timeouts → add p-timeout

You want queues → add p-queue

You want rate limits → add bottleneck

Now each primitive owns a different part of the lifecycle.

None coordinate cancellation together.

When an AI agent fails mid‑flight:

Who stops the retry?

Who clears the timeout?

Who drains the queue?

Who cleans up the tool call?

Who prevents the losing provider from continuing to bill?


WorkIt explores a different model:

work(items)
  .inParallel(8)
  .withRetry(3)
  .withTimeout("5s")
  .do(fn)
Enter fullscreen mode Exit fullscreen mode

One scope. One owner. One cancellation path. One cleanup model.

Concurrency, retry, timeout – under the same ownership tree.

👉 Read the full article:

Concurrency, Retry, and Timeout Under One Owner

npm install @workit/core
Enter fullscreen mode Exit fullscreen mode

Top comments (2)

Collapse
 
raju_dandigam profile image
Raju Dandigam

This is a great systems-level way to describe agent failure. In real agent workflows, the bug is rarely isolated to the model call — it often crosses retries, timeouts, queue state, provider billing, and cleanup. I like the “one owner, one cancellation path” framing because it makes failure handling explicit instead of scattered across helper libraries. This same ownership model would be valuable for tracing agent runs end to end.

Collapse
 
admilsoncossa profile image
AdmilsonCossa

Appreciate this @raju_dandigam The tracing angle is exactly where my head's been too, and I think it falls out of the ownership model almost for free. The scope tree basically becomes the trace tree.

Parent/child relationships are already there, and the cancellation path tells you where execution was abandoned vs. where it actually failed — which is something flat logs usually lose. Concrete case: if a user disconnects from an agent session, Workit propagates cancellation across the whole scope instead of leaving retries or provider calls alive with no owner.

Curious if you've seen the inverse in production — retries or background work outliving the request that spawned them. Feels like a surprisingly common source of hidden cost and weird system behavior.