They fail across concurrency, retries, timeouts, queues, tools, streams, and provider calls.
That is why the JavaScript ecosystem has huge demand for separate async primitives:
| Package | Weekly Downloads |
|---|---|
p-limit |
~204M |
p-map |
~53M |
p-timeout |
~36M |
p-retry |
~38M |
async-retry |
~24M |
p-queue |
~23M |
bottleneck |
~10M |
These libraries are good.
But the production pain is deeper:
You want concurrency → add p-limit
You want retries → add p-retry
You want timeouts → add p-timeout
You want queues → add p-queue
You want rate limits → add bottleneck
Now each primitive owns a different part of the lifecycle.
None coordinate cancellation together.
When an AI agent fails mid‑flight:
Who stops the retry?
Who clears the timeout?
Who drains the queue?
Who cleans up the tool call?
Who prevents the losing provider from continuing to bill?
WorkIt explores a different model:
work(items)
.inParallel(8)
.withRetry(3)
.withTimeout("5s")
.do(fn)
One scope. One owner. One cancellation path. One cleanup model.
Concurrency, retry, timeout – under the same ownership tree.
👉 Read the full article:
Concurrency, Retry, and Timeout Under One Owner
npm install @workit/core
Top comments (2)
This is a great systems-level way to describe agent failure. In real agent workflows, the bug is rarely isolated to the model call — it often crosses retries, timeouts, queue state, provider billing, and cleanup. I like the “one owner, one cancellation path” framing because it makes failure handling explicit instead of scattered across helper libraries. This same ownership model would be valuable for tracing agent runs end to end.
Appreciate this @raju_dandigam The tracing angle is exactly where my head's been too, and I think it falls out of the ownership model almost for free. The scope tree basically becomes the trace tree.
Parent/child relationships are already there, and the cancellation path tells you where execution was abandoned vs. where it actually failed — which is something flat logs usually lose. Concrete case: if a user disconnects from an agent session, Workit propagates cancellation across the whole scope instead of leaving retries or provider calls alive with no owner.
Curious if you've seen the inverse in production — retries or background work outliving the request that spawned them. Feels like a surprisingly common source of hidden cost and weird system behavior.