DEV Community

Egeo Minotti
Egeo Minotti

Posted on

I built a saga workflow engine for Bun (no Temporal, no Redis, no Kafka)

TL;DR

I built a workflow engine inside bunqueue with first-class saga compensation, retries with exponential backoff, branching, parallel steps, loops, sub-workflows, and human-in-the-loop signals. It runs on Bun + SQLite. Zero external infrastructure. The DSL is plain TypeScript and reads like pseudocode. Source: bunqueue.dev/guide/workflow.

The problem nobody wants to deal with

Distributed transactions are the most underestimated problem in modern backend.

An order comes in. You need to validate it, reserve inventory, charge the card, notify the warehouse, send a confirmation email, update analytics. Six services, six failure points.

What happens if the charge succeeds but the warehouse is offline? You just took money from a customer for a product that will never ship.

The academic answer is two-phase commit. The pragmatic answer, since 1987, is the Saga Pattern: split the transaction into steps, each with its own compensation, and if anything goes wrong, run the rollbacks in reverse order.

It's elegant. It's also a nightmare to implement by hand.

Why I built another workflow engine

I evaluated the existing players:

  • Temporal: powerful but operationally heavy. You're running a cluster.
  • Inngest: cloud-only by default. Vendor lock-in.
  • Trigger.dev: nice DX, but again, infrastructure to run.
  • AWS Step Functions: locked to AWS, JSON-based DSL, no local dev story.

I wanted sagas without any of that. Just TypeScript, a SQLite file, and a fluent DSL that reads like pseudocode. So I built one inside bunqueue, the job queue I already maintain for Bun.

The whole thing is ~1500 lines and lives at src/client/workflow/. Each step is a regular bunqueue job on an internal __wf:steps queue. Persistence is the same SQLite store. State survives restarts, crashes, deploys.

The smallest possible example

import { Workflow, Engine } from 'bunqueue/workflow';

const flow = new Workflow('greet')
  .step('hello', async (ctx) => {
    return { message: `Hi, ${(ctx.input as { name: string }).name}` };
  });

const engine = new Engine({ embedded: true });
engine.register(flow);

const run = await engine.start('greet', { name: 'Ada' });
Enter fullscreen mode Exit fullscreen mode

That's the entire surface area for a one-step workflow. embedded: true means no separate server — bunqueue spins up an in-process SQLite store and runs steps as background jobs.

Compensation: the part that actually matters

Here's where the saga pattern earns its keep. You attach a compensate handler to any step, and if a later step fails, the engine walks back through everything that completed and runs the compensations in reverse.

const checkout = new Workflow('checkout')
  .step('reserve-inventory',
    async (ctx) => {
      const items = (ctx.input as { items: Item[] }).items;
      return { reservationId: await stock.reserve(items) };
    },
    {
      compensate: async (ctx) => {
        const id = (ctx.steps['reserve-inventory'] as { reservationId: string }).reservationId;
        await stock.release(id);
      },
    },
  )
  .step('charge-card',
    async (ctx) => {
      const amount = (ctx.input as { amount: number }).amount;
      return { txId: await payments.charge(amount) };
    },
    {
      retry: 3,
      compensate: async (ctx) => {
        const txId = (ctx.steps['charge-card'] as { txId: string }).txId;
        await payments.refund(txId);
      },
    },
  )
  .step('dispatch-warehouse', async (ctx) => {
    return await warehouse.dispatch(ctx.steps['reserve-inventory']);
  })
  .step('send-receipt', async (ctx) => {
    return await email.send((ctx.input as { email: string }).email);
  });
Enter fullscreen mode Exit fullscreen mode

If dispatch-warehouse throws after charge-card succeeded, the engine automatically:

  1. Refunds the card (compensate for charge-card)
  2. Releases the inventory (compensate for reserve-inventory)

In that order. You write zero rollback orchestration code. No try/catch pyramids, no hand-rolled state machines, no boolean flags scattered across the database to remember what you've already done.

The execution state during compensation is 'compensating', so you can hook into it via the event emitter and surface it to your observability stack.

Retries with exponential backoff

Every step accepts a retry count. The engine retries with min(500ms * 2^attempt + jitter, 30s). The jitter prevents thundering herds when you have 10k workflows hitting the same flaky downstream.

.step('flaky-api-call', handler, { retry: 5, timeout: 10_000 })
Enter fullscreen mode Exit fullscreen mode

Failed attempts emit a step:retry event with the attempt number, so you can wire this into Datadog or whatever you use without a polling loop.

Branching

Routes execution based on runtime context. Only the matching path runs; the others are skipped entirely. Steps after the branch always execute.

new Workflow('kyc')
  .step('assess-risk', async (ctx) => ({ tier: classifyRisk(ctx.input) }))
  .branch((ctx) => (ctx.steps['assess-risk'] as { tier: string }).tier)
    .path('low',    (w) => w.step('auto-approve', autoApprove))
    .path('medium', (w) => w.step('request-docs', requestDocs))
    .path('high',   (w) => w.step('manual-review', manualReview))
  .step('finalize', finalize);
Enter fullscreen mode Exit fullscreen mode

Parallel steps

Promise.allSettled under the hood. If any step in the group fails, the whole group fails and compensation kicks in.

.parallel((w) => w
  .step('notify-warehouse', notifyWarehouse)
  .step('send-email', sendEmail)
  .step('update-analytics', updateAnalytics)
)
Enter fullscreen mode Exit fullscreen mode

Loops and iteration

Three flavors, all with a maxIterations safety limit (default 100):

// doUntil — run, then check
.doUntil(
  (ctx) => (ctx.steps['poll'] as { ready: boolean }).ready,
  (w) => w.step('poll', pollStatus),
  { maxIterations: 20 },
)

// doWhile — check, then run
.doWhile(
  (ctx, i) => i < 5,
  (w) => w.step('process-batch', processBatch),
)

// forEach — iterate over a dynamic list
.forEach(
  (ctx) => (ctx.input as { items: Item[] }).items,
  'process-item',
  async (ctx) => {
    const item = ctx.steps.__item as Item;
    const idx = ctx.steps.__index as number;
    return await process(item, idx);
  },
)
Enter fullscreen mode Exit fullscreen mode

forEach results are stored under indexed names (process-item:0, process-item:1, ...) so you can fan back in afterward.

Sub-workflows

Workflows can call other workflows. Parent pauses, child runs, results nest under ctx.steps['sub:<child-name>']. Failures propagate upward and trigger parent compensation.

.subWorkflow('payment', (ctx) => ({ amount: ctx.input.total }))
.step('after-payment', async (ctx) => {
  const result = ctx.steps['sub:payment'];
  // ...
});
Enter fullscreen mode Exit fullscreen mode

This is how you keep individual workflows small and composable instead of building one monstrous flow.

Human-in-the-loop

This is my favorite feature.

Call .waitFor('event-name') and execution pauses. State becomes 'waiting'. The workflow sits there for hours or days until an external system calls engine.signal(executionId, 'event-name', payload). The signal payload becomes available in ctx.signals['event-name'].

const deploy = new Workflow('deploy')
  .step('build', buildArtifacts)
  .step('deploy-staging', deployStaging)
  .waitFor('approval', { timeout: 48 * 60 * 60 * 1000 }) // 48h
  .step('deploy-prod', async (ctx) => {
    const approval = ctx.signals['approval'] as { approver: string };
    return await deployProd({ approver: approval.approver });
  });

// Later, from your approval UI:
await engine.signal(runId, 'approval', { approver: 'alice@corp.io' });
Enter fullscreen mode Exit fullscreen mode

If nobody approves within 48 hours, the signal times out, the workflow fails, and compensation runs. Perfect for approval gates, manual reviews, KYC compliance, production deploys requiring human sign-off.

You don't need a separate scheduler. You don't need to poll. You don't need to manage timer state in Postgres. The workflow is genuinely asleep until either the signal arrives or the timeout fires.

Schema validation (bring your own)

Steps accept inputSchema and outputSchema. Anything with a .parse() method works — Zod, ArkType, Valibot, your own custom validator. Duck typing means I don't ship a hard dep on any particular library.

import { z } from 'zod';

const InputSchema = z.object({ orderId: z.string(), amount: z.number().positive() });

.step('validate', handler, { inputSchema: InputSchema })
Enter fullscreen mode Exit fullscreen mode

If validation throws, the step fails like any other step — retries, compensation, and event emission all kick in.

Observability

The engine emits typed events for everything that matters:

  • workflow:started, workflow:completed, workflow:failed, workflow:waiting, workflow:compensating
  • step:started, step:completed, step:failed, step:retry
  • signal:received, signal:timeout

You can subscribe globally via EngineOptions.onEvent or per-execution:

const unsub = engine.subscribe(run.id, (event) => {
  if (event.type === 'step:retry') {
    metrics.increment('workflow.step.retry', { step: event.stepName });
  }
});
Enter fullscreen mode Exit fullscreen mode

For history management there's engine.cleanup({ maxAge }) and engine.archive({ maxAge }) — same idea, one deletes, the other moves to cold storage.

A complete real-world example

Here's an e-commerce checkout end-to-end. Validation, parallel notifications, conditional fulfillment based on customer tier, and full saga compensation on failure:

import { Workflow, Engine } from 'bunqueue/workflow';

const checkout = new Workflow('checkout')
  .step('validate-cart', async (ctx) => {
    const cart = (ctx.input as { cart: Cart }).cart;
    if (cart.items.length === 0) throw new Error('Empty cart');
    return { total: cart.items.reduce((s, i) => s + i.price, 0) };
  })
  .step('reserve-inventory',
    async (ctx) => ({ resId: await stock.reserve(ctx.input) }),
    { compensate: async (ctx) => stock.release((ctx.steps['reserve-inventory'] as any).resId) },
  )
  .step('charge-card',
    async (ctx) => ({ txId: await payments.charge((ctx.steps['validate-cart'] as any).total) }),
    {
      retry: 3,
      compensate: async (ctx) => payments.refund((ctx.steps['charge-card'] as any).txId),
    },
  )
  .branch((ctx) => (ctx.input as { tier: string }).tier)
    .path('vip', (w) => w.step('priority-ship', priorityShip))
    .path('standard', (w) => w.step('standard-ship', standardShip))
  .parallel((w) => w
    .step('notify-warehouse', notifyWarehouse)
    .step('send-receipt', sendReceipt)
    .step('update-analytics', updateAnalytics)
  )
  .step('finalize', async () => ({ status: 'complete' }));

const engine = new Engine({
  embedded: true,
  dataPath: './data/checkout.db',
  onEvent: (e) => logger.info(e),
});

engine.register(checkout);
const run = await engine.start('checkout', { cart, tier: 'vip' });
Enter fullscreen mode Exit fullscreen mode

If anything in priority-ship, the parallel block, or finalize fails, the engine refunds the card and releases the inventory automatically. You wrote one compensate per side-effecting step — that's it.

How it compares

Feature bunqueue Temporal Inngest Trigger.dev Step Functions
Saga compensation First-class Manual Limited Limited Manual
Infrastructure None (SQLite) Cluster Cloud Cloud/self-host AWS
Local dev Same as prod Heavy Limited OK None
Language TypeScript Multi (heavy SDKs) TypeScript TypeScript JSON DSL
Vendor lock-in None None Yes (cloud) Partial Total
Human-in-the-loop Built-in Built-in Built-in Built-in Built-in

bunqueue isn't trying to replace Temporal at planet-scale. If you're running a million workflows per minute across regions, use Temporal. If you're building a SaaS, an internal tool, or anything that fits comfortably on one or two boxes, bunqueue gets you 90% of the value with 1% of the operational cost.

Getting started

bun add bunqueue
Enter fullscreen mode Exit fullscreen mode
import { Workflow, Engine } from 'bunqueue/workflow';

const flow = new Workflow('hello')
  .step('greet', async (ctx) => ({ msg: `Hi, ${ctx.input.name}` }));

const engine = new Engine({ embedded: true });
engine.register(flow);
const run = await engine.start('hello', { name: 'world' });
Enter fullscreen mode Exit fullscreen mode

Full guide: bunqueue.dev/guide/workflow

GitHub, npm, and the rest are linked from the docs.

Closing thoughts

The Saga Pattern has been around since 1987. The reason it doesn't get used more is that the implementation tax is brutal — you end up rebuilding half a workflow engine from scratch every time. Putting it inside a tool you already use, with a TypeScript DSL and zero infrastructure, removes that tax.

If you've ever implemented sagas by hand, you know how much plumbing it takes. If you've ever evaluated Temporal and bounced off the operational overhead, this is for you.

Feedback and contributions welcome. Especially curious to hear from people running checkout flows, KYC pipelines, or CI/CD orchestration — those are the use cases I designed it around.

Top comments (0)