DEV Community

Cover image for A Durable Execution Graph Engine for Node.js
Muhammad Arslan
Muhammad Arslan

Posted on • Edited on

A Durable Execution Graph Engine for Node.js

Links: npm @hazeljs/flow · npm @hazeljs/flow-runtime · GitHub (monorepo) · Example repo (hazeljs-flow-example) · HazelJS


Why We Built It

Modern applications increasingly rely on workflows—multi-step processes that span services, require human input, and must survive restarts. Think order fulfillment, fraud review, document approval, onboarding sequences, or AI agent orchestration. These aren't simple request-response; they're stateful, long-running, and often asynchronous.

The problem: most Node.js frameworks treat each request as stateless. When you need to "pause" a workflow and resume it later—after a webhook, a manual approval, or a retry—you're on your own. You end up hand-rolling state machines, polling loops, and ad-hoc persistence. That leads to brittle code, lost state, duplicate charges, and debugging nightmares.

@hazeljs/flow was created to give Node.js developers a workflow OS kernel: a durable, auditable, resumable execution graph engine that handles the hard parts—persistence, retries, timeouts, idempotency, and concurrency—so you can focus on business logic.


Storage: In-Memory by Default, Optional Database

You can run without any database. The engine uses in-memory storage by default: zero config, no DATABASE_URL, no migrations. Ideal for local development, tests, demos, or lightweight deployments.

When you need durable persistence (crash recovery, multi-process, audit in Postgres), install @prisma/client and use the Prisma adapter from @hazeljs/flow/prisma: pass storage: createPrismaStorage(prisma) into FlowEngine. Run the flow schema migrations and you get the same API with full persistence and Postgres advisory locks.


What Problems Does It Solve?

1. Durability and Crash Recovery (with Prisma storage)

When using the optional Prisma storage, every run's state is persisted to Postgres. If your process crashes mid-flow, the runtime can pick up RUNNING flows on restart and continue them. No manual recovery scripts, no "start from scratch" UX. With in-memory storage, runs survive only for the lifetime of the process—use it when that's acceptable.

2. Wait-and-Resume (Human-in-the-Loop)

Many workflows need to pause for external input: a manager's approval, a payment confirmation, a customer response. Most systems force you to poll, use webhooks with custom state lookup, or build a queue. @hazeljs/flow has a first-class WAIT state. A node returns { status: 'wait', reason: 'awaiting_approval' }, and the run is persisted (or held in memory). When the approval arrives, you call resumeRun(runId, payload) and execution continues. No polling, no custom state tables.

3. Idempotency and Duplicate Prevention

Charging a card, sending a notification, or updating inventory—these must not run twice. @hazeljs/flow supports idempotency keys per node. If a node has already run with the same key (e.g. order:ORD-123:charge), the engine reuses the cached output instead of re-executing. Critical for payments and external APIs.

4. Retries and Backoff

Transient failures (network blips, rate limits) are common. @hazeljs/flow lets you attach a retry policy to any node: maxAttempts, backoff: 'fixed' | 'exponential', baseDelayMs, maxDelayMs. The engine retries automatically and emits NODE_FAILED events for each attempt. You get observability without writing retry loops.

5. Timeouts

A stuck node can block a run forever. @hazeljs/flow supports per-node timeouts. If a handler exceeds timeoutMs, it's treated as a timeout error (retryable if you have a retry policy). No more orphaned runs.

6. Branching and Conditional Logic

Workflows often branch: "if risk score < 30, approve; else if < 70, review; else reject." @hazeljs/flow supports conditional edges with a when(ctx) predicate. Edges are evaluated by priority; if multiple match at the same priority, the engine fails deterministically with AMBIGUOUS_EDGE—no silent wrong-path bugs.

7. Concurrency Safety

Multiple workers might tick the same run. With Prisma storage, @hazeljs/flow uses Postgres advisory locks per run: only one process can execute a run at a time; others get LOCK_BUSY and can retry. With in-memory storage, an in-process per-run lock prevents concurrent ticks in the same Node process. No race conditions, no duplicate side effects.

8. Audit Trail

Every run has a timeline of events: RUN_STARTED, NODE_STARTED, NODE_FINISHED, NODE_FAILED, RUN_WAITING, RUN_COMPLETED, RUN_ABORTED. You can replay what happened, debug failures, and satisfy compliance requirements.


Real-World Scenarios It Solves

Scenario How flow + flow-runtime help
Order & fulfillment pipelines One flow per order: validate → reserve stock → charge → ship → notify. Wait nodes for payment webhooks or warehouse callbacks; resume when events arrive. Durable state and idempotency prevent double charges and lost orders.
Approval workflows Expense, PTO, procurement: run starts → wait for approver → resume with payload. No polling or custom state tables. Flow-runtime exposes POST /v1/runs/:runId/resume so your UI or approval service just calls the API.
Fraud & risk checks Per transaction: score → branch (approve / review / reject) → optional manual review. Conditional edges and audit timeline support compliance and dispute resolution.
Multi-step integrations (ETL, sync) Fetch → transform → write to DB/warehouse → notify on failure. Retries and timeouts for flaky APIs; run as a separate service so other systems trigger runs via HTTP.
Document & case workflows Insurance, claims, onboarding: intake → validation → wait for documents → decision → payout. Long-lived runs with waits; state in Postgres so restarts don’t lose progress; timeline for auditors and support.
SaaS automation Let customers define or use prebuilt flows; flow-runtime as the backend that runs them. Multi-tenant via tenantId; one shared service, horizontal scaling.
Internal ops & support Onboard customer → provision resources → send email → create ticket. Or: alert → triage → assign → wait for resolution → close. One place to see status and history; easy to add steps and retries.

Benefits

Benefit Description
Zero config by default In-memory storage out of the box. No database or env vars required to run flows.
Framework-agnostic No dependency on Hazel core. Use it with Express, Fastify, NestJS, or plain Node.
Decorator-first API Define flows with @Flow, @Entry, @Node, @Edge—familiar to NestJS/Hazel developers.
Optional persistence Add Postgres when you need it: install @prisma/client, use createPrismaStorage(prisma) from @hazeljs/flow/prisma. Schema and migrations live in the package.
Type-safe Full TypeScript support: FlowContext, NodeResult, typed handlers.
Testable Run flows in-process with in-memory storage (or a test DB). No need to spin up a runtime server for unit tests.
Optional runtime Use FlowEngine directly in your app, or deploy @hazeljs/flow-runtime as a standalone HTTP service (HazelApp). Invoke it programmatically with runFlowRuntime({ flows, port, databaseUrl?, services }) so apps don’t reimplement the server.

Competitive Advantages

vs. Temporal / Cadence

  • Simpler: No separate worker process, no activity/workflow split. Nodes are just async functions.
  • Lighter: Optional Postgres. Start with in-memory; add a single DB when you need durability. No Elasticsearch, Cassandra, or separate Temporal server.
  • Faster to adopt: Define a flow in one file, register it, and run. No SDK concepts to learn.
  • Good for: Teams that want workflow durability without the operational complexity of Temporal.

vs. BullMQ / Inngest / Trigger.dev

  • Stateful graphs: BullMQ is job queues; @hazeljs/flow is execution graphs with branching and wait. You model the flow, not just jobs.
  • Built-in wait/resume: No need to "schedule a follow-up job" for human approval. First-class WAIT state.
  • Audit trail: Every transition is recorded (in memory or DB). BullMQ gives you job history; @hazeljs/flow gives you a run timeline.

vs. Custom State Machines (XState, etc.)

  • Persistence optional: XState is in-memory. @hazeljs/flow can run in-memory or persist to Postgres when you need crash recovery and multi-process safety.
  • Runtime included: You get an HTTP API (@hazeljs/flow-runtime) and recovery logic. With XState, you build that yourself.
  • Idempotency and retries: Built into the engine, not something you wire up per transition.

vs. AWS Step Functions / Google Workflows

  • Self-hosted: No vendor lock-in. Run on your own infra, your own (optional) Postgres.
  • No cold starts: No Lambda limits. Your nodes run in your process or a long-lived runtime.
  • Simpler pricing: You pay for compute (and Postgres if you use it), not per state transition.

Architecture at a Glance

Option A: In-process, in-memory (no DB)
┌─────────────────────────────────────────────────────────────┐
│                     Your Application                        │
│  ┌─────────────┐    ┌──────────────────┐                    │
│  │ FlowEngine  │───▶│ createMemoryStorage()                 │
│  │ (default)  │    │ (runs, events, idempotency in memory)  │
│  └─────────────┘    └──────────────────┘                    │
└─────────────────────────────────────────────────────────────┘

Option B: In-process, with Postgres
┌─────────────────────────────────────────────────────────────┐
│                     Your Application                        │
│  ┌─────────────┐    ┌──────────────┐    ┌─────────────────┐ │
│  │ FlowEngine  │───▶│ createPrismaStorage(prisma)           │  
│  │             │    │ from @hazeljs/flow/prisma             │──▶ Postgres
│  └─────────────┘    └──────────────┘    └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Option C: Standalone HTTP service (@hazeljs/flow-runtime)
┌─────────────────────────────────────────────────────────────┐
│              @hazeljs/flow-runtime (HazelApp)               │
│  POST /v1/runs/start  │  GET /v1/runs/:id  │  POST .../tick │
│  POST /v1/runs/:id/resume  │  GET /v1/runs/:id/timeline     │
└─────────────────────────────────────────────────────────────┘
                              │
              FlowEngine + (in-memory or Prisma)
Enter fullscreen mode Exit fullscreen mode

Use the runtime by running its process (node dist/main.js) or invoke it from your app with runFlowRuntime({ port, databaseUrl?, flows, services })—no need to reimplement the HTTP API.

Example: The hazeljs-flow-example repo registers order-processing, approval, fraud-detection, and other flows and starts the server with runFlowRuntime(...). You can run it locally (npm run run:runtime or npm run run:direct) or browse the source for copy-paste patterns.


When to Use It

Good fit:

  • Order processing, fulfillment pipelines
  • Approval workflows (expense, PTO, procurement)
  • Fraud detection and review queues
  • Onboarding sequences (signup → verify → onboard)
  • AI agent orchestration (plan → execute → wait for human → continue)
  • Document processing with retries and branching
  • Multi-step integrations and ETL with retries and audit

Less ideal:

  • Simple one-off background jobs (use BullMQ or a cron)
  • Real-time streaming (use WebSockets or SSE)
  • High-throughput event sourcing (use Kafka + CQRS)
  • Distributed sagas across many services (consider Temporal)

Getting Started

1. Install (no database required for in-memory)

pnpm add @hazeljs/flow
Enter fullscreen mode Exit fullscreen mode

2. Define and run a flow (in-memory)

import { FlowEngine, Flow, Entry, Node, Edge, buildFlowDefinition } from '@hazeljs/flow';
import type { FlowContext, NodeResult } from '@hazeljs/flow';

@Flow('order-flow', '1.0.0')
class OrderFlow {
  @Entry()
  @Node('validate')
  @Edge('charge')
  async validate(ctx: FlowContext): Promise<NodeResult> { ... }

  @Node('charge')
  async charge(ctx: FlowContext): Promise<NodeResult> { ... }
}

const engine = new FlowEngine();  // uses in-memory storage
await engine.registerDefinition(buildFlowDefinition(OrderFlow));
const { runId } = await engine.startRun({ flowId: 'order-flow', version: '1.0.0', input: order });
let run = await engine.getRun(runId);
while (run?.status === 'RUNNING') {
  run = await engine.tick(runId);
}
Enter fullscreen mode Exit fullscreen mode

3. (Optional) Add Postgres for durability

pnpm add @prisma/client
# Run migrations from the flow package (see package README)
Enter fullscreen mode Exit fullscreen mode
import { FlowEngine } from '@hazeljs/flow';
import { createPrismaStorage, createFlowPrismaClient } from '@hazeljs/flow/prisma';

const prisma = createFlowPrismaClient(process.env.DATABASE_URL);
const engine = new FlowEngine({ storage: createPrismaStorage(prisma) });
// same registerDefinition, startRun, tick...
Enter fullscreen mode Exit fullscreen mode

4. (Optional) Run as an HTTP service

Use @hazeljs/flow-runtime: run its built-in process with default demo flows, or invoke it from your app with your own flows:

import { runFlowRuntime } from '@hazeljs/flow-runtime';
import { myFlow } from './flows';

await runFlowRuntime({
  port: 3000,
  databaseUrl: process.env.DATABASE_URL,  // optional; in-memory if omitted
  flows: [myFlow],
  services: { logger, slack },
});
Enter fullscreen mode Exit fullscreen mode

Summary

@hazeljs/flow gives Node.js developers a durable, auditable, resumable workflow engine without the complexity of Temporal or the limitations of simple job queues. It solves real problems: crash recovery (with optional Postgres), wait-and-resume, idempotency, retries, timeouts, branching, and concurrency—with a decorator-based API and in-memory storage by default, so you can run with zero config and add persistence when you need it. Use FlowEngine in your app or deploy @hazeljs/flow-runtime (HazelApp) as a standalone service; you can also invoke the runtime programmatically with runFlowRuntime({ flows, ... }) so your app stays thin and the package owns the server.

Built for developers who need workflows that don't break.


Links & resources

Top comments (3)

Collapse
 
mickyarun profile image
arun rajkumar

The category framing is honest — the wait-and-resume + audit-timeline + framework-enforced-idempotency stack is a real win, especially for teams that haven't yet built the homegrown version. Two observations from running the same use case (payment flows: consent → reservation → authorised → settled → reconciled) without an engine, in case it's useful for the comparison readers will inevitably do. First, the in-memory-by-default story is the right onboarding path — every team I've seen evaluate Temporal bounces off the operational complexity (separate workers, separate DB, separate observability surface) before they get to the actual win. Hazel's "in-memory works on day one, add Postgres when you need durability" is a much friendlier ramp than "set up a separate worker process and cluster before you can run a hello-world flow."

Second, the boundary I'd push on: the homegrown shape that competes with this is a state column in Postgres + optimistic locking on the transition (UPDATE ... WHERE state = $previous) + idempotency keys + outbox + BullMQ for retries. For a fintech with three or four core long-running flows that already have to live in the application database for audit reasons, that combination gives most of what @hazeljs/flow gives, with the trade-off being you write the wait-state primitive yourself (we model it as state = consent_pending and a webhook handler that drives the transition). The threshold question — when do you graduate from the homegrown pattern to a workflow engine? — is the conversation most teams in this category never explicitly have. Three signals I'd watch for: (a) more than ~5 distinct long-running flow shapes; (b) the audit timeline becoming a customer-facing surface, not just an internal admin view; (c) needing first-class workflow versioning across schema changes. Curious where in the spectrum you'd put HazelJS for teams already running the homegrown shape — is the upgrade path the same code, or is it a rewrite?

Collapse
 
arslan_mecom profile image
Muhammad Arslan

Really thoughtful breakdown — and I think the distinction you’re making between “workflow engine” and “well-disciplined application state machine” is exactly the right framing.

I completely agree that for a lot of fintech teams, the Postgres state column + optimistic transition locking + idempotency + outbox + queue pattern gets surprisingly far. Especially when the workflows are tightly coupled to the core transactional model anyway, keeping the source of truth in the application DB is often the simplest operational choice.

The place we’re aiming HazelJS is less “replace every well-built homegrown flow system” and more the point where those patterns start repeating across multiple domains and teams. Once you have several long-running flow shapes, retry semantics duplicated across services, customer-facing timelines, and migration/versioning concerns, the orchestration layer itself starts becoming product infrastructure rather than implementation detail.

Your versioning point is especially important — that’s one of the biggest pain areas we’ve seen with hand-rolled state machines once workflows live long enough that schema/process evolution becomes constant.

On the migration question: the goal is definitely incremental adoption rather than rewrite. A lot of teams already have durable state + queues + webhook-driven transitions, so HazelJS should be able to sit on top of or alongside that model instead of forcing a “move everything into a separate orchestration world” rewrite from day one.

And agreed on the Temporal comparison too — Temporal is incredibly powerful, but operationally it asks teams to commit to the workflow model very early. We’re trying to make the entry point much lighter while still preserving the durability/auditability benefits once teams grow into them.

Collapse
 
arslan_mecom profile image
Muhammad Arslan

Happy to help and provide advice for migration to HazelJS or just upgrade existing system.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.