cypher682

Posted on Jun 29

Building an Async Job Platform with BullMQ, Socket.io, and Webhooks

#node #typescript #backend #devops

NodeFlow is an asynchronous job orchestration and webhook delivery API built with Node.js, TypeScript, BullMQ, Socket.io, Prisma, PostgreSQL, Redis, Docker, and GitHub Actions.

The goal was to build a multi-process distributed backend service — not a CRUD API with a background task bolted on. A system that forces real decisions about process separation, inter-process communication, failure handling, and delivery guarantees.

This is a writeup of what was built, what decisions were made, and why.

Architecture

NodeFlow runs as two separate processes that share state through Redis and PostgreSQL:

Client
  │
  ▼
Express API (port 4000)
  │── POST /v1/jobs      → validates, writes Job to DB, enqueues to BullMQ
  │── GET  /v1/jobs/:id  → reads job state from PostgreSQL
  │── Socket.io server   → pushes real-time status events to connected clients
  │── Bull Board         → /admin/queues visual dashboard
  │── Swagger UI         → /docs
  │
  └──► Redis (BullMQ queues: job-processing, webhook-dispatch, file-processing)
         │
         ▼
    BullMQ Worker Process
         │── picks up job from queue
         │── updates DB state (RUNNING → SUCCEEDED / FAILED)
         │── emits job:status event via Redis pub/sub → forwarded to Socket.io clients
         │── dispatches webhook delivery
         │── writes structured logs to PostgreSQL
         ▼
    PostgreSQL — durable state: jobs, logs, webhooks, deliveries, files, API keys

The API and Worker do not communicate directly. No shared memory, no imports of each other. Redis is the message bus. PostgreSQL is the source of truth.

This is the same shape as a Kubernetes deployment: two separate Deployments, one managed Redis, one managed database.

The Job Lifecycle

A client POSTs a job:

// POST /v1/jobs
{
  "type": "file.metadata.extract",
  "payload": { "fileId": "abc-123" },
  "priority": 5,
  "maxAttempts": 3
}

The API:

Validates the request with Zod.
Creates a Job record in PostgreSQL with status QUEUED.
Enqueues a BullMQ job to the job-processing queue.
Returns 202 Accepted immediately with the job ID.

The Worker:

Picks up the job from Redis.
Updates DB to RUNNING, emits job:status event.
Dispatches the handler for the registered job type.
On success: updates DB to SUCCEEDED, writes structured logs, emits another job:status, dispatches webhook.
On failure: if attempts < maxAttempts, BullMQ reschedules with exponential backoff. If exhausted: marks FAILED, emits final job:status.

The client never polls. It connects via WebSocket, subscribes to the job room, and receives state transitions as they happen.

Real-Time Updates: Cross-Process Socket.io

The hardest design question: how does the Worker push WebSocket events to clients connected to the API?

The WebSocket connection is held by the API process. The Worker is a completely separate Node.js process with no access to the API's in-memory Socket.io server. Connecting them directly would create tight coupling that breaks horizontal scaling.

The solution is @socket.io/redis-adapter on the API side and @socket.io/redis-emitter on the Worker side:

// API process — src/socket/index.ts
// Socket.io uses Redis as the pub/sub backbone
const pubClient = createRedisClient();
const subClient = pubClient.duplicate();
io.adapter(createAdapter(pubClient, subClient));

// Worker process — src/socket/emitter.ts
// Emits directly into Redis. No Socket.io server needed.
const emitter = new Emitter(createRedisClient());

export const emitJobUpdate = (jobId: string, status: string, result?: unknown) => {
  emitter.to(`job:${jobId}`).emit("job:status", { jobId, status, result });
};

The Worker writes to a Redis channel. The API subscribes to that channel and forwards to the correct WebSocket room. Neither process has a reference to the other. This is also how Socket.io works with multiple pods — the Redis adapter makes all instances appear as one.

Webhooks: HMAC Signatures, Exponential Backoff, Circuit Breaking

Webhook delivery is harder than it looks. The target might be down, slow, or returning errors. Naive delivery with no retry strategy loses events. Unlimited retries with no backoff hammers a broken endpoint.

NodeFlow uses three layers:

HMAC-SHA256 Signatures

Every payload is signed with the webhook's secret key:

const signature = createHmac("sha256", webhook.secret)
  .update(JSON.stringify(payload))
  .digest("hex");

// Sent as:
// X-Nodeflow-Signature: sha256=<hex>
// X-Nodeflow-Event: job.completed
// X-Nodeflow-Delivery-Id: <uuid>

The receiver verifies the signature to confirm the payload came from NodeFlow and was not tampered with.

Exponential Backoff

Failed deliveries are retried by BullMQ with increasing delays:

// 2s → 4s → 8s → 16s → ... up to 5 attempts
const nextRetryAt = new Date(Date.now() + Math.pow(2, attempts) * 1000);

After 5 failures the delivery is marked FAILED permanently.

Redis Circuit Breaker

Even with backoff, if an endpoint is completely dead you are still firing requests into the void:

// Redis key: circuit_breaker:failures:<url>
// Threshold: 5 consecutive failures → circuit opens for 5 minutes

if (await circuitBreaker.isCircuitOpen(webhook.url)) {
  // Skip HTTP attempt entirely. Mark delivery FAILED immediately.
  return;
}

After 5 consecutive failures to a URL, no further deliveries are attempted for 5 minutes. On circuit reset, the next attempt goes through. Success clears the failure counter.

This is the same pattern used in production service meshes — implemented at the application layer in Redis without a sidecar.

Idempotency: Safe Request Replay

Networks fail. Clients retry. Without idempotency, a retry creates a duplicate job, a duplicate file upload, a duplicate notification.

The fix is an Idempotency-Key header:

POST /v1/jobs
Idempotency-Key: my-unique-request-id-001

The middleware:

Hashes the key with the userId.
Checks PostgreSQL for a stored response for (key, userId).
If found: returns the stored response body and status — the handler never runs.
If not found: runs the handler, then caches the response asynchronously.

// Intercept res.json() to capture and cache the response
res.json = (body: unknown): Response => {
  const result = originalJson(body);
  if (res.statusCode >= 200 && res.statusCode < 500) {
    prisma.idempotencyKey
      .create({ data: { key, userId, responseStatus: res.statusCode, responseBody: body } })
      .catch(logger.error);
  }
  return result;
};

Same key, same response, zero side effects on replay. Stripe uses this exact pattern for payment safety.

File Upload Pipeline

Upload → Store → Return 202 → Process in background.

POST /v1/files (multipart/form-data)
  │── Multer buffers file in memory
  │── StorageProvider.upload() writes file to storage
  │── Prisma creates File record (status: UPLOADED)
  │── BullMQ enqueues file-processing job
  └── Returns 202 Accepted

file-processing worker:
  │── Sets status → PROCESSING
  │── Extracts metadata (mime type, dimensions, page count)
  │── Sets status → READY
  └── Dispatches file.processed webhook

Storage is abstracted behind an interface:

export interface StorageProvider {
  upload(key: string, buffer: Buffer, mimeType: string): Promise<string>;
  download(key: string): Promise<Buffer>;
  delete(key: string): Promise<void>;
}

LocalStorageProvider writes to disk today. Switching to S3: set STORAGE_PROVIDER=s3, implement S3StorageProvider. Nothing else changes.

Cross-Cutting: Auth, Rate Limiting, Versioning

API Key Authentication
Keys are stored only as SHA-256 hashes — never plaintext. On each request:

const keyHash = createHash("sha256").update(key).digest("hex");
const apiKey = await prisma.apiKey.findUnique({ where: { keyHash } });

Redis Rate Limiting
Sliding window per user/IP. Each request increments a Redis counter with a 60-second TTL. Above 100 req/min:

HTTP 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
Retry-After: <seconds>

API Versioning
Every response includes X-API-Version: v1 and X-Deprecated: false. Set by middleware — no individual route handler thinks about it.

Testing and CI

31 tests across 7 suites. Prisma and Redis are mocked, so tests run without live infrastructure:

PASS tests/workers.test.ts       — handler registry, job state machine
PASS tests/webhooks.test.ts      — webhook CRUD routes
PASS tests/cross-cutting.test.ts — auth, rate limiting, idempotency, versioning
PASS tests/files.test.ts         — upload, download, list, delete
PASS tests/socket.test.ts        — WebSocket server lifecycle
PASS tests/jobs.test.ts          — job CRUD and cancellation
PASS tests/health.test.ts        — liveness probe

Test Suites: 7 passed
Tests:       31 passed

CI pipeline on every push to main:

GitHub Actions
  ├── ESLint
  ├── TypeScript compile (tsc)
  ├── Jest tests (with Postgres + Redis service containers)
  ├── Docker build — API image
  ├── Docker build — Worker image
  └── Trivy vulnerability scan (both images)

Repo

cypher682/nodeflow-api

Top comments (1)

VoltageGPU • Jul 6

Interesting take on job orchestration with BullMQ and webhooks! I've worked on systems where GPU tasks needed async coordination, and having real-time updates via Socket.io helps with monitoring resource-heavy jobs—though we ended up using gRPC for some internal communication. The VoltageGPU abstraction layer might come in handy if you ever need to integrate hardware acceleration into the workflow.