DEV Community

Yevhen Salitrynskyi
Yevhen Salitrynskyi

Posted on

How I built a reliable webhook queue in Rust (retries, idempotency, DLQ, schedules, workflows, real-time)

Webhooks are deceptively hard to run in production. If you’ve shipped them at scale, you’ve probably hit at least one of these:

  • A customer says “we never got the webhook,” but you can’t prove what happened.
  • Retries amplify outages (your retries + their retries = thundering herd).
  • You implement idempotency inconsistently and pay for it later.
  • Failures overwrite context, and the payload that caused the issue is gone.
  • You end up building “just enough queue + retry logic” in every service.

After repeating that loop too many times, I built Spooled: an open-source webhook queue and background job infrastructure built in Rust, designed around reliability and operational visibility.

What I wanted (non‑negotiables)

  • Reliable delivery: retries with backoff and clear terminal states
  • Idempotency: safe replays without duplicate side effects
  • Dead-letter queue (DLQ): keep failed jobs + error context; retry/purge when ready
  • Bulk operations: enqueue jobs in batches and manage failures at scale
  • Cron schedules: recurring jobs with timezone support
  • Workflows: job dependencies (DAG-style execution)
  • Real-time visibility: live job/queue updates (SSE + WebSocket)
  • Dual API: REST (:8080) + gRPC (:50051) for high-throughput workers

The core design

At a high level:

  1. The API accepts jobs (webhooks are just another job type).
  2. Jobs are stored durably in PostgreSQL with explicit state transitions.
  3. Workers claim jobs using DB-backed concurrency patterns (e.g., FOR UPDATE SKIP LOCKED) so multiple workers can scale safely.
  4. Every important transition can be observed in real time via SSE/WebSocket, so the dashboard doesn’t lie.
  5. When retries are exhausted, jobs land in a DLQ with enough context to debug and recover.

This gives you the two properties that matter most for webhooks:

  • Durability: jobs survive process restarts and deploys
  • Traceability: it’s easy to answer “what happened to job X?”

Retries that don’t cause incidents

Retries are necessary, but “retry immediately forever” is how you take systems down.

Spooled uses a retry model with backoff and terminal outcomes:

  • transient failures get retried with increasing delays
  • persistent failures end in DLQ instead of looping
  • operators can re-run jobs safely (especially with idempotency keys)

Idempotency: making retries safe

A retry system is only “reliable” if it’s safe to replay work.

Spooled supports an idempotency_key so you can prevent duplicates when external systems retry the same event (Stripe, GitHub, payment providers, etc.). With idempotency keys, you can aim for exactly-once effects on top of at-least-once processing.

DLQ: failures you can actually debug

A DLQ shouldn’t be a graveyard; it should be a debugging tool.

Spooled’s DLQ keeps failed jobs so you can inspect payload + error context, then retry (or purge) once the underlying issue is fixed.

Workflows: dependencies without a heavyweight orchestrator

Many real systems need “do A, then B, then C,” or “run B only after A succeeds.”

Spooled supports job dependencies and workflow/DAG execution so jobs run in the correct order without bolting on a separate orchestration platform for simple cases.

Real-time streaming: dashboards that don’t lie

Polling-based dashboards often go stale at the exact moment you need them.

Spooled exposes SSE streams (system-wide, per-queue, and per-job) and WebSocket updates, so you can watch job and queue state change live.

Why Rust?

Rust is a great fit for infrastructure that must run continuously:

  • strong reliability and safety properties
  • high performance under concurrency
  • simple ops via a single binary release artifact

Quick start (self-hosted)

Spooled is self-hosted. The recommended way to run it is Docker Compose:

# Pull the multi-arch image (amd64 + arm64)
docker pull ghcr.io/spooled-cloud/spooled-backend:latest

# Download the production compose file
curl -O https://raw.githubusercontent.com/Spooled-Cloud/spooled-backend/main/docker-compose.prod.yml

# Create a minimal .env with secure secrets
POSTGRES_PASSWORD="$(openssl rand -base64 16)"
JWT_SECRET="$(openssl rand -base64 32)"

cat > .env << EOF
POSTGRES_PASSWORD=$POSTGRES_PASSWORD
JWT_SECRET=$JWT_SECRET
RUST_ENV=production
JSON_LOGS=true
EOF

# Start services
docker compose -f docker-compose.prod.yml up -d

# Verify
curl http://localhost:8080/health
Enter fullscreen mode Exit fullscreen mode

PostgreSQL is required. Redis is optional (used for pub/sub and caching when enabled).

Links

What I’d love feedback on

If you’ve built webhook systems or background job infrastructure, I’d love to hear:

  • What failure modes hurt you most in production?
  • What’s missing from existing queues that you wish existed?
  • What would make you switch to a self-hosted job/webhook system?

Thanks for reading. Feedback and issues are welcome.

Top comments (0)