Webhooks are deceptively hard to run in production. If you’ve shipped them at scale, you’ve probably hit at least one of these:
- A customer says “we never got the webhook,” but you can’t prove what happened.
- Retries amplify outages (your retries + their retries = thundering herd).
- You implement idempotency inconsistently and pay for it later.
- Failures overwrite context, and the payload that caused the issue is gone.
- You end up building “just enough queue + retry logic” in every service.
After repeating that loop too many times, I built Spooled: an open-source webhook queue and background job infrastructure built in Rust, designed around reliability and operational visibility.
What I wanted (non‑negotiables)
- Reliable delivery: retries with backoff and clear terminal states
- Idempotency: safe replays without duplicate side effects
- Dead-letter queue (DLQ): keep failed jobs + error context; retry/purge when ready
- Bulk operations: enqueue jobs in batches and manage failures at scale
- Cron schedules: recurring jobs with timezone support
- Workflows: job dependencies (DAG-style execution)
- Real-time visibility: live job/queue updates (SSE + WebSocket)
-
Dual API: REST (
:8080) + gRPC (:50051) for high-throughput workers
The core design
At a high level:
- The API accepts jobs (webhooks are just another job type).
- Jobs are stored durably in PostgreSQL with explicit state transitions.
- Workers claim jobs using DB-backed concurrency patterns (e.g.,
FOR UPDATE SKIP LOCKED) so multiple workers can scale safely. - Every important transition can be observed in real time via SSE/WebSocket, so the dashboard doesn’t lie.
- When retries are exhausted, jobs land in a DLQ with enough context to debug and recover.
This gives you the two properties that matter most for webhooks:
- Durability: jobs survive process restarts and deploys
- Traceability: it’s easy to answer “what happened to job X?”
Retries that don’t cause incidents
Retries are necessary, but “retry immediately forever” is how you take systems down.
Spooled uses a retry model with backoff and terminal outcomes:
- transient failures get retried with increasing delays
- persistent failures end in DLQ instead of looping
- operators can re-run jobs safely (especially with idempotency keys)
Idempotency: making retries safe
A retry system is only “reliable” if it’s safe to replay work.
Spooled supports an idempotency_key so you can prevent duplicates when external systems retry the same event (Stripe, GitHub, payment providers, etc.). With idempotency keys, you can aim for exactly-once effects on top of at-least-once processing.
DLQ: failures you can actually debug
A DLQ shouldn’t be a graveyard; it should be a debugging tool.
Spooled’s DLQ keeps failed jobs so you can inspect payload + error context, then retry (or purge) once the underlying issue is fixed.
Workflows: dependencies without a heavyweight orchestrator
Many real systems need “do A, then B, then C,” or “run B only after A succeeds.”
Spooled supports job dependencies and workflow/DAG execution so jobs run in the correct order without bolting on a separate orchestration platform for simple cases.
Real-time streaming: dashboards that don’t lie
Polling-based dashboards often go stale at the exact moment you need them.
Spooled exposes SSE streams (system-wide, per-queue, and per-job) and WebSocket updates, so you can watch job and queue state change live.
Why Rust?
Rust is a great fit for infrastructure that must run continuously:
- strong reliability and safety properties
- high performance under concurrency
- simple ops via a single binary release artifact
Quick start (self-hosted)
Spooled is self-hosted. The recommended way to run it is Docker Compose:
# Pull the multi-arch image (amd64 + arm64)
docker pull ghcr.io/spooled-cloud/spooled-backend:latest
# Download the production compose file
curl -O https://raw.githubusercontent.com/Spooled-Cloud/spooled-backend/main/docker-compose.prod.yml
# Create a minimal .env with secure secrets
POSTGRES_PASSWORD="$(openssl rand -base64 16)"
JWT_SECRET="$(openssl rand -base64 32)"
cat > .env << EOF
POSTGRES_PASSWORD=$POSTGRES_PASSWORD
JWT_SECRET=$JWT_SECRET
RUST_ENV=production
JSON_LOGS=true
EOF
# Start services
docker compose -f docker-compose.prod.yml up -d
# Verify
curl http://localhost:8080/health
PostgreSQL is required. Redis is optional (used for pub/sub and caching when enabled).
Links
- GitHub: https://github.com/Spooled-Cloud/spooled-backend
- Docs: https://spooled.cloud/docs
- Live demo (SpriteForge): https://example.spooled.cloud
What I’d love feedback on
If you’ve built webhook systems or background job infrastructure, I’d love to hear:
- What failure modes hurt you most in production?
- What’s missing from existing queues that you wish existed?
- What would make you switch to a self-hosted job/webhook system?
Thanks for reading. Feedback and issues are welcome.
Top comments (0)