DEV Community

Cover image for Delay Queues: Do It Later, Reliably
Rahul Bhati
Rahul Bhati

Posted on

Delay Queues: Do It Later, Reliably

What is a delay queue?

A delay queue holds a message/job until a specified delay time in the future, then makes it available for processing. It's the simplest way to schedule "do this later" without blocking workers or rolling your own cron logic.

Why use a delay queue?

  • Retries with backoff — retry failed tasks after a delay.
  • Scheduled tasks — run at specific times (e.g., reminder emails).
  • Rate smoothing — spread out work to avoid thundering herds.

When to choose what

1. Broker-based Delay Queues

Built-in support for delayed messages in brokers.

Cloud-managed

  • AWS SQS (DelaySeconds, per-message up to 15 min)
  • Azure Service Bus (scheduled messages)
  • Google Cloud Tasks (scheduled delivery)

Open-source

  • Apache Pulsar (delayed messages)
  • RabbitMQ (delayed-message plugin)

Best for:

  • Static/simple delays
  • Operational simplicity over flexibility
  • Delay ranges: seconds → hours (SQS: 15 min/message; Service Bus/Pulsar: much longer)

Tradeoffs:

  • Limited introspection: hard to list/search/cancel/reschedule after enqueue
  • Cloud-managed = zero ops but vendor lock-in
  • Open-source = portable but you run the infra

2. Redis ZSET Scheduler

Use a Redis sorted set with score = execution timestamp. Fast lookups; easy cancel/reschedule.

Best for:

  • High-throughput scheduling (10K+ jobs/sec)
  • Cancellable delays & "list upcoming jobs" APIs
  • Dynamic rescheduling (snooze/postpone)
  • Sub-second granularity

Implementation:

# Schedule a job
ZADD delays <execution_timestamp> <job_id>

# Fetch due jobs (batch of 100)
ZRANGEBYSCORE delays -inf <now> LIMIT 0 100

# Remove once picked up
ZREM delays <job_id>
Enter fullscreen mode Exit fullscreen mode

Tradeoffs:

  • Single point of failure unless Redis is clustered or replicated.
  • In-memory volatility: Data can be lost on crash if AOF/RDB not configured properly.
  • External Execution Required: Redis only tracks when to run jobs — execution must be handled by polling workers.

3. Postgres + SKIP LOCKED

Store jobs in your main database. Poll with FOR UPDATE SKIP LOCKED.
Best for:

  • Atomic job scheduling + business logic ("cancel order and schedule refund" in one DB transaction)
  • Teams already using Postgres — no new infrastructure.
  • Strong consistency and strict ordering requirements.
  • Moderate throughput workloads (<10K jobs/sec).

Implementation:

SELECT * FROM jobs
WHERE run_at <= NOW() AND status = 'pending'
ORDER BY run_at
LIMIT 100
FOR UPDATE SKIP LOCKED;
Enter fullscreen mode Exit fullscreen mode

Why SKIP LOCKED?
Multiple workers poll concurrently without lock contention. Each grabs a batch, skips locked rows.

Tradeoffs:

  • Polling interval creates latency floor (5s poll = up to 5s jitter).
  • Index pressure on high churn.
  • Not for sub-second precision.

4. DynamoDB TTL

AWS-native approach using DynamoDB's Time To Live (TTL) feature combined with DynamoDB Streams.

Best for:

  • Serverless architectures on AWS.
  • Long delays (hours to days).
  • Variable/unpredictable load patterns

Implementation:

// Schedule item with TTL
await dynamodb.putItem({
  TableName: 'delayed_jobs',
  Item: {
    job_id: 'order-123',
    ttl: Math.floor(Date.now() / 1000) + delay_seconds,
    payload: { /* job data */ }
  }
});

// Lambda triggered by DynamoDB Streams on deletion
Enter fullscreen mode Exit fullscreen mode

Tradeoffs:

  • TTL deletion is eventually consistent (typically within 48 hours, but usually minutes)
  • Not suitable for precise timing needs.
  • AWS lock-in.

5. Timer Wheels / Bucketed Timers

An in-memory data structure optimized for millions of delays per second.

Examples:

  • Netty’s HashedWheelTimer
  • Kafka broker request purgatory (hierarchical timing wheels) reference

Best for:

  • Session timeouts (30-sec idle = disconnect)
  • Rate limiter token bucket refills
  • Game server tick-based events
  • Short delays (<1 hour), acceptable precision ~100ms

Why it's fast?

  • O(1) insertions using a circular wheel of buckets
  • Batched execution of timers in the same bucket
  • Trades precision for throughput.

Tradeoffs:

  • In-memory only(not durable)
  • Perfect for ephemeral delays where loss is acceptable.

6. Workflow Engines

Full orchestration engines with built-in delay scheduling.

Examples:

  • Temporal
  • Camunda

Best for:

  • Multi-step processes with delays between steps (e.g., Abandoned cart: 1h idle → +2h reminder → +24h reminder (+incentive) → close.).
  • Human approval steps with timeouts
  • Microservice orchestration with retries and delays
  • Audit trails and visual workflow editors

Tradeoffs:

  • Heavy infrastructure and operational complexity.
  • Overkill for simple delay needs, but perfect when delays are part of larger state workflows.
  • Steep learning curve
  • Resource-intensive (separate database, workers, UI)

Delivery Semantics & Idempotency

Most delay queue implementations provide at-least-once delivery guarantees. This means jobs may be delivered multiple times due to:

  • Worker crashes before acknowledging
  • Network timeouts during acknowledgment
  • Duplicate scheduling

Design for Idempotency
Your job handlers should be safe to retry. Use unique job IDs, database constraints, or distributed locks to ensure processing a job multiple times has no adverse effects.

Exactly-once Delivery
Kafka (with transactions) and Pulsar (with deduplication) can provide stronger guarantees, but at the cost of complexity.

Dead Letter Queues (DLQs)

  • Most delay queue systems support DLQs to isolate jobs that permanently fail after exhausting retries
  • When a job fails repeatedly (e.g., 3-5 attempts with exponential backoff), it's moved to the DLQ for manual inspection
  • Prevents poisonous messages from blocking healthy jobs
  • DLQ growth indicates systematic issues like bad code deploys, external service outages, or malformed data

Observability

Regardless of the delay queue you choose, implement monitoring and alerting for:

  • Queue depth: Number of pending jobs (alert on unexpected growth)
  • Lag: Time between scheduled and actual execution
  • Throughput: Jobs processed per second
  • Failure rates: Percentage of jobs that fail and require retries
  • Age of oldest job: Detect stuck/stale jobs
  • Dead letter queue (DLQ) depth: Jobs that exceeded retry limits

Quick Decision Matrix

Your constraint Go with
Fully managed (AWS) SQS
Fully managed (GCP) Cloud Tasks
Fully managed (Azure) Service Bus
Fully managed, avoid vendor lock-in Managed Pulsar
Need speed + easy cancellation/inspect Redis ZSET
Need transactional consistency Postgres + SKIP LOCKED
Serverless AWS + long delays DynamoDB TTL
Short-lived delays, loss acceptable Timer wheels
100K+ delays/sec, ephemeral OK Timer wheels
Multi-step workflows with state Temporal / Camunda

Final Thoughts

Choose based on your requirements: cloud-managed for simplicity, Redis/Postgres for speed and control, workflow engines for complex orchestration. The best delay queue fits your existing stack and operational maturity.

Don't build your own delay queue. Use an existing solution unless you have very specific requirements that nothing else can satisfy.

Top comments (0)