DEV Community

Wesley Conde
Wesley Conde

Posted on • Edited on

Error state machines:Designing dead-letter queues (DLQ) and jittered retries

Transient vs. permanent errors

When a consumer fails to process an event, the worst action is retrying it immediately and infinitely. If the error is transient (e.g., temporary network blip), rapid retries will overwhelm the failing downstream service. If the error is permanent (e.g., corrupted payload or schema validation failure), repeating the operation wastes CPU cycles and stalls progress.

To handle this onkai-unified-bus implements a robust error state machine combining Exponential Backoff with Jitter retries and a Dead-Letter Queue (DLQ).

Exponential backoff with Jitter

To prevent dozens of disconnected consumers from retrying or reconnecting at the exact same millisecond (the "thundering herd" effect), we inject random delay variance (Jitter) combined with an exponentially increasing interval between retries.

func CalculateBackoff(attempt int, baseDelay, maxDelay time.Duration) time.Duration {
    // Exponential calculation: base * 2^attempt
    backoff := float64(baseDelay) * math.Pow(2, float64(attempt))
    duration := time.Duration(backoff)
    if duration > maxDelay {
        duration = maxDelay
    }
    // Add +/- 10% Jitter (random noise) to spread out retry load
    jitter := rand.Float64() * 0.2 - 0.1
    duration = time.Duration(float64(duration) * (1.0 + jitter))
    return duration
}
Enter fullscreen mode Exit fullscreen mode

Isolation via dead-letter queue (DLQ)

If an event fails after reaching the maximum number of retry attempts (e.g., 5 attempts), it is classified as a permanent logical failure. To prevent stalling the rest of the queue, the event bus intercepts the offending message, attaches the error's stack trace to the message headers, and forwards it to a designated isolation queue called the Dead-Letter Queue (DLQ).

From the DLQ, faulty messages can be audited manually or processed through administrative tools once the underlying business logic bug is fixed.

Technical terms demystified

  • Dead-Letter Queue (DLQ): A secondary queue dedicated to holding messages that couldn't be delivered or processed successfully by consumers.
  • Thundering Herd Effect: A situation where many concurrent processes wake up at the same time to handle a single event, overwhelming system resources.
  • Jitter: The deliberate introduction of small, random time variations to prevent processes from synchronizing their retries.

link: https://github.com/wesleyskap/onkai-unified-bus

Top comments (0)