Jobinesh Purushothaman

Posted on Apr 12

Designing a Scalable Recovery Service for Distributed Systems

#architecture #distributedsystems #sre #systemdesign

Failures are a normal and expected part of distributed systems.

If your application processes data asynchronously—for example, by consuming messages from Kafka or running background jobs—failures will happen regularly. A service might crash, a downstream dependency might become unavailable, or the data itself might fail validation.

If these failures are not handled properly, your system can lose data, create duplicates, or leave you with no visibility into what went wrong.

This article explains a simple and practical approach to building a recovery service that helps you handle failures in a reliable and scalable way.

What problem are we solving?

Before going deeper, let us clarify a few terms.

An event or message is simply a unit of work your system processes, such as a Kafka message.
Asynchronous processing means this work is handled in the background, not immediately in a user request.
A failure occurs when this processing does not complete successfully. In many systems, failures are handled by retrying immediately. However, this approach is not sufficient in real-world systems.

Why simple retries are not enough

In-memory retries are useful, but they do not solve real-world problems:

If the service crashes, all in-progress retries are lost.
Some failures should be retried later, not immediately.
There is no persistent record of what failed and how many times it was retried.
There is no clear final state, which can lead to endless retry loops.

To address these issues, failures must be persisted and handled asynchronously.

What is a recovery service?

A recovery service is a background component responsible for handling failed work.

It performs three main functions:

It stores failed tasks in a database.
It retries them later in a controlled manner.
It tracks the final outcome of each task.

Instead of retrying immediately, the system records the failure and processes it later using dedicated worker processes.

How failures are stored

Each failure is stored as a recovery task in a database table.
A recovery task typically contains the following information.

Failure context

This includes details about what failed, such as the event type and the payload being processed.

Lifecycle status

This represents the outcome of the task and can be one of the following:

FAILED, meaning it is waiting to be retried
RESOLVED, meaning it was successfully recovered
PERMANENT_FAILURE, meaning it will not be retried again

Retry metadata

This includes how many times the task has been retried and when it should be retried next.

Lock information

This indicates which worker is currently processing the task and when the lock was acquired.

Important design principle: status and lock are different

The lifecycle status and the execution lock represent different concepts.

The status describes the business outcome of the task.
The lock indicates which worker is currently processing the task. Keeping these two concepts separate helps avoid race conditions and keeps the design clear.

How multiple workers process tasks safely

To scale the system, multiple worker instances can run in parallel. The challenge is to ensure that the same task is not processed more than once at the same time.

This is solved using database row locking.

Workers use a query like the following:

SELECT *
FROM recovery_task
WHERE status = 'FAILED'
  AND next_retry_at <= NOW()
FOR UPDATE SKIP LOCKED
LIMIT 10;

This query locks the selected rows and ensures that other workers skip them.

As a result, each worker processes a different set of tasks, and no central coordination mechanism is required.

Worker lifecycle

Each recovery worker follows a simple loop.

First, it fetches a batch of failed tasks.
Then, it processes each task using the appropriate recovery logic.
Finally, it updates the task based on the outcome.

If the processing succeeds, the task is marked as RESOLVED.
If it fails but can be retried, the system schedules the next retry.
If the maximum number of retries is reached, the task is marked as PERMANENT_FAILURE.

Keeping the system generic

The recovery system should not contain business-specific logic.

Instead, responsibilities should be clearly separated.

The worker is responsible for coordination, retry handling, and state updates.
The handler is responsible for the actual recovery logic. For example, a handler might republish a Kafka message or re-trigger a failed operation.

This separation makes the recovery system reusable across different parts of the application.

Why this works without leader election

Some systems use leader election to ensure that only one instance performs certain tasks.

In this design, leader election is not required because the work is divided at the database row level.

Each worker processes different rows, and the database ensures that no two workers process the same task at the same time.

This approach allows the system to scale horizontally without introducing additional coordination complexity.

Safety guarantees

This design provides several important guarantees.

It prevents duplicate processing through database locks.
It allows recovery from worker crashes through lock expiration.
It provides full visibility into failures and retries.
It ensures that each task reaches a clear final state.

Practical tips
All updates, such as claiming tasks and updating results, should be performed within transactions.
Recovery handlers should be idempotent so that repeated execution does not cause issues.
Useful debugging information, such as error messages and request identifiers, should be stored.
Retry limits should be clearly defined.

- Metrics should be added to monitor system behavior.

Final thoughts

Failures are unavoidable in distributed systems, especially when processing data asynchronously.

Instead of relying only on retries, introducing a dedicated recovery service allows you to handle failures in a controlled and reliable way.

This approach improves system reliability, scalability, and observability, and it forms an essential part of building production-ready systems.

DEV Community