Vlad Ogir

Posted on Apr 8 • Originally published at vladogir.substack.com

Your Background Tasks Are Silently Failing Before They Even Start

#django #webdev #python #architecture

The six stages of evolution from silent failures to at-least-once delivery with RabbitMQ and Celery

Background tasks feel like magic until something breaks.

Imagine you have an endpoint that saved data in the DB, but the welcome email never sends because of a network blip.

Achieving true reliability requires moving beyond ‘fire-and-forget’ and adopting patterns like the Outbox. Here are the six stages of evolution from the ‘fire-and-forget’ to a bullet-proof system.

Prerequisites

Examples in this article use Django, Celery and RabbitMQ. However, this approach applies to various other stacks. Familiarity with task queues is helpful but not required.

The only question that matters

When you publish a task, how do you know if it actually worked?

There are three different guarantees that you can have:

At-most-once delivery — Messages may be lost if delivery fails.
At-least-once delivery — The same message can be published twice.
Exactly-once delivery — Only one message is published, but impossible in practice.

Publishers

A single goal for a publisher is to publish a message. When you want to publish an event or trigger a task, you publish it to a queue. The key question to ask when working with publishers: “Did my message get published?”

The Six Stages

Stage 1 — Best Effort

No guarantees, things fail silently.

Here’s the typical fire-and-forget pattern:

def my_command(my_model):
    my_model.foo = "bar"
    my_model.save()

    my_task.delay(my_model.id)

In practice, commands rarely live in isolation:

def my_command(my_model):
    my_model.foo = "bar"
    my_model.save()

    my_task.delay(my_model.id)

def parent_command():
    model = create_model()
    my_command(model)
    send_welcome_email(model)

Now we have multiple failure points with no rollback. This is much riskier and harder to resolve post failure.

There are several failure modes:

Our database can go down while saving a model. Preventing the task from being published.
When publishing a task, we may experience a network failure or broker crashes. Resulting in an out-of-sync database record.
A task may get accepted by the broker, but the message may be silently lost.

We are in the at-most-once delivery land, where we can’t be confident that tasks will get published.

Stage 2 — Transactional Boundary

No transactional boundary means failures leave data out of sync.

def my_command(my_model):
    my_model.foo = "bar"
    my_model.save()

    my_task.delay(my_model.id)

def parent_command():
    with transaction.atomic():
        model = create_model()
        my_command(model)
        send_welcome_email(model)

Transaction addresses several failure modes:

If our database goes down — database is rolled back.
If publishing a task failed — database is rolled back.

We still have a gap where a message can be silently lost in the broker, and we will address this later.

Stage 3 — Publishing before commit

The task is now queued the moment delay() is called, but not when the transaction commits. If anything goes wrong after that point, the database rolls back, but the task is already in flight, running against data that no longer exists in the state it expected.

For instance, the my_task depends on the data within my_model. If we experience a failure within the send_welcome_email() then transaction will roll back the foo column change on the my_model. As a result, my_model.foo will no longer be equal to bar.

To address this, we need to call delay_on_commit. This way task is published only after the transaction is successful!

def my_command(my_model):
    my_model.foo = "bar"
    my_model.save()

    my_task.delay_on_commit(my_model.id)

def parent_command():
    with transaction.atomic():
        model = create_model()
        my_command(model)
        send_welcome_email(model)

Now that our code is getting more robust, it’s time to address silent message loss.

Stage 4 — Tasks are still going missing

We use transactions, we publish messages after a transaction, but I am now losing tasks!

You are not going mad; it’s an issue with the system, not you — we are lacking publisher confirmation.

# set within settings file
BROKER_TRANSPORT_OPTIONS = {'confirm_publish': True}

With publisher confirms, we can be sure that a message was actually published and written to a disk. Without it, the message queue can simply accept your request and then fail silently.

When a publish failure happens, an exception will be triggered.

Now we are getting more towards the at-least-once delivery, as long as failures are handled.

However, there are still issues with this approach. User connection can die, so can a broker. User connection might end up being held indefinitely, but data is already in the database… a sweeper task might come to mind and that is definitely an option. But I want to suggest a different approach next.

Stage 5 — Outbox

Rather than publishing messages directly to the queue, what if we stored an intent to publish instead?

The idea is simple — we have our atomic transaction, so why not persist a task in a table for future execution? This is an async world we are in — an extra minute of wait time should be fine!

def my_command(my_model):
    my_model.foo = "bar"
    my_model.save()

    # writing to a DB instead of scheduling a task
    schedule_task_execution('path.to.my_task', my_model.id) 

def parent_command():
    with transaction.atomic():
        model = create_model()
        my_command(model)
        send_welcome_email(model)

By doing this, we move intent into the database. The intent will be captured safely due to atomic transactions. That intent can later be dispatched to the message broker and retried in case of failures, without degrading user experience. Now we get at-least-once delivery.

However, the complexity hasn’t disappeared; we simply shifted complexity to make the rest of the system more reliable. Since now, due to at-least-once delivery, we must make sure that duplicates are handled correctly and consumers process the task only once (idempotent).

Stage 6 — Clusters

We now have a mostly reliable publishing system, until the server burns to the ground. This is where having a cluster of messaging servers comes in. This is something that many cloud providers address for us in the background, but for instances where you are responsible for the instance, it’s worth being mindful of.

Right now, we are publishing a message to a single server. That server confirms that a message was written on a disk and will be handled. If we have a cluster, that message is replicated to other servers in a cluster.

The replication process is critical because, depending on the method, there can still be a loss! I will give one example of what can happen with RabbitMQ.

In RabbitMQ, we have a classic queue. Those queues are replicated after the message is published. So there is a risk that messages didn’t get replicated by the time the server burns down. RabbitMQ also has another queue type called ‘quorum’. This queue type uses the Raft consensus algorithm for replication. In this instance, we are making sure that messages are persisted on at least 50%+ of servers in the cluster before the message broker does a publish confirm. You are paying for it with speed for greater reliability.

In the six stages, we went from at-most-once to at-least-once delivery — creating a robust process that we can put trust into.

Did my message get published?

As a project scales, so does the chance of something going wrong. With few users, the chances of the message broker going down are rather small, but that changes with scale. The uncommon edge cases become common enough, and through that journey, you can build reliability incrementally.

In this article, we covered just one aspect of the process. Once a message is in the queue there is a question of how to consume those messages, which would touch on observability, monitoring and idempotency.

The journey doesn’t have to happen all at once — what stage is your current system at?

If this was useful, a reaction or boost helps more people find it — and I’m always happy to discuss in the comments.

References and further reading:- https://docs.celeryq.dev/en/main/userguide/tasks.html#database-transactions- https://www.rabbitmq.com/docs/confirms- https://docs.celeryq.dev/en/main/getting-started/backends-and-brokers/rabbitmq.html#using-quorum-queues- https://microservices.io/patterns/data/transactional-outbox.html- https://www.rabbitmq.com/docs/quorum-queues#quorum-requirements