Vatryok

Posted on Apr 19

Your Python Workflow Crashes Between Steps. Here Is Why, and How to Fix It.

#softwareengineering #programming #python #opensource

The gap between task queues and durable execution, and what fills it in 2026

Most Python applications reach for Celery when they need background processing. It is a reasonable default. It is mature, well-documented, and the operational model is familiar to almost every backend engineer in the ecosystem.

The problem surfaces later, usually in production, usually at an inconvenient time.

You have a three-step workflow: charge a card, reserve inventory, send a confirmation. The worker crashes after step one completes but before step two begins. Celery retries the task. Step one runs again. The card gets charged twice.

This is not a Celery bug. It is the correct behavior given what Celery guarantees, which is at-least-once delivery of individual tasks. It was never designed to guarantee crash-safe execution of multi-step sequences. That is a different problem, and it requires a different tool.

What crash-safe execution actually means

A durable workflow engine does two things a task queue does not.

First, it checkpoints step outputs atomically. Before a step is marked complete, its result is written to persistent storage in the same transaction. The step either finishes and its result is stored, or neither happens. There is no state where a step completed but its output was lost.

Second, it replays from the exact failure point. When a crashed worker restarts, it reads the checkpoint store, sees which steps already completed, loads their outputs, and continues from the step that failed. Steps that already ran do not run again.

Normal run: step 1 [OK] → step 2 [OK] → step 3 [OK]
Crash: step 1 [OK] → step 2 [CRASH]
Auto-resume: step 1 [SKIP] → step 2 [OK] → step 3 [OK]

Step 1 is skipped entirely on resume. Its output, the charge ID from Stripe, was already written to the database. The card is never charged twice.

The infrastructure problem

The tool that has provided these guarantees reliably and at scale is Temporal. Its event-sourcing model is rigorous and the guarantees are strong. But running Temporal means deploying a server cluster, a history service, a matching service, a frontend service, a database, and optionally Elasticsearch for visibility. For many teams, particularly smaller ones or those earlier in a product's lifecycle, that is a legitimate blocker.

The gap this creates is real. You need more than Celery but cannot justify a Temporal deployment. The options in that space have historically been limited.

Gravtory

Gravtory is a Python library that provides crash-safe durable execution backed by your existing database. PostgreSQL, MySQL, SQLite, MongoDB, and Redis are all supported. There is no additional server process, no broker, no new infrastructure to operate.

pip install gravtory[postgres]

Workflows are Python classes. Steps are decorated methods. The engine handles persistence, scheduling, and replay.

from gravtory import Gravtory, workflow, step

grav = Gravtory("postgresql://localhost/mydb")

@grav.workflow(id="order-{order_id}")
class OrderWorkflow:

@step(1)
async def charge_card(self, order_id: str) -> dict:
    return await stripe.charge(order_id)

@step(2, depends_on=1)
async def reserve_inventory(self, order_id: str) -> dict:
    return await inventory.reserve(order_id)

@step(3, depends_on=2)
async def send_notification(self, order_id: str) -> None:
    await email.send(order_id)

Beyond the core checkpointing, it covers several patterns that production systems commonly need.

Saga compensation runs automatically when a downstream step fails. You annotate each step with a compensating function and the engine handles rollback durably:

@step(1, compensate="refund")
async def debit(self, amount: Decimal) -> dict:
return await bank.debit(self.source, amount)

Signals let external systems send typed messages into a running workflow, which enables human-in-the-loop approval flows without holding a worker slot during the wait.

Parallel fan-out is checkpointed at the individual item level. A crash mid-batch resumes with only the incomplete items. Completed items are not reprocessed.

Type safety throughout. Step inputs and outputs are Pydantic models. The library ships with full type annotations and a py.typed marker.

Testing runs in-memory with no database required, and supports crash simulation:

runner.simulate_crash_after(step=1)
result = await runner.run(OrderWorkflow, order_id="test_123")
result = await runner.resume("order-test_123")
assert result.steps[1].was_replayed

The ability to assert that a step was not re-executed after a simulated crash is the kind of test that is nearly impossible to write against a real broker without significant scaffolding.

When to use it

If your application already runs PostgreSQL and you need crash-safe multi-step execution, saga compensation, or long-running stateful workflows, the infrastructure cost of adding Gravtory is effectively zero. You are adding a schema migration and a library dependency.

If you need the strongest possible guarantees and have the operational capacity for a full Temporal deployment, Temporal is still the more rigorous option. The event-sourcing model provides properties that database checkpointing does not match in every scenario.

If you are currently using Celery for workflows where re-execution of individual steps has side effects, and you have been working around that limitation with idempotency keys and manual deduplication logic, Gravtory directly addresses the underlying problem rather than working around it.

Source: https://github.com/vatryok/Gravtory

DEV Community

Your Python Workflow Crashes Between Steps. Here Is Why, and How to Fix It.

Top comments (0)