wintrover

Posted on Jan 8 • Originally published at wintrover.github.io

From Celery/Redis to Temporal: A Journey Toward Idempotency and Reliable Workflows

#temporal #celery #redis #idempotency

When handling asynchronous tasks in distributed systems, the combination of Celery and Redis is often the go-to choice. I also chose Celery for the initial design of my KYC (Know Your Customer) orchestrator due to its familiarity. However, as the service grew in complexity, I hit a massive wall: guaranteeing idempotency and managing complex states.

In this post, I want to share my technical journey of why I moved away from Celery to Temporal and how I ensured idempotency during that process.

1. Limitations of Celery/Redis: Why Change Was Necessary?

Difficulties in Idempotency Management

While Celery is excellent for "Fire and Forget" tasks, there's a high risk of duplicate execution during retries caused by network failures or worker downs. Especially for face recognition tasks that consume significant GPU resources, duplicate execution was critical in terms of both cost and performance.

Fragmentation of State

The KYC process follows this sequence:

User uploads an ID card image.
User uploads a selfie video.
Compare face similarity once both files exist.

In a Celery environment, since I didn't know when images and videos would be uploaded, I needed complex logic to query the DB every time or store intermediate states in Redis. The logic to check "Are all files collected?" was scattered across multiple places, making maintenance difficult.

2. Introducing Temporal: A Paradigm Shift in Orchestration

Temporal is not just a message queue; it's a Stateful Workflows engine.

Workflow Logic Must Be "Deterministic"

Since Temporal workflow code is based on the premise of "Replay," it must always produce the same sequence of workflow API calls for the same input and history. Therefore, you should not directly perform "external-world-dependent operations" like network I/O, file I/O, system time (e.g., DateTime.now), randomness, or threading within a workflow. These side effects should be pushed to activities, while the workflow focuses solely on orchestration.

Official Docs: https://docs.temporal.io/develop/python/core-application#workflow-logic-requirements

Workflow-Centric Design

The first thing that changed after introducing Temporal was the visibility of business logic. FaceSimilarityWorkflow now gracefully waits until files are ready.

# Core logic of FaceSimilarityWorkflow
@workflow.run
async def run(self, data: SimilarityData) -> SimilarityResult:
    # Wait up to 1 hour until both image and video are collected
    await workflow.wait_condition(
        lambda: any(f["type"] == "image" for f in self._files)
        and any(f["type"] == "video" for f in self._files),
        timeout=timedelta(hours=1),
    )

    # Execute GPU activity once all files are ready
    result = await workflow.execute_activity(
        check_face_similarity_activity,
        activity_data,
        retry_policy=RetryPolicy(maximum_attempts=3)
    )
    return result

This code uses workflow.wait_condition to suspend the workflow until the condition is met without blocking the event loop. In Celery, this would have required complex polling or webhook logic.

3. Idempotency Strategy: Building a Double Defense

Even with Temporal, idempotency at the activity level remains crucial. I established a double defense strategy as follows.

Step 1: Temporal's Basic Guarantee

Temporal records the progress of a workflow as event history. Therefore, even if a worker restarts, it resumes exactly from the last successful point.

Step 2: Explicit Checks within Activities

Since Temporal activities follow an "at-least-once" execution model, an activity might be retried if a worker crashes after successfully performing it but before notifying the server. Thus, official documentation strongly recommends making activities idempotent.

Official Docs: https://docs.temporal.io/develop/python/error-handling#make-activities-idempotent

In practice, I use the following two together:

For external system calls, pass an idempotency key combined from the workflow execution and activity identifiers.
Internally, use unique keys (or check for existing results) in the DB to prevent duplicate storage/processing.

@activity.defn
async def check_face_similarity_activity(data: SimilarityData) -> SimilarityResult:
    info = activity.info()
    idempotency_key = f"{info.workflow_run_id}-{info.activity_id}"
    session_id = data["session_id"]

    with get_db_context() as db:
        existing = (
            db.query(FaceSimilarity)
            .filter(FaceSimilarity.idempotency_key == idempotency_key)
            .first()
        )
        if existing:
            return SimilarityResult(success=True, message="Already processed.")

    # Perform actual GPU-intensive work...

4. Results: What Has Changed?

Comparison Item	Celery/Redis Based	Temporal Based
State Management	Manual storage in DB/Redis	Automatically managed by engine
Retry Strategy	Manual exponential backoff	Declarative Retry Policy
Visibility	Must dig through logs	Check history in Temporal UI
Idempotency	Very difficult to guarantee	Structurally achievable

Conclusion

The transition from Celery to Temporal was not just about changing tools; it was about changing how I define business processes in code. Especially in financial/authentication systems where idempotency is paramount, Temporal provided irreplaceable stability.

If you are losing sleep over complex asynchronous logic and idempotency issues, I strongly recommend migrating to Temporal.

DEV Community