DEV Community

loading...
Cover image for Durable Functions vs. Apache Airflow

Durable Functions vs. Apache Airflow

cgillum profile image Chris Gillum ・8 min read

Recently I've been looking at Apache Airflow since I've noticed it getting a lot of attention from Python developers and cloud providers for supporting "workflow" scenarios. For context, I'm the creator of Durable Functions on Azure and we'll soon be announcing the general availability (GA) of Durable Functions for Python. Both Airflow and Durable Functions support building workflows in Python, so I thought it would be worth doing some investigation to understand the differences between the two technologies. This blog post is my attempt at doing this comparison and I hope folks find it somewhat useful.

In the end, what I've learned is that Durable Functions and Apache Airflow are trying to solve different problems using different approaches, in spite of the fact that they both support implementing "workflows" in Python. The main differences are around the types of workflows supported and the role that Python plays in the authoring process. I go into more details on these differences in this post.

Orchestrators vs DAGs

The most important technical difference I've found is the programming model. Durable Functions allows you to build workflows by writing orchestrator functions. Orchestrator functions describe how actions are executed and the order in which actions are executed. To illustrate, the following is a simple sequential orchestration in Python that calls three tasks known as "activities", t1, t2, and t3, in sequence using Python's generator syntax.

import azure.functions as func
import azure.durable_functions as df

def orchestrator_function(context: df.DurableOrchestrationContext):
    x = yield context.call_activity("t1", None)
    y = yield context.call_activity("t2", x)
    z = yield context.call_activity("t3", y)
    return z

main = df.Orchestrator.create(orchestrator_function)
Enter fullscreen mode Exit fullscreen mode

Each yield expression causes the orchestrator function to wait for the scheduled task to complete and then save the result to a local variable for later use. When yielded in this way, the orchestrator function can be unloaded from memory and the progress of the workflow is persisted.

An orchestration can have many different types of actions, including activity functions, sub-orchestrations, they can wait for external events, make HTTP calls, and sleep using durable timers. Orchestrator functions can also interact with durable actor-like objects known as entity functions. All of this is done using normal procedural coding constructs, which means you can use programming language features like conditionals, loops, function calls, exception handling via try/except/finally (for implementing compensation logic), etc. For developers, this model is very natural and scales nicely to even highly complex workflows. Reliability and distributed execution is handled for you by the underlying framework.

The Apache Airflow programming model is very different in that it uses a more declarative syntax to define a DAG (directed acyclic graph) using Python. To illustrate, let's assume again that we have three tasks defined, t1, t2, and t3. You could implement a similar sequential workflow as above using the following code in Airflow:

dag = DAG('hello_world', description='Sequential DAG',
          schedule_interval='0 12 * * *',
          start_date=datetime(2020, 11, 28), catchup=False)

t1 = PythonOperator(task_id='t1', dag=dag, python_callable=t1)
t2 = PythonOperator(task_id='t2', dag=dag, python_callable=t2)
t3 = PythonOperator(task_id='t3', dag=dag, python_callable=t3)

t1 >> t2 >> t3
Enter fullscreen mode Exit fullscreen mode

It's important to note that Airflow Python scripts are really just configuration files specifying the DAG’s structure as code. Unlike with normal Python scripts, you are not able to pass dynamic inputs, inspect outputs, or do control flow with conditionals, loops, error handling, or other features of the language. It's therefore best to think of Airflow DAG authoring as a workflow DSL that happens to use Python for configuration.

To illustrate this point further, consider a common "approval workflow" use-case. In this scenario, someone submits a purchase order that needs to be approved by a manager. The workflow waits for the approval and moves onto the processing step immediately after it is received. However, if no approval is received in 72 hours (maybe the approver is on vacation), an escalation task is scheduled to help resolve the pending approval. Using Durable Functions, we can implement this workflow using code like the following:

import azure.durable_functions as df
from datetime import timedelta 


def orchestrator_function(context: df.DurableOrchestrationContext):
    yield context.call_activity("RequestApproval", None)

    # create a timer task that expires 72 hours from now
    due_time = context.current_utc_datetime + timedelta(hours=72)
    timeout_task = context.create_timer(due_time)

    # create a task that completes when an "Approval" event is received
    approval_task = context.wait_for_external_event("Approval")

    # context.task_any() waits until any one task completes and returns it
    winning_task = yield context.task_any([approval_task, timeout_task])

    if approval_task == winning_task:
        timeout_task.cancel()
        yield context.call_activity("Process", approval_task.result)
    else:
        yield context.call_activity("Escalate", None)


main = df.Orchestrator.create(orchestrator_function)
Enter fullscreen mode Exit fullscreen mode

As you can see, tasks are scheduled dynamically, have inputs and outputs, and the outputs can be used to decide the next step to take in the workflow. Airflow DAGs, on the other hand, are more optimized for defining static data pipelines that don't necessarily require passing data around.

There are important tradeoffs between these two workflow authoring models. In the "imperative code" model of Durable Functions where task scheduling is dynamic, you can express a much larger set of possible workflows. However, you have to be careful to ensure your orchestrator code is deterministic and doesn't violate the orchestrator code constraints. The "declarative code" model of Airflow is much more static and constrained, but those constraints make it easier to build tools that analyze the DAGs and do things like create visualizations (more on this in subsequent sections).

Activities vs. Operators

Another important distinction is that of activities and operators. Both represent workflow tasks that can be scheduled. In Durable Functions, everything is a function. This includes the orchestrators as well as the activities being orchestrated. However, it's up to you to write the business logic for each of your activity functions. As mentioned earlier, Durable Functions offers several low-level primitive task types, like timers, external event handlers, and HTTP actions, but doesn't offer an extensive library "pre-cooked" tasks in its current version.

Operators in Apache Airflow are more like traditional workflow actions in the sense that there is a preexisting library of operators that you choose from. The most basic is PythonOperator, which allows you to run a Python function as a task. In addition, a broad set of operators exist for interacting with external systems, like databases, HTTP endpoints, cloud data services like S3, etc. You can find the full list of supported operators here. You can also leverage other operators developed by the OSS community, or create your own via Airflow plugins. Because of this library of existing operators, it's possible that many Airflow-authored workflows won't involve any custom code at all.

Event-driven vs. CRON

Apache Airflow DAGs are primarily triggered on predefined CRON schedules. More info here. It's also possible to do one-off runs using the CLI. The Airflow CRON model is especially useful if you need to do backfilling or do "catchup" executions as part of your data processing pipeline.

Durable Functions, on the other hand, relies on the event-driven mechanisms of Azure Functions for triggering orchestrations. For example, you would create a client function that starts one or more orchestration instances with dynamic inputs when an HTTP request is received, when a queue message arrives, or based on an event from one of many other supported trigger types, including CRON-based timer trigger events.

Do-it-yourself vs. built-in management tools

I suspect one of the biggest draws to Apache Airflow is the UI tools that allow you to easily manage and inspect your DAGs. You can see the full list of Airflow UI screenshots here. Some especially useful views, in my opinion, are the graph view that visualizes your DAG and shows its progress in real-time, and the Gantt chart that shows you how long each step in your workflow has taken. As I mentioned earlier, the static nature of DAGs makes it easy to build visual tooling for management and monitoring purposes. You can find a great video showing off some of these features here.

Durable Functions exposes a set of management APIs over HTTP and language-specific SDK APIs. However, there are no built-in tools for doing this kind of visualization (though some 3rd party tools are available). In terms of monitoring, Durable Functions emits detailed telemetry into Application Insights, which is quite powerful but has its own learning curve for creating alerts, visualizations, etc. More information on diagnostics and monitoring for Durable Functions can be found here.

Serverless vs. VMs

If at this point you believe that both Durable Functions and Apache Airflow could potentially meet your needs, then another important difference to consider is the range of supported hosting environments. Durable Functions has a significant advantage in that it can run in a completely serverless environment (via Azure Functions). This means there are no servers or VMs to set up, no databases to configure, scale-out is automatic and elastic, and you only pay when work is being done. This often translates into pennies per month or less when you have very light workloads.

On the other hand, Apache Airflow has managed offerings available in other clouds, including Astronomer.io, Google Cloud Composer, and Amazon Managed Workflows for Apache Airflow (MWAA). These managed offerings provide automatic scaling and management of infrastructure but you are still left with per-hour VM billing models. At the current time of writing, the cheapest configurations on these clouds will cost between $100 and $400 USD per month, regardless of how many workflows you execute.

Conclusion

The most important conclusion I arrived at when examining Apache Airflow is that it is designed and optimized specifically for static data processing pipelines. If you need to build ETL pipelines with a known set of steps and dependencies, then Airflow might be a great option because of the various built-in tools and the broad array of plugins created by the Airflow community.

However, if you're looking for a platform that supports a broader range of workflow primitives and scenarios, and if you need the flexibility that code-first orchestrations provide, then something like Durable Functions may be more suitable.

Discussion (0)

Forem Open with the Forem app