Chen Debra

Posted on Mar 20

Airflow Is Overkill for Most Teams-Here’s a Better Option

#airflow #apachedolphinschedu #opensource #tooling

Last year, when our team was selecting a data platform, my boss directly said:“Airflow is too heavy. The operational cost is too high. Find a lighter alternative.”

To be honest, I was a bit overwhelmed at the time. Airflow is indeed heavy. There are a lot of Python dependencies, and the Celery Executor also requires Redis or RabbitMQ. Once the scale grows a bit, you basically need to use Kubernetes.

But our data team only has a few people. Asking them to maintain crontab scripts? That would be going backwards.

Later, after browsing GitHub, I found DolphinScheduler in the Apache Incubator. It has 14.1K stars, is under the Apache 2.0 license, and was open-sourced by a Chinese company (Analysys). Now it has graduated and become a top-level Apache project.

After trying it out, I found that this thing really has something special.

Low-Code Drag-and-Drop, You Can Get Things Done Without Writing YAML

Everyone understands Airflow’s DAG configuration: workflows are written in Python code. It’s flexible, but data analysts can’t understand it.

DolphinScheduler directly provides you with a visual drag-and-drop interface. You can configure task dependencies just by clicking and dragging with your mouse.

It supports more than 30 task types: Shell, SQL, Spark, Flink, HTTP, DataX, Python… basically covering all common tasks in big data scenarios.

Want to run a Hive SQL? Drag a SQL node, configure the data source and script, connect upstream dependencies, done. No need to write a single line of Python, and no need to deal with BashOperator or SparkSubmitOperator.

This is much more friendly to non-developer roles. Data analysts can configure workflows themselves, without coming to you every day asking you to write DAGs.

Decentralized High Availability, No Dependence on ZooKeeper

Everyone knows Airflow’s architecture. The Scheduler is a single point. Although it later supports multi-Scheduler HA, it still relies on database locks to ensure tasks are not scheduled repeatedly.

DolphinScheduler was designed with decentralization from the very beginning. The architecture is very clear, with five core components:

API Server: the entry point for frontend interaction, including workflow configuration and user permission management
Master Server: DAG parsing and task distribution; multiple Masters can be deployed, and each can work independently
Worker Server: task execution nodes that receive tasks from Master and return results
Alert Server: alert notifications, supporting email, DingTalk, WeCom, Feishu, and more
Registry: registry center responsible for service discovery and distributed locks, supporting three options: JDBC, ZooKeeper, and Etcd

Let’s focus on the Master’s decentralized design.

There is no master-slave relationship between multiple Masters. After starting, each Master registers itself to the Registry, and then competes for tasks using a slot partitioning algorithm.

How is the partitioning done? It uses modulo on ID:

Command ID % total number of Masters = the slot of the current Master

For example, if you have 3 Masters, and the Command ID is 1001, then it will be assigned to slot 2 (1001 % 3 = 2, slots start from 0).

If one Master goes down, its slot will be taken over by other Masters, and tasks will not be lost.

This design is much simpler than Airflow’s Scheduler HA. It does not require complex leader election logic, and Masters can scale horizontally at any time.

Use JDBC as Registry, Say Goodbye to ZooKeeper Dependency

In the past, when building distributed scheduling systems, you couldn’t avoid ZooKeeper. Early versions of Airflow also relied on ZK. Later it switched to database locks, but there are still performance bottlenecks.

DolphinScheduler supports three types of registries: JDBC, ZooKeeper, and Etcd.

The official recommendation is to use JDBC. You can directly reuse your business database (MySQL or PostgreSQL), without deploying additional ZK or Etcd clusters.

For small and medium-sized teams, maintaining one less component means reducing cost and improving efficiency.

Of course, if you already have a ZK cluster, or have extremely high performance requirements (tens of thousands of concurrent scheduling tasks), you can still choose ZK or Etcd.

Task Dispatch Mechanism: Active Push Instead of Pull

Airflow’s Celery Executor is a typical task queue model. The Scheduler puts tasks into a Redis queue, and Workers pull them themselves.

This approach is flexible, but when the queue gets backlogged, it becomes troublesome.

DolphinScheduler uses active push. After the Master parses the DAG, it directly pushes tasks to Workers via Netty RPC.

Workers do not need to poll. The Master tells them exactly what to do.

During task allocation, load balancing is performed. By default, it uses dynamic weighted round-robin, considering CPU, memory, and thread pool usage of Workers, and assigning tasks to nodes with lower load.

If a Worker is about to be overloaded, the Master will automatically schedule tasks to other nodes.

The advantage of this push mechanism is low scheduling latency. The Master can grasp Worker status in real time, and tasks will not sit in the queue for dozens of seconds waiting to be consumed.

Plugin-Based Architecture, Replace Anything You Want

DolphinScheduler’s plugin system is quite thorough:

Task plugins: more than 30 built-in task types, and you can write your own plugins
Alert plugins: email, DingTalk, WeCom, Feishu, Telegram; if not enough, implement the Alert Plugin interface yourself
Data source plugins: MySQL, PostgreSQL, Hive, Spark SQL, ClickHouse… supporting hundreds of data sources
Storage plugins: task logs and resource files can be stored locally, on HDFS, S3, or OSS

Want to switch an alert channel? Write a plugin, package it into a JAR, drop it in, restart the service—done.

No need to modify source code, and maintenance cost is low.

Flexible Deployment, One-Click Experience with Docker

The official provides four deployment methods:

Standalone: single-machine mode, for development and testing, can run with one command
Cluster: cluster mode, standard for production, manually deploy each component
Docker: start a complete environment with one click, suitable for quick experience
Kubernetes: deploy with Helm Chart, preferred for cloud-native teams

If you want to try quickly, just use Docker Compose:

docker-compose -f docker/docker-compose.yaml up -d

After the containers start, open your browser at:
http://localhost:12345/dolphinscheduler

Default account: admin / dolphinscheduler123

Drag a Shell task and try it—you can run a workflow in a few minutes.

For production deployment, it is recommended to have at least 3 Masters plus several Workers. Use MySQL master-slave or PostgreSQL for the database, and choose JDBC as the registry.

Highlights of Version 3.4.0

The 3.4.0 version released at the end of last year mainly optimized several points:

Task priority queue: high-priority tasks can jump the queue instead of waiting
Dynamic resource allocation: Workers can dynamically adjust thread pool size based on task type
Workflow version management: DAG changes automatically save history versions, supporting one-click rollback
Enhanced lineage analysis: visualization of upstream and downstream dependencies of data tables

The most practical one is the task priority queue. Previously, when inserting urgent tasks, you had to manually pause other tasks to free resources. Now you just assign a high priority label, and the scheduler will handle it automatically.

What Kind of Teams Is It Suitable For?

That said, after talking about so many advantages, it’s only fair to discuss where it actually fits.

Suitable teams for DolphinScheduler:

Data teams with fewer than 10 people and limited operational resources
Tasks mainly based on offline batch processing, such as ETL, data synchronization, reporting scheduling
Need for a low-code platform so that analysts and business users can configure workflows
Already using MySQL/PostgreSQL and do not want to deploy ZooKeeper

Not very suitable scenarios:

Mainly real-time streaming tasks (although Flink is supported, scheduling granularity is still batch-oriented)
Heavy reliance on Python ecosystem with highly customized workflow logic (Airflow is more flexible)
Extremely large task volume with tens of thousands of concurrent scheduling tasks

Final Thoughts

Overall, DolphinScheduler’s positioning is a user-friendly, stable, and lightweight data scheduling platform.

It doesn’t have as many fancy features as Airflow, but all the core capabilities are there, and the maintenance cost is much lower.

After our team migrated from Airflow to DolphinScheduler, the cluster size was reduced from 5 nodes to 3 nodes, and operational manpower was cut by half.

Now data analysts can configure workflows themselves, and no longer need to urge me every day to write DAGs.

There is no absolute good or bad scheduling tool. The one that fits your team is the best.

If you are also looking for an alternative to Airflow, you might want to try DolphinScheduler—it might be exactly what you need.

DEV Community