Chen Debra

Posted on Apr 17

Part 8 | Boundaries, Collaboration, and Best Practices Between Apache DolphinScheduler and Flink & Spark

#ai #opensource #apachedolphinscheduler #spark

In the continuous evolution of data platforms, a very common yet subtle misconception is that teams unconsciously allow the scheduling system to take on more and more responsibilities that do not belong to it, such as writing complex business logic in the scheduling layer, controlling computation parameters, and even attempting to centrally manage execution details across different computing engines.

In the short term, this may seem to improve efficiency, but in the long run, such a design often makes the system highly coupled, difficult to maintain, and even causes it to lose stability as scale increases.

Therefore, before discussing specific practices, we must first clarify one thing: the boundary between the scheduling system and data engines.

Responsibilities and Boundaries Between the Scheduler and Data Engines

To understand how the entire system operates, it is helpful to remember a very core principle: the scheduling system is only responsible for “when to run” and “dependency relationships,” while “how to compute” must be left to execution engines such as Spark, Flink, or SeaTunnel.

In other words, DolphinScheduler is the orchestrator of workflows, not the executor of computation.

From an engineering perspective, this division of responsibilities can be clearly expressed in the following table:

Component	Core Responsibility
DolphinScheduler	DAG orchestration, task scheduling, dependency management, failure retry
Spark	Offline batch processing
Flink	Real-time stream processing
SeaTunnel	Data integration (batch / streaming / CDC)

In actual development, the place where this boundary is most easily broken is often the Shell task.

Many people are accustomed to writing complex branching logic in a single node, for example, deciding which Spark job to execute based on the date:

if [ "$day" == "2026-04-01" ]; then
  spark-submit job_a.py
else
  spark-submit job_b.py
fi

Although this approach “works,” it brings three problems: first, the logic is hidden inside the script and cannot be perceived by the DAG; second, dependency relationships are no longer clear, affecting the visualization capability of the scheduling system; third, the cost of maintenance and troubleshooting will increase significantly in the later stages.

A more reasonable approach is to explicitly model the branching logic in the workflow and control the execution path through conditional nodes, so that the entire process is visible and controllable in the UI.

Differences in Scheduling Between Batch, Streaming, and CDC

After the boundaries are clear, when we look at the scheduling methods of different types of tasks, we will find that they are essentially three completely different models, rather than simple variations of the same scheduling logic.

First is batch processing, which is the type of scenario that best fits the traditional scheduling model, such as T+1 tasks in a data warehouse or aggregation computations running hourly.

Such tasks have clear time windows and well-defined upstream and downstream dependencies, making them very suitable to be expressed through DAGs.

In practice, they are usually split into layers such as ODS, DWD, and DWS, with each layer corresponding to one or more independent tasks, and driven by parameters (such as ${biz_date}).

For example, a typical Spark submission method is as follows:

spark-submit \
  --class com.example.ETLJob \
  --master yarn \
  --deploy-mode cluster \
  etl-job.jar \
  --date ${biz_date}

In this process, the responsibility of the scheduling system is to connect task relationships, control execution order, and handle failure retries, rather than diving into the specific computation logic.

In contrast to batch processing, streaming tasks are fundamentally “continuously running,” rather than “periodically triggered.”

If a scheduling system is used to start a Flink job every few minutes, it is essentially solving the problem in the wrong way.

A well-designed streaming task should rely on Flink’s own state management and checkpoint mechanism to run continuously, while DolphinScheduler plays more of a “guardian” role, responsible for initial startup, status detection, and exception recovery, rather than frequent intervention.

Looking further at CDC scenarios, it is essentially also a type of streaming processing, but more oriented toward data integration, which is exactly a typical application scenario of SeaTunnel.

Through SeaTunnel, it is very convenient to implement real-time synchronization from databases to message queues, for example, from MySQL to Kafka:

env {
  execution.parallelism = 2
}

source {
  MySQL-CDC {
    hostname = "localhost"
    port = 3306
    username = "root"
    password = "123456"
    database-names = ["test_db"]
    table-names = ["test_db.user"]
  }
}

sink {
  Kafka {
    topic = "user_cdc"
    bootstrap.servers = "localhost:9092"
  }
}

The corresponding startup command is as follows:

./bin/seatunnel.sh \
  --config config/mysql_cdc.conf \
  -e local

At the scheduling level, the principle of CDC is consistent with streaming processing: start once, run continuously, and ensure stability through status detection mechanisms, rather than repeatedly triggering through periodic scheduling.

From this perspective, the core difference between batch processing, streaming processing, and CDC actually lies in whether it needs to be repeatedly scheduled.

Why the Scheduling System Should Not Intrude into the Execution Engine

As the system gradually scales, a deeper question will emerge: why do we repeatedly emphasize that the scheduling system should remain “restrained”?

The reason is that once the scheduling system begins to intrude into the responsibility scope of the execution engine, the controllability of the entire architecture will rapidly decline.

For example, directly writing Spark resource parameters in the scheduling script:

spark-submit \
  --executor-memory 8G \
  --conf spark.sql.shuffle.partitions=500 \
  job.sql

The problem with this approach is that it hardcodes execution-layer configurations into the scheduling layer, making parameter management scattered and difficult to unify.

Once resource configurations need to be adjusted, the scheduling task must be modified, or even the workflow must be redeployed.

A more reasonable approach is to place these parameters in the Spark configuration center or manage them within the job itself, allowing DolphinScheduler to only be responsible for triggering execution:

spark-submit job.sql

This decoupling approach can significantly improve system maintainability, allowing each layer to focus on its own responsibilities.

From an overall architectural perspective, a mature data platform can usually be abstracted into a three-layer structure: the top layer is the scheduling layer represented by DolphinScheduler, responsible for workflow orchestration; the middle layer is the execution layer represented by Spark, Flink, and SeaTunnel, responsible for specific computation and data processing; and the bottom layer is the resource layer such as YARN or Kubernetes, responsible for resource allocation and isolation.

Only when the boundaries of these three layers are clear can the system maintain stability as complexity increases.

A Practical Architecture Example Integrating SeaTunnel

In real production environments, this layered thinking is usually reflected in complete data pipelines.

For example, SeaTunnel can be used to implement CDC from MySQL to Kafka to synchronize real-time data; then Flink performs real-time computation to produce online metrics; at the same time, the data is landed into storage systems, and then Spark completes offline data warehouse processing.

In this process, DolphinScheduler is responsible for unified orchestration of these tasks, including starting CDC, monitoring streaming tasks, and scheduling offline computations.

From a process perspective, it can be abstracted into a clear data link: data enters from the source, goes through SeaTunnel into the real-time channel, is processed by Flink to serve online systems, is simultaneously written into storage, and then processed by Spark for layered transformation, while DolphinScheduler always acts as the “central hub,” coordinating execution order and dependency relationships across all stages.

Summary: Let the System Return to “Each Doing Its Own Job”

Returning to the original question, the design principle of the entire system can actually be summarized in one sentence: DolphinScheduler is the “brain,” while Spark, Flink, and SeaTunnel are the “muscles.”

The scheduling system is responsible for decision-making and orchestration, while the execution engines are responsible for specific computation and processing.

In practical implementation, it can be further summarized into three simple but very critical principles: first, all process logic must be reflected in the DAG, rather than hidden in scripts; second, all computation logic must be pushed down into the execution engines to avoid expansion of the scheduling layer; third, streaming processing and CDC tasks must be designed based on “long-running” operation, rather than being scheduled repeatedly in a batch-processing manner.

When these three points are strictly followed, the data platform can evolve from “just able to run” to “stable, scalable, and governable,” which is also a key step from engineering to systematic architecture.