Chen Debra

Posted on Feb 6

Part 2 | The Core Abstraction Model of Apache DolphinScheduler

#programming #apachedolphinscheduler #opensource #datascience

This article is Part 2 of the series Deep Dive into Apache DolphinScheduler: From Scheduling Principles to DataOps in Practice.

From the perspective of source code and scheduling models, it analyzes DolphinScheduler’s core abstractions, focusing on the responsibility boundaries between Workflow, TaskDefinition, and instance-level objects.

Using DAG illustrations, it explains how the scheduler drives complex orchestration through dependency evaluation.

Previous article:
A Scheduler Is More Than Just a “Timer”

After working with DolphinScheduler for some time, many users start asking the same question:

Why does the system contain so many objects at the same time — workflow definitions, workflow instances, task definitions, task instances?
Isn’t this over-engineered?

If you look at the source code and how the scheduler actually runs, the answer is exactly the opposite:

These abstractions are intentionally separated to contain complexity.

Workflow: A DAG Blueprint That Never “Runs”

In DolphinScheduler’s design, a Workflow (mapped to ProcessDefinition in the source code) is defined from the very beginning as a purely static structure.

It describes only a limited set of information:
which tasks exist in the process, how they depend on each other, and whether there are conditional branches or sub-workflows.

Together, these form a DAG — but this DAG never executes by itself.

From a code perspective, a Workflow is closer to a structured configuration object than to a scheduling entity.

In the database, it records no execution results:
no success or failure status, no start time, no end time.
It has no awareness of what happens during any specific run.

Behind this lies a crucial design principle:

Structure and execution must be completely separated; otherwise, state will pollute the definition.

What the DAG Actually Solves in DolphinScheduler

In DolphinScheduler, the DAG has a single, very focused responsibility:

determining whether a task is eligible to be scheduled at a given moment.

It does not care how a task is executed, nor about the business meaning of execution results.
It only evaluates whether dependencies are satisfied.

📌 Here, a DAG PNG illustration can be used to show a typical multi-parent dependency structure.
Whether a node can be scheduled does not depend on execution order, but on whether all upstream dependencies have completed.
This is the core runtime logic by which DolphinScheduler dynamically evaluates the DAG.

In the source code, the DAG is parsed into an in-memory structure when a process instance starts, and this structure drives all subsequent scheduling decisions.

When the state of a TaskInstance changes, the scheduler does not simply “move forward”.
Instead, it re-evaluates the DAG to determine which nodes have now been unlocked.

This is why DolphinScheduler naturally supports parallel execution, conditional branches, and failure blocking.

These capabilities are not hard-coded features — they are natural outcomes of DAG reasoning.

TaskDefinition: The Execution Template of a Task

If a Workflow is the blueprint of a process, then a TaskDefinition is the template for a single task.

In the source code, TaskDefinition stores everything about how a task should be executed, including:

Task type (Shell, SQL, Spark, Flink, etc.)
Parameters and script content
Failure strategy, timeout configuration, resource settings

One point is absolutely critical:

TaskDefinition is completely stateless.

You will never see fields like “execution success” in a TaskDefinition, because semantically, execution results do not belong to a definition.

This is very clear in the code (illustrative example):

public class TaskDefinition {
    private Long id;
    private String name;
    private TaskType taskType;
    private String taskParams;
    private int timeout;
    private int failRetryTimes;
    // Note: no execution state here
}

The sole responsibility of TaskDefinition is to describe how to run, not how it ran.

Process Definition vs. Process Instance: The Real Boundary

Understanding DolphinScheduler requires a clear distinction between definitions and instances.

When a Workflow is actually triggered, the system creates a full set of runtime objects based on the Workflow and TaskDefinitions:

ProcessInstance
TaskInstance

A ProcessInstance represents:

“This specific execution of a workflow.”

A TaskInstance represents:

“This specific execution of a task.”

All the states you see in the UI — running, failed, retried, logs — exist entirely at the instance level, not at the definition level.

From the source code, the boundary is explicit:

public class ProcessInstance {
    private Long id;
    private Long processDefinitionId;
    private ExecutionStatus state;
    private Date startTime;
    private Date endTime;
}

public class TaskInstance {
    private Long id;
    private Long taskDefinitionId;
    private ExecutionStatus state;
    private int retryTimes;
    private Date startTime;
}

Definitions are reusable; instances are ephemeral.
This separation is fundamental to long-term stability in a scheduling system.

How These Abstractions Enable Complex Orchestration

As task counts grow, workflows nest, and failures become routine, systems without clear abstraction boundaries quickly lose control.

DolphinScheduler’s model separation enables several critical capabilities:

The same Workflow can run multiple instances concurrently without interference
Retries affect only TaskInstances, never polluting definitions
DAG evaluation and task execution are fully decoupled
Scheduling logic revolves around state transitions, not business logic

From this perspective, DolphinScheduler is not “managing tasks” —
it is managing the evolution of state and dependencies.

Summary

If you treat DolphinScheduler as a “more powerful Cron”, these models may look overly complex.

But viewed from a system and source-code perspective, they form a highly disciplined, deeply engineered design.

In the next article, we’ll continue along this model and explore:

how the scheduler operates around state transitions — and how failures are absorbed rather than amplified.

DEV Community