DEV Community

Chen Debra
Chen Debra

Posted on

Deep Dive into Apache DolphinScheduler Database Schema

This article provides an in-depth look at the database schema used by Apache DolphinScheduler, which primarily persists workflow definitions, execution states, scheduling information, and system metadata. It supports a wide range of databases including MySQL, PostgreSQL, and H2. The schema definitions are located in the dolphinscheduler-dao/src/main/resources/sql directory.

Schema Architecture

The DolphinScheduler database schema is organized into seven major functional groups:

Group Purpose Key Tables
Workflow Management Stores workflow and task definitions with version control t_ds_workflow_definition, t_ds_task_definition, t_ds_workflow_task_relation
Execution State Tracks runtime instances and their states t_ds_workflow_instance, t_ds_task_instance, t_ds_command
Scheduling Manages cron-based schedules via Quartz t_ds_schedules, QRTZ_* tables
Resource Management Metadata for data sources, files, and UDFs t_ds_datasource, t_ds_resources, t_ds_udfs
Administration Users, tenants, projects, and permissions t_ds_user, t_ds_tenant, t_ds_project
Alerts Alert configurations and history t_ds_alert, t_ds_alertgroup
Service Registry JDBC-based coordination (alternative to ZooKeeper) t_ds_jdbc_registry_* tables

Workflow and Task Definition Model

Separation of Definition and Instance

DolphinScheduler strictly separates definitions (templates) from instances (executions), enabling version control, concurrent executions, and audit tracking.

Key Design Principles:

  • Code-based Identifier: Workflows and tasks use a bigint code as a stable cross-version identifier.
  • Composite Key: Definitions use a composite natural key (code, version).
  • Immutable Versions: Each version is immutable; modifications create a new version.
  • Instance Reference: Instances reference a specific version of the definition.

Core Tables Reference

Workflow Definition Table

t_ds_workflow_definition

Primary table for workflow templates.

Column Type Description
id int Auto-increment primary key
code bigint Unique workflow identifier (stable across versions)
version int Version number (default 1)
name varchar(255) Workflow name
project_code bigint Associated project
release_state tinyint 0 = offline, 1 = online
global_params text Global parameters in JSON format
execution_type tinyint 0 = parallel, 1 = serial wait, 2 = serial discard, 3 = serial priority
timeout int Timeout in minutes
user_id int Creator user ID

Indexes:

  • UNIQUE KEY workflow_unique (name, project_code)
  • UNIQUE KEY uniq_workflow_definition_code (code)
  • KEY idx_project_code (project_code)

t_ds_workflow_definition_log

Stores audit logs for all workflow definition versions.
Mirrors t_ds_workflow_definition structure, with additional columns: operator, operate_time. Primary key: (code, version).

t_ds_task_definition

Reusable task template within workflows.

Column Type Description
code bigint Unique task identifier
version int Version number
task_type varchar(50) Shell, SQL, Python, Spark, etc.
task_params longtext Task configuration in JSON
worker_group varchar(255) Target worker group
fail_retry_times int Retry attempts on failure
fail_retry_interval int Retry interval (minutes)
timeout int Task timeout (minutes)
cpu_quota int CPU limit (-1 = unlimited)
memory_max int Memory limit (MB, -1 = unlimited)

t_ds_workflow_task_relation

Defines DAG structure by specifying edges between tasks.

Column Type Description
workflow_definition_code bigint Parent workflow
workflow_definition_version int Workflow version
pre_task_code bigint Predecessor task (0 for root)
post_task_code bigint Successor task
condition_type tinyint 0 = none, 1 = conditional, 2 = delay
condition_params text Condition configuration in JSON

Note: pre_task_code = 0 indicates a root node (no predecessor).

Execution State Tables

t_ds_workflow_instance

Runtime execution record of workflows.

Column Type Description
id int Primary key
workflow_definition_code bigint References definition
workflow_definition_version int Locked version for this run
state tinyint 0=SUBMITTED,1=RUNNING,2=PAUSING_PREP,3=PAUSED,4=STOPPING_PREP,5=STOPPED,6=FAILURE,7=SUCCESS,8=NEED_FAULT_TOLERANCE,9=TERMINATED,10=WAITING,11=WAITING_DEP
state_history text State transition log
start_time datetime Execution start
end_time datetime Execution end
command_type tinyint 0=START,1=START_CURRENT,2=RECOVER,3=RECOVER_PAUSE,4=START_FAILURE,5=SUPPLEMENT,6=SCHEDULE,7=RE_RUN,8=PAUSE,9=STOP,10=RESUME_WAIT
host varchar(135) Main server executing workflow
executor_id int User triggering execution
tenant_code varchar(64) Tenant for resource isolation
next_workflow_instance_id int For serial execution mode

Indexes:

  • KEY workflow_instance_index (workflow_definition_code, id)
  • KEY start_time_index (start_time, end_time)

t_ds_task_instance

Runtime record for a single task.

Column Type Description
id int Primary key
task_code bigint References task definition
task_definition_version int Locked version
workflow_instance_id int Parent workflow instance
state tinyint Same as workflow_instance
submit_time datetime Queued time
start_time datetime Actual start time
end_time datetime End time
host varchar(135) Worker host executing task
execute_path varchar(200) Working directory on worker
log_path text Log file path
retry_times int Current retry count
var_pool text Variables for downstream tasks

Index: KEY idx_task_instance_code_version (task_code, task_definition_version)

Command Model and Workflow Execution

Command Queue

The t_ds_command table implements a queue-based execution model, where commands trigger workflow instances.

t_ds_command Structure

Column Type Description
command_type tinyint 0=START,1=START_CURRENT,2=RECOVER,3=RECOVER_PAUSE,4=START_FAILURE,5=SUPPLEMENT,6=SCHEDULE,7=RE_RUN,8=PAUSE,9=STOP
workflow_definition_code bigint Target workflow
workflow_instance_id int For resume/re-run operations
workflow_instance_priority int 0=highest,1=high,2=medium,3=low,4=lowest
command_param text Execution parameters in JSON
worker_group varchar(255) Target worker group
tenant_code varchar(64) Tenant executing command
dry_run tinyint 0=normal,1=dry run (no actual execution)

Process Flow:

  1. Commands are inserted via API, scheduler, or retry logic into t_ds_command.
  2. MasterSchedulerThread continuously scans the table (sorted by priority and id).
  3. Master creates t_ds_workflow_instance records.
  4. DAG is analyzed, and t_ds_task_instance records are created for ready tasks.
  5. Successfully processed commands are removed; failed commands move to t_ds_error_command.

Version Control System

Code-based Version Model

DolphinScheduler supports:

  • Concurrent execution of different versions.
  • Safe updates without affecting running instances.
  • Full change audit tracking.

Version Management Rules

  • Current version table: Only the “current” version exists in t_ds_workflow_definition and t_ds_task_definition.
  • Log tables: All versions are saved in *_log tables with UNIQUE KEY (code, version).
  • Online state: Only one version per code has release_state = 1 (online).
  • Instance lock: Workflow instances lock to a specific version upon creation.
  • Immutable versions: Once referenced by an instance, logs are immutable.

Scheduling Architecture

Quartz Integration

DolphinScheduler integrates Quartz for cron-based scheduling. The schema includes standard Quartz tables and a mapping table.

t_ds_schedules

Column Type Description
workflow_definition_code bigint Target workflow (unique)
start_time datetime Schedule start time
end_time datetime Schedule end time
timezone_id varchar(40) Timezone for cron expression
crontab varchar(255) Cron expression
release_state int 0=offline,1=online
failure_strategy int Behavior on failure
workflow_instance_priority int Default instance priority

Quartz Table Notes:

  • QRTZ_TRIGGERS.NEXT_FIRE_TIME: Indexed for efficient scanning.
  • QRTZ_CRON_TRIGGERS.CRON_EXPRESSION: Parsed cron definition.
  • QRTZ_SCHEDULER_STATE: Tracks Quartz scheduler instance.

Resource and Configuration Tables

Data Source Management

t_ds_datasource

Stores database connection configs for SQL tasks.

Column Type Description
name varchar(64) Data source name
type tinyint Database type (MySQL, PostgreSQL, Hive, etc.)
connection_params text JSON configuration (host, port, db, credentials)
user_id int Owner user

Constraint: UNIQUE KEY (name, type) prevents duplicate sources.

File Resources

t_ds_resources (Deprecated)

Note: Deprecated; resource metadata is migrating to a separate backend.

Column Type Description
full_name varchar(128) Full path including tenant
type int File type (file/UDF)
size bigint File size (bytes)
is_directory boolean Directory flag
pid int Parent directory ID

Multi-Tenant and Administration

Project, User, and Tenant Hierarchy

Key Admin Tables

t_ds_tenant

Column Type Description
tenant_code varchar(64) Unique tenant identifier
queue_id int Default YARN queue for tasks
description varchar(255) Tenant description

Default tenant: System creates id=-1, tenant_code='default'.

t_ds_user

Column Type Description
user_name varchar(64) Login username (unique)
user_password varchar(64) Hashed password
user_type tinyint 0=normal,1=admin
tenant_id int Associated tenant (default -1)
email varchar(64) Email address
state tinyint 0=disabled,1=enabled

t_ds_project

Column Type Description
code bigint Unique project code
name varchar(255) Project name (unique)
user_id int Creator/owner
description varchar(255) Project description

JDBC Registry

For deployments without ZooKeeper, DolphinScheduler provides a JDBC-based registry for service coordination.

t_ds_jdbc_registry_data

Stores registry entries similar to ZooKeeper nodes.

Column Type Description
data_key varchar(256) Path-like key (unique)
data_value text Serialized data
data_type varchar(64) EPHEMERAL or PERSISTENT
client_id bigint Associated client
last_update_time timestamp Last modification

t_ds_jdbc_registry_lock

Implements distributed locks.

Column Type Description
lock_key varchar(256) Lock identifier (unique)
lock_owner varchar(256) Client holding lock (ip_processId)
client_id bigint Associated client

t_ds_jdbc_registry_client_heartbeat

Tracks active clients to clean ephemeral data.

Column Type Description
id bigint Client ID (primary key)
client_name varchar(256) Client identifier
last_heartbeat_time bigint Last heartbeat timestamp
connection_config text Connection metadata

Cleanup: Expired heartbeats automatically remove temporary registry entries and locks.

Alert System

Alert Tables

t_ds_alert

Alerts generated on workflow/task failure or completion.

Column Type Description
title varchar(512) Alert title
sign char(40) SHA1 hash for deduplication
content text Alert message body
alert_status tinyint 0=waiting,1=success,2=failure
warning_type tinyint 1=workflow success,2=workflow/task failure
workflow_instance_id int Source workflow instance
alertgroup_id int Target alert group

Index: KEY idx_sign (sign) for deduplication.

t_ds_alertgroup

Alert channel group.

Column Type Description
group_name varchar(255) Unique group name
alert_instance_ids varchar(255) Comma-separated plugin instance IDs
description varchar(255) Group description

Indexes and Query Optimization

Key Indexes

Schema includes indexes for common query patterns:

  • Workflow and Task Lookup:
  - Workflow instances by definition:
  `KEY workflow_instance_index (workflow_definition_code, id)`
  - Task instances by definition:
  `KEY idx_task_instance_code_version (task_code, task_definition_version)`
  - Time range queries for monitoring:
  `KEY start_time_index (start_time, end_time)`
Enter fullscreen mode Exit fullscreen mode
  • Command Processing:
Priority-based command scan:
`KEY priority_id_index (workflow_instance_priority, id)`
Enter fullscreen mode Exit fullscreen mode
  • DAG Relation Queries:
  - Forward and backward DAG traversal:
  `KEY idx_pre_task_code_version (pre_task_code, pre_task_version)`
  `KEY idx_post_task_code_version (post_task_code, post_task_version)`
  `KEY idx_code (project_code, workflow_definition_code)`
Enter fullscreen mode Exit fullscreen mode

Unique Constraints

Database-level enforcement of key business rules:

Table Constraint Purpose
t_ds_workflow_definition UNIQUE (name, project_code) No duplicate workflow names per project
t_ds_workflow_definition UNIQUE (code) Global workflow identifier
t_ds_workflow_definition_log UNIQUE (code, version) One record per version
t_ds_datasource UNIQUE (name, type) No duplicate data sources per type
t_ds_schedules UNIQUE (workflow_definition_code) One schedule per workflow

Schema Evolution and Upgrades

DolphinScheduler maintains upgrade scripts in dolphinscheduler-dao/src/main/resources/sql/upgrade for cross-version schema migration.

Recent Schema Changes

Version 3.3.0

  • Renamed tables/columns from “process” to “workflow”.
  • Removed data quality tables (t_ds_dq_*).
  • Added JDBC registry to replace ZooKeeper.
  • Removed cache-related columns from task tables.

Version 3.2.0

  • Added execution_type to workflow definition (parallel/serial mode).
  • Added next_workflow_instance_id for serial execution chain.
  • Added tenant_code to command and instance tables.
  • Created t_ds_project_parameter and t_ds_project_preference.

Database Interaction Patterns

Service Layer Access

Database access is abstracted via DAO layer in dolphinscheduler-dao.
Key Service Classes:

  • ProcessService: CRUD for workflow/task definitions and instances.
  • CommandService: Command queue management.
  • ProjectService: Project and permission management.
  • ResourcesService: Resource metadata operations.

Transaction Management

Most operations use Spring @Transactional for atomicity:

  • Create workflow instance and task instances atomically.
  • Consume commands and generate instances.
  • Synchronize version updates with log tables.

Connection Pooling

System uses HikariCP configured in

application.yaml:

  • Default pool size: 50 connections.
  • Connection timeout: 30 seconds.
  • Idle timeout: 600 seconds.

Top comments (0)