DEV Community

Chen Debra
Chen Debra

Posted on

In-Depth Look at the Apache DolphinScheduler Storage System

The Storage System in Apache DolphinScheduler provides a unified interface for storing and retrieving files across various storage backends. It enables resource management for workflows and tasks, allowing users to upload files such as scripts, JAR files, configuration files, and other artifacts that can be used in task execution. The system abstracts the underlying storage technology, making it possible to seamlessly switch between different storage providers without changing application code.

Architecture Overview

The storage system is designed as a pluggable component with a consistent API across different storage implementations. This architecture allows DolphinScheduler to work with multiple storage backends while maintaining a unified interface for resource operations.

1

Sources:

  • dolphinscheduler-storage-plugin/dolphinscheduler-storage-api/src/main/java/org/apache/dolphinscheduler/plugin/storage/api/StorageType.java 22-62
  • dolphinscheduler-common/src/main/resources/common.properties 24-33

Supported Storage Types

DolphinScheduler supports the following storage backends:

2.1

Sources:

  • dolphinscheduler-storage-plugin/dolphinscheduler-storage-api/src/main/java/org/apache/dolphinscheduler/plugin/storage/api/StorageType.java 22-36
  • docs/docs/en/guide/resource/configuration.md 1-7

Plugin Architecture

The storage functionality is implemented using a plugin architecture that allows for easy extension and maintenance.

3

Sources:

  • dolphinscheduler-storage-plugin/pom.xml 30-39
  • dolphinscheduler-storage-plugin/dolphinscheduler-storage-all/pom.xml 29-54

Configuration

The storage system is configured through the common.properties file. Different storage backends require different configuration parameters.

Basic Configuration

# Storage type: LOCAL, HDFS, S3, OSS, GCS, ABS, OBS, COS
resource.storage.type=LOCAL

# Base path for resource storage
resource.storage.upload.base.path=/tmp/dolphinscheduler
Enter fullscreen mode Exit fullscreen mode

Sources:

  • dolphinscheduler-common/src/main/resources/common.properties 24-27

Configuration Flow

When DolphinScheduler starts, it loads the storage configuration and initializes the appropriate storage operator:

4

Sources:

  • dolphinscheduler-common/src/main/java/org/apache/dolphinscheduler/common/utils/PropertyUtils.java 49-60

Storage Type-Specific Configuration

Local Storage

The Local Storage option stores files on the local file system of the machine where DolphinScheduler is running. This is the default configuration.

resource.storage.type=LOCAL
resource.storage.upload.base.path=/tmp/dolphinscheduler
Enter fullscreen mode Exit fullscreen mode

Note: When using LOCAL storage type with multiple DolphinScheduler nodes, each node has its own local file system. This means resources uploaded on one node are not automatically available on other nodes unless you use a shared file system.

Sources:

  • dolphinscheduler-common/src/main/resources/common.properties 24-27
  • docs/docs/en/guide/resource/configuration.md 10-28 ### HDFS Storage For HDFS storage, additional configuration is required:
resource.storage.type=HDFS
resource.hdfs.fs.defaultFS=hdfs://namenode:8020
resource.hdfs.root.user=hdfs
Enter fullscreen mode Exit fullscreen mode

If HDFS with Kerberos authentication is used, additional Kerberos configuration is required.

Sources:

  • dolphinscheduler-common/src/main/resources/common.properties 97-115

S3 Storage

For Amazon S3 or S3-compatible storage:

resource.storage.type=S3
Enter fullscreen mode Exit fullscreen mode

AWS connection parameters are specified in the aws.yaml file:

aws:
    s3:
        credentials.provider.type: AWSStaticCredentialsProvider
        access.key.id: <access.key.id>
        access.key.secret: <access.key.secret>
        region: <region>
        bucket.name: <bucket.name>
        endpoint: <endpoint>
Enter fullscreen mode Exit fullscreen mode

Sources:

  • docs/docs/en/guide/resource/configuration.md 29-53

Other Cloud Storage

DolphinScheduler also supports storage on Alibaba Cloud OSS, Huawei Cloud OBS, Tencent Cloud COS, Google Cloud Storage, and Azure Blob Storage, each with its own configuration parameters.

Sources:

  • docs/docs/en/guide/resource/configuration.md 54-127

Database Schema for Resources

In addition to the actual file storage, DolphinScheduler also maintains metadata about resources in its database. The relevant tables include:

  • Resource metadata tables (storing information about resources like name, path, owner, etc.)
  • Resource-user relation tables (defining access permissions)

Sources:

  • dolphinscheduler-dao/src/main/resources/sql/dolphinscheduler_mysql.sql 729-748

Inter-Component Integration

The Storage System integrates with other DolphinScheduler components:

5

Sources: The diagrams provided in the prompt showing system architecture.

Usage Considerations

When selecting a storage type, consider the following:

  1. Single-node vs. Multi-node: For a single-node deployment, LOCAL storage is sufficient. For multi-node deployments, consider HDFS or cloud storage.
  2. Performance: Local storage typically offers the best performance but lacks distributed capabilities. HDFS provides good performance for on-premises deployments, while cloud storage options are suitable for cloud-based deployments.
  3. Reliability: Cloud storage providers typically offer high durability and availability. For on-premises deployments, HDFS with proper replication provides reliable storage.
  4. Integration: If you're already using a particular cloud provider or have an existing Hadoop cluster, it may be simplest to use the corresponding storage option.
  5. Cost: Different storage options have different cost structures. Cloud storage typically charges for storage volume, requests, and data transfer.

Sources:

  • docs/docs/en/guide/resource/configuration.md 13-26
  • docs/docs/zh/guide/resource/configuration.md 12-25

Configuration Best Practices

  1. Consistent Configuration: Ensure the storage configuration is identical across all DolphinScheduler nodes (API server and Worker server).

  2. Permissions: The user running DolphinScheduler must have appropriate permissions to access the configured storage.

  3. Shared Storage: In a distributed deployment, use a shared storage solution (HDFS, S3, etc.) rather than LOCAL storage to ensure all nodes can access the same resources.

  4. Security: For cloud storage, use appropriate security measures such as IAM roles or access keys with minimal required permissions.

  5. Backup: Implement a backup strategy for your resource storage, especially for critical resources.

Sources:

  • docs/docs/en/guide/resource/configuration.md 95-97
  • docs/docs/zh/guide/resource/configuration.md 87-91

Top comments (0)