The Storage System in Apache DolphinScheduler provides a unified interface for storing and retrieving files across various storage backends. It enables resource management for workflows and tasks, allowing users to upload files such as scripts, JAR files, configuration files, and other artifacts that can be used in task execution. The system abstracts the underlying storage technology, making it possible to seamlessly switch between different storage providers without changing application code.
Architecture Overview
The storage system is designed as a pluggable component with a consistent API across different storage implementations. This architecture allows DolphinScheduler to work with multiple storage backends while maintaining a unified interface for resource operations.
Sources:
- dolphinscheduler-storage-plugin/dolphinscheduler-storage-api/src/main/java/org/apache/dolphinscheduler/plugin/storage/api/StorageType.java 22-62
- dolphinscheduler-common/src/main/resources/common.properties 24-33
Supported Storage Types
DolphinScheduler supports the following storage backends:
Sources:
- dolphinscheduler-storage-plugin/dolphinscheduler-storage-api/src/main/java/org/apache/dolphinscheduler/plugin/storage/api/StorageType.java 22-36
- docs/docs/en/guide/resource/configuration.md 1-7
Plugin Architecture
The storage functionality is implemented using a plugin architecture that allows for easy extension and maintenance.
Sources:
- dolphinscheduler-storage-plugin/pom.xml 30-39
- dolphinscheduler-storage-plugin/dolphinscheduler-storage-all/pom.xml 29-54
Configuration
The storage system is configured through the common.properties
file. Different storage backends require different configuration parameters.
Basic Configuration
# Storage type: LOCAL, HDFS, S3, OSS, GCS, ABS, OBS, COS
resource.storage.type=LOCAL
# Base path for resource storage
resource.storage.upload.base.path=/tmp/dolphinscheduler
Sources:
- dolphinscheduler-common/src/main/resources/common.properties 24-27
Configuration Flow
When DolphinScheduler starts, it loads the storage configuration and initializes the appropriate storage operator:
Sources:
- dolphinscheduler-common/src/main/java/org/apache/dolphinscheduler/common/utils/PropertyUtils.java 49-60
Storage Type-Specific Configuration
Local Storage
The Local Storage option stores files on the local file system of the machine where DolphinScheduler is running. This is the default configuration.
resource.storage.type=LOCAL
resource.storage.upload.base.path=/tmp/dolphinscheduler
Note: When using LOCAL storage type with multiple DolphinScheduler nodes, each node has its own local file system. This means resources uploaded on one node are not automatically available on other nodes unless you use a shared file system.
Sources:
- dolphinscheduler-common/src/main/resources/common.properties 24-27
- docs/docs/en/guide/resource/configuration.md 10-28 ### HDFS Storage For HDFS storage, additional configuration is required:
resource.storage.type=HDFS
resource.hdfs.fs.defaultFS=hdfs://namenode:8020
resource.hdfs.root.user=hdfs
If HDFS with Kerberos authentication is used, additional Kerberos configuration is required.
Sources:
- dolphinscheduler-common/src/main/resources/common.properties 97-115
S3 Storage
For Amazon S3 or S3-compatible storage:
resource.storage.type=S3
AWS connection parameters are specified in the aws.yaml
file:
aws:
s3:
credentials.provider.type: AWSStaticCredentialsProvider
access.key.id: <access.key.id>
access.key.secret: <access.key.secret>
region: <region>
bucket.name: <bucket.name>
endpoint: <endpoint>
Sources:
- docs/docs/en/guide/resource/configuration.md 29-53
Other Cloud Storage
DolphinScheduler also supports storage on Alibaba Cloud OSS, Huawei Cloud OBS, Tencent Cloud COS, Google Cloud Storage, and Azure Blob Storage, each with its own configuration parameters.
Sources:
- docs/docs/en/guide/resource/configuration.md 54-127
Database Schema for Resources
In addition to the actual file storage, DolphinScheduler also maintains metadata about resources in its database. The relevant tables include:
- Resource metadata tables (storing information about resources like name, path, owner, etc.)
- Resource-user relation tables (defining access permissions)
Sources:
- dolphinscheduler-dao/src/main/resources/sql/dolphinscheduler_mysql.sql 729-748
Inter-Component Integration
The Storage System integrates with other DolphinScheduler components:
Sources: The diagrams provided in the prompt showing system architecture.
Usage Considerations
When selecting a storage type, consider the following:
- Single-node vs. Multi-node: For a single-node deployment, LOCAL storage is sufficient. For multi-node deployments, consider HDFS or cloud storage.
- Performance: Local storage typically offers the best performance but lacks distributed capabilities. HDFS provides good performance for on-premises deployments, while cloud storage options are suitable for cloud-based deployments.
- Reliability: Cloud storage providers typically offer high durability and availability. For on-premises deployments, HDFS with proper replication provides reliable storage.
- Integration: If you're already using a particular cloud provider or have an existing Hadoop cluster, it may be simplest to use the corresponding storage option.
- Cost: Different storage options have different cost structures. Cloud storage typically charges for storage volume, requests, and data transfer.
Sources:
- docs/docs/en/guide/resource/configuration.md 13-26
- docs/docs/zh/guide/resource/configuration.md 12-25
Configuration Best Practices
Consistent Configuration: Ensure the storage configuration is identical across all DolphinScheduler nodes (API server and Worker server).
Permissions: The user running DolphinScheduler must have appropriate permissions to access the configured storage.
Shared Storage: In a distributed deployment, use a shared storage solution (HDFS, S3, etc.) rather than LOCAL storage to ensure all nodes can access the same resources.
Security: For cloud storage, use appropriate security measures such as IAM roles or access keys with minimal required permissions.
Backup: Implement a backup strategy for your resource storage, especially for critical resources.
Sources:
- docs/docs/en/guide/resource/configuration.md 95-97
- docs/docs/zh/guide/resource/configuration.md 87-91
Top comments (0)