Why Do You Need DolphinScheduler?
3-Minute Quick Deployment (Beginner-Friendly)
Environment Preparation
Minimum requirements (for development):
JDK 8+
MySQL 5.7+
Zookeeper 3.8+
One-Click Docker Startup (Recommended to Avoid Pitfalls)
docker run -d --name dolphinscheduler \
-e DATABASE_TYPE=mysql \
-e SPRING_DATASOURCE_URL="jdbc:mysql://localhost:3306/ds?useUnicode=true&characterEncoding=UTF-8" \
-e SPRING_DATASOURCE_USERNAME=root \
-p 12345:12345 \
apache/dolphinscheduler:3.2.0
Core Concepts
Term | Analogy in Reality | Technical Definition |
---|---|---|
Workflow | Factory Production Line | A collection of DAG tasks |
Task | Production Process | Execution unit such as Shell/SQL/Spark etc. |
Instance | Daily Production Batch | A specific running instance of a workflow |
Alert | Factory Broadcasting System | Configuration of failure alarm channels |
Step-by-Step: Create Your First Workflow (With Code)
Scenario: Daily User Behavior Analysis
Step 1: Log in to the Console
http://localhost:12345/dolphinscheduler (Default account: admin / dolphinscheduler123)
Step 2: Create a Workflow
Step 3: Configure a Shell Task (Key Code)
#!/bin/bash
# Example of parameter injection
spark-submit \
--master yarn \
--name behavior_analysis_${sys_date} \ # System variable
/opt/jobs/user_analysis.py ${begin_date} ${end_date}
Step 4: Set a Scheduling Policy
0 2 * * * # Executes daily at 2 AM (Quartz expression supported)
Unlock Advanced Features (Beginner-Friendly)
- Parameter Passing (Cross-Task Value Transfer)
# In a Python node, retrieve upstream output
context.getUpstreamOutParam('uv_count')
- Automatic Retry on Failure
# Part of workflow definition
task_retry_interval: 300 # Retry every 5 minutes
retry_times: 3 # Retry up to 3 times
- Conditional Branching (Dynamic Routing)
# Determine if it's a weekend
if [ ${week} -gt 5 ]; then
echo "skip weekend processing"
exit 0
fi
Pitfall Prevention Tips (From Real-World Experience)
-
Resource Misconfiguration: Spark out-of-memory → Adjust in
conf/worker.properties
worker.worker.task.resource.limit=true
worker.worker.task.memory.max=8g # Adjust based on your cluster
-
Timezone Trap: Scheduled task delayed by 8 hours → Fix in
common.properties
spring.jackson.time-zone=GMT+8
Efficiency Comparison (Convincing Metrics)
Metric | Crontab | Airflow | DolphinScheduler |
---|---|---|---|
Visualization Level | ❌ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
High Availability Deployment | ❌ | ⭐⭐ | ⭐⭐⭐⭐ |
Big Data Integration Degree | ⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
Learning Curve | ⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ |
Final Thoughts
Apache DolphinScheduler is rapidly becoming the de facto standard in big data workflow scheduling. Its cloud-native architecture and user-friendly interface help developers escape the pain of managing complex task flows. For beginners, this guide is a great starting point to explore more advanced features like cross-cluster task dispatching and Kubernetes integration.
Top comments (0)