In production environments, performance issues in a scheduling platform are never caused by a single bottleneck. Instead, they arise from the combined effects of scheduling decisions, task execution, metadata storage, and coordination mechanisms. Taking Apache DolphinScheduler as an example, focusing on just one component, such as the Master or Worker, often leads to misidentifying the root cause.
This article is based on real-world production experience. It systematically breaks down performance bottlenecks in a scheduling platform and provides practical, actionable optimization strategies.
1. From the overall architecture, where exactly are the bottlenecks?
The core workflow of DolphinScheduler can be abstracted as:
Scheduling → Execution → Storage → Coordination
Any layer can become a bottleneck, but the most common issues are concentrated in four areas:
- Insufficient scheduling throughput on the Master
- Mismatch between Worker execution capacity and workload
- Excessive pressure on the database (MySQL/PostgreSQL)
- Latency or instability in ZooKeeper (coordination layer)
2. The Master bottleneck is not CPU, but the “scheduling model”
Many assume the Master’s CPU is the issue. In practice, the real bottleneck is the combination of the scheduling model and database I/O.
2.1 Scheduling mechanism
The Master’s core loop looks like this:
// MasterSchedulerService.java
while (true) {
List<ProcessInstance> instances = processService.findNeedScheduleProcessInstances();
for (ProcessInstance instance : instances) {
submitProcessInstance(instance);
}
}
This is a polling + database-driven model. The key limitation is that scheduling capacity is directly tied to database throughput.
2.2 Typical symptoms
High scheduling latency:
Tasks are ready but delayed by tens of seconds before execution, while Master CPU usage remains low and database QPS is high.
Low throughput:
The system may only schedule a few hundred tasks per minute, and adding more Masters yields limited improvement.
2.3 Optimization strategies
Reduce database scanning pressure
Typical SQL:
SELECT * FROM t_ds_process_instance
WHERE state = 'READY'
LIMIT 100;
Optimization:
CREATE INDEX idx_state_priority_time
ON t_ds_process_instance(state, priority, create_time);
Additional measures include limiting scan batch sizes and tuning scheduling intervals to avoid excessive polling.
Increase scheduling concurrency
Key configuration:
master:
exec-threads: 100
dispatch-task-number: 50
Practical guidelines:
exec-threads should be approximately 2 to 4 times the number of CPU cores.
dispatch-task-number should not be too large to avoid overwhelming Workers.
Scale out Masters
DolphinScheduler supports multiple Masters, but scaling is not linear due to shared database bottlenecks and ZooKeeper coordination overhead.
3. More Workers is not always better
Adding more Workers blindly can overload the database and worsen queuing.
3.1 Worker configuration
worker:
exec-threads: 50
Workers act as both execution units and resource isolation boundaries.
3.2 Estimation formula
Worker count ≈ Total concurrent tasks / Per-Worker concurrency
Per-Worker concurrency ≈ CPU cores × (2 to 4)
3.3 Example
For 1,000 concurrent tasks and 16-core Workers:
Per Worker ≈ 32 to 64 concurrent tasks
Required Workers ≈ 1000 / 50 ≈ 20
3.4 Task type matters more
Short tasks (<5 seconds):
Scheduling overhead exceeds execution time, making the Master the bottleneck.
Long tasks (>10 minutes):
Workers become resource bottlenecks due to long occupation time.
4. Different strategies for short and long tasks
4.1 Short tasks optimization
Typical scenarios include SQL queries and API calls.
Batching example:
-- Before: multiple small queries
SELECT * FROM table WHERE id = 1;
SELECT * FROM table WHERE id = 2;
-- After: batch query
SELECT * FROM table WHERE id IN (1,2,3,...);
Other strategies include reducing DAG granularity and moving loops into scripts.
4.2 Long tasks optimization
Typical scenarios include Spark or Flink jobs.
The bottleneck lies in resource systems rather than the scheduler.
Strategies:
Bind workloads to YARN queues or Kubernetes namespaces and enforce concurrency limits.
5. The database bottleneck is the most underestimated
Around 80% of production performance issues ultimately relate to the database.
5.1 Common problems
Slow queries
Row-level lock contention
Connection pool exhaustion
5.2 Typical SQL
UPDATE t_ds_task_instance
SET state = 'RUNNING'
WHERE id = ?;
Frequent updates to the same rows lead to lock contention and reduced throughput.
5.3 Optimization strategies
Read-write separation
Masters handle writes, while APIs and queries use read replicas.
Reduce update frequency
Inefficient pattern:
RUNNING → RUNNING → RUNNING
Optimization:
Reduce heartbeat frequency.
Batch updates
// Batch update task states
updateBatch(taskInstances);
6. ZooKeeper as a hidden bottleneck
ZooKeeper is responsible for coordination, including Master election, Worker registration, and heartbeat management.
6.1 Common symptoms
Scheduling jitter under high load
Workers falsely marked as dead
Frequent Master failovers
6.2 Root causes
Improper session timeout settings
Too many nodes and connections
Network instability
6.3 Optimization
Example configuration:
tickTime=2000
initLimit=10
syncLimit=5
Recommendations:
Increase session timeout to at least 20 seconds to tolerate transient failures.
Deploy ZooKeeper independently to avoid resource contention.
7. A real-world optimization case
Background
Daily tasks: 200,000
DAGs: 30,000
Masters: 2
Workers: 30
Issues
Scheduling latency exceeded 1 minute during peak hours
Database CPU usage reached 90 percent
Optimization process
Step 1: Database indexing
Result: latency reduced by 40 percent
Step 2: Reduce short tasks
Result: DAG count reduced by 30 percent
Step 3: Adjust Master parameters
exec-threads: 50 → 120
Result: throughput doubled
Final results
- Scheduling latency reduced from 60 seconds to 8 seconds
- Database CPU usage reduced from 90 percent to 50 percent
- Overall throughput improved by 2 to 3 times
8. Summary: the essence of scheduling performance optimization
The core insight is that performance is a balance of:
Scheduling capacity × Execution capacity × Storage capacity × Coordination capability
Optimization must be holistic:
- The Master controls the scheduling rhythm
- Workers provide execution capacity
- The database defines system limits
- ZooKeeper ensures coordination stability
Ultimately:
The limit of a scheduling system is not how many tasks it can dispatch, but how long the database can sustain the load.
Previous articles:
- Part 1 | Scheduling Systems Are More Than Just “Timers”
Part 2 | The Core Abstraction Model of Apache DolphinScheduler
Part 4 | The State Machine: The Real Soul of Scheduling Systems
Next: The boundaries between DolphinScheduler and Flink, Spark, and SeaTunnel

Top comments (0)