DEV Community

Chen Debra
Chen Debra

Posted on

The Choice of a Chemical Giant: How Wison Engineering Ignites a New Spark in Data Integration with DolphinScheduler

Image description

In the wave of digital transformation, enterprises face the challenge of explosive data growth. Integrating and managing data from various sources has become the key to enhancing competitiveness. This article briefly introduces Wison Engineering (China)'s multi-source data integration practice solution based on Apache DolphinScheduler, covering background, pain point analysis, scheduling strategy design, key technical solutions, and operations management experience.

Speaker Introduction

Cheng Guo, Digital Development Manager at Wison Engineering (China), is a senior architecture expert with over 15 years of experience in enterprise application architecture design and development. He is proficient in full-stack development technologies and has successfully designed and implemented multiple enterprise-level core systems in the energy, chemical, and supply chain logistics sectors.

Image description

Background

Digital transformation is a necessary path for enterprises to adapt to the digital economy. In this process, data becomes the core asset of enterprises, and efficient data management and utilization are crucial for decision-making, operations, and innovation.

Challenges and Opportunities in Project Management

Wison Engineering (China) is an engineering company specializing in providing engineering services and solutions for industries such as oil, chemicals, and natural gas. Their projects include engineering design, project management, construction management, engineering consulting, and technology development. In the wave of digital transformation, this traditional engineering company has also embarked on the path of digitalization.

In Wison Engineering (China), project management plays a bridging and connecting role in digital transformation. Effective project management helps identify and solve problems during the digital transformation process and seize opportunities. However, while facing unavoidable challenges, opportunities also arise.

Challenges:

  1. Cross-Cultural Communication and Collaboration: International projects involve diverse cultural backgrounds, making communication and team collaboration more complex, requiring overcoming language barriers and cultural differences.
  2. Standardization and Interoperability: Standardized data formats and system interoperability are crucial to support international project collaboration.
  3. Regulatory Compliance: Implementing projects in different countries requires adherence to multiple regulations, and digital tools can help track and meet these requirements.

Opportunities:

  1. Efficiency and Cost Savings: Digital tools automate routine tasks, improve efficiency, reduce errors, and save costs.
  2. Optimized Decision-Making: Big data analytics and AI technologies provide deep insights, helping project managers make more accurate decisions and reduce risks.
  3. Market Expansion: Entering new markets brings more business opportunities, especially in emerging economies with significant infrastructure development needs.
  4. Innovation and Competitiveness: Adopting the latest technologies innovates project management methods, enhances market competitiveness, and attracts more international clients.

Background of Data Platform Construction

Building a robust data platform is the foundation of digital transformation. The platform needs to support data integration, storage, processing, and analysis to meet the enterprise's data demands. Currently, enterprises face several challenges in digital transformation:

Key Challenges:

  1. Enhanced Digital Application Levels: Clients demand more data-driven and visualized project process management.
  2. High Standards: To meet high client standards, enterprises need to improve competitiveness and benchmark against world-class peers.
  3. Rapid Deployment of Digital Systems: Enhancing collaboration efficiency and project management levels is an urgent need.

Major Problems:

  1. Outdated IT Infrastructure: This relies on traditional client software like IE, which is being phased out.
  2. Insufficient Data Digitization: Lacks systems to manage progress, costs, contracts, and changes based on data.
  3. Decentralized Data: Data scattered across personal devices and Excel files, making centralized management difficult.
  4. Inadequate Sharing and Control: Budget overruns and lack of real-time control lead to information asymmetry and progress management challenges.
  5. Obvious Security Risks: Data security and confidentiality are hard to control.
  6. Difficulties in Retrospective Statistics: Inability to classify, aggregate, and analyze cost and budget data at the company level.
  7. Challenges in Reusing Experience: Current systems hinder data accumulation and system improvement, making it hard to reuse experience.

These issues indicate that the existing information systems can no longer meet the enterprise's development needs and require reform and upgrades to adapt to the digital era.

Pain Points in Data Integration

In the process of multi-source data integration, Wison Engineering (China) encountered several pain points, which can be summarized as follows:

Data Source Complexity

  1. System Heterogeneity:
  2. Different professions use different business systems.
  3. Data formats are inconsistent.

    • Structured data is stored in databases like MySQL, Oracle, and SQL Server.
    • Semi-structured data includes Excel files, JSON logs.
    • Unstructured data includes blueprints, documents, and photos.
  4. Data Quality Issues:

  5. Incomplete data,

    • Field-collected data may be missing.
    • Manual data entry may have delays.
    • System integration may be disconnected.
  6. Data inconsistency,

    • Data standards may vary across systems.
    • Duplicate data has not been cleaned.
    • Historical data has not been governed.

Real-Time Requirements

Construction site data, cost progress data, and completion progress data need to be updated in real-time to ensure the accuracy and timeliness of project management.

Scalability Challenges

  • Project cycles are long, spanning 3-5 years from design to construction, which requires the data integration system to have good scalability.
  • Diverse analysis dimensions, such as carbon emission analysis, require the system to handle and analyze various types of data.

Platform Selection Considerations

Image description

Given the complexity of data integration requirements, Huisheng Engineering (China) has high demands for selecting a big data scheduling system and needs to be cautious in the selection process. Both the technical architecture and operational costs need to be carefully considered.

Comparison of Selection

Image description

In the selection process, Huisheng Engineering (China) compared and analyzed different data orchestration tools, including open-source and commercial software, focusing on scheduling platforms, data integration, and data warehouse aspects:

Apache DolphinScheduler, with its open-source and free nature, active community, ease of maintenance, and resource isolation, can meet the complex requirements of enterprises in data integration and scheduling, performing excellently in multi-project needs and engineering scenarios. Apache SeaTunnel, on the other hand, provides simple and intuitive configuration and batch-stream integration capabilities, meeting various data integration needs. In the data warehouse area, Doris offers high-performance query capabilities, while MySQL reduces learning costs due to its mature ecosystem.

Key Explanation of Apache DolphinScheduler as an Open-Source Distributed Scheduling System

Apache DolphinScheduler can provide the following support for Huisheng Engineering (China) EPC project data integration:

  • Unified Data Source Management: By integrating different data sources, data format unification and standardization are achieved, simplifying data management processes.
  • Improved Data Quality: Through automated data cleaning and governance processes, data integrity and consistency are ensured.
  • Meeting Real-Time Requirements: DolphinScheduler's real-time scheduling capabilities ensure the real-time update of construction site data, cost progress data, and completion progress data.
  • Supporting Project Scalability: The system's design allows for flexible expansion throughout the project lifecycle, supporting long-term project data management and analysis needs.
  • Multi-Dimensional Analysis: DolphinScheduler supports complex data analysis tasks, such as carbon emission analysis, helping project teams assess the project's impact from multiple angles.

Through DolphinScheduler, EPC projects can more effectively manage data, improve project management efficiency, and enhance decision-making quality and project success rates.

Multi-Source Data Integration Solution Design

Based on Apache DolphinScheduler and Apache SeaTunnel, Huisun Engineering (China) has designed its multi-source data integration solution, which mainly includes the design of scheduling strategies, key technical solutions, and overall architectural solutions.

Scheduling Strategy Design

  • Architecture

The diagram below illustrates the scheduling strategy for data synchronization tasks. Through a scheduled task root node, the system can automate the execution of various data synchronization tasks, ensuring the real-time and accurate nature of the data. This scheduling strategy helps to improve the efficiency of data management, reduce manual intervention, and ensure data consistency and reliability. With a scheduling platform like DolphinScheduler, these tasks can be effectively managed and executed, enhancing the level of automated operations and maintenance management.

  • Business Domain and Data Characteristics

In the design of scheduling strategies, it is necessary to consider the entire lifecycle of engineering projects and core business processes to ensure the efficiency and real-time nature of data flow. Classifying data characteristics helps determine the priority and processing strategies of data, thereby optimizing the scheduling strategy. The distinction between real-time data, near-real-time data, and batch data, as well as the classification of critical, important, and general data, are key factors in designing effective scheduling strategies. Additionally, classifying data based on volume can help allocate resources and processing capabilities reasonably, ensuring the performance and efficiency of the scheduling system.

Through such classification and analysis, clear guidance can be provided for the scheduling strategies of engineering projects, ensuring that projects are completed on time, within budget, and to the required quality.

  • Project Lifecycle Characteristics

In each stage of the project lifecycle, the characteristics and synchronization requirements of data have their specificities. The design of scheduling strategies needs to ensure the accuracy, timeliness, and consistency of data based on these characteristics. Key considerations include data management, transmission performance, version control, data integrity, and issue tracking, all of which are important factors in ensuring the smooth progress of the project. Through such scheduling strategy design, the various stages of the project can be effectively supported, improving the efficiency and quality of project management.

  • Core Business Process Analysis

Image description

In the design of scheduling strategies, different core business processes need to consider the characteristics of data flow, key data items, and synchronization strategies. These factors together determine the real-time nature, accuracy, and consistency of data, thereby affecting the efficiency of business processes and the quality of decision-making.

Through DolphinScheduler, the data flow in these core business processes can be effectively managed and scheduled. The scheduling capabilities of DolphinScheduler can ensure that key data items are updated according to the predetermined synchronization strategies, whether it is real-time, hourly, daily, or scheduled calculations, meeting the needs of different business processes. In this way, enterprises can ensure the timeliness and accuracy of data, supporting the smooth operation of business and the formulation of decisions.

  • Data Flow Analysis

In the design of scheduling strategies, data flow analysis is a crucial part, involving cross-departmental data sharing, multi-system data integration, and different business requirements for real-time nature. To meet these needs, corresponding implementation strategies need to be formulated:

  1. Cross-departmental data collaboration: By establishing data sharing mechanisms and real-time push functions, ensure the real-time sharing and consistency of data, while managing data permissions through permission filtering rules.
  2. Multi-system data integration: Unify data exchange standards and establish data mapping relationships to handle the complexity of interfaces and diversity of data formats between different systems, optimizing transmission mechanisms to meet performance requirements.
  3. Differences in real-time requirements: Implement hierarchical scheduling strategies, allocate resources reasonably according to the real-time requirements of the business, achieve load balancing, ensure the high real-time nature of core business, and allow appropriate delays for statistical analysis and historical data queries.

Through such a scheduling strategy design, data flow can be effectively managed and scheduled, improving the efficiency of business processes and the quality of decision-making. DolphinScheduler, as a scheduling platform, can support the implementation of these strategies, helping enterprises achieve efficient data flow and management.

  • Core of Task Design

In the design of scheduling strategies, the core elements of task design are to ensure the efficient execution of tasks and the reasonable allocation of resources. Real-time task processing focuses on the timely collection of construction site data, which is crucial for the real-time nature and accuracy of project management. Batch task design focuses on regular data statistical analysis, such as cost monthly reports and progress analysis reports. These tasks do not require real-time processing but have high requirements for data accuracy and the complexity of processing logic.

The formulation of task scheduling strategies is to optimize the execution time and resource allocation of tasks. By executing tasks during periods of low system load, dividing large tasks into smaller pieces, and reasonably setting the number of concurrent tasks according to the server's load capacity, the efficiency of task execution and the overall performance of the system can be improved. At the same time, by setting task priorities, key tasks can be ensured to be processed first, thereby ensuring the timely generation of key reports in project management.

DolphinScheduler, as a scheduling platform, can effectively support the implementation of these scheduling strategies, helping enterprises achieve automated scheduling and management of tasks, and improving the efficiency and quality of project management.

  • Technical Implementation

Wison Engineering (China) uses DolphinScheduler to manage and monitor workflows, allowing users to view the status of workflows, scheduling information, and perform various operations.

Taking project management tasks as an example, in specific implementation, the permissions of personnel in different projects vary, so a unified project management platform is needed to manage and distribute personnel and position information to various subsystems to execute corresponding tasks. Data can be collected through HTTP, scheduled using DolphinScheduler, stored in Doris, and then distributed to various business systems.

Key Technical Solutions

Using DolphinScheduler as a scheduling platform and SeaTunnel as a data integration engine, this combination can effectively handle data synchronization, transformation, and loading requirements in projects, synchronizing data to the data warehouse on a schedule, and integrating and analyzing data from multiple business systems.

Overall Architecture

This solution is a comprehensive distributed system design, covering various levels from client interaction to service governance, from data storage to business logic processing. The system achieves efficient routing and security control of requests through a unified service gateway, uses nginx for load balancing to ensure high availability, and uses Nacos server as the service registration and discovery center to simplify communication between services. Business services are containerized and deployed on docker servers, separating basic services such as authentication and authorization (keycloak), API management (yapi), and message notifications, as well as application services such as delivery planning and engineering homepage, enhancing the flexibility and maintainability of the system. Service monitoring is achieved through Skywalking for link tracing, ensuring service performance. The data storage layer uses Redis, Kafka, MySQL, and OSS file services to handle caching, message queues, structured data, and unstructured data. Doris serves as the data service layer, providing fast data analysis capabilities. DolphinScheduler is responsible for data scheduling tasks, ensuring the automation of data processing workflows.

The overall architecture demonstrates obvious advantages in terms of high modularity, service independence, scalability, and ease of maintenance, supporting complex business needs and large-scale user access.

Summary and Outlook

In operations and maintenance management, the application of AI can significantly improve the level of automation, optimize the design process, predict and manage project risks, and enhance the intelligence of factory operations and maintenance. However, the successful application of AI depends on high-quality data. In EPC project management, collecting and organizing data is a challenge that requires effective strategies to overcome. Intelligent factory operations and maintenance is another important field for AI applications, where AI can leverage its advantages in anomaly detection and fault diagnosis.

DolphinScheduler, as a scheduling platform, can play an important role in these areas by automating scheduling and optimizing task execution, supporting AI model training and prediction tasks, thereby improving the efficiency and effectiveness of operations and maintenance management.

Thank you all for your attention to the multi-source data integration practice solution based on DolphinScheduler. It is hoped that this article can provide valuable insights and guidance to readers, helping enterprises better achieve data integration and scheduling.

Top comments (0)