Apache SeaTunnel

Posted on Oct 10

From Hours to Minutes: How Dmall Cuts Data Integration Costs to 1/3 with Apache SeaTunnel?

#programming #apacheseatunnel #datascience #opensource

Dmall is a global provider of intelligent retail solutions, supporting the digital transformation of over 430 clients. With rapid business expansion, data synchronization's real-time nature, resource efficiency, and development flexibility have become the three key challenges we must overcome.

Four Stages of Dmall's Data Platform Evolution

Dmall's data platform has undergone four major transformations, always focusing on "faster, more efficient, and more stable."

In the process of building the data platform, we initially used AWS-EMR to quickly establish cloud-based big data capabilities, then reverted to IDC self-built Hadoop clusters, combining open-source cores with self-developed integration, scheduling, and development components, transforming heavy assets into reusable light services. As the business required lower costs and higher elasticity, the team rebuilt the foundation with storage-compute separation and containerization, introducing Apache SeaTunnel for real-time data lake integration. Subsequently, with Apache Iceberg and Paimon as unified storage formats, we formed a new architecture for lakehouse integration, providing a stable, low-cost data foundation for AI and completing the transition from cloud adoption to cloud creation and from offline to real-time.

Storage-Compute Separation Architecture

Dmall UniData's (Data IDE) storage-compute separation architecture uses Kubernetes as the elastic foundation, with Spark, Flink, and StarRocks scaling on demand. Iceberg + JuiceFS unifies lake storage, Hive Metastore manages cross-cloud metadata, and Ranger provides fine-grained access control. This architecture is vendor-neutral and fully controllable across the entire tech stack.

The business benefits are clear: TCO reduced by 40-75%, resource scaling in seconds, the same IDE framework covering integration, scheduling, modeling, querying, and service delivery quickly, with fewer resources and seamless multi-cloud security.

I. Pain Points of the Old Architecture

Before introducing Apache SeaTunnel, Dmall's data platform supported over a dozen storage self-service data synchronization sources like MySQL, Hive, and ES, using Spark’s self-developed solutions for various data sources, customized to connect on demand, but only supported batch processing.

In terms of data import, Dmall's data platform unified ODS data into the data lake, using Apache Iceberg as the lakehouse format, with hourly data downstream being available, ensuring high data reuse and quality.

Previously, we relied on Spark's self-developed synchronization tools, which were stable but suffered from issues like “slow startup, high resource usage, and difficult scalability.”

“It's not that Spark is bad, but it’s too heavy.”

Against the backdrop of cost reduction and efficiency improvement, we re-evaluated the original data integration architecture. While Spark's batch jobs were mature, they were overkill for medium-sized data synchronization tasks. Slow startup, high resource consumption, and long development cycles became bottlenecks for the team's efficiency. More importantly, with the growing demand for real-time business needs, Spark's batch processing model was becoming unsustainable.

Dimension	Old Spark Solution	Business Impact
High resources	2C8G start, idle 40s	Not friendly for medium and small-scale data synchronization
High development	No abstracted Source/Sink, full-stack development	Full-stack development increased development and maintenance costs, lowered delivery efficiency
Does not support real-time sync	Growing real-time incremental synchronization needs	Still relying on developers to implement using Java/Flink
Limited data sources	Increased private cloud deployments and diverse data sources	Difficulty in quickly meeting business needs for new data source development

That was until we encountered Apache SeaTunnel, and everything started to change.

II. Why SeaTunnel?

“We’re not choosing a tool; we’re choosing the foundation for the next five years of data integration.”

Facing diverse data sources, real-time needs, and resource optimization pressures, we needed a “batch-stream unified, lightweight, efficient, and easily scalable” integration platform. SeaTunnel, with its open-source nature, multi-engine support, rich connectors, and active community, became our final choice. It not only solved Spark’s “heavy” issue but also laid the foundation for lakehouse integration and real-time analytics in the future.

Engine Neutrality: Built-in Zeta, compatible with Spark/Flink, automatically switching based on data volume.
200+ connectors: Plugin-based; new data sources require only JSON configuration, no Java code.
Batch and stream unified: One configuration supports full, incremental, and CDC.
Active community: GitHub 8.8k stars, 30+ PR merges weekly, with 5 patches we contributed merged within 7 days.

III. New Platform Architecture: Making SeaTunnel "Enterprise-Grade"

“Open-source doesn’t just mean using it as-is, but standing on the shoulders of giants to continue building.”

While SeaTunnel is powerful, to truly apply it in enterprise-level scenarios, we needed an "outer shell"—unified management, scheduling, permissions, rate-limiting, monitoring, etc. We built a set of visual, configurable, and scalable data integration platforms around SeaTunnel, transforming it from an open-source tool into the "core engine" of Dmall's data platform.

3.1 Global Architecture

Using Apache SeaTunnel as the foundation, the platform exposes a unified REST API, allowing external systems like Web UI, Merchant Exchange, and MCP services to call with one click; built-in connector template center allows new storage to be published in minutes with parameter filling, no coding required. The scheduling layer supports mainstream orchestration like Apache DolphinScheduler, Airflow, etc. The engine layer intelligently routes Zeta/Flink/Spark based on data volume, allowing lightweight fast tasks for small jobs and distributed parallel processing for large jobs. The environment is fully cloud-native, supporting K8s, Yarn, and Standalone modes, making it easy to deliver in private cloud scenarios, ensuring "template-as-a-service, engine-switchable, deployment-unbound."

3.2 Data Integration Features

Data Source Registration: One-time entry of address, account, and password, with sensitive fields encrypted. Public data sources like Hive are visible to all tenants.
Connector Templates: Add connectors by configuration and define SeaTunnel config generation rules, controlling task interface Source and Sink display.
Offline Tasks: Run batch tasks supporting Zeta and Spark engines, describe synchronization tasks using DAG diagrams, and support wildcard variable injection.
Real-Time Tasks: Run stream tasks supporting Zeta and Flink engines, storing checkpoints via S3 protocol for CDC incremental synchronization.

Integration Features

Access Application: Users submit synchronization table requests for approval to ensure data quality.
Database Table Management: Sync by database, avoiding excessive sync paths; unified management guarantees data quality and supports merging tables.
Base Pulling: Automatic table creation and initialization using batch tasks; large tables split as needed and data gaps filled based on conditions.
Data Synchronization: Submit synchronization tasks to the cluster via REST API; supports rate-limiting and tagging features to guarantee important syncs; CDC incremental writing to multiple lakehouses.

IV. Secondary Development: Let SeaTunnel Speak "Dmall Dialect"

“No matter how excellent the open-source project, it still can't understand your business 'dialect'.”

SeaTunnel’s plugin mechanism is flexible, but it still requires us to “modify the code” to meet Dmall's custom requirements such as DDH message formats, sharding and merging tables, and dynamic partitioning. Fortunately, SeaTunnel's modular design makes secondary development efficient and controllable. Below are some key modules we've modified, each directly addressing a business pain point.

4.1 Custom DDH-Format CDC

Dmall has developed DDH to collect MySQL binlogs and push them to Kafka using Protobuf. We implemented the following:

KafkaDeserializationSchema:
- Parse Protobuf → SeaTunnelRow;
- DDL messages directly construct CatalogTable and automatically add columns in Paimon;
- Mark DML as "before/after," enabling partial column updates in downstream StarRocks.

4.2 Router Transform: Multi-Table Merging and Dynamic Partitioning

Scenario: Merging 1200 sharded tables t_order_00…t_order_1199 into a single Paimon table dwd_order.
Implementation:
- Use regular expressions t_order_(\d+) to map to the target table;
- Select the benchmark schema (the one with the most fields), and for missing fields in other tables, insert NULL;
- Generate new UK using $table_name + $pk for primary key conflicts;
- Extract the partition field dt from the string create_time, supporting both yyyy-MM-dd and yyyyMMdd formats for automatic recognition.

Configuration Example:

4.3 Hive-Sink Support for Overwrite

The community version only supports append. Based on PR #7843, we modified SeaTunnel to support the overwrite feature:

Before submitting the task, we call FileSystem.listStatus() to get the old paths based on partition values;
After writing new data, we atomically delete the old paths to achieve "idempotent re-run."

This improvement has been contributed back to the community and is expected to be released in version 2.3.14.

4.4 Other Patches

JuiceFS Connector: Supports mount point caching, improving listing performance by 5x.
Kafka 2.x Independent Module: Resolves protocol conflicts between versions 0.10 and 2.x.
Upgrade to JDK 11: Reduces garbage collection time by 40% in the Zeta engine.
New JSON UDFs: Added json_extract_array/json_merge and date-related UDF date_shift(), all merged into the main branch.

V. Pitfalls: Our Real-World Challenges

“Every pitfall is a necessary step toward stability.”

No matter how mature an open-source project is, it’s inevitable to encounter pitfalls when deploying in real business scenarios. During our use of SeaTunnel, we faced version conflicts, asynchronous operations, and consumption delays. Below are some typical "pits" we encountered and the solutions that helped us avoid them.

Problem	Phenomenon	Root Cause	Solution
S3 Access Failure	Spark 3.3.4 conflicts with SeaTunnel's default Hadoop 3.1.4	Two versions of `aws-sdk` in classpath	Exclude Spark’s `hadoop-client`, use SeaTunnel's uber jar
StarRocks ALTER Blocked	Write fails with “column not found”	ALTER in SR is asynchronous; clients continue writing and fail	Poll `SHOW ALTER TABLE STATE` in the sink, resume writing after the `FINISHED` status
Slow Kafka Consumption	Only 3k messages per second	Polling thread sleeps 100ms on empty messages	Contributed PR #7821, added "no sleep on empty polling" mode, increasing throughput to 120k/s

VI. Summary of Benefits: Delivering in Three Months

“Technical value must ultimately be demonstrated with numbers.”

In less than three months of using Apache SeaTunnel, we completed the migration of three merchant production environments. Not only did it “run faster,” but it also “ran cheaper.”

With support for Oracle, cloud storage, Paimon, and StarRocks, we covered all source-side needs, and real-time synchronization is no longer dependent on hand-written Flink code. The template-based, "zero-code" connector integration reduced the development time from several weeks to just 3 days. Resource consumption dropped to only 1/3 of the original Spark solution, with the same data volume running lighter and faster.

With a new UI and on-demand data source permissions, merchant IT teams can now configure tasks and monitor data flows, reducing delivery costs and improving user experience—fulfilling the three key goals of cost reduction, flexibility, and stability.

VII. Next Steps: Lakehouse + AI Dual-Drive

“Data integration is not the end, but the beginning of intelligent analysis.”

Apache SeaTunnel helped us solve the problems of fast and cost-effective data transfer. Next, we need to solve the challenges of accurate and intelligent data transfer. As technologies like Paimon, StarRocks, and LLM mature, we are building a "real-time lakehouse + AI-driven" data platform, enabling data not only to be visible but also to be usable with precision.

In the future, Dmall will write “real-time” and “intelligent” into the next line of code for its data platform:

Lakehouse Upgrade: Fully integrate Paimon + StarRocks, reducing ODS data lake latency from hours to minutes, providing merchants with near-real-time data.
AI Ready: Use MCP services to call LLM to auto-generate synchronization configurations, and introduce vectorized execution engines to create pipelines directly consumable by AI training, enabling "zero-code, intelligent" data integration.
Community Interaction: Track SeaTunnel's main version updates, introduce performance optimizations, and contribute internal improvements as PRs to the community, forming a closed loop of “use-improve-open-source” and continuously amplifying the technical dividend.

VIII. A Message to My Peers

“If you're also struggling with the 'heavy' and 'slow' data synchronization, give SeaTunnel a sprint’s worth of time.”

In just 3 months, we reduced data integration costs to 1/3, improved real-time performance from hourly to minute-level, and compressed development cycles from weeks to days.

SeaTunnel is not a silver bullet, but it is light, fast, and open enough. As long as you're willing to get hands-on, it can become the "new engine" for your data platform.

DEV Community