<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Apache SeaTunnel</title>
    <description>The latest articles on DEV Community by Apache SeaTunnel (@seatunnel).</description>
    <link>https://dev.to/seatunnel</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F844122%2Fc6155eb3-df58-448b-8d88-36865c4f1d84.jpg</url>
      <title>DEV Community: Apache SeaTunnel</title>
      <link>https://dev.to/seatunnel</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/seatunnel"/>
    <language>en</language>
    <item>
      <title>Why Apache SeaTunnel Zeta Can Be Both “Fast and Stable”</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 17 Apr 2026 10:29:31 +0000</pubDate>
      <link>https://dev.to/seatunnel/why-apache-seatunnel-zeta-can-be-both-fast-and-stable-2e61</link>
      <guid>https://dev.to/seatunnel/why-apache-seatunnel-zeta-can-be-both-fast-and-stable-2e61</guid>
      <description>&lt;p&gt;If SeaTunnel Zeta is simply understood as “a faster execution engine,” its true value will be underestimated.&lt;/p&gt;

&lt;p&gt;For data integration systems, the real challenge has never been “whether the pipeline can run,” but whether the following can be achieved at the same time: sufficiently high throughput, recoverability after failure, no data duplication or loss, and controlled resource consumption.&lt;/p&gt;

&lt;p&gt;What makes Zeta worth serious attention lies exactly here: it does not win through a single performance optimization, but instead turns consistency, recovery, convergence under concurrency, and resource control into a closed-loop system capability.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: This article is based on SeaTunnel commit &lt;code&gt;c5ceb6490&lt;/code&gt;; all source code interpretations refer to this version. Runtime observations are based on the official &lt;code&gt;apache/seatunnel:2.3.13&lt;/code&gt; image and are intended to help understand the mechanisms, not as a strict benchmark for this commit.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion First&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;From an architect’s perspective, SeaTunnel Zeta does not achieve both high throughput and stability through a single “performance optimization point,” but instead forms a closed loop of four capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Control plane&lt;/strong&gt;: when checkpoints are triggered, timed out, and completed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State plane&lt;/strong&gt;: how task state is snapshotted, persisted, restored, and remapped&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data plane&lt;/strong&gt;: how Barrier, Record, and Close signals converge in order under high concurrency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource plane&lt;/strong&gt;: how resources are modeled, allocated, and throttled to prevent the system from overwhelming itself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these four layers can be missing. If the contract of any layer is broken, it will eventually manifest as duplicate writes, stalled recovery, checkpoint timeouts, or resource instability.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Looking at the Big Picture: Zeta Solves Not Just “Fast,” but “Fast and Stable”&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The most typical contradiction in data integration systems has never been “whether they can run,” but whether the following three conditions can be satisfied simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Throughput is high enough to avoid becoming a bottleneck&lt;/li&gt;
&lt;li&gt;Recoverable after failure, without data loss or duplication upon restart&lt;/li&gt;
&lt;li&gt;Resource consumption is controllable, without exhausting the cluster in pursuit of stability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why I prefer to understand Zeta as a &lt;strong&gt;stability engine for data integration scenarios&lt;/strong&gt;, rather than a generalized computing engine.&lt;/p&gt;

&lt;p&gt;From the source code design, it decomposes the problem into four clearly defined planes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Control plane&lt;/strong&gt;: &lt;code&gt;CheckpointCoordinator&lt;/code&gt; is responsible for triggering, progressing, completing, timing out, and terminating checkpoints&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State plane&lt;/strong&gt;: &lt;code&gt;CheckpointStorage&lt;/code&gt;, &lt;code&gt;CompletedCheckpoint&lt;/code&gt;, and &lt;code&gt;ActionSubtaskState&lt;/code&gt; handle snapshotting and recovery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data plane&lt;/strong&gt;: &lt;code&gt;SourceSplitEnumeratorTask&lt;/code&gt;, Writers, Aggregated Committer, and intermediate queues embed control signals into the data processing flow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource plane&lt;/strong&gt;: &lt;code&gt;ResourceProfile&lt;/code&gt;, &lt;code&gt;DefaultSlotService&lt;/code&gt;, and &lt;code&gt;read_limit&lt;/code&gt; handle resource profiling, dynamic allocation, and throttling&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1.1 Architecture Overview&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd2x4ayb8zo5a7ipm3zd9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd2x4ayb8zo5a7ipm3zd9.png" alt="1" width="800" height="576"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Architectural judgment: The highlight of Zeta is not the complexity of individual modules, but that it places “consistency, recovery, concurrency, and resources” into a unified protocol.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. Exactly-Once Is Not a Single Capability, but a Cross-Layer Contract&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Many articles describe Exactly-Once as “the engine supports checkpoints, therefore Exactly-Once is guaranteed.” This is not rigorous from an architectural perspective.&lt;/p&gt;

&lt;p&gt;In Zeta, Exactly-Once is at least divided into two layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Engine-level guarantees&lt;/strong&gt;: Barrier alignment, state snapshotting, completion ordering, and failure rollback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connector-level guarantees&lt;/strong&gt;: &lt;code&gt;prepareCommit&lt;/code&gt; must produce transferable and replayable &lt;code&gt;CommitInfo&lt;/code&gt;, and &lt;code&gt;commit&lt;/code&gt; must be idempotent and retryable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, Zeta provides an &lt;strong&gt;execution framework for Exactly-Once&lt;/strong&gt;, rather than automatically guaranteeing it for all connectors.&lt;/p&gt;

&lt;p&gt;In addition, the Sink side does not have only one commit path:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the connector implements &lt;code&gt;SinkAggregatedCommitter&lt;/code&gt;, it follows the path: Writer &lt;code&gt;prepareCommit&lt;/code&gt; → Aggregated Committer aggregation → unified commit after &lt;code&gt;notifyCheckpointComplete&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;If the connector only implements &lt;code&gt;SinkCommitter&lt;/code&gt;, the commit happens directly inside &lt;code&gt;notifyCheckpointComplete(...)&lt;/code&gt; of the Writer task&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The following analysis focuses on the first path, as it better reflects Zeta’s coordination of consistency and commit timing at the engine level.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2.1 What It Actually Guarantees&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Taking the &lt;code&gt;SinkAggregatedCommitter&lt;/code&gt; path as an example, the Exactly-Once main flow in Zeta is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;CheckpointCoordinator&lt;/code&gt; triggers a checkpoint and injects barriers into tasks&lt;/li&gt;
&lt;li&gt;Each participant snapshots state at the barrier boundary and sends ACK&lt;/li&gt;
&lt;li&gt;Sink Writer calls &lt;code&gt;prepareCommit(checkpointId)&lt;/code&gt; without committing externally&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SinkAggregatedCommitterTask&lt;/code&gt; aggregates CommitInfo and includes the result in checkpoint state&lt;/li&gt;
&lt;li&gt;Only when the Coordinator determines the checkpoint is complete does it trigger the actual &lt;code&gt;commit(...)&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjh5qjqxukyp1azflkyzx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjh5qjqxukyp1azflkyzx.jpg" width="800" height="298"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The architectural meaning of this chain is very clear: &lt;strong&gt;first solidify the consistency boundary, then perform external side effects.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2.2 Why This Design Matters&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If the Writer commits to the external system immediately after local processing, once the checkpoint fails to complete, the system will face two classic problems after recovery:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;State not saved but external commit already happened → irreversible duplication&lt;/li&gt;
&lt;li&gt;Upstream replay writes again → logically at-least-once, but claimed as Exactly-Once&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Zeta delays the commit action until after &lt;code&gt;notifyCheckpointComplete&lt;/code&gt;, essentially doing one thing: &lt;strong&gt;binding external visible side effects to the completion of consistency.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2.3 Architectural Boundaries Must Be Clear&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If this is not clearly stated, it is easy to misinterpret:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SinkWriter.prepareCommit(checkpointId)&lt;/code&gt; is not a normal flush, but a phase-one protocol action&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SinkCommitter.commit(...)&lt;/code&gt; must be idempotent, otherwise duplicates may still occur after recovery&lt;/li&gt;
&lt;li&gt;If the external system does not support idempotency or transactional semantics, engine-level Exactly-Once will degrade&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Architectural judgment: Exactly-Once is not a “switch,” but a responsibility chain across engine, connectors, and external systems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2.4 What Is the Cost&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Every architectural benefit comes with a cost, and Exactly-Once is no exception:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The more frequent the checkpoints, the higher the cost of Barrier handling and state serialization&lt;/li&gt;
&lt;li&gt;External commits are delayed, introducing additional commit paths and state buffering&lt;/li&gt;
&lt;li&gt;If Sink idempotency is not well designed, complexity shifts to connector implementers&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;3. The Key to Resume Is Not Just Restoring State, but Restoring Protocol Progress&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Many systems stop at “restoring state objects.” But in distributed data integration, this is not enough, because &lt;strong&gt;the protocol itself has progress&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Three points in Zeta’s recovery path are particularly worth attention.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3.1 Recovery Is Not a Direct Restore, but a Remapping Based on Current Parallelism&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;CheckpointCoordinator.restoreTaskState(...)&lt;/code&gt; does not simply assign old state back to the original subtask. Instead, it determines the correct execution unit based on current parallelism and mapping.&lt;/p&gt;

&lt;p&gt;This means it considers not “who ran last time,” but “who should take over this time.”&lt;/p&gt;

&lt;p&gt;This is crucial because real-world recovery often involves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Worker relocation&lt;/li&gt;
&lt;li&gt;Parallelism changes&lt;/li&gt;
&lt;li&gt;Slot reallocation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3.2 The Core of Source Recovery Lies in the Enumerator&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;On the Source side, what truly determines whether reading can continue correctly is not just the reader itself, but the allocation state of splits.&lt;/p&gt;

&lt;p&gt;Therefore, Zeta places the recovery focus on &lt;code&gt;SourceSplitEnumerator&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;During checkpoint: execute &lt;code&gt;snapshotState(checkpointId)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;During recovery: &lt;code&gt;SourceSplitEnumeratorTask.restoreState(...)&lt;/code&gt; decides whether to call &lt;code&gt;restoreEnumerator(...)&lt;/code&gt; or &lt;code&gt;createEnumerator(...)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Then &lt;code&gt;open()&lt;/code&gt; is invoked and subsequent coordination resumes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This shows that its recovery approach is not about “restoring threads,” but about “restoring the scheduler.”&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3.3 What Truly Reflects Stability Engineering Is “Protocol Signal Compensation”&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;One of the most valuable details in this article is the re-signaling logic of &lt;code&gt;NoMoreSplits&lt;/code&gt; after reader re-registration.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;SourceSplitEnumeratorTask.receivedReader(...)&lt;/code&gt;, if a reader has previously been marked as having no more splits, then when it re-registers after recovery, the system will again call &lt;code&gt;signalNoMoreSplits&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This detail is highly significant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is restored is not just data state&lt;/li&gt;
&lt;li&gt;Nor just split allocation results&lt;/li&gt;
&lt;li&gt;But also the fact that “this reader has already reached the end of the protocol”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without this step, the system may appear to have “successfully restored state,” but the reader could remain stuck waiting for more splits indefinitely.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7s4yprsf7virt0dtj8l3.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7s4yprsf7virt0dtj8l3.jpg" width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Architectural judgment: A truly mature recovery mechanism restores “state + protocol position + control signals,” not just a serialized object.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. In High-Concurrency Systems, the Real Risk Is Not Slowness, but Lack of Convergence&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;When people think of high concurrency, they often think of parallelism, threads, and queue length. But for data integration engines, the more dangerous issue is actually: &lt;strong&gt;whether control messages are drowned out, and whether the shutdown process loses control.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Zeta’s design here reflects a clear engineering mindset.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4.1 The Parallel Model Is Not the Highlight, the Convergence Model Is&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;From the task model perspective, Zeta’s high concurrency is not mysterious:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source/Sink improve throughput via multiple Readers and Writers&lt;/li&gt;
&lt;li&gt;Pipelines scale throughput via task parallelism&lt;/li&gt;
&lt;li&gt;Aggregated Committer waits until all necessary writers are registered and aligned before advancing lifecycle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are standard practices in distributed execution engines.&lt;/p&gt;

&lt;p&gt;What stands out is that it does not treat “parallelism” as simply increasing processing threads, but treats &lt;strong&gt;how to terminate in an orderly way under concurrency&lt;/strong&gt; as a first-class concern.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4.2 Barrier Priority Is Essentially Protecting the Control Plane&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In the implementations of &lt;code&gt;RecordEventProducer&lt;/code&gt; and &lt;code&gt;IntermediateBlockingQueue&lt;/code&gt;, when a Barrier arrives, it is acknowledged with priority. If that Barrier triggers &lt;code&gt;prepareClose&lt;/code&gt; for the current task, the system enters the &lt;code&gt;prepareClose&lt;/code&gt; state, and ordinary records are no longer accepted into the queue.&lt;/p&gt;

&lt;p&gt;This design addresses two common pitfalls in high-concurrency systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Control signals being drowned by data traffic&lt;/strong&gt;: Barriers cannot reach boundaries, and consistency cannot converge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data still flowing during shutdown&lt;/strong&gt;: Records continue after checkpoint boundaries, breaking semantics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, this is not “queue optimization,” but an architectural decision where &lt;strong&gt;control takes priority over throughput&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgifeusghxwss5tpssa1r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgifeusghxwss5tpssa1r.png" alt="2" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4.3 Why This Is Especially Important for Data Integration Systems&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In data integration pipelines, downstream systems are often slower than upstream, and network/storage jitter is common.&lt;/p&gt;

&lt;p&gt;If the system simply increases concurrency mechanically, three consequences arise:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Queue buildup worsens&lt;/li&gt;
&lt;li&gt;Checkpoint cost increases&lt;/li&gt;
&lt;li&gt;Shutdown and recovery become harder to converge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So what Zeta demonstrates here is not just “high concurrency capability,” but:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;It knows when to continue throughput, and when to first enforce consistency and lifecycle convergence.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5. Low Resource Usage Is Not About Using Fewer Machines, but About Restraining Resource Decisions&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;“Low resource usage” is often misunderstood as “this engine consumes fewer machines.” Architecturally, a more accurate statement is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The system avoids wasting resources on ineffective competition through a simpler resource model and explicit throttling mechanisms.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5.1 The Value of a Minimal Resource Model Lies in Low Scheduling Cost&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;ResourceProfile&lt;/code&gt; uses CPU and Memory as core resource descriptors, and provides &lt;code&gt;merge&lt;/code&gt;, &lt;code&gt;subtract&lt;/code&gt;, and &lt;code&gt;enoughThan&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is not a highly detailed model, but it has two practical advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simplicity → low scheduling computation cost&lt;/li&gt;
&lt;li&gt;Generality → suitable for volatile and heterogeneous data integration workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off is also clear: it has limited expressiveness for network, disk, and downstream service bottlenecks.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Architectural judgment: This is a “good enough” resource model, not a “precise simulation” model.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5.2 Dynamic Slots Are Essentially Elastic Partitioning Based on Remaining Capacity&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In &lt;code&gt;DefaultSlotService.requestSlot(...)&lt;/code&gt;, if dynamic slots are enabled and remaining resources can satisfy the requested profile, a new &lt;code&gt;SlotProfile&lt;/code&gt; is created on demand.&lt;/p&gt;

&lt;p&gt;This means slots are not statically partitioned, but dynamically sliced based on available capacity.&lt;/p&gt;

&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Higher resource utilization&lt;/li&gt;
&lt;li&gt;More flexible scheduling&lt;/li&gt;
&lt;li&gt;Suitable for mixed workloads with fluctuating load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But this does not mean the system is immune to overload. If upstream jobs expand parallelism uncontrollably, dynamic slots will only expose the problem faster.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5.3 What Actually Suppresses Resource Instability Is Checkpoint Throttling&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;checkpointInterval&lt;/code&gt;, &lt;code&gt;checkpointMinPause&lt;/code&gt;, and &lt;code&gt;checkpointTimeout&lt;/code&gt; are not just configurations, but stability valves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;interval&lt;/code&gt;: how frequently snapshots occur&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;minPause&lt;/code&gt;: enforced gap between checkpoints&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;timeout&lt;/code&gt;: maximum duration before abort&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Improper configuration leads to a vicious cycle:&lt;/p&gt;

&lt;p&gt;Frequent checkpoints → higher state cost → slower barriers → more timeouts → more recovery → increased resource instability&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5.4 Throttling Is Often More Effective Than Scaling&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Configurations like &lt;code&gt;read_limit.rows_per_second&lt;/code&gt; and &lt;code&gt;read_limit.bytes_per_second&lt;/code&gt; have high architectural value.&lt;/p&gt;

&lt;p&gt;Because often the system is not “computationally insufficient,” but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Downstream cannot keep up&lt;/li&gt;
&lt;li&gt;Excessive concurrency only creates retries and backlog&lt;/li&gt;
&lt;li&gt;Resources are wasted on ineffective contention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Therefore, for slow or rate-limited downstream systems, the recommended approach is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Throttle first, observe, then scale.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5.5 Closed Loop of Resource Scheduling and Throttling&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4d37vb54g86moowzgl37.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4d37vb54g86moowzgl37.png" alt="3" width="800" height="1120"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;6. From an Architectural Perspective, What Scenarios Is Zeta Suitable For&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;From the current design, Zeta’s strengths are clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear data integration pipelines from Source to Sink&lt;/li&gt;
&lt;li&gt;Need for recoverable and traceable consistency guarantees&lt;/li&gt;
&lt;li&gt;Production environments where manual intervention after recovery is unacceptable&lt;/li&gt;
&lt;li&gt;Desire to maintain stable operation under limited resources via dynamic allocation and throttling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Correspondingly, its focus is not on maximizing every operator capability, but on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clearly defining consistency boundaries&lt;/li&gt;
&lt;li&gt;Completing recovery loops&lt;/li&gt;
&lt;li&gt;Ensuring convergence under concurrency&lt;/li&gt;
&lt;li&gt;Turning resource control into a system-level capability&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;7. If You Want to Apply It in Practice, Focus on These Four Things&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;7.1 For Connector Developers&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Do not treat &lt;code&gt;prepareCommit(checkpointId)&lt;/code&gt; as a normal flush&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;commit(...)&lt;/code&gt; must be idempotent and retryable&lt;/li&gt;
&lt;li&gt;External side effects must align with checkpoint completion&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;7.2 For Source Developers&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;snapshotState(...)&lt;/code&gt; and &lt;code&gt;run(...)&lt;/code&gt; may run concurrently; ensure thread safety&lt;/li&gt;
&lt;li&gt;Fully implement &lt;code&gt;addSplitsBack(...)&lt;/code&gt; and reader failover&lt;/li&gt;
&lt;li&gt;Do not only restore split state while ignoring protocol termination signals&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;7.3 For Operators&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Do not assume higher parallelism is always better&lt;/li&gt;
&lt;li&gt;Tune &lt;code&gt;checkpoint.interval&lt;/code&gt;, &lt;code&gt;checkpoint.timeout&lt;/code&gt;, and &lt;code&gt;min-pause&lt;/code&gt; first&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;read_limit&lt;/code&gt; for fragile downstream systems&lt;/li&gt;
&lt;li&gt;Prefer cluster mode for &lt;code&gt;savepoint / restore&lt;/code&gt; demonstrations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;7.4 For Architecture Reviewers&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Evaluate Exactly-Once together with external system idempotency&lt;/li&gt;
&lt;li&gt;Evaluate recovery beyond state snapshots, including protocol compensation&lt;/li&gt;
&lt;li&gt;Evaluate performance not just by throughput, but by convergence during shutdown and recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  8. How to Interpret "Performance Data": Do Not Prove Architecture with Out-of-Context Numbers
&lt;/h2&gt;

&lt;p&gt;It is not valid in architecture articles to directly conclude that an "architecture is advanced" based only on a set of &lt;code&gt;Total Read/Write&lt;/code&gt; and &lt;code&gt;Total Time&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The sample statistics in the quick-start documentation can only demonstrate three things at most:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The pipeline is runnable.&lt;/li&gt;
&lt;li&gt;Read/write forms a closed loop.&lt;/li&gt;
&lt;li&gt;No failures occur in the minimal environment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It alone cannot prove upper limits of high concurrency, recovery efficiency, or cost-performance ratio under different resource specifications.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.1 Supplement: Minimal Testing Better Illustrates "The Importance of Context"
&lt;/h3&gt;

&lt;p&gt;I performed three additional minimal run validations: environment is a single Ubuntu host with &lt;code&gt;8 vCPU / 15Gi RAM&lt;/code&gt;, running the official &lt;code&gt;apache/seatunnel:2.3.13&lt;/code&gt; image in local mode.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Official batch template: &lt;code&gt;32 / 32 / 0&lt;/code&gt;, total time &lt;code&gt;3s&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Custom batch job, &lt;code&gt;parallelism=1, row.num=1000&lt;/code&gt;: &lt;code&gt;1000 / 1000 / 0&lt;/code&gt;, total time &lt;code&gt;3s&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Custom batch job, &lt;code&gt;parallelism=4, row.num=1000&lt;/code&gt;: &lt;code&gt;4000 / 4000 / 0&lt;/code&gt;, total time &lt;code&gt;3s&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These three sets of data clearly show: &lt;strong&gt;the same total time may correspond to completely different data volumes and parallelism settings.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Therefore, drawing conclusions about "performance" without parallelism, data scale, resource specifications, and job type easily leads to distortion.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.2 What Else Can These Tests Demonstrate
&lt;/h3&gt;

&lt;p&gt;In a batch job lasting approximately &lt;code&gt;12s&lt;/code&gt;, I added two sets of local-mode control-plane validations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When &lt;code&gt;checkpoint.interval = 2000&lt;/code&gt;, &lt;code&gt;5&lt;/code&gt; regular checkpoints completed plus &lt;code&gt;1&lt;/code&gt; final checkpoint were observed.&lt;/li&gt;
&lt;li&gt;After adding &lt;code&gt;min-pause = 5000&lt;/code&gt;, only &lt;code&gt;2&lt;/code&gt; regular checkpoints plus &lt;code&gt;1&lt;/code&gt; final checkpoint were observed within similar job duration.&lt;/li&gt;
&lt;li&gt;After adding &lt;code&gt;read_limit.rows_per_second = 5&lt;/code&gt;, for the same &lt;code&gt;100&lt;/code&gt; rows, job duration increased from ~&lt;code&gt;12s&lt;/code&gt; to ~&lt;code&gt;21s&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This shows that &lt;code&gt;min-pause&lt;/code&gt; and &lt;code&gt;read_limit&lt;/code&gt; are not "decorative configurations" — they actually change control rhythm and runtime.&lt;/p&gt;

&lt;p&gt;I also performed a validation in &lt;strong&gt;single-machine cluster mode&lt;/strong&gt; specifically for &lt;code&gt;savepoint / restore&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;After running for &lt;code&gt;8s&lt;/code&gt; in a ~&lt;code&gt;50s&lt;/code&gt; batch job, job status remained &lt;code&gt;RUNNING&lt;/code&gt;, and checkpoint overview recorded &lt;code&gt;6&lt;/code&gt; completed checkpoints.&lt;/li&gt;
&lt;li&gt;After executing &lt;code&gt;-s&lt;/code&gt;, job status became &lt;code&gt;SAVEPOINT_DONE&lt;/code&gt;, and &lt;code&gt;SAVEPOINT_TYPE&lt;/code&gt; appeared in checkpoint history.&lt;/li&gt;
&lt;li&gt;Using the same &lt;code&gt;jobId&lt;/code&gt; to execute &lt;code&gt;-r&lt;/code&gt; for restoration, foreground restoration completed in ~&lt;code&gt;37s&lt;/code&gt;, final statistics &lt;code&gt;500 / 500 / 0&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From only the final line &lt;code&gt;500 / 500 / 0&lt;/code&gt;, you cannot tell whether it "resumed from a breakpoint." But combined with the prior ~&lt;code&gt;16s&lt;/code&gt; runtime and savepoint records, a more reasonable engineering judgment is:&lt;br&gt;
&lt;strong&gt;the restoration processed remaining splits, not a full re-run.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I also tested adding &lt;code&gt;read_limit.bytes_per_second = 10000&lt;/code&gt; to a large-field example; total duration remained ~&lt;code&gt;12s&lt;/code&gt;.&lt;br&gt;
This more likely indicates that under this load pattern, &lt;code&gt;FakeSource&lt;/code&gt; split reading became the bottleneck first — not simply that "byte rate limiting does not work."&lt;br&gt;
It again proves: &lt;strong&gt;discussing performance numbers without load context easily leads to misjudgment.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Of course, these are only &lt;strong&gt;runtime observations&lt;/strong&gt;, not strict benchmarks based on the &lt;code&gt;c5ceb6490&lt;/code&gt; build.&lt;br&gt;
They better support "mechanisms are effective, metrics must be interpreted carefully" rather than "absolute performance leadership."&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Recommended Observation Metrics for Real Pressure Testing
&lt;/h2&gt;

&lt;p&gt;Instead of only looking at throughput, I suggest observing four types of metrics simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Consistency metrics&lt;/strong&gt;: duplication, loss, unfinished commits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recovery metrics&lt;/strong&gt;: time to recover after failure, need for manual intervention&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource metrics&lt;/strong&gt;: CPU, Heap, thread count, checkpoint duration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Convergence metrics&lt;/strong&gt;: data inflow during shutdown, barrier delays&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two recommended comparison scenarios:&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario A: High Parallelism Observation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;env&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;job.mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"STREAMING"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;parallelism&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;checkpoint.interval&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;FakeSource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;row.num&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100000000&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;split.num&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;split.read-interval&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;sink&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;Console&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario B: Conservative Recovery Observation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;env&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;job.mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"STREAMING"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;parallelism&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;checkpoint.interval&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;FakeSource&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;row.num&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100000000&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;split.num&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;split.read-interval&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;sink&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;Console&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above two configurations are more suitable for observing control links and recovery behavior, &lt;strong&gt;not&lt;/strong&gt; for serious throughput benchmarking.&lt;br&gt;
&lt;code&gt;FakeSource&lt;/code&gt; in &lt;code&gt;c5ceb6490&lt;/code&gt; supports &lt;code&gt;split.read-interval&lt;/code&gt;, not &lt;code&gt;rate&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In addition, &lt;code&gt;row.num&lt;/code&gt; in &lt;code&gt;FakeSource&lt;/code&gt; means &lt;strong&gt;total generated rows per parallelism&lt;/strong&gt;.&lt;br&gt;
This must be accounted for when explaining test scale.&lt;/p&gt;

&lt;p&gt;What these two scenarios truly compare is not just "who is faster," but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Whether higher parallelism actually delivers effective throughput&lt;/li&gt;
&lt;li&gt;Whether shorter checkpoint intervals stabilize recovery boundaries or cause timeouts&lt;/li&gt;
&lt;li&gt;Whether the system throttles gracefully when sinks slow down, or amplifies congestion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A practical observation: in my minimal tests, &lt;code&gt;min-pause&lt;/code&gt; did reduce checkpoint count within the same time window, and &lt;code&gt;read_limit&lt;/code&gt; did increase total runtime. Both configurations are observable and verifiable.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. Architecture Vision: From "Recoverable" to "Adaptive"
&lt;/h2&gt;

&lt;p&gt;If we regard Zeta as a stability engine, its most promising future direction may not be stacking more "performance parameters,"&lt;br&gt;
but further turning existing control signals into &lt;strong&gt;adaptive capabilities&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When Checkpoint slows down, can the system automatically identify whether the bottleneck is Source, Queue, Sink, or insufficient Slot resources?&lt;/li&gt;
&lt;li&gt;When downstream writing slows, can the system automatically adjust &lt;code&gt;read_limit&lt;/code&gt; based on real-time metrics, instead of requiring manual throttling after backlog occurs?&lt;/li&gt;
&lt;li&gt;When a job recovers, can the system inform the user in advance: which checkpoint recovery starts from, how many splits remain, expected impact scope?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Furthermore, Exactly-Once capabilities on the connector side can become more &lt;strong&gt;explicit&lt;/strong&gt;.&lt;br&gt;
Today we mostly express capability boundaries via interface implementations and code conventions.&lt;br&gt;
In the future, if idempotency, commit semantics, and retry boundaries become declarable, inspectable, observable contracts,&lt;br&gt;
the operability of the entire data integration pipeline will improve significantly.&lt;/p&gt;

&lt;p&gt;This does not mean the current version fully supports these capabilities,&lt;br&gt;
but is a natural extension of the existing architecture:&lt;/p&gt;

&lt;p&gt;Once the control plane, state plane, data plane, and resource plane form a closed loop,&lt;br&gt;
the next step can evolve from &lt;strong&gt;"recover after failure"&lt;/strong&gt; to &lt;strong&gt;"predict before failure, adapt during runtime."&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;11. Final Thoughts: What Makes Zeta Valuable Is Turning Stability into a System Capability&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Looking at individual code points, many implementations in Zeta are not particularly flashy.&lt;/p&gt;

&lt;p&gt;But architecturally, it gets several critical things right:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;CheckpointCoordinator&lt;/code&gt; as a unified consistency control entry&lt;/li&gt;
&lt;li&gt;Aggregated Committer binding external commits to checkpoint completion&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;restoreTaskState(...)&lt;/code&gt; and Enumerator-based recovery forming a complete resume loop&lt;/li&gt;
&lt;li&gt;Barrier priority and &lt;code&gt;prepareClose&lt;/code&gt; ensuring convergence under concurrency&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ResourceProfile&lt;/code&gt;, dynamic slots, and &lt;code&gt;read_limit&lt;/code&gt; making resource control a system-level strategy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What deserves recognition is not a single powerful module, but that it places the most failure-prone aspects of data integration systems into a unified, explainable engineering mechanism.&lt;/p&gt;

&lt;p&gt;If you are an architect, what matters is not just whether it is fast, but whether it remains &lt;strong&gt;explainable, convergent, and operable&lt;/strong&gt; under failure, recovery, commit, and resource fluctuation.&lt;/p&gt;

&lt;p&gt;From this perspective, Zeta’s real value is not extreme optimization in one area, but placing these concerns into a system that can be traced, verified, and reasoned about.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;SeaTunnel Zeta’s competitiveness lies not in pushing a single capability to the extreme, but in closing the loop across consistency, recovery, concurrency, and resource management.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Appendix: Source Code Reference Anchors&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you want to further explore the source code, it is recommended to start with the following entry points. You can also follow the official SeaTunnel channel and reply with the keyword “anchors” to get more materials.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;CheckpointCoordinator.tryTriggerPendingCheckpoint&lt;/code&gt;&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/checkpoint/CheckpointCoordinator.java#L500-L582" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/checkpoint/CheckpointCoordinator.java#L500-L582&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;CheckpointCoordinator.restoreTaskState&lt;/code&gt;&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/checkpoint/CheckpointCoordinator.java#L306-L344" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/checkpoint/CheckpointCoordinator.java#L306-L344&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;SeaTunnelSink&lt;/code&gt;&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-api/src/main/java/org/apache/seatunnel/api/sink/SeaTunnelSink.java#L40-L127" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-api/src/main/java/org/apache/seatunnel/api/sink/SeaTunnelSink.java#L40-L127&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;SinkFlowLifeCycle.received / notifyCheckpointComplete&lt;/code&gt;&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/flow/SinkFlowLifeCycle.java#L191-L244" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/flow/SinkFlowLifeCycle.java#L191-L244&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;SinkAggregatedCommitterTask.notifyCheckpointComplete&lt;/code&gt;&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SinkAggregatedCommitterTask.java#L303-L332" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SinkAggregatedCommitterTask.java#L303-L332&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;SourceSplitEnumeratorTask.restoreState&lt;/code&gt;&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SourceSplitEnumeratorTask.java#L187-L207" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SourceSplitEnumeratorTask.java#L187-L207&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;SourceSplitEnumeratorTask.receivedReader&lt;/code&gt;&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SourceSplitEnumeratorTask.java#L221-L246" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/task/SourceSplitEnumeratorTask.java#L221-L246&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;DefaultSlotService.requestSlot&lt;/code&gt;&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/service/slot/DefaultSlotService.java#L168-L189" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/blob/c5ceb6490/seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/service/slot/DefaultSlotService.java#L168-L189&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;speed-limit.md&lt;/code&gt;&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/blob/c5ceb6490/docs/zh/introduction/configuration/speed-limit.md" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/blob/c5ceb6490/docs/zh/introduction/configuration/speed-limit.md&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>apacheseatunnel</category>
      <category>opensource</category>
      <category>programming</category>
    </item>
    <item>
      <title>Three Core Engine Innovations in Apache SeaTunnel: High-Reliability Asynchronous Persistence and CDC Architecture Optimization</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 17 Apr 2026 09:47:03 +0000</pubDate>
      <link>https://dev.to/seatunnel/three-core-engine-innovations-in-apache-seatunnel-high-reliability-asynchronous-persistence-and-24p1</link>
      <guid>https://dev.to/seatunnel/three-core-engine-innovations-in-apache-seatunnel-high-reliability-asynchronous-persistence-and-24p1</guid>
      <description>&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; In large-scale distributed data integration scenarios, high availability and extreme data processing performance have always been core challenges. This article provides an in-depth analysis of three recent core engine innovations in Apache SeaTunnel: a high-performance asynchronous WAL (Write-Ahead Log) persistence architecture based on LMAX Disruptor, an efficient timezone conversion optimization for Debezium deserialization in the CDC module, and enhanced complex type mapping in the JDBC module for databases such as SQL Server. By interpreting these core code changes, this article reveals how Apache SeaTunnel achieves a leap in processing throughput while ensuring strong data consistency, and provides best-practice references for distributed system architecture design.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Background Introduction
&lt;/h2&gt;

&lt;p&gt;With the deepening of enterprise digital transformation, data integration is no longer just simple “data movement,” but has evolved into complex orchestration of massive, heterogeneous, and real-time data streams. As a next-generation high-performance data integration platform, Apache SeaTunnel’s self-developed Zeta engine demonstrates strong capabilities in distributed coordination, fault tolerance, and resource scheduling.&lt;/p&gt;

&lt;p&gt;However, in the pursuit of extreme performance, bottlenecks such as blocking caused by synchronous I/O, performance overhead in cross-timezone data processing, and fragmentation in heterogeneous database type mapping have constrained further scalability. A series of recent core code contributions directly address these deep-rooted challenges through systematic architectural upgrades.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Core Contributors and PR Traceability
&lt;/h2&gt;

&lt;p&gt;The technical breakthroughs analyzed in this article are inseparable from continuous contributions by the community. Below are the core contributors and corresponding Pull Requests for these features, enabling developers to further explore implementation details.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Technical Highlight&lt;/th&gt;
&lt;th&gt;Main Contributor (GitHub ID)&lt;/th&gt;
&lt;th&gt;Key PR&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Asynchronous WAL Persistence (WALDisruptor)&lt;/td&gt;
&lt;td&gt;Kirs (@CalvinKirs) &amp;amp; Xiaojian Sun (@Sun-XiaoJian)&lt;/td&gt;
&lt;td&gt;#3418 / #4683&lt;/td&gt;
&lt;td&gt;Introduced LMAX Disruptor framework to refactor asynchronous persistence logic in the Zeta engine IMAP storage layer, significantly reducing I/O blocking.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CDC Performance Optimization (Timezone / Bitwise Ops)&lt;/td&gt;
&lt;td&gt;Zongwen Li (@zongwenli)&lt;/td&gt;
&lt;td&gt;#3499&lt;/td&gt;
&lt;td&gt;Implemented highly optimized time conversion logic in CDC deserialization, avoiding frequent date object creation and improving multi-timezone support.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL Server Type Mapping Enhancement&lt;/td&gt;
&lt;td&gt;hailin0 (@hailin0)&lt;/td&gt;
&lt;td&gt;#5872&lt;/td&gt;
&lt;td&gt;Unified and enhanced the JDBC type system, especially improving high-precision support for SQL Server DATETIME2 and DATETIMEOFFSET.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  3. Core Technical Highlights
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2h5b52zb5k0wlygep4pe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2h5b52zb5k0wlygep4pe.png" alt="SeaTunnel Engine" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 Asynchronous WAL Persistence Architecture Based on LMAX Disruptor
&lt;/h3&gt;

&lt;p&gt;In distributed storage systems, WAL (Write-Ahead Log) is the cornerstone of ensuring data consistency. Traditional synchronous WAL writes block the main thread, leading to increased latency under high-concurrency I/O scenarios. SeaTunnel introduces the lock-free queue framework LMAX Disruptor in WALDisruptor.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Innovation:&lt;/strong&gt; Adopts a single-producer, multi-worker thread pool model (Worker Pool), decoupling WAL publishing from actual I/O persistence logic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architectural Advantages:&lt;/strong&gt; The ring buffer mechanism of Disruptor significantly reduces thread contention and context switching overhead, while preallocated memory avoids frequent garbage collection.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.2 CDC Timezone Conversion and Deserialization Performance Optimization
&lt;/h3&gt;

&lt;p&gt;CDC (Change Data Capture) is one of SeaTunnel’s core strengths. When processing raw data from Debezium, high-frequency time conversion operations often consume significant CPU resources.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Innovation:&lt;/strong&gt; In &lt;code&gt;SeaTunnelRowDebeziumDeserializationConverters&lt;/code&gt;, fine-grained bitwise conversion logic is introduced for TIMESTAMP, MICRO_TIMESTAMP, and NANO_TIMESTAMP, avoiding costly Java date object creation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architectural Advantages:&lt;/strong&gt; By directly operating on millisecond and nanosecond-level long values and combining them with cached timezone (ZoneId) conversions, processing throughput is effectively doubled.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.3 Standardized Enhancement of Heterogeneous Database Type Mapping
&lt;/h3&gt;

&lt;p&gt;Type differences across heterogeneous databases (such as SQL Server, Oracle, and MySQL) are a major cause of precision loss during data synchronization.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Innovation:&lt;/strong&gt; In converters such as &lt;code&gt;SqlServerTypeConverter&lt;/code&gt;, precision adaptation logic for complex types like DATETIME2 and DATETIMEOFFSET is refactored.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architectural Advantages:&lt;/strong&gt; A streaming builder pattern based on &lt;code&gt;BasicTypeDefine&lt;/code&gt; is introduced, making mappings between source types (SourceType) and underlying storage types (DataType) more transparent and extensible.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4. Implementation Details and Code Examples
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 Core of Asynchronous Persistence: Evolution of WALDisruptor
&lt;/h3&gt;

&lt;p&gt;In WALDisruptor.java, we can observe a typical Disruptor usage pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Initialize Disruptor with BlockingWaitStrategy to reduce CPU usage under low load&lt;/span&gt;
&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;disruptor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Disruptor&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;FileWALEvent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;FACTORY&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="no"&gt;DEFAULT_RING_BUFFER_SIZE&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;threadFactory&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="nc"&gt;ProducerType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;SINGLE&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;BlockingWaitStrategy&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;

&lt;span class="c1"&gt;// Bind worker pool to handle HDFS/local file I/O&lt;/span&gt;
&lt;span class="n"&gt;disruptor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;handleEventsWithWorkerPool&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;WALWorkHandler&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fileConfiguration&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parentPath&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;serializer&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

&lt;span class="n"&gt;disruptor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;start&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this architecture, the main thread only needs to call &lt;code&gt;tryAppendPublish&lt;/code&gt; to submit tasks to the RingBuffer and return immediately, while persistence is handled asynchronously by background threads.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 CDC Performance Acceleration: Efficient Time Conversion
&lt;/h3&gt;

&lt;p&gt;In SeaTunnelRowDebeziumDeserializationConverters.java, developers implemented an extremely optimized conversion function for high-precision timestamps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="nc"&gt;LocalDateTime&lt;/span&gt; &lt;span class="nf"&gt;toLocalDateTime&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;millisecond&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;nanoOfMillisecond&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;millisecond&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;86400000&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;millisecond&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;86400000&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;86400000&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;nanoOfDay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1_000_000L&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;nanoOfMillisecond&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="nc"&gt;LocalDate&lt;/span&gt; &lt;span class="n"&gt;localDate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LocalDate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ofEpochDay&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="nc"&gt;LocalTime&lt;/span&gt; &lt;span class="n"&gt;localTime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LocalTime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ofNanoOfDay&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nanoOfDay&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;LocalDateTime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;localDate&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;localTime&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This implementation replaces heavy Calendar or SimpleDateFormat operations with efficient mathematical calculations, representing a typical example of high-performance system design.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Performance Benchmark Comparison
&lt;/h2&gt;

&lt;p&gt;Based on benchmark results from the SeaTunnel community, significant performance improvements were observed after these optimizations:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before Optimization (Legacy Mode)&lt;/th&gt;
&lt;th&gt;After Optimization (2.3.13 Preview)&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;WAL Write Latency (P99)&lt;/td&gt;
&lt;td&gt;15 ms&lt;/td&gt;
&lt;td&gt;2 ms&lt;/td&gt;
&lt;td&gt;86% ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CDC Throughput per Core (Rows/s)&lt;/td&gt;
&lt;td&gt;55k&lt;/td&gt;
&lt;td&gt;120k&lt;/td&gt;
&lt;td&gt;118% ↑&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL Server Time Precision&lt;/td&gt;
&lt;td&gt;Second-level&lt;/td&gt;
&lt;td&gt;Nanosecond-level (Datetime2)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Test Environment:&lt;/strong&gt; 8 vCPU (Intel Xeon), 16GB RAM, SSD storage.&lt;br&gt;
&lt;strong&gt;Scenario:&lt;/strong&gt; MySQL CDC → SeaTunnel (Zeta) → Console/HDFS.&lt;br&gt;
&lt;strong&gt;Data Characteristics:&lt;/strong&gt; Average row size ~500 bytes, with 3+ time-related fields.&lt;br&gt;
&lt;strong&gt;Throughput Note:&lt;/strong&gt; 120k Rows/s represents single-core peak; real-world performance may vary due to network I/O and sink throughput.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: Data derived from CDC synchronization scenarios involving 10 billion records.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  6. Challenges and Solutions
&lt;/h2&gt;
&lt;h3&gt;
  
  
  6.1 Graceful Shutdown in Asynchronous Architecture
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Challenge:&lt;/strong&gt; Asynchronous persistence may leave unflushed data in memory queues during JVM shutdown.&lt;br&gt;
&lt;strong&gt;Solution:&lt;/strong&gt; Introduced timeout-based waiting in the &lt;code&gt;close()&lt;/code&gt; method to ensure queue draining.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;disruptor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;shutdown&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;DEFAULT_CLOSE_WAIT_TIME_SECONDS&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;TimeUnit&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;SECONDS&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6.2 Timezone Drift in Heterogeneous Databases
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Challenge:&lt;/strong&gt; Inconsistent timezones between database servers and runtime environments may cause incorrect CDC timestamp parsing.&lt;br&gt;
&lt;strong&gt;Solution:&lt;/strong&gt; Introduced dynamic &lt;code&gt;ZoneId&lt;/code&gt; injection to ensure end-to-end timezone consistency.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Best Practices and Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  7.1 Backpressure Management
&lt;/h3&gt;

&lt;p&gt;Although Disruptor improves throughput, downstream storage issues (e.g., HDFS or S3 latency) may cause RingBuffer accumulation. Monitoring queue depth is essential.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.2 Importance of Graceful Shutdown
&lt;/h3&gt;

&lt;p&gt;Force-killing processes (&lt;code&gt;kill -9&lt;/code&gt;) may lead to data loss in asynchronous pipelines. Always use controlled shutdown procedures.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.3 Timezone Configuration Consistency
&lt;/h3&gt;

&lt;p&gt;Ensure &lt;code&gt;serverTimeZone&lt;/code&gt; matches the database timezone to avoid inconsistencies in CDC pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.4 Type Conversion Precision
&lt;/h3&gt;

&lt;p&gt;When synchronizing SQL Server DATETIMEOFFSET to systems without offset support, precision loss may occur. Validate schema compatibility beforehand.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Conclusion and Outlook
&lt;/h2&gt;

&lt;p&gt;Through architectural innovations in asynchronous WAL persistence, CDC performance optimization, and standardized type mapping, Apache SeaTunnel has significantly strengthened its foundation as an enterprise-grade data integration platform. Looking ahead, the project will continue exploring more efficient in-memory data exchange formats and deeper integration with AI ecosystems, making data integration more intelligent, efficient, and accessible.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>apacheseatunnel</category>
      <category>opensource</category>
    </item>
    <item>
      <title>A Practical DataOps Development Framework Based on WhaleStudio’s Three Layer Model</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 10 Apr 2026 09:37:01 +0000</pubDate>
      <link>https://dev.to/seatunnel/a-practical-dataops-development-framework-based-on-whalestudios-three-layer-model-1j9l</link>
      <guid>https://dev.to/seatunnel/a-practical-dataops-development-framework-based-on-whalestudios-three-layer-model-1j9l</guid>
      <description>&lt;p&gt;As data platforms evolve from simply “getting jobs to run” to achieving stable and reliable operations, the challenges teams face also begin to shift. Early on, the focus is mainly on whether tasks execute successfully. As scale increases, the concerns move toward access control, clarity of data pipelines, manageability of changes, and the ability to recover from failures.&lt;/p&gt;

&lt;p&gt;This is where DataOps starts to show its real value. It is not just a set of tool usage guidelines, but an engineering methodology that spans development, scheduling, and governance. Using WhaleStudio’s development management framework as an example, this article distills a set of practical standards drawn directly from real production experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Layer Development Framework
&lt;/h2&gt;

&lt;p&gt;In complex data platforms, managing everything through a single dimension quickly becomes insufficient as the system grows. WhaleStudio introduces a three-layer structure of Project, Workflow, and Task, which decouples governance, orchestration, and execution, creating clear boundaries for system management.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F150g5rxu5mh8gr6ws2gd.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F150g5rxu5mh8gr6ws2gd.jpg" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Project as the Governance Boundary
&lt;/h3&gt;

&lt;p&gt;The project layer is the most fundamental part of the system, yet it is also the most commonly misused. In many teams, projects are treated merely as a way to organize directories. This approach often leads to problems later, such as unclear permissions, resource misuse, and ambiguous ownership.&lt;/p&gt;

&lt;p&gt;In a well-designed system, projects should serve as governance boundaries. Everything related to access control should be scoped within a project, including user permissions, data source access, script resources, alerting strategies, and Worker group configurations.&lt;/p&gt;

&lt;p&gt;A practical rule is simple. Whenever there is a scenario where certain users should not be able to view or modify specific resources, isolation must be enforced at the project level rather than relying on conventions or manual processes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Workflow as the Business Pipeline
&lt;/h3&gt;

&lt;p&gt;If projects define who can do what, workflows define how work is organized.&lt;/p&gt;

&lt;p&gt;A workflow is essentially a DAG that represents dependencies between tasks. In a typical data pipeline, workflows connect data ingestion, SQL processing, script execution, and sub-process calls into a complete business flow.&lt;/p&gt;

&lt;p&gt;Beyond orchestration, workflows also handle scheduling concerns such as dependency management, parallel and sequential execution strategies, retry mechanisms, and backfill logic. This means a workflow is not just a representation of execution logic, but also a key part of system stability design.&lt;/p&gt;

&lt;p&gt;In practice, workflows should be treated as traceable and replayable pipelines rather than just collections of tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Task as the Smallest Execution Unit
&lt;/h3&gt;

&lt;p&gt;Under workflows, tasks represent the smallest unit of execution and have the most direct impact on system stability.&lt;/p&gt;

&lt;p&gt;Common task types include SQL, Shell, Python, and data integration jobs. Despite their differences, they should follow consistent design principles such as traceability, retry capability, and recoverability.&lt;/p&gt;

&lt;p&gt;In many production scenarios, issues do not originate from the scheduler itself, but from the tasks. For example, non-idempotent SQL logic, scripts without proper error handling, or strong dependencies on external systems can amplify risks during retries or backfills. Establishing standards at the task level is therefore critical to overall system reliability.&lt;/p&gt;

&lt;p&gt;Once the responsibilities of the three layers are clearly defined, the next step is to manage permissions and design workflows effectively to prevent the system from becoming unmanageable as it scales.&lt;/p&gt;

&lt;h2&gt;
  
  
  Principles for Data Access and Workflow Design
&lt;/h2&gt;

&lt;p&gt;As teams grow and business logic becomes more complex, access control and workflow design become key factors affecting both efficiency and stability. Without consistent standards, systems can quickly become chaotic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Organize Projects by Business Domain
&lt;/h3&gt;

&lt;p&gt;Projects should primarily be structured around business domains such as sales, risk control, or finance. This aligns naturally with organizational structure and helps clarify ownership.&lt;/p&gt;

&lt;p&gt;When cross-team collaboration is required, resource sharing should be implemented through authorization mechanisms rather than placing everything into a single project. While the latter may seem convenient initially, it often leads to uncontrolled permissions over time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Separate Responsibilities in Permission Design
&lt;/h3&gt;

&lt;p&gt;Permissions should never default to giving everyone full access. Roles such as development, testing, operations, and auditing should be clearly separated, each with its own scope of authority.&lt;/p&gt;

&lt;p&gt;This approach reduces the risk of accidental changes and helps standardize release processes, making system changes more controlled.&lt;/p&gt;

&lt;h3&gt;
  
  
  Balance Isolation and Reuse
&lt;/h3&gt;

&lt;p&gt;Resource management must balance isolation with reuse. Data sources, scripts, resource pools, and Worker groups should be isolated by default to avoid unintended interference.&lt;/p&gt;

&lt;p&gt;When reuse is necessary, it should be achieved through controlled authorization rather than duplicating configurations. This reduces maintenance overhead and avoids inconsistencies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Resolve Permission Differences Through Projects
&lt;/h3&gt;

&lt;p&gt;Whenever permission differences exist, they must be handled through project-level isolation. For example, if certain datasets should only be accessible to specific users, this must be enforced through system mechanisms rather than informal agreements.&lt;/p&gt;

&lt;p&gt;Although this principle seems straightforward, it is often overlooked, leading to loss of control over the permission system.&lt;/p&gt;

&lt;p&gt;Once the permission model is stable, workflow design becomes the key factor in maintainability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Control Workflow Size
&lt;/h3&gt;

&lt;p&gt;As the number of tasks grows, placing everything into a single workflow leads to rapidly increasing maintenance costs and higher risk during changes.&lt;/p&gt;

&lt;p&gt;In practice, workflows should be split based on data layers or business domains, such as ODS, DWD, DWS, and ADS. The number of nodes within a workflow should remain within a manageable range to avoid excessive complexity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Upgrade Governance When Complexity Increases
&lt;/h3&gt;

&lt;p&gt;When the number of workflows grows too large or directory structures become unmanageable, relying on labels or folders is no longer sufficient. At this point, governance should be elevated to a higher level, such as introducing additional project segmentation.&lt;/p&gt;

&lt;p&gt;This is not merely structural optimization, but an evolution of governance strategy.&lt;/p&gt;

&lt;p&gt;Once design principles are clear, implementation should align with team size. There is no single solution that fits all teams.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Strategies for Different Team Sizes
&lt;/h2&gt;

&lt;p&gt;DataOps does not have a universal solution. The right approach depends on team size and system complexity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Large Teams with Layered Isolation
&lt;/h3&gt;

&lt;p&gt;In large or complex data warehouse environments, multiple business domains, permission boundaries, and data pipelines coexist. In such cases, data warehouse layers such as ODS, DWD, DWS, and ADS should be mapped to different projects and workflows.&lt;/p&gt;

&lt;p&gt;Dependencies across projects and workflows must be clearly defined. Impact analysis tools should be used for global governance to ensure changes do not introduce cascading failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Medium Sized Teams with Balanced Design
&lt;/h3&gt;

&lt;p&gt;For medium-sized teams, the goal is to maintain stability while avoiding unnecessary complexity.&lt;/p&gt;

&lt;p&gt;Projects should not be overly fragmented, and workflows should not be split excessively. Instead, different scheduling cycles such as daily and monthly jobs can be connected through well-defined dependencies.&lt;/p&gt;

&lt;p&gt;The focus at this stage should be on unified scheduling strategies and resource pool management rather than introducing overly complex governance frameworks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Small Teams with Fast Execution
&lt;/h3&gt;

&lt;p&gt;For small teams or early-stage projects, the priority is to establish a working delivery pipeline.&lt;/p&gt;

&lt;p&gt;A single workflow can be used to handle core business processes, supported by naming conventions, alerting mechanisms, and backfill strategies to ensure baseline quality. As complexity increases, the system can gradually evolve toward more fine-grained structures.&lt;/p&gt;

&lt;p&gt;This approach keeps costs under control while avoiding overly heavy design in the early stages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;From Project to Workflow to Task, WhaleStudio’s three-layer model provides a clear division of responsibilities. Projects define governance boundaries, workflows manage business orchestration, and tasks handle execution.&lt;/p&gt;

&lt;p&gt;With well-designed permission models and properly structured workflows, systems can remain stable and controllable even as complexity grows.&lt;/p&gt;

&lt;p&gt;The essence of DataOps lies not in the tools themselves, but in building an engineering system that can evolve sustainably. Only when permissions, resources, and execution logic are governed under a unified framework can a data platform truly support long-term business growth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Previous Articles
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://medium.com/@apacheseatunnel/5-when-your-data-warehouse-breaks-down-its-probably-a-naming-problem-32ba42558db1" rel="noopener noreferrer"&gt;(5)When Your Data Warehouse Breaks Down, It’s Probably a Naming Problem&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://medium.com/codex/4-why-your-ads-layer-always-goes-wild-and-how-a-strong-dws-layer-fixes-it-4fddecde4288?source=your_stories_outbox---writer_outbox_published-----------------------------------------" rel="noopener noreferrer"&gt;(4)Why Your ADS Layer Always Goes Wild and How a Strong DWS Layer Fixes It&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;(3) Key Design Principles for ODS/Detail Layer Implementation: Building the Data Ingestion Layer as a “Stable and Operable” Infrastructure&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;a href="https://medium.com/@apacheseatunnel/i-a-complete-guide-to-building-and-standardizing-a-modern-lakehouse-architecture-an-overview-of-9a2a263f2f1b?source=your_stories_outbox---writer_outbox_published-----------------------------------------" rel="noopener noreferrer"&gt;(I) A Complete Guide to Building and Standardizing a Modern Lakehouse Architecture: An Overview of Data Warehouses and Data Lakes&lt;/a&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Coming Next
&lt;/h2&gt;

&lt;p&gt;Part 7 Scheduling design best practices&lt;/p&gt;




</description>
      <category>dataops</category>
      <category>ai</category>
      <category>database</category>
      <category>terraform</category>
    </item>
    <item>
      <title>You Don’t Apply to Become an ASF Member, You Grow Into It</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 10 Apr 2026 09:11:30 +0000</pubDate>
      <link>https://dev.to/seatunnel/you-dont-apply-to-become-an-asf-member-you-grow-into-it-4oa8</link>
      <guid>https://dev.to/seatunnel/you-dont-apply-to-become-an-asf-member-you-grow-into-it-4oa8</guid>
      <description>&lt;p&gt;Very few people set “becoming an ASF Member” as a clear goal.&lt;/p&gt;

&lt;p&gt;Not because it lacks appeal, but because there is no application process and no defined path. It is more of an outcome, something that happens after sustained contributions are naturally recognized within a community.&lt;/p&gt;

&lt;p&gt;Fan Jia followed exactly that kind of path.&lt;/p&gt;

&lt;p&gt;Recently, he was invited to join the Apache Software Foundation as a Member. Taking this opportunity, we had an in-depth conversation with him. More than a recognition of achievement, the discussion felt like a reflection on his journey—from data integration, to open source participation, to system design and community understanding—tracing how an engineer gradually arrives at this point.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqij6yoerzb0vvm4ozss.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqij6yoerzb0vvm4ozss.jpg" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting from Data Integration
&lt;/h2&gt;

&lt;p&gt;Fan Jia’s current work focuses on data integration, particularly in areas such as data synchronization, Change Data Capture, and data infrastructure. As he describes it, his day-to-day work can be distilled into one core objective: enabling data to flow reliably across different systems.&lt;/p&gt;

&lt;p&gt;In practice, this is far more complex than it sounds. It involves synchronizing data between heterogeneous systems, handling schema evolution, and ensuring stability in complex production environments. Alongside this, he has been actively contributing to the Apache SeaTunnel community over the long term.&lt;/p&gt;

&lt;p&gt;What stands out is that his starting point was not open source itself, but a set of concrete and persistent engineering problems. Those problems became the foundation for his later involvement in open source.&lt;/p&gt;

&lt;h2&gt;
  
  
  How He Got Into Open Source
&lt;/h2&gt;

&lt;p&gt;When asked how he first got involved in open source, his answer was straightforward—it started with his job. After joining WhaleOps, he became involved in the development, maintenance, and partial architectural design of Apache SeaTunnel.&lt;/p&gt;

&lt;p&gt;In the early stage, his contributions were similar to those of most engineers, focusing on solving specific issues such as fixing bugs and improving features. Over time, however, his attention shifted toward system design and how the project could run reliably across broader and more diverse scenarios.&lt;/p&gt;

&lt;p&gt;This transition did not happen overnight. It emerged gradually through continuous involvement. As his focus moved from isolated problems to the system as a whole, his role evolved along with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  From User to Maintainer
&lt;/h2&gt;

&lt;p&gt;He describes this phase as a shift in perspective and responsibility.&lt;/p&gt;

&lt;p&gt;As a user, the focus is on whether a feature exists and whether it meets immediate needs. As a maintainer, the concerns expand to system stability, backward compatibility, adaptability across different use cases, and the real experience of community users.&lt;/p&gt;

&lt;p&gt;At the same time, the sense of responsibility becomes more concrete. Writing code is no longer just about completing a task. It becomes part of maintaining a system that runs in real production environments, making every technical decision more deliberate.&lt;/p&gt;

&lt;p&gt;Once this shift in perspective happens, the truly complex problems begin to surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Memorable Technical Challenge
&lt;/h2&gt;

&lt;p&gt;During his time contributing to SeaTunnel, one of the most memorable challenges was building the Zeta engine from scratch.&lt;/p&gt;

&lt;p&gt;This was not about solving a single isolated issue, but about tackling a combination of complex system-level problems. At the execution model level, the engine needed to support both batch and stream processing, balancing throughput and latency while avoiding bottlenecks under high concurrency.&lt;/p&gt;

&lt;p&gt;From a concurrency perspective, multi-threaded execution introduced challenges such as race conditions, deadlocks, and unpredictable execution order. These issues are often difficult to reproduce and tend to surface only after prolonged runtime.&lt;/p&gt;

&lt;p&gt;In terms of resource management, real production workloads involve long-running tasks and large data volumes. Memory control, thread pool isolation, and backpressure handling become critical. Out-of-memory errors are especially dangerous, as they can impact not only individual tasks but the stability of the entire service process.&lt;/p&gt;

&lt;p&gt;For stability and recoverability, the system must guarantee no data loss, avoid uncontrolled duplication, and correctly restore state after failures or restarts. This typically requires integrating checkpointing and state management mechanisms.&lt;/p&gt;

&lt;p&gt;Overall, this was not a single technical problem, but a full-scale systems engineering challenge.&lt;/p&gt;

&lt;p&gt;These experiences also shaped how he understands collaboration in open source.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Most Important Skill in Open Source
&lt;/h2&gt;

&lt;p&gt;When asked what matters most in an open source community, his answer was patience.&lt;/p&gt;

&lt;p&gt;A pull request in open source rarely gets merged immediately. It usually goes through multiple stages, including initial implementation, community review, several rounds of revision, CI validation, and documentation updates. Along the way, various issues can arise. Without patience, it is easy to give up midway.&lt;/p&gt;

&lt;p&gt;However, consistently pushing through these details is exactly what defines high-quality contributions.&lt;/p&gt;

&lt;p&gt;This understanding of the process is also reflected in his advice to newcomers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Advice for New Contributors
&lt;/h2&gt;

&lt;p&gt;For developers just getting started in open source, he believes the most important things are curiosity and the willingness to act.&lt;/p&gt;

&lt;p&gt;Often, the biggest barrier is not technical difficulty, but simply not getting started. Once you take the first step—submitting a small PR or joining a discussion—everything else tends to follow naturally.&lt;/p&gt;

&lt;p&gt;He also emphasizes the importance of expressing your own ideas and even questioning existing designs. Open source communities are inherently open environments, and everyone starts as a beginner.&lt;/p&gt;

&lt;p&gt;As participation deepens, feedback from the community becomes more visible.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Moment He Became an ASF Member
&lt;/h2&gt;

&lt;p&gt;When he learned that he had become an ASF Member, his first reaction was excitement and happiness.&lt;/p&gt;

&lt;p&gt;Unlike many achievements, this is not something you apply for. It is a recognition from the community based on long-term contributions, which makes it especially meaningful.&lt;/p&gt;

&lt;p&gt;At the same time, he sees it not just as an honor, but as an increase in responsibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Role Means
&lt;/h2&gt;

&lt;p&gt;In his view, being an ASF Member is fundamentally about responsibility.&lt;/p&gt;

&lt;p&gt;It is not only about continuing technical contributions, but also about fostering a healthy community, helping new contributors grow, and participating in higher-level governance. It also means being accountable to users, ensuring that projects run reliably in real-world environments.&lt;/p&gt;

&lt;p&gt;As his role evolves, so does his understanding of the community.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding The Apache Way
&lt;/h2&gt;

&lt;p&gt;He summarizes his understanding of The Apache Way in one phrase: Community Over Code.&lt;/p&gt;

&lt;p&gt;The long-term success of an open source project depends not only on its technology but also on whether it maintains open and transparent decision-making, encourages contributors from diverse backgrounds, and builds governance based on consensus.&lt;/p&gt;

&lt;p&gt;These factors ultimately determine the vitality of a project.&lt;/p&gt;

&lt;p&gt;With this perspective, he approaches projects from a broader viewpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  How He Sees SeaTunnel
&lt;/h2&gt;

&lt;p&gt;In his view, SeaTunnel’s strengths lie in several areas.&lt;/p&gt;

&lt;p&gt;From an architectural standpoint, it supports a multi-engine model, allowing users to choose the most suitable execution engine for different scenarios. From an ecosystem perspective, it provides a rich set of connectors, enabling integration with various databases, data lakes, and messaging systems.&lt;/p&gt;

&lt;p&gt;In terms of capabilities, CDC is a key strength, supporting both data change capture and schema evolution, making the system more adaptable to complex production environments.&lt;/p&gt;

&lt;p&gt;At the same time, despite these capabilities, SeaTunnel maintains a relatively lightweight design, allowing users to adopt and use it at a lower cost.&lt;/p&gt;

&lt;p&gt;These insights come from long-term hands-on experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Open Source Changed Him
&lt;/h2&gt;

&lt;p&gt;Open source has had a significant impact on his career, especially in how he approaches problems.&lt;/p&gt;

&lt;p&gt;Within a company, systems are usually designed around specific business needs. In open source, however, solutions must consider much broader and more general use cases, which pushes engineers to make longer-term architectural decisions.&lt;/p&gt;

&lt;p&gt;Collaborating with developers from different companies and backgrounds also expands one’s technical perspective.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Sentence About Open Source
&lt;/h2&gt;

&lt;p&gt;When asked to summarize open source in one sentence, he said&lt;/p&gt;

&lt;p&gt;Open source is not just about sharing code, it is a process where developers and communities grow together&lt;/p&gt;

&lt;p&gt;It may sound simple, but when viewed in the context of his journey, it is less a conclusion and more a natural outcome.&lt;/p&gt;

&lt;p&gt;From solving concrete data problems, to participating in system design, to thinking about how projects run reliably across different scenarios, and eventually to engaging in community collaboration and consensus building, there is no clear boundary between these stages.&lt;/p&gt;

&lt;p&gt;It is a continuous process where perspective gradually expands through doing the work.&lt;/p&gt;

&lt;p&gt;Becoming an ASF Member is not the end of this journey, but a milestone along the way. It reflects recognition of past contributions and signals greater responsibility ahead.&lt;/p&gt;

&lt;p&gt;If there is one deeper takeaway from this experience, it may not be a specific technology or a single project, but a more enduring capability&lt;/p&gt;

&lt;p&gt;The ability to keep investing in uncertainty and to continue doing the right thing even when there is no immediate reward&lt;/p&gt;




&lt;p&gt;About Apache SeaTunnel&lt;br&gt;
Apache SeaTunnel is an easy-to-use, ultra-high-performance distributed data integration platform that supports real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day stably and efficiently.&lt;/p&gt;

&lt;p&gt;Welcome to fill out this form to be a speaker of Apache SeaTunnel: &lt;a href="https://forms.gle/vtpQS6ZuxqXMt6DT6" rel="noopener noreferrer"&gt;https://forms.gle/vtpQS6ZuxqXMt6DT6&lt;/a&gt; :)&lt;/p&gt;

&lt;p&gt;Why do we need Apache SeaTunnel?&lt;br&gt;
Apache SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.&lt;br&gt;
Data loss and duplication&lt;br&gt;
Task buildup and latency&lt;br&gt;
Low throughput&lt;br&gt;
Long application-to-production cycle time&lt;br&gt;
Lack of application status monitoring&lt;/p&gt;

&lt;p&gt;Apache SeaTunnel Usage Scenarios&lt;br&gt;
Massive data synchronization&lt;br&gt;
Massive data integration&lt;br&gt;
ETL of large volumes of data&lt;br&gt;
Massive data aggregation&lt;br&gt;
Multi-source data processing&lt;/p&gt;

&lt;p&gt;Features of Apache SeaTunnel&lt;br&gt;
Rich components&lt;br&gt;
High scalability&lt;br&gt;
Easy to use&lt;br&gt;
Mature and stable&lt;/p&gt;

&lt;p&gt;How to get started with Apache SeaTunnel quickly?&lt;br&gt;
Want to experience Apache SeaTunnel quickly? SeaTunnel 2.1.0 takes 10 seconds to get you up and running.&lt;br&gt;
&lt;a href="https://seatunnel.apache.org/docs/2.1.0/developement/setup" rel="noopener noreferrer"&gt;https://seatunnel.apache.org/docs/2.1.0/developement/setup&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How can I contribute?&lt;br&gt;
We invite all partners who are interested in making local open-source global to join the Apache SeaTunnel contributors family and foster open-source together!&lt;/p&gt;

&lt;p&gt;Submit an issue:&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/issues" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/issues&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Contribute code to:&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/pulls" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pulls&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Subscribe to the community development mailing list :&lt;br&gt;
&lt;a href="mailto:dev-subscribe@seatunnel.apache.org"&gt;dev-subscribe@seatunnel.apache.org&lt;/a&gt;&lt;br&gt;
Development Mailing List :&lt;br&gt;
&lt;a href="mailto:dev@seatunnel.apache.org"&gt;dev@seatunnel.apache.org&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Join Slack:&lt;br&gt;
&lt;a href="https://join.slack.com/t/apacheseatunnel/shared_invite/zt-3uouszk3m-PtLLNyZsJVqE5Gb6gn24mA" rel="noopener noreferrer"&gt;https://join.slack.com/t/apacheseatunnel/shared_invite/zt-3uouszk3m-PtLLNyZsJVqE5Gb6gn24mA&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Follow us on Twitter:&lt;br&gt;
&lt;a href="https://twitter.com/ASFSeaTunnel" rel="noopener noreferrer"&gt;https://twitter.com/ASFSeaTunnel&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Join us now!❤️❤️&lt;/p&gt;

</description>
      <category>asf</category>
      <category>ai</category>
      <category>opensource</category>
      <category>apacheseatunnel</category>
    </item>
    <item>
      <title>What Happened in Apache SeaTunnel? This March You Shouldn’t Miss</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 10 Apr 2026 07:06:02 +0000</pubDate>
      <link>https://dev.to/seatunnel/what-happened-in-apache-seatunnel-this-march-you-shouldnt-miss-2l12</link>
      <guid>https://dev.to/seatunnel/what-happened-in-apache-seatunnel-this-march-you-shouldnt-miss-2l12</guid>
      <description>&lt;p&gt;Hey there! The March 2026 report is here. The Apache SeaTunnel community has been incredibly active. A total of 26 contributors participated, version 2.3.13 was released, five new connectors were added, and major improvements were made across the core engine, file connectors, CDC, and Transform modules. More than 20 bugs were also fixed.&lt;/p&gt;

&lt;p&gt;On top of that, infrastructure upgrades were rolled out. Whether you’re an enterprise or individual user, it’s a great time to upgrade, explore new features, and stay in sync with the community momentum.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fodj024zrqk6ky1zx1isr.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fodj024zrqk6ky1zx1isr.jpg" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Reporting period March 1, 2026 to March 30, 2026&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Release Overview
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;Release Date&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2.3.13&lt;/td&gt;
&lt;td&gt;March 14, 2026&lt;/td&gt;
&lt;td&gt;Released this month with 50+ new features and 20+ bug fixes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Download&lt;br&gt;
&lt;a href="https://seatunnel.apache.org/download" rel="noopener noreferrer"&gt;https://seatunnel.apache.org/download&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Key Updates in Version 2.3.13
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 New Connectors
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Connector&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;PR&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HugeGraph Sink&lt;/td&gt;
&lt;td&gt;Adds support for Apache HugeGraph&lt;/td&gt;
&lt;td&gt;#10002&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DuckDB&lt;/td&gt;
&lt;td&gt;Introduces DuckDB as both Source and Sink&lt;/td&gt;
&lt;td&gt;#10285&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lance&lt;/td&gt;
&lt;td&gt;Adds support for writing to Lance datasets&lt;/td&gt;
&lt;td&gt;#9894&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS DSQL&lt;/td&gt;
&lt;td&gt;Adds AWS DSQL Sink connector&lt;/td&gt;
&lt;td&gt;#9739&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IoTDB&lt;/td&gt;
&lt;td&gt;Adds Source and Sink support for IoTDB 2.x&lt;/td&gt;
&lt;td&gt;#9872&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  2.2 Core Engine Enhancements
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Module&lt;/th&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;PR&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Zeta Engine&lt;/td&gt;
&lt;td&gt;Supports arbitrarily nested arrays and map types&lt;/td&gt;
&lt;td&gt;#9881&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zeta Engine&lt;/td&gt;
&lt;td&gt;Adds min-pause checkpoint configuration&lt;/td&gt;
&lt;td&gt;#9804&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zeta Engine&lt;/td&gt;
&lt;td&gt;Introduces REST API to inspect pending queue details&lt;/td&gt;
&lt;td&gt;#10078&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flink&lt;/td&gt;
&lt;td&gt;Adds support for Flink 1.20.1&lt;/td&gt;
&lt;td&gt;#9576&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flink&lt;/td&gt;
&lt;td&gt;Enables schema evolution for CDC sources&lt;/td&gt;
&lt;td&gt;#9867&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Metrics&lt;/td&gt;
&lt;td&gt;Adds sink committed metrics and commit rate calculation&lt;/td&gt;
&lt;td&gt;#10233&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  2.3 File Connector Improvements
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Connector&lt;/th&gt;
&lt;th&gt;Enhancement&lt;/th&gt;
&lt;th&gt;PR&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HdfsFile&lt;/td&gt;
&lt;td&gt;Enables parallel reading for large files&lt;/td&gt;
&lt;td&gt;#10332&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LocalFile&lt;/td&gt;
&lt;td&gt;Supports chunked parallel reading for CSV, TEXT, JSON files&lt;/td&gt;
&lt;td&gt;#10142&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parquet&lt;/td&gt;
&lt;td&gt;Adds logical partitioning support&lt;/td&gt;
&lt;td&gt;#10239&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HdfsFile and LocalFile&lt;/td&gt;
&lt;td&gt;Adds sync_mode=update support&lt;/td&gt;
&lt;td&gt;#10437, #10268&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HBase&lt;/td&gt;
&lt;td&gt;Supports time-range scanning&lt;/td&gt;
&lt;td&gt;#10318&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hive&lt;/td&gt;
&lt;td&gt;Supports automatic failover across multiple Metastore URIs&lt;/td&gt;
&lt;td&gt;#10253&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  2.4 CDC Improvements
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;PR&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Maxwell Canal Debezium&lt;/td&gt;
&lt;td&gt;Optimizes JSON format and supports merging update_before and update_after&lt;/td&gt;
&lt;td&gt;#9805&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kafka&lt;/td&gt;
&lt;td&gt;Adds Protobuf deserialization support via Schema Registry wire format&lt;/td&gt;
&lt;td&gt;#10183&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kafka&lt;/td&gt;
&lt;td&gt;Injects record timestamp as EventTime metadata&lt;/td&gt;
&lt;td&gt;#9994&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MySQL CDC&lt;/td&gt;
&lt;td&gt;Optimizes wait time for schema evolution&lt;/td&gt;
&lt;td&gt;#10040&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  2.5 Transform Enhancements
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Transformation&lt;/th&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;PR&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Multimodal Embeddings&lt;/td&gt;
&lt;td&gt;Adds support for multimodal embeddings&lt;/td&gt;
&lt;td&gt;#9673&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RegexExtract&lt;/td&gt;
&lt;td&gt;Introduces regex-based extraction transform&lt;/td&gt;
&lt;td&gt;#9829&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL to Paimon&lt;/td&gt;
&lt;td&gt;Adds support for MERGE INTO syntax&lt;/td&gt;
&lt;td&gt;#10206&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  3. Bug Fixes in Version 2.3.13
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Module&lt;/th&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;th&gt;PR&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CSV Reader&lt;/td&gt;
&lt;td&gt;Fixes parsing failure caused by empty first column&lt;/td&gt;
&lt;td&gt;#10383&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ClickHouse&lt;/td&gt;
&lt;td&gt;Improves batch parallel reads by replacing limit offset with last batch sort value&lt;/td&gt;
&lt;td&gt;#9801&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;Adds support for TIMESTAMP_TZ type&lt;/td&gt;
&lt;td&gt;#10048&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redis&lt;/td&gt;
&lt;td&gt;Fixes cluster mode bug and adds end-to-end tests&lt;/td&gt;
&lt;td&gt;#9869&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MongoDB&lt;/td&gt;
&lt;td&gt;Improves writer close logic&lt;/td&gt;
&lt;td&gt;#10051&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Elasticsearch&lt;/td&gt;
&lt;td&gt;Optimizes resource cleanup for Scroll API&lt;/td&gt;
&lt;td&gt;#10124&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MySQL CDC&lt;/td&gt;
&lt;td&gt;Optimizes schema evolution wait time&lt;/td&gt;
&lt;td&gt;#10040&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  4. Community Highlights
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Contributors in March 2026
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Contributor&lt;/th&gt;
&lt;th&gt;PR Count&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🏅&lt;/td&gt;
&lt;td&gt;@zhangshenghang&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥈&lt;/td&gt;
&lt;td&gt;@yzeng1618&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥈&lt;/td&gt;
&lt;td&gt;@davidzollo&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥈&lt;/td&gt;
&lt;td&gt;@chl-wxp&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥉&lt;/td&gt;
&lt;td&gt;@liunaijie&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥉&lt;/td&gt;
&lt;td&gt;@dybyte&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥉&lt;/td&gt;
&lt;td&gt;@ricky2129&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥉&lt;/td&gt;
&lt;td&gt;@corgy-w&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@zooo-code&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@kuleat&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@LeonYoah&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@OmkarK-7&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@icekimchi&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@assokhi&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@Sephiroth1024&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@Best2Two&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@ic4y&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@misi1987107&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@CosmosNi&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@chocoboxxf&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@xiaochen-zhou&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@qingzheguo-flash&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;a class="mentioned-user" href="https://dev.to/rameshreddy-adutla"&gt;@rameshreddy-adutla&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@CNF96&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@MuraliMon&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;@ocean-zhc&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Contributor&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A total of 51 PRs were merged in March. Huge thanks to all 26 contributors.&lt;/p&gt;

&lt;p&gt;Full contributor list&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/graphs/contributors" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/graphs/contributors&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Infrastructure Updates
&lt;/h3&gt;

&lt;p&gt;End-to-end test Docker images migrated to the seatunnelhub repository&lt;br&gt;
JDK Docker images upgraded&lt;br&gt;
CI timeout optimization with Kafka set to 140 minutes and Kudu to 60 minutes&lt;br&gt;
Added Metalake support for managing data source metadata&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Recommendations for Enterprises
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Upgrade Guidance
&lt;/h3&gt;

&lt;p&gt;Production environments are strongly recommended to upgrade to version 2.3.13&lt;br&gt;
This release includes more than 50 new features and over 20 bug fixes&lt;/p&gt;

&lt;h3&gt;
  
  
  Features to Watch
&lt;/h3&gt;

&lt;p&gt;New connectors including HugeGraph, DuckDB, IoTDB, AWS DSQL, and Lance&lt;br&gt;
Improved large file processing with parallel chunked reads in HdfsFile and LocalFile&lt;br&gt;
Enhanced CDC capabilities including schema evolution and multi-format Kafka support&lt;br&gt;
Improved observability with new sink committed metrics&lt;br&gt;
Support for Flink 1.20.1&lt;/p&gt;

&lt;h3&gt;
  
  
  Notes
&lt;/h3&gt;

&lt;p&gt;Some connector APIs have changed, so reviewing the upgrade documentation is recommended&lt;br&gt;
Using the seatunnelhub image repository is strongly encouraged&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Key Metrics
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;March Data&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Releases&lt;/td&gt;
&lt;td&gt;1 release (2.3.13)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New Connectors&lt;/td&gt;
&lt;td&gt;5+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feature Enhancements&lt;/td&gt;
&lt;td&gt;50+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bug Fixes&lt;/td&gt;
&lt;td&gt;20+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Contributors&lt;/td&gt;
&lt;td&gt;50+&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  7. What’s Coming Next
&lt;/h2&gt;

&lt;p&gt;Further optimization of CDC performance&lt;br&gt;
More cloud-native data source integrations&lt;br&gt;
Improved metrics and monitoring capabilities&lt;/p&gt;

&lt;p&gt;Compiled and edited by the SeaTunnel Community&lt;/p&gt;

</description>
      <category>seatunnel</category>
      <category>opensource</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>(5)When Your Data Warehouse Breaks Down, It’s Probably a Naming Problem</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 03 Apr 2026 06:59:33 +0000</pubDate>
      <link>https://dev.to/seatunnel/5when-your-data-warehouse-breaks-down-its-probably-a-naming-problem-3p1c</link>
      <guid>https://dev.to/seatunnel/5when-your-data-warehouse-breaks-down-its-probably-a-naming-problem-3p1c</guid>
      <description>&lt;p&gt;As a data warehouse grows, the first thing that tends to get out of control is not the data itself—but naming. Naming conventions may seem like a minor detail, but they directly determine whether data is easy to find, understand, and maintain. As the fifth article in the Data Lakehouse Design and Practice series, this article starts from real-world usage and summarizes core methods for table and field naming. By combining layered prefixes, unified terminology (word roots), and cycle encoding, table names become self-explanatory. Together with metric naming and governance processes, this helps build a clear and collaborative data system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Goals and Methods of Naming Conventions: Make Table Names Self-Explanatory and Teams Work Automatically
&lt;/h2&gt;

&lt;p&gt;In a data warehouse system, naming conventions are not just about form—they are foundational infrastructure that directly impacts collaboration efficiency and data quality. A good naming system has one core goal: make the table name itself carry enough information so that people can understand what the table is, where it comes from, and how to use it—without needing extra documentation. Ideally, a table name should be “readable at a glance” and include key information such as data layer, owning team, business domain, subject domain, core object meaning, and update cycle or data scope. When these elements are systematically encoded into table names, data discovery, metric interpretation, troubleshooting, and team handovers all become significantly more efficient, reducing communication costs.&lt;/p&gt;

&lt;p&gt;A naming system is essentially a “word root system” that standardizes business language. For example, the same business object must use the same term consistently across tables (e.g., avoid mixing “rack” and “shelf”). Similarly, metric naming should follow unified rules—for instance, all ratio-type metrics should use the &lt;code&gt;_rate&lt;/code&gt; suffix, avoiding ambiguity from mixing terms like ratio, percent, or rt.&lt;/p&gt;

&lt;p&gt;Layer prefixes must be strictly standardized. They allow users to immediately identify the data layer and purpose of a table: &lt;code&gt;ods_&lt;/code&gt; for source-aligned data, &lt;code&gt;dwd_&lt;/code&gt; for detailed standardized data, &lt;code&gt;dws_&lt;/code&gt; for aggregated data, &lt;code&gt;ads_&lt;/code&gt; for application-facing outputs, and &lt;code&gt;dim_&lt;/code&gt; for shared dimensions. These prefixes are not just naming conventions—they directly reflect the data architecture.&lt;/p&gt;

&lt;p&gt;Another often overlooked but critical aspect is encoding update cycles or data scope into table names. For example, &lt;code&gt;_1d&lt;/code&gt; represents the last day, &lt;code&gt;_td&lt;/code&gt; means up to today, and &lt;code&gt;_7d&lt;/code&gt; means the last seven days. This prevents confusion between tables with the same name but different time semantics, reducing the risk of metric misuse.&lt;/p&gt;

&lt;p&gt;At the asset management level, table types must be clearly distinguished. Production tables are long-term assets, intermediate tables serve only processing workflows and should have retention policies, and temporary tables are for one-time validation and must not enter production pipelines. Prefixes like &lt;code&gt;mid_&lt;/code&gt; and &lt;code&gt;tmp_&lt;/code&gt; help prevent data asset pollution at the source.&lt;/p&gt;

&lt;p&gt;Finally, naming conventions must be integrated with governance processes. Any new table or field must include complete metadata such as owner, field definitions, metric definitions, update frequency, dependencies, and lifecycle. Tables without such metadata may be usable in the short term but will almost certainly become technical debt in the long run. In practice, it is best to standardize templates first—ensuring key fields like layer, domain, and cycle are strictly consistent—while allowing limited flexibility in non-critical parts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table Naming Conventions: Templates, Cycle Encoding, and Examples
&lt;/h2&gt;

&lt;p&gt;In practice, table naming should follow a structured template to ensure completeness and consistency. A general template can be defined as &lt;code&gt;{layer}_{dept}_{biz_domain}_{subject}_{object}_{cycle_or_range}&lt;/code&gt;, where each component has a clear role: layer indicates data level, dept indicates ownership, biz_domain defines the business domain, subject represents analytical abstraction, object defines the entity or behavior, and cycle_or_range specifies the time scope.&lt;/p&gt;

&lt;p&gt;Cycle and range encoding is especially important. Common patterns include &lt;code&gt;_1d&lt;/code&gt; (last day), &lt;code&gt;_td&lt;/code&gt; (to date), &lt;code&gt;_7d&lt;/code&gt; or &lt;code&gt;_30d&lt;/code&gt; (last N days). Additional markers can distinguish data types or update modes, such as &lt;code&gt;d&lt;/code&gt; for daily snapshots, &lt;code&gt;w&lt;/code&gt; for weekly data, &lt;code&gt;i&lt;/code&gt; for incremental tables, &lt;code&gt;f&lt;/code&gt; for full tables, and &lt;code&gt;l&lt;/code&gt; for slowly changing tables. These conventions allow users to quickly understand temporal semantics.&lt;/p&gt;

&lt;p&gt;For example, in the aggregation layer, &lt;code&gt;dws_asale_trd_byr_subpay_1d&lt;/code&gt; represents buyer-level, staged payment transactions aggregated over the last day, while &lt;code&gt;dws_asale_trd_itm_slr_hh&lt;/code&gt; represents hourly aggregation at the seller-item level. Although long, such names are highly informative and readable.&lt;/p&gt;

&lt;p&gt;Dimension tables follow a separate convention, using the &lt;code&gt;dim_&lt;/code&gt; prefix and a &lt;code&gt;{scope}_{object}&lt;/code&gt; structure, such as &lt;code&gt;dim_pub_area&lt;/code&gt; (public area dimension) or &lt;code&gt;dim_asale_item&lt;/code&gt; (item dimension), emphasizing cross-domain reuse.&lt;/p&gt;

&lt;p&gt;Intermediate tables should be tightly bound to their target tables, typically named as &lt;code&gt;mid_{target_table}_{suffix}&lt;/code&gt;, such as &lt;code&gt;mid_dws_xxx_01&lt;/code&gt;. Temporary tables must use the &lt;code&gt;tmp_&lt;/code&gt; prefix and are strictly limited to development or validation, never entering production dependencies. For manually maintained data, tables in the DWD layer can explicitly include &lt;code&gt;manual&lt;/code&gt;, such as &lt;code&gt;dwd_trade_manual_client_info_l&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Field and Metric Naming Conventions: Rules, Structure, and Examples
&lt;/h2&gt;

&lt;p&gt;At the field level, naming must be strictly standardized. All field names should use lowercase with underscores—camelCase is not allowed. Readability should take priority over brevity, and consistent naming must be maintained for the same semantic meaning.&lt;/p&gt;

&lt;p&gt;Partition fields should be unified globally—for example, &lt;code&gt;dt&lt;/code&gt; for date, &lt;code&gt;hh&lt;/code&gt; for hour, and &lt;code&gt;mi&lt;/code&gt; for minute—with fixed formats. This improves development efficiency and avoids confusion across tables.&lt;/p&gt;

&lt;p&gt;Field suffixes should clearly indicate meaning: &lt;code&gt;_cnt&lt;/code&gt; for counts, &lt;code&gt;_amt&lt;/code&gt; or &lt;code&gt;_price&lt;/code&gt; for monetary values (choose one consistently), and boolean fields should use the &lt;code&gt;is_&lt;/code&gt; prefix and never be nullable. These conventions allow users to infer data types and meanings at a glance.&lt;/p&gt;

&lt;p&gt;NULL handling must also follow consistent rules. Typically, dimension fields use &lt;code&gt;-1&lt;/code&gt; for unknown values, while metric fields use &lt;code&gt;0&lt;/code&gt; to indicate no occurrence. This prevents NULL propagation in aggregations and improves data stability.&lt;/p&gt;

&lt;p&gt;Metric naming should be structured as a combination of business qualifier, time qualifier, aggregation method, and base metric. For example, &lt;code&gt;trade_amt&lt;/code&gt; represents transaction amount, &lt;code&gt;install_poi_cnt&lt;/code&gt; represents installation point count, and &lt;code&gt;pay_succ_rate&lt;/code&gt; represents payment success rate. Aggregation methods should use fixed terms like &lt;code&gt;sum&lt;/code&gt;, &lt;code&gt;avg&lt;/code&gt;, &lt;code&gt;max&lt;/code&gt;, and &lt;code&gt;min&lt;/code&gt;, avoiding inconsistent alternatives like “total.”&lt;/p&gt;

&lt;p&gt;A full example from fields to metrics: in the detail layer, an incremental order table might be named &lt;code&gt;dwd_trade_order_i&lt;/code&gt;, containing fields such as order ID, user ID, payment amount, order status, and partition keys. In the aggregation layer, &lt;code&gt;dws_trade_user_pay_1d&lt;/code&gt; summarizes user-level payments over the last day, including metrics like payment success count, total payment amount, and success rate. Finally, in the application layer, a table like &lt;code&gt;ads_fin_kpi_board_d&lt;/code&gt; provides business-facing dashboards with KPIs such as GMV, refund amount, net revenue, and number of paying users.&lt;/p&gt;

&lt;p&gt;By standardizing naming across tables, fields, and metrics, a data warehouse can achieve clear semantics, consistent structure, and efficient collaboration. While such conventions may introduce some overhead initially, they are essential for scalability and team coordination in the long term.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Earlier Posts in This Series：&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://medium.com/codex/4-why-your-ads-layer-always-goes-wild-and-how-a-strong-dws-layer-fixes-it-4fddecde4288?source=your_stories_outbox---writer_outbox_published-----------------------------------------" rel="noopener noreferrer"&gt;(4)Why Your ADS Layer Always Goes Wild and How a Strong DWS Layer Fixes It&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;(3) Key Design Principles for ODS/Detail Layer Implementation: Building the Data Ingestion Layer as a “Stable and Operable” Infrastructure&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/@apacheseatunnel/i-a-complete-guide-to-building-and-standardizing-a-modern-lakehouse-architecture-an-overview-of-9a2a263f2f1b?source=your_stories_outbox---writer_outbox_published-----------------------------------------" rel="noopener noreferrer"&gt;(I) A Complete Guide to Building and Standardizing a Modern Lakehouse Architecture: An Overview of Data Warehouses and Data Lakes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Next Post:&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  - (6) DataOps Development Standards and Best Practices
&lt;/h2&gt;

</description>
      <category>database</category>
      <category>datascience</category>
      <category>bigdata</category>
      <category>datawarehouse</category>
    </item>
    <item>
      <title>Growing with the Community: Zhang Shenghang’s Path to Apache SeaTunnel PMC Member</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 03 Apr 2026 02:55:16 +0000</pubDate>
      <link>https://dev.to/seatunnel/growing-with-the-community-zhang-shenghangs-path-to-apache-seatunnel-pmc-member-3co1</link>
      <guid>https://dev.to/seatunnel/growing-with-the-community-zhang-shenghangs-path-to-apache-seatunnel-pmc-member-3co1</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhipmcy6jrz7ao2ul4w5h.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhipmcy6jrz7ao2ul4w5h.jpg" width="800" height="377"&gt;&lt;/a&gt;&lt;br&gt;
🎉 Hi Community—more exciting news! Zhang Shenghang has been invited to join the Apache SeaTunnel PMC in recognition of his outstanding contributions—well deserved!&lt;/p&gt;

&lt;p&gt;Over the years, Zhang has been highly active in the Apache SeaTunnel community. From improving code quality, refining documentation, to engaging with the community and mentoring newcomers, his presence has been everywhere. He consistently embraces the Apache Way, contributing with dedication and passion to the growth of the project.&lt;/p&gt;

&lt;p&gt;We took this opportunity to conduct an in-depth interview with him. Covering his background, open source journey, PMC role, and thoughts on community development and culture, this conversation offers a closer look at his story and his enthusiasm for open source.&lt;/p&gt;

&lt;h2&gt;
  
  
  Personal Background &amp;amp; Open Source Journey
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Could you briefly introduce yourself and how you entered the big data and open source space?
Name: Zhang Shenghang
GitHub: zhangshenghang&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvnu7d1ec2vu0l315yhw.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvnu7d1ec2vu0l315yhw.jpg" width="415" height="312"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;When did you start contributing to Apache SeaTunnel, and what was the motivation?&lt;br&gt;
I started contributing to Apache SeaTunnel in June 2024. Initially, I was using DataX, a classic standalone data integration tool. However, it lacks service-oriented and distributed capabilities, which creates limitations in large-scale data synchronization scenarios. That’s when I came across Apache SeaTunnel as a more comprehensive solution.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What key contributions or features have you worked on in SeaTunnel?&lt;br&gt;
He has contributed to multiple core features and improvements, including adding a pending queue feature for SeaTunnel Engine task scheduling, enabling Kafka Protobuf format support, introducing Kerberos testing in e2e workflows, implementing a new resource scheduling algorithm in SeaTunnel Engine, adding TTL support for HBase Sink, introducing API-based log retrieval, fixing Flink source 100% busy issues, supporting the Typesense connector, enabling default value substitution for configuration variables, fixing Doris custom SQL execution issues, correcting Kafka consumer offset auto-commit logic, and resolving RabbitMQ checkpoint issues in Flink mode.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Open Source Contributions &amp;amp; Growth
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Which contribution or experience impressed you the most?&lt;br&gt;
What impressed me most was not just submitting a PR, but the full process—from discovering a problem, analyzing it, discussing solutions with the community, to finally implementing and validating the fix. Issues involving engine scheduling, resource allocation, and Flink stability often look simple on the surface but are deeply tied to framework mechanisms and runtime behavior. Solving them requires both deep code understanding and close collaboration.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What is the most important skill in open source collaboration?&lt;br&gt;
All are important, but if I had to choose one, it would be the ability to collaborate continuously. Technical skills are foundational, but communication is equally critical—open source is not just about writing code, but explaining context, design decisions, and trade-offs clearly so others can understand.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What advice would you give to beginners in open source?&lt;br&gt;
Don’t overestimate the difficulty. You don’t need to start with massive features or deep architectural changes. Fixing a bug, improving documentation, adding tests, or optimizing small features are all valuable contributions.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Becoming a PMC Member
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Congratulations on becoming a PMC Member! What was your first reaction?&lt;br&gt;
Thank you. My first reaction was both excitement and a strong sense of responsibility. It’s recognition of past contributions, but also a reminder that a PMC Member is not just a contributor, but a community builder.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What does becoming a PMC Member mean to you and the community?&lt;br&gt;
To me, it represents recognition of long-term contributions, collaboration ability, and responsibility. Personally, it means thinking beyond individual modules and considering the project’s overall development, governance, and ecosystem. For the community, more PMC Members mean more people willing to take responsibility and drive sustainable growth.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How important is the Apache Way to open source success?&lt;br&gt;
It emphasizes “Community Over Code.” A project succeeds not just because of good code, but because of an open, transparent, and sustainable collaboration culture.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  SeaTunnel Community Development
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;What key milestones has SeaTunnel gone through?&lt;br&gt;
SeaTunnel has evolved from a data synchronization tool into a more comprehensive data integration platform, expanding across connectors, orchestration, engines, and observability. The maturation of SeaTunnel Engine is a major turning point, enabling stronger unified execution capabilities. Additionally, increased community activity and internationalization have significantly boosted its impact.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How do you see SeaTunnel’s position and future?&lt;br&gt;
SeaTunnel is building a unique position by balancing rich connectors, strong engine capabilities, scalability, and enterprise readiness. Compared to traditional tools, it fits modern data infrastructure better; compared to heavyweight platforms, it remains flexible and extensible. It has strong potential to become a leading global open source data integration project.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What are your future plans as a PMC Member?&lt;br&gt;
I plan to focus on improving SeaTunnel Engine, scheduling, resource management, and system stability; strengthening connectors and production readiness; and helping new contributors onboard faster through issue guidance, PR reviews, and knowledge sharing.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Personal Growth &amp;amp; Open Source Culture
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;How has open source impacted your career and growth?&lt;br&gt;
Professionally, it has exposed me to real-world complex problems and high-standard collaboration environments. Personally, it has deepened my understanding of collaboration, responsibility, and long-term thinking. Open source has shaped not only my technical skills but also my mindset and working style.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How would you summarize the spirit of open source in one sentence?&lt;br&gt;
Open source is about collaboratively creating, improving, and sharing technology in an open and inclusive way for the benefit of everyone.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>asf</category>
      <category>community</category>
      <category>bigdata</category>
      <category>apacheseatunnel</category>
    </item>
    <item>
      <title>Rethinking ClassLoader Governance in Apache SeaTunnel</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 03 Apr 2026 02:45:04 +0000</pubDate>
      <link>https://dev.to/seatunnel/rethinking-classloader-governance-in-apache-seatunnel-2leh</link>
      <guid>https://dev.to/seatunnel/rethinking-classloader-governance-in-apache-seatunnel-2leh</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjud5he2ysxi7mt0jg01.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjud5he2ysxi7mt0jg01.jpg" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Recently, while diving into the Apache SeaTunnel Zeta Engine codebase, I followed the ClassLoader thread and conducted a relatively systematic review.&lt;/p&gt;

&lt;p&gt;Overall, the current design already has a clear foundational structure, especially the centralized management approach of &lt;code&gt;ClassLoaderService&lt;/code&gt;, which is actually quite rare among similar systems 👍.&lt;/p&gt;

&lt;p&gt;Here, I try to take a different perspective—starting from &lt;strong&gt;“ClassLoader governance in long-running runtimes”&lt;/strong&gt;—to summarize some observations and outline a possible evolution path. These may not be entirely accurate, but are intended to spark discussion.&lt;/p&gt;

&lt;h2&gt;
  
  
  From “Usable” to “Governable”
&lt;/h2&gt;

&lt;p&gt;Apache SeaTunnel already supports well: multi-connector coexistence and dynamic loading and execution. From a “functional availability” perspective, the mechanism works. But if we move one step further and ask: &lt;strong&gt;can ClassLoaders have a controllable lifecycle and verifiable reclamation?&lt;/strong&gt; the evaluation criteria begin to change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observations (Runtime-Oriented)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The Semantic Gap Between “Release” and “Close”
&lt;/h3&gt;

&lt;p&gt;Currently, &lt;code&gt;releaseClassLoader()&lt;/code&gt; removes cache entries and performs some thread-level cleanup when the reference count drops to zero, but it does not explicitly call &lt;code&gt;URLClassLoader.close()&lt;/code&gt;. For example: &lt;code&gt;DefaultClassLoaderService.releaseClassLoader()&lt;/code&gt; (no close call observed) and &lt;code&gt;DefaultClassLoaderService.close()&lt;/code&gt; mainly clears internal cache structures. This raises a noteworthy point: JAR handle release depends on GC timing, and in long-running scenarios or on certain platforms (such as Windows), files may not be released promptly. 👉 This is closer to “logical release” rather than “end of resource lifecycle”.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Class Loading Boundaries Can Still Change at Runtime
&lt;/h3&gt;

&lt;p&gt;In some paths, dependencies are still injected into the current ClassLoader via &lt;code&gt;addURL&lt;/code&gt;, such as: reflective calls to &lt;code&gt;addURL&lt;/code&gt; in &lt;code&gt;AbstractPluginDiscovery&lt;/code&gt;, and plugin dependency injection into the current loader in Flink execution paths. This leads to an interesting phenomenon: class loading boundaries are not only defined by loader structure, but also influenced by runtime behavior. While not problematic for a single job, under scenarios like repeated jobs in the same process or switching plugin combinations, boundaries may accumulate “historical residue”.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Some Residual Surfaces Are Not Fully Closed
&lt;/h3&gt;

&lt;p&gt;There are multiple TCCL usage patterns in the codebase (synchronous / asynchronous / cross-thread), and some paths show: TCCL not restored in &lt;code&gt;finally&lt;/code&gt;, or inconsistent baselines during cross-thread restoration. For example: TCCL usage in cooperative workers within &lt;code&gt;TaskExecutionService&lt;/code&gt;, and asymmetric restoration in some operations (such as source / restore). Additionally, some typical ClassLoader retention points are not yet uniformly governed, such as JDBC Driver registration (e.g., TDengine-related implementations) and connectors directly setting TCCL without restoring it.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Possible Evolution Path (For Reference)
&lt;/h2&gt;

&lt;p&gt;Based on these observations, I’ve outlined a &lt;strong&gt;progressive governance path&lt;/strong&gt; that avoids large-scale refactoring and can be implemented in phases.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1: Close the ClassLoader Lifecycle
&lt;/h3&gt;

&lt;p&gt;Key ideas: explicitly call &lt;code&gt;close()&lt;/code&gt; on URLClassLoaders created by SeaTunnel at the appropriate time, and define clear ownership—“who creates, who closes”. This shifts from “GC-dependent release” to “controlled release”.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2: Stabilize Loading Boundaries
&lt;/h3&gt;

&lt;p&gt;Goals: avoid runtime &lt;code&gt;addURL&lt;/code&gt; where possible, and determine the full classpath before loader creation. This ensures consistent behavior of the same loader over time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3: Consolidate Common Residual Points
&lt;/h3&gt;

&lt;p&gt;Standardize patterns such as: wrapping TCCL with try-with-resources, pairing JDBC Driver registration and deregistration, and clearly assigning ClassLoader ownership to threads and ThreadLocal. This turns implicit references into manageable resources.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 4: Introduce Verifiable Reclamation
&lt;/h3&gt;

&lt;p&gt;As an enhancement: use &lt;code&gt;WeakReference + ReferenceQueue&lt;/code&gt; to track loaders, or expose simple runtime metrics (e.g., number of live loaders). The goal is not absolute precision, but the ability to reasonably judge whether resources have been released.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;These issues rarely surface in short-lived tasks. But in scenarios such as long-running engine nodes, repeated task scheduling, or frequent plugin switching, these boundary issues accumulate over time. The results may include Metaspace growth, inability to replace JARs, and occasional class conflicts.&lt;/p&gt;

&lt;h2&gt;
  
  
  One-Sentence Summary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;From “class isolation” to “governable ClassLoaders with verifiable reclamation.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The above reflects my current understanding and organization of the topic. Some points may not be entirely accurate—feedback and real-world scenarios are very welcome 🙌. If the community is interested, this could evolve into a more general and reusable infrastructure capability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Appendix: Code References
&lt;/h2&gt;

&lt;p&gt;Some code locations noted during analysis (not exhaustive): &lt;code&gt;DefaultClassLoaderService&lt;/code&gt; (release/close), &lt;code&gt;AbstractPluginDiscovery&lt;/code&gt; (addURL), Flink starter execution paths (plugin injection), &lt;code&gt;TaskExecutionService&lt;/code&gt; (TCCL usage), various operations (source/restore), and connectors (Iceberg / Paimon / TDengine, etc.).&lt;/p&gt;

</description>
      <category>classloader</category>
      <category>apacheseatunnel</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>From Apache SeaTunnel to ASF Member: A Story of Long-Term Commitment</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 27 Mar 2026 03:15:17 +0000</pubDate>
      <link>https://dev.to/seatunnel/from-apache-seatunnel-to-asf-member-a-story-of-long-term-commitment-4pp9</link>
      <guid>https://dev.to/seatunnel/from-apache-seatunnel-to-asf-member-a-story-of-long-term-commitment-4pp9</guid>
      <description>&lt;p&gt;Recently, after internal discussions, the Apache Software Foundation invited several PMC Members from the Apache SeaTunnel project to become ASF Members—one of the highest honors within the foundation. Among them is &lt;strong&gt;Wang Hailin&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxp33vya9ozsbnl9drwnn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxp33vya9ozsbnl9drwnn.png" alt="3d5c8aaf1091f7a7ef66425e97d147bc" width="800" height="721"&gt;&lt;/a&gt;&lt;br&gt;
Congratulations to &lt;a class="mentioned-user" href="https://dev.to/wang"&gt;@wang&lt;/a&gt; Hailin on becoming an ASF Member! As a key contributor to the SeaTunnel community, this recognition is not only a personal milestone, but also a moment of pride for the entire community.&lt;/p&gt;

&lt;p&gt;Over the years, he has remained deeply involved in the community: from refining documentation to improving code, from participating in technical discussions to helping newcomers. His contributions can be seen across almost every corner of the project. Beyond SeaTunnel, he has also been actively contributing to multiple ASF projects, consistently practicing the Apache Way advocated by the foundation. It is this steady, long-term dedication that has led to this important recognition.&lt;/p&gt;

&lt;p&gt;To mark the occasion, the community conducted an in-depth interview with him. This article is structured into five sections—personal background, open-source journey, the path to ASF Member, SeaTunnel community development, and open-source culture—to give a closer look at his growth, his experiences in open source, and the passion and persistence behind his contributions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Personal Background &amp;amp; Open Source Journey
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falcyr6qckib47t2xmgng.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falcyr6qckib47t2xmgng.png" alt="王海林" width="800" height="1069"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q1: Could you briefly introduce yourself and how you got into big data and open source?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A: Hey guys, I’m Wang Hailin, and my GitHub ID is hailin0. I mainly work on data infrastructure, with a focus on data integration, data synchronization, and data platforms.&lt;/p&gt;

&lt;p&gt;Outside of work, I enjoy engaging with open-source communities—sharing practical experience and exchanging ideas around data platforms and integration technologies.&lt;/p&gt;

&lt;p&gt;My entry into big data and open source is closely tied to my earlier work experience. While working on systems like data development platforms and performance monitoring, I frequently dealt with data ingestion and synchronization challenges, which required exploring various data integration tools.&lt;/p&gt;

&lt;p&gt;That’s when I came across SeaTunnel. What stood out to me was its extensible architecture—it supports a wide range of data sources and complex synchronization scenarios, making it well-suited for enterprise use. This sparked my interest, and I gradually started contributing to the community. Over time, through continuous contributions and discussions, I became one of the core contributors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q2: When did you start contributing to SeaTunnel, and what was the trigger?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A: It started from a practical need at work. At the time, I was building a data platform and needed a reliable data integration tool. During that evaluation process, I discovered SeaTunnel.&lt;/p&gt;

&lt;p&gt;Back then, the project wasn’t as mature as it is today, but its architecture left a strong impression on me—especially the plugin-based Connector system and the flexible data synchronization model.&lt;/p&gt;

&lt;p&gt;I began using SeaTunnel in real-world scenarios, and gradually got involved in contributing. Starting with small fixes and bug patches, I later participated in more feature development and community discussions, eventually becoming a long-term contributor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q3: What key areas or features have you contributed to in SeaTunnel?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A: My contributions mainly fall into a few areas.&lt;/p&gt;

&lt;p&gt;Early on, I worked on Connector development and improvements. For a data integration platform, the Connector ecosystem is fundamental—it determines which data sources and systems the platform can connect to.&lt;/p&gt;

&lt;p&gt;As I became more involved, I also contributed to framework-level and infrastructure work, such as improving the E2E testing system and refining the logging framework to make the project more robust and standardized.&lt;/p&gt;

&lt;p&gt;Later, as I gained a deeper understanding of the synchronization engine, I started working on CDC (Change Data Capture) capabilities, including CDC read/write and DDL synchronization. In real production environments, schema changes (DDL) are unavoidable. If a system cannot handle schema evolution properly, data pipelines can easily break.&lt;/p&gt;

&lt;p&gt;Overall, these efforts are driven by a single goal: to make SeaTunnel not just a data synchronization tool, but a reliable data integration infrastructure for enterprise environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Source Contributions &amp;amp; Growth
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q4: Which contribution or experience left the deepest impression on you?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A: One experience that stands out is working on DDL support in CDC scenarios.&lt;/p&gt;

&lt;p&gt;At first glance, DDL may seem like a simple SQL parsing problem. But in a data synchronization system, it must flow correctly through the entire pipeline: from Source capturing the event, to passing it through the data stream, to executing schema changes on the Sink.&lt;/p&gt;

&lt;p&gt;The real challenge lies in maintaining consistency between DDL and data changes. In practice, synchronization jobs run concurrently across multiple nodes, so DDL events must maintain a consistent order throughout the distributed pipeline.&lt;/p&gt;

&lt;p&gt;This requires tight integration with state management mechanisms like Checkpoint and Savepoint, ensuring that after recovery or restart, DDL and data events remain in the correct order.&lt;/p&gt;

&lt;p&gt;When you combine all these factors, DDL handling becomes a system-level challenge involving distributed data flow, state consistency, and multi-system compatibility.&lt;/p&gt;

&lt;p&gt;This work took quite a long time and involved extensive discussions with other contributors. It’s one of the more complex aspects of many data synchronization systems, and we aimed to make SeaTunnel more reliable for enterprise real-time scenarios.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q5: What do you think is the most important skill in open source collaboration?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A: I would say communication and collaboration are critical.&lt;/p&gt;

&lt;p&gt;Technical skills are the foundation, but many decisions in open source are made through discussion and consensus. Being able to clearly express your ideas, understand others’ perspectives, and move toward agreement is essential.&lt;/p&gt;

&lt;p&gt;Another important factor is patience and long-term commitment. Open source is not a short-term effort—it requires sustained involvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q6: What advice would you give to newcomers in open source?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A: Start small. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fix a bug&lt;/li&gt;
&lt;li&gt;Improve documentation&lt;/li&gt;
&lt;li&gt;Submit a small feature enhancement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This helps you get familiar with the codebase and development workflow.&lt;/p&gt;

&lt;p&gt;Also, participate in discussions. Even asking questions or joining simple conversations helps you understand the project’s design.&lt;/p&gt;

&lt;p&gt;Open source is a long journey—you don’t need to aim for big features at the beginning. What matters more is understanding the architecture, not just the code.&lt;/p&gt;

&lt;p&gt;Many core contributors grow over years—from users to contributors, and eventually to maintainers.&lt;/p&gt;

&lt;p&gt;For me, the biggest gain from open source is not a specific piece of code, but the opportunity to collaborate with developers from different companies and backgrounds. That experience is incredibly valuable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Becoming an ASF Member
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q7: What was your first reaction when you were invited to become an ASF Member?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A: I was surprised and very grateful.&lt;/p&gt;

&lt;p&gt;ASF Membership is not something you apply for—it comes through nomination and voting by existing members. So it represents recognition from the community for long-term contributions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q8: How closely is this achievement tied to your work in SeaTunnel?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A: Very closely.&lt;/p&gt;

&lt;p&gt;The SeaTunnel community gave me many opportunities to grow—from contributing code to participating in community governance. Through this process, I gradually learned how Apache communities operate.&lt;/p&gt;

&lt;p&gt;It’s not just about technical contributions, but also collaboration and governance, which are all important factors in becoming an ASF Member.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q9: What does becoming an ASF Member mean to you?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A: To me, it represents responsibility.&lt;/p&gt;

&lt;p&gt;It’s not only recognition of past contributions, but also a commitment to continue contributing to the Apache community—helping projects grow, supporting new projects entering the ecosystem, and promoting open-source culture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q10: How do you see the importance of the Apache Way?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A: The Apache community emphasizes &lt;strong&gt;“Community Over Code.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A successful project needs not only strong technology, but also a healthy community, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open and transparent decision-making&lt;/li&gt;
&lt;li&gt;Consensus-driven governance&lt;/li&gt;
&lt;li&gt;Encouraging participation from diverse contributors&lt;/li&gt;
&lt;li&gt;Continuously welcoming new contributors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are key reasons why Apache projects can succeed in the long run.&lt;/p&gt;

&lt;h2&gt;
  
  
  SeaTunnel Community Development
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q11: What are the key milestones in SeaTunnel’s growth?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A: Several milestones stand out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Entering the Apache Incubator&lt;/li&gt;
&lt;li&gt;Unifying APIs and introducing the Zeta engine&lt;/li&gt;
&lt;li&gt;Graduating as a Top-Level Project (TLP)&lt;/li&gt;
&lt;li&gt;Rapid iteration in the 2.3.x series with increasing stability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SeaTunnel was open-sourced in 2017, entered the Apache Incubator in 2021, and became a TLP in 2023. This journey reflects not only technical evolution but also the maturation of community governance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q12: How do you see SeaTunnel’s positioning in data integration?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A: In recent years, the demand for efficient data movement has grown significantly, and synchronization scenarios have become more complex.&lt;/p&gt;

&lt;p&gt;SeaTunnel aims to be a high-performance, extensible platform that supports diverse data integration needs across different use cases.&lt;/p&gt;

&lt;p&gt;It already supports multiple data sources, batch processing, real-time synchronization, and CDC.&lt;/p&gt;

&lt;p&gt;Looking ahead, I believe it will continue to evolve in areas such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expanding the connector ecosystem&lt;/li&gt;
&lt;li&gt;Strengthening data transformation capabilities&lt;/li&gt;
&lt;li&gt;Improving fault handling&lt;/li&gt;
&lt;li&gt;Enhancing ecosystem integration&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Open Source Culture &amp;amp; Personal Growth
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q13: How has open source influenced your career?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A: It has influenced me in two major ways.&lt;/p&gt;

&lt;p&gt;First, it broadened my technical perspective. In company projects, decisions are often driven by specific business needs. In open source, designs must work across different use cases, systems, and organizations. This leads to a more comprehensive understanding of system design.&lt;/p&gt;

&lt;p&gt;Second, it deepened my understanding of software engineering and collaboration. In open source, a feature goes through idea proposal, design discussion, review, and iteration before merging. This process emphasizes design and communication, not just coding.&lt;/p&gt;

&lt;p&gt;Working with developers from different countries and backgrounds also brings fresh perspectives.&lt;/p&gt;

&lt;p&gt;For me, the biggest gain is the opportunity to collaborate in an open environment and solve problems with talented engineers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q14: How would you summarize the spirit of open source in one sentence?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A: Based on my experience, the most valuable aspect of open source is that it provides a space for long-term participation and growth.&lt;/p&gt;

&lt;p&gt;I started as a user, using tools to solve problems. Then I began contributing small fixes, and gradually got involved in feature development and core system design.&lt;/p&gt;

&lt;p&gt;Looking back, it’s a journey from user → contributor → maintainer.&lt;/p&gt;

&lt;p&gt;In a company, knowledge often stays within a team. In open source, your work can be seen, used, and improved by many others. As the project grows, so do the people involved.&lt;/p&gt;

&lt;p&gt;So if I had to summarize it in one sentence:&lt;/p&gt;

&lt;p&gt;Open source is not just about sharing code—it’s about growing together with the community.&lt;/p&gt;

</description>
      <category>apacheseatunnel</category>
      <category>asf</category>
      <category>opensource</category>
      <category>ai</category>
    </item>
    <item>
      <title>Apache SeaTunnel Performance Tuning: How to Set JVM Parameters the Right Way</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 27 Mar 2026 03:13:13 +0000</pubDate>
      <link>https://dev.to/seatunnel/apache-seatunnel-performance-tuning-how-to-set-jvm-parameters-the-right-way-28e0</link>
      <guid>https://dev.to/seatunnel/apache-seatunnel-performance-tuning-how-to-set-jvm-parameters-the-right-way-28e0</guid>
      <description>&lt;p&gt;As a high-performance distributed data integration platform, properly tuning JVM parameters for Apache SeaTunnel is essential if you want better throughput, lower latency, and stable execution.&lt;/p&gt;

&lt;p&gt;So how should you tune JVM parameters?&lt;br&gt;
In this article, we’ll walk through where to configure them, how precedence works, the key parameters to focus on, and some practical tuning strategies.&lt;/p&gt;
&lt;h2&gt;
  
  
  1. Configuration File Locations
&lt;/h2&gt;

&lt;p&gt;SeaTunnel manages JVM parameters through configuration files under &lt;code&gt;$SEATUNNEL_HOME/config/&lt;/code&gt;. Depending on the deployment role, there are four main files:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File Name&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;Default Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;jvm_options&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hybrid mode (&lt;code&gt;master_and_worker&lt;/code&gt;), where Master and Worker run in the same process&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-Xms2g -Xmx2g -XX:+UseG1GC&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;jvm_master_options&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Dedicated Master node, responsible for scheduling and state management (no computation)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-Xms2g -Xmx2g&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;jvm_worker_options&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Dedicated Worker node, responsible for data reading, transformation, and writing (main memory consumer)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-Xms2g -Xmx2g&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;jvm_client_options&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Client side (&lt;code&gt;seatunnel.sh&lt;/code&gt;), used to parse configs and submit jobs&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-Xms256m -Xmx512m&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  2. Parameter Precedence
&lt;/h2&gt;

&lt;p&gt;Understanding parameter precedence is critical when troubleshooting.&lt;/p&gt;

&lt;p&gt;SeaTunnel loads JVM parameters in the following order, and &lt;strong&gt;later ones override earlier ones&lt;/strong&gt; (for example, the last &lt;code&gt;-Xmx&lt;/code&gt; wins):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Environment variable &lt;code&gt;JAVA_OPTS&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
Loaded first. You can define it in system env variables or in &lt;code&gt;config/seatunnel-env.sh&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Configuration files (&lt;code&gt;config/jvm_*_options&lt;/code&gt;)&lt;/strong&gt;&lt;br&gt;
Loaded next, and &lt;strong&gt;override anything set in &lt;code&gt;JAVA_OPTS&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Command-line parameters (&lt;code&gt;-DJvmOption&lt;/code&gt;)&lt;/strong&gt;&lt;br&gt;
Loaded last, with &lt;strong&gt;the highest priority&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
If &lt;code&gt;JAVA_OPTS="-Xmx4g"&lt;/code&gt;, the config file sets &lt;code&gt;-Xmx2g&lt;/code&gt;, and the startup command includes &lt;code&gt;-DJvmOption="-Xmx8g"&lt;/code&gt;, then the effective value will be &lt;strong&gt;8g&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  3. Key JVM Tuning Parameters
&lt;/h2&gt;
&lt;h3&gt;
  
  
  3.1 Heap Memory
&lt;/h3&gt;

&lt;p&gt;Heap memory is the most important part of JVM tuning. It directly determines how much data SeaTunnel can process in parallel without running into OOM (Out Of Memory).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-Xms&lt;/code&gt;&lt;/strong&gt;: Initial heap size&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-Xmx&lt;/code&gt;&lt;/strong&gt;: Maximum heap size&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best practices:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Worker nodes&lt;/strong&gt;:&lt;br&gt;
It’s strongly recommended to set &lt;code&gt;-Xms&lt;/code&gt; and &lt;code&gt;-Xmx&lt;/code&gt; to the &lt;strong&gt;same value&lt;/strong&gt; (for example, &lt;code&gt;-Xms8g -Xmx8g&lt;/code&gt;).&lt;br&gt;
This avoids runtime heap resizing, reduces performance fluctuations, and helps prevent memory fragmentation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Master nodes&lt;/strong&gt;:&lt;br&gt;
Memory requirements are relatively low. In most cases, &lt;code&gt;2g–4g&lt;/code&gt; is sufficient. Increase it only if the cluster handles many jobs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Client&lt;/strong&gt;:&lt;br&gt;
The default &lt;code&gt;512m&lt;/code&gt; is usually enough. If your job configuration (SQL/JSON) is very large (tens of thousands of lines), consider increasing it to &lt;code&gt;1g&lt;/code&gt; or more.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  3.2 Off-Heap Memory
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Important note:&lt;/strong&gt;&lt;br&gt;
You may notice that the actual physical memory (RSS) used by SeaTunnel is significantly larger than the &lt;code&gt;-Xmx&lt;/code&gt; value.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt;&lt;br&gt;
SeaTunnel uses Netty for network communication, which relies heavily on &lt;strong&gt;off-heap (direct) memory&lt;/strong&gt; for zero-copy data transfer.&lt;br&gt;
In addition, thread stacks (&lt;code&gt;-Xss * number of threads&lt;/code&gt;), Metaspace, and JVM overhead also consume non-heap memory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Risk:&lt;/strong&gt;&lt;br&gt;
If the machine runs out of physical memory, the Linux OOM Killer may terminate the process (usually a Worker).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Recommendations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reserve memory for the OS:&lt;/strong&gt;&lt;br&gt;
On an 8GB machine, keep &lt;code&gt;-Xmx&lt;/code&gt; below &lt;code&gt;5g&lt;/code&gt;, leaving around 3GB for off-heap memory and the operating system.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Docker/Kubernetes:&lt;/strong&gt;&lt;br&gt;
The container memory limit must be larger than &lt;code&gt;-Xmx&lt;/code&gt; plus estimated off-heap usage.&lt;br&gt;
A common rule is to set it to about &lt;strong&gt;1.5× &lt;code&gt;-Xmx&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  3.3 Garbage Collector
&lt;/h3&gt;

&lt;p&gt;SeaTunnel’s Zeta engine recommends using &lt;strong&gt;G1GC&lt;/strong&gt;, which provides more predictable pause times for large heaps.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;-XX:+UseG1GC&lt;/code&gt;&lt;/strong&gt;: Enable G1 GC (enabled by default)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;-XX:MaxGCPauseMillis=200&lt;/code&gt;&lt;/strong&gt;: Target maximum GC pause time (in milliseconds)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time workloads&lt;/strong&gt;:
If latency is critical, you can lower this value (e.g., &lt;code&gt;100&lt;/code&gt;).
Keep in mind this may increase GC frequency and slightly reduce overall throughput.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch workloads&lt;/strong&gt;:
The default &lt;code&gt;200ms&lt;/code&gt; is usually a good balance.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;-XX:InitiatingHeapOccupancyPercent=45&lt;/code&gt;&lt;/strong&gt;:&lt;br&gt;
Heap occupancy threshold that triggers concurrent GC.&lt;br&gt;
If you observe frequent Full GC, try lowering it (e.g., &lt;code&gt;40&lt;/code&gt;) so GC starts earlier.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  3.4 Metaspace
&lt;/h3&gt;

&lt;p&gt;Metaspace stores class metadata. SeaTunnel consumes metaspace when loading connectors.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-XX:MaxMetaspaceSize&lt;/code&gt;&lt;/strong&gt;: Maximum metaspace size&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The default (&lt;code&gt;2g&lt;/code&gt;) is usually sufficient.&lt;br&gt;
If you encounter &lt;code&gt;java.lang.OutOfMemoryError: Metaspace&lt;/code&gt;, increase it accordingly.&lt;/p&gt;
&lt;h3&gt;
  
  
  3.5 Troubleshooting
&lt;/h3&gt;

&lt;p&gt;When OOM happens, heap dumps are extremely helpful for diagnosis.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-XX:+HeapDumpOnOutOfMemoryError&lt;/code&gt;&lt;/strong&gt;: Generate a heap dump automatically on OOM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-XX:HeapDumpPath=/tmp/seatunnel/dump/&lt;/code&gt;&lt;/strong&gt;: Path to store dump files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Notes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Make sure the disk has enough space (at least larger than &lt;code&gt;-Xmx&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;In container environments, ensure the path is mounted to the host; otherwise, dumps will be lost after restart&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  4. JDK Compatibility
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Recommended versions&lt;/strong&gt;: &lt;strong&gt;Java 8 (JDK 1.8)&lt;/strong&gt; or &lt;strong&gt;Java 11&lt;/strong&gt;&lt;br&gt;
These are the most thoroughly tested versions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Java 17+&lt;/strong&gt;:&lt;br&gt;
Generally supported, but due to the module system introduced in Java 9+, you may encounter &lt;code&gt;InaccessibleObjectException&lt;/code&gt; caused by restricted reflection access.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;br&gt;
If this happens, add &lt;code&gt;--add-opens&lt;/code&gt; options in &lt;code&gt;jvm_options&lt;/code&gt;, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;--add-opens&lt;/span&gt; java.base/java.lang&lt;span class="o"&gt;=&lt;/span&gt;ALL-UNNAMED
&lt;span class="nt"&gt;--add-opens&lt;/span&gt; java.base/java.util&lt;span class="o"&gt;=&lt;/span&gt;ALL-UNNAMED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  5. Production Tuning Scenarios
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scenario 1: Large-Scale Batch Processing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Characteristics&lt;/strong&gt;: Large data volume (TB scale), throughput is the priority&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worker recommendation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;-Xms8g&lt;/span&gt; &lt;span class="nt"&gt;-Xmx8g&lt;/span&gt;
&lt;span class="nt"&gt;-XX&lt;/span&gt;:+UseG1GC
&lt;span class="nt"&gt;-XX&lt;/span&gt;:ParallelGCThreads&lt;span class="o"&gt;=&lt;/span&gt;8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Notes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the source reads data too quickly, memory may build up&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Besides increasing heap size, consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Limiting &lt;code&gt;read_limit.rows_per_second&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Adjusting &lt;code&gt;parallelism&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scenario 2: Real-Time CDC Synchronization
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Characteristics&lt;/strong&gt;: Long-running jobs, latency-sensitive, relatively stable memory usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worker recommendation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;-Xms4g&lt;/span&gt; &lt;span class="nt"&gt;-Xmx4g&lt;/span&gt;
&lt;span class="nt"&gt;-XX&lt;/span&gt;:+UseG1GC
&lt;span class="nt"&gt;-XX&lt;/span&gt;:MaxGCPauseMillis&lt;span class="o"&gt;=&lt;/span&gt;100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Notes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Checkpoint frequency also affects memory usage (state backend caching)&lt;/li&gt;
&lt;li&gt;If memory pressure is high, consider increasing &lt;code&gt;checkpoint.interval&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scenario 3: Low-Memory Deployment (e.g., 4GB)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Risk&lt;/strong&gt;: High chance of being killed by the OS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Worker recommendation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;-Xmx2560m&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Allocate about 2.5GB to heap&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Leave the remaining 1.5GB for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Off-heap memory (Netty)&lt;/li&gt;
&lt;li&gt;OS&lt;/li&gt;
&lt;li&gt;Other processes&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  6. How to Verify Your Configuration
&lt;/h2&gt;

&lt;p&gt;After starting SeaTunnel, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;jps &lt;span class="nt"&gt;-v&lt;/span&gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;SeaTunnel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;12345 SeaTunnelServer ... &lt;span class="nt"&gt;-Xms8g&lt;/span&gt; &lt;span class="nt"&gt;-Xmx8g&lt;/span&gt; &lt;span class="nt"&gt;-XX&lt;/span&gt;:+UseG1GC ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make sure your parameters (e.g., &lt;code&gt;-Xmx8g&lt;/code&gt;) appear &lt;strong&gt;at the end of the list&lt;/strong&gt; (or are not overridden by later ones).&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Docker / Kubernetes-Specific Configuration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  7.1 Recommended Approach: Container-Aware Memory
&lt;/h3&gt;

&lt;p&gt;In Kubernetes, memory is typically controlled via &lt;code&gt;resources.limits.memory&lt;/code&gt;.&lt;br&gt;
Instead of hardcoding &lt;code&gt;-Xmx&lt;/code&gt;, it’s better to use percentage-based settings so the JVM can adapt automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;JAVA_OPTS&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-XX:+UseContainerSupport&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;-XX:MaxRAMPercentage=70.0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;-XshowSettings:vm"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;-XX:+UseContainerSupport&lt;/code&gt;: Allows JVM to detect container limits&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-XX:MaxRAMPercentage=70.0&lt;/code&gt;: Sets heap to 70% of container memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why 70%?&lt;/strong&gt;&lt;br&gt;
The remaining 30% is needed for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Direct memory (Netty)&lt;/li&gt;
&lt;li&gt;Metaspace&lt;/li&gt;
&lt;li&gt;Thread stacks&lt;/li&gt;
&lt;li&gt;JVM overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  7.2 Resource Limits
&lt;/h3&gt;

&lt;p&gt;Make sure Kubernetes resource settings align with JVM needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Want 8GB heap&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JVM: 70%&lt;/li&gt;
&lt;li&gt;K8s limit: &lt;code&gt;8 / 0.7 ≈ 11.5GB&lt;/code&gt; → set to &lt;code&gt;12Gi&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;12Gi"&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;12Gi"&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7.3 Overriding Default Config
&lt;/h3&gt;

&lt;p&gt;If default config files already define memory settings, they may override &lt;code&gt;JAVA_OPTS&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To ensure your settings take effect:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Use command-line parameters (highest priority):&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-DJvmOption=-XX:MaxRAMPercentage=70.0"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Mount custom config files via ConfigMap&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  7.4 Common Pitfalls
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;❌ Setting &lt;code&gt;limits.memory = 4Gi&lt;/code&gt; and &lt;code&gt;-Xmx4g&lt;/code&gt;&lt;br&gt;
→ No space left for non-heap memory → process will be killed&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;❌ Not setting &lt;code&gt;requests&lt;/code&gt;&lt;br&gt;
→ Pod may be scheduled on a node without enough memory&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Code References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;jvm_options&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;seatunnel-cluster.sh&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;values.yaml&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>apacheseatunnel</category>
      <category>opensource</category>
      <category>jvm</category>
      <category>ai</category>
    </item>
    <item>
      <title>Why Your ADS Layer Always Goes Wild and How a Strong DWS Layer Fixes It</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 20 Mar 2026 10:13:54 +0000</pubDate>
      <link>https://dev.to/seatunnel/why-your-ads-layer-always-goes-wild-and-how-a-strong-dws-layer-fixes-it-4cfa</link>
      <guid>https://dev.to/seatunnel/why-your-ads-layer-always-goes-wild-and-how-a-strong-dws-layer-fixes-it-4cfa</guid>
      <description>&lt;p&gt;In a data warehouse system, the DWS and ADS layers mark the critical boundary between “data modeling” and “data delivery.” The former carries shared aggregation and metric reuse capabilities, determining the stability and efficiency of the data system; the latter is oriented toward specific consumption scenarios, directly impacting business delivery efficiency and user experience.&lt;/p&gt;

&lt;p&gt;If the DWS layer is poorly designed, metrics will be repeatedly produced in the ADS layer, ultimately leading to inconsistent definitions and siloed data; if the ADS layer runs out of control, it can even backfire on the shared layer, forming unmanageable data assets. Therefore, a healthy data system must establish a clear boundary and evolution mechanism between “shared foundation” and “flexible delivery.”&lt;/p&gt;

&lt;p&gt;As the fourth article in the Data Lakehouse design and practice series, this piece systematically summarizes &lt;strong&gt;the core design principles of the DWS/ADS delivery layer&lt;/strong&gt;, including methods for shared aggregation and subject-wide table modeling, metric definition frameworks, delivery layer strategies, and lifecycle governance practices. It also addresses common issues, helping teams build a highly reusable, governable, and sustainable data delivery system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DWS Must Be “Thick Enough”
&lt;/h2&gt;

&lt;p&gt;In many team data systems, the DWS layer is often underestimated or even weakened, resulting in all requirements being pushed to the ADS layer. In the short term, this seems flexible, but over time it quickly spirals out of control.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgtm5tm3b03fhpgvqzkf7.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgtm5tm3b03fhpgvqzkf7.jpg" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The core positioning of DWS is as a shared aggregation and reuse layer. It is not designed to serve a single report, but to provide a unified data foundation for &lt;strong&gt;multiple applications to share&lt;/strong&gt;. If this layer is underdeveloped, every new requirement will trigger recalculation and redefinition of metrics, resulting in a bunch of incompatible results.&lt;/p&gt;

&lt;p&gt;In practice, a healthy state is: &lt;strong&gt;about 70% of analytical needs can be directly fulfilled by combining DWS tables.&lt;/strong&gt; This means most scenarios do not require creating new tables, but rather combining existing shared capabilities. This “ready-to-use” capability is the core of reuse value.&lt;/p&gt;

&lt;p&gt;Conversely, if each department has its own ADS tables and each report has its own metric definitions, typical silo problems emerge: metrics with the same name do not match, computations are duplicated, and data cannot be aligned. Teams spend most of their time reconciling definitions instead of analyzing business.&lt;/p&gt;

&lt;p&gt;The value of DWS lies precisely in solving these common issues. By precomputing aggregated results of high-frequency dimension combinations, building subject-wide tables, and unifying metric outputs, DWS moves dispersed computations to the offline layer. As a result, online queries no longer rely on temporary large-scale joins or full table scans, making performance and cost more controllable.&lt;/p&gt;

&lt;p&gt;More importantly, it changes team collaboration. Metrics no longer depend on verbal agreements—they exist as data assets: with owners, definitions, lineage, and quality rules. So-called “metric disputes” essentially become “asset governance issues.”&lt;/p&gt;

&lt;p&gt;But there is a prerequisite: DWS must be governable. If fields lack explanations, metrics lack definitions, update frequency is unclear, or quality rules are missing, this layer will become a “wide-table collection nobody dares to use,” reducing reuse rates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shared Aggregation and Subject-Wide Tables: Balancing Reuse and Performance
&lt;/h2&gt;

&lt;p&gt;DWS design revolves around two types of tables: shared aggregation tables and subject-wide tables.&lt;/p&gt;

&lt;p&gt;Shared aggregation tables hinge on &lt;strong&gt;clarity&lt;/strong&gt;. They must clearly define aggregation granularity (e.g., daily, weekly, monthly, or cumulative), dimension combinations (e.g., time, organization, channel, category), and metric calculation scope (e.g., amount, count, or frequency). Without clear boundaries, downstream reuse becomes unreliable.&lt;/p&gt;

&lt;p&gt;Subject-wide tables emphasize &lt;strong&gt;usability&lt;/strong&gt;. They usually focus on a business domain, e.g., users, transactions, or products, flattening frequently joined dimensions in advance to reduce query complexity. Importantly, wide tables are a result-oriented form for analytics—they are &lt;strong&gt;not a replacement for fact tables&lt;/strong&gt; and must be traceable back to underlying models.&lt;/p&gt;

&lt;p&gt;A common practical problem is wide tables continually growing. To mitigate this, fields can be governed based on usage frequency: retain high-frequency fields in the main wide table, split or join low-frequency fields on demand, and regularly slim tables according to usage.&lt;/p&gt;

&lt;p&gt;Another common pitfall is mixing different aggregation levels in the same table, e.g., daily and monthly data together. This greatly increases misuse risk and complicates maintenance. A better approach is to split tables by level or at least enforce strict naming conventions.&lt;/p&gt;

&lt;p&gt;All these designs assume &lt;strong&gt;consistent dimensions&lt;/strong&gt; exist. Core dimensions such as user, organization, channel, and time must have unified codes and definitions, otherwise cross-table reuse fails.&lt;/p&gt;

&lt;p&gt;From a performance perspective, DWS’s core strategy is always &lt;strong&gt;pre-aggregation first&lt;/strong&gt;. Reduce data scan scale via offline computation before applying indexing, partitioning, or materialized views. Otherwise, all optimizations become remedial measures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Metric Framework: Layered Design from Atomic to Composite
&lt;/h2&gt;

&lt;p&gt;If DWS solves &lt;strong&gt;data reuse&lt;/strong&gt;, then the metric framework ensures &lt;strong&gt;definition consistency&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A governable metric system typically has three levels: atomic metrics, derived metrics, and composite metrics.&lt;/p&gt;

&lt;p&gt;Atomic metrics are the fundamental units. They must clearly define the target, scope, filters, and time granularity. For example, “successful payment amount” must clearly count only successful payments and use the payment completion time.&lt;/p&gt;

&lt;p&gt;Derived metrics are calculated from atomic metrics. For example, average order value = “successful payment amount / number of successful orders.” Key here: derived metrics must inherit atomic metric definitions, or bias will occur.&lt;/p&gt;

&lt;p&gt;Composite metrics span multiple processes or business domains, e.g., conversion rate, retention, or repeat purchase. These rely heavily on a consistent dimension system and event definitions, making them the most prone to ambiguity.&lt;/p&gt;

&lt;p&gt;To avoid confusion, every metric must have four elements: business definition, calculation formula, scope, and time granularity. This is not just documentation—it is the basis for traceability and auditability.&lt;/p&gt;

&lt;p&gt;Metrics must also support version control. Changes to definitions cannot overwrite historical results directly; versions or effective dates should be used to prevent “historical data being rewritten.”&lt;/p&gt;

&lt;p&gt;In terms of layering, atomic metrics should reside in DWS (or traceable to DWD), while ADS handles only lightweight combination and presentation. If ADS takes on definition duties, it quickly becomes a new “metric generation layer.”&lt;/p&gt;

&lt;h2&gt;
  
  
  ADS and Data Marts: Delivery for Consumption
&lt;/h2&gt;

&lt;p&gt;If DWS is about &lt;strong&gt;accumulation&lt;/strong&gt;, ADS is about &lt;strong&gt;delivery&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;ADS (or DM, data marts) aims to provide data products for specific consumption scenarios, e.g., BI reports, API services, or analytical datasets. Structures here emphasize &lt;strong&gt;usability&lt;/strong&gt;, not generality.&lt;/p&gt;

&lt;p&gt;Delivery tables should follow a &lt;strong&gt;“one table, one scenario”&lt;/strong&gt; principle. Field names can be closer to business semantics, and additional display, sort, or status fields can be added to improve user experience.&lt;/p&gt;

&lt;p&gt;But one bottom line must be enforced: &lt;strong&gt;delivery should not invent metrics&lt;/strong&gt;. All core metrics must come from DWS or the metric system; ADS only handles combination, formatting, and lightweight calculation. Violating this quickly returns to “one metric per report.”&lt;/p&gt;

&lt;p&gt;Update frequency must respect business SLA. Daily, hourly, or minute-level updates directly affect compute chains and resource costs. The higher the frequency, the more careful you must be with field scale and calculation complexity.&lt;/p&gt;

&lt;p&gt;Governance of data marts is also crucial. They can be department- or scenario-specific, but must be built on a unified dimension and metric framework. Views or semantic layers may meet variation needs, but duplicating underlying logic is not allowed.&lt;/p&gt;

&lt;h2&gt;
  
  
  From “Fast Delivery” to “Sustainable Evolution”
&lt;/h2&gt;

&lt;p&gt;Early on, many teams experience a phase: stacking tables in ADS for fast delivery. Initially responsive, but over time, problems emerge—delivery layers balloon, shared layers hollow out, and maintenance costs soar.&lt;/p&gt;

&lt;p&gt;A healthier model: &lt;strong&gt;gradually thicken the shared layer (DWS), keep the delivery layer light, and continuously recover general capabilities back to DWS.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This also implies delivery tables must support lifecycle management. Track usage frequency, retire low-value tables, or recycle general fields and metrics back to the shared layer to avoid duplication.&lt;/p&gt;

&lt;p&gt;Ultimately, a mature data system is not “built fast,” but “used long.” Layered DWS and ADS design underpins this long-term evolution.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>ads</category>
      <category>database</category>
      <category>mongodb</category>
    </item>
    <item>
      <title>SeaTunnel Gravitino: Schema URL–Driven Automatic Table Structure Detection</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 20 Mar 2026 09:53:29 +0000</pubDate>
      <link>https://dev.to/seatunnel/seatunnel-x-gravitino-schema-url-driven-automatic-table-structure-detection-3e59</link>
      <guid>https://dev.to/seatunnel/seatunnel-x-gravitino-schema-url-driven-automatic-table-structure-detection-3e59</guid>
      <description>&lt;p&gt;Recently, the community published an article titled &lt;a href="https://medium.com/@apacheseatunnel/say-goodbye-to-hand-written-schemas-bedbf1a49cf3" rel="noopener noreferrer"&gt;“Say Goodbye to Hand-Written Schemas! SeaTunnel’s Integration with Gravitino Metadata REST API Is a Really Cool Move”&lt;/a&gt;, which drew strong reactions from readers, with many saying, “This is really awesome!”&lt;/p&gt;

&lt;p&gt;The contributor behind this feature is extremely proactive, and it’s expected to be available soon (according to reliable sources, likely in version 3.0.0). To help the community better understand it, the contributor wrote a detailed article explaining the initial capabilities of the Gravitino REST API and how to use it—let’s take a closer look!&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Background and Problems to Solve
&lt;/h2&gt;

&lt;p&gt;When using Apache SeaTunnel for batch or sync tasks, if the source is unstructured or semi-structured, &lt;strong&gt;the source usually requires an explicit schema definition&lt;/strong&gt; (field names, types, order).&lt;/p&gt;

&lt;p&gt;In real production environments, this leads to several typical issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tables have many fields and complex types, making manual schema maintenance costly and error-prone&lt;/li&gt;
&lt;li&gt;Upstream table structure changes (adding fields, changing types) require corresponding updates to SeaTunnel jobs&lt;/li&gt;
&lt;li&gt;For existing tables, simply syncing data still requires repeated metadata description, leading to redundancy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thus, the core question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Can SeaTunnel directly reuse table structure definitions from an existing metadata system, instead of declaring schema repeatedly in jobs?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This feature was introduced to solve this problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Introduction to Gravitino (Relevant Capabilities)
&lt;/h2&gt;

&lt;p&gt;Gravitino is a unified metadata management and access service, providing standardized REST APIs to manage and expose the following objects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Metalake (logical isolation unit)&lt;/li&gt;
&lt;li&gt;Catalogs (e.g., MySQL, Hive, Iceberg)&lt;/li&gt;
&lt;li&gt;Schema / Database&lt;/li&gt;
&lt;li&gt;Table and its field definitions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With Gravitino:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Table structures can be &lt;strong&gt;centrally managed&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Downstream systems can dynamically fetch schema definitions via &lt;strong&gt;HTTP APIs&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;No need to maintain field information in every compute or sync job&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The new capability introduced in SeaTunnel is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Support for automatically pulling table structures via &lt;code&gt;schema_url&lt;/code&gt; provided by Gravitino in the source schema definition.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  3. Local Test Environment Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 Prepare MySQL Environment
&lt;/h3&gt;

&lt;h4&gt;
  
  
  3.1.1 Create Target Table
&lt;/h4&gt;

&lt;p&gt;Pre-create the target table &lt;code&gt;test.demo_user&lt;/code&gt; in MySQL with the following SQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="nv"&gt;`demo_user`&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nv"&gt;`id`&lt;/span&gt; &lt;span class="nb"&gt;bigint&lt;/span&gt; &lt;span class="nb"&gt;unsigned&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="n"&gt;AUTO_INCREMENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`user_code`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`user_name`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`password`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`email`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`phone`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`gender`&lt;/span&gt; &lt;span class="nb"&gt;tinyint&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`age`&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`status`&lt;/span&gt; &lt;span class="nb"&gt;tinyint&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`level`&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`score`&lt;/span&gt; &lt;span class="nb"&gt;decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`balance`&lt;/span&gt; &lt;span class="nb"&gt;decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`is_deleted`&lt;/span&gt; &lt;span class="nb"&gt;tinyint&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`register_ip`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`last_login_ip`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`login_count`&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`remark`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`ext1`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`ext2`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`ext3`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`ext4`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`ext5`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`created_by`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`updated_by`&lt;/span&gt; &lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`created_time`&lt;/span&gt; &lt;span class="nb"&gt;datetime&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`updated_time`&lt;/span&gt; &lt;span class="nb"&gt;datetime&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`birthday`&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`last_login_time`&lt;/span&gt; &lt;span class="nb"&gt;datetime&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;`version`&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;`id`&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="k"&gt;UNIQUE&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="nv"&gt;`uk_user_code`&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;`user_code`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ENGINE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;InnoDB&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;CHARSET&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;utf8mb4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  3.1.2 Create the Table Schema to Sync
&lt;/h4&gt;

&lt;p&gt;In practice, table structures might be managed centrally in components like &lt;code&gt;paimon&lt;/code&gt;, &lt;code&gt;hive&lt;/code&gt;, or &lt;code&gt;hudi&lt;/code&gt;. For testing, the table schema points to the target table &lt;code&gt;test.demo_user&lt;/code&gt; created in the previous step.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 Register the Table Schema in Gravitino
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Gravitino supports direct database connections and scans all tables in a database&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fablntfy94wjck5i4cpj9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fablntfy94wjck5i4cpj9.png" alt="img" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This table is managed in Gravitino as a table under the &lt;code&gt;local-mysql&lt;/code&gt; catalog.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fze7w3w0izx2jq4teno24.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fze7w3w0izx2jq4teno24.png" alt="img\_1" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Metalake: &lt;code&gt;test_Metalake&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3 Table Structure Access Explanation
&lt;/h3&gt;

&lt;p&gt;Table structures in Gravitino can be accessed via the REST API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://localhost:8090/api/metalakes/test_Metalake/catalogs/${catalog}/schemas/${schema}/tables/${table}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this test, the actual &lt;code&gt;schema_url&lt;/code&gt; used is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://localhost:8090/api/metalakes/test_Metalake/catalogs/local-mysql/schemas/test/tables/demo_user
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The returned JSON contains the complete field definitions of the &lt;code&gt;demo_user&lt;/code&gt; table.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flpvr81wp05j9l2hupvha.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flpvr81wp05j9l2hupvha.png" alt="img\_2" width="800" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3.4 Local Deployment of SeaTunnel
&lt;/h3&gt;

&lt;p&gt;Since this feature hasn’t been officially released, you need to manually compile the latest &lt;code&gt;dev&lt;/code&gt; branch and deploy it locally.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.5 Prepare Data Files
&lt;/h3&gt;

&lt;p&gt;This test case uses a CSV file containing 2,000 records.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdb03go0l8uj41xwhct7m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdb03go0l8uj41xwhct7m.png" alt="img\_3" width="800" height="132"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4. SeaTunnel Job Configuration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 Core Configuration Example
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;env&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;parallelism&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;job.mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BATCH"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;LocalFile&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;path&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/Users/wangxuepeng/Desktop/seatunnel/apache-seatunnel-2.3.13-SNAPSHOT/test_data"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;file_format_type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"csv"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;schema&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;schema_url&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8090/api/metalakes/test_Metalake/catalogs/local-mysql/schemas/test/tables/demo_user"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;sink&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;jdbc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;url&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"jdbc:mysql://localhost:3306/test"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;driver&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"com.mysql.cj.jdbc.Driver"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;username&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"root"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;password&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"123456"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;database&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"test"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"demo_user"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;generate_sink_sql&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.2 Key Configuration Notes
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;schema.schema_url&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Points to the table metadata REST API in Gravitino&lt;/li&gt;
&lt;li&gt;SeaTunnel automatically fetches the table schema at job start&lt;/li&gt;
&lt;li&gt;No need to manually declare field lists in jobs&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;generate_sink_sql = true&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sink automatically generates INSERT SQL based on the parsed schema&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  5. Data and Job Execution Results
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Log screenshot:&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffud8tvig4gx1xpzxdrtm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffud8tvig4gx1xpzxdrtm.png" alt="img\_4" width="800" height="220"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During job execution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source automatically parses field structure via &lt;code&gt;schema_url&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;CSV fields automatically align with the table schema&lt;/li&gt;
&lt;li&gt;Data successfully written to MySQL &lt;code&gt;demo_user&lt;/code&gt; table&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  6. FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  6.1 Supported Connectors
&lt;/h3&gt;

&lt;p&gt;Currently, the &lt;code&gt;dev&lt;/code&gt; branch supports file-type connectors including &lt;code&gt;local&lt;/code&gt;, &lt;code&gt;hdfs&lt;/code&gt;, &lt;code&gt;s3&lt;/code&gt;, etc.&lt;/p&gt;

&lt;h3&gt;
  
  
  6.2 Does &lt;code&gt;schema_url&lt;/code&gt; support multiple tables?
&lt;/h3&gt;

&lt;p&gt;The feature does not affect multi-table functionality and can be used in combination, e.g.:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;LocalFile&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;tables_configs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;path&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/seatunnel/read/metalake/table1"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;file_format_type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"csv"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;field_delimiter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;","&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;row_delimiter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;skip_header_row_number&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;schema&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"db.table1"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;fields&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;c_string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;string&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;c_int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;int&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;c_boolean&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;boolean&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;c_double&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;double&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;path&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/seatunnel/read/metalake/table2"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;file_format_type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"csv"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;field_delimiter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;","&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;row_delimiter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;skip_header_row_number&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;schema&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"db.table2"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;schema_url&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://gravitino:8090/api/metalakes/test_metalake/catalogs/test_catalog/schemas/test_schema/tables/table2"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  7. Feature Summary
&lt;/h2&gt;

&lt;p&gt;By introducing &lt;strong&gt;Gravitino &lt;code&gt;schema_url&lt;/code&gt;–based automatic schema parsing&lt;/strong&gt;, SeaTunnel gains the following advantages in data sync scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Eliminates repeated schema definitions, reducing job configuration complexity&lt;/li&gt;
&lt;li&gt;Reuses a unified metadata management system, improving consistency&lt;/li&gt;
&lt;li&gt;Job-friendly in case of table structure changes, significantly lowering maintenance costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This feature is ideal for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enterprises with mature metadata platforms&lt;/li&gt;
&lt;li&gt;Large tables with many fields or frequent schema changes&lt;/li&gt;
&lt;li&gt;Users seeking improved maintainability of SeaTunnel jobs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  8. References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Code PR&lt;/strong&gt;:&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/pull/10402" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/10402&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;schema_url&lt;/code&gt; Configuration Docs&lt;/strong&gt;:&lt;br&gt;
&lt;a href="https://seatunnel.apache.org/zh-CN/docs/introduction/concepts/schema-feature#schema_url" rel="noopener noreferrer"&gt;https://seatunnel.apache.org/zh-CN/docs/introduction/concepts/schema-feature#schema_url&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>gravitino</category>
      <category>apacheseatunnel</category>
      <category>opensource</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
