<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Apache SeaTunnel</title>
    <description>The latest articles on DEV Community by Apache SeaTunnel (@seatunnel).</description>
    <link>https://dev.to/seatunnel</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F844122%2Fc6155eb3-df58-448b-8d88-36865c4f1d84.jpg</url>
      <title>DEV Community: Apache SeaTunnel</title>
      <link>https://dev.to/seatunnel</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/seatunnel"/>
    <language>en</language>
    <item>
      <title>How Basic Auth Works in Apache SeaTunnel Zeta Engine — Fix 401 Unauthorized API Errors</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 26 Jun 2026 09:57:49 +0000</pubDate>
      <link>https://dev.to/seatunnel/how-basic-auth-works-in-apache-seatunnel-zeta-engine-fix-401-unauthorized-api-errors-8gb</link>
      <guid>https://dev.to/seatunnel/how-basic-auth-works-in-apache-seatunnel-zeta-engine-fix-401-unauthorized-api-errors-8gb</guid>
      <description>&lt;p&gt;Recently while reviewing the REST API authentication logic of Apache SeaTunnel’s Zeta Engine, I ran into a very common issue:&lt;br&gt;
The Zeta Engine fully started and its REST service listened on the designated port normally, yet accessing endpoints like &lt;code&gt;/overview&lt;/code&gt;, &lt;code&gt;/running-jobs&lt;/code&gt; and &lt;code&gt;/job-info&lt;/code&gt; kept returning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HTTP/1.1 401 Unauthorized
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you encounter this error for the first time, you may easily assume the service failed to launch, you entered the wrong port number, or the API path is incorrect.&lt;/p&gt;

&lt;p&gt;In reality, this error is mostly tied to the Basic Auth configuration of SeaTunnel Zeta Engine.&lt;br&gt;
Once Basic Auth is enabled on Zeta Engine, clients can no longer send plain requests to REST APIs as before — valid authentication credentials must be attached in the request header.&lt;/p&gt;

&lt;p&gt;Starting from this 401 error case, this article breaks down how Basic Auth operates inside SeaTunnel Zeta Engine, plus the correct client-side connection methods.&lt;/p&gt;
&lt;h2&gt;
  
  
  1. The Symptom: Receiving 401 When Calling Zeta REST APIs
&lt;/h2&gt;

&lt;p&gt;Suppose we send a direct request to the Zeta Engine REST API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl http://localhost:8080/overview
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Basic Auth stays disabled, this call will return the engine’s overview information without issues.&lt;/p&gt;

&lt;p&gt;But after enabling Basic Auth in the config, requests missing authentication headers will trigger this response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HTTP/1.1 401 Unauthorized
WWW-Authenticate: Basic realm="SeaTunnel Web UI"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This indicates the request successfully reached Zeta Engine, yet got intercepted by the authentication filter before hitting the actual REST Servlet.&lt;/p&gt;

&lt;p&gt;To simplify the root cause chain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Zeta Engine running normally
REST API endpoint fully functional
Client request missing Authorization header
Request intercepted by BasicAuthFilter → returns 401 error
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Basic Auth itself is a straightforward mechanism. Its core function is appending an authentication string to the HTTP Request Header:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Authorization: Basic base64(username:password)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example, with username &lt;code&gt;admin&lt;/code&gt; and password &lt;code&gt;admin&lt;/code&gt;, clients first combine the credentials into the string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;admin:admin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Encode this string with Base64, then inject the encoded value into the &lt;code&gt;Authorization&lt;/code&gt; header.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Source Code Deep Dive: How BasicAuthFilter Intercepts Requests
&lt;/h2&gt;

&lt;p&gt;All core Basic Auth logic of SeaTunnel Zeta Engine resides in the &lt;code&gt;BasicAuthFilter&lt;/code&gt; class.&lt;br&gt;
This class implements the standard Java Servlet Filter interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BasicAuthFilter&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="nc"&gt;Filter&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;HttpConfig&lt;/span&gt; &lt;span class="n"&gt;httpConfig&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nf"&gt;BasicAuthFilter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HttpConfig&lt;/span&gt; &lt;span class="n"&gt;httpConfig&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;httpConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpConfig&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The core trait of Servlet Filters: every request passes through the filter before reaching the target Servlet.&lt;br&gt;
This means the Basic Auth validation logic is centralized in a universal filter, rather than hardcoded into individual API endpoints.&lt;/p&gt;

&lt;p&gt;The core business logic lives in the &lt;code&gt;doFilter&lt;/code&gt; method.&lt;br&gt;
First, it checks whether Basic Auth has been toggled on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;httpConfig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isEnableBasicAuth&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;doFilter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This snippet is critical.&lt;br&gt;
If Basic Auth remains disabled, the request gets passed through unobstructed via:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;doFilter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In short: REST APIs require zero authentication when &lt;code&gt;enable-basic-auth&lt;/code&gt; is turned off.&lt;/p&gt;

&lt;p&gt;When Basic Auth is enabled, the code proceeds to extract values from the HTTP request header:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;HttpServletRequest&lt;/span&gt; &lt;span class="n"&gt;httpRequest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HttpServletRequest&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="nc"&gt;HttpServletResponse&lt;/span&gt; &lt;span class="n"&gt;httpResponse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HttpServletResponse&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;authHeader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpRequest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getHeader&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Authorization"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, it verifies the header exists and follows the standard "Basic " prefix format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;authHeader&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;authHeader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;startsWith&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Basic "&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;base64Credentials&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;authHeader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;substring&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Basic "&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;length&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;credentials&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
            &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;String&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Base64&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;decodeBase64&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base64Credentials&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="nc"&gt;StandardCharsets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;UTF_8&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;":"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This block executes four sequential steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extract the full &lt;code&gt;Authorization&lt;/code&gt; header from the request&lt;/li&gt;
&lt;li&gt;Strip the leading "Basic " prefix&lt;/li&gt;
&lt;li&gt;Base64-decode the remaining string content&lt;/li&gt;
&lt;li&gt;Split the decoded string on the colon symbol to separate username and password&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Take this header as an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Authorization: Basic YWRtaW46YWRtaW4=
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Decoding the Base64 segment yields the raw credential pair:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;admin:admin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code then compares the parsed credentials against the values defined in the engine configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;httpConfig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getBasicAuthUsername&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;httpConfig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getBasicAuthPassword&lt;/span&gt;&lt;span class="o"&gt;()))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;doFilter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requests matching the configured username and password pass through the filter successfully.&lt;/p&gt;

&lt;p&gt;If the header is missing, malformed, or contains mismatched credentials, the filter returns a 401 error response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;httpResponse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setHeader&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"WWW-Authenticate"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Basic realm=\"SeaTunnel Web UI\""&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;httpResponse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sendError&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HttpServletResponse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;SC_UNAUTHORIZED&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Unauthorized"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full execution flow of BasicAuthFilter can be summarized as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client request arrives at Zeta REST service
        ↓
Check if Basic Auth is enabled in configuration
        ↓
Disabled: Forward request directly to target API
        ↓
Enabled: Extract the Authorization header from HTTP request
        ↓
Parse and Base64-decode credentials to retrieve username + password
        ↓
Compare parsed credentials against configured username &amp;amp; password
        ↓
Credentials match: Forward request to target API
Credentials mismatch / invalid header: Return 401 Unauthorized
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. Configuration Breakdown: How enable-basic-auth, username and password Take Effect
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhw0rkvp67up52djmaq2u.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhw0rkvp67up52djmaq2u.jpg" width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three core configuration items govern Basic Auth functionality:&lt;br&gt;
By default, the configuration values are set as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;enable-basic-auth = false
basic-auth-username = admin
basic-auth-password = admin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A key detail to note:&lt;br&gt;
Even though &lt;code&gt;basic-auth-username&lt;/code&gt; and &lt;code&gt;basic-auth-password&lt;/code&gt; default to &lt;code&gt;admin&lt;/code&gt;, these two fields do nothing unless &lt;code&gt;enable-basic-auth&lt;/code&gt; is activated.&lt;/p&gt;

&lt;p&gt;The single switch controlling authentication enforcement is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;enable-basic-auth
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When set to &lt;code&gt;false&lt;/code&gt;, all Zeta Engine REST APIs operate without authentication requirements.&lt;br&gt;
When set to &lt;code&gt;true&lt;/code&gt;, every request routed through BasicAuthFilter must carry valid authentication credentials.&lt;/p&gt;

&lt;p&gt;A complete sample configuration block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;seatunnel&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;engine&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;http&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;enable-http&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;port&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8080&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="nl"&gt;enable-basic-auth&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;basic-auth-username&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"admin"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;basic-auth-password&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"admin"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After enabling Basic Auth, plain unauthenticated requests will fail:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8080/overview
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;401 Unauthorized
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two valid approaches to send authenticated requests:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Short-form curl with built-in credential flag
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-u&lt;/span&gt; admin:admin http://localhost:8080/overview
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Explicitly inject the Authorization header
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Basic YWRtaW46YWRtaW4="&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://localhost:8080/overview
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The string &lt;code&gt;YWRtaW46YWRtaW4=&lt;/code&gt; is the Base64 encoding of the credential pair &lt;code&gt;admin:admin&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Client Implementation: Connect via the Authorization Header
&lt;/h2&gt;

&lt;p&gt;Once you grasp the server-side authentication workflow, the client-side requirements become clear:&lt;br&gt;
If Zeta Engine has Basic Auth enabled, every REST API request sent by the client must include this header:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Authorization: Basic base64(username:password)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Java client implementations, you can leverage Spring’s built-in &lt;code&gt;HttpHeaders#setBasicAuth&lt;/code&gt; to auto-generate the header without manual Base64 encoding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;HttpHeaders&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpHeaders&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setBasicAuth&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"admin"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"admin"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StandardCharsets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;UTF_8&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="nc"&gt;HttpEntity&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;entity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpEntity&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="nc"&gt;ResponseEntity&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;restTemplate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;exchange&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="s"&gt;"http://localhost:8080/overview"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="nc"&gt;HttpMethod&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;GET&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;entity&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;
&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For reusable code, wrap the authentication logic into a dedicated helper method:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;applyBasicAuth&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HttpHeaders&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;username&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;trim&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;isEmpty&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;trim&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;isEmpty&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setBasicAuth&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;trim&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
            &lt;span class="nc"&gt;StandardCharsets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;UTF_8&lt;/span&gt;
    &lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Invoke this utility method before every REST API call to attach authentication universally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;HttpHeaders&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpHeaders&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setAccept&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Collections&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;singletonList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;MediaType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;APPLICATION_JSON&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

&lt;span class="n"&gt;applyBasicAuth&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This method works seamlessly for both GET endpoints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;restTemplate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;exchange&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="s"&gt;"http://localhost:8080/running-jobs"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="nc"&gt;HttpMethod&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;GET&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpEntity&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;
&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And POST submission endpoints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;restTemplate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;exchange&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="s"&gt;"http://localhost:8080/submit-job"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="nc"&gt;HttpMethod&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;POST&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpEntity&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;configText&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;
&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The core takeaway for client integration with Basic Auth-enabled Zeta Engine:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;No changes to REST API endpoints themselves&lt;/li&gt;
&lt;li&gt;No adjustments to API request paths&lt;/li&gt;
&lt;li&gt;Only mandatory addition: the Authorization request header&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The Basic Auth implementation inside SeaTunnel Zeta Engine is lightweight, yet it frequently triggers confusing 401 errors for first-time users.&lt;/p&gt;

&lt;p&gt;Core takeaways condensed into six key points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;enable-basic-auth&lt;/code&gt; defaults to &lt;code&gt;false&lt;/code&gt;; REST APIs require no authentication when disabled&lt;/li&gt;
&lt;li&gt;Once Basic Auth is toggled on, Zeta Engine intercepts all requests via the &lt;code&gt;BasicAuthFilter&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Clients must attach the &lt;code&gt;Authorization: Basic [encoded-string]&lt;/code&gt; header to every request&lt;/li&gt;
&lt;li&gt;The encoded segment is the Base64 output of the &lt;code&gt;username:password&lt;/code&gt; credential pair&lt;/li&gt;
&lt;li&gt;Default values for &lt;code&gt;basic-auth-username&lt;/code&gt; and &lt;code&gt;basic-auth-password&lt;/code&gt; are both &lt;code&gt;admin&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;SeaTunnel Web delivers a visual interface to simplify authentication setup and engine connection workflows&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When you encounter a &lt;code&gt;401 Unauthorized&lt;/code&gt; response, avoid jumping to conclusions that the Zeta Engine failed to start or that your port/path values are incorrect.&lt;br&gt;
Prioritize verifying these three points first:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Whether &lt;code&gt;enable-basic-auth&lt;/code&gt; is enabled in your engine configuration&lt;/li&gt;
&lt;li&gt;Whether your client request carries a valid Authorization header&lt;/li&gt;
&lt;li&gt;Whether the username and password match the values defined in the engine config&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With this authentication workflow fully understood, troubleshooting and integrating with SeaTunnel Zeta Engine’s REST APIs becomes far more straightforward.&lt;/p&gt;

</description>
      <category>seatunnel</category>
      <category>dataengineering</category>
      <category>api</category>
      <category>ai</category>
    </item>
    <item>
      <title>Debugging a Distributed Job Stuck in CANCELING in Apache SeaTunnel</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 26 Jun 2026 09:41:39 +0000</pubDate>
      <link>https://dev.to/seatunnel/debugging-a-distributed-job-stuck-in-canceling-in-apache-seatunnel-5ap5</link>
      <guid>https://dev.to/seatunnel/debugging-a-distributed-job-stuck-in-canceling-in-apache-seatunnel-5ap5</guid>
      <description>&lt;p&gt;Recently, I worked on an issue in Apache SeaTunnel where a job could sometimes stay in the &lt;code&gt;CANCELING&lt;/code&gt; state forever after a user requested cancellation.&lt;/p&gt;

&lt;p&gt;At first, I thought this would be a simple bug in the cancel logic. Maybe there was a deadlock, an infinite retry loop, or a missing state transition somewhere.&lt;/p&gt;

&lt;p&gt;But after tracing the code more carefully, I realized the issue was more subtle. It was related to master-worker communication, master failover, job state recovery timing, and how one exception was handled during task status notification.&lt;/p&gt;

&lt;p&gt;This post is a short troubleshooting note about what I found and why I added a separate Force Stop mechanism instead of directly rewriting the existing Cancel logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;SeaTunnel’s Zeta engine runs jobs across a cluster.&lt;/p&gt;

&lt;p&gt;The master node manages job-level state, and worker nodes execute task groups. When a user cancels a job, the master needs to notify the worker that is running the task. After the task finishes or is canceled, the worker reports its final task state back to the master.&lt;/p&gt;

&lt;p&gt;A simplified flow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User requests cancel
        ↓
Master sends cancel request to worker
        ↓
Worker cancels or finishes the task
        ↓
Worker reports final task state to master
        ↓
Master updates the job state
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks simple, but in a distributed system, each step can race with node failure, membership changes, or master recovery.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Symptom
&lt;/h2&gt;

&lt;p&gt;The symptom was that a job stayed in &lt;code&gt;CANCELING&lt;/code&gt; indefinitely.&lt;/p&gt;

&lt;p&gt;The user had already requested cancellation, but the job never moved to a final state such as &lt;code&gt;CANCELED&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;At first, I mainly focused on the master-to-worker cancel request path.&lt;/p&gt;

&lt;h2&gt;
  
  
  First Suspicion: The Cancel Request Path
&lt;/h2&gt;

&lt;p&gt;On the master side, SeaTunnel sends a &lt;code&gt;CancelTaskOperation&lt;/code&gt; to the worker node.&lt;/p&gt;

&lt;p&gt;One important part of the logic is that it checks whether the current execution address still exists in the cluster membership before sending the cancel operation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;while (!taskFuture.isDone()
        &amp;amp;&amp;amp; nodeEngine
                .getClusterService()
                .getMember(executionAddress = getCurrentExecutionAddress())
            != null) {
    try {
        nodeEngine
                .getOperationService()
                .createInvocationBuilder(
                        Constant.SEATUNNEL_SERVICE_NAME,
                        new CancelTaskOperation(taskGroupLocation),
                        executionAddress)
                .invoke()
                .get();
        return;
    } catch (Exception e) {
        Thread.sleep(2000);
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looked suspicious to me.&lt;/p&gt;

&lt;p&gt;If the worker node temporarily disappears from the cluster view, for example because of a heartbeat issue, the loop may exit without sending the cancel request at all.&lt;/p&gt;

&lt;p&gt;So my first thought was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Maybe the master believes it has handled cancellation, but the worker never received the cancel request.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This path is still worth paying attention to. However, I later realized that it was not enough to fully explain why the job could remain in CANCELING forever.&lt;/p&gt;

&lt;p&gt;Even if the cancel request is missed, the task may eventually finish and report its final state to the master.&lt;/p&gt;

&lt;p&gt;So I started looking at the opposite direction:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What happens when the worker reports its final task state back to the master?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The More Important Path: Worker-to-Master Notification
&lt;/h2&gt;

&lt;p&gt;When a task reaches a final state, the worker calls &lt;code&gt;notifyTaskStatusToMaster&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The method is designed to retry until the notification succeeds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;while (isRunning &amp;amp;&amp;amp; !notifyStateSuccess) {
    InvocationFuture&amp;lt;Object&amp;gt; invoke =
            nodeEngine
                    .getOperationService()
                    .createInvocationBuilder(
                            SeaTunnelServer.SERVICE_NAME,
                            new NotifyTaskStatusOperation(
                                    taskGroupLocation, taskExecutionState),
                            nodeEngine.getMasterAddress())
                    .invoke();
    try {
        invoke.get();
        notifyStateSuccess = true;
    } catch (JobNotFoundException e) {
        logger.warning("send notify task status failed because can't find job", e);
        notifyStateSuccess = true;
    } catch (ExecutionException e) {
        if (e.getCause() instanceof JobNotFoundException) {
            logger.warning("send notify task status failed because can't find job", e);
            notifyStateSuccess = true;
        } else {
            Thread.sleep(sleepTime);
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The retry logic itself looked reasonable at first.&lt;/p&gt;

&lt;p&gt;But the &lt;code&gt;JobNotFoundExceptio&lt;/code&gt;n handling was important.&lt;/p&gt;

&lt;p&gt;If the worker receives &lt;code&gt;JobNotFoundException&lt;/code&gt;, it sets &lt;code&gt;notifyStateSuccess = true&lt;/code&gt; and stops retrying.&lt;/p&gt;

&lt;p&gt;In many cases, this behavior makes sense. If the job no longer exists on the master, it may mean the job has already finished and was removed from the running job map.&lt;/p&gt;

&lt;p&gt;But during master failover, this assumption can become dangerous.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fy17ui702t2quph71gxrr.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fy17ui702t2quph71gxrr.jpg" width="720" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where JobNotFoundException Comes From
&lt;/h2&gt;

&lt;p&gt;On the master side, the task state update checks &lt;code&gt;runningJobMasterMap&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;public void updateTaskExecutionState(TaskExecutionState taskExecutionState) {
    TaskGroupLocation taskGroupLocation = taskExecutionState.getTaskGroupLocation();
    JobMaster runningJobMaster = runningJobMasterMap.get(taskGroupLocation.getJobId());
    if (runningJobMaster == null) {
        throw new JobNotFoundException(
                String.format("Job %s not running", taskGroupLocation.getJobId()));
    }
    runningJobMaster.updateTaskExecutionState(taskExecutionState);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Normally, &lt;code&gt;runningJobMaster == null&lt;/code&gt; means the job is not running anymore.&lt;/p&gt;

&lt;p&gt;However, there is another possible timing window.&lt;/p&gt;

&lt;p&gt;During master failover, the new master may not have fully restored &lt;code&gt;runningJobMasterMap&lt;/code&gt; yet. If a worker sends its final task status during that window, the new master can fail to find the corresponding &lt;code&gt;JobMaster&lt;/code&gt; and throw &lt;code&gt;JobNotFoundException&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Then the worker treats this exception as success and stops retrying.&lt;/p&gt;

&lt;p&gt;The problematic sequence can look like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A job is being canceled.&lt;/li&gt;
&lt;li&gt;A master switch happens.&lt;/li&gt;
&lt;li&gt;A worker finishes its task and sends the final task state.&lt;/li&gt;
&lt;li&gt;The new master has not fully restored runningJobMasterMap yet.&lt;/li&gt;
&lt;li&gt;The master throws JobNotFoundException.&lt;/li&gt;
&lt;li&gt;The worker treats this as success and stops retrying.&lt;/li&gt;
&lt;li&gt;The master never receives the final task state after recovery.&lt;/li&gt;
&lt;li&gt;The job remains in CANCELING.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This was the key point for me.&lt;/p&gt;

&lt;p&gt;The issue was not just that the cancel RPC might fail. The more important problem was that the worker’s final status notification could be lost during master recovery.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I Added Force Stop Instead of Rewriting Cancel
&lt;/h2&gt;

&lt;p&gt;After finding this path, I considered whether the existing Cancel logic should be changed directly.&lt;/p&gt;

&lt;p&gt;There were several possible directions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;change how &lt;code&gt;JobNotFoundException&lt;/code&gt; is handled,&lt;/li&gt;
&lt;li&gt;wait until &lt;code&gt;runningJobMasterMap&lt;/code&gt; recovery is completed,&lt;/li&gt;
&lt;li&gt;store more running job state in distributed storage,&lt;/li&gt;
&lt;li&gt;or redesign the cancellation state transition.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the issue was intermittent and timing-dependent. Also, the normal Cancel path is a sensitive part of the execution lifecycle. A direct change there could introduce new behavior changes or performance overhead.&lt;/p&gt;

&lt;p&gt;So I chose a more practical approach first.&lt;/p&gt;

&lt;p&gt;Instead of changing the meaning of normal Cancel, I added a separate Force Stop mechanism.&lt;/p&gt;

&lt;p&gt;The idea was simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If graceful cancellation cannot make progress, operators need an explicit way to finalize the job state.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Cancel vs Force Stop
&lt;/h2&gt;

&lt;p&gt;I tried to keep the difference between Cancel and Force Stop clear.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cancel
&lt;/h3&gt;

&lt;p&gt;Cancel is for graceful termination.&lt;/p&gt;

&lt;p&gt;It asks the running task to stop and depends on the normal task lifecycle and worker notification path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Force Stop
&lt;/h3&gt;

&lt;p&gt;Force Stop is for operational recovery.&lt;/p&gt;

&lt;p&gt;It should not depend on whether the remote worker can still respond correctly. The master finalizes the job state and cleans up based on that decision.&lt;/p&gt;

&lt;p&gt;In short:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cancel     = try to stop the job gracefully
Force Stop = finalize the job when Cancel cannot make progress
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Force Stop is not meant to replace Cancel. It is a fallback path for stuck cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;This issue taught me that a stuck job state is not always caused by the code that directly updates that state.&lt;/p&gt;

&lt;p&gt;In this case, the cancel request path looked suspicious at first. But the more important problem was on the worker-to-master notification path.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;JobNotFoundException&lt;/code&gt; looked like a reasonable terminal condition in normal cases, but during master failover it could also mean:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The new master has not recovered the job yet.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Those two meanings are very different.&lt;/p&gt;

&lt;p&gt;That small difference can decide whether the worker should stop retrying or keep trying.&lt;/p&gt;

&lt;p&gt;For me, this was a good reminder that in distributed systems, exception handling is part of the state machine. It is not just error handling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The stuck &lt;code&gt;CANCELING&lt;/code&gt; issue was not simply a failed cancel request.&lt;/p&gt;

&lt;p&gt;A worker could finish its task and try to notify the master, but if this happened during master failover before the new master fully restored its running job state, the master could throw &lt;code&gt;JobNotFoundException&lt;/code&gt;. Because the worker treated this exception as a successful terminal condition, it stopped retrying. As a result, the master could miss the final task state and keep the job in &lt;code&gt;CANCELING&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Force Stop was added as a practical recovery mechanism for this kind of situation.&lt;/p&gt;

&lt;p&gt;It does not replace normal Cancel. Instead, it gives operators a way to finalize a job when graceful cancellation cannot make progress.&lt;/p&gt;

&lt;p&gt;The biggest lesson for me was simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In distributed systems, the hard part is not only sending a request. The hard part is deciding what the system should believe when the request races with failure, recovery, or a delayed state.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>apacheseatunnel</category>
      <category>programming</category>
      <category>developer</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Tutorial: Incremental MySQL-to-Doris Synchronization with Apache SeaTunnel and Apache DolphinScheduler</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Thu, 18 Jun 2026 07:30:14 +0000</pubDate>
      <link>https://dev.to/seatunnel/tutorial-incremental-mysql-to-doris-synchronization-with-apache-seatunnel-and-apache-415a</link>
      <guid>https://dev.to/seatunnel/tutorial-incremental-mysql-to-doris-synchronization-with-apache-seatunnel-and-apache-415a</guid>
      <description>&lt;p&gt;Data synchronization is one of the most common requirements in enterprise data platform development. As business volume continues to grow, full data synchronization can place increasing pressure on source databases while consuming substantial computing and storage resources. As a result, incremental synchronization has become the preferred approach in most production environments.&lt;/p&gt;

&lt;p&gt;In this demo, we will combine Apache SeaTunnel and Apache DolphinScheduler to implement a typical offline incremental synchronization scenario. DolphinScheduler retrieves the synchronization checkpoint from the target system and passes it to SeaTunnel as a runtime parameter, enabling incremental data synchronization from MySQL to Apache Doris.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7tzz4foestm5zcd4ndw1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7tzz4foestm5zcd4ndw1.jpg" width="799" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This article is based on an actual demonstration and provides a complete walkthrough of the environment setup, SeaTunnel configuration, and DolphinScheduler workflow configuration process.&lt;/p&gt;

&lt;p&gt;For the full demo, please refer to:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://youtu.be/ObUaVOuoDC8?si=xNTPvsMTo6ALhi5b" rel="noopener noreferrer"&gt;https://youtu.be/ObUaVOuoDC8?si=xNTPvsMTo6ALhi5b&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  1. Environment Setup
&lt;/h1&gt;

&lt;p&gt;This demonstration uses the following components:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Apache SeaTunnel&lt;/td&gt;
&lt;td&gt;2.3.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apache DolphinScheduler&lt;/td&gt;
&lt;td&gt;3.x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MySQL&lt;/td&gt;
&lt;td&gt;8.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apache Doris&lt;/td&gt;
&lt;td&gt;2.x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In this architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MySQL serves as the source database.&lt;/li&gt;
&lt;li&gt;Doris serves as the target database.&lt;/li&gt;
&lt;li&gt;SeaTunnel is responsible for data synchronization.&lt;/li&gt;
&lt;li&gt;DolphinScheduler handles workflow orchestration and scheduling.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  2. Preparing Test Data
&lt;/h1&gt;

&lt;p&gt;Before configuring the synchronization task, we first prepare sample business data.&lt;/p&gt;

&lt;p&gt;In this demonstration, a database named shopping is used as the sample database, and an orders table is created.&lt;/p&gt;

&lt;p&gt;The orders table contains an auto-incrementing primary key column:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This field will later be used as the incremental synchronization checkpoint.&lt;/p&gt;

&lt;p&gt;To verify synchronization results, a batch of sample records is inserted into the table. Approximately 300 order records are generated using a script.&lt;/p&gt;

&lt;p&gt;The following information is then inspected:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current total number of orders&lt;/li&gt;
&lt;li&gt;Current maximum order ID&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These values will serve as references when configuring incremental synchronization logic later.&lt;/p&gt;

&lt;p&gt;It is worth noting that order_id is used only for demonstration purposes. In real-world production scenarios, timestamp fields such as update_time or create_time are often used as incremental synchronization conditions.&lt;/p&gt;

&lt;h1&gt;
  
  
  3. Incremental Synchronization Design
&lt;/h1&gt;

&lt;p&gt;Before configuring SeaTunnel, let's first understand the overall synchronization strategy.&lt;/p&gt;

&lt;p&gt;The core idea is to use data that has already been synchronized into Doris to determine the current synchronization progress.&lt;/p&gt;

&lt;p&gt;The workflow operates as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Query the current maximum order ID in Doris.&lt;/li&gt;
&lt;li&gt;Use this value as the synchronization checkpoint.&lt;/li&gt;
&lt;li&gt;SeaTunnel reads records from MySQL whose order IDs are greater than this checkpoint.&lt;/li&gt;
&lt;li&gt;Newly added records are written into Doris.&lt;/li&gt;
&lt;li&gt;During the next execution, synchronization resumes from the latest checkpoint.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For example, if the current maximum order ID in Doris is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;300
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SeaTunnel will execute the following condition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures that only newly inserted records are processed during each run, preventing duplicate synchronization of existing data.&lt;/p&gt;

&lt;p&gt;As emphasized during the demonstration, the incremental field does not necessarily have to be a primary key. Any field that can accurately identify newly added or modified data can be used.&lt;/p&gt;

&lt;h1&gt;
  
  
  4. Configuring the SeaTunnel Job
&lt;/h1&gt;

&lt;p&gt;After defining the synchronization strategy, we can start configuring the SeaTunnel job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configure the JDBC Source
&lt;/h2&gt;

&lt;p&gt;Since the source data resides in MySQL, the JDBC Source is used to read the data.&lt;/p&gt;

&lt;p&gt;The core query is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most important part is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;${order_id}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This value is not hardcoded. Instead, it will be dynamically supplied by DolphinScheduler.&lt;/p&gt;

&lt;p&gt;When the workflow runs, SeaTunnel automatically replaces this variable with the actual synchronization checkpoint, enabling incremental extraction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configure Parallelism
&lt;/h2&gt;

&lt;p&gt;The demonstration also configures task parallelism:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;parallelism&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Increasing parallelism can significantly improve synchronization performance.&lt;/p&gt;

&lt;p&gt;In production environments, the appropriate value should be determined based on available server resources and database workload.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configure Partitioned Reads
&lt;/h2&gt;

&lt;p&gt;To improve performance when reading large tables, partitioned reading is also introduced.&lt;/p&gt;

&lt;p&gt;The partition column is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;order_id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;partition_column&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"order_id"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Combined with the partition_num parameter, the dataset is divided into multiple partitions that can be processed in parallel.&lt;/p&gt;

&lt;p&gt;This approach can greatly improve synchronization efficiency for large-scale datasets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configure Fetch Size
&lt;/h2&gt;

&lt;p&gt;Within the JDBC Connector, fetch_size can be used to control the number of records retrieved from the database per fetch operation.&lt;/p&gt;

&lt;p&gt;Proper configuration of this parameter can reduce database round trips and improve overall read performance.&lt;/p&gt;

&lt;h1&gt;
  
  
  5. Configuring the Doris Sink
&lt;/h1&gt;

&lt;p&gt;After completing the Source configuration, the next step is to configure the Doris Sink.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automatic Table Creation
&lt;/h2&gt;

&lt;p&gt;The demonstration first introduces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="l"&gt;create_schema&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This parameter enables automatic creation of target tables.&lt;/p&gt;

&lt;p&gt;By leveraging automatic table creation, users can significantly reduce the effort required to manually maintain Doris table schemas.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configure Write Mode
&lt;/h2&gt;

&lt;p&gt;Since this example uses incremental synchronization, append mode is selected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;save_mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;APPEND_DATA&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;APPEND_DATA is used because each synchronization run only processes newly added records and does not need to overwrite historical data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enable Two-Phase Commit
&lt;/h2&gt;

&lt;p&gt;To ensure data consistency, the demonstration also introduces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;enable_2pc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enabling this option activates the two-phase commit mechanism, providing more reliable data writes.&lt;/p&gt;

&lt;p&gt;It also helps guarantee Exactly-Once semantics during data synchronization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance Optimization Parameters
&lt;/h2&gt;

&lt;p&gt;Several performance-related parameters are also discussed, including:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="l"&gt;batch_size&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="l"&gt;buffer_size&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These parameters primarily control batch write behavior and can significantly improve Doris ingestion performance.&lt;/p&gt;

&lt;h1&gt;
  
  
  6. Configuring the DolphinScheduler Runtime Environment
&lt;/h1&gt;

&lt;p&gt;After completing the SeaTunnel configuration, the next step is to set up DolphinScheduler.&lt;/p&gt;

&lt;h2&gt;
  
  
  Create a Tenant
&lt;/h2&gt;

&lt;p&gt;First, navigate to the &lt;strong&gt;Security Center&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Open the Tenant Management page and create a new tenant.&lt;/p&gt;

&lt;p&gt;The demonstration specifically emphasizes that all tasks in DolphinScheduler are ultimately executed under a tenant identity. Therefore, tenant configuration is an essential step in preparing the runtime environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Create a User and Associate It with the Tenant
&lt;/h2&gt;

&lt;p&gt;Next, navigate to the User Management page.&lt;/p&gt;

&lt;p&gt;Create a user and associate it with the tenant created in the previous step.&lt;/p&gt;

&lt;p&gt;Once configured, the user will have permission to execute tasks under the corresponding tenant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Create an Environment
&lt;/h2&gt;

&lt;p&gt;Next, open the Environment Management page.&lt;/p&gt;

&lt;p&gt;Create a runtime environment for SeaTunnel.&lt;/p&gt;

&lt;p&gt;Configure the following environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;SEATUNNEL_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/soft/seatunnel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This configuration tells DolphinScheduler where SeaTunnel is installed.&lt;/p&gt;

&lt;p&gt;When a workflow executes a SeaTunnel task, DolphinScheduler uses this path to locate the corresponding execution scripts.&lt;/p&gt;

&lt;p&gt;The demonstration highlights that this configuration is mandatory and should not be skipped.&lt;/p&gt;

&lt;h1&gt;
  
  
  7. Creating the Project and Workflow
&lt;/h1&gt;

&lt;p&gt;After completing the environment setup, create a new project.&lt;/p&gt;

&lt;p&gt;In this demonstration, a project named:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;shopping
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;is created.&lt;/p&gt;

&lt;p&gt;After entering the project, create a new workflow.&lt;/p&gt;

&lt;p&gt;The workflow contains two core nodes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;SQL Task&lt;/li&gt;
&lt;li&gt;SeaTunnel Task&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The SQL Task is responsible for retrieving the synchronization checkpoint, while the SeaTunnel Task performs the actual data synchronization.&lt;/p&gt;

&lt;h1&gt;
  
  
  8. Configuring the SQL Task to Retrieve the Synchronization Checkpoint
&lt;/h1&gt;

&lt;p&gt;This is the most critical step in the entire solution.&lt;/p&gt;

&lt;p&gt;First, create an SQL Task and select the Doris datasource.&lt;/p&gt;

&lt;p&gt;The purpose of this task is to determine how far synchronization has progressed by querying the latest synchronized order ID.&lt;/p&gt;

&lt;p&gt;The SQL statement is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;IFNULL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The demonstration specifically explains why the following statement should not be used directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reason is that during the initial synchronization, the Doris table may still be empty.&lt;/p&gt;

&lt;p&gt;In this case:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MAX(order_id)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NULL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If NULL is passed directly to the downstream SeaTunnel task, it may generate an invalid query condition.&lt;/p&gt;

&lt;p&gt;Therefore:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;IFNULL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;is used to convert NULL values into 0.&lt;/p&gt;

&lt;p&gt;This ensures that the initial synchronization starts correctly from the very first record.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configure an OUT Parameter
&lt;/h2&gt;

&lt;p&gt;The query result must be passed to downstream tasks.&lt;/p&gt;

&lt;p&gt;To accomplish this, create a custom parameter within the SQL Task.&lt;/p&gt;

&lt;p&gt;Select the parameter type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OUT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set the parameter name to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;order_id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SQL query result will then be stored as a workflow variable.&lt;/p&gt;

&lt;p&gt;The SeaTunnel task can subsequently reference this variable directly.&lt;/p&gt;

&lt;h1&gt;
  
  
  9. Incremental Synchronization Workflow Logic
&lt;/h1&gt;

&lt;p&gt;Once the SQL Task is configured, the entire incremental synchronization pipeline is established.&lt;/p&gt;

&lt;p&gt;When the workflow runs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The SQL Task queries the current maximum order_id in Doris.&lt;/li&gt;
&lt;li&gt;The result is stored as a workflow variable.&lt;/li&gt;
&lt;li&gt;SeaTunnel uses &lt;code&gt;${order_id}&lt;/code&gt; as the query condition.&lt;/li&gt;
&lt;li&gt;Newly added records are extracted from MySQL.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Through this approach, offline incremental synchronization based on a business primary key can be implemented efficiently and reliably.&lt;/p&gt;

&lt;h1&gt;
  
  
  10. Conclusion
&lt;/h1&gt;

&lt;p&gt;This example demonstrates how to implement offline incremental synchronization by combining Apache DolphinScheduler and Apache SeaTunnel.&lt;/p&gt;

&lt;p&gt;SeaTunnel handles data extraction and loading, while DolphinScheduler manages synchronization checkpoint retrieval, parameter passing, workflow orchestration, and scheduling.&lt;/p&gt;

&lt;p&gt;The key idea behind this solution is querying the maximum order_id from the target Doris table through an SQL Task and passing the result to SeaTunnel through an OUT parameter. SeaTunnel then uses the checkpoint to perform incremental extraction from MySQL.&lt;/p&gt;

&lt;p&gt;For data warehouse construction, ODS synchronization, and recurring offline synchronization scenarios, this solution offers several advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple implementation&lt;/li&gt;
&lt;li&gt;Easy maintenance&lt;/li&gt;
&lt;li&gt;Strong extensibility&lt;/li&gt;
&lt;li&gt;Production-ready incremental synchronization capabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As a result, it provides a practical and highly valuable reference architecture for enterprise data platform development.&lt;/p&gt;

</description>
      <category>apacheseatunnel</category>
      <category>sql</category>
      <category>doris</category>
      <category>datascience</category>
    </item>
    <item>
      <title>What It Takes to Become an Apache SeaTunnel Committer: Doyeon Kim's Story</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Thu, 18 Jun 2026 02:52:39 +0000</pubDate>
      <link>https://dev.to/seatunnel/what-it-takes-to-become-an-apache-seatunnel-committer-doyeon-kims-story-i6g</link>
      <guid>https://dev.to/seatunnel/what-it-takes-to-become-an-apache-seatunnel-committer-doyeon-kims-story-i6g</guid>
      <description>&lt;p&gt;Do you remember the young contributor from South Korea whom we featured last year?&lt;/p&gt;

&lt;p&gt;In our previous interview, &lt;em&gt;&lt;a href="https://medium.com/dev-genius/from-newcomer-to-power-contributor-south-koreas-doyeon-kim-shines-in-apache-seatunnel-in-just-six-4b931f69281b" rel="noopener noreferrer"&gt;From Newcomer to Power Contributor: South Korea’s Doyeon Kim Shines in Apache SeaTunnel in Just Six Months&lt;/a&gt;,&lt;/em&gt; we shared the story of a passionate newcomer who had already become one of the most active contributors in the Apache SeaTunnel community within just a few months of joining.&lt;/p&gt;

&lt;p&gt;Fast forward two years, and we're excited to share some wonderful news: &lt;strong&gt;Doyeon Kim has officially been invited to become an Apache SeaTunnel Committer!&lt;/strong&gt; 🎉👏&lt;/p&gt;

&lt;p&gt;What's even more inspiring is that Doyeon is still a university student.&lt;/p&gt;

&lt;p&gt;Through her exceptional curiosity, relentless commitment to learning, and consistently high-quality contributions to both the project and the community, she has earned the recognition and trust of contributors worldwide. Her growth over the past two years has been remarkable, and her promotion to Committer reflects the impact she has had on the Apache SeaTunnel ecosystem.&lt;/p&gt;

&lt;p&gt;To celebrate this milestone, we sat down with Doyeon for another in-depth conversation.&lt;/p&gt;

&lt;p&gt;How has her perspective evolved since our first interview? What challenges did she overcome along the way? And what lessons has she learned on her journey from student contributor to Apache SeaTunnel Committer?&lt;/p&gt;

&lt;p&gt;Join us as we look back on her inspiring open-source journey and discover the experiences that helped shape the next chapter of her growth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Personal Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fk9xbag4as21g47dmn2ut.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fk9xbag4as21g47dmn2ut.jpg" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;My name is Doyeon Kim, and my GitHub ID is dybyte.&lt;/p&gt;

&lt;p&gt;I am currently a university student. My main technical interests are backend development, data integration, distributed systems, and Apache SeaTunnel.&lt;/p&gt;

&lt;p&gt;I am especially interested in how data integration systems work internally, including connectors, checkpointing, failure recovery, and engine state&lt;br&gt;
management.&lt;/p&gt;

&lt;p&gt;Outside of development, I enjoy playing games, especially Overwatch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Full Interview Transcript
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;How long have you been involved in open source? What attracts you to open source?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I have been involved in open source for about one year.&lt;/p&gt;

&lt;p&gt;What attracted me to open source was the opportunity to work together with many people on the same project. I found it very interesting and enjoyable that contributors from different backgrounds can discuss ideas, review code, and improve the project together.&lt;/p&gt;

&lt;p&gt;I also like that open source communities are active and open. Through open source, I can learn not only from the code itself, but also from discussions, reviews, and the way experienced contributors think about problems.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;When did you start contributing to the Apache SeaTunnel community, and what motivated you to get involved?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I started contributing to Apache SeaTunnel in May 2025.&lt;/p&gt;

&lt;p&gt;At first, I was mainly studying backend development and wanted to gain experience&lt;br&gt;
by contributing to a real open source project. When I found SeaTunnel, I felt that the community was active, welcoming, and open to new contributors, so I decided to start contributing.&lt;/p&gt;

&lt;p&gt;As I continued participating, I encountered many interesting problems related to connectors, engine behavior, checkpointing, and reliability. Through these experiences, my technical interests gradually expanded from backend development to data integration systems and distributed systems.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Now that you have been elected as a SeaTunnel Committer, could you summarize your contributions to the community, including both code and non-code contributions? Please describe specific solutions or initiatives as much as possible.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;My contributions to Apache SeaTunnel include code contributions, larger ongoing proposals, pull request reviews, technical discussions, documentation improvements, and community activities.&lt;/p&gt;

&lt;p&gt;On the code side, I have mainly contributed to the Zeta engine, connectors, Transform-V2, tests, and documentation.&lt;/p&gt;

&lt;p&gt;In the Zeta engine area, I worked on improving job state handling, metrics handling, REST API stability, and failure-related edge cases. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PR #9833: [Improve][Zeta] Improve job metrics handling with partitioning
support
&lt;a href="https://github.com/apache/seatunnel/pull/" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PR #9926: [Improve][Zeta] Filter tasks and pipelines by state
&lt;a href="https://github.com/apache/seatunnel/pull/" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PR #10132: [Fix][Zeta] Fix unnecessary job state update
&lt;a href="https://github.com/apache/seatunnel/pull/" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PR #10315: [Fix][Zeta] Fix memory leak when cancelling pending job
&lt;a href="https://github.com/apache/seatunnel/pull/" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PR #10456: [Fix][Zeta] Fix NPE when querying pending job info
&lt;a href="https://github.com/apache/seatunnel/pull/" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These changes helped me better understand SeaTunnel’s engine behavior,&lt;br&gt;
job lifecycle, metrics handling, and failure recovery paths.&lt;/p&gt;

&lt;p&gt;I have also contributed to connector and Transform-V2-related improvements. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PR #10222: [Fix][Connector-V2] Use upload session for insert
&lt;a href="https://github.com/apache/seatunnel/pull/" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;PR #10263: [Fix][Transform-V2] Fix multiTable SQL transform&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/pull/" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;PR #10319: [Fix][Connector-V2] Fix partitioning column selection logic&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/pull/" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;PR #10361: [Feature][Transform-V2] Add FieldEncrypt transform for&lt;br&gt;
encrypting selected fields&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/pull/" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;PR #10603: [Feature][Transform-V2] Support AES_GCM algorithm in&lt;br&gt;
FieldEncrypt&lt;br&gt;
&lt;a href="https://github.com/apache/seatunnel/pull/" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These contributions involved fixing the connector or transform behavior, improving correctness, and adding new transform capabilities.&lt;/p&gt;

&lt;p&gt;In addition to merged contributions, I have also been working on larger ongoing proposals and feature work. These PRs are still open, but I have already addressed the main review feedback, and they are currently waiting for further review or merge. I do not consider them completed features yet, but they represent areas where I have been investing deeper effort.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PR #10485: [Feature][Connector] Add BigQuery Sink Connector
&lt;a href="https://github.com/apache/seatunnel/pull/" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This PR introduces a BigQuery Sink connector based on the BigQuery Storage Write API. Through this work, I learned a lot about connector design, checkpoint integration, write semantics, and failure recovery.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PR #10399: [Feature][Zeta] Add compaction support to IMAP external
storage
&lt;a href="https://github.com/apache/seatunnel/pull/" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This PR explores compaction support for IMap external storage. It helped me think more deeply about storage design, recovery behavior, and how to reduce long-term maintenance costs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PR #10812: [Feature][Zeta] Decouple Hazelcast IMap via StateStore
Abstraction
&lt;a href="https://github.com/apache/seatunnel/pull/" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This PR proposes a StateStore abstraction layer in the Zeta engine. The goal is to reduce direct dependency on Hazelcast IMap and make engine state management easier to maintain and evolve. It represents one of the deeper areas I have been working on recently.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PR #10133: [Feature][Zeta] Report non-terminal job states
&lt;a href="https://github.com/apache/seatunnel/pull/" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This PR is related to improving job state reporting and observability in the Zeta engine.&lt;/p&gt;

&lt;p&gt;I have also helped improve test stability and CI reliability through several smaller fixes. Although these changes are not the main focus of my contributions, I believe they are still important for maintaining contributor productivity and community trust.&lt;/p&gt;

&lt;p&gt;On the non-code side, I have reviewed pull requests from other contributors and participated in technical discussions. My reviews have covered areas such as connectors, Zeta engine behavior, configuration validation, tests, documentation, and maintainability.&lt;/p&gt;

&lt;p&gt;During reviews, I try to focus not only on whether the code works, but also on whether the change is safe, understandable, and maintainable for the project in the long term.&lt;/p&gt;

&lt;p&gt;I have also shared my experience through technical writing and SeaTunnel meetup participation. I hope these activities can help more people understand SeaTunnel and become interested in contributing to the community.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;After being involved in the SeaTunnel project and community for quite some time, you likely have a deep understanding of both. In your opinion, what differentiates SeaTunnel from other competing products/projects? What are its strengths and weaknesses? What aspects of the SeaTunnel community motivate you to stay actively involved?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In my opinion, SeaTunnel is especially useful when users need distributed data integration and need to connect many different sources and sinks.&lt;/p&gt;

&lt;p&gt;One of SeaTunnel’s strengths is that it provides many connectors while also supporting distributed execution. This makes it useful for ETL and ELT scenarios where users need to move, synchronize, or transform data across different systems.&lt;/p&gt;

&lt;p&gt;Another interesting point is that SeaTunnel has its own engine, Zeta, while&lt;br&gt;
also supporting engines such as Flink and Spark. This gives users more flexibility depending on their use case.&lt;/p&gt;

&lt;p&gt;At the same time, I think SeaTunnel is more focused on data integration than on being a full-featured stream processing engine. For very complex stream processing logic, such as advanced aggregations, windows, or event-time processing, engines like Flink or Spark may still be more suitable depending on the scenario.&lt;/p&gt;

&lt;p&gt;The value of SeaTunnel may be especially clear in scenarios where users need to integrate many systems or process data at a larger scale, so it is important to evaluate it based on each company’s requirements.&lt;/p&gt;

&lt;p&gt;What motivates me to stay involved is that SeaTunnel still has many meaningful problems to solve. The project is active, the community is open, and I feel that individual contributors can still have a real impact on the project.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Have you ever performed re-development or custom enhancements to address SeaTunnel’s shortcomings? If so, have these improvements been contributed back to the community? Could you introduce the solution/design approach?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Since I am currently a student and not working in a company, I do not have experience customizing SeaTunnel for a company’s internal production use case.&lt;/p&gt;

&lt;p&gt;However, through community contributions, I have worked on improvements that address limitations or potential reliability issues in SeaTunnel.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Has your company used SeaTunnel? If yes, what are the use cases or scenarios? If not, would you consider recommending it to your company? What would be your reasons for recommending it?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Since I am currently a university student, I do not have a company use case to share.&lt;/p&gt;

&lt;p&gt;However, if I were in a situation where a company needed a data integration platform, I would consider recommending SeaTunnel depending on the requirements.&lt;/p&gt;

&lt;p&gt;For example, if the company needs to connect many different data sources and sinks and needs distributed execution for ETL or ELT workloads, SeaTunnel could be a good option.&lt;/p&gt;

&lt;p&gt;Of course, I would also consider the trade-offs carefully. The decision should depend on the company’s data volume, reliability requirements, operational cost, and whether SeaTunnel’s connector ecosystem matches the company’s needs.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What kind of support or growth opportunities do you hope continued participation in the SeaTunnel community can provide for your personal development?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Through continued participation in the SeaTunnel community, I hope to grow both technically and personally.&lt;/p&gt;

&lt;p&gt;Technically, I want to deepen my understanding of data integration systems,&lt;br&gt;
distributed systems, checkpointing, failure recovery, connector design, and&lt;br&gt;
engine architecture.&lt;/p&gt;

&lt;p&gt;I also hope to improve my communication skills. By discussing technical topics with community members, reviewing pull requests, and receiving feedback from experienced contributors, I have been able to learn a lot.&lt;/p&gt;

&lt;p&gt;In addition, I want to improve my English communication skills. For example, participating in events such as ApacheCon, listening to technical talks, and giving presentations would be very valuable experiences for me.&lt;/p&gt;

&lt;p&gt;Overall, the SeaTunnel community gives me opportunities to grow not only as a developer but also as a person who can communicate, collaborate, and contribute in an international open-source community.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What is your understanding of the Committer role within the community? What responsibilities and impact should a Committer have?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In my understanding, a Committer can contribute to the community in many ways, not only by writing code.&lt;/p&gt;

&lt;p&gt;A Committer can review pull requests, help maintain code quality, guide contributors, and help new contributors become more familiar with the project.&lt;/p&gt;

&lt;p&gt;I also think a Committer should regularly think about what actions can have a positive impact on the community. This includes giving constructive feedback, helping discussions move forward, and making decisions that are good for the&lt;br&gt;
long-term health of the project.&lt;/p&gt;

&lt;p&gt;So I believe the Committer role is not just a permission to merge code. It is also a responsibility to help the project and the community grow in a healthy direction.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Now that you have been elected as a Committer, do you have any thoughts you would like to share with the community, or suggestions for the future development of the project?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I am very grateful to be elected as an Apache SeaTunnel Committer.&lt;/p&gt;

&lt;p&gt;While participating in the community, I have gained many valuable experiences. I am happy that my passion and contributions could be helpful to the community.&lt;/p&gt;

&lt;p&gt;I would like to thank everyone who reviewed my pull requests, answered my questions, and discussed technical topics with me. Their feedback and support helped me continue contributing to the project.&lt;/p&gt;

&lt;p&gt;There were some difficult moments, but I think all of those experiences became part of my growth.&lt;/p&gt;

&lt;p&gt;At the moment, I do not have a specific proposal for the future direction of the&lt;br&gt;
project. However, if I find areas that could be improved, I will continue to share my thoughts through GitHub issues, pull requests, and community discussions.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What are your personal plans for contributing to and promoting the further development of the community and project in the near future?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In the near future, I would like to continue helping contributors so that they&lt;br&gt;
can contribute to SeaTunnel more easily.&lt;/p&gt;

&lt;p&gt;One way I can do this is through pull request reviews. I want to continue reviewing PRs, giving feedback, and helping contributors understand the project better.&lt;/p&gt;

&lt;p&gt;I am also interested in helping more people in the Korean open-source community learn about SeaTunnel and contribute to the project. I think this could help both new contributors and the SeaTunnel community.&lt;/p&gt;

&lt;p&gt;In addition, I plan to continue sharing my experiences through platforms such as LinkedIn and Medium. By writing about SeaTunnel, my contribution experience, and technical problems I have worked on, I hope more people can become interested in the project.&lt;/p&gt;

</description>
      <category>apacheseatunnel</category>
      <category>opensource</category>
      <category>datascience</category>
      <category>ai</category>
    </item>
    <item>
      <title>Goodbye Data, Hello AI: My Biggest Takeaway from Snowflake Summit 2026</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Thu, 11 Jun 2026 10:39:40 +0000</pubDate>
      <link>https://dev.to/seatunnel/goodbye-data-hello-ai-my-biggest-takeaway-from-snowflake-summit-2026-1hkl</link>
      <guid>https://dev.to/seatunnel/goodbye-data-hello-ai-my-biggest-takeaway-from-snowflake-summit-2026-1hkl</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;By William Guo, CEO of WhaleOps &amp;amp; Snowflake Ambassador&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I would like to thank Snowflake for inviting me to attend Snowflake Summit as a Snowflake Ambassador.&lt;/p&gt;

&lt;p&gt;This Summit had a much greater impact on me than I had expected.&lt;/p&gt;

&lt;p&gt;As many of you know, I have spent my career in the data industry. I started at Teradata, then moved to IBM. Later, I was responsible for big data initiatives at enterprises such as Lenovo, CICC, and Wanda Group. After that, I became a Member of the Apache Software Foundation, and today I am the CEO of WhaleOps Open Source. Because of this background, I have always paid close attention to developments across the data industry.&lt;/p&gt;

&lt;p&gt;Before coming to the Summit, I originally thought Snowflake would launch a number of enterprise AI products or add AI-related capabilities on top of its existing data warehouse and data platform offerings.&lt;/p&gt;

&lt;p&gt;For many years, people's understanding of Snowflake has been quite clear: it is a cloud data warehouse company and a representative of the Data Cloud era. Its core strengths revolve around data storage, compute, performance, security, governance, sharing, and elastic scalability.&lt;/p&gt;

&lt;p&gt;However, after spending two days at the Summit, my impression changed completely.&lt;/p&gt;

&lt;p&gt;The biggest takeaway I had from this year's Snowflake Summit was not that it had released some new data platform features. Rather, it is aggressively reconstructing its product positioning.&lt;/p&gt;

&lt;p&gt;In my view, Snowflake is no longer satisfied with being defined as a Data Warehouse company. Nor does it simply want to become an AI Data Cloud. Instead, it aims to transform itself into an enterprise AI + Data platform, and perhaps even the foundation of the Agentic Enterprise, putting itself on a path that increasingly overlaps with companies like Anthropic.&lt;/p&gt;

&lt;p&gt;If I had to summarize my personal impression of this Snowflake Summit in one sentence, it would be:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Goodbye Data, Hello AI.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Of course, "Goodbye Data" does not mean data is becoming less important.&lt;/p&gt;

&lt;p&gt;On the contrary, data has become even more important.&lt;/p&gt;

&lt;p&gt;What has changed is the way data platforms are expressed and understood.&lt;/p&gt;

&lt;p&gt;In the past, when we talked about data platforms, we talked about how data should be stored, processed, shared, governed, and optimized for cost efficiency.&lt;/p&gt;

&lt;p&gt;Today, Snowflake is talking about how AI can understand enterprise data, how Agents can use enterprise data, how business users can gain insights directly through natural language, and how enterprises can enable AI to execute tasks within secure and governed boundaries.&lt;/p&gt;

&lt;p&gt;Snowflake Product VP &lt;strong&gt;Christian Kleinerman&lt;/strong&gt; made a statement during the Platform Keynote that perfectly captures this shift:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jf4tpfyof747efidndy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jf4tpfyof747efidndy.png" alt="image.png" width="800" height="505"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Your AI-native enterprise starts here.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If this sentence had appeared at a typical AI conference, it might have sounded like a standard marketing slogan.&lt;/p&gt;

&lt;p&gt;But in the context of Snowflake Summit, it carries a very different meaning.&lt;/p&gt;

&lt;p&gt;Because Snowflake is not an AI-native company. Historically, it has been a data infrastructure company. When a company with that background begins reorganizing its entire product portfolio around AI, it signals that AI is no longer an add-on feature—it is becoming a force that reshapes the enterprise itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  1 Snowflake's Transformation: From Data Warehouse to AI Platform
&lt;/h2&gt;

&lt;p&gt;In the past, when I thought about Snowflake, the first things that came to mind were data warehousing, cloud-native architecture, elastic computing, the separation of storage and compute, data sharing, and unified governance.&lt;/p&gt;

&lt;p&gt;What Snowflake solved were several long-standing problems in traditional data platforms: fragmented data, limited scalability, complex performance tuning, inconsistent governance, and high collaboration costs.&lt;/p&gt;

&lt;p&gt;The central narrative of this year's Summit was clearly different.&lt;/p&gt;

&lt;p&gt;Snowflake still talks about All Data, All Workloads, and All Users. It still talks about structured, semi-structured, and unstructured data. It still talks about Iceberg, OpenFlow, Streaming, Zero Copy, and Horizon Catalog.&lt;/p&gt;

&lt;p&gt;However, these capabilities are no longer being positioned simply as components of a better data platform. Instead, they are being framed as the foundation for a new goal: enabling enterprise AI and Agents to operate on a unified data platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Christian Kleinerman&lt;/strong&gt; also made another highly important statement during the Platform Keynote:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"We need a unified architecture, both AI and data."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;This statement can almost be regarded as the strategic core of this year's Snowflake Summit. It is not saying, "We support AI too." Rather, it is saying that enterprises should not build a separate AI platform outside of their data platform.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because if the AI platform and the data platform are separated, many of the same problems we experienced during the data era will reappear: new silos, new permission systems, new governance gaps, new cost black holes, and new security risks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;We spent more than a decade eliminating data silos. If we build AI on an entirely separate stack today, we are essentially creating AI silos all over again.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So Snowflake's answer is clear: AI and Data must be unified. Data, compute, semantics, governance, security, applications, and Agents should all form a closed loop within a single platform.&lt;/p&gt;

&lt;p&gt;Viewed from this perspective, Snowflake's Summit slogan, &lt;strong&gt;&lt;em&gt;Make AI Real for Business&lt;/em&gt;&lt;/strong&gt;, is fundamentally about turning Data into the context, fuel, and execution foundation for AI.&lt;/p&gt;

&lt;p&gt;In the past, data platforms were built for people. People wrote SQL, viewed dashboards, configured jobs, and performed analyses.&lt;/p&gt;

&lt;p&gt;In the future, data platforms will increasingly be built for Agents. Agents will understand business questions, invoke data capabilities, generate analytical workflows, propose actions, and even participate directly in business processes.&lt;/p&gt;

&lt;p&gt;This is what truly struck me at this Summit.&lt;/p&gt;

&lt;p&gt;Snowflake is not simply adding an AI assistant on top of an existing Data Warehouse. It is using data to rebuild a new AI-native foundation for the Agentic Enterprise, which is also where OpenAI and Anthropic will ultimately compete.&lt;/p&gt;

&lt;p&gt;That is why I believe Snowflake's transformation is far more aggressive than I originally imagined.&lt;/p&gt;

&lt;h2&gt;
  
  
  2 CoCo, CoWork, and Desktop: Snowflake Is "Paying Tribute" to Anthropic—And Revealing a Bigger Ambition
&lt;/h2&gt;

&lt;p&gt;If the first layer of change is strategic positioning, then the second layer is the product portfolio itself.&lt;/p&gt;

&lt;p&gt;At this year's Snowflake Summit, what impressed me most was not a traditional database feature or a performance improvement metric. Instead, it was the launch of an entire collection of AI Agent-centric products and components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CoCo&lt;/li&gt;
&lt;li&gt;CoWork&lt;/li&gt;
&lt;li&gt;Desktop&lt;/li&gt;
&lt;li&gt;Skill Catalog&lt;/li&gt;
&lt;li&gt;VS Code Extension&lt;/li&gt;
&lt;li&gt;Excel Add-in&lt;/li&gt;
&lt;li&gt;MCP&lt;/li&gt;
&lt;li&gt;ACP&lt;/li&gt;
&lt;li&gt;Cloud Agents&lt;/li&gt;
&lt;li&gt;Agent Teams&lt;/li&gt;
&lt;li&gt;Automated Agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When viewed together, they send a very clear signal:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Snowflake is reorganizing its product strategy in the same way an AI-native company would.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In fact, I would even say it is "paying tribute" to Anthropic.&lt;/p&gt;

&lt;p&gt;Why do I say that?&lt;/p&gt;

&lt;p&gt;Because AI-native companies such as Anthropic are no longer just building chatbots. They are building complete AI work systems, including Claude, Claude Code, Desktop, MCP, Artifacts, Skills, Computer Use, enterprise context, and security boundaries.&lt;/p&gt;

&lt;p&gt;What they truly want to own is not merely a conversational interface, but the primary interface through which humans collaborate with software in the future.&lt;/p&gt;

&lt;p&gt;The CoCo, CoWork, Desktop, Skill Catalog, and MCP/ACP announcements from Snowflake have remarkably strong parallels.&lt;/p&gt;

&lt;p&gt;CoCo feels like Claude Code for the enterprise.&lt;/p&gt;

&lt;p&gt;CoWork resembles an AI workspace for business users.&lt;/p&gt;

&lt;p&gt;CoCo Desktop extends Snowflake's AI capabilities beyond the web console and into users' everyday work environments.&lt;/p&gt;

&lt;p&gt;Skill Catalog packages Snowflake platform capabilities into discoverable, composable, and reusable skills that Agents can invoke.&lt;/p&gt;

&lt;p&gt;So when I heard these announcements at the event, my first reaction was not:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Snowflake has released a few more AI features."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Instead, it was:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Snowflake wants to repackage the data platform as a complete Enterprise AI Agent Operating System and enter the same strategic battleground occupied by OpenAI Enterprise and Anthropic Enterprise.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Snowflake officially announced that Cortex Code would no longer be called Cortex Code. It has been renamed to Snowflake CoCo:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm9zocxw1k1xgqyogdp61.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm9zocxw1k1xgqyogdp61.png" alt="image.png" width="800" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"From here on, no more Cortex Code. It is officially Snowflake CoCo."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This statement is worth paying attention to.&lt;/p&gt;

&lt;p&gt;The name Cortex Code still carried the feeling of being a coding assistant.&lt;/p&gt;

&lt;p&gt;CoCo, on the other hand, feels much more like a standalone AI product.&lt;/p&gt;

&lt;p&gt;Behind this rebranding is a larger ambition.&lt;/p&gt;

&lt;p&gt;Snowflake does not want CoCo to be merely an assistant that helps users write SQL, generate code, or explain syntax. It wants CoCo to become the AI operating interface for the entire Snowflake platform.&lt;/p&gt;

&lt;p&gt;Christian also mentioned during the keynote that over the past several months, CoCo has evolved beyond CLI and SnowSight experiences and expanded into MCP, ACP, SDKs, Agent Teams, Cloud Agents, automation capabilities, and Skill Catalog.&lt;/p&gt;

&lt;p&gt;Among these, Skill Catalog is especially important. It enables users to share, discover, and reuse Skills. In essence, it is modularizing Snowflake platform capabilities and turning them into reusable tools for Agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is extremely important.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Snowflake also explicitly announced upcoming Excel add-ins, VS Code extensions, and Marketplace partner integrations for CoCo.&lt;/p&gt;

&lt;p&gt;During discussions at the event, many of us felt the Excel integration was particularly powerful because Excel remains the most familiar data workspace for business users.&lt;/p&gt;

&lt;p&gt;VS Code, meanwhile, remains the most familiar workspace for developers.&lt;/p&gt;

&lt;p&gt;Rather than forcing everyone into SnowSight, Snowflake is bringing CoCo directly into the environments where people already work.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flic4mm7knqh37ji1ybv6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flic4mm7knqh37ji1ybv6.png" alt="image.png" width="800" height="323"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is also one of the most important principles behind AI-native products:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do not force users to move into your interface. Bring your Agent into the user's workflow.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Therefore, the significance of CoCo is not that Snowflake now has its own Copilot.&lt;/p&gt;

&lt;p&gt;The significance is that Snowflake is moving away from a traditional platform UI and toward an &lt;strong&gt;Agent Everywhere&lt;/strong&gt; strategy.&lt;/p&gt;

&lt;p&gt;Beyond CoCo, Snowflake also placed a major spotlight on CoWork at this Summit. &lt;/p&gt;

&lt;p&gt;To be honest, when I first heard about CoWork, I was a bit puzzled. If Anthropic were launching CoWork, I could easily understand it, because Agents naturally require enterprise-grade collaboration. But from the perspective of a traditional data platform, CoWork did not seem like the kind of product Snowflake would be expected to release. CoCo helping data engineers write SQL, fix pipelines, and build applications makes perfect sense. OpenFlow, Streaming, Iceberg, and Horizon Catalog are also clear enhancements to the data platform. But what does CoWork have to do with a data warehouse?&lt;/p&gt;

&lt;p&gt;After listening to the presentation, I gradually understood it. CoWork reveals Snowflake’s ambitions even more clearly. It is designed for business users, with the vision of enabling CEOs, sales teams, operations teams, marketers, and other business professionals to interact directly with enterprise data and gain insights as if they had their own personal Jarvis. Samsung shared a use case illustrating this idea: CoCo serves as the AI operating interface for data engineers and developers, while CoWork serves as the AI workspace for business users. Snowflake is not just trying to serve data teams; it wants to become part of the daily workflow of every business user across the enterprise.&lt;/p&gt;

&lt;p&gt;At that point, I finally understood CoWork’s role. CoCo is reshaping back-end data engineering, while CoWork is reshaping front-end business decision-making. Together, they enable Snowflake to evolve from a data platform into an enterprise AI work platform. CoWork may seem far removed from the traditional Snowflake, but in reality, it is perhaps the closest thing to Snowflake’s future.&lt;/p&gt;

&lt;p&gt;Building Agentic Enterprise Infrastructure—this is Snowflake’s true ambition. It also explains why I no longer see Snowflake as a traditional data company.&lt;/p&gt;

&lt;p&gt;When traditional data companies launch products, they typically talk about performance improvements, cost reductions, additional connectors, or stronger governance capabilities.&lt;/p&gt;

&lt;p&gt;Snowflake’s announcements this time felt much more like those of an AI company. It talked about Agents, Skills, Desktop, CoWork, natural language, business users, context, and security boundaries. In other words, Snowflake is repositioning itself from a Data Warehouse company into an Enterprise AI Platform company.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This should serve as a wake-up call for every data software company.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If even Snowflake has realized that the entry point to future data platforms will shift from SQL, BI, notebooks, and pipelines toward Agents, Skills, Context, and Workflows, then companies like ours that focus on ETL, DataOps, Data Ingestion, and Orchestration must also rethink what our products should look like.&lt;/p&gt;

&lt;p&gt;This is not about a single product. It is a blueprint for how Snowflake is reorganizing its entire product portfolio for the AI era. CoCo, CoWork, Desktop, Skill Catalog, and MCP/ACP together reveal Snowflake’s new ambition: &lt;strong&gt;not just to manage data, but to become the entry point for enterprise AI.&lt;/strong&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  3 AI Is Bringing Every Software Company Back to the Same Starting Line
&lt;/h1&gt;

&lt;p&gt;The second major impression I took away from this Summit is that we are all part of the same ecosystem, and AI is bringing every software company back to the same starting line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There was one moment that left a particularly deep impression on me.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbg1pafdthb1a5r5qy8ut.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbg1pafdthb1a5r5qy8ut.png" alt="image.png" width="800" height="291"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Snowflake announced the Agentic Control Plane, or ACP. &lt;strong&gt;At that moment, I was genuinely shocked&lt;/strong&gt;, because just last month we launched our own ACP product. &lt;strong&gt;Wait a minute—isn’t that a direct collision? If a giant like Snowflake is entering this space, am I finished!?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faiajlgs9dtbyskuc2yrx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faiajlgs9dtbyskuc2yrx.png" alt="image.png" width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As I listened more carefully, I realized the two products are not exactly the same. Snowflake’s ACP focuses more on Snowflake-native data modeling, Text-to-SQL, semantic layers, and enabling Agents to understand and interact with Snowflake data. What we focus on is ETL, orchestration, pipelines, data synchronization, job scheduling, and the execution and governance of heterogeneous data systems. In fact, I quickly added the words “Data Engineering” in front of our product name before calling it an Agent Control Plane.&lt;/p&gt;

&lt;p&gt;But the key point is not whether the two products are identical. The important thing is what this coincidence reveals: everyone is moving toward the same destination.&lt;/p&gt;

&lt;p&gt;That destination is:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Future software systems must become systems that Agents can understand, invoke, orchestrate, and govern.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the past, the differences between software companies came from many factors: brand recognition, customer base, sales channels, engineering scale, ecosystem strength, delivery capability, and product maturity. Large enterprises had their advantages, while startups faced their own challenges. But with the arrival of AI, something fascinating has happened: every software product now needs to be rebuilt for AI.&lt;/p&gt;

&lt;p&gt;In the past, software interaction looked like this: people opened interfaces, clicked buttons, filled out forms, wrote SQL, reviewed logs, and handled exceptions.&lt;/p&gt;

&lt;p&gt;In the future, software interaction may look like this: people define objectives, Agents understand context, invoke tools, generate plans, execute tasks, and return results. Humans increasingly take on the roles of validation, supervision, judgment, decision-making, and correction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One of my strongest impressions from the Summit was this: when it comes to AI, every software company has been brought back to a new starting line.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Because in the AI era, every software product must be rebuilt from the ground up.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is why Snowflake and we ended up launching similar categories of products at almost the same time. In the past, such a thing would have been difficult to imagine. Startups rarely released the same kind of products in parallel with large enterprises because the major players usually leveraged their vast resources to build everything first.&lt;/p&gt;

&lt;p&gt;Today, however, this creates an enormous opportunity for startups.&lt;/p&gt;

&lt;p&gt;In the past, competing directly with large enterprises on resources, branding, or customer scale was extremely difficult. But when AI begins to redefine software, large enterprises also carry historical baggage: legacy systems, legacy customers, legacy architectures, and legacy organizational processes. Startups, if they move quickly enough, can design products from day one with an Agent-native mindset.&lt;/p&gt;

&lt;p&gt;That is what encouraged me most about attending Snowflake Summit. The kinds of products Snowflake announced this month are remarkably similar to some of the directions we ourselves explored last month. The scale, scenarios, and depth may differ, but it suggests that our understanding of where the industry is heading is remarkably aligned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In the AI era, opportunities do not belong only to large enterprises. They also belong to entrepreneurs who can quickly recognize change and are willing to reinvent their products.&lt;/strong&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  4 Looking Back at Ourselves Through Snowflake: How Do I Goodbye Data, Hello AI?
&lt;/h1&gt;

&lt;p&gt;The biggest question Snowflake Summit left me with was not what Snowflake will become, but what we ourselves should become.&lt;/p&gt;

&lt;p&gt;If Snowflake is already embracing &lt;strong&gt;Make AI Real for Business&lt;/strong&gt;, then how should we, in turn, embrace Goodbye Data, Hello AI?&lt;/p&gt;

&lt;p&gt;For years, we have worked in DataOps, ETL, Data Ingestion, Orchestration, and Pipelines. At their core, these disciplines are about managing the flow of data. We help customers move data from one system to another, schedule tasks based on dependencies, monitor failed jobs, stabilize data pipelines, and connect heterogeneous systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;All of those things remain important. But once the AI era arrives, software itself is no longer the end goal.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the past, we dealt with structured data, semi-structured data, files, logs, tables, fields, tasks, and workflows. In the future, we will also need to handle Knowledge, Context, Semantics, Business Rules, Lineage, Execution Memory, and Agent Actions.&lt;/p&gt;

&lt;p&gt;When data is no longer just rows and columns inside tables, and no longer merely moving from a source to a destination, it becomes something far more important. Data becomes the context through which AI understands a business. It becomes the foundation upon which Agents take action. It becomes the fuel that powers enterprise automation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Snowflake’s answer is to evolve from a Data Warehouse into an AI Data Platform.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Our answer is to evolve from a DataOps tool into a Data Engineering Harness for the AI era.&lt;/p&gt;

&lt;p&gt;When people use Claude Code or Codex today, they are primarily working in Java or Python development environments. But the environment of a Data Engineer is fundamentally different. The business semantics are more complex, and the workflows are more complicated.&lt;/p&gt;

&lt;p&gt;Snowflake’s CoCo is essentially a Data Warehouse Agent. But orchestration and data ingestion are not Snowflake’s core strengths. What data engineers truly need is a Data Engineering Harness that spans systems, databases, schedulers, and environments. In practice, this manifests as an Agentic Data Control Plane designed specifically for data engineers.&lt;/p&gt;

&lt;p&gt;That may very well be the opportunity for WhaleOps.&lt;/p&gt;

&lt;p&gt;There was a statement from Thomson Reuters at Snowflake Summit that left a deep impression on me:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“They can’t be wrong.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The quote referred to professionals in legal, tax, and audit industries, where errors are simply unacceptable. The same principle applies to data engineering. That is why a Data Engineering Harness is inherently more complex than harnesses in many other domains.&lt;/p&gt;

&lt;p&gt;Enterprise AI is not a toy.&lt;/p&gt;

&lt;p&gt;The data tasks generated by Agents cannot merely &lt;em&gt;look&lt;/em&gt; correct—they must actually be correct.&lt;/p&gt;

&lt;p&gt;The analyses produced by Agents cannot simply sound convincing—they must be grounded in trustworthy data.&lt;/p&gt;

&lt;p&gt;The data workflows executed by Agents cannot merely be automated—they must be governable, auditable, and reversible.&lt;/p&gt;

&lt;p&gt;That is why I believe this represents our opportunity in the AI era—not simply to become smarter, but to become more trustworthy.&lt;/p&gt;

&lt;h1&gt;
  
  
  5 My Final Prediction: Snowflake Is Competing for the AI Entry Point, and If It Succeeds, Its Stock Could Rise Far Beyond 2×
&lt;/h1&gt;

&lt;p&gt;Looking back on this Snowflake Summit, my biggest takeaway was not any single product announcement, but a much larger signal about the software industry:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI is reshaping the entry points, forms, and value propositions of all software.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Snowflake is competing for the AI entry point. That is why it has set its sights on Anthropic as a competitor and is evolving from a Data Warehouse into an AI Data Platform.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In the future world of Data + AI, who will control the entry point—the owners of data, or the owners of AI?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My own view is that data is difficult to move, while AI platforms are relatively easy to switch.&lt;/p&gt;

&lt;p&gt;As many people know, Snowflake’s stock price has roughly doubled over the past month. Personally, I believe that if Snowflake’s AI-entry-point strategy succeeds, its future upside will be far greater than the 2× gain we have seen over the past month.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6vvzdg7bnpxrjipbi4u6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6vvzdg7bnpxrjipbi4u6.png" alt="image.png" width="799" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Rethink what is possible,&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The future of Snowflake AI looks incredibly promising.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(The views shared above are solely my personal opinions, do not represent any official position, and should not be considered investment advice.)&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>datascience</category>
      <category>data</category>
      <category>snowflake</category>
    </item>
    <item>
      <title>Apache SeaTunnel Monthly Engineering Digest (May 2026): 87 PRs, Stronger Connectors, Smarter Recovery</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Thu, 11 Jun 2026 10:22:25 +0000</pubDate>
      <link>https://dev.to/seatunnel/apache-seatunnel-monthly-engineering-digest-may-2026-87-prs-stronger-connectors-smarter-43kd</link>
      <guid>https://dev.to/seatunnel/apache-seatunnel-monthly-engineering-digest-may-2026-87-prs-stronger-connectors-smarter-43kd</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmcgu6z4a1dg7lcudg3g.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmcgu6z4a1dg7lcudg3g.jpg" width="799" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hello SeaTunnel community!&lt;/p&gt;

&lt;p&gt;The Apache SeaTunnel May 2026 Monthly Report has finally arrived.&lt;/p&gt;

&lt;p&gt;According to community statistics, a total of 87 pull requests were merged into the &lt;code&gt;apache/seatunnel&lt;/code&gt; repository during May 2026. This month’s primary focus was on continuously improving Connector-V2, closing functional gaps, and making connectors truly production-ready. Significant efforts were also invested in the Zeta Engine, with enhancements across high availability, fault recovery, observability, and testing. At the same time, the community strengthened CI security and regression testing to ensure efficient and reliable development on the main branch.&lt;/p&gt;

&lt;p&gt;This report includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A comprehensive review of all merged PRs, covering new features, performance improvements, bug fixes, and architectural enhancements (including the complete PR list).&lt;/li&gt;
&lt;li&gt;In-depth analysis of key technical changes, their implementation details, and impact scope (with patch-level code snippets).&lt;/li&gt;
&lt;li&gt;Reproducible methodologies for performance and stability validation (without fabricated benchmark results).&lt;/li&gt;
&lt;li&gt;Insights into project evolution trends and future technical directions.&lt;/li&gt;
&lt;li&gt;A complete list of all contributors who submitted merged PRs in May 2026, including GitHub usernames, contribution categories, and rankings.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  1. Overall Snapshot
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1.1 Four-Dimensional Statistics (87 PRs)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6apzpvshev4qo232hwh.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6apzpvshev4qo232hwh.jpg" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The data clearly shows that May was a month focused on making existing capabilities robust and production-ready. A large portion of the merged work addressed real-world operational challenges, including HA, recovery mechanisms, edge cases, resource and memory risks, and observability.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.2 Module Distribution
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;seatunnel-connectors-v2: 32&lt;/li&gt;
&lt;li&gt;seatunnel-engine: 21&lt;/li&gt;
&lt;li&gt;seatunnel-connectors-v2/connector-cdc: 8&lt;/li&gt;
&lt;li&gt;seatunnel-e2e: 6&lt;/li&gt;
&lt;li&gt;docs: 3&lt;/li&gt;
&lt;li&gt;other: 17&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Feature Enhancements and Engineering Evolution
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 Connector-V2: HTTP Source Now Supports Binary Downloads (#10956)
&lt;/h3&gt;

&lt;p&gt;This is one of the most user-visible connector enhancements introduced this month. HTTP Source has evolved from simply fetching JSON and text payloads to supporting full file and binary downloads.&lt;/p&gt;

&lt;p&gt;The PR patch summary confirms the scope of the change (15 files changed, +758/-17):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Added &lt;code&gt;format = "binary"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Added &lt;code&gt;binary_chunk_size&lt;/code&gt; (default: 10 MB) for large-file chunking&lt;/li&gt;
&lt;li&gt;Introduced a fixed output schema: &lt;code&gt;(data: bytes, relativePath: string, partIndex: long)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Limited to BATCH mode&lt;/li&gt;
&lt;li&gt;Added comprehensive unit tests, E2E tests, and documentation in both Chinese and English&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Official example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;env&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;parallelism&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;job.mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BATCH"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="nl"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;Http&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;url&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://example.com/files/report.pdf"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;method&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;format&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"binary"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;binary_chunk_size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10485760&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;schema&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;fields&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;bytes&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;relativePath&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;string&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;partIndex&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;long&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="nl"&gt;sink&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;LocalFile&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;path&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/tmp/download"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;file_format&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"binary"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Impact and upgrade recommendations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Existing jobs remain unaffected as long as &lt;code&gt;format=text&lt;/code&gt; continues to be used.&lt;/li&gt;
&lt;li&gt;After upgrading, users should validate that downstream transforms and sinks properly handle &lt;code&gt;bytes&lt;/code&gt; fields.&lt;/li&gt;
&lt;li&gt;For large files, &lt;code&gt;binary_chunk_size&lt;/code&gt; reduces peak memory usage from file-size scale to chunk-size scale, significantly lowering OOM risk. This can be validated through RSS and GC metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.2 Engine/Zeta: Progressive Validation for Dry-Run Mode (#10763)
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;[Feature][Zeta] Implement proper dry-run mode with progressive validation layer0&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Engineering value:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Configuration errors are detected before job submission or startup rather than during runtime.&lt;/li&gt;
&lt;li&gt;Provides a foundation for future progressive validation layers, including connector parameter validation, schema constraint verification, and permission/resource prechecks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.3 Ecosystem Expansion and Tooling Improvements
&lt;/h3&gt;

&lt;p&gt;This month also saw progress in control-plane capabilities, observability, edge collection, and developer tooling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;#10878: STIP-24 Phase 1 EdgeSocket ingress (Edge Collection MVP)&lt;/li&gt;
&lt;li&gt;#10491: Data lineage and performance tracing&lt;/li&gt;
&lt;li&gt;#10184: Python SDK Client for SeaTunnel REST API&lt;/li&gt;
&lt;li&gt;Multi-table and schema evolution support for RabbitMQ, Cassandra, SQL Server CDC, Postgres CDC, and more&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. Performance Optimization and Resource Risk Mitigation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 Kudu: Dependency Upgrade Resolves Flink 1.15+ Compatibility Risks (#10974)
&lt;/h3&gt;

&lt;p&gt;This PR primarily improves compatibility and runtime reliability, although it was categorized as a performance improvement.&lt;/p&gt;

&lt;p&gt;Implementation details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Upgraded &lt;code&gt;kudu-client&lt;/code&gt; to reduce classpath conflicts with Flink 1.15+.&lt;/li&gt;
&lt;li&gt;Removed explicit dependencies on Kudu’s shaded Guava libraries to avoid runtime issues caused by internal shaded API changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Suggested reproducible validation:&lt;/p&gt;

&lt;p&gt;Run identical Kudu jobs on Flink 1.15+ and compare:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Job startup success rate&lt;/li&gt;
&lt;li&gt;Startup latency percentiles (P50/P95)&lt;/li&gt;
&lt;li&gt;Stability during 30-minute execution windows&lt;/li&gt;
&lt;li&gt;Presence of slot allocation or release loops&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.2 Kafka: Reduce Default Cache Queue Size from 1024 to 2 to Prevent OOM (#10954)
&lt;/h3&gt;

&lt;p&gt;The PR title explicitly states:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Reduce default reader_cache_queue_size from 1024 to 2 to prevent OOM&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Quantitative interpretation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Queue depth reduced by 512x (1024 → 2).&lt;/li&gt;
&lt;li&gt;For workloads with large cached records (such as deserialized rows), this significantly reduces peak memory usage and GC pressure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Recommended metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Peak RSS memory&lt;/li&gt;
&lt;li&gt;GC frequency&lt;/li&gt;
&lt;li&gt;Young/Old GC duration&lt;/li&gt;
&lt;li&gt;OOM occurrence rate&lt;/li&gt;
&lt;li&gt;Throughput (rows/sec)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4. Bug Fixes and Architectural Improvements Defined May
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 Zeta: Master Failover Could Permanently Stall Jobs (#10836)
&lt;/h3&gt;

&lt;p&gt;The PR title already highlights the affected scenarios:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;affects BATCH / bounded source / job shutdown phase&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Potential production impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jobs that should terminate may remain indefinitely stuck.&lt;/li&gt;
&lt;li&gt;Increased risk of resource leakage.&lt;/li&gt;
&lt;li&gt;Potential duplicate scheduling.&lt;/li&gt;
&lt;li&gt;Higher risk of inconsistent downstream results.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Recommended validation:&lt;/p&gt;

&lt;p&gt;After simulating master failover or node failure, verify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jobs complete the shutdown phase correctly.&lt;/li&gt;
&lt;li&gt;Final job states remain consistent.&lt;/li&gt;
&lt;li&gt;State cleanup does not suffer from race conditions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4.2 Security Fix: Path Traversal Vulnerability in Log REST API (#10628)
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;[Fix][Zeta] Fix path traversal vulnerability in log file REST API endpoints&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Operational recommendations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apply access controls at gateway or ingress layers.&lt;/li&gt;
&lt;li&gt;Restrict log API access to authorized users only.&lt;/li&gt;
&lt;li&gt;Validate the fix using security scans and path traversal attempts such as &lt;code&gt;../&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4.3 Recovery and Race Condition Improvements (#10687 / #10842 / #10877)
&lt;/h3&gt;

&lt;p&gt;Collectively, these fixes significantly strengthened Zeta’s recovery workflows, particularly around:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;State cleanup after node failures&lt;/li&gt;
&lt;li&gt;Restore process consistency&lt;/li&gt;
&lt;li&gt;Reuse of recovery metadata&lt;/li&gt;
&lt;li&gt;Elimination of race conditions&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  5. Deep Dive into Key Technical Changes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 HTTP Binary Download Support (#10956): Design, Semantics, and Impact
&lt;/h3&gt;

&lt;p&gt;Key characteristics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;format=binary&lt;/code&gt; treats the HTTP response body as raw bytes.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;binary_chunk_size&lt;/code&gt; controls chunking for large files.&lt;/li&gt;
&lt;li&gt;Fixed output schemas simplify downstream sink implementations.&lt;/li&gt;
&lt;li&gt;Batch-only mode avoids ambiguity with streaming download semantics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;User value:&lt;/p&gt;

&lt;p&gt;HTTP Source can now fetch both API data and files, reducing pipeline complexity.&lt;/p&gt;

&lt;p&gt;Risks and recommendations:&lt;/p&gt;

&lt;p&gt;After chunking, downstream sinks must correctly reconstruct files using the combination of &lt;code&gt;relativePath&lt;/code&gt; and &lt;code&gt;partIndex&lt;/code&gt;. The official binary file sink or sinks supporting byte-stream writes are strongly recommended.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.2 SQLite Upsert Syntax Fix (#10880): Why &lt;code&gt;EXCLUDED&lt;/code&gt; Matters
&lt;/h3&gt;

&lt;p&gt;SQLite's UPSERT syntax requires conflicting updates to reference &lt;code&gt;EXCLUDED.&amp;lt;column&amp;gt;&lt;/code&gt; rather than &lt;code&gt;VALUES(&amp;lt;column&amp;gt;)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The PR updates the generated SQL as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- `col`=VALUES(`col`)
&lt;/span&gt;&lt;span class="gi"&gt;+ `col`=EXCLUDED.`col`
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unit tests were added to ensure correctness and prevent regressions.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.3 Milvus Exception Reference Fix (#10975): A Small Change with Large Operational Value
&lt;/h3&gt;

&lt;p&gt;A single-line change delivers a significant troubleshooting improvement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- loadStateResponse.getException()
&lt;/span&gt;&lt;span class="gi"&gt;+ queryResultsR.getException()
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the fix, error messages accurately reflect the actual failure source, reducing debugging effort and avoiding misleading diagnostics.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.4 Zeta Flaky Test Fix (#10891): Eliminating Timing Dependencies
&lt;/h3&gt;

&lt;p&gt;The patch replaces real job submissions and asynchronous waiting with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mock servers&lt;/li&gt;
&lt;li&gt;Mock JobHistoryService&lt;/li&gt;
&lt;li&gt;Constructed pendingJobDAGInfo objects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This removes non-deterministic timing dependencies and significantly improves CI stability.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Reproducible Framework for Performance and Stability Validation
&lt;/h2&gt;

&lt;p&gt;Many of May’s PRs focused on compatibility improvements, stability enhancements, and safer default configurations rather than benchmark-oriented optimizations.&lt;/p&gt;

&lt;p&gt;To ensure objective validation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Default parameter changes: measure RSS, GC, OOM rate, and throughput.&lt;/li&gt;
&lt;li&gt;Chunking strategies: measure peak memory, end-to-end latency, and output consistency.&lt;/li&gt;
&lt;li&gt;Dependency upgrades: measure startup success rate, startup latency percentiles, and long-term stability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Recommended validation workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fix dataset size and parallelism.&lt;/li&gt;
&lt;li&gt;Fix hardware and network environments.&lt;/li&gt;
&lt;li&gt;Define consistent metrics.&lt;/li&gt;
&lt;li&gt;Compare only pre-change and post-change versions.&lt;/li&gt;
&lt;li&gt;Preserve all logs and metric outputs.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Thanks to All Contributors
&lt;/h2&gt;

&lt;p&gt;Apache SeaTunnel’s continuous growth and evolution would not be possible without every community member who contributes code, reviews, documentation, and ideas.&lt;/p&gt;

&lt;p&gt;Thank you all for your dedication and generosity. Every line of code helps move the project forward and strengthens the ecosystem.&lt;/p&gt;

&lt;p&gt;The following contributors made merged contributions during May 2026:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;GitHub Username&lt;/th&gt;
&lt;th&gt;Merged PRs&lt;/th&gt;
&lt;th&gt;Major Contribution Categories&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;nzw921rx&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;Features ×2; Performance ×1; Bug Fixes ×6; Architecture ×9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;zhangshenghang&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;Features ×0; Performance ×0; Bug Fixes ×8; Architecture ×4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;DanielLeens&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Features ×0; Performance ×0; Bug Fixes ×1; Architecture ×7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;CosmosNi&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Features ×1; Performance ×0; Bug Fixes ×6; Architecture ×1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;davidzollo&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Features ×1; Performance ×1; Bug Fixes ×1; Architecture ×2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;JeremyXin&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Features ×0; Performance ×0; Bug Fixes ×0; Architecture ×4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;CloverDew&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Features ×1; Performance ×0; Bug Fixes ×2; Architecture ×0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;zhiliang-wu&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Features ×1; Performance ×0; Bug Fixes ×0; Architecture ×1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;yzeng1618&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Features ×1; Performance ×0; Bug Fixes ×1; Architecture ×0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;QuakeWang&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Features ×0; Performance ×0; Bug Fixes ×2; Architecture ×0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;(The ranking is based on the number of merged PRs during May. Contribution categories are aggregated using the four-quadrant classification adopted in this report.)&lt;/p&gt;

</description>
      <category>apacheseatunnel</category>
      <category>software</category>
      <category>opensource</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Save Your Seat: How HOPEGOO Built a Unified Multimodal Data Platform with Apache SeaTunnel</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Thu, 11 Jun 2026 08:41:17 +0000</pubDate>
      <link>https://dev.to/seatunnel/save-your-seat-how-hopegoo-built-a-unified-multimodal-data-platform-with-apache-seatunnel-27cb</link>
      <guid>https://dev.to/seatunnel/save-your-seat-how-hopegoo-built-a-unified-multimodal-data-platform-with-apache-seatunnel-27cb</guid>
      <description>&lt;p&gt;As generative AI continues to evolve, enterprises are placing new demands on their data infrastructure. Beyond traditional structured data, multimodal data—including text, images, audio, and other content types—is growing rapidly. As data types become more diverse and data pipelines increasingly complex, building a unified, efficient, and easily governed data pipeline platform has become a key challenge for many organizations.&lt;/p&gt;

&lt;p&gt;At the upcoming Apache SeaTunnel June Meetup, we are excited to welcome Xiaocheng Zhou, Data Development Engineer at HOPEGOO and Apache SeaTunnel Committer, who will present a session titled &lt;strong&gt;"Architecture and Practices for a Unified Multimodal Data Pipeline Platform."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Drawing from HOPEGOO’s real-world experience, this talk will explore how the company leveraged Apache SeaTunnel to consolidate data ingestion channels, build a unified batch-and-stream data platform, and gain valuable insights throughout the platform modernization journey.&lt;/p&gt;

&lt;h2&gt;
  
  
  Featured Speaker
&lt;/h2&gt;

&lt;p&gt;Xiaocheng Zhou currently works at HOPEGOO, where he focuses on data platform development and operations. He is also an Apache SeaTunnel Committer and has been actively contributing to the SeaTunnel community and its ongoing evolution.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4fgaoafun91hszvz1qc.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4fgaoafun91hszvz1qc.jpg" alt="周晓晨" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Background and Session Overview
&lt;/h2&gt;

&lt;p&gt;As business scale continued to expand, HOPEGOO gradually accumulated multiple data synchronization systems, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Offline data synchronization services&lt;/li&gt;
&lt;li&gt;Real-time lake ingestion pipelines&lt;/li&gt;
&lt;li&gt;Legacy Sqoop-based synchronization jobs&lt;/li&gt;
&lt;li&gt;An early-generation SeaTunnel platform&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While these systems effectively supported business requirements at different stages of growth, the coexistence of multiple synchronization platforms eventually introduced several challenges as data volumes and use cases continued to grow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fragmented data ingestion entry points&lt;/li&gt;
&lt;li&gt;Increased maintenance costs caused by multiple technology stacks&lt;/li&gt;
&lt;li&gt;Difficulty establishing unified data governance&lt;/li&gt;
&lt;li&gt;Reduced efficiency when onboarding new business scenarios&lt;/li&gt;
&lt;li&gt;Challenges in standardizing platform capabilities across teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At the same time, the rise of generative AI applications has significantly increased demand for multimodal data processing, creating new challenges for traditional data integration architectures.&lt;/p&gt;

&lt;p&gt;Against this backdrop, HOPEGOO began exploring the construction of a unified data pipeline platform and ultimately selected Apache SeaTunnel as its core data integration engine, driving the consolidation and modernization of its data synchronization ecosystem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Event Information
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Topic: Architecture and Practices for a Unified Multimodal Data Pipeline Platform&lt;/li&gt;
&lt;li&gt;Date &amp;amp; Time: June 23, 2026, 14:00–15:00 (UTC+8)&lt;/li&gt;
&lt;li&gt;Live Streaming: &lt;a href="https://meeting.tencent.com/dm/VABnzAOyh8Yx" rel="noopener noreferrer"&gt;https://meeting.tencent.com/dm/VABnzAOyh8Yx&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Community Giveaways
&lt;/h2&gt;

&lt;p&gt;As with our previous events, we have prepared exclusive Apache SeaTunnel community gifts for online attendees. Join the live session for a chance to win exciting prizes and community swag!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F60i5e8jzm492gakweahq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F60i5e8jzm492gakweahq.png" alt=" " width="350" height="411"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Reserve Your Spot Today
&lt;/h2&gt;

&lt;p&gt;From managing multiple independent synchronization systems to building a unified data pipeline platform, and ultimately preparing data infrastructure for the multimodal AI era, HOPEGOO's journey offers valuable lessons for organizations facing similar challenges.&lt;/p&gt;

&lt;p&gt;If you're interested in data integration platforms, unified batch-and-stream architectures, lakehouse implementations, or enterprise adoption stories of Apache SeaTunnel, we invite you to &lt;strong&gt;reserve your spot now!&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://meeting.tencent.com/dm/VABnzAOyh8Yx" rel="noopener noreferrer"&gt;https://meeting.tencent.com/dm/VABnzAOyh8Yx&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Join this Meetup to connect with community contributors and industry practitioners!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>apacheseatunnel</category>
      <category>database</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Demo: Full Data Synchronization from MySQL CDC to PostgreSQL with Apache SeaTunnel</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Thu, 11 Jun 2026 08:24:06 +0000</pubDate>
      <link>https://dev.to/seatunnel/demo-full-data-synchronization-from-mysql-cdc-to-postgresql-with-apache-seatunnel-2gfb</link>
      <guid>https://dev.to/seatunnel/demo-full-data-synchronization-from-mysql-cdc-to-postgresql-with-apache-seatunnel-2gfb</guid>
      <description>&lt;p&gt;This article provides a detailed walkthrough on using &lt;strong&gt;Apache SeaTunnel 2.3.9&lt;/strong&gt; to perform &lt;strong&gt;full data synchronization from MySQL CDC to PostgreSQL&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For the complete demonstration, please refer to the video:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Demo: Synchronizing Data from MySQL CDC to PostgreSQL with Apache SeaTunnel&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/rpzm1AkdIqw"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Without further ado, let's dive into the MySQL-to-PostgreSQL synchronization scenario.&lt;/p&gt;

&lt;h1&gt;
  
  
  Version Requirements
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;MySQL → MySQL 8.3&lt;/li&gt;
&lt;li&gt;PostgreSQL → PostgreSQL 13.2&lt;/li&gt;
&lt;li&gt;Apache SeaTunnel → Apache SeaTunnel 2.3.9&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All configuration files used in this article are available by replying with the keyword &lt;strong&gt;"Demo 01"&lt;/strong&gt; on our WeChat official account.&lt;/p&gt;

&lt;h1&gt;
  
  
  Prerequisites
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Verify Version Information
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check version information&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;VERSION&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Enable Replication
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check replication-related configurations&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;VARIABLES&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;variable_name&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="s1"&gt;'log_bin'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;'binlog_format'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;'binlog_row_image'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;'gtid_mode'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;'enforce_gtid_consistency'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;MySQL CDC synchronization relies on reading MySQL &lt;strong&gt;binlog&lt;/strong&gt; files. SeaTunnel cluster nodes act as replication replicas within the MySQL replication architecture.&lt;/p&gt;

&lt;p&gt;Therefore, before configuring CDC synchronization, you must verify that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Binary logging (binlog) is enabled.&lt;/li&gt;
&lt;li&gt;Replication mode is enabled.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note: For MySQL 8.0 and later versions, binlog is enabled by default. However, replication-related settings still need to be configured manually.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Enable replication (execute sequentially)&lt;/span&gt;
&lt;span class="c1"&gt;-- SET GLOBAL gtid_mode=OFF;&lt;/span&gt;
&lt;span class="c1"&gt;-- SET GLOBAL enforce_gtid_consistency=OFF;&lt;/span&gt;

&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;GLOBAL&lt;/span&gt; &lt;span class="n"&gt;gtid_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;OFF_PERMISSIVE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;GLOBAL&lt;/span&gt; &lt;span class="n"&gt;gtid_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ON_PERMISSIVE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;GLOBAL&lt;/span&gt; &lt;span class="n"&gt;enforce_gtid_consistency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;ON&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;GLOBAL&lt;/span&gt; &lt;span class="n"&gt;gtid_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;ON&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configure User Permissions
&lt;/h2&gt;

&lt;p&gt;The CDC user must have replication privileges.&lt;/p&gt;

&lt;p&gt;The two essential permissions are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;REPLICATION SLAVE&lt;/li&gt;
&lt;li&gt;REPLICATION CLIENT&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After granting permissions, refresh privileges.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;USER&lt;/span&gt; &lt;span class="s1"&gt;'test'&lt;/span&gt;&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="s1"&gt;'%'&lt;/span&gt; &lt;span class="n"&gt;IDENTIFIED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="s1"&gt;'password'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;RELOAD&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;DATABASES&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;REPLICATION&lt;/span&gt; &lt;span class="n"&gt;SLAVE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;REPLICATION&lt;/span&gt; &lt;span class="n"&gt;CLIENT&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="s1"&gt;'test'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;FLUSH&lt;/span&gt; &lt;span class="k"&gt;PRIVILEGES&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  SeaTunnel Cluster Configuration
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Cluster Logging
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Independent Log Files for Each Job
&lt;/h3&gt;

&lt;p&gt;Configuration file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;log4j2.properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By default, SeaTunnel outputs logs into a centralized log file.&lt;/p&gt;

&lt;p&gt;However, in production environments, job management is typically performed on a per-job basis. Therefore, it is recommended to configure independent log files for each job.&lt;/p&gt;

&lt;p&gt;This approach provides several advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easier monitoring&lt;/li&gt;
&lt;li&gt;Faster troubleshooting&lt;/li&gt;
&lt;li&gt;Better operational visibility&lt;/li&gt;
&lt;li&gt;More efficient job-level management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Modify the configuration as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;############################ log output to file #############################
# rootLogger.appenderRef.file.ref = fileAppender
&lt;/span&gt;
&lt;span class="c"&gt;# Output logs into independent files for each job
&lt;/span&gt;&lt;span class="py"&gt;rootLogger.appenderRef.file.ref&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;routingAppender&lt;/span&gt;

&lt;span class="c"&gt;############################ log output to file #############################
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Client Configuration
&lt;/h2&gt;

&lt;p&gt;In production environments, SeaTunnel is typically installed under the &lt;code&gt;/opt&lt;/code&gt; directory.&lt;/p&gt;

&lt;p&gt;It is recommended to point &lt;code&gt;SEATUNNEL_HOME&lt;/code&gt; to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/opt/seatunnel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If multiple versions are installed or the installation path differs, it is recommended to create a symbolic link so that client and server environments remain consistent.&lt;/p&gt;

&lt;p&gt;This helps avoid classpath-related issues and missing dependency errors.&lt;/p&gt;

&lt;p&gt;SeaTunnel's submission scripts reference client-side environment variables, including classpaths and installation directories, using absolute paths.&lt;/p&gt;

&lt;p&gt;If client and server configurations differ, job submission may fail.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create symbolic link&lt;/span&gt;
&lt;span class="nb"&gt;ln&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; /opt/apache-seatunnel-2.3.9 /opt/seatunnel

&lt;span class="c"&gt;# Configure environment variable&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;SEATUNNEL_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/opt/seatunnel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Environment Variable Configuration
&lt;/h2&gt;

&lt;p&gt;For Linux servers, it is recommended to configure environment variables using the official approach by placing them under:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/etc/profile.d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export SEATUNNEL_HOME=/opt/seatunnel'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /etc/profile.d/seatunnel.sh

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export PATH=$SEATUNNEL_HOME/bin:$PATH'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /etc/profile.d/seatunnel.sh

&lt;span class="nb"&gt;source&lt;/span&gt; /etc/profile.d/seatunnel.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Job Configuration
&lt;/h2&gt;

&lt;p&gt;The following examples do not cover every available option. Instead, they focus on commonly used production configurations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;env&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;job.mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"STREAMING"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;job.name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DEMO"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;parallelism&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="nl"&gt;checkpoint.interval&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30000&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;checkpoint.timeout&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30000&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="nl"&gt;job.retry.times&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;job.retry.interval.seconds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's start with the &lt;code&gt;env&lt;/code&gt; section.&lt;/p&gt;

&lt;p&gt;Because CDC synchronization is a streaming workload, the job mode must be configured as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;STREAMING
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, configure the job name.&lt;/p&gt;

&lt;p&gt;In production environments, it is recommended to use meaningful naming conventions based on database names or table names. This makes job identification and management much easier.&lt;/p&gt;

&lt;h3&gt;
  
  
  Parallelism
&lt;/h3&gt;

&lt;p&gt;The example uses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;parallelism = 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The optimal parallelism value depends on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cluster size&lt;/li&gt;
&lt;li&gt;Available resources&lt;/li&gt;
&lt;li&gt;Database performance characteristics&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Checkpoints
&lt;/h3&gt;

&lt;p&gt;The checkpoint interval is configured as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;30 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If lower recovery latency is required, the interval can be reduced to 10 seconds or even lower.&lt;/p&gt;

&lt;p&gt;Checkpoint timeout is also configured as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;30 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If checkpoint creation exceeds this threshold, the job is considered failed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retry Policy
&lt;/h3&gt;

&lt;p&gt;The example configures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retry attempts: 3&lt;/li&gt;
&lt;li&gt;Retry interval: 3 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These values can be adjusted according to production requirements.&lt;/p&gt;

&lt;h1&gt;
  
  
  MySQL CDC Source Configuration
&lt;/h1&gt;

&lt;p&gt;The MySQL CDC source configuration is one of the most important parts of the entire synchronization task.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;MySQL-CDC&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;base-url&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"jdbc:mysql://192.168.8.101:3306/test?serverTimezone=Asia/Shanghai"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;username&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"test"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;password&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"123456"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;database-names&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"test"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# table-names = ["test.test_001","test.test_002"]&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;table-pattern&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"test&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;.test_.*"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;table-names-config&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"table"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"test.test_002"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"primaryKeys"&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;startup.mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"initial"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;snapshot.split.size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"8096"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;snapshot.fetch.size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1024"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;server-id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"6500-8500"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;connect.timeout.ms&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30000&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;connect.max-retries&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;connection.pool.size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;exactly_once&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;schema-changes.enabled&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One important recommendation is to explicitly specify the timezone in the MySQL JDBC URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;serverTimezone=Asia/Shanghai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This helps prevent timezone-related inconsistencies when extracting &lt;code&gt;DATETIME&lt;/code&gt; and &lt;code&gt;TIMESTAMP&lt;/code&gt; fields.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;username&lt;/code&gt; and &lt;code&gt;password&lt;/code&gt; should belong to an account with replication permissions, capable of reading binlogs and accessing all tables involved in synchronization.&lt;/p&gt;

&lt;p&gt;In most production environments, each synchronization task is typically configured for a specific database. Therefore, it is recommended to explicitly specify the database being synchronized:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;database-names&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"test"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Table Selection
&lt;/h2&gt;

&lt;p&gt;SeaTunnel provides two approaches for selecting source tables:&lt;/p&gt;

&lt;h3&gt;
  
  
  Explicit Table List
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;table-names&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="s2"&gt;"test.test_001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="s2"&gt;"test.test_002"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Regular Expression Matching
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;table-pattern&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"test&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;.test_.*"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For large-scale initial synchronization scenarios involving many tables—or even an entire database—regular expressions are often the preferred option.&lt;/p&gt;

&lt;p&gt;When using regular expressions, both the database name and table name must be included.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;test\\.test_.*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The escaped dot (&lt;code&gt;\\.&lt;/code&gt;) represents the literal separator between the database name and table name.&lt;/p&gt;

&lt;p&gt;The pattern above matches all tables in the &lt;code&gt;test&lt;/code&gt; database whose names begin with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;test_
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach is especially useful for large-scale database synchronization projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table-Level Custom Configuration
&lt;/h2&gt;

&lt;p&gt;Additional table-specific settings can also be defined.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;table-names-config&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"table"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"test.test_002"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"primaryKeys"&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Suppose &lt;code&gt;test_002&lt;/code&gt; does not have a physical primary key.&lt;/p&gt;

&lt;p&gt;In that case, a logical primary key can be specified manually to support synchronization and downstream processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Startup Mode
&lt;/h2&gt;

&lt;p&gt;One of the most important options is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;startup.mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"initial"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is also the most common production configuration.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;initial&lt;/code&gt; mode performs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Historical full synchronization&lt;/li&gt;
&lt;li&gt;Continuous CDC synchronization&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In other words:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full Load + Incremental CDC&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This approach is widely used when onboarding existing databases into real-time synchronization pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Snapshot Configuration
&lt;/h2&gt;

&lt;p&gt;The following options control snapshot processing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;snapshot.split.size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"8096"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;snapshot.fetch.size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1024"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The default values work well in most cases.&lt;/p&gt;

&lt;p&gt;For larger clusters or more powerful servers, these values can be adjusted to improve throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  Server ID Configuration
&lt;/h2&gt;

&lt;p&gt;Another critical parameter is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;server-id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"6500-8500"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When SeaTunnel consumes MySQL binlogs, it behaves like a MySQL replica node.&lt;/p&gt;

&lt;p&gt;MySQL replication requires every replica to have a unique &lt;code&gt;server-id&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If not specified, a default value will be used.&lt;/p&gt;

&lt;p&gt;However, the official recommendation is to configure a dedicated server-id range.&lt;/p&gt;

&lt;p&gt;An important requirement is:&lt;/p&gt;

&lt;p&gt;The server-id range must be larger than the configured parallelism; otherwise, synchronization tasks may fail during startup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connection Settings
&lt;/h2&gt;

&lt;p&gt;The following parameters control connectivity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;connect.timeout.ms&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30000&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;connect.max-retries&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;connection.pool.size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For large datasets or slower networks, increasing timeout values may be beneficial.&lt;/p&gt;

&lt;p&gt;Similarly, connection pool size can be increased for high-throughput synchronization workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Exactly-Once Semantics
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;exactly_once&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For most CDC analytics scenarios, strict transactional consistency is not required.&lt;/p&gt;

&lt;p&gt;Therefore, disabling Exactly-Once semantics is often recommended because it can significantly improve synchronization performance.&lt;/p&gt;

&lt;p&gt;If strong consistency guarantees are required by the business, this option can be enabled.&lt;/p&gt;

&lt;p&gt;However, enabling Exactly-Once generally introduces additional overhead and may reduce throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  Schema Evolution
&lt;/h2&gt;

&lt;p&gt;Another highly recommended feature is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;schema-changes.enabled&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Schema Evolution allows SeaTunnel to automatically adapt to changes in source table structures.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add Column&lt;/li&gt;
&lt;li&gt;Drop Column&lt;/li&gt;
&lt;li&gt;Rename Column&lt;/li&gt;
&lt;li&gt;Modify Column&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This reduces the need to manually modify synchronization jobs whenever schema changes occur.&lt;/p&gt;

&lt;p&gt;However, it also introduces certain considerations.&lt;/p&gt;

&lt;p&gt;For example, if a downstream application depends on a column name that is automatically renamed, related SQL statements may fail.&lt;/p&gt;

&lt;p&gt;Therefore, users should balance automation and downstream compatibility according to their own requirements.&lt;/p&gt;

&lt;p&gt;According to the official documentation, Schema Evolution currently supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;add column&lt;/li&gt;
&lt;li&gt;drop column&lt;/li&gt;
&lt;li&gt;rename column&lt;/li&gt;
&lt;li&gt;modify column&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not all DDL operations are supported.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CREATE TABLE&lt;/li&gt;
&lt;li&gt;DROP TABLE&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;cannot currently be captured and propagated automatically.&lt;/p&gt;

&lt;p&gt;Despite these limitations, Schema Evolution remains an extremely valuable capability and is highly recommended for most production environments.&lt;/p&gt;

&lt;h1&gt;
  
  
  PostgreSQL Sink Configuration
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;sink&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;jdbc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="k"&gt;url&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"jdbc:postgresql://192.168.8.101:5432/test"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;driver&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"org.postgresql.Driver"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;user&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"postgres"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;password&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"123456"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;generate_sink_sql&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;database&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"test"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;database_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;schema_save_mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CREATE_SCHEMA_WHEN_NOT_EXIST"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;data_save_mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"APPEND_DATA"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# enable_upsert = false&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Sink configuration writes synchronized data into PostgreSQL.&lt;/p&gt;

&lt;p&gt;In addition to the connection URL, JDBC driver, username, and password, one particularly useful feature is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;generate_sink_sql&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When enabled, SeaTunnel automatically generates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CREATE TABLE statements&lt;/li&gt;
&lt;li&gt;INSERT statements&lt;/li&gt;
&lt;li&gt;DELETE statements&lt;/li&gt;
&lt;li&gt;UPDATE statements based on primary keys&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This greatly simplifies synchronization configuration and eliminates the need for complex manual SQL development.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding PostgreSQL Schema Hierarchy
&lt;/h2&gt;

&lt;p&gt;When synchronizing data to PostgreSQL, it is important to understand the difference between MySQL and PostgreSQL object hierarchies.&lt;/p&gt;

&lt;p&gt;MySQL contains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Database → Table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PostgreSQL contains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Database → Schema → Table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Therefore, if all synchronized tables should be stored in a specific PostgreSQL database, this must be configured appropriately.&lt;/p&gt;

&lt;p&gt;It is also recommended that the PostgreSQL user have table creation permissions if automatic table creation is enabled.&lt;/p&gt;

&lt;h2&gt;
  
  
  Placeholder Support
&lt;/h2&gt;

&lt;p&gt;SeaTunnel provides a powerful placeholder mechanism:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;database_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This feature is especially useful when synchronizing many tables.&lt;/p&gt;

&lt;p&gt;Instead of manually defining every target table name, placeholders automatically generate target table mappings based on source metadata.&lt;/p&gt;

&lt;p&gt;This significantly reduces maintenance effort in large-scale synchronization projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Save Modes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Schema Save Mode
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;schema_save_mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CREATE_SCHEMA_WHEN_NOT_EXIST"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This option is extremely useful for whole-database synchronization.&lt;/p&gt;

&lt;p&gt;It automatically creates target schemas and tables when they do not already exist.&lt;/p&gt;

&lt;p&gt;As a result, users can avoid many manual setup steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Save Mode
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;data_save_mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"APPEND_DATA"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;APPEND_DATA prevents existing synchronized data from being overwritten.&lt;/p&gt;

&lt;p&gt;This makes it one of the safest and most commonly used modes in production environments.&lt;/p&gt;

&lt;p&gt;Other save modes are also available and can be selected according to business requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Upsert Configuration
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="l"&gt;enable_upsert&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you can guarantee that source-side data will never contain duplicate primary keys, disabling upsert operations may significantly improve synchronization performance.&lt;/p&gt;

&lt;p&gt;However, if duplicate records are possible, it is recommended to keep Upsert enabled.&lt;/p&gt;

&lt;p&gt;SeaTunnel can then perform primary-key-based updates automatically.&lt;/p&gt;

&lt;p&gt;Refer to the official documentation for detailed parameter descriptions and supported behaviors.&lt;/p&gt;

&lt;h1&gt;
  
  
  Job Submission and Monitoring
&lt;/h1&gt;

&lt;p&gt;After completing the configuration file, the synchronization job can be submitted using SeaTunnel's command-line tools.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./bin/start-seatunnel.sh &lt;span class="nt"&gt;--config&lt;/span&gt; /path/to/config.yaml &lt;span class="nt"&gt;--async&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Parameter description:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;config&lt;/code&gt; — Specifies the configuration file path.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;async&lt;/code&gt; — Submits the job asynchronously. After submission, the command-line process exits immediately while the job continues running in the background.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After submission, the job can be monitored through the SeaTunnel UI.&lt;/p&gt;

&lt;p&gt;Starting from version 2.3.9, SeaTunnel provides an intuitive web interface that allows users to view:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Job status&lt;/li&gt;
&lt;li&gt;Execution logs&lt;/li&gt;
&lt;li&gt;Throughput statistics&lt;/li&gt;
&lt;li&gt;Data processing metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Synchronization Demonstration
&lt;/h1&gt;

&lt;p&gt;In this demo, two tables were created:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;test_001
test_002
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After inserting sample records into MySQL, SeaTunnel successfully synchronized the data into PostgreSQL.&lt;/p&gt;

&lt;p&gt;The demonstration also covers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;INSERT operations&lt;/li&gt;
&lt;li&gt;DELETE operations&lt;/li&gt;
&lt;li&gt;UPDATE operations&lt;/li&gt;
&lt;li&gt;Schema changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SeaTunnel successfully captures and synchronizes all these changes to PostgreSQL in real time.&lt;/p&gt;

&lt;h1&gt;
  
  
  Key Takeaways
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Automatic Schema Synchronization
&lt;/h2&gt;

&lt;p&gt;SeaTunnel supports automatic schema synchronization.&lt;/p&gt;

&lt;p&gt;When the schema of a source MySQL table changes, the corresponding PostgreSQL table structure can be updated automatically as well.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Consistency
&lt;/h2&gt;

&lt;p&gt;SeaTunnel ensures data consistency throughout the synchronization process.&lt;/p&gt;

&lt;p&gt;All INSERT, DELETE, and UPDATE operations are accurately replicated to the target database, providing a reliable foundation for real-time analytics and data integration scenarios.&lt;/p&gt;

</description>
      <category>mysql</category>
      <category>postgres</category>
      <category>apacheseatunnel</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Meet Apache SeaTunnel's Newest Committer: 60+ PRs, Connector Innovation, and the Future of AI Data Integration</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 05 Jun 2026 09:51:05 +0000</pubDate>
      <link>https://dev.to/seatunnel/meet-apache-seatunnels-newest-committer-60-prs-connector-innovation-and-the-future-of-ai-data-1gbn</link>
      <guid>https://dev.to/seatunnel/meet-apache-seatunnels-newest-committer-60-prs-connector-innovation-and-the-future-of-ai-data-1gbn</guid>
      <description>&lt;p&gt;Hey Community! Exciting news has arrived from the Apache SeaTunnel open-source community. Zeng Yi, a Big Data Engineer at China Telecom Cloud Technology, has been invited to join the ranks of Apache SeaTunnel Committers, bringing new energy and momentum to the project.&lt;/p&gt;

&lt;p&gt;Within the Apache SeaTunnel community, Zeng Yi has already distinguished himself through outstanding technical expertise and strong engineering capabilities. His election as an Apache SeaTunnel Committer is a well-deserved recognition of his dedication to open source. He has actively contributed to a wide range of community initiatives, delivering high-quality code in areas such as Connector enhancements and engine adaptation. Beyond code contributions, he has also leveraged his extensive experience to improve documentation, participate in technical discussions, and help fellow developers solve problems, continuously driving the project's growth through practical action.&lt;/p&gt;

&lt;p&gt;From his early days exploring the open-source world to becoming a key contributor to a top-tier Apache project, Zeng Yi's journey is filled with valuable experiences and unique insights. How did he find his direction in open source? What lessons can he share with aspiring contributors?&lt;/p&gt;

&lt;p&gt;Let's dive into this in-depth community interview and hear his story firsthand!&lt;/p&gt;

&lt;h2&gt;
  
  
  Personal Profile
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcb2o0tfmkdle9dizf9nz.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcb2o0tfmkdle9dizf9nz.jpg" width="799" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Interview Transcript
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. How long have you been involved in open source, and what attracts you to it?
&lt;/h3&gt;

&lt;p&gt;If we take my first official PR submission to an Apache project as the starting point, I have been contributing to open source for about one year.&lt;/p&gt;

&lt;p&gt;What attracts me most is the opportunity to collaborate with many outstanding developers in the community. During every PR review, reviewers provide feedback on naming conventions, edge cases, compatibility, test coverage, and many other aspects. These are valuable experiences that are not always available in such a concentrated form in day-to-day work.&lt;/p&gt;

&lt;p&gt;Open source also allows individual contributions to create broader value. Fixing a problem within a company may only benefit a team or a specific project. However, when the same fix is merged into the community's main branch, it can benefit a much larger user base. That sense of impact is highly rewarding.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. When did you start contributing to SeaTunnel, and what motivated you to get involved?
&lt;/h3&gt;

&lt;p&gt;I started contributing to SeaTunnel in April 2025.&lt;/p&gt;

&lt;p&gt;On April 23, I submitted my first PR (#9213), and on May 16, my first contribution was merged through PR #9305, officially making me a Contributor.&lt;/p&gt;

&lt;p&gt;The motivation came directly from real-world challenges at work. At the time, we were building a data integration platform based on Apache SeaTunnel and Flink CDC, using Flink as the unified execution engine.&lt;/p&gt;

&lt;p&gt;While supporting customer synchronization workloads, we encountered issues such as Oracle BLOB fields losing their original format after being read, and Doris Sink lacking flexible table name case handling. After implementing fixes internally, I realized these problems were quite common, so I organized the solutions into PRs and contributed them back to the community.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. As a newly invited SeaTunnel Committer, could you summarize your contributions to the community, both technical and non-technical?
&lt;/h3&gt;

&lt;p&gt;Since joining the Apache SeaTunnel community, I have continuously contributed in areas including Connector enhancements, engine adaptation, synchronization stability, documentation improvements, and community collaboration.&lt;/p&gt;

&lt;p&gt;To date, I have had 60 PRs merged into the apache/seatunnel repository, covering modules such as connectors-v2, Zeta, Flink, CDC, Transform-v2, E2E, and Docs.&lt;/p&gt;

&lt;h4&gt;
  
  
  Code Contributions
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;File Connector Enhancements&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I systematically improved large-file processing and continuous file discovery capabilities for HDFS.&lt;/p&gt;

&lt;p&gt;To address the limitation of "one file equals one split," which restricted parallelism for large files, I implemented large-file split reading support for the HDFS File Source and introduced configurations such as &lt;code&gt;enable_file_split&lt;/code&gt; and &lt;code&gt;file_split_size&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For text, CSV, and JSON files, split processing respects line boundaries to prevent record corruption. For Parquet files, logical splitting is implemented based on RowGroups.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Continuous File Discovery&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I added continuous discovery capabilities for FTP, SFTP, Local, and HDFS file sources.&lt;/p&gt;

&lt;p&gt;This allows running jobs to continuously detect newly created or updated files, with support for the &lt;code&gt;scan_interval&lt;/code&gt; configuration, making it suitable for periodic file drops and near real-time file synchronization scenarios.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Connector Enhancements and Bug Fixes&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I contributed enhancements and fixes across Hive, JDBC, CDC, Iceberg, Doris, Kafka, and other connectors.&lt;/p&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hive Sink support for SchemaSaveMode and DataSaveMode&lt;/li&gt;
&lt;li&gt;Automatic table creation&lt;/li&gt;
&lt;li&gt;Schema management&lt;/li&gt;
&lt;li&gt;Partition field support&lt;/li&gt;
&lt;li&gt;Multiple storage format support&lt;/li&gt;
&lt;li&gt;JDBC regex-based multi-table reading&lt;/li&gt;
&lt;li&gt;PostgreSQL TIMESTAMP_TZ support&lt;/li&gt;
&lt;li&gt;Iceberg time type fixes&lt;/li&gt;
&lt;li&gt;Doris case sensitivity compatibility improvements&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kafka checkpoint offset recovery fixes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flink Version Adaptation&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I participated in SeaTunnel's support for Flink 1.20.1.&lt;/p&gt;

&lt;p&gt;This involved introducing a dedicated translation layer for Flink 1.20, replacing fragile reflection-based implementations with Flink's official Sink2 API, and adding starters, build configurations, and E2E testing infrastructure.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Stability and Engineering Quality Improvements&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I continuously worked on fixing unstable CI test cases, improving E2E test coverage, and addressing issues related to checkpoint recovery, transaction commits, duplicate XA XIDs, empty directory reading, CDC snapshot splits, Kafka offset restoration, and Zeta REST APIs.&lt;/p&gt;

&lt;h4&gt;
  
  
  Non-Code Contributions
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Continuously improving SeaTunnel user and developer documentation, including connector documentation, parameter descriptions, usage examples, E2E configuration guides, onboarding materials, and architecture documents.&lt;/li&gt;
&lt;li&gt;Actively participating in PR reviews and technical discussions.&lt;/li&gt;
&lt;li&gt;Adjusting implementation approaches, adding tests, and improving documentation based on reviewer feedback, helping drive PRs from design to merge.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. After contributing to SeaTunnel for some time, what do you see as its advantages and shortcomings compared to other solutions? What keeps you engaged in the community?
&lt;/h3&gt;

&lt;p&gt;One of my strongest impressions is that SeaTunnel is highly aligned with real enterprise data integration needs.&lt;/p&gt;

&lt;p&gt;Rather than focusing on a single data source, execution engine, or synchronization method, SeaTunnel aims to solve a broad range of heterogeneous data integration challenges.&lt;/p&gt;

&lt;p&gt;Compared with competing solutions, one of SeaTunnel's most significant advantages is its rich Connector ecosystem.&lt;/p&gt;

&lt;p&gt;It provides comprehensive coverage for commonly used systems such as JDBC, CDC, Hive, Doris, Kafka, Iceberg, HDFS, FTP/SFTP, and local file systems.&lt;/p&gt;

&lt;p&gt;In enterprise environments, data integration rarely involves a single pipeline. Instead, organizations often require database-to-warehouse, database-to-lakehouse, file-to-Hive/Doris, CDC-to-Kafka, and many other combinations. SeaTunnel offers a relatively unified development and configuration experience across these scenarios.&lt;/p&gt;

&lt;p&gt;Another major advantage is multi-engine support.&lt;/p&gt;

&lt;p&gt;Different organizations have different technology stacks. Some workloads are better suited for Flink, while others may choose SeaTunnel Zeta. SeaTunnel does not lock users into a specific execution engine, which is highly valuable for enterprise adoption.&lt;/p&gt;

&lt;p&gt;As for areas for improvement, I believe CDC capabilities can be further strengthened.&lt;/p&gt;

&lt;p&gt;By learning from projects such as Flink CDC, SeaTunnel could continue improving schema change support across multiple data sources, consistency guarantees for different sinks, and recovery stability after failures.&lt;/p&gt;

&lt;p&gt;Documentation and best practices can also be improved, especially around production deployment, troubleshooting, and performance tuning.&lt;/p&gt;

&lt;p&gt;In addition, emerging areas such as AI data integration, unstructured data processing, and vector databases present exciting opportunities for future exploration.&lt;/p&gt;

&lt;p&gt;What keeps me engaged is the community's high level of activity.&lt;/p&gt;

&lt;p&gt;Issues, PRs, and discussions typically receive timely feedback, which makes contributions feel meaningful rather than isolated efforts.&lt;/p&gt;

&lt;p&gt;I can clearly see the value of my contributions. Many problems originate from real business scenarios, and once resolved and contributed back, other users can immediately benefit.&lt;/p&gt;

&lt;p&gt;Another important factor is the quality of community reviews.&lt;/p&gt;

&lt;p&gt;Reviewers do not simply check whether code runs. They evaluate solution generality, edge cases, test completeness, documentation quality, and long-term maintainability.&lt;/p&gt;

&lt;p&gt;Although this often requires multiple iterations, it significantly improves both the solution and my own engineering skills.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Have you developed custom solutions to address SeaTunnel's limitations? Have these been contributed back to the community?
&lt;/h3&gt;

&lt;p&gt;Yes.&lt;/p&gt;

&lt;p&gt;In fact, my initial involvement with SeaTunnel began because I encountered issues while using it.&lt;/p&gt;

&lt;p&gt;At the time, we were building a data integration platform based on SeaTunnel and Flink CDC. During production synchronization tasks, we discovered problems related to Oracle BLOB field handling and Doris Sink table name case sensitivity.&lt;/p&gt;

&lt;p&gt;While these might seem like minor details, they can directly affect synchronization results in production environments.&lt;/p&gt;

&lt;p&gt;I first validated solutions internally and then submitted them to the community as PRs.&lt;/p&gt;

&lt;p&gt;Later, much of my work focused on File Connectors.&lt;/p&gt;

&lt;p&gt;For example, HDFS originally mapped one file to one split, which limited parallelism when processing very large files.&lt;/p&gt;

&lt;p&gt;I introduced large-file split reading support with configurable split behavior and split sizes.&lt;/p&gt;

&lt;p&gt;This required more than simply splitting files by byte offsets. Text, CSV, and JSON files must preserve record boundaries, while Parquet files are more naturally divided by RowGroups.&lt;/p&gt;

&lt;p&gt;I also implemented continuous discovery for FTP, SFTP, Local, and HDFS file sources.&lt;/p&gt;

&lt;p&gt;Many file synchronization scenarios involve periodic file arrivals rather than a fixed set of files prepared upfront, making continuous discovery essential.&lt;/p&gt;

&lt;p&gt;Additionally, I participated in Flink 1.20.1 compatibility work because our product needed to standardize on a newer Flink version.&lt;/p&gt;

&lt;p&gt;This included translation layers, starters, build configurations, and E2E testing to ensure SeaTunnel worked properly on Flink 1.20.1.&lt;/p&gt;

&lt;p&gt;Most of these enhancements have already been contributed back to the community.&lt;/p&gt;

&lt;p&gt;My philosophy is simple:&lt;/p&gt;

&lt;p&gt;If a problem is fixed only internally, the team must maintain that patch indefinitely.&lt;/p&gt;

&lt;p&gt;If the problem is common, contributing it back to the community creates greater value and allows the solution to benefit from community review and validation.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Does your company use SeaTunnel in production? What are the use cases? If not, would you recommend it, and why?
&lt;/h3&gt;

&lt;p&gt;Yes, our company uses SeaTunnel in real production environments.&lt;/p&gt;

&lt;p&gt;For customer-facing data integration scenarios, we have built a data integration platform based on open-source Apache SeaTunnel and Flink CDC, with Flink serving as the unified underlying execution engine.&lt;/p&gt;

&lt;p&gt;Currently, we support a wide range of data sources, including various databases, data warehouses, data lakes, Kafka, HTTP, and many others.&lt;/p&gt;

&lt;p&gt;Typical target systems include Hive, Doris, and Iceberg, which are used to support customer requirements such as data lake ingestion, data warehouse loading, and both real-time and batch synchronization.&lt;/p&gt;

&lt;p&gt;From practical experience, SeaTunnel is well suited to serve as the foundation of a data integration platform.&lt;/p&gt;

&lt;p&gt;On one hand, its broad Connector coverage allows it to adapt to different customer environments and heterogeneous data sources.&lt;/p&gt;

&lt;p&gt;On the other hand, its configuration model and extensibility are relatively clear and straightforward, making it suitable for productization and enterprise-level packaging.&lt;/p&gt;

&lt;p&gt;Of course, when delivering solutions to customers, we still perform additional adaptation and validation based on specific business scenarios, including complex data type compatibility, task stability, failure recovery, performance characteristics, and write capabilities for different target systems.&lt;/p&gt;

&lt;p&gt;Overall, SeaTunnel provides significant value in heterogeneous data synchronization and lakehouse integration scenarios.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. What kind of support do you hope the SeaTunnel community can provide for your personal growth?
&lt;/h3&gt;

&lt;p&gt;I hope that through participating in the SeaTunnel community, I can continue improving my engineering capabilities, open-source collaboration skills, and technical perspective.&lt;/p&gt;

&lt;p&gt;SeaTunnel involves many production-grade challenges, including data source integration, type conversion, task partitioning, fault tolerance and recovery, checkpointing, and multi-engine compatibility. Working on these areas greatly helps deepen my understanding of large-scale data systems.&lt;/p&gt;

&lt;p&gt;At the same time, open-source collaboration has encouraged me to think beyond solving immediate business problems and to consider factors such as generality, compatibility, documentation quality, and test completeness.&lt;/p&gt;

&lt;p&gt;Going forward, I would like to participate more actively in code reviews and community discussions, helping other Contributors improve their solutions and implementations.&lt;/p&gt;

&lt;p&gt;In addition, if the community continues exploring areas such as AI, Agents, unstructured data, and vector databases, I would be very interested in participating and gaining hands-on experience in these emerging domains.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. What is your understanding of the Committer role? What responsibilities should a Committer have within the community?
&lt;/h3&gt;

&lt;p&gt;In my view, being a Committer is not simply about having merge permissions.&lt;/p&gt;

&lt;p&gt;More importantly, a Committer is responsible for maintaining project quality and supporting the long-term growth of the community.&lt;/p&gt;

&lt;p&gt;First, Committers should continue contributing.&lt;/p&gt;

&lt;p&gt;Obtaining Committer status should not be the end of one's involvement. Instead, Committers should continue identifying problems, solving issues, and improving the areas where they have expertise.&lt;/p&gt;

&lt;p&gt;Second, Committers should take code reviews seriously.&lt;/p&gt;

&lt;p&gt;A good review is not merely about checking formatting or verifying that the code compiles successfully.&lt;/p&gt;

&lt;p&gt;More importantly, reviewers should evaluate whether a solution is well designed, sufficiently general, capable of handling edge cases, compatible with existing functionality, and supported by adequate testing and documentation.&lt;/p&gt;

&lt;p&gt;In many cases, a pull request becomes significantly clearer and more reliable after going through the review process.&lt;/p&gt;

&lt;p&gt;Third, Committers should help new Contributors integrate into the community.&lt;/p&gt;

&lt;p&gt;Many first-time contributors may not be familiar with project architecture, contribution guidelines, testing requirements, or communication processes.&lt;/p&gt;

&lt;p&gt;Receiving timely and friendly feedback can greatly increase their confidence and encourage them to continue contributing.&lt;/p&gt;

&lt;p&gt;For a community to grow sustainably, it cannot rely solely on a small number of contributors. It must continuously attract and nurture new participants.&lt;/p&gt;

&lt;p&gt;Finally, Committers should also participate in strategic planning and project direction.&lt;/p&gt;

&lt;p&gt;For example, determining which Connectors need priority improvements, which stability issues should be addressed first, and which documentation or testing gaps need attention all require collaboration between community members and real-world users.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. How do you feel about becoming an Apache Software Foundation Committer? Do you have any message for the community or suggestions for the project's future development?
&lt;/h3&gt;

&lt;p&gt;I am truly grateful to the community for recognizing my previous contributions and inviting me to become a SeaTunnel Committer.&lt;/p&gt;

&lt;p&gt;For me, this is both an encouragement and a responsibility.&lt;/p&gt;

&lt;p&gt;I originally became involved with SeaTunnel because of practical problems encountered in my daily work.&lt;/p&gt;

&lt;p&gt;As I became more engaged, I realized that many business challenges are not isolated cases. When solutions are contributed back to the community, they can help many other users facing similar issues. That is one of the reasons I find open source so meaningful.&lt;/p&gt;

&lt;p&gt;Throughout this journey, I am also deeply thankful to all the Reviewers and Contributors who have helped me along the way.&lt;/p&gt;

&lt;p&gt;Many of my PRs went through multiple rounds of discussion and revision. While the process could sometimes be repetitive, the final solutions were always more complete and robust, and I learned a tremendous amount from those experiences.&lt;/p&gt;

&lt;p&gt;Looking ahead, I hope SeaTunnel will continue strengthening the production-grade capabilities of its core Connectors, including CDC, JDBC, File, Hive, Doris, and Iceberg.&lt;/p&gt;

&lt;p&gt;At the same time, I believe the project should continue investing in stability, observability, documentation, and best practices.&lt;/p&gt;

&lt;p&gt;In addition, I plan to continue contributing to AI-related initiatives and practical implementations, including unstructured data processing, vector databases, LLM data pipelines, and Agent automation scenarios.&lt;/p&gt;

&lt;p&gt;I hope to further explore how SeaTunnel's data integration capabilities can support these emerging technologies.&lt;/p&gt;

&lt;h3&gt;
  
  
  10. What are your plans for helping drive the project forward in the near future?
&lt;/h3&gt;

&lt;p&gt;Over the coming period, I plan to continue contributing to Connector development and production stability improvements.&lt;/p&gt;

&lt;p&gt;This includes enhancing commonly used Connectors, fixing issues, and improving E2E test coverage.&lt;/p&gt;

&lt;p&gt;At the same time, I will place particular focus on AI-powered data integration.&lt;/p&gt;

&lt;p&gt;Recently, the community has been discussing Knowledge Sync and Retrieval-Augmented Generation (RAG) capabilities.&lt;/p&gt;

&lt;p&gt;The goal is to enable SeaTunnel to take on responsibilities related to enterprise knowledge synchronization and indexing, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Document discovery&lt;/li&gt;
&lt;li&gt;Document parsing&lt;/li&gt;
&lt;li&gt;Content chunking and segmentation&lt;/li&gt;
&lt;li&gt;Embedding generation&lt;/li&gt;
&lt;li&gt;Writing data into vector databases such as Milvus and Qdrant&lt;/li&gt;
&lt;li&gt;Lifecycle management for document updates, deletions, and unchanged-content detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Personally, I hope to participate in both the design and implementation of these capabilities.&lt;/p&gt;

&lt;p&gt;By combining SeaTunnel's existing data integration strengths with AI and RAG scenarios, I believe we can unlock new possibilities for enterprise knowledge bases, unstructured data synchronization, and vector search data preparation workflows.&lt;/p&gt;

</description>
      <category>apacheseatunnel</category>
      <category>opensource</category>
      <category>datascience</category>
      <category>database</category>
    </item>
    <item>
      <title>From Batch to Real-Time: A Practical Guide to MySQL CDC in SeaTunnel</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Fri, 05 Jun 2026 08:08:42 +0000</pubDate>
      <link>https://dev.to/seatunnel/from-batch-to-real-time-a-practical-guide-to-mysql-cdc-in-seatunnel-270</link>
      <guid>https://dev.to/seatunnel/from-batch-to-real-time-a-practical-guide-to-mysql-cdc-in-seatunnel-270</guid>
      <description>&lt;h2&gt;
  
  
  Real-Time Incremental Data Capture: Change Data Capture (CDC)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Official MySQL-CDC Documentation:&lt;br&gt;
&lt;a href="https://seatunnel.apache.org/docs/2.3.3/connector-v2/source/MySQL-CDC/" rel="noopener noreferrer"&gt;https://seatunnel.apache.org/docs/2.3.3/connector-v2/source/MySQL-CDC/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A single SeaTunnel CDC job can monitor &lt;strong&gt;multiple tables simultaneously&lt;/strong&gt; and synchronize them in real time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You must use &lt;strong&gt;MySQL JDBC Driver 8.0.33 or later&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Download:
&lt;a href="https://downloads.mysql.com/archives/c-j/" rel="noopener noreferrer"&gt;https://downloads.mysql.com/archives/c-j/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flom6htqcunzeek5z62sm.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flom6htqcunzeek5z62sm.jpg" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1. How Many Databases Does SeaTunnel Support for CDC?
&lt;/h3&gt;

&lt;p&gt;Documentation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://seatunnel.apache.org/docs/2.3.12/connector-v2/source" rel="noopener noreferrer"&gt;https://seatunnel.apache.org/docs/2.3.12/connector-v2/source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Official CDC Source Connectors Supported by SeaTunnel:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;MongoDB CDC Source Connector&lt;/li&gt;
&lt;li&gt;MySQL CDC Source Connector&lt;/li&gt;
&lt;li&gt;OpenGauss CDC Source Connector&lt;/li&gt;
&lt;li&gt;Oracle CDC Source Connector&lt;/li&gt;
&lt;li&gt;PostgreSQL CDC Source Connector&lt;/li&gt;
&lt;li&gt;SQL Server CDC Source Connector&lt;/li&gt;
&lt;li&gt;TiDB CDC Source Connector&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  2. CDC Jobs Do Not Stop After SeaTunnel Starts
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;standard CDC (Change Data Capture) job is designed to run continuously and indefinitely as a streaming service.&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Core Difference
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Offline / Batch Processing Jobs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;job.mode = "BATCH"&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Execute a full or incremental SQL query once.&lt;/li&gt;
&lt;li&gt;The job automatically finishes after processing all existing data.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;CDC / Streaming Jobs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;job.mode = "STREAMING"&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;After startup, the job will:&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Optionally perform an **initial full snapshot** (when `scan.startup.mode = "initial"` is configured).
2. Suspend and continuously monitor the MySQL Binlog stream.
3. Immediately capture, process, and write any new database changes (INSERT, UPDATE, DELETE).
4. Continue monitoring indefinitely until the job is manually stopped or fails due to an error.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h4&gt;
  
  
  How to Stop a CDC Job
&lt;/h4&gt;

&lt;p&gt;In a SeaTunnel terminal session, you can usually stop a CDC job gracefully using:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Ctrl + C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In production environments, jobs are typically stopped through schedulers, orchestration systems, or management platforms.&lt;/p&gt;

&lt;p&gt;In simple terms, a CDC job behaves like a &lt;strong&gt;persistent subscription service&lt;/strong&gt; that subscribes to database change logs and processes events in real time whenever they occur.&lt;/p&gt;

&lt;p&gt;This is fundamentally different from a traditional batch job that runs once and exits.&lt;/p&gt;

&lt;h4&gt;
  
  
  💡 Configuration Recommendations and Common Pitfalls
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;About &lt;code&gt;startup.mode&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For first-time deployments, it is &lt;strong&gt;strongly recommended&lt;/strong&gt; to use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;initial
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A complete snapshot of current data is captured first.&lt;/li&gt;
&lt;li&gt;Incremental synchronization seamlessly follows afterward.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you configure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;all historical data will be skipped and only changes occurring after startup will be captured.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;About &lt;code&gt;server-id&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In production environments, always configure a unique and explicit value.&lt;/p&gt;

&lt;p&gt;Avoid random values because conflicts can cause instability and unexpected behavior.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. CDC Mode vs JDBC Mode
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Comparison
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;MySQL CDC (Streaming)&lt;/th&gt;
&lt;th&gt;JDBC Batch Processing&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Job Mode&lt;/td&gt;
&lt;td&gt;&lt;code&gt;STREAMING&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;BATCH&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Source&lt;/td&gt;
&lt;td&gt;Database Binlog&lt;/td&gt;
&lt;td&gt;SQL Query Result Set&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Content&lt;/td&gt;
&lt;td&gt;Change Event Stream (includes operation type, before/after images, metadata)&lt;/td&gt;
&lt;td&gt;Static Data Snapshot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Core Configuration&lt;/td&gt;
&lt;td&gt;&lt;code&gt;table-names&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;query&lt;/code&gt; or &lt;code&gt;table + sql&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Synchronization Type&lt;/td&gt;
&lt;td&gt;Real-time incremental (optionally full snapshot first)&lt;/td&gt;
&lt;td&gt;One-time full or incremental batch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Job Lifecycle&lt;/td&gt;
&lt;td&gt;Runs continuously until manually stopped&lt;/td&gt;
&lt;td&gt;Automatically ends after reading all data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Typical Use Cases&lt;/td&gt;
&lt;td&gt;Real-time warehouses, analytics, disaster recovery&lt;/td&gt;
&lt;td&gt;T+1 reports, migrations, historical backfill&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source Database Load&lt;/td&gt;
&lt;td&gt;Low continuous Binlog reading&lt;/td&gt;
&lt;td&gt;Query-intensive batch execution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Consistency&lt;/td&gt;
&lt;td&gt;Exactly-once guarantees&lt;/td&gt;
&lt;td&gt;Depends on query conditions; duplicates or omissions may occur&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Architecture Comparison
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjqkpfzh1oodigdj9fy1o.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjqkpfzh1oodigdj9fy1o.jpg" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7qxzye4tojvxh207y6hf.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7qxzye4tojvxh207y6hf.jpg" width="800" height="1202"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Under CDC mode, you &lt;strong&gt;cannot use a custom SQL query&lt;/strong&gt; in the source connector like you would with JDBC.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;is not supported in the CDC Source.&lt;/p&gt;

&lt;p&gt;If filtering or transformation is required, it must be implemented in the &lt;strong&gt;Transform&lt;/strong&gt; stage.&lt;/p&gt;

&lt;h4&gt;
  
  
  Why CDC Does Not Support Custom Queries
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Different Data Sources&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;JDBC Batch&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reads the results of user-defined SQL statements.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;MySQL CDC&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reads database Binlog events.&lt;/li&gt;
&lt;li&gt;Internally uses Debezium.&lt;/li&gt;
&lt;li&gt;Behaves like a MySQL replica that continuously receives row-level changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;Different Data Structures&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;JDBC Batch&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Returns ordinary row data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MySQL CDC&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Produces structured change events containing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Changed row values&lt;/li&gt;
&lt;li&gt;Operation types (&lt;code&gt;op&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Metadata&lt;/li&gt;
&lt;li&gt;Before image&lt;/li&gt;
&lt;li&gt;After image&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+I  Insert
-U  Before Update
+U  After Update
-D  Delete
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A traditional SQL query cannot generate this event structure.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. How to Implement Filtering and Transformation in CDC Mode
&lt;/h3&gt;

&lt;p&gt;Although filtering cannot be performed in the Source, several alternatives are available.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Location&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Transform&lt;/td&gt;
&lt;td&gt;SeaTunnel Transform Section&lt;/td&gt;
&lt;td&gt;Recommended approach. Supports filtering, projection, renaming, and SQL processing.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sink Processing&lt;/td&gt;
&lt;td&gt;Sink Configuration&lt;/td&gt;
&lt;td&gt;Some sinks support limited filtering or mapping capabilities.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema Evolution&lt;/td&gt;
&lt;td&gt;Table Synchronization Layer&lt;/td&gt;
&lt;td&gt;Sync selected columns to achieve indirect column filtering; requires advance schema changes and is typically used for long-term sync scenarios.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  5. CDC Can Monitor Multiple Tables Simultaneously
&lt;/h3&gt;

&lt;p&gt;When multiple tables need to be synchronized using a single CDC job, SeaTunnel provides a simple configuration model.&lt;/p&gt;

&lt;p&gt;The key idea is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One Source → Many Tables&lt;/li&gt;
&lt;li&gt;Automatic Sink Routing&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Configuring Multi-Table CDC
&lt;/h4&gt;

&lt;p&gt;Simply list all tables inside:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;table-names
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;MySQL-CDC&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;base-url&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"jdbc:mysql://192.168.1.107:51382/cs1"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;username&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"root"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;password&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"zysoft"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;database-names&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"cs1"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;table-names&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"cs1.t_8_100w"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"cs1.order_table"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"cs1.user_profile"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;startup.mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"initial"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;server-id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5400&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;server-time-zone&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Asia/Shanghai"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Sink Configuration: Automatic Routing
&lt;/h4&gt;

&lt;p&gt;SeaTunnel JDBC Sink supports automatic table routing.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Core Configuration
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sink {
  jdbc {

    url = "jdbc:mysql://192.168.1.107:51382/cs2"

    driver = "com.mysql.cj.jdbc.Driver"

    user = "root"

    password = "zysoft"

    generate_sink_sql = true

    database = "cs2"    # Target database

    # Key technique: dynamically map source table names using built-in variables
    table = "${table_name}"

    # table = "prefix_${table_name}"                  # Add a prefix if needed

    # table = "${database_name}_${table_name}_suffix" # Use database name and suffix

    schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"

    data_save_mode = "APPEND_DATA"

    batch_size = 5000

    # ... Other connection and performance tuning parameters remain unchanged

  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Built-In Variables
&lt;/h3&gt;

&lt;p&gt;SeaTunnel automatically provides:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;${table_name}
${database_name}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;During execution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cs1.t_8_100w
   -&amp;gt;
cs2.t_8_100w

cs1.order_table
   -&amp;gt;
cs2.order_table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a natural 1:1 table mapping.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Different Routing Strategies for Different Tables&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If different tables require different write policies, configure multiple Sink definitions.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sink {

  # Sink 1: Dedicated to processing the t_8_100w table
  jdbc {

    # Explicitly define the target table
    table = "t_8_100w_target"

    # Route only data from the specified source table to this sink
    source_table_name = "t_8_100w"

    # ... Other configurations

  }
}

sink {

  # Sink 2: Dedicated to processing the order_table table
  # Different write strategies (such as data_save_mode) can be applied
  jdbc {

    table = "order_table_target"

    source_table_name = "order_table"

    # Example: Use a truncate-and-reload strategy for this table
    data_save_mode = "DROP_DATA"

    # ... Other configurations

  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Reminding and Best Practices&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Target Tables
Ensure compatible target tables already exist, or enable&lt;code&gt;schema_save_mode = “CREATE_SCHEMA_WHEN_NOT_EXIST&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Performance Isolation
All monitored tables share the same CDC stream.
If one table generates massive changes, it may increase latency for others.
For critical workloads, create separate CDC jobs.&lt;/li&gt;
&lt;li&gt;Initial Snapshot
When using &lt;code&gt;startup.mode = "initial"&lt;/code&gt;, all monitored tables are snapshotted during startup.
Ensure sufficient database resources are available.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Summary&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;For most multi-table CDC scenarios:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;List multiple tables in &lt;code&gt;table-names&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Configure:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;table = "${table_name}"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;in the Sink&lt;/p&gt;

&lt;p&gt;SeaTunnel will automatically route data to matching target tables.&lt;/p&gt;

&lt;p&gt;If special processing is required for specific tables, use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transform SQL&lt;/li&gt;
&lt;li&gt;Filters&lt;/li&gt;
&lt;li&gt;Conditional routing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;to customize behavior.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Routing and Filters
If you want data to flow selectively into different sinks instead of being copied to every sink, you must configure a &lt;strong&gt;filter&lt;/strong&gt; before each sink or use more advanced mechanisms such as side outputs (typically implemented through conditional expressions in the configuration file).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;💡 Configuration Recommendations for Multi-Table CDC Scenarios&lt;/p&gt;

&lt;p&gt;Based on the questions discussed earlier, a typical configuration for synchronizing multiple CDC tables into different target tables is shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="nl"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;MySQL-CDC&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;table-names&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"cs1.t_order"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cs1.t_user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cs1.t_log"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# ... other configurations&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="nl"&gt;sink&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="c1"&gt;# Order table -&amp;gt; Order Archive Database&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;jdbc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"t_order_archive"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# Use filter to synchronize only t_order data&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;filter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;source_table_name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"t_order"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# ...&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="c1"&gt;# User table -&amp;gt; User Analytics Database&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;jdbc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"t_user_analysis"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;filter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;source_table_name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"t_user"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;data_save_mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"OVERWRITE"&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="c1"&gt;# Use overwrite strategy for this table&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# ...&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="c1"&gt;# Log table -&amp;gt; Log Center (example writes to HDFS)&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;hdfs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;path&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/data/lake/log/&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;source_table_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/dt=&lt;/span&gt;&lt;span class="si"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;now&lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;date&lt;/span&gt;&lt;span class="err"&gt;='&lt;/span&gt;&lt;span class="nv"&gt;yyyy-MM-dd&lt;/span&gt;&lt;span class="err"&gt;')&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;filter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;source_table_name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"t_log"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# ...&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Steps to Implement MySQL CDC
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67seujuy2b1z8kfd7edb.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67seujuy2b1z8kfd7edb.jpg" width="800" height="723"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Verify That MySQL Binlog Is Enabled
&lt;/h3&gt;

&lt;p&gt;Use the following commands to check whether Binlog is enabled.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- The simplest way to check&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;VARIABLES&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'log_bin'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Shows current binlog file name and location&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;MASTER&lt;/span&gt; &lt;span class="n"&gt;STATUS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Verify binlog format&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;VARIABLES&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'binlog_format'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Verify row image mode&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;VARIABLES&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'binlog_row_image'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fncnbdoa7r50usy4pnqzq.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fncnbdoa7r50usy4pnqzq.jpg" width="423" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Enable Binlog If It Is Disabled
&lt;/h3&gt;

&lt;p&gt;If Binlog is not enabled, modify the MySQL configuration file and restart MySQL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[mysqld]&lt;/span&gt;

&lt;span class="py"&gt;server-id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;123&lt;/span&gt;
&lt;span class="c"&gt;# Configure a unique server ID
&lt;/span&gt;
&lt;span class="py"&gt;log_bin&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;/var/lib/mysql/mysql-bin&lt;/span&gt;
&lt;span class="c"&gt;# Enable Binlog and specify storage path
&lt;/span&gt;
&lt;span class="py"&gt;binlog_format&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;ROW&lt;/span&gt;
&lt;span class="c"&gt;# Must be configured as ROW mode
&lt;/span&gt;
&lt;span class="py"&gt;binlog_row_image&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;FULL&lt;/span&gt;
&lt;span class="c"&gt;# Must be configured as FULL
&lt;/span&gt;
&lt;span class="py"&gt;expire_logs_days&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;10&lt;/span&gt;
&lt;span class="c"&gt;# Binlog retention period
# At least 2 days is recommended
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Common MySQL CDC Parameters
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6zh5qceb9ynot5759f93.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6zh5qceb9ynot5759f93.jpg" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Single-Table MySQL CDC Demo
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Create Target Table
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- demo7-1-mysql-cdc2mysql-qxzh-st-107.conf&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;cs2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;t_8_100w_imp_st_qxzh_cdc_demo7_1&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;

  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'Primary Key'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

  &lt;span class="n"&gt;user_name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'User Name'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

  &lt;span class="n"&gt;sex&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'Gender: Male/Female'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

  &lt;span class="n"&gt;decimal_f&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'Large Decimal Value'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

  &lt;span class="n"&gt;phone_number&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'Phone Number'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

  &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'Converted Age'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

  &lt;span class="n"&gt;create_time&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'Creation Time'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

  &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="nb"&gt;LONGTEXT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'Large Text'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

  &lt;span class="n"&gt;address&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'Default Value for Empty Address: Unknown'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

  &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Execute the Job
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# demo7-1-mysql-cdc2mysql-qxzh-st-107.conf&lt;/span&gt;

sh /data/tools/seatunnel/seatunnel-2.3.12/bin/seatunnel.sh &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--config&lt;/span&gt; /data/tools/seatunnel/myconf/demo7-1-mysql-cdc2mysql-qxzh-st-107.conf &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="nb"&gt;local&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Complete Configuration File
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hocon"&gt;&lt;code&gt;&lt;span class="c1"&gt;# demo7-1-mysql-cdc2mysql-qxzh-st-107.conf&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="nl"&gt;env&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="c1"&gt;# Parallelism (number of threads)&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;execution.parallelism&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="c1"&gt;# Job mode:&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# BATCH = Batch Processing&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# STREAMING = Streaming Processing (required for CDC)&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;job.mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"STREAMING"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="nl"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="nl"&gt;MySQL-CDC&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;base-url&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"jdbc:mysql://ip:port/cs1"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;username&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"root"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;password&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"zysoft"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# query is invalid in CDC mode&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# Source database&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;database-names&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"cs1"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# Monitored tables&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# Table names must include database names&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;table-names&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"cs1.t_8_100w"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# Startup mode:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# initial = full snapshot + incremental changes&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# latest  = incremental changes only&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;startup.mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;initial&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# Startup timestamp&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# Required when startup.mode=timestamp&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# startup.timestamp&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# Very important!&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# CDC client unique ID&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# Example: 5400 or range 5400-6408&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# Must not conflict with existing MySQL server IDs&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;server-id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5400&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# Stop mode&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# stop.mode&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# Required when stop.mode=timestamp&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# stop.timestamp&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# Database session time zone&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;server-time-zone&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Asia/Shanghai"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c1"&gt;# CDC transformations must be implemented in Transform&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="nl"&gt;transform&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="c1"&gt;# 1. Field Mapping&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# Alternatively, use the FieldMapper plugin&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="nl"&gt;FieldMapper&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;field_mapper&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="nl"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;id&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="nl"&gt;name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;user_name&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="nl"&gt;sex&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;sex&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="nl"&gt;decimal_f&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;decimal_f&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="nl"&gt;phone_number&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;phone_number&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="nl"&gt;age&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;age&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="nl"&gt;create_time&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;create_time&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="nl"&gt;description&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;description&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="c1"&gt;# 2. Phone number masking&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# 13812341234 -&amp;gt; 138****1234&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="c1"&gt;# 3. Age conversion&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# String -&amp;gt; Integer&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="c1"&gt;# 4. Gender conversion&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# 1 -&amp;gt; Male&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# 2 -&amp;gt; Female&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="c1"&gt;# 5. Data filtering&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# Keep only records where age &amp;gt; 25&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="c1"&gt;# 6. Address default value&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="c1"&gt;# Empty address -&amp;gt; "Unknown"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="nl"&gt;sink&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

  &lt;/span&gt;&lt;span class="nl"&gt;jdbc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="k"&gt;url&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"jdbc:mysql://ip:port/cs2"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;driver&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"com.mysql.cj.jdbc.Driver"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;user&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"root"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;password&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"zysoft"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# Automatically generate insert SQL&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# Can also create tables automatically&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;generate_sink_sql&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# Required when generate_sink_sql=true&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;database&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l"&gt;cs&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"t_8_100w_imp_st_qxzh_cdc_demo7_1"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# Fail if schema does not exist&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# Commonly used:&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# CREATE_SCHEMA_WHEN_NOT_EXIST&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;schema_save_mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ERROR_WHEN_SCHEMA_NOT_EXIST"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# APPEND_DATA&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# Keep existing data and append new records&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;data_save_mode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"APPEND_DATA"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# DROP_DATA&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# Clear table before loading&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="c1"&gt;# data_save_mode = "DROP_DATA"&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;batch_size&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# Retry attempts&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;max_retries&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="c1"&gt;# Connection timeout&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;connection_check_timeout_sec&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="w"&gt;

    &lt;/span&gt;&lt;span class="nl"&gt;properties&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="nl"&gt;useUnicode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="nl"&gt;characterEncoding&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"utf8"&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="nl"&gt;serverTimezone&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Asia/Shanghai"&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="c1"&gt;# Enable batch rewrite&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;rewriteBatchedStatements&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="c1"&gt;# Enable compression&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;useCompression&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="c1"&gt;# Disable server-side prepared statements&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;useServerPrepStmts&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"false"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Execution Result&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After startup, the CDC job first performs the initial snapshot synchronization.&lt;/p&gt;

&lt;p&gt;In this example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total records read: &lt;strong&gt;1,000,009&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Total records written: &lt;strong&gt;1,000,009&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Job status: &lt;strong&gt;RUNNING&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Checkpoints continue to be generated successfully.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This indicates that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The initial full snapshot has completed successfully.&lt;/li&gt;
&lt;li&gt;Source and target data are consistent.&lt;/li&gt;
&lt;li&gt;Checkpointing is functioning correctly.&lt;/li&gt;
&lt;li&gt;The CDC job has entered continuous Binlog monitoring mode.&lt;/li&gt;
&lt;li&gt;The streaming pipeline remains active and ready to capture new changes in real time.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Result Demonstration
&lt;/h3&gt;

&lt;p&gt;Once the initial synchronization is complete:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;INSERT operations are captured immediately.&lt;/li&gt;
&lt;li&gt;UPDATE operations are synchronized in real time.&lt;/li&gt;
&lt;li&gt;DELETE operations are propagated automatically.&lt;/li&gt;
&lt;li&gt;The CDC job continues running until manually stopped.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this point, the pipeline behaves like a long-running subscription service, continuously consuming MySQL Binlog events and delivering them to downstream systems in real time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fopenwrite-whaleops.oss-cn-zhangjiakou.aliyuncs.com%2F2026%2F06%2F04%2Fdong-tu.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fopenwrite-whaleops.oss-cn-zhangjiakou.aliyuncs.com%2F2026%2F06%2F04%2Fdong-tu.gif" alt="动图" width="8" height="6"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>apacheseatunnel</category>
      <category>mysql</category>
      <category>datascience</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>The Next Decade of Data Engineering: From Modern Data Stack to Data Engineering Harness</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Thu, 28 May 2026 09:44:46 +0000</pubDate>
      <link>https://dev.to/seatunnel/the-next-decade-of-data-engineering-from-modern-data-stack-to-data-engineering-harness-4cjo</link>
      <guid>https://dev.to/seatunnel/the-next-decade-of-data-engineering-from-modern-data-stack-to-data-engineering-harness-4cjo</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7c0il2r39wcvt8pkzz3n.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7c0il2r39wcvt8pkzz3n.jpg" width="800" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Over the past decade, the core evolution of data engineering has been the deconstruction and reconstruction of traditional data warehouse architectures through the Modern Data Stack.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We separated data ingestion from databases, forming the Data Ingestion layer, using tools like FiveTran, Airbyte, and Apache SeaTunnel to solve ELT / CDC / Reverse ETL problems;&lt;/li&gt;
&lt;li&gt;We separated compute from storage, forming cloud data warehouse and lakehouse systems such as Snowflake, Databricks, Iceberg, and Hive;&lt;/li&gt;
&lt;li&gt;We separated orchestration from scripts, leading to orchestration systems like Apache Airflow and Apache DolphinScheduler;&lt;/li&gt;
&lt;li&gt;SQL development, data modeling, lineage, data quality, BI, and AI analytics were further split into independent tools.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture was undoubtedly progress. It moved data engineering away from the primitive era of “a bunch of scripts + Crontab” toward cloud-native infrastructure, elastic computing, engineering governance, and open ecosystems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The greatest contribution of the Modern Data Stack was “decoupling,” and its biggest side effect was also “decoupling.”&lt;/strong&gt;&lt;br&gt;
Tools became more powerful, but data engineers were forced to switch between more systems than ever before: datasources in one place, synchronization configs in another, DAGs somewhere else, logs elsewhere, SQL stored in Git, and Snowflake / Iceberg / cloud warehouse execution results living in yet another environment.&lt;/p&gt;

&lt;p&gt;As a result, many data engineers spend less time on data modeling, business understanding, metric definitions, architecture design, and cost optimization — and far more time configuring datasources, setting field mappings, dragging DAG nodes, modifying SQL, checking logs, and rerunning tasks. This is the hidden pain created by the Modern Data Stack: &lt;strong&gt;data engineers became trapped inside tools.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The emergence of engineering-focused AI systems like Codex and Claude Code is now changing the entire software engineering workflow. &lt;strong&gt;But how can data engineers truly achieve Vibe Coding? That is exactly the direction I’ve been exploring, and the core topic of this article.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I believe future data engineering will no longer revolve around “humans operating tools.” Instead, it will evolve into: &lt;strong&gt;Codex + Data Engineering Skills &amp;amp; Harness + Data Engineering SaaS + Cloud Data Warehouse Infrastructure.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the past, the Modern Data Stack assumed that humans were the operational center: humans understood tools, clicked interfaces, connected workflows, and handled context switching. But in the AI and Agentic development era, data engineering should no longer mean “humans operating a pile of tools.” Instead, humans define objectives, Codex/Claude Code decompose and implement solutions automatically, the &lt;strong&gt;Data Engineering Skill &amp;amp; Harness&lt;/strong&gt; layer provides engineering boundaries and translates them into cloud SaaS systems, Snowflake / Iceberg / cloud warehouses provide scalable compute, orchestration and synchronization engines ensure runtime stability, and humans become responsible for reviewing, governing, and making final decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Once Codex and Claude Code deeply participate in data engineering, perhaps data engineers can finally be freed from the “Dirty Work” created by the Modern Data Stack, allowing data engineering to return from “tool operation” back to “engineering creation.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I believe this organizational transformation is inevitable in the AI and Agent era.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The Problem with the Modern Data Stack: The Issue Is Not Weak Tools — It’s That Humans Spend Too Much Time Managing Complexity
&lt;/h2&gt;

&lt;p&gt;Today’s data platforms are already extremely capable. Datasource management, batch synchronization, real-time CDC, SQL development, workflow orchestration, runtime logs, alerting, auditing, and lineage analysis are all widely available. But the more features platforms add, the more complex they become. Menus multiply, configurations grow deeper, and processes become longer.&lt;/p&gt;

&lt;p&gt;Data engineers are no longer mastering tools — they are adapting themselves to tools. The once-popular &lt;strong&gt;Modern Data Stack essentially forced engineers to learn endless tools under the glamorous label of “Data Stack,” while in reality engineers became slaves to tools.&lt;/strong&gt; Engineers should control tools, not endlessly relearn fragmented ecosystems.&lt;/p&gt;

&lt;p&gt;Even a seemingly simple MySQL-to-Snowflake synchronization task may involve source schemas, target database/schema/warehouse/role settings, field type conversion, synchronization strategies, workflow dependencies, failure logs, downstream SQL, and reporting definitions. Even with the best visual tools, it still requires multiple drag-and-drop operations and configuration steps.&lt;/p&gt;

&lt;p&gt;The real burden is not that any single technical challenge is difficult. The real burden is excessive context switching. Datasources live in one system, task configurations in another, scheduling elsewhere, logs elsewhere, SQL in Git or local files, and Snowflake execution results in cloud environments.&lt;/p&gt;

&lt;p&gt;In the past, there was no better way, so humans had to do everything manually.&lt;/p&gt;

&lt;p&gt;But once engineering AI systems like Codex and Claude Code emerged, many decisions became processable by large language models. Tiny repetitive actions became decomposable, callable, executable, and feedback-driven automatically. That made the emergence of the &lt;strong&gt;Data Engineering Harness&lt;/strong&gt; possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Data Engineering Harness is not simply another data platform. It is a data engineering capability framework designed specifically for AI systems and engineering agents.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It encapsulates datasource management, synchronization, CDC, SQL development, orchestration, log diagnostics, permission auditing, observability, cost governance, and human takeover mechanisms into engineering capabilities that Codex/Claude Code can invoke, humans can review, and enterprises can govern.&lt;/p&gt;

&lt;p&gt;In other words, the Harness is not solving the question: “Can AI write SQL?” It is solving questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;After AI writes SQL, can it run safely?&lt;/li&gt;
&lt;li&gt;After AI creates tasks, can they be audited and tracked?&lt;/li&gt;
&lt;li&gt;After AI invokes Snowflake, can permissions and costs be controlled?&lt;/li&gt;
&lt;li&gt;After AI generates workflows, can humans understand, confirm, and take over?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Therefore, the value of a Data Engineering Harness is not replacing data engineers, nor simply replacing data platforms. It upgrades data engineering from “humans manually operating tools” into “humans define goals, Codex executes tasks, platforms provide boundaries, and enterprises accumulate engineering know-how.”&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Why Not Let Codex Directly Write Scripts?
&lt;/h2&gt;

&lt;p&gt;Many people ask: if Codex can write SQL, Python, and invoke command lines, why do we still need a &lt;strong&gt;Data Engineering Harness?&lt;/strong&gt; Why not simply let it connect directly to MySQL and Snowflake and generate scripts automatically?&lt;/p&gt;

&lt;p&gt;This may work in personal experiments, but it fails in enterprise data engineering.&lt;/p&gt;

&lt;p&gt;Enterprise data engineering is not simply “making a script run.” Production-grade systems require manageability, auditability, operations, collaboration, and governance. At minimum, enterprises must answer questions such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How do we restrict Codex/Claude Code behavior across development and production environments to avoid catastrophic actions?&lt;/li&gt;
&lt;li&gt;How can runtime failures be interpreted and corrected automatically by AI?&lt;/li&gt;
&lt;li&gt;How can other people, agents, or tools understand the generated engineering workflows?&lt;/li&gt;
&lt;li&gt;Can failed tasks recover automatically through retries, checkpoint resume, or reruns?&lt;/li&gt;
&lt;li&gt;Will table modifications affect downstream systems?&lt;/li&gt;
&lt;li&gt;Can DAG dependencies be visualized?&lt;/li&gt;
&lt;li&gt;Can synchronization, ETL, and Data Mapping processes be visually represented?&lt;/li&gt;
&lt;li&gt;Who audits incidents when problems occur?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If AI generates temporary scripts every time, we simply replace “humans writing scripts” with “AI generating scripts.” Short-term productivity improves, but long-term technical debt explodes: inconsistent styles, unclear permissions, nonstandard logs, uncontrolled failures, and untraceable operations.&lt;/p&gt;

&lt;p&gt;Eventually, data engineering falls back into the “Shell + Crontab era.”&lt;/p&gt;

&lt;p&gt;That is why the future of enterprise AI data engineering is not about letting Codex run freely. It is about giving Codex clear engineering boundaries.&lt;/p&gt;

&lt;p&gt;That is the true meaning of the &lt;strong&gt;Data Engineering Harness&lt;/strong&gt;, and also the reason I designed the WhaleStudio Harness Suite. Harness does not restrict Codex or Claude Code — it makes them observable, manageable, and production-ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Data Engineering Harness Design Philosophy
&lt;/h2&gt;

&lt;p&gt;Future Data Engineering Harness systems will no longer be traditional human-centered development platforms. They will become Harness &amp;amp; Skill suites designed specifically for Codex, Claude Code, and Agentic development.&lt;/p&gt;

&lt;p&gt;Take WhaleStudio Harness Suite as an example. Previously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apache DolphinScheduler solved orchestration problems;&lt;/li&gt;
&lt;li&gt;Apache SeaTunnel solved multi-datasource synchronization and CDC problems;&lt;/li&gt;
&lt;li&gt;WhaleStudio integrated these capabilities into an all-in-one enterprise platform.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But in the era of large models and Codex/Claude Code, providing GUI interfaces for humans is no longer sufficient.&lt;/p&gt;

&lt;p&gt;Future systems must simultaneously allow humans to review and take over, while enabling Codex/Claude Code to invoke, debug, and receive feedback through CLI interfaces and engineering contexts.&lt;/p&gt;

&lt;p&gt;This means WhaleStudio must reorganize the core capabilities of DolphinScheduler and SeaTunnel — including orchestration, synchronization, CDC, SQL tasks, runtime execution, diagnostics, auditing, and observability — into an engineering capability layer that agents can invoke and debug, engineers can rapidly review, and enterprises can govern.&lt;/p&gt;

&lt;p&gt;This is not about adding an “AI button” or chatbot onto old platforms. It is about redesigning software interaction models around agents as primary users.&lt;/p&gt;

&lt;p&gt;From underlying engines to development feedback systems, every layer must become understandable, callable, observable, and controllable by AI systems.&lt;/p&gt;

&lt;p&gt;Future data engineering platforms will not simply be feature collections. They will become containers for enterprise data engineering know-how.&lt;/p&gt;

&lt;p&gt;Scheduling strategies, synchronization experience, SQL migration expertise, Snowflake/cloud warehouse cost optimization strategies, release workflows, and exception handling rules should all become part of Harness Memory and Skills. Codex/Claude Code should invoke not raw APIs, but proven enterprise engineering capabilities.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. UI Will Not Disappear — It Will Become an Observability &amp;amp; Fine-Tuning Interface
&lt;/h2&gt;

&lt;p&gt;Some people believe AI will make enterprise software UI irrelevant.&lt;/p&gt;

&lt;p&gt;I disagree.&lt;/p&gt;

&lt;p&gt;UI will not disappear, but its role will change. Previously, UI was the operational entry point: humans created datasources, configured tasks, dragged DAGs, scheduled workflows, and inspected logs.&lt;/p&gt;

&lt;p&gt;In the future, many actions will be completed by Codex/Claude Code. But humans must still clearly understand what the agent created, which datasources were used, which Snowflake schemas were modified, which SQL changed, whether DAG dependencies are valid, why tasks failed, whether downstream systems are impacted, and whether human takeover is needed. Teams also need collaboration.&lt;/p&gt;

&lt;p&gt;Nobody wants to read another person’s AI prompt history just to understand an engineering workflow. This creates demand for &lt;strong&gt;Observability + Fine-Tuning Interfaces.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Future UI systems will no longer focus on step-by-step manual operations. Instead, they will help humans review, fine-tune, and build trust in AI-generated engineering workflows.&lt;/p&gt;

&lt;p&gt;UI should visualize execution plans, SQL diffs, DAG dependencies, runtime states, failure logs, and cost risks.&lt;/p&gt;

&lt;p&gt;In short:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CLI is for Codex execution.&lt;/li&gt;
&lt;li&gt;GUI is for human review.&lt;/li&gt;
&lt;li&gt;Harness connects both worlds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The best future UI may not even be static pages. It may dynamically generate review interfaces around specific engineering actions: SQL migration diffs, synchronization confirmation, DAG risk analysis, cost estimation, and deployment approvals.&lt;/p&gt;

&lt;p&gt;UI becomes the trust layer between humans and AI-generated engineering systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Future Data Engineers: From Tool Operators to Engineering Commanders
&lt;/h2&gt;

&lt;p&gt;Data engineers will not disappear. But they will diverge into two categories.&lt;/p&gt;

&lt;p&gt;One group will remain tool operators: configuring platforms, editing SQL, checking logs, and manually dragging DAGs. These skills still matter, but they will increasingly be automated by agents.&lt;/p&gt;

&lt;p&gt;The other group will move upward: understanding business goals, designing data models, governing cloud warehouse costs, understanding orchestration/synchronization/CDC relationships, and encoding team experience into Harness systems.&lt;/p&gt;

&lt;p&gt;Future elite data engineers may not be the people who know the most tools. They will be the people who best organize engineering capabilities.&lt;/p&gt;

&lt;p&gt;They will know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what can be automated;&lt;/li&gt;
&lt;li&gt;what requires human confirmation;&lt;/li&gt;
&lt;li&gt;what should become Harness rules;&lt;/li&gt;
&lt;li&gt;and what should remain human judgment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the past, data engineers revolved around tools. In the future, tools, Codex/Claude Code, and cloud capabilities will revolve around engineering goals.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: The Future of Data Engineering Is Not Humanless — Humans Finally Move to a Higher Level
&lt;/h2&gt;

&lt;p&gt;In the future, engineers who only know how to manually operate Modern Data Stack tools may become obsolete, just like developers who only know how to manually write Java code.&lt;/p&gt;

&lt;p&gt;But engineers who understand business, data engineering, cloud warehouses, AI workflows, and Harness systems will become increasingly valuable.&lt;/p&gt;

&lt;p&gt;And this is not some distant vision.&lt;/p&gt;

&lt;p&gt;In one of my experimental demos, I already completed an entire MySQL-to-Snowflake ETL pipeline with automated SQL orchestration creation in just 10 minutes using Codex and WhaleStudio Harness.&lt;/p&gt;

&lt;p&gt;Through CLI-based capabilities, the system automatically identified datasources, created synchronization tasks, generated visual DAGs, executed workflows, inspected logs, converted SQL into Snowflake-compatible pipelines, debugged runtime failures, and corrected issues automatically.&lt;/p&gt;

&lt;p&gt;Through this demo, you can experience how future data engineers may work.&lt;/p&gt;

&lt;p&gt;The next decade of data engineering will not be about adding more tools. It will be about AI deeply integrating into tools, understanding goals, respecting boundaries, and operating under human review. And that is what Data Engineering Harness truly means.&lt;/p&gt;

</description>
      <category>data</category>
      <category>dataengineering</category>
      <category>dataengineeringharness</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Building Metadata Capabilities in Apache SeaTunnel: A Committer’s Journey</title>
      <dc:creator>Apache SeaTunnel</dc:creator>
      <pubDate>Thu, 28 May 2026 09:34:59 +0000</pubDate>
      <link>https://dev.to/seatunnel/building-metadata-capabilities-in-apache-seatunnel-a-committers-journey-o5l</link>
      <guid>https://dev.to/seatunnel/building-metadata-capabilities-in-apache-seatunnel-a-committers-journey-o5l</guid>
      <description>&lt;p&gt;Recently, Apache SeaTunnel welcomed several talented and highly motivated new Committers, and Wang Xuepeng is one of them.&lt;/p&gt;

&lt;p&gt;As a long-time contributor, Wang Xuepeng’s promotion to Committer was no coincidence. Over the years, he has quietly contributed a tremendous amount to the community, and everyone has witnessed his dedication. From first stepping into the open-source world to becoming a Committer of an Apache top-level project, he has accumulated plenty of stories and valuable insights along the way.&lt;/p&gt;

&lt;p&gt;What inspired his journey? What experiences and lessons does he want to share with the community? Let’s take a closer look at this exclusive interview with him!&lt;/p&gt;

&lt;h2&gt;
  
  
  Personal Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkmdqzqq00kywshihhd2d.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkmdqzqq00kywshihhd2d.jpg" width="800" height="340"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Interview Transcript
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;How long have you been involved in open source? What attracts you to open source?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I started getting involved in open source in 2023. What attracts me most is the sense of achievement when the code I write can actually be used within the industry.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;When did you start contributing to SeaTunnel? What was the trigger?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I joined WhaleOps in 2023, which was also when I first started engaging with open source.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Now that you’ve been elected as a SeaTunnel Committer, could you summarize your contributions to the community, including both code and non-code contributions?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Most of my major feature PRs have focused on building SeaTunnel’s metadata capabilities.&lt;/p&gt;

&lt;p&gt;When running SeaTunnel jobs and writing job configurations, users often need to manually enter datasource connection information. For file-based tasks, users also need to manually define field mappings. To address these issues, I designed an SPI interface called &lt;code&gt;MetadataProvider&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The interface mainly exposes two methods:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Map&amp;lt;String, Object&amp;gt; datasourceMap(String connectorIdentifier, String metaDataDatasourceId);&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Optional&amp;lt;TableSchema&amp;gt; tableSchema(String metaDataTableId);&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Previously, some users in the community mentioned that datasource usernames and passwords were stored in Nacos with read-only access permissions. In scenarios like this, users can implement a custom metadata center to better protect sensitive connection information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Community Contribution Summary&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;PR Link&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/apache/seatunnel/pull/5663" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/5663&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Added &lt;code&gt;save_mode&lt;/code&gt; functionality to SeaTunnel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/apache/seatunnel/pull/10402" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/10402&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Integrated Gravitino with SeaTunnel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/apache/seatunnel/pull/10586" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/10586&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Designed the metadata SPI interface for SeaTunnel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/apache/seatunnel/pull/10657" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/10657&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Enhanced the metadata SPI interface for SeaTunnel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/apache/seatunnel/pull/10838" rel="noopener noreferrer"&gt;https://github.com/apache/seatunnel/pull/10838&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Added dynamic metadata functionality based on the metadata SPI interface&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;After contributing to SeaTunnel for so long, you must have developed a deep understanding of both the project and the community. Compared with competing products, what do you think are SeaTunnel’s strengths and weaknesses? What keeps you actively involved in the community?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One major advantage of SeaTunnel is the flexibility of its engine choices. Teams already familiar with Flink or Spark can adopt it with a very low learning curve. For lightweight data synchronization scenarios, the Zeta engine is an even better choice.&lt;/p&gt;

&lt;p&gt;As for weaknesses, I think the web platform still has a lot of room for improvement.&lt;/p&gt;

&lt;p&gt;What attracts me most to the SeaTunnel community is the opportunity to discuss implementation solutions with talented contributors from different technical fields. It helps me improve my own skills while broadening my perspective.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Have you ever done any secondary development based on SeaTunnel’s shortcomings? Have you contributed those improvements back to the community? Could you briefly introduce your solution?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Yes, I’ve done secondary development for SeaTunnel. Most of the time, when I encounter bugs during usage, I first fix them in our company repository and then submit the same fixes back to the open-source community.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What kind of support do you hope the SeaTunnel community can provide for your personal growth in the future?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As long as people actively participate in community discussions — whether it’s creating issues, submitting PRs, or reviewing PRs — they will definitely improve their technical abilities.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What does the Committer role mean to you? What responsibilities should a Committer take within the community?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I believe a Committer should first ensure code quality. Secondly, Committers should help guide the community in a positive direction, such as mentoring newcomers on how to submit PRs.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Now that you’ve become a Committer, what would you like to say to the community? Do you have any suggestions for the project’s future development?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;First of all, I’m very happy to become a Committer. It means becoming part of the Apache Foundation, which is truly a valuable identity and experience. I also want to thank all the community members who guided and helped me along the way.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What are your future plans in the community to further promote the project’s development?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I will continue contributing in the metadata area, and in the future, I plan to expand further into data lineage capabilities.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>javascript</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
