Apache SeaTunnel

Posted on Dec 4, 2025

Why Does Real-Time UPDATE Synchronization Require Two Records? A Full-Chain Breakdown in Apache SeaTunnel

#apacheseatunnel #opensource #datascience #developer

In real-time data platforms, real-time data warehouses, data lake ingestion, and distributed data replication, CDC (Change Data Capture) has become a standard capability in building modern data pipelines. Whether you are building a real-time warehouse on StarRocks, Doris, or ClickHouse, writing real-time change events into Iceberg, Paimon, or Hudi, or synchronizing data across databases, CDC is the core foundation.
The examples below use MySQL CDC.

There is one common CDC question that is often overlooked:

Why must a MySQL CDC UPDATE event output two records — one BEFORE and one AFTER?
Why can’t it output only the final new value?
If we only have AFTER, can’t the system still sync correctly?

At first glance, it seems possible. But once you dive into consistency, idempotency, replayability, primary key handling, data lake merge semantics, and distributed out-of-order recovery, you discover this truth:

Splitting UPDATE into BEFORE + AFTER is not a “format choice” — it is fundamental to the correctness semantics of CDC.

This article analyzes, from MySQL Binlog internals to CDC parsing frameworks (using SeaTunnel as the main example) and finally to data lake update mechanisms, why UPDATE must contain both BEFORE and AFTER images — and what real production issues arise if BEFORE is missing.

Structure of this article:

The real structure of UPDATE in MySQL Binlog
Why CDC cannot correctly represent UPDATE with only one record
Why AFTER alone cannot guarantee consistency in distributed environments
SeaTunnel’s CDC architecture explained
How data lakes and warehouses consume BEFORE
Real production case studies
Final takeaways

Whether you work in CDC, data platforms, real-time pipelines, data warehousing, or database systems, this article will help you understand CDC semantics at a deeper engineering level.

1. UPDATE in MySQL Binlog Is Not “a Single Record”

Many assume that MySQL writes UPDATE into the binlog as a single record. This is a common misunderstanding.

Let’s look at MySQL ROW-based binlog format.

When you execute:

update t set price = 200 where id = 1;

The binlog does not write:

id=1, price=200

Instead, it writes a structure like:

update_rows_event {
    before_image: {id:1, price:100}
    after_image:  {id:1, price:200}
}

In other words, MySQL internally treats UPDATE as:

A pair: old values (before) and new values (after).

Why does MySQL do this? The reason is simple:

An UPDATE is a transition from an old state to a new state. Only BEFORE + AFTER can fully describe this transition.

Without both images, you cannot correctly replay, restore, roll back, or verify a transaction from a database perspective.

In short:
MySQL Binlog’s structure fundamentally determines that CDC must produce two records.

2. If CDC Only Outputs AFTER, Many Core Scenarios Break Down

Many junior engineers instinctively think:
“Isn’t sending just the latest value enough?”

Let’s illustrate with six typical production scenarios where AFTER-only breaks the entire pipeline.

Scenario 1: Cannot detect whether an actual update happened (duplicate writes, invalid merges)

If you run:

update t set price = 200 where id=1;

But the original price was already 200.

If CDC only sends AFTER:

{id=1, price=200}

Downstream systems cannot determine:

Whether the update actually changed data
Whether the data lake needs a write
Whether to trigger a merge
Whether metrics should be recomputed

For example, Iceberg merge operations are expensive. Triggering unnecessary merges wastes compute resources.

In financial scenarios involving trades, balances, and risk indicators, distinguishing real updates is critical.
This requires BEFORE.

Scenario 2: Primary key updates cannot be processed (cross-database sync breaks)

Example:

update user set id=2 where id=1;

If only AFTER is sent:

{id=2}

Downstream systems cannot know the old primary key value and therefore cannot delete the old record.

This leads to:

Failure to remove id=1
Duplicate records
Unique key violations

This is one of the most common disasters in cross-database real-time replication.
Only BEFORE provides the old primary key.

Scenario 3: Without a primary key, the system cannot locate the correct row

Given duplicate rows:

name | score
A    | 100
A    | 200

Executing:

update t set score=300 where name='A';

If CDC only sends AFTER:

A, 300
A, 300

Downstream cannot know:

Which 300 comes from 100?
Which 300 comes from 200?

Without BEFORE, it is impossible to map updates correctly.

Scenario 4: Exactly-Once guarantees break (idempotency failure)

CDC systems regularly resend events due to:

Distributed recovery
Network retries
Checkpoint replay
Consumer restarts

With AFTER-only, you cannot distinguish duplicates from real updates.

This breaks the fundamental idempotency guarantees of streaming systems.

Scenario 5: When binlog events arrive out of order, AFTER-only cannot restore correct state

MySQL multi-threaded replication may produce out-of-order events.

Example:

Thread 1: 100 -> 120  
Thread 2: 120 -> 200

If events arrive out of order:

AFTER=200 arrives first  
AFTER=120 arrives later

Without BEFORE, you cannot determine whether 120 should override 200.

Scenario 6: Data lakes (Paimon, Iceberg, Hudi) require BEFORE to perform deletes

Data lake updates typically work as:

DELETE old_row  
INSERT new_row

Example:

DELETE WHERE id=1 AND price=100  
INSERT {id:1, price=200}

DELETE must match exactly the BEFORE image.

Therefore:

Missing BEFORE → UPDATE cannot execute correctly → Data becomes inconsistent.

In financial scenarios, this directly affects regulatory compliance.

3. Why SeaTunnel Must Fully Support BEFORE & AFTER

SeaTunnel’s CDC model is built on Debezium’s log parsing and uses four RowKinds:

INSERT  
DELETE  
UPDATE_BEFORE  
UPDATE_AFTER

Thus, MySQL-CDC Source in SeaTunnel outputs:

UPDATE_BEFORE
UPDATE_AFTER

These two events are tightly coupled, ensuring:

Replayability in distributed environments
Recoverability
Order preservation
Compatibility with data lake merge semantics

SeaTunnel’s processing flow:

MySQL Binlog (ROW)
        |
UpdateRowsEvent (before, after)
        |
SeaTunnel MySQL-CDC Parser
        |
--------------------------------
|                              |
UPDATE_BEFORE            UPDATE_AFTER
(old row)                (new row)

The sink then decides the operation based on RowKind:

UPDATE_BEFORE → Delete  
UPDATE_AFTER  → Insert

This behavior is consistent across OLAP databases (Doris, StarRocks), data lakes (Iceberg, Paimon), and messaging systems (Kafka).

Without BEFORE, the whole pipeline fails.

4. Why Data Lakes Depend Heavily on BEFORE

Iceberg, Paimon, and Hudi support ACID semantics, but UPDATE in these systems is inherently a composite operation:

Locate the old data file
Delete the old record
Write the new record into a new file

Diagram:

UPDATE event
    |
--------------------------------------
|                                    |
DELETE old_row               INSERT new_row

The lake cannot infer DELETE keys from AFTER alone.
Therefore:

Missing BEFORE → UPDATE cannot be applied → Inconsistent data.

This is especially critical in financial sectors such as trading, balance tracking, and position management.

5. Production Case Studies

Two real examples where missing BEFORE caused real damage.

Case 1: Duplicate customer records due to primary key update

A gaming company built its own CDC system and captured only AFTER.

When users updated their phone numbers (a composite primary key), the downstream system produced duplicate customer records.

Reason:
The downstream store could not know the old primary key.

Case 2: Iceberg merge failures

A financial institution ingested data into Iceberg using AFTER-only CDC.

DELETE operations could not match old records, leading to large amounts of inconsistent data.

They later rebuilt the entire pipeline with BEFORE included.

6. Theoretical Foundation: The Mathematical Model of CDC UPDATE

A transaction T transitions a row from state A to state B.
To represent this in CDC, the system must output:

(A, B)

If only B is output, CDC loses the ability to:

Detect if T actually happened
Ensure idempotency
Restore the correct execution sequence
Support distributed replay
Perform delta calculations
Validate correctness

CDC is designed to convert database transactions into an event stream that can reconstruct state downstream.
Without BEFORE, the stream is incomplete.

7. Final Takeaway: Two Records for UPDATE Is an Engineering Necessity

Why must UPDATE produce two records?

Because CDC must support:

Replayability  
Recovery  
State reconstruction  
Out-of-order tolerance  
Idempotency  
Validation  
Data lake merge semantics  
Primary key updates

All of these require:

UPDATE = BEFORE + AFTER

A single AFTER can never represent a complete update.

From MySQL Binlog to Debezium and modern data integration tools like Apache SeaTunnel, the entire industry aligns on this logic — not by choice, but by necessity.

DEV Community

Why Does Real-Time UPDATE Synchronization Require Two Records? A Full-Chain Breakdown in Apache SeaTunnel

1. UPDATE in MySQL Binlog Is Not “a Single Record”

2. If CDC Only Outputs AFTER, Many Core Scenarios Break Down

Scenario 1: Cannot detect whether an actual update happened (duplicate writes, invalid merges)

Scenario 2: Primary key updates cannot be processed (cross-database sync breaks)

Scenario 3: Without a primary key, the system cannot locate the correct row

Scenario 4: Exactly-Once guarantees break (idempotency failure)

Scenario 5: When binlog events arrive out of order, AFTER-only cannot restore correct state

Scenario 6: Data lakes (Paimon, Iceberg, Hudi) require BEFORE to perform deletes

3. Why SeaTunnel Must Fully Support BEFORE & AFTER

4. Why Data Lakes Depend Heavily on BEFORE

5. Production Case Studies

Case 1: Duplicate customer records due to primary key update

Case 2: Iceberg merge failures

6. Theoretical Foundation: The Mathematical Model of CDC UPDATE

7. Final Takeaway: Two Records for UPDATE Is an Engineering Necessity

Top comments (0)