In real-time data platforms, real-time data warehouses, data lake ingestion, and distributed data replication, CDC (Change Data Capture) has become a standard capability in building modern data pipelines. Whether you are building a real-time warehouse on StarRocks, Doris, or ClickHouse, writing real-time change events into Iceberg, Paimon, or Hudi, or synchronizing data across databases, CDC is the core foundation.
The examples below use MySQL CDC.
There is one common CDC question that is often overlooked:
Why must a MySQL CDC UPDATE event output two records — one BEFORE and one AFTER?
Why can’t it output only the final new value?
If we only have AFTER, can’t the system still sync correctly?
At first glance, it seems possible. But once you dive into consistency, idempotency, replayability, primary key handling, data lake merge semantics, and distributed out-of-order recovery, you discover this truth:
Splitting UPDATE into BEFORE + AFTER is not a “format choice” — it is fundamental to the correctness semantics of CDC.
This article analyzes, from MySQL Binlog internals to CDC parsing frameworks (using SeaTunnel as the main example) and finally to data lake update mechanisms, why UPDATE must contain both BEFORE and AFTER images — and what real production issues arise if BEFORE is missing.
Structure of this article:
- The real structure of UPDATE in MySQL Binlog
- Why CDC cannot correctly represent UPDATE with only one record
- Why AFTER alone cannot guarantee consistency in distributed environments
- SeaTunnel’s CDC architecture explained
- How data lakes and warehouses consume BEFORE
- Real production case studies
- Final takeaways
Whether you work in CDC, data platforms, real-time pipelines, data warehousing, or database systems, this article will help you understand CDC semantics at a deeper engineering level.
1. UPDATE in MySQL Binlog Is Not “a Single Record”
Many assume that MySQL writes UPDATE into the binlog as a single record. This is a common misunderstanding.
Let’s look at MySQL ROW-based binlog format.
When you execute:
update t set price = 200 where id = 1;
The binlog does not write:
id=1, price=200
Instead, it writes a structure like:
update_rows_event {
before_image: {id:1, price:100}
after_image: {id:1, price:200}
}
In other words, MySQL internally treats UPDATE as:
A pair: old values (before) and new values (after).
Why does MySQL do this? The reason is simple:
An UPDATE is a transition from an old state to a new state. Only BEFORE + AFTER can fully describe this transition.
Without both images, you cannot correctly replay, restore, roll back, or verify a transaction from a database perspective.
In short:
MySQL Binlog’s structure fundamentally determines that CDC must produce two records.
2. If CDC Only Outputs AFTER, Many Core Scenarios Break Down
Many junior engineers instinctively think:
“Isn’t sending just the latest value enough?”
Let’s illustrate with six typical production scenarios where AFTER-only breaks the entire pipeline.
Scenario 1: Cannot detect whether an actual update happened (duplicate writes, invalid merges)
If you run:
update t set price = 200 where id=1;
But the original price was already 200.
If CDC only sends AFTER:
{id=1, price=200}
Downstream systems cannot determine:
- Whether the update actually changed data
- Whether the data lake needs a write
- Whether to trigger a merge
- Whether metrics should be recomputed
For example, Iceberg merge operations are expensive. Triggering unnecessary merges wastes compute resources.
In financial scenarios involving trades, balances, and risk indicators, distinguishing real updates is critical.
This requires BEFORE.
Scenario 2: Primary key updates cannot be processed (cross-database sync breaks)
Example:
update user set id=2 where id=1;
If only AFTER is sent:
{id=2}
Downstream systems cannot know the old primary key value and therefore cannot delete the old record.
This leads to:
- Failure to remove id=1
- Duplicate records
- Unique key violations
This is one of the most common disasters in cross-database real-time replication.
Only BEFORE provides the old primary key.
Scenario 3: Without a primary key, the system cannot locate the correct row
Given duplicate rows:
name | score
A | 100
A | 200
Executing:
update t set score=300 where name='A';
If CDC only sends AFTER:
A, 300
A, 300
Downstream cannot know:
Which 300 comes from 100?
Which 300 comes from 200?
Without BEFORE, it is impossible to map updates correctly.
Scenario 4: Exactly-Once guarantees break (idempotency failure)
CDC systems regularly resend events due to:
- Distributed recovery
- Network retries
- Checkpoint replay
- Consumer restarts
With AFTER-only, you cannot distinguish duplicates from real updates.
This breaks the fundamental idempotency guarantees of streaming systems.
Scenario 5: When binlog events arrive out of order, AFTER-only cannot restore correct state
MySQL multi-threaded replication may produce out-of-order events.
Example:
Thread 1: 100 -> 120
Thread 2: 120 -> 200
If events arrive out of order:
AFTER=200 arrives first
AFTER=120 arrives later
Without BEFORE, you cannot determine whether 120 should override 200.
Scenario 6: Data lakes (Paimon, Iceberg, Hudi) require BEFORE to perform deletes
Data lake updates typically work as:
DELETE old_row
INSERT new_row
Example:
DELETE WHERE id=1 AND price=100
INSERT {id:1, price=200}
DELETE must match exactly the BEFORE image.
Therefore:
Missing BEFORE → UPDATE cannot execute correctly → Data becomes inconsistent.
In financial scenarios, this directly affects regulatory compliance.
3. Why SeaTunnel Must Fully Support BEFORE & AFTER
SeaTunnel’s CDC model is built on Debezium’s log parsing and uses four RowKinds:
INSERT
DELETE
UPDATE_BEFORE
UPDATE_AFTER
Thus, MySQL-CDC Source in SeaTunnel outputs:
- UPDATE_BEFORE
- UPDATE_AFTER
These two events are tightly coupled, ensuring:
- Replayability in distributed environments
- Recoverability
- Order preservation
- Compatibility with data lake merge semantics
SeaTunnel’s processing flow:
MySQL Binlog (ROW)
|
UpdateRowsEvent (before, after)
|
SeaTunnel MySQL-CDC Parser
|
--------------------------------
| |
UPDATE_BEFORE UPDATE_AFTER
(old row) (new row)
The sink then decides the operation based on RowKind:
UPDATE_BEFORE → Delete
UPDATE_AFTER → Insert
This behavior is consistent across OLAP databases (Doris, StarRocks), data lakes (Iceberg, Paimon), and messaging systems (Kafka).
Without BEFORE, the whole pipeline fails.
4. Why Data Lakes Depend Heavily on BEFORE
Iceberg, Paimon, and Hudi support ACID semantics, but UPDATE in these systems is inherently a composite operation:
- Locate the old data file
- Delete the old record
- Write the new record into a new file
Diagram:
UPDATE event
|
--------------------------------------
| |
DELETE old_row INSERT new_row
The lake cannot infer DELETE keys from AFTER alone.
Therefore:
Missing BEFORE → UPDATE cannot be applied → Inconsistent data.
This is especially critical in financial sectors such as trading, balance tracking, and position management.
5. Production Case Studies
Two real examples where missing BEFORE caused real damage.
Case 1: Duplicate customer records due to primary key update
A gaming company built its own CDC system and captured only AFTER.
When users updated their phone numbers (a composite primary key), the downstream system produced duplicate customer records.
Reason:
The downstream store could not know the old primary key.
Case 2: Iceberg merge failures
A financial institution ingested data into Iceberg using AFTER-only CDC.
DELETE operations could not match old records, leading to large amounts of inconsistent data.
They later rebuilt the entire pipeline with BEFORE included.
6. Theoretical Foundation: The Mathematical Model of CDC UPDATE
A transaction T transitions a row from state A to state B.
To represent this in CDC, the system must output:
(A, B)
If only B is output, CDC loses the ability to:
- Detect if T actually happened
- Ensure idempotency
- Restore the correct execution sequence
- Support distributed replay
- Perform delta calculations
- Validate correctness
CDC is designed to convert database transactions into an event stream that can reconstruct state downstream.
Without BEFORE, the stream is incomplete.
7. Final Takeaway: Two Records for UPDATE Is an Engineering Necessity
Why must UPDATE produce two records?
Because CDC must support:
Replayability
Recovery
State reconstruction
Out-of-order tolerance
Idempotency
Validation
Data lake merge semantics
Primary key updates
All of these require:
UPDATE = BEFORE + AFTER
A single AFTER can never represent a complete update.
From MySQL Binlog to Debezium and modern data integration tools like Apache SeaTunnel, the entire industry aligns on this logic — not by choice, but by necessity.
Top comments (0)