Apache SeaTunnel

Posted on Oct 30

OSPP Project Outcome: Supporting Flink Engine CDC Source Schema Evolution

#database #dataengineering #opensource

As the OSPP program draws to a close, after more than half a year of dedicated development, contributors to the Apache SeaTunnel project have achieved fruitful results.

Today, let’s focus on the project that implemented CDC source schema evolution support on the Flink engine of Apache SeaTunnel.

From the project’s initial concept to its gradual development and final completion, we’ll explore the full journey behind this achievement.

Next, through an interview, let’s step into the open-source world of this developer from the University of Science and Technology Beijing and see how she balanced her demanding studies while successfully completing this project!

Personal Introduction

Project Mentor: Lucifer Tyrant
Name: Dong Jiaxin
School & Major: University of Science and Technology Beijing – Big Data Management and Application
GitHub ID: 147227543
Research Interests: Big data platform development; previously worked on data platform development at Kuaishou and Meituan
Hobbies: Reading technical documentation, experimenting with new technology stacks, and reading novels

Project Title

Schema Evolution Support for CDC Sources on Flink Engine

Project Background

In real-time data synchronization scenarios, source-table schema changes — such as adding columns or modifying column types — are common.
Currently, Apache SeaTunnel already supports CDC schema evolution on its native engine, but this feature has not yet been implemented on the Flink engine.

As a result, when using Flink for CDC synchronization, users must restart jobs whenever a schema change occurs, which severely impacts data synchronization continuity and stability.

Implementation Approach

My implementation inspiration mainly came from the Flink CDC project.

After studying Flink CDC’s schema-evolution design and combining it with SeaTunnel’s architecture, I designed a schema-evolution solution tailored for the Flink engine.

The core architecture design includes the following key components:

SchemaCoordinator

Role: The central coordinator of the entire solution, responsible for global schema-change state management and synchronization
Implementation details:
- Maintains a schemaChangeStates mapping table to record each table’s schema-change status
- Tracks schema versions for each table through schemaVersions
- Uses a ReentrantLock to ensure thread safety for concurrent schema-change requests
- Maintains a pendingRequests queue to manage pending CompletableFutures awaiting schema changes

SchemaOperator

Role: A specialized operator inserted between the CDC Source and Sink to intercept and process schema-change events
Implementation details:
- Detects SchemaChangeEvents in processElement()
- Handles schema-change flow via processSchemaChangeEvent()
- Maintains currentSchemaChangeFuture to support schema-change cancellation and rollback
- Uses lastProcessedEventTime to prevent duplicate processing of old events

Key Problem and Solution

During development, I encountered a tricky issue:
When processing schema-change events in processElement(), the job would hang — no further data was processed, only continuous checkpointing.

After analyzing logs, I found the root cause:

2025-08-17 12:33:36,597 INFO  FlinkSinkWriter - FlinkSinkWriter handled FlushEvent for table: .schema_test.products
2025-08-17 12:33:36,597 INFO  SchemaOperator - FlushEvent sent to downstream for table: .schema_test.products
2025-08-17 12:33:36,597 INFO  SchemaCoordinator - Processing schema change for table: .schema_test.products
2025-08-17 12:33:36,598 WARN  SchemaCoordinator - No schema change state found for table: .schema_test.products

As the logs show, a FlushEvent was sent downstream first.

After the FlinkSinkWriter processed it and tried to notify the SchemaCoordinator, the SchemaCoordinator hadn’t yet initialized the schema-change state (because the coordinator code hadn’t executed yet).

This caused the notification to fail.

As a result, the schemaChangeFuture.get() call in SchemaOperator kept waiting until timeout (60 seconds).

After observing this, I adjusted the execution order: instead of “sending FlushEvent first, then requesting SchemaCoordinator,” I changed it to “request SchemaCoordinator first, then send FlushEvent,”
as shown below:

CompletableFuture<SchemaResponse> schemaChangeFuture =
        schemaCoordinator.requestSchemaChange(
                tableId, jobId, schemaChangeEvent.getChangeAfter(), 1);
currentSchemaChangeFuture.set(schemaChangeFuture);
sendFlushEventToDownstream(schemaChangeEvent);  // send after coordinator request

This ensured that the SchemaCoordinator created its state before the FlushEvent was sent.

Thus, when the downstream finished processing the event, the state already existed, allowing successful notification and completion of the CompletableFuture.

The processSchemaChangeEvent method then resumed, continuing normal execution.

Project Outcomes

Problems Solved

Implemented real-time schema-evolution capability on the Flink engine
Users no longer need to restart jobs when source schemas change during CDC synchronization
Provided a full schema-change coordination mechanism ensuring synchronization among operators

User Benefits

Improved business continuity — schema changes no longer require downtime
Reduced O&M costs — less manual intervention and fewer restarts
Guaranteed data consistency — FlushEvent ensures pre- and post-change data consistency
Flexible engine choice — users can freely choose between the Flink or SeaTunnel engines while retaining schema-evolution capability

Technical Contributions

Added a global SchemaCoordinator component
Introduced a new FlushEvent type and handling mechanism
Implemented full schema-evolution adaptation in the Flink translation layer

Future Improvements

Multi-parallelism support: design a flush-coordination mechanism for parallel scenarios using parallelism-aware counters and finer-grained state management
State persistence: consider converting the SchemaCoordinator into a Flink operator or leveraging Flink’s BroadcastState so its state participates in checkpoints

To better understand students’ experiences during the Summer of Open Source, Apache SeaTunnel conducted a brief interview.

Here’s the transcript:

Q: Among so many projects, why did you choose Apache SeaTunnel?

A: Several reasons.
First, the project’s technical direction matches my background well.

During an internship at a startup, we used SeaTunnel for data integration and warehouse construction.

I also frequently use Flink to build data pipelines and real-time lineage systems, so I’m passionate about data integration and streaming synchronization.

SeaTunnel, as a new-generation data-integration platform with a modern tech stack and clear architecture, is a perfect project to learn deeply and contribute to.

Moreover, the SeaTunnel community is extremely welcoming and active.

Responses are quick, and for someone like me participating in open source for the first time, it’s very friendly.

The CDC schema-evolution feature solves a real-world pain point, and seeing my code truly help users gives me a strong sense of accomplishment.

Q: How does the Apache SeaTunnel project relate to your academic studies?

A: Quite closely. Our courses on big-data processing cover frameworks like Flink and StarRocks, both of which are widely used in SeaTunnel.

In my sophomore year, I also worked with StreamPark for Spark micro-batch jobs, which made me familiar with data integration concepts.

By joining SeaTunnel, I could apply classroom theory in real projects and deepen my understanding.

Q: How has participating in this project influenced your studies and future plans?

A: I’ve gained a lot. To understand CDC implementation, I studied Flink CDC’s source code in depth, strengthening my understanding of Flink’s runtime, distributed coordination, and asynchronous programming.

Under my mentor’s guidance, I also learned open-source collaboration practices — code style, PR process, testing — which laid a solid foundation for future contributions.

Most importantly, this experience confirmed my long-term interest in data infrastructure; I plan to continue developing in this area.

Q: What was your biggest challenge during the project, and how did you overcome it?

A: The toughest issue was the job hang in processElement, which only kept checkpointing without processing data.

Architecturally, integrating new functionality gracefully into the existing system was also more complex than expected.

To resolve these, I repeatedly debugged, researched documentation, and sought advice from my mentor and community members.

Their feedback was incredibly helpful and guided me through the problem.

Q: How long have you been involved in open source? Do you like it? What has it changed for you?

A: This is my first formal open-source experience. Though I wrote internal code during internships, this was my first time submitting PRs publicly.

I really enjoy open source — the openness, collaboration, and shared learning environment where everyone contributes to a common goal.
It’s meaningful and rewarding.

Q: Have you used Apache SeaTunnel or other data-integration tools before?

A: Yes. During internships, I used SeaTunnel mainly for syncing between different data sources — for example, Kafka to Hive or Kafka to StarRocks.

I’ve also worked with Flink CDC for streaming source integration.

Compared with others, SeaTunnel supports three execution engines, covering both batch and streaming use cases.

If I choose a data-integration tool in the future, SeaTunnel will be my first choice — it’s feature-rich and easy to configure.

Q: What was your first impression of the SeaTunnel community, and what do you hope to gain here?

A: My first impression was great — friendly community, responsive mentors, and very careful code reviews. I hope to keep meeting like-minded contributors, make valuable contributions, and continue improving my technical skills.

Q: Will you continue contributing to Apache SeaTunnel?

A: Definitely. I plan to further optimize my current feature to better ensure exactly-once semantics for SeaTunnel on Flink, and I hope to join more interesting topics in the future.

DEV Community