Over the past two weeks, we’ve conducted brief interviews with several outstanding student developers from the Summer of Open Source program to learn about their development experiences and insights.
Today, we’re sharing the full project report for one of the most exciting contributions — Metalake support in Apache SeaTunnel — to help the community better understand its technical design and latest progress.
I. Project Background
Currently, in Apache SeaTunnel’s task configuration, sensitive information such as database usernames and passwords is hard-coded into task scripts. This approach introduces several problems:
- Security Risks: Sensitive information is exposed in scripts, making data source credentials vulnerable to leaks.
- Maintenance Overhead: When data source configurations change, users must manually update all related task scripts, which is inefficient and error-prone.
To address these issues, this project introduces Metalake integration to centralize data source configuration management.
Through a data source ID mapping mechanism, users can easily update and manage connection information. The goal is to support the Apache Gravitino metadata catalog and reserve interfaces for future integration with other third-party metadata services.
Example REST API for retrieving Gravitino catalog info:
https://gravitino.apache.org/docs/0.9.0-incubating/api/rest/load-catalog
Project repository:
https://github.com/apache/seatunnel
Main implementation objectives:
-
Adapt Metalake configuration loading
Load Metalake-related configuration from
seatunnel-envwhen a task starts. -
Refactor source and sink configuration logic
Add
sourceIdfor querying Metalake and replacing configuration placeholders dynamically. - Plugin-based Metalake support integrated with Apache Gravitino Define a unified Metalake interface, enable Gravitino support, and keep the design easily extensible to future metadata catalogs.
II. Solution Overview
1. Metalake Configuration Adaptation
Goal: Load Metalake configuration during task startup.
Method: Define Metalake settings in seatunnel-env.sh or directly in the task configuration file.
Example in seatunnel-env.sh:
METALAKE_ENABLED=true
METALAKE_TYPE=gravitino
METALAKE_URL=http://localhost:8090/api/metalakes/metalake_name/catalogs/
Or within a task configuration:
env {
metalake_enabled = true
metalake_type = "gravitino"
metalake_url = "http://localhost:8090/api/metalakes/metalake_name/catalogs/"
}
If the configuration exists in the task file, it’s automatically loaded.
If defined in seatunnel-env.sh, it can be accessed via System.getenv() at runtime.
2. Refactoring Source/Sink Configuration
2.1 Add sourceId to Source/Sink
Goal: Identify data sources in Metalake.
Example:
source {
type = "mysql"
sourceId = "mysql_datasource_001"
url = "jdbc:mysql://localhost:3306/db"
...
}
2.2 Support Placeholder Replacement
Goal: Dynamically fetch credentials and replace placeholders using Metalake.
Method:
- Detect
metalakeEnabledandsourceIdduring configuration parsing. - Query Metalake via REST API and replace placeholders like
${username}or${password}.
Steps:
- Define placeholder format
${key}. Replace placeholders in configuration automatically.
Code example:
3. Plugin-Based Metalake and Gravitino Integration
3.1 Define Metalake Interface
Create a MetalakeClient interface providing methods for data source lookup.
3.2 Implement Apache Gravitino Client
Implement GravitinoClient based on the interface:
- Use HTTP client to request Gravitino REST API.
- Parse and map data source info to SeaTunnel configuration placeholders.
Code example:
3.3 Extensible Plugin Design
Add a factory mechanism to select client types dynamically (e.g., Gravitino, UnityCatalog, or DataHub).
3.4 Backward Compatibility
Ensure existing tasks are unaffected:
-
metalakeEnableddefaults tofalse. - Only triggers Metalake logic when explicitly enabled and
sourceIdis provided.
Code example:
III. Project Timeline
Timeframe: July 1, 2025 – September 30, 2025
Below is the detailed implementation plan and milestones for this project.
| Phase | Time | Tasks | Milestones |
|---|---|---|---|
| Preparation Phase | July 1 – July 7, 2025 | - Finalize technical solution details - Set up development environment - Complete seatunnel-env.sh configuration file format design |
Technical solution confirmed and development environment prepared |
| Development Phase 1: Metalake Configuration Adaptation | July 8 – July 20, 2025 | - Implement configuration read and load functions - Integrate configuration loading into task context - Test configuration load functionality |
Metalake configuration and loading functions completed and passed unit testing |
| Development Phase 2: Source/Sink Refactoring | July 21 – August 5, 2025 | - Add SourceTo to source and sink configuration- Implement field mapping logic - Test data source replacement logic |
Source/Sink configuration refactoring completed and passed integration testing |
| Development Phase 3: Plugin Support and Gravitino Integration | August 6 – August 31, 2025 | - Define MetalakeClient interface- Implement Gravitino client integration - Support plugin method - Verify backward compatibility |
Gravitino integration and plugin support completed, extensibility verified |
| Testing & Optimization Phase | September 1 – September 15, 2025 | - Conduct comprehensive functional testing - Fix bugs and optimize code - Compile project documentation |
All functional testing completed; final code and documentation submitted |
| Summary & Submission Phase | September 16 – September 30, 2025 | - Summarize project deliverables - Submit code to Apache SeaTunnel community - Prepare project report |
Project officially completed and accepted |
IV. Project Progress
Completed Work
All core features have been developed, tested, and merged into the main repository.
Challenges and Solutions
While coding, most challenges were minor thanks to the guidance from mentor liugddx.
The main difficulty was the lengthy test suite: SeaTunnel’s integration tests are extensive and sometimes unstable due to network factors, requiring multiple retries.
This process tested my patience and attention to detail.
Test Case Design
A sample task configuration was created using Metalake-based MySQL as the source and Assert as the sink to validate correctness.
Integration tests were built and successfully passed in GitHub CI.
Future Work
Future improvements include extending support for more Metalake types beyond Apache Gravitino, enabling wider metadata interoperability.








Top comments (0)