DEV Community

Apache SeaTunnel
Apache SeaTunnel

Posted on

Final Project Report 2| Apache SeaTunnel Adds Metalake Support

Over the past two weeks, we’ve conducted brief interviews with several outstanding student developers from the Summer of Open Source program to learn about their development experiences and insights.
Today, we’re sharing the full project report for one of the most exciting contributions — Metalake support in Apache SeaTunnel — to help the community better understand its technical design and latest progress.

I. Project Background

Currently, in Apache SeaTunnel’s task configuration, sensitive information such as database usernames and passwords is hard-coded into task scripts. This approach introduces several problems:

  1. Security Risks: Sensitive information is exposed in scripts, making data source credentials vulnerable to leaks.
  2. Maintenance Overhead: When data source configurations change, users must manually update all related task scripts, which is inefficient and error-prone.

To address these issues, this project introduces Metalake integration to centralize data source configuration management.
Through a data source ID mapping mechanism, users can easily update and manage connection information. The goal is to support the Apache Gravitino metadata catalog and reserve interfaces for future integration with other third-party metadata services.

Example REST API for retrieving Gravitino catalog info:
https://gravitino.apache.org/docs/0.9.0-incubating/api/rest/load-catalog

Project repository:
https://github.com/apache/seatunnel

Main implementation objectives:

  1. Adapt Metalake configuration loading Load Metalake-related configuration from seatunnel-env when a task starts.
  2. Refactor source and sink configuration logic Add sourceId for querying Metalake and replacing configuration placeholders dynamically.
  3. Plugin-based Metalake support integrated with Apache Gravitino Define a unified Metalake interface, enable Gravitino support, and keep the design easily extensible to future metadata catalogs.

II. Solution Overview

1. Metalake Configuration Adaptation

Goal: Load Metalake configuration during task startup.
Method: Define Metalake settings in seatunnel-env.sh or directly in the task configuration file.

Example in seatunnel-env.sh:

METALAKE_ENABLED=true
METALAKE_TYPE=gravitino
METALAKE_URL=http://localhost:8090/api/metalakes/metalake_name/catalogs/
Enter fullscreen mode Exit fullscreen mode

Or within a task configuration:

env {
  metalake_enabled = true
  metalake_type = "gravitino"
  metalake_url = "http://localhost:8090/api/metalakes/metalake_name/catalogs/"
}
Enter fullscreen mode Exit fullscreen mode

If the configuration exists in the task file, it’s automatically loaded.
If defined in seatunnel-env.sh, it can be accessed via System.getenv() at runtime.


2. Refactoring Source/Sink Configuration

2.1 Add sourceId to Source/Sink

Goal: Identify data sources in Metalake.
Example:

source {
  type = "mysql"
  sourceId = "mysql_datasource_001"
  url = "jdbc:mysql://localhost:3306/db"
  ...
}
Enter fullscreen mode Exit fullscreen mode

2.2 Support Placeholder Replacement

Goal: Dynamically fetch credentials and replace placeholders using Metalake.
Method:

  • Detect metalakeEnabled and sourceId during configuration parsing.
  • Query Metalake via REST API and replace placeholders like ${username} or ${password}.

Steps:

  1. Define placeholder format ${key}.
  2. Fetch data source info from Gravitino via REST API.
    3d7c0a17071fa4982471ef0927a33cff

  3. Replace placeholders in configuration automatically.

Code example:

28aff9eafe17310d56b871531e385fd9

3. Plugin-Based Metalake and Gravitino Integration

3.1 Define Metalake Interface

Create a MetalakeClient interface providing methods for data source lookup.

9f4b5efe4fd2c7074ea3ee09d9b50436

3.2 Implement Apache Gravitino Client

Implement GravitinoClient based on the interface:

  • Use HTTP client to request Gravitino REST API.
  • Parse and map data source info to SeaTunnel configuration placeholders.

Code example:

8f0cc16cfe6c7eb5c2570d71e2152a8a

3.3 Extensible Plugin Design

Add a factory mechanism to select client types dynamically (e.g., Gravitino, UnityCatalog, or DataHub).

Code example:
f90a13455e0949430c266ffe2465c69b

3.4 Backward Compatibility

Ensure existing tasks are unaffected:

  • metalakeEnabled defaults to false.
  • Only triggers Metalake logic when explicitly enabled and sourceId is provided.

Code example:

3310759facec5e3c3dc1ceeead4df3fc

III. Project Timeline

Timeframe: July 1, 2025 – September 30, 2025

Below is the detailed implementation plan and milestones for this project.

Phase Time Tasks Milestones
Preparation Phase July 1 – July 7, 2025 - Finalize technical solution details
- Set up development environment
- Complete seatunnel-env.sh configuration file format design
Technical solution confirmed and development environment prepared
Development Phase 1: Metalake Configuration Adaptation July 8 – July 20, 2025 - Implement configuration read and load functions
- Integrate configuration loading into task context
- Test configuration load functionality
Metalake configuration and loading functions completed and passed unit testing
Development Phase 2: Source/Sink Refactoring July 21 – August 5, 2025 - Add SourceTo to source and sink configuration
- Implement field mapping logic
- Test data source replacement logic
Source/Sink configuration refactoring completed and passed integration testing
Development Phase 3: Plugin Support and Gravitino Integration August 6 – August 31, 2025 - Define MetalakeClient interface
- Implement Gravitino client integration
- Support plugin method
- Verify backward compatibility
Gravitino integration and plugin support completed, extensibility verified
Testing & Optimization Phase September 1 – September 15, 2025 - Conduct comprehensive functional testing
- Fix bugs and optimize code
- Compile project documentation
All functional testing completed; final code and documentation submitted
Summary & Submission Phase September 16 – September 30, 2025 - Summarize project deliverables
- Submit code to Apache SeaTunnel community
- Prepare project report
Project officially completed and accepted

IV. Project Progress

Completed Work

All core features have been developed, tested, and merged into the main repository.

Challenges and Solutions

While coding, most challenges were minor thanks to the guidance from mentor liugddx.
The main difficulty was the lengthy test suite: SeaTunnel’s integration tests are extensive and sometimes unstable due to network factors, requiring multiple retries.
This process tested my patience and attention to detail.

Test Case Design

A sample task configuration was created using Metalake-based MySQL as the source and Assert as the sink to validate correctness.
Integration tests were built and successfully passed in GitHub CI.

bd4bb215666197ea3d2028f4f477f2cc

Future Work

Future improvements include extending support for more Metalake types beyond Apache Gravitino, enabling wider metadata interoperability.

Top comments (0)