Apache SeaTunnel

Posted on Mar 20

SeaTunnel Gravitino: Schema URL–Driven Automatic Table Structure Detection

#gravitino #apacheseatunnel #opensource #datascience

Recently, the community published an article titled “Say Goodbye to Hand-Written Schemas! SeaTunnel’s Integration with Gravitino Metadata REST API Is a Really Cool Move”, which drew strong reactions from readers, with many saying, “This is really awesome!”

The contributor behind this feature is extremely proactive, and it’s expected to be available soon (according to reliable sources, likely in version 3.0.0). To help the community better understand it, the contributor wrote a detailed article explaining the initial capabilities of the Gravitino REST API and how to use it—let’s take a closer look!

1. Background and Problems to Solve

When using Apache SeaTunnel for batch or sync tasks, if the source is unstructured or semi-structured, the source usually requires an explicit schema definition (field names, types, order).

In real production environments, this leads to several typical issues:

Tables have many fields and complex types, making manual schema maintenance costly and error-prone
Upstream table structure changes (adding fields, changing types) require corresponding updates to SeaTunnel jobs
For existing tables, simply syncing data still requires repeated metadata description, leading to redundancy

Thus, the core question is:

Can SeaTunnel directly reuse table structure definitions from an existing metadata system, instead of declaring schema repeatedly in jobs?

This feature was introduced to solve this problem.

2. Introduction to Gravitino (Relevant Capabilities)

Gravitino is a unified metadata management and access service, providing standardized REST APIs to manage and expose the following objects:

Metalake (logical isolation unit)
Catalogs (e.g., MySQL, Hive, Iceberg)
Schema / Database
Table and its field definitions

With Gravitino:

Table structures can be centrally managed
Downstream systems can dynamically fetch schema definitions via HTTP APIs
No need to maintain field information in every compute or sync job

The new capability introduced in SeaTunnel is:

Support for automatically pulling table structures via schema_url provided by Gravitino in the source schema definition.

3. Local Test Environment Setup

3.1 Prepare MySQL Environment

3.1.1 Create Target Table

Pre-create the target table test.demo_user in MySQL with the following SQL:

CREATE TABLE `demo_user` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `user_code` varchar(32) NOT NULL,
  `user_name` varchar(64) DEFAULT NULL,
  `password` varchar(128) DEFAULT NULL,
  `email` varchar(128) DEFAULT NULL,
  `phone` varchar(20) DEFAULT NULL,
  `gender` tinyint DEFAULT NULL,
  `age` int DEFAULT NULL,
  `status` tinyint DEFAULT NULL,
  `level` int DEFAULT NULL,
  `score` decimal(10,2) DEFAULT NULL,
  `balance` decimal(12,2) DEFAULT NULL,
  `is_deleted` tinyint DEFAULT NULL,
  `register_ip` varchar(45) DEFAULT NULL,
  `last_login_ip` varchar(45) DEFAULT NULL,
  `login_count` int DEFAULT NULL,
  `remark` varchar(255) DEFAULT NULL,
  `ext1` varchar(100) DEFAULT NULL,
  `ext2` varchar(100) DEFAULT NULL,
  `ext3` varchar(100) DEFAULT NULL,
  `ext4` varchar(100) DEFAULT NULL,
  `ext5` varchar(100) DEFAULT NULL,
  `created_by` varchar(64) DEFAULT NULL,
  `updated_by` varchar(64) DEFAULT NULL,
  `created_time` datetime DEFAULT NULL,
  `updated_time` datetime DEFAULT NULL,
  `birthday` date DEFAULT NULL,
  `last_login_time` datetime DEFAULT NULL,
  `version` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_user_code` (`user_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

3.1.2 Create the Table Schema to Sync

In practice, table structures might be managed centrally in components like paimon, hive, or hudi. For testing, the table schema points to the target table test.demo_user created in the previous step.

3.2 Register the Table Schema in Gravitino

Gravitino supports direct database connections and scans all tables in a database

This table is managed in Gravitino as a table under the local-mysql catalog.

$img\_1$

Metalake: test_Metalake

3.3 Table Structure Access Explanation

Table structures in Gravitino can be accessed via the REST API:

http://localhost:8090/api/metalakes/test_Metalake/catalogs/${catalog}/schemas/${schema}/tables/${table}

In this test, the actual schema_url used is:

http://localhost:8090/api/metalakes/test_Metalake/catalogs/local-mysql/schemas/test/tables/demo_user

The returned JSON contains the complete field definitions of the demo_user table.

$img\_2$

3.4 Local Deployment of SeaTunnel

Since this feature hasn’t been officially released, you need to manually compile the latest dev branch and deploy it locally.

3.5 Prepare Data Files

This test case uses a CSV file containing 2,000 records.

$img\_3$

4. SeaTunnel Job Configuration

4.1 Core Configuration Example

env {
  parallelism = 1
  job.mode = "BATCH"
}
source {
  LocalFile {
    path = "/Users/wangxuepeng/Desktop/seatunnel/apache-seatunnel-2.3.13-SNAPSHOT/test_data"
    file_format_type = "csv"
    schema {
      schema_url = "http://localhost:8090/api/metalakes/test_Metalake/catalogs/local-mysql/schemas/test/tables/demo_user"
    }
  }
}
sink {
  jdbc {
    url = "jdbc:mysql://localhost:3306/test"
    driver = "com.mysql.cj.jdbc.Driver"
    username = "root"
    password = "123456"
    database = "test"
    table = "demo_user"
    generate_sink_sql = true
  }
}

4.2 Key Configuration Notes

schema.schema_url
- Points to the table metadata REST API in Gravitino
- SeaTunnel automatically fetches the table schema at job start
- No need to manually declare field lists in jobs
generate_sink_sql = true
- Sink automatically generates INSERT SQL based on the parsed schema

5. Data and Job Execution Results

Log screenshot:

$img\_4$

During job execution:

Source automatically parses field structure via schema_url
CSV fields automatically align with the table schema
Data successfully written to MySQL demo_user table

6. FAQ

6.1 Supported Connectors

Currently, the dev branch supports file-type connectors including local, hdfs, s3, etc.

6.2 Does `schema_url` support multiple tables?

The feature does not affect multi-table functionality and can be used in combination, e.g.:

source {
  LocalFile {
    tables_configs = [
      {
        path = "/seatunnel/read/metalake/table1"
        file_format_type = "csv"
        field_delimiter = ","
        row_delimiter = "\n"
        skip_header_row_number = 1
        schema {
          table = "db.table1"
          fields {
            c_string = string
            c_int = int
            c_boolean = boolean
            c_double = double
          }
        }
      },
      {
        path = "/seatunnel/read/metalake/table2"
        file_format_type = "csv"
        field_delimiter = ","
        row_delimiter = "\n"
        skip_header_row_number = 1
        schema {
          table = "db.table2"
          schema_url = "http://gravitino:8090/api/metalakes/test_metalake/catalogs/test_catalog/schemas/test_schema/tables/table2"
        }
      }
    ]
  }
}

7. Feature Summary

By introducing Gravitino schema_url–based automatic schema parsing, SeaTunnel gains the following advantages in data sync scenarios:

Eliminates repeated schema definitions, reducing job configuration complexity
Reuses a unified metadata management system, improving consistency
Job-friendly in case of table structure changes, significantly lowering maintenance costs

This feature is ideal for:

Enterprises with mature metadata platforms
Large tables with many fields or frequent schema changes
Users seeking improved maintainability of SeaTunnel jobs

8. References

Code PR:
https://github.com/apache/seatunnel/pull/10402
schema_url Configuration Docs:
https://seatunnel.apache.org/zh-CN/docs/introduction/concepts/schema-feature#schema_url

DEV Community

SeaTunnel Gravitino: Schema URL–Driven Automatic Table Structure Detection

1. Background and Problems to Solve

2. Introduction to Gravitino (Relevant Capabilities)

3. Local Test Environment Setup

3.1 Prepare MySQL Environment

3.1.1 Create Target Table

3.1.2 Create the Table Schema to Sync

3.2 Register the Table Schema in Gravitino

3.3 Table Structure Access Explanation

3.4 Local Deployment of SeaTunnel

3.5 Prepare Data Files

4. SeaTunnel Job Configuration

4.1 Core Configuration Example

4.2 Key Configuration Notes

5. Data and Job Execution Results

6. FAQ

6.1 Supported Connectors

6.2 Does `schema_url` support multiple tables?

7. Feature Summary

8. References

Top comments (0)

1. Background and Problems to Solve

2. Introduction to Gravitino (Relevant Capabilities)

3. Local Test Environment Setup

3.1 Prepare MySQL Environment

3.1.1 Create Target Table

3.1.2 Create the Table Schema to Sync

3.2 Register the Table Schema in Gravitino

3.3 Table Structure Access Explanation

3.4 Local Deployment of SeaTunnel

3.5 Prepare Data Files

4. SeaTunnel Job Configuration

4.1 Core Configuration Example

4.2 Key Configuration Notes

5. Data and Job Execution Results

6. FAQ

6.1 Supported Connectors

6.2 Does schema_url support multiple tables?

7. Feature Summary

8. References

6.2 Does `schema_url` support multiple tables?