DEV Community

Apache SeaTunnel
Apache SeaTunnel

Posted on

SeaTunnel Gravitino: Schema URL–Driven Automatic Table Structure Detection

Recently, the community published an article titled “Say Goodbye to Hand-Written Schemas! SeaTunnel’s Integration with Gravitino Metadata REST API Is a Really Cool Move”, which drew strong reactions from readers, with many saying, “This is really awesome!”

The contributor behind this feature is extremely proactive, and it’s expected to be available soon (according to reliable sources, likely in version 3.0.0). To help the community better understand it, the contributor wrote a detailed article explaining the initial capabilities of the Gravitino REST API and how to use it—let’s take a closer look!

1. Background and Problems to Solve

When using Apache SeaTunnel for batch or sync tasks, if the source is unstructured or semi-structured, the source usually requires an explicit schema definition (field names, types, order).

In real production environments, this leads to several typical issues:

  • Tables have many fields and complex types, making manual schema maintenance costly and error-prone
  • Upstream table structure changes (adding fields, changing types) require corresponding updates to SeaTunnel jobs
  • For existing tables, simply syncing data still requires repeated metadata description, leading to redundancy

Thus, the core question is:

Can SeaTunnel directly reuse table structure definitions from an existing metadata system, instead of declaring schema repeatedly in jobs?

This feature was introduced to solve this problem.

2. Introduction to Gravitino (Relevant Capabilities)

Gravitino is a unified metadata management and access service, providing standardized REST APIs to manage and expose the following objects:

  • Metalake (logical isolation unit)
  • Catalogs (e.g., MySQL, Hive, Iceberg)
  • Schema / Database
  • Table and its field definitions

With Gravitino:

  • Table structures can be centrally managed
  • Downstream systems can dynamically fetch schema definitions via HTTP APIs
  • No need to maintain field information in every compute or sync job

The new capability introduced in SeaTunnel is:

Support for automatically pulling table structures via schema_url provided by Gravitino in the source schema definition.

3. Local Test Environment Setup

3.1 Prepare MySQL Environment

3.1.1 Create Target Table

Pre-create the target table test.demo_user in MySQL with the following SQL:

CREATE TABLE `demo_user` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `user_code` varchar(32) NOT NULL,
  `user_name` varchar(64) DEFAULT NULL,
  `password` varchar(128) DEFAULT NULL,
  `email` varchar(128) DEFAULT NULL,
  `phone` varchar(20) DEFAULT NULL,
  `gender` tinyint DEFAULT NULL,
  `age` int DEFAULT NULL,
  `status` tinyint DEFAULT NULL,
  `level` int DEFAULT NULL,
  `score` decimal(10,2) DEFAULT NULL,
  `balance` decimal(12,2) DEFAULT NULL,
  `is_deleted` tinyint DEFAULT NULL,
  `register_ip` varchar(45) DEFAULT NULL,
  `last_login_ip` varchar(45) DEFAULT NULL,
  `login_count` int DEFAULT NULL,
  `remark` varchar(255) DEFAULT NULL,
  `ext1` varchar(100) DEFAULT NULL,
  `ext2` varchar(100) DEFAULT NULL,
  `ext3` varchar(100) DEFAULT NULL,
  `ext4` varchar(100) DEFAULT NULL,
  `ext5` varchar(100) DEFAULT NULL,
  `created_by` varchar(64) DEFAULT NULL,
  `updated_by` varchar(64) DEFAULT NULL,
  `created_time` datetime DEFAULT NULL,
  `updated_time` datetime DEFAULT NULL,
  `birthday` date DEFAULT NULL,
  `last_login_time` datetime DEFAULT NULL,
  `version` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_user_code` (`user_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Enter fullscreen mode Exit fullscreen mode

3.1.2 Create the Table Schema to Sync

In practice, table structures might be managed centrally in components like paimon, hive, or hudi. For testing, the table schema points to the target table test.demo_user created in the previous step.

3.2 Register the Table Schema in Gravitino

Gravitino supports direct database connections and scans all tables in a database

img

This table is managed in Gravitino as a table under the local-mysql catalog.

img\_1

Metalake: test_Metalake

3.3 Table Structure Access Explanation

Table structures in Gravitino can be accessed via the REST API:

http://localhost:8090/api/metalakes/test_Metalake/catalogs/${catalog}/schemas/${schema}/tables/${table}
Enter fullscreen mode Exit fullscreen mode

In this test, the actual schema_url used is:

http://localhost:8090/api/metalakes/test_Metalake/catalogs/local-mysql/schemas/test/tables/demo_user
Enter fullscreen mode Exit fullscreen mode

The returned JSON contains the complete field definitions of the demo_user table.

img\_2

3.4 Local Deployment of SeaTunnel

Since this feature hasn’t been officially released, you need to manually compile the latest dev branch and deploy it locally.

3.5 Prepare Data Files

This test case uses a CSV file containing 2,000 records.

img\_3

4. SeaTunnel Job Configuration

4.1 Core Configuration Example

env {
  parallelism = 1
  job.mode = "BATCH"
}
source {
  LocalFile {
    path = "/Users/wangxuepeng/Desktop/seatunnel/apache-seatunnel-2.3.13-SNAPSHOT/test_data"
    file_format_type = "csv"
    schema {
      schema_url = "http://localhost:8090/api/metalakes/test_Metalake/catalogs/local-mysql/schemas/test/tables/demo_user"
    }
  }
}
sink {
  jdbc {
    url = "jdbc:mysql://localhost:3306/test"
    driver = "com.mysql.cj.jdbc.Driver"
    username = "root"
    password = "123456"
    database = "test"
    table = "demo_user"
    generate_sink_sql = true
  }
}
Enter fullscreen mode Exit fullscreen mode

4.2 Key Configuration Notes

  • schema.schema_url

    • Points to the table metadata REST API in Gravitino
    • SeaTunnel automatically fetches the table schema at job start
    • No need to manually declare field lists in jobs
  • generate_sink_sql = true

    • Sink automatically generates INSERT SQL based on the parsed schema

5. Data and Job Execution Results

Log screenshot:

img\_4

During job execution:

  • Source automatically parses field structure via schema_url
  • CSV fields automatically align with the table schema
  • Data successfully written to MySQL demo_user table

6. FAQ

6.1 Supported Connectors

Currently, the dev branch supports file-type connectors including local, hdfs, s3, etc.

6.2 Does schema_url support multiple tables?

The feature does not affect multi-table functionality and can be used in combination, e.g.:

source {
  LocalFile {
    tables_configs = [
      {
        path = "/seatunnel/read/metalake/table1"
        file_format_type = "csv"
        field_delimiter = ","
        row_delimiter = "\n"
        skip_header_row_number = 1
        schema {
          table = "db.table1"
          fields {
            c_string = string
            c_int = int
            c_boolean = boolean
            c_double = double
          }
        }
      },
      {
        path = "/seatunnel/read/metalake/table2"
        file_format_type = "csv"
        field_delimiter = ","
        row_delimiter = "\n"
        skip_header_row_number = 1
        schema {
          table = "db.table2"
          schema_url = "http://gravitino:8090/api/metalakes/test_metalake/catalogs/test_catalog/schemas/test_schema/tables/table2"
        }
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

7. Feature Summary

By introducing Gravitino schema_url–based automatic schema parsing, SeaTunnel gains the following advantages in data sync scenarios:

  • Eliminates repeated schema definitions, reducing job configuration complexity
  • Reuses a unified metadata management system, improving consistency
  • Job-friendly in case of table structure changes, significantly lowering maintenance costs

This feature is ideal for:

  • Enterprises with mature metadata platforms
  • Large tables with many fields or frequent schema changes
  • Users seeking improved maintainability of SeaTunnel jobs

8. References

Top comments (0)