DEV Community

Apache SeaTunnel
Apache SeaTunnel

Posted on

Say Goodbye to Hand-Written Schemas! SeaTunnel’s Integration with Gravitino Metadata REST API Is a Really Cool Move

Every time you configure a non-relational database in Apache SeaTunnel, staring at hundreds of lines of manually defined field mappings—doesn’t it feel exhausting?
One wrong field definition, and the job fails. This kind of “manual labor” really should come to an end.

Recently, Issue #10339 in the Apache SeaTunnel community finally broke through this long-standing pain point: since we already have a powerful metadata service like Apache Gravitino, why not let it automatically synchronize Schemas?

Once this proposal was raised, it quickly sparked active discussion in the community. Core maintainers have already added it to the annual Roadmap. The current discussion is very pragmatic: everyone is focusing on how Apache SeaTunnel can automatically fetch the latest metadata at job submission time, so users can completely say goodbye to the primitive life of “hand-writing database schemas”.

🫱 Issue link: https://github.com/apache/seatunnel/issues/10339

Issue Overview

Let’s first look at why the author raised this Issue and the initial core design concept behind it. 🔽

This PR implements integration between Apache Gravitino and SeaTunnel, using Gravitino as an external metadata service for non-relational connectors.
By automatically retrieving table structures and metadata via Gravitino’s REST API, SeaTunnel users no longer need to manually define long and complex Schema mappings in connector configurations.

Background

Currently, many non-relational connectors in Apache SeaTunnel (such as Elasticsearch, vector databases, and data lake engines) require users to explicitly define complete column Schemas in job configurations. This leads to several problems:

  • Cumbersome and error-prone configuration: Field mappings are lengthy and highly susceptible to human error.
  • Schema redundancy: Large amounts of duplicate Schema definitions exist across different jobs.
  • Risk of data inconsistency: Schema drift can easily occur between the actual storage layer and SeaTunnel configuration files.

Changes Introduced

This PR adds Gravitino-based Catalog and Schema resolvers, enabling SeaTunnel to:

  • Query table definitions from Gravitino via REST API.
  • Automatically retrieve column names, data types, and related attributes.
  • Construct SeaTunnel’s internal Schema directly from Gravitino metadata.
  • Remove the mandatory requirement to manually define schema { fields { ... } } for supported connectors.

After implementation, users only need to specify the Gravitino Catalog and related table references in their job configurations.

Key Advantages

  • Zero manual mapping: Automatic Schema alignment for non-relational data sources.
  • Single source of truth: Ensures table structures remain highly consistent with the centralized metadata repository.
  • Improved reliability: Significantly increases configuration accuracy and reduces long-term maintenance costs.
  • Support for complex types: Unified metadata simplifies handling of nested structures, JSON, vectors, and other advanced data types.

Execution Scope

All Gravitino-based Schema parsing and validation is performed on the SeaTunnel Engine client side (i.e., before job submission).
This design ensures that:

  • Invalid or incompatible Schemas can be detected during the job pre-check phase.
  • Runtime tasks receive only validated and standardized Schemas, reducing the probability of execution failures.

Impact

This update greatly simplifies job setup for non-relational connectors.
Beyond improved usability, it also lays a technical foundation for the entire SeaTunnel ecosystem in terms of unified schema management, schema evolution, and advanced data type support.

Core Design Idea

For semi-structured and unstructured data sources such as FTP, S3, Elasticsearch, and MongoDB, SeaTunnel now supports automatic Schema resolution via the Gravitino REST API.

It is important to note that this does not aim to replace existing explicit configurations. Instead, it is a fully forward-compatible and optional new mechanism.

The Schema resolution priority is as follows:

1. Explicit Configuration (Inline Schema) Always Takes Priority

As long as a schema block is defined in the connector configuration, SeaTunnel must ignore Gravitino and use the explicitly defined Schema.

FtpFile {
  path = "/tmp/seatunnel/sink/text"
  # ... other basic configurations ...

  # If defined here, Gravitino will not be queried
  schema = {
    name = string
    age  = int
  }
}
Enter fullscreen mode Exit fullscreen mode

2. Global Gravitino Configuration via env (Recommended Mode)

SeaTunnel has integrated Gravitino Metalake at the engine level.
Once enabled globally in env, all non-relational data sources can reference Schemas directly by name.

env {
  metalake_enabled = true
  metalake_type    = "gravitino"
  metalake_url     = "http://localhost:8090/api/metalakes/metalake_name/catalogs/"
}
Enter fullscreen mode Exit fullscreen mode

2.1 Reference using schema_path

FtpFile {
  # ... basic configuration ...
  schema_path = "catalog_name.ykw.test_table"
}
Enter fullscreen mode Exit fullscreen mode

2.2 Reference using schema_url

FtpFile {
  # ... basic configuration ...
  schema_url = "http://localhost:8090/api/metalakes/laowang_test/.../tables/all_type"
}
Enter fullscreen mode Exit fullscreen mode

3. Fallback Logic: Read from OS Environment Variables

If Gravitino is not defined in the job’s env block, SeaTunnel will attempt to read the following configuration from operating system environment variables:

metalake_enabled | metalake_type | metalake_url

The behavior is identical to the env configuration described in Section 2.

4. Connector-Level Gravitino Configuration

If no global metadata center is configured, Gravitino can also be defined directly within a specific connector.

4.1 Use schema_url directly

FtpFile {
  # ... basic configuration ...
  metalake_type = "gravitino"
  schema_url = "http://localhost:8090/api/.../tables/all_type"
}
Enter fullscreen mode Exit fullscreen mode

4.2 Combine metalake_url with schema_path

FtpFile {
  # ... basic configuration ...
  metalake_type = "gravitino"
  metalake_url  = "http://localhost:8090/api/metalakes/metalake_name/catalogs/"
  schema_path   = "catalog_name.ykw.test_table"
}
Enter fullscreen mode Exit fullscreen mode

5. Detector Resolution (Find Detector)

The system automatically matches and loads the corresponding REST API HTTP detector based on metalake_type.

6. Mapping and CatalogTable Construction

The detector calls the assembled URL to retrieve the response body, then passes it to the Mapper for type matching, ultimately completing the construction of the CatalogTable.

7. Workflow Diagram

Issue Progress

At present, Apache SeaTunnel core contributors have given positive feedback on this proposal and added it to the Apache SeaTunnel Roadmap.

Apache SeaTunnel PMC Members raised several questions, such as which architectural layer this integration belongs to, how multi-engine compatibility should be handled, and the accuracy of type conversion. According to community design standards, they requested the proposer to submit a formal Design Document.

The proposer’s response was highly constructive. By emphasizing “client-side preprocessing” and an “abstract Catalog interface”, he effectively addressed community concerns about system coupling and runtime stability.

Currently, the discussion has returned to the original Issue author, and the community is waiting for the submission of the formal Design Document.

It’s easy to see that once this solution is implemented, writing jobs may only require one or two lines of configuration. The design draft is still being refined, and community feedback is extremely valuable. After all, whether this feature is truly usable is something frontline developers understand best.

So go check it out on GitHub—your suggestion might just shape the next release! 🔽
https://github.com/apache/seatunnel/issues/10339

Top comments (0)