DEV Community: Yue @ Datastrato (Admin)

Using Gravitino with Apache Flink for Streaming

Yue @ Datastrato (Admin) — Thu, 12 Mar 2026 05:10:14 +0000

Author: xiaojing

Last Updated: 2026-03-11

Overview

In this tutorial, you will learn how to use Apache Gravitino with Apache Flink to build a simple streaming pipeline. You will create a Hive catalog and a Paimon catalog in Gravitino, define a Kafka-backed generic table in the Hive catalog, and then use Flink SQL (through the Gravitino Flink connector) to read Kafka data and write it into a Paimon table.

What you'll accomplish:

Configure the Gravitino Flink connector in Flink
Create Hive and Paimon catalogs in Gravitino
Define a Kafka generic table in the Hive catalog
Stream data from Kafka to Paimon using Flink SQL

Architecture overview:

Prerequisites

System Requirements:

Linux or macOS
JDK 17+ (required for the Gravitino server; this tutorial assumes JDK 17 or later)
Apache Flink 1.18 (recommended for the Gravitino Flink connector)

Required Components:

Gravitino server v1.2.0 or later (this tutorial requires features introduced after v1.1.0; see 02-setup-guide/README.md)
Hive Metastore (for the Hive catalog)
Apache Kafka broker (for the Kafka source table)

Suggested Versions:

Apache Paimon connector JAR that matches your Flink version

Before proceeding, verify your Java and Flink installation:

${JAVA_HOME}/bin/java -version
${FLINK_HOME}/bin/flink --version

Step-by-Step Guide

Step 1: Set environment variables

These values are used throughout the tutorial. Adjust them for your environment:

export GRAVITINO_URI="http://localhost:8090"
export METALAKE_NAME="default_metalake"
export HIVE_METASTORE_URI="thrift://localhost:9083"
export PAIMON_WAREHOUSE="file:///tmp/paimon-warehouse"
export KAFKA_BROKERS="localhost:9092"

Step 2: Create Hive and Paimon catalogs in Gravitino

Create a Hive catalog and a Paimon catalog using the Gravitino REST API.
If you need to pass Hive-specific configs (for example hive-conf-dir), set them in catalog properties with the flink.bypass. prefix (for example flink.bypass.hive-conf-dir), which are forwarded to the Flink Hive connector.

# Create Hive catalog
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" -d '{
    "name": "hive_catalog",
    "type": "relational",
    "comment": "Hive catalog for Flink streaming",
    "provider": "hive",
    "properties": {
      "metastore.uris": "'"$HIVE_METASTORE_URI"'"
    }
  }' ${GRAVITINO_URI}/api/metalakes/${METALAKE_NAME}/catalogs

# Create Paimon catalog (filesystem backend)
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" -d '{
    "name": "paimon_catalog",
    "type": "relational",
    "comment": "Paimon catalog for Flink streaming",
    "provider": "lakehouse-paimon",
    "properties": {
      "catalog-backend": "filesystem",
      "warehouse": "'"$PAIMON_WAREHOUSE"'"
    }
  }' ${GRAVITINO_URI}/api/metalakes/${METALAKE_NAME}/catalogs

Step 3: Install required JARs in Flink

Place the following JARs in FLINK_HOME/lib so Flink SQL can load them:

gravitino-flink-connector-runtime-1.18_2.12-<version>.jar
paimon-flink-1.18-<version>.jar
flink-sql-connector-kafka-<version>.jar
Hive dependencies required by Flink HiveCatalog (same as Flink-Hive integration)

Tip: The Kafka SQL connector is not included in the Flink binary distribution and must be added separately.

Step 4: Configure Flink to use the Gravitino catalog store

Edit FLINK_HOME/conf/flink-conf.yaml and add (replace with your values):

table.catalog-store.kind: gravitino
table.catalog-store.gravitino.gravitino.metalake: ${METALAKE_NAME}
table.catalog-store.gravitino.gravitino.uri: ${GRAVITINO_URI}

Restart Flink if it is running, then make sure a Flink cluster is reachable:

${FLINK_HOME}/bin/start-cluster.sh
curl -sS http://localhost:8081/overview

If curl fails with connection refused, INSERT INTO ... SELECT ... in Step 7 will fail because SQL Client cannot submit jobs to the cluster.

Step 5: Create a Kafka generic table in the Hive catalog

Flink's HiveCatalog supports both Hive-compatible tables and generic tables. A table is generic by default in HiveCatalog unless you explicitly set 'connector' = 'hive' or use Hive dialect. Here we create a Kafka generic table so the metadata is stored in Hive Metastore, while the data is read from Kafka by Flink. If you want a Hive-compatible table, use Hive dialect or set 'connector' = 'hive'.

Start the Flink SQL client:

${FLINK_HOME}/bin/sql-client.sh

In the SQL client, run the following statements:

-- Use the Hive catalog managed by Gravitino
USE CATALOG hive_catalog;

CREATE DATABASE IF NOT EXISTS streaming_db;
USE streaming_db;

-- Kafka source table stored as a generic table in Hive catalog
CREATE TABLE kafka_events (
  user_id BIGINT,
  item_id BIGINT,
  behavior STRING,
  ts TIMESTAMP_LTZ(3) METADATA FROM 'timestamp',
  WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH (
  'connector' = 'kafka',
  'topic' = 'user_behavior',
  'properties.bootstrap.servers' = '${KAFKA_BROKERS}', -- replace with your Kafka brokers
  'properties.group.id' = 'gravitino-flink-demo',
  'scan.startup.mode' = 'earliest-offset',
  'format' = 'json',
  'json.ignore-parse-errors' = 'true'
);

Notes about generic tables:

HiveCatalog supports Hive-compatible tables and generic tables. Hive-compatible tables are stored in a Hive-compatible way and can be queried from Hive.
Generic tables are Flink-specific. Hive can see the metadata in Hive Metastore, but typically cannot interpret it, so querying from Hive is undefined behavior.
If you want Hive-compatible tables with default dialect, set 'connector' = 'hive'. If you use Hive dialect, the connector property is not required.
In Gravitino, generic table schema and partition keys are stored in flink.* properties in Hive Metastore. If connector=hive, the table is treated as a Hive-compatible table with a native Hive schema.

Step 6: Create a Paimon sink table

USE CATALOG paimon_catalog;

CREATE DATABASE IF NOT EXISTS streaming_db;
USE streaming_db;

CREATE TABLE paimon_user_behavior (
  user_id BIGINT,
  item_id BIGINT,
  behavior STRING,
  ts TIMESTAMP_LTZ(3)
);

Step 7: Stream data from Kafka to Paimon

SET 'execution.checkpointing.interval' = '10 s';

INSERT INTO paimon_catalog.streaming_db.paimon_user_behavior
SELECT user_id, item_id, behavior, ts
FROM hive_catalog.streaming_db.kafka_events;

If Kafka is receiving data in user_behavior, Flink will continuously write it to the Paimon table.
For streaming writes to Paimon, periodic checkpoints are required for commits.

Code Examples

Sample Kafka messages (JSON lines):

{"user_id": 1, "item_id": 1001, "behavior": "click"}
{"user_id": 2, "item_id": 1002, "behavior": "buy"}

Troubleshooting

Catalogs not visible in Flink: Verify table.catalog-store.* settings in flink-conf.yaml and that the Gravitino server is reachable.
ClassNotFoundException: Ensure the Gravitino connector, Kafka connector, and Paimon JARs are present in FLINK_HOME/lib.
java.net.ConnectException: Connection refused when running INSERT INTO: Flink SQL client cannot reach JobManager REST endpoint (default localhost:8081). Start cluster with ${FLINK_HOME}/bin/start-cluster.sh and verify curl http://localhost:8081/overview.
Job is RUNNING but no new rows in Paimon: Ensure checkpoints are enabled in streaming mode (for example SET 'execution.checkpointing.interval' = '10 s';) and check checkpoint progress in Flink Web UI or /jobs/<job-id>/checkpoints.
Job is RUNNING but expected records are skipped after rerun: Kafka offsets are tracked by properties.group.id. Use a new group id (for example gravitino-flink-demo-v2) when you want a fresh replay behavior.
Table not found: Use fully qualified names like hive_catalog.streaming_db.kafka_events and paimon_catalog.streaming_db.paimon_user_behavior.

Congratulations

You have successfully completed the Gravitino Flink streaming tutorial!

You now have a fully functional Flink streaming environment with Gravitino integration, including:

A configured Gravitino Flink connector for unified catalog access
Hive and Paimon catalogs registered in Gravitino and accessible from Flink SQL
A working streaming pipeline that reads from Kafka and writes to Paimon
Understanding of generic tables vs Hive-compatible tables in HiveCatalog

Your Flink environment is now ready to leverage Gravitino for unified metadata management across your streaming data ecosystem.

Next Steps

Explore Iceberg catalogs with Gravitino in 03-iceberg-catalog/README.md
Try query federation with Trino in 06-trino-query/README.md
Follow and star Apache Gravitino Repository

Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the official documentation or submit issues on GitHub.

Apache Gravitino Iceberg REST Catalog Access Control Deployment Guide

Yue @ Datastrato (Admin) — Sun, 15 Feb 2026 20:05:36 +0000

1. Overview

1.1 Product Introduction

Apache Gravitino IRC (Iceberg REST Catalog) is an Iceberg REST catalog service based on Gravitino, providing unified Iceberg table management capabilities. Starting from v1.1.0, Gravitino IRC supports access control for Iceberg tables.

1.2 Key Features

✅ Table operation authorization
✅ Multi-tenancy support
✅ RESTful API interface
✅ Seamless integration with Spark
✅ Role-based access control (RBAC)

1.3 Current Status

Currently supports table-level operation authorization, with more access control features to be added in the future.

2. System Architecture

2.1 Architecture Diagram

2.2 Component Description

Gravitino Server: Core metadata service, primarily managing table permission information in this scenario; port 8090
Iceberg REST Service: Iceberg REST catalog service that connects to Gravitino Server via API to retrieve permission information; port 9002
MySQL: Metadata storage for both Gravitino and IRC
Object Storage: Data file storage

3. Environment Requirements

3.1 System Requirements

Resource	Minimum	Recommended
CPU	4 cores	8 cores
Memory	8GB	16GB
Disk	100GB	500GB
Network	Gigabit	10 Gigabit

3.2 Software Dependencies

Software	Version	Notes
Java	JDK 17+	Required
MySQL	5.7+	Metadata storage
Spark	3.4+	Optional, client

4. Configuration

4.1 Core Configuration File

Create gravitino.conf configuration file in GRAVITINO_HOME/conf:

# ============================================
# Gravitino Service Basic Configuration
# ============================================

# Service shutdown timeout
gravitino.server.shutdown.timeout = 3000

# ============================================
# Web Server Configuration
# ============================================

# Web server host address
gravitino.server.webserver.host = 0.0.0.0
# HTTP port
gravitino.server.webserver.httpPort = 8090
# Minimum threads
gravitino.server.webserver.minThreads = 24
# Maximum threads
gravitino.server.webserver.maxThreads = 200
# Stop timeout
gravitino.server.webserver.stopTimeout = 30000
# Idle timeout
gravitino.server.webserver.idleTimeout = 30000

# ============================================
# Entity Store Configuration (MySQL)
# ============================================

gravitino.entity.store = relational
gravitino.entity.store.relational = JDBCBackend
gravitino.entity.store.relational.jdbcUrl = jdbc:mysql://192.168.194.152:3306/gravitino
gravitino.entity.store.relational.jdbcDriver = com.mysql.cj.jdbc.Driver
gravitino.entity.store.relational.jdbcUser = gravitino
gravitino.entity.store.relational.jdbcPassword = gravitino

# ============================================
# Cache Configuration
# ============================================

gravitino.cache.enabled = true
gravitino.cache.maxEntries = 10000
gravitino.cache.expireTimeInMs = 3600000
gravitino.cache.enableWeigher = true
gravitino.cache.implementation = caffeine

# ============================================
# Authorization Configuration
# ============================================

gravitino.authorization.enable = true
gravitino.authorization.impl = org.apache.gravitino.server.authorization.jcasbin.JcasbinAuthorizer
gravitino.authorization.serviceAdmins = admin # Admin account for creating metalake
gravitino.authenticators = simple

# ============================================
# Iceberg REST Service Configuration
# ============================================

gravitino.auxService.names = iceberg-rest
gravitino.iceberg-rest.classpath = iceberg-rest-server/libs,iceberg-rest-server/conf
gravitino.iceberg-rest.host = 0.0.0.0
gravitino.iceberg-rest.httpPort = 9002
gravitino.iceberg-rest.catalog-config-provider = dynamic-config-provider
gravitino.iceberg-rest.gravitino-uri = http://localhost:8090/
gravitino.iceberg-rest.gravitino-metalake = my_metalake  # Metalake name to use
gravitino.iceberg-rest.gravitino-simple.user-name = rest-catalog # User for IRC service to fetch catalog info
gravitino.iceberg-rest.default-catalog-name = catalog_iceberg

5. Deployment Process

5.1 Database Initialization

# Navigate to scripts directory
cd distribution/package/scripts

# Execute SQL based on database type
# MySQL example
mysql -h <host> -u <user> -p -D <database> < xxx.sql

5.2 Download Dependencies

# Download MySQL driver
cd $GRAVITINO_HOME
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.27/mysql-connector-java-8.0.27.jar
cp mysql-connector-java-8.0.27.jar libs/
cp mysql-connector-java-8.0.27.jar catalogs/lakehouse-iceberg/libs
cp mysql-connector-java-8.0.27.jar iceberg-rest-server/libs

# Copy bundle jar files
cp -r bundles/aws-bundle/build/libs/*.jar distribution/package/catalogs/lakehouse-iceberg/libs
cp -r bundles/aws-bundle/build/libs/*.jar distribution/package/iceberg-rest-server/libs

wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.9.2/iceberg-aws-bundle-1.9.2.jar
cp iceberg-aws-bundle-1.9.2.jar distribution/package/iceberg-rest-server/libs
cp iceberg-aws-bundle-1.9.2.jar distribution/package/catalogs/lakehouse-iceberg/libs

5.3 Start Services

# Start Gravitino service
/bin/bash bin/gravitino.sh start

# Check service status
/bin/bash bin/gravitino.sh status

5.4 Create Metalake

If you haven't created a metalake yet, use the API (or web UI) to create one named my_metalake:

# Create Metalake with admin privileges
curl -X POST -H "Content-Type: application/json" \
  -H "Authorization: Basic $(echo -n 'admin:password' | base64)" \
  -d '{
    "name": "my_metalake",
    "comment": "",
    "properties": {}
  }' http://localhost:8090/api/metalakes

5.5 Create Iceberg Catalog

Register an Iceberg Catalog in Gravitino; it needs to use the same backend (such as HMS or JDBC) as the running Iceberg REST Service:

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic $(echo -n 'admin:password' | base64)" \
  -d '{
    "name": "catalog_iceberg",
    "type": "RELATIONAL",
    "provider": "lakehouse-iceberg",
    "comment": "Iceberg directory",
    "properties": {
      "uri": "jdbc:mysql://mysql-host:3306/iceberg_db",
      "catalog-backend": "jdbc",
      "warehouse": "s3://bucket/iceberg/warehouse/",
      "jdbc-user": "mysql_user",
      "jdbc-password": "mysql_password",
      "jdbc-driver": "com.mysql.cj.jdbc.Driver",
      "io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
      "s3-secret-access-key": "your_secret_key",
      "s3-access-key-id": "your_access_key",
      "s3-region": "ap-southeast-1",
      "authentication.type": "simple",
      "credential-providers": "s3-token",
      "s3-endpoint": "http://s3.ap-southeast-1.amazonaws.com",
      "jdbc-initialize": "true",
      "s3-role-arn": "arn:aws:iam::730335553010:role/sts_s3_access_role"
    }
  }' http://localhost:8090/api/metalakes/my_metalake/catalogs

6. Access Control Management

Next, we will use Gravitino's RBAC permission model to configure access control for the Iceberg Catalog.

6.1 Permission Model

Gravitino provides the following privileges related to catalog/schema/table:

Privilege Type	Description	Applicable Objects
USE_CATALOG	Permission to use catalog	Catalog
USE_SCHEMA	Permission to use schema	Schema, Catalog
SELECT_TABLE	Permission to query table	Table, Schema, Catalog
MODIFY_TABLE	Permission to modify table	Table, Schema, Catalog
CREATE_TABLE	Permission to create table	Schema, Catalog
CREATE_SCHEMA	Permission to create schema	Catalog

6.2 Create Roles and Permissions

Create a role named "data_reader" with various privileges on catalog, schema, and table. Please adjust the catalog, schema, and table names accordingly.

# Create schema
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Authorization: Basic $(echo -n 'admin:password' | base64)" \
  -H "Content-Type: application/json" -d '{
  "name": "schema1",
  "comment": "comment",
  "properties": {
    "key1": "value1"
  }
}' http://localhost:8090/api/metalakes/my_metalake/catalogs/catalog_iceberg/schemas

# Create role
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic $(echo -n 'admin:password' | base64)" \
  -d '{
    "name": "data_reader",
    "properties": {"description": "data read"},
    "securableObjects": [
      {
        "fullName": "catalog_iceberg.schema1",
        "type": "SCHEMA",
        "privileges": [
          {"name": "CREATE_TABLE", "condition": "ALLOW"},
          {"name": "USE_SCHEMA", "condition": "ALLOW"}
        ]
      },
      {
        "fullName": "catalog_iceberg",
        "type": "CATALOG",
        "privileges": [{"name": "USE_CATALOG", "condition": "ALLOW"}]
      }
    ]
  }' http://localhost:8090/api/metalakes/my_metalake/roles

Create a role for rest_server to allow it to fetch catalog information:

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic $(echo -n 'admin:password' | base64)" \
  -d '{
    "name": "catalog_reader",
    "properties": {"description": "load catalog infos"},
    "securableObjects": [
      {
        "fullName": "my_metalake",
        "type": "METALAKE",
        "privileges": [{"name": "USE_CATALOG", "condition": "ALLOW"}]
      }
    ]
  }' http://localhost:8090/api/metalakes/my_metalake/roles

6.3 Create Users and Grant Permissions

Create a user such as spark_user in Gravitino and grant them the role created above:

# Create user
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic $(echo -n 'admin:password' | base64)" \
  -d '{"name": "spark_user"}' \
  http://localhost:8090/api/metalakes/my_metalake/users

# Grant permissions to user
curl -X PUT -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic $(echo -n 'admin:password' | base64)" \
  -d '{"roleNames": ["data_reader"]}' \
  http://localhost:8090/api/metalakes/my_metalake/permissions/users/spark_user/grant

Create user rest-catalog in Gravitino and grant permissions to load catalog:

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic $(echo -n 'admin:password' | base64)" \
  -d '{"name": "rest-catalog"}' \
  http://localhost:8090/api/metalakes/my_metalake/users

# Grant permissions to user
curl -X PUT -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic $(echo -n 'admin:password' | base64)" \
  -d '{"roleNames": ["catalog_reader"]}' \
  http://localhost:8090/api/metalakes/my_metalake/permissions/users/rest-catalog/grant

7. Spark Integration

After configuring permissions in Gravitino, you can test and verify on the client side.

7.1 Spark Configuration

Using Spark as an example, you need to configure the username on the client and point the Iceberg REST service to the IRC service address.

spark-sql \
  --jars "/path/to/iceberg-aws-bundle-1.9.2.jar,/path/to/iceberg-spark-runtime-3.4_2.12-1.9.2.jar" \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
  --conf spark.sql.catalog.rest=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.rest.type=rest \
  --conf spark.sql.catalog.rest.uri=http://localhost:9002/iceberg/ \
  --conf spark.sql.catalog.rest..X-Iceberg-Access-Delegation=vended-credentials \
  --conf spark.sql.catalog.rest.rest.auth.type=basic \
  --conf spark.sql.catalog.rest.rest.auth.basic.username=spark_user \
  --conf spark.sql.catalog.rest.rest.auth.basic.password=user_password

7.2 Usage Examples

-- Show available tables
SHOW TABLES IN rest.schema1;

-- Query data
SELECT * FROM rest.schema1.table1;

-- Create table
CREATE TABLE rest.schema1.table2 (
  id BIGINT,
  name STRING
) USING iceberg;

Summary

Through this guide, you will master:

Complete Deployment Process - End-to-end guidance from environment preparation, database initialization, dependency downloads to service startup
Access Control System - Understanding Gravitino's RBAC permission model, learning to create roles, assign permissions, and manage users
Real-world Application Scenarios - Learning how to use IRC access control in production through Spark integration examples
Core Configuration Points - Mastering key configuration parameters for Gravitino Server and Iceberg REST Service

This solution provides enterprise-grade access control capabilities for your data lake, implementing fine-grained table-level permission management while ensuring data security and maintaining excellent usability.

Using Apache Gravitino with Trino for Query Federation

Yue @ Datastrato (Admin) — Thu, 12 Feb 2026 00:29:32 +0000

Author: Yu hui

Last Updated: 2026-02-11

Overview

In this tutorial, you will learn how to integrate Apache Gravitino with Trino to enable query federation across multiple data sources through a unified metadata layer. By the end of this guide, you'll be able to configure Trino to automatically load catalogs from Gravitino and run cross-catalog queries seamlessly.

What you'll accomplish:

Connect Trino to Gravitino to enable automatic loading of Gravitino-managed catalogs
Create catalogs from Trino SQL including Iceberg and MySQL examples
Execute federated queries that join data across heterogeneous sources
Validate catalog discovery and inspect catalogs using Trino SQL

Trino is a distributed SQL query engine designed for fast analytic queries against data of any size. In modern data architectures, organizations often need to query data across multiple heterogeneous systems (like MySQL, PostgreSQL, Iceberg, Hive) without moving or copying data. This is where query federation becomes essential.

Apache Gravitino simplifies this by acting as a unified metadata control plane. By using the Gravitino Trino Connector, you can access multiple data sources through a single catalog interface in Trino, with automatic catalog discovery and centralized metadata management.

Key benefits:

Unified catalog access: Query MySQL, Iceberg, Hive, and other sources through a single interface
Automatic catalog discovery: Catalogs created in Gravitino are automatically available in Trino
Zero data movement: Join across heterogeneous systems without copying data
Centralized management: Create and update catalogs in one place, reflected everywhere

Architecture overview:

Prerequisites

Before starting this tutorial, you will need:

System Requirements:

Linux or macOS operating system with outbound internet access for downloads
JDK 17 or higher installed and properly configured

Required Components:

Gravitino server installed and running (see 02-setup-guide/README.md)
Trino server (coordinator + workers, or single-node for testing)
Trino version 435 or compatible version for your Gravitino Trino connector release

Optional Components:

MySQL or PostgreSQL for JDBC federation examples
Hive Metastore for Iceberg catalog backend
Object storage (S3/GCS/Azure) for cloud-based table storage

Before proceeding, verify your Java installation:

${JAVA_HOME}/bin/java -version

Important: Ensure your Gravitino server is configured to use simple authentication mode. The Gravitino Trino connector currently connects as an anonymous user and does not propagate user authentication.

Setup

How the Integration Works

The Gravitino Trino Connector enables Trino to dynamically load catalogs from Gravitino:

The connector is configured as a Trino catalog named gravitino via etc/catalog/gravitino.properties
You create additional catalogs (like iceberg_test and mysql_test) through Gravitino stored procedures or REST APIs
Trino automatically syncs Gravitino-managed catalogs every 10 seconds (configurable via gravitino.metadata.refresh-interval-seconds)
You query federated data using standard catalog.schema.table naming

Values Used in This Tutorial

Replace these values with your environment settings:

Gravitino URI: http://gravitino-server:8090
Metalake: trino_metalake
Iceberg HMS URI: thrift://hive-host:9083
Iceberg warehouse: hdfs://namenode:9000/user/iceberg/warehouse
MySQL JDBC URL: jdbc:mysql://mysql-host:3306?useSSL=false
MySQL credentials: trino / ds123

Step 1: Install and Configure Gravitino Trino Connector

The Gravitino Trino Connector must be installed on all Trino nodes (coordinator and workers).

Install the Connector Plugin

1. Download the connector

Download the Gravitino Trino connector from the Apache Gravitino download page or build from source.

2. Install on all Trino nodes

Extract the connector and copy it to Trino's plugin directory:

# Extract the connector
tar -xzf gravitino-trino-connector-<version>.tar.gz

# Copy to Trino plugin directory (on coordinator and all workers)
cp -r gravitino-trino-connector-<version> ${TRINO_HOME}/plugin/gravitino

Enable Dynamic Catalog Management

Configure Trino for dynamic catalogs

Edit ${TRINO_HOME}/etc/config.properties on the coordinator node:

catalog.management=dynamic

Configure the Gravitino Catalog

Create etc/catalog/gravitino.properties

On each Trino node, create the Gravitino catalog configuration file, pointing to your Gravitino server and metalake:

connector.name=gravitino
gravitino.uri=http://gravitino-server:8090
gravitino.metalake=trino_metalake

Note: The metalake specified in gravitino.metalake must already exist in Gravitino. If not, create it via the Web UI or REST API:
curl -X POST -H "Content-Type: application/json" \
  -d '{"name":"trino_metalake","comment":"Metalake for Trino federation","properties":{}}' \
  http://gravitino-server:8090/api/metalakes

Restart Trino

After creating the configuration file on each node, restart Trino to load the connector.

Verify Installation

Check the gravitino catalog is loaded

SHOW CATALOGS;

You should see gravitino in the catalog list, confirming successful installation.

Step 2: Create Catalogs from Trino SQL

Once the gravitino catalog is configured, you can create additional catalogs using stored procedures in gravitino.system.

Create an Iceberg Catalog

Example using Hive Metastore backend

CALL gravitino.system.create_catalog(
    'iceberg_test',
    'lakehouse-iceberg',
    MAP(
        ARRAY['uri', 'catalog-backend', 'warehouse'],
        ARRAY['thrift://hive-host:9083', 'hive', 'hdfs://namenode:9000/user/iceberg/warehouse']
    )
);

Note: For S3 or other cloud storage, you may need to pass additional properties using the trino.bypass. prefix for Trino-specific settings, read more in Apache Gravitino Trino connector - Iceberg catalog.

Create a MySQL Catalog

Example JDBC catalog for MySQL

CALL gravitino.system.create_catalog(
    'mysql_test',
    'jdbc-mysql',
    MAP(
        ARRAY['jdbc-url', 'jdbc-user', 'jdbc-password', 'jdbc-driver'],
        ARRAY['jdbc:mysql://mysql-host:3306?useSSL=false', 'trino', 'ds123', 'com.mysql.cj.jdbc.Driver']
    )
);

Tip: To ignore "already exists" errors, use named arguments with ignore_exist => true.

Verify Catalog Creation

Inspect Gravitino catalogs

SELECT * FROM gravitino.system.catalog;

Expected output:

   name       |     provider      | properties
--------------+-------------------+-------------------------------
 iceberg_test | lakehouse-iceberg | {...}
 mysql_test   | jdbc-mysql        | {...}

Step 3: Validate Catalog Discovery

After creating catalogs in Gravitino, verify they are visible in Trino.

Confirm Catalog Visibility

Check available catalogs and schemas

SHOW CATALOGS;
SHOW SCHEMAS FROM iceberg_test;
SHOW SCHEMAS FROM mysql_test;

Note: Trino syncs catalogs from Gravitino according to the configured refresh interval (10 seconds by default). If catalogs don't appear immediately, wait for the next refresh cycle and retry.

Step 4: Prepare Sample Data

Create sample schemas and tables to demonstrate federation capabilities.

Create MySQL Dimension Table

Set up a users dimension table

-- Create schema
CREATE SCHEMA mysql_test.demo;

-- Create users table
CREATE TABLE mysql_test.demo.users (
  user_id BIGINT,
  user_name VARCHAR
);

-- Insert sample data
INSERT INTO mysql_test.demo.users VALUES
  (1, 'alice'),
  (2, 'bob');

-- Verify data
SHOW TABLES FROM mysql_test.demo;
SELECT * FROM mysql_test.demo.users;

Create Iceberg Fact Table

Set up an events fact table

-- Create schema
CREATE SCHEMA iceberg_test.demo;

-- Create events table
CREATE TABLE iceberg_test.demo.events (
  user_id BIGINT,
  event_type VARCHAR,
  ts TIMESTAMP
);

-- Insert sample data
INSERT INTO iceberg_test.demo.events VALUES
  (1, 'click', TIMESTAMP '2024-01-01 10:00:00'),
  (2, 'view',  TIMESTAMP '2024-01-01 10:01:00');

-- Verify data
SHOW TABLES FROM iceberg_test.demo;
SELECT * FROM iceberg_test.demo.events;

Step 5: Execute Federated Queries

These examples demonstrate the core value of query federation: joining data across heterogeneous sources in a single query.

Pattern 1: Cross-Catalog JOIN

Join dimension and fact tables

SELECT
  e.user_id,
  u.user_name,
  e.event_type,
  e.ts
FROM iceberg_test.demo.events e
JOIN mysql_test.demo.users u
  ON e.user_id = u.user_id
ORDER BY e.ts;

Pattern 2: Aggregation Across Catalogs

Count events by user

SELECT
  u.user_name,
  COUNT(*) AS event_cnt
FROM iceberg_test.demo.events e
JOIN mysql_test.demo.users u
  ON e.user_id = u.user_id
WHERE e.ts >= TIMESTAMP '2024-01-01 00:00:00'
GROUP BY u.user_name
ORDER BY event_cnt DESC, u.user_name;

Pattern 3: Semi-Join Filter

Filter fact table by dimension membership

SELECT e.*
FROM iceberg_test.demo.events e
WHERE EXISTS (
  SELECT 1
  FROM mysql_test.demo.users u
  WHERE u.user_id = e.user_id
);

Pattern 4: LEFT JOIN with Unmatched Rows

Keep all events, even without matching users

SELECT
  e.user_id,
  COALESCE(u.user_name, 'unknown') AS user_name,
  e.event_type,
  e.ts
FROM iceberg_test.demo.events e
LEFT JOIN mysql_test.demo.users u
  ON e.user_id = u.user_id
ORDER BY e.ts;

Step 6: Understanding Federation Mechanics

In federated queries, understanding where work happens is crucial for optimization:

How queries execute:

Connector-level reads: Each connector (Iceberg, MySQL) reads from its respective source
Trino-level joins: Trino combines results from multiple sources in memory
Pushdown optimization: Some filters and predicates may be pushed to source systems

Query optimization tips:

Filter early: Apply partition/time filters on large tables to reduce data scanned
Align join keys: Use consistent data types across sources (e.g., BIGINT for IDs)
Small dimension pattern: Join large fact tables with small dimension tables for efficiency
Review query plans: Use EXPLAIN to understand execution strategy

Analyze query execution

EXPLAIN
SELECT
  u.user_name,
  COUNT(*) AS event_cnt
FROM iceberg_test.demo.events e
JOIN mysql_test.demo.users u
  ON e.user_id = u.user_id
GROUP BY u.user_name;

Step 7: Clean Up Resources

Remove sample data and catalogs

-- Drop tables
DROP TABLE mysql_test.demo.users;
DROP TABLE iceberg_test.demo.events;

-- Drop schemas
DROP SCHEMA mysql_test.demo;
DROP SCHEMA iceberg_test.demo;

-- Drop catalogs
CALL gravitino.system.drop_catalog('mysql_test');
CALL gravitino.system.drop_catalog('iceberg_test');

Troubleshooting

Common issues and their solutions:

Connector installation issues:

Catalog not found: Ensure the Gravitino connector plugin is installed on all Trino nodes and gravitino.properties exists in etc/catalog/
Dynamic catalog not working: Verify catalog.management=dynamic is set in etc/config.properties on the coordinator

Connection issues:

Cannot connect to Gravitino: Check that Gravitino server is running and gravitino.uri is correct
Metalake not found: Ensure the metalake specified in gravitino.metalake exists in Gravitino

Catalog sync issues:

Catalogs not appearing: Wait for the sync interval (default 10 seconds) or adjust gravitino.metadata.refresh-interval-seconds
Stale catalog information: Restart Trino or wait for the next sync cycle

Query execution issues:

Table not found: Verify the fully qualified table name format: catalog.schema.table
Permission denied (Gravitino metadata): Verify that the Trino identity (or the mapped Gravitino user, or the anonymous user if the connector is configured to run anonymously) has the required catalog/schema/table privileges in Gravitino for the objects being queried
Permission denied (underlying data source): If Gravitino privileges are correct but the error persists, check the credentials and permissions for the underlying data sources (for example MySQL user/password, Hive/HDFS/S3 ACLs) configured in the corresponding Gravitino catalog

Congratulations

You have successfully completed the Gravitino Trino query federation tutorial!

You now have a fully functional Trino environment with Gravitino integration, including:

A configured Gravitino Trino Connector for automatic catalog discovery
Multiple registered catalogs (Iceberg and MySQL) accessible from Trino
Working federated queries that join data across heterogeneous sources
Understanding of query optimization patterns and federation mechanics

Your Trino environment is now ready to leverage Gravitino for unified metadata management and cross-system query federation.

Next Steps

Explore Using Gravitino with Flink for streaming processing
Follow and star Apache Gravitino Repository

Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the official documentation or submit issues on GitHub.

Using Gravitino with Apache Spark for ETL

Yue @ Datastrato (Admin) — Thu, 05 Feb 2026 23:24:45 +0000

Author: Minghuang Li

Last Updated: 2026-01-31

Overview

In this tutorial, you will learn how to use Apache Gravitino with Apache Spark for ETL (Extract, Transform, Load) operations. By the end of this guide, you'll be able to build data pipelines that seamlessly access multiple heterogeneous data sources through a unified catalog interface.

What you'll accomplish:

Configure Gravitino Spark Connector to enable unified access to multiple data sources in Spark
Register multiple catalogs including MySQL and Iceberg in Gravitino for federated access
Build an ETL pipeline that extracts data from MySQL, transforms it, and loads it into Iceberg
Execute federated queries across different data sources using Spark SQL and PySpark

Apache Spark is one of the most popular unified analytics engines for large-scale data processing. In a typical ETL pipeline, Spark often needs to interact with multiple heterogeneous data sources (like MySQL, HDFS, S3, Hive, Iceberg). Managing connectivity, credentials, and schema information for these diverse sources can be complex and error-prone.

Apache Gravitino simplifies this by acting as a unified metadata lake. By using the Gravitino Spark Connector, you can access multiple data sources through a single catalog interface in Spark, without having to manually configure each source's connection details in your Spark jobs.

Key benefits:

Unified catalog: Access Hive, Iceberg, MySQL, PostgreSQL, and other sources under a unified namespace
Centralized metadata: Metadata is managed in Gravitino, changes are reflected immediately
Simplified configuration: Configure the Gravitino connector once, and access all managed catalogs
Federated querying: Easily join data across different sources (e.g., join MySQL data with Iceberg table)

Prerequisites

Before starting this tutorial, you will need:

System Requirements:

Linux or macOS operating system with outbound internet access for downloads
JDK 17 or higher installed and properly configured
Apache Spark 3.3, 3.4, or 3.5 installed

Required Components:

Gravitino server installed and running (see 02-setup-guide/README.md)
MySQL instance for testing JDBC catalog functionality

Optional Components:

HDFS or S3 for Iceberg data storage in production environments

Before proceeding, verify your Java and Spark installation:

${JAVA_HOME}/bin/java -version
${SPARK_HOME}/bin/spark-submit --version

Architecture overview:

Setup

Step 1: Download Gravitino Spark Connector

You need the Gravitino Spark Connector jar file to enable Spark integration with Gravitino.

Obtain the connector

Download from Maven Central Repository

For Spark 3.5, download the connector from:
gravitino-spark-connector-runtime-3.5

Additional dependencies

For JDBC sources (MySQL, PostgreSQL), you also need the specific JDBC driver jar (e.g., mysql-connector-j for MySQL) in your classpath.

Step 2: Configure Spark Session

To use Gravitino with Spark, you need to configure the specialized Gravitino Spark IO plugin.

Configure Spark SQL with Gravitino

Start Spark SQL with the Gravitino connector

# Set the location of your Gravitino server
GRAVITINO_URI="http://localhost:8090"
# The metalake you want to access
METALAKE_NAME="default_metalake"

spark-sql \
  --packages org.apache.gravitino:gravitino-spark-connector-runtime-3.5_2.12:1.1.0,mysql:mysql-connector-java:8.0.33,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1 \
  --conf spark.plugins=org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin \
  --conf spark.sql.gravitino.metalake=$METALAKE_NAME \
  --conf spark.sql.gravitino.uri=$GRAVITINO_URI \
  --conf spark.sql.gravitino.enableIcebergSupport=true

Configuration notes:

Replace 1.1.0 with the actual version you are using
Ensure the Spark connector version matches your Spark version
Set spark.sql.gravitino.enableIcebergSupport=true to enable Iceberg catalog support

Step 3: Prepare Metadata in Gravitino

Before running ETL jobs, you need to register the catalogs for your data sources in Gravitino. You can do this via the Gravitino REST API or Web UI.

Register MySQL Catalog

Create a MySQL catalog in Gravitino

curl -X POST -H "Content-Type: application/json" -d '{
  "name": "mysql_catalog",
  "type": "relational",
  "provider": "jdbc-mysql",
  "properties": {
    "jdbc-url": "jdbc:mysql://localhost:3306",
    "jdbc-user": "root",
    "jdbc-password": "password",
    "jdbc-driver": "com.mysql.cj.jdbc.Driver"
  }
}' http://localhost:8090/api/metalakes/default_metalake/catalogs

Register Iceberg Catalog

Create an Iceberg catalog in Gravitino

curl -X POST -H "Content-Type: application/json" -d '{
  "name": "iceberg_catalog",
  "type": "relational",
  "provider": "lakehouse-iceberg",
  "properties": {
    "warehouse": "file:///tmp/iceberg-warehouse",
    "catalog-backend": "jdbc",
    "uri": "jdbc:mysql://localhost:3306/iceberg_metadata",
    "jdbc-driver": "com.mysql.cj.jdbc.Driver",
    "jdbc-user": "root",
    "jdbc-password": "password",
    "jdbc-initialize": "true"
  }
}' http://localhost:8090/api/metalakes/default_metalake/catalogs

Note: This example uses a local file system for Iceberg data storage. For production environments, consider using HDFS or S3. For more detailed Iceberg catalog configuration options, see 03-iceberg-catalog/README.md.

Step 4: Build an ETL Pipeline from MySQL to Iceberg

In this scenario, we will extract user data from a MySQL database, perform some transformations, and load it into an Apache Iceberg table for analytical queries, all managed through Gravitino.

Verify Catalogs in Spark

1. Start your Spark SQL session

Use the configuration from Step 2 to start your Spark SQL session.

2. Verify catalog visibility

-- Due to Spark catalog manager limitations, SHOW CATALOGS only displays 'spark_catalog' initially
SHOW CATALOGS;

-- Use a Gravitino-managed catalog to make it visible
USE mysql_catalog;
USE iceberg_catalog;

-- Now both catalogs are visible in the output
SHOW CATALOGS;

Note: The SHOW CATALOGS command initially only displays the Spark default catalog (spark_catalog). After explicitly using a Gravitino-managed catalog with the USE command, that catalog becomes visible in subsequent SHOW CATALOGS output.

Prepare Sample Data in MySQL

1. Create a sample database and table

-- Switch to MySQL catalog
USE mysql_catalog;

-- Create a sample database
CREATE DATABASE IF NOT EXISTS users_db;
USE users_db;

-- Create a users table
CREATE TABLE IF NOT EXISTS users (
  id INT,
  username STRING,
  email STRING,
  status STRING,
  created_at TIMESTAMP
);

2. Insert sample data

-- Insert sample data
INSERT INTO users VALUES 
  (1, 'Alice', 'alice@example.com', 'active', TIMESTAMP '2024-01-15 10:00:00'),
  (2, 'Bob', 'bob@example.com', 'active', TIMESTAMP '2024-02-20 14:30:00'),
  (3, 'Charlie', 'charlie@example.com', 'inactive', TIMESTAMP '2024-03-10 09:15:00'),
  (4, 'Diana', 'diana@example.com', 'active', TIMESTAMP '2024-04-05 16:45:00'),
  (5, 'Eve', 'eve@example.com', 'inactive', TIMESTAMP '2024-05-12 11:20:00');

-- Verify the data
SELECT * FROM users;

Extract Data from MySQL

Verify data extraction

-- Read data from MySQL
SELECT * FROM mysql_catalog.users_db.users LIMIT 10;

Transform and Load Data to Iceberg

1. Create an Iceberg table

-- Switch to Iceberg catalog
USE iceberg_catalog;
CREATE DATABASE IF NOT EXISTS analytics;

CREATE TABLE IF NOT EXISTS analytics.active_users (
  user_id INT,
  username STRING,
  email STRING,
  created_at TIMESTAMP
) USING iceberg;

2. Execute ETL query

-- ETL Query: Insert into Iceberg from MySQL with transformation
INSERT INTO analytics.active_users
SELECT 
  id as user_id, 
  LOWER(username) as username, 
  LOWER(email) as email, 
  created_at 
FROM mysql_catalog.users_db.users 
WHERE status = 'active';

Note: For JDBC catalogs (like MySQL), operations UPDATE, DELETE, and TRUNCATE are NOT supported. Only SELECT and INSERT are supported.

Verify ETL Results

Query the target Iceberg table

SELECT count(*) FROM analytics.active_users;
SELECT * FROM analytics.active_users LIMIT 5;

PySpark Example

If you prefer using Python, the logic is very similar using the DataFrame API.

Configure PySpark Session

Create a PySpark session with Gravitino connector

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("GravitinoSparkETL") \
    .config("spark.jars.packages", "org.apache.gravitino:gravitino-spark-connector-runtime-3.5_2.12:1.1.0,mysql:mysql-connector-java:8.0.33,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1") \
    .config("spark.plugins", "org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin") \
    .config("spark.sql.gravitino.metalake", "default_metalake") \
    .config("spark.sql.gravitino.uri", "http://localhost:8090") \
    .config("spark.sql.gravitino.enableIcebergSupport", "true") \
    .getOrCreate()

Execute ETL Pipeline

Read, transform, and write data using DataFrame API

# Read from MySQL
mysql_df = spark.table("mysql_catalog.users_db.users")

# Transform
active_users = mysql_df.filter("status = 'active'") \
    .selectExpr("id as user_id", "lower(username) as username", "lower(email) as email", "created_at")

# Write to Iceberg
active_users.write \
    .format("iceberg") \
    .mode("append") \
    .saveAsTable("iceberg_catalog.analytics.active_users")

print("ETL Job Completed successfully.")

Troubleshooting

Common issues and their solutions:

Connector and classpath issues:

ClassNotFoundException: org.apache.gravitino.spark.connector.GravitinoCatalog: The Gravitino Spark Connector JAR is missing from the classpath. Ensure you added the correct package with --packages or placed the JAR in $SPARK_HOME/jars
Missing JDBC Driver: When connecting to JDBC sources (MySQL/PostgreSQL) via Gravitino, Spark still needs the JDBC driver JARs in its classpath. Add the MySQL/PostgreSQL JDBC driver packages to your Spark startup command (e.g., --packages mysql:mysql-connector-java:8.0.33) or put the jar in jars/ folder

Connection issues:

Connection refused to Gravitino Server: Spark cannot reach the Gravitino server. Check if Gravitino server is running and the spark.sql.gravitino.uri config is correct
Catalog not found: Ensure the catalogs are properly registered in Gravitino and the metalake name is correct

Query execution issues:

UPDATE/DELETE not supported on JDBC catalogs: For JDBC catalogs (like MySQL), only SELECT and INSERT operations are supported through Gravitino
Table not found: Verify the fully qualified table name format: catalog.schema.table

Congratulations

You have successfully completed the Gravitino Spark ETL tutorial!

You now have a fully functional Spark environment with Gravitino integration, including:

A configured Gravitino Spark Connector for unified catalog access
Multiple registered catalogs (MySQL and Iceberg) in Gravitino
A working ETL pipeline that extracts, transforms, and loads data across heterogeneous sources
Understanding of federated query capabilities and PySpark integration

Your Spark environment is now ready to leverage Gravitino for unified metadata management across your data ecosystem.

Next Steps

Explore Using Gravitino with Trino for federated querying
Follow and star Apache Gravitino Repository

Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the official documentation or submit issues on GitHub.

Configuring Gravitino Lance REST Service

Yue @ Datastrato (Admin) — Sat, 31 Jan 2026 07:37:34 +0000

Author: Qi Yu
Last Updated: 2026-01-23

Overview

In this tutorial, you will learn how to configure and use the Gravitino Lance REST service. By the end of this guide, you'll have a fully functional Lance REST service that enables Lance clients to interact with Gravitino through HTTP APIs.

The Gravitino Lance REST service provides a RESTful interface for managing Lance datasets, implementing the standard Lance REST API. It acts as a centralized catalog service that allows Lance clients (like Spark and Ray) to discover and access Lance datasets managed by Gravitino.

Key concepts:

Lance REST catalog: A standard HTTP API for Lance dataset operations
Gravitino Lance REST service: Implements the Lance REST API and integrates with Gravitino's metadata system
Unified Metadata: Stores Lance dataset metadata in Gravitino, enabling centralized governance

The REST endpoint base path is http://<host>:<port>/lance/.

Architecture overview:

Prerequisites

Before starting this tutorial, you will need:

System Requirements:

Linux or macOS operating system with outbound internet access for downloads
Python environment (3.10+) for running PySpark or Ray clients

Required Components:

Gravitino server installed and configured (see 02-setup-guide/README.md)

Optional Components:

Apache Spark with Lance runtime JARs for client verification (recommended for testing)
Ray framework for distributed Lance data processing

Before proceeding, verify your Python installation and install required packages:

python --version
pip install pyspark==3.5.0 lance-ray==0.1.0 lance-namespace

Setup

Step 1: Start a Gravitino server with Lance REST service

Use this approach if you want the Lance REST service embedded in a full Gravitino server (with Web UI, unified REST APIs, etc.).

Configure Lance REST as auxiliary service

1. Install Gravitino server distribution

Follow the previous tutorial 02-setup-guide/README.md to download or build the Gravitino server package.

2. Enable Lance REST as an auxiliary service

Modify conf/gravitino.conf to enable the lance-rest service and configure it:

# Enable Lance REST service
gravitino.auxService.names = lance-rest
gravitino.lance-rest.httpPort = 9101
gravitino.lance-rest.host = 0.0.0.0
gravitino.lance-rest.namespace-backend = gravitino
gravitino.lance-rest.gravitino-uri = http://localhost:8090
gravitino.lance-rest.gravitino-metalake = lance_metalake

Note: The lance_metalake should exist in Gravitino when you access Lance REST service. You can create it via the Gravitino REST API or Web UI after starting the Gravitino server if it doesn't exist.

3. Start the Gravitino server

./bin/gravitino.sh start

4. Create the Metalake (if not exists)

curl -X POST -H "Content-Type: application/json" \
  -d '{"name":"lance_metalake","comment":"comment"}' \
  http://localhost:8090/api/metalakes

5. Check server logs (optional)

tail -f logs/gravitino-server.log

Step 2: Verify the Lance REST endpoint and create a catalog namespace

Test the service endpoint

You can verify the service is running by the following command:

curl -X GET http://localhost:9101/lance/v1/namespace/$/list \
  -H 'Content-Type: application/json'

On success, you should see a JSON response with namespace information.

Create a catalog namespace

Create a catalog namespace (e.g., lance_catalog) that will hold your Lance schemas and tables:

curl -X POST http://localhost:9101/lance/v1/namespace/lance_catalog/create \
  -H 'Content-Type: application/json' \
  -d '{
    "id": ["lance_catalog"],
    "mode": "exist_ok"
  }'

If successful, it returns the namespace information.

Step 3: Connect with Spark

Configure your PySpark session to use the Lance REST catalog.

Configure Spark with Lance REST catalog

Prerequisites:

Install pyspark: pip install pyspark==3.5.0
Download the lance-spark bundle jar matching your Spark version (e.g., lance-spark-bundle-3.5_2.12-0.0.15.jar)

Execute sample operations

Run the following Python script:

from pyspark.sql import SparkSession
import os

# Set path to your lance-spark bundle
os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--jars /path/to/lance-spark-bundle-3.5_2.12-0.0.15.jar "
    "--conf \"spark.driver.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" "
    "--conf \"spark.executor.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" "
    "--master local[1] pyspark-shell"
)

spark = SparkSession.builder \
    .appName("lance_rest_demo") \
    .config("spark.sql.catalog.lance", "com.lancedb.lance.spark.LanceNamespaceSparkCatalog") \
    .config("spark.sql.catalog.lance.impl", "rest") \
    .config("spark.sql.catalog.lance.uri", "http://localhost:9101/lance") \
    .config("spark.sql.catalog.lance.parent", "lance_catalog") \
    .config("spark.sql.defaultCatalog", "lance") \
    .getOrCreate()

# Create a schema and table
spark.sql("CREATE DATABASE IF NOT EXISTS demo_schema")
spark.sql("""
    CREATE TABLE demo_schema.test_table (id INT, value STRING)
    USING lance
    LOCATION '/tmp/lance_catalog/demo_schema/test_table'
""")

# Insert and query data
spark.sql("INSERT INTO demo_schema.test_table VALUES (1, 'test')")
spark.sql("SELECT * FROM demo_schema.test_table").show()

Step 4: Connect with Ray

You can also access the data created by Spark using Ray with Lance Ray integration.

Configure Ray with Lance REST catalog

Prerequisites:

Install required packages: pip install lance-ray==0.1.0 lance-namespace

Execute sample operations

import ray
import lance_namespace as ln
from lance_ray import read_lance, write_lance

ray.init()

# Connect to Lance REST
namespace = ln.connect("rest", {"uri": "http://localhost:9101/lance"})

# Read the table created by Spark
# Note: Table ID is [catalog, schema, table]
ds = read_lance(namespace=namespace, table_id=["lance_catalog", "demo_schema", "test_table"])
print(f"Row count: {ds.count()}")
ds.show()

# Perform filtering operation
result = ds.filter(lambda row: row["id"] < 100).count()
print(f"Filtered row count: {result}")

Troubleshooting

Common issues and their solutions:

Service connectivity issues:

Service fails to start: Check logs/gravitino-server.log for startup errors and configuration issues
Connection refused: Verify gravitino.lance-rest.httpPort (default 9101) is open and accessible
curl returns 404: Confirm the Lance REST base path is /lance and the port matches configuration

Client connection issues:

Spark ClassNotFoundException: Ensure the lance-spark-bundle jar is correctly referenced in PYSPARK_SUBMIT_ARGS or --jars
Namespace not found: Remember to create the parent catalog namespace (e.g., lance_catalog) before creating schemas or tables
Ray connection errors: Verify lance-ray and lance-namespace packages are installed and the REST endpoint is accessible

Configuration issues:

Metalake not found: Ensure the metalake specified in gravitino.lance-rest.gravitino-metalake exists in Gravitino
Permission errors: Check that the Gravitino server has proper access to the configured storage locations

Congratulations

You have successfully completed the Gravitino Lance REST service configuration tutorial!

You now have a fully functional Lance REST service with:

A configured Lance REST endpoint running on port 9101
A catalog namespace configured for organizing Lance datasets
Verified client connectivity through Apache Spark and Ray
Understanding of Lance dataset operations across different compute engines

Your Gravitino Lance REST service is ready to serve Lance clients across your data ecosystem.

Next Steps

Continue reading Spark ETL
Follow and star Apache Gravitino Repository

Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the official documentation or submit issues on GitHub.

Unified Data Access with Daft and Apache Gravitino: Simplifying Multi-Cloud Data Management

Yue @ Datastrato (Admin) — Tue, 20 Jan 2026 19:28:31 +0000

Unified Data Access with Daft and Apache Gravitino: Simplifying Multi-Cloud Data Management

The modern data landscape is increasingly distributed across multiple cloud providers and storage systems. Organizations often find themselves managing data across AWS S3, Google Cloud Storage, Azure Blob Storage, and on-premises systems, each with their own access patterns, credentials, and metadata management challenges. This fragmentation creates complexity in data discovery, access control, and operational overhead.

Today, we're excited to introduce the integration between Daft and Apache Gravitino, bringing unified catalog management and seamless multi-cloud data access to the Daft ecosystem. This integration focuses on fileset catalog support, enabling you to access distributed datasets through a single, unified interface while maintaining security and performance.

Note: This integration is available in Daft v0.7.2 and later versions.

What is Apache Gravitino?

Apache Gravitino is an open-source data catalog that provides unified metadata management for various data sources and storage systems. It acts as a central hub for organizing and accessing data across different platforms, offering:

Unified Metadata Management: Single source of truth for data across multiple storage systems
Multi-Cloud Support: Native integration with AWS S3 and local storage, with more cloud providers coming soon
Security Integration: Centralized credential management and access control
Catalog Abstraction: Support for both table catalogs (Iceberg, Hudi, Hive, JDBC) and fileset catalogs

What is Daft?

Daft is a distributed query engine built for the Python ecosystem, designed to handle large-scale data processing with ease and efficiency. Daft brings the power of distributed computing to data scientists and engineers through a familiar DataFrame API, offering:

Distributed Processing: Scale computations across multiple cores and machines seamlessly
Lazy Evaluation: Optimize query execution through intelligent query planning and predicate pushdown
Multi-Format Support: Native support for Parquet, JSON, CSV, Images, and more
Cloud-Native: Built-in integrations with AWS S3, Google Cloud Storage, Azure Blob Storage
Python-First: Intuitive DataFrame API that feels natural to Python developers
Performance Optimized: Rust-powered execution engine for maximum performance

Unlike traditional big data tools that require complex cluster management, Daft provides a simple pip install experience while delivering enterprise-grade performance. Whether you're processing terabytes of data locally or across cloud infrastructure, Daft's intelligent execution engine automatically optimizes your workloads for speed and efficiency.

The Power of Fileset Catalogs

While table catalogs manage structured data with schemas, fileset catalogs provide a flexible way to organize and access collections of files across different storage systems. This is particularly valuable for:

Data Lakes: Managing raw data files, logs, and unstructured datasets
Multi-Format Data: Handling Parquet, JSON, CSV, and other file formats in a unified way
Distributed Storage: Accessing data across different storage systems
Dynamic Datasets: Working with datasets that don't fit traditional table structures

Introducing GVFS

The Daft + Gravitino integration introduces a new URL scheme: gvfs:// (Gravitino Virtual File System). This provides a unified way to access files managed by Gravitino filesets, regardless of their underlying storage location (s3, adls, gcs, etc).

URL Format

gvfs://fileset/catalog/schema/fileset/path/to/file

Where:

catalog: The Gravitino catalog name
schema: The schema within the catalog
fileset: The specific fileset name
path/to/file: The file path within the fileset (optional)

Example URLs

# Access a specific file
"gvfs://fileset/s3_catalog/analytics/user_events/2024/01/events.parquet"

# Access all files in a fileset
"gvfs://fileset/s3_catalog/ml_data/training_set/"

# Access partitioned data
"gvfs://fileset/s3_catalog/logs/application/year=2024/month=01/"

Getting Started

Requirements

The Daft + Gravitino integration requires:

Python: 3.10 or later
pip: 21.0 or later (recommended: latest version)
Daft: v0.7.2 or later

Make sure you have the correct versions installed.

Installation and Setup

First, ensure you have Apache Gravitino, and create a fileset catalog, schema and at least one fileset entity which has a storage location to s3. You can refer to Gravitino's online documentation.

Second, ensure you have installed Daft v0.7.2 or later with the support for Gravitino:

pip install "daft>=0.7.2" requests

Basic Configuration

import daft
from daft.io import IOConfig, GravitinoConfig

# Configure Gravitino connection
gravitino_config = GravitinoConfig(
    endpoint="http://localhost:8090",
    metalake_name="my_metalake",
    auth_type="simple"
)

# Create IOConfig with Gravitino settings
io_config = IOConfig(gravitino=gravitino_config)

Reading Data from Filesets

# Read a specific file from a Gravitino fileset
df = daft.read_parquet(
    "gvfs://fileset/s3_catalog/analytics/user_events/events.parquet",
    io_config=io_config
)

# Read all Parquet files in a fileset
df = daft.read_parquet(
    "gvfs://fileset/s3_catalog/analytics/user_events/**/*.parquet",
    io_config=io_config
)

# List files in a fileset
files_df = daft.from_glob_path(
    "gvfs://fileset/s3_catalog/analytics/user_events/**/*",
    io_config=io_config
)

Advanced Usage Examples

Working with Multiple File Formats

# Read JSON files from a fileset
json_df = daft.read_json(
    "gvfs://fileset/s3_catalog/logs/application/**/*.json",
    io_config=io_config
)

# Read CSV files with custom options
csv_df = daft.read_csv(
    "gvfs://fileset/s3_catalog/exports/daily_reports/**/*.csv",
    io_config=io_config
)

Writing Data to Filesets

# Write Parquet files to a Gravitino fileset
df = daft.from_pydict({
    "user_id": [1, 2, 3],
    "event_type": ["click", "purchase", "view"],
    "timestamp": ["2024-01-01", "2024-01-02", "2024-01-03"]
})

df.write_parquet(
    "gvfs://fileset/s3_catalog/analytics/processed_events/",
    io_config=io_config
)

# Write CSV files to a fileset
df.write_csv(
    "gvfs://fileset/s3_catalog/exports/daily_reports/",
    io_config=io_config
)

# Write JSON files to a fileset
df.write_json(
    "gvfs://fileset/s3_catalog/logs/application/",
    io_config=io_config
)

Programmatic Fileset Discovery

from daft.gravitino import GravitinoClient

# Initialize Gravitino client
client = GravitinoClient(
    endpoint="http://localhost:8090",
    metalake_name="my_metalake",
    auth_type="simple"
)

# Discover available catalogs and filesets
catalogs = client.list_catalogs()
print(f"Available catalogs: {catalogs}")

# Load fileset metadata
fileset = client.load_fileset("s3_catalog.analytics.user_events")
print(f"Storage location: {fileset.fileset_info.storage_location}")
print(f"Properties: {fileset.fileset_info.properties}")

# Use the fileset with Daft
df = daft.read_parquet(
    f"gvfs://fileset/s3_catalog/analytics/user_events/**/*.parquet",
    io_config=client.to_io_config()
)

Security and Credential Management

One of the key benefits of the Gravitino integration is centralized credential management. Instead of managing separate credentials for each storage system, Gravitino handles authentication and authorization:

# Gravitino manages credentials for underlying storage
# No need to configure separate S3 credentials in Daft
gravitino_config = GravitinoConfig(
    endpoint="http://localhost:8090",
    metalake_name="secure_metalake",
    auth_type="oauth2",
    token="your-oauth-token"
)

# All storage access is handled through Gravitino's security layer
io_config = IOConfig(gravitino=gravitino_config)

# Access data with unified security
df = daft.read_parquet(
    "gvfs://fileset/s3_catalog/sensitive_data/financial/**/*.parquet",
    io_config=io_config
)

Performance Considerations

The Gravitino integration is designed for optimal performance:

Lazy Evaluation: Daft's lazy execution works seamlessly with gvfs:// URLs
Predicate Pushdown: Filters are pushed down to the storage layer when possible
Parallel Processing: Multi-threaded I/O operations across different storage systems
Caching: Gravitino metadata is cached to reduce lookup overhead

# Efficient filtered reads with predicate pushdown
df = (
    daft.read_parquet("gvfs://fileset/s3_catalog/events/daily/**/*.parquet", io_config=io_config)
    .filter(daft.col("date") >= "2024-01-01")  # Pushed down to storage
    .filter(daft.col("event_type") == "click")  # Efficient columnar filtering
    .select("user_id", "timestamp", "page_url")  # Column pruning
)

Current Limitations and Future Roadmap

Current Status

✅ Read Operations: Full support for reading files from Gravitino filesets
✅ Write Operations: Support for writing Parquet, CSV, and JSON files to filesets
✅ Multiple Formats: Support for Parquet, JSON, CSV, and other formats
✅ S3 Storage: Full support for S3-backed filesets (including S3-compatible storage like MinIO)
✅ Local Storage: Support for local file:// storage
✅ Security Integration: Centralized credential management through Gravitino

Future Enhancements

Credential Vending: Gravitino credential vending can generate temporary credentials for clients, providing enhanced security
More Cloud Storages: Support for GCS and Azure Blob Storage
Table Catalog Integration: Support for read and write Iceberg and Lance table catalogs
Advanced Security: Fine-grained access control and audit logging
Performance Optimizations: Enhanced caching and metadata management

Getting Started Today

Ready to try the Daft + Gravitino integration? Here's how to get started:

Set up Gravitino: Follow the Gravitino quickstart guide to set up your Gravitino server
Install Daft v0.7.2 or later with Gravitino support:

   pip install "daft>=0.7.2" requests

Configure your first fileset: Create a fileset in Gravitino pointing to your data
Start querying: Use gvfs:// URLs to access your data with Daft

The integration between Daft and Apache Gravitino represents a significant step forward in simplifying distributed data access. By combining Daft's powerful distributed query engine with Gravitino's unified catalog management, data teams can focus on extracting insights rather than managing infrastructure complexity.

Whether you're building analytics pipelines, managing data lakes, or simply looking to simplify your data access patterns, the Daft + Gravitino integration provides the tools you need to succeed in today's distributed data landscape.

Want to learn more? Check out the Daft documentation and Apache Gravitino project for detailed guides and examples.

Setting up Apache Gravitino from Scratch

Yue @ Datastrato (Admin) — Tue, 20 Jan 2026 08:31:19 +0000

Author: Danhua Wang

Last Updated: [2026-01-12]

Overview

In this tutorial, you will learn how to install and configure Apache Gravitino from scratch. By the end of this guide, you'll have a fully functional Gravitino server running with your chosen storage backend.

What you'll accomplish:

Install Apache Gravitino from source or pre-built binaries and configure the basic server setup
Configure storage backends including H2 for development and MySQL/PostgreSQL for production environments
Configure Gravitino Server including web server, cache and access control configurations
Verify the installation by testing the server endpoints and Web UI to ensure everything is working correctly

Prerequisites

Before starting this tutorial, you will need:

System Requirements:

Linux or macOS operating system with outbound internet access for downloads
Minimum Production Environment: 4 CPU cores, 16GB RAM
Minimum Development Environment: 2 CPU cores, 8GB RAM

Java Development Kit:

JDK 17 or higher installed and properly configured

Optional Components:

MySQL or PostgreSQL server installed and properly configured, if you choose either as your storage backend

Before proceeding, verify your Java installation:

${JAVA_HOME}/bin/java -version

Setup

Step 1: Obtain Gravitino Binary

You have two options for obtaining Apache Gravitino: downloading a pre-built release or building from source.

Option 1: Download Pre-built Release

1. Download the latest release

Navigate to the Apache Gravitino GitHub Releases page and download the latest release tarball.
For example, to download version 1.1.0, run:

wget https://github.com/apache/gravitino/releases/download/v1.1.0/gravitino-1.1.0-bin.tar.gz

2. Extract the package

tar -xzf gravitino-1.1.0-bin.tar.gz
cd gravitino-1.1.0-bin

Option 2: Build from Source

If you prefer to build from source or need the latest development features, see How to Build Gravitino for detailed build instructions.

Understanding the Package Structure

After obtaining the binary, familiarize yourself with the directory layout:

gravitino-<version>-bin/
├── bin/                                    # Launch scripts
│   ├── gravitino.sh                        # Main server launcher
│   ├── gravitino-iceberg-rest-server.sh    # Iceberg REST server launcher
│   └── gravitino-lance-rest-server.sh      # Lance REST server launcher
├── conf/                                   # Configuration files
│   ├── gravitino.conf                      # Main server configuration
│   ├── gravitino-iceberg-rest-server.conf  # Iceberg REST configuration
│   ├── gravitino-lance-rest-server.conf    # Lance REST configuration
│   ├── gravitino-env.sh                    # Environment variables
│   └── log4j2.properties                   # Logging configuration
├── catalogs/                               # Catalog-specific configurations
├── libs/                                   # Server dependencies
├── iceberg-rest-server/                    # Iceberg REST server package
├── lance-rest-server/                      # Lance REST server package
├── logs/                                   # Log files (created at runtime)
├── data/                                   # Default data storage
└── scripts/                                # Database initialization scripts
└── web/                                    # Frontend package

Step 2: Plan Your Storage Backend

Choose the appropriate storage backend for your deployment scenario.

Development/Testing: H2 Database (Default)

For development and testing environments, H2 provides a quick setup:

Pros: Embedded database, no external dependencies, works out-of-the-box
Cons: Not suitable for production, no data consistency guarantees
Configuration: No additional setup required

Production: MySQL

For production environments, MySQL is the recommended choice:

1. Install and configure MySQL server

2. Create database and user

CREATE DATABASE gravitino;
CREATE USER 'gravitino'@'%' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gravitino.* TO 'gravitino'@'%';
FLUSH PRIVILEGES;

3. Initialize database schema

mysql -h <mysql_ip_address> -u gravitino -D gravitino -p < scripts/mysql/schema-1.1.0-mysql.sql

Production: PostgreSQL

As an alternative production option:

1. Install and configure PostgreSQL server

2. Create database and user

CREATE DATABASE gravitino;
CREATE USER gravitino WITH PASSWORD 'your_password';
GRANT ALL PRIVILEGES ON DATABASE gravitino TO gravitino;

3. Initialize database schema

psql -h <postgres_ip_address> -U gravitino -d gravitino -f scripts/postgresql/schema-1.1.0-postgresql.sql

Step 3: Configure Gravitino Server

Configure the main server settings in the conf/gravitino.conf file.

Basic Server Configuration

1. Configure HTTP server settings

# HTTP Server Configuration
gravitino.server.webserver.host = 0.0.0.0
gravitino.server.webserver.httpPort = 8090
gravitino.server.webserver.minThreads = 24
gravitino.server.webserver.maxThreads = 200

2. Configure storage backend

For H2 (Development):

# Storage Backend Configuration
gravitino.entity.store = relational
gravitino.entity.store.relational = JDBCBackend
gravitino.entity.store.relational.jdbcUrl = jdbc:h2
gravitino.entity.store.relational.jdbcDriver = org.h2.Driver
gravitino.entity.store.relational.jdbcUser = gravitino
gravitino.entity.store.relational.jdbcPassword = gravitino

For MySQL (Production):

# Configure for MySQL
gravitino.entity.store.relational.jdbcUrl = jdbc:mysql://<mysql_ip_address>:3306/gravitino
gravitino.entity.store.relational.jdbcDriver = com.mysql.cj.jdbc.Driver
gravitino.entity.store.relational.jdbcUser = gravitino
gravitino.entity.store.relational.jdbcPassword = <your_password>

Optional Performance Configuration

1. Enable caching for better performance

Caching provides significant performance improvements, particularly for authorization operations and metadata lookups:

Authorization Performance: Dramatically reduces latency for permission checks by caching user roles, privileges, and access control decisions
Metadata Retrieval: Speeds up frequent catalog, schema, and table metadata queries by avoiding repeated database lookups

# Enable caching for better performance
gravitino.cache.enabled = true
gravitino.cache.implementation = caffeine
gravitino.cache.maxEntries = 10000
gravitino.cache.expireTimeInMs = 3600000

Optional Access Control Configuration

Configure authorization

Gravitino includes built-in metadata authorization that you can enable with the following configuration:

# Enable access control
gravitino.authorization.enable = true
gravitino.authorization.serviceAdmins = admin,gravitino

gravitino.authorization.serviceAdmins defines service administrators who are responsible for creating metalakes.

When a service admin creates a metalake, they automatically become the owner. As the owner, they have full control over the metalake, including the ability to drop it. Ownership can be transferred to another user if needed.

For comprehensive access control documentation, see Access Control.

Configure authentication

Apache Gravitino supports three authentication mechanisms: simple, OAuth, and Kerberos. Upon successful authentication, user identities from any of these methods are directly mapped to authorization principals to govern access control decisions.

Default Behavior: If authentication is not explicitly configured, Gravitino defaults to simple authentication mode.
Login Method: Use the service administrators specified in the gravitino.authorization.serviceAdmins configuration to log in.

See How to authenticate for detailed authentication setup.

Environment Configuration

Configure environment variables in conf/gravitino-env.sh

# JVM Memory Settings
export GRAVITINO_MEM="-Xms4g -Xmx4g -XX:MaxMetspaceSize=1g"

# Debug Options (uncomment for debugging)
# export GRAVITINO_DEBUG_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8000 -Dlog4j2.debug=true"

See Apache Gravitino server configurations for detailed server configurations.

Step 4: Optional REST services for enhanced functionality

You can enable the Iceberg REST or Lance REST services either as auxiliary services when starting the Gravitino server, or run them as standalone services. We’ve prepared detailed guides for them in sebsequent articles.

# Enable Iceberg REST/Lance REST as auxiliary service
gravitino.auxService.names = iceberg-rest,lance-rest

Step 5: Start and Verify Installation

Launch the Gravitino server and verify the installation.

Start Gravitino Server

1. Start the server in daemon mode

./bin/gravitino.sh start

2. Check server status

./bin/gravitino.sh status

3. View server logs

tail -f logs/gravitino-server.log

Verify Installation

1. Check server health

curl -v -X GET \
  -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" \
  http://localhost:8090/api/version

On success, the response looks like this:

{"code":0,"version":{"version":"1.1.0","compileDate":"12/12/2025 12:38:33","gitCommit":"5a6b5ae772d50aff98878ae3659fba3598a9027f"}}

2. Access Web UI

Open your browser and navigate to http://localhost:8090 to access the Gravitino Web UI.

The default login page when using simple authentication mode (with access control enabled):

Go to the metalake management page directly if access control disabled:

3. Verify auxiliary services (if enabled)

# Check Iceberg REST service
curl http://localhost:9001/iceberg/v1/config

# Check Lance REST service
curl http://localhost:9101/lance/v1/namespace/%24/list

Create Sample Metadata

Test your installation by creating sample metadata objects.

Create your first metalake

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" \
  -d '{"name": "my_metalake", "comment": "My first metalake"}' \
  http://localhost:8090/api/metalakes

Note: If you have enabled access control, you need to add the Authorization header to the command (using username 'admin' and password '123'):

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic $(echo -n 'admin:123' | base64)" \
  -d '{"name": "my_metalake", "comment": "My first metalake"}' \
  http://localhost:8090/api/metalakes

Create a sample catalog

Note: This example creates a Hive catalog. Before proceeding, ensure you have a running Hive cluster with Hive Metastore service accessible. If you don't have a Hive cluster, you can use a different catalog type (such as MySQL catalog).

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "catalog_hive",
    "type": "relational",
    "provider": "hive",
    "comment": "My Hive catalog",
    "properties": {
      "metastore.uris": "thrift://<hive_metastore_host>:<port>"
    }
  }' \
  http://localhost:8090/api/metalakes/my_metalake/catalogs

Manage catalogs on web GUI

Create a catalog:

View/Manage all catalogs:

Congratulations

You have successfully completed the Apache Gravitino setup tutorial!

You now have a fully functional Apache Gravitino installation with:

A configured metadata server running on port 8090
A storage backend configured for your environment
Optional auxiliary REST services for Iceberg and Lance integration
Sample metadata objects to verify functionality

Your Apache Gravitino server is ready to manage metadata across your data ecosystem.

Next Steps

Continue reading [Iceberg Catalog]
Follow and star Apache Gravitino Repository

Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the official documentation or submit issues on GitHub.

Apache Gravitino Introduction

Yue @ Datastrato (Admin) — Fri, 16 Jan 2026 23:05:12 +0000

Author: shaofeng shi

Last Updated: [2025-12-29]

Background

In the era of big data, enterprises often need to manage metadata from multi-cloud, multi-domain, and heterogeneous data sources, such as Apache Hive, MySQL, PostgreSQL, Iceberg, Lance, S3, GCS, etc. Additionally, with the extensive application of AI model training and inference, massive amounts of multimodal data and model metadata also require a unified management solution. Traditional approaches involve managing metadata separately for each data source, which not only increases operational complexity but also easily creates data silos. Apache Gravitino, as a high-performance, geographically distributed federated metadata lake, provides us with a unified solution for managing multi-source metadata.

Gravitino was originally initiated and founded by Datastrato Inc., open-sourced in 2023, donated to the Apache Incubator in 2024, and graduated from the Apache Incubator in May 2025 to become an Apache Top Level Project. It has been deployed in production environments at companies like Xiaomi, Tencent, Zhihu, Uber, and Pinterest.

What is Apache Gravitino?

Apache Gravitino is a high-performance, geographically distributed, federated metadata lake management system that provides users with a unified data and AI asset management platform. It can:

Unified Metadata Management: Provide unified metadata models and APIs for different types of data sources
Direct Metadata Management: Directly manage underlying systems, with changes reflected in real-time to source systems
Multi-Engine Support: Support multiple query engines such as Trino, Spark, Flink, etc.
Geographically Distributed Deployment: Support cross-region, cross-cloud deployment architectures
AI Asset Management: Manage not only data assets but also AI/ML model metadata

Core concepts include:

Metalake: Container/tenant for metadata, typically one organization corresponds to one metalake
Catalog: Collection of metadata from specific metadata sources
Schema: Second-level namespace, corresponding to the schema concept in databases
Table: Bottom-level object representing specific data tables

Apache Gravitino Core Features Overview

Unified Metadata Management

Gravitino provides a unified metadata management layer that supports integration with multiple data sources:

Supported Data Source Types:

Relational Databases: MySQL, PostgreSQL, OceanBase, Apache Doris, StarRocks, etc.
Big Data Storage: Apache Hive, Apache Iceberg, Apache Hudi, Apache Paimon, Delta Lake (in development)
Message Queues: Apache Kafka
File Systems: HDFS, S3, GCS, Azure Blob Storage, Alibaba Cloud OSS
AI/ML Data Formats: Lance (columnar data format designed specifically for AI/ML workloads)

REST API Services

Gravitino provides rich REST API services that support standardized access to different data formats:

Gravitino Core REST API

Complete metadata management RESTful API interface
Support for CRUD operations on all metadata objects including Metalake, Catalog, Schema, Table, etc.
Complete API for user, group, role, and permission management
API interfaces for advanced features like tags, policies, models, etc.
Support for multiple authentication methods (Simple, OAuth2, Kerberos)

Iceberg REST Service

Complies with Apache Iceberg REST API specification
Supports multiple backend storage options (Hive, JDBC, custom backends)
Provides complete table management and query capabilities
Supports multiple storage systems (S3, HDFS, GCS, Azure, etc.)

Lance REST Service

Implements Lance REST API specification
Optimized specifically for AI/ML workloads
Supports efficient vector data storage and retrieval
Provides namespace and table management functionality

Real-time Metadata Retrieval and Modification

Gravitino adopts a direct metadata management mode to ensure data real-time performance and consistency:

Real-time Synchronization: Changes to metadata are immediately reflected in underlying data sources
Bidirectional Synchronization: Supports metadata synchronization from Gravitino to data sources and from data sources to Gravitino
Transaction Support: Ensures atomicity and consistency of metadata operations
Version Management: Supports metadata version control and historical tracking

Unified Access Control

Gravitino implements unified permission management across multiple data sources:

Core Features:

Role-Based Access Control (RBAC): Supports flexible permission management for users, groups, and roles
Ownership Model: Each metadata object has a clear owner
Permission Inheritance: Supports hierarchical permission inheritance mechanisms
Fine-grained Control: Multi-level permission control from Metalake to specific tables

Supported Permission Types:

User and group management permissions
Catalog and schema creation permissions
Read/write permissions for tables, topics, filesets
Model registration and version management permissions
Tag and policy application permissions

Unified Data Lineage

Based on OpenLineage standards, Gravitino provides complete data lineage tracking capabilities:

Automatic Lineage Collection: Automatically collect data lineage information through Spark plugins
Unified Identifiers: Convert identifiers from different data sources to Gravitino unified identifiers
Multi-Data Source Support: Support lineage tracking for various data sources including Hive, Iceberg, JDBC, file systems, etc.

High Availability and Scalability

Deployment Modes:

Single-node Deployment: Suitable for development and testing environments
Cluster Deployment: Supports high availability and load balancing
Kubernetes Deployment: Supports containerized deployment and auto-scaling
Docker Support: Provides official Docker images

Storage Backends:

Supports multiple metadata storage backends (MySQL, PostgreSQL, etc.)
Supports distributed storage systems

Security Features

Authentication Methods:

Simple authentication (username/password)
OAuth2 authentication
Kerberos authentication (for Hive backends)

Credential Management:

Supports cloud storage credential vending (S3, GCS, Azure, etc.)
Dynamic credential refresh
Secure credential passing mechanisms

Apache Gravitino Integration Capabilities

Gravitino deeply integrates with mainstream compute engines and data processing frameworks, providing users with a unified data access experience.

Compute Engine Integration

Apache Spark

Seamless integration through Gravitino Spark Connector
Supports Spark SQL and DataFrame API
Automatic data lineage collection and tracking
Supports unified access to multiple data sources

Trino

Integration through Gravitino Trino Connector service
Supports federated queries across data sources
High-performance analytical query capabilities

Apache Flink

Integration through Gravitino Flink Connector service
Supports stream-batch unified data processing
Real-time data processing and analysis

Python Ecosystem Integration

PyIceberg

Supports Iceberg table access in Python environments
Integrates with Gravitino Iceberg REST service
Supports data science and machine learning workflows
Provides Pandas-compatible data interfaces

Daft

Modern distributed data processing framework
Optimized specifically for AI/ML workloads
Supports multimodal data processing
Integrates with Gravitino metadata management

Cloud-Native Integration

Kubernetes

Supports Kubernetes native deployment
Provides Helm Charts and Operators
Supports auto-scaling and fault recovery
Integrates with cloud-native monitoring and logging systems

APIs and SDKs

REST API

Complete RESTful API interface
Supports all metadata management operations
Standardized HTTP interface
Supports multiple authentication methods

Java SDK

Native Java client library
Type-safe API interface
Supports connection pooling and retry mechanisms
Complete exception handling

Python SDK

Python client library
Supports asynchronous operations
Integrates with Jupyter Notebook
Supports data science workflows

These integration capabilities enable Gravitino to seamlessly integrate into existing data infrastructure, providing users with a unified and efficient data management experience. Subsequent articles will detail Gravitino's various capabilities and configuration and usage methods for each integration component. Stay tuned.

Next Steps

Continue reading [Setup Guide]
Follow and star Apache Gravitino Repository

Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the official documentation or submit issues on GitHub.

Apache Gravitino — 2025 Summary

Yue @ Datastrato (Admin) — Wed, 07 Jan 2026 00:14:52 +0000

Introduction

2025 was a landmark year for Apache Gravitino. The project not only graduated as a Top-Level Project (TLP) but also reached its first major stable release, version 1.0.0. Throughout the year, the community focused heavily on "Contextual Engineering" and "AI-native" metadata management, introducing groundbreaking features like the Model Context Protocol (MCP) server, the Lance REST service, and a metadata-driven action system. This article summarizes the milestones and achievements of Apache Gravitino in 2025.

Timeline

Apache Gravitino officially graduated as an Apache Top-Level Project on June 3, 2025, marking a significant maturity milestone.

In 2025, the community released several key versions, including the major 1.0.0 release and significant feature updates in 0.8.0-incubating, 0.9.0-incubating, and 1.1.0.

2025.01.24: Version 0.8.0-incubating released
- Focused on strengthening AI support with the introduction of the Model Catalog.
- Introduced credential vending for Filesets and new connectors for Flink (Iceberg/Paimon).
2025.05.07: Version 0.9.0-incubating released
- Enhanced data governance with a new Data Lineage interface (OpenLineage compliant).
- Added gcli script for better CLI experience and improved security with privilege refinements.
2025.09.24: Version 1.0.0 released
- The first stable major release, themed "From Metadata Management to Contextual Engineering."
- Introduced the Metadata-driven Action System (including Statistics, Policies, and Jobs).
- Launched the MCP (Model Context Protocol) Server, enabling AI Agents/LLMs to interact directly with metadata.
- Implemented unified Role-Based Access Control (RBAC) across catalogs.
2025.11.20: Version 1.0.1 released
- A stability release featuring smarter job templates and improved Python client support.
2025.12.19: Version 1.1.0 released
- Added the Lance REST service to support vector data for AI workloads.
- Introduced a Generic Lakehouse Catalog and support for Hive 3 and multi-cluster HDFS filesets.
- Hardened security for the Iceberg REST service.

Key Features & Improvements

In 2025, Gravitino evolved from a unified catalog to an active metadata control plane. Key technical achievements include:

AI & LLM Integration: The project positioned itself as an AI-native catalog by introducing the Model Catalog for managing ML models and the MCP Server to connect AI agents with data context. The addition of the Lance REST service in v1.1.0 further solidified support for vector datasets.
Metadata-Driven Actions: A new framework allowing users to define policies (e.g., TTL, compaction) and execute jobs based on metadata, moving beyond passive metadata storage.
Unified Governance & Security: Full implementation of RBAC, credential vending for secure data access (S3/GCS/ADLS), and a unified authentication flow for Iceberg REST services.
Ecosystem Expansion: Broadened support with new connectors (Generic Lakehouse, Hive 3, Flink, Paimon) and enhancements to the GVFS (Gravitino Virtual File System) for unified file management.

Community

The Apache Gravitino community saw explosive growth in 2025, evolving from an incubator project into a Top-Level Project (TLP) backed by a rapidly expanding global ecosystem.

Top-Level Graduation: On June 3, 2025, the project officially graduated to an Apache Top-Level Project, a major milestone marking its maturity in community health, vendor-neutral governance, and production readiness.
Community Growth (Year-over-Year):
- Engagement: GitHub stars increased by over 130%, ending the year above 2,600. Forks grew by approximately 150%, reflecting a surge in community-led integrations and local developments.
- Contributor Base: The active developer community expanded by nearly 100%. Recent major releases, such as version 1.1.0, featured contributions from 40+ unique developers representing a wide variety of global organizations.
- Development Velocity: Development pace accelerated significantly, with code commits reaching a lifetime total of over 3,300 commits.
- Post-Graduation Committer Growth: July 7, 2025: Chenxi Pan was added as Committers. December 15, 2025: Junda Yang and Yangyang Zhong were added as Committers.
Global Presence: The project established itself as the standard for federated metadata through featured presentations at Community Over Code (NA & Asia) and QCon Shanghai, gathering critical production feedback from global data engineering teams to shape the future roadmap.

Industry Trends in Metadata Management (2026)

Breaking Lakehouse Silos: As organizations adopt multiple "open" table formats, the risk of "format lock-in" has replaced "vendor lock-in." The trend is toward Universal Lakehouse architectures that provide a single entry point for fragmented data silos.
The Multimodal AI Explosion: AI workloads are moving beyond tabular data to include massive volumes of unstructured assets (images, video, audio). Traditional data stacks are being replaced by AI-Native Multimodal Stacks that can process complex data types with the same governance as SQL tables.
Emergence of Data Agents: AI Agents are becoming the primary consumers of data. These agents require "Context Engineering"—a way to use metadata as an external brain to discover, understand, and act upon data autonomously.
Escalating AI Security Risks: The high-speed nature of AI interactions makes traditional static security (RBAC) obsolete. The industry is moving toward Identity-Centric Zero Trust and Fine-Grained ABAC to prevent data leakage and ensure model safety.

Future Work

1. Universal Lakehouse & Format Interoperability

To solve the data silo problem, Gravitino is expanding its reach to provide a unified management layer for the modern Lakehouse.

Multi-Format Support: We will provide first-class support for Apache Iceberg, Delta Lake, Hudi, and Paimon. By acting as a "Catalog of Catalogs," Gravitino allows users to manage multiple formats through a single interface, significantly reducing vendor lock-in and simplifying cross-format governance.

2. Multimodal Data Stack for the AI Era

Gravitino is evolving to empower a new generation of AI-native data stacks.

Ecosystem Integration: We will focus on deep integration with AI-centric engines like Daft, Ray, and Lance.
Empowering New Scenarios: By providing a unified metadata layer for these engines, Gravitino allows users to "reuse" existing data governance capabilities—like auditing and access control—for modern multimodal scenarios, giving the new AI data stack enterprise-grade maturity from day one.

3. Data Agent Orchestration (Metadata as the "Brain")

Gravitino will serve as the cognitive foundation for autonomous Data Agents.

MCP Server & Action System: Leveraging the Model Context Protocol (MCP) and our Metadata Action System, we are exploring scenario-based capabilities for Data Agents. This allows an AI agent to not only "see" the data but also "act" on it—such as performing a schema update or triggering a compaction job—using metadata as its reasoning context.

4. Advanced Security: KMS & ABAC

As security threats become more sophisticated in the AI era, Gravitino is implementing more granular and automated security controls.

ABAC (Attribute-Based Access Control): We will implement an ABAC engine to enable fine-grained permissions. This allows access decisions to be made based on dynamic tags (e.g., Sensitivity=High) and environmental context rather than just static roles.
KMS & Credential Management: To protect data-at-rest and in-transit, we are integrating with Key Management Services (KMS) .

2026 will be a defining year for AI-native data infrastructure, and the Gravitino community is just getting started.Whether you’re exploring federated lakehouse architectures, multimodal AI data stacks, or data agents in production, we welcome you to build and evolve Apache Gravitino together with us❤️.

https://gravitino.apache.org/blog/2025-summary/

Apache Gravitino 1.1.0: A Major Step Toward Unified Metadata for the AI-Native Lakehouse

Yue @ Datastrato (Admin) — Tue, 16 Dec 2025 17:32:19 +0000

Apache Gravitino 1.1.0 is now available, bringing powerful new capabilities that make it easier for organizations to unify metadata, govern heterogeneous data platforms, and support emerging AI workloads.

As enterprises adopt multimodal data, multiple engines, and mixed table formats, metadata fragmentation becomes a critical barrier. Gravitino 1.1.0 directly tackles this challenge with key upgrades across catalogs, security, cluster management, and developer experience.

https://github.com/apache/gravitino/releases/tag/v1.1.0

What’s new in 1.1.0

1. Built for the Future of AI Data
A new Lance REST service brings governed, high-performance vector access to AI pipelines, inference workloads, and data applications.

2. Stronger Security and Metadata Governance
Fine-grained authorization now covers tags, jobs, statistics, and policies, while the Iceberg REST service receives major security hardening for production use.

3. Legacy-to-Lakehouse Modernization, Simplified
Hive 3 catalog support allows organizations to bring existing Hive metastores under centralized governance without data migration or risk.

4. Multi-Cluster Operations for Real-World Deployments
Support for multiple HDFS clusters gives large-scale teams the flexibility needed for DR, isolation, and multi-region architectures.

5. Faster, More Stable, More Observable
From caching to metrics to connector improvements, the entire system sees meaningful boosts in performance, reliability, and usability.

Why it matters

Organizations building AI-native architectures need a metadata system that spans:

multiple engines (Spark, Trino, Flink, Daft etc.)
multiple formats (Iceberg, Hudi, Lance)
multiple clouds and clusters
batch, streaming, and vector workloads
As AI workloads diversify, metadata must unify. Gravitino 1.1.0 brings the interoperability and governance needed to make that possible.

Community Acknowledgement

This release reflects the work of dozens of contributors across issues, PRs, tests, design contributions, and reviews. The Gravitino community continues to grow, and we are grateful for everyone who made this release possible.

Read the full release notes for details on all features and fixes.

https://github.com/apache/gravitino/releases/tag/v1.1.0

If You’re Not All-in on Databricks: Why Metadata Freedom Matters

Yue @ Datastrato (Admin) — Wed, 26 Nov 2025 22:25:37 +0000

Stop and consider your data architecture right now. You are likely grappling with these challenges:

1. Are you facing vendor lock-in and prohibitive costs?
2. Is fragmented metadata wasting your engineering resources?
3. Are your BI systems failing your AI strategy?

These three questions directly point to the core friction points stemming from metadata constraints, which are crippling modern data teams:

Vendor Lock-in Risk:
Platform-bound catalogs, particularly Databricks Unity Catalog (UC), tie governance and security tightly to a specific computing environment. This limits the freedom to choose best-of-breed engines (Trino, Flink, Ray, and others) and results in prohibitive migration costs and annually increasing spend.

Fragmentation & Operational Complexity:
Modern teams operate in multi-cloud, multi-engine complexity. They lack a unified interface to manage metadata across different clouds (S3, Azure Blob, on-premises, etc) and varying processing engines, severely hindering operational efficiency.

Multimodal Data Silos:
Traditional, table-centric metadata systems fail to natively support AI/LLM workloads, leading to silos for critical Vector Embeddings, Streaming Topics (Kafka), and Model Registries. The lack of unified metadata breaks end-to-end AI lineage and reproducibility.

The reality is, the modern data stack is facing a “fragmentation crisis.” Your metadata, which should be the bridge for unified governance, has instead become the primary casualty of this fragmentation.

We acknowledge that platform-specific solutions like Databricks Unity Catalog (UC) deliver a smooth experience within their own ecosystem. However, for organizations operating across multiple clouds, engines, and formats, that tight integration quickly transforms into a constraint rather than a strength. Traditional systems (HMS, AWS Glue) struggle to keep up, while tightly coupled solutions like UC introduce “soft lock-in.” Here, the core metadata, the true ownership layer of the data, becomes inseparable from a single commercial platform. Once metadata is trapped, everything upstream and downstream hardens around that dependency.

This challenge is amplified dramatically in the AI-Native Era. LLM and agent-based workflows absolutely depend on the ability to discover, understand, trace, and govern all data assets. Without unified metadata, AI pipelines lose transparency, governance becomes brittle, and trust in AI outcomes erodes.

Therefore, the truly essential requirement is a vendor-neutral metadata layer that can unify your entire data ecosystem.

This layer must abstract metadata from compute, storage, and individual vendors; provide consistent semantics across all engines; and support the multimodal assets required for AI.

In short, you need Metadata Freedom. This freedom is the foundation for long-term data agility and AI readiness. It ensures the true “deed” to your data remains in your hands, not locked behind the boundaries of any single cloud, engine, or commercial platform.

The Architecture of Freedom: Why Gravitino Breaks the Mold

Databricks has become one of the most influential players in the modern data stack, and Unity Catalog brings strong cohesion within the Databricks ecosystem.

However, no enterprise is 100% Databricks-only. Most operate across Snowflake, BigQuery, Redshift, Trino, Iceberg, MySQL, S3, and more. Even a powerful platform-native catalog cannot serve as the unified source of metadata truth for such a heterogeneous world.

To bridge this gap and finally achieve cross-platform metadata freedom, we introduce Apache™ Gravitino.

“Metadata freedom is not a feature; it is the foundation on which future-proof data architectures are built.”
— JP Du, PMC member of Apache Gravitino™ project, CEO & Co-founder of Datastrato.

Apache Gravitino™ is not merely a replacement for existing catalogs; it is a foundational architectural shift designed to achieve true Metadata Freedom. We realize this vision by adhering to two core principles: Radical Decoupling and Federated Unification.

Radical Decoupling: Metadata as an Independent Layer

Rather than inheriting constraints from compute or storage platforms, Gravitino treats metadata as an independent, universal control layer.

Compute-agnostic: Works seamlessly with Trino, Spark, Flink, PyTorch, Ray, and others — without imposing a preferred engine.
Storage-agnostic: A connector-based architecture supports S3, GCS, Azure Blob, HDFS, and on-prem object stores.
Vendor-neutral: As described earlier, Gravitino’s decoupled design breaks metadata free from compute, storage, and vendor boundaries, aligning with open standards instead of proprietary roadmaps.

Federated Unification: The Catalog of Catalogs

To solve the multi-engine, multi-cloud “fragmentation crisis” introduced earlier, Gravitino redefines metadata management through its “Federated Metadata Lake” positioning. This architecture unifies the ecosystem without compromising the autonomy of underlying systems.

Gravitino addresses fragmentation with a federated architecture that unifies metadata across engines and clouds without forcing standardization.

AI-native multimodal metadata: Supports tables, unstructured files, Kafka topics, vector embeddings, and models as first-class assets.
Federated control: Gravitino’s “Catalog of Catalogs” integrates with HMS, Apache Apache™Iceberg REST Catalog, MySQL, PostgreSQL, and object stores — unifying governance without replacing existing systems.

This model harmonizes metadata across legacy systems, cloud warehouses, lakehouses, and AI platforms.

Apache Gravitino™ vs. Unity Catalog: Solving the Fragmentation Between All Catalogs

Gravitino is not simply a “better Unity Catalog.” It solves the fragmentation between all catalogs. While UC excels within its platform boundaries, Gravitino is designed to function as the superset control plane for the entire heterogeneous stack.

Community Over Code: Building the Future Through Open Collaboration

For us, the creators and initial contributors to Apache Gravitino™, Metadata Freedom is more than just code. We believe the future of enterprise data architectures cannot be tied to a single corporate roadmap; it must be built by the people who use it. This focus on developer interaction and shared ownership is why we chose the Apache path.

Apache Gravitino™ is governed by the Apache Software Foundation (ASF). This isn’t a mere badge; it’s a structural guarantee of neutrality. We cherish the open, democratic nature of the ASF model, where every major decision, feature, and release is subject to a community-wide consensus and voting process. This rigorous governance ensures that Gravitino’s evolution is driven purely by technical merit and user needs, making it a project that truly belongs to everyone.

In addition to advancing the Apache Gravitino™ project itself, we actively contribute to the broader open-source ecosystem, submitting code to upstream and downstream projects such as Apache Iceberg™, Lance, Daft, OpenLineage, Spark and others.

This commitment to open, multi-company governance directly fuels Gravitino’s rapid momentum. Compared to proprietary-led open source solutions like OSS Unity Catalog, our community activity metrics including lines of code, contributors, commits, and issues have been observed at over 5x in recent periods. This explosion of activity proves that our neutral, democratic approach is exactly what the industry demands, assuring all contributors and adopters of the project’s long-term health and vitality.

Fully embracing the Apache spirit of “Community over code”, we have established the “Data for AI” community. This focused hub convenes developers globally to exchange knowledge and collectively tackle the practical industry challenges unique to modern data infrastructure. To accelerate this collective understanding, we regularly host technical events, inviting leading data experts from cutting-edge companies such as AWS, OpenAI, NVIDIA, Uber, Pinterest, and Roku to share their latest best practices and trends. By fostering this interaction and insight sharing, we ensure Gravitino evolves in lockstep with the most urgent needs of the industry.

Databricks Is Part of the Future — But Not the Whole Future.

This is not a Databricks critique.

They are a key pillar of the ecosystem, and many workloads fit beautifully within their platform. But the modern enterprise will always be a polyglot environment.

We firmly believe that enterprises welcome more flexible, open, and diverse technological solutions. No organization wants to be forced into a corner, locked in by a single supplier.

If your data strategy is not 100% committed to Databricks, or any other singular vendor ecosystem, or if you are implementing a multi-cloud, multi-engine strategy, relying on platform-specific catalogs is fundamentally insufficient.

The path forward requires decoupling control from computing.

Metadata Freedom provides the agility, interoperability, and ultimate safeguard against vendor lock-in.

AI-Native and multimodal workloads demand an open, federated metadata layer to unify tables, files, streams, and vectors.

Apache Gravitino™ demonstrates what the next-generation metadata architecture looks like. Through radical decoupling and a community-driven, federated approach, it returns the sovereignty of metadata to the user.

The future belongs to open, federated, and community-driven metadata.

This is Metadata Freedom. This is what Apache Gravitino™ stands for.

Why wait for tomorrow? Join the Apache Gravitino™ community today and start your journey to Metadata Freedom now.

DEV Community: Yue @ Datastrato (Admin)

Using Gravitino with Apache Flink for Streaming

Overview

Prerequisites

Step-by-Step Guide

Step 1: Set environment variables

Step 2: Create Hive and Paimon catalogs in Gravitino

Step 3: Install required JARs in Flink

Step 4: Configure Flink to use the Gravitino catalog store

Step 5: Create a Kafka generic table in the Hive catalog

Step 6: Create a Paimon sink table

Step 7: Stream data from Kafka to Paimon

Code Examples

Troubleshooting

Congratulations

Further Reading

Next Steps

Apache Gravitino Iceberg REST Catalog Access Control Deployment Guide

1. Overview

1.1 Product Introduction

1.2 Key Features

1.3 Current Status

2. System Architecture

2.1 Architecture Diagram

2.2 Component Description

3. Environment Requirements

3.1 System Requirements

3.2 Software Dependencies

4. Configuration

4.1 Core Configuration File

5. Deployment Process

5.1 Database Initialization

5.2 Download Dependencies

5.3 Start Services

5.4 Create Metalake

5.5 Create Iceberg Catalog

6. Access Control Management

6.1 Permission Model

6.2 Create Roles and Permissions

6.3 Create Users and Grant Permissions

7. Spark Integration

7.1 Spark Configuration

7.2 Usage Examples

Summary

Using Apache Gravitino with Trino for Query Federation

Overview

Prerequisites

Setup

How the Integration Works

Values Used in This Tutorial

Step 1: Install and Configure Gravitino Trino Connector

Install the Connector Plugin

Enable Dynamic Catalog Management

Configure the Gravitino Catalog

Verify Installation

Step 2: Create Catalogs from Trino SQL

Create an Iceberg Catalog

Create a MySQL Catalog

Verify Catalog Creation

Step 3: Validate Catalog Discovery

Confirm Catalog Visibility

Step 4: Prepare Sample Data

Create MySQL Dimension Table

Create Iceberg Fact Table

Step 5: Execute Federated Queries

Pattern 1: Cross-Catalog JOIN

Pattern 2: Aggregation Across Catalogs

Pattern 3: Semi-Join Filter

Pattern 4: LEFT JOIN with Unmatched Rows

Step 6: Understanding Federation Mechanics

Step 7: Clean Up Resources

Troubleshooting

Congratulations

Further Reading

Next Steps

Using Gravitino with Apache Spark for ETL

Overview

Prerequisites

Setup

Step 1: Download Gravitino Spark Connector