<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yue @ Datastrato (Admin)</title>
    <description>The latest articles on DEV Community by Yue @ Datastrato (Admin) (@yueguo).</description>
    <link>https://dev.to/yueguo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3376613%2F5a55f02d-15be-46d0-9f9c-7bd8e3ad6ec2.jpg</url>
      <title>DEV Community: Yue @ Datastrato (Admin)</title>
      <link>https://dev.to/yueguo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yueguo"/>
    <language>en</language>
    <item>
      <title>Using Gravitino with Apache Flink for Streaming</title>
      <dc:creator>Yue @ Datastrato (Admin)</dc:creator>
      <pubDate>Thu, 12 Mar 2026 05:10:14 +0000</pubDate>
      <link>https://dev.to/gravitino/using-gravitino-with-apache-flink-for-streaming-25n9</link>
      <guid>https://dev.to/gravitino/using-gravitino-with-apache-flink-for-streaming-25n9</guid>
      <description>&lt;p&gt;&lt;em&gt;Author: &lt;a href="https://www.linkedin.com/in/fanng-1-2081a7330/" rel="noopener noreferrer"&gt;xiaojing&lt;/a&gt;&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Last Updated: 2026-03-11&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;In this tutorial, you will learn how to use Apache Gravitino with Apache Flink to build a simple streaming pipeline. You will create a Hive catalog and a Paimon catalog in Gravitino, define a Kafka-backed &lt;strong&gt;generic table&lt;/strong&gt; in the Hive catalog, and then use Flink SQL (through the Gravitino Flink connector) to read Kafka data and write it into a Paimon table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll accomplish:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Configure the Gravitino Flink connector&lt;/strong&gt; in Flink&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create Hive and Paimon catalogs&lt;/strong&gt; in Gravitino&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Define a Kafka generic table&lt;/strong&gt; in the Hive catalog&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stream data from Kafka to Paimon&lt;/strong&gt; using Flink SQL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture overview:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi4iplzi4tm46m46w7fsr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi4iplzi4tm46m46w7fsr.png" alt=" " width="800" height="471"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;System Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linux or macOS&lt;/li&gt;
&lt;li&gt;JDK 17+ (required for the Gravitino server; this tutorial assumes JDK 17 or later)&lt;/li&gt;
&lt;li&gt;Apache Flink 1.18 (recommended for the Gravitino Flink connector)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Required Components:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gravitino server v1.2.0 or later (this tutorial requires features introduced after v1.1.0; see &lt;a href="//../02-setup-guide/README.md"&gt;02-setup-guide/README.md&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Hive Metastore (for the Hive catalog)&lt;/li&gt;
&lt;li&gt;Apache Kafka broker (for the Kafka source table)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Suggested Versions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apache Paimon connector JAR that matches your Flink version&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before proceeding, verify your Java and Flink installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;JAVA_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/bin/java &lt;span class="nt"&gt;-version&lt;/span&gt;
&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;FLINK_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/bin/flink &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step-by-Step Guide
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Set environment variables
&lt;/h3&gt;

&lt;p&gt;These values are used throughout the tutorial. Adjust them for your environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GRAVITINO_URI&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8090"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;METALAKE_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"default_metalake"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HIVE_METASTORE_URI&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"thrift://localhost:9083"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PAIMON_WAREHOUSE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"file:///tmp/paimon-warehouse"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;KAFKA_BROKERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"localhost:9092"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Create Hive and Paimon catalogs in Gravitino
&lt;/h3&gt;

&lt;p&gt;Create a Hive catalog and a Paimon catalog using the Gravitino REST API.&lt;br&gt;
If you need to pass Hive-specific configs (for example &lt;code&gt;hive-conf-dir&lt;/code&gt;), set them in catalog properties with the &lt;code&gt;flink.bypass.&lt;/code&gt; prefix (for example &lt;code&gt;flink.bypass.hive-conf-dir&lt;/code&gt;), which are forwarded to the Flink Hive connector.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create Hive catalog&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/vnd.gravitino.v1+json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name": "hive_catalog",
    "type": "relational",
    "comment": "Hive catalog for Flink streaming",
    "provider": "hive",
    "properties": {
      "metastore.uris": "'&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HIVE_METASTORE_URI&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s1"&gt;'"
    }
  }'&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GRAVITINO_URI&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/api/metalakes/&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;METALAKE_NAME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/catalogs

&lt;span class="c"&gt;# Create Paimon catalog (filesystem backend)&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/vnd.gravitino.v1+json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name": "paimon_catalog",
    "type": "relational",
    "comment": "Paimon catalog for Flink streaming",
    "provider": "lakehouse-paimon",
    "properties": {
      "catalog-backend": "filesystem",
      "warehouse": "'&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PAIMON_WAREHOUSE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s1"&gt;'"
    }
  }'&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GRAVITINO_URI&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/api/metalakes/&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;METALAKE_NAME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/catalogs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Install required JARs in Flink
&lt;/h3&gt;

&lt;p&gt;Place the following JARs in &lt;code&gt;FLINK_HOME/lib&lt;/code&gt; so Flink SQL can load them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;gravitino-flink-connector-runtime-1.18_2.12-&amp;lt;version&amp;gt;.jar&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;paimon-flink-1.18-&amp;lt;version&amp;gt;.jar&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;flink-sql-connector-kafka-&amp;lt;version&amp;gt;.jar&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Hive dependencies required by Flink HiveCatalog (same as Flink-Hive integration)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Tip: The Kafka SQL connector is not included in the Flink binary distribution and must be added separately.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 4: Configure Flink to use the Gravitino catalog store
&lt;/h3&gt;

&lt;p&gt;Edit &lt;code&gt;FLINK_HOME/conf/flink-conf.yaml&lt;/code&gt; and add (replace with your values):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;table.catalog-store.kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gravitino&lt;/span&gt;
&lt;span class="na"&gt;table.catalog-store.gravitino.gravitino.metalake&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${METALAKE_NAME}&lt;/span&gt;
&lt;span class="na"&gt;table.catalog-store.gravitino.gravitino.uri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${GRAVITINO_URI}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart Flink if it is running, then make sure a Flink cluster is reachable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;FLINK_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/bin/start-cluster.sh
curl &lt;span class="nt"&gt;-sS&lt;/span&gt; http://localhost:8081/overview
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;curl&lt;/code&gt; fails with connection refused, &lt;code&gt;INSERT INTO ... SELECT ...&lt;/code&gt; in Step 7 will fail because SQL Client cannot submit jobs to the cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Create a Kafka generic table in the Hive catalog
&lt;/h3&gt;

&lt;p&gt;Flink's &lt;code&gt;HiveCatalog&lt;/code&gt; supports both Hive-compatible tables and &lt;strong&gt;generic tables&lt;/strong&gt;. A table is generic by default in &lt;code&gt;HiveCatalog&lt;/code&gt; unless you explicitly set &lt;code&gt;'connector' = 'hive'&lt;/code&gt; or use Hive dialect. Here we create a Kafka &lt;strong&gt;generic&lt;/strong&gt; table so the metadata is stored in Hive Metastore, while the data is read from Kafka by Flink. If you want a Hive-compatible table, use Hive dialect or set &lt;code&gt;'connector' = 'hive'&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Start the Flink SQL client:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;FLINK_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/bin/sql-client.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the SQL client, run the following statements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Use the Hive catalog managed by Gravitino&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;hive_catalog&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;streaming_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;streaming_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Kafka source table stored as a generic table in Hive catalog&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;kafka_events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;item_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;behavior&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMP_LTZ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;METADATA&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'timestamp'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;WATERMARK&lt;/span&gt; &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'5'&lt;/span&gt; &lt;span class="k"&gt;SECOND&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="s1"&gt;'connector'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'kafka'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;'topic'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'user_behavior'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;'properties.bootstrap.servers'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'${KAFKA_BROKERS}'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;-- replace with your Kafka brokers&lt;/span&gt;
  &lt;span class="s1"&gt;'properties.group.id'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'gravitino-flink-demo'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;'scan.startup.mode'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'earliest-offset'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;'format'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'json'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;'json.ignore-parse-errors'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Notes about generic tables:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HiveCatalog supports Hive-compatible tables and generic tables. Hive-compatible tables are stored in a Hive-compatible way and can be queried from Hive.&lt;/li&gt;
&lt;li&gt;Generic tables are Flink-specific. Hive can see the metadata in Hive Metastore, but typically cannot interpret it, so querying from Hive is undefined behavior.&lt;/li&gt;
&lt;li&gt;If you want Hive-compatible tables with default dialect, set &lt;code&gt;'connector' = 'hive'&lt;/code&gt;. If you use Hive dialect, the &lt;code&gt;connector&lt;/code&gt; property is not required.&lt;/li&gt;
&lt;li&gt;In Gravitino, generic table schema and partition keys are stored in &lt;code&gt;flink.*&lt;/code&gt; properties in Hive Metastore. If &lt;code&gt;connector=hive&lt;/code&gt;, the table is treated as a Hive-compatible table with a native Hive schema.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 6: Create a Paimon sink table
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;paimon_catalog&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;streaming_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;streaming_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;paimon_user_behavior&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;item_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;behavior&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMP_LTZ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 7: Stream data from Kafka to Paimon
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="s1"&gt;'execution.checkpointing.interval'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'10 s'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;paimon_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;streaming_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;paimon_user_behavior&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;behavior&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;hive_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;streaming_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kafka_events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Kafka is receiving data in &lt;code&gt;user_behavior&lt;/code&gt;, Flink will continuously write it to the Paimon table.&lt;br&gt;
For streaming writes to Paimon, periodic checkpoints are required for commits.&lt;/p&gt;
&lt;h2&gt;
  
  
  Code Examples
&lt;/h2&gt;

&lt;p&gt;Sample Kafka messages (JSON lines):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"item_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1001&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"behavior"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"click"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"item_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1002&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"behavior"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"buy"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Catalogs not visible in Flink&lt;/strong&gt;: Verify &lt;code&gt;table.catalog-store.*&lt;/code&gt; settings in &lt;code&gt;flink-conf.yaml&lt;/code&gt; and that the Gravitino server is reachable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ClassNotFoundException&lt;/strong&gt;: Ensure the Gravitino connector, Kafka connector, and Paimon JARs are present in &lt;code&gt;FLINK_HOME/lib&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;java.net.ConnectException: Connection refused&lt;/strong&gt; when running &lt;code&gt;INSERT INTO&lt;/code&gt;: Flink SQL client cannot reach JobManager REST endpoint (default &lt;code&gt;localhost:8081&lt;/code&gt;). Start cluster with &lt;code&gt;${FLINK_HOME}/bin/start-cluster.sh&lt;/code&gt; and verify &lt;code&gt;curl http://localhost:8081/overview&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Job is RUNNING but no new rows in Paimon&lt;/strong&gt;: Ensure checkpoints are enabled in streaming mode (for example &lt;code&gt;SET 'execution.checkpointing.interval' = '10 s';&lt;/code&gt;) and check checkpoint progress in Flink Web UI or &lt;code&gt;/jobs/&amp;lt;job-id&amp;gt;/checkpoints&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Job is RUNNING but expected records are skipped after rerun&lt;/strong&gt;: Kafka offsets are tracked by &lt;code&gt;properties.group.id&lt;/code&gt;. Use a new group id (for example &lt;code&gt;gravitino-flink-demo-v2&lt;/code&gt;) when you want a fresh replay behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Table not found&lt;/strong&gt;: Use fully qualified names like &lt;code&gt;hive_catalog.streaming_db.kafka_events&lt;/code&gt; and &lt;code&gt;paimon_catalog.streaming_db.paimon_user_behavior&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Congratulations
&lt;/h2&gt;

&lt;p&gt;You have successfully completed the Gravitino Flink streaming tutorial!&lt;/p&gt;

&lt;p&gt;You now have a fully functional Flink streaming environment with Gravitino integration, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A configured Gravitino Flink connector for unified catalog access&lt;/li&gt;
&lt;li&gt;Hive and Paimon catalogs registered in Gravitino and accessible from Flink SQL&lt;/li&gt;
&lt;li&gt;A working streaming pipeline that reads from Kafka and writes to Paimon&lt;/li&gt;
&lt;li&gt;Understanding of generic tables vs Hive-compatible tables in HiveCatalog&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your Flink environment is now ready to leverage Gravitino for unified metadata management across your streaming data ecosystem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;p&gt;For more advanced configurations and detailed documentation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Review the &lt;a href="https://gravitino.apache.org/docs/latest/flink-connector/flink-catalog-paimon" rel="noopener noreferrer"&gt;Gravitino Flink Connector Documentation&lt;/a&gt; for advanced configuration options&lt;/li&gt;
&lt;li&gt;Learn about &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sql/overview/" rel="noopener noreferrer"&gt;Apache Flink SQL&lt;/a&gt; for more query patterns&lt;/li&gt;
&lt;li&gt;Explore &lt;a href="https://paimon.apache.org/docs/master/flink/quick-start/" rel="noopener noreferrer"&gt;Apache Paimon with Flink&lt;/a&gt; for Paimon-specific features&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Explore Iceberg catalogs with Gravitino in &lt;a href="//../03-iceberg-catalog/README.md"&gt;03-iceberg-catalog/README.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Try query federation with Trino in &lt;a href="//../06-trino-query/README.md"&gt;06-trino-query/README.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Follow and star &lt;a href="https://github.com/apache/gravitino" rel="noopener noreferrer"&gt;Apache Gravitino Repository&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the &lt;a href="https://gravitino.apache.org/docs/" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt; or submit issues on &lt;a href="https://github.com/apache/gravitino/issues" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>metadata</category>
      <category>apacheflink</category>
      <category>apachegravitino</category>
      <category>gravitino101</category>
    </item>
    <item>
      <title>Apache Gravitino Iceberg REST Catalog Access Control Deployment Guide</title>
      <dc:creator>Yue @ Datastrato (Admin)</dc:creator>
      <pubDate>Sun, 15 Feb 2026 20:05:36 +0000</pubDate>
      <link>https://dev.to/gravitino/apache-gravitino-iceberg-rest-catalog-access-control-deployment-guide-4ck</link>
      <guid>https://dev.to/gravitino/apache-gravitino-iceberg-rest-catalog-access-control-deployment-guide-4ck</guid>
      <description>&lt;h2&gt;
  
  
  1. Overview
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1.1 Product Introduction
&lt;/h3&gt;

&lt;p&gt;Apache Gravitino IRC (Iceberg REST Catalog) is an Iceberg REST catalog service based on Gravitino, providing unified Iceberg table management capabilities. Starting from v1.1.0, Gravitino IRC supports access control for Iceberg tables.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.2 Key Features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;✅ Table operation authorization&lt;/li&gt;
&lt;li&gt;✅ Multi-tenancy support&lt;/li&gt;
&lt;li&gt;✅ RESTful API interface&lt;/li&gt;
&lt;li&gt;✅ Seamless integration with Spark&lt;/li&gt;
&lt;li&gt;✅ Role-based access control (RBAC)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  1.3 Current Status
&lt;/h3&gt;

&lt;p&gt;Currently supports table-level operation authorization, with more access control features to be added in the future.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. System Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 Architecture Diagram
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffpu3zu6913tjmoumksy1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffpu3zu6913tjmoumksy1.png" alt="Architecture Diagram" width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 Component Description
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gravitino Server&lt;/strong&gt;: Core metadata service, primarily managing table permission information in this scenario; port 8090&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iceberg REST Service&lt;/strong&gt;: Iceberg REST catalog service that connects to Gravitino Server via API to retrieve permission information; port 9002&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MySQL&lt;/strong&gt;: Metadata storage for both Gravitino and IRC&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Object Storage&lt;/strong&gt;: Data file storage&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. Environment Requirements
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 System Requirements
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Minimum&lt;/th&gt;
&lt;th&gt;Recommended&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;4 cores&lt;/td&gt;
&lt;td&gt;8 cores&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;8GB&lt;/td&gt;
&lt;td&gt;16GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk&lt;/td&gt;
&lt;td&gt;100GB&lt;/td&gt;
&lt;td&gt;500GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network&lt;/td&gt;
&lt;td&gt;Gigabit&lt;/td&gt;
&lt;td&gt;10 Gigabit&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  3.2 Software Dependencies
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Software&lt;/th&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Java&lt;/td&gt;
&lt;td&gt;JDK 17+&lt;/td&gt;
&lt;td&gt;Required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MySQL&lt;/td&gt;
&lt;td&gt;5.7+&lt;/td&gt;
&lt;td&gt;Metadata storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spark&lt;/td&gt;
&lt;td&gt;3.4+&lt;/td&gt;
&lt;td&gt;Optional, client&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  4. Configuration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 Core Configuration File
&lt;/h3&gt;

&lt;p&gt;Create &lt;code&gt;gravitino.conf&lt;/code&gt; configuration file in &lt;code&gt;GRAVITINO_HOME/conf&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;# ============================================
# Gravitino Service Basic Configuration
# ============================================
&lt;/span&gt;
&lt;span class="c"&gt;# Service shutdown timeout
&lt;/span&gt;&lt;span class="py"&gt;gravitino.server.shutdown.timeout&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;3000&lt;/span&gt;

&lt;span class="c"&gt;# ============================================
# Web Server Configuration
# ============================================
&lt;/span&gt;
&lt;span class="c"&gt;# Web server host address
&lt;/span&gt;&lt;span class="py"&gt;gravitino.server.webserver.host&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0&lt;/span&gt;
&lt;span class="c"&gt;# HTTP port
&lt;/span&gt;&lt;span class="py"&gt;gravitino.server.webserver.httpPort&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;8090&lt;/span&gt;
&lt;span class="c"&gt;# Minimum threads
&lt;/span&gt;&lt;span class="py"&gt;gravitino.server.webserver.minThreads&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;24&lt;/span&gt;
&lt;span class="c"&gt;# Maximum threads
&lt;/span&gt;&lt;span class="py"&gt;gravitino.server.webserver.maxThreads&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;200&lt;/span&gt;
&lt;span class="c"&gt;# Stop timeout
&lt;/span&gt;&lt;span class="py"&gt;gravitino.server.webserver.stopTimeout&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;30000&lt;/span&gt;
&lt;span class="c"&gt;# Idle timeout
&lt;/span&gt;&lt;span class="py"&gt;gravitino.server.webserver.idleTimeout&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;30000&lt;/span&gt;

&lt;span class="c"&gt;# ============================================
# Entity Store Configuration (MySQL)
# ============================================
&lt;/span&gt;
&lt;span class="py"&gt;gravitino.entity.store&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;relational&lt;/span&gt;
&lt;span class="py"&gt;gravitino.entity.store.relational&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;JDBCBackend&lt;/span&gt;
&lt;span class="py"&gt;gravitino.entity.store.relational.jdbcUrl&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;jdbc:mysql://192.168.194.152:3306/gravitino&lt;/span&gt;
&lt;span class="py"&gt;gravitino.entity.store.relational.jdbcDriver&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;com.mysql.cj.jdbc.Driver&lt;/span&gt;
&lt;span class="py"&gt;gravitino.entity.store.relational.jdbcUser&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;gravitino&lt;/span&gt;
&lt;span class="py"&gt;gravitino.entity.store.relational.jdbcPassword&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;gravitino&lt;/span&gt;

&lt;span class="c"&gt;# ============================================
# Cache Configuration
# ============================================
&lt;/span&gt;
&lt;span class="py"&gt;gravitino.cache.enabled&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;gravitino.cache.maxEntries&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;10000&lt;/span&gt;
&lt;span class="py"&gt;gravitino.cache.expireTimeInMs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;3600000&lt;/span&gt;
&lt;span class="py"&gt;gravitino.cache.enableWeigher&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;gravitino.cache.implementation&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;caffeine&lt;/span&gt;

&lt;span class="c"&gt;# ============================================
# Authorization Configuration
# ============================================
&lt;/span&gt;
&lt;span class="py"&gt;gravitino.authorization.enable&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;gravitino.authorization.impl&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;org.apache.gravitino.server.authorization.jcasbin.JcasbinAuthorizer&lt;/span&gt;
&lt;span class="py"&gt;gravitino.authorization.serviceAdmins&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;admin # Admin account for creating metalake&lt;/span&gt;
&lt;span class="py"&gt;gravitino.authenticators&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;simple&lt;/span&gt;

&lt;span class="c"&gt;# ============================================
# Iceberg REST Service Configuration
# ============================================
&lt;/span&gt;
&lt;span class="py"&gt;gravitino.auxService.names&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;iceberg-rest&lt;/span&gt;
&lt;span class="py"&gt;gravitino.iceberg-rest.classpath&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;iceberg-rest-server/libs,iceberg-rest-server/conf&lt;/span&gt;
&lt;span class="py"&gt;gravitino.iceberg-rest.host&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0&lt;/span&gt;
&lt;span class="py"&gt;gravitino.iceberg-rest.httpPort&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;9002&lt;/span&gt;
&lt;span class="py"&gt;gravitino.iceberg-rest.catalog-config-provider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;dynamic-config-provider&lt;/span&gt;
&lt;span class="py"&gt;gravitino.iceberg-rest.gravitino-uri&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;http://localhost:8090/&lt;/span&gt;
&lt;span class="py"&gt;gravitino.iceberg-rest.gravitino-metalake&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;my_metalake  # Metalake name to use&lt;/span&gt;
&lt;span class="py"&gt;gravitino.iceberg-rest.gravitino-simple.user-name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;rest-catalog # User for IRC service to fetch catalog info&lt;/span&gt;
&lt;span class="py"&gt;gravitino.iceberg-rest.default-catalog-name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;catalog_iceberg&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  5. Deployment Process
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 Database Initialization
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Navigate to scripts directory&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;distribution/package/scripts

&lt;span class="c"&gt;# Execute SQL based on database type&lt;/span&gt;
&lt;span class="c"&gt;# MySQL example&lt;/span&gt;
mysql &lt;span class="nt"&gt;-h&lt;/span&gt; &amp;lt;host&amp;gt; &lt;span class="nt"&gt;-u&lt;/span&gt; &amp;lt;user&amp;gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="nt"&gt;-D&lt;/span&gt; &amp;lt;database&amp;gt; &amp;lt; xxx.sql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.2 Download Dependencies
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Download MySQL driver&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="nv"&gt;$GRAVITINO_HOME&lt;/span&gt;
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.27/mysql-connector-java-8.0.27.jar
&lt;span class="nb"&gt;cp &lt;/span&gt;mysql-connector-java-8.0.27.jar libs/
&lt;span class="nb"&gt;cp &lt;/span&gt;mysql-connector-java-8.0.27.jar catalogs/lakehouse-iceberg/libs
&lt;span class="nb"&gt;cp &lt;/span&gt;mysql-connector-java-8.0.27.jar iceberg-rest-server/libs

&lt;span class="c"&gt;# Copy bundle jar files&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; bundles/aws-bundle/build/libs/&lt;span class="k"&gt;*&lt;/span&gt;.jar distribution/package/catalogs/lakehouse-iceberg/libs
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; bundles/aws-bundle/build/libs/&lt;span class="k"&gt;*&lt;/span&gt;.jar distribution/package/iceberg-rest-server/libs

wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.9.2/iceberg-aws-bundle-1.9.2.jar
&lt;span class="nb"&gt;cp &lt;/span&gt;iceberg-aws-bundle-1.9.2.jar distribution/package/iceberg-rest-server/libs
&lt;span class="nb"&gt;cp &lt;/span&gt;iceberg-aws-bundle-1.9.2.jar distribution/package/catalogs/lakehouse-iceberg/libs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.3 Start Services
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start Gravitino service&lt;/span&gt;
/bin/bash bin/gravitino.sh start

&lt;span class="c"&gt;# Check service status&lt;/span&gt;
/bin/bash bin/gravitino.sh status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.4 Create Metalake
&lt;/h3&gt;

&lt;p&gt;If you haven't created a metalake yet, use the API (or web UI) to create one named &lt;code&gt;my_metalake&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create Metalake with admin privileges&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Basic &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s1"&gt;'admin:password'&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name": "my_metalake",
    "comment": "",
    "properties": {}
  }'&lt;/span&gt; http://localhost:8090/api/metalakes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.5 Create Iceberg Catalog
&lt;/h3&gt;

&lt;p&gt;Register an Iceberg Catalog in Gravitino; it needs to use the same backend (such as HMS or JDBC) as the running Iceberg REST Service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/vnd.gravitino.v1+json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Basic &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s1"&gt;'admin:password'&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name": "catalog_iceberg",
    "type": "RELATIONAL",
    "provider": "lakehouse-iceberg",
    "comment": "Iceberg directory",
    "properties": {
      "uri": "jdbc:mysql://mysql-host:3306/iceberg_db",
      "catalog-backend": "jdbc",
      "warehouse": "s3://bucket/iceberg/warehouse/",
      "jdbc-user": "mysql_user",
      "jdbc-password": "mysql_password",
      "jdbc-driver": "com.mysql.cj.jdbc.Driver",
      "io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
      "s3-secret-access-key": "your_secret_key",
      "s3-access-key-id": "your_access_key",
      "s3-region": "ap-southeast-1",
      "authentication.type": "simple",
      "credential-providers": "s3-token",
      "s3-endpoint": "http://s3.ap-southeast-1.amazonaws.com",
      "jdbc-initialize": "true",
      "s3-role-arn": "arn:aws:iam::730335553010:role/sts_s3_access_role"
    }
  }'&lt;/span&gt; http://localhost:8090/api/metalakes/my_metalake/catalogs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  6. Access Control Management
&lt;/h2&gt;

&lt;p&gt;Next, we will use Gravitino's RBAC permission model to configure access control for the Iceberg Catalog.&lt;/p&gt;

&lt;h3&gt;
  
  
  6.1 Permission Model
&lt;/h3&gt;

&lt;p&gt;Gravitino provides the following privileges related to catalog/schema/table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Privilege Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Applicable Objects&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;USE_CATALOG&lt;/td&gt;
&lt;td&gt;Permission to use catalog&lt;/td&gt;
&lt;td&gt;Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;USE_SCHEMA&lt;/td&gt;
&lt;td&gt;Permission to use schema&lt;/td&gt;
&lt;td&gt;Schema, Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SELECT_TABLE&lt;/td&gt;
&lt;td&gt;Permission to query table&lt;/td&gt;
&lt;td&gt;Table, Schema, Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MODIFY_TABLE&lt;/td&gt;
&lt;td&gt;Permission to modify table&lt;/td&gt;
&lt;td&gt;Table, Schema, Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CREATE_TABLE&lt;/td&gt;
&lt;td&gt;Permission to create table&lt;/td&gt;
&lt;td&gt;Schema, Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CREATE_SCHEMA&lt;/td&gt;
&lt;td&gt;Permission to create schema&lt;/td&gt;
&lt;td&gt;Catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  6.2 Create Roles and Permissions
&lt;/h3&gt;

&lt;p&gt;Create a role named "data_reader" with various privileges on catalog, schema, and table. Please adjust the catalog, schema, and table names accordingly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create schema&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/vnd.gravitino.v1+json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Basic &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s1"&gt;'admin:password'&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "name": "schema1",
  "comment": "comment",
  "properties": {
    "key1": "value1"
  }
}'&lt;/span&gt; http://localhost:8090/api/metalakes/my_metalake/catalogs/catalog_iceberg/schemas

&lt;span class="c"&gt;# Create role&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/vnd.gravitino.v1+json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Basic &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s1"&gt;'admin:password'&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name": "data_reader",
    "properties": {"description": "data read"},
    "securableObjects": [
      {
        "fullName": "catalog_iceberg.schema1",
        "type": "SCHEMA",
        "privileges": [
          {"name": "CREATE_TABLE", "condition": "ALLOW"},
          {"name": "USE_SCHEMA", "condition": "ALLOW"}
        ]
      },
      {
        "fullName": "catalog_iceberg",
        "type": "CATALOG",
        "privileges": [{"name": "USE_CATALOG", "condition": "ALLOW"}]
      }
    ]
  }'&lt;/span&gt; http://localhost:8090/api/metalakes/my_metalake/roles
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a role for rest_server to allow it to fetch catalog information:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/vnd.gravitino.v1+json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Basic &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s1"&gt;'admin:password'&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name": "catalog_reader",
    "properties": {"description": "load catalog infos"},
    "securableObjects": [
      {
        "fullName": "my_metalake",
        "type": "METALAKE",
        "privileges": [{"name": "USE_CATALOG", "condition": "ALLOW"}]
      }
    ]
  }'&lt;/span&gt; http://localhost:8090/api/metalakes/my_metalake/roles
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6.3 Create Users and Grant Permissions
&lt;/h3&gt;

&lt;p&gt;Create a user such as spark_user in Gravitino and grant them the role created above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create user&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/vnd.gravitino.v1+json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Basic &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s1"&gt;'admin:password'&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name": "spark_user"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://localhost:8090/api/metalakes/my_metalake/users

&lt;span class="c"&gt;# Grant permissions to user&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; PUT &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/vnd.gravitino.v1+json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Basic &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s1"&gt;'admin:password'&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"roleNames": ["data_reader"]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://localhost:8090/api/metalakes/my_metalake/permissions/users/spark_user/grant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create user rest-catalog in Gravitino and grant permissions to load catalog:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/vnd.gravitino.v1+json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Basic &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s1"&gt;'admin:password'&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name": "rest-catalog"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://localhost:8090/api/metalakes/my_metalake/users

&lt;span class="c"&gt;# Grant permissions to user&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; PUT &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/vnd.gravitino.v1+json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Basic &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s1"&gt;'admin:password'&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"roleNames": ["catalog_reader"]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://localhost:8090/api/metalakes/my_metalake/permissions/users/rest-catalog/grant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  7. Spark Integration
&lt;/h2&gt;

&lt;p&gt;After configuring permissions in Gravitino, you can test and verify on the client side.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.1 Spark Configuration
&lt;/h3&gt;

&lt;p&gt;Using Spark as an example, you need to configure the username on the client and point the Iceberg REST service to the IRC service address.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;spark-sql &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--jars&lt;/span&gt; &lt;span class="s2"&gt;"/path/to/iceberg-aws-bundle-1.9.2.jar,/path/to/iceberg-spark-runtime-3.4_2.12-1.9.2.jar"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.sql.extensions&lt;span class="o"&gt;=&lt;/span&gt;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.sql.catalog.rest&lt;span class="o"&gt;=&lt;/span&gt;org.apache.iceberg.spark.SparkCatalog &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.sql.catalog.rest.type&lt;span class="o"&gt;=&lt;/span&gt;rest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.sql.catalog.rest.uri&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:9002/iceberg/ &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.sql.catalog.rest..X-Iceberg-Access-Delegation&lt;span class="o"&gt;=&lt;/span&gt;vended-credentials &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.sql.catalog.rest.rest.auth.type&lt;span class="o"&gt;=&lt;/span&gt;basic &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.sql.catalog.rest.rest.auth.basic.username&lt;span class="o"&gt;=&lt;/span&gt;spark_user &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.sql.catalog.rest.rest.auth.basic.password&lt;span class="o"&gt;=&lt;/span&gt;user_password
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7.2 Usage Examples
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Show available tables&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;TABLES&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="n"&gt;rest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schema1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Query data&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;rest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schema1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;table1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;rest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schema1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;table2&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;iceberg&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Through this guide, you will master:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Complete Deployment Process&lt;/strong&gt; - End-to-end guidance from environment preparation, database initialization, dependency downloads to service startup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access Control System&lt;/strong&gt; - Understanding Gravitino's RBAC permission model, learning to create roles, assign permissions, and manage users&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-world Application Scenarios&lt;/strong&gt; - Learning how to use IRC access control in production through Spark integration examples&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Core Configuration Points&lt;/strong&gt; - Mastering key configuration parameters for Gravitino Server and Iceberg REST Service&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This solution provides enterprise-grade access control capabilities for your data lake, implementing fine-grained table-level permission management while ensuring data security and maintaining excellent usability.&lt;/p&gt;

</description>
      <category>datacatalog</category>
      <category>icebergrest</category>
      <category>metadata</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Using Apache Gravitino with Trino for Query Federation</title>
      <dc:creator>Yue @ Datastrato (Admin)</dc:creator>
      <pubDate>Thu, 12 Feb 2026 00:29:32 +0000</pubDate>
      <link>https://dev.to/gravitino/using-apache-gravitino-with-trino-for-query-federation-4doi</link>
      <guid>https://dev.to/gravitino/using-apache-gravitino-with-trino-for-query-federation-4doi</guid>
      <description>&lt;p&gt;&lt;em&gt;Author: &lt;a href="https://www.linkedin.com/in/hui-yu-503300394" rel="noopener noreferrer"&gt;Yu hui&lt;/a&gt;&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Last Updated: 2026-02-11&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;In this tutorial, you will learn how to integrate Apache Gravitino with Trino to enable query federation across multiple data sources through a unified metadata layer. By the end of this guide, you'll be able to configure Trino to automatically load catalogs from Gravitino and run cross-catalog queries seamlessly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll accomplish:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Connect Trino to Gravitino&lt;/strong&gt; to enable automatic loading of Gravitino-managed catalogs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create catalogs from Trino SQL&lt;/strong&gt; including Iceberg and MySQL examples&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execute federated queries&lt;/strong&gt; that join data across heterogeneous sources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validate catalog discovery&lt;/strong&gt; and inspect catalogs using Trino SQL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Trino is a distributed SQL query engine designed for fast analytic queries against data of any size. In modern data architectures, organizations often need to query data across multiple heterogeneous systems (like MySQL, PostgreSQL, Iceberg, Hive) without moving or copying data. This is where query federation becomes essential.&lt;/p&gt;

&lt;p&gt;Apache Gravitino simplifies this by acting as a unified metadata control plane. By using the Gravitino Trino Connector, you can access multiple data sources through a single catalog interface in Trino, with automatic catalog discovery and centralized metadata management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unified catalog access&lt;/strong&gt;: Query MySQL, Iceberg, Hive, and other sources through a single interface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic catalog discovery&lt;/strong&gt;: Catalogs created in Gravitino are automatically available in Trino&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero data movement&lt;/strong&gt;: Join across heterogeneous systems without copying data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized management&lt;/strong&gt;: Create and update catalogs in one place, reflected everywhere&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture overview:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr320dbnwrkeg1eokse1a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr320dbnwrkeg1eokse1a.png" alt="Gravitino Trino Query Federation Architecture" width="800" height="372"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before starting this tutorial, you will need:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linux or macOS operating system with outbound internet access for downloads&lt;/li&gt;
&lt;li&gt;JDK 17 or higher installed and properly configured&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Required Components:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gravitino server installed and running (see &lt;a href="//../02-setup-guide/README.md"&gt;&lt;code&gt;02-setup-guide/README.md&lt;/code&gt;&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Trino server (coordinator + workers, or single-node for testing)&lt;/li&gt;
&lt;li&gt;Trino version 435 or compatible version for your Gravitino Trino connector release&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Optional Components:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MySQL or PostgreSQL for JDBC federation examples&lt;/li&gt;
&lt;li&gt;Hive Metastore for Iceberg catalog backend&lt;/li&gt;
&lt;li&gt;Object storage (S3/GCS/Azure) for cloud-based table storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before proceeding, verify your Java installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;JAVA_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/bin/java &lt;span class="nt"&gt;-version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: Ensure your Gravitino server is configured to use &lt;code&gt;simple&lt;/code&gt; authentication mode. The Gravitino Trino connector currently connects as an anonymous user and does not propagate user authentication.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How the Integration Works
&lt;/h3&gt;

&lt;p&gt;The Gravitino Trino Connector enables Trino to dynamically load catalogs from Gravitino:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The connector is configured as a Trino catalog named &lt;code&gt;gravitino&lt;/code&gt; via &lt;code&gt;etc/catalog/gravitino.properties&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;You create additional catalogs (like &lt;code&gt;iceberg_test&lt;/code&gt; and &lt;code&gt;mysql_test&lt;/code&gt;) through Gravitino stored procedures or REST APIs&lt;/li&gt;
&lt;li&gt;Trino automatically syncs Gravitino-managed catalogs every 10 seconds (configurable via &lt;code&gt;gravitino.metadata.refresh-interval-seconds&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;You query federated data using standard &lt;code&gt;catalog.schema.table&lt;/code&gt; naming&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Values Used in This Tutorial
&lt;/h3&gt;

&lt;p&gt;Replace these values with your environment settings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gravitino URI&lt;/strong&gt;: &lt;code&gt;http://gravitino-server:8090&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metalake&lt;/strong&gt;: &lt;code&gt;trino_metalake&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iceberg HMS URI&lt;/strong&gt;: &lt;code&gt;thrift://hive-host:9083&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iceberg warehouse&lt;/strong&gt;: &lt;code&gt;hdfs://namenode:9000/user/iceberg/warehouse&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MySQL JDBC URL&lt;/strong&gt;: &lt;code&gt;jdbc:mysql://mysql-host:3306?useSSL=false&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MySQL credentials&lt;/strong&gt;: &lt;code&gt;trino&lt;/code&gt; / &lt;code&gt;ds123&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 1: Install and Configure Gravitino Trino Connector
&lt;/h3&gt;

&lt;p&gt;The Gravitino Trino Connector must be installed on all Trino nodes (coordinator and workers).&lt;/p&gt;

&lt;h4&gt;
  
  
  Install the Connector Plugin
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. Download the connector&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Download the Gravitino Trino connector from the &lt;a href="https://gravitino.apache.org/downloads" rel="noopener noreferrer"&gt;Apache Gravitino download page&lt;/a&gt; or &lt;a href="https://gravitino.apache.org/docs/1.1.0/how-to-build" rel="noopener noreferrer"&gt;build from source&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Install on all Trino nodes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Extract the connector and copy it to Trino's plugin directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Extract the connector&lt;/span&gt;
&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-xzf&lt;/span&gt; gravitino-trino-connector-&amp;lt;version&amp;gt;.tar.gz

&lt;span class="c"&gt;# Copy to Trino plugin directory (on coordinator and all workers)&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; gravitino-trino-connector-&amp;lt;version&amp;gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TRINO_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/plugin/gravitino
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi2c5bq294joim60l9iu3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi2c5bq294joim60l9iu3.png" alt="Gravitino Trino Connector Plugin Directory" width="732" height="476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Enable Dynamic Catalog Management
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Configure Trino for dynamic catalogs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Edit &lt;code&gt;${TRINO_HOME}/etc/config.properties&lt;/code&gt; on the coordinator node:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;catalog.management&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;dynamic&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Configure the Gravitino Catalog
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Create &lt;code&gt;etc/catalog/gravitino.properties&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On each Trino node, create the Gravitino catalog configuration file, pointing to your Gravitino server and metalake:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;connector.name&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;gravitino&lt;/span&gt;
&lt;span class="py"&gt;gravitino.uri&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;http://gravitino-server:8090&lt;/span&gt;
&lt;span class="py"&gt;gravitino.metalake&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;trino_metalake&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: The metalake specified in &lt;code&gt;gravitino.metalake&lt;/code&gt; must already exist in Gravitino. If not, create it via the Web UI or REST API:&lt;/p&gt;


&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name":"trino_metalake","comment":"Metalake for Trino federation","properties":{}}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://gravitino-server:8090/api/metalakes
&lt;/code&gt;&lt;/pre&gt;

&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Restart Trino&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After creating the configuration file on each node, restart Trino to load the connector.&lt;/p&gt;

&lt;h4&gt;
  
  
  Verify Installation
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Check the gravitino catalog is loaded&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;CATALOGS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see &lt;code&gt;gravitino&lt;/code&gt; in the catalog list, confirming successful installation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Create Catalogs from Trino SQL
&lt;/h3&gt;

&lt;p&gt;Once the &lt;code&gt;gravitino&lt;/code&gt; catalog is configured, you can create additional catalogs using stored procedures in &lt;code&gt;gravitino.system&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Create an Iceberg Catalog
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Example using Hive Metastore backend&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="n"&gt;gravitino&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create_catalog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'iceberg_test'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'lakehouse-iceberg'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;MAP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;ARRAY&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'uri'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'catalog-backend'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'warehouse'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;ARRAY&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'thrift://hive-host:9083'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'hive'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'hdfs://namenode:9000/user/iceberg/warehouse'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: For S3 or other cloud storage, you may need to pass additional properties using the &lt;code&gt;trino.bypass.&lt;/code&gt; prefix for Trino-specific settings, read more in &lt;a href="https://gravitino.apache.org/docs/1.1.0/trino-connector/catalog-iceberg" rel="noopener noreferrer"&gt;Apache Gravitino Trino connector - Iceberg catalog&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Create a MySQL Catalog
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Example JDBC catalog for MySQL&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="n"&gt;gravitino&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create_catalog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'mysql_test'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'jdbc-mysql'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;MAP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;ARRAY&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'jdbc-url'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'jdbc-user'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'jdbc-password'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'jdbc-driver'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;ARRAY&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'jdbc:mysql://mysql-host:3306?useSSL=false'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'trino'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'ds123'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'com.mysql.cj.jdbc.Driver'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip&lt;/strong&gt;: To ignore "already exists" errors, use named arguments with &lt;code&gt;ignore_exist =&amp;gt; true&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Verify Catalog Creation
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Inspect Gravitino catalogs&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;gravitino&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   name       |     provider      | properties
--------------+-------------------+-------------------------------
 iceberg_test | lakehouse-iceberg | {...}
 mysql_test   | jdbc-mysql        | {...}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Validate Catalog Discovery
&lt;/h3&gt;

&lt;p&gt;After creating catalogs in Gravitino, verify they are visible in Trino.&lt;/p&gt;

&lt;h4&gt;
  
  
  Confirm Catalog Visibility
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Check available catalogs and schemas&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;CATALOGS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;SCHEMAS&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;iceberg_test&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;SCHEMAS&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;mysql_test&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Trino syncs catalogs from Gravitino according to the configured refresh interval (10 seconds by default). If catalogs don't appear immediately, wait for the next refresh cycle and retry.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 4: Prepare Sample Data
&lt;/h3&gt;

&lt;p&gt;Create sample schemas and tables to demonstrate federation capabilities.&lt;/p&gt;

&lt;h4&gt;
  
  
  Create MySQL Dimension Table
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Set up a users dimension table&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create schema&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;mysql_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create users table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;mysql_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;user_name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Insert sample data&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;mysql_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'alice'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'bob'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Verify data&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;TABLES&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;mysql_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;mysql_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Create Iceberg Fact Table
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Set up an events fact table&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create schema&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;iceberg_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create events table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;iceberg_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Insert sample data&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;iceberg_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'click'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01 10:00:00'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'view'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01 10:01:00'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Verify data&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;TABLES&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;iceberg_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;iceberg_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Execute Federated Queries
&lt;/h3&gt;

&lt;p&gt;These examples demonstrate the core value of query federation: joining data across heterogeneous sources in a single query.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pattern 1: Cross-Catalog JOIN
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Join dimension and fact tables&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;iceberg_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;mysql_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Pattern 2: Aggregation Across Catalogs
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Count events by user&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;event_cnt&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;iceberg_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;mysql_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01 00:00:00'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_name&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;event_cnt&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Pattern 3: Semi-Join Filter
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Filter fact table by dimension membership&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;iceberg_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;mysql_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Pattern 4: LEFT JOIN with Unmatched Rows
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Keep all events, even without matching users&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'unknown'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;user_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;iceberg_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;mysql_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 6: Understanding Federation Mechanics
&lt;/h3&gt;

&lt;p&gt;In federated queries, understanding where work happens is crucial for optimization:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How queries execute:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Connector-level reads&lt;/strong&gt;: Each connector (Iceberg, MySQL) reads from its respective source&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trino-level joins&lt;/strong&gt;: Trino combines results from multiple sources in memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pushdown optimization&lt;/strong&gt;: Some filters and predicates may be pushed to source systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Query optimization tips:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Filter early&lt;/strong&gt;: Apply partition/time filters on large tables to reduce data scanned&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Align join keys&lt;/strong&gt;: Use consistent data types across sources (e.g., &lt;code&gt;BIGINT&lt;/code&gt; for IDs)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small dimension pattern&lt;/strong&gt;: Join large fact tables with small dimension tables for efficiency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review query plans&lt;/strong&gt;: Use &lt;code&gt;EXPLAIN&lt;/code&gt; to understand execution strategy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analyze query execution&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;event_cnt&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;iceberg_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;mysql_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 7: Clean Up Resources
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Remove sample data and catalogs&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Drop tables&lt;/span&gt;
&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;mysql_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;iceberg_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Drop schemas&lt;/span&gt;
&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;mysql_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;iceberg_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;demo&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Drop catalogs&lt;/span&gt;
&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="n"&gt;gravitino&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop_catalog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'mysql_test'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="n"&gt;gravitino&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop_catalog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'iceberg_test'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;p&gt;Common issues and their solutions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connector installation issues:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Catalog not found&lt;/strong&gt;: Ensure the Gravitino connector plugin is installed on all Trino nodes and &lt;code&gt;gravitino.properties&lt;/code&gt; exists in &lt;code&gt;etc/catalog/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic catalog not working&lt;/strong&gt;: Verify &lt;code&gt;catalog.management=dynamic&lt;/code&gt; is set in &lt;code&gt;etc/config.properties&lt;/code&gt; on the coordinator&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Connection issues:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cannot connect to Gravitino&lt;/strong&gt;: Check that Gravitino server is running and &lt;code&gt;gravitino.uri&lt;/code&gt; is correct&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metalake not found&lt;/strong&gt;: Ensure the metalake specified in &lt;code&gt;gravitino.metalake&lt;/code&gt; exists in Gravitino&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Catalog sync issues:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Catalogs not appearing&lt;/strong&gt;: Wait for the sync interval (default 10 seconds) or adjust &lt;code&gt;gravitino.metadata.refresh-interval-seconds&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stale catalog information&lt;/strong&gt;: Restart Trino or wait for the next sync cycle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Query execution issues:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Table not found&lt;/strong&gt;: Verify the fully qualified table name format: &lt;code&gt;catalog.schema.table&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Permission denied (Gravitino metadata)&lt;/strong&gt;: Verify that the Trino identity (or the mapped Gravitino user, or the &lt;code&gt;anonymous&lt;/code&gt; user if the connector is configured to run anonymously) has the required catalog/schema/table privileges in Gravitino for the objects being queried&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Permission denied (underlying data source)&lt;/strong&gt;: If Gravitino privileges are correct but the error persists, check the credentials and permissions for the underlying data sources (for example MySQL user/password, Hive/HDFS/S3 ACLs) configured in the corresponding Gravitino catalog&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Congratulations
&lt;/h2&gt;

&lt;p&gt;You have successfully completed the Gravitino Trino query federation tutorial!&lt;/p&gt;

&lt;p&gt;You now have a fully functional Trino environment with Gravitino integration, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A configured Gravitino Trino Connector for automatic catalog discovery&lt;/li&gt;
&lt;li&gt;Multiple registered catalogs (Iceberg and MySQL) accessible from Trino&lt;/li&gt;
&lt;li&gt;Working federated queries that join data across heterogeneous sources&lt;/li&gt;
&lt;li&gt;Understanding of query optimization patterns and federation mechanics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your Trino environment is now ready to leverage Gravitino for unified metadata management and cross-system query federation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;p&gt;For more advanced configurations and detailed documentation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Review the &lt;a href="https://gravitino.apache.org/docs/1.1.0/trino-connector/index" rel="noopener noreferrer"&gt;Gravitino Trino Connector Documentation&lt;/a&gt; for advanced configuration options&lt;/li&gt;
&lt;li&gt;Learn about &lt;a href="https://trino.io/docs/current/overview/concepts.html" rel="noopener noreferrer"&gt;Trino Query Federation&lt;/a&gt; for deeper understanding&lt;/li&gt;
&lt;li&gt;Explore &lt;a href="https://trino.io/docs/current/admin/tuning.html" rel="noopener noreferrer"&gt;Trino Performance Tuning&lt;/a&gt; for optimization strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Explore &lt;a href="//../07-flink-streaming/README.md"&gt;Using Gravitino with Flink&lt;/a&gt; for streaming processing&lt;/li&gt;
&lt;li&gt;Follow and star &lt;a href="https://github.com/apache/gravitino" rel="noopener noreferrer"&gt;Apache Gravitino Repository&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the &lt;a href="https://gravitino.apache.org/docs/" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt; or submit issues on &lt;a href="https://github.com/apache/gravitino/issues" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gravitino101</category>
      <category>trino</category>
      <category>metadata</category>
      <category>lakehouse</category>
    </item>
    <item>
      <title>Using Gravitino with Apache Spark for ETL</title>
      <dc:creator>Yue @ Datastrato (Admin)</dc:creator>
      <pubDate>Thu, 05 Feb 2026 23:24:45 +0000</pubDate>
      <link>https://dev.to/gravitino/using-gravitino-with-apache-spark-for-etl-538c</link>
      <guid>https://dev.to/gravitino/using-gravitino-with-apache-spark-for-etl-538c</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F73miqycef1bsajepzgwy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F73miqycef1bsajepzgwy.png" alt=" " width="800" height="336"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Author: &lt;a href="https://www.linkedin.com/in/minghuang-li?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAAFa5V_cBwwdNNbf1bI0scSPywRS9WQBQ1Yk&amp;amp;lipi=urn%3Ali%3Apage%3Ad_flagship3_search_srp_all%3BDWCB9kUFQ7yeAZ7UUVFT3A%3D%3D" rel="noopener noreferrer"&gt;Minghuang Li&lt;/a&gt;&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Last Updated: 2026-01-31&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;In this tutorial, you will learn how to use Apache Gravitino with Apache Spark for ETL (Extract, Transform, Load) operations. By the end of this guide, you'll be able to build data pipelines that seamlessly access multiple heterogeneous data sources through a unified catalog interface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll accomplish:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Configure Gravitino Spark Connector&lt;/strong&gt; to enable unified access to multiple data sources in Spark&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Register multiple catalogs&lt;/strong&gt; including MySQL and Iceberg in Gravitino for federated access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build an ETL pipeline&lt;/strong&gt; that extracts data from MySQL, transforms it, and loads it into Iceberg&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execute federated queries&lt;/strong&gt; across different data sources using Spark SQL and PySpark&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Apache Spark is one of the most popular unified analytics engines for large-scale data processing. In a typical ETL pipeline, Spark often needs to interact with multiple heterogeneous data sources (like MySQL, HDFS, S3, Hive, Iceberg). Managing connectivity, credentials, and schema information for these diverse sources can be complex and error-prone.&lt;/p&gt;

&lt;p&gt;Apache Gravitino simplifies this by acting as a unified metadata lake. By using the Gravitino Spark Connector, you can access multiple data sources through a single catalog interface in Spark, without having to manually configure each source's connection details in your Spark jobs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unified catalog&lt;/strong&gt;: Access Hive, Iceberg, MySQL, PostgreSQL, and other sources under a unified namespace&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized metadata&lt;/strong&gt;: Metadata is managed in Gravitino, changes are reflected immediately&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simplified configuration&lt;/strong&gt;: Configure the Gravitino connector once, and access all managed catalogs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Federated querying&lt;/strong&gt;: Easily join data across different sources (e.g., join MySQL data with Iceberg table)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before starting this tutorial, you will need:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linux or macOS operating system with outbound internet access for downloads&lt;/li&gt;
&lt;li&gt;JDK 17 or higher installed and properly configured&lt;/li&gt;
&lt;li&gt;Apache Spark 3.3, 3.4, or 3.5 installed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Required Components:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gravitino server installed and running (see &lt;a href="//../02-setup-guide/README.md"&gt;&lt;code&gt;02-setup-guide/README.md&lt;/code&gt;&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;MySQL instance for testing JDBC catalog functionality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Optional Components:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HDFS or S3 for Iceberg data storage in production environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before proceeding, verify your Java and Spark installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;JAVA_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/bin/java &lt;span class="nt"&gt;-version&lt;/span&gt;
&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;SPARK_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/bin/spark-submit &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Architecture overview:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiylgtvw69d7ojmyu95nu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiylgtvw69d7ojmyu95nu.png" alt="Gravitino Spark Architecture" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Download Gravitino Spark Connector
&lt;/h3&gt;

&lt;p&gt;You need the Gravitino Spark Connector jar file to enable Spark integration with Gravitino.&lt;/p&gt;

&lt;h4&gt;
  
  
  Obtain the connector
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Download from Maven Central Repository&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For Spark 3.5, download the connector from:&lt;br&gt;
&lt;a href="https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-spark-connector-runtime-3.5" rel="noopener noreferrer"&gt;&lt;code&gt;gravitino-spark-connector-runtime-3.5&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Additional dependencies&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For JDBC sources (MySQL, PostgreSQL), you also need the specific JDBC driver jar (e.g., &lt;code&gt;mysql-connector-j&lt;/code&gt; for MySQL) in your classpath.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 2: Configure Spark Session
&lt;/h3&gt;

&lt;p&gt;To use Gravitino with Spark, you need to configure the specialized Gravitino Spark IO plugin.&lt;/p&gt;
&lt;h4&gt;
  
  
  Configure Spark SQL with Gravitino
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Start Spark SQL with the Gravitino connector&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set the location of your Gravitino server&lt;/span&gt;
&lt;span class="nv"&gt;GRAVITINO_URI&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8090"&lt;/span&gt;
&lt;span class="c"&gt;# The metalake you want to access&lt;/span&gt;
&lt;span class="nv"&gt;METALAKE_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"default_metalake"&lt;/span&gt;

spark-sql &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--packages&lt;/span&gt; org.apache.gravitino:gravitino-spark-connector-runtime-3.5_2.12:1.1.0,mysql:mysql-connector-java:8.0.33,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.plugins&lt;span class="o"&gt;=&lt;/span&gt;org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.sql.gravitino.metalake&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$METALAKE_NAME&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.sql.gravitino.uri&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$GRAVITINO_URI&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.sql.gravitino.enableIcebergSupport&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Configuration notes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replace &lt;code&gt;1.1.0&lt;/code&gt; with the actual version you are using&lt;/li&gt;
&lt;li&gt;Ensure the Spark connector version matches your Spark version&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;spark.sql.gravitino.enableIcebergSupport=true&lt;/code&gt; to enable Iceberg catalog support&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Prepare Metadata in Gravitino
&lt;/h3&gt;

&lt;p&gt;Before running ETL jobs, you need to register the catalogs for your data sources in Gravitino. You can do this via the Gravitino REST API or Web UI.&lt;/p&gt;

&lt;h4&gt;
  
  
  Register MySQL Catalog
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Create a MySQL catalog in Gravitino&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "name": "mysql_catalog",
  "type": "relational",
  "provider": "jdbc-mysql",
  "properties": {
    "jdbc-url": "jdbc:mysql://localhost:3306",
    "jdbc-user": "root",
    "jdbc-password": "password",
    "jdbc-driver": "com.mysql.cj.jdbc.Driver"
  }
}'&lt;/span&gt; http://localhost:8090/api/metalakes/default_metalake/catalogs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Register Iceberg Catalog
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Create an Iceberg catalog in Gravitino&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "name": "iceberg_catalog",
  "type": "relational",
  "provider": "lakehouse-iceberg",
  "properties": {
    "warehouse": "file:///tmp/iceberg-warehouse",
    "catalog-backend": "jdbc",
    "uri": "jdbc:mysql://localhost:3306/iceberg_metadata",
    "jdbc-driver": "com.mysql.cj.jdbc.Driver",
    "jdbc-user": "root",
    "jdbc-password": "password",
    "jdbc-initialize": "true"
  }
}'&lt;/span&gt; http://localhost:8090/api/metalakes/default_metalake/catalogs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: This example uses a local file system for Iceberg data storage. For production environments, consider using HDFS or S3. For more detailed Iceberg catalog configuration options, see &lt;a href="//../03-iceberg-catalog/README.md"&gt;&lt;code&gt;03-iceberg-catalog/README.md&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 4: Build an ETL Pipeline from MySQL to Iceberg
&lt;/h3&gt;

&lt;p&gt;In this scenario, we will extract user data from a MySQL database, perform some transformations, and load it into an Apache Iceberg table for analytical queries, all managed through Gravitino.&lt;/p&gt;

&lt;h4&gt;
  
  
  Verify Catalogs in Spark
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. Start your Spark SQL session&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use the configuration from Step 2 to start your Spark SQL session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Verify catalog visibility&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Due to Spark catalog manager limitations, SHOW CATALOGS only displays 'spark_catalog' initially&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;CATALOGS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Use a Gravitino-managed catalog to make it visible&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;mysql_catalog&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;iceberg_catalog&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Now both catalogs are visible in the output&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;CATALOGS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: The &lt;code&gt;SHOW CATALOGS&lt;/code&gt; command initially only displays the Spark default catalog (&lt;code&gt;spark_catalog&lt;/code&gt;). After explicitly using a Gravitino-managed catalog with the &lt;code&gt;USE&lt;/code&gt; command, that catalog becomes visible in subsequent &lt;code&gt;SHOW CATALOGS&lt;/code&gt; output.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Prepare Sample Data in MySQL
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. Create a sample database and table&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Switch to MySQL catalog&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;mysql_catalog&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create a sample database&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;users_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;users_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create a users table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;username&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Insert sample data&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Insert sample data&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; 
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Alice'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'alice@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-15 10:00:00'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Bob'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'bob@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="s1"&gt;'2024-02-20 14:30:00'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Charlie'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'charlie@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'inactive'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="s1"&gt;'2024-03-10 09:15:00'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Diana'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'diana@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="s1"&gt;'2024-04-05 16:45:00'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Eve'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'eve@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'inactive'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="s1"&gt;'2024-05-12 11:20:00'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Verify the data&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Extract Data from MySQL
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Verify data extraction&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Read data from MySQL&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;mysql_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Transform and Load Data to Iceberg
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. Create an Iceberg table&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Switch to Iceberg catalog&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;iceberg_catalog&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;active_users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;username&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;iceberg&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Execute ETL query&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- ETL Query: Insert into Iceberg from MySQL with transformation&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;active_users&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="n"&gt;created_at&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;mysql_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: For JDBC catalogs (like MySQL), operations &lt;code&gt;UPDATE&lt;/code&gt;, &lt;code&gt;DELETE&lt;/code&gt;, and &lt;code&gt;TRUNCATE&lt;/code&gt; are NOT supported. Only &lt;code&gt;SELECT&lt;/code&gt; and &lt;code&gt;INSERT&lt;/code&gt; are supported.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Verify ETL Results
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Query the target Iceberg table&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;active_users&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;active_users&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  PySpark Example
&lt;/h2&gt;

&lt;p&gt;If you prefer using Python, the logic is very similar using the DataFrame API.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configure PySpark Session
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Create a PySpark session with Gravitino connector&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GravitinoSparkETL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.jars.packages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.gravitino:gravitino-spark-connector-runtime-3.5_2.12:1.1.0,mysql:mysql-connector-java:8.0.33,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.plugins&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.gravitino.metalake&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default_metalake&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.gravitino.uri&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8090&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.gravitino.enableIcebergSupport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Execute ETL Pipeline
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Read, transform, and write data using DataFrame API&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Read from MySQL
&lt;/span&gt;&lt;span class="n"&gt;mysql_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mysql_catalog.users_db.users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Transform
&lt;/span&gt;&lt;span class="n"&gt;active_users&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mysql_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;active&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;selectExpr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id as user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lower(username) as username&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lower(email) as email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write to Iceberg
&lt;/span&gt;&lt;span class="n"&gt;active_users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iceberg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iceberg_catalog.analytics.active_users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ETL Job Completed successfully.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;p&gt;Common issues and their solutions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connector and classpath issues:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ClassNotFoundException: org.apache.gravitino.spark.connector.GravitinoCatalog&lt;/strong&gt;: The Gravitino Spark Connector JAR is missing from the classpath. Ensure you added the correct package with &lt;code&gt;--packages&lt;/code&gt; or placed the JAR in &lt;code&gt;$SPARK_HOME/jars&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing JDBC Driver&lt;/strong&gt;: When connecting to JDBC sources (MySQL/PostgreSQL) via Gravitino, Spark still needs the JDBC driver JARs in its classpath. Add the MySQL/PostgreSQL JDBC driver packages to your Spark startup command (e.g., &lt;code&gt;--packages mysql:mysql-connector-java:8.0.33&lt;/code&gt;) or put the jar in &lt;code&gt;jars/&lt;/code&gt; folder&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Connection issues:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Connection refused to Gravitino Server&lt;/strong&gt;: Spark cannot reach the Gravitino server. Check if Gravitino server is running and the &lt;code&gt;spark.sql.gravitino.uri&lt;/code&gt; config is correct&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Catalog not found&lt;/strong&gt;: Ensure the catalogs are properly registered in Gravitino and the metalake name is correct&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Query execution issues:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;UPDATE/DELETE not supported on JDBC catalogs&lt;/strong&gt;: For JDBC catalogs (like MySQL), only &lt;code&gt;SELECT&lt;/code&gt; and &lt;code&gt;INSERT&lt;/code&gt; operations are supported through Gravitino&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Table not found&lt;/strong&gt;: Verify the fully qualified table name format: &lt;code&gt;catalog.schema.table&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Congratulations
&lt;/h2&gt;

&lt;p&gt;You have successfully completed the Gravitino Spark ETL tutorial!&lt;/p&gt;

&lt;p&gt;You now have a fully functional Spark environment with Gravitino integration, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A configured Gravitino Spark Connector for unified catalog access&lt;/li&gt;
&lt;li&gt;Multiple registered catalogs (MySQL and Iceberg) in Gravitino&lt;/li&gt;
&lt;li&gt;A working ETL pipeline that extracts, transforms, and loads data across heterogeneous sources&lt;/li&gt;
&lt;li&gt;Understanding of federated query capabilities and PySpark integration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your Spark environment is now ready to leverage Gravitino for unified metadata management across your data ecosystem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;p&gt;For more advanced configurations and detailed documentation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Review the &lt;a href="https://gravitino.apache.org/docs/1.1.0/spark-connector/spark-catalog-iceberg" rel="noopener noreferrer"&gt;Gravitino Spark Connector Documentation&lt;/a&gt; for advanced configuration options&lt;/li&gt;
&lt;li&gt;Learn about &lt;a href="https://spark.apache.org/docs/latest/sql-programming-guide.html" rel="noopener noreferrer"&gt;Spark SQL Guide&lt;/a&gt; for more query patterns&lt;/li&gt;
&lt;li&gt;Explore &lt;a href="https://iceberg.apache.org/docs/latest/spark-configuration/" rel="noopener noreferrer"&gt;Apache Iceberg Spark Integration&lt;/a&gt; for Iceberg-specific features&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Explore &lt;a href="//../06-trino-query/README.md"&gt;Using Gravitino with Trino&lt;/a&gt; for federated querying&lt;/li&gt;
&lt;li&gt;Follow and star &lt;a href="https://github.com/apache/gravitino" rel="noopener noreferrer"&gt;Apache Gravitino Repository&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the &lt;a href="https://gravitino.apache.org/docs/" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt; or submit issues on &lt;a href="https://github.com/apache/gravitino/issues" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gravitino101</category>
      <category>apachespark</category>
      <category>etl</category>
      <category>metadata</category>
    </item>
    <item>
      <title>Configuring Gravitino Lance REST Service</title>
      <dc:creator>Yue @ Datastrato (Admin)</dc:creator>
      <pubDate>Sat, 31 Jan 2026 07:37:34 +0000</pubDate>
      <link>https://dev.to/gravitino/configuring-gravitino-lance-rest-service-3e2b</link>
      <guid>https://dev.to/gravitino/configuring-gravitino-lance-rest-service-3e2b</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmdsdc9lp4p1gl07izb8z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmdsdc9lp4p1gl07izb8z.png" alt=" " width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Author: &lt;a href="https://www.linkedin.com/in/yuqi1129/" rel="noopener noreferrer"&gt;Qi Yu&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Last Updated: 2026-01-23&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;In this tutorial, you will learn how to configure and use the Gravitino Lance REST service. By the end of this guide, you'll have a fully functional Lance REST service that enables Lance clients to interact with Gravitino through HTTP APIs.&lt;/p&gt;

&lt;p&gt;The Gravitino Lance REST service provides a RESTful interface for managing Lance datasets, implementing the standard Lance REST API. It acts as a centralized catalog service that allows Lance clients (like Spark and Ray) to discover and access Lance datasets managed by Gravitino.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key concepts:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lance REST catalog&lt;/strong&gt;: A standard HTTP API for Lance dataset operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gravitino Lance REST service&lt;/strong&gt;: Implements the Lance REST API and integrates with Gravitino's metadata system&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified Metadata&lt;/strong&gt;: Stores Lance dataset metadata in Gravitino, enabling centralized governance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The REST endpoint base path is &lt;code&gt;http://&amp;lt;host&amp;gt;:&amp;lt;port&amp;gt;/lance/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture overview:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foe89gymgd8d81nsq505s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foe89gymgd8d81nsq505s.png" alt="gravitino-lance-rest-architecture" width="800" height="606"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before starting this tutorial, you will need:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linux or macOS operating system with outbound internet access for downloads&lt;/li&gt;
&lt;li&gt;Python environment (3.10+) for running PySpark or Ray clients&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Required Components:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gravitino server installed and configured (see &lt;a href="//../02-setup-guide/README.md"&gt;&lt;code&gt;02-setup-guide/README.md&lt;/code&gt;&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Optional Components:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apache Spark with Lance runtime JARs for client verification (recommended for testing)&lt;/li&gt;
&lt;li&gt;Ray framework for distributed Lance data processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before proceeding, verify your Python installation and install required packages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;--version&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;&lt;span class="nv"&gt;pyspark&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;3.5.0 lance-ray&lt;span class="o"&gt;==&lt;/span&gt;0.1.0 lance-namespace
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Start a Gravitino server with Lance REST service
&lt;/h3&gt;

&lt;p&gt;Use this approach if you want the Lance REST service embedded in a full Gravitino server (with Web UI, unified REST APIs, etc.).&lt;/p&gt;

&lt;h4&gt;
  
  
  Configure Lance REST as auxiliary service
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. Install Gravitino server distribution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Follow the previous tutorial &lt;a href="//../02-setup-guide/README.md"&gt;&lt;code&gt;02-setup-guide/README.md&lt;/code&gt;&lt;/a&gt; to download or build the Gravitino server package.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Enable Lance REST as an auxiliary service&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Modify &lt;code&gt;conf/gravitino.conf&lt;/code&gt; to enable the &lt;code&gt;lance-rest&lt;/code&gt; service and configure it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable Lance REST service
&lt;/span&gt;&lt;span class="py"&gt;gravitino.auxService.names&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;lance-rest&lt;/span&gt;
&lt;span class="py"&gt;gravitino.lance-rest.httpPort&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;9101&lt;/span&gt;
&lt;span class="py"&gt;gravitino.lance-rest.host&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0&lt;/span&gt;
&lt;span class="py"&gt;gravitino.lance-rest.namespace-backend&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;gravitino&lt;/span&gt;
&lt;span class="py"&gt;gravitino.lance-rest.gravitino-uri&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;http://localhost:8090&lt;/span&gt;
&lt;span class="py"&gt;gravitino.lance-rest.gravitino-metalake&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;lance_metalake&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: The &lt;code&gt;lance_metalake&lt;/code&gt; should exist in Gravitino when you access Lance REST service. You can create it via the Gravitino REST API or Web UI after starting the Gravitino server if it doesn't exist.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;3. Start the Gravitino server&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./bin/gravitino.sh start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Create the Metalake (if not exists)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name":"lance_metalake","comment":"comment"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://localhost:8090/api/metalakes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;5. Check server logs (optional)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; logs/gravitino-server.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Verify the Lance REST endpoint and create a catalog namespace
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Test the service endpoint&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can verify the service is running by the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; GET http://localhost:9101/lance/v1/namespace/&lt;span class="nv"&gt;$/&lt;/span&gt;list &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On success, you should see a JSON response with namespace information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create a catalog namespace&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Create a catalog namespace (e.g., &lt;code&gt;lance_catalog&lt;/code&gt;) that will hold your Lance schemas and tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:9101/lance/v1/namespace/lance_catalog/create &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "id": ["lance_catalog"],
    "mode": "exist_ok"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If successful, it returns the namespace information.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Connect with Spark
&lt;/h3&gt;

&lt;p&gt;Configure your PySpark session to use the Lance REST catalog.&lt;/p&gt;

&lt;h4&gt;
  
  
  Configure Spark with Lance REST catalog
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Install pyspark: &lt;code&gt;pip install pyspark==3.5.0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Download the &lt;code&gt;lance-spark&lt;/code&gt; bundle jar matching your Spark version (e.g., &lt;code&gt;lance-spark-bundle-3.5_2.12-0.0.15.jar&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Execute sample operations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Run the following Python script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="c1"&gt;# Set path to your lance-spark bundle
&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PYSPARK_SUBMIT_ARGS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--jars /path/to/lance-spark-bundle-3.5_2.12-0.0.15.jar &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--conf &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;spark.driver.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--conf &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;spark.executor.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--master local[1] pyspark-shell&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lance_rest_demo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.lance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;com.lancedb.lance.spark.LanceNamespaceSparkCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.lance.impl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.lance.uri&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:9101/lance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.lance.parent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lance_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.defaultCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Create a schema and table
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CREATE DATABASE IF NOT EXISTS demo_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    CREATE TABLE demo_schema.test_table (id INT, value STRING)
    USING lance
    LOCATION &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/tmp/lance_catalog/demo_schema/test_table&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Insert and query data
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO demo_schema.test_table VALUES (1, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM demo_schema.test_table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Connect with Ray
&lt;/h3&gt;

&lt;p&gt;You can also access the data created by Spark using Ray with Lance Ray integration.&lt;/p&gt;

&lt;h4&gt;
  
  
  Configure Ray with Lance REST catalog
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Install required packages: &lt;code&gt;pip install lance-ray==0.1.0 lance-namespace&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Execute sample operations&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ray&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;lance_namespace&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ln&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;lance_ray&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;read_lance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write_lance&lt;/span&gt;

&lt;span class="n"&gt;ray&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Connect to Lance REST
&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ln&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uri&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:9101/lance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Read the table created by Spark
# Note: Table ID is [catalog, schema, table]
&lt;/span&gt;&lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_lance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;table_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lance_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;demo_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test_table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Row count: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Perform filtering operation
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Filtered row count: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;p&gt;Common issues and their solutions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Service connectivity issues:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Service fails to start&lt;/strong&gt;: Check &lt;code&gt;logs/gravitino-server.log&lt;/code&gt; for startup errors and configuration issues&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connection refused&lt;/strong&gt;: Verify &lt;code&gt;gravitino.lance-rest.httpPort&lt;/code&gt; (default 9101) is open and accessible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;curl&lt;/code&gt; returns 404&lt;/strong&gt;: Confirm the Lance REST base path is &lt;code&gt;/lance&lt;/code&gt; and the port matches configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Client connection issues:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Spark ClassNotFoundException&lt;/strong&gt;: Ensure the &lt;code&gt;lance-spark-bundle&lt;/code&gt; jar is correctly referenced in &lt;code&gt;PYSPARK_SUBMIT_ARGS&lt;/code&gt; or &lt;code&gt;--jars&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Namespace not found&lt;/strong&gt;: Remember to create the parent catalog namespace (e.g., &lt;code&gt;lance_catalog&lt;/code&gt;) before creating schemas or tables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ray connection errors&lt;/strong&gt;: Verify &lt;code&gt;lance-ray&lt;/code&gt; and &lt;code&gt;lance-namespace&lt;/code&gt; packages are installed and the REST endpoint is accessible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Configuration issues:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Metalake not found&lt;/strong&gt;: Ensure the metalake specified in &lt;code&gt;gravitino.lance-rest.gravitino-metalake&lt;/code&gt; exists in Gravitino&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Permission errors&lt;/strong&gt;: Check that the Gravitino server has proper access to the configured storage locations&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Congratulations
&lt;/h2&gt;

&lt;p&gt;You have successfully completed the Gravitino Lance REST service configuration tutorial!&lt;/p&gt;

&lt;p&gt;You now have a fully functional Lance REST service with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A configured Lance REST endpoint running on port 9101&lt;/li&gt;
&lt;li&gt;A catalog namespace configured for organizing Lance datasets&lt;/li&gt;
&lt;li&gt;Verified client connectivity through Apache Spark and Ray&lt;/li&gt;
&lt;li&gt;Understanding of Lance dataset operations across different compute engines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your Gravitino Lance REST service is ready to serve Lance clients across your data ecosystem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;p&gt;For more advanced configurations and detailed documentation:&lt;/p&gt;

&lt;p&gt;for full API details&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check the &lt;a href="https://github.com/apache/gravitino/blob/main/docs/lance-rest-integration.md" rel="noopener noreferrer"&gt;Lance REST Integration Guide&lt;/a&gt; for compatibility matrices and advanced configuration&lt;/li&gt;
&lt;li&gt;Learn more about &lt;a href="https://lance.org/" rel="noopener noreferrer"&gt;Lance format&lt;/a&gt; and its capabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Continue reading &lt;a href="//../05-spark-etl/README.md"&gt;Spark ETL&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Follow and star &lt;a href="https://github.com/apache/gravitino" rel="noopener noreferrer"&gt;Apache Gravitino Repository&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the &lt;a href="https://gravitino.apache.org/docs/" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt; or submit issues on &lt;a href="https://github.com/apache/gravitino/issues" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>api</category>
      <category>dataengineering</category>
      <category>lance</category>
      <category>gravitino101</category>
    </item>
    <item>
      <title>Unified Data Access with Daft and Apache Gravitino: Simplifying Multi-Cloud Data Management</title>
      <dc:creator>Yue @ Datastrato (Admin)</dc:creator>
      <pubDate>Tue, 20 Jan 2026 19:28:31 +0000</pubDate>
      <link>https://dev.to/gravitino/unified-data-access-with-daft-and-apache-gravitino-simplifying-multi-cloud-data-management-47nn</link>
      <guid>https://dev.to/gravitino/unified-data-access-with-daft-and-apache-gravitino-simplifying-multi-cloud-data-management-47nn</guid>
      <description>&lt;h1&gt;
  
  
  Unified Data Access with Daft and Apache Gravitino: Simplifying Multi-Cloud Data Management
&lt;/h1&gt;

&lt;p&gt;The modern data landscape is increasingly distributed across multiple cloud providers and storage systems. Organizations often find themselves managing data across AWS S3, Google Cloud Storage, Azure Blob Storage, and on-premises systems, each with their own access patterns, credentials, and metadata management challenges. This fragmentation creates complexity in data discovery, access control, and operational overhead.&lt;/p&gt;

&lt;p&gt;Today, we're excited to introduce the integration between &lt;strong&gt;Daft&lt;/strong&gt; and &lt;strong&gt;Apache Gravitino&lt;/strong&gt;, bringing unified catalog management and seamless multi-cloud data access to the Daft ecosystem. This integration focuses on &lt;strong&gt;fileset catalog support&lt;/strong&gt;, enabling you to access distributed datasets through a single, unified interface while maintaining security and performance.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: This integration is available in Daft v0.7.2 and later versions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What is Apache Gravitino?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://gravitino.apache.org/" rel="noopener noreferrer"&gt;Apache Gravitino&lt;/a&gt; is an open-source data catalog that provides unified metadata management for various data sources and storage systems. It acts as a central hub for organizing and accessing data across different platforms, offering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unified Metadata Management&lt;/strong&gt;: Single source of truth for data across multiple storage systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Cloud Support&lt;/strong&gt;: Native integration with AWS S3 and local storage, with more cloud providers coming soon&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Integration&lt;/strong&gt;: Centralized credential management and access control&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Catalog Abstraction&lt;/strong&gt;: Support for both table catalogs (Iceberg, Hudi, Hive, JDBC) and fileset catalogs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmebr52mj4ltyy615n94d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmebr52mj4ltyy615n94d.png" alt=" " width="800" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Daft?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.daft.ai" rel="noopener noreferrer"&gt;Daft&lt;/a&gt; is a distributed query engine built for the Python ecosystem, designed to handle large-scale data processing with ease and efficiency. Daft brings the power of distributed computing to data scientists and engineers through a familiar DataFrame API, offering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Distributed Processing&lt;/strong&gt;: Scale computations across multiple cores and machines seamlessly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lazy Evaluation&lt;/strong&gt;: Optimize query execution through intelligent query planning and predicate pushdown&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Format Support&lt;/strong&gt;: Native support for Parquet, JSON, CSV, Images, and more&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud-Native&lt;/strong&gt;: Built-in integrations with AWS S3, Google Cloud Storage, Azure Blob Storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python-First&lt;/strong&gt;: Intuitive DataFrame API that feels natural to Python developers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance Optimized&lt;/strong&gt;: Rust-powered execution engine for maximum performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unlike traditional big data tools that require complex cluster management, Daft provides a simple &lt;code&gt;pip install&lt;/code&gt; experience while delivering enterprise-grade performance. Whether you're processing terabytes of data locally or across cloud infrastructure, Daft's intelligent execution engine automatically optimizes your workloads for speed and efficiency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Figmwonflwj1lx3mpi0us.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Figmwonflwj1lx3mpi0us.png" alt="Daft Architecture" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Power of Fileset Catalogs
&lt;/h2&gt;

&lt;p&gt;While table catalogs manage structured data with schemas, &lt;strong&gt;fileset catalogs&lt;/strong&gt; provide a flexible way to organize and access collections of files across different storage systems. This is particularly valuable for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Lakes&lt;/strong&gt;: Managing raw data files, logs, and unstructured datasets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Format Data&lt;/strong&gt;: Handling Parquet, JSON, CSV, and other file formats in a unified way&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed Storage&lt;/strong&gt;: Accessing data across different storage systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Datasets&lt;/strong&gt;: Working with datasets that don't fit traditional table structures&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Introducing GVFS
&lt;/h2&gt;

&lt;p&gt;The Daft + Gravitino integration introduces a new URL scheme: &lt;code&gt;gvfs://&lt;/code&gt; (Gravitino Virtual File System). This provides a unified way to access files managed by Gravitino filesets, regardless of their underlying storage location (s3, adls, gcs, etc).&lt;/p&gt;

&lt;h3&gt;
  
  
  URL Format
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gvfs://fileset/catalog/schema/fileset/path/to/file
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;catalog&lt;/code&gt;: The Gravitino catalog name&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;schema&lt;/code&gt;: The schema within the catalog&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fileset&lt;/code&gt;: The specific fileset name&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;path/to/file&lt;/code&gt;: The file path within the fileset (optional)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example URLs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Access a specific file
&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gvfs://fileset/s3_catalog/analytics/user_events/2024/01/events.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Access all files in a fileset
&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gvfs://fileset/s3_catalog/ml_data/training_set/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Access partitioned data
&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gvfs://fileset/s3_catalog/logs/application/year=2024/month=01/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Requirements
&lt;/h3&gt;

&lt;p&gt;The Daft + Gravitino integration requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python&lt;/strong&gt;: 3.10 or later&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pip&lt;/strong&gt;: 21.0 or later (recommended: latest version)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daft&lt;/strong&gt;: v0.7.2 or later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Make sure you have the correct versions installed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installation and Setup
&lt;/h3&gt;

&lt;p&gt;First, ensure you have Apache Gravitino, and create a fileset catalog, schema and at least one fileset entity which has a storage location to s3. You can refer to Gravitino's online documentation.&lt;/p&gt;

&lt;p&gt;Second, ensure you have installed Daft v0.7.2 or later with the support for Gravitino:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"daft&amp;gt;=0.7.2"&lt;/span&gt; requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Basic Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;daft&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;daft.io&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;IOConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GravitinoConfig&lt;/span&gt;

&lt;span class="c1"&gt;# Configure Gravitino connection
&lt;/span&gt;&lt;span class="n"&gt;gravitino_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GravitinoConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8090&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metalake_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_metalake&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;auth_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create IOConfig with Gravitino settings
&lt;/span&gt;&lt;span class="n"&gt;io_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;IOConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gravitino&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gravitino_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Reading Data from Filesets
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Read a specific file from a Gravitino fileset
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;daft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gvfs://fileset/s3_catalog/analytics/user_events/events.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;io_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;io_config&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read all Parquet files in a fileset
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;daft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gvfs://fileset/s3_catalog/analytics/user_events/**/*.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;io_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;io_config&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# List files in a fileset
&lt;/span&gt;&lt;span class="n"&gt;files_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;daft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_glob_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gvfs://fileset/s3_catalog/analytics/user_events/**/*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;io_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;io_config&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Advanced Usage Examples
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Working with Multiple File Formats
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Read JSON files from a fileset
&lt;/span&gt;&lt;span class="n"&gt;json_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;daft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gvfs://fileset/s3_catalog/logs/application/**/*.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;io_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;io_config&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read CSV files with custom options
&lt;/span&gt;&lt;span class="n"&gt;csv_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;daft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gvfs://fileset/s3_catalog/exports/daily_reports/**/*.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;io_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;io_config&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Writing Data to Filesets
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Write Parquet files to a Gravitino fileset
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;daft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pydict&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;click&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purchase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;view&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-01-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-01-02&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-01-03&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gvfs://fileset/s3_catalog/analytics/processed_events/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;io_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;io_config&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write CSV files to a fileset
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gvfs://fileset/s3_catalog/exports/daily_reports/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;io_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;io_config&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write JSON files to a fileset
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gvfs://fileset/s3_catalog/logs/application/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;io_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;io_config&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Programmatic Fileset Discovery
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;daft.gravitino&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GravitinoClient&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize Gravitino client
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GravitinoClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8090&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metalake_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_metalake&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;auth_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Discover available catalogs and filesets
&lt;/span&gt;&lt;span class="n"&gt;catalogs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_catalogs&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Available catalogs: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;catalogs&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Load fileset metadata
&lt;/span&gt;&lt;span class="n"&gt;fileset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_fileset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3_catalog.analytics.user_events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Storage location: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fileset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fileset_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;storage_location&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Properties: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fileset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fileset_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;properties&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Use the fileset with Daft
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;daft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gvfs://fileset/s3_catalog/analytics/user_events/**/*.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;io_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_io_config&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Security and Credential Management
&lt;/h2&gt;

&lt;p&gt;One of the key benefits of the Gravitino integration is centralized credential management. Instead of managing separate credentials for each storage system, Gravitino handles authentication and authorization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Gravitino manages credentials for underlying storage
# No need to configure separate S3 credentials in Daft
&lt;/span&gt;&lt;span class="n"&gt;gravitino_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GravitinoConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8090&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metalake_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;secure_metalake&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;auth_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;oauth2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-oauth-token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# All storage access is handled through Gravitino's security layer
&lt;/span&gt;&lt;span class="n"&gt;io_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;IOConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gravitino&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gravitino_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Access data with unified security
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;daft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gvfs://fileset/s3_catalog/sensitive_data/financial/**/*.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;io_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;io_config&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performance Considerations
&lt;/h2&gt;

&lt;p&gt;The Gravitino integration is designed for optimal performance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lazy Evaluation&lt;/strong&gt;: Daft's lazy execution works seamlessly with gvfs:// URLs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Predicate Pushdown&lt;/strong&gt;: Filters are pushed down to the storage layer when possible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel Processing&lt;/strong&gt;: Multi-threaded I/O operations across different storage systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caching&lt;/strong&gt;: Gravitino metadata is cached to reduce lookup overhead
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Efficient filtered reads with predicate pushdown
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;daft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gvfs://fileset/s3_catalog/events/daily/**/*.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;io_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;io_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;daft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-01-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Pushed down to storage
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;daft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;click&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Efficient columnar filtering
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Column pruning
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Current Limitations and Future Roadmap
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Current Status
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;Read Operations&lt;/strong&gt;: Full support for reading files from Gravitino filesets&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Write Operations&lt;/strong&gt;: Support for writing Parquet, CSV, and JSON files to filesets&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Multiple Formats&lt;/strong&gt;: Support for Parquet, JSON, CSV, and other formats&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;S3 Storage&lt;/strong&gt;: Full support for S3-backed filesets (including S3-compatible storage like MinIO)&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Local Storage&lt;/strong&gt;: Support for local file:// storage&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Security Integration&lt;/strong&gt;: Centralized credential management through Gravitino&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Future Enhancements
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Credential Vending&lt;/strong&gt;: Gravitino credential vending can generate temporary credentials for clients, providing enhanced security&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More Cloud Storages&lt;/strong&gt;: Support for GCS and Azure Blob Storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Table Catalog Integration&lt;/strong&gt;: Support for read and write Iceberg and Lance table catalogs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Advanced Security&lt;/strong&gt;: Fine-grained access control and audit logging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance Optimizations&lt;/strong&gt;: Enhanced caching and metadata management&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Getting Started Today
&lt;/h2&gt;

&lt;p&gt;Ready to try the Daft + Gravitino integration? Here's how to get started:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Set up Gravitino&lt;/strong&gt;: Follow the &lt;a href="https://gravitino.apache.org/docs/getting-started/index" rel="noopener noreferrer"&gt;Gravitino quickstart guide&lt;/a&gt; to set up your Gravitino server&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Install Daft v0.7.2 or later with Gravitino support&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"daft&amp;gt;=0.7.2"&lt;/span&gt; requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Configure your first fileset&lt;/strong&gt;: Create a fileset in Gravitino pointing to your data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start querying&lt;/strong&gt;: Use gvfs:// URLs to access your data with Daft&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The integration between Daft and Apache Gravitino represents a significant step forward in simplifying distributed data access. By combining Daft's powerful distributed query engine with Gravitino's unified catalog management, data teams can focus on extracting insights rather than managing infrastructure complexity.&lt;/p&gt;

&lt;p&gt;Whether you're building analytics pipelines, managing data lakes, or simply looking to simplify your data access patterns, the Daft + Gravitino integration provides the tools you need to succeed in today's distributed data landscape.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Want to learn more? Check out the &lt;a href="https://docs.daft.ai" rel="noopener noreferrer"&gt;Daft documentation&lt;/a&gt; and &lt;a href="https://gravitino.apache.org/" rel="noopener noreferrer"&gt;Apache Gravitino project&lt;/a&gt; for detailed guides and examples.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>daft</category>
      <category>apachegravitino</category>
      <category>metadata</category>
      <category>datacatalog</category>
    </item>
    <item>
      <title>Setting up Apache Gravitino from Scratch</title>
      <dc:creator>Yue @ Datastrato (Admin)</dc:creator>
      <pubDate>Tue, 20 Jan 2026 08:31:19 +0000</pubDate>
      <link>https://dev.to/gravitino/setting-up-apache-gravitino-from-scratch-56o8</link>
      <guid>https://dev.to/gravitino/setting-up-apache-gravitino-from-scratch-56o8</guid>
      <description>&lt;p&gt;&lt;em&gt;Author: Danhua Wang&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Last Updated: [2026-01-12]&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;In this tutorial, you will learn how to install and configure Apache Gravitino from scratch. By the end of this guide, you'll have a fully functional Gravitino server running with your chosen storage backend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll accomplish:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Install Apache Gravitino&lt;/strong&gt; from source or pre-built binaries and configure the basic server setup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configure storage backends&lt;/strong&gt; including H2 for development and MySQL/PostgreSQL for production environments
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configure Gravitino Server&lt;/strong&gt; including web server, cache and access control configurations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify the installation&lt;/strong&gt; by testing the server endpoints and Web UI to ensure everything is working correctly&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before starting this tutorial, you will need:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linux or macOS operating system with outbound internet access for downloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimum Production Environment&lt;/strong&gt;: 4 CPU cores, 16GB RAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimum Development Environment&lt;/strong&gt;: 2 CPU cores, 8GB RAM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Java Development Kit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JDK 17 or higher installed and properly configured&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Optional Components:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MySQL or PostgreSQL server installed and properly configured, if you choose either as your storage backend&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before proceeding, verify your Java installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;JAVA_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/bin/java &lt;span class="nt"&gt;-version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Obtain Gravitino Binary
&lt;/h3&gt;

&lt;p&gt;You have two options for obtaining Apache Gravitino: downloading a pre-built release or building from source.&lt;/p&gt;

&lt;h4&gt;
  
  
  Option 1: Download Pre-built Release
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. Download the latest release&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Navigate to the &lt;a href="https://github.com/apache/gravitino/releases" rel="noopener noreferrer"&gt;Apache Gravitino GitHub Releases&lt;/a&gt; page and download the latest release tarball.&lt;br&gt;
For example, to download version 1.1.0, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget https://github.com/apache/gravitino/releases/download/v1.1.0/gravitino-1.1.0-bin.tar.gz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Extract the package&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-xzf&lt;/span&gt; gravitino-1.1.0-bin.tar.gz
&lt;span class="nb"&gt;cd &lt;/span&gt;gravitino-1.1.0-bin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Option 2: Build from Source
&lt;/h4&gt;

&lt;p&gt;If you prefer to build from source or need the latest development features, see &lt;a href="https://gravitino.apache.org/docs/1.1.0/how-to-build" rel="noopener noreferrer"&gt;How to Build Gravitino&lt;/a&gt; for detailed build instructions.&lt;/p&gt;

&lt;h4&gt;
  
  
  Understanding the Package Structure
&lt;/h4&gt;

&lt;p&gt;After obtaining the binary, familiarize yourself with the directory layout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gravitino-&amp;lt;version&amp;gt;-bin/
├── bin/                                    # Launch scripts
│   ├── gravitino.sh                        # Main server launcher
│   ├── gravitino-iceberg-rest-server.sh    # Iceberg REST server launcher
│   └── gravitino-lance-rest-server.sh      # Lance REST server launcher
├── conf/                                   # Configuration files
│   ├── gravitino.conf                      # Main server configuration
│   ├── gravitino-iceberg-rest-server.conf  # Iceberg REST configuration
│   ├── gravitino-lance-rest-server.conf    # Lance REST configuration
│   ├── gravitino-env.sh                    # Environment variables
│   └── log4j2.properties                   # Logging configuration
├── catalogs/                               # Catalog-specific configurations
├── libs/                                   # Server dependencies
├── iceberg-rest-server/                    # Iceberg REST server package
├── lance-rest-server/                      # Lance REST server package
├── logs/                                   # Log files (created at runtime)
├── data/                                   # Default data storage
└── scripts/                                # Database initialization scripts
└── web/                                    # Frontend package
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Plan Your Storage Backend
&lt;/h3&gt;

&lt;p&gt;Choose the appropriate storage backend for your deployment scenario.&lt;/p&gt;

&lt;h4&gt;
  
  
  Development/Testing: H2 Database (Default)
&lt;/h4&gt;

&lt;p&gt;For development and testing environments, H2 provides a quick setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pros&lt;/strong&gt;: Embedded database, no external dependencies, works out-of-the-box
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cons&lt;/strong&gt;: Not suitable for production, no data consistency guarantees
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration&lt;/strong&gt;: No additional setup required
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Production: MySQL
&lt;/h4&gt;

&lt;p&gt;For production environments, MySQL is the recommended choice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Install and configure MySQL server&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Create database and user&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;gravitino&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;USER&lt;/span&gt; &lt;span class="s1"&gt;'gravitino'&lt;/span&gt;&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="s1"&gt;'%'&lt;/span&gt; &lt;span class="n"&gt;IDENTIFIED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="s1"&gt;'your_password'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt; &lt;span class="k"&gt;PRIVILEGES&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;gravitino&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="s1"&gt;'gravitino'&lt;/span&gt;&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="s1"&gt;'%'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;FLUSH&lt;/span&gt; &lt;span class="k"&gt;PRIVILEGES&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Initialize database schema&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mysql &lt;span class="nt"&gt;-h&lt;/span&gt; &amp;lt;mysql_ip_address&amp;gt; &lt;span class="nt"&gt;-u&lt;/span&gt; gravitino &lt;span class="nt"&gt;-D&lt;/span&gt; gravitino &lt;span class="nt"&gt;-p&lt;/span&gt; &amp;lt; scripts/mysql/schema-1.1.0-mysql.sql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Production: PostgreSQL
&lt;/h4&gt;

&lt;p&gt;As an alternative production option:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Install and configure PostgreSQL server&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Create database and user&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;gravitino&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;USER&lt;/span&gt; &lt;span class="n"&gt;gravitino&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;PASSWORD&lt;/span&gt; &lt;span class="s1"&gt;'your_password'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt; &lt;span class="k"&gt;PRIVILEGES&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;gravitino&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;gravitino&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Initialize database schema&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;psql &lt;span class="nt"&gt;-h&lt;/span&gt; &amp;lt;postgres_ip_address&amp;gt; &lt;span class="nt"&gt;-U&lt;/span&gt; gravitino &lt;span class="nt"&gt;-d&lt;/span&gt; gravitino &lt;span class="nt"&gt;-f&lt;/span&gt; scripts/postgresql/schema-1.1.0-postgresql.sql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Configure Gravitino Server
&lt;/h3&gt;

&lt;p&gt;Configure the main server settings in the &lt;code&gt;conf/gravitino.conf&lt;/code&gt; file.&lt;/p&gt;

&lt;h4&gt;
  
  
  Basic Server Configuration
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. Configure HTTP server settings&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;# HTTP Server Configuration
&lt;/span&gt;&lt;span class="py"&gt;gravitino.server.webserver.host&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0&lt;/span&gt;
&lt;span class="py"&gt;gravitino.server.webserver.httpPort&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;8090&lt;/span&gt;
&lt;span class="py"&gt;gravitino.server.webserver.minThreads&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;24&lt;/span&gt;
&lt;span class="py"&gt;gravitino.server.webserver.maxThreads&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;200&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Configure storage backend&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For H2 (Development):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;# Storage Backend Configuration
&lt;/span&gt;&lt;span class="py"&gt;gravitino.entity.store&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;relational&lt;/span&gt;
&lt;span class="py"&gt;gravitino.entity.store.relational&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;JDBCBackend&lt;/span&gt;
&lt;span class="py"&gt;gravitino.entity.store.relational.jdbcUrl&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;jdbc:h2&lt;/span&gt;
&lt;span class="py"&gt;gravitino.entity.store.relational.jdbcDriver&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;org.h2.Driver&lt;/span&gt;
&lt;span class="py"&gt;gravitino.entity.store.relational.jdbcUser&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;gravitino&lt;/span&gt;
&lt;span class="py"&gt;gravitino.entity.store.relational.jdbcPassword&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;gravitino&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For MySQL (Production):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;# Configure for MySQL
&lt;/span&gt;&lt;span class="py"&gt;gravitino.entity.store.relational.jdbcUrl&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;jdbc:mysql://&amp;lt;mysql_ip_address&amp;gt;:3306/gravitino&lt;/span&gt;
&lt;span class="py"&gt;gravitino.entity.store.relational.jdbcDriver&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;com.mysql.cj.jdbc.Driver&lt;/span&gt;
&lt;span class="py"&gt;gravitino.entity.store.relational.jdbcUser&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;gravitino&lt;/span&gt;
&lt;span class="py"&gt;gravitino.entity.store.relational.jdbcPassword&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;your_password&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Optional Performance Configuration
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. Enable caching for better performance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Caching provides significant performance improvements, particularly for authorization operations and metadata lookups:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Authorization Performance&lt;/strong&gt;: Dramatically reduces latency for permission checks by caching user roles, privileges, and access control decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata Retrieval&lt;/strong&gt;: Speeds up frequent catalog, schema, and table metadata queries by avoiding repeated database lookups
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable caching for better performance
&lt;/span&gt;&lt;span class="py"&gt;gravitino.cache.enabled&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;gravitino.cache.implementation&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;caffeine&lt;/span&gt;
&lt;span class="py"&gt;gravitino.cache.maxEntries&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;10000&lt;/span&gt;
&lt;span class="py"&gt;gravitino.cache.expireTimeInMs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;3600000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Optional Access Control Configuration
&lt;/h4&gt;

&lt;h5&gt;
  
  
  Configure authorization
&lt;/h5&gt;

&lt;p&gt;Gravitino includes built-in metadata authorization that you can enable with the following configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable access control
&lt;/span&gt;&lt;span class="py"&gt;gravitino.authorization.enable&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;gravitino.authorization.serviceAdmins&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;admin,gravitino&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;gravitino.authorization.serviceAdmins&lt;/code&gt; defines service administrators who are responsible for creating metalakes.&lt;br&gt;&lt;br&gt;
When a service admin creates a metalake, they automatically become the owner. As the owner, they have full control over the metalake, including the ability to drop it. Ownership can be transferred to another user if needed.&lt;/p&gt;

&lt;p&gt;For comprehensive access control documentation, see &lt;a href="https://gravitino.apache.org/docs/1.1.0/security/access-control" rel="noopener noreferrer"&gt;Access Control&lt;/a&gt;.&lt;/p&gt;
&lt;h5&gt;
  
  
  Configure authentication
&lt;/h5&gt;

&lt;p&gt;Apache Gravitino supports three authentication mechanisms: simple, OAuth, and Kerberos. Upon successful authentication, user identities from any of these methods are directly mapped to authorization principals to govern access control decisions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Default Behavior&lt;/strong&gt;: If authentication is not explicitly configured, Gravitino defaults to &lt;code&gt;simple&lt;/code&gt; authentication mode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Login Method&lt;/strong&gt;: Use the service administrators specified in the &lt;code&gt;gravitino.authorization.serviceAdmins&lt;/code&gt; configuration to log in.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;See &lt;a href="https://gravitino.apache.org/docs/1.1.0/security/how-to-authenticate" rel="noopener noreferrer"&gt;How to authenticate&lt;/a&gt; for detailed authentication setup.&lt;/p&gt;
&lt;h4&gt;
  
  
  Environment Configuration
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Configure environment variables in &lt;code&gt;conf/gravitino-env.sh&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# JVM Memory Settings&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GRAVITINO_MEM&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"-Xms4g -Xmx4g -XX:MaxMetspaceSize=1g"&lt;/span&gt;

&lt;span class="c"&gt;# Debug Options (uncomment for debugging)&lt;/span&gt;
&lt;span class="c"&gt;# export GRAVITINO_DEBUG_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8000 -Dlog4j2.debug=true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See &lt;a href="https://gravitino.apache.org/docs/1.1.0/gravitino-server-config" rel="noopener noreferrer"&gt;Apache Gravitino server configurations&lt;/a&gt; for detailed server configurations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Optional REST services for enhanced functionality
&lt;/h3&gt;

&lt;p&gt;You can enable the Iceberg REST or Lance REST services either as auxiliary services when starting the Gravitino server, or run them as standalone services. We’ve prepared detailed guides for them in sebsequent articles.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable Iceberg REST/Lance REST as auxiliary service
&lt;/span&gt;&lt;span class="py"&gt;gravitino.auxService.names&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;iceberg-rest,lance-rest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Start and Verify Installation
&lt;/h3&gt;

&lt;p&gt;Launch the Gravitino server and verify the installation.&lt;/p&gt;

&lt;h4&gt;
  
  
  Start Gravitino Server
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. Start the server in daemon mode&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./bin/gravitino.sh start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Check server status&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./bin/gravitino.sh status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. View server logs&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; logs/gravitino-server.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Verify Installation
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. Check server health&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; GET &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/vnd.gravitino.v1+json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://localhost:8090/api/version
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On success, the response looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"code":0,"version":{"version":"1.1.0","compileDate":"12/12/2025 12:38:33","gitCommit":"5a6b5ae772d50aff98878ae3659fba3598a9027f"}}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Access Web UI&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Open your browser and navigate to &lt;code&gt;http://localhost:8090&lt;/code&gt; to access the Gravitino Web UI.  &lt;/p&gt;

&lt;p&gt;The default login page when using simple authentication mode (with access control enabled):  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fukt8rzzsrkdw7k249thl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fukt8rzzsrkdw7k249thl.png" alt=" " width="800" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Go to the metalake management page directly if access control disabled:  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhlc4j53x6pg956h2g6pe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhlc4j53x6pg956h2g6pe.png" alt=" " width="800" height="552"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Verify auxiliary services (if enabled)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check Iceberg REST service&lt;/span&gt;
curl http://localhost:9001/iceberg/v1/config

&lt;span class="c"&gt;# Check Lance REST service&lt;/span&gt;
curl http://localhost:9101/lance/v1/namespace/%24/list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Create Sample Metadata
&lt;/h4&gt;

&lt;p&gt;Test your installation by creating sample metadata objects.&lt;/p&gt;

&lt;h5&gt;
  
  
  Create your first metalake
&lt;/h5&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/vnd.gravitino.v1+json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name": "my_metalake", "comment": "My first metalake"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://localhost:8090/api/metalakes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: If you have enabled access control, you need to add the Authorization header to the command (using username 'admin' and password '123'):&lt;/p&gt;


&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/vnd.gravitino.v1+json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Basic &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s1"&gt;'admin:123'&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"name": "my_metalake", "comment": "My first metalake"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://localhost:8090/api/metalakes
&lt;/code&gt;&lt;/pre&gt;

&lt;/blockquote&gt;

&lt;h5&gt;
  
  
  Create a sample catalog
&lt;/h5&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: This example creates a Hive catalog. Before proceeding, ensure you have a running Hive cluster with Hive Metastore service accessible. If you don't have a Hive cluster, you can use a different catalog type (such as MySQL catalog).&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: application/vnd.gravitino.v1+json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name": "catalog_hive",
    "type": "relational",
    "provider": "hive",
    "comment": "My Hive catalog",
    "properties": {
      "metastore.uris": "thrift://&amp;lt;hive_metastore_host&amp;gt;:&amp;lt;port&amp;gt;"
    }
  }'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://localhost:8090/api/metalakes/my_metalake/catalogs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h5&gt;
  
  
  Manage catalogs on web GUI
&lt;/h5&gt;

&lt;p&gt;Create a catalog:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhmh79hwmlcbfwze7lp0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhmh79hwmlcbfwze7lp0.png" alt=" " width="800" height="554"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;View/Manage all catalogs:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F173zoikwbe79w21b598l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F173zoikwbe79w21b598l.png" alt="Gravitino catalogs Page" width="800" height="556"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Congratulations
&lt;/h2&gt;

&lt;p&gt;You have successfully completed the Apache Gravitino setup tutorial!&lt;/p&gt;

&lt;p&gt;You now have a fully functional Apache Gravitino installation with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A configured metadata server running on port 8090&lt;/li&gt;
&lt;li&gt;A storage backend configured for your environment
&lt;/li&gt;
&lt;li&gt;Optional auxiliary REST services for Iceberg and Lance integration&lt;/li&gt;
&lt;li&gt;Sample metadata objects to verify functionality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your Apache Gravitino server is ready to manage metadata across your data ecosystem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Continue reading [Iceberg Catalog]&lt;/li&gt;
&lt;li&gt;Follow and star &lt;a href="https://github.com/apache/gravitino" rel="noopener noreferrer"&gt;Apache Gravitino Repository&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the &lt;a href="https://gravitino.apache.org/docs/" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt; or submit issues on &lt;a href="https://github.com/apache/gravitino/issues" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>metadata</category>
      <category>agenticai</category>
      <category>gravitino101</category>
    </item>
    <item>
      <title>Apache Gravitino Introduction</title>
      <dc:creator>Yue @ Datastrato (Admin)</dc:creator>
      <pubDate>Fri, 16 Jan 2026 23:05:12 +0000</pubDate>
      <link>https://dev.to/gravitino/apache-gravitino-introduction-ce7</link>
      <guid>https://dev.to/gravitino/apache-gravitino-introduction-ce7</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ta9xwucijs1ql00h6xi.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ta9xwucijs1ql00h6xi.jpeg" alt=" " width="800" height="402"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Author: &lt;a href="https://www.linkedin.com/in/shi-shao-feng-b1867917/" rel="noopener noreferrer"&gt;shaofeng shi&lt;/a&gt;&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Last Updated: [2025-12-29]&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;In the era of big data, enterprises often need to manage metadata from multi-cloud, multi-domain, and heterogeneous data sources, such as Apache Hive, MySQL, PostgreSQL, Iceberg, Lance, S3, GCS, etc. Additionally, with the extensive application of AI model training and inference, massive amounts of multimodal data and model metadata also require a unified management solution. Traditional approaches involve managing metadata separately for each data source, which not only increases operational complexity but also easily creates data silos. &lt;a href="https://github.com/apache/gravitino" rel="noopener noreferrer"&gt;Apache Gravitino&lt;/a&gt;, as a high-performance, geographically distributed federated metadata lake, provides us with a unified solution for managing multi-source metadata.&lt;/p&gt;

&lt;p&gt;Gravitino was originally initiated and founded by Datastrato Inc., open-sourced in 2023, donated to the Apache Incubator in 2024, and graduated from the Apache Incubator in May 2025 to become an Apache Top Level Project. It has been deployed in production environments at companies like Xiaomi, Tencent, Zhihu, Uber, and Pinterest.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Apache Gravitino?
&lt;/h2&gt;

&lt;p&gt;Apache Gravitino is a high-performance, geographically distributed, federated metadata lake management system that provides users with a unified data and AI asset management platform. It can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unified Metadata Management&lt;/strong&gt;: Provide unified metadata models and APIs for different types of data sources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Direct Metadata Management&lt;/strong&gt;: Directly manage underlying systems, with changes reflected in real-time to source systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Engine Support&lt;/strong&gt;: Support multiple query engines such as Trino, Spark, Flink, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Geographically Distributed Deployment&lt;/strong&gt;: Support cross-region, cross-cloud deployment architectures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI Asset Management&lt;/strong&gt;: Manage not only data assets but also AI/ML model metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Core concepts include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Metalake&lt;/strong&gt;: Container/tenant for metadata, typically one organization corresponds to one metalake&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Catalog&lt;/strong&gt;: Collection of metadata from specific metadata sources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema&lt;/strong&gt;: Second-level namespace, corresponding to the schema concept in databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Table&lt;/strong&gt;: Bottom-level object representing specific data tables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hg0z37dfbt0mo2isys8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hg0z37dfbt0mo2isys8.png" alt="Gravitino Overall Architecture" width="800" height="295"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Gravitino Core Features Overview
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Unified Metadata Management
&lt;/h3&gt;

&lt;p&gt;Gravitino provides a unified metadata management layer that supports integration with multiple data sources:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Supported Data Source Types:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Relational Databases&lt;/strong&gt;: MySQL, PostgreSQL, OceanBase, Apache Doris, StarRocks, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Big Data Storage&lt;/strong&gt;: Apache Hive, Apache Iceberg, Apache Hudi, Apache Paimon, Delta Lake (in development)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Message Queues&lt;/strong&gt;: Apache Kafka&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File Systems&lt;/strong&gt;: HDFS, S3, GCS, Azure Blob Storage, Alibaba Cloud OSS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI/ML Data Formats&lt;/strong&gt;: Lance (columnar data format designed specifically for AI/ML workloads)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  REST API Services
&lt;/h3&gt;

&lt;p&gt;Gravitino provides rich REST API services that support standardized access to different data formats:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gravitino Core REST API&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete metadata management RESTful API interface&lt;/li&gt;
&lt;li&gt;Support for CRUD operations on all metadata objects including Metalake, Catalog, Schema, Table, etc.&lt;/li&gt;
&lt;li&gt;Complete API for user, group, role, and permission management&lt;/li&gt;
&lt;li&gt;API interfaces for advanced features like tags, policies, models, etc.&lt;/li&gt;
&lt;li&gt;Support for multiple authentication methods (Simple, OAuth2, Kerberos)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Iceberg REST Service&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complies with Apache Iceberg REST API specification&lt;/li&gt;
&lt;li&gt;Supports multiple backend storage options (Hive, JDBC, custom backends)&lt;/li&gt;
&lt;li&gt;Provides complete table management and query capabilities&lt;/li&gt;
&lt;li&gt;Supports multiple storage systems (S3, HDFS, GCS, Azure, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lance REST Service&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implements Lance REST API specification&lt;/li&gt;
&lt;li&gt;Optimized specifically for AI/ML workloads&lt;/li&gt;
&lt;li&gt;Supports efficient vector data storage and retrieval&lt;/li&gt;
&lt;li&gt;Provides namespace and table management functionality&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Real-time Metadata Retrieval and Modification
&lt;/h3&gt;

&lt;p&gt;Gravitino adopts a direct metadata management mode to ensure data real-time performance and consistency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time Synchronization&lt;/strong&gt;: Changes to metadata are immediately reflected in underlying data sources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bidirectional Synchronization&lt;/strong&gt;: Supports metadata synchronization from Gravitino to data sources and from data sources to Gravitino&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transaction Support&lt;/strong&gt;: Ensures atomicity and consistency of metadata operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version Management&lt;/strong&gt;: Supports metadata version control and historical tracking&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Unified Access Control
&lt;/h3&gt;

&lt;p&gt;Gravitino implements unified permission management across multiple data sources:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Role-Based Access Control (RBAC)&lt;/strong&gt;: Supports flexible permission management for users, groups, and roles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ownership Model&lt;/strong&gt;: Each metadata object has a clear owner&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Permission Inheritance&lt;/strong&gt;: Supports hierarchical permission inheritance mechanisms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-grained Control&lt;/strong&gt;: Multi-level permission control from Metalake to specific tables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Supported Permission Types:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User and group management permissions&lt;/li&gt;
&lt;li&gt;Catalog and schema creation permissions&lt;/li&gt;
&lt;li&gt;Read/write permissions for tables, topics, filesets&lt;/li&gt;
&lt;li&gt;Model registration and version management permissions&lt;/li&gt;
&lt;li&gt;Tag and policy application permissions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Unified Data Lineage
&lt;/h3&gt;

&lt;p&gt;Based on OpenLineage standards, Gravitino provides complete data lineage tracking capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic Lineage Collection&lt;/strong&gt;: Automatically collect data lineage information through Spark plugins&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified Identifiers&lt;/strong&gt;: Convert identifiers from different data sources to Gravitino unified identifiers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Data Source Support&lt;/strong&gt;: Support lineage tracking for various data sources including Hive, Iceberg, JDBC, file systems, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  High Availability and Scalability
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Deployment Modes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single-node Deployment&lt;/strong&gt;: Suitable for development and testing environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster Deployment&lt;/strong&gt;: Supports high availability and load balancing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes Deployment&lt;/strong&gt;: Supports containerized deployment and auto-scaling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker Support&lt;/strong&gt;: Provides official Docker images&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Storage Backends:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Supports multiple metadata storage backends (MySQL, PostgreSQL, etc.)&lt;/li&gt;
&lt;li&gt;Supports distributed storage systems&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Security Features
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Authentication Methods:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple authentication (username/password)&lt;/li&gt;
&lt;li&gt;OAuth2 authentication&lt;/li&gt;
&lt;li&gt;Kerberos authentication (for Hive backends)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Credential Management:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Supports cloud storage credential vending (S3, GCS, Azure, etc.)&lt;/li&gt;
&lt;li&gt;Dynamic credential refresh&lt;/li&gt;
&lt;li&gt;Secure credential passing mechanisms&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Apache Gravitino Integration Capabilities
&lt;/h2&gt;

&lt;p&gt;Gravitino deeply integrates with mainstream compute engines and data processing frameworks, providing users with a unified data access experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  Compute Engine Integration
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Apache Spark&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Seamless integration through Gravitino Spark Connector&lt;/li&gt;
&lt;li&gt;Supports Spark SQL and DataFrame API&lt;/li&gt;
&lt;li&gt;Automatic data lineage collection and tracking&lt;/li&gt;
&lt;li&gt;Supports unified access to multiple data sources&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trino&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Integration through Gravitino Trino Connector service&lt;/li&gt;
&lt;li&gt;Supports federated queries across data sources&lt;/li&gt;
&lt;li&gt;High-performance analytical query capabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Apache Flink&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Integration through Gravitino Flink Connector service&lt;/li&gt;
&lt;li&gt;Supports stream-batch unified data processing&lt;/li&gt;
&lt;li&gt;Real-time data processing and analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Python Ecosystem Integration
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;PyIceberg&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Supports Iceberg table access in Python environments&lt;/li&gt;
&lt;li&gt;Integrates with Gravitino Iceberg REST service&lt;/li&gt;
&lt;li&gt;Supports data science and machine learning workflows&lt;/li&gt;
&lt;li&gt;Provides Pandas-compatible data interfaces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Daft&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Modern distributed data processing framework&lt;/li&gt;
&lt;li&gt;Optimized specifically for AI/ML workloads&lt;/li&gt;
&lt;li&gt;Supports multimodal data processing&lt;/li&gt;
&lt;li&gt;Integrates with Gravitino metadata management&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cloud-Native Integration
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Supports Kubernetes native deployment&lt;/li&gt;
&lt;li&gt;Provides Helm Charts and Operators&lt;/li&gt;
&lt;li&gt;Supports auto-scaling and fault recovery&lt;/li&gt;
&lt;li&gt;Integrates with cloud-native monitoring and logging systems&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  APIs and SDKs
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;REST API&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete RESTful API interface&lt;/li&gt;
&lt;li&gt;Supports all metadata management operations&lt;/li&gt;
&lt;li&gt;Standardized HTTP interface&lt;/li&gt;
&lt;li&gt;Supports multiple authentication methods&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Java SDK&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Native Java client library&lt;/li&gt;
&lt;li&gt;Type-safe API interface&lt;/li&gt;
&lt;li&gt;Supports connection pooling and retry mechanisms&lt;/li&gt;
&lt;li&gt;Complete exception handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Python SDK&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python client library&lt;/li&gt;
&lt;li&gt;Supports asynchronous operations&lt;/li&gt;
&lt;li&gt;Integrates with Jupyter Notebook&lt;/li&gt;
&lt;li&gt;Supports data science workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These integration capabilities enable Gravitino to seamlessly integrate into existing data infrastructure, providing users with a unified and efficient data management experience. Subsequent articles will detail Gravitino's various capabilities and configuration and usage methods for each integration component. Stay tuned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Continue reading [Setup Guide]&lt;/li&gt;
&lt;li&gt;Follow and star &lt;a href="https://github.com/apache/gravitino" rel="noopener noreferrer"&gt;Apache Gravitino Repository&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the &lt;a href="https://gravitino.apache.org/docs/" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt; or submit issues on &lt;a href="https://github.com/apache/gravitino/issues" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>dataengineering</category>
      <category>opensource</category>
      <category>gravitino101</category>
    </item>
    <item>
      <title>Apache Gravitino — 2025 Summary</title>
      <dc:creator>Yue @ Datastrato (Admin)</dc:creator>
      <pubDate>Wed, 07 Jan 2026 00:14:52 +0000</pubDate>
      <link>https://dev.to/gravitino/apache-gravitino-2025-summary-nf8</link>
      <guid>https://dev.to/gravitino/apache-gravitino-2025-summary-nf8</guid>
      <description>&lt;h3&gt;
  
  
  &lt;strong&gt;Introduction&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;2025 was a landmark year for Apache Gravitino. The project not only graduated as a Top-Level Project (TLP) but also reached its first major stable release, version 1.0.0. Throughout the year, the community focused heavily on "Contextual Engineering" and "AI-native" metadata management, introducing groundbreaking features like the Model Context Protocol (MCP) server, the Lance REST service, and a metadata-driven action system. This article summarizes the milestones and achievements of Apache Gravitino in 2025.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Timeline&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Apache Gravitino officially &lt;strong&gt;graduated as an Apache Top-Level Project on June 3, 2025&lt;/strong&gt;, marking a significant maturity milestone.&lt;/p&gt;

&lt;p&gt;In 2025, the community released several key versions, including the major 1.0.0 release and significant feature updates in 0.8.0-incubating, 0.9.0-incubating, and 1.1.0.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2025.01.24: Version 0.8.0-incubating released&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Focused on strengthening AI support with the introduction of the &lt;strong&gt;Model Catalog&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Introduced credential vending for Filesets and new connectors for Flink (Iceberg/Paimon).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;2025.05.07: Version 0.9.0-incubating released&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Enhanced data governance with a new &lt;strong&gt;Data Lineage interface&lt;/strong&gt; (OpenLineage compliant).&lt;/li&gt;
&lt;li&gt;Added gcli script for better CLI experience and improved security with privilege refinements.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;2025.09.24: Version 1.0.0 released&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;The first stable major release, themed "From Metadata Management to Contextual Engineering."&lt;/li&gt;
&lt;li&gt;Introduced the &lt;strong&gt;Metadata-driven Action System&lt;/strong&gt; (including Statistics, Policies, and Jobs).&lt;/li&gt;
&lt;li&gt;Launched the &lt;strong&gt;MCP (Model Context Protocol) Server&lt;/strong&gt;, enabling AI Agents/LLMs to interact directly with metadata.&lt;/li&gt;
&lt;li&gt;Implemented unified Role-Based Access Control (RBAC) across catalogs.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;2025.11.20: Version 1.0.1 released&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;A stability release featuring smarter job templates and improved Python client support.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;2025.12.19: Version 1.1.0 released&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Added the &lt;strong&gt;Lance REST service&lt;/strong&gt; to support vector data for AI workloads.&lt;/li&gt;
&lt;li&gt;Introduced a Generic Lakehouse Catalog and support for Hive 3 and multi-cluster HDFS filesets.&lt;/li&gt;
&lt;li&gt;Hardened security for the Iceberg REST service.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Key Features &amp;amp; Improvements&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In 2025, Gravitino evolved from a unified catalog to an active metadata control plane. Key technical achievements include:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AI &amp;amp; LLM Integration&lt;/strong&gt;: The project positioned itself as an AI-native catalog by introducing the &lt;strong&gt;Model Catalog&lt;/strong&gt; for managing ML models and the &lt;strong&gt;MCP Server&lt;/strong&gt; to connect AI agents with data context. The addition of the &lt;strong&gt;Lance REST service&lt;/strong&gt; in v1.1.0 further solidified support for vector datasets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata-Driven Actions&lt;/strong&gt;: A new framework allowing users to define policies (e.g., TTL, compaction) and execute jobs based on metadata, moving beyond passive metadata storage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified Governance &amp;amp; Security&lt;/strong&gt;: Full implementation of &lt;strong&gt;RBAC&lt;/strong&gt;, credential vending for secure data access (S3/GCS/ADLS), and a unified authentication flow for Iceberg REST services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ecosystem Expansion&lt;/strong&gt;: Broadened support with new connectors (Generic Lakehouse, Hive 3, Flink, Paimon) and enhancements to the &lt;strong&gt;GVFS (Gravitino Virtual File System)&lt;/strong&gt; for unified file management.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Community&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The Apache Gravitino community saw explosive growth in 2025, evolving from an incubator project into a Top-Level Project (TLP) backed by a rapidly expanding global ecosystem.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Top-Level Graduation&lt;/strong&gt;: On &lt;strong&gt;June 3, 2025&lt;/strong&gt;, the project officially graduated to an Apache Top-Level Project, a major milestone marking its maturity in community health, vendor-neutral governance, and production readiness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community Growth (Year-over-Year)&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Engagement&lt;/strong&gt;: GitHub stars increased by over &lt;strong&gt;130%&lt;/strong&gt;, ending the year above &lt;strong&gt;2,600&lt;/strong&gt;. Forks grew by approximately &lt;strong&gt;150%&lt;/strong&gt;, reflecting a surge in community-led integrations and local developments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contributor Base&lt;/strong&gt;: The active developer community expanded by nearly &lt;strong&gt;100%&lt;/strong&gt;. Recent major releases, such as version 1.1.0, featured contributions from &lt;strong&gt;40+ unique developers&lt;/strong&gt; representing a wide variety of global organizations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Development Velocity&lt;/strong&gt;: Development pace accelerated significantly, with code commits reaching a lifetime total of over &lt;strong&gt;3,300 commits&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-Graduation Committer Growth&lt;/strong&gt;: July 7, 2025: Chenxi Pan was added as Committers. December 15, 2025: Junda Yang and Yangyang Zhong were added as Committers.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Global Presence&lt;/strong&gt;: The project established itself as the standard for federated metadata through featured presentations at &lt;strong&gt;Community Over Code (NA &amp;amp; Asia)&lt;/strong&gt; and &lt;strong&gt;QCon Shanghai&lt;/strong&gt;, gathering critical production feedback from global data engineering teams to shape the future roadmap.&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Industry Trends in Metadata Management (2026)&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Breaking Lakehouse Silos&lt;/strong&gt;: As organizations adopt multiple "open" table formats, the risk of "format lock-in" has replaced "vendor lock-in." The trend is toward &lt;strong&gt;Universal Lakehouse&lt;/strong&gt; architectures that provide a single entry point for fragmented data silos.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Multimodal AI Explosion&lt;/strong&gt;: AI workloads are moving beyond tabular data to include massive volumes of unstructured assets (images, video, audio). Traditional data stacks are being replaced by &lt;strong&gt;AI-Native Multimodal Stacks&lt;/strong&gt; that can process complex data types with the same governance as SQL tables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Emergence of Data Agents&lt;/strong&gt;: AI Agents are becoming the primary consumers of data. These agents require "Context Engineering"—a way to use metadata as an external brain to discover, understand, and act upon data autonomously.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Escalating AI Security Risks&lt;/strong&gt;: The high-speed nature of AI interactions makes traditional static security (RBAC) obsolete. The industry is moving toward &lt;strong&gt;Identity-Centric Zero Trust&lt;/strong&gt; and &lt;strong&gt;Fine-Grained ABAC&lt;/strong&gt; to prevent data leakage and ensure model safety.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Future Work&lt;/strong&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;1. Universal Lakehouse &amp;amp; Format Interoperability&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;To solve the data silo problem, Gravitino is expanding its reach to provide a unified management layer for the modern Lakehouse.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Format Support&lt;/strong&gt;: We will provide first-class support for &lt;strong&gt;Apache Iceberg&lt;/strong&gt;, &lt;strong&gt;Delta Lake&lt;/strong&gt;, &lt;strong&gt;Hudi&lt;/strong&gt;, and &lt;strong&gt;Paimon&lt;/strong&gt;. By acting as a "Catalog of Catalogs," Gravitino allows users to manage multiple formats through a single interface, significantly reducing vendor lock-in and simplifying cross-format governance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;2. Multimodal Data Stack for the AI Era&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Gravitino is evolving to empower a new generation of AI-native data stacks.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ecosystem Integration&lt;/strong&gt;: We will focus on deep integration with AI-centric engines like &lt;strong&gt;Daft&lt;/strong&gt;, &lt;strong&gt;Ray&lt;/strong&gt;, and &lt;strong&gt;Lance&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Empowering New Scenarios&lt;/strong&gt;: By providing a unified metadata layer for these engines, Gravitino allows users to "reuse" existing data governance capabilities—like auditing and access control—for modern multimodal scenarios, giving the new AI data stack enterprise-grade maturity from day one.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;3. Data Agent Orchestration (Metadata as the "Brain")&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Gravitino will serve as the cognitive foundation for autonomous &lt;strong&gt;Data Agents&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MCP Server &amp;amp; Action System&lt;/strong&gt;: Leveraging the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; and our &lt;strong&gt;Metadata Action System&lt;/strong&gt;, we are exploring scenario-based capabilities for Data Agents. This allows an AI agent to not only "see" the data but also "act" on it—such as performing a schema update or triggering a compaction job—using metadata as its reasoning context.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;4. Advanced Security: KMS &amp;amp; ABAC&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;As security threats become more sophisticated in the AI era, Gravitino is implementing more granular and automated security controls.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ABAC (Attribute-Based Access Control)&lt;/strong&gt;: We will implement an ABAC engine to enable fine-grained permissions. This allows access decisions to be made based on dynamic tags (e.g., Sensitivity=High) and environmental context rather than just static roles.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KMS &amp;amp; Credential Management&lt;/strong&gt;: To protect data-at-rest and in-transit, we are integrating with &lt;strong&gt;Key Management Services (KMS)&lt;/strong&gt; .&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;2026 will be a defining year for AI-native data infrastructure, and the Gravitino community is just getting started.Whether you’re exploring federated lakehouse architectures, multimodal AI data stacks, or data agents in production, we welcome you to build and evolve Apache Gravitino together with us❤️.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://gravitino.apache.org/blog/2025-summary/" rel="noopener noreferrer"&gt;https://gravitino.apache.org/blog/2025-summary/&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Apache Gravitino 1.1.0: A Major Step Toward Unified Metadata for the AI-Native Lakehouse</title>
      <dc:creator>Yue @ Datastrato (Admin)</dc:creator>
      <pubDate>Tue, 16 Dec 2025 17:32:19 +0000</pubDate>
      <link>https://dev.to/gravitino/apache-gravitino-110-a-major-step-toward-unified-metadata-for-the-ai-native-lakehouse-2h03</link>
      <guid>https://dev.to/gravitino/apache-gravitino-110-a-major-step-toward-unified-metadata-for-the-ai-native-lakehouse-2h03</guid>
      <description>&lt;p&gt;Apache Gravitino 1.1.0 is now available, bringing powerful new capabilities that make it easier for organizations to unify metadata, govern heterogeneous data platforms, and support emerging AI workloads.&lt;/p&gt;

&lt;p&gt;As enterprises adopt multimodal data, multiple engines, and mixed table formats, metadata fragmentation becomes a critical barrier. Gravitino 1.1.0 directly tackles this challenge with key upgrades across catalogs, security, cluster management, and developer experience.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/apache/gravitino/releases/tag/v1.1.0" rel="noopener noreferrer"&gt;https://github.com/apache/gravitino/releases/tag/v1.1.0&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What’s new in 1.1.0&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Built for the Future of AI Data&lt;/strong&gt;&lt;br&gt;
A new Lance REST service brings governed, high-performance vector access to AI pipelines, inference workloads, and data applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Stronger Security and Metadata Governance&lt;/strong&gt;&lt;br&gt;
Fine-grained authorization now covers tags, jobs, statistics, and policies, while the Iceberg REST service receives major security hardening for production use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Legacy-to-Lakehouse Modernization, Simplified&lt;/strong&gt;&lt;br&gt;
Hive 3 catalog support allows organizations to bring existing Hive metastores under centralized governance without data migration or risk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Multi-Cluster Operations for Real-World Deployments&lt;/strong&gt;&lt;br&gt;
Support for multiple HDFS clusters gives large-scale teams the flexibility needed for DR, isolation, and multi-region architectures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Faster, More Stable, More Observable&lt;/strong&gt;&lt;br&gt;
From caching to metrics to connector improvements, the entire system sees meaningful boosts in performance, reliability, and usability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgb5wr0pg23h5h9e592hd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgb5wr0pg23h5h9e592hd.png" alt="The Unified Metadata Layer for the AI-Native Data Stack" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why it matters&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Organizations building AI-native architectures need a metadata system that spans:&lt;/p&gt;

&lt;p&gt;multiple engines (Spark, Trino, Flink, Daft etc.)&lt;br&gt;
multiple formats (Iceberg, Hudi, Lance)&lt;br&gt;
multiple clouds and clusters&lt;br&gt;
batch, streaming, and vector workloads&lt;br&gt;
As AI workloads diversify, metadata must unify. Gravitino 1.1.0 brings the interoperability and governance needed to make that possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Community Acknowledgement&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This release reflects the work of dozens of contributors across issues, PRs, tests, design contributions, and reviews. The Gravitino community continues to grow, and we are grateful for everyone who made this release possible.&lt;/p&gt;

&lt;p&gt;Read the full release notes for details on all features and fixes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/apache/gravitino/releases/tag/v1.1.0" rel="noopener noreferrer"&gt;https://github.com/apache/gravitino/releases/tag/v1.1.0&lt;/a&gt;&lt;/p&gt;

</description>
      <category>metada</category>
      <category>apacheiceberg</category>
      <category>apachegravitino</category>
      <category>datacatelog</category>
    </item>
    <item>
      <title>If You’re Not All-in on Databricks: Why Metadata Freedom Matters</title>
      <dc:creator>Yue @ Datastrato (Admin)</dc:creator>
      <pubDate>Wed, 26 Nov 2025 22:25:37 +0000</pubDate>
      <link>https://dev.to/datastrato/if-youre-not-all-in-on-databricks-why-metadata-freedom-matters-1d38</link>
      <guid>https://dev.to/datastrato/if-youre-not-all-in-on-databricks-why-metadata-freedom-matters-1d38</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcr35o726wek1ppc2wt0q.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcr35o726wek1ppc2wt0q.jpeg" alt=" " width="800" height="512"&gt;&lt;/a&gt;Stop and consider your data architecture right now. You are likely grappling with these challenges:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;1. Are you facing vendor lock-in and prohibitive costs?&lt;/em&gt;&lt;br&gt;
&lt;em&gt;2. Is fragmented metadata wasting your engineering resources?&lt;/em&gt;&lt;br&gt;
&lt;em&gt;3. Are your BI systems failing your AI strategy?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These three questions directly point to the core friction points stemming from metadata constraints, which are crippling modern data teams:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vendor Lock-in Risk:&lt;/strong&gt;&lt;br&gt;
Platform-bound catalogs, particularly Databricks Unity Catalog (UC), tie governance and security tightly to a specific computing environment. This limits the freedom to choose best-of-breed engines (Trino, Flink, Ray, and others) and results in prohibitive migration costs and annually increasing spend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fragmentation &amp;amp; Operational Complexity:&lt;/strong&gt;&lt;br&gt;
Modern teams operate in multi-cloud, multi-engine complexity. They lack a unified interface to manage metadata across different clouds (S3, Azure Blob, on-premises, etc) and varying processing engines, severely hindering operational efficiency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multimodal Data Silos:&lt;/strong&gt; &lt;br&gt;
Traditional, table-centric metadata systems fail to natively support AI/LLM workloads, leading to silos for critical Vector Embeddings, Streaming Topics (Kafka), and Model Registries. The lack of unified metadata breaks end-to-end AI lineage and reproducibility.&lt;/p&gt;

&lt;p&gt;The reality is, the modern data stack is facing a “fragmentation crisis.” Your metadata, which should be the bridge for unified governance, has instead become the primary casualty of this fragmentation.&lt;/p&gt;

&lt;p&gt;We acknowledge that platform-specific solutions like Databricks Unity Catalog (UC) deliver a smooth experience within their own ecosystem. However, for organizations operating across multiple clouds, engines, and formats, that tight integration quickly transforms into a constraint rather than a strength. Traditional systems (HMS, AWS Glue) struggle to keep up, while tightly coupled solutions like UC introduce “soft lock-in.” Here, the core metadata, the true ownership layer of the data, becomes inseparable from a single commercial platform. Once metadata is trapped, everything upstream and downstream hardens around that dependency.&lt;/p&gt;

&lt;p&gt;This challenge is amplified dramatically in the AI-Native Era. LLM and agent-based workflows absolutely depend on the ability to discover, understand, trace, and govern all data assets. Without unified metadata, AI pipelines lose transparency, governance becomes brittle, and trust in AI outcomes erodes.&lt;/p&gt;

&lt;p&gt;Therefore, the truly essential requirement is a vendor-neutral metadata layer that can unify your entire data ecosystem.&lt;/p&gt;

&lt;p&gt;This layer must abstract metadata from compute, storage, and individual vendors; provide consistent semantics across all engines; and support the multimodal assets required for AI.&lt;/p&gt;

&lt;p&gt;In short, you need Metadata Freedom. This freedom is the foundation for long-term data agility and AI readiness. It ensures the true “deed” to your data remains in your hands, not locked behind the boundaries of any single cloud, engine, or commercial platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Architecture of Freedom: Why Gravitino Breaks the Mold&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Databricks has become one of the most influential players in the modern data stack, and Unity Catalog brings strong cohesion within the Databricks ecosystem.&lt;/p&gt;

&lt;p&gt;However, no enterprise is 100% Databricks-only. Most operate across Snowflake, BigQuery, Redshift, Trino, Iceberg, MySQL, S3, and more. Even a powerful platform-native catalog cannot serve as the unified source of metadata truth for such a heterogeneous world.&lt;/p&gt;

&lt;p&gt;To bridge this gap and finally achieve cross-platform metadata freedom, we introduce Apache™ Gravitino.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;“Metadata freedom is not a feature; it is the foundation on which future-proof data architectures are built.”&lt;/em&gt;&lt;br&gt;
 — JP Du, PMC member of Apache Gravitino™ project, CEO &amp;amp; Co-founder of Datastrato.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://github.com/apache/gravitino" rel="noopener noreferrer"&gt;Apache Gravitino™&lt;/a&gt; is not merely a replacement for existing catalogs; it is a foundational architectural shift designed to achieve true Metadata Freedom. We realize this vision by adhering to two core principles: Radical Decoupling and Federated Unification.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fczfepycy3fivwv6xim75.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fczfepycy3fivwv6xim75.png" alt="Apache™ Gravitino Architecture" width="800" height="289"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Radical Decoupling: Metadata as an Independent Layer&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Rather than inheriting constraints from compute or storage platforms, Gravitino treats metadata as an independent, universal control layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compute-agnostic:&lt;/strong&gt; Works seamlessly with Trino, Spark, Flink, PyTorch, Ray, and others — without imposing a preferred engine.&lt;br&gt;
Storage-agnostic: A connector-based architecture supports S3, GCS, Azure Blob, HDFS, and on-prem object stores.&lt;br&gt;
&lt;strong&gt;Vendor-neutral:&lt;/strong&gt; As described earlier, Gravitino’s decoupled design breaks metadata free from compute, storage, and vendor boundaries, aligning with open standards instead of proprietary roadmaps.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Federated Unification: The Catalog of Catalogs&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To solve the multi-engine, multi-cloud “fragmentation crisis” introduced earlier, Gravitino redefines metadata management through its “Federated Metadata Lake” positioning. This architecture unifies the ecosystem without compromising the autonomy of underlying systems.&lt;/p&gt;

&lt;p&gt;Gravitino addresses fragmentation with a federated architecture that unifies metadata across engines and clouds without forcing standardization.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;AI-native multimodal metadata: Supports tables, unstructured files, Kafka topics, vector embeddings, and models as first-class assets.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Federated control: Gravitino’s “Catalog of Catalogs” integrates with HMS, Apache Apache™Iceberg REST Catalog, MySQL, PostgreSQL, and object stores — unifying governance without replacing existing systems.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This model harmonizes metadata across legacy systems, cloud warehouses, lakehouses, and AI platforms.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Apache Gravitino™ vs. Unity Catalog: Solving the Fragmentation Between All Catalogs&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Gravitino is not simply a “better Unity Catalog.” It solves the fragmentation between all catalogs. While UC excels within its platform boundaries, Gravitino is designed to function as the superset control plane for the entire heterogeneous stack.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzm3itfi8529kvpblvaa0.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzm3itfi8529kvpblvaa0.jpeg" alt="Apache Gravitino™ vs. Unity Catalog (OSS) Comparison" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Community Over Code: Building the Future Through Open Collaboration&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;For us, the creators and initial contributors to Apache Gravitino™, Metadata Freedom is more than just code. We believe the future of enterprise data architectures cannot be tied to a single corporate roadmap; it must be built by the people who use it. This focus on developer interaction and shared ownership is why we chose the Apache path.&lt;/p&gt;

&lt;p&gt;Apache Gravitino™ is governed by the Apache Software Foundation (ASF). This isn’t a mere badge; it’s a structural guarantee of neutrality. We cherish the open, democratic nature of the ASF model, where every major decision, feature, and release is subject to a community-wide consensus and voting process. This rigorous governance ensures that Gravitino’s evolution is driven purely by technical merit and user needs, making it a project that truly belongs to everyone.&lt;/p&gt;

&lt;p&gt;In addition to advancing the Apache Gravitino™ project itself, we actively contribute to the broader open-source ecosystem, submitting code to upstream and downstream projects such as &lt;strong&gt;&lt;a href="https://iceberg.apache.org/" rel="noopener noreferrer"&gt;Apache Iceberg™&lt;/a&gt;, &lt;a href="https://www.lance.com/" rel="noopener noreferrer"&gt;Lance&lt;/a&gt;, &lt;a href="https://www.daft.ai/" rel="noopener noreferrer"&gt;Daft&lt;/a&gt;, &lt;a href="https://openlineage.io/" rel="noopener noreferrer"&gt;OpenLineage&lt;/a&gt;, &lt;a href="https://spark.apache.org/" rel="noopener noreferrer"&gt;Spark&lt;/a&gt;&lt;/strong&gt; and others.&lt;/p&gt;

&lt;p&gt;This commitment to open, multi-company governance directly fuels Gravitino’s rapid momentum. Compared to proprietary-led open source solutions like OSS Unity Catalog, our community activity metrics including lines of code, contributors, commits, and issues have been observed at over 5x in recent periods. This explosion of activity proves that our neutral, democratic approach is exactly what the industry demands, assuring all contributors and adopters of the project’s long-term health and vitality.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftk0a236dzf33nhmp4li9.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftk0a236dzf33nhmp4li9.jpeg" alt="Data for AI’s open source afterparty" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Fully embracing the Apache spirit of &lt;strong&gt;“Community over code”&lt;/strong&gt;, we have established the “Data for AI” community. This focused hub convenes developers globally to exchange knowledge and collectively tackle the practical industry challenges unique to modern data infrastructure. To accelerate this collective understanding, we regularly host technical events, inviting leading data experts from cutting-edge companies such as AWS, OpenAI, NVIDIA, Uber, Pinterest, and Roku to share their latest best practices and trends. By fostering this interaction and insight sharing, we ensure Gravitino evolves in lockstep with the most urgent needs of the industry.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Databricks Is Part of the Future — But Not the Whole Future.&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This is not a Databricks critique.&lt;/p&gt;

&lt;p&gt;They are a key pillar of the ecosystem, and many workloads fit beautifully within their platform. But the modern enterprise will always be a polyglot environment.&lt;/p&gt;

&lt;p&gt;We firmly believe that enterprises welcome more flexible, open, and diverse technological solutions. No organization wants to be forced into a corner, locked in by a single supplier.&lt;/p&gt;

&lt;p&gt;If your data strategy is not 100% committed to Databricks, or any other singular vendor ecosystem, or if you are implementing a multi-cloud, multi-engine strategy, relying on platform-specific catalogs is fundamentally insufficient.&lt;/p&gt;

&lt;p&gt;The path forward requires decoupling control from computing.&lt;/p&gt;

&lt;p&gt;Metadata Freedom provides the agility, interoperability, and ultimate safeguard against vendor lock-in.&lt;/p&gt;

&lt;p&gt;AI-Native and multimodal workloads demand an open, federated metadata layer to unify tables, files, streams, and vectors.&lt;/p&gt;

&lt;p&gt;Apache Gravitino™ demonstrates what the next-generation metadata architecture looks like. Through radical decoupling and a community-driven, federated approach, it returns the sovereignty of metadata to the user.&lt;/p&gt;

&lt;p&gt;The future belongs to open, federated, and community-driven metadata.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is Metadata Freedom. This is what Apache Gravitino™ stands for.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Why wait for tomorrow? Join the &lt;a href="https://github.com/apache/gravitino" rel="noopener noreferrer"&gt;Apache Gravitino™&lt;/a&gt; community today and start your journey to Metadata Freedom now.&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>apachegravitino</category>
      <category>unitycatalog</category>
      <category>metadata</category>
    </item>
  </channel>
</rss>
