DEV Community: Alex Yan

Apache Gravitino 1.0.0 — From Metadata Management to Contextual Engineering

Alex Yan — Sun, 05 Oct 2025 05:40:44 +0000

Apache Gravitino was designed from day one to provide a unified framework for metadata management across heterogeneous sources, regions, and clouds—what we define as the metadata lake (or metalake). Throughout its evolution, Gravitino has extended support to multiple data modalities, including tabular metadata from Apache Hive, Apache Iceberg, MySQL, and PostgreSQL; unstructured assets from HDFS and S3; streaming and messaging metadata from Apache Kafka; and metadata for machine learning models. To further strengthen governance in Gravitino, we have also integrated advanced capabilities, including tagging, audit logging, and end-to-end lineage capture.

After all enterprise metadata has been centralized through Gravitino, it forms a data brain: a structured, queryable, and semantically enriched representation of data assets. This enables not only consistent metadata access but also knowledge grounding, contextual reasoning, tool using and others. As we approach the 1.0 milestone, our focus shifts from pure metadata storage to metadata-driven contextual engineering—a foundation we call the Metadata-driven Action System, to provide the building blocks for the contextual engineering.

The release of Apache Gravitino 1.0.0 marks a significant engineering step forward, with robust APIs, extensible connectors, enhanced governance primitives, improved scalability and reliability in distributed environments. In the following sections, I will dive into the new features and architectural improvements introduced in Gravitino 1.0.0.

Metadata-driven action system

In version 1.0.0, we introduced three new components that enable us to build jobs to accomplish metadata-driven actions, such as table compaction, TTL data management, and PII identification. These three new components are: the statistics system, the policy system, and the job system.

Taking table compaction as an example:

Firstly, users can define the table compaction policy in Gravitino and associate this policy with the tables that need to be compacted.
Then, users can save the statistics of the table to Gravitino.
Also, users can define a job template for the compaction.
Lastly, users can use the statistics with the defined policy to generate the compaction parameters and use these parameters to trigger a compaction job based on the defined job templates.

Statistics system

The statistics system is a new component for the statistics store and retrieval. You can define and store the table/partition level statistics in Gravitino, and also fetch them through Gravitino for different purposes.

For the details of how we design this component, please see #7268. For instructions on using the statistics system, refer to the documentation here.

Policy system

The policy system enables you to define action rules in Gravitino, like compaction rules or TTL rules. The defined policy can be associated with the metadata, which means these rules will be enforced on the dedicated metadata. Users can leverage these enforced polices to decide how to trigger an action on the dedicated metadata.

Please refer to the policy system documentation to know how to use it. For more information on the policy system's implementation details, please refer to #7139.

Job system

The job system is another feature that allows you to submit and run jobs through Gravitino. Users can register a job template, then trigger a job based on the specific job template. Gravitino will help submit the job to the dedicated job executor, such as Apache Airflow. Gravitino can manage the job lifecycle and save the job status in it. With the job system, users can run a self-defined job to accomplish a metadata-driven action system.

In version 1.0.0, we have an initial version to support running the jobs as a local process. If you want to know more about the design details, you can follow issue #7154. Also, a user-facing documentation can be found here.

The whole metadata-driven action system is still in an alpha phase for version 1.0.0. The community will continue to evolve the code and take the Iceberg table maintenance as a reference implementation in the next version. Please stay tuned.

Agent-ready through the MCP server

MCP is a powerful protocol to bridge the gap between human languages and machine interfaces. With MCP, users can communicate with the LLM using natural language, and the LLM can understand the context and invoke the appropriate tools.

In version 1.0.0, the community officially delivered the MCP server for Gravitino. Users can launch it as a remote or local MCP server and connect to various MCP applications, such as Cursor and Claude Desktop. Additionally, we exposed all metadata-related interfaces as tools that MCP clients can call.

With the Gravitino MCP server, users can manage and govern metadata, as well as perform metadata-driven actions using natural language. Please follow issue #7483 for more details. Additionally, you can refer to the documentation for instructions on how to start the MCP server locally or in Docker.

Unified access control framework

Gravitino introduced the RBAC system in the previous version, but it only offers users the ability to grant privileges to roles and users, without enforcing access control when manipulating the secure objects. In 1.0.0, we complete this missing piece in Gravitino.

Currently, users can set access control policies through our RBAC system and enforce these controls when accessing secure objects. For details, you can refer to the umbrella issue #6762.

Add support for multiple locations model management

The model management is introduced in Gravitino 0.9.0. Users have since requested support for multiple storage locations within a single model version, allowing them to select a model version with a preferred location.

In 1.0.0, the community added multiple locations for model management. This feature is similar to the fileset’s support for multiple locations. Users can check the document here for more information. For more information on implementation details, please refer to this issue #7363.

Support the latest Apache Iceberg and Paimon versions

In Gravitino 1.0.0, we have upgraded the supported Iceberg version to 1.9.0. With the new version, we will add more feature support in the next release. Additionally, we have upgraded the supported Paimon version to 1.2.0, introducing new features for Paimon support.

You can see the issue #6719 for Iceberg upgrading and issue #8163 for Paimon upgrading.

Various core features

Core:

Add the cache system in the Gravitino entity store #7175.
Add Marquez integration as a lineage sink in Gravitino #7396.

Server:

Add Azure AD login support for OAuth authentication #7538.

Catalogs:

Support StarRocks catalog management in Gravitino #3302.

Clients:

Adds the custom configurations for clients #7816, #7817, #7670, #7456.

Spark connector:

Upgrade the supported Kyubbi version #7480.

UI:

Add web UI for listing files / directories under a fileset #7477.

Deployment:

Add hem char deployment for Iceberg REST catalog #7159.

Behavior changes

Compatible changes:

Rename the Hadoop catalog to fileset catalog #7184.
Allowing event listener changes Iceberg create table request #6486.
Support returning aliases when listing model version #7307.

Breaking changes:

Change the supported Java version to JDK 17 for the Gravitino server.
Remove the Python 3.8 support for the Gravitino Python client #7491.
Fix the unnecessary double encoding and decoding issue for fileset get location and list files interfaces #8335. This change is incompatible with the old version of Java and Python clients. Using old version clients with a new version server will meet a decoding issue in some unexpected scenarios.

Overall

There are still lots of features, improvements, and bug fixes that are not mentioned here. We thank the community for their continued support and valuable contributions.

Apache Gravitino 1.0.0 opens a new chapter from the data catalog to the smart catalog. We will continue to innovate and build, to add more Data and AI features. Please stay tuned!

Credits

This release acknowledges the hard work and dedication of all contributors who have helped make this release possible.

1161623489@qq.com, Aamir, Aaryan Kumar Sinha, Ajax, Akshat Tiwari, Akshat kumar gupta, Aman Chandra Kumar, AndreVale69, Ashwil-Colaco, BIN, Ben Coke, Bharath Krishna, Brijesh Thummar, Bryan Maloyer, Cyber Star, Danhua Wang, Daniel, Daniele Carpentiero, Dentalkart399, Drinkaiii, Edie, Eric Chang, FANNG, Gagan B Mishra, George T. C. Lai, Guilherme Santos, Hatim Kagalwala, Jackeyzhe, Jarvis, JeonDaehong, Jerry Shao, Jimmy Lee, Joonha, Joonseo Lee, Joseph C., Justin Mclean, KWON TAE HEON, Kang, KeeProMise, Khawaja Abdullah Ansar, Kwon Taeheon, Kyle Lin, KyleLin0927, Lord of Abyss, MaAng, Mathieu Baurin, Maxspace1024, Mikshakecere, Mini Yu, Minji Kim, Minji Ryu, Nithish Kumar S, Pacman, Peidian li, Praveen, Qian Xia, Qiang-Liu, Qiming Teng, Raj Gupta, Ratnesh Rastogi, Raveendra Pujari, Reuben George, RickyMa, Rory, Sambhavi Pandey, Sébastien Brochet, Shaofeng Shi, Spiritedswordsman, Sua Bae, Surya B, Tarun, Tian Lu, Tianhang, Timur, Viral Kachhadiya, Will Guo, XiaoZ, Xiaojian Sun, Xun, Yftach Zur, Yuhui, Yujiang Zhong, Yunchi Pang, Zhengke Zhou, _.mung, ankamde, arjun, danielyyang, dependabot[bot], fad, fanng, gavin.wang, guow34, jackeyzhe, kaghatim, keepConcentration, kerenpas, kitoha, lipeidian, liuxian, liuxian131, lsyulong, mchades, mingdaoy, predator4ann, qbhan, raveendra11, roryqi, senlizishi, slimtom95, taylor.fan, taylor12805, teo, tian bao, vishnu, yangyang zhong, youngseojeon, yuhui, yunchi, yuqi, zacsun, zhanghan, zhanghan18, 梁自强, 박용현, 배수아, 신동재, 이승주, 이준하

Apache, Apache Fink, Apache Hive, Apache Hudi, Apache Iceberg, Apache Ranger, Apache Spark, Apache Paimon and Apache Gravitino are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Doris x Gravitino: Unified Metadata Management for Modern Lakehouse Architecture

Alex Yan — Sun, 28 Sep 2025 21:54:13 +0000

With the rapid evolution of data lake technologies, building unified, secure, and efficient lakehouse architectures has become a core challenge for enterprise digital transformation. Apache Gravitino serves as a next-generation unified metadata management platform, providing comprehensive solutions for data governance in multi-cloud and multi-engine environments. It not only supports unified management of various data sources and compute engines but also ensures secure and controllable data access through its credential management mechanism (Credential Vending).

This article provides an in-depth introduction to deep integration between Apache Doris and Apache Gravitino, building a modern lakehouse architecture based on Iceberg REST Catalog. Through Gravitino's unified metadata management and dynamic credential vending capabilities, we achieve efficient and secure access to Iceberg data stored on S3.

What you'll learn from this guide:

AWS Environment Setup: How to create S3 buckets and IAM roles in AWS, configure secure credential management systems for Gravitino, and implement dynamic temporary credential distribution mechanisms.
Gravitino Deployment and Configuration: How to quickly deploy Gravitino services, configure Iceberg REST Catalog, and enable vended-credentials functionality.
Connecting Doris to Gravitino: Detailed explanation of how Doris accesses Iceberg data through Gravitino's REST API, supporting two core storage access modes:

1. Static Credential Mode: Doris uses fixed AK/SK to directly access S3
2. Dynamic Credential Mode: Gravitino dynamically distributes temporary credentials to Doris via STS AssumeRole

About Apache Doris

Apache Doris is the fastest analytical and search database for the AI era.

It provides high-performance hybrid search capabilities across structured data, semi-structured data (such as JSON), and vector data. It excels at delivering high-concurrency, low-latency queries, while also offering advanced optimization for complex join operations. In addition, Doris can serve as a unified query engine, delivering high-performance analytical services not only on its self-managed internal table format but also on open lakehouse formats such as Iceberg.

With Doris, users can easily build a real-time lakehouse data platform.

About Apache Gravitino

Apache Gravitino is a high-performance, distributed federated metadata lake open-source project designed to manage metadata across different regions and data sources in public and private clouds. It supports various types of data catalogs including Apache Hive, Apache Iceberg, Apache Paimon, Apache Doris, MySQL, PostgreSQL, Fileset (HDFS, S3, GCS, OSS, JuiceFS, etc.), Streaming (Apache Kafka), Models, and more (continuously expanding), providing users with unified metadata access for Data and AI assets.

Hands-on Guide

1. AWS Environment Setup

Before we begin, we need to prepare a complete infrastructure on AWS, including S3 buckets and a carefully designed IAM role system, which forms the foundation for building a secure and reliable lakehouse architecture.

1.1 Create S3 Bucket

First, create a dedicated S3 bucket to store Iceberg data:

# Create S3 bucket
aws s3 mb s3://gravitino-iceberg-demo --region us-west-2
# Verify bucket creation
aws s3 ls | grep gravitino-iceberg-demo

1.2 Design IAM Role Architecture

To implement secure credential management, we need to create an IAM role for Gravitino to use through the STS AssumeRole mechanism. This design follows the principles of least privilege and separation of duties security best practices.

Create Data Access Role

Create trust policy file gravitino-trust-policy.json:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::YOUR_ACCOUNT_ID:root"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Create IAM Role

For demonstration simplicity, we'll use AWS managed policies directly. For production environments, we recommend creating more fine-grained permission controls.

# Create IAM role
aws iam create-role \
    --role-name gravitino-iceberg-access \
    --assume-role-policy-document file://gravitino-trust-policy.json \
    --description "Gravitino Iceberg data access role"

# Attach S3 full access permissions (for testing; use fine-grained permissions in production)
aws iam attach-role-policy \
    --role-name gravitino-iceberg-access \
    --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

Verify IAM Configuration

Verify that the role configuration is correct:

# Test role assumption functionality
aws sts assume-role \
    --role-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/gravitino-iceberg-access \
    --role-session-name gravitino-test

Example successful response:

{
    "Credentials": {
        "AccessKeyId": "ASIA***************",
        "SecretAccessKey": "***************************",
        "SessionToken": "IQoJb3JpZ2luX2VjEOj...",
        "Expiration": "2025-07-23T08:33:30+00:00"
    }
}

2. Gravitino Deployment and Configuration

Download and Install Gravitino

We'll use Gravitino's pre-compiled version for quick environment setup:

# Create working directory
mkdir gravitino-deployment && cd gravitino-deployment

# Download Gravitino main program
wget https://dlcdn.apache.org/gravitino/0.9.1/gravitino-0.9.1-bin.tar.gz

# Extract and install
tar -xzf gravitino-0.9.1-bin.tar.gz
cd gravitino-0.9.1-bin

Install Required Dependencies

To support AWS S3 and credential management functionality, we need to install additional JAR packages:

# Create necessary directory structure
mkdir -p logs
mkdir -p /tmp/gravitino

# Download Iceberg AWS bundle
wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.6.1/iceberg-aws-bundle-1.6.1.jar \
  -P catalogs/lakehouse-iceberg/libs/
cp catalogs/lakehouse-iceberg/libs/iceberg-aws-bundle-1.6.1.jar iceberg-rest-server/libs/

# Download Gravitino AWS support package (core for vended-credentials functionality)
wget https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws/0.9.1/gravitino-aws-0.9.1.jar \
  -P iceberg-rest-server/libs/

Configure Gravitino Service

Create or edit the conf/gravitino.conf file:

# Iceberg catalog backend configuration; default memory config is for testing only, should be changed to jdbc
# Using H2 here, MySQL recommended for production:
gravitino.iceberg-rest.catalog-backend = jdbc
gravitino.iceberg-rest.uri = jdbc:h2:file:/tmp/gravitino/catalog_iceberg.db;DB_CLOSE_DELAY=-1;MODE=MYSQL
gravitino.iceberg-rest.jdbc-driver = org.h2.Driver
gravitino.iceberg-rest.jdbc-user = iceberg
gravitino.iceberg-rest.jdbc-password = iceberg123
gravitino.iceberg-rest.jdbc-initialize = true
gravitino.iceberg-rest.warehouse = s3://gravitino-iceberg-demo/warehouse
gravitino.iceberg-rest.io-impl = org.apache.iceberg.aws.s3.S3FileIO
gravitino.iceberg-rest.s3-region = us-west-2

# Enable Vended-Credentials functionality
# Note: Gravitino uses these AK/SK to call STS AssumeRole and obtain temporary credentials for client distribution
gravitino.iceberg-rest.credential-providers = s3-token
gravitino.iceberg-rest.s3-access-key-id = YOUR_AWS_ACCESS_KEY_ID
gravitino.iceberg-rest.s3-secret-access-key = YOUR_AWS_SECRET_ACCESS_KEY
gravitino.iceberg-rest.s3-role-arn = arn:aws:iam::YOUR_ACCOUNT_ID:role/gravitino-iceberg-access
gravitino.iceberg-rest.s3-region = us-west-2
gravitino.iceberg-rest.s3-token-expire-in-secs = 3600

Please note: Replace the warehouse, s3-region, access key id, secret access key, YOUR_ACCOUNT_ID and other properties in the above configuration with your own values.

Start Services

# Start Gravitino service
./bin/gravitino.sh start

# Check service status
./bin/gravitino.sh status

# View logs
tail -f logs/gravitino-server.log

# Verify main service
curl -v http://localhost:8090/api/version

# Verify Iceberg REST service
curl -v http://localhost:9001/iceberg/v1/config

Create Gravitino Metadata Structure

Create necessary metadata structures through REST API. MetaLake is the top level of Gravitino's metadata structure. If you already have one, you can skip this step. Otherwise, we'll create a metalake named lakehouse:

# Create MetaLake
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "lakehouse",
    "comment": "Gravitino lakehouse for Doris integration",
    "properties": {}
  }' http://localhost:8090/api/metalakes

Next, we create an Iceberg Catalog, which can be done through the Web GUI or REST API. For example, here we create a catalog named iceberg_catalog with JDBC as the backend metadata storage and the warehouse address pointing to the S3 bucket created earlier:

# Create Iceberg Catalog
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "iceberg_catalog",
    "type": "RELATIONAL",
    "provider": "lakehouse-iceberg",
    "comment": "Iceberg catalog with S3 storage and vended credentials",
    "properties": {
      "catalog-backend": "jdbc",
      "uri": "jdbc:h2:file:/tmp/gravitino/catalog_iceberg.db;DB_CLOSE_DELAY=-1;MODE=MYSQL",
      "jdbc-user": "iceberg",
      "jdbc-password": "iceberg123",
      "jdbc-driver": "org.h2.Driver",
      "jdbc-initialize": "true",
      "warehouse": "s3://gravitino-iceberg-demo/warehouse",
      "io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
      "s3-region": "us-west-2"
    }
  }' http://localhost:8090/api/metalakes/lakehouse/catalogs

At this point, the necessary metadata has been created in Gravitino.

3. Connecting Doris to Gravitino

Doris can connect to Gravitino and access Iceberg data on S3 through two different approaches:

Static Credential Mode: Doris uses pre-configured fixed AK/SK to directly access S3
Dynamic Credential Mode: Gravitino dynamically distributes temporary credentials to Doris via STS

Vended Credential Mode (Recommended)

Enabling vended credential mode is the more secure approach. In this mode, Gravitino dynamically generates temporary credentials and distributes them to clients like Doris, thereby minimizing the risk of credential leakage:

-- Create Catalog with vended credential mode
CREATE CATALOG gravitino_vending PROPERTIES (
    'type' = 'iceberg',
    'iceberg.catalog.type' = 'rest',
    'iceberg.rest.uri' = 'http://127.0.0.1:9001/iceberg/',
    'iceberg.rest.warehouse' = 'warehouse',
    'iceberg.rest.vended-credentials-enabled' = 'true',
    's3.endpoint' = 'https://s3.us-west-2.amazonaws.com',
    's3.region' = 'us-west-2'
);

Static Credential Mode

In this mode, Doris directly uses fixed AWS credentials to access S3, with Gravitino providing only metadata services:

-- Create Catalog with static credential mode
CREATE CATALOG gravitino_static PROPERTIES (
    'type' = 'iceberg',
    'iceberg.catalog.type' = 'rest',
    'iceberg.rest.uri' = 'http://127.0.0.1:9001/iceberg/',
    'iceberg.rest.warehouse' = 'warehouse',
    'iceberg.rest.vended-credentials-enabled' = 'false',
    's3.endpoint' = 'https://s3.us-west-2.amazonaws.com',
    's3.access_key' = 'YOUR_AWS_ACCESS_KEY_ID',
    's3.secret_key' = 'YOUR_AWS_SECRET_ACCESS_KEY',
    's3.region' = 'us-west-2'
);

Verify Connection and Data Operations

-- Verify connection
SHOW DATABASES FROM gravitino_vending;

-- Switch to vended credentials catalog
SWITCH gravitino_vending;

-- Create database and table
CREATE DATABASE demo;
USE gravitino_vending.demo;

CREATE TABLE gravitino_table (
    id INT,
    name STRING
)
PROPERTIES (
    'write-format' = 'parquet'
);

-- Insert test data
INSERT INTO gravitino_table VALUES (1, 'Doris'), (2, 'Gravitino');

-- Query verification
SELECT * FROM gravitino_table;

Summary

Through this guide, you can successfully build a modern lakehouse architecture based on Gravitino and Doris. This architecture not only provides high performance and high availability but also ensures data access security and compliance through advanced security mechanisms. As data scales grow and business requirements evolve, this architecture can flexibly expand to meet various enterprise-level needs. For those interested in these two projects, please star both projects on GitHub: https://github.com/apache/gravitino and https://github.com/apache/doris. We look forward to your participation in community issue discussions and PR contributions!

Why Metadata Matters: The Force Driving Data and AI

Alex Yan — Sat, 27 Sep 2025 01:47:34 +0000

In a world drowning in data, we often focus on the information itself – the records, the tables, the images, and the videos. What if the most important asset isn't just the data, but also the "data about data"? This is the essence of metadata. Think of it as the DNA of your information, providing context, meaning, and structure.

While metadata has long been the backbone of traditional data management, its role is now exploding in importance. It is becoming the critical link between human understanding and AI-driven intelligence. Let's explore why metadata is more crucial than ever.

The Foundation of Traditional Data Management

In the classic data ecosystem, metadata is the key to creating a single source of truth. Without it, data exists in isolated "silos"—departments have their own databases, and nobody knows what information exists elsewhere. Metadata changes this by acting as a universal translator.

A well-defined metadata catalog serves as a central library, documenting everything from data ownership and access permissions to data types and refresh schedules. This centralized view eliminates data silos and lays the groundwork for robust data governance. When you have a unified view of your data, you can enforce quality standards, ensure regulatory compliance (like GDPR or HIPAA), and manage sensitive information effectively.

Furthermore, metadata is the engine behind data lineage. It tracks the journey of data from its source to its final destination, showing every transformation it undergoes and each point of interaction. This is invaluable for troubleshooting data quality issues, auditing processes, and understanding the complete history of your information.

The Fuel for AI Models: Data for AI

To train a powerful AI model, you need high-quality, relevant data. For large-scale multi-modal models that understand everything from text to images to audio, the challenge is immense. You can't just throw raw data at a model; without context, the data is simply noise.

This is where metadata shines. For multi-modal AI, metadata acts as the "label" or "tag" that allows the model to interpret the data it's analyzing.

Images: Annotations that identify objects ("car," "dog") and actions ("running," "jumping").
Text: Sentiment tags ("positive," "negative"), keywords, and topic classifications.
Audio: Transcriptions, speaker identification, and background noise annotations.

Without this crucial metadata, a multi-modal model would be blind and deaf. It's the metadata that allows the model to connect a picture of a cat with the word "cat," or an audio clip of a car with the term "vehicle." This process of data annotation and labeling, powered by metadata, is the single most important step in preparing data for effective AI training.

The Brain for AI Agents: AI for Data

The most exciting development is the shift from "Data For AI" to "AI For Data." This is where AI models don't just consume data – they actively manage and understand it. Metadata provides the cognitive foundation for this new generation of data agents.

Imagine an AI that can answer complex business questions like: "What was the total revenue from our top five products in Europe last quarter?" To do this, the AI needs more than just access to a database. It needs metadata to act as its "brain." This metadata helps the model understand:

Context: What does the "revenue" column actually mean? Is it net or gross?
Relationships: How does the "product" table connect to the "sales" table?
Semantics: What does "Europe" refer to in the context of the data? Is it a country or a region?

This is the power of semantic metadata, which goes beyond simple descriptions to map the meaning and relationships of data. By integrating metadata with large language models (LLMs) and other tools, we can create data agents that understand the nuance of your business. These agents can autonomously clean data, generate reports, and even orchestrate complex data pipelines—all by using metadata as their guide.

The Future is Now: From Data Management to Data Intelligence

The journey of metadata is a continuous evolution. We are moving beyond traditional data management – passive registries of information – towards something far more intelligent. This new paradigm is often called Data Intelligence.

Data intelligence isn't just a data repository; it’s an AI-powered system that automatically understands, enriches, and connects metadata. It can infer relationships, suggest improvements, and serve as the central brain for all your AI-powered data initiatives.

This vision is at the core of projects like Apache Gravitino. The release of Apache Gravitino 1.0.0 marks a major step forward, designed from the ground up to be a modern metadata lake for data and AI. It provides a unified, open-source metadata management solution that is built to support the very transitions we've discussed: from addressing traditional data silos to providing the essential semantic layer that will power the next generation of AI applications. With Apache Gravitino, metadata is no longer a static asset but a dynamic and collaborative force in the data ecosystem.

Please stay tuned for our follow-up blogs about the Gravitino 1.0.0 technical deep dive.

Catalogs as Context: How Metadata is Powering the Next Wave of AI

Alex Yan — Fri, 19 Sep 2025 23:19:27 +0000

The promise of AI and LLMs to revolutionize business is immense, but for many organizations, it's blocked by a significant hurdle: data chaos. While our historic focus on the "3Vs"—Volume, Velocity, and Variety—advanced data architectures, it also created complex silos that trap data, making it difficult to use for effective AI.

We believe the solution lies not in managing more data, but in understanding it better. The key is context, which is powered by metadata. A unified metadata layer, acting as a central "brain" for your data ecosystem, is the essential component to unlock data for AI, enabling both powerful insights and robust governance.

The End of an Era: Why Our Old Data Goals Are Failing Us

The data landscape is fundamentally shifting, and the paradigms that brought us here are beginning to show their limits. We see three major challenges confronting modern data platforms:

Diminishing Returns: With the end of Moore's Law, we can no longer solve data problems simply by adding more hardware.
Crushing Complexity: The modern data stack has become a tangled web of tools, creating massive overhead that slows innovation and increases risk.
The Push to Intelligence: Data platforms must evolve beyond simple storage to intelligently understand and act on data, much like cars evolved from speed machines to autonomous vehicles.

Metadata: The Key to Unlocking AI

This is where metadata—data about your data—comes in. For too long, it’s been an afterthought. In the AI era, it’s your most critical asset. Think of it as the bridge connecting the powerful brain of an LLM to your specific business data. Without it, AI is flying blind.

Good metadata management delivers on three key things:

Clear Understanding: It's your universal "data dictionary," making sure everyone and every system is on the same page.
Consistent Governance: It provides a single place to manage security, quality, and compliance rules everywhere.
Smart Automation: It gives AI the context it needs to automate tasks and make decisions correctly.

Meet Apache Gravitino: Your Data's Central Brain

That's where Apache Gravitino comes in. We're excited to be building this open-source "catalog of catalogs"—a single place to manage all your metadata. Gravitino doesn't replace your existing systems. Instead, it works with them by providing a unified layer on top, which unlocks several key advantages:

A Single Source of Truth: Eliminate ambiguity and ensure everyone—and every system—is working with the same understanding of your data assets.
Improved Efficiency & Discovery: Radically simplify the process of finding and using the right data for any task.
Enhanced Data Quality & Governance: Define and enforce data quality rules, access policies, and compliance standards from one central, authoritative place.
Empowering LLMs: Provide your AI models with the rich, reliable, and well-governed context they need to perform effectively and safely.

The Future is Agentic: Putting Your Metadata to Work

Centralizing metadata is the first step. The next is to build systems that can act on it intelligently. The future of data management is "agentic." Our roadmap for Gravitino includes building a framework for specialized AI agents that can automate today's most complex data tasks, such as:

Automated Data Engineering: Imagine agents that can understand a natural language request, discover the relevant data across your entire ecosystem, and automatically build the necessary data pipelines.
Automated Data Governance: Picture agents that can automatically scan, classify, and tag sensitive data, applying the correct governance policies without manual intervention.

Build Your Data's Brain Before You Build Your AI

The journey to becoming an AI-driven organization requires a shift in focus—from simply collecting data to truly understanding it. In this new era, a unified metadata catalog isn't a "nice-to-have"; it is a foundational requirement. You cannot build a powerful, trustworthy AI system on a chaotic and poorly understood data foundation.

The work on Apache Gravitino is just beginning, and we are excited about the future. The project recently graduated to become an Apache Top-Level Project in May 2025, and we invite you to join us on this journey.

Explore the project on our official website
Star and contribute to the code on our Apache Gravitino repository.
Join the conversation by subscribing to our mailing lists.

Together, we can build the open standard for the next generation of intelligent, metadata-driven data platforms.

Apache Gravitino: Production-Ready Unified Metadata for Enterprise Data

Alex Yan — Sat, 23 Aug 2025 05:15:35 +0000

Jerry Shao, CTO & Co-founder of Datastrato, explained the vision and capabilities of unified metadata management at Scaling Iceberg Adoption at Pinterest with Gravitino.

Session 1: The Modern Data Challenge - Why "Catalog of Catalogs" Matters

The Enterprise Data Silos Problem

Modern data-driven companies face an inevitable challenge: data silos everywhere. Take a typical enterprise—they'll have a Hadoop-based data lake for ETL and batch processing, a data warehouse for ad hoc analytics, streaming processing stacks for real-time requirements, and machine learning platforms for AI workloads.

The result? Each system serves its purpose well, but data becomes fragmented across isolated islands.

The Traditional "Unified Data" Approach Falls Short

The conventional wisdom says: "Unify all your data in one storage layer." Lakehouse technologies attempt this, trying to force everything into a single system.

But here's the problem: Current lakehouse technologies cannot adequately support streaming analytics AND machine learning AND traditional analytics. Each workload has unique requirements that no single system can perfectly address.

Apache Gravitino's Revolutionary Insight

Instead of asking "How do we unify data together?"
We asked: "Can we unify the metadata together?"

This shift in thinking is profound. Every data system needs a catalog to manage its metadata. So rather than moving massive amounts of data, why not create a unified layer that manages the catalogs themselves?

Apache Gravitino is a "catalog of catalogs" - a metadata management platform that provides unified governance and access across diverse data systems without forcing you to abandon existing investments.

Session 2: Gravitino's Architecture - Unification Without Migration

The Generic Metadata Object Model

At Gravitino's core is a universal metadata framework that represents different types of data through consistent interfaces:

Unified Table Management Tables from Hive, Iceberg, PostgreSQL, and other systems are represented through the same metadata model, enabling consistent operations across platforms.

Comprehensive Data Type Support

Tables: Traditional structured data across any system
Filesets: Direct file and directory management on HDFS, S3, etc.
Models: Machine learning model metadata and versioning
Topics: Streaming data topics and configurations

The Connection Layer: Universal Data System Integration

Gravitino connects to diverse data systems through specialized connectors:

Hive Connector: Integrates with Hive Metastore ecosystems
JDBC Connector: Connects to relational databases
Iceberg Connector: Native Apache Iceberg support
Custom Connectors: Extensible framework for new systems

Dual API Strategy: Standards Compliance + Innovation

Gravitino Unified REST APIs Generic operations across all data types and systems through consistent interfaces.

Native Iceberg REST APIs Full compliance with Iceberg REST specification at /v1/namespaces, /v1/tables, supporting standard Iceberg clients while adding enterprise capabilities.

The Power of Interoperability Both API sets operate on the same underlying data, so operations through one interface are immediately reflected in the other.

Session 3: Gravitino IRC - Production-Ready Iceberg REST Catalog

Beyond Basic REST Catalog Implementation

While Apache Iceberg provides a reference REST catalog implementation, Gravitino IRC is built for enterprise production requirements with enhanced capabilities that standard implementations lack.

Federated Catalog Architecture

The Challenge: Organizations use different catalog backends (Hive Metastore, JDBC databases, cloud-native solutions) and don't want to abandon existing investments.

Gravitino's Solution: Build REST endpoints on top of existing catalogs rather than replacing them.

User Applications → Iceberg REST Interface → Gravitino IRC → Existing Catalogs
├── Hive Metastore
├── JDBC Catalog
└── Future: S3, Polaris

Dual API Interoperability

Both Gravitino unified APIs and Iceberg REST APIs operate on the same underlying metadata, enabling seamless interoperability:

Production Enhancements Over Standard Implementation

Enterprise Serviceability

Integrated metrics systems: Native Prometheus and Grafana support for comprehensive monitoring
Audit logging: Complete operation tracking for governance and compliance
Event framework: Pre/post-event hooks for custom business logic

Flexible Deployment Options Organizations can deploy Gravitino IRC as a unified service alongside other Gravitino APIs, or as a standalone Iceberg-focused service, depending on their architectural preferences.

Enhanced Performance and Reliability Unlike basic implementations, Gravitino IRC includes intelligent caching, connection pooling, and failover mechanisms designed for enterprise-scale workloads.

Session 4: Enterprise Security and Governance at Scale

End-to-End Authentication Architecture

Client Authentication

OAuth2: Modern web-based authentication for applications and users
Kerberos: Enterprise directory integration for secure environments
Pluggable Framework: Custom authentication methods for unique organizational requirements

Backend Catalog Authentication Different catalog systems require different authentication approaches:

Hive Catalogs: Kerberos and delegation token support with impersonation
JDBC Catalogs: Secure username/password management
Cloud Catalogs: Native cloud identity integration

Data Layer Security

HDFS Integration: Kerberos and delegation token support
Cloud Storage: Secure credential vending that provides temporary, scoped access tokens
Cross-System: Consistent security model regardless of backend storage

Role-Based Access Control (RBAC)

Comprehensive Identity Management Through Gravitino's unified REST APIs, administrators can:

Add users and groups to the system
Create roles with specific privileges on different entities
Bind roles to users with fine-grained control
Enforce policies consistently across all connected systems

Unified Policy Enforcement When users query tables through any connected engine, Gravitino enforces access policies in real-time, checking permissions before allowing read or write operations.

Advanced Data Governance Features

Intelligent Data Discovery

Tagging System: Classify and organize data assets across systems
Search Integration: OpenSearch integration for keyword-based data discovery
Metadata Enrichment: Automatic data profiling and documentation

Data Lineage Tracking Gravitino captures and exposes lineage information, showing how data flows between systems and transforms through different processing stages.

Session 5: Expanding Ecosystem Support

Growing Catalog Backend Support

Current Production Support

Hive Metastore: Full integration with existing Hadoop ecosystems
JDBC Catalogs: PostgreSQL, MySQL, and other relational database catalogs

Planned Integrations We will extend to support more catalog backends like S3 catalogs and Polaris and others, giving organizations even more flexibility in their catalog choices.

Conclusion: The Unified Metadata Future

Apache Gravitino represents a fundamental shift in how enterprises approach data architecture. Rather than forcing organizations to migrate data or abandon existing investments, Gravitino enables transformation through unification.

The three core principles driving this evolution:

Metadata-First Architecture: Unifying metadata management enables data interoperability without data movement
Federation Over Migration: Preserve existing investments while gaining modern capabilities
Standards-Based Innovation: Extensible platforms that maintain ecosystem compatibility

Organizations choosing Gravitino gain:

Immediate integration benefits without migration risk
Enterprise-grade security and governance capabilities
Future-proof architecture that evolves with industry standards
Active community support and continuous innovation

The future of enterprise data architecture isn't about choosing the right system—it's about choosing the right approach to unified metadata management.

Introduction to REST Catalogs for Apache Iceberg

Alex Yan — Wed, 13 Aug 2025 18:54:31 +0000

What is the Apache Iceberg Catalog?
While Iceberg primarily concentrates on its role as an open data format for lakehouse implementation, it still needs to use metadata to track its tables by name. The catalog acts as a reference and contains a pointer to the metadata file for a given table and provides atomicity. Different backends (e.g. Hive, Hadoop, AWS Glue) that can serve as the Iceberg catalog will store the current metadata pointer differently. Iceberg catalogs are flexible and can be implemented using almost any backend system. They can be plugged into any Iceberg runtime, and allow any processing engine that supports Iceberg to load the tracked Iceberg tables. Iceberg also comes with several catalog implementations that are ready to use out of the box.

This includes:

REST: a server-side catalog that’s exposed through a REST API, such as Apache Gravitino or Apache Polaris
Hive Metastore: tracks namespaces and tables using a Hive metastore
JDBC: tracks namespaces and tables in the JDBC database
Nessie: a transactional catalog that tracks namespaces and tables in a database with git-like version control

Catalogs are extremely useful at telling us where our Iceberg tables are, and subsequently how we can access them safely. They are the backbone of data governance frameworks and with regards to Iceberg, they are used for tracking tables and allowing external tools to interface with the metadata. For an Iceberg catalog to be production-ready, it must support atomic operations for updating the current metadata pointer. This helps ensure ACID compliance for table operations by making sure all readers and writers see the same state of the table at a given point in time. When there are two concurrent writers, it’s important to ensure that partial writes don’t happen, resulting in data loss.

Image Credit: Iceberg the Definitive Guide.

How do catalogs work in Iceberg?

This metastore plays a central role in providing the source of truth for the metadata location of Iceberg tables and task execution like creating, dropping, or renaming tables. By grouping collections of tables into namespaces, the Iceberg catalog can keep track of each table’s current metadata for when you load in a specific table. It is important to note that the purpose of the Iceberg catalog is primarily technical, which means most of its functionality surrounds versioning, table management, and naming — this differs significantly from a product data catalog. To learn more about the difference between technical and product data catalogs, see “Technical vs Product Data Catalogs: Which one is best for you?”.

When you implement Iceberg in your setup, one of the first steps is to initialize and configure the catalog. These catalogs enable the SQL commands that will allow you to manage the tables and load them by name. The catalog will be configured by passing along certain properties to the processing engine at initialization. For instance, a Spark catalog would be set by passing spark.sql.catalog. The Iceberg catalog initialized will be specified based on this property, depending on where you are loading tables from. Examples are spark.sql.catalog.hive, spark.sql.catalog.rest, or spark.sql.catalog.hadoop to name a few. It is important to note that not all engines are configured the same way, so it’s best to always refer to the documentation for best practices.

Many pluggable services exist for the Iceberg catalog, such as Apache Gravitino or Polaris, which will utilize the REST catalog. Using REST as a defacto, these services are part of an effort to decouple the catalogs from their underlying technologies. This is important because many of the Iceberg catalog clients contain particular logic depending on the source, so moving it to the catalog server instead allows for more flexibility and control. Iceberg 0.14.0 introduced the REST Open API specification, allowing server-side logic to be written in any language and use any custom technology as long as the API followed the specification.

Evolution of Catalogs: From Apache Hive Metastore to REST

Generally, you will find Iceberg catalogs in two flavors: either a Service-based catalog or a File System Catalog. The majority of Service-based catalogs work by running a service that is either self or cloud managed, and use a backing store to maintain all the Iceberg table references, as well as any locking mechanisms to ensure ACID compliance and prevent conflicts. This type of catalog is becoming a more common trend over File System Catalogs, which use a file to track tables instead of a dedicated backing store. These types of catalogs, such as the Apache Hadoop catalog, are compatible with any storage system but as a result are more prone to inconsistencies due to the atomicity guarantee differences between the many types of storage solutions that exist, and are inherently split-state.

The Hive Metastore catalog was previously a widely used implementation for managing an Iceberg catalog due to the prevalence of the Hadoop ecosystem. It works by mapping a table’s path to its current metadata file using the location table property in the table’s entry within Hive Metastore. This property specifies the absolute path in the filesystem where the table’s current metadata file is stored. This means that the metastore needs to be synced to the file storage to avoid failures of the metastore being written but not the data, or vice versa. File system catalogs like the Hive Metastore have been used to manage Iceberg, particularly during migrations from Hive to Iceberg, while maintaining both systems. However, these catalogs often encountered locking issues and conflicts that required resolution. For example, when concurrent writes are occurring, the writer that first successfully acquires the lock will swap in its snapshot, while the second writer will retry applying its changes. However, locks may also be occasionally abandoned during a shutdown that failed to be cleaned up.

As a result, the Iceberg community began exploring alternative lock implementations, though there remained a desire for a less bloated solution than the Hive Metastore. JDBC-based catalogs work by storing the metadata in a dedicated table in the respective relational database it is connected to and then use that table to track changes and manage the iceberg tables. Depending on the implementation, JDBC-based catalogs can be more prone to inconsistencies due to differences in ACID compliance across various storage systems. To address these challenges, the idea of using REST was introduced, where engines would send HTTP requests to a REST endpoint, with conflicts handled server-side. This approach allows users to utilize Iceberg without needing an in-depth understanding of its intricacies.

An introduction to REST

A REST API conforms to the principles of the Representational State Transfer (REST) architectural style, making it compatible with RESTful web services. REST is not necessarily a protocol or standard but rather a set of architectural constraints that developers can implement in various ways. When a client makes a request via a RESTful API, it receives a representation of the resource’s state, which is delivered through HTTP in formats including but not limited to JSON, HTML, Python, or plain text.

Because it would be hard to accommodate everyone’s data infrastructure, and as languages like Java and Python have begun to co-exist, it is important that catalog implementations are consistent. It’s ideal to have a central place to manage all the metadata for all these clients to interact with Iceberg seamlessly. Especially when you have long-running jobs it needs to be backwards and forwards compatible in terms of iceberg versioning. REST catalogs were meant to solve this core issue by shifting a lot of the logic from the client side to the server side.

Image by mannhowie.com

Apache Iceberg’s REST implementation

The API Definitions exist in official Iceberg documentation as a specification, but it is not actually implemented. This is what we consider REST compliance, and to fully make use of it we would have to build the service ourselves or use a catalog that provides the REST service for us, like Apache Gravitino. At the very least, such a service requires the implementation of a server, which will need to handle the requests, and a backend which delegates to the catalog to fulfill the requests. We can also extend the functionality of the API at will depending on our needs, which, when compared to Hive Metastore, is a more pluggable approach. The service implementing the REST catalog interface can choose to store the mapping of a table’s path to its current metadata file in any way it chooses. It could even store it in another catalog if it wanted to.

A REST catalog offers several advantages that make it an appealing choice for many organizations. First, it requires fewer packages and dependencies compared to other catalogs, which simplifies deployment and management. This simplicity is largely due to its reliance on standard HTTP communication. Additionally, the REST catalog provides flexibility because it can be implemented by any service capable of handling RESTful requests and responses, and the service’s data store can vary widely. Another benefit is its support for multi-table transactions, which allows for complex operations across multiple tables. Moreover, a REST catalog is cloud-agnostic, making it suitable for organizations that are currently using a multi-cloud strategy, or plan to do so in the future, or want to avoid cloud vendor lock-in.

However, there are also some disadvantages to consider. Implementing a REST catalog requires running a process to handle and respond to REST calls from engines and tools. In production environments, this often necessitates an additional data storage service to store the catalog’s state. Furthermore, there is no public implementation of the backend service to support REST catalog endpoints, meaning developers will need to create their own or opt for a hosted service. Another limitation is that not all engines and tools support the REST catalog, though some, like Spark, Trino, PyIceberg, and Snowflake, do at the time of writing.

Compilation of Iceberg REST Catalogs

In terms of use cases, a REST catalog is ideal if you need a flexible, customizable solution that can integrate with a variety of backend data stores, require support for multi-table transactions, or aim to maintain cloud agnosticism. When choosing a catalog, key considerations include whether it is recommended for production, whether it requires an external system and if that system is self-hosted or managed, whether it has broad compatibility with engines and tools, whether it supports multi-table and multi-statement transactions, and whether it is cloud-agnostic. Several examples of catalogs that follow the Iceberg REST Specification and are available for use out of the box are Apache Gravitino, Apache Polaris, Project Nessie, and Unity Catalog.

The tool with the largest open source ecosystem support and connectors is Apache Gravitino. Gravitino implements a metalake approach across data AI assets (although in the next release it will modularize its Iceberg REST service) and can also aggregate from other catalogs. Along with Iceberg, Gravitino also has native connectors to streaming sources, filesets, and relational stores and supports querying with Flink, Trino, Spark, or StarRocks. Learn more about Apache Gravitino here.

Apache, Apache Iceberg, Apache Hive, Apache Hadoop, Apache Polaris and Apache Gravitino are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Datastrato Announced as an Official OPEA Partner

Alex Yan — Wed, 13 Aug 2025 18:54:04 +0000

We’re excited to share that Datastrato has officially become a partner for the Open Platform For Enterprise AI (OPEA), the LF AI & Data Foundation’s latest Sandbox Project. The pioneering initiative unites industry leaders to champion the development of open, multi-provider, robust, and composable GenAI systems.

Datastrato was chosen in no small part due to our contributions to Apache (incubating) Gravitino, an open source metadata lake that can catalog data from unstructured, relational, and streaming data sources. By using a technical data catalog and metadata lake, you can manage access and perform data governance for all your data sources while safely using multiple engines like Spark, Trino, or Flink on multiple formats on different cloud providers.

The mission drives open source innovation in the AI and data domains by enabling collaboration and the creation of new opportunities for all members of the community. By supporting development of flexible, scalable GenAI systems, OPEA harnesses the best open source innovation from across the ecosystem. As an open source data catalog and metadata lake, we help provide the backbone for enterprise data architectures to streamline their data strategy across both legacy and modern data sources and make it accessible to multiple engines.

This recognition places us alongside industry leaders like Anyscale, Cloudera, Datastax, Domino Data Lab, Hugging Face, Intel, KX, MariaDB Foundation, Minio, Qdrant, Red Hat, SAS, Yellowbrick Data, and Zilliz, highlighting our strengths in supporting large and diverse data infrastructure through versatility, scalability, and high availability.

Learn more about OPEA through their website, or attend their upcoming webinar “GenAI Workflow Solutions for Enterprise: OPEA Demo-palooza”.

Building A Universal Data Agent in 15 Minutes with LlamaIndex and Apache Gravitino (incubating)

Alex Yan — Wed, 13 Aug 2025 18:53:30 +0000

Blogpost by Jerry Shao & Lisa N. Cao

In this new, modern era of data and generative intelligence, data infrastructure teams are struggling with getting company data served in a way that is convenient, efficient, and compliant with regulations. This is especially crucial in the development of Large Language Models (LLMs) and Agentic Retrieval-Augmented Generation (RAG), which have taken the analytics world by storm. In this article, we will cover how you can build a data agent from scratch and interact with it using an open source data catalog.

What is a LLM Agent?

Before we get started, we should first review the role of Agents in RAG Pipelines. While LLMs themselves lack advanced reasoning capabilities and provide the general ability to understand and generate language, Agents are used to take it a step further by being able to take instructions to perform more complex, domain specific reasoning that then gets fed back into the LLM.

Original image by Jerry Shao, inspired by Role of LLM agents at a glance — Source: LinkedIn

Agents can be used for different purposes and in different areas, for example, mathematical problem solving, retrieval-augmented chat, personal assistants, etc. A data agent is typically designed for an extractive goal by directly interacting with the data itself. By helping with assistive reasoning tasks, general application performance can improve greatly and responses more accurately.

Below is the general architecture of a data agent.

Original image by Jerry Shao, inspired by Role of LLM agents at a glance — Source: LinkedIn

As you can see, the agent will take instructions from the LLM, and depending on the design it will interface with the user or LLM through a set of APIs or other agents. It then breaks down large tasks into smaller ones through planning, with some reflection and refinement capabilities. Combined with memory, the agent will retain and recall information over long context windows through the use of vector store and retrieval methods. Agents can also call external APIs to fill in missing information from alternate data sources, which is extremely useful.

Production Issues in RAG Development

There already exist plenty of demos, POCs, and tutorials describing how to build a simple data agent, but when we turn to production usage, we still face several challenges.

Data quality and integrity

Regardless of which LLM you use, data quality and integrity will directly affect the accuracy of the answers. The quality of metadata for structured data will directly result in the accuracy of generated SQL statements, which can have ill-intended effects. Regardless of your chunking strategy, poor source data and documents will pollute the quality of your vector embeddings and result in poor retrieval results that can be nonsensical or hallucinatory. The saying “garbage in, garbage out” is more important than ever before in the age of generative AI.

Retrieve information from a wide range of sources

In any organization or a company, data will likely be ingested from a wide range of sources. On top of a given variety of formats and storage solutions, data may also need to traverse from one data center/region into cross-regional, cross-cloud distributions. If we cannot successfully connect and retrieve data from the entire organization’s wide range of sources efficiently, we open ourselves to a huge disadvantage by missing key data and relationships, making it hard to implement a knowledge graph or map out similarities for our LLM to provide accurate answers from. In the meantime, the traditional way of ETL to centralize the data will also lower the effectiveness of the answers, as usually you will need T+1 to get data prepared.

Data privacy, security, and compliance

Data privacy, security, and compliance are paramount when building any production-level data system, including data agents and APIs. This problem becomes more challenging when implementing LLMs because of their tendency to be incredibly high dimensional and complex at scale and thus trace their outputs from the source. Troubleshooting such systems, especially when making many calls to external tools and APIs, is very hard to do- let alone while retaining privacy and security. It is important to design our data infrastructure and end-to-end systems to have high visibility, observability, measurability, and robustness in a continuous way.

What is Apache Gravitino (incubating)?

Apache (incubating) Gravitino is a high-performance, geo-distributed, and federated metadata lake. By using a technical data catalog and metadata lake, you can manage access and perform data governance for all your data sources (including filestores, relational databases, and event streams) while safely using multiple engines like Spark, Trino, or Flink on multiple formats on different cloud providers. This is very useful for us to plug into our data architecture when trying to get LlamaIndex up and running quickly on top of numerous data sources at the same time.

Apache (incubating) Gravitino’s Architecture at A Glance

With Gravitino, you can achieve:

Single Source of Truth for multi-regional data with geo-distributed architecture support.
Unified Data and AI asset management for both users and engines.
Security in one place, centralizing the security for different sources.
Built-in data management and data access management.
An AI-ready and low cost metadata fabric that standardizes across all your data stores.

For more details about Gravitino, please refer to our blogpost Gravitino — the unified metadata lake.

Without Gravitino, a typical agentic RAG system would look like this:

Image by Jerry Shao, inspired by LlamaIndex flow — Source: LlamaIndex

Users would need to use different readers to connect to various sources one by one, the difficulties will be multiplied when data is distributed across clouds with varying security policies.

With Gravitino, the new architecture is streamlined:

Image by Jerry Shao, inspired by LlamaIndex flow — Source: LlamaIndex

Using Gravitino and LlamaIndex to build a Universal Data Agent

Now, let’s show how you can build a data agent in 15 minutes. This data agent will have several advantages:

No data movement: data will stay where it is, and there’s no need to preprocess or aggregate data together.
Obtain answers both from structured and unstructured data.
Natural language interface. Using natural language to ask the data questions, which will automatically decompose into subqueries and generate SQL as required.

Environment Setup

Below we have abstracted out the code you will need to reproduce this on your own. If you are interested in running this step by step with us, we have a prepared setup that can be run locally. Keep in mind, to run this demo you will need an OpenAI API key.

To learn more about the playground, see here: Apache Gravitino Demo Playground

git clone git@github.com:apache/gravitino-playground.git
cd gravitino-playground
./launch-playground.sh

From there, you will need to navigate to the Jupyter Notebook through the following steps:

Open the Jupyter Notebook in the browser at http://localhost:8888/
Open the gravitino_llamaIndex_demo.ipynb notebook
Start the notebook and run the cells

The overall architecture of the demo that is included in the local playground looks like this:

Manage datasets using Gravitino

First, we’ll need to set up our first catalog and connect it to our filesets. In our case, the data source is Hadoop. We’ll then need to define the schemas and provide the storage location.

demo_catalog = None
try:
    demo_catalog = gravitino_client.load_catalog(name=catalog_name)
except Exception as e:
    demo_catalog = gravitino_client.create_catalog(name=catalog_name,
                                               catalog_type=Catalog.Type.FILESET,
                                               comment="demo",
                                               provider="hadoop",
                                               properties={})

# Create schema and fileset
schema_countries = None
try:
    schema_countries = demo_catalog.as_schemas().load_schema(ident=schema_ident)
except Exception as e:
    schema_countries = demo_catalog.as_schemas().create_schema(ident=schema_ident,
                                                           comment="countries",
                                                           properties={})

fileset_cities = None
try:
    fileset_cities = demo_catalog.as_fileset_catalog().load_fileset(ident=fileset_ident)
except Exception as e:
    fileset_cities = demo_catalog.as_fileset_catalog().create_fileset(ident=fileset_ident,
                                                                      fileset_type=Fileset.Type.EXTERNAL,
                                                                      comment="cities",
                                                                      storage_location="/tmp/gravitino/data/pdfs",
                                                                      properties={})

Build a Gravitino structured data reader

Once our data sources are connected, we’ll need to query it somehow. We’ve decided to use Trino, connected via sqlalchemy in this case to help us out. You could also use PySpark, however if that is what your team already uses.

from sqlalchemy import create_engine
from trino.sqlalchemy import URL
from sqlalchemy.sql.expression import select, text

trino_engine = create_engine('trino://admin@trino:8080/catalog_mysql/demo_llamaindex')

connection = trino_engine.connect();

with trino_engine.connect() as connection:
    cursor = connection.exec_driver_sql("SELECT * FROM catalog_mysql.demo_llamaindex.city_stats")
    print(cursor.fetchall())

Build a Gravitino unstructured data reader

Once our basic data infrastructure has been set up, we can now directly read it into LlamaIndex. Gravitino will use a virtual file system to serve the data as a directory that LlamaIndex can take as input.

from llama_index.core import SimpleDirectoryReader
from gravitino import gvfs

fs = gvfs.GravitinoVirtualFileSystem(
    server_uri=gravitino_url,
    metalake_name=metalake_name
    )

fileset_virtual_location = "fileset/catalog_fileset/countries/cities"

reader = SimpleDirectoryReader(
    input_dir=fileset_virtual_location,
    fs=fs,
    recursive=True)
wiki_docs = reader.load_data()

Build SQL metadata index from the structured data connection

Once built, we can now begin to build our index and vector stores from the metadata alone.

from llama_index.core import SQLDatabase
sql_database = SQLDatabase(trino_engine, include_tables=["city_stats"])

Build vector index from unstructured data

from llama_index.core import VectorStoreIndex
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

# Insert documents into vector index
# Each document has metadata of the city attached

vector_indices = {}
vector_query_engines = {}

for city, wiki_doc in zip(cities, wiki_docs):
   vector_index = VectorStoreIndex.from_documents([wiki_doc])

   query_engine = vector_index.as_query_engine(
       similarity_top_k=2, llm=OpenAI(model="gpt-3.5-turbo")
   )

   vector_indices[city] = vector_index
   vector_query_engines[city] = query_engine

Define query engines and ask the questions

To make this a fully functioning chat application, we will need to be able to provide a text to SQL interface to pull it all together. In this case we will use LlamaIndex’s native functions to directly interface with the index we defined in the previous steps.

from llama_index.core.query_engine import NLSQLTableQueryEngine
from llama_index.core.query_engine import SQLJoinQueryEngine

# Define the NL to SQL engine
sql_query_engine = NLSQLTableQueryEngine(
   sql_database=sql_database,
   tables=["city_stats"],
)


# Define the vector query engines for each city
from llama_index.core.tools import QueryEngineTool
from llama_index.core.tools import ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine


query_engine_tools = []
for city in cities:
   query_engine = vector_query_engines[city]


   query_engine_tool = QueryEngineTool(
       query_engine=query_engine,
       metadata=ToolMetadata(
           name=city, description=f"Provides information about {city}"
       ),
   )
   query_engine_tools.append(query_engine_tool)


s_engine = SubQuestionQueryEngine.from_defaults(
   query_engine_tools=query_engine_tools, llm=OpenAI(model="gpt-3.5-turbo")
)


# Convert engines to tools and combine them together
sql_tool = QueryEngineTool.from_defaults(
   query_engine=sql_query_engine,
   description=(
       "Useful for translating a natural language query into a SQL query over"
       " a table containing: city_stats, containing the population/country of"
       " each city"
   ),
)
s_engine_tool = QueryEngineTool.from_defaults(
   query_engine=s_engine,
   description=(
       f"Useful for answering semantic questions about different cities"
   ),
)

query_engine = SQLJoinQueryEngine(
   sql_tool, s_engine_tool, llm=OpenAI(model="gpt-4")
)

# Issue query
response = query_engine.query(
   "Tell me about the arts and culture of the city with the highest"
   " population"
)

The final answer combines from two parts:

One is the answer from SQL engine, the data agent generates a SQL statement “SELECT city_name, population, country FROM city_stats ORDER BY population DESC LIMIT 1” from natural language and gets an answer from the structured data that Tokyo has the highest population.

Based on the first answer, the data agent then generates three sub questions “Can you provide more details about the museums, theaters, and performance venues in Tokyo?” regarding art and culture in Tokyo.

The final answer combines the two parts and is shown below:

Final response: The city with the highest population is Tokyo, Japan. Tokyo is known for its vibrant arts and culture scene, with a mix of traditional and modern influences. From ancient temples and traditional tea ceremonies to cutting-edge technology and contemporary art galleries, Tokyo offers a diverse range of cultural experiences for visitors and residents alike. The city is also home to numerous museums, theaters, and performance venues showcasing the rich history and creativity of Japan. Unfortunately, I cannot provide more details about the museums, theaters, and performance venues in Tokyo based on the context information provided.

So what’s next?

The demo here shows how to use Gravitino for data ingestion and LlamaIndex for efficient data retrieval. With Gravitino’s production-ready features, it is easy for users to build a universal data agent. We’re continually improving Gravitino to make it a key component to building a data agent that meets enterprise-grade standards.

Ready to take your data agent to the next level? Dive into the guides and join our ASF Community Slack Channel for support.

Huge thanks to co-writer Jerry Shao for collaborating with me on this.

Gravitino 0.5.0: Expanding the horizon to Apache Spark, non-tabular data, and more!

Alex Yan — Wed, 13 Aug 2025 18:52:49 +0000

Our community is always looking to build new features and enhancements to meet the ever changing needs of today’s data and AI ecosystem. As such, we are glad to announce the release of Gravitino 0.5.0, which features Spark engine support, non-tabular data management, messaging data support, and a brand new Python Client. Make sure to keep reading to learn what we’ve done and check out some usage examples.

Core features and enhancements

Apache Spark connector

With our new Spark connector support, we’ll allow seamless integration with one of the most widely used data processing frameworks. Users can now read and write metadata directly from Gravitino, making the management of large-scale data processing pipelines more convenient and unified. If you’re already a heavy Spark user, this will feel natural to plug into your stack compared to before, where we mainly supported Trino. You can refer to the Spark connector to get started and see the specs.

Support for non-tabular data management

When we group files together as collections as a means of organizing data, we are essentially using “Filesets”. Compared to relational data models like databases, and tables, filesets are used to manage non-tabular data that often does not fit neatly into traditional row-and-column structures of tabular data. Examples of this are unstructured data types like images, documents, audio and video files.

Filesets are useful because they provide a level of abstraction over the storage systems, making it possible to leverage metadata to manage the files, much like a traditional database would do with records in a table. This abstraction brings possibilities like life cycle management, security permissions, and other metadata-level management such as schema or partition information to non-tabular data.

Many modern data infrastructures leverage these formats for data processing, especially with the surge of AI and ML. Their models and data management systems will usually need to use both tabular and non-tabular data, so managing both with a single application can save a lot of headaches for engineers. These difficulties become harder at scale when it comes to data operations, enforcing policies, and dealing with complex storage architectures.

Now that Gravitino supports Fileset features, managing non-tabular data on storage such as HDFS, S3, and other Hadoop-compatible systems is simpler and more integrated. This means you can manage permissions, track usage, and oversee the end-to-end lifecycle of all your data, no matter where the data physically resides. (Issue #1241, refer to Fileset catalog for more details).

Here is a brief example of how to use Fileset in Gravitino.

Shell

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
  "name": "local_fileset",
  "comment": "This is an local fileset",
  "type": "MANAGED",
  "storageLocation": "file:/tmp/root/schema/local_fileset",
  "properties": {}
}' http://localhost:8090/api/metalakes/metalake/catalogs/fileset_catalog_1/schemas/schema4/filesets

Java

GravitinoClient gravitinoClient = GravitinoClient
    .builder("http://127.0.0.1:8090")
    .withMetalake("metalake")
    .build();

Catalog catalog = gravitinoClient.loadCatalog(NameIdentifier.of("metalake", "catalog"));
FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();

Map<String, String> propertiesMap = ImmutableMap.<String, String>builder()
        .build();

filesetCatalog.createFileset(
  NameIdentifier.of("metalake", "fileset_catalog_1", "schema4", "local_fileset"),
  "This is an local fileset",
  Fileset.Type.MANAGED,
  "file:/tmp/root/schema/local_fileset",
  propertiesMap,
);

After creating a Fileset catalog, we can use Gravitino Virtual Filesystem to manage the data using Fileset virtual path:

Java

Configuration conf = new Configuration();
conf.set("fs.AbstractFileSystem.gvfs.impl","com.datastrato.gravitino.filesystem.hadoop.Gvfs");
conf.set("fs.gvfs.impl","com.datastrato.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
conf.set("fs.gravitino.server.uri","http://localhost:8090");
conf.set("fs.gravitino.client.metalake","metalake");
Path filesetPath = new Path("gvfs://fileset/fileset_catalog/test_schema/test_fileset_1");
FileSystem fs = filesetPath.getFileSystem(conf);
fs.getFileStatus(filesetPath);

Spark

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("DataFrame Example")
  .getOrCreate()

// Assuming the file format is text, we use `text` method to read as a DataFrame
val df = spark.read.text("gvfs://fileset/test_catalog/test_schema/test_fileset_1")

// Show the contents of the DataFrame using `show`
// Setting truncate to false to show the entire content of the row if it is too long
df.show(truncate = false)

// Alternatively, to print the DataFrame contents to the console in a plain format like `foreach(println)`:
df.collect().foreach(println)

// Stop the SparkSession
spark.stop()

Further details can be found in hadoop-catalog and Gravitino Virtual Filesystem.

Messaging data support

Along with the previously mentioned unstructured data,real-time data analytics and processing has now become commonplace in many modern data systems. To facilitate the management of these systems, Gravitino now supports messaging-type data, including Apache Kafka and Kafka-compatible messaging systems. Users can now seamlessly manage their messaging data alongside their other data sources in a unified way using Gravitino. (Issue #2369, further information can be found at kafka-catalog).

The following is an example of how to create a Kafka catalog, more can be found in Kafka catalog.

Shell

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
  "name": "catalog",
  "type": "MESSAGING",
  "comment": "comment of kafka catalog",
  "provider": "kafka",
  "properties": {
  "bootstrap.servers": "localhost:9092",
  }
}' http://localhost:8090/api/metalakes/metalake/catalogs

Java

GravitinoClient gravitinoClient = GravitinoClient
.builder("http://127.0.0.1:8090")
.withMetalake("metalake")
.build();

Map<String, String> properties = ImmutableMap.<String, String>builder()
// You should replace the following with your own Kafka bootstrap servers that Gravitino can connect to.
.put("bootstrap.servers", "localhost:9092")
.build();


Catalog catalog = gravitinoClient.createCatalog(
NameIdentifier.of("metalake", "catalog"),
Type.MESSAGING,
"kafka", // provider, Gravitino only supports "kafka" for now.
"This is a Kafka catalog",
properties);
// ...

Python client support

Python is the #1 most popular AI/ML programming language, hosting some of the most popular machine learning frameworks such as PyTorch, Tensorflow, and Ray. Like a lot of other core infrastructure tools, Gravitino was originally written in Java but now has added a Python client so that users can use our data management features directly from their Python IDE of choice, like Jupyter-notebook. This allows data engineers and machine learning scientists to consume the metadata in Gravitino natively using Python. Note that currently only Fileset type catalogs are supported through the Python client. (Issue #2229).

The following code is a simple example of how to use the Python client to connect with Gravitino.

Python

provider: str = "hadoop"

// Related NameIdentifer
schema_ident: NameIdentifier = NameIdentifier.of_schema(metalake_name, catalog_name, schema_name)
fileset_ident: NameIdentifier = NameIdentifier.of_fileset(metalake_name, catalog_name, schema_name, fileset_name)

// Init Gravitino client
gravitino_client = GravitinoClient(uri="http://localhost:8090", metalake_name=metalake_name)

// Create catalog
catalog = gravitino_client.create_catalog(
    ident=catalog_ident,
    type=Catalog.Type.FILESET,
    provider=provider,
    comment="catalog comment",
    properties={}
)

// Create schema
catalog.as_schemas().create_schema(ident=schema_ident, comment="schema comment", properties={})

// Create a fileset
fileset = catalog.as_fileset_catalog().create_fileset(ident=fileset_ident,type=Fileset.Type.MANAGED,comment="comment of fileset",storage_location="file:/tmp/root/schema/local_file",properties={})

Apache Doris support

Tagging onto our Real-Time Analytics support, we are now also supporting Apache Doris in this release. Doris is a high-performance, real-time analytical data warehouse that is known for its speed and ease of use. By adding a Doris catalog, engineers implementing Gravitino will now have more flexibility in their cataloging options for their analytical workloads. (Issue #1339, visit jdbc-doris-catalog for specifics). Related user documents can be found here.

Empowering operations with event listener system

To complement our efforts in enabling real-time, dynamic data infrastructures, Gravitino’s 0.5.0 release now also includes a new Event Listener System for applications to plug into. This new system allows users to track and handle all operational events through use of a hook mechanism for custom events, enhancing capabilities in auditing, real-time monitoring, observability, and integration with other applications. (Issue #2233, detailed at event-listener-configuration). You can refer to the event listener system for more information.

JDBC storage backend

Diversifying its storage options, Gravitino 0.5.0 now supports JDBC backends other than KV storage. This allows the use of popular databases like MySQL or PostgreSQL as the entity store. (Issue #1811, check out storage-configuration for further insights). Users only need to replace the following configurations in the configuration file to use the JDBC storage backend.

Yaml

gravitino.entity.store = relational
gravitino.entity.store.relational = JDBCBackend
gravitino.entity.store.relational.jdbcUrl = jdbc:mysql://localhost:3306/mydb
gravitino.entity.store.relational.jdbcDriver = com.mysql.cj.jdbc.Driver
gravitino.entity.store.relational.jdbcUser = user
gravitino.entity.store.relational.jdbcPassword = password

Bug fixes and optimizations

Gravitino 0.5.0 also contains many bug fixes and optimizations that enhance overall system stability and performance. These improvements address issues that have been identified by the community through issues and direct feedbacks.

Overall

We show appreciation to the Gravitino community for their continued support and valuable contributions, including feedback and testing. Thanks to the vocal feedback of our users that we are able to innovate and build, so cheers to all those reading this!

To explore Gravitino 0.5.0 release in full, please check the documentation and release notes. Your feedback is invaluable to the community and the project.

Enjoy the Data and AI journey with Gravitino 0.5.0!

Apache®, Apache Doris™, Doris™, Apache Hadoop®, Hadoop®, Apache Kafka®, Kafka®, Apache Spark, Spark™, are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. Java and MySQL are registered trademarks of Oracle and/or its affiliates. Python is a registered trademark of the Python Software Foundation. PostgreSQL is a registered trademark of the PostgreSQL Community Association of Canada.

How to Implement a REST Catalog for Apache Iceberg

Alex Yan — Mon, 11 Aug 2025 16:29:01 +0000

Data Catalogs can be extremely useful for understanding your data at a glance. By making your data catalogs available in a REST API format it becomes widely available for applications and users to pull from, but more importantly, you can also use it to manage your data catalog by plugging it into other frameworks. Today we’ll review how to use a REST Catalog for Apache Iceberg.

Since the current Apache Iceberg version (1.5.1) does not natively support a REST Catalog, we will use Gravitino, an open-source metadata lake and data catalog. It provides a standard REST catalog for Apache Iceberg. Throughout this process, you can also refer to Gravitino’s documentation. They’ve also released some blog posts that are helpful for today, such as “Gravitino: the unified metadata lake”, “Gravitino: Next-Gen REST Catalog for Iceberg, and Why You Need It”.

Today however, we will go over how to leverage their Iceberg REST Catalog service instead of building it from scratch.

1. Getting the Gravitino package

You can either download the latest version binary package from the Gravitino Github releases, or use the following code to checkout and build from source locally:

git clone git@github.com:datastrato/gravitino.git

// You will see some output here of git retrieving and downloading the repo locally

cd gravitino
./gradlew clean assembleDistribution -x test

// Build the package, this may take a minute and check the output that it built successfully

ls distribution/gravitino-0.5.0-bin.tar.gz

To do the initial installation, you will need to first clone the repository using git and then build the package using Gradle, a popular JVM build tool. Although everything you need is in the repository, you will need to make sure you have a Java version supported (e.g. JDK 8, 11, or 17). If you would rather build Gravitino from source code, please refer to the build doc for more details.

After building, decompress the package:

cd distribution

tar xfz gravitino-0.5.0-bin.tar.gz
cd gravitino-0.5.0-bin

Launching the Iceberg REST catalog ./bin/gravitino.sh start

Log dir doesn't exist, create /Users/user/Downloads/gravitino/distribution/gravitino-0.5.0-bin/logs
Gravitino Server start success!
Gravitino Server is running[PID:38047]
// Check server process started, will see the GravitinoServer process
jps | grep GravitinoServer
// Check interface works as expected, like `{"defaults":{},"overrides":{}}`
curl http://127.0.0.1:9001/iceberg/v1/config

Running the Gravitino script will then start the Gravitino server. If there are no errors returned you can assume the Iceberg REST catalog service is started and listening on local port 9001. If you want to, you could add more configurations in conf/gravitino.conf. There are two locations for logging you can check: logs/gravitino-server.log and logs/gravitino-server.out. Below are some critical configuration items listed, but you can also refer to the Iceberg REST catalog service document for details.

Configuration item	Description
gravitino.auxService.names	Must specify `iceberg-rest` to start Iceberg REST service
gravitino.auxService.iceberg-rest.catalog-backend	`memory` is the default mainly used to test, could use `hive` or `jdbc` for production.

3. Using the Iceberg REST catalog

Using Spark as an example, we can configure the Spark catalog options to use the Gravitino Iceberg REST catalog with the name “rest”. This then allows us to directly query the Iceberg Catalog using SparkSQL - the same can be done with Trino, for instance.

Run spark-sql shell using the following cmd and set the following configurations. Please note in the case below, we are using Spark version 3.4, Scala version 2.12 and Iceberg version 1.3.1. You may need to adjust the jar file (provided by Iceberg project) according to the actual version in your environment.

./bin/spark-sql -v \
--packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.rest=org.apache.iceberg.spark.SparkCatalog  \
--conf spark.sql.catalog.rest.type=rest  \
--conf spark.sql.catalog.rest.uri=http://127.0.0.1:9001/iceberg/

For convenience, you can also place your configuration settings in conf/spark-defaults.conf. This approach gets rid of specifying configs each time you run Spark and ensures the consistency for your cluster. Here is an example of what this file may look like:

spark.sql.extensions    org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.rest         org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.rest.type    rest
spark.sql.catalog.rest.uri     http://127.0.0.1:9001/iceberg/

Exploring your new catalog with Spark SQL

USE rest;
CREATE DATABASE IF NOT EXISTS dml;
CREATE TABLE dml.test (id bigint COMMENT 'unique id') using iceberg;
DESCRIBE TABLE EXTENDED dml.test;
INSERT INTO dml.test VALUES (1), (2);
SELECT * FROM dml.test;

As you can see, Spark will now default to using the Iceberg REST catalog service. You can also check the Gravitino Iceberg REST catalog service document, or contact the developers on their community Slack channel.

Supported versions
We have verified the following versions in our testbed and some community users’ deployment. If you have different versions of compute engines and want to verify the build, please let the developers know by filing an issue on Github or messaging them on slack.

Engine	Supported versions
Spark	3.0 and above
Flink	1.13 and above
Trino	405 and above

Community Voices - At Xiaomi, where are we heading with Gravitino

Alex Yan — Mon, 11 Aug 2025 16:28:53 +0000

" width="800" height="448">

Xiaomi Inc. is a consumer electronics and smart manufacturing company with smartphones, smart hardware and electric cars connected by an IoT platform at its core. As of 2023, Xiaomi was ranked among the top 3 in the global smartphone market, according to Canalys, and listed as Fortune Global 500 for the 5th consecutive year.

The Xiaomi Cloud Computing Platform team is dedicated to providing reliable and secure data computing and processing capabilities for Xiaomi's business. We actively contribute to many open-source projects, covering storage, computing, resource scheduling, message queue, data lake, etc. By leveraging these advanced technologies, our team has achieved significant milestones, including winning the Xiaomi Million-dollar Technology Award Finalist.

This article focuses on the use of Gravitino in Xiaomi, providing solutions to future work plans and general guidance. There are, as follows, three key points and we look forward to growing our data-driven business with Gravitino.

Unifying our metadata
Integrating data and AI asset management
Unifying user permission management

1. Unify our metadata

With the introduction of multi-regional or multi-cloud deployment, the problem of data silos becomes even more pronounced. It becomes challenging to maintain a unified view of the data across different regions or cloud providers. This is really true for Xiaomi. Gravitino provides a solution to such challenges, and helps break down the data silos. It aims to solve such problems in data management, governance, and analysis in a multi-cloud architecture.

Gravitino's position in Xiaomi's data platform

Gravitino, highlighted in green and yellow in the diagram below, has the following features that we need:

Unified Metadata Lake: As a unified data catalog, it supports multiple data sources, computing engines, and data platforms for data development, management, and governance.
Real-time and Consistency: Real-time acquisition of metadata to ensure SSOT (Single Source of Truth).
Dynamic Registration: Supports adding/altering the data catalog on the fly, no need to restart the service, which makes maintenance and upgrades much easier than before.
Multi-Engines Support: not only Data engines, like: Trino, Apache Spark, Apache Flink (WIP), but also AI/ML frameworks, such as Tensorflow (WIP), PyTorch (WIP) and Ray (WIP).
Multi-Storage Support: Supports both Data and AI domain-specific storages, including HDFS/Hive, Iceberg, RDBMS, as well as NAS/CPFS, JuiceFS, etc.
Eco-friendly: Supports using external Apache Ranger for permission management, external event bus for audit and notification, and external SchemaRegistry for messaging catalog.

Feature is still in active development

Unified metadata lake, unified management

As the type of data sources becomes more and more abundant, computing engines like Trino, Spark and Flink need to maintain a long list of the source catalogs for each of them. That introduces a lot of duplicated and complicated maintenance work.

To build fabric among multiple data sources and computing engines, it is often expected to manage all kinds of data catalogs in one place, and then use a unified service to expose those metadata. Gravitino is extremely useful in this context as it provides a unified metadata lake to standardize the data catalog operations, and unify all metadata management and governance.

User Story

Users can use a three-level coordinate: catalog.schema.entity to describe all the data, and used for data integration, federated queries, etc. What is exciting is that engines no longer need to maintain complex and tedious data catalogs, which simplifies the complexity of O(M*N) to O(M+N).

Note: M represents the number of engines, N represents the number of data sources.

Furthermore, we can use a simple and unified language to make data integration and federated queries:

Apache Spark: Writing to Apache Doris from Apache Hive (with Gravitino Spark connector).

INSERT INTO doris_cluster_a.doris_db.doris_table
SELECT
    goods_id,
    goods_name,
    price
FROM
    hive_cluster_a.hive_db.hive_table

Trino: Making a query between Hive and Apache Iceberg (with Gravitino Trino connector).

SELECT
    *
FROM
    hive_cluster_b.hive_db.hive_table a
JOIN
    iceberg_cluster_b.iceberg_db.iceberg_table b
ON a.name = b.name

2. Integrate data and AI asset management

In the realm of big data, we have made significant progress through data lineage, access measurement, and life cycle management. However, in the domain of AI, non-tabular data has always been the most challenging aspect of data management and governance, encompassing HDFS files, NAS files, and other formats.

Challenges of AI asset management

In the realm of machine learning, the process of reading and writing files is very flexible. Users can use various formats, such as Thrift-Sequence, Thrift-Parquet, Parquet, TFRecord, JSON, text, and more. Additionally, they can leverage multiple programming languages, including Scala, SQL, Python, and others. To manage our AI assets, we need to take into account these diverse uses and ensure adaptability and compatibility.

Similar to tabular data management, non-tabular data also needs to adapt to a variety of engines and storages, including frameworks like PyTorch and TensorFlow, as well as various storage interfaces like FileSystem for file sets, FUSE for instance disk, CSI for container storage.

Non-tabular data management architecture

We aim to establish AI asset management capabilities by leveraging Gravitino, whose core technologies are outlined in the figure below.

Non-tabular data catalog management: Achieving the auditing for AI assets, and the assurance of the specification for file paths;
File interface support: Ensuring seamless compatibility with various file interfaces:
- Hadoop File System: Achieving the compatibility with the Hadoop file system through GVFS (Gravitino Virtual File System).
- CSI Driver: Facilitating the reading and writing of files within container storage.
- FUSE Driver: Enabling the reading and writing of files directly on the physical machine disk.
AI asset lifecycle management: Implementing TTL (Time-To-Live) management for non-tabular data.

User story

We expect that the migration process for users from the original way to the new approach will be straightforward and seamless. In fact, the transition involves just two steps:

Create a fileset catalog with the storage location, and configure the TTL (Time-To-Live) on the Gravitino-based data platform.
Replace the original file path with a new way: gvfs://

To illustrate, let's consider the example of Spark reading HDFS files as follows.

// 1.Structured data - Parquet
val inputPath = "hdfs://cluster_a/database/table/date=20240309"
val df = spark.read.parquet(inputPath).select()...
val outputPath = "hdfs://cluster_a/database/table/date=20240309/hour=00"
df.write().format("parquet").mode(SaveMode.Overwrite).save(outputPath)
// 2.Semi-structured data - Json
inputPath = "hdfs://cluster_a/database/table/date=20240309_${date-7}/xxx.json"
val fileRDD = sc.read.json(inputPath)
// 3.Unstructured data - Text
val inputPath = "hdfs://cluster_a/database/table/date=20240309_12"
val fileRDD = sc.read.text(inputPath)

Leveraging Gravitino, we create a fileset called “myfileset” that is pointing to the origin HDFS, then we can replace the original hdfs://xxx with the new gvfs://fileset/xxx approach, offering users a seamless and intuitive way to upgrade. Users will no longer have to care about the real storage location.

// 1.Structured data - Parquet
val inputPath = "gvfs://fileset/myfileset/database/table/date=20240309"
val df = spark.read.parquet(inputPath).select()...
val outputPath = "gvfs://fileset/myfileset/database/table/date=20240309/hour=00"
df.write().format("parquet").mode(SaveMode.Overwrite).save(outputPath)
// 2.Semi-structured data - Json
inputPath = "gvfs://fileset/myfileset/database/table/date=20240309_${date-7}/xxx.json"
val fileRDD = sc.read.json(inputPath)
// 3.Unstructured data - Text
val inputPath = "gvfs://fileset/myfileset/database/table/date=20240309_12")
val fileRDD = sc.read.text(inputPath)

As previously mentioned, file reading and writing exhibit a lot of flexibility, it also adapts to diverse engines. Instead of enumerating individual examples, the overarching principle remains that users should be able to manage and govern non-tabular data with minimal modifications.

Many challenges within AI asset management require exploration and development work. It includes specifying the depth and date of file paths, facilitating data sharing, exploring non-tabular data reading and writing solutions based on the data lake like Iceberg. Those will be our focus in the near future.

3. Unify user permission management

Metadata and user permission information are so close to each other, and it is always a good idea to manage them together. The metadata service also needs to integrate user permission-related capabilities to authenticate resource operations. We expect to achieve this in our data platform by leveraging Gravitino.

Challenges of unified authentication across multi-system

In order to provide users with a seamless data development experience, the data platform often needs to be integrated with various storage and computation systems. However, such integrations often lead to the challenge of managing multiple systems and accounts.

Users need to authenticate themselves using different accounts in different systems like HDFS (Kerberos), Doris (User/Password), and Talos (AK/SK - Xiaomi IAM account). Such fragmented authentication and authorization processes significantly slow and can even block development.

To address this issue, a crucial step for a streamlined data development platform is to shield the complexity of different account systems and establish a unified authorization framework to increase the efficiency of data development.

Unified user permissions based on workspace

Xiaomi's data platform is designed around the concept of Workspace and utilizes the RBAC (Role-Based Access Control) permission model. Gravitino allows us to generate what we call "mini-accounts" (actual resource accounts, such as HDFS-Kerberos) within the workspace, effectively shielding users from the complexities of Kerberos, User/Password, and IAM/AKSK accounts.

Here are the key components of this setup:

Workspace: Workspaces serve as the smallest operational unit within the data platform, containing all associated resources.
Role: Identities within the workspace, such as Admin, Developer, and Guest. Each role is granted different permissions for accessing workspace resources.
Resource: Resources within the workspace, such as catalog.database.table, are abstracted into three-level coordinates thanks to unified metadata.
Permission: Permissions determine the level of control granted to users for operating resources within the workspace, including admin, write, and read.
Token: A unique ID used to identify individuals within the workspace.
Authentication: API operations are authenticated using tokens, while IAM identities are carried through UI operations after login.
Authorization: Authorization is managed through Apache Ranger, granting the necessary permissions to authenticated workspace roles.
Mini-account: Each workspace has a dedicated set of proxy accounts to access the resources, such as HDFS (Kerberos) or Apache Doris (User/Password). When the engine accesses the underlying resources, it seamlessly utilizes the corresponding mini-account authentication for each resource. However, the entire process remains transparent to the user, who only needs to focus on managing workspace permissions (which are equivalent to resource permissions by leveraging Gravitino).

User story

The figure below shows a brief process for users to create and access resources on our data platform:

All users are only aware of the workspace identity and workspace permissions.

Upon creating a workspace, a suite of workspace proxy mini-accounts is automatically created. Whenever resources are created or imported within the workspace, the corresponding proxy mini-account is authorized with the necessary resource permissions.

When a user attempts to read or write to a resource, the system verifies their workspace permissions. If the workspace permission check is successful, the engine utilizes the mini-account to perform the desired read or write operation on the resource.

Summary

In this blog, we showcase three important scenarios at Xiaomi that we’re using Gravitino to accomplish - most of the critical work has been done, the rest are ongoing with good progress. We're confident in the successful landing of all above scenarios in Xiaomi to support our data-driven business in a better way, and we are glad to be part of the Gravitino community to co-create the potential de-facto standard of the unified metadata lake.

Introducing Gravitino 0.4.0

Alex Yan — Mon, 11 Aug 2025 16:28:45 +0000

Today, we are pleased to announce the release of Gravitino 0.4.0. This version is a stable release, which includes more than 280 bug fixes as well as a bunch of new features.

In this blog post, we will walk you through the highlights of Gravitino 0.4.0, giving you a quick overview of features and enhancements. To learn more about the nitty-gritty details, we recommend going through the comprehensive Gravitino 0.4.0 release notes, which include a full list of major features and resolved issues across all Gravitino components.

Public preview of Gravitino web UI

With the release of Gravitino 0.4.0, we are excited to announce the public preview of Gravitino’s web UI. This greatly improves the user experience of Gravitino.

Gravitino web UI supports the creation, updating, and deletion of metadata such as metalakes and catalogs. Additionally, it can list and display schemas, tables, columns, and their detailed information. You can access the UI by visiting the URL http://{gravitino-host}:8090 in your browser.

Here is the screenshot of Gravitino’s UI, you can manage metalakes in the UI as shown below:

Within each metalake, you can also manage catalogs, the UI will list all the catalogs in a tree structure with schemas and tables under them.

Unified support of partition management

One of the new features added in Gravitino 0.4.0 is to support unified partition management for tables. With Gravitino, you can create, list, get, and delete partitions from different sources via the REST API and Java API in a unified way.

Gravitino provides a generic representation of partition definition. It can support Identity Partition (Hive’s partition definition), as well as List Partition and Range Partition supported by other engines.

Here is a brief example of how to use Gravitino to manage partitions.

Shell

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
  "partitions": [
    {
      "type": "identity",
      "fieldNames": [
        [
          "dt"
        ],
        [
          "country"
        ]
      ],
      "values": [
        {
          "type": "literal",
          "dataType": "date",
          "value": "2008-08-08"
        },
        {
          "type": "literal",
          "dataType": "string",
          "value": "us"
        }
      ]
    }
  ]
}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/tables/table/partitions

Java

GravitinoClient gravitinoClient = GravitinoClient
    .builder("http://localhost:8090")
    .build();

// Assumes that you have a partitioned table named "metalake.catalog.schema.table".
Partition addedPartition =
    gravitinoClient
        .loadMetalake(NameIdentifier.of("metalake"))
        .loadCatalog(NameIdentifier.of("metalake", "catalog"))
        .asTableCatalog()
        .loadTable(NameIdentifier.of("metalake", "catalog", "schema", "table"))
        .supportPartitions()
        .addPartition(
            Partitions.identity(
              new String[][] {{"dt"}, {"country"}},
              new Literal[] {
              Literals.dateLiteral(LocalDate.parse("2008-08-08")), Literals.stringLiteral("us")},
              Maps.newHashMap()));

For more details, you can refer to Gravitino’s partition management documentation.

Support column default values, auto increment, and table indexes

As a unified metadata lake, Gravitino’s goal is to provide a unified representation of different metadata. In version 0.4.0, we have included supporting default values and auto increment in column definitions, as well as indexes in tables.

Users can now create tables with default values and auto increment specified in column definitions, indexes specified in table definitions.

Here’s also a brief example of how to use them in table creation:

Shell

curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
  "name": "table",
  "columns": [
    {
      "name": "id",
      "type": "integer",
      "nullable": true,
      "autoIncrement": true,
      "comment": "Id of the user"
    },
    {
      "name": "name",
      "type": "varchar(1000)",
      "nullable": true,
      "comment": "Name of the user"
    },
    {
      "name": "age",
      "type": "integer",
      "nullable": false,
      "comment": "Age of the user"
      "defaultValue": {
        "type": "literal",
        "dataType": "integer",
        "value": "-1"
      }
    },
    {
      "name": "score",
      "type": "double",
      "nullable": true,
      "comment": "Score of the user"
    }
  ],
  "comment": "A user table with detailed information",
  "indexes": [
    {
      "indexType": "PRIMARY_KEY",
      "name": "PRIMARY",
      "fieldNames": [["id"]]
    },
    {
      "indexType": "UNIQUE_KEY",
      "name": "name_age_score_uk",
      "fieldNames": [["name"],["age"],["score]]
    }
  ]
}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/tables

Java

tableCatalog.createTable(
    NameIdentifier.of("metalake", "hive_catalog", "schema", "table"),
    new Column[] {
      Column.of("id", Types.IntegerType.get(), "Id of the user", false, true, null),
      Column.of("name", Types.VarCharType.of(1000), "Name of the user", true, false, null),
      Column.of("age", Types.IntergerType.get(), "Age of the user", false, false, Literals.integerLiteral(-1)),
      Column.of("score", Types.DoubleType.get(), "Score of the user", true, false, null)
    },
    "A user table with detailed information",
    tablePropertiesMap,
    Transforms.EMPTY_TRANSFORM,
    Distributions.NONE,
    new SortOrder[0],
    new Index[] {
      Indexes.of(IndexType.PRIMARY_KEY, "PRIMARY", new String[][]{{"id"}}),
      Indexes.of(IndexType.UNIQUE_KEY, "name_age_score_uk", new String[][]{{"name"}, {"age"}, {"score"}})
    });

For more details, you can refer to Gravitino’s documentation on how to specify default values auto increment, and indexes.

Security enhancements

Security is always our top priority at Gravitino. In version 0.3.0, we have implemented OAuth2 authentication support, and in this release, we add additional security features.

Kerberos is widely supported in the big data field. In response to user demand, we have implemented Kerberos authentication support for client and server communication. With SPNEGO enabled, users can use Kerberos authenticated headers to communicate with the server.

This version also includes the support of user impersonation, ensuring that each request uses a real user to communicate with the underlying sources. For the Hive catalog, we have added Kerberos support to communicate with the Hive MetaStore, by simply configuring the principal and keytab, Gravitino can now communicate with HMS using Kerberos authentication.

For more details of how to enable security-related features,see the documentation security.

Support completed operator pushdown for Trino connector

In version 0.4.0, we implemented the completed operator pushdown for the Trino connector, which has improved performance. Additionally, we conducted a TPC-H benchmark test, here is the performance comparison between the two versions (lower is better):

As you can see, for most of the TPC-H queries, the Gravitino 0.4.0 Trino connector gives better results, when compared to the previous version. It gains at most 38% performance boost and on average 7% better performance.

Java 8, 11, and 17 support

Gravitino can now run on a variety of Java environments, including Java 8, 11, and 17. This enhancement offers increased flexibility and compatibility for users.

More than just features in Gravitino 0.4.0

While the spotlight often falls on the new features, the true importance of the project is its focus on usability, stability, and incremental improvements. To that end, Gravitino 0.4.0 has tackled and resolved over 280 issues, thanks to the collaborative efforts of all the contributors. To learn more, read the release notes for the full list of improvements.

Get started with Gravitino 0.4.0 today

If you want to experiment with Gravitino 0.4.0, you can simply launch the provided docker playground, see the playground documentation. If you want to install and run from the ground up, also see the documentation on how to install Gravitino.

Please let us know if you have any questions, you can contact us via our Github repository, or join our Discourse community and Slack group.