Yue @ Datastrato (Admin) for Apache Gravitino

Posted on Jan 16 • Edited on Jan 22

Apache Gravitino Introduction

#architecture #dataengineering #opensource #gravitino101

Author: shaofeng shi

Last Updated: [2025-12-29]

Background

In the era of big data, enterprises often need to manage metadata from multi-cloud, multi-domain, and heterogeneous data sources, such as Apache Hive, MySQL, PostgreSQL, Iceberg, Lance, S3, GCS, etc. Additionally, with the extensive application of AI model training and inference, massive amounts of multimodal data and model metadata also require a unified management solution. Traditional approaches involve managing metadata separately for each data source, which not only increases operational complexity but also easily creates data silos. Apache Gravitino, as a high-performance, geographically distributed federated metadata lake, provides us with a unified solution for managing multi-source metadata.

Gravitino was originally initiated and founded by Datastrato Inc., open-sourced in 2023, donated to the Apache Incubator in 2024, and graduated from the Apache Incubator in May 2025 to become an Apache Top Level Project. It has been deployed in production environments at companies like Xiaomi, Tencent, Zhihu, Uber, and Pinterest.

What is Apache Gravitino?

Apache Gravitino is a high-performance, geographically distributed, federated metadata lake management system that provides users with a unified data and AI asset management platform. It can:

Unified Metadata Management: Provide unified metadata models and APIs for different types of data sources
Direct Metadata Management: Directly manage underlying systems, with changes reflected in real-time to source systems
Multi-Engine Support: Support multiple query engines such as Trino, Spark, Flink, etc.
Geographically Distributed Deployment: Support cross-region, cross-cloud deployment architectures
AI Asset Management: Manage not only data assets but also AI/ML model metadata

Core concepts include:

Metalake: Container/tenant for metadata, typically one organization corresponds to one metalake
Catalog: Collection of metadata from specific metadata sources
Schema: Second-level namespace, corresponding to the schema concept in databases
Table: Bottom-level object representing specific data tables

Apache Gravitino Core Features Overview

Unified Metadata Management

Gravitino provides a unified metadata management layer that supports integration with multiple data sources:

Supported Data Source Types:

Relational Databases: MySQL, PostgreSQL, OceanBase, Apache Doris, StarRocks, etc.
Big Data Storage: Apache Hive, Apache Iceberg, Apache Hudi, Apache Paimon, Delta Lake (in development)
Message Queues: Apache Kafka
File Systems: HDFS, S3, GCS, Azure Blob Storage, Alibaba Cloud OSS
AI/ML Data Formats: Lance (columnar data format designed specifically for AI/ML workloads)

REST API Services

Gravitino provides rich REST API services that support standardized access to different data formats:

Gravitino Core REST API

Complete metadata management RESTful API interface
Support for CRUD operations on all metadata objects including Metalake, Catalog, Schema, Table, etc.
Complete API for user, group, role, and permission management
API interfaces for advanced features like tags, policies, models, etc.
Support for multiple authentication methods (Simple, OAuth2, Kerberos)

Iceberg REST Service

Complies with Apache Iceberg REST API specification
Supports multiple backend storage options (Hive, JDBC, custom backends)
Provides complete table management and query capabilities
Supports multiple storage systems (S3, HDFS, GCS, Azure, etc.)

Lance REST Service

Implements Lance REST API specification
Optimized specifically for AI/ML workloads
Supports efficient vector data storage and retrieval
Provides namespace and table management functionality

Real-time Metadata Retrieval and Modification

Gravitino adopts a direct metadata management mode to ensure data real-time performance and consistency:

Real-time Synchronization: Changes to metadata are immediately reflected in underlying data sources
Bidirectional Synchronization: Supports metadata synchronization from Gravitino to data sources and from data sources to Gravitino
Transaction Support: Ensures atomicity and consistency of metadata operations
Version Management: Supports metadata version control and historical tracking

Unified Access Control

Gravitino implements unified permission management across multiple data sources:

Core Features:

Role-Based Access Control (RBAC): Supports flexible permission management for users, groups, and roles
Ownership Model: Each metadata object has a clear owner
Permission Inheritance: Supports hierarchical permission inheritance mechanisms
Fine-grained Control: Multi-level permission control from Metalake to specific tables

Supported Permission Types:

User and group management permissions
Catalog and schema creation permissions
Read/write permissions for tables, topics, filesets
Model registration and version management permissions
Tag and policy application permissions

Unified Data Lineage

Based on OpenLineage standards, Gravitino provides complete data lineage tracking capabilities:

Automatic Lineage Collection: Automatically collect data lineage information through Spark plugins
Unified Identifiers: Convert identifiers from different data sources to Gravitino unified identifiers
Multi-Data Source Support: Support lineage tracking for various data sources including Hive, Iceberg, JDBC, file systems, etc.

High Availability and Scalability

Deployment Modes:

Single-node Deployment: Suitable for development and testing environments
Cluster Deployment: Supports high availability and load balancing
Kubernetes Deployment: Supports containerized deployment and auto-scaling
Docker Support: Provides official Docker images

Storage Backends:

Supports multiple metadata storage backends (MySQL, PostgreSQL, etc.)
Supports distributed storage systems

Security Features

Authentication Methods:

Simple authentication (username/password)
OAuth2 authentication
Kerberos authentication (for Hive backends)

Credential Management:

Supports cloud storage credential vending (S3, GCS, Azure, etc.)
Dynamic credential refresh
Secure credential passing mechanisms

Apache Gravitino Integration Capabilities

Gravitino deeply integrates with mainstream compute engines and data processing frameworks, providing users with a unified data access experience.

Compute Engine Integration

Apache Spark

Seamless integration through Gravitino Spark Connector
Supports Spark SQL and DataFrame API
Automatic data lineage collection and tracking
Supports unified access to multiple data sources

Trino

Integration through Gravitino Trino Connector service
Supports federated queries across data sources
High-performance analytical query capabilities

Apache Flink

Integration through Gravitino Flink Connector service
Supports stream-batch unified data processing
Real-time data processing and analysis

Python Ecosystem Integration

PyIceberg

Supports Iceberg table access in Python environments
Integrates with Gravitino Iceberg REST service
Supports data science and machine learning workflows
Provides Pandas-compatible data interfaces

Daft

Modern distributed data processing framework
Optimized specifically for AI/ML workloads
Supports multimodal data processing
Integrates with Gravitino metadata management

Cloud-Native Integration

Kubernetes

Supports Kubernetes native deployment
Provides Helm Charts and Operators
Supports auto-scaling and fault recovery
Integrates with cloud-native monitoring and logging systems

APIs and SDKs

REST API

Complete RESTful API interface
Supports all metadata management operations
Standardized HTTP interface
Supports multiple authentication methods

Java SDK

Native Java client library
Type-safe API interface
Supports connection pooling and retry mechanisms
Complete exception handling

Python SDK

Python client library
Supports asynchronous operations
Integrates with Jupyter Notebook
Supports data science workflows

These integration capabilities enable Gravitino to seamlessly integrate into existing data infrastructure, providing users with a unified and efficient data management experience. Subsequent articles will detail Gravitino's various capabilities and configuration and usage methods for each integration component. Stay tuned.

Next Steps

Continue reading [Setup Guide]
Follow and star Apache Gravitino Repository

Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the official documentation or submit issues on GitHub.

DEV Community