DEV Community

Datastrato
Datastrato

Posted on

Apache Gravitino Introduction


Author: shaofeng shi

Last Updated: [2025-12-29]

Background

In the era of big data, enterprises often need to manage metadata from multi-cloud, multi-domain, and heterogeneous data sources, such as Apache Hive, MySQL, PostgreSQL, Iceberg, Lance, S3, GCS, etc. Additionally, with the extensive application of AI model training and inference, massive amounts of multimodal data and model metadata also require a unified management solution. Traditional approaches involve managing metadata separately for each data source, which not only increases operational complexity but also easily creates data silos. Apache Gravitino, as a high-performance, geographically distributed federated metadata lake, provides us with a unified solution for managing multi-source metadata.

Gravitino was originally initiated and founded by Datastrato Inc., open-sourced in 2023, donated to the Apache Incubator in 2024, and graduated from the Apache Incubator in May 2025 to become an Apache Top Level Project. It has been deployed in production environments at companies like Xiaomi, Tencent, Zhihu, Uber, and Pinterest.

What is Apache Gravitino?

Apache Gravitino is a high-performance, geographically distributed, federated metadata lake management system that provides users with a unified data and AI asset management platform. It can:

  • Unified Metadata Management: Provide unified metadata models and APIs for different types of data sources
  • Direct Metadata Management: Directly manage underlying systems, with changes reflected in real-time to source systems
  • Multi-Engine Support: Support multiple query engines such as Trino, Spark, Flink, etc.
  • Geographically Distributed Deployment: Support cross-region, cross-cloud deployment architectures
  • AI Asset Management: Manage not only data assets but also AI/ML model metadata

Core concepts include:

  • Metalake: Container/tenant for metadata, typically one organization corresponds to one metalake
  • Catalog: Collection of metadata from specific metadata sources
  • Schema: Second-level namespace, corresponding to the schema concept in databases
  • Table: Bottom-level object representing specific data tables

Gravitino Overall Architecture

Apache Gravitino Core Features Overview

Unified Metadata Management

Gravitino provides a unified metadata management layer that supports integration with multiple data sources:

Supported Data Source Types:

  • Relational Databases: MySQL, PostgreSQL, OceanBase, Apache Doris, StarRocks, etc.
  • Big Data Storage: Apache Hive, Apache Iceberg, Apache Hudi, Apache Paimon, Delta Lake (in development)
  • Message Queues: Apache Kafka
  • File Systems: HDFS, S3, GCS, Azure Blob Storage, Alibaba Cloud OSS
  • AI/ML Data Formats: Lance (columnar data format designed specifically for AI/ML workloads)

REST API Services

Gravitino provides rich REST API services that support standardized access to different data formats:

Gravitino Core REST API

  • Complete metadata management RESTful API interface
  • Support for CRUD operations on all metadata objects including Metalake, Catalog, Schema, Table, etc.
  • Complete API for user, group, role, and permission management
  • API interfaces for advanced features like tags, policies, models, etc.
  • Support for multiple authentication methods (Simple, OAuth2, Kerberos)

Iceberg REST Service

  • Complies with Apache Iceberg REST API specification
  • Supports multiple backend storage options (Hive, JDBC, custom backends)
  • Provides complete table management and query capabilities
  • Supports multiple storage systems (S3, HDFS, GCS, Azure, etc.)

Lance REST Service

  • Implements Lance REST API specification
  • Optimized specifically for AI/ML workloads
  • Supports efficient vector data storage and retrieval
  • Provides namespace and table management functionality

Real-time Metadata Retrieval and Modification

Gravitino adopts a direct metadata management mode to ensure data real-time performance and consistency:

  • Real-time Synchronization: Changes to metadata are immediately reflected in underlying data sources
  • Bidirectional Synchronization: Supports metadata synchronization from Gravitino to data sources and from data sources to Gravitino
  • Transaction Support: Ensures atomicity and consistency of metadata operations
  • Version Management: Supports metadata version control and historical tracking

Unified Access Control

Gravitino implements unified permission management across multiple data sources:

Core Features:

  • Role-Based Access Control (RBAC): Supports flexible permission management for users, groups, and roles
  • Ownership Model: Each metadata object has a clear owner
  • Permission Inheritance: Supports hierarchical permission inheritance mechanisms
  • Fine-grained Control: Multi-level permission control from Metalake to specific tables

Supported Permission Types:

  • User and group management permissions
  • Catalog and schema creation permissions
  • Read/write permissions for tables, topics, filesets
  • Model registration and version management permissions
  • Tag and policy application permissions

Unified Data Lineage

Based on OpenLineage standards, Gravitino provides complete data lineage tracking capabilities:

  • Automatic Lineage Collection: Automatically collect data lineage information through Spark plugins
  • Unified Identifiers: Convert identifiers from different data sources to Gravitino unified identifiers
  • Multi-Data Source Support: Support lineage tracking for various data sources including Hive, Iceberg, JDBC, file systems, etc.

High Availability and Scalability

Deployment Modes:

  • Single-node Deployment: Suitable for development and testing environments
  • Cluster Deployment: Supports high availability and load balancing
  • Kubernetes Deployment: Supports containerized deployment and auto-scaling
  • Docker Support: Provides official Docker images

Storage Backends:

  • Supports multiple metadata storage backends (MySQL, PostgreSQL, etc.)
  • Supports distributed storage systems

Security Features

Authentication Methods:

  • Simple authentication (username/password)
  • OAuth2 authentication
  • Kerberos authentication (for Hive backends)

Credential Management:

  • Supports cloud storage credential vending (S3, GCS, Azure, etc.)
  • Dynamic credential refresh
  • Secure credential passing mechanisms

Apache Gravitino Integration Capabilities

Gravitino deeply integrates with mainstream compute engines and data processing frameworks, providing users with a unified data access experience.

Compute Engine Integration

Apache Spark

  • Seamless integration through Gravitino Spark Connector
  • Supports Spark SQL and DataFrame API
  • Automatic data lineage collection and tracking
  • Supports unified access to multiple data sources

Trino

  • Integration through Gravitino Trino Connector service
  • Supports federated queries across data sources
  • High-performance analytical query capabilities

Apache Flink

  • Integration through Gravitino Flink Connector service
  • Supports stream-batch unified data processing
  • Real-time data processing and analysis

Python Ecosystem Integration

PyIceberg

  • Supports Iceberg table access in Python environments
  • Integrates with Gravitino Iceberg REST service
  • Supports data science and machine learning workflows
  • Provides Pandas-compatible data interfaces

Daft

  • Modern distributed data processing framework
  • Optimized specifically for AI/ML workloads
  • Supports multimodal data processing
  • Integrates with Gravitino metadata management

Cloud-Native Integration

Kubernetes

  • Supports Kubernetes native deployment
  • Provides Helm Charts and Operators
  • Supports auto-scaling and fault recovery
  • Integrates with cloud-native monitoring and logging systems

APIs and SDKs

REST API

  • Complete RESTful API interface
  • Supports all metadata management operations
  • Standardized HTTP interface
  • Supports multiple authentication methods

Java SDK

  • Native Java client library
  • Type-safe API interface
  • Supports connection pooling and retry mechanisms
  • Complete exception handling

Python SDK

  • Python client library
  • Supports asynchronous operations
  • Integrates with Jupyter Notebook
  • Supports data science workflows

These integration capabilities enable Gravitino to seamlessly integrate into existing data infrastructure, providing users with a unified and efficient data management experience. Subsequent articles will detail Gravitino's various capabilities and configuration and usage methods for each integration component. Stay tuned.

Next Steps


Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the official documentation or submit issues on GitHub.

Top comments (0)