Alex Yan for Datastrato

Posted on Sep 28

Doris x Gravitino: Unified Metadata Management for Modern Lakehouse Architecture

#metadata #apachedoris #apachegravitino #ai

With the rapid evolution of data lake technologies, building unified, secure, and efficient lakehouse architectures has become a core challenge for enterprise digital transformation. Apache Gravitino serves as a next-generation unified metadata management platform, providing comprehensive solutions for data governance in multi-cloud and multi-engine environments. It not only supports unified management of various data sources and compute engines but also ensures secure and controllable data access through its credential management mechanism (Credential Vending).

This article provides an in-depth introduction to deep integration between Apache Doris and Apache Gravitino, building a modern lakehouse architecture based on Iceberg REST Catalog. Through Gravitino's unified metadata management and dynamic credential vending capabilities, we achieve efficient and secure access to Iceberg data stored on S3.

What you'll learn from this guide:

AWS Environment Setup: How to create S3 buckets and IAM roles in AWS, configure secure credential management systems for Gravitino, and implement dynamic temporary credential distribution mechanisms.
Gravitino Deployment and Configuration: How to quickly deploy Gravitino services, configure Iceberg REST Catalog, and enable vended-credentials functionality.
Connecting Doris to Gravitino: Detailed explanation of how Doris accesses Iceberg data through Gravitino's REST API, supporting two core storage access modes:

1. Static Credential Mode: Doris uses fixed AK/SK to directly access S3
2. Dynamic Credential Mode: Gravitino dynamically distributes temporary credentials to Doris via STS AssumeRole

About Apache Doris

Apache Doris is the fastest analytical and search database for the AI era.

It provides high-performance hybrid search capabilities across structured data, semi-structured data (such as JSON), and vector data. It excels at delivering high-concurrency, low-latency queries, while also offering advanced optimization for complex join operations. In addition, Doris can serve as a unified query engine, delivering high-performance analytical services not only on its self-managed internal table format but also on open lakehouse formats such as Iceberg.

With Doris, users can easily build a real-time lakehouse data platform.

About Apache Gravitino

Apache Gravitino is a high-performance, distributed federated metadata lake open-source project designed to manage metadata across different regions and data sources in public and private clouds. It supports various types of data catalogs including Apache Hive, Apache Iceberg, Apache Paimon, Apache Doris, MySQL, PostgreSQL, Fileset (HDFS, S3, GCS, OSS, JuiceFS, etc.), Streaming (Apache Kafka), Models, and more (continuously expanding), providing users with unified metadata access for Data and AI assets.

Hands-on Guide

1. AWS Environment Setup

Before we begin, we need to prepare a complete infrastructure on AWS, including S3 buckets and a carefully designed IAM role system, which forms the foundation for building a secure and reliable lakehouse architecture.

1.1 Create S3 Bucket

First, create a dedicated S3 bucket to store Iceberg data:

# Create S3 bucket
aws s3 mb s3://gravitino-iceberg-demo --region us-west-2
# Verify bucket creation
aws s3 ls | grep gravitino-iceberg-demo

1.2 Design IAM Role Architecture

To implement secure credential management, we need to create an IAM role for Gravitino to use through the STS AssumeRole mechanism. This design follows the principles of least privilege and separation of duties security best practices.

Create Data Access Role

Create trust policy file gravitino-trust-policy.json:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::YOUR_ACCOUNT_ID:root"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Create IAM Role

For demonstration simplicity, we'll use AWS managed policies directly. For production environments, we recommend creating more fine-grained permission controls.

# Create IAM role
aws iam create-role \
    --role-name gravitino-iceberg-access \
    --assume-role-policy-document file://gravitino-trust-policy.json \
    --description "Gravitino Iceberg data access role"

# Attach S3 full access permissions (for testing; use fine-grained permissions in production)
aws iam attach-role-policy \
    --role-name gravitino-iceberg-access \
    --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

Verify IAM Configuration

Verify that the role configuration is correct:

# Test role assumption functionality
aws sts assume-role \
    --role-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/gravitino-iceberg-access \
    --role-session-name gravitino-test

Example successful response:

{
    "Credentials": {
        "AccessKeyId": "ASIA***************",
        "SecretAccessKey": "***************************",
        "SessionToken": "IQoJb3JpZ2luX2VjEOj...",
        "Expiration": "2025-07-23T08:33:30+00:00"
    }
}

2. Gravitino Deployment and Configuration

Download and Install Gravitino

We'll use Gravitino's pre-compiled version for quick environment setup:

# Create working directory
mkdir gravitino-deployment && cd gravitino-deployment

# Download Gravitino main program
wget https://dlcdn.apache.org/gravitino/0.9.1/gravitino-0.9.1-bin.tar.gz

# Extract and install
tar -xzf gravitino-0.9.1-bin.tar.gz
cd gravitino-0.9.1-bin

Install Required Dependencies

To support AWS S3 and credential management functionality, we need to install additional JAR packages:

# Create necessary directory structure
mkdir -p logs
mkdir -p /tmp/gravitino

# Download Iceberg AWS bundle
wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.6.1/iceberg-aws-bundle-1.6.1.jar \
  -P catalogs/lakehouse-iceberg/libs/
cp catalogs/lakehouse-iceberg/libs/iceberg-aws-bundle-1.6.1.jar iceberg-rest-server/libs/

# Download Gravitino AWS support package (core for vended-credentials functionality)
wget https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws/0.9.1/gravitino-aws-0.9.1.jar \
  -P iceberg-rest-server/libs/

Configure Gravitino Service

Create or edit the conf/gravitino.conf file:

# Iceberg catalog backend configuration; default memory config is for testing only, should be changed to jdbc
# Using H2 here, MySQL recommended for production:
gravitino.iceberg-rest.catalog-backend = jdbc
gravitino.iceberg-rest.uri = jdbc:h2:file:/tmp/gravitino/catalog_iceberg.db;DB_CLOSE_DELAY=-1;MODE=MYSQL
gravitino.iceberg-rest.jdbc-driver = org.h2.Driver
gravitino.iceberg-rest.jdbc-user = iceberg
gravitino.iceberg-rest.jdbc-password = iceberg123
gravitino.iceberg-rest.jdbc-initialize = true
gravitino.iceberg-rest.warehouse = s3://gravitino-iceberg-demo/warehouse
gravitino.iceberg-rest.io-impl = org.apache.iceberg.aws.s3.S3FileIO
gravitino.iceberg-rest.s3-region = us-west-2

# Enable Vended-Credentials functionality
# Note: Gravitino uses these AK/SK to call STS AssumeRole and obtain temporary credentials for client distribution
gravitino.iceberg-rest.credential-providers = s3-token
gravitino.iceberg-rest.s3-access-key-id = YOUR_AWS_ACCESS_KEY_ID
gravitino.iceberg-rest.s3-secret-access-key = YOUR_AWS_SECRET_ACCESS_KEY
gravitino.iceberg-rest.s3-role-arn = arn:aws:iam::YOUR_ACCOUNT_ID:role/gravitino-iceberg-access
gravitino.iceberg-rest.s3-region = us-west-2
gravitino.iceberg-rest.s3-token-expire-in-secs = 3600

Please note: Replace the warehouse, s3-region, access key id, secret access key, YOUR_ACCOUNT_ID and other properties in the above configuration with your own values.

Start Services

# Start Gravitino service
./bin/gravitino.sh start

# Check service status
./bin/gravitino.sh status

# View logs
tail -f logs/gravitino-server.log

# Verify main service
curl -v http://localhost:8090/api/version

# Verify Iceberg REST service
curl -v http://localhost:9001/iceberg/v1/config

Create Gravitino Metadata Structure

Create necessary metadata structures through REST API. MetaLake is the top level of Gravitino's metadata structure. If you already have one, you can skip this step. Otherwise, we'll create a metalake named lakehouse:

# Create MetaLake
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "lakehouse",
    "comment": "Gravitino lakehouse for Doris integration",
    "properties": {}
  }' http://localhost:8090/api/metalakes

Next, we create an Iceberg Catalog, which can be done through the Web GUI or REST API. For example, here we create a catalog named iceberg_catalog with JDBC as the backend metadata storage and the warehouse address pointing to the S3 bucket created earlier:

# Create Iceberg Catalog
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "iceberg_catalog",
    "type": "RELATIONAL",
    "provider": "lakehouse-iceberg",
    "comment": "Iceberg catalog with S3 storage and vended credentials",
    "properties": {
      "catalog-backend": "jdbc",
      "uri": "jdbc:h2:file:/tmp/gravitino/catalog_iceberg.db;DB_CLOSE_DELAY=-1;MODE=MYSQL",
      "jdbc-user": "iceberg",
      "jdbc-password": "iceberg123",
      "jdbc-driver": "org.h2.Driver",
      "jdbc-initialize": "true",
      "warehouse": "s3://gravitino-iceberg-demo/warehouse",
      "io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
      "s3-region": "us-west-2"
    }
  }' http://localhost:8090/api/metalakes/lakehouse/catalogs

At this point, the necessary metadata has been created in Gravitino.

3. Connecting Doris to Gravitino

Doris can connect to Gravitino and access Iceberg data on S3 through two different approaches:

Static Credential Mode: Doris uses pre-configured fixed AK/SK to directly access S3
Dynamic Credential Mode: Gravitino dynamically distributes temporary credentials to Doris via STS

Vended Credential Mode (Recommended)

Enabling vended credential mode is the more secure approach. In this mode, Gravitino dynamically generates temporary credentials and distributes them to clients like Doris, thereby minimizing the risk of credential leakage:

-- Create Catalog with vended credential mode
CREATE CATALOG gravitino_vending PROPERTIES (
    'type' = 'iceberg',
    'iceberg.catalog.type' = 'rest',
    'iceberg.rest.uri' = 'http://127.0.0.1:9001/iceberg/',
    'iceberg.rest.warehouse' = 'warehouse',
    'iceberg.rest.vended-credentials-enabled' = 'true',
    's3.endpoint' = 'https://s3.us-west-2.amazonaws.com',
    's3.region' = 'us-west-2'
);

Static Credential Mode

In this mode, Doris directly uses fixed AWS credentials to access S3, with Gravitino providing only metadata services:

-- Create Catalog with static credential mode
CREATE CATALOG gravitino_static PROPERTIES (
    'type' = 'iceberg',
    'iceberg.catalog.type' = 'rest',
    'iceberg.rest.uri' = 'http://127.0.0.1:9001/iceberg/',
    'iceberg.rest.warehouse' = 'warehouse',
    'iceberg.rest.vended-credentials-enabled' = 'false',
    's3.endpoint' = 'https://s3.us-west-2.amazonaws.com',
    's3.access_key' = 'YOUR_AWS_ACCESS_KEY_ID',
    's3.secret_key' = 'YOUR_AWS_SECRET_ACCESS_KEY',
    's3.region' = 'us-west-2'
);

Verify Connection and Data Operations

-- Verify connection
SHOW DATABASES FROM gravitino_vending;

-- Switch to vended credentials catalog
SWITCH gravitino_vending;

-- Create database and table
CREATE DATABASE demo;
USE gravitino_vending.demo;

CREATE TABLE gravitino_table (
    id INT,
    name STRING
)
PROPERTIES (
    'write-format' = 'parquet'
);

-- Insert test data
INSERT INTO gravitino_table VALUES (1, 'Doris'), (2, 'Gravitino');

-- Query verification
SELECT * FROM gravitino_table;

Summary

Through this guide, you can successfully build a modern lakehouse architecture based on Gravitino and Doris. This architecture not only provides high performance and high availability but also ensures data access security and compliance through advanced security mechanisms. As data scales grow and business requirements evolve, this architecture can flexibly expand to meet various enterprise-level needs. For those interested in these two projects, please star both projects on GitHub: https://github.com/apache/gravitino and https://github.com/apache/doris. We look forward to your participation in community issue discussions and PR contributions!

DEV Community