DEV Community

Cover image for Doris x Gravitino: Unified Metadata Management for Modern Lakehouse Architecture
Alex Yan for Datastrato

Posted on

Doris x Gravitino: Unified Metadata Management for Modern Lakehouse Architecture

With the rapid evolution of data lake technologies, building unified, secure, and efficient lakehouse architectures has become a core challenge for enterprise digital transformation. Apache Gravitino serves as a next-generation unified metadata management platform, providing comprehensive solutions for data governance in multi-cloud and multi-engine environments. It not only supports unified management of various data sources and compute engines but also ensures secure and controllable data access through its credential management mechanism (Credential Vending).

This article provides an in-depth introduction to deep integration between Apache Doris and Apache Gravitino, building a modern lakehouse architecture based on Iceberg REST Catalog. Through Gravitino's unified metadata management and dynamic credential vending capabilities, we achieve efficient and secure access to Iceberg data stored on S3.

What you'll learn from this guide:

  • AWS Environment Setup: How to create S3 buckets and IAM roles in AWS, configure secure credential management systems for Gravitino, and implement dynamic temporary credential distribution mechanisms.

  • Gravitino Deployment and Configuration: How to quickly deploy Gravitino services, configure Iceberg REST Catalog, and enable vended-credentials functionality.

  • Connecting Doris to Gravitino: Detailed explanation of how Doris accesses Iceberg data through Gravitino's REST API, supporting two core storage access modes:

1. Static Credential Mode: Doris uses fixed AK/SK to directly access S3
2. Dynamic Credential Mode: Gravitino dynamically distributes temporary credentials to Doris via STS AssumeRole
Enter fullscreen mode Exit fullscreen mode

About Apache Doris

Apache Doris is the fastest analytical and search database for the AI era.

It provides high-performance hybrid search capabilities across structured data, semi-structured data (such as JSON), and vector data. It excels at delivering high-concurrency, low-latency queries, while also offering advanced optimization for complex join operations. In addition, Doris can serve as a unified query engine, delivering high-performance analytical services not only on its self-managed internal table format but also on open lakehouse formats such as Iceberg.

With Doris, users can easily build a real-time lakehouse data platform.

About Apache Gravitino

Apache Gravitino is a high-performance, distributed federated metadata lake open-source project designed to manage metadata across different regions and data sources in public and private clouds. It supports various types of data catalogs including Apache Hive, Apache Iceberg, Apache Paimon, Apache Doris, MySQL, PostgreSQL, Fileset (HDFS, S3, GCS, OSS, JuiceFS, etc.), Streaming (Apache Kafka), Models, and more (continuously expanding), providing users with unified metadata access for Data and AI assets.

Hands-on Guide

1. AWS Environment Setup

Before we begin, we need to prepare a complete infrastructure on AWS, including S3 buckets and a carefully designed IAM role system, which forms the foundation for building a secure and reliable lakehouse architecture.

1.1 Create S3 Bucket

First, create a dedicated S3 bucket to store Iceberg data:

# Create S3 bucket
aws s3 mb s3://gravitino-iceberg-demo --region us-west-2
# Verify bucket creation
aws s3 ls | grep gravitino-iceberg-demo
Enter fullscreen mode Exit fullscreen mode

1.2 Design IAM Role Architecture

To implement secure credential management, we need to create an IAM role for Gravitino to use through the STS AssumeRole mechanism. This design follows the principles of least privilege and separation of duties security best practices.

  1. Create Data Access Role

    Create trust policy file gravitino-trust-policy.json:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": [
                        "arn:aws:iam::YOUR_ACCOUNT_ID:root"
                    ]
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }
    
  2. Create IAM Role

    For demonstration simplicity, we'll use AWS managed policies directly. For production environments, we recommend creating more fine-grained permission controls.

    # Create IAM role
    aws iam create-role \
        --role-name gravitino-iceberg-access \
        --assume-role-policy-document file://gravitino-trust-policy.json \
        --description "Gravitino Iceberg data access role"
    
    # Attach S3 full access permissions (for testing; use fine-grained permissions in production)
    aws iam attach-role-policy \
        --role-name gravitino-iceberg-access \
        --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
    
  3. Verify IAM Configuration

    Verify that the role configuration is correct:

    # Test role assumption functionality
    aws sts assume-role \
        --role-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/gravitino-iceberg-access \
        --role-session-name gravitino-test
    

    Example successful response:

    {
        "Credentials": {
            "AccessKeyId": "ASIA***************",
            "SecretAccessKey": "***************************",
            "SessionToken": "IQoJb3JpZ2luX2VjEOj...",
            "Expiration": "2025-07-23T08:33:30+00:00"
        }
    }
    

2. Gravitino Deployment and Configuration

  1. Download and Install Gravitino

    We'll use Gravitino's pre-compiled version for quick environment setup:

    # Create working directory
    mkdir gravitino-deployment && cd gravitino-deployment
    
    # Download Gravitino main program
    wget https://dlcdn.apache.org/gravitino/0.9.1/gravitino-0.9.1-bin.tar.gz
    
    # Extract and install
    tar -xzf gravitino-0.9.1-bin.tar.gz
    cd gravitino-0.9.1-bin
    
  2. Install Required Dependencies

    To support AWS S3 and credential management functionality, we need to install additional JAR packages:

    # Create necessary directory structure
    mkdir -p logs
    mkdir -p /tmp/gravitino
    
    # Download Iceberg AWS bundle
    wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.6.1/iceberg-aws-bundle-1.6.1.jar \
      -P catalogs/lakehouse-iceberg/libs/
    cp catalogs/lakehouse-iceberg/libs/iceberg-aws-bundle-1.6.1.jar iceberg-rest-server/libs/
    
    # Download Gravitino AWS support package (core for vended-credentials functionality)
    wget https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws/0.9.1/gravitino-aws-0.9.1.jar \
      -P iceberg-rest-server/libs/
    
  3. Configure Gravitino Service

    Create or edit the conf/gravitino.conf file:

    # Iceberg catalog backend configuration; default memory config is for testing only, should be changed to jdbc
    # Using H2 here, MySQL recommended for production:
    gravitino.iceberg-rest.catalog-backend = jdbc
    gravitino.iceberg-rest.uri = jdbc:h2:file:/tmp/gravitino/catalog_iceberg.db;DB_CLOSE_DELAY=-1;MODE=MYSQL
    gravitino.iceberg-rest.jdbc-driver = org.h2.Driver
    gravitino.iceberg-rest.jdbc-user = iceberg
    gravitino.iceberg-rest.jdbc-password = iceberg123
    gravitino.iceberg-rest.jdbc-initialize = true
    gravitino.iceberg-rest.warehouse = s3://gravitino-iceberg-demo/warehouse
    gravitino.iceberg-rest.io-impl = org.apache.iceberg.aws.s3.S3FileIO
    gravitino.iceberg-rest.s3-region = us-west-2
    
    # Enable Vended-Credentials functionality
    # Note: Gravitino uses these AK/SK to call STS AssumeRole and obtain temporary credentials for client distribution
    gravitino.iceberg-rest.credential-providers = s3-token
    gravitino.iceberg-rest.s3-access-key-id = YOUR_AWS_ACCESS_KEY_ID
    gravitino.iceberg-rest.s3-secret-access-key = YOUR_AWS_SECRET_ACCESS_KEY
    gravitino.iceberg-rest.s3-role-arn = arn:aws:iam::YOUR_ACCOUNT_ID:role/gravitino-iceberg-access
    gravitino.iceberg-rest.s3-region = us-west-2
    gravitino.iceberg-rest.s3-token-expire-in-secs = 3600
    

    Please note: Replace the warehouse, s3-region, access key id, secret access key, YOUR_ACCOUNT_ID and other properties in the above configuration with your own values.

  4. Start Services

    # Start Gravitino service
    ./bin/gravitino.sh start
    
    # Check service status
    ./bin/gravitino.sh status
    
    # View logs
    tail -f logs/gravitino-server.log
    
    # Verify main service
    curl -v http://localhost:8090/api/version
    
    # Verify Iceberg REST service
    curl -v http://localhost:9001/iceberg/v1/config
    
  5. Create Gravitino Metadata Structure

    Create necessary metadata structures through REST API. MetaLake is the top level of Gravitino's metadata structure. If you already have one, you can skip this step. Otherwise, we'll create a metalake named lakehouse:

    # Create MetaLake
    curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
      -H "Content-Type: application/json" \
      -d '{
        "name": "lakehouse",
        "comment": "Gravitino lakehouse for Doris integration",
        "properties": {}
      }' http://localhost:8090/api/metalakes
    

    Next, we create an Iceberg Catalog, which can be done through the Web GUI or REST API. For example, here we create a catalog named iceberg_catalog with JDBC as the backend metadata storage and the warehouse address pointing to the S3 bucket created earlier:

    # Create Iceberg Catalog
    curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
      -H "Content-Type: application/json" \
      -d '{
        "name": "iceberg_catalog",
        "type": "RELATIONAL",
        "provider": "lakehouse-iceberg",
        "comment": "Iceberg catalog with S3 storage and vended credentials",
        "properties": {
          "catalog-backend": "jdbc",
          "uri": "jdbc:h2:file:/tmp/gravitino/catalog_iceberg.db;DB_CLOSE_DELAY=-1;MODE=MYSQL",
          "jdbc-user": "iceberg",
          "jdbc-password": "iceberg123",
          "jdbc-driver": "org.h2.Driver",
          "jdbc-initialize": "true",
          "warehouse": "s3://gravitino-iceberg-demo/warehouse",
          "io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
          "s3-region": "us-west-2"
        }
      }' http://localhost:8090/api/metalakes/lakehouse/catalogs
    

    At this point, the necessary metadata has been created in Gravitino.

3. Connecting Doris to Gravitino

Doris can connect to Gravitino and access Iceberg data on S3 through two different approaches:

  • Static Credential Mode: Doris uses pre-configured fixed AK/SK to directly access S3
  • Dynamic Credential Mode: Gravitino dynamically distributes temporary credentials to Doris via STS
  1. Vended Credential Mode (Recommended)

    Enabling vended credential mode is the more secure approach. In this mode, Gravitino dynamically generates temporary credentials and distributes them to clients like Doris, thereby minimizing the risk of credential leakage:

    -- Create Catalog with vended credential mode
    CREATE CATALOG gravitino_vending PROPERTIES (
        'type' = 'iceberg',
        'iceberg.catalog.type' = 'rest',
        'iceberg.rest.uri' = 'http://127.0.0.1:9001/iceberg/',
        'iceberg.rest.warehouse' = 'warehouse',
        'iceberg.rest.vended-credentials-enabled' = 'true',
        's3.endpoint' = 'https://s3.us-west-2.amazonaws.com',
        's3.region' = 'us-west-2'
    );
    
  2. Static Credential Mode

    In this mode, Doris directly uses fixed AWS credentials to access S3, with Gravitino providing only metadata services:

    -- Create Catalog with static credential mode
    CREATE CATALOG gravitino_static PROPERTIES (
        'type' = 'iceberg',
        'iceberg.catalog.type' = 'rest',
        'iceberg.rest.uri' = 'http://127.0.0.1:9001/iceberg/',
        'iceberg.rest.warehouse' = 'warehouse',
        'iceberg.rest.vended-credentials-enabled' = 'false',
        's3.endpoint' = 'https://s3.us-west-2.amazonaws.com',
        's3.access_key' = 'YOUR_AWS_ACCESS_KEY_ID',
        's3.secret_key' = 'YOUR_AWS_SECRET_ACCESS_KEY',
        's3.region' = 'us-west-2'
    );
    
  3. Verify Connection and Data Operations

    -- Verify connection
    SHOW DATABASES FROM gravitino_vending;
    
    -- Switch to vended credentials catalog
    SWITCH gravitino_vending;
    
    -- Create database and table
    CREATE DATABASE demo;
    USE gravitino_vending.demo;
    
    CREATE TABLE gravitino_table (
        id INT,
        name STRING
    )
    PROPERTIES (
        'write-format' = 'parquet'
    );
    
    -- Insert test data
    INSERT INTO gravitino_table VALUES (1, 'Doris'), (2, 'Gravitino');
    
    -- Query verification
    SELECT * FROM gravitino_table;
    

Summary

Through this guide, you can successfully build a modern lakehouse architecture based on Gravitino and Doris. This architecture not only provides high performance and high availability but also ensures data access security and compliance through advanced security mechanisms. As data scales grow and business requirements evolve, this architecture can flexibly expand to meet various enterprise-level needs. For those interested in these two projects, please star both projects on GitHub: https://github.com/apache/gravitino and https://github.com/apache/doris. We look forward to your participation in community issue discussions and PR contributions!

Top comments (0)