With the rapid evolution of data lake technologies, building unified, secure, and efficient lakehouse architectures has become a core challenge for enterprise digital transformation. Apache Gravitino serves as a next-generation unified metadata management platform, providing comprehensive solutions for data governance in multi-cloud and multi-engine environments. It not only supports unified management of various data sources and compute engines but also ensures secure and controllable data access through its credential management mechanism (Credential Vending).
This article provides an in-depth introduction to deep integration between Apache Doris and Apache Gravitino, building a modern lakehouse architecture based on Iceberg REST Catalog. Through Gravitino's unified metadata management and dynamic credential vending capabilities, we achieve efficient and secure access to Iceberg data stored on S3.
What you'll learn from this guide:
AWS Environment Setup: How to create S3 buckets and IAM roles in AWS, configure secure credential management systems for Gravitino, and implement dynamic temporary credential distribution mechanisms.
Gravitino Deployment and Configuration: How to quickly deploy Gravitino services, configure Iceberg REST Catalog, and enable vended-credentials functionality.
Connecting Doris to Gravitino: Detailed explanation of how Doris accesses Iceberg data through Gravitino's REST API, supporting two core storage access modes:
1. Static Credential Mode: Doris uses fixed AK/SK to directly access S3
2. Dynamic Credential Mode: Gravitino dynamically distributes temporary credentials to Doris via STS AssumeRole
About Apache Doris
Apache Doris is the fastest analytical and search database for the AI era.
It provides high-performance hybrid search capabilities across structured data, semi-structured data (such as JSON), and vector data. It excels at delivering high-concurrency, low-latency queries, while also offering advanced optimization for complex join operations. In addition, Doris can serve as a unified query engine, delivering high-performance analytical services not only on its self-managed internal table format but also on open lakehouse formats such as Iceberg.
With Doris, users can easily build a real-time lakehouse data platform.
About Apache Gravitino
Apache Gravitino is a high-performance, distributed federated metadata lake open-source project designed to manage metadata across different regions and data sources in public and private clouds. It supports various types of data catalogs including Apache Hive, Apache Iceberg, Apache Paimon, Apache Doris, MySQL, PostgreSQL, Fileset (HDFS, S3, GCS, OSS, JuiceFS, etc.), Streaming (Apache Kafka), Models, and more (continuously expanding), providing users with unified metadata access for Data and AI assets.
Hands-on Guide
1. AWS Environment Setup
Before we begin, we need to prepare a complete infrastructure on AWS, including S3 buckets and a carefully designed IAM role system, which forms the foundation for building a secure and reliable lakehouse architecture.
1.1 Create S3 Bucket
First, create a dedicated S3 bucket to store Iceberg data:
# Create S3 bucket
aws s3 mb s3://gravitino-iceberg-demo --region us-west-2
# Verify bucket creation
aws s3 ls | grep gravitino-iceberg-demo
1.2 Design IAM Role Architecture
To implement secure credential management, we need to create an IAM role for Gravitino to use through the STS AssumeRole mechanism. This design follows the principles of least privilege and separation of duties security best practices.
-
Create Data Access Role
Create trust policy file gravitino-trust-policy.json:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::YOUR_ACCOUNT_ID:root" ] }, "Action": "sts:AssumeRole" } ] }
-
Create IAM Role
For demonstration simplicity, we'll use AWS managed policies directly. For production environments, we recommend creating more fine-grained permission controls.
# Create IAM role aws iam create-role \ --role-name gravitino-iceberg-access \ --assume-role-policy-document file://gravitino-trust-policy.json \ --description "Gravitino Iceberg data access role" # Attach S3 full access permissions (for testing; use fine-grained permissions in production) aws iam attach-role-policy \ --role-name gravitino-iceberg-access \ --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
-
Verify IAM Configuration
Verify that the role configuration is correct:
# Test role assumption functionality aws sts assume-role \ --role-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/gravitino-iceberg-access \ --role-session-name gravitino-test
Example successful response:
{ "Credentials": { "AccessKeyId": "ASIA***************", "SecretAccessKey": "***************************", "SessionToken": "IQoJb3JpZ2luX2VjEOj...", "Expiration": "2025-07-23T08:33:30+00:00" } }
2. Gravitino Deployment and Configuration
-
Download and Install Gravitino
We'll use Gravitino's pre-compiled version for quick environment setup:
# Create working directory mkdir gravitino-deployment && cd gravitino-deployment # Download Gravitino main program wget https://dlcdn.apache.org/gravitino/0.9.1/gravitino-0.9.1-bin.tar.gz # Extract and install tar -xzf gravitino-0.9.1-bin.tar.gz cd gravitino-0.9.1-bin
-
Install Required Dependencies
To support AWS S3 and credential management functionality, we need to install additional JAR packages:
# Create necessary directory structure mkdir -p logs mkdir -p /tmp/gravitino # Download Iceberg AWS bundle wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.6.1/iceberg-aws-bundle-1.6.1.jar \ -P catalogs/lakehouse-iceberg/libs/ cp catalogs/lakehouse-iceberg/libs/iceberg-aws-bundle-1.6.1.jar iceberg-rest-server/libs/ # Download Gravitino AWS support package (core for vended-credentials functionality) wget https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws/0.9.1/gravitino-aws-0.9.1.jar \ -P iceberg-rest-server/libs/
-
Configure Gravitino Service
Create or edit the conf/gravitino.conf file:
# Iceberg catalog backend configuration; default memory config is for testing only, should be changed to jdbc # Using H2 here, MySQL recommended for production: gravitino.iceberg-rest.catalog-backend = jdbc gravitino.iceberg-rest.uri = jdbc:h2:file:/tmp/gravitino/catalog_iceberg.db;DB_CLOSE_DELAY=-1;MODE=MYSQL gravitino.iceberg-rest.jdbc-driver = org.h2.Driver gravitino.iceberg-rest.jdbc-user = iceberg gravitino.iceberg-rest.jdbc-password = iceberg123 gravitino.iceberg-rest.jdbc-initialize = true gravitino.iceberg-rest.warehouse = s3://gravitino-iceberg-demo/warehouse gravitino.iceberg-rest.io-impl = org.apache.iceberg.aws.s3.S3FileIO gravitino.iceberg-rest.s3-region = us-west-2 # Enable Vended-Credentials functionality # Note: Gravitino uses these AK/SK to call STS AssumeRole and obtain temporary credentials for client distribution gravitino.iceberg-rest.credential-providers = s3-token gravitino.iceberg-rest.s3-access-key-id = YOUR_AWS_ACCESS_KEY_ID gravitino.iceberg-rest.s3-secret-access-key = YOUR_AWS_SECRET_ACCESS_KEY gravitino.iceberg-rest.s3-role-arn = arn:aws:iam::YOUR_ACCOUNT_ID:role/gravitino-iceberg-access gravitino.iceberg-rest.s3-region = us-west-2 gravitino.iceberg-rest.s3-token-expire-in-secs = 3600
Please note: Replace the warehouse, s3-region, access key id, secret access key, YOUR_ACCOUNT_ID and other properties in the above configuration with your own values.
-
Start Services
# Start Gravitino service ./bin/gravitino.sh start # Check service status ./bin/gravitino.sh status # View logs tail -f logs/gravitino-server.log # Verify main service curl -v http://localhost:8090/api/version # Verify Iceberg REST service curl -v http://localhost:9001/iceberg/v1/config
-
Create Gravitino Metadata Structure
Create necessary metadata structures through REST API.
MetaLake
is the top level of Gravitino's metadata structure. If you already have one, you can skip this step. Otherwise, we'll create ametalake
namedlakehouse
:
# Create MetaLake curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" \ -d '{ "name": "lakehouse", "comment": "Gravitino lakehouse for Doris integration", "properties": {} }' http://localhost:8090/api/metalakes
Next, we create an Iceberg Catalog, which can be done through the Web GUI or REST API. For example, here we create a catalog named
iceberg_catalog
withJDBC
as the backend metadata storage and the warehouse address pointing to the S3 bucket created earlier:
# Create Iceberg Catalog curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ -H "Content-Type: application/json" \ -d '{ "name": "iceberg_catalog", "type": "RELATIONAL", "provider": "lakehouse-iceberg", "comment": "Iceberg catalog with S3 storage and vended credentials", "properties": { "catalog-backend": "jdbc", "uri": "jdbc:h2:file:/tmp/gravitino/catalog_iceberg.db;DB_CLOSE_DELAY=-1;MODE=MYSQL", "jdbc-user": "iceberg", "jdbc-password": "iceberg123", "jdbc-driver": "org.h2.Driver", "jdbc-initialize": "true", "warehouse": "s3://gravitino-iceberg-demo/warehouse", "io-impl": "org.apache.iceberg.aws.s3.S3FileIO", "s3-region": "us-west-2" } }' http://localhost:8090/api/metalakes/lakehouse/catalogs
At this point, the necessary metadata has been created in Gravitino.
3. Connecting Doris to Gravitino
Doris can connect to Gravitino and access Iceberg data on S3 through two different approaches:
- Static Credential Mode: Doris uses pre-configured fixed AK/SK to directly access S3
- Dynamic Credential Mode: Gravitino dynamically distributes temporary credentials to Doris via STS
-
Vended Credential Mode (Recommended)
Enabling vended credential mode is the more secure approach. In this mode, Gravitino dynamically generates temporary credentials and distributes them to clients like Doris, thereby minimizing the risk of credential leakage:
-- Create Catalog with vended credential mode CREATE CATALOG gravitino_vending PROPERTIES ( 'type' = 'iceberg', 'iceberg.catalog.type' = 'rest', 'iceberg.rest.uri' = 'http://127.0.0.1:9001/iceberg/', 'iceberg.rest.warehouse' = 'warehouse', 'iceberg.rest.vended-credentials-enabled' = 'true', 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', 's3.region' = 'us-west-2' );
-
Static Credential Mode
In this mode, Doris directly uses fixed AWS credentials to access S3, with Gravitino providing only metadata services:
-- Create Catalog with static credential mode CREATE CATALOG gravitino_static PROPERTIES ( 'type' = 'iceberg', 'iceberg.catalog.type' = 'rest', 'iceberg.rest.uri' = 'http://127.0.0.1:9001/iceberg/', 'iceberg.rest.warehouse' = 'warehouse', 'iceberg.rest.vended-credentials-enabled' = 'false', 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', 's3.access_key' = 'YOUR_AWS_ACCESS_KEY_ID', 's3.secret_key' = 'YOUR_AWS_SECRET_ACCESS_KEY', 's3.region' = 'us-west-2' );
-
Verify Connection and Data Operations
-- Verify connection SHOW DATABASES FROM gravitino_vending; -- Switch to vended credentials catalog SWITCH gravitino_vending; -- Create database and table CREATE DATABASE demo; USE gravitino_vending.demo; CREATE TABLE gravitino_table ( id INT, name STRING ) PROPERTIES ( 'write-format' = 'parquet' ); -- Insert test data INSERT INTO gravitino_table VALUES (1, 'Doris'), (2, 'Gravitino'); -- Query verification SELECT * FROM gravitino_table;
Summary
Through this guide, you can successfully build a modern lakehouse architecture based on Gravitino and Doris. This architecture not only provides high performance and high availability but also ensures data access security and compliance through advanced security mechanisms. As data scales grow and business requirements evolve, this architecture can flexibly expand to meet various enterprise-level needs. For those interested in these two projects, please star both projects on GitHub: https://github.com/apache/gravitino and https://github.com/apache/doris. We look forward to your participation in community issue discussions and PR contributions!
Top comments (0)