Speaker: Shaofeng Shi @ AWS Amarathon 2025
Summary by Amazon Nova
A Brief History to Un-silo the Data
LATE 1980'S: Data Warehouse
2011: Data Lake
2020: Lakehouse
Goal
To achieve SSOT (Single Source of Truth)
Full management of data
Get rid of risks, such as data leak, compliance for a data-driven business.
New Data Silos in Clouds & Regions
Nobody like vendor “lock-in”
If data is deployed with different cloud vendors:
Hard to Process together
Expensive to Move
Nobody like geo-distributed data,
But data goes with business to become international:
Regulation requirement
Cost for cross-ocean transfer
More than "Data Access"
Data you see
Technical & Business Data
Legal Hold Data
Metadata you overlook
3rd Party Data
PII & PI Data
Credentials
IP Data
Data Management Functions
Data Connect: Connect to the Data That Matters Most.
Data Right Automation: Automate end-to-end data rights requests and reporting.
Metadata Enrichment: Enrich technical metadata with business and operational metadata for full visibility.
Data Discovery: Automatically find, classify, and map all of your data - everywhere.
Data Classification: Automatically classify more types of data in more places.
Data Lifecycle Management: Simplify and automate data lifecycle management from collection to destruction.
What is Gravitino
Next-gen unified data catalog for Data/AI
Integrations:
Trino
Spark
Flink
Doris
ClickHouse
PyTorch
TensorFlow
Metadata Lake Using Gravitino Components:
Hive Metastore
Built-in Catalog
Schema Registry
Fileset Management
Model Catalog
Data Sources:
Hadoop Data Lake
Data Warehouse
Streaming Processing
Unstructured Data
Machine Learning
Problems to solve
Have a "Big Picture" of whole data
Achieve SSOT of data while it is distributed and consumed in various ways
Data governance in one place, secure and audit data everywhere
Next-Gen Data Catalog is the Core in New Open Data Architecture.
Gravitino Architecture
Functionality Layer:
Unified Processing
Unified Governing
Interface Layer:
Unified REST API's
Iceberg REST API's
Core with Object Model:
Metalake
Catalogs
Schemas
Object Types: Table, Fileset, Model, Topic
Connection Layer:
- Connections
Metadata Storage
Supported Data Types (Bottom Layer):
Tabular
Files
Models
Message Queue
Process Tabular and Non-tabular data with Gravitino
Tabular data (via connectors)
Engines: Spark
Operations: Create, Load, Alter, Drop
API: Unified Tabular API
Schema (struct):
name: string
comment: string
properties: map
Table (struct):
name: string
columns: Column[]
partitioning: Transform[]
distribution: Distribution
sortOrder: SortOrder[]
indexes: Index[]
Related Definitions:
- Transform, Distribution, SortOrder, Index, Type
Non-tabular data
Engines: Spark, PyTorch, Ray, TensorFlow
Filesystems: Gravitino Virtual FileSystem, Python FileSystem
Operations: Create, Load, Alter, Drop
API: Unified Non-tabular API
Schema (struct):
name: string
comment: string
Properties: map
Fileset (struct):
name: string
storageLocation: string
type: Type
Storage Locations:
- S3, HDFS, ADLS, GCS
Scenarios
Lakehouse Federation
Multi-clouds, multi-engines and multi-formats
An open solution for Lakehouse Federation
Platform Capabilities
Analytics
Machine Learning
360° View
App
Query/Language Tools
SQL
Python
R
Core Functionality
Gravitino Data Connector
Federated Query over multi-cloud, multi-formats and multi-engines.
Make Data and AI team to work seamlessly
Roles:
Data Engineer
Data Scientist
AI Engineer
Use Scenario:
Efficient collaborations between Data Engineers and Data Scientists or AI engineers
Data Scientists get an unified definition of metadata for heterogeneous data sources
Data engineers use metadata to process data
Unified metadata for multiple AI frameworks
Unified security control
Core Technology:
- Gravitino
External Factors:
Technology
Communication
ETL
Internet of things
Automation
Networking
Data & Tools:
[ 1 ] Data Ingestion:
Spark
HDFS Client
S3 SDK
[ 2 ] Model Training:
Tensorflow
Pytorch
Ray
Gravitino Python lib
[ 3 ] Data Types:
Structured Data
Unstructured Data
Gravitino Features:
Gravitino IO (Data read & write)
Gravitino ACL (Access Control)
Gravitino Next - metadata-driven action system
Catalog service
APIs: Unified REST API, Iceberg REST API
Components: Catalog, Schema, Table, Fileset, Model, Topic
Connections: Connectors to various data sources (databases, files)
Gravitino Next
Catalog service
APIs: Unified REST API, Iceberg REST API
Components: Catalog, Schema, Table, Fileset, Model, Topic, Policy
Job system items: Job
Systems Included:
Policy system
Statistics system
Job system
Action framework
Action framework items:
TTL Action
Compaction Action
Clustering Action
Team:
Top comments (0)