Eliana Lam for AWS Community On Air

Posted on Nov 22, 2025

A Modern Unified Metadata Architecture: New Approaches to Breaking Down Data Silos

#aws #cloud #beginners #productivity

Speaker: Shaofeng Shi @ AWS Amarathon 2025

Summary by Amazon Nova

A Brief History to Un-silo the Data

LATE 1980'S: Data Warehouse
2011: Data Lake
2020: Lakehouse

Goal

To achieve SSOT (Single Source of Truth)
Full management of data
Get rid of risks, such as data leak, compliance for a data-driven business.

New Data Silos in Clouds & Regions

Nobody like vendor “lock-in”

If data is deployed with different cloud vendors:
Hard to Process together
Expensive to Move

Nobody like geo-distributed data,

But data goes with business to become international:
Regulation requirement
Cost for cross-ocean transfer

More than "Data Access"

Data you see

Technical & Business Data
Legal Hold Data

Metadata you overlook

3rd Party Data
PII & PI Data
Credentials
IP Data

Data Management Functions

Data Connect: Connect to the Data That Matters Most.
Data Right Automation: Automate end-to-end data rights requests and reporting.
Metadata Enrichment: Enrich technical metadata with business and operational metadata for full visibility.
Data Discovery: Automatically find, classify, and map all of your data - everywhere.
Data Classification: Automatically classify more types of data in more places.
Data Lifecycle Management: Simplify and automate data lifecycle management from collection to destruction.

What is Gravitino

Next-gen unified data catalog for Data/AI

Integrations:

Trino
Spark
Flink
Doris
ClickHouse
PyTorch
TensorFlow

Metadata Lake Using Gravitino Components:

Hive Metastore
Built-in Catalog
Schema Registry
Fileset Management
Model Catalog

Data Sources:

Hadoop Data Lake
Data Warehouse
Streaming Processing
Unstructured Data
Machine Learning

Problems to solve

Have a "Big Picture" of whole data
Achieve SSOT of data while it is distributed and consumed in various ways
Data governance in one place, secure and audit data everywhere

Next-Gen Data Catalog is the Core in New Open Data Architecture.

Gravitino Architecture

Functionality Layer:

Unified Processing
Unified Governing

Interface Layer:

Unified REST API's
Iceberg REST API's

Core with Object Model:

Metalake
Catalogs
Schemas
Object Types: Table, Fileset, Model, Topic

Connection Layer:

Connections

Metadata Storage

Supported Data Types (Bottom Layer):
Tabular
Files
Models
Message Queue

Process Tabular and Non-tabular data with Gravitino

Tabular data (via connectors)

Engines: Spark
Operations: Create, Load, Alter, Drop
API: Unified Tabular API

Schema (struct):

name: string
comment: string
properties: map

Table (struct):

name: string
columns: Column[]
partitioning: Transform[]
distribution: Distribution
sortOrder: SortOrder[]
indexes: Index[]

Related Definitions:

Transform, Distribution, SortOrder, Index, Type

Non-tabular data

Engines: Spark, PyTorch, Ray, TensorFlow
Filesystems: Gravitino Virtual FileSystem, Python FileSystem
Operations: Create, Load, Alter, Drop
API: Unified Non-tabular API

Schema (struct):

name: string
comment: string
Properties: map

Fileset (struct):

name: string
storageLocation: string
type: Type

Storage Locations:

S3, HDFS, ADLS, GCS

Scenarios

Lakehouse Federation

Multi-clouds, multi-engines and multi-formats
An open solution for Lakehouse Federation

Platform Capabilities

Analytics
Machine Learning
360° View
App

Query/Language Tools

SQL
Python
R

Core Functionality

Gravitino Data Connector
Federated Query over multi-cloud, multi-formats and multi-engines.

Make Data and AI team to work seamlessly

Roles:

Data Engineer
Data Scientist
AI Engineer

Use Scenario:

Efficient collaborations between Data Engineers and Data Scientists or AI engineers
Data Scientists get an unified definition of metadata for heterogeneous data sources
Data engineers use metadata to process data
Unified metadata for multiple AI frameworks
Unified security control

Core Technology: