DEV Community

Cover image for Unified catalog for Data and AI
Eliana Lam for AWS Community On Air

Posted on

Unified catalog for Data and AI

Speaker: Shaofeng Shi @ AWS Amarathon 2025

Summary by Amazon Nova



A Brief History to Un-silo the Data

  • LATE 1980'S: Data Warehouse

  • 2011: Data Lake

  • 2020: Lakehouse

Goal

  • To achieve SSOT (Single Source of Truth)

  • Full management of data

  • Get rid of risks, such as data leak, compliance for a data-driven business.

New Data Silos in Clouds & Regions

Nobody like vendor “lock-in”

  • If data is deployed with different cloud vendors:

  • Hard to Process together

  • Expensive to Move

Nobody like geo-distributed data,

  • But data goes with business to become international:

  • Regulation requirement

  • Cost for cross-ocean transfer

More than "Data Access"

Data you see

  • Technical & Business Data

  • Legal Hold Data

Metadata you overlook

  • 3rd Party Data

  • PII & PI Data

  • Credentials

  • IP Data

Data Management Functions

  • Data Connect: Connect to the Data That Matters Most.

  • Data Right Automation: Automate end-to-end data rights requests and reporting.

  • Metadata Enrichment: Enrich technical metadata with business and operational metadata for full visibility.

  • Data Discovery: Automatically find, classify, and map all of your data - everywhere.

  • Data Classification: Automatically classify more types of data in more places.

  • Data Lifecycle Management: Simplify and automate data lifecycle management from collection to destruction.



What is Gravitino

Next-gen unified data catalog for Data/AI

Integrations:

  • Trino

  • Spark

  • Flink

  • Doris

  • ClickHouse

  • PyTorch

  • TensorFlow

Metadata Lake Using Gravitino Components:

  • Hive Metastore

  • Built-in Catalog

  • Schema Registry

  • Fileset Management

  • Model Catalog

Data Sources:

  • Hadoop Data Lake

  • Data Warehouse

  • Streaming Processing

  • Unstructured Data

  • Machine Learning

Problems to solve

  • Have a "Big Picture" of whole data

  • Achieve SSOT of data while it is distributed and consumed in various ways

  • Data governance in one place, secure and audit data everywhere

Next-Gen Data Catalog is the Core in New Open Data Architecture.



Gravitino Architecture

Functionality Layer:

  • Unified Processing

  • Unified Governing

Interface Layer:

  • Unified REST API's

  • Iceberg REST API's

Core with Object Model:

  • Metalake

  • Catalogs

  • Schemas

  • Object Types: Table, Fileset, Model, Topic

Connection Layer:

  • Connections

Metadata Storage

  • Supported Data Types (Bottom Layer):

  • Tabular

  • Files

  • Models

  • Message Queue



Process Tabular and Non-tabular data with Gravitino

Tabular data (via connectors)

  • Engines: Spark

  • Operations: Create, Load, Alter, Drop

  • API: Unified Tabular API

Schema (struct):

  • name: string

  • comment: string

  • properties: map

Table (struct):

  • name: string

  • columns: Column[]

  • partitioning: Transform[]

  • distribution: Distribution

  • sortOrder: SortOrder[]

  • indexes: Index[]

Related Definitions: 

  • Transform, Distribution, SortOrder, Index, Type


Non-tabular data

  • Engines: Spark, PyTorch, Ray, TensorFlow

  • Filesystems: Gravitino Virtual FileSystem, Python FileSystem

  • Operations: Create, Load, Alter, Drop

  • API: Unified Non-tabular API

Schema (struct):

  • name: string

  • comment: string

  • Properties: map

Fileset (struct):

  • name: string

  • storageLocation: string

  • type: Type

Storage Locations:

  • S3, HDFS, ADLS, GCS


Scenarios

Lakehouse Federation

  • Multi-clouds, multi-engines and multi-formats

  • An open solution for Lakehouse Federation

Platform Capabilities

  • Analytics

  • Machine Learning

  • 360° View

  • App

Query/Language Tools

  • SQL

  • Python

  • R

Core Functionality

  • Gravitino Data Connector

  • Federated Query over multi-cloud, multi-formats and multi-engines.



Make Data and AI team to work seamlessly

Roles:

  • Data Engineer

  • Data Scientist

  • AI Engineer

Use Scenario:

  • Efficient collaborations between Data Engineers and Data Scientists or AI engineers

  • Data Scientists get an unified definition of metadata for heterogeneous data sources

  • Data engineers use metadata to process data

  • Unified metadata for multiple AI frameworks

  • Unified security control

Core Technology:

  • Gravitino

External Factors:

  • Technology

  • Communication

  • ETL

  • Internet of things

  • Automation

  • Networking

Data & Tools:

  • [ 1 ] Data Ingestion:

  • Spark

  • HDFS Client

  • S3 SDK

  • [ 2 ] Model Training:

  • Tensorflow

  • Pytorch

  • Ray

  • Gravitino Python lib

  • [ 3 ] Data Types:

  • Structured Data

  • Unstructured Data

Gravitino Features:

  • Gravitino IO (Data read & write)

  • Gravitino ACL (Access Control)



Gravitino Next - metadata-driven action system

  • Catalog service

  • APIs: Unified REST API, Iceberg REST API

  • Components: Catalog, Schema, Table, Fileset, Model, Topic

  • Connections: Connectors to various data sources (databases, files)

Gravitino Next

  • Catalog service

  • APIs: Unified REST API, Iceberg REST API

  • Components: Catalog, Schema, Table, Fileset, Model, Topic, Policy

  • Job system items: Job

Systems Included:

  • Policy system

  • Statistics system

  • Job system

  • Action framework

Action framework items:

  • TTL Action

  • Compaction Action

  • Clustering Action



Team:

AWS FSI Customer Acceleration Hong Kong

AWS Amarathon Fan Club

AWS Community Builder Hong Kong

Top comments (0)