Yue @ Datastrato (Admin) for Apache Gravitino

Posted on Jan 20 • Edited on Jan 22

Unified Data Access with Daft and Apache Gravitino: Simplifying Multi-Cloud Data Management

#daft #apachegravitino #metadata #datacatalog

Unified Data Access with Daft and Apache Gravitino: Simplifying Multi-Cloud Data Management

The modern data landscape is increasingly distributed across multiple cloud providers and storage systems. Organizations often find themselves managing data across AWS S3, Google Cloud Storage, Azure Blob Storage, and on-premises systems, each with their own access patterns, credentials, and metadata management challenges. This fragmentation creates complexity in data discovery, access control, and operational overhead.

Today, we're excited to introduce the integration between Daft and Apache Gravitino, bringing unified catalog management and seamless multi-cloud data access to the Daft ecosystem. This integration focuses on fileset catalog support, enabling you to access distributed datasets through a single, unified interface while maintaining security and performance.

Note: This integration is available in Daft v0.7.2 and later versions.

What is Apache Gravitino?

Apache Gravitino is an open-source data catalog that provides unified metadata management for various data sources and storage systems. It acts as a central hub for organizing and accessing data across different platforms, offering:

Unified Metadata Management: Single source of truth for data across multiple storage systems
Multi-Cloud Support: Native integration with AWS S3 and local storage, with more cloud providers coming soon
Security Integration: Centralized credential management and access control
Catalog Abstraction: Support for both table catalogs (Iceberg, Hudi, Hive, JDBC) and fileset catalogs

What is Daft?

Daft is a distributed query engine built for the Python ecosystem, designed to handle large-scale data processing with ease and efficiency. Daft brings the power of distributed computing to data scientists and engineers through a familiar DataFrame API, offering:

Distributed Processing: Scale computations across multiple cores and machines seamlessly
Lazy Evaluation: Optimize query execution through intelligent query planning and predicate pushdown
Multi-Format Support: Native support for Parquet, JSON, CSV, Images, and more
Cloud-Native: Built-in integrations with AWS S3, Google Cloud Storage, Azure Blob Storage
Python-First: Intuitive DataFrame API that feels natural to Python developers
Performance Optimized: Rust-powered execution engine for maximum performance

Unlike traditional big data tools that require complex cluster management, Daft provides a simple pip install experience while delivering enterprise-grade performance. Whether you're processing terabytes of data locally or across cloud infrastructure, Daft's intelligent execution engine automatically optimizes your workloads for speed and efficiency.

The Power of Fileset Catalogs

While table catalogs manage structured data with schemas, fileset catalogs provide a flexible way to organize and access collections of files across different storage systems. This is particularly valuable for:

Data Lakes: Managing raw data files, logs, and unstructured datasets
Multi-Format Data: Handling Parquet, JSON, CSV, and other file formats in a unified way
Distributed Storage: Accessing data across different storage systems
Dynamic Datasets: Working with datasets that don't fit traditional table structures

Introducing GVFS

The Daft + Gravitino integration introduces a new URL scheme: gvfs:// (Gravitino Virtual File System). This provides a unified way to access files managed by Gravitino filesets, regardless of their underlying storage location (s3, adls, gcs, etc).

URL Format

gvfs://fileset/catalog/schema/fileset/path/to/file

Where:

catalog: The Gravitino catalog name
schema: The schema within the catalog
fileset: The specific fileset name
path/to/file: The file path within the fileset (optional)

Example URLs

# Access a specific file
"gvfs://fileset/s3_catalog/analytics/user_events/2024/01/events.parquet"

# Access all files in a fileset
"gvfs://fileset/s3_catalog/ml_data/training_set/"

# Access partitioned data
"gvfs://fileset/s3_catalog/logs/application/year=2024/month=01/"

Getting Started

Requirements

The Daft + Gravitino integration requires:

Python: 3.10 or later
pip: 21.0 or later (recommended: latest version)
Daft: v0.7.2 or later

Make sure you have the correct versions installed.

Installation and Setup

First, ensure you have Apache Gravitino, and create a fileset catalog, schema and at least one fileset entity which has a storage location to s3. You can refer to Gravitino's online documentation.

Second, ensure you have installed Daft v0.7.2 or later with the support for Gravitino:

pip install "daft>=0.7.2" requests

Basic Configuration

import daft
from daft.io import IOConfig, GravitinoConfig

# Configure Gravitino connection
gravitino_config = GravitinoConfig(
    endpoint="http://localhost:8090",
    metalake_name="my_metalake",
    auth_type="simple"
)

# Create IOConfig with Gravitino settings
io_config = IOConfig(gravitino=gravitino_config)

Reading Data from Filesets

# Read a specific file from a Gravitino fileset
df = daft.read_parquet(
    "gvfs://fileset/s3_catalog/analytics/user_events/events.parquet",
    io_config=io_config
)

# Read all Parquet files in a fileset
df = daft.read_parquet(
    "gvfs://fileset/s3_catalog/analytics/user_events/**/*.parquet",
    io_config=io_config
)

# List files in a fileset
files_df = daft.from_glob_path(
    "gvfs://fileset/s3_catalog/analytics/user_events/**/*",
    io_config=io_config
)

Advanced Usage Examples

Working with Multiple File Formats

# Read JSON files from a fileset
json_df = daft.read_json(
    "gvfs://fileset/s3_catalog/logs/application/**/*.json",
    io_config=io_config
)

# Read CSV files with custom options
csv_df = daft.read_csv(
    "gvfs://fileset/s3_catalog/exports/daily_reports/**/*.csv",
    io_config=io_config
)

Writing Data to Filesets

# Write Parquet files to a Gravitino fileset
df = daft.from_pydict({
    "user_id": [1, 2, 3],
    "event_type": ["click", "purchase", "view"],
    "timestamp": ["2024-01-01", "2024-01-02", "2024-01-03"]
})

df.write_parquet(
    "gvfs://fileset/s3_catalog/analytics/processed_events/",
    io_config=io_config
)

# Write CSV files to a fileset
df.write_csv(
    "gvfs://fileset/s3_catalog/exports/daily_reports/",
    io_config=io_config
)

# Write JSON files to a fileset
df.write_json(
    "gvfs://fileset/s3_catalog/logs/application/",
    io_config=io_config
)

Programmatic Fileset Discovery

from daft.gravitino import GravitinoClient

# Initialize Gravitino client
client = GravitinoClient(
    endpoint="http://localhost:8090",
    metalake_name="my_metalake",
    auth_type="simple"
)

# Discover available catalogs and filesets
catalogs = client.list_catalogs()
print(f"Available catalogs: {catalogs}")

# Load fileset metadata
fileset = client.load_fileset("s3_catalog.analytics.user_events")
print(f"Storage location: {fileset.fileset_info.storage_location}")
print(f"Properties: {fileset.fileset_info.properties}")

# Use the fileset with Daft
df = daft.read_parquet(
    f"gvfs://fileset/s3_catalog/analytics/user_events/**/*.parquet",
    io_config=client.to_io_config()
)

Security and Credential Management

One of the key benefits of the Gravitino integration is centralized credential management. Instead of managing separate credentials for each storage system, Gravitino handles authentication and authorization:

# Gravitino manages credentials for underlying storage
# No need to configure separate S3 credentials in Daft
gravitino_config = GravitinoConfig(
    endpoint="http://localhost:8090",
    metalake_name="secure_metalake",
    auth_type="oauth2",
    token="your-oauth-token"
)

# All storage access is handled through Gravitino's security layer
io_config = IOConfig(gravitino=gravitino_config)

# Access data with unified security
df = daft.read_parquet(
    "gvfs://fileset/s3_catalog/sensitive_data/financial/**/*.parquet",
    io_config=io_config
)

Performance Considerations

The Gravitino integration is designed for optimal performance:

Lazy Evaluation: Daft's lazy execution works seamlessly with gvfs:// URLs
Predicate Pushdown: Filters are pushed down to the storage layer when possible
Parallel Processing: Multi-threaded I/O operations across different storage systems
Caching: Gravitino metadata is cached to reduce lookup overhead

# Efficient filtered reads with predicate pushdown
df = (
    daft.read_parquet("gvfs://fileset/s3_catalog/events/daily/**/*.parquet", io_config=io_config)
    .filter(daft.col("date") >= "2024-01-01")  # Pushed down to storage
    .filter(daft.col("event_type") == "click")  # Efficient columnar filtering
    .select("user_id", "timestamp", "page_url")  # Column pruning
)

Current Limitations and Future Roadmap

Current Status

✅ Read Operations: Full support for reading files from Gravitino filesets
✅ Write Operations: Support for writing Parquet, CSV, and JSON files to filesets
✅ Multiple Formats: Support for Parquet, JSON, CSV, and other formats
✅ S3 Storage: Full support for S3-backed filesets (including S3-compatible storage like MinIO)
✅ Local Storage: Support for local file:// storage
✅ Security Integration: Centralized credential management through Gravitino

Future Enhancements

Credential Vending: Gravitino credential vending can generate temporary credentials for clients, providing enhanced security
More Cloud Storages: Support for GCS and Azure Blob Storage
Table Catalog Integration: Support for read and write Iceberg and Lance table catalogs
Advanced Security: Fine-grained access control and audit logging
Performance Optimizations: Enhanced caching and metadata management

Getting Started Today

Ready to try the Daft + Gravitino integration? Here's how to get started:

Set up Gravitino: Follow the Gravitino quickstart guide to set up your Gravitino server
Install Daft v0.7.2 or later with Gravitino support:

   pip install "daft>=0.7.2" requests

Configure your first fileset: Create a fileset in Gravitino pointing to your data
Start querying: Use gvfs:// URLs to access your data with Daft

The integration between Daft and Apache Gravitino represents a significant step forward in simplifying distributed data access. By combining Daft's powerful distributed query engine with Gravitino's unified catalog management, data teams can focus on extracting insights rather than managing infrastructure complexity.

Whether you're building analytics pipelines, managing data lakes, or simply looking to simplify your data access patterns, the Daft + Gravitino integration provides the tools you need to succeed in today's distributed data landscape.

Want to learn more? Check out the Daft documentation and Apache Gravitino project for detailed guides and examples.