DEV Community

Datastrato for Apache Gravitino

Posted on

Unified Data Access with Daft and Apache Gravitino: Simplifying Multi-Cloud Data Management

Unified Data Access with Daft and Apache Gravitino: Simplifying Multi-Cloud Data Management

The modern data landscape is increasingly distributed across multiple cloud providers and storage systems. Organizations often find themselves managing data across AWS S3, Google Cloud Storage, Azure Blob Storage, and on-premises systems, each with their own access patterns, credentials, and metadata management challenges. This fragmentation creates complexity in data discovery, access control, and operational overhead.

Today, we're excited to introduce the integration between Daft and Apache Gravitino, bringing unified catalog management and seamless multi-cloud data access to the Daft ecosystem. This integration focuses on fileset catalog support, enabling you to access distributed datasets through a single, unified interface while maintaining security and performance.

Note: This integration is available in Daft v0.7.2 and later versions.

What is Apache Gravitino?

Apache Gravitino is an open-source data catalog that provides unified metadata management for various data sources and storage systems. It acts as a central hub for organizing and accessing data across different platforms, offering:

  • Unified Metadata Management: Single source of truth for data across multiple storage systems
  • Multi-Cloud Support: Native integration with AWS S3 and local storage, with more cloud providers coming soon
  • Security Integration: Centralized credential management and access control
  • Catalog Abstraction: Support for both table catalogs (Iceberg, Hudi, Hive, JDBC) and fileset catalogs

What is Daft?

Daft is a distributed query engine built for the Python ecosystem, designed to handle large-scale data processing with ease and efficiency. Daft brings the power of distributed computing to data scientists and engineers through a familiar DataFrame API, offering:

  • Distributed Processing: Scale computations across multiple cores and machines seamlessly
  • Lazy Evaluation: Optimize query execution through intelligent query planning and predicate pushdown
  • Multi-Format Support: Native support for Parquet, JSON, CSV, Images, and more
  • Cloud-Native: Built-in integrations with AWS S3, Google Cloud Storage, Azure Blob Storage
  • Python-First: Intuitive DataFrame API that feels natural to Python developers
  • Performance Optimized: Rust-powered execution engine for maximum performance

Unlike traditional big data tools that require complex cluster management, Daft provides a simple pip install experience while delivering enterprise-grade performance. Whether you're processing terabytes of data locally or across cloud infrastructure, Daft's intelligent execution engine automatically optimizes your workloads for speed and efficiency.

Daft Architecture

The Power of Fileset Catalogs

While table catalogs manage structured data with schemas, fileset catalogs provide a flexible way to organize and access collections of files across different storage systems. This is particularly valuable for:

  • Data Lakes: Managing raw data files, logs, and unstructured datasets
  • Multi-Format Data: Handling Parquet, JSON, CSV, and other file formats in a unified way
  • Distributed Storage: Accessing data across different storage systems
  • Dynamic Datasets: Working with datasets that don't fit traditional table structures

Introducing GVFS

The Daft + Gravitino integration introduces a new URL scheme: gvfs:// (Gravitino Virtual File System). This provides a unified way to access files managed by Gravitino filesets, regardless of their underlying storage location (s3, adls, gcs, etc).

URL Format

gvfs://fileset/catalog/schema/fileset/path/to/file
Enter fullscreen mode Exit fullscreen mode

Where:

  • catalog: The Gravitino catalog name
  • schema: The schema within the catalog
  • fileset: The specific fileset name
  • path/to/file: The file path within the fileset (optional)

Example URLs

# Access a specific file
"gvfs://fileset/s3_catalog/analytics/user_events/2024/01/events.parquet"

# Access all files in a fileset
"gvfs://fileset/s3_catalog/ml_data/training_set/"

# Access partitioned data
"gvfs://fileset/s3_catalog/logs/application/year=2024/month=01/"
Enter fullscreen mode Exit fullscreen mode

Getting Started

Requirements

The Daft + Gravitino integration requires:

  • Python: 3.10 or later
  • pip: 21.0 or later (recommended: latest version)
  • Daft: v0.7.2 or later

Make sure you have the correct versions installed.

Installation and Setup

First, ensure you have Apache Gravitino, and create a fileset catalog, schema and at least one fileset entity which has a storage location to s3. You can refer to Gravitino's online documentation.

Second, ensure you have installed Daft v0.7.2 or later with the support for Gravitino:

pip install "daft>=0.7.2" requests
Enter fullscreen mode Exit fullscreen mode

Basic Configuration

import daft
from daft.io import IOConfig, GravitinoConfig

# Configure Gravitino connection
gravitino_config = GravitinoConfig(
    endpoint="http://localhost:8090",
    metalake_name="my_metalake",
    auth_type="simple"
)

# Create IOConfig with Gravitino settings
io_config = IOConfig(gravitino=gravitino_config)
Enter fullscreen mode Exit fullscreen mode

Reading Data from Filesets

# Read a specific file from a Gravitino fileset
df = daft.read_parquet(
    "gvfs://fileset/s3_catalog/analytics/user_events/events.parquet",
    io_config=io_config
)

# Read all Parquet files in a fileset
df = daft.read_parquet(
    "gvfs://fileset/s3_catalog/analytics/user_events/**/*.parquet",
    io_config=io_config
)

# List files in a fileset
files_df = daft.from_glob_path(
    "gvfs://fileset/s3_catalog/analytics/user_events/**/*",
    io_config=io_config
)
Enter fullscreen mode Exit fullscreen mode

Advanced Usage Examples

Working with Multiple File Formats

# Read JSON files from a fileset
json_df = daft.read_json(
    "gvfs://fileset/s3_catalog/logs/application/**/*.json",
    io_config=io_config
)

# Read CSV files with custom options
csv_df = daft.read_csv(
    "gvfs://fileset/s3_catalog/exports/daily_reports/**/*.csv",
    io_config=io_config
)
Enter fullscreen mode Exit fullscreen mode

Writing Data to Filesets

# Write Parquet files to a Gravitino fileset
df = daft.from_pydict({
    "user_id": [1, 2, 3],
    "event_type": ["click", "purchase", "view"],
    "timestamp": ["2024-01-01", "2024-01-02", "2024-01-03"]
})

df.write_parquet(
    "gvfs://fileset/s3_catalog/analytics/processed_events/",
    io_config=io_config
)

# Write CSV files to a fileset
df.write_csv(
    "gvfs://fileset/s3_catalog/exports/daily_reports/",
    io_config=io_config
)

# Write JSON files to a fileset
df.write_json(
    "gvfs://fileset/s3_catalog/logs/application/",
    io_config=io_config
)
Enter fullscreen mode Exit fullscreen mode

Programmatic Fileset Discovery

from daft.gravitino import GravitinoClient

# Initialize Gravitino client
client = GravitinoClient(
    endpoint="http://localhost:8090",
    metalake_name="my_metalake",
    auth_type="simple"
)

# Discover available catalogs and filesets
catalogs = client.list_catalogs()
print(f"Available catalogs: {catalogs}")

# Load fileset metadata
fileset = client.load_fileset("s3_catalog.analytics.user_events")
print(f"Storage location: {fileset.fileset_info.storage_location}")
print(f"Properties: {fileset.fileset_info.properties}")

# Use the fileset with Daft
df = daft.read_parquet(
    f"gvfs://fileset/s3_catalog/analytics/user_events/**/*.parquet",
    io_config=client.to_io_config()
)
Enter fullscreen mode Exit fullscreen mode

Security and Credential Management

One of the key benefits of the Gravitino integration is centralized credential management. Instead of managing separate credentials for each storage system, Gravitino handles authentication and authorization:

# Gravitino manages credentials for underlying storage
# No need to configure separate S3 credentials in Daft
gravitino_config = GravitinoConfig(
    endpoint="http://localhost:8090",
    metalake_name="secure_metalake",
    auth_type="oauth2",
    token="your-oauth-token"
)

# All storage access is handled through Gravitino's security layer
io_config = IOConfig(gravitino=gravitino_config)

# Access data with unified security
df = daft.read_parquet(
    "gvfs://fileset/s3_catalog/sensitive_data/financial/**/*.parquet",
    io_config=io_config
)
Enter fullscreen mode Exit fullscreen mode

Performance Considerations

The Gravitino integration is designed for optimal performance:

  • Lazy Evaluation: Daft's lazy execution works seamlessly with gvfs:// URLs
  • Predicate Pushdown: Filters are pushed down to the storage layer when possible
  • Parallel Processing: Multi-threaded I/O operations across different storage systems
  • Caching: Gravitino metadata is cached to reduce lookup overhead
# Efficient filtered reads with predicate pushdown
df = (
    daft.read_parquet("gvfs://fileset/s3_catalog/events/daily/**/*.parquet", io_config=io_config)
    .filter(daft.col("date") >= "2024-01-01")  # Pushed down to storage
    .filter(daft.col("event_type") == "click")  # Efficient columnar filtering
    .select("user_id", "timestamp", "page_url")  # Column pruning
)
Enter fullscreen mode Exit fullscreen mode

Current Limitations and Future Roadmap

Current Status

  • Read Operations: Full support for reading files from Gravitino filesets
  • Write Operations: Support for writing Parquet, CSV, and JSON files to filesets
  • Multiple Formats: Support for Parquet, JSON, CSV, and other formats
  • S3 Storage: Full support for S3-backed filesets (including S3-compatible storage like MinIO)
  • Local Storage: Support for local file:// storage
  • Security Integration: Centralized credential management through Gravitino

Future Enhancements

  • Credential Vending: Gravitino credential vending can generate temporary credentials for clients, providing enhanced security
  • More Cloud Storages: Support for GCS and Azure Blob Storage
  • Table Catalog Integration: Support for read and write Iceberg and Lance table catalogs
  • Advanced Security: Fine-grained access control and audit logging
  • Performance Optimizations: Enhanced caching and metadata management

Getting Started Today

Ready to try the Daft + Gravitino integration? Here's how to get started:

  1. Set up Gravitino: Follow the Gravitino quickstart guide to set up your Gravitino server

  2. Install Daft v0.7.2 or later with Gravitino support:

   pip install "daft>=0.7.2" requests
Enter fullscreen mode Exit fullscreen mode
  1. Configure your first fileset: Create a fileset in Gravitino pointing to your data

  2. Start querying: Use gvfs:// URLs to access your data with Daft

The integration between Daft and Apache Gravitino represents a significant step forward in simplifying distributed data access. By combining Daft's powerful distributed query engine with Gravitino's unified catalog management, data teams can focus on extracting insights rather than managing infrastructure complexity.

Whether you're building analytics pipelines, managing data lakes, or simply looking to simplify your data access patterns, the Daft + Gravitino integration provides the tools you need to succeed in today's distributed data landscape.


Want to learn more? Check out the Daft documentation and Apache Gravitino project for detailed guides and examples.

Top comments (0)