Unified Data Access with Daft and Apache Gravitino: Simplifying Multi-Cloud Data Management
The modern data landscape is increasingly distributed across multiple cloud providers and storage systems. Organizations often find themselves managing data across AWS S3, Google Cloud Storage, Azure Blob Storage, and on-premises systems, each with their own access patterns, credentials, and metadata management challenges. This fragmentation creates complexity in data discovery, access control, and operational overhead.
Today, we're excited to introduce the integration between Daft and Apache Gravitino, bringing unified catalog management and seamless multi-cloud data access to the Daft ecosystem. This integration focuses on fileset catalog support, enabling you to access distributed datasets through a single, unified interface while maintaining security and performance.
Note: This integration is available in Daft v0.7.2 and later versions.
What is Apache Gravitino?
Apache Gravitino is an open-source data catalog that provides unified metadata management for various data sources and storage systems. It acts as a central hub for organizing and accessing data across different platforms, offering:
- Unified Metadata Management: Single source of truth for data across multiple storage systems
- Multi-Cloud Support: Native integration with AWS S3 and local storage, with more cloud providers coming soon
- Security Integration: Centralized credential management and access control
- Catalog Abstraction: Support for both table catalogs (Iceberg, Hudi, Hive, JDBC) and fileset catalogs
What is Daft?
Daft is a distributed query engine built for the Python ecosystem, designed to handle large-scale data processing with ease and efficiency. Daft brings the power of distributed computing to data scientists and engineers through a familiar DataFrame API, offering:
- Distributed Processing: Scale computations across multiple cores and machines seamlessly
- Lazy Evaluation: Optimize query execution through intelligent query planning and predicate pushdown
- Multi-Format Support: Native support for Parquet, JSON, CSV, Images, and more
- Cloud-Native: Built-in integrations with AWS S3, Google Cloud Storage, Azure Blob Storage
- Python-First: Intuitive DataFrame API that feels natural to Python developers
- Performance Optimized: Rust-powered execution engine for maximum performance
Unlike traditional big data tools that require complex cluster management, Daft provides a simple pip install experience while delivering enterprise-grade performance. Whether you're processing terabytes of data locally or across cloud infrastructure, Daft's intelligent execution engine automatically optimizes your workloads for speed and efficiency.
The Power of Fileset Catalogs
While table catalogs manage structured data with schemas, fileset catalogs provide a flexible way to organize and access collections of files across different storage systems. This is particularly valuable for:
- Data Lakes: Managing raw data files, logs, and unstructured datasets
- Multi-Format Data: Handling Parquet, JSON, CSV, and other file formats in a unified way
- Distributed Storage: Accessing data across different storage systems
- Dynamic Datasets: Working with datasets that don't fit traditional table structures
Introducing GVFS
The Daft + Gravitino integration introduces a new URL scheme: gvfs:// (Gravitino Virtual File System). This provides a unified way to access files managed by Gravitino filesets, regardless of their underlying storage location (s3, adls, gcs, etc).
URL Format
gvfs://fileset/catalog/schema/fileset/path/to/file
Where:
-
catalog: The Gravitino catalog name -
schema: The schema within the catalog -
fileset: The specific fileset name -
path/to/file: The file path within the fileset (optional)
Example URLs
# Access a specific file
"gvfs://fileset/s3_catalog/analytics/user_events/2024/01/events.parquet"
# Access all files in a fileset
"gvfs://fileset/s3_catalog/ml_data/training_set/"
# Access partitioned data
"gvfs://fileset/s3_catalog/logs/application/year=2024/month=01/"
Getting Started
Requirements
The Daft + Gravitino integration requires:
- Python: 3.10 or later
- pip: 21.0 or later (recommended: latest version)
- Daft: v0.7.2 or later
Make sure you have the correct versions installed.
Installation and Setup
First, ensure you have Apache Gravitino, and create a fileset catalog, schema and at least one fileset entity which has a storage location to s3. You can refer to Gravitino's online documentation.
Second, ensure you have installed Daft v0.7.2 or later with the support for Gravitino:
pip install "daft>=0.7.2" requests
Basic Configuration
import daft
from daft.io import IOConfig, GravitinoConfig
# Configure Gravitino connection
gravitino_config = GravitinoConfig(
endpoint="http://localhost:8090",
metalake_name="my_metalake",
auth_type="simple"
)
# Create IOConfig with Gravitino settings
io_config = IOConfig(gravitino=gravitino_config)
Reading Data from Filesets
# Read a specific file from a Gravitino fileset
df = daft.read_parquet(
"gvfs://fileset/s3_catalog/analytics/user_events/events.parquet",
io_config=io_config
)
# Read all Parquet files in a fileset
df = daft.read_parquet(
"gvfs://fileset/s3_catalog/analytics/user_events/**/*.parquet",
io_config=io_config
)
# List files in a fileset
files_df = daft.from_glob_path(
"gvfs://fileset/s3_catalog/analytics/user_events/**/*",
io_config=io_config
)
Advanced Usage Examples
Working with Multiple File Formats
# Read JSON files from a fileset
json_df = daft.read_json(
"gvfs://fileset/s3_catalog/logs/application/**/*.json",
io_config=io_config
)
# Read CSV files with custom options
csv_df = daft.read_csv(
"gvfs://fileset/s3_catalog/exports/daily_reports/**/*.csv",
io_config=io_config
)
Writing Data to Filesets
# Write Parquet files to a Gravitino fileset
df = daft.from_pydict({
"user_id": [1, 2, 3],
"event_type": ["click", "purchase", "view"],
"timestamp": ["2024-01-01", "2024-01-02", "2024-01-03"]
})
df.write_parquet(
"gvfs://fileset/s3_catalog/analytics/processed_events/",
io_config=io_config
)
# Write CSV files to a fileset
df.write_csv(
"gvfs://fileset/s3_catalog/exports/daily_reports/",
io_config=io_config
)
# Write JSON files to a fileset
df.write_json(
"gvfs://fileset/s3_catalog/logs/application/",
io_config=io_config
)
Programmatic Fileset Discovery
from daft.gravitino import GravitinoClient
# Initialize Gravitino client
client = GravitinoClient(
endpoint="http://localhost:8090",
metalake_name="my_metalake",
auth_type="simple"
)
# Discover available catalogs and filesets
catalogs = client.list_catalogs()
print(f"Available catalogs: {catalogs}")
# Load fileset metadata
fileset = client.load_fileset("s3_catalog.analytics.user_events")
print(f"Storage location: {fileset.fileset_info.storage_location}")
print(f"Properties: {fileset.fileset_info.properties}")
# Use the fileset with Daft
df = daft.read_parquet(
f"gvfs://fileset/s3_catalog/analytics/user_events/**/*.parquet",
io_config=client.to_io_config()
)
Security and Credential Management
One of the key benefits of the Gravitino integration is centralized credential management. Instead of managing separate credentials for each storage system, Gravitino handles authentication and authorization:
# Gravitino manages credentials for underlying storage
# No need to configure separate S3 credentials in Daft
gravitino_config = GravitinoConfig(
endpoint="http://localhost:8090",
metalake_name="secure_metalake",
auth_type="oauth2",
token="your-oauth-token"
)
# All storage access is handled through Gravitino's security layer
io_config = IOConfig(gravitino=gravitino_config)
# Access data with unified security
df = daft.read_parquet(
"gvfs://fileset/s3_catalog/sensitive_data/financial/**/*.parquet",
io_config=io_config
)
Performance Considerations
The Gravitino integration is designed for optimal performance:
- Lazy Evaluation: Daft's lazy execution works seamlessly with gvfs:// URLs
- Predicate Pushdown: Filters are pushed down to the storage layer when possible
- Parallel Processing: Multi-threaded I/O operations across different storage systems
- Caching: Gravitino metadata is cached to reduce lookup overhead
# Efficient filtered reads with predicate pushdown
df = (
daft.read_parquet("gvfs://fileset/s3_catalog/events/daily/**/*.parquet", io_config=io_config)
.filter(daft.col("date") >= "2024-01-01") # Pushed down to storage
.filter(daft.col("event_type") == "click") # Efficient columnar filtering
.select("user_id", "timestamp", "page_url") # Column pruning
)
Current Limitations and Future Roadmap
Current Status
- ✅ Read Operations: Full support for reading files from Gravitino filesets
- ✅ Write Operations: Support for writing Parquet, CSV, and JSON files to filesets
- ✅ Multiple Formats: Support for Parquet, JSON, CSV, and other formats
- ✅ S3 Storage: Full support for S3-backed filesets (including S3-compatible storage like MinIO)
- ✅ Local Storage: Support for local file:// storage
- ✅ Security Integration: Centralized credential management through Gravitino
Future Enhancements
- Credential Vending: Gravitino credential vending can generate temporary credentials for clients, providing enhanced security
- More Cloud Storages: Support for GCS and Azure Blob Storage
- Table Catalog Integration: Support for read and write Iceberg and Lance table catalogs
- Advanced Security: Fine-grained access control and audit logging
- Performance Optimizations: Enhanced caching and metadata management
Getting Started Today
Ready to try the Daft + Gravitino integration? Here's how to get started:
Set up Gravitino: Follow the Gravitino quickstart guide to set up your Gravitino server
Install Daft v0.7.2 or later with Gravitino support:
pip install "daft>=0.7.2" requests
Configure your first fileset: Create a fileset in Gravitino pointing to your data
Start querying: Use gvfs:// URLs to access your data with Daft
The integration between Daft and Apache Gravitino represents a significant step forward in simplifying distributed data access. By combining Daft's powerful distributed query engine with Gravitino's unified catalog management, data teams can focus on extracting insights rather than managing infrastructure complexity.
Whether you're building analytics pipelines, managing data lakes, or simply looking to simplify your data access patterns, the Daft + Gravitino integration provides the tools you need to succeed in today's distributed data landscape.
Want to learn more? Check out the Daft documentation and Apache Gravitino project for detailed guides and examples.


Top comments (0)