Object Storage as Primary Storage: The MinIO Story

#storage #bigdata #cloudnative

Object storage is a data storage architecture that organizes information as individual units called "objects". Each object contains the data itself, customizable metadata, and a unique identifier, making it easy to locate across a distributed system. Unlike traditional storage systems, object storage is designed to handle vast amounts of unstructured data, and allows for flexible scalability. Object storage uses HTTP-based RESTful APIs to manage data with simple commands. This architecture is particularly well-suited for cloud-native applications, big data analytics, and backup/recovery scenarios. Object storage is also known as object-based storage.

Comparison of storage types

The following table summarizes the key differences, pros, cons, and applications of each storage type, drawing on the sources:

Feature	Object Storage	File Storage	Block Storage
Data Structure	Data is stored as discrete units called "objects," each with data, metadata, and a unique ID.	Data is stored in a hierarchical structure of files and folders.	Data is stored in fixed-size blocks.
Scalability	Highly scalable, designed for handling vast amounts of unstructured data.	Suitable for smaller environments, may struggle with very large datasets.	Scales well by distributing data across multiple volumes but requires a Storage Area Network (SAN) to operate.
Access Method	Uses HTTP-based RESTful APIs with commands like "PUT," "GET," and "DELETE".	Uses file-level protocols like NFS and SMB over TCP/IP.	Uses high-speed transport protocols like NVMe-oF, Fiber Channel, or iSCSI.
Metadata	Customizable metadata stored with each object, enabling policies, access rules, and more.	Fixed metadata such as names, dates, and file types.	Limited metadata managed by the storage system.
Performance	Designed for throughput and high scalability, can have higher latency compared to block storage .	Good for general file sharing and collaboration.	Designed for low-latency, high-speed data access.
Data Protection	Offers flexible data protection through immutability, versioning, erasure coding, and data replication.	Can be used for backup and disaster recovery, but may lack some advanced security features.	Typically relies on RAID for data protection.
Pros	Scalable, cost-effective, simplifies data management, ideal for unstructured data, has immutability via object lock, flexible data protection, handles large datasets, and has a flat architecture with equal access to all units.	Simple to use, good for document collaboration, suitable for smaller environments, familiar structure.	High performance, low-latency access, suitable for virtualized systems and server environments.
Cons	Can have lower performance (latency) than block storage, limited customization once objects are created, cloud retrieval can be costly.	Can struggle with very large datasets, lacks the performance of block storage for high-speed applications.	Requires a SAN, more complex to manage than file storage, can be more expensive than object storage.
Applications	Cloud-native applications, big data analytics, machine learning, archiving, backup and recovery. Examples include AWS S3, Azure Blob, Google Cloud Storage and MinIO.	Managing local files, document collaboration, backup and disaster recovery. Examples include network-attached storage (NAS).	Databases, server storage, email servers, virtualized systems. Examples include Storage Area Network (SAN).

The case of MinIO

MinIO is a software-defined, cloud-native object storage system designed for high performance and scalability. It is designed to be hardware agnostic and can run on commodity servers with local disks, virtual machines, or container platforms like Kubernetes. MinIO is also designed for multi-tenancy, and uses erasure coding for data protection.

Key aspects of MinIO's architecture include:

Performance: MinIO is designed for high performance with a lightweight, S3-compatible REST API. It provides efficient data storage with high throughput and strict consistency for large datasets.
Scalability: MinIO scales easily by adding more nodes or drives, supporting up to 100+ petabytes. Each tenant’s data is isolated, enabling multi-tenancy, and scaling is seamless without affecting the system’s physical limits.
Cloud-Native: Built for cloud environments, MinIO integrates with Kubernetes for automatic scaling, orchestration, and isolation. It runs as lightweight containers, optimizing resource usage and flexibility.
S3 Compatibility: MinIO is fully compatible with Amazon S3, making it easy to integrate with applications that use the S3 API, allowing users to treat MinIO just like AWS S3 for storage.
Data Protection: MinIO uses erasure coding (Reed-Solomon) for data durability, meaning up to 8 drives can fail without losing data. It also includes features like encryption, versioning, WORM (Write Once, Read Many), and bitrot protection to ensure data integrity and security.
Multi-Cloud: MinIO works across different cloud environments. It offers a consistent interface, unified identity management, and encryption keys across clouds, with seamless replication support (both synchronous and asynchronous).

Metadata Management: MinIO stores metadata alongside data as objects, avoiding a separate metadata database. This simplifies the architecture and allows for consistent operations (like encryption and erasure coding) across the entire cluster.

Theoretically how it works

User/Application: A user or application sends a request (upload, download, delete) to the REST API.
REST API (Flask/FastAPI): The API layer handles HTTP requests and communicates with MinIO and the metadata store.
MinIO Server:
- Object Storage: Stores the actual files (data). When a file is uploaded, it gets stored here.
Metadata Store (SQLite/MongoDB):
- Object Metadata: Tracks metadata for each object, like filename, size, path, and timestamp.
Data Flow:
- Upload: User uploads a file → REST API communicates with MinIO → File is stored in Object Storage → Metadata is stored in Metadata Store.
- Download: User requests a file → REST API retrieves metadata from the store → MinIO fetches the file from Object Storage → File is sent back to the user.
- Delete: User deletes a file → REST API removes file from MinIO → Metadata is removed from Metadata Store.

This architecture uses MinIO as the core object storage system, while the metadata store tracks the objects and allows for retrieval and management.

Key Benefits of MinIO

Performance: MinIO is designed to be fast, aiming to be close to hardware speeds. It can saturate a 100-gigabit network.
Scalability: MinIO can scale to 100s of petabytes by adding more nodes and racks.
Simplicity: The design prioritizes simplicity.
Security: MinIO includes data protection features such as encryption and object locking.
Cloud-Native: It integrates well with Kubernetes, making it ideal for cloud environments.
Multi-Cloud: It can operate consistently across different cloud providers.

MinIO in the cloud-native era

Cloud-Native Architecture: MinIO is built to be cloud-native. It runs in lightweight containers and integrates seamlessly with Kubernetes, the standard for container orchestration. This allows for easy deployment, scaling, and management in cloud environments. MinIO uses Kubernetes for the isolation, orchestration, and scaling of tenants.
S3 API Compatibility: MinIO is 100% compatible with the Amazon S3 API. This is a critical factor, as many cloud-native applications and frameworks are designed to work with S3. This compatibility allows applications to easily integrate with MinIO without significant code changes.
Multi-Cloud Support: MinIO is designed to operate across multiple clouds. It provides a consistent storage layer, allowing applications to run on any cloud without needing to adapt to different storage APIs. This capability is crucial in the modern multi-cloud approach where organizations aim to avoid vendor lock-in and maintain operational consistency across different environments. MinIO provides a common identity, key management, and policies across clouds, providing a unified experience.
Scalability and Performance: MinIO is engineered for high performance and scalability. It is designed to be as fast as the underlying hardware, and its single-layer architecture and lack of a separate metadata database contribute to its speed. MinIO can scale to handle petabytes of data and is optimized for both small and large objects. Its performance and scalability make it suitable for the demanding requirements of AI and machine learning workloads.
Multi-Tenancy: MinIO supports multi-tenancy, which is critical for cloud-native environments where multiple users, applications, or teams share the same infrastructure. Each tenant's data is stored on different MinIO instances, ensuring isolation and security. Kubernetes is used for the orchestration and scaling of these tenants.
Data Protection and Security: MinIO provides built-in data protection features such as erasure coding to protect against drive failures, WORM (Write Once Read Many) for immutability, and encryption to safeguard data. It also supports object locking for compliance and governance. These features ensure that the data is secure and resilient, a key requirement for AI/ML datasets.
Metadata-less Architecture: MinIO’s metadata-less architecture makes it highly scalable and performant. The metadata is grouped with the data as objects, which eliminates the need for a separate metadata database.
Suitable for AI/ML Workloads: MinIO is particularly well-suited for AI and machine learning (AI/ML) workloads for several reasons:
- Performance at Scale: AI/ML workloads often involve processing massive datasets. MinIO's high performance and scalability make it an excellent choice for storing and retrieving these large datasets.
- Data Lake Foundation: MinIO can be used as the foundation for data lakes and data lakehouses, which are crucial for AI/ML projects.
- Integration with AI/ML Frameworks: The S3 API compatibility means that AI/ML tools and frameworks that are designed for S3 can seamlessly integrate with MinIO. This simplifies the deployment and management of AI/ML pipelines. MinIO supports frameworks such as Kubeflow.
- Support for Various Data Types: MinIO can store unstructured data, such as log files, images, videos, and model artifacts, which are common in AI/ML projects.
- Object-Level Granularity: MinIO's erasure coding operates at the object level, rather than at the volume level. This allows different erasure coding schemes to be applied to different objects, optimizing cluster capacity, a useful feature for managing the various types of data found in AI/ML workloads.
- Integration with Databases: MinIO is seeing increasing adoption from databases, especially those used in modern data analysis, including data analytics and AI/ML.. Databases are moving towards object storage for its scalability and the ability to query the data directly.
Software Defined: MinIO is software defined which means it can abstract administrative and management capabilities from the underlying technology.

Thought experiment: build your own object storage

Choose Storage Backend
- Local File System (easy for prototypes)
- MinIO (open-source, S3-like object storage)
Set Up Metadata Store - Store object info (e.g., filename, size, path) in a database:
- SQLite (simple)
- MongoDB/PostgreSQL (scalable)
Build API Layer - Create endpoints for interacting with storage:
- Flask/FastAPI (Python)
- Express.js (Node.js)
- Endpoints:
- POST /upload (upload file)
- GET /download/:id (download file)
- DELETE /delete/:id (delete file)
Handle Object Upload/Download
- Upload: Save file to storage, record metadata.
- Download: Retrieve file based on metadata.
- Delete: Remove file and metadata.
Implement Security
- JWT (authentication)
- API keys (for user access)
Testing & Scaling
- Stress-test for performance.
- Scale using Docker or Kubernetes.