DEV Community: Sasha Bonner Wodtke

Replication Strategies Deep Dive

Sasha Bonner Wodtke — Thu, 15 Feb 2024 18:51:07 +0000

BY AJ, MinIO

In previous blogs we’ve talked about Replication Best practices and the different types of replication such as Batch, Site and Bucket. But with all these different types of replication types floating around one has to wonder which replication strategy to use where? Do you use mc mirror or Batch when migrating data from an existing S3 compatible data store? When replicating between clusters should you use Site Replication or Bucket replication?

Today we’ll demystify these different replication strategies to see which one should be used in which scenario.

Replicating from existing source

Generally if you already have existing data either locally on a drive or existing S3 compatible store, there are one of two ways we recommend replicating the data.

Batch Replication: This must need an existing source that is either MinIO or another S3 compatible store such as AWS.
Using mc mirror: This could be a local directory or NFS mount among others.

Before we go through the specifics let's take a look at some of the prerequisites.

Create an alias in mc called miniostore for the MinIO cluster.

mc alias set miniostore

Create a bucket in miniostore where the data from olderstore will be transferred to.

mc mb miniostore/mybucket

Create another alias in mc for the existing bucket in the S3 compatible store.

mc alias set olderstore

In this case we will assume there is already a bucket in olderstore named mybucket.

Batch Replication

Let's take a look at how we can use Batch Replication to migrate data from an existing S3 compatible source to MinIO bucket.

Create the yaml for the batch replication configuration

mc batch generate olderstore/ replicate

You should see a replication.yaml file similar to below, the source is olderstore and the target is miniostore.

replicate:

  apiVersion: v1

  # source of the objects is `olderstore` alias

  source:

    type: TYPE # valid values are "s3"

    bucket: BUCKET

    prefix: PREFIX

    # NOTE: if source is remote then target must be "local"

    # endpoint: ENDPOINT

    # credentials:

    #   accessKey: ACCESS-KEY

    #   secretKey: SECRET-KEY

    #   sessionToken: SESSION-TOKEN # Available when rotating credentials are used


  # target where the objects is `miniostore` alias

  target:

    type: TYPE # valid values are "s3"

    bucket: BUCKET

    prefix: PREFIX

    # NOTE: if target is remote then source must be "local"

    # endpoint: ENDPOINT

    # credentials:

    #   accessKey: ACCESS-KEY

    #   secretKey: SECRET-KEY

    #   sessionToken: SESSION-TOKEN # Available when rotating credentials are used


[TRUNCATED]

Execute batch replication using the command below

mc batch start olderstore/ ./replicate.yaml

Successfully start 'replicate' job `E24HH4nNMcgY5taynaPfxu` on '2024-02-04 14:19:06.296974771 -0400 EDT'

Using the replicate job ID above, in this case E24HH4nNMcgY5taynaPfxu, we can find the status of the batch job.

mc batch status olderstore/ E24HH4nNMcgY5taynaPfxu

●∙∙

Objects:        28766

Versions:       28766

Throughput:     3.0 MiB/s

Transferred:    406 MiB

Elapsed:        2m14.227222868s

CurrObjName:    share/doc/xml-core/examples/foo.xmlcatalogs

You can list and find the config of all the batch jobs currently running.

mc batch list olderstore/


ID                      TYPE            USER            STARTED

E24HH4nNMcgY5taynaPfxu  replicate       minioadmin      1 minute ago

mc batch describe olderstore/ E24HH4nNMcgY5taynaPfxu


replicate:

  apiVersion: v1

You can also cancel and start the batch job if, for example, it's saturating the network and you need to resume it at a later time when traffic is the least.

Mc Mirror

Let’s take a quick look at how mc mirror would work in this case.

mc mirror --watch olderstore/mybucket miniostore/mybucket

The above command is similar to rsync. It will not only copy the data from olderstore to miniostore but also look for newer objects on olderstore that come in and then copy them to miniostore. There are some nuances to batch jobs on versioned vs unversioned buckets. If one of the source/target is s3 compatible source batch job works just like mirror and copies only the latest version. However, version id will not be preserved.

You can compare the two buckets to see if the data has been copied over successfully.

mc diff olderstore/mybucket miniostore/mybucket

It's as simple as that.

Which is the better option?

Although mc mirror seems simple and straightforward, we actually recommend the Batch Replication method for migrating data from an existing S3 compatible store, the reasons are several fold.

Batch replication runs on server side, while mc mirror runs on client side. Meaning batch replication has the full resources available to it that MinIO servers run on to perform its batch jobs. On the other hand mc mirror is bottle-necked by the client system where the command is being run, so your data is taking the longer route. In other words with Batch replication the traceroute would look like olderstore -> miniostore but with mirroring that would look like olderstore -> mc mirror -> miniostore.

Batch jobs are one-time processes allowing for fine control replication. For example, while running replication if you notice the network is being saturated you can cancel the batch replication job and later resume during off hours when the traffic is the least. In the event that some objects fail to replicate, the job will retry multiple attempts so the objects eventually replicate.

So does Batch replication have no downsides? Well not a lot. One possible concern we see in the real world is that sometimes batch replication is slow and not instantaneous. Depending on the network transfer and speed you might see some slowness compared to other methods. That being said, we still recommend Batch replication because it's more stable and we have more control on how and when the data gets migrated.

Replicating to another site

Once you have data in your MinIO cluster, you would want to ensure that the data gets replicated to another MinIO cluster in another site for redundancy, performance and disaster recovery purposes. There are several ways to do this but in this case let's talk about the following two:

Site Replication
Bucket Replication

Site Replication

Once data is in a MinIO object store cluster, it opens to several different possibilities to replicate and manage your data.

First step is to set up 3 identical MinIO clusters and name them minio1, minio2 and minio3, respectively. We will assume site1 already has the data migrated to it using Batch replication.

mc alias set minio1 http:// minioadmin minioadmin

mc alias set minio2 http:// minioadmin minioadmin

mc alias set minio3 http:// minioadmin minioadmin

Enable site replication across all 3 sites

mc admin replicate add minio1 minio2 minio3

Verify the site replication is set properly across 3 sites

mc admin replicate info minio1


SiteReplication enabled for:


Deployment ID                        | Site Name     | Endpoint

f96a6675-ddc3-4c6e-907d-edccd9eae7a4 | minio1        | http://

0dfce53f-e85b-48d0-91de-4d7564d5456f | minio2        | http://

8527896f-0d4b-48fe-bddc-a3203dccd75f | minio3        | http://

Check the current replication status using the following command

mc admin replicate status minio1

Once site replication is enabled, data will automatically start to replicate between all the sites. Depending on the amount of data to transfer, the network and disk speeds, it might take a couple of hours to a few days for the object to be synchronized across the sites.

If it's taking longer than usual or you still don’t see everything replicated over you can perform the resync command as below

mc admin replicate resync start minio1 minio2 minio3

The status can be checked using the following command

mc admin replicate resync status minio1 minio2 minio3

Eventually all the data will be replicated to the minio2 and minio3 sites.

Bucket Replication

Bucket replication, as the name suggests, sets up replication on a particular bucket in MinIO based on ARN.

Set up the following two MinIO aliases

Source:

mc alias set minio1

Destination:

mc alias set minio2

Once both aliases are set on the minio2 side, create a replication user repluser and set up a user policy for this user on the minio2 side bucket which has permissions to the actions listed in this policy as a minimal requirement for replication.

mc admin user add minio2 repluser repluserpwd

Set the minimum policy required for repluser to run the replication operations

$ cat > replicationPolicy.json << EOF

{

 "Version": "2012-10-17",

 "Statement": [

  {

   "Effect": "Allow",

   "Action": [

    "s3:GetBucketVersioning"

   ],

   "Resource": [

    "arn:aws:s3:::destbucket"

   ]

  },

  {

   "Effect": "Allow",

   "Action": [

    "s3:ReplicateTags",

    "s3:GetObject",

    "s3:GetObjectVersion",

    "s3:GetObjectVersionTagging",

    "s3:PutObject",

    "s3:ReplicateObject"

   ],

   "Resource": [

    "arn:aws:s3:::destbucket/*"

   ]

  }

 ]

}

Attach the above replpolicy to repluser

`$ mc admin policy add minio2 replpolicy ./replicationPolicy.json

$ mc admin policy set minio2 replpolicy user=repluser`

Now this is where it gets interesting. Now that you have the replication user (repluser) and the replication policy (replpolicy) created on minio2 cluster, you need to actually set the bucket replication target on minio1. This doesn’t start the bucket replication yet, it only sets it up for later when we actually start the process.

`$ mc replicate add minio1/srcbucket https:/repluser:repluserpwd@replica-endpoint:9000/destbucket --service "replication" --region "us-east-1"

Replication ARN = 'arn:minio:replication:us-east-1:28285312-2dec-4982-b14d-c24e99d472e6:destbucket'`

Finally, this is where the rubber meets the road, let's start the replication process.

$ mc replicate add minio1/srcbucket --remote-bucket https:/repluser:repluserpwd@replica-endpoint:9000/destbucket

Any objects uploaded to the source bucket that meet replication criteria will now be automatically replicated by the MinIO server to the remote destination bucket. Replication can be disabled at any time by disabling specific rules in the configuration or deleting the replication configuration entirely.

Which is the better option?

So why can’t we use Site to Site replication for everything and why do we need to use Batch Replication? Well Batch Replication provides more control over the replication process. Think of site replication as a firehose when you start it for the first time, once started the site replication process has potential to use all the available network bandwidth on the network to a point where no other applications can use the network throughput. On the other hand while sometimes batch replication might be slow it will not disrupt your existing network during the initial data transfer. Bucket replication is generally useful when you want to replicate just a handful of bucket and not the entire cluster.

Okay great, then what about site replication? Batch replication is not ideal for continuous replication because once the batch job ends it won’t replicate any new objects. So you have to keep re-run the batch replication job at certain intervals to ensure the delta gets replication to the minio2 site. On the other hand site replication allows data to be replicated both from minio1 to minio2 and vice versa, if you have active-active replication setup.

It is not possible to have both bucket and site replication to be enabled at the same time, you have to pick one or the other. So generally unless you want to only replicate certain buckets or certain objects in a particular bucket, we highly recommend going with site replication as it will not only replicate existing buckets and objects but also any new buckets/objects that are created. More over without too much configuration you can setup replication in a distributed manner where you can have minio1 in North America and minio2in Africa so MENA (Middle East North Africa) region will add data to minio2 and North America region will add data to minio1 and they will replicate each other.

Final Thoughts

In this post we went deeper in Bucket, Batch and Site replication types. While there is no set rule to use a particular replication strategy, our engineers at SUBNET after working with countless clusters setup, migrating them, expanding them, thinking about disaster recovery scenarios, our engineers have come up with the above replication strategies which should help most folks out there thinking of migrating their data to MinIO.

If you have any questions on the replication or best practices be sure to reach out to us on Slack! or better yet sign up for SUBNET and we can get you going.

Hungry GPUs Need Fast Object Storage

Sasha Bonner Wodtke — Mon, 05 Feb 2024 19:48:58 +0000

By Keith Pijanowski, AI/ML SME, MinIO

A chain is as strong as its weakest link - and your AI/ML infrastructure is only as fast as your slowest component. If you train machine learning models with GPUs, then your weak link may be your storage solution. The result is what I call the “Starving GPU Problem.” The Starving GPU problem occurs when your network or your storage solution cannot serve training data to your training logic fast enough to fully utilize your GPUs. The symptoms are fairly obvious. If you are monitoring your GPUs, then you will notice that they never get close to being fully utilized. If you have instrumented your training code, then you will notice that total training time is dominated by IO.

Unfortunately, there is bad news for those who are wrestling with this issue. Let’s look at some advances being made with GPUs to understand how this problem is only going to get worse in the coming years.

GPUs Are Getting Faster

GPUs are getting faster. Not only is raw performance getting better, but memory and bandwidth are also increasing. Let’s take a look at these three characteristics of Nvidia’s most recent GPUs the A100, the H100 and the H200.

(Note: the table above uses the statistics that align with a PCIe (Peripheral Component Interconnect Express) socket solution for the A100 and the SXM (Server PCI Express Module) socket solution for the H100 and the H200. SXM statistics do not exist for the A100. With respect to performance, the Floating Point 16 Tensor Core statistic is used for the comparison.)

A few observations on the statistics above are worth calling out. First, the H100 and the H200 have the same performance (1,979 TFLOPS), which is 3.17 times greater than the A100. The H100 has twice as much memory as the A100 and the memory bandwidth increased by a similar amount - which makes sense otherwise, the GPU would starve itself. The H200 can handle a whopping 141GB of memory and its memory bandwidth also increased proportionally with respect to the other GPUs.

Let’s look at each of these statistics in more detail and discuss what it means to machine learning.

Performance - A teraflop (TFLOP) is one trillion (10^12) floating-point operations per second. That is a 1 with 12 zeros after it (1,000,000,000,000). It is hard to equate TFLOPs to IO demand in gigabytes as the floating point operations that occur during model training involve simple tensor math as well as first derivates against the loss function (a.k.a. gradients). However, relative comparisons are possible. Looking at the statistics above, we see that the H100 and the H200, which both perform at 1,979 TFLOPS, are 3 times faster - potentially consuming data 3 times faster if everything else can keep up.

GPU Memory - Also known as Video RAM or Graphics RAM. The GPU memory is separate from the system's main memory (RAM) and is specifically designed to handle the intensive graphical processing tasks performed by the graphics card. GPU memory dictates batch size when training models. In the past batch size was decreased when moving training logic from a CPU to a GPU. However, as GPU memory catches up with CPU memory in terms of capacity, the batch size used for GPU training will increase. When performance and memory capacity increase at the same time, the result is larger requests where each gigabyte of training data is getting processed faster.

Memory Bandwidth - Think of GPU memory bandwidth as the "highway" that connects the memory and computation cores. It determines how much data can be transferred per unit of time. Just like a wider highway allows more cars to pass in a given amount of time, a higher memory bandwidth allows more data to be moved between memory and the GPU. As you can see, the designers of these GPUs increased the memory bandwidth for each new version proportional to memory; therefore, the internal data bus of the chip will not be the bottleneck.

A Look into the Future

In August 2023, Nvidia announced its next-generation platform for accelerated computing and generative AI - The GH200 Grace Hopper Superchip Platform. The new platform uses the Grace Hopper Superchip, which can be connected with additional Superchips by NVIDIA NVLink, allowing them to work together during model training and inference.

While all the specifications on the Grace Hopper Superchip represent an improvement over previous chips, the most important innovation for AI/ML engineers is its unified memory. Grace Hopper gives the GPU full access to the CPU’s memory. This is important because, in the past, engineers wishing to use GPUs for training had to first pull data into system memory and then from there, move the data to the GPU memory. Grace Hopper eliminates the need to use the CPU’s memory as a bounce buffer to get data to the GPU.

The simple comparison of a few key GPU statistics as well as the capabilities of Grace Hopper, has got to be a little scary to anyone responsible for upgrading GPUs and making sure everything else can keep up. A storage solution will absolutely need to serve data at a faster rate to keep up with these GPU improvements. Let’s look at a common solution to the hungry GPU problem.

A Common Solution

There is a common and obvious solution to this problem that does not require organizations to replace or upgrade their existing storage solution. You can keep your existing storage solution intact so that you can take advantage of all the enterprise features your organization requires. This storage solution is most likely a Data Lake that holds all of your organization’s unstructured data - therefore, it may be quite large, and the total cost of ownership is a consideration. It also has a lot of features enabled for redundancy, reliability and security, all of which impact performance.

What can be done, however, is to set up a storage solution that is in the same data center as your compute infrastructure - ideally, this would be in the same cluster as your compute. Make sure you have a high-speed network with the best storage devices available. From there, copy only the data needed for ML training.

Amazon’s recently announced Amazon S3 Express One Zone exemplifies this approach. It is a bucket type optimized for high throughput and low latency and is confined to a single Availability Zone (no replication). Amazon’s intention is for customers to use it to hold a copy of data that requires high-speed access. Consequently, it is purpose-built for model training. According to Amazon, it provides 10x the data access speed of S3 Standard at 8x the cost. Read more about our assessment of Amazon S3 Express One Zone here.

The MinIO Solution

The common solution I outlined above required AWS to customize its S3 storage solution by offering specialty buckets at an increased cost. Additionally, some organizations (that are not MinIO customers) are buying specialized storage solutions that do the simple things I described above. Unfortunately, this adds complexity to an existing infrastructure since a new product is needed to solve a relatively simple problem.

The irony to all this is that MinIO customers have always had this option. You can do exactly what I described above with a new installation of MinIO on a high-speed network with NVMe drives. MinIO is a software-defined storage solution - the same product runs on bare metal or the cluster of your choice using a variety of storage devices. If your corporate Data Lake uses MinIO on bare metal with HDDs and it is working fine for all of your non-ML data - then there is no reason to replace it. However, if the datasets used for ML require faster IO because you are using GPUs, then consider the approach I outlined in this post. Be sure to make a copy of your ML data for use in your high-speed instance of MinIO - a gold copy should always exist in a hardened installation of MinIO. This will allow you to turn off features like replication and encryption in your high-speed instance of MinIO, further increasing performance. Copying data is easy using MinIO’s mirroring feature.

MinIO is capable of the performance needed to feed your hungry GPUs - a recent benchmark achieved 325 GiB/s on GETs and 165 GiB/s on PUTs with just 32 nodes of off-the-shelf NVMe SSDs.

Download MinIO today and learn just how easy it is to build a data lakehouse. If you have any questions be sure to reach out to us on Slack!

Understanding True Costs - Hardware and Software for 10PB

Sasha Bonner Wodtke — Wed, 24 Jan 2024 22:12:16 +0000

By Dudley Nostrand, Head of Global Value Engineering, MinIO

We had a conversation with the CIO of a major bank the other day. They are one of the global systemically important banks - the biggest of the big. The CIO had decided to bring in MinIO as the object store for a data analytics initiative. This deployment collects data from mortgage, transactional and news platforms to run Spark and other analytical tools to drive insights for the Bank. The implementation that MinIO was replacing was a proprietary platform. The switch to MinIO was motivated by technical glitches and inflated costs of the proprietary solution.

Things were going swimmingly until we got a call from our CIO friend. His incumbent vendor (they have both HW and SW) had persuaded the procurement team that MinIO was 4x as expensive as the incumbent’s object storage software solution.

This was a curious development because the incumbent vendor doesn’t have object storage software - they have an appliance. It took a few moments for us to realize what was going on, and less than an hour to get it sorted out. This post talks about what the vendor did, how that obscured the value assignment, and how it came back to bite them in the end.

In doing so, we are going to tell you what you can expect to pay for 10PB of storage - both from a software and hardware perspective.

Feel free to use these numbers on us and on your hardware vendor of choice. If they won’t hit the numbers, ping us and we will tell you who it is and they will honor that price (assuming the same config yada yada).

Before we bury the lead. Your software costs for 10PB of storage from MinIO with the Standard plan will be $667K per year. Your total hardware costs for 10PB of NVMe storage from a leader in the commodity hardware space is going to be about $1.28M. This is a full featured, all-flash system that can do anything from AI/ML to Archival.

So, back to the story.

The CIO explained what the incumbent vendor had presented to him and it went like this…the vendor’s “new” object storage software-defined solution isn’t ready for primetime (it has been in “beta” for a few years now) but the vendor wasn’t about to let another hardware vendor into the account. It is a $250M+ per year account for them.

What they did was to present MinIO on their HW platform as one offering. The HW was full freight. The combined price was MinIO at $667K and HW at $2.1M+. Separately, they offered their legacy (read “old”) object storage solution with their HW as a bundle. In this case they dropped the HW price considerably. The effect was to price as an appliance where they didn’t attribute any value to their software (which tells you what they think about their soon to be EOL software).

The goal of their proposal was to make MinIO’s software look very expensive to the bank thereby leading them to the appliance solution. After we explained what had happened, the CIO went back to procurement and asked for the pricing of the hardware without any software, just like they might do with any commodity hardware provider. In a “shocking” development, the procurement team learned that the incumbent’s HW vendor’s costs were 4x what a commodity vendor was willing to provide them - in writing!

This is the power of unbundling the cost of hardware and software. Always demand the unbundled costs from your vendors.

When they offer to “separate” them out - don’t take the bait. Understand what each component offers. When they claim, “but our software is optimized for our hardware” don’t buy that either - the HW salesperson will gladly throw their software brethren under the bus as soon as you tell them you are selecting another hardware vendor. The response will be immediate, “of course, our hardware works great with x software, in fact these three Fortune 500 companies run it….”

Trust us here. If they can unbundle, and the other option is losing the deal, you will get the truth.

The reason appliance vendors like to “separate out” the costs is that it is easy to manipulate where the value lies. The incumbent in this case actually has nice HW. They just have bad incentives when it comes to doing the right thing for the customer. They “subsidize” the HW with mediocre to poor software. Generally, the customer’s procurement team doesn’t see the difference. They see a bill of materials for object storage software and accompanying HW, and calculate what they think to be the “all in per TB” price.

In this case, when all was said and done, the incumbent found itself in an awkward conversation about their HW pricing. If the software added no value (as they demonstrated by making it almost zero $) then the HW was, in turn, fairly expensive. Specifically, 4x the commodity price of that same type of hardware. More importantly, that was the price the bank had been paying for years….So much for investing in the relationship.

The True Cost: 10PB of Software and Hardware

We will keep it simple.

If you are buying 10PB of usable capacity from MinIO on a multi-year deal you should expect to pay $66.7K per PB per year from us. You might be able to do better depending on the length of the term.

If you are buying 10PB of usable capacity from one of the leading commodity HW players you would expect to pay around $1.28M ($128K per PB).

That means your all-in costs for HW and SW would be around $1.95M in year 1, and $2.7M for the subsequent 4 years, for a total of $4.6M for 5 years.

This gets you best of breed.

The alternative? The appliance with software that will be EOL in year two of the five and a cost of $5.4M - $5.8M for 5 years.

The Takeaway

The appliance model is designed to obscure value. The argument that it has a lower TCO simply doesn’t hold water if it isn’t cloud-native and doesn’t work with the rest of the cloud-native ecosystem. It almost always shields one of the components, the hardware or software. Given the legacy of many of these companies - it is the software that they need to obscure because in their DNA they are HW vendors, not software developers.

So, ALWAYS get your unbundled, not separated-out, pricing. If you do that you will end up with better pricing, better software and better hardware, and a better overall solution for your needs. You don’t have to pick us - we mean that sincerely, but we hope you follow our advice. Feel free to reach out to us at hello@min.io to have that conversation.

Never Say Die: Persistent Data with a CDC MinIO Sink for CockroachDB

Sasha Bonner Wodtke — Wed, 24 Jan 2024 01:37:40 +0000

By Brenna Buuck, Databases and Datalakes SME, MinIO

CockroachDB scurries onto the database scene as a resilient and scalable distributed SQL database. Drawing inspiration from the tenacity of its insect namesake, CockroachDB boasts high availability even in the face of hardware failures. Its distributed architecture spans multiple nodes, mirroring the adaptability of its insect counterpart.

With strong consistency and ACID transaction support, CockroachDB becomes a reliable choice for applications requiring data accuracy and reliability, thriving in dynamic environments and effortlessly managing the complexities of distributed data.

This blog introduces using MinIO as a changefeed sink for CockroachDB. By doing so, you not only benefit from CockroachDB's strengths but also leverage the durability, scalability, and performance of MinIO. Use this as a guide to establishing an enterprise-grade CDC strategy. It is inspired by this awesome repo.

What is CDC?

CDC is a smart database management technique that tracks and captures changes in a relational database like CockroachDB. It acts like a monitor, detecting CRUD Operations like INSERTS, UPDATES and DELETES in real time.

CDC's strength lies in its ability to identify only the altered data, making it more efficient in terms of network bandwidth and expense than traditional methods of replication. This efficiency is crucial for tasks like data integration, real-time analytics, and maintaining consistency across distributed systems. CDC is truly a prerequisite for real-time connected data fabrics and remains a fundamental tool for keeping databases synchronized and maintaining reliability in dynamic data environments.

This continuous stream of real-time data updates provides a rich foundation for training and optimizing machine learning models which require huge volumes of up-to-date data for success.

Changefeed Sinks with CockroachDB

Changefeed sinks in CockroachDB are like data pipelines that efficiently funnel CRUD operations happening in the database to an external destination. In this instance, MinIO serves as one such destination. When configured as a sink, MinIO becomes the repository for the continuous stream of changes, offering a durable and scalable storage solution for CDC operations.

One benefit of this approach is that you can use CDC data in your MinIO bucket to replicate your data strategy across clouds. For example, If your CockroachDB is hosted in AWS, but you need your data on-prem for your models to run, you can do that with this approach. This can be an effective way to implement a multi-cloud data strategy.

Prerequisites

To follow this guide, make sure you have Docker Compose installed. You can install Docker Engine and Docker Compose separately or together using Docker Desktop. The easiest way is to go for Docker Desktop.

Check if Docker Compose is installed by running this command:

docker-compose --version

You will need a self-hosted, Enterprise license of CockroachDB. For this local deployment of CockroachDB, please note that CockroachDB on ARM for MacOS is experimental and is not yet ready for production.

Getting Started

To get started clone or download the project folder from this location.

Open a terminal window, navigate to the project folder and run the following command:

docker-compose up -d

This command instructs Docker Compose to read the configuration from the docker-compose.yml file, create and start the services defined in it, and run them in the background, allowing you to use the terminal for other tasks without being tied to the container's console output.

After running the command you should be able to see the containers are up and running.

You can access the Cockroach UI at http://127.0.0.1:8080. Verify that your node is live.

You can access the MinIO UI at http://127.0.0.1:9001. Log in with the username and password combination of minioadmin:minioadmin.

Once you log in, you should be able to verify that the mc command in the docker-compose.yml was executed and a bucket named cockroach was automatically created.

SQL Commands

Once your containers are up and running, you’re now ready to run SQL commands. Run the following command in a terminal window in the same folder where you downloaded the tutorial files and started your containers. This command executes an interactive SQL shell inside the crdb-1 container, connecting to the CockroachDB instance running in that container, and allowing you to enter SQL queries.

docker exec -it crdb-1 ./cockroach sql --insecure

You should see the following if you executed the shell correctly.

#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
# Server version: CockroachDB CCL v19.2.2 (x86_64-unknown-linux-gnu, built 2019/12/11 01:33:43, go1.12.12) (same version as client)
# Cluster ID: 0a668a2d-056d-4203-a996-217ca6169f80
#
# Enter \? for a brief introduction.
#

As noted in the prerequisites, you need an Enterprise CockroachDB account to set up CDC. Use the same terminal window to enter the next commands.

SET CLUSTER SETTING cluster.organization = '<organization name>';

SET CLUSTER SETTING enterprise.license = '<secret>';

SET CLUSTER SETTING kv.rangefeed.enabled = true;

Crushing CDC

You can now build your database and tables. The SQL below creates a new database named ml_data, switches the current database context to ml_data, creates a table named model_performance to store information about machine learning models, and inserts two rows of data into this table representing the performance metrics of specific models. Execute these commands in the same terminal.

CREATE DATABASE ml_data;

SET DATABASE = ml_data;

CREATE TABLE model_performance (
     model_id INT PRIMARY KEY,
     model_name STRING,
     accuracy FLOAT,
     training_time INT);

INSERT INTO model_performance VALUES
   (1, 'NeuralNetworkV1', 0.85, 120),
   (2, 'RandomForestV2', 0.92, 150);

Next, run the following command to create a changefeed for the model_performance table in CockroachDB and configure it to stream updates to MinIO.

CREATE CHANGEFEED FOR TABLE model_performance INTO 'experimental-s3://cockroach?AWS_ACCESS_KEY_ID=minioadmin&AWS_SECRET_ACCESS_KEY=minioadmin&AWS_ENDPOINT=http://minio:9000&AWS_REGION=us-east-1' with updated, resolved='10s';

Navigate to the Cockroach UI at http://127.0.0.1:8080 to verify that your changefeed has been successfully created and a high-water timestamp has been established.

Make a change to your data to see the changefeed in action.

UPDATE model_performance SET model_name = 'ResNet50' WHERE model_id = 1;

Run the following command to make sure the command was executed.

SELECT * FROM model_performance ;
  model_id |   model_name   | accuracy | training_time  
+----------+----------------+----------+---------------+
         1 | ResNet50       |     0.88 |           130  
         2 | RandomForestV2 |     0.92 |           150

In a production environment, your transactions would populate MinIO without the need for this step.

Navigate back to the MinIO UI at‘http://127.0.0.1:9001 to see your changefeed in action.

End of the Line

In this tutorial, you’ve gone through the process of creating a changefeed sink with MinIO to enable a CDC strategy for your enterprise-licensed CockroachDB.

This CDC strategy sets the stage for a resilient and continuously synchronized data fabric for your most critical data. When you absolutely need to have a perfect replica of your data for data exploration, analytics or AI applications the combination of CockroachDB and MinIO is your winning strategy.

Evolve what has been outlined here and you will continue to support better decision-making, facilitating AI/ML endeavors, and maintaining the reliability of your dynamic data environment.

Download MinIO today for all your cockroach-related needs. MinIO - CockroachDBs check-in, but they don't check out. Reach out to us if you have any questions or trouble with bugs at hello@min.io or on Slack.

I want to thank the folks at CockroachDB – big shoutout to John Billingsley – for helping us with this post.

Everything You Need to Know to Repatriate from AWS S3 to MinIO

Sasha Bonner Wodtke — Mon, 22 Jan 2024 20:49:07 +0000

By Matt Sarrel, Director of Technical Marketing, MinIO

The response to our previous post, How to Repatriate From AWS S3 to MinIO, was extraordinary - we’ve fielded dozens of calls from enterprises asking us for repatriation advice. We have aggregated those responses into this new post, where we dig a little deeper into the costs and savings associated with repatriation to make it easier for you to put together your own analysis. Migration of data is a daunting task for many. In practice, they target new data to come to MinIO and take their sweet time to migrate old data from the cloud or leave it in place and not grow.

Repatriation Overview

To repatriate data from AWS S3, you will follow these general guidelines:

1. Review Data Requirements: Determine the specific buckets and objects that need to be repatriated from AWS S3. Make sure you understand business needs and compliance requirements on a bucket-by-bucket basis.
2. Identify Repatriation Destination: You’ve already decided to repatriate to MinIO, now you can choose to run MinIO in an on-premises data center or at another cloud provider or colocation facility. Using the requirements from #1, you will select hardware or instances for forecasted storage, transfer and availability needs.
3. Data Transfer: Plan and execute the transfer of data from AWS S3 to MinIO. Simply use MinIO's built-in Batch Replication or mirror using the MinIO Client (see How to Repatriate From AWS S3 to MinIO for details). There are several additional methods you can use for data transfer, such as using AWS DataSync, AWS Snowball or TD SYNNEX data migration, or directly using AWS APIs.
4. Data Access and Permissions: Ensure that appropriate access controls and permissions are set up for the repatriated data on a per-bucket basis. This includes IAM and bucket policies for managing user access, authentication, and authorization to ensure the security of the data.
5. Object Locks: It is critical to preserve the object lock retention and legal hold policies after the migration. The target object store has to interpret the rules in the same way as Amazon S3. If you are unsure, ask for the Cohasset Associates Compliance Assessment on the target object store implementation.
6. Data Lifecycle Management: Define and implement a data lifecycle management strategy for the repatriated data. This includes defining retention policies, backup and recovery procedures, and data archiving practices on a per-bucket basis.
7. Data Validation: Validate the transferred data to ensure its integrity and completeness. Perform necessary checks and tests to ensure that the data has been successfully transferred without any corruption or loss. After the transfer, the object name, ETag and metadata, checksum and the number of objects all match between the source and destination.
8. Update Applications and Workflows: The good news is that if you follow cloud-native principles to build your applications, then all you will have to do is reconfigure them for the new MinIO endpoint. However, if your applications and workflows were designed to work with the AWS ecosystem, make the necessary updates to accommodate the repatriated data. This may involve updating configurations, reconfiguring integrations or in some cases modifying code.
9. Monitor and Optimize: Continuously monitor and optimize the repatriated data environment to ensure optimal performance, cost-efficiency, and adherence to data management best practices.

Repatriation Steps

There are many factors to consider when budgeting and planning for cloud repatriation. Fortunately, our engineers have done this with many customers and we’ve developed a detailed plan for you. We have customers that have repatriated everything from a handful of workloads to hundreds of petabytes.

The biggest planning task is to think through choices around networking, leased bandwidth, server hardware, archiving costs for the data not selected to be repatriated, and the human cost of managing and maintaining your own cloud infrastructure. Estimate these costs and plan for them. Cloud repatriation costs will include data egress fees for moving the data from the cloud back to the data center. These fees are intentionally high enough to compel cloud lock-in. Take note of these high egress fees - they substantiate the economic argument to leave the public cloud because, as the amount of data you manage grows, the egress fees increase. Therefore, if you’re going to repatriate, it pays to take action sooner rather than later.

We’re going to focus on data and metadata that must be moved – this is eighty percent of the work required to repatriate. Metadata includes bucket properties and policies (access management based on access/secret key, lifecycle management, encryption, anonymous public access, object locking and versioning).

Let’s focus on data (objects) for now. For each namespace you want to migrate, take inventory of the buckets and objects you want to move. It is likely that your DevOps team already knows which buckets hold important current data. You can also use Amazon S3 Inventory. At a high level, this will look something like:

The next step is to list, by namespace, each bucket and its properties for every bucket you’re going to migrate. Note the application(s) that store and read data in that bucket. Based on usage, classify each bucket as hot, warm or cold tier data.

In an abridged version, this will look something like

You have some decisions to make about data lifecycle management at this point and pay close attention because here’s a great way to save money on AWS fees. Categorize objects in each bucket as hot, warm or cold based on how frequently they are accessed. A great place to save money is to migrate cold tier buckets directly to S3 Glacier – there’s no reason to incur egress fees to download just to upload again.

Depending on the amount of data you’re repatriating, you have a few options to choose how to migrate. We recommend that you load and work with new data on the new MinIO cluster while copying hot and warm data to the new cluster over time. The amount of time and bandwidth needed to copy objects will, of course, depend on the number and size of the objects you’re copying.

Here’s where it will be very helpful to calculate the total data that you’re going to repatriate from AWS S3. Look at your inventory and total the size of all the buckets that are classified as hot and warm.

Calculate data egress fees based on the above total. I’m using list price, but your organization may qualify for a discount from AWS. I’m also using 10 Gbps as the connection bandwidth, but you may have more or less at your disposal. Finally, I’m working from the assumption that one-third of S3 data will merely be shifted to S3 Glacier Deep Archive.

Don’t forget to budget for S3 Glacier Deep Archive usage moving forward.

For the sake of simplicity, the above calculation includes neither the fee for per object operations ($0.40/1m) nor the cost of LISTing ($5/1m). For very large repatriation projects, we can also compress objects before sending them across the network, saving you some of the cost of egress fees.

Another option is to use AWS Snowball to transfer objects. Snowball devices are each 80TB, so we know up front that we need 20 of them for our repatriation effort. The per-device fee includes 10 days of use, plus 2 days for shipping. Additional days are available for $30/device.

AWS will charge you standard request, storage, and data transfer rates to read from and write to AWS services including Amazon S3 and AWS Key Management Service (KMS). There are further considerations when working with Amazon S3 storage classes. For S3 export jobs, data transferred to your Snow Family device from S3 are billed at standard S3 charges for operations such as LIST, GET, and others. You are also charged standard rates for Amazon CloudWatch Logs, Amazon CloudWatch Metrics, and Amazon CloudWatch Events.

Now we know how long it will take to migrate this massive amount of data and the cost. Make a business decision as to which method meets your needs based on the combination of timing and fees.

At this point, we also know the requirements for the hardware needed to run MinIO on-prem or at a colocation facility. Take the requirement above for 1.5PB of storage, estimate data growth, and consult our Recommended Hardware & Configuration page and Selecting the Best Hardware for Your MinIO Deployment.

The first step is to recreate your S3 buckets in MinIO. You’re going to have to do this regardless of how you choose to migrate objects. While both S3 and MinIO store objects using server-side encryption, you don’t have to worry about migrating encryption keys. You can connect to your KMS of choice using MinIO KES to manage encryption keys. This way, new keys will be automatically generated for you as encrypted tenants and buckets are created in MinIO.

You have multiple options to copy objects: Batch Replication and mc mirror. My previous blog post, How to Repatriate From AWS S3 to MinIO included detailed instructions for both methods. You can copy objects directly from S3 to on-prem MinIO, or use a temporary MinIO cluster running on EC2 to query S3 and then mirror to on-prem MinIO.

Typically, customers use tools we wrote combined with AWS Snowball or TD SYNNEX’s data migration hardware and services to move larger amounts of data (over 1 PB).

MinIO recently partnered with Western Digital and TD SYNNEX to field a Snowball alternative. Customers can schedule windows to take delivery of the Western Digital hardware and pay for what they need during the rental period. More importantly, the service is not tied to a specific cloud - meaning the business can use the service to move data into, out of, and across clouds - all using the ubiquitous S3 protocol. Additional details on the service can be found on the Data Migration Service page on the TD SYNNEX site.

Bucket metadata, including policies and bucket properties, can be read using get-bucket S3 API calls and then set up in MinIO. When you sign up for MinIO SUBNET, our engineers will work with you to migrate these settings from AWS S3: access management based on access key/secret key, lifecycle management policies, encryption, anonymous public access, immutability and versioning. One note about versioning, AWS version ID isn’t usually preserved when data is migrated because each version ID is an internal UUID. This is largely not a problem for customers because objects are typically called by name. However, if AWS version ID is required, then we have an extension that will preserve it in MinIO and we’ll help you enable it.

Pay particular attention to IAM and bucket policies. S3 isn’t going to be the only part of AWS’s infrastructure that you leave behind. You will have a lot of service accounts for applications to use when accessing S3 buckets. This would be a good time to list and audit all of your service accounts. Then you can decide whether or not to recreate them in your identity provider. If you choose to automate, then use Amazon Cognito to share IAM information with external OpenID Connect IDPs and AD/LDAP.

Pay particular attention to Data Lifecycle Management, such as object retention, object locking and archive/tiering. Run a get-bucket-lifecycle-configuration on each bucket to obtain a human-readable JSON list of lifecycle rules. You can easily recreate AWS S3 settings using MinIO Console or MinIO Client (mc). Use commands such as get-object-legal-hold and get-object-lock-configuration to pinpoint objects that require special security and governance treatment.

While we’re on the subject of lifecycle, let’s talk about backup and disaster recovery for a moment. Do you want an additional MinIO cluster to replicate to, for backup and disaster recovery?

After objects are copied from AWS S3 to MinIO, it’s important to validate data integrity. The easiest way to do this is to use the MinIO Client to run mc diff against old buckets in S3 and new buckets on MinIO. This will compute the difference between the buckets and return a list of only those objects that are missing or different. This command takes the arguments of the source and target buckets. For your convenience, you may want to create aliases for S3 and MinIO so you don’t have to keep typing out full addresses and credentials. For example:

mc diff s3/bucket1 minio/bucket1

The great news is that all you have to do is point existing apps at the new MinIO endpoint. Configurations can be rewritten app by app over a period of time. Migrating data in object storage is less disruptive than a filesystem, just change the URL to read/write from a new cluster. Note that if you previously relied on AWS services to support your applications, those won’t be present in your data center, so you’ll have to replace them with their open-source equivalent and rewrite some code. For example, Athena can be replaced with Spark SQL, Apache Hive and Presto, Kinesis with Apache Kafka, and AWS Glue with Apache Airflow.

If your S3 migration is part of a larger effort to move an entire application on-prem, then chances are you used S3 event notifications to call downstream services when new data arrived. If this is the case, then do not fear - MinIO supports event notification as well. The most straightforward migration here would be to implement a custom webhook to receive the notification. However, if you need a destination that is more durable and resilient, then use messaging services such as Kafka or RabbitMQ. We also support sending events to databases such as PostgreSQL and MySQL.

Now that you’ve completed repatriating, it’s time to turn your attention to storage operation, monitoring and optimization. The good news is that no optimization is needed for MinIO – we’ve built optimization right into the software so you know you’re getting the best performance for your hardware. You’ll want to start monitoring your new MinIO cluster to assess resource utilization and performance on an ongoing basis. MinIO exposes metrics via a Prometheus endpoint that you can consume in your monitoring and alerting platform of choice. For more on monitoring, please see Multi-Cloud Monitoring and Alerting with Prometheus and Grafana and Metrics with MinIO using OpenTelemetry, Flask, and Prometheus.

With SUBNET, we have your back when it comes to Day 2 operations with MinIO. Subscribers gain access to built-in automated troubleshooting tools to keep their clusters running smoothly. They also get unlimited, direct-to-engineer support in real-time via our support portal. We also help you future-proof your object storage investment with an annual architecture review.

Migrate and Save

It’s far from a secret that the days of writing blank checks to cloud providers are gone. Many businesses are currently evaluating their cloud spend to find potential savings. Now you have everything you need to start your migration from AWS S3 to MinIO, including concrete technical steps and a financial framework.

If you get excited about the prospect of repatriation cost savings, then please reach out to us at hello@min.io.

LanceDB: Your Trusted Steed in the Joust Against Data Complexity

Sasha Bonner Wodtke — Thu, 18 Jan 2024 21:43:11 +0000

By Brenna Buuck, Databases and Datalakes SME, MinIO

Built on Lance, an open-source columnar data format, LanceDB has some interesting features that make it attractive for AI/ML. For example, LanceDB supports explicit and implicit vectorization with the ability to handle various data types. LanceDB is integrated with leading ML frameworks such as PyTorch and TensorFlow. Cooler still is LanceDB’s fast neighbor search which enables efficient retrieval of similar vectors using approximate nearest neighbor algorithms. All of these combine to create a vector database that is fast, easy to use and so lightweight it can be deployed anywhere.

LanceDB is capable of querying data in S3-compatible object storage. This combination is optimal for building high-performance, scalable, and cloud-native ML data storage and retrieval systems. MinIO brings performance and unparalleled flexibility across diverse hardware, locations, and cloud environments to the equation, making it the natural choice for such deployments.

Upon completion of this tutorial, you will be prepared to use LanceDB and MinIO to joust with any data challenge.

What is Lance?

The Lance file format is a columnar data format optimized for ML workflows and datasets. It is designed to be easy and fast to version, query, and use for training, and is suitable for various data types, including images, videos, 3D point clouds, audio, and tabular data. Additionally, it supports high-performance random access: with Lance reporting benchmarks of 100 times faster than Parquet in queries. Lance’s speed is in part the result of being implemented in Rust, and its cloud-native design which includes features like zero-copy versioning and optimized vector operations.

One of its key features is the ability to perform vector search, allowing users to find nearest neighbors in under 1 millisecond and combine OLAP-queries with vector search. Other production applications for the lance format include edge-deployed low-latency vector databases for ML applications, large-scale storage, retrieval, and processing of multi-modal data in self-driving car companies, and billion-scale+ vector personalized search in e-commerce companies. Part of the appeal of the Lance file format is its compatibility with popular tools and platforms, such as Pandas, DuckDB, Polars, and Pyarrow. Even if you don’t use LanceDB, you can still leverage the Lance file format in your data stack.

Built for AI and Machine Learning

Vector databases like LanceDB offer distinct advantages for AI and machine learning applications, thanks to their efficient decoupled storage and compute architectures and retrieval of high-dimensional vector representations of data. Here are some key use cases:

Natural Language Processing (NLP):

Semantic Search: Find documents or passages similar to a query based on meaning, not just keywords. This powers chatbot responses, personalized content recommendations, and knowledge retrieval systems.

Question Answering: Understand and answer complex questions by finding relevant text passages based on semantic similarity.

Topic Modeling: Discover latent topics in large text collections, useful for document clustering and trend analysis.

Computer Vision:

Image and Video Retrieval: Search for similar images or videos based on visual content, crucial for content-based image retrieval, product search, and video analysis.

Object Detection and Classification: Improve the accuracy of object detection and classification models by efficiently retrieving similar training data.

Video Recommendation: Recommend similar videos based on the visual content of previously watched videos

Among the plethora of vector databases on the market, LanceDB is particularly well suited for AI and machine learning, because it supports querying on S3- compatible storage. Your data is everywhere, your database should be everywhere too.

Architecting for Success

Using MinIO with LanceDB offers several benefits, including:

Scalability and Performance: MinIO’s cloud-native design is built for scale and high-performance storage and retrieval. By leveraging MinIO's scalability and performance, LanceDB can efficiently handle large amounts of data, making it well-suited for modern ML workloads.
High Availability and Fault Tolerance: MinIO is highly available, immutable, and highly durable. This ensures that data stored in MinIO is protected against hardware failures and provides high availability and fault tolerance, which are crucial for data-intensive applications like LanceDB.
Active-active replication: Multi-site, active-active replication enables near-synchronous replication of data between multiple MinIO deployments. This robust process ensures high durability and redundancy, making it ideal for shielding data in mission-critical production environments.

The combination of MinIO and LanceDB provides a high-performance scalable cloud-native solution for managing and analyzing large-scale ML datasets.

Requirements

To follow along with this tutorial, you will need to use Docker Compose. You can install the Docker Engine and Docker Compose binaries separately or together using Docker Desktop. The simplest option is to install Docker Desktop.

Ensure that Docker Compose is installed by running the following command:

docker compose version

You will also need to install Python. You can download Python from here. During installation, make sure to check the option to add Python to your system's PATH.

Optionally, you can choose to create a Virtual Environment. It's good practice to create a virtual environment to isolate dependencies. To do so, open a terminal and run:

python -m venv venv

To Activate the virtual environment:

On Windows:

.\venv\Scripts\activate

On macOS/Linux:

source venv/bin/activate

Getting Started

Begin by cloning the project from here. Once done, navigate to the folder where you downloaded the files in a terminal window and run:

docker-compose up minio

This will start up the MinIO container. You can navigate to ‘http://127.0.0.1:9001/?ref=blog.min.io’ to take a look at the MinIO console.

Next, run the following command to create a MinIO bucket called lance.

docker compose up mc

This command performs a series of MinIO Client (mc) commands within a shell.

Here's a breakdown of each command:

until (/usr/bin/mc config host add minio http://minio:9000 minioadmin minioadmin) do echo '...waiting...' && sleep 1; done;: This command repeatedly attempts to configure a MinIO host named minio with the specified parameters (endpoint, access key, and secret key) until successful. During each attempt, it echoes a waiting message and pauses for 1 second.

/usr/bin/mc rm -r --force minio/lance;: This command forcefully removes (deletes) all contents within the lance bucket in MinIO.

/usr/bin/mc mb minio/lance;: This command creates a new bucket named lance in MinIO.

/usr/bin/mc policy set public minio/lance;: This command sets the policy of the lance bucket to public, allowing public read access.

exit 0;: This command ensures that the script exits with a status code of 0, indicating successful execution.

LanceDB

Unfortunately, LanceDB does not have native S3 support, and as a result, you will have to use something like boto3 to connect to the MinIO container you made. As LanceDB matures we look forward to native S3 support that will make the user experience all the better.

The sample script below will get you started.

Install the required packages using pip. Create a file named requirements.txt with the following content:

lancedb~=0.4.1
boto3~=1.34.9
botocore~=1.34.9

Then run the following command to install the packages:

pip install -r requirements.txt

You will need to change your credentials if your method of creating the MinIO container differs from the one outlined above.

Save the below script to a file, e.g., lancedb_script.py.

import lancedb
import os
import boto3
import botocore
import random

def generate_random_data(num_records):
    data = []
    for _ in range(num_records):
        record = {
            "vector": [random.uniform(0, 10), random.uniform(0, 10)],
            "item": f"item_{random.randint(1, 100)}",
            "price": round(random.uniform(5, 100), 2)
        }
        data.append(record)
    return data

def main():
    # Set credentials and region as environment variables
    os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
    os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
    os.environ["AWS_ENDPOINT"] = "http://localhost:9000"
    os.environ["AWS_DEFAULT_REGION"] = "us-east-1"

    minio_bucket_name = "lance"

    # Create a boto3 session with path-style access
    session = boto3.Session()
    s3_client = session.client("s3", config=botocore.config.Config(s3={'addressing_style': 'path'}))

    # Connect to LanceDB using path-style URI and s3_client
    db_uri = f"s3://{minio_bucket_name}/"
    db = lancedb.connect(db_uri)

    # Create a table with more interesting data
    table = db.create_table("mytable", data=generate_random_data(100))

    # Open the table and perform a search
    result = table.search([5, 5]).limit(5).to_pandas()
    print(result)

if __name__ == "__main__":
    main()

This script will create a Lance table from randomly generated data and add it to your MinIO bucket. Again, if you don’t use the method in the previous section to create a bucket you will need to do so before running the script. Remember to change the sample script above to match what you name your MinIO bucket.

Finally, the script opens the table, without moving it out of MinIO, and uses Pandas to do a search and print the results.

The result of the script should look similar to the one below. Remember that the data itself is randomly generated each time.

                   vector      item  price  _distance
0  [5.1022754, 5.1069164]   item_95  50.94   0.021891
1   [4.209107, 5.2760105]  item_100  69.34   0.701694
2     [5.23562, 4.102992]   item_96  99.86   0.860140
3   [5.7922664, 5.867489]   item_47  56.25   1.380223
4    [4.458882, 3.934825]   item_93   9.90   1.427407

Expand on your Own

There are many ways to build on this foundation offered in this tutorial to create performant, scalable and future-proofed ML/AI architectures. You have two cutting-edge and open-source building blocks in your arsenal – MinIO object storage and the LanceDB vector database – consider this your winning ticket to the ML/AI tournament.

Don’t stop here. LanceDB offers a wide range of recipes and tutorials to expand on what you’ve built in this tutorial including a recently announced Udacity course on Building Generative AI Solutions with Vector Databases. Of particular interest is this recipe to chat with your documents. We are all for breaking down barriers to getting the most from your data.

Please show us what you’re building and should you need guidance on your noble quest don’t hesitate to email us at hello@minio.io or join our round table on Slack.

The Future of AI is Open-Source

Sasha Bonner Wodtke — Thu, 18 Jan 2024 21:18:20 +0000

By Brenna Buuck, Datalakes and Databases SME, MinIO

Imagine a future where AI isn't locked away in corporate vaults, but built in the open, brick by brick, by a global community of innovators. Where collaboration, not competition, fuels advancements, and ethical considerations hold equal weight with raw performance. This isn't science fiction, it's the open-source revolution brewing in the heart of AI development. But Big Tech has its own agenda, masking restricted models as open source while attempting to reap the benefits of a truly open community. Let's peel back the layers of code and unveil the truth behind these efforts. This exploration of the future of open-source AI will dissect the “pretenders” and champion the “real ones” in AI development to uncover the innovation engine that is open-source software humming beneath it all. The bottom line is that open-source AI will beget an open-source data stack.

The Need

A recent article by Matteo Wong in The Atlantic, ‘There Was Never Such a Thing as ‘Open’ AI’ describes a growing trend in academia and the software community for truly open source AI. “The idea is to create relatively transparent models that the public can more easily and cheaply use, study, and reproduce, attempting to democratize a highly concentrated technology that may have the potential to transform work, police, leisure and even religion.” That same Atlantic suggests that Big Tech companies like Meta are trying to fill this need in the market by ‘open-washing’ their products. They are assuming the qualities and positive reputation of the open-source community without truly open-sourcing their product. But, there is no substitute for the real thing. This is because true open-source software drives innovation and collaboration: two qualities that are desperately needed to move forward with AI responsibly.

The Pretenders

LLaMA 2, is a large language model created by Meta that is free to use for both research and commercial uses. Leading some to suggest LLaMA 2 is open source. However, Meta has implemented some severe restrictions on the use of their model. For example, LLaMA 2 cannot be used to improve any other large language model. A position that goes against the traditional private collective innovation model of open software which promotes the free and open revelation of innovation for the benefit of everyone in the software community.

Meta further crippled the use of their model by not allowing integration of LLaMA 2 with products that have 700 million monthly users and by not disclosing what data their model is trained on or the code they used to build it. By not disclosing, Meta is opening itself to questions of inherent bias and accidental discrimination. A model trained on discriminatory data will serve up discriminatory responses. Without the software community at large being able to view either the code used to build the model to see if any safeguards have been built in or the data used to train it, we are left in the dark on these moral questions. In a time when published research on AI is more concerned with performance than justice and respect this obfuscation is particularly disturbing.

The Real Ones

Mistral AI has gained recognition for its open-source large language models, notably Mistral 7B and Mixtral 8x7B. The company strives to ensure broad accessibility to its AI models, encouraging review, modification, and reuse by the open software community.

vLLM stands for "vectorized low-latency model serving" and is an open-source library specifically designed to speed up and optimize large language models (LLMs). It is a powerful tool that can significantly improve the performance and usability of LLMs. This makes it a valuable asset for developers working on a variety of AI applications, from chatbots and virtual assistants to content creation and code generation. So much so that, Mistral recommends using vLLM as the inference server for the 7B and 8x7B models.

EleutherAI is a non-profit AI research lab that has grown from a Discord server for discussing GPT-3 to a leading non-profit research organization. The group is known for its work in training and promoting open science norms in Natural Language Processing. They have released various open-source large language models and are involved in research projects related to AI alignment and interpretability. Their LM-Harness project is probably the leading open-source evaluation tool for language models.

Phi-2 is Microsoft's LLM that punches above its weight. Trained on a blend of synthetic texts and filtered websites, this small, but powerful model excels at tasks like question-answering, summarizing, and translation. What truly sets Phi-2 apart is its focus on reasoning and language understanding, leading to impressive performance even without advanced alignment techniques.

Many competent open-source embedding models are strengthening the overall open-source generative AI space. These are the current state-of-the-art for open source and include UAE-Large-V1 and multilingual-e5-largel.

There are many more in this ever-growing field. This limited list is just a start.

Open Source Drives Innovation

Embracing a philosophy of extreme open innovation, companies that truly participate in open-source software development challenge traditional notions of competitive advantage by acknowledging that not all good code or great ideas reside within their organization. This shift supports the argument that shared innovations within the open-source ecosystem lead to faster market growth, providing even smaller software firms with more limited R&D funds the opportunity to benefit from R&D spillovers present in open-source software. This is because, in contrast to traditional outsourcing, open innovation enhances internal resources by leveraging the collective intelligence of the community, without diminishing internal R&D efforts. Meaning that open-source software companies don’t have to sacrifice their budgets to pursue thought leadership and code outside their organization. Additionally, open-source software companies strategically drive innovation by releasing code early and often, recognizing the cumulative nature of the innovation process in the software community. All of which to say something many already recognize: Open Source Software drives innovation.

Open Fosters Collaboration

Through networking in the open-source software community, entrepreneurs are able to fulfill both short-term and long-term goals. Short-term profit goals build companies and long-term profit goals sustain them. At the same time, this networking effort self-perpetuates the network itself - growing it for the next entrepreneur. It is well known that open-source platforms provide access to the source code, enabling developers to create upgrades, plug-ins and other pieces of software and use them according to their requirements. This particular kind of collaboration experienced a boom with the wide adoption of Kubernetes by the wider software community. Now more than ever, modern technologies work together with very little friction and can be in minutes together almost anywhere.

Big Tech companies acknowledge this deep collaboration inherent to the open-source community when they freely release frameworks, libraries, and languages they created to maintain and develop internal tools. Doing so deepens the pool of developers capable of working on their products and starts to set the standard for how similar technologies should operate. That same Atlantic article quotes Meta founder Mark Zuckerberg as saying it has “been very valuable for us to provide that because now all of the best developers across the industry are using tools that we’re also using internally”.

Open Source Begets Open Source

These are factors in why we very often see synergies between open-source companies. Open-source AI and ML companies will naturally develop solutions with other open-source products from foundational products like object storage to all-way up the stack to visualization tools. When one open-source company steps forward, we all do. This cohesive and blended approach is probably our best bet for developing AI that takes a human-centered approach. These natural forces inherent in the market need for open source AI combined with the qualities of open source software of innovation and collaboration will drive the AI data stack open source.

Please join and contribute to this conversation and our community by emailing us at hello@min.io or sending us a message on our Slack channel.

AI/ML Reproducibility with lakeFS and MinIO

Sasha Bonner Wodtke — Wed, 06 Dec 2023 21:23:08 +0000

This post was written in collaboration with Amit Kesarwani from lakeFS.

The reality of running multiple machine learning experiments is that managing them can become unpredictable and complicated - especially in a team environment. What often happens is that during the research process, teams constantly change configuration and data between experiments. For example, try several training sets and several hyperparameter values, and - when large data sets are involved - also different configurations of distributed compute engines such as Apache Spark.

Part of the ML engineer’s work requires going back and forth between these experiments for comparison and optimization. When engineers manage all these experiments manually, they are less productive.

How can engineers run ML experiments confidently and efficiently?

In this article, we dive into reproducibility to show you why it’s worth your time and how to achieve it with lakeFS and MinIO.

Why data practitioners need reproducibility

What is reproducibility?
Reproducibility ensures that teams can repeat experiments using the same procedures and get the same results. It’s the foundation of the scientific method and, therefore, a handy approach in ML.

In the context of data, reproducibility means that you have everything needed to recreate the model and its results, such as data, tools, libraries, frameworks, programming languages, and operating systems. That way, you can produce identical results.

Why do data teams need reproducibility?
ML processes aren’t linear. Engineers usually experiment with various ML methods and parameters iteratively and incrementally to arrive at a more accurate ML model. Because of the iterative nature of development, one of the most difficult challenges in ML is ensuring that work is repeatable. For example, training an ML model meant to detect cancer should return the same model if all inputs and systems used are the same.

Additionally, reproducibility is a key ingredient for regulatory compliance, auditing, and validation. It also increases team productivity, improves collaboration with nontechnical stakeholders, and promotes transparency and confidence in ML products and services.

As stated previously, the ML pipeline can get complicated. You must manage code, data sets, models, hyperparameters, pipelines, third-party packages, and environment-specific configurations. Repeating an experiment accurately is challenging. You need to recreate the exact conditions used to generate the model.

Benefits of reproducible data
Consistency
Given the same data, you want a model to deliver the same outcome. This is how you can establish confidence in your data products. If you acquire the same result by repeating the experiment, the users' experience will also be consistent.

Moreover, as part of the research process, you want to be able to update a single element, such as the model's core, while keeping everything else constant - and then see how the outcome has changed.

Security and Compliance
Another consideration is security and compliance. In many business verticals such as banking, healthcare, and security, organizations are required to maintain and report on the exact process that led to a given model and its result. Ensuring reproducibility is a common practice in these verticals.

To accomplish this, you need to version all the inputs to your ML pipeline such that reverting to a previous version reproduces a prior result. Regulations often require you to recreate the former state of the pipeline. This includes the model, data, and a previous result.

Easier management changing data
Data is always changing. This makes it difficult to keep track of its exact status over time. People frequently keep only one state of their data: the present state.

This has a negative impact on the work since it makes the following tasks incredibly difficult:

Debugging a data problem.
Validating the correctness of machine learning training (re-running a model on different data yields different results).
Observing data audits.

How do you achieve reproducibility?
To achieve reproducibility, data practitioners often keep several copies of the ML pipeline and data. But copying enormous training datasets each time you wish to explore them is expensive and just isn’t scalable.

Furthermore, there is no method to save atomic copies of many model artifacts and their accompanying training data. Add to that the challenges of handling many types of organized, semi-structured, and unstructured training data, such as video, audio, IoT sensor data, tabular data, and so on.

Finally, when ML teams make duplicate copies of the same data for collaboration, it's difficult to implement data privacy best practices and data access limits.

What you need is a data versioning tool that has a zero-copy mechanism and lets you create and track multiple versions of your data. Version control is the process of recording and controlling changes to artifacts like code, data, labels, models, hyperparameters, experiments, dependencies, documentation, and environments for training and inference.

Can you get away with using Git for versioning data? This might sound like a good idea, but Git is neither secure, adequate, nor scalable for data.

The version control components for data science are more complicated than those for software projects, making reproducibility more challenging. Moreover, since raw training data is often stored in cloud object stores (S3, GCS, Azure Blob), teams need a versioning solution that works for data in-place (in object stores).

Luckily, there is an open-source tool that does just that: lakeFS.

Data version control with lakeFS

lakeFS is an open-source tool that enables teams to manage their data using Git-like procedures (commit, merge, branch), supporting billions of files and petabytes of data. lakeFS adds a management layer to your object storage, like S3, and transforms your entire bucket into something akin to a code repository. Additionally, Although lakeFS only handles a portion of the MLOps flow, it strives to be a good citizen within the MLOPs ecosystem by interacting with all the tools shown below - especially data quality tools.

Reproducibility means that team members have the capability to time travel between multiple versions of the data, taking snapshots of the data at various periods and with varying degrees of modification.

To ensure data reproducibility, we recommend committing a lakeFS repository every time the data in it changes. As long as a commit has been made, replicating a given state is as simple as reading data from a route that includes the unique commit_id produced for the commit.

Getting the current state of a repository is straightforward. We use a static route with the repository name and branch name. For example, if you have a repository called example with a branch called main, reading the most recent state of this data into a Spark Dataframe looks like this:

df = spark.parquet("s3://example/main/")

Note: This code snippet assumes that all items in the repository under this path are in Parquet format. If you’re using a different format, use the appropriate Spark read method.

However, we can also look at any previous commit in a lakeFS repository. Any commit can be reproduced. If a commit can be reproduced then the results are repeatable.

In the repository above, each time a model training script is performed, a new commit is made to the repository, and the commit message specifies the exact run number. What if we wanted to re-run the model training script and get the same results from a previous run. As an example, let's say we want to reproduce the results of run #435. To do this, we simply copy the commit ID associated with the run and read the data into a dataframe as follows:

df = spark.parquet("s3://example/296e54fbee5e176f3f4f4aeb7e087f9d57515750e8c3d033b8b841778613cb23/training_dataset/")

The ability to reference a single commit_id in code facilitates the process of duplicating a data collection's or several collections' specific state. This has several typical data development uses, such as historical debugging, discovering deltas in a data collection, audit compliance, and more.

Object storage with MinIO

MinIO is a high-performance, S3 compatible object store. It is built for large scale AI/ML, data lake, and database workloads. It runs on-prem and on any cloud (public or private) and from the data center to the edge. MinIO is software-defined and open source under GNU AGPL v3. Enterprises use MinIO to deliver against ML/AI, analytics, backup, and archival workloads - all from a single platform. Remarkably simple to install and manage, MinIO offers a rich suite of enterprise features targeting security, resiliency, data protection, scalability, and identity management. In the end-to-end demo presented here, MinIO was used to store a customer's documents.

Putting it all together: lakeFS + MinIO

lakeFS provides object-storage-based data lakes with version control capabilities in the form of Git-like operations. It can work on top of your MinIO storage environment and integrate with all contemporary data frameworks like Apache Spark, Hive, Presto, Kafka, R, and Native Python, among others.

Using lakeFS on top of MinIO, you can:

Create a development environment that keeps track of experiments
Efficiently modify and version data with zero copy branching and commits for every experiment.
Build a robust data pipeline for delivering new data to production. Here’s how you can set up lakeFS over MinIO and make data processing easier.

Prerequisites

MinIO Server Installed from here.
Installed mc from here.
Installed docker from here.

Installation
Let’s start by installing lakeFS locally on your machine. More installation options are available in the lakeFS docs.

An installation fit for production calls for a persistent PostgreSQL installation. But in this example, we will use a local key-value store within a Docker container.

Run the following command by replacing , and with their values in your MinIO installation:

docker run --name lakefs \
             --publish 8000:8000 \
             -e LAKEFS_BLOCKSTORE_TYPE=s3 \
             -e LAKEFS_BLOCKSTORE_S3_FORCE_PATH_STYLE=true \
             -e LAKEFS_BLOCKSTORE_S3_ENDPOINT= \
             -e LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_KEY_ID= \
             -e LAKEFS_BLOCKSTORE_S3_CREDENTIALS_SECRET_ACCESS_KEY= \
             treeverse/lakefs:latest \
             run --local-settings

Configuration
Go to lakeFS and create an admin user: http://127.0.0.1:8000/setup

Take note of the generated access key and secret.

We will use the lakectl binary to carry out lakeFS operations. You need to find the distribution suitable to your operating system here, and extract the lakectl binary from the tar.gz archive. Locate it somewhere in your $PATH and run lakectl --version to verify.

Then run the following command to configure lakectl, using the credentials you got in the setup before:

lakectl config
# output:
# Config file /home/janedoe/.lakectl.yaml will be used
# Access key ID: 
# Secret access key: 
# Server endpoint URL: http://127.0.0.1:8000/api/v1

Make sure that lakectl can access lakeFS with the command:

lakectl repo list

If you don’t get any error notifications, you’re ready to set a MinIO alias for lakeFS:

mc alias set lakefs http://s3.local.lakefs.io:8000

If you don’t get already have, set a MinIO alias for MinIO storage also:

mc alias set myminio

Now that we understand the basic concepts and have everything installed let’s walk through an end-to-end example that demonstrates how easy this is to incorporate into your AI/ML engineering workflow. You will also notice that using lakeFS is very git-like. If your engineers know git, they will have an easy time learning lakeFS.

Zero Clone Copy and Reproducibility example
One of the key advantages of combining MinIO with lakeFS is the ability to achieve parallelism without incurring additional storage costs. lakeFS leverages a unique approach (zero clone copies), where different versions of your ML datasets and models are efficiently managed without duplicating the data. This functionality will be demonstrated in this section.

Let’s start by creating a bucket in MinIO!

Note that this bucket will be created directly in your MinIO installation. Later on, we’ll use lakeFS to enable versioning on this bucket.

mc mb myminio/example-bucket

So, let’s start by creating a repository in lakeFS:

lakectl repo create lakefs://example-repo s3://example-bucket

Generate two example files:

echo "my first file" > myfile.txt
echo "my second file" > myfile2.txt

Create a branch named experiment1, copy a file to it and commit:

lakectl branch create lakefs://example-repo/experiment1 --source lakefs://example-repo/main

mc cp ./myfile.txt lakefs/example-repo/experiment1/

lakectl commit lakefs://example-repo/experiment1 -m "my first experiment"

Let's create a tag for the committed data in experiment1 branch (your ML models can access your data by tag later):

lakectl tag create lakefs://example-repo/my-ml-experiment1 lakefs://example-repo/experiment1

Now, let’s merge the branch back to main:

lakectl merge lakefs://example-repo/experiment1 lakefs://example-repo/main

Create a branch named experiment2, copy a file to it, commit it and tag it:

lakectl branch create lakefs://example-repo/experiment2 --source lakefs://example-repo/main

mc cp ./myfile2.txt lakefs/example-repo/experiment2/

lakectl commit lakefs://example-repo/experiment2 -m "my second experiment"

lakectl tag create lakefs://example-repo/my-ml-experiment2 lakefs://example-repo/experiment2

Now, let’s merge the branch back to main:

lakectl merge lakefs://example-repo/experiment2 lakefs://example-repo/main

List the data for different experiments by using tags:

mc ls lakefs/example-repo/my-ml-experiment1
# only myfile.txt should be listed

mc ls lakefs/example-repo/my-ml-experiment2
# both files should be listed

Full example

You can review ML Reproducibility example along with multiple ML experiments here

Setup instructions for this example are here

Next steps

If you’re ready to extend your MinIO object storage with Git-like features, take the installation and configuration steps outlined above and try it yourself!

Head over to this documentation page to get started.

Summary

By bringing lakeFS and MinIO together, you can take advantage of the power of Git branching to create reproducible experiments.

Check out the lakeFS documentation to learn more and join the vibrant lakeFS community on their public Slack channel.

If you have questions about MinIO, then drop us a line at hello@min.io or join the discussion on MinIO’s general Slack channel.

Using LXMIN in MinIO Multi-Node cluster

Sasha Bonner Wodtke — Thu, 30 Nov 2023 21:44:12 +0000

By AJ, MinIO

MinIO includes several ways to replicate data, and we give you the freedom to choose the best methodology to meet your needs. We’ve written about bucket based active-active replication for replication of objects at a bucket level, batch replication for replication of specific objects in a bucket which gives you more granular control and other best practices when it comes to site-to-site replication. MinIO uses Erasure Coding to protect and ensure data redundancy and availability for reconstructing objects on the fly without any additional hardware or software. In addition, MinIO also ensures that any corrupted object is captured and fixed on the fly to ensure data integrity using Bit Rot Protection. There are several reasons data can be corrupted on physical disks. It could be due to voltage spikes, bugs in firmware, misdirected reads and writes among other things. In other words, you never have to worry about backing up your data saved to MinIO as long as you set up your MinIO cluster following best practices.

When an entire node fails it often takes several minutes to get it back online. This is because after it's been reprovisioned it has to be reconfigured, configuration management tools such as Ansible/Puppet have to be run multiple times achieve idempotency. In this case it would be very useful to have a snapshot of the VM with the configuration already baked-in to the VM so the node can come back online as soon as possible. Once the node is online the data will get rehydrated using the other nodes in the cluster, bringing the cluster back to a good state.

In this post let's take a look at how to set up multiple LXMIN servers backing up to a multi-node multi-drive MinIO cluster. We use LXC internally for our lab but you can use these concepts with any platform. We’ll set up each VM broker server with its own LXMIN service as a means of reducing load on the backup system. We recommend this setup because each request from a specific instance goes to the hypervisor where it's hosted. Moreover, a request for a specific backup goes to the node which is able to list it. In turn, the LXMIN services connect to a single MinIO endpoint which, in this example, allows access to a multi-node multi-drive MinIO cluster.

For each server in the cluster, obtain wildcard certificates issued by a signing authority. Create the following files on the server.

In this example, we used *.lab.domain.com certificates

mkdir -p $HOME/.minio/certs/CAs
mkdir -p $HOME/.minio/certs_intel/CAs
vi $HOME/.minio/certs_intel/public.crt
vi $HOME/.minio/certs_intel/private.key

Download and prepare LXMIN on each of the MinIO servers

mkdir $HOME/lxmin
cd $HOME/lxmin
wget https://github.com/minio/lxmin/releases/latest/download/lxmin-linux-amd64
chmod +x lxmin-linux-amd64 
sudo mv lxmin-linux-amd64 /usr/local/bin/lxmin-aj

Create a systemctl unit file on the MinIO servers for our LXMIN service. NOTE - This may need to run as User=root and Group=root due to error: Unable get instance config

sudo vi /etc/systemd/system/lxmin-aj.service

###
[Unit]
Description=Lxmin
Documentation=https://github.com/minio/lxmin/blob/master/README.md
Wants=network-online.target
After=network-online.target
AssertFileIsExecutable=/usr/local/bin/lxmin-aj

[Service]
User=aj
Group=aj

EnvironmentFile=/etc/default/lxmin-aj
ExecStart=/usr/local/bin/lxmin-aj

# Let systemd restart this service always
Restart=always

# Specifies the maximum file descriptor number that can be opened by this process
LimitNOFILE=65536

# Disable timeout logic and wait until process is stopped
TimeoutStopSec=infinity
SendSIGKILL=no

[Install]
WantedBy=multi-user.target
###

To go with the LXMIN service we need a settings file which contains startup configuration for the LXMIN service. Change the values as required.

sudo vi /etc/default/lxmin-aj

###
## MinIO endpoint configuration
LXMIN_ENDPOINT=https://node5.lab.domain.com:19000
LXMIN_BUCKET="lxc-backup"
LXMIN_ACCESS_KEY="REDACTED"
LXMIN_SECRET_KEY="REDACTED"
LXMIN_NOTIFY_ENDPOINT="https://webhook.site/REDACTED"

## LXMIN address
LXMIN_ADDRESS=":8000"

## LXMIN server certificate and client trust certs.
LXMIN_TLS_CERT="$HOME./lxmin/certs_intel/public.crt"
LXMIN_TLS_KEY="$HOME/.lxmin/certs_intel/private.key"
###

After creating the systemctl unit and config files, go ahead and enable the service so it can start at boot-up time of the VM. Once it's enabled you can start it and check the status via logs.

sudo systemctl enable --now lxmin-aj.service
sudo systemctl start lxmin-aj.service
sudo systemctl status lxmin-aj.service
sudo journalctl -f -u lxmin-aj.service

The following commands are also useful if you want to disable the service during boot, stop or restart it.

sudo systemctl disable lxmin-aj.service
sudo systemctl stop lxmin-aj.service
sudo systemctl restart lxmin-aj.service

Next let’s download and prepare MinIO on the instances

cd $HOME
mkdir minio
cd $HOME/minio
wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
sudo mv minio /usr/local/bin/minio-aj

For each MinIO server in the cluster, obtain wildcard certificates issued by a signing authority.

Copy to the following files on the server.

NOTE: Be sure to update the SANs (Subject Alternative Names) below with the IPs of the MinIO nodes in the cluster so the certificates get accepted.

mkdir -p $HOME/.minio/certs/CA
wget https://github.com/minio/certgen/releases/latest/download/certgen-linux-amd64
mv certgen-linux-amd64 certgen
chmod +x certgen 
./certgen -host "127.0.0.1,localhost,<node-1-IP>,<node-2-IP>,<node-3-IP>,<node-4-IP>"
mv public.crt $HOME/.minio/certs/public.crt
mv private.key $HOME/.minio/certs/private.key
cat $HOME/.minio/certs/public.crt | openssl x509 -text -noout

Create a systemctl unit file on the MinIO servers for our minio service.

sudo vi /etc/systemd/system/minio-aj.service

###
[Unit]
Description=MinIO
Documentation=https://min.io/docs/minio/linux/index.html
Wants=network-online.target
After=network-online.target
AssertFileIsExecutable=/usr/local/bin/minio-aj

[Service]
WorkingDirectory=/usr/local

User=aj
Group=aj
ProtectProc=invisible

EnvironmentFile=-/etc/default/minio-aj
ExecStartPre=/bin/bash -c "if [ -z \"${MINIO_VOLUMES}\" ]; then echo \"Variable MINIO_VOLUMES not set in /etc/default/minio-aj\"; exit 1; fi"
ExecStart=/usr/local/bin/minio-aj server $MINIO_OPTS $MINIO_VOLUMES

# Let systemd restart this service always
Restart=always

# Specifies the maximum file descriptor number that can be opened by this process
LimitNOFILE=65536

# Specifies the maximum number of threads this process can create
TasksMax=infinity

# Disable timeout logic and wait until process is stopped
TimeoutStopSec=infinity
SendSIGKILL=no

[Install]
WantedBy=multi-user.target

# Built for ${project.name}-${project.version} (${project.name})
###

Let's open up the MinIO default configuration file and add the following values

sudo vi /etc/default/minio-aj

###
# do paths in your home directory
MINIO_CI_CD=1
# Set the hosts and volumes MinIO uses at startup
# The command uses MinIO expansion notation {x...y} to denote a
# sequential series.
#
# The following example covers four MinIO hosts
# with 4 drives each at the specified hostname and drive locations.
# The command includes the port that each MinIO server listens on
# (default 9000)

MINIO_VOLUMES="https://node{4...7}.lab.domain.com:19000/home/aj/disk{0...1}/minio"
#MINIO_VOLUMES="https://65.49.37.{20...23}:19000/home/aj/disk{0...1}/minio"

# Set all MinIO server options
#
# The following explicitly sets the MinIO Console listen address to
# port 9001 on all network interfaces. The default behavior is dynamic
# port selection.

MINIO_OPTS="--address :19000 --console-address :19001 --certs-dir /home/aj/.lxmin/certs_intel"

# Set the root username. This user has unrestricted permissions to
# perform S3 and administrative API operations on any resource in the
# deployment.
#
# Defer to your organizations requirements for superadmin user name.

MINIO_ROOT_USER=REDACTED

# Set the root password
#
# Use a long, random, unique string that meets your organizations
# requirements for passwords.

MINIO_ROOT_PASSWORD=REDACTED

# Set to the URL of the load balancer for the MinIO deployment
# This value *must* match across all MinIO servers. If you do
# not have a load balancer, set this value to to any *one* of the
# MinIO hosts in the deployment as a temporary measure.
MINIO_SERVER_URL="https://node5.lab.domain.com:19000"
###

Start or restart the service depending on the status of the service.

sudo systemctl enable --now minio-aj.service
sudo systemctl start minio-aj.service
sudo systemctl status minio-aj.service
sudo journalctl -f -u minio-aj.service

Be sure to repeat all the steps for installing MinIO and LXMIN on the other servers in the cluster. They all have to be configured identically.

Let's take a look at the MinIO console to validate the entire setup

https://node5.lab.domain.com:19001/login

You can also use mc to double-check to make sure the cluster is running correctly.

mc alias set myminio https://node7.lab.domain.com MINIO_ROOT_USER MINIO_ROOT_PASSWORD

To test the LXMIN/MinIO backup, access the following LXMIN endpoint. The assumption here is that the wildcard certificate and key are available at $HOME/.vm-broker/ssl

$ curl -X GET "https://node4.lab.domain.com:8000/1.0/instances/*/backups" -H "Content-Type: application/json" --cert $HOME/.vm-broker/ssl/tls.crt --key $HOME/.vm-broker/ssl/tls.key | jq .

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1100  100  1100    0     0   7766      0 --:--:-- --:--:-- --:--:--  8148
{
  "metadata": [
    {
      "instance": "delete",
      "name": "backup_2023-08-08-09-3627",
      "created": "2023-08-08T16:37:40.627Z",
      "size": 583116000,
      "optimized": false,
      "compressed": false
    },
    {
      "instance": "delete",
      "name": "backup_2023-08-08-10-3807",
      "created": "2023-08-08T17:39:21.241Z",
      "size": 583446188,
      "optimized": false,
      "compressed": false
    },
    {
      "instance": "delete",
      "name": "backup_2023-08-08-15-2135",
      "created": "2023-08-08T22:22:49.79Z",
      "size": 583774771,
      "optimized": false,
      "compressed": false
    },
    {
      "instance": "dilvm1",
      "name": "backup_2023-08-08-15-2618",
      "created": "2023-08-08T22:27:53.034Z",
      "size": 888696535,
      "optimized": false,
      "compressed": false
    },
    {
      "instance": "new0",
      "name": "backup_2023-08-08-17-5231",
      "created": "2023-08-09T00:53:52.235Z",
      "size": 617889321,
      "optimized": false,
      "compressed": false
    },
    {
      "instance": "new2",
      "name": "backup_2023-08-08-15-2805",
      "created": "2023-08-08T22:29:26.87Z",
      "size": 617917969,
      "optimized": false,
      "compressed": false
    },
    {
      "instance": "test-error",
      "name": "backup_2023-08-08-16-1501",
      "created": "2023-08-08T23:16:09.232Z",
      "size": 516457141,
      "optimized": false,
      "compressed": false
    }
  ],
  "status": "Success",
  "status_code": 200,
  "type": "sync"
}

Replication is not Backup

As a DevOps engineer, I learnt very early on that Replication is not equal to Backup (Replication != Backup). MinIO clusters are lightning fast in performance, can scale to hundreds or thousands of nodes and heal corrupted objects on the fly. What we are recommending is not to backup the data inside MinIO but rather the configuration on the node required for MinIO to run and operate in the cluster itself. The goal here is that if we had a degraded node we would be able to restore it to a good state as soon as possible so it can rehydrate (using Erasure Coding) and participate in the MinIO cluster operations. The actual integrity and replication of the data is managed by MinIO software.

In addition to the API, LXMIN also includes a command line interface for local management of backups. This command line interface allows you to build on top of LXMIN's functionality in any language of your choice. API compatibility checker, and DirectPV to simplify the use of persistent volumes.

If you would like to know more about MinIO replication and clusters, give us a ping on Slack and we’ll get you going!

The Architects Guide to the Modern Data Stack

Sasha Bonner Wodtke — Wed, 08 Nov 2023 18:48:28 +0000

By Brenna Buuck, Databases and Modern Datalakes SME, MinIO. This post first appeared on The New Stack.

While its precise definition may be elusive, one thing is clear about the modern data stack: It's not your traditional, monolithic approach favored by giants of the past. The modern data stack is a dynamic ensemble of specialized tools, each excelling in a specific facet of data handling. It's a modular, shape-shifting ecosystem that accommodates the fluidity of technology and ever-changing business needs.

Despite or perhaps because of this fluidity, the modern data stack does have some defining characteristics. It is cloud native, modular, performant, compatible with RESTful APIs, features decoupled compute and storage, and is open. Let’s look at those in a little more detail:

Cloud native: Cloud native tools deliver unparalleled scalability, allowing organizations to seamlessly process and analyze vast data sets while maintaining high performance across diverse cloud environments. Whether it's the public clouds or private ones, the modern data stack is multicloud compatible, ensuring flexibility and avoiding vendor lock-in.
Modular: The modern data stack offers a buffet of specialized tools, each optimized for a specific data task. This modularity allows organizations to craft a customized data infrastructure tailored to their unique needs, promoting agility and adaptability in a rapidly evolving data landscape.
Performant: Performance is at the core of the modern data stack. Its components are engineered for high performance, enabling organizations to process and analyze data efficiently.
RESTful API compatibility is employed for smooth and standardized communication between stack components, promoting interoperability and for the creation of microservices that break up the stack into manageable components. An example of this is the all pervasiveness of the S3 API inside the stack.
Decoupled compute: Decoupling compute from storage is a fundamental architectural principle of the modern data stack. This separation allows organizations to independently scale their computational resources and storage capacity, optimizing cost efficiency and performance. It also enables dynamic resource allocation, ensuring that computational power is matched to specific workloads.
Open: The modern data stack champions openness by embracing open source solutions and open table formats, dismantling proprietary silos and eradicating vendor lock-in. This commitment to openness fosters collaboration, innovation and data accessibility across a wide spectrum of platforms and tools, reinforcing the stack's adaptability and inclusivity.

The Shape of the Modern Data Stack

Picture the modern data stack as a symphony orchestra, with each instrument playing its part while following the conductor, Kubernetes, to create a harmonious data experience. While the players may change, the components remain constant: data integration, storage, transformation, data observability, data discovery, data visualization, data analytics and machine learning and AI. Let's delve into each of these areas.

Storage

Object storage plays a crucial role in the modern data stack. Object storage offers a scalable, performant and flexible storage solution for the ever-increasing volume of data. The stack's agility is enhanced by object storage, as the best-of-breed object storage can be deployed across diverse infrastructures, underscoring the importance of software-defined storage.

Storage increasingly performs an active role, seamlessly integrating with elements in the rest of the stack and serving as the backbone for lakehouse architectures. Lakehouses, like those built using MinIO and Iceberg, Hudi and Delta Lake, exemplify this use case perfectly.

Data Integration

Ingest is the bridge that connects disparate data sources. Modern data integration tools embrace the ethos of flexibility and democratization. They don't hoard data in proprietary silos; instead, they facilitate data accessibility, irrespective of where data resides. Whether it's in the public cloud, private cloud, on bare-metal infrastructure or at the edge, data integration tools break down the barriers that once kept data isolated.

One noteworthy player in this realm is Apache NiFi, an open source data integration tool that orchestrates data flows with ease. It's object storage-friendly, ensuring your data can seamlessly traverse various environments. Airflow is another obvious performer in this space. Airflow is an open source platform designed for orchestrating, scheduling and monitoring complex data workflows, making it easier to manage and automate data-related tasks

The older pattern of data integration involving actual data movement has been largely unseated by the concept of integrating in place. This paradigm shift represents not just a change in the way we manage data but a fundamental transformation in how we approach data freedom, accessibility and agility. Data in the modern data stack belongs to you, not to proprietary systems. The entity that reaps the benefits must be you and your organization, not a multinational company selling an outdated relational database management system.

Transformation

While there may be some overlap between transformation and data integration applications, it's important to note the existence of highly specialized transformation tools like Apache Spark and DBT. These tools serve a distinct purpose, allowing data engineers and analysts to modify and refine data before it's used by downstream applications within the stack. With object storage as both the source and destination for data, these tools ensure that data remains consistent, accessible and reliable throughout the transformation process.

Data Observability

Ensuring data reliability and quality is paramount in the modern data stack. Data observability tools act as the watchful guardians, offering insights into the health and behavior of your data pipelines. These tools not only monitor but also detect anomalies, helping you maintain data integrity.

Prometheus, a popular observability tool, empowers you to gain deep insights into your data infrastructure, providing the necessary observability along with the S3 compatibility that is the standard for the modern data stack. Grafana, while often associated with infrastructure and application monitoring, can also be extended to monitor data pipelines.

Data Discovery

Tools like Apache Atlas and Collibra provide the means to catalog and discover data assets across the organization. Integrating with object storage repositories ensures that all data, regardless of its location, can be discovered and used.

Data Visualization

Data visualization tools turn raw data into meaningful and actionable insights. They enable users to craft compelling stories, uncover patterns and make data-driven decisions. These tools thrive on accessibility, ensuring that data is within reach for everyone, not just data scientists or analysts. Here again, we see the prevalent use of RESTful APIs used to connect to data in the stack.

Tools like Tableau, Power BI, Looker and Apache SuperSet lead the way in this category, offering insights on data wherever it is.

Data Analytics

Object storage is primary storage for online analytical processing (OLAP) analytical databases. This forward-looking approach, adopted by analytical giants like Snowflake, SQL Server and Teradata hinges on the concept of queryable tables that eliminate the need for data migration and allows these highly performant databases to focus their energies on query performance instead of storage. This trend follows the next logical step with smaller, lightweight analytics engines like DuckDB that have completely ceded storage and instead rely only on in-memory processes to further accelerate data analytics workloads.

Cloud native analytics platforms that pursue the advantages of object storage of scale, performance and cost effectiveness are revolutionizing the way enterprises extract value from their data. It's not just a technological shift; it's a strategic imperative for organizations seeking to stay competitive in today's data-driven world.

Machine Learning and AI

Now more than ever, Machine Learning (ML) and AI have a prominent place in the modern data stack, driving transformative insights and decision-making capabilities. ML frameworks like TensorFlow and PyTorch take center stage, showcasing their capacity to hyperscale when integrated with highly performant object storage. This powerful synergy not only accelerates the training and inference phases of ML models but also amplifies the agility of AI-driven applications, allowing organizations to harness the potential of their data for anomaly detection, natural language processing, computer vision and more. In this era of data-driven innovation, MI and AI have become indispensable pillars, reshaping industries and unlocking new possibilities for businesses willing to explore the frontiers of intelligent automation and data-driven decision support backed by powerful object storage.

Conclusion

These contenders for the modern data stack aren’t the end-all be-all options for the enterprise architect. There are plenty that have been left out and plenty more we have yet to explore, but the categories should be the takeaway for readers. The modern data stack will continue to evolve, embracing new tools and technologies. The constant, however, is its requirements around scale, performance, data accessibility, modularity and flexibility.

At MinIO, we view these pillars as engineering-first principles. In fact, we think of ourselves as more of a data company than a storage company. We aim to be part of the overall data orchestra, enabling large-scale pieces as well as improvisation.

Keep exploring, keep innovating, and keep unlocking the limitless potential of your data. The modern data stack is your symphony, and you are the composer. You can drop us a note on Slack or send us an email at hello@min.io if you have any questions or ideas on what belongs in the modern data stack.

If You're Having a Hard Time Migrating to the Cloud, You're Doing It Wrong

Sasha Bonner Wodtke — Wed, 01 Nov 2023 19:47:39 +0000

By Brenna Buuck, MinIO, Databases and Modern Datalakes SME

Cloud migration is a strategic imperative for businesses of all sizes. However, many organizations struggle to successfully migrate their applications and data to the cloud. Part of the reason for this struggle is a fundamental misunderstanding of what the cloud is. Too many believe that it is a physical location – you sign up for a public cloud provider and like magic all your software becomes cloud-native. The result? They find themselves locked into inappropriate infrastructures and stuck with sky-high bills.

In reality, the cloud should be viewed not as a location, but as a set of processes and procedures for how you want your technology to work. This operating model consists of guiding principles that empower businesses to build and operate simple, standards-compliant, automatable, portable data tools and services. The cloud operating model is open and standards-based, and at first, it may be difficult for enterprise technologists to step outside the age-old paradigm of being sold expensive, locked-down, restrictive ecosystems that only partially meet their needs.

It should not be that hard and the benefits are immense. When you change your mindset, you open multiple paths forward creating valuable optionality for your company.

Redefining the Cloud: An Operating Model

The key to a successful cloud migration is to design systems that can be deployed anywhere - any public clouds, but also private clouds, bare-mental or on edge. It’s been said that if your entire infrastructure can’t be deployed with a single Kubernetes YAML multiple times a day in different locations you haven’t truly migrated to the cloud.

Relying on proprietary services, hardware, or software dependencies is counterproductive as it limits flexibility and control. The key to a successful cloud migration is demanding portable data solutions that unlock the full potential of the cloud through enhanced flexibility and adaptability.

The Role of Tools in Successful Cloud Migration

Your choice of tools can make or break your cloud migration. It's not an overstatement to say that this decision is a determining factor in your project's success. Luckily, with this updated view of the cloud as a set of principles, it becomes easier to make these critical decisions – if only through the power of elimination. When tools and platforms are held to the standard of a successful cloud operating model, the field dramatically narrows with only a few able to meet expectations and survive the culling. When new tools and services enter the field, it’s easy to assess their value: they either represent your cloud operating model or they don’t. The best part is that all the pieces of this model work together by design.

Your stack should have the following characteristics:

Cloud-Native: Your tools should embrace cloud-native practices, not just be in the cloud. Meaning, they have to be purpose built for scale and resilience.
Modular: Pick and choose best-of-breed tools to meet your requirements instead of being locked in to subpar vendor offerings just because they happen to be bundled together.
Performant: Speed and efficiency are paramount. Choose tools that take a software-first approach and prioritize performance and developer experience.
Compatibility with RESTful APIs: Interoperability is non-negotiable. Your tools must speak the same language, and this language is S3 for cloud storage.
Decoupled Compute and Storage: Decoupling these elements gives you more flexibility and scalability and allows the services and tools you use to focus on what they do best.
Embracing Open Standards: Open standards promote interoperability and future-proof your investments. This means not only open-source, but open table formats like Apache Iceberg.

Only tools that survive these rigorous tests deserve to be part of a successful cloud strategy.

Navigating Multi-Cloud Environments

The cloud operating model is even more important for organizations seeking to deploy applications in multi-cloud environments. A multi-cloud environment is one in which applications are deployed across multiple cloud providers - public and private. This can give organizations more flexibility and choice, but it can also make management more complex.

When you have true ownership over your data and your tools, navigating multi-cloud environments becomes infinitely more manageable. You are free to move tools and services around to suit your budget and requirements. For example, applications that use MinIO as a backend just need to have their S3 endpoint reconfigured when they move. Remember the cloud is a mindset, not a location, and multi-cloud deployments test and prove this axiom true every day.

The Future of Cloud Migration

The logical progression of this mindset has driven organizations to embrace object storage as their primary storage solution. This choice is primarily motivated by object storage's remarkable capacity to deliver exceptional performance at scale, in perfect harmony with the prevailing cloud operating model detailed earlier. Recent research underscores this shift, with an astounding 80% of respondents recognizing object storage as capable of supporting their most critical IT initiatives. This trend has even infiltrated the traditionally dominant domains of SAN-based block storage and NAS-based file storage: databases. Today, the only innovation in software we see is cloud-native. Innovation cannot exist without adhering to the cloud operating principles described above.

Conclusion

The evolution of thinking about cloud migration from a location-centric mindset to an operating model is shaping a new future for organizations where they are free to innovate and collaborate. This paradigm shift emphasizes principles that simplify operations and foster portability, enabling cloud systems to be deployed across various environments, including multi-cloud setups. A vital aspect of this transformation is the adoption of object storage as the primary storage solution. Object storage's scalability, performance, and alignment with the cloud operating model make it a pivotal choice and showcases a promising path forward in cloud migration.

If you have any questions or need assistance with your cloud migration strategy with MinIO, feel free to contact us at hello@min.io or join our Slack community for support.

Building a Scalable, Data Sovereign National ID System

Sasha Bonner Wodtke — Thu, 26 Oct 2023 05:08:05 +0000

By Brian Costa, Field CTO & Executive, MinIO

Some of the smartest minds in philanthropy are backing the concept of a simple yet powerful national ID system. The Bill and Melinda Gates Foundation, the Tata Trusts, the Omidyar Network and the Pratiksha Trust have all gotten involved with this movement because of its foundational capabilities for enabling a wide range of social programmes. They have put their resources behind an open source project called MOSIP and it is quietly remaking national identity across Africa and Asia:

A national ID system is a centralized database that stores information about all citizens and legal residents of a country. This information can include name, date of birth, address, photograph, fingerprints, and other biometric data. Core applications include:

- Enhancing security: These systems can help to prevent identity theft and fraud by providing a secure and reliable way to verify a person's identity. This is important for a number of purposes, such as opening a bank account, applying for a job, or voting.
- Promoting efficiency: A national ID system can help to streamline government services by making it easier for citizens and legal residents to access them. For example, a national ID card can be used to verify a person's identity when applying for a driver's license, passport, or other government document.
- Encouraging financial inclusion: A national ID system can help to make financial services more accessible to people who have previously been excluded from the formal financial system. This is because a national ID card can be used to open a bank account or obtain a loan, even if the person does not have other documentation, such as a birth certificate or marriage certificate.
- Improving access to healthcare: A national ID system can help to improve access to healthcare by making it easier for people to register with a doctor or hospital. This is important for people who move frequently or who do not have other documentation, such as a permanent address.

There are other ancillary benefits as well. Those include reducing corruption by reducing identity fraud, enhancing economic growth through access to the financial system and reduced friction/increased social cohesion.

MOSIP can obviously work in the greenfield model where the program is built from scratch and each of the open source modules are customized - from pre-registration to issuance and verification. MOSIP also works in a brownfield model where the open source modules are integrated with existing databases or identity systems.

At its heart, this is a technology problem and one where MinIO is deeply embedded. Indeed, MOSIP, ultimately recommends two deployment options - MinIO for countries that keep their data within their borders and AWS when that data is permitted to leave the country and go to the cloud.

We want to document how this architecture goes together and why a MinIO-based data sovereign approach matters.

The following aspects are important when considering a storage platform capable of handling a national ID program:

Strong Security

By its nature, a national ID program stores the most sensitive data imaginable - personally identifiable information including personal data, images and biometric data. The highest levels of data security are required when dealing with such data. MinIO provides enterprise-level encryption for data both in-flight using TLS as well as at-rest using external keys stored on external key management systems.

MinIO supports Transport Layer Security (TLS) v1.2+ between all components in the cluster. This approach ensures there are no weak links in either inter or intra-cluster encrypted traffic. TLS is a ubiquitous encryption framework: it’s what puts the s in https and is the same encryption protocol used by banks, e-commerce sites and other enterprise-grade systems that rely on data storage encryption.

MinIO’s state-of-the-art encryption schemes support granular object-level encryption using modern, industry-standard encryption algorithms, such as AES-256-GCM, ChaCha20-Poly1305, and AES-CBC. MinIO is fully compatible with S3 encryption semantics, and also extends S3 by including support for non-AWS key management services such as Hashicorp Vault, Gemalto KeySecure, and Google Secrets Manager.

Enterprise Grade Object Storage Encryption

Easily Expandable Capacity

Populations tend to grow with time, and the types and amount of data stored per ID is likely to grow with time. Typically a MOSIP deployment starts with a few million IDs and grows from there to the hundreds of millions. Capacity, and the ability to easily scale capacity, becomes critical for a deployment such as national ID programs. MinIO’s erasure coding provides very efficient storage with the ability to survive the loss of drives and nodes. Capacity is easily scaled through the use of Server Pools. Server Pools eliminate the need to rebalance data - a legacy approach that is both expensive and time consuming. Scaling in pools allows growth from terabytes to petabytes as needs change.

Scalable Object Storage

Consistent High-Throughput Performance

As the number of IDs grows, and the data per ID grows, throughput becomes critical to a successful national program in order to ensure a speedy user experience. The data structure of an ID is made up of a large number of data items. MinIO is a high-performance object storage system designed for these types of workloads. When properly deployed across 32 nodes, MinIO can deliver sustained READ throughput of over 320 GiB per second.

MinIO NVMe Benchmark

Real-time Replication for BC/DR

At-scale replication is the only rational way to provide data resiliency across sites. The time it takes to backup or restore a small quantity of data, for example 10TiB, across a slow network is unacceptable for almost all use cases, and certainly unacceptable for a government ID system that needs to be available 24/7. Active-Active Replication for object storage is a key requirement for mission-critical production environments. MinIO is the only vendor that offers synchronous, object-level replication to multiple sites, today.

Active Active Replication for Object Storage

Efficient and Cost-Effective Deployment and Growth

The required storage per ID varies with the MOSIP modules that are deployed, the types of data stored, the resolution of the images and the biometric data items. Please consult MOSIP for a more accurate prediction of storage needs. That being said, when a deployment starts at a few million IDs the data storage required is typically on the order of 10 TiB. When it grows to 100 million IDs, the storage requirements can be on the order of 1 PiB, and the throughput requirements grow proportionately. A typical initial deployment for MinIO, able to handle a few million IDs, would consist of 4 nodes with 4 drives each.

MinIO runs on commodity hardware, so any vendor is fine. As an example, below is a 4 node 2U device (4 separate CPUs, RAM, and drive sets in a single 2U chassis) from Supermicro that makes deployment easy. This unit supports up to 6 disks per CPU in the chassis for a total of 24 drives. Using hardware like this makes it easy and cost effective to deploy enterprise-grade object storage using MinIO in order to support MOSIP.

Supermicro GrandTwin™ SuperServer and MinIO

The Supermicro GrandTwin™ SuperServer SYS-211GT-HNTR 2U server enclosure is a dense, rack-optimized platform for deploying MinIO object storage.

As the MOSIP deployment grows, additional units can be added to scale the MinIO storage capacity. Using three of the above referenced Supermicro units deployed across 3 racks would provide exceedingly fast data access to over 3PiB of storage. MinIO is resilient and erasure coding would allow for the loss of 96 drives, or 4 servers, or 1 rack and still maintain full functionality.

Bringing a National ID Program to Life

As national ID systems proliferate for all the benefits outlined above, and the quantity of data they store grows, it becomes incumbent on governments to store such data safely and securely in object storage that is secure, cost-effective, rapidly scalable and high-performance.

Governments and NGOs can’t take risks when it comes to national ID systems because citizen data is too valuable and sensitive to lose. Implementing a national ID system such as MOSIP, built on an enterprise class object storage system such as MinIO, guarantees a successful deployment and the overall success of the program.

MinIO is always available for a call to discuss your object storage needs, growth path, and to work with your hardware vendor to ensure a proper object storage deployment. Reach out to us on slack or email us at hello@min.io.