Kazuya

Posted on Dec 5

AWS re:Invent 2025 - GenAI, Hold the Waste: How H2O Fixed Its Storage Bottleneck (STG204)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - GenAI, Hold the Waste: How H2O Fixed Its Storage Bottleneck (STG204)

In this video, H2O.ai discusses how they reduced EBS storage costs from 2 petabytes to under 1 petabyte using Datafy's autonomous storage solution. H2O.ai, a leader in enterprise AI, faced challenges with overprovisioned EBS storage at 25% utilization. Datafy's agent-based solution automatically scales EBS volumes up and down without downtime, integrating seamlessly with Kubernetes, BottleRocket, and Terraform. The implementation achieved 80% capacity utilization while maintaining security and existing Velero backup processes, delivering significant cost savings with zero downtime deployment.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

H2O.ai's EBS Storage Challenge: Scaling Issues with 2 Petabytes of Underutilized Cloud Storage

Let's get started. H2O.ai is basically a global leader in the enterprise AI realm. H2O.ai does both generative and predictive AI. We have an entire platform which we are able to install in on-premises environments, in clouds, and in air-gapped environments. H2O.ai is a leader in the Gaia benchmark, and we deliver an entire platform.

Our technology stack runs entirely on Kubernetes. Everything runs in EKS in the cloud, and we rely very heavily on EBS storage. The reason we rely on EBS storage is that when we train data and train our models, we need very fast and good storage for our AI engines. The problem that we had with H2O.ai with EBS storage is that we had a lot of unutilized storage. Basically, we were storing over 2 petabytes of storage and it was growing very quickly. The issue is that we couldn't scale very well. We couldn't scale down, we couldn't be more efficient with our cloud storage, and we ended up wasting a lot of storage in the cloud.

We looked at this issue and tried to find several solutions. Most of the solutions that we found out there were solutions that required us to migrate data from old EBS storages to new EBS volumes. Basically, it was a very hard and painful process. That was the issue that we had previously before we started working with Datafy.

Datafy's Autonomous Storage Solution: Dynamic Auto-Scaling for EBS Without Downtime

When we met H2O.ai initially, we were very pleasantly surprised because the problems that Ophir has described are exactly what we set out to solve with Datafy: large EBS capacity, overprovisioned and underutilized. Datafy is an autonomous storage solution which manages cloud storage automatically for AWS customers. You can deploy it on the fly and it will adjust your volume capacity in EBS automatically, auto-scaling it based on your needs. If you fill it up, it will grow automatically, and if you delete files, it will shrink automatically. It has no impact to performance. You can deploy it in real time.

With Datafy, you get dynamic auto-scaling, so the solution is completely autonomous and it overcomes those EBS limitations that Ophir mentioned that prevented him from being efficient in his consumption of EBS. It allows you to basically endlessly scale your EBS capacity both up and down as customers fill up data and delete and enter these cycles of growth and deletion. Whenever you use Datafy, there's no downtime at all. We've always thought that customers will not tolerate introducing downtime to their applications when they want to improve their storage utilization.

You can install or uninstall it without any impact to applications, to your file system, or to anything else running in your stack. Furthermore, there are no changes needed to your stack. It integrates seamlessly with Kubernetes, with CloudFormation, with Terraform, and it supports any Linux operating system and any tech stack you're running on top of it. So you don't need to inform your customers that we're going to take downtime because we want to do something with the storage. The whole thing is done automatically and seamlessly without any intervention required by you or your application owners.

How does Datafy work? It is based on a low-level agent that you install on your EC2 servers or you containerize these clusters. This agent changes the destination volumes of EBS automatically and dynamically without impacting your applications. In addition to that, we have a SaaS control plane or backend that runs in a VPC. It monitors the agents and gives them commands like growing volumes or shrinking volumes. It also provides analytics and shows you exactly what is going on with your EBS deployments, how much storage you're consuming, how efficient you are, and how much efficiency Datafy brought to the table.

Finally, we've spent a lot of effort into integrating with infrastructure-as-code environments, both Kubernetes and other infrastructure-as-code environments. This allows you to use the product without making any modifications and it completely integrates into the life cycle of CI/CD.

Integration Challenges: BottleRocket, Security, and Velero Backup Compatibility

When we started working with Datafy, we encountered a couple of challenges that we needed to solve to streamline the entire process and ensure everything worked with our existing platform. The first challenge we faced was BottleRocket. Together with Datafy, we ensured that the Datafy agent runs on our existing BottleRocket infrastructure in EKS. This integration was critical to maintaining our current operational setup.

The second challenge was security. As an AI platform, we host our customers' data, and we wanted to ensure that working with Datafy maintained the same level of security as before. The key aspect was that no data leaves the actual cluster or EKS, and the task management only happens externally, so we keep the same level of security and reliability for our customers' data. The last challenge we faced was keeping our existing backup implementation. We have a solution that works with Velero to make backups and send them to backup PVs. Together with Datafy, we ensured that we could continue using the existing solution with Velero and seamlessly continue to back up and restore data volumes without any issues.

Optimization Results: From 25% to 80% Capacity Utilization and Cutting Storage Costs in Half

We thought it would be nice to show you an example of the optimization that Datafy brought to H2O. On the left side chart, you see the capacity utilization. Customers typically have low capacity utilization, especially on EBS. They tend to overprovision because they are afraid they will run out of space and cannot grow in time. At the beginning of the chart, the capacity utilization is just 25%, which means they are overpaying 4x compared to if they could pay based on utilization rather than capacity reservation as EBS charges them today.

As they deploy data on more and more environments, the chart for utilization keeps growing and improving to the point where it stabilizes around 80%, which is our natural place where we want to hold the buffer. We do not want to make it hit 100% because that means we are running out of space. It is quite typical to stabilize around 80%, which is what we consider success criteria. On the right chart, you can see what is happening in terms of the capacity itself. When we started out, their total capacity footprint was about 2 petabytes, but in terms of how much data they actually wrote to it, the green line shows only 0.5 petabyte written. That is why we say the capacity utilization is 25%.

As time progressed with the utilization of Datafy, what you are seeing is really interesting. Even though the green line grows a bit because they wrote more data, the blue line, which is how much capacity they are paying for on EBS, keeps dropping because of the Datafy scale capacity. Over time, the blue line almost reaches the green line, which is the 80% utilization that we were targeting. The savings for H2O are significant: instead of paying for 2 petabytes, they are actually paying for less than 1 petabyte. This is applicable for just about any EBS customer out there.

Deployment Success: Zero-Downtime Implementation with Significant Cost Savings and Performance Gains

The result of our work with Datafy was that we were able to deploy the Datafy solution across all our customers. The Datafy solution is deployed with our existing tools and solutions, meaning we still use Terraform to deploy Datafy. It is integrated into our GitOps process, and we did not have to make any substantial changes to our infrastructure to get Datafy to work. Of course, we have very significant cost savings, which we just showed. We were able to take up our utilization and reduce our cost.

While we are doing that, we were also able to give better performance to our customers. The Datafy solution was able to reduce our cost for EBS and also increase our performance for our EBS volumes. I think the most important part of this solution is that there was zero downtime. Once the Datafy agent was deployed and was in all the clusters in read-only mode, we just had to flip a switch, and from that point on, it started doing its magic, basically reducing our storage without any manual intervention that we had to do. On the flip of a switch, we started seeing reduced storage costs.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community