Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Train high-performing AI models at scale on AWS (AIM365)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Train high-performing AI models at scale on AWS (AIM365)

In this video, AWS SageMaker team and Roblox demonstrate training large AI models at scale. Michael Oguike and Tomonori Shimomura cover six critical dimensions: compute availability, performance, resiliency, observability, ease of use, and cost. They explain distributed training challenges, showing that Llama 70B requires 1.4 terabytes of GPU memory. SageMaker offers two solutions: Training Jobs for ephemeral compute and HyperPod for persistent clusters with SLURM or EKS orchestration. Key features include automatic health monitoring, node replacement, managed tiered checkpointing, and task governance for multi-team resource allocation. Denis Goupil from Roblox shares their experience training a 4D foundation model using 100 million 3D assets and models up to 70 billion parameters on H200 GPUs, highlighting HyperPod's flexibility and EKS integration that enabled production deployment in one month.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Training High Performance AI Models on AWS SageMaker

Good afternoon everyone and welcome to day one of re:Invent. We're very pleased to have you all join us to discuss training high performance AI models on AWS using SageMaker. I am Michael Oguike, a senior technical product manager at Amazon Web Services. I work on AWS SageMaker where we help customers train high performance models. With me today is Tomonori Shimomura, who's a principal solutions architect who also works on SageMaker and helps customers train AI models. We are very pleased to be joined by Denis Goupil, who's a principal machine learning engineer from Roblox, and they're going to be sharing an exciting use case with us.

Today we're going to go through the need to train large AI models. We're going to discuss the challenges that we see in training these large models, and then we'll discuss how SageMaker helps to resolve these challenges. Then Tomonori will come back up and show us an actual demo followed by Roblox with Denis coming back up to show us a real life example of how they've built a 4D foundational model using SageMaker and HyperPod. Before we go on, by a quick show of hands, how many of us in the audience have trained or customized a large model? And how many of you have done that using SageMaker? Great, so there's going to be a lot of capabilities that we share and a lot to also learn here.

Before we go deeper, I think a couple of you may have noticed the growing popularity of this sparkle icon. You may have seen it on your re:Invent app where it's helping you to get more out of your week, or you may have noticed it on your playlist where AI is trying to help you create a better playlist or even on autonomous driverless cars on the road. For me, even while building this presentation as I was trying to crop photos, AI was there trying to help me crop my photos better. But behind each of these icons is a trained model and as your customers and your users are getting more used to using AI to improve their experience, you're also going to be needing to add more capable models to the experiences that you already deliver for your customers.

In fact we see this in studies. A study by McKinsey showed that 47% of companies who said they were using generative AI were also training or customizing these AI models to better serve their use cases. Now we built SageMaker for AI. As we've seen the trend moving forward, for you to help build more capable models for your customers, it's important that we have the right capabilities for you, our customers, to build and train and deploy these models.

SageMaker Training Capabilities: From Training Jobs to HyperPod

We started this journey in 2017 when we launched SageMaker Training Jobs. SageMaker Training Jobs offers you a fully managed API where all you need to do is bring in your training data. You bring in your training script. You tell us which instances you want to run, and we take all that input, we spin up a cluster, we train the model, and we deliver the model artifacts to an S3 bucket. It's an ephemeral cluster and you only pay for what you use. Our customers had a great experience with that and they continue to use training jobs.

But then we had customers like you ask for more capabilities such as being able to manage the cluster using Slurm or Kubernetes. We had customers asking for more granular control and observability to these clusters, and there were also customers who wanted more persistent clusters so they could both run training jobs and inference jobs on the same infrastructure. So we launched SageMaker HyperPod to meet these needs. In the conversation today, the two key capabilities that we have are SageMaker Training Jobs, which is the ephemeral compute, or SageMaker HyperPod, which is more persistent. These are the two key services that SageMaker offers for you to train your models.

As we've continued to work with you and other customers to learn about what matters most to you for training large AI models, we've seen six key dimensions: compute availability, performance, resiliency, observability, ease of use, and most importantly, cost.

Every lever that we move on each of these dimensions has a direct impact on your cost. In the next couple of slides, I'm going to dive deeper into the first three: compute availability, performance, and resiliency. When Tomonori comes up, he's going to go deeper into observability and ease of use as we walk through the demo together.

Compute Availability: Securing and Optimizing GPU Resources

Let's start with compute availability and look at the trend that we've been observing over the last couple of years. We've seen that we need to use more compute to train more capable models. This is primarily driven by the fact that we need more training data and we're training these models with more parameters so they can be more capable and more accurate. We're using them in cases like the legal field, medical sciences, and autonomous driving. We need bigger models so they are more capable.

Over the last two to three years, we've actually gotten to the point where we're now using 10 to 24 FLOPS of compute power to train these very high-performance models. The scientific notation for 10 to 24 is a yottaflop. Let me make that more concrete. What's a yottaflop? Five yottaflops is the equivalent of running 1,000 P5 GPUs consistently for one month. That's a lot of compute power, and that's very expensive. That's why it's essential that we optimize the utilization of these GPUs. But we also want access to the most performance compute so we can get our training jobs done quickly.

To do this, SageMaker offers multiple capabilities for you to get the best compute whenever you want it. We have a wide selection of GPUs and accelerators, from the H100 single GPUs to GP200 ultra servers. You can get single GPU jobs or even a 72 GPU ultra server within a rack depending on your needs. But selecting the GPU is just one thing. We also understand that your business needs and your timelines are different, so we have various options to secure capacity.

Whether that's using on-demand capacity or you can get spot capacity for up to 90 percent cheaper than on-demand capacity. In fact, just last week on SageMaker HyperPod we launched support for spot instances, which is great for fault-tolerant workloads and experimentation. Or if you want a longer-term reservation where you have guaranteed capacity, you can get a one-year reservation or three-year reservation with reserved capacity. And if you're looking for something in between where there's a calendar period—say next month—where you want to get capacity for a particular period of time, you can use the flexible training plans.

Now that we've got GPUs, what are we going to do with them? As important as it is to get the compute capacity, it is even more important that we utilize them effectively to avoid wastage. With SageMaker, you have the option to use the training jobs managed API where you just submit the job and we handle the scheduling and the distribution of the jobs across the GPUs. Or you can also use SLURM and Kubernetes depending on what your scientists are more familiar with.

Or if you have a team where you have multiple team members and you have different types of jobs and you want to prioritize the different jobs and allocate capacity to different teams within your organization, you can use a capability we call task governance on HyperPod where you can allocate resources to different teams. Or you can give a higher preference to, say, an inference job over a training job, and whenever an inference job comes, it preempts the training job. We also have an integration with AWS Batch that allows you to do this with SageMaker training jobs. All these capabilities come together to ensure that you have the right compute capability and you're able to optimize utilization of these GPUs or accelerators so you reduce your cost for training.

Performance: Distributed Training and High-Speed Networking

We then talk about performance. To discuss performance, let's quickly demonstrate the relationship between the amount of memory required by the parameters in a model compared to the available memory per GPU for many of the latest GPUs and accelerators. On this chart, we have two trends. One trend shows a higher incline, which represents the number of parameters in many popular models over the last 10 years. As we discussed earlier, we can see this growth in the number of parameters required for these large models. On the other hand, the gentler slope shows how memory per GPU is changing over time. We can see that the number of parameters is growing faster than the memory per GPU.

The implication of this becomes clear when we demonstrate it with one example: the Llama 70B model. The amount of GPU memory we need to train a 70 billion parameter model can be calculated by determining the amount of memory each parameter requires. On average, the rule of thumb is 18 to 20 bytes when you account for the weights, gradients, and optimizer states. So 20 bytes times 70 billion parameters requires almost 1.4 terabytes of memory on the GPU. Even the Nvidia H100 has only 80 gigabytes of memory, so you do not have enough memory per GPU to load and train the model.

To overcome this scaling challenge, we must use what we call distributed training. Some of you are probably familiar with this, so I'll go over it quickly. Basically, in the Llama 70B model case where we have more parameters and more memory from the model than we can put on one GPU, we shard the model across different GPUs and distribute the training across them. As training happens synchronously, the different GPUs communicate with each other to update their state. The data is small enough for us to replicate it across the different GPUs. Each GPU performs a micro batch of the training and then updates its state as it goes along. This is model parallelism, where we split the model but replicate the data.

Conversely, if we have a large training dataset but a small enough model, such as a Llama 3.2 1 billion parameter model that we can fit onto one H100 because it requires 20 gigabytes of model versus 80 gigabytes of GPU memory, we can split the training data across different GPUs and replicate the model across them. We do the training in batches and have all the GPUs synchronize their states as they train over steps and epochs. One thing we might observe here is that networking becomes very important because the GPUs have to update state and communicate with one another over time.

If we step one level more complex where we are distributing the training data as well as sharding the model, we have sharded the model across various layers in the neural network and also sharded the model across different tensors. In this case, we are doing a hybrid of both data parallelism and model parallelism. What is most important here is that your networking can quickly become a bottleneck. You need high bandwidth, low latency communication between all the nodes to ensure that you are not spending more time than necessary on the training job.

SageMaker offers multiple capabilities to ensure that you are getting the most performance and high-speed networking when running these training jobs. First, we have an automated, easy setup where you can use the CLI, SDK, CloudFormation, or even the console to set up your distributed training job. We have access to multiple popular frameworks like PyTorch, TensorFlow, or even Ray. For communication libraries, you can use Nvidia NCCL or, if you are using Trainium instances, you can use the Neuron Collective Communication Libraries. For high-speed networking, you have access to NVLink technology from Nvidia, or for inter-node connectivity, you have access to high-speed networking solutions.

Specifically, you have access to Elastic Fabric Adapter from Amazon where you can get up to 3200 Gbps of bandwidth. All these together help to ensure that you have an efficient distributed training setup where your GPUs and your training jobs are connecting and moving at a very high speed, which is essential for reducing your time to market and being cost efficient.

Resiliency: Four-Pronged Strategy for Job-Level Fault Tolerance

Finally, we talk about resiliency. To discuss resiliency, we quickly bring back our favorite single GPU. Each of these GPUs has a mean time to failure. When you scale these GPUs to a large number, say hundreds or thousands of GPUs, we then move from GPU level resiliency to job level resiliency because every failure from every GPU in the cluster can potentially interrupt the progress of your entire job. We have actually seen this in research, so research from Meta that was published in IEEE early this year showed where they ran an experiment on two large state of the art clusters for about 11 months. If you follow the lower line, they saw that for jobs running with 1,024 nodes, the average time to failure was about 8 hours. That is 3 failures in 1 day—one for breakfast, one for lunch, and one to wake you up at 2 o'clock in the morning with a page.

How does SageMaker help us resolve these issues and prevent us from getting paged? We have a four-pronged strategy: mitigation, prevention, detection, and recovery. With mitigation, we encourage and make it easy for you to checkpoint your training so you are taking multiple safe steps. Even if something goes wrong 2 days or 1 week into the future, you are not losing all your training progress and we are able to restart from a checkpoint. Sometime this year we also launched managed tiered checkpointing, which reduces the penalty of having to checkpoint frequently because there could be network penalties or even storage penalties. With managed tiered checkpointing you have a more efficient, faster checkpoint process, and you will see that when Tomonori shows his demo.

For prevention, we have multiple types of health checks with standard health checks or deep health checks where essentially before we add a node to your job, we do a burn-in test and we really pressure test these instances. If there is actually a problem with any of these instances or GPUs we can isolate that instance from the cluster so it never even makes it into your cluster. For detection, we have automatic health monitoring agents that are continually running through the cluster and running through your job to identify any issues and resolve them. For recovery, in the event that something does happen, if we identify there is a slow GPU or a slow node, we can automatically reboot the node or if that does not resolve the issue, we will automatically replace the node and then restart the job from your checkpoint.

Once again, all these capabilities come together and we have seen cases where it saves up to 40 percent of the time spent on training, which essentially helps you save on your overall costs because you are able to respond to failures faster. I will now invite Tomonori on to show us a demo of these in action.

Developer Experience Demo: SageMaker Training Jobs and HyperPod in Action

Thank you. Yeah, thank you, Michael. Am I audible? Yeah, thank you. So I am Tomonori Shimomura, a solutions architect on the SageMaker AI organization that is helping SageMaker AI customers every day. Michael covered the fundamental concepts of SageMaker AI's training capability. So let us dive deep into the actual developer experience, including demonstration.

As Michael explained, there are two training capabilities: SageMaker Training Jobs and SageMaker HyperPod. SageMaker Training Jobs provide fully managed APIs. Before executing the job, SageMaker Training Jobs automatically create a cluster, and after the execution of the job, it automatically deletes the cluster. Essentially, the compute resource is allocated only during the job execution, so you can expect a lower cost.

You can use a high-level Python SDK, which is a wrapper Python library on top of SageMaker's service API. This allows you to easily use SageMaker's training capability by writing Python code or running notebooks. Because this is based on the SageMaker service API, you can easily integrate the training job with your MLOps pipeline.

Here is an example code fragment for a training job. As you can see, this Python code imports a Python module called SageMaker core, and instantiates a training job object by passing parameters like hyperparameter configuration, instance type, instance count, and container image. By providing these parameters, you can configure the training job, start the training job, and automatically create a cluster internally.

Another training capability from SageMaker is HyperPod. Unlike SageMaker training jobs, HyperPod is for persistent clusters. This means cluster creation is not automatic, unlike SageMaker training jobs. You need to explicitly create and delete the clusters. However, HyperPod helps you create such high-performance clusters easily. You can customize the cluster based on your needs, share the cluster with your teammates and multiple users, and execute multiple jobs on the same cluster. That is the key difference from SageMaker training jobs.

As mentioned earlier, SageMaker comes with advanced resiliency capability, which I would like to explain in more detail later. I have also included some demonstration in my video. Because HyperPod is for AI development purposes, not a normal cluster, it comes with a comprehensive observability dashboard and useful libraries and tools to optimize your training.

HyperPod supports two orchestration options. One is SLURM, and another is EKS, or Kubernetes. EKS is Amazon's Elastic Kubernetes Service. Both SLURM and EKS are powerful frameworks to run distributed training, but the developer experience differs. I would like to explain how they are different briefly.

SLURM is an open source job orchestration framework. You can basically run batch jobs on multiple nodes. In the case of HyperPod, you create three types of nodes: controller nodes, compute nodes, and login nodes. Typically, you log in to login nodes and execute SLURM commands such as sinfo, sbatch, and squeue to run SLURM features.

Another orchestrator is EKS, or Kubernetes. Many people have heard of Kubernetes because it is popular. Kubernetes is a framework to run various types of containerized applications, including distributed training and training applications. Unlike SLURM orchestration, in the case of choosing Kubernetes, the HyperPod cluster only contains compute nodes because the EKS control plane plays the role of control plane. The HyperPod cluster itself does not contain a control plane or login nodes. You can learn Kubernetes clients like kubectl or your preferred Kubernetes client on your development machine, even on your local laptop.

So which is the better choice for you? This is a tough question because it is case by case. If you already have a preferred choice of orchestration, please choose it. If you want to directly access hardware or run applications at the host level, that may be the right choice.

If you love containers, Kubernetes might be the better choice. I think Michael already covered some portion of resiliency, but let me recap quickly before going into the actual demonstration.

SageMaker HyperPod resiliency consists of three core steps: health monitoring, instance replacement, and job auto-resuming. These three components form the flow of HyperPod resiliency. HyperPod constantly runs health checks, and if we find a faulty GPU or faulty instance, we trigger the instance replacement. After the instance replacement, the suspended job automatically resumes. We provide such a framework.

We would like to do two demos. I captured pre-recorded videos, and I partially accelerated the video to fit this within the limited time of this session. Starting from SageMaker training jobs, I'm going to open a code editor in the browser. SageMaker has a code editor in the browser, by the way. I'll execute a notebook to trigger the training job, and eventually, we will see the execution result on training metrics by MLflow.

Firstly, I'm opening the code editor on SageMaker by clicking the open button. It looks like Visual Studio, right? This is based on the open source version of Visual Studio Code editor. I'm checking and browsing the training code. This is the code actually executed on compute resources. I'm executing a notebook, and the first half of this notebook is for data processing and uploading to S3 bucket. This is a preparation step.

I'm visualizing the training dataset and configuring some parameters for the training job, like the S3 upload location. Then I'm creating and configuring the training job. You can see a ModelTrainer class. I'm instantiating the ModelTrainer class with some parameters. This is the Python API.

Then I'm configuring the input data channels and starting the training. You can also see the list of SageMaker training jobs on the SageMaker Management console like this. Now it's showing training status. Training is in progress. The progress can also be seen by visiting the CloudWatch logs management console. You can see output from the container.

There are some options for how to monitor the progress. You can see CloudWatch logs. You can also use a notebook by interacting and calling some methods of Python objects to get the log output as well. After this, I will show, but you can also check MLflow output or your preferred observability solution as well.

SageMaker has managed MLflow. I click the MLflow button and open MLflow. I'm seeing a list of executed training jobs, and you can see training metrics like this. The next demo is SageMaker HyperPod. In this demo, I'm going to create a cluster easily and customize the cluster by installing some other components and running some training jobs. I'm going to simulate some hardware failure artificially by injecting a simulated error and verify that HyperPod resiliency works, and eventually see the observability dashboard. You can easily create a SageMaker HyperPod cluster.

For quick setup, all you have to do is choose the right instance type and instance count. That's basically it. Cluster creation can be complex if you do it manually, but we provide an easy setup experience. The actual cluster creation progress can be monitored on the CloudFormation management console because it requires not only the cluster itself, but also various prerequisite resources such as VPC, subnet, Security Group, S3 bucket, and IAM role. When the HyperPod cluster is getting created, we can visit the SageMaker management console and see the instance initialization progress at the instance level. You just saw a deep health check being executed.

Now let's install some add-ons, starting with the HyperPod observability add-on. It's literally one click for basic installation, but I'm also enabling some additional advanced metrics. Then I'm enabling HyperPod task governance. This is for job task prioritization and dynamic compute quota allocation between multiple teams. This is useful when sharing a cluster between multiple teams and multiple purposes. I'm configuring the priorities and job priorities depending on the task types, like inference, training, or experiments. I'm also configuring two teams, Team A and Team B, and defining eight instances as a default quota, but allowing lending and borrowing of compute resources between teams. Then I'm configuring a training job.

As you can see in the second line, HyperPod training job is a custom resource definition installed by the HyperPod training operator, and I'm using it for advanced training. I'm installing additional software and modifying the checkpointing code in order to enable managed tiered checkpointing. I will explain more later. I'm using some SageMaker-specific Python classes for the checkpointing code. Now on the terminal, I'm monitoring three pieces of information. Top left, I'm monitoring the node status. Top right, I'm monitoring pod status. The center shows log output from the training job. I executed the training job using kubectl, which is the standard CLI interface.

The training is progressing. Next, I'm going to inject an artificial GPU failure in order to simulate and verify the resiliency capability. A good feature of HyperPod is that you have direct access to the infrastructure, not just at the container level and not only monitoring logs, but you can log into compute nodes like this by SSM session. In this case, I injected some error messages to the kernel log to simulate a hardware error. You can see that one node became not ready status and one pod became pending status, which means your job got suspended because of the artificially injected error. A new node came in thanks to instance replacement, and a new container is now in creating status and should soon start executing. The newly created container started and the job resumed. The job resumed. Next, I would like to show you how the observability dashboard looks like. This is based on Prometheus and Amazon Managed Prometheus and Amazon Managed Grafana.

I'm using a task set cluster dashboard first. The cluster dashboard shows you hardware resource related information at the cluster level or instance level, like this. Next, the dashboard I'm showing is the task dashboard. It's quite powerful. You can see key metrics like CPU, GPU, and memory utilization at the task level, not the instance level. Since HyperPod is for distributed training, it's useful for distributed training, right? So I think you're interested in which team is using GPU efficiently, which team needs improvement, and which job is not performing well because of what bottleneck, right? Those pieces of information are really important to use the cluster. So this observability dashboard is very useful for you.

In the demo, I included some newly introduced features this year. Let me highlight some new features. The first feature I'd like to highlight is the enhanced HyperPod cluster creation experience on the management console. You can easily set up a cluster by entering some parameters and configuring it. Configuring a cluster can be very complex because you need to configure not only the cluster itself but also prerequisite resources. However, you can use this browser-based experience to easily configure a cluster.

Next feature is one-click observability. Observability is important for you, as I said, but setting it up is complex because it requires a metric emitting component on the cluster itself, a time series database, and a visualization layer. But you can literally install this with one click. The HyperPod Training Operator is for efficient distributed training, and it has the capability to recover intelligently. It has job hanging detection and manages checkpointing. This is to reduce the overhead of checkpointing. This is a feature I didn't include in my demo, but AWS training jobs have a new feature: AWS Batch support. A training job itself doesn't have batch job scheduling capability; it will just run a training. But by combining a SageMaker training job and AWS Batch, you can enqueue training jobs in a queue and intelligently schedule them by priority.

This is also a new feature I didn't include in the demo, but now you can launch IDEs and Jupyter notebooks on a HyperPod cluster. So you can use a HyperPod cluster not only for training jobs but also for interactive development and experimentation. Lastly, we built an MCP server for SageMaker AI. You can use this MCP server to create and maintain HyperPod clusters by natural language. For example, you can say "create a cluster with 4 G5 12X instances," or you can also say "scale up my cluster from 4 instances to 8 instances" by natural language.

Roblox Case Study: Building a 4D Foundation Model at Scale

So now we've covered the fundamental concepts of SageMaker training capability, and I've provided some demonstrations. Now, I think you're interested in what HyperPod or SageMaker AI capabilities can do in the real world, right? So I'd like to introduce Donnie Goupil from Roblox. His team is using SageMaker HyperPod for one of the core components of their AI platform. Thank you. Hello, can you hear me? Cool. Yes, so I'm Donnie. I work at Roblox where I lead the AI infrastructure and AI platform team. Today I'm going to talk about AI at Roblox to give you some context.

And then we'll dig into our training infrastructure, especially how it has been impacted by training our 4D foundation model. I'll finish by discussing what we want to build next.

If you're not familiar with Roblox, it's a platform where millions of people come together to create, play, and interact with each other through one of the many experiments that we have, all created by our community of creators. As of last quarter, we have 150 million daily active users and 45 million peak concurrency, which means that 45 million people come together to play at the same time in Roblox. That's pretty huge. In terms of AI, our AI infrastructure supports about 1 million plus queries per second across our 350 plus 50 models that we have running in production. So as you can see, AI is pretty much everywhere in what we do at Roblox.

Especially when it comes to safety, which is core to our principles, we train and serve a model for real-time moderation for voice and text to make sure that there is no bad content happening within the platform. Those models are actually open source, so you can use them for your own use cases. For safety, we also do bad avatar moderation as well as bot detection, abuse reports, and clickbait detection. So there are a lot of AI use cases being used for safety.

For our players, AI drives our recommendation and search system to help our users get the right content. This means we have game recommendation as well as friend recommendation for social interaction, and for assets that you can use for your game or for your avatar, we also have marketplace search driven by AI. Finally, for our creators, we introduced a set of generative AI tools to help them improve their creative processes. In one example, you have a creator assistant built into Roblox that helps you create a 3 by 3 orb grade. The other set is a script that helps you turn this blue orb into red and disappear when a player touches it. Similar to what you can do as a developer using Cursor, we have the same set of tools for our creators.

More recently we introduced our Cube model, which is a 4D object generation system. By 4D I mean it's a 3D object that you can interact with and is functional. For example, with a car, you can actually open the door, get into the car, drive it, and the wheels will turn. There are two examples on the screen. On the left side you have a gun, a watermelon gun in the shape of a banana that actually shoots watermelon. On the right side you have a magic carpet built by our 4D model that you can fly. That's pretty cool. So you can see that in the future our players and creators will be able to create anything that they can imagine.

Now let's see how we actually train this foundation model and how our training infrastructure has been impacted by that. But first, let's step back and look at the AI platform as a whole to give you some context. On the left side in purple, you have all our data components. Starting with our data lake with structured and unstructured data, our feature store and embedding where you can do online and offline retrieval. In the middle top, you have the blue component which is used for training, so that's data processing where you go from raw data to features for your training, and actually the training component itself where you go from features to build those models. Then the yellow one is what we use for serving. So when you get your training and you have your model, you can push it to a model registry, and then you have our serving component that reads the model from the registry and serves it for real-time use cases.

As shown in green, which was already described before. Finally, on the right side, you have the rest of what we use for observability, cost tracking, and our experimentation platform. That's pretty much the AI platform as a whole. Today, we are going to focus on this red box, which is our distributed training component. This covers how we do multi-GPU, multi-node distributed training. As a spoiler alert, it's going to use a HyperPod cluster.

Let's go back to the 4D model and the scaling factor that Michael was talking about and how it's going to be applied to our use case. First, on the data side, we have about 100 million 3D assets available, which is roughly 6.5 petabytes of data. We don't necessarily use everything for training, but that gives you an idea of the scale. Then for model size, the requirement was between 1 billion and 70 billion parameters, which is a very large requirement. We don't know exactly what's going to be built in terms of available hardware, but we know we will need a lot of GPU memory to train those models. The newer the better, so we went for H200.

In terms of training technique, the research team doesn't have any specific requirement except that it has to be performant, stable, and easy to use. Finally, and probably the most important point for them, is the number of jobs they can run. Why this is more important is that since we've never trained a foundation model, especially a 4D foundation model, being able to run as many jobs as possible in parallel makes them more efficient and actually refines the hypothesis they need to create that model, especially regarding model size.

HyperPod Implementation at Roblox: Challenges, Solutions, and Future Vision

How does that affect our AI platform? Well, first of all, you go talk to your finance team and make sure you have enough budget for it. But once you get the budget, we actually sit down and look at what is going to break or what is missing from our current solution. We came up with three challenges. The first one was how we actually get capacity. We already have thousands of GPUs across training and inference, but those are usually tiny GPUs for inference or they are used by other teams. So we just needed more capacity and more high-performance GPUs.

The second challenge was how we ensure resiliency. As Michael was describing, as you get more data and the model size goes up, you're most likely to fail at some point during training. We have to make sure that whatever this new project brings to the platform, the hundreds of ML engineers working on other projects are not stopped because of a new project. The last challenge is that once you get GPU capacity and resiliency, you need to make it cost efficient and ensure you utilize GPU resources in the best way possible. That's where HyperPod comes into the picture and helps us get there.

The first point was that it's actually fairly easy to set up. Don't get me wrong, it's not as easy as setting up an EC2 instance or an EKS cluster, but it's reasonable. It took us one month from prototyping to production, and that includes making sure that everything we do on the platform works the same way in a HyperPod cluster, with InfoSec review and actually setting up the network so our EKS control plane can talk to HyperPod nodes. We recently created a new EKS HyperPod cluster and it took us less than one hour to do it, so I would say it's reasonable.

The next point was GPU capacity. As I was saying, we need this high-performance GPU, and HyperPod was one way to get there. Especially with HyperPod, you have multiple options to get more capacity either with reserved capacity, spot instances, or flexible training plans. We don't use flexible training plans today, but as we get more timely or ephemeral projects, I do see that happening in the future. HyperPod helped us with those changes. The next one was stability.

As we scale up, a failure is most likely to happen. Building those features ourselves would have taken a lot of time and most likely delayed that project. HyperPod helped us get there faster. Actually, as I was preparing the slides for today, I couldn't find a single node failure in the last three months, even with a lot of GPU utilization during that time. So kudos to the team. However, something that is actually missing is some visibility into recurring node detection and what failures can happen. I would have loved to show you thousands of tests that AWS is running behind the scenes where everything is green and fine, but I couldn't find this information, so we should probably take that.

The last one is flexibility, and honestly, to us that was the most important point. We already invest a lot of time with an AI platform where we have tooling that we want to reuse as much as possible. The fact that HyperPod is able to integrate with EKS is phenomenal to us. When you create a HyperPod cluster, your HyperPod nodes are going to join your EKS cluster, and at that point they look like just any other node in your cluster. That means you can tend those nodes, do node selection, run any daemon set on it, and any tooling that you have works the same way.

HyperPod is actually framework agnostic for both scheduling and training. In our case, that means we were able to use Unicorn just like any other AI workload that we have on the platform, and we were able to use Ray for distributed training again, just like any other workload. That was definitely a good selling point for us. I'll finish with what we want to build next, what we call decentralized compute. The problem still remains the same: having access to GPU is always a bottleneck. HyperPod was able to solve that in a single region, single cluster environment. But as we want to get access to more capacity, we need to extend that.

The solution we came up with is what we call decentralized compute. We need a way to schedule workloads dynamically across all resources that are available to us, regardless of which region or cluster or where the capacity is. All that matters is that we need to make that transparent for our users. We need to abstract the scheduling process so the user gets a unified experience, just schedule a job and submit it without actually caring where it's running. That's where we want to be. So finally, that means we want to extend HyperPod support to get multi-region and multi-cluster capabilities. We hope to work with the SageMaker team to make that happen, and hopefully you can benefit from it as well. That's all for me. I'll let Michael do the recap.

Conclusion: Flexibility of Choice Across the AI Training Stack

Thank you very much, Denny and Tomonori, and thank you so much for sticking around. We have just three more slides to close out this session. As a quick recap, we talked about the various dimensions of critical things that are important to customers: compute availability, performance, resiliency, observability, and ease of use, and how those all apply to cost. I recall then he talked about flexibility of choice, and on this slide we're just going to summarize that right from the bottom of the stack. For hardware, we have multiple options: different compute options, storage, and networking. On the software and driver layer, we have multiple prebuilt images, different device drivers, and toolkits. Moving further up, we have multiple training libraries to ensure highly performant distributed training with PyTorch, TensorFlow, multiple capabilities for distributed training strategies, and even various MLOps third-party solutions. We integrate with MLflow or Weights and Biases for tools. You can submit jobs with Ray or other tools for observability. Like Tomonori showed, we have support for Prometheus, Grafana, or CloudWatch. You can use SageMaker Studio or notebooks, or even use it directly from your machine to submit jobs.

At the top layer, whether you're using SageMaker HyperPod to orchestrate your jobs with Kubernetes on Amazon EKS or Slurm, those options are available to you. Alternatively, if you just want to use a fully managed API, you can use the SageMaker training jobs managed API.

What this shows is that we understand your use cases might be different and your technology team might have different interests in using various frameworks. We offer the capability for you to pick and choose what works best for your use case. We are sharing a link to a blog where we discuss both the capabilities for training jobs and SageMaker HyperPod. It's a good resource that I usually recommend.

It also has example notebooks on GitHub where you can literally just copy the code and run. These examples are set up for you, and we have hundreds of them available to help you get started. Thank you so much for joining this session, and please leave feedback on the app at the end of the day or later. We've really enjoyed talking to you today. We're going to also be outside if you have more questions or things you want us to discuss, but thank you very much and enjoy the rest of re:Invent.

; This article is entirely auto-generated using Amazon Bedrock.