DEV Community

Cover image for AWS re:Invent 2025 - Accelerate AI workloads with UltraServers on Amazon SageMaker HyperPod (AIM362)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Accelerate AI workloads with UltraServers on Amazon SageMaker HyperPod (AIM362)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Accelerate AI workloads with UltraServers on Amazon SageMaker HyperPod (AIM362)

In this video, Rekha Seshadrinathan and Paulo present Amazon SageMaker HyperPod and EC2 Ultra servers for generative AI development. They address key challenges including compute access, resource allocation, distributed training complexity, and hardware failures. SageMaker HyperPod offers flexible training plans (1 day to 6 months), task governance for efficient multi-team resource allocation with preemption rules, pre-benchmarked training recipes, and automated resiliency features that handle failures without manual intervention. Paulo demonstrates EC2 Ultra servers architecture, explaining how GB200/GB300 servers bundle up to 72 GPUs using NVLink switch and Elastic Fabric Adapter with Scalable Reliable Datagram protocol. The demo shows creating training plans, configuring task governance with priority queues, capacity borrowing/lending between teams, and topology-aware scheduling. Advanced use cases cover Mixture of Experts training on Ultra clusters with 25,000+ interconnected GPUs, achieving 60% network optimization and up to 68% cost savings versus on-demand pricing.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: The Brooklyn Bridge Analogy and the AI Infrastructure Challenge

Today we're going to dive into the world of Amazon SageMaker HyperPod and EC2 Ultra servers. I'm Rekha Seshadrinathan, a Senior Manager of Product on Amazon SageMaker, and I'm joined today on stage by my colleague Paulo. Hi, I'm Paulo. I'm a Principal Specialist Solutions Architect working on AWS helping customers build their foundation models. Thank you everyone for being here today.

Thumbnail 0

Thumbnail 40

Before we get started, I wanted to begin with some history. How many of you here recognize this bridge? Yes, it is the iconic Brooklyn Bridge that connects the boroughs of Manhattan and Brooklyn. Now back in the mid-1800s, the very idea of constructing a bridge across the East River was considered absurd. The river was deep, the currents fierce, and the span longer than one that had ever been attempted seemed impossible.

Thumbnail 80

Thumbnail 120

Yet one man, John Roebling, had a vision. He had a vision for a suspension bridge unlike one the world had ever seen, one that could transport thousands of people and carry tons of cargo every day. While the newspapers called it a fantasy, the idea persisted and it led to the invention of the steel wire that John Roebling designed. This was something called the 6-19 design. This is basically 6 strands, each with 19 wires surrounding a core. This unique hexagonal design created a very strong yet flexible wire, and he used this to construct the bridge.

Thumbnail 160

The construction of the bridge itself took over 14 years, and when it finally opened, it was the first steel wire suspension bridge and the longest suspension bridge at the time. Now we stand at a similar moment in history. For decades, we believed that many forms of intelligence, reasoning, creativity, and language were uniquely human domains. We believed that scaling models beyond billions of parameters was beyond our computational reach.

Thumbnail 170

Thumbnail 210

Yet breakthroughs arrived: transformers, foundation models, RAG, and reinforcement learning. And now the impossible seems inevitable. In fact, over 70 percent of enterprises are already using AI in at least one function. It's not just chatbots anymore. Reasoning systems and agentic workflows are creating new efficiencies and growth. Organizations like you are investing millions in AI. But just like the Brooklyn Bridge needed a new architecture of steel, the challenges of AI and these AI breakthroughs demand a new compute architecture and new software systems that are designed to solve the unique challenges of generative AI.

Thumbnail 280

Thumbnail 290

Thumbnail 300

Now here at AWS, it is our mission to build the best place for generative AI training and inference. That means building the best compute infrastructure and software systems to solve these unique challenges. That's why we're excited to talk to you about SageMaker HyperPod and EC2 Ultra servers today. In today's session, we'll first talk about some of the challenges with generative AI development, training, and inference. We'll then introduce SageMaker HyperPod and tell you how it uniquely solves some of these challenges.

Thumbnail 310

Thumbnail 330

After that, we'll talk about EC2 Ultra servers and Paulo will do a deep dive on the architecture. We'll also go through a comprehensive demo on how you can use all of these tools yourself. And then finally we'll go through some advanced use cases and give you some resources that you can use.

Thumbnail 340

Thumbnail 350

The Four Major Challenges of Generative AI Development

Let's get started. Why don't we walk through what a typical AI journey looks like for customers who are training and hosting models. The very first step that customers need to take is to get access to the compute that they need. Because these accelerated instances are in high demand, they're not always available on demand. When you're trying to get these instances for training, they may not be available at the time that you need them.

The alternative for you is to potentially make a long-term reservation, which means making an upfront commitment for one to three years, and that's not always feasible for every organization. It may also lead to underutilization because you may not be able to use these resources for the entire duration of your reservation. By nature, these AI workflows tend to be spiky with lots of experimentation. So let's say you've procured this compute. The next step that your organization is going to take is decide how to allocate this compute across multiple teams in your organization.

Thumbnail 420

Thumbnail 450

There are going to be different teams who are fine-tuning models, training, hosting them, and building applications, and you need these resources for all of these use cases. Typically, an administrator will go around asking different teams what their compute needs are and then statically allocate this compute across the organization. But that can lead to inefficiencies. Let me walk through this example where you have two teams in an organization.

Thumbnail 480

Here you see the first team. They have two instances allocated to them and they have multiple jobs that are queued up. The first two jobs are of lower priority, and they're occupying these two instances. Then a third higher priority job comes in. However, it's standing in the queue waiting for these jobs to complete. Ideally you want your higher priority job to get access to these compute resources and then pick up the lower priority jobs later.

Thumbnail 490

Thumbnail 510

Now let's look at team two. Here, they've got three instances. They have two jobs of varying priority that are using two of those instances. But you've got a third instance there that's sitting idle, and ideally that instance could be used for one of the backlogged jobs for team one. Because you're statically allocating this compute, it ends up underutilizing your resources and you're always not prioritizing the highest priority workloads in your organization.

Thumbnail 540

You've procured the compute and you've allocated the resources across different teams. The next step, if you're training models or fine-tuning them, is to see how you can use this hardware effectively. With the latest models, the foundation models have been growing faster and faster every year. Even if you look at GPT-3, which is now quite old, that was a 175 billion parameter model.

If you're trying to load up the model, you need memory to store these parameters. If you're using FP16 precision, that requires about two bytes per parameter, and so you need 350 gigabytes just to store the model weights. Now add in activations and gradients, and the memory needed is far north of that. But a single H100 GPU only has 80 gigabytes of memory. So what do you do?

Teams will typically use model parallelism or split that model across multiple GPUs. But that requires expertise in distributed training and requires weeks or months of iteration to get the right configuration on the particular type of hardware you're using and the particular model family that you're trying to train.

Thumbnail 630

If you've done all of this and gotten this far and kicked off your model training, the next challenge you run into is hardware failures. Although these accelerated instances give you a lot of compute and memory power, they also tend to fail. Because distributed training is a highly synchronous process with a lot of internal communication, a single GPU failure can halt your entire fine-tuning or training workload.

When that happens, a data scientist or engineer in your organization needs to go and isolate the issue, debug the failure, solve the problem, and restart the training from a last safe checkpoint. All of this can result in hours of frustration and downtime. These failures can happen at any time, so if they happen in the middle of the night, your entire cluster might be sitting idle until that problem is resolved, leading to significant downtime. As the size of your cluster or models gets larger, the probability of these hardware failures only increases.

Thumbnail 740

In fact, the industry has coined a term called goodput for this. Goodput basically means the percentage of time that you spend actually making forward progress, the percentage of time that you're actually productive instead of doing all these activities like debugging. Many of the customers we spoke with as they started training these models were suffering from low utilization and poor goodput on their clusters. There are a host of challenges with generative AI development, and they're not all technical. You have memory bottlenecks and communication bottlenecks, but you also have operational challenges and business challenges as well.

Thumbnail 770

SageMaker HyperPod: A Comprehensive Solution for AI Training

All of these problems are interconnected, so solving them in isolation doesn't usually result in good results for your organization. You need a more comprehensive solution that attacks all of these issues. That's why we built SageMaker HyperPod and launched it right here at re:Invent two years ago. SageMaker HyperPod is purpose-built for generative AI development and provides you the tools you need to address the unique challenges we discussed. It comes with unique resiliency features that help you debug and isolate failures quickly, helps you maximize cluster utilization, and gives you observability tools that you need to make sure you're using these resources effectively.

Thumbnail 820

Let's revisit that AI journey with SageMaker HyperPod. The first challenge we talked about was access to compute. SageMaker HyperPod supports flexible training plans. Instead of reserving capacity for one to three years, you can now request capacity from one day up to six months, up to eight weeks in advance. You can go to the SageMaker console, specify how many instances you need and for how long, and you're able to see a series of plans that meet your needs.

Thumbnail 860

Once you find a plan that you like, you can purchase that plan and then reference that training plan when you're creating your HyperPod cluster. Later, Paulo is going to show you a demo of how all of this works. HyperPod will simply scale up the cluster when that training plan becomes active and scale down the cluster when the plan expires. Once you've set up your cluster, you're then able to start submitting your fine-tuning and training workloads and begin using the cluster.

Thumbnail 890

Thumbnail 900

The next challenge we had talked about was properly using these resources across multiple teams and how that can be challenging. SageMaker HyperPod comes with a capability called task governance.

Thumbnail 960

Task governance helps you efficiently allocate these compute resources across teams. The way this works is an administrator can go in and specify compute allocations for different teams, but along with that compute allocation, they can also specify priorities for different tasks and set up preemption rules. Once this is set up, HyperPod will automatically prioritize the highest priority tasks that come in. While every team gets their guaranteed compute allocation, if there are idle resources, then another team is able to opportunistically run tasks on that idle compute, which helps you maximize utilization while still making sure that every team is getting their fair share of resources.

Thumbnail 970

Thumbnail 1000

Thumbnail 1010

The next challenge we discussed was distributing these models across large clusters. HyperPod also comes with training recipes. Think of these as a playbook for distributed training, where we pre-benchmark recipes for different model architectures and instance families ahead of time. You are able to submit your jobs using one of these recipes and get started on your fine-tuning or training workload quickly. The last challenge we discussed was hardware faults and how once you get started on your training, you might run into hardware faults or network issues. HyperPod has a bunch of resiliency features that help you with this.

First, it comes with a health monitoring agent that proactively monitors your cluster for hardware failures or network issues. When an issue is detected, HyperPod will pause the training, either reboot or replace the node depending on the kind of error, and then restart your training from a previously saved checkpoint, all without any manual intervention. This happens behind the scenes without needing to use your precious data scientist or ML engineering resources. In addition, there are also deep health checks that you can optionally enable on the cluster. When you are creating your cluster, you can set up these deep health checks, and they are comprehensive checks that will run and reduce the probability of failures once your workload actually starts running.

Thumbnail 1110

We also have some exciting new features on HyperPod that were released just today where we are able to resume your training by loading the model state from peer nodes without needing to go to durable storage or a previously saved checkpoint, which saves you even more time. We have also launched a new feature today called elastic training, which allows you to reconfigure your training on the fly depending on the resources that are available. I would encourage you to check those out as well. Now it is because of all of these features that top AI companies are using HyperPod to both train and deploy their models. If you happened to watch the CEO of Ryder's keynote earlier this week, they talked about how they have trained their latest models on HyperPod and been able to speed up their training by over three times.

Thumbnail 1150

Thumbnail 1160

EC2 Ultra Servers: Advanced Compute Architecture for Model Training

So now that I have given you an introduction to HyperPod, I am going to hand it over to Paulo to talk more about EC2 Ultra servers and give you a demo of how all of this works. Thank you so much, Rekha. So we talked about how to create those clusters and how to allocate that compute. Let's talk about that advanced Ultra server compute. First of all, Ultra servers are a definition of how you combine hardware together to make your model training and your workload perform faster and work better. Initially we announced the last re:Invent Ultra servers based on AWS training chips. Those chips were designed and developed by AWS by Annapurna Labs, which is one of Amazon's company groups. That company developed that chip focused initially on training, and that is why it was named Trainium. But nowadays, the way that the chip is designed, it does training and also inference. In order for you to develop your model and use that model on training devices, it is very simple. You have the AWS Neuron SDK, which integrates with most of the machine learning frameworks existing on the market.

Such as PyTorch, for example, allow you to define the device during model training. Instead of defining the device as CUDA, where you would be leveraging NVIDIA GPUs, you define the device as Neuron. By doing so, PyTorch and the frameworks that are supported by Neuron will compile the graphs and run that model training on training devices. To help you speed up the training, we developed the UltraServer based on Trainium. It combines four instances together. Those instances are interconnected by a very fast network, which is not a traditional network. It's similar to what NVIDIA has in terms of the NV Switch, which is a dedicated proprietary network. We developed a proprietary network for those Neuron devices so they can communicate together and bundle up four of those servers up to 64 devices. Each server has 16 devices.

The same goes with the GB200s and the recently announced GB300 servers, where we are bundling up NVIDIA GPUs. Putting them together across different servers, each server with a different amount of GPUs, we can bundle them together up to 72 GPUs. We can understand the difference of training your model on a single server with a few GPUs and why an UltraServer is very important for you with 72 GPUs. But before we discuss this, let's understand how those servers are interconnected and how that innovation was developed.

When we talk about UltraServers, the name implies one server, but actually you have different instances. For the UltraServer based on Trainium, I mentioned we have four instances, each instance with 16 devices. For the UltraServers with NVIDIA GPUs, you have up to 18 instances where each instance has four devices. Those instances are interconnected by NV Switch, which is NV Link, the proprietary interconnect on an external switch. If you heard Jensen Huang, the CEO of NVIDIA, talking about this last week, it is a piece of innovation in terms of that proprietary connection which is outside of those servers interconnecting up to 18 of them. That's not Ethernet. That's not InfiniBand network. You also get that secondary network in case you need to expand your workload on multiple servers.

So is the UltraServer the only innovation? Putting that together on an NV Switch, is that the only innovation? The answer is no. The other thing that NVIDIA did was develop a chip, a CPU. So a company that was heavily based on GPU decided on creating a CPU. Why is that? Because they understood that to help customers, that CPU, which is ARM-based and not x86, required to be connected to that GPU in a very fast network as well. So they developed the superchip, which is a GPU and a CPU together. That superchip allows you, for example, on interesting use cases. If you're training your model and your GPU, which has 108 gigabytes of memory, requires to access the memory mapping on the CPU, now it can on a single coherent memory mapping. That was a very interesting innovation, especially when you're talking about offloading activations, meaning your model is training and learning the weights, but at some point, those weights which are not being used are offloaded from the GPU to the CPU memory. Now you have available memory on the GPU to load more data.

Of course, when you're training your models and you're using a lot of those devices, you want data as close to the compute as possible. In order for you to do so, Amazon SageMaker HyperPod has a feature called Topology Aware Scheduling, where when you're deploying those models with many different servers together, we understand how to deploy those models so that data and that compute are close.

On the UltraServer example specifically, we have that UltraServer ID annotation, which you have to put on your container definition. We're going to see during the demonstration how, even though you're not a Kubernetes expert, you can use that because we have some tools to help you remove that definite heavy lifting. So let's go to the demo and understand how we can connect all of those features together. During the demo, I'm going to be walking through a video that I pre-recorded, so we didn't have any problems with the internet connection, but we might have some problems with the video. Well done. Perfect.

Thumbnail 1570

Thumbnail 1580

Thumbnail 1590

Thumbnail 1600

Demo: Flexible Training Plans and Task Governance in Action

So let's start with getting compute capacity. I have here the SageMaker AI console, and we have already a training plan, but let's go through the creation of a training plan. Training plans can be defined for SageMaker HyperPod clusters or for training jobs or even for inference endpoints. You choose an instance, you choose the number of instances you want for the training plan, and you choose the start date. So the start date could be in the future. You can plan getting capacity ahead of time if you want to. Then you click on find training plan, and you're going to have some options. Those options could be immediately available as we are seeing those two first options, or they could be some options available in the future, even though you need the start date today.

Thumbnail 1620

Thumbnail 1640

Thumbnail 1650

Thumbnail 1660

After you have done that, you can choose which training plan you want. It's important to understand which availability zone that compute capacity is allocated, because once again, we want that compute capacity to be as close as possible to data and compute. So you look at the availability zone and then you decide that that's the right training plan. You have to give that training plan a name. It could be any kind of name. You can add tags to that training plan. And then you have this summary, this review. So you review the availability zone again. You review how much you're going to pay for the training plan, and remember, when you're paying for the training plan, that's an upfront cost. And you also have the price per instance and all of the details, and of course you have to confirm that you want to buy that training plan.

Thumbnail 1680

Thumbnail 1690

Thumbnail 1720

When you click on that button, it will show up on the console now, and it will show up as pending and the number of instances that you require because it takes a while, a few minutes, so that training plan is actually active. After that training plan is active, you can go to your SageMaker HyperPod cluster, and you can allocate that training plan on a specific instance group. An instance group is a logical definition of a specific instance, the name of that group, the training plan, and then when you're submitting your jobs, you can submit the jobs and based on the instance group that you have, that job will be running. So let's say that you need a job that's going to be running with GPUs, you're going to have an instance group with GPUs or if you need a job with CPU, that instance group can be with CPU as well.

Thumbnail 1730

Thumbnail 1750

Thumbnail 1760

Thumbnail 1770

Some of the definitions on that instance group: I define the instance type, the training plan, but I also have to give an IAM role which allows the permission for that instance group to be used by the orchestrator either SLA or ECS, and also this S3 bucket where you're going to have the lifecycle scripts. After you have your instance groups, and you can see I have a few different ones there, now we can go ahead and we can use other features of the SageMaker HyperPod, right? So remember, those instances they might take a while to come up, because we are installing a lot of software for you, we are configuring the network, we are running some benchmarks and making sure that everything is running correctly. After they are allocated, you can go back to the training plan console and you can see how many of those instances that you have bought are allocated or idle.

Thumbnail 1790

Thumbnail 1800

Perfect. Let's go back to the cluster and let's actually activate the task governance feature. So to activate that feature is as simple as a single click. You click on a button and behind the scenes we are installing the software for you. We are configuring task governance, and after it's ready to be used, all you have to do is to define the policies.

Thumbnail 1820

Thumbnail 1830

Thumbnail 1850

Thumbnail 1860

Thumbnail 1890

Policies are going to be the different priorities on your queue. For example, I'm defining four different priorities where inference has the highest priority, fine-tuning has the lowest priority, and I also have experimentation and training priorities. After I have defined those policies, now I have to actually allocate compute to different teams. I can create different teams, get that capacity that I just bought through the flexible training plan, and allocate that to my teams. Right now, I'm creating Team A. I'm going to call it my awesome team A, and that team will actually get some capacity allocated. One of the things that I can define when I'm allocating that capacity is whether I'm going to be generous with other teams or not. I can lend capacity or I can borrow capacity from those teams. That's a decision I have to make. After I decide what kind of capacity lending and borrowing I'm going to be doing, I can also define how many instances of capacity I have that I'm going to be lending from some other team. The last feature on that specific panel is to enable preemption. If you remember, we talked about having different priority jobs running on a queue, where lower priority jobs are running and then higher priority jobs come in. You want that higher priority job to preempt the lower priority job. So you enable that for a team when defining the allocation of those compute nodes.

Thumbnail 1920

Thumbnail 1930

Thumbnail 1940

Thumbnail 1950

Thumbnail 1970

Thumbnail 1980

Thumbnail 1990

Thumbnail 2030

Thumbnail 2040

Thumbnail 2050

Thumbnail 2060

Thumbnail 2070

Thumbnail 2080

Thumbnail 2090

You can either choose different instance groups, so you don't have to choose from all the instance groups. You can choose to allocate one complete, one whole instance to that specific team, or you can define that you're going to allocate just a few GPUs, a few vCPUs, and some memory to that team. So the allocation of compute resources is the baseline of compute that team has. It doesn't mean that the team cannot have more compute. It can be borrowing. So in order for you to experiment with that, I'm defining a second team, Team B. Both teams are going to have one single resource, and the example that I'm going to show you is submitting the jobs. Now we're going to see how we are bundling together everything that we just configured. First of all, this is my cluster. My cluster has two instance groups. One is the controller, the other one is the training instance group where the controller doesn't have GPUs and the training instance group has some GPUs. After I reviewed that, what I'm going to do is submit a basic training job, and I'm not prioritizing. I'm not defining any tag for prioritization here. I'm just submitting that. This SageMaker HyperPod cluster is orchestrated by Kubernetes, but I'm using the HyperPod CLI which allows you to submit those jobs without knowledge of Kubernetes. With the hyp create hyp-pytorch-job and the definition of job name, container image, the pool policy, the number of nodes you require, you can see I'm defining three nodes here, you're submitting. Behind the scenes, what the HyperPod CLI is doing is creating the container definition for you and submitting that container definition to the Amazon EKS cluster. That job was submitted. We have three replicas, meaning we are using three pods, three servers. The pods are running and the workload is running. Let's see those pods running as a simple Kubernetes administrator. I could do kubectl get pods. I would see those pods like that, and let's see if the pods are training. Yes, so we can see that the pods have started training. We have the training split, so we are splitting the dataset into training and validation. And we are downloading, fetching the files directly from Hugging Face. Perfect, it's simple to submit a training job, right? But what happens when one of those nodes fails? Usually during large distributed trainings, those nodes will fail. Right now what I'm doing here on this demo, I'm simulating a failure by annotating one of the nodes as unschedulable. And the reason is pending reboot. Let's see what's happening when I do that.

Thumbnail 2110

Thumbnail 2140

Thumbnail 2160

Demo: Automated Failure Recovery and Dynamic Capacity Management

I submit this command to the SageMaker HyperPod. I can go to the management console, and I'm going to see the node spanning. So if that was a harder failure, what SageMaker HyperPod would do is annotate that node as failed, remove that node from your cluster, bring a new node which is now running perfectly, and add that node back to your cluster. So initially you would see your nodes in an unknown ready state. But after a few moments when SageMaker HyperPod had replaced that node, you would see that node back online. So because we don't have a node, one of the pods stopped running and it's on pending. Now that the node is back online, that specific pod is back running. And you can see all of that on the SageMaker HyperPod management console.

Thumbnail 2170

Thumbnail 2180

Thumbnail 2200

Thumbnail 2210

Now let's talk about borrowing compute capacity. I'm doing the same thing—I'm starting a training job, but what's the difference here? The difference is that I'm now annotating my training job with a priority, which is training priority, and I'm giving that training job a namespace. That namespace was defined when you created the team. When you create the team, you get a namespace definition, so you have isolation of different teams when running on the same EKS cluster.

Thumbnail 2220

Thumbnail 2230

Thumbnail 2240

Thumbnail 2250

I submit the job. The job was submitted successfully, the pods are running successfully, and I can see that the HyperPod CLI automatically annotated that job for me. When I checked the queue for those two different teams, Team A has one running job and Team B has no jobs at all. Perfect, but how am I running that if Team A only had a single node and they submitted a job for more than one node? If you see over here, node count equals two. Let's see what happens.

Thumbnail 2280

Thumbnail 2290

What happened is that Team A actually borrowed capacity from Team B, and we can see that borrowed capacity through data location. Instead of having only 48 vCPUs, we borrowed additional 40, and now we have 88 vCPUs. Instead of having only 4 GPUs, we borrowed another 4, and we have a total of 8 GPUs to run our use case. So that's borrowing capacity. Now let's go back and ask, what happens if Team B submits a job? I'm borrowing capacity from Team B. Will that job run?

Thumbnail 2310

Thumbnail 2320

Thumbnail 2340

Let's see what's going to happen. I'm submitting the job the same way as I've been submitting the job using HyperPod CLI, just a different team with the same priority. When I do that, the job is running. We can see that the job was annotated. Let's check the queue and see what's happening right now. We can see that the job that was running for Team A is now suspended and pending. The job for Team B is running. Team A is suspended and pending because it doesn't have that borrowing capacity anymore. Team B got it back and reclaimed it. When we look at the jobs from a HyperPod CLI perspective, we can see that the namespace Team A has a suspended job and the namespace Team B has a running job.

Thumbnail 2350

Thumbnail 2360

Thumbnail 2370

Thumbnail 2390

Now let's go ahead and preempt a lower priority task. Right now training is a lower priority than experimentation. So we're going to submit a new job on Team B with the experimentation priority. We can verify that experimentation has a weight of 80 while training has a weight of 70. When we submit the job, the job is submitted successfully and the job is running. Now we can see on Team B we have one running job, which is the experimentation one, and that training job has been suspended because it has a lower priority. If we check the allocation, we can see that Team A is not borrowing capacity. That's why the Team A job is still pending. Team B is actually borrowing capacity from Team A. So that experimentation job requires some additional capacity. It borrowed from Team A, and that's where we are putting all of the features together.

Thumbnail 2420

Thumbnail 2430

Thumbnail 2450

Thumbnail 2490

Thumbnail 2500

Thumbnail 2510

Ultra Clusters and Network Innovation: Enabling Mixture of Experts at Scale

So that's it. For the demo, that's what I wanted to show you, and I want to go back to the slides and show you some advanced use cases for HyperPod with those Ultra servers. So before we go and dive deep into mixture of experts and how to leverage Ultra servers with that, let's recap what we have covered so far. We have viewed that the rapid evolution of generative AI brings challenges, and then we have SageMaker HyperPod to help you overcome those challenges and increase your goodput. One example is Perplexity. Perplexity is using SageMaker HyperPod and speeding up their development by 40%. The other thing is you can easily access GPUs by using flexible training plans, and one of the advantages of using training plans is that you get up to 68% discount versus on-demand pricing. So you reserve capacity through flexible training plans, you plan that capacity, and you also get some discounts. Task governance together with topology aware scheduling will make sure that all of those jobs that you are submitting land on the right compute close to the data and using the fastest network possible. Combining all of those and thinking about real world use cases and advanced use cases that need and require a lot of compute power, let's dive deep into how you can leverage more than one server. All of these jobs that I submitted during the training demonstration were using more than one instance, and how they are connected together. You don't want one instance on a data center and the second one on a second data center. You don't even want instances which are not interconnected into the same network spine. That's why AWS developed the Ultra clusters. It is a combination of many different servers interconnected on what we call a single spine, meaning they are interconnected with the lowest latency possible, highest throughput possible, and the least amount of network devices in a single route.

Thumbnail 2590

Thumbnail 2640

Not only that, but Ultra clusters also bundle up together storage because you want compute close to the data. You want to bring those data sets close to the GPU and you want compute close to the GPU because you need to save the checkpoints as fast as possible. Ultra clusters connect Amazon FSX for Lustre on that single spine as well, so you get all of that performance. I've been talking a lot about the network, but is it really important? Am I just talking about a specific AWS products? So we profiled Nemotron 15 billion parameters, and while profiling that training run of Nemotron 15 billion parameters, we identified that 60% of that training time was spent on the network. It was spent on what we call, or the market calls, communication collectives. Meaning when a GPU is training in the model, it is learning weights, but at some point it needs to synchronize the gradients, those weights, and do some math, apply some loss function, and when it starts doing that, now it's using the network a lot. Those collective communications are based on a library called Message Passing Interface API, and they heavily use the network.

Thumbnail 2650

Thumbnail 2660

Thumbnail 2680

So how come AWS developed a network that can help those model training go very fast? Traditional networks are using Equal-Cost Multi-Path routing, or ECMP, with flow-based routing. Meaning when you start transferring data, you open a single channel and you transfer all that data that you need on that channel. You might open a single channel and start transferring a second pair of data, but those two channels might use a specific network device in the same route and that network device could be either a bottleneck or a single point of failure. Right, so what AWS did to help customers is that we developed Elastic Fabric Adapter which uses the Scalable Reliable Datagram, or SRD. It's a protocol developed by AWS and we're using on our network. SRD, when you open that channel for communication, it actually shards your data and sprays that data into multiple different channels using all of them together.

Thumbnail 2710

Different channels are used together, so you transfer data from instance A to instance B using all of the available channels and all the available network devices at the same time. This means that if a single network device fails, you are not heavily impacted. We have run benchmarks that are published on the market that showcase the connection between instance A and B. Without SRD, the connection stops for a while if a network device fails, but with SRD and EFA, it does not stop.

Thumbnail 2760

It has a small hiccup in terms of performance. It might lower from 100% to 80% and then back to 100% very quickly because we are spraying those packets into multiple devices. Putting it all together, the SuperChip from NVIDIA, the Grace Blackwell GPUs, and those EFA devices create a simple diagram of one of those instances where you have the network devices connected directly to the GPUs. The GPUs have access to that fast network, and you have the CPU connected to that GPU.

Thumbnail 2790

How that ties together with Mixture of Experts, which is an advanced use case, involves understanding how transformer models work. Transformer models perform encoding, embedding, and data transformation, generating and correlating new data. For advanced use cases, such as specializing a model for translation or image creation, your training run was taking too long because all of the GPUs wanted to do everything and the model was not getting good specialization on different domains.

Mixture of Experts helps you have different GPUs that are the experts, and you have a router, which is a function or logic that understands how to manage data throughout those experts so the experts get the right data. A good example could be a classroom with different teachers. You could have a math teacher and an English teacher. The English teacher does not want to get those very complex algebraic equations. That English teacher wants to get Shakespeare books and literature, while the math teacher gets the numbers. You need someone to distribute that load, and you have that router.

Thumbnail 2890

Thumbnail 2900

Thumbnail 2910

Mixture of Experts requires a lot of network communications. How those networks might happen in that specific use case involves several steps. First, you have a leader, a ranked 0, which will start that process and start the communication and understand how to define those different experts. After that, those tensors, which are part of your neural network, are mapped to those experts. The part of your neural network that will actually train on those experts are moved there.

Thumbnail 2920

When you have the definition of those experts, data is exchanged and the routers might be sitting on a specific instance sending data to another one. On the bottom of that slide you can see an image where you have node zero, which is an instance with 2 GPUs, and node one, a different instance with 2 GPUs as well, but with 1 GPU. You can see that you have one of the routers which sends data to node 0 and node 1, which is the router in the middle.

Thumbnail 2990

If you are using UltraClusters, 72 GPUs are interconnected together, so you can have different experts running on those 72 GPUs without leveraging a network focused on using the NVLink switch. But if you are using more than 72 GPUs on an UltraServer, or if you are using other instances that have fewer GPUs, then you need the network for that router communication. At the end of your training, you have to collect all of those gradients and save that model state for one last time, so now you can use that later for inference. How that maps to the UltraServers and how that maps to those UltraCluster

Thumbnail 3010

Best Practices, Resources, and Closing Remarks

which are the interconnect of everything together. So on the UltraServers, if you're creating your model and you're enabling Mixture of Experts, you can define expert parallelism equal to 18, meaning you're defining every single server on UltraServer as a single expert, and your expert will have access to 4 GPUs. By doing so, you are isolating the communication of that specific expert on a single server, and you don't need to do any computation outside of that instance. And then you define tensor parallelism equal to 4, meaning you're sharing that neural network into the 4 different GPUs and you have 4 GPUs with one-quarter of your neural network spinning up your model deployment.

Thumbnail 3080

But if you're moving further away from 72 GPUs growing to 144 or even more than that, now it comes into the UltraCluster. So you have the UltraServers connected with NVLink switch up to 72 servers, up to 72 GPUs. And then you interconnect those UltraServers with the Alaska PE adapter on top of the Scalable Reliable Datagram protocol, and then you can get more than 25,000 GPUs interconnected together working as a single system.

Thumbnail 3110

So some calls to action that I really would like you to understand before you leave this session. The first thing is there are a lot of best practices on running those UltraServers. It's not only just acquiring capacity using the SageMaker HyperPod features to do task governance and priority on those skills, but you have to set up specific libraries with specific versions, and we have all of that well documented on SageMaker HyperPod documentation and the Amazon UltraServer documentation as well. You might have heard about multiple instance groups, which is a feature that we recently announced, meaning you can get a GPU and you can charge that GPU, for example, into two different virtual GPUs. That is not supported on UltraServers today.

Thumbnail 3160

The other thing that I wanted you to take away before leaving this session is that you can get your phones and scan those QR codes. The first one, AI on SageMaker HyperPod, is a repository that we created with a lot of instructions on how to put all of those features together easily and quickly, so you can speed up the deployment of those clusters on your AWS account. Most of what we have seen today in this demo is explained there on how to replicate on your own account.

The second repository, which is the Awesome Distributed Training repository, has examples of starting, for example, a fine-tuning on DeepSeek on Llama, training on Llama 4 and Llama 3, or even doing some micro benchmarks if you want to test and troubleshoot your cluster. All of those assets have come from customer engagements and we understood that they are things that you can use to speed up and leverage so you don't have to do documentation study or figure out how to put all of those things together.

Thumbnail 3270

With that, let's close the session, and thank you so much for being here, not only at re:Invent, but coming to the session and listening to me and Rekha. Rekha, thank you so much for sharing the stage with me, and please don't forget to fill out the survey. As a data-driven organization as we are at AWS and Amazon, we really need that survey because we read your comments. Whatever you put there, we're going to use as lessons learned for our next session. And hopefully you're going to have much more fun than you had today. Thank you so much.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)