Kazuya

Posted on Dec 5

AWS re:Invent 2025 - Zoox: Building Machine Learning Infrastructure for Autonomous Vehicles (AMZ304)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Zoox: Building Machine Learning Infrastructure for Autonomous Vehicles (AMZ304)

In this video, Zoox's Jim Robinson-Bohnslav explains how the company trains multimodal foundation models for autonomous robotaxis using AWS SageMaker HyperPod. He describes Zoox's approach to handling long-tail driving scenarios—from jaywalkers to tanks on the Las Vegas strip—through large-scale language action models that combine LiDAR, radar, and camera data. The team uses supervised fine-tuning with tens of thousands of driving hours and reinforcement learning techniques like GRPO. Avinash Kolluri details HyperPod's auto-recovery capabilities and resilient infrastructure supporting 500+ node clusters with EFA networking. Anindya demonstrates their implementation using HSDP and tensor parallelism, achieving 95% GPU utilization on P5/P6 instances with Mosaic data streaming, while transitioning from SLURM to EKS orchestration for greater scalability.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Foundation Models at Zoox and AWS SageMaker HyperPod

Thanks everyone for coming. Today, we're going to be talking about how Zoox uses machine learning infrastructure built on AWS to train our foundation models. My name is Jim Robinson-Bohnslav, and I'm the Technical Lead Manager for Foundation Models here at Zoox. My co-presenters Anindya and Avinash will introduce themselves when they come up.

Here's the overall agenda. First, I'm going to give a short introduction to Zoox. Then I will give an overview of the use cases for foundation models here at Zoox, and my co-presenters will go over AWS SageMaker HyperPod and how we're integrating that into our training and evaluation stack.

Zoox's Vision: Building a Purpose-Built Robotaxi with 360-Degree Sensing

What is Zoox and why did we build this company? Zoox aims to make personal transportation safer, cleaner, and more enjoyable for everyone. To do this, we built a purpose-built robotaxi that is fully electric, fully autonomous, and designed for riders. Hopefully you've already seen it driving around the strip.

Why did we decide to do this? We believe the model of individually owned human-driven vehicles is broken. Human drivers are quite dangerous, and there are many fatalities every year. Human-driven vehicles are idle the vast majority of the time, not actually taking you from place to place, and they also put out a lot of pollution.

Zoox is live in Las Vegas, so I'll let you see this little demo. You can see our robotaxi going on the strip. It's very beautiful. I encourage you to take out your phones, get the QR code, and get the app. Riding is free in Vegas right now, so that's very fun. I strongly encourage you to take a ride and let us know what you think.

Here's a high-level overview of Zoox's autonomy. It drives itself, hopefully that is not a surprise to any of you. To do this, we have sensor pods all around the vehicle. You can see these pods on the top corners in this actual image. They have LIDARs, radars, cameras, thermal cameras, microphones, and tons of other sensors. We have a 360-degree field of view so we can see all around the robotaxi, and they are redundant in case of hardware failure. We can see up to and over 150 meters away from the vehicle.

The Three Pillars of Autonomy: Perception, Prediction, and Planning

The three pillars of autonomy are perception, prediction, and planning and controls. Perception's job is to take raw data streaming in from all of our sensors and turn that into a structured understanding of the world. The core of that is the 3D bounding box, where we detect and track every agent in three dimensions around the vehicle so we can tell if it's a car, a pedestrian, a truck, or we detect traffic lights in addition to using our HD maps.

Prediction's job is to take all these agents that are moving about and predict where they might be in the future. These predictions are multimodal, which means we predict multiple possible futures of every agent. The planning stack and controls can then turn that into the actual volts that end up controlling the vehicle. Planning integrates our perception stack, the outputs of prediction, the map that we've prebuilt, and of course the mission. You have to know where the rider wants to go in order to make the high and low level plans and controls.

Navigating the Long Tail: Unusual Scenarios on Las Vegas Streets

Most driving is easy, but some driving is very hard. Here we have a log where we're going at a pretty good rate, I think between 25 and 45 miles per hour, and there are two fire trucks in the right lane. Note that they don't have their siren on or their blinking lights, so even though it looks like they're actively responding to an incident, we obviously take active fire engines very seriously. But even in this case, they had put a fire hose into the drivable lane and our autonomy stack is capable of detecting this and making a plan to safely move around.

I have been driving for 18 years and I have never seen anything like this on the road, but our robotaxis see things like this all the time.

Here is another example right from here in Las Vegas. We are going down the strip at 30 miles an hour and somebody decides to walk across all 6, maybe 7 lanes of traffic. Again, this is not something I have ever seen, and I do not think that was a great decision, but our autonomy stack has to be capable of handling people making poor decisions.

Jaywalkers are important to deal with, but they are not particularly unusual. We do see very unusual things in our robotaxis. Going across from top left to top right, we have a child dressed up as a construction zone flagger. Spoiler alert, we do listen to construction zone flaggers, but in this case it is a little unclear if we should or not.

One time we saw a convoy of tanks going down the strip, which you might imagine for our machine learning-based perception stack is a little unusual to deal with. We do not have a specific tank class in our ontology. The next image shows that we obviously pay a lot of attention to traffic cones, as they tell us where construction zones are, incident zones, and other things. Somebody, perhaps on a fun Vegas night out, decided to put one of those on top of a sign. Again, it is not clear that we should listen to that.

We see cars on fire, which is maybe not that unusual. We see handwritten construction zone signs saying "keep right" that we might actually have to obey. Here in Vegas, there is a half tour bus, half shopping cart called the Cartzilla that we see often. Our autonomy stack has to be able to handle all these long tail cases.

We even have to be able to handle when people are behaving badly. In this particular image on the right side, it is a little dim because it was at night, and it is somebody's foot as they climb on top of the robotaxi. In the left panel is their friend videoing them. Of course, we also have somebody riding around with a dog in the backpack. It is unclear if we should call that agent a dog or a pedestrian or a bicyclist or what is going on there.

Applying the Bitter Lesson: Training Foundation Models for Zero-Shot Performance

This is the long tail of autonomy, and our system has to be able to deal with all of these. But you might imagine the edge cases proliferate, and it becomes increasingly hard to deal with as we scale. One way to handle this, and the way that our team has decided to handle this, is with foundation models. When you have a really difficult problem, you go back to theory and principle. In this case, we turn to the Bitter Lesson, which should be familiar to all of you. This is the famous blog post from Richard Sutton, and the key quote is: "The biggest lesson from 70 years of AI research is that general methods that leverage computation are ultimately the most effective."

I have adapted a graph I had seen on Twitter showing three possible ways that a model might improve in its performance over time. The first way, in light blue, is where you add a lot of human engineering and expertise. You might have a small model designed to do one specific thing, like determining whether a construction zone flagger is a kid in a costume or not. That is going to be great in the short term and it is always the best in the short term.

In the middle, maybe you have a more general model. But we are really betting on this long-term approach, which is that we are going to use a lot of compute, which my co-presenters will tell you about, and we are going to have a very simple approach. We are not going to put a ton of human knowledge or biases in. We are just going to train big models on big data. Specifically, we are going to combine many different tasks and datasets and approaches into one multimodal language action model.

Our goal is to train a model to zero-shot these long tail scenarios. Zero-shot means we don't have to see one, two, five, or ten tanks before we can accurately deal with it, or we don't have to see ten kids in flagger costumes. Our stack will accurately handle this the very first time we ever see an event. That's what it means to zero-shot, and that's our goal.

Building a Multimodal Language Action Model for Robotaxi Control

Our approach is to make a multimodal language action model. The core of it is an LLM, and what we're going to output is robotic controls, such as controls for the robotaxi like acceleration, braking, steering, and so on. We also output 3D detections like our perception stack does currently, and other more generative techniques like answers to questions. For example, am I supposed to listen to this flagger, or is the hotel employee gesturing at me or at the guy next to me? We also provide descriptions of scenes, and captions are very useful for many applications.

The input to this, like any LLM, is going to be a text prompt as well as a pre-trained embedding layer, and those go as embeddings into our LLM. In this case, the prompt is a dummy prompt: "You are the driver of a Zoox robotaxi. What should you do in this scenario? Think step by step." Along with the text prompt, we have images or videos coming off of our sensors, and those go through pre-trained encoders that we then pass through a projection layer. This is a standard multimodal LLM approach into embeddings that the LLM can process.

Uniquely at Zoox, we've been driving around for a long time and we have nearly infinite LiDAR and radar data. We encode that and project it into the LLM as well. One of the main benefits of using an LLM as the core is its flexibility. We can encode and pass many things into our model, including the outputs of our perception stack right now if we wanted to. For example, the 3D boxes that we generate can be encoded and passed into the model.

So how are we actually training this? This is a very high-level overview, but I'll walk you through it. We start with pre-trained models. We didn't want to get into the pre-training game for reasons you might imagine. We take state-of-the-art vision language models like Qwen 3 VL, which came out a few weeks ago, and we based our model off of that.

Our first stage is really large-scale supervised fine-tuning. This includes behavior cloning from tens of thousands of hours of human driving, where we train the model to do what the human did. We have millions of 3D detection labels so we can train the model to understand the world in three dimensions. We also use more standard LLM techniques like visual question answering and spatial understanding.

The next stage in our pipeline is a smaller but higher quality supervised fine-tuning stage. This includes rare object detection, driving in particularly difficult scenarios, and synthetic chain-of-thought reasoning to help the model learn to think and reason its way to an answer.

This leads to the last stage in the training pipeline, which is reinforcement learning. We are fine-tuning for robotic controls or driving the robotaxi on our most difficult scenarios. We use techniques like GRPO and DAPO for this. The last stage for deployment is that for offline purposes we use VLLM, and for online purposes we're working towards potential integrations with something like TensorRT LLM.

Training Challenges: Managing Petabytes of Data and Scalable Infrastructure

So what are the challenges in training a model like this? There are many, as you might imagine. First, we have petabytes of data.

We have multimodal sensor data from cameras, lidar, radars, and all of the structured data that our autonomy stack already generates, which numbers in the petabytes. Of course, accuracy and performance is a challenge. We want it to be as good as possible and as low latency as possible, which is a fundamental trade-off. For our researchers and engineers working on our team, we want rapid and scalable iteration, so we need to be able to launch an experiment, track it, get metrics, run our evaluations as quickly and smoothly as possible so that they can run their next follow-up experiment quickly.

There's a whole host of difficult infrastructure that goes into this: managing datasets, allocating compute, debugging the model training pipeline, and so on. For this section, I will now turn it over to my co-presenter Avinash from AWS to tell you about AWS SageMaker HyperPod. Thank you, Jim, for covering extensive details about Zoox. That's so exciting to hear, as well as discussing the large language model challenges. I'm Avinash Kolluri. I'm a Senior Solutions Architect with AWS supporting Amazon and its subsidiaries on their cloud journey.

AWS SageMaker HyperPod: Purpose-Built Infrastructure for Resilient Distributed Training

How many of you here are building foundation models or doing distributed training as well as fine-tuning with large language models? I see a few hands raised. Thanks for that. I'm pretty sure you might have run into certain issues around the hardware. You end up spending more time debugging and fixing node failures. When you do that, your capacity or the GPUs stay in idle time, which ends up costing you a huge amount.

For these reasons, we have SageMaker HyperPod. It's a resilient cluster that was purpose-built for distributed training as well as running your machine learning workloads so that it can take care of the heavy lifting in terms of bringing down efficiency as much as possible and reducing cost. Let's look at HyperPod's versatile environment. I'll try to unfold it and make it simple so that you'll understand what SageMaker HyperPod is about.

At the very beginning, on the foundation layer, we have storage and compute. As Jim mentioned, there's resource optimization and resources extensively needed for a lot of large language models. SageMaker HyperPod supports Trainium, which is AWS-based GPUs, as well as NVIDIA-based GPUs, which are P4s, P5s, and P6s. In fact, we have seen clusters the size of more than 500 nodes running on SageMaker HyperPod for distributed training as well as ML workloads.

Then comes the storage. Just for an example, with Zoox, there's a lot of data being covered across the lidar, cameras, and sensors. In fact, it is petabytes in scale. So you need persistent storage and better efficiency storage mechanisms like Amazon S3, FSx for Lustre, and EFS so that you can load this data directly onto the GPUs in order to do any sort of distributed training. These nodes do not come in ones or twos because when you load this amount of data, you need multiple nodes, tens and hundreds of them.

When you do that, it has to be high-performance computing and efficiency in order to make sure the data transfer speeds and efficiency are at high scale because you have to load and train the data. For that, we have Elastic Fabric Adapter, which is a network layer physical device that is attached to these instances or the nodes of the GPUs to give you 3200 GBPS capacity for each and every node in order to make sure you're running at high-performance compute mode.

Then comes the runtime environment. SageMaker HyperPod supports two types of runtime environments now. One is NVIDIA-based, which is CUDA libraries, and then it's AWS-based Neuron runtime. Just right about the runtime, we have various sets of ML frameworks and tools.

ML engineers deal with many sets of tools and frameworks on any given day. For example, PyTorch, Nemo, Megatron, as well as Kubeflow or Ray for job submissions. One of my personal favorites in this entire section is observability. I strongly believe that you need to measure something to improve, and having a single unified experience always helps in terms of getting the right operational excellence. SageMaker HyperPod is easily integrated with Amazon CloudWatch as well as Amazon managed Grafana and Prometheus so that you can have one-stop observability or operational excellence for your workloads to make sure that you are monitoring and building the alerts on top of it.

At the surface layer, we have the orchestrators. We have seen many customers using SLURM as well as Amazon EKS. Let's check what SageMaker HyperPod's sample architecture is. Whether you take an EKS-based or SLURM-based approach, this is a generic architecture that helps you understand how SageMaker HyperPod really behaves.

On the left side of the screen, you're seeing the compute nodes. Typically, all of these compute nodes are in the Amazon SageMaker service account and they are mapped to the customer nodes. The reason for that is the health checks towards the bottom of the compute nodes so that SageMaker HyperPod can take control of them and make sure it's governing and guarding those nodes. Whenever there is a failure, it automatically replaces them. That is one less job to worry about for a lot of ML engineers.

You could also see there is a control node and a login node. The control node is basically the head node, which is used for orchestration and also handling the jobs where your ML engineers usually submit the jobs. The login nodes are basically to protect the control nodes and carry on any sort of admin activities. SageMaker HyperPod also comes with a lot of built-in features. For example, lifecycle scripts run when an instance is booting up and make sure the cluster configuration and setup is correctly done. At the same time, it also provides capabilities for the checkpoint mechanism and is directly integrated with Amazon CloudWatch, the observability platform.

One of the most important pillars or aspects of SageMaker HyperPod is resiliency. Let's take a look at why resilience is important, and I'll try to break it down in such a way that it is easily understood. On the screen you're looking at a 16-node cluster where you have 8 jobs: job A, job B, and job C. Let's assume that job A has taken 8 nodes and jobs B and C have taken 4 nodes each.

When an ML experiment or ML workloads are running and you happen to find a failure of instances, it could be either because of software or hardware. The usual software failures are something like hyperparameter misconfigurations or even missing a semicolon, which is pretty usual in the case of an ML engineer's lifecycle: debug, fix it, resume the job, and pick it up.

But when it comes to hardware failures, this is physics. It can easily happen because of overheating or some other reasons. When that happens, ML engineers do end up spending a lot of time debugging it, but ultimately all that happens is replacing the node. When it doubles up from 8 to 16 to 30 and beyond, you'll end up spending more time debugging and fixing these nodes one after the other. That costs huge because you're having idle capacity or idle usage of your GPUs.

Here is a typical scenario. If the failure is of the model issue, ML engineers usually go and debug it, as we just discussed. But if it is with the instances, that is when you have to investigate, replace the node, and if it occurs again, you have to repeat the same process again and again. This is where SageMaker HyperPod comes into the picture. The biggest value proposition of auto-resume feature based on health checks.

How this works is that SageMaker HyperPod has a set of reserved nodes in the SageMaker HyperPod service account for the reserved category, and it automatically checks and replaces instances so that you don't have to do that heavy lifting yourself. You can focus on building the models instead.

We discussed some of the capabilities, and here's a quick recap of a few more capabilities of SageMaker HyperPod. Resilience is something we touched on just now. The next one is scalability. When you deal with hundreds and hundreds of nodes, EFA really stands out in order to give you that level of bandwidth for all these nodes to intercommunicate and exchange information. At the same time, there is SageMaker flexible training plan, which Anando will be discussing next, but this is also another capability for you to have instances reserved and have continued operations at times.

SageMaker HyperPod also supports a versatile environment. Within the stack we have seen, many frameworks and tools are supported. This brings down the opportunity for ML engineers to focus on writing code and building models and worry less about the many frameworks and tools that are integrated. Another capability is efficiency. With the integration of deep learning AMIs and built-in scripts, you have many frameworks and tools available so you can immediately get your cluster spun up and started to build models. At the same time, the integration with CloudWatch gives you the capabilities of building an observability platform on top of it, so you can govern your resources and make sure you're taking the right decisions at the right time.

What's a typical lifecycle within SageMaker HyperPod when you build large language models? Jim mentioned that they collect a lot of data. Anyone building foundation models needs to collect a lot of data. Once they collect the data, they need a mechanism to store it, and this is where we discussed S3 and storing the data. Now you have the data in S3. What are you going to do next? You need to load this data and bring it to the GPUs so you can continue doing model training. Once you have this data, you can load it and then do the training. For that, you need efficient and distributed training mechanisms for your GPUs to continue the training process. Towards the end, once the training is done, you deploy these models.

Zoox's Implementation: Architecture and Workflow on SageMaker HyperPod

With that, I'll hand it over to Anando, who will be discussing the implementation side of SageMaker HyperPod and also the challenges. Thank you, Jim and Avinash, for setting the context on why we are trying to do this and why we need SageMaker HyperPod. I lead the distributed training efforts in ML platform at Zoox. There are many features in SageMaker HyperPod, but I will highlight a couple that really caught our attention. The first one was that we wanted to do large-scale training, which means we want to do FSDP, HSDP, and DDP. We want a framework or an ecosystem where we can do it freely with our own tools. We also want EFA to be enabled because we want to train large-scale models across nodes and GPUs.

There are other features like checkpointing, which I'm going to talk about in a bit. The most important was training plans because since we are working with models like Qwen 3 VL or similar models, we use instances such as P5 and P6s. They are costly instances and we cannot get them on demand. How can we guarantee upfront that we will surely get these nodes when we need them? SageMaker HyperPod training plan offers that flexibility. Once we decide what we need, it is guaranteed that it is available.

Let's understand what the fundamental requirements are of any training platform for foundation models, whether we have HyperPod or not. First of all, we should be able to run distributed training jobs on multi-GPU nodes, sometimes like we have a vision to run jobs on 64+ GPUs and also run FSDP, HSDP, and tensor parallelism.

Is there a way to do it? SageMaker HyperPod gives us that opportunity and ecosystem. We can bring our own code and own libraries. They also have their own library, but we are not bound to use them. We can freely switch between the implementations.

Terabyte scale data streaming is another key requirement. SageMaker HyperPod does not mandate us to use any particular data loader. We can bring our own data loader. We use Mosaic data streaming for our data loading strategies, so this fits very well. It's a win-win situation because we are using SageMaker HyperPod as a compute platform, but we are bringing our own data loaders and solution on top of it. Both systems coexist together.

The third requirement is production resilience. Our jobs typically run for two to three days. If one of the nodes crashes due to a GPU failure, it should automatically recover and the node should be replaced. Traditionally in our Zoox infrastructure, we had a SLURM-based ecosystem where we had to manually come and restart the nodes. But SageMaker HyperPod already gives us the auto-recovery capability which we want to leverage.

Unified observability is also critical. When multiple engineers are training many jobs, we should have EFA, GPU, and FSX metrics available. We do not want to build all that observability in-house, so we use the SageMaker HyperPod add-ons for observability to do that.

Checkpointing is another important aspect. We want distributed asynchronous checkpointing, and if any failure occurs, we should be able to resume from the last checkpoint automatically. We do not want any human in the loop to do that job for us. We want to automate this, so we looked at SageMaker HyperPod and they already have built the ecosystem. The HyperPod ecosystem is really geared towards large-scale training and they have thought about all these challenges and problems and have built them into the product itself, so it was a natural fit for us.

Avinash told us about the generic architecture. This is our Zoox architecture, which is very simple and straightforward. In the center, we have multiple partitions of P5 and P6N nodes. SageMaker HyperPod runs on a managed VPC, so all these compute nodes are in the managed VPC. On the left, we have the Zoox VPC where we have Zoox-specific components like FSX and other components. On the right side, we have additional components from SageMaker apart from SageMaker HyperPod. For example, we track our experiments in Comet ML and we also have CloudWatch integration.

Now let's look at the user workflow and how we use this architecture. In the center, covered by the green box, is the FSX, which is like the heart of the system. On the bottom, we have the compute nodes, login node, and SLURM controller, which we already discussed. The beauty of this ecosystem is that FSX is a shared storage mounted everywhere, not only on the compute nodes, but on the controller as well as the login node. If I make any change to the FSX file system, every compute node should be able to pick it up.

On the right-hand side, we have a section for model checkpoints and evaluation. It is not related to SageMaker HyperPod. We do the evaluation once the checkpoints are created, and we fire off the evaluation as a separate job. This highlights how the ecosystem allows both SageMaker and non-SageMaker components to coexist in the same VPC layout and infrastructure setup. They both collaborate with each other without competing.

The workflow starts when the user logs in to the login node. Step two is creating a Python virtual environment. We do this to give users the flexibility to bring their own libraries. For each experiment, you can have your own virtual environment.

If you look at the folder structure, FSX user 1 VNV, you can have VNV1, 2 up to N. You decide what you want to bring for your experiment and create the virtual environment on the FSX. Once created on FSX on the login node, since the FSX is mounted on all nodes, the compute nodes will naturally see those virtual environments as well.

Then we get to check out a particular branch or a SHA that you want to run on SageMaker HyperPod. You can also make changes on the fly on the training code and submit the job to the SLURM controller. Submitting means we use SLURM batch. The SLURM controller takes it and schedules it at the compute node. This is where step number 4 comes in. The distributed training job is orchestrated. All the health agents from SageMaker kick in and monitor the job. If any node has to go down, it will replace that node and make that job into a pending state and re-kick off that job when a new node is spun up. Everything is orchestrated by SageMaker HyperPod here.

Step 5 is the output of the model are written as checkpoints. Step 6 is after the checkpoints are written, we kick off the evaluation jobs. This is our entire workflow, how it exists right now. So now it is time to check a little bit what are the techniques we try usually to scale the model training process.

Scaling Training with HSDP and EFA: Challenges, Solutions, and Performance Gains

To address the training processes, we use mostly HSDP, hybrid sharded data parallel. Because we have a partition of P5 and P6 instances, we can apply DTP across nodes but FSDP inside the nodes. Some of the models are too large because we are working with Queen. They come in multi-billions, so we have to do the shard the matrix multiplications as well. So you apply tensor parallelism too.

And then the third thing we also apply some training optimizations such as use BF16 to make the computation much faster and memory consumption even half than the float 32. And then gradient checkpointing to increase the batch size as well as we use torch compile to compile the torch graph. The scalable HyperPod is governed and dictated by the EFA fabric that governs across nodes.

We have been talking about training, but we think a little discussion, a primer on the data plane is also critical. Because we use our custom MDS Mosaic data streaming, that has helped us a lot in achieving high throughput and efficiency because direct streaming and local caching is a native feature of Mosaic data streaming. They are aware of the node topology. They understand what shard is to be needed or downloaded on a particular GPU or a node.

And then we also have deterministic sampling and resumable iteration, which means that if a node dies or a GPU crashes, and when it restarts, the Mosaic data streaming allows mid-epoch resumption. It also maintains a global state of the samples that it has seen, so that when it resumes, it knows to skip the samples that it has already seen and only go through the samples which are unseen yet. And MDS also supports batch prefetching because we are working with video data. As the current GPU is working on the current batch, we need to prepare the next batch of data and we have complex preprocessing logic there so we need to prefetch them so that the GPU is not starved.

Every system goes through a series of challenges when we implement it for the first time. I will only talk about two which are broadly applicable, the first one and the last one. The other three are very specific to Zoox. The first one we started with Docker and Pike is an environment combination to run Docker images on SLURM. It was quite complicated because we have to first of all make a Docker image to run it on SLURM. Developers could not iterate. They have to make Docker images many times. So it was hindering velocity.

The last one is like when we started having more P5ENs and P6ENs, we did not do a good job initially calculating the CIDR or the IP ranges.

But P5 and P6 instances come with EFAs. Each EFA consumes one extra IP per network card, in addition to the IP consumed by the node itself. When we started to grow with more nodes and more training, we ran out of subnet IPs and nodes became unavailable, so we had to redesign our CIDRs and VPCs to fix that issue.

Other challenges included enabling certificates and connectivity. We have Zoox infrastructure inside Zoox, which is very secure but quite complicated. The SRE setup requires that we cannot connect to any resource system from just any node. Everything has to be properly secured, and we had to install certificates on all compute nodes. SageMaker's LifeCycle script came to our rescue because we can add a lifecycle script when the node gets created, allowing us to add our Zoox-specific logic there. This may or may not be applicable to you, but in our ecosystem, we had this challenge and had to fix it.

Another aspect is that SageMaker already comes with so many features, but we can also build our own. We deployed NVIDIA DCGM exporter on each compute node because we wanted to see the GPU metrics coming out of these nodes. We built our custom exporter, created our custom events, and created our own CloudWatch metrics. Once the metrics were available, we could create CloudWatch dashboards. I'm showing GPU utilization for P6 and P5, but the point to drive here is that we use SageMaker HyperPod, but it's still extensible. You can customize it to your own needs and do not need to use every feature.

We are currently in a transition phase. We started with SLURM because we had experience with SLURM clusters before. We have our own high-performance compute system in SLURM, so we initially started with SLURM so that the learning curve would be low and we could bootstrap and unblock our science team faster. However, we are now moving to the EKS-orchestrated SageMaker HyperPod because of more flexibility and scalability.

One example of this transition is metrics that come from a real job we are currently testing on EKS. In the previous slide, I showed custom CloudWatch metrics, but this time we do not want to build anything in-house. We want to leverage what is provided by SageMaker HyperPod itself. This boils down to the question of whether you want to build it or reuse it. Do you want to focus on building observability by yourself, or do you want to focus on using the observability to fine-tune or make better decisions about what your batch size should be and whether you are using the EFA correctly?

For example, if you look at the EFA metrics, EFA RMA reads and rewrites show whether you have EFA set up correctly. EFA is supposed to be used with RDMA. If you use RDMA, then communication between the GPUs skips the CPUs and other kernel levels. If the metrics are high and stable, it means we are using EFA in the right way. If it is low, we need to fix it and there is probably room for improvement.

Similarly, EFA received bytes show whether the GPUs are receiving bytes at a standard, steady, higher level. If yes, that is standard and means everyone is working as desired. If there is a sudden drop, something is wrong in our setup. For GPU memory used, it is straightforward. Am I consuming the full memory or not? If not, then I need to increase my batch size.

Another important metric is tensor core utilization. We also use BF16 data format. If the nodes come with tensor cores, BF16 can be utilized to do BF16 computation. With BF16, the speed is double and it uses half the memory. Having this tensor core utilization gives us the confidence that we use BF16 and it is really being implemented and used.

If the graphs are low, there is something wrong we need to fix. The fundamental point to drive here is that we would not be building all these metrics by ourselves. It's already part of SageMaker, so we should focus on the business problem at hand that we want to train large-scale models efficiently, and SageMaker HyperPod helps to do that.

This slide is pretty much self-explanatory, showing what was earlier and what was new. In our existing setup, we did not have a good experience with EFA and RDMA. With SageMaker HyperPod, because we used managed instances, installing EFA is like a four to five step process. You have to install the EFA installer, AWS Lea Fabric, then AWS OFI Nickel plug-in, CUDA drivers, and the versions need to match. There are many combinations where we can go wrong.

Instead of focusing on doing that ourselves and where we can go wrong, we just reuse what SageMaker provides. They come with everything already working. We focus on using the technology to improve our models and train more efficiently. That's why we have GPU utilization bump up to 95 percent after using SageMaker. Because we are using EFA and RDMA, we get almost linear scalability with the multi-node setup. We were training somewhere around 400 million parameter models, but now we train almost regularly 7 billion, and we are also going towards 32 billion parameter models as well.

The recovery time is critical. As I mentioned, we want to reduce the human in the loop. We do not want errors to be there. If an error occurs and the person running the job was working at night, they come back the next morning and we lost 6 to 7 hours of P5 or P6 time. These machines are costly, and we cannot afford to do that. SageMaker HyperPod is geared towards first detecting these failures efficiently and quickly, secondly resuming them as part of the product feature, and third, we can also integrate these alerts with our observability.

Lessons Learned and the Road Ahead: Transitioning to EKS and Multi-Region Infrastructure

There are many lessons learned, but I will highlight four for simplicity. Use data loaders and data sharing everywhere and do not download everything on your disk. Do streaming processing and leverage HSDP the best you can. Then you go to tensor parallelism, but try out HSDP gradient accumulation, PyTorch compile, and other techniques first. Try to run on EFA-enabled devices.

Currently, we do not have multi-AZ infrastructure. We only run on US-East-2. We have to graduate and grow from there to run across regions. It is hard to get nodes in one particular region, but if you enable multi-region, it is easier to get compute nodes across regions. The fourth point is that if you do not see what you are doing and if you have no clue how things are performing under the hood or in your bag, you just cannot fix it. You have to have greater visibility and observability for sure. You should be metrics-driven.

This is a quote from our director. Fundamentally, SageMaker HyperPod has unblocked large-scale training at Zoox, and we are using it to the full extent. We have great visibility into the utilization and performance. We are not complete in the journey yet. We are somewhere in the middle. We started eight or nine months back with a SLURM-based ecosystem. Now we are transitioning into an EKS-based ecosystem.

The reason being that Kubernetes is quite popular, very flexible, very scalable, and there are lots of open-source frameworks built on top of Kubernetes. We can utilize them along with SageMaker HyperPod. We also want to transition to SageMaker HyperPod Training Operator, which is very customized for HyperPod. We currently use Kubeflow PyTorch job operator on the EKS. The HyperPod Training Operator also gives us regular expression-based log scanning.

For example, if your job is not progressing fast enough or if the validation loss is not decreasing, you can have expressions that will detect job hang or training stalling. Implement task governance. Currently we have a first come, first serve basis and we make a manual roster of who will be using which machine. We want to move away from that. We should have teams and teams should have compute quotas, and they should be able to lend and borrow between teams.

For example, Team A has so many jobs running that they have consumed all their resources. Now I need one more job to run. Should I just sit idle? No, I can ask Team B if they can lend me some resources for the time being so that I can run my job. Task governance in SageMaker HyperPod will give you that feature, and we want to leverage that. The fourth point is we want to continue leveraging FSDP and HSDP and do more experiments. Those are the technical requirements.

What are the non-technical requirements on the road ahead? Can I ask for any resource? Can I ask for the entire cluster? Can I ask for all nodes in the cluster? No, obviously not. So we need to implement guard rails and resource quotas. We have namespaces, so every user or every namespace should have some restrictions so that we do not end up with runaway clusters where everyone is asking for everything and we will be having DDoS attacks. We also want to expand to multi-region, which I mentioned is on our radar, but we have not implemented it yet.

Zoox has its own log monitoring system. SageMaker HyperPod does not mandate or have an opinion on how you should manage your logs. Logs are created from SLURM jobs or EKS jobs, and it is the user's responsibility to have a system on top of it. We already have an established robust system for monitoring logs. We will bring that system into SageMaker HyperPod. Similarly, I showed you some metrics on Grafana. Zoox has its own observability stack, its own Grafana, its own Prometheus. We will integrate that.

As you can see, we are leveraging HyperPod for compute nodes, orchestration, recovery, and resilience. For the other things which are already in Zoox, we are just integrating them and plugging them in, so both ecosystems coexist together and play well together. All this discussion, my previous two authors and me, ultimately culminates to this. We want to build the future of transportation. We want to build a robotaxi, which should be safe, reliable, and scalable. In order to build such a system, we need better models to drive the robotaxi. In order to build such models, we need better infrastructure to drive building those models. That is why we have a partnership with AWS on SageMaker HyperPod.

If you want to learn more about Zoox, please visit zoox.com. We also have a one Amazon lane on Caesar's Forum. You have been hearing a lot of innovations in all other aspects as well. You can come and try out hands-on each of these innovations in this one Amazon lane on Caesar's Forum. We also have a Zoox booth in the Amazon lane. You can check out the robotaxi, get into it, take pictures, and just come talk to us. We would love to explain more.

Thank you for coming to the talk. This is the three of us representing the talk, but understand that behind these three, there are many more people who have supported this infrastructure and are working in the background. Thank you all.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community