Kazuya

Posted on Dec 6

AWS re:Invent 2025 - Zoox: Building Machine Learning Infrastructure for Autonomous Vehicles (AMZ304)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Zoox: Building Machine Learning Infrastructure for Autonomous Vehicles (AMZ304)

In this video, Zoox's Jim Robinson-Bohnslav, Avinash Kolluri, and Anindya discuss how Zoox leverages AWS SageMaker HyperPod to train foundation models for autonomous robotaxis. They explain Zoox's multimodal language action model that processes camera, LIDAR, and radar data to handle edge cases like jaywalkers and unusual objects. The presentation covers their training pipeline from supervised fine-tuning with tens of thousands of driving hours to reinforcement learning using GRPO and DAPO. Key infrastructure challenges include managing petabytes of data, implementing HSDP and tensor parallelism across 64+ GPUs, and achieving 95% GPU utilization. SageMaker HyperPod's auto-resume capability, EFA-enabled networking at 3200 Gbps, and integrated observability through CloudWatch and Grafana proved essential. They're transitioning from SLURM to EKS orchestration for greater flexibility and implementing multi-region training with P5 and P6 instances.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction to Zoox: Building Purpose-Built Robotaxis for Safe, Clean Transportation

Alright, thanks everyone for coming. Today we're going to be talking about how Zoox uses machine learning infrastructure built on AWS to train our foundation models. So my name is Jim Robinson-Bohnslav. I'm the tech lead for foundation model training here at Zoox, and my co-presenters Anindya and Avinash will introduce themselves when they come up.

All right, so here's the overall agenda. First, I'm going to give a short introduction to Zoox. Hopefully you all know already, and then I will give an overview of the use cases for foundation models here at Zoox. My co-presenters will go over AWS SageMaker HyperPod and how we're integrating that into our training and evaluation stack.

All right, so what is Zoox? Why did we build this company? So Zoox aims to make personal transportation safe, clean, and enjoyable for everyone. To do this, we built a purpose-built robotaxi that hopefully you all have already seen driving around the strip that is fully electric, fully autonomous, and designed for riders.

Why did we decide to do this? We believe the model of individually owned, human-driven vehicles is broken. You can see here human drivers are actually quite dangerous, and there are many fatalities every year. Human-driven vehicles are idle the vast majority of the time. They're not actually taking you from place to place, and of course they also put out a lot of pollution.

So I hope you all know this already, but Zoox is live in Las Vegas, so I'll let you see this little demo. So you can see our robotaxi going on the strip. It's very beautiful. I hope you agree. Yeah, so I encourage you to take out your phones, get the QR code, get the app. Riding is free in Vegas right now, so that's very fun. I strongly encourage you to take a ride and let us know what you think.

Zoox's Autonomy Stack and the Challenge of Handling Long Tail Scenarios

All right, so here's a high level overview of Zoox's autonomy. So it drives itself, hopefully that is not a surprise to any of you. To do this, we have sensor pods all around the vehicle. You can see that here in this actual image. These are the pods on the top corners that have LiDARs, radars, cameras, thermal cameras, microphones, tons of stuff on there. And we have of course 360 degree field of view so we can see all around the robotaxi, and they are redundant in case of hardware failure. We can see up to and over 150 meters away from the vehicle.

All right, so the three pillars of autonomy are perception, prediction, and planning and controls. Perception's job is to take raw data streaming in from all of our sensors and turn that into a structured understanding of the world. The core of that is of course the 3D bounding box, where we are going to detect and track every agent in three dimensions around the vehicle so we can tell if it's a car, a pedestrian, a truck. We detect traffic lights in addition to using our HD maps.

All right, so prediction's job then is to take all these agents that are moving about and predict where they might be in the future. So these predictions are multimodal, which in this case means we predict multiple possible futures of every agent. So that the planning stack and controls can turn that into the actual volts that end up controlling the vehicle. So planning integrates our perception stack, the outputs of prediction, the map that we've prebuilt, and of course the mission. You have to know where the rider wants to go in order to make the high and low level plans and controls.

All right, so most driving is easy, but some driving is very, very hard. So here we have a log where we're going at a pretty good rate, I think, I forget, between 25 and 45 miles an hour, and there's two fire trucks in the right lane. Note that they don't have their siren on, their blinking lights, so even though it looks like they're actively responding to an incident. So we obviously take active fire engines very seriously, but even in this case they had put a fire hose into the drivable lane and our autonomy stack is capable of detecting this and making a plan to safely move around.

I've been driving for 18 years and I've never seen anything like this on the road, but our robotaxis see things like this all the time.

Here's another example right from here in Las Vegas. We're going down the strip at 30 miles an hour and somebody decides to walk across all six, maybe more, lanes of traffic. Again, not something I've ever seen and I don't think that was a great decision, but our autonomy stack has to be capable of handling people making poor decisions basically.

Jaywalkers are important to deal with, but they're not particularly unusual, and we do absolutely see very unusual things in our robotaxis. Going across from top left to top right, we have a child dressed up as a construction zone flagger, and spoiler alert, we do listen to construction zone flaggers, but in this case it's a little unclear if we should or not. One time we saw a convoy of tanks going down the strip, which you might imagine for our machine learning based perception stack is a little unusual to deal with. We don't have a specific tank class in our ontology. The next image is, you know, we obviously pay a lot of attention to traffic cones. They tell us where construction zones are, incident zones, and other things, and somebody perhaps on a fun Vegas night out decided to put one of those on top of a sign. Again, not clear that we should listen to that.

We see cars on fire, which is maybe not that unusual. We see handwritten construction zone signs saying keep right that we might actually have to obey. Here in Vegas there's a half tour bus, half shopping cart called the Cartzilla that we see often, and our autonomy stack has to be able to handle all these long tail cases. We even have to be able to handle when people are behaving badly. This particular image on the right side, it's a little dim, it was at night, is somebody's foot as they climb on top of the robotaxi, and in the left panel of that is their friend videoing them. And of course, we have somebody riding around with a dog in the backpack. It's unclear if we should maybe call that agent a dog or a pedestrian or bicyclist or what's going on there.

Foundation Models as the Solution: Building a Multimodal Language Action Model

This is the long tail of autonomy and our system has to be able to deal with all of these, but you might imagine the edge cases proliferate and it becomes increasingly hard to deal with as we scale. One way to handle this, and the way that we on our team have decided to handle this, is with foundation models. When you have a really difficult problem, you go back to theory and principle, and in this case we turn to the bitter lesson which should be familiar to all of you. This is the famous blog post from Richard Sutton, and the key quote is the biggest lesson from 70 years of AI research is that general methods that leverage computation are ultimately the most effective.

I sort of adapted a graph I had seen on Twitter of here's three possible ways that a model might improve in its performance over time. The first way in light blue is where you add a lot of human engineering and expertise. You might have a small model designed to do one specific thing like is this construction zone flagger a kid in a costume or not model, and that's going to be great in the short term and it's always the best in the short term. In the middle maybe you have a more general model, but we're really betting on this long term approach which is we're going to use a lot of compute that my co-presenters will tell you about and we're going to have a very, very simple approach. We're not going to put a ton of human knowledge or biases in so we're just going to train big models on big data.

Specifically we are going to combine many, many different tasks and datasets and approaches into one multimodal language action model.

So our goal is to train a model to zero shot these long tail scenarios. Zero shot means that we don't have to see one or two or five tanks before we can accurately deal with it, or we don't have to see ten kids in flagger costumes. Our stack will accurately handle this the very first time we ever see an event. That's what it means to zero shot, and that's our goal.

So our approach that we're taking is to make a multimodal language action model. The core of it is an LLM, and what we're going to output is robotic controls, so controls for the robotaxi like acceleration, braking, steering, and so on, as well as 3D detections like our perception stack does currently. And also other more generative techniques like answers to questions. Am I supposed to listen to this flagger? Is the hotel employee gesturing at me or at the guy next to me? So answers to questions like these, and of course descriptions of scenes. Captions are very useful for many, many applications.

So the input to this, like any LLM, is going to be a text prompt as well as a pre-trained embedding layer in this case, and those are going to go as embeddings into our LLM. In this case, the prompt is just a dummy prompt: you are the driver of a Zoox robotaxi, what should you do in this scenario, think step by step. As well as the text prompt, we have images or videos coming off of our sensors, and those go through pre-trained encoders that we then pass through a projection layer. This is a standard multimodal LLM approach into embeddings that the LLM can process.

As well, and a little uniquely, you might imagine at Zoox we've been driving around for a long time. We have nearly infinite LIDAR and radar data, so we encode that and project it into the LLM as well. Lastly, one of the main benefits of using an LLM as the core is its flexibility. So you might imagine we can encode and pass many, many, many things into our model, including the outputs of our perception stack right now if we wanted to. So something like the 3D boxes that we generate, we can encode those and pass those into the model.

Training Pipeline and Infrastructure Challenges for Foundation Models at Zoox

So how are we actually training this? This is a very high level slide, but I'll walk you through. So we start with pre-trained models. Yes, we didn't want to get into the pre-training game for reasons you might imagine. So we take state of the art vision language models like Qwen 2 VL, which came out a few weeks ago, that we were basing our model off of.

Our first stage is really large scale supervised fine tuning. So this includes behavior cloning from tens of thousands of hours of human driving, where we train the model to do what the human did. We have millions of 3D detection labels, so we can train the model to understand the world in three dimensions. And then of course we have more standard LLM techniques like visual question answering, spatial understanding, and more.

So the next stage in our pipeline is a smaller but higher quality SFT stage. So this includes rare object detection, driving particularly in difficult scenarios, and synthetic chain of thought reasoning to help the model learn to think and reason its way to an answer. Which leads to the last stage in the training pipeline, which is reinforcement learning. So we are fine tuning for robotic controls or driving the robotaxi on our most difficult scenarios, and we use techniques like GRPO and DAPO for this.

And the last stage for deployment is, for offline purposes, we use VLLM, and for potential online integrations, we're working towards something like TensorRT LLM. Okay, so what are the challenges in training a model like this? There are many, as you might imagine. So first is that we have petabytes of data.

We have multimodal sensor data from cameras, lidars, radars, and all of the structured data that our autonomy stack already generates, which numbers in the petabytes. Of course, accuracy and performance is a challenge. We want it to be as good as possible and as low latency as possible, which is a fundamental trade-off.

For our researchers and engineers working on our team, we want rapid and scalable iteration, so we need to be able to launch an experiment, track it, get metrics, and run our evaluations as quickly and smoothly as possible so that they can run their next follow-up experiment quickly. There's a whole host of difficult infrastructure that goes into this, including managing datasets, allocating compute, debugging the model training pipeline, and so on. For this section, I will now turn it over to my co-presenter Avinash from AWS to tell you about AWS SageMaker HyperPod.

AWS SageMaker HyperPod: A Purpose-Built Platform for Distributed Training

All right. Thank you, Jim, for covering extensive details about Zoox. That's so exciting to hear, as well as discussing the large language model challenges. I'm Avinash Kolluri. I'm a Senior Solutions Architect with AWS supporting Amazon and its subsidiaries on their cloud journey. How many of you here are building foundation models or doing distributed training, as well as fine-tuning with large language models? I see a few hands raised. Thanks for that. I'm pretty sure you might have run into certain issues around the hardware, and you end up spending more time debugging and fixing the node failures. And guess what? When you do that, your capacity or the GPUs stay in idle time, which ends up costing you a huge amount.

For these reasons, we have SageMaker HyperPod. It's a resilient cluster that was purpose-built for distributed training as well as running your machine learning workloads so that it can take care of the heavy lifting in terms of bringing down, bringing the efficiency as much as possible and reducing the cost. Let's look at HyperPod's versatile environment. I'll try to unfold it and make it so simple so that you'll get to understand what SageMaker HyperPod is about.

At the very beginning, on the foundation layer, we have storage and compute. Well, as Jim mentioned, there's resource optimization and resources extensively that are needed for a lot of large language models. SageMaker HyperPod supports Trainium, which is AWS-based GPUs, as well as NVIDIA-based GPUs, which are P4s, P5s, and P6s. In fact, we have seen clusters the size of more than 500 plus nodes running on SageMaker HyperPod for distributed training as well as ML workload builds.

And then comes the storage. Just for an example, today with Zoox, there's a lot of data that's being covered across the lidars, cameras, and sensors. In fact, it is petabytes of scale. So you need a persistent storage and better efficiency storage mechanisms like Amazon S3, FSx for Lustre, and EFS so that you can load this data directly onto the GPUs in order to do any sort of distributed training. Well, these nodes do not come in one or two in number because when you load this amount of data, you need multiple nodes, tens and hundreds of them. And when you do that, it has to be high-performance computing and efficiency in order to make sure the data transfer speeds and the efficiency is at high scale because you have to load and train the data.

So for that, we have Elastic Fabric Adapter, which is a network layer physical device that is attached to these instances or the nodes of the GPUs to give you 3200 gigabits per second capacity for each and every node in order to make sure you're running at a high-performance compute mode. And then comes the runtime environment. So SageMaker HyperPod supports two types of runtime environments now. One is NVIDIA-based, which is CUDA libraries, and then it's AWS-based Neuron runtime.

Just right above the runtime, we have various sets of ML frameworks and tools. I know this is something that might be interesting to a lot of ML engineers because at any given day, many ML engineers deal with many sets of tools and frameworks.

For example, they use PyTorch, NeMo, Megatron, as well as kubectl or Ray for job submissions. One of my personal favorites in this entire section is observability. I strongly believe that you need to measure something to improve it, and having a single unified experience always helps in terms of achieving the right operational excellence. SageMaker HyperPod is easily integrated with Amazon CloudWatch, as well as Amazon Managed Grafana and Prometheus, so that you can have one-stop observability or operational excellence for your workloads to make sure that you are monitoring and building alerts on top of it.

At the surface layer, we have the orchestrators. We have seen many customers using SLURM as well as Amazon EKS. All right, let's check what HyperPod's sample architecture is. Whether you take EKS-based or SLURM-based, this is a generic architecture which helps you to understand how SageMaker HyperPod really behaves. On the left side of the screen, you're seeing the compute nodes. Typically all of these compute nodes are in the Amazon SageMaker service account and they are mapped to the customer nodes. The reason for that is the health checks towards the bottom of the compute nodes so that SageMaker HyperPod can take control of them and make sure it's governing and guarding those nodes, and whenever there is a failure, it automatically replaces them. That is one less job to worry about for a lot of ML engineers.

You could also see there is a control node and a login node. The control node is basically the head node, which is used for orchestration and also handling the jobs where your ML engineers usually submit the jobs. The login nodes are basically to protect the control nodes and carry on any sort of admin activities. SageMaker HyperPod also comes with a lot of built-in flavors. For example, it uses lifecycle scripts, which are scripts that run when an instance is booting up and make sure the cluster configuration and setup is correctly done. At the same time, it also provides capabilities for the checkpoint mechanism. As we just discussed, Amazon CloudWatch, the observability platform, is directly integrated with it.

Resiliency and Auto-Resume: How SageMaker HyperPod Handles Hardware Failures

All right, one of the primary or the most important pillars or aspects of SageMaker HyperPod is resiliency. So let's take a look at why resilience is important, and I'll try to break it down in such a way that it is easily understood. On the screen you're looking at a 16-node cluster where you have three jobs: Job A, Job B, and Job C. Let's assume that Job A has taken eight nodes and Job B and C have taken four nodes each. When an ML experiment or when ML workloads are running, and you happen to find a failure of instances, it could be either because of software or hardware. The usual software failures are something like hyperparameter misconfigurations or even missing a semicolon, and that's pretty usual in the case of an ML engineer's lifecycle. They go, debug, fix it, resume the job, and pick it up, right?

But when it comes to hardware failures, this is physics. It can easily happen. It could be because of overheating or some other reasons. When that happens, ML engineers do end up spending a lot of time debugging it, but ultimately all that happens is replacing the node. When it doubles up from 8 to 16 to 32 and so on, you'll end up spending more time debugging and fixing these nodes one after the other. That costs huge because you're having idle capacity or idle usage of your GPUs then.

So here is a typical scenario. If the failure is a model issue, ML engineers usually go and debug it, as we just discussed. But if it is with the instances, that is when you have to investigate, replace the node, and if it occurs again, you have to repeat the same process again and again. This is where SageMaker HyperPod comes into the picture. The biggest value proposition of auto-resume

is the auto-resume feature based on health checks. How this works is that SageMaker HyperPod has a set of reserved nodes in the SageMaker HyperPod service account for the reserved category, and it automatically checks and replaces the instances so that you don't have to do that heavy lifting. You can just focus on building the models.

We discussed some of the capabilities, and here's a quick recap of a few more capabilities of SageMaker HyperPod. Resilience is something that we have touched on just now. The next one is about scalability. In fact, when you deal with hundreds and hundreds of nodes, EFA is something that really stands out in order to give you that level of bandwidth for all these nodes to intercommunicate and exchange information. At the same time, there is a SageMaker Flexible Training Plan, which Anindya would be discussing next, but this is also another capability for you to have instances reserved and have continued operations at times.

SageMaker HyperPod also supports a versatile environment. In fact, within the stack that we have just seen, a lot of frameworks and tools are supported. This brings down that opportunity for ML engineers so that they can just focus on writing the code and building the models and worry less about the frameworks and tools that are integrated. Another one is efficiency. For example, with the integration of deep learning AMIs and the built-in scripts, you have a lot of frameworks and tools that are available so that you can immediately get your cluster spun up and started and build the models. At the same time, the integration with CloudWatch gives you the capabilities of building an observability platform on top of it so that you can govern your resources and make sure that you're taking the right decisions at the right time.

What's a typical lifecycle within SageMaker HyperPod when you build large language models? Jim just mentioned that they collect a lot of data. Anyone building foundation models needs to collect a lot of data. Once they collect the data, they need a mechanism to store it, and this is where we have discussed S3 and storing the data. Now you have the data with you in S3, so what are you going to do next? You need to load this data and bring it to the GPUs so that you can continue doing the model training.

Once you have this data, you can load it and then do the training. For that, you need efficient mechanisms and distributed training mechanisms for your GPUs to continue on the training process. Towards the end, once the training is done, you deploy these models. With that, I'll hand it over to Anindya, and he would be discussing the implementation side on SageMaker HyperPod and also the challenges.

Zoox's Requirements for Foundation Model Training and Why HyperPod Was a Natural Fit

Thank you, Jim and Avinash, for setting the context on what we are trying to do and why we need SageMaker HyperPod. I lead the distributed training efforts in the ML platform at Zoox. There are many features in SageMaker HyperPod, but I will just highlight a couple that really caught our attention. The first one was we wanted to do large-scale training, which means we want to do FSDP, HSDP, and DDP. We want a framework or an ecosystem where we can do it freely with our own tools. We also want EFA to be enabled because we want to train large-scale models, and it should be across nodes and GPUs.

There are other features like checkpointing, which I'm going to talk about in a bit. The most important was training plans, because since we are working with models like Qwen 3 VL or similar models, we'll be using instances such as P5 and P6. They are costly instances, and we cannot get them on demand. So how can we guarantee upfront that we will surely get these nodes when we need them? HyperPod Training Plans offer that flexibility. Once we decide what we need, it is guaranteed that it is available.

Let's understand that whether we have HyperPod or not, what are the fundamental requirements of any training platform for foundation models? First of all, distributed training jobs. I should be able to run jobs on multi-GPU nodes.

Sometimes we have a vision to run jobs on 64+ GPUs and also run FSDP, HSDP, and tensor parallelism. Is there a way to do it? SageMaker HyperPod gives us that opportunity and ecosystem. We can bring our own code and our own libraries. They also have their own library, but we are not bound to use them. We can freely switch between the implementations.

For terabyte scale data streaming, SageMaker HyperPod does not mandate us to use any particular data loader. We can bring our own data loader, and we'll talk about that. We use Mosaic data streaming for our data loading strategies, so this fits very well. It's a win-win situation because we are using SageMaker HyperPod as a compute platform, but we are bringing our own data loaders and solution on top of it. So both systems coexist together.

The third one is production resilience. Our jobs typically run for two days or three days. If one of the nodes crashes due to a GPU failure, it should automatically recover. The node should be replaced. Traditionally in Zoox, we had a SLURM-based ecosystem. In that case, we had to manually come and restart the nodes. But SageMaker HyperPod already gives the auto-resume capability, which we want to leverage.

Unified observability is important when multiple engineers are training many jobs. We should have EFA, GPU, and FSx metrics available, and we do not want to build all those observability in-house, so we use the SageMaker HyperPod add-ons for observability to do that.

For checkpointing, we want distributed asynchronous checkpointing, and once any failure occurs, we should be able to resume from the last checkpoints, and it should be automatic. We do not have to have any human in the loop to come and do that job for us, so we want to automate. We looked onto SageMaker HyperPod, and they already have built the ecosystem. The HyperPod ecosystem is really geared towards large-scale training, and they have thought about all these challenges and problems, and they have embedded them into the product itself, so it was a natural fit for us.

Implementation Architecture and User Workflow: From Login to Model Checkpoints

Avinash told about the generic architecture. This is the Zoox architecture. It's very simple and straightforward, driven from there. In the center, we have multiple partitions of P5 and P6N nodes. Again, SageMaker HyperPod runs on a managed VPC, so all these compute nodes are in the managed VPC. On the left, we have the Zoox VPC where we have Zoox-specific components. For example, we have FSx and other components in our VPC. On the right side, these are additional components from SageMaker apart from SageMaker HyperPod. For example, we track our experiments in Comet ML, and we also have CloudWatch integration. I will talk about it in a moment.

So we talked about the architecture and the placement of the components. Let's look at the user workflow. When we have this, how do we use it? It's the user journey. In the center, which covers the green box, is the FSx, which is like the heart of the system. On the bottom, we have the compute nodes, login, and SLURM controller, which we already discussed. But the beauty of this ecosystem is the FSx is a shared storage. It's mounted everywhere, not only on the compute nodes but on the controller as well as the login node. So if I make any change onto the FSx file system, every compute node should be able to pick it up.

On the right-hand side, we have a section for model checkpoints and evaluation. It is not related to SageMaker HyperPod. We do the evaluation once the checkpoints are created. We fire off the evaluation as a separate job, but this highlights the ecosystem where both SageMaker and non-SageMaker can coexist in the same VPC layout and same infrastructure setup. They both collaborate with each other without competing with any of them.

So how does the workflow look? First of all, the user logs in to the login node. Then step two is he creates a virtual environment. It's a Python virtual environment. Why do we do that? We want to give users the flexibility to bring your own libraries. For each experiment, you can have your own virtual environment.

If you look at the folder structure FSx user 1 VNV, you can have VNV1, VNV2, up to N. So you decide what you want to bring for your experiment and create the virtual environment on the FSx. Once created on FSx on the login node, since the FSx is mounted on all nodes, the compute nodes will naturally see those virtual environments as well.

Then we get to check out a particular branch or a SHA that you want to run it on SageMaker HyperPod. You can also make changes on the fly on the training code and submit the job to the SLURM controller. Submit means we use SLURM batch. The SLURM controller takes it and schedules it at the compute node. This is where step number 4 comes in. The distributed training job is orchestrated. All the health agents from SageMaker kick in and monitor the job. If any node has to go down, it will replace that node and make that job into a pending state and re-kick off that job when a new node is spun up. So everything is orchestrated by SageMaker HyperPod here.

Step 5 is the output of the model are written as checkpoints, and step 6 is after the checkpoints are written, we kick off the evaluation jobs. So this is our entire workflow, how it exists right now. So now it's time to check a little bit what are the techniques we try usually to scale the model training process.

Just to address the training processes, we use mostly HSDP, Hybrid Sharded Data Parallel. Because we have a partition of P5 and P6s, we can apply DDP across nodes but FSDP inside the nodes. Some of the models are too large because we are working with Zoox. They come in multi-billions, so we have to shard the matrix multiplications as well. So you apply tensor parallelism too. And then the third thing we also apply some training, other training optimizations such as use BF16 to make the computation much faster and memory consumption even half than float 32. And then gradient checkpointing to increase the batch size as well as we use torch compile to compile the torch graph. And the scalable HyperPod is governed and dictated by the EFA fabric that governs across nodes.

We have been talking about training, but we think a little discussion, a primer on the data plan is also critical. Because we use our custom MDS, Mosaic Data Streaming, that has helped us a lot in achieving high throughput and efficiency because direct streaming and local caching is a feature, a native feature of Mosaic Data Streaming. They are aware of the node, the topology. They understand what shard is to be needed or downloaded on a particular GPU or a node. And then we also have deterministic sampling and resumable iteration, which means that if a node dies or a GPU crashes and when it restarts, the Mosaic Data Streaming allows mid-epoch resumption. And it also maintains a global state of the samples that it has seen so that when it resumes, it knows to skip the samples that it has already seen and only go through the samples which are unseen yet.

Overcoming Challenges and Achieving Results: From 400M to 32B Parameter Models

And we also, MDS also supports batch prefetching because we are working with video data. As the GPU is working on the current batch, we need to prepare the next batch of data and we have complex preprocessing logic there, so we need to prefetch them so that the GPU is not starved. Challenges, every system goes through a series of challenges when we implement it for the first time. I will only talk about two which are broadly applicable, the first one and the last one. The other three are very specific to Zoox.

The first one, we started with Docker and Pyxis and Enroot combination to run Docker images on SLURM. It was quite complicated because we have to first of all make a Docker image to run it on SLURM. Developers could not iterate. They have to make Docker images many times. So it was hindering velocity. The last one is like when we started having more P5 ENs and P6 ENs, we did not do a good job initially calculating the CIDR or the IP ranges.

But P5 and P6 come with EFAs. Each EFA consumes one extra IP per network card in addition to the IP that is consumed by the node itself. So when I mean it was a good problem, when we started to grow having more nodes and more training, we ran out of subnet IPs and nodes were not available. So we had to redesign our CIDRs and VPCs and things to fix that.

Other things like enabling certificates and connectivity to Git. We have Zoox, and inside Zoox it's very secure. It's quite complicated. The SRE set up, and we just cannot connect to any remote system just from any node. So it has to be properly secured. We have to install certificates on all compute nodes. Here also SageMaker's Lifecycle script came to the rescue because we can add a lifecycle script when the node gets created, so we can add our Zoox-specific logic there. It may be applicable to you, may not be, but in our ecosystem we had this challenge and we had to fix this.

Another part is, well, SageMaker already comes with so many features. Can we build our own? Of course we can. So here is an example. We deployed NVIDIA DCGM exporter on each compute node because we wanted to see the GPU metrics that are coming out of these nodes. So we built our custom exporter, we created our custom events, and we created our own CloudWatch metrics. And once the metrics were available, we could create CloudWatch dashboards. Here I'm showing GPU utilization for P6 and P5, but the point to drive here is, well, we use SageMaker HyperPod, but it's still extensible. You can customize it to your own needs. You do not need to use each and every feature. So this is very important.

We are in a transition phase now. We started with SLURM because we had experience with SLURM cluster before. We have our own high-performance compute system in SLURM. So we initially started with SLURM so that the learning curve is low and we could bootstrap and unblock our science team faster. But later, we are now currently moving on to the EKS-orchestrated SageMaker HyperPod. And the reason we want to move there is because of more flexibility and scalability.

One of these examples, these are metrics that come from one of the transitions that we are making, a real job that we are currently trying to test out on EKS. And we want to show, in the previous slide I showed you custom CloudWatch metrics, but this time we don't want to build anything in-house. We want to leverage what is provided by SageMaker HyperPod itself. It boils down to the question of whether you want to build it or reuse it. Do you want to focus on building observability by yourself, or you want to focus on using the observability to fine-tune or make better decisions? What should be your batch size? Are you using the EFA correctly or not?

For example, if you look at the EFA metrics, EFA RDMA read writes, what does it show? EFA is supposed to be used with RDMA. If you use RDMA, then the communication between the GPUs is skipping the CPUs and other kernel levels. If the metrics are high and stable, it means that we are using EFA in the right way. If it is low, we need to fix it. Probably there is room for improvement.

Similarly, EFA received bytes. Are the GPUs receiving bytes at a standard, steady, higher level? If it is yes, standard, which means that everyone is working as desired. If there is a sudden chop or drop, which means something is wrong in our setup. Similarly for GPU, we know about GPU memory used. It's pretty straightforward. Am I consuming the full memory or not? If I'm not, then I need to increase my batch size.

Another one is tensor core utilization. Why is it important? Well, I mentioned that we also use BF16 data format. If we use BF16 and the nodes come with tensor cores, the tensor cores can be utilized to do the BF16 computation. And if you do BF16, it is half of the memory and the speed is double. So having this tensor core utilization gives us the confidence that, okay, so we use BF16 and it is really being implemented and used.

If the graphs are low, there is something wrong we need to fix. So the fundamental point to drive here is we would not be building all these metrics by ourselves. It's already part of SageMaker, so we should focus on the business problem at hand that we want to train large-scale models efficiently, and SageMaker HyperPod helps to do that.

So this is a slide which is pretty much self-explanatory, showing what was earlier and what was new. In our existing setup, we did not have a good experience with EFA and RDMA. With SageMaker HyperPod, because we used managed instances, installing EFA is a four to five step process. You have to install EFA installer, AWS LEA Fabric, then AWS OFI NCCL plugin, CUDA drivers, and the versions need to match. There are many combinations where we can go wrong. Instead of focusing on doing that ourselves and where we can go wrong, we just reuse what SageMaker provides. They come with everything already working, so we focus on using the technology to improve our models and train more efficiently.

That's why we have the GPU utilization also bump up to 95% after using SageMaker. Because we are using EFA and RDMA, we get almost linear scalability with the multi-node setup. We were training somewhere around 400 million parameter models, but now we train almost regularly 7 billion and we are also going towards 32 billion parameter models as well.

And the recovery time, as I mentioned, we want to reduce the human in the loop. We do not want errors to be there, and let's say if the error was there and the person who was running the job was away at night, he comes back next morning, we lost six hours, seven hours of P5 or P6 time. These machines are costly. We cannot afford to do that. So SageMaker HyperPod is geared towards, first of all, detecting these failures efficiently and quickly. Secondly, they resume it as part of the product feature. And third, we can also integrate these alerts with our observability.

Okay, lessons learned. There are many, but I will just highlight four for simplicity. Use data loaders and data sharing everywhere and don't download everything on your disk. Do streaming processing, leverage HSDP the best you can, and then you go to tensor parallelism, but try out the HSDP gradient accumulation, PyTorch compile, and other techniques first. And try to run on EFA enabled devices.

Currently we do not have multi-AZ infrastructure. We only run on US East 2. We have to graduate and grow from there to run across regions. It is hard to get nodes in one particular region, but if you enable this multi-region, it is easier to get compute nodes across regions. And the fourth point is, if you don't see what you're doing and if you have no clue how things are performing under the hood or in your back, you just cannot fix it. So you have to have greater visibility and observability for sure. You should be metrics driven.

Road Ahead: Transitioning to EKS and Building the Future of Transportation

This is a quote from our director. I will not go through it line by line, but fundamentally, SageMaker HyperPod has unblocked large-scale training at Zoox, and we are using it to full extent and we have great visibility into the utilization and performance.

Road ahead. What are we going to do? So we are not complete in the journey yet. We're just somewhere in the middle. We started eight or nine months back with SLURM-based ecosystem. Now we are transitioning into an EKS-based ecosystem. The reason being Kubernetes is quite popular, it's very flexible, it's very scalable, and there's lots of open source frameworks built on top of Kubernetes, and we can utilize them along with SageMaker HyperPod.

We also want to transition to SageMaker HyperPod Training Operator, which is very customized for HyperPod. We currently use Kubeflow PyTorch Job Operator on the EKS. HPTO also gives us regular expression-based log scanning.

For example, if your job is not progressing fast enough, or if the validation loss is not decreasing, you can have expressions that will detect job hang or situations where your training is stalling. We also want to implement task governance. Currently we have a first come, first serve basis and we make a manual roster for who will be using which machine. We want to move away from that.

We should have teams and teams should have compute quotas, and they should be able to lend and borrow between teams. For example, there are so many jobs running in Team B and they have consumed all their resources. Now I need one more job to run. Should I just sit idle? No, I can ask Team A, can you lend me some resources for the time being so that I can run my job. Task governance in SageMaker HyperPod will give you that feature, and we want to leverage that. The fourth point is we want to continue leveraging FSDP, HSDP, and do more experiments. So those are the technical requirements.

What are the non-technical requirements on the road ahead? The non-technical question is, can I ask for any resource? Can I ask for the entire cluster? Can I ask for all nodes in the cluster? Obviously no. So we need to implement guard rails and resource quotas. We have namespaces, so every user or every namespace should have some restrictions so that we do not go with runaway clusters where everyone is asking for everything and we will be having DDoS attacks.

We want to expand to multi-region, which I mentioned is on our radar, but we have not implemented it yet. Zoox has its own log monitoring system. SageMaker HyperPod does not mandate or does not have an opinion on how you should manage your logs. Logs are created from SLURM jobs or EKS jobs, and it's the user's responsibility to have a system on top of it. We already have an established robust system of monitoring logs, and we will bring that system into SageMaker HyperPod.

Similarly, I showed you some metrics on Grafana. Zoox has its own observability stack, their own Grafana, their own Prometheus, and we will integrate that. As you can see, we are leveraging HyperPod for the compute nodes, orchestration, recovery, and resilience. For the other things which are already in Zoox, we are just integrating them and plugging them in, so both ecosystems coexist together and play well together.

All this talk from my previous two co-authors and me ultimately culminates to this. We want to build the future of transportation. We want to build a robotaxi which should be safe, which should be reliable, and which should be scalable. In order to build such a system, we need better models to drive the robotaxi. And in order to build such models, we need better infrastructure to drive building those models. That's why we have a partnership with AWS on SageMaker HyperPod.

If you want to learn more about Zoox, please visit zoox.com. We also have One Amazon Lane at Caesars Forum. You have been hearing a lot of innovations in all other aspects as well. You can come and try out hands-on each of these innovations in this One Amazon Lane at Caesars Forum. We also have a Zoox booth in the Amazon Lane where you can check out the robotaxi, get into it, take pictures, and just come talk to us. We would love to explain more.

Thank you for coming to the talk. This is the three of us representing the talk, but understand that behind these three, there are many more people who have supported this infrastructure and are working in the background behind us. Thank you all.

; This article is entirely auto-generated using Amazon Bedrock.