🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - HPC at Scale with AWS Parallel Computing Service (PCS) (CMP340)
In this video, Ian Colle from AWS introduces AWS Parallel Computing Service (PCS), a fully managed Slurm-based HPC service designed to simplify high-performance computing workloads. He shares AWS's HPC journey since 2017, addressing initial customer skepticism from companies like Shell and Toyota. The session features case studies from both organizations: Satoshi Ikemure explains how Toyota reduced environment setup from six weeks to 30 minutes with PCS, while Michael Gujral describes Shell's evolution from failed POCs to production success, achieving 3-5x performance improvements and accelerating 2.5 years of wall clock time through capacity blocks. The video also covers PCS architecture, pricing, global availability plans, and demonstrates AWS Batch integration with SageMaker for Toyota's large behavioral model training in robotics.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
From Skepticism to Partnership: The Early Days of HPC on AWS
Good afternoon. I'm Ian Colle, the director of advanced computing and simulation at AWS. Thank you for coming today. Before we start, I want to make sure that this silent disco setup is really working. I've never done one of these before, and I don't know how many of you have either, so I want to take a little poll to see how we can make this interactive. How many of you is this your first re:Invent? Please raise your hand. Wow, congratulations. On the other end of the spectrum, how many of you have been to more than ten re:Invents? Anyone? Wow. Well, I'm somewhere in between. This is actually my ninth re:Invent and my eighth presenting. I started in November of 2017. Unfortunately, I was staying at the Vdara, and no one gave me the hint about wearing comfortable shoes, so I subsequently walked a half marathon on my first day coming to different presentations here and there. Since then, I've learned my lesson, but it was an interesting time to be talking about HPC on AWS.
Two weeks before that had been Supercomputing 2017 in Denver. I was a freshly minted AWS employee, and I ran into a number of my former colleagues at WhamCloud, at Intel, at Ink Tank, and at Red Hat. They asked me, Ian, what does Amazon Web Services have to do with high performance computing? Let's just say we had a lot of skeptics around. Part of the reason I'm so excited to have the two colleagues that I have with me today to talk to you is because two of the biggest skeptics and honestly two of the biggest companies that held us accountable for developing the HPC infrastructures and orchestration tools that we have today are Shell and Toyota.
In fact, I recall very vividly walking into a customer meeting with Shell at Supercomputing 2017. A very kind gentleman with a very strong Texas drawl said, Ian, you seem like a nice guy, but you don't have the performance. I don't know about your security, and to be honest, I don't know that you'll ever hit the price that I need to perform my high performance computing workloads on AWS. Welcome to the company. That was quite a shock, but it was really refreshing to hear someone that could be that open and honest with me.
A couple of months later, one of my first international trips was to go meet with Toyota. I sat down with them, and they explained that they had decades of experience tweaking their high performance computing clusters to where they could perform at such levels of efficiency that they weren't sure we could ever demonstrate the business value of moving some of those workloads onto AWS. Here we are. It took us a while to get there, but not only do we have both Toyota and Shell migrating a portion of their workloads onto AWS, but they're going to talk to you today about how they have them in production and what that has meant to them.
Introducing AWS Parallel Computing Service: A Managed Slurm Solution
So what are we going to talk about today? We're going to talk about PCS, Parallel Computing Service. What is it? Why did we develop it? In large part because of what I just said—very demanding customers like Toyota and Shell and others asking us for a fully managed HPC as a service. We're going to go a little bit into the design, cost, and pricing, where you can find it, and what regions it's in. Then you'll hear from them directly. You'll hear from Shell and from Toyota about how they're implementing it. Then I'll come back up and do a little recap, and I'll give you a little teaser with a pretty cool demo on one of our other services, AWS Batch.
What is driving this need for HPC as a service? What is causing us to really evaluate how we can make HPC workloads simpler for customers to execute? Part of it is just the explosion of HPC. With the growth of AI workloads and the integration of AI and simulation, we see customers across disciplines looking at how to more efficiently and effectively implement architectures that before were considered some of the most extremely demanding of any and all of scientific computing. That's why we've developed Parallel Computing Service to address these solutions. Instead of the complexity of creating your own infrastructure, we're allowing simple infrastructure as code, repeatable setups that can be driven by APIs.
We're allowing customers to focus on what they do best, which is scientific workloads, research and development, engineering, and simulation. The expertise in actual infrastructure is handled by our managed service, taking away some of that undifferentiated heavy lifting. We're still allowing on-site HPC administrators to perform their very important tasks of upleveling what their customers can accomplish by giving them more tools. They can rely on us to handle some of the more mundane patching and updates—things that previously would have required taking down an entire cluster—which can now be done via API-driven development seamlessly to the underlying cluster.
So what is AWS Parallel Computing Service, or PCS? Our initial version is a managed Slurm offering. How many people have heard of Slurm? Please raise your hand. A bunch of people. Slurm is the most popular open source scheduler. While there are a number of popular schedulers across industries, from LSF to PBS Pro to Symphony, Slurm has taken on a really important role, especially in large language model training and in AI, because of its role in academia and its open source background. Because of that, we thought let's start with a scheduler that's fully open source and that we can provide a breadth of experiences to our customers.
We're still reviewing other schedulers. If I didn't mention your favorite scheduler, or if you say, "Hey Ian, Slurm is great to have managed, but I've got a bunch of stuff written that's tailored to a specific scheduler and I would really like a managed version of that," come talk to me afterwards or send me an email. You have my contact info at the end of this presentation. We're still looking to expand this. We just started with Slurm because of its popularity within open source.
AWS HPC Portfolio Evolution: From ParallelCluster to PCS
Where does PCS fall within our HPC portfolio? HPC is a pretty broad term, so I wanted to give you a picture of everything that we talk about. It's just like you would think about high performance computing on premises. We need networking, we need compute, we need performance storage, and we need some way to tie it all together into a coherent cluster. That's why you see here we have remote visualization from Amazon DCV. We have various ways that you can instantiate different clusters depending on what you're looking for. I mentioned a little bit about AWS Batch, and I'll go into that later. I also talk about how we've evolved from ParallelCluster to PCS and why some customers choose to use one over the other.
We've got this whole family of EC2 instances, from our HPC-optimized instances all the way to our latest and greatest R8I on the Intel Xeon 6 generation Granite Rapids chips that we recently released, all the way to the latest and greatest NVIDIA GPUs, from the GB200 to the B300 available in our P6 family. Whatever particular EC2 instance you want to orchestrate, we can do that with PCS. It's not just the compute though; it's the storage. You have a whole family of storage products that you can integrate into your cluster depending on your needs. Some people really like FSx for Lustre because of the open source background, similar to Slurm. A whole bunch of people have their infrastructure set up with FSx for NetApp ONTAP and want to maintain that. We support all of that. You can connect the storage that you choose for your workloads to the particular compute you like, all using PCS.
So how did we get here? I talked about how I joined Amazon at the end of 2017. When I showed up, there was this long partnership with Intel and AWS where they had a vision. Back in 2015, we saw the need to make it easier for customers to instantiate these large HPC clusters, so we jointly created this open source toolkit called CFN, or Cloud Foundation Cluster. But when I showed up, I talked with a bunch of customers and they said, "Is this even associated with Amazon? We see a GitHub repo and this doesn't really look like something I'm ready to stand behind and put my production workloads."
We very quickly said we need to stand this up as more than just an open source toolkit. We need to have this be a fully fledged Amazon supported solution with a support team behind it. So we rebranded that open source repo as AWS ParallelCluster and stood up a whole team of engineers that to this day continue to upgrade and improve AWS ParallelCluster.
Along the way we learned from a lot of customers who said they appreciate the modularity and the ability to hack AWS ParallelCluster because it is completely downloadable as an open source toolkit. However, they really would like a managed service. There are so many things that are hard to do with just an open source toolkit. For one, it's annoying that they have to blow away their entire cluster and recreate it from scratch just to pick up the latest and greatest ParallelCluster version. They asked if there was some way we could do it like all the other AWS services, where it seamlessly upgrades behind the scenes and they don't have to think about it. They also wanted to be able to do infrastructure as code and more API driven development. They asked if we could make it a true AWS service. That's why last year we released AWS PCS.
Dynamic Scaling and Managed Services: Key Features of AWS PCS
What does it allow you to do? AWS PCS provides managed SLURM, so what does SLURM allow you to do? It allows you to scale and schedule resources for compute jobs. You can define how you allocate those resources, give priorities, and set up various queues. We'll go into some of the details and the actual orchestration of those jobs, and you can set various priorities.
One of the key values of moving your scheduling to a service like AWS PCS is the ability to dynamically provision resources. Typically within the history of high performance computing, we're looking at a fixed box with an on-premises resource that is, let's say, 100 nodes or 1,000 nodes, and we are trying to efficiently allocate jobs to exercise and execute on those resources. But here's the thing with the cloud: we blow the top off that box and we say, when does it make sense to have those resources? Do we have jobs in the queue? If we have jobs in the queue, then let's scale up. If we don't have jobs in the queue, then let's scale down. That's one way that customers take advantage of this dynamic scaling.
There are other times where, especially in the new world of very scarce GPU resources, customers say they have these GPUs and they're not letting go of them. They're going to use them just like they would on-premises resources and keep them up 24/7, feeding them just like they would, and they want us to maximize scheduling against those 50 GPUs as much as we can. But depending upon that static allocation or dynamic allocation, it's all up to you. You have control over how you allocate those resources to the compute jobs that you need to execute.
One of the nice things about having a managed service is that we're now integrated with the broader AWS system, so you get CloudWatch logs. You can look at what happened with your jobs, what happened with this scaling, what happened with this execution, and so you get that insight that before with ParallelCluster you didn't have. One of the high value updates that I see of moving from ParallelCluster or moving to PCS is that customers have told me there are so many times where they just want to upgrade something. It could be upgrading an OS on the nodes, upgrading a particular patch, or upgrading SLURM versions. All of this has to be done by the admin. We figured out a way that AWS can take some of that undifferentiated heavy lifting off of your admins and give you the resources so you can just say update the cluster and hit this API and that will happen.
We talked about the flexible architecture where you can use CPUs, GPUs, and pick your favorite storage
regardless of what is particular to your infrastructure and your needs. We're not prescribing what you have to use, and that's part of the benefit of moving to cloud resources. It's not a one-time snapshot where you have to pick what is your 80% solution that might right now be a certain flavor of GPU and a certain flavor of CPU. But at the rate that new GPU versions are being released, by the time you actually get it into production, you may be two or more iterations behind on the innovations that have happened there.
With the flexible scheduling that you have on AWS, you can spin up some of the older instances and some of the newer instances. It's totally in your control what resources you allocate to your jobs. When we looked at this creating PCS, we want to make sure that we're meeting the needs of multiple stakeholders. Obviously within HPC, we have not just scientists and engineers, but we have a whole family. We have HPC system administrators, the people that are actually allocating the resources for the scientists to do experiments on.
What are their needs? There has to be some sort of really robust logging. There has to be a way that they can have telemetry into how efficiently their jobs are being scheduled, what resources they're taking advantage of, whether they have resources sitting there that are not being utilized, and how to measure that efficiency. For the scientists and engineers, who are the core end users, we want to make it as seamless as possible so that they don't even notice if they're running on AWS with PCS, maybe AWS with ParallelCluster, or on premises. We want to make that Slurm migration such that for them it's the same experience because at the end of the day, the last thing we want to do is slow down innovation by our researchers. We want the opposite. We want to help them speed up, and you're going to see some examples from customers about how migrating to PCS has helped them really improve that innovation cycle.
Architecture, Pricing, and Real-World Impact: PCS in Action at Tune Therapeutics
We have a number of partners and ISVs that have built on top of ParallelCluster before and now on PCS that wanted us to really do that integration of an AWS service so they could build a partner offering for their customers. So let's take a quick deep dive into what PCS looks like. It's a standard cluster. A cluster unsurprisingly has a number of node groups. We have one for logins and one for compute types, and then we have a number of queues that schedule back and forth. We'll go into different options that you can have around queuing, and then again pick your favorite flavor of storage and your favorite directory service. We'll integrate with it. It's not prescriptive.
Here's an example where you see a user logs into node group A. They have a queue that is attached to two different compute types. So they've got one node group that's specified for ARM CPUs and one node group that's specified for x86 CPUs, and that queue can schedule between them. Similarly, you could have one that's GPUs and one that's CPUs. Now on the other hand, we see in the second example below where you have the opposite thing. Instead of having one queue that goes to two node groups, you've got two separate queues that go to a single node group. And again that gets to the area where I'm talking about. Let's say you have that on-demand capacity request where you want to maximize that, so you don't really care if it's jobs from queue A or queue B. You want to make sure that those hardware resources that you've committed to use are being employed as effectively and efficiently as possible.
Here's even pulling back the covers a little bit more into some of the detail of the architecture. You can SSH into the login node. A user can then execute their queues from two different permission groups. What I really want you to focus on is the two big boxes, or actually one big box on top and a little rectangle on the bottom, and that is the shared responsibility model. That's where everything in that bottom rectangle, we handle all of that. We handle all those updates. You don't have to worry about any of that. If there are minor security patches or if we have new versions of software, we take care of it.
You don't have to do anything—we just upgrade the service. Everything in the top box is all in your VPC. Your compute nodes you control. So if you're going to do something like a massive update to all of your underlying AI, that's something your admin would continue to have. But if we're going to provide a newer version of Slurm to you to execute, you can use our update cluster API to upgrade your Slurm version.
We wanted to make this as seamless as possible, just like every AWS service. Console, API, SDK—it all works. How can you purchase this? We're talking about scaling EC2 nodes. You purchase EC2 nodes the same way—ODCRs, spot, on-demand—you can utilize those. Here's that shared responsibility model. Here's what we update for the controller, and here's what you update. In some cases with minor versions, it's a nondisruptive update, so we just do it. If it's a major version, we want you to have that control so you can decide whether or not to update it. Same with PCS features—we update it. It's fully supported. Where is PCS? Right now these are the locations where it's available. You can see there's some in Europe, some in the Pacific, and some in North America. But the most exciting thing I wanted to talk about is that by the end of next year we won't need this slide because it'll be everywhere. We are making it globally available, so AWS PCS by the end of 2026 will be in every Amazon region.
Observability with CloudWatch Logs—I talked a little bit about that. It's just standard observability things from AWS that you'll be able to get with PCS. Especially around taggable resources, this is how customers often track their costs. By tagging specific jobs to work groups, that allows them through CloudWatch Logs to say, "OK, this work subgroup performed these jobs which cost this X," and you can do your own internal chargebacks if that's what you're doing for cost accounting.
There are two fees for pricing. Let me go into this. How your cluster controller is set up—right now this is something we want to improve in the future, but you kind of have to say t-shirt sizing. I think I'm going to need to do a small or maybe I need to do a large. You can upgrade, but you need to tear everything down and go back in order to upgrade. In the future, we want to make this dynamic scaling, so maybe at one point you need a small and sometimes you need a large, and you can just migrate back and forth without it being as invasive as it is today. But right now you need to commit to what size your controller is because that limits how many jobs you can execute. Then we have the accounting feature where we do Slurm accounting. Obviously, that's going to save an accounting database which takes up S3 storage, so we're going to bill you back for that storage. It's not a big plus-up or anything—we're just trying to account for the storage that accounting database could take up. And then you're going to pay for your general EC2 instances, which are in your account.
Last, I want to talk about a real-world example before I turn it over to Satoshi Ikemure for a brief overview of how they're using PCS at Toyota, and then a deeper insight from Michael Gujral about what they're doing at Shell. Tune Therapeutics—if you haven't heard of them, it's really interesting. They're a company in the biomedical space studying epigenetics. Basically, what they're looking at is not just gene editing or anything, but how you turn gene expression off and on via medicine. They just went into trials earlier this year on a treatment for hepatitis B. By not doing gene splicing where you can get the wrong thing or who knows what is going on on the side, by limiting it to how genes or groups of genes are expressed, there's a very favorable outcome that they see by being able to limit their impact to epigenetics as opposed to genetic sequencing and editing.
What they've seen from us is that by moving to database PCS, they were able to cut their innovation cycle down from 12 weeks to 2 to 3 weeks. Now I want you to think about that in real terms. Imagine if your scientists, engineers, and researchers were able to turn the crank on great ideas that they tried 4 times a year. That's what they were doing before. Now, 25 times a year. What kind of a change would that allow you to have in your organization if you could turn the crank on innovation that much more rapidly by speeding it up?
Toyota Central R&D Labs: Accelerating Research with AWS PCS
So with that, I'm going to give up the stage to Satoshi Ikemure, where he'll talk to you about what they're doing at Toyota. Hello, everyone. My name is Satoshi Ikemure, and I'm the Administration Leader of HPC at Toyota Central R&D Labs. Today, I'd like to share how our organization modernized our HPC environment.
First, let me briefly introduce our company. Toyota Central R&D Labs is a research institute within the Toyota Group. It was established back in 1960 to create useful and high-quality research results. We work together with a variety of companies and research organizations, both in Japan and overseas, including Toyota Group companies and our technical partners.
We conduct research across a wide range of fields including information technology, environment and energy, and material science. Our major achievements include the development of the Quick-response code, or QR code, the practical application of the Visible-Light Active Photocatalyst, and the demonstration of artificial photosynthesis. These studies contribute to solving social challenges and building a sustainable society and continue to evolve through collaboration with Toyota Group companies and research institutions worldwide.
We integrate the computing needs of various research teams and provide an efficient and scalable shared computing environment. This environment is operated by a team of about 10 members and is designed to flexibly support a wide range of applications and resources, including CAE simulations, AI and machine learning workloads, and large-scale research data storage. We support researchers in efficiently accessing the computing resources they need when they need them.
Since this is a shared computing environment, we usually design and provide standard computing resources. However, depending on the research project, some need large-scale resources or long-running computations that can take several months to complete. In traditional HPC environments, when we try to meet all those demands, a lot of jobs ended up waiting in the queue. Adding new computing resources could take months from ordering to actually deploying.
To address our challenges, we started by providing dedicated EC2 instances. However, costs were incurred even when the instances were idle. Next, we adopted AWS ParallelCluster. But setting up individual environments for each researcher was time-consuming.
Running separate controllers increased cost. So we decided not to expand its use. Later, AWS ParallelCluster was released. It provided the dedicated nature of EC2 and the flexibility of parallel cluster in just a few clicks. So we quickly decided to adopt it. AWS Parallel Computing Service is provided as a managed service that includes the scheduler. Since the PCS environment can be intuitively configured and built through the API, we can now easily create environments on our own without relying on external vendors as we did before. For providing computing environments, we listen to each researcher's needs and simply link the appropriate instance type and the required number of instances to their own dedicated queue. This allows them to start large scale or long running computations immediately.
In addition, because the instances automatically stop when the computation is finished, it has also contributed to reducing overall costs. Finally, here's the outcome we achieved by adopting AWS PCS. Environment setup, which usually takes six weeks on premise and several days with ParallelCluster, can now be completed in just thirty minutes with PCS. This allows researchers to spend more time focusing on their work. Soon after the release, actual use began with requests for R7I 48X instances for large scale material simulations and P4DE 24X large instances for AI and machine learning workloads.
Being able to quickly accommodate these advanced computing demands demonstrates the flexibility and effectiveness of PCS in practice. In addition, integration with job execution now allows research to be used only when needed, improving overall utilization and leading to cost optimization. We are also free from daily operations, giving us more time to focus on creating new value. Let's transform together from a system that processes to ones that respond to needs. Thank you very much.
Shell's Cloud Journey: From Proof-of-Concept Struggles to Production Success
Good afternoon, everyone. I'm Michael Gujral with Shell. I lead our global HPC engineering operations. As you can tell, I do not have a Texas drawl, so it was not me in 2017, but we will talk about 2017. Ian and I have had many similar meetings, but I'm happy to be on stage here to talk about our HPC journey with AWS and specifically some of the things we've seen in PCS. First, let me talk a bit about Shell. I don't work at a gas station and it took me a long time to convince my kids of that. Thankfully they now know where my office is, so they believe that I don't go to a gas station every day.
But as a company, with 2022 numbers, we're the size of a city. We're in over 70 countries with 93,000 employees and 46,000 retail stations that see 32 million customers daily. Those top two are things that most people associate with us and know about us. It's really not what I'm going to talk much about today at all. What I'm going to talk about is our upstream side of the business because that's really who uses HPC. We produce 2,800 barrels of oil equivalent per day, and we sell 66 million tonnes of liquefied natural gas. These are two key energy sources.
What does that mean and what do those ultimately produce? A lot of energy solutions. Hopefully everyone drove a Toyota here and filled it up with Shell, and then you'll be compliant with the presentation here. We see ourselves as an integrated energy company providing energy for the energy needs of tomorrow. We're going to talk about getting into that and the many different customer segments that our products help fuel.
To fill up all those pipes, we need a tremendous amount of compute. Not to actually produce it, but to find it. So what we're going to talk about today is upstream. In our upstream business, 85 to 95 percent of our compute is used for seismic imaging. What's seismic imaging? You send sound waves into the earth and you listen back to it. It's very complex physics. So two questions to think about when we look at this slide: Why do we need HPC? Why do we need more HPC? And then why cloud? Why do we believe in the cloud and what's going on?
Why do we need more HPC? As we go into more complex areas and try to find oil and gas either deeper or in more challenging spaces, we need more physics. The algorithms that we're solving are physics probably based in the 1970s. It's wave equation based. The challenge has always been the affordability of compute. Essentially, we give our geophysicists a new machine and they fill it up algorithmically. The same thing happens with storage, but this is not a talk about storage.
Why cloud? The cloud enables us to be more reactive. Historically, we guess how much compute we think we'll need based on how many projects we think we'll have in how many years, and we buy a big system. We fill the system, but we never really know what we could have done. Generally, we prioritize our workloads. But blowing the cap off, as Ian said, was a huge driver. What does that do for us as a business? It enables more rapid decisions. If we can make those decisions much closer to when we need it, if we can turn capacity on when we need it, we're enabling the value of information that for us is a critical path to our bottom line.
Getting these images out faster enables our business to move faster. Then there's delivery timelines. Getting gear on premise has become incredibly challenging. Prior to COVID, things were generally smooth and simple. Now in a post-COVID world with the AI boom and modern-day GPUs using a tremendous amount of power, this has become a complex challenge. We still do have some on premise today, but I'm very happy that AWS does a ton of it for us because the scale, the repeatability, and the efficiency that they do it at means my engineers don't need to worry about that.
Let's go down memory lane. From 2017 to 2022, we struggled. We saw this thing called cloud and thought we should try to do HPC in the cloud. We tried a lot of different things, largely at proof-of-concept scale. We tried to do the exact same thing. We said, "I have 10 nodes here. Let's provision 10 nodes." But the 10 nodes didn't look like ours. We didn't like the nodes. We tried spot because it's cheaper, but we couldn't handle interruptions. We just kept failing and didn't see a lot of traction.
Then we said, what if we could go 10x as fast? When you go 10x as fast, you change how you do things. When you take something that takes 60 minutes to do and you can do it in 6 minutes, you're going to do it more times. So the 10x challenge we posed was: Amazon, we want you to make our embarrassingly parallel workload go 10x faster in wall clock time. That's all. It's a simple ask. You're infinite. Let's do it.
We also said to do it technically first. Commercials are important, but we're going to put that on a parallel track. If it's really expensive, we may never do it. We may do it once, and if it's affordable, we'll do it all the time. We separated the two tracks and asked how do we go faster and how do we actually impact our bottom line.
It was still hard because going 10x scale breaks everything. In 2022, the stars aligned for us. One of our key algorithms, full waveform inversion, had just become accelerated onto a GPU, which led to a much more homogeneous design. Our nodes, because of Nvidia's reference architecture, look a lot like a P series node in AWS. We realized that we didn't have to take the whole workflow. We always thought we needed to take every single bit of our HPC workflow end to end, but we realized we don't.
Data has gravity, but HPC has attraction. We decided to unwind our workflow and said this bit runs on accelerators. What we found is that the input size is actually pretty small. We can send that up and use a tremendous amount of GPUs and bring it back down. All of those things aligned, and in December 2022 we went live in production with P4DEs and A100s. That was really the start of our production journey, and it's been a wild time because in December 2022, ChatGPT had just come out. These A100s are great, but it became very challenging to get them.
Coming to the tagline here, the advantage of working with Amazon for a long time is you start to speak their language as well. We always thought this was going to be a staircase. We thought we had to get to step one, get to step two, but we're now at a point where we're actually at a flywheel. We iterate much more. What happened in 2025, one of the biggest things for us is capacity blocks. Having EC2 capacity blocks and having the ability to have predictable bursts was a challenge when trying to get a P4DE, but with capacity blocks and the visibility we have of when we need capacity, this has been a huge enabler for us.
Since 2022, because of our ability to burst, we've accelerated 2.5 years of wall clock time of projects by enabling burst on critical path items to our bottom line. That's critical because we've been able to accelerate by blowing the lid off and thinking differently about what capacity we need. The bad news is we're still not at 10x. The good news is we've sustainably hit 3 to 5x, and that 2.5 years is of course a proof point that we are definitely headed towards there.
Migration to PCS: Shell's Seamless Transition and Operational Transformation
Now we're in the flywheel, and the pace that we're able to iterate and do things is where PCS has become a piece of that. The journey now, if I look at what we've done even just this year in 2025 versus what we did in 2022 to 2024, the pace of innovation that we're able to drive is just so much faster. That's of course what our HPC users do, but even what we as an HPC department do, how we think about it, how we provision our systems. Let's dive into a bit more detail about how we started and where we go. Very similar to Toyota, we started with ParallelCluster. One of the things we learned in all those early POCs that failed was that we realized we don't have a containerized workload.
While Batch is a great tool, it wasn't the right tool for us, but we did try to use it for a good number of times. We tried other solutions as well, but what we realized is ParallelCluster and Slurm was familiar because it's what we use on premise. So it was this comfortable feeling. Now the cloud wasn't this scary thing. We know how to use Slurm, so it was a very familiar thing. We were able to use it in multiple regions and multiple clusters, and that was really what enabled us in 2022. It's a really great tool, but of course there were challenges with ParallelCluster. We knew we could do better. We knew that there was more that we could do in this cloud journey.
Where were some of the pain points or inflection points? We wanted to combine multiple architectures in a queue. You don't really think about needing to do that on premise, but now you can.
What if I want Graviton and AMD EPYC sitting beside each other? That's a novel concept. It was reasonably hard to do or it needed a lot of clever engineering work to do in ParallelCluster. Native Terraform support was something we used in ParallelCluster through CloudFormation. It was a lot of work. We had great engineers that made that work. We'll leave it at that, but maybe not where we wanted to invest all the time that we need to.
Multiple OS support is important, especially as our AI workloads are continuing to emerge. We're seeing more need for Ubuntu. We're using Rocky. We're using RHEL. Being able to have multiple operating systems is just another thing that gives us flexibility as an HPC provider inside the company. Seamless integration of Spot instances has been great. I'll come back to that story a bit later on how cool that's been. It's a good unpopular thing to say in an HPC talk, but some of my HPC engineers in private have admitted that they like the GUI. They will never tell their peers that, but the GUI has been a very handy tool to visualize what it looks like.
Instant cluster upgrades is true because we came to a world with ParallelCluster where it's very familiar. You take the cluster down, you bring it back up with any update. The reality in the cloud is you're doing way more updates if you want to or if you desire to because on premise you bring in a cluster and have new nodes for any number of years. Now they're releasing nodes all the time. Maybe I want a different shape, maybe I want a different size, so the number of updates can be so much more. That was really our inflection point when they announced PCS. It looked great, but that was one of those hard customer meetings where we said we're there with you. We just need these features. My kudos and recognition go to the PCS team. We set up a path. We looked at the roadmap and said, okay, there's our jumping point.
We needed key things like Rocky 8. We needed capacity block support. We were not willing to do PCS as a toy or as a POC. We were going to go with our biggest production workloads. That's how we're going. We're not going just for fun. We're going for it all. Once those were met, we already talked about the motivations, and then the fun really started to happen.
We had this timeline and knew when the features were coming. We had everything lined up. We have a great AWS ProServe team that's been working with us since 2021. In the eleventh hour, we're getting ready in the final sprint. I get a call on a Friday morning and one of our ProServe engineers has decided to leave AWS in two weeks. That was a challenging call and it led to a bit of a fire alarm scenario. We said, okay, who are the engineers we need from AWS? Who are the engineers we need from Shell? We got them all in a room in the Netherlands. Thankfully we were able to get everyone's visa in time. One of the visas came in Sunday night as the flight was Monday morning, so there were some hairy moments in there, but we got all the people together and it was phenomenal.
The excitement that was in that room of them being able to solve problems and deliver it was really amazing to see. That came together and there were some considerations. We do things on a head node because it's there in Slurm on premise. It's just convenient. Well, now it's Amazon's head node. We're not allowed to touch it, so now we have an admin node. There were some considerations with node groups. It's a great idea, but when you haven't thought about it, it does take some consideration of how best to use a node group. How many nodes do we want? How many do you want to put in? What's maintainable? Dealing with quotas and AWS quotas and convincing the PCS team that yes, you need to increase the quota. Then exposing Slurm features because having used Slurm, we knew what we would want to see. It's just been a really great partnership in that.
Here's what it looks like. I'm not going to go into detail. Ian showed you a bit of this already. I will call out because someone's going to look at their screenshot later and say, is there a hamster in there? Yes, there is a hamster in there. Hamster is one of our API tools. Our optimization team has a thing with rodents. They're called the Rats. That is Hamster. There is a tool called Mice and others, but yes, that is a hamster. Essentially on the left you have your user and the user has no idea whether it's ParallelCluster or PCS. They are aware whether it's on premise or the cloud, but that's only because of data, and they need to make sure their data is in the right place.
Beyond that, it's the same user experience running across multiple availability zones and multiple regions in a very seamless way. You'll see Spot instances in there, and I'll come back to that.
We were at Supercompute 25 a couple weeks ago, and one of our lead developers was there. He asked our engineer leading our PCS effort if he could get a Spot instance in Oregon. He opened his console, wrote two lines of Terraform, hit commit, and within a few minutes, there it was in Oregon running under PCS. Historically, how would that have worked? You'd update the cluster, take the cluster down, drain the cluster, and do it all over again. The seamless upgrade that Ian talked about is very real. With Terraform, just two lines of code takes seconds. A big cluster change is probably a minute, maybe better than the 30 minutes it used to take, and it was just such a cool thing to see how happy our researcher was that it came through.
The other thing we always had to do was manage cluster changes carefully. We have the advantage of an embarrassingly parallel workload, so when we did these cluster changes, we could still make it seamless to the user. You would slowly drain one cluster and slowly rehydrate it. It's really clever engineering that our HPC engineers are able to do. But now they don't even need to do that. Now they can use their creativity elsewhere.
Once we got through the initial concerns about losing resources and ensuring we had all the right things in place, I can tell you our migration from ParallelCluster to Parallel Computing Service was really anticlimactic. The way we found out about it was one of our overseas engineers decided to just do it, and then the team in the US came in and asked why one monitoring thing was broken. Nothing with the cluster was broken, but I was like, "Oh, I decided to switch to PCS. Thanks for telling us." Really, that's exactly what you want. No news was good news, and it's been a really great thing.
That's why PCS is helping us focus on what we need to do next. For our HPC side, one of the things now is that they control our Slurm scheduler. They know where the capacity is. They know how to play multi-region and how to get more creative about finding and unlocking more capacity, especially because we're using primarily P-series NVIDIA instances, which are on the harder end of the spectrum of nodes to get. If you want to know more about Shell's HPC journey, there's a session tomorrow at MGM where I'll talk a bit more about our overall journey.
With that, I want to give a big thank you to Ian, to the whole PCS team, and to my team that made this possible. A tremendous amount of work has gone into this journey since 2017, but even more recently in getting to PCS. I'm really excited to be here and to share where we've been. Ian, thank you, and with that, back to you.
Beyond PCS: AWS Batch, Large Behavioral Models, and the Future of HPC Innovation
Huge thank you to Michael and Satoshi for sharing a little bit about their journey on AWS from on-premises to AWS ParallelCluster to AWS PCS. If you yourself want to find out more details about PCS, please take a screenshot of this and look at the QR codes. We've got some getting started documents as well as further details on it. But I also couldn't leave this presentation without at least giving you a little teaser on AWS Batch and some other things we've been cooking up.
AWS Batch falls into our orchestration portfolio. You heard Michael talk about how he doesn't have containerized workloads, so he wasn't interested in it. Well, if you do have containerized workloads and you're running on ECS, Fargate, or EKS, Batch is a great scheduler that we've created on AWS. One of the interesting areas where we see differentiation is that customers say if they want to run their containerized workload, they might consider AWS Batch. However, it is an AWS proprietary scheduler, so it doesn't have a lot of the open source ecosystem around it that Slurm does. But if you're willing to go all in on AWS APIs, AWS Batch is a great solution for you to consider for your containerized workloads.
One of the very interesting updates we did this past year was integrating Batch with SageMaker training jobs. Where this is really cool is that Batch allowed Toyota Research to create large behavioral models.
For those of you familiar with LLMs, large language models, LBMs are not as well known. Large behavioral models come into play when training robots to interact with the everyday world. Instead of having to go step by step with training, they create a model, update the dataset, and then allow the robot to iterate. I'm going to show you a video next that's about one minute long, and I invite you to look at the differences between the video on the left, which shows standard point-by-point training, and the one on the right, which is trained with an LBM. I especially invite you to look at how the robot struggles with the banana.
This exercise shows a robot being set to the task of setting the table. Imagine if you will that we're living in an era where someone is physically incapacitated or needs some sort of support and is looking for a robot to help them set the table. You can see here that this has been accelerated to get it through in time for this demonstration, but you can see how the robot on the left is struggling with a lot of these features that a human could very easily do—pick up the cereal, pick up the bowl, pick up the banana again, really struggling to pick up the banana. Then on the right, it's just sitting there done. It's been waiting there. The breakfast is set. Where do we go? The robot on the left is just pouring cereal all over the table.
This integration with AWS Batch and SageMaker training jobs allowed the creation of this efficient large behavioral model where Toyota's innovation was able to really accelerate. The robot is still struggling with the spoon. I'm going to let it go. What does this allow you to do now? This didn't require any code change between the left and the right. This is just using the trained model with a different dataset and allowing these robots to really generate new insights with updated visualization datasets. If you want to learn more about AWS Batch, I know this is just a brief overview, but I wanted to show you what I thought was a pretty cool demo.
Now, there's more. We announced supercomputing a couple weeks ago, and early next year you will have access to the latest and greatest AMD Turin in our HPC8a instance. Those will be out early next year. I invite you, if you want to see the details, to take a snapshot of this QR code. I spent a long time in the US military as a naval intelligence officer, and so this means a lot to me personally. We saw that our customers in the classified spaces and government spaces specifically did not have ready access to the latest and greatest supercomputing and AI resources. We have committed up to a fifty billion dollar investment ensuring that our customers in GovCloud East, GovCloud West, secret, and top secret regions have access to the most innovative hardware resources they can get to execute their mission-specific workloads.
I couldn't leave this without at least giving credit to my team's recognition at the Supercomputing Awards for the eighth year in a row, recognized as the best HPC in the cloud platform. AWS PCS is also being recognized as one of the top five new technologies to watch in all of HPC. I invite you, if you're here for the rest of the week, to check out the ton of HPC sessions we have. We've got breakout sessions, working sessions, and a few more of these keynotes. Please take a look at that QR code and go to those sessions and ask more questions.
I'll leave you with this. There's our contact information if you want to get a hold of any of us. We'll be standing up here on the side. I invite you to please complete your survey. It's super important to us. That's how we get better. We want to hear directly from you. Was this valuable? Did I talk too long? Was the content really insightful? Those types of feedbacks we really always look for, for you to directly tell us how we're doing and how we can make this interactive with you. Did you enjoy the session? Was it weird? Anything you want to put in there. This is my first session like this. I thought it worked pretty well, but I'd love to hear from you all. Thank you so much for your time, and I hope to see you more.
; This article is entirely auto-generated using Amazon Bedrock.




























































Top comments (0)