Kazuya

Posted on Dec 8, 2025

AWS re:Invent 2025 - Accelerating Engineering: Cross-Industry HPC Cloud Transformations (CMP302)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Accelerating Engineering: Cross-Industry HPC Cloud Transformations (CMP302)

In this video, AWS presents HPC market trends showing 23.5% growth in 2024, with cloud adoption reaching $10 billion and projected to hit $23.7 billion by 2029. The session covers AWS HPC building blocks including HPC6A, HPC7A instances, upcoming HPC8A with AMD Turin, Graviton4 processors, and the 500,000 Trainium2 chip Rainier project for Anthropic Claude training. Karima Dridi from Arm details their cloud transformation journey, explaining CloudRunner platform running 5,000 concurrent jobs with 90% Spot savings and Cloud Foundation Platform using LSF scheduler with FSx NetApp, demonstrating how Graviton instances outperform competitors for EDA workloads. Taylor Gowan from DTN describes their complete migration from on-premises to AWS, running global weather forecasts hourly generating 20TB daily data, and implementing production AI NWP models like Nvidia ForecastNet on G6E instances for under $9 per day, accurately predicting Hurricane Melissa's landfall three days in advance.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Accelerating Engineering with HPC on AWS Across Industries

Good afternoon. If you don't mind putting on your headsets, we're going to watch the premiere of Avatar 3, or actually we'll be having a session today, CMP 302, about accelerating engineering using HPC on AWS across industries. Before we get started, I want to remind you that towards the end of the presentation you will have QR codes on the screen, and I encourage you to go take a picture. They will give you all the sessions that will be covering HPC throughout re:Invent this year, and we will motivate you to go and attend them.

The second thing that I wanted to say is we really, really appreciate everybody filling out the surveys at the end. That is how we can get better year on year and give you the best content and the best experience when you come back next year at re:Invent 2026. So before we dive into the presentation, I wanted to introduce two additional speakers that will be on stage with me a little bit later in this hour. They're sitting up here up front. The first co-speaker is Karima Dridi from Arm. She is the Vice President of Productivity Engineering at Arm. She leads global teams that run Arm's large-scale design and verification flows, including adopting AI and cloud-based EDA to help Arm's CPU and Neoverse platforms reach customers faster. She will be the first speaker to come on stage after my 20-minute discussion.

The second speaker that will join me, her name is Taylor Gowan. She is the Modeling and Engineering Team Lead at DTN for atmospheric science. She leads a team of scientists who build AI-based and physics-based numerical weather prediction systems on AWS, powering DTN's global forecasts and weather intelligence for customers around the world. So today's topic is accelerating engineering cross-industry cloud transformations. Let's take a look at the agenda. I'm going to start by giving you some key market trend updates, and if you attended Supercomputing in Saint Louis two weeks ago, there will be a repeat of what you heard through the Hyperion Research discussions.

I'll dive into cloud adoption in HPC. We'll talk about the AWS HPC building blocks. We'll talk about cross-industry use cases to make all of this a little bit more real, and Karima and Taylor will talk about the journey of Arm and the journey of DTN as they were leveraging AWS for their HPC needs. I'll give you a few comments at the end for recap and remarks, and we should be able to close within an hour.

Market Trends: Explosive Growth in HPC and AI Infrastructure

So let's start with a few market trends. As you can imagine, HPC and AI as an overall market is growing quite a bit. The numbers that you see is the growth in 2024, according to Hyperion Research, was 23.5%. Historically over the last 20 years, growth rates were about 7 to 8% a year, so about three times the growth rate we saw last year in 2024. The overall market size for HPC is now close to $60 billion. And just a little spoiler alert, cloud is about $10 billion out of the $60 billion, and you may think, well, of course last year AI grew gangbusters, so really that's the effect of AI, and you're not completely wrong.

But what I do want to point out is that if we take out AI as part of the growth of the HPC and AI market, you still see that the HPC portion, traditional simulation, calculation, scientific innovation, still grew 8.4%. And when we look at that growth compared to historical growth of 6 to 8%, that's still faster than we've seen in the past. So compounded HPC traditional workloads and the AI enhancements that we all see and implement in our day-to-day, this has been driving this growth of over 20% year on year. The AI infrastructure is fantastic, as I said, that portion of the growth was 166%.

So now let's double-click into the cloud adoption because we're here to talk about cloud, right? Cloud adoption in 2024, I said, is about $10 billion, and we expect that adoption to continue growing. And if you look at the prediction for 2029, belief is that it will be as high as $23.7 billion, which will account for about 20% of the overall HPC and AI market. So now if you do the math, if it was 15% last year and it's going to be 20% of the overall market, that means that cloud is growing faster than on-premises. Not really a surprise, and we'll talk about the reason for which we see that trend.

Five Key Drivers Behind Cloud's Faster Growth Than On-Premise HPC

Here are five reasons that explain why cloud is growing faster than on-premise. Number one reason, some workloads cannot run on-premise. The jobs are blocked, capacity limits, and you want to run a simulation, you want to design a part for a car, or you want to go after the next scientific discovery, but the queue is clogged and you can't run it on-premise, so cloud can really help in that situation.

The second reason is access to the latest GPUs. If you watched or attended the keynote with Matt Garman, our CEO, you saw that we are already launching GB300. We're launching Trainium 3. We have Trainium 4 being announced already that's going to be launched later. All those technologies are moving fast, and in the cloud you have access to them as soon as they're available. You don't have to go build a data center, allocate power, and try to bring all this to bear in a record time.

Number three is run at larger scale. We all think we designed the right environment, the right infrastructure for what we need today. We put a little bit of buffer for tomorrow and the day after tomorrow, but we always end up in a situation where we need more. More infrastructure, more access to technology helps you design or build more products for your customers, and cloud elasticity is what makes it possible. If you decided to use 10 instances and tomorrow you want twice or 10 times the number of instances that you were using yesterday, you can do that in a cloud environment.

Number four, time to results and queue times can be shortened. It's just a lever. You want bigger infrastructure, you can get things done faster. You may have to pay a little bit more, but at least you have that flexibility. If you have a looming deadline to go finish a design and you need to go pile it up high so you finish on time, you can make it happen in a cloud environment.

The last one is collaboration globally and sustainability. I encourage you to go on the AWS website. There is a page that talks about our efficiency for our data centers. PUE is amazing considering the scale that we're at, and then collaboration globally. Anybody in the world can access AWS instances, and if you have your data in an AWS environment, everybody technically under your account can have access to it.

If I had to summarize it in one diagram, this is the diagram. On the left you have a wheel, multiple steps to go from the idea to the data. On the right, with cloud and HPC being run in the cloud, you can go a lot faster. Four steps, you go from an idea and before you know it you already have the results. You can analyze, and that's the flexibility and the power of cloud.

AWS HPC Building Blocks: Purpose-Built Infrastructure and Storage Solutions

So let's dive a bit into the building blocks on AWS. We have built purpose-built building blocks for HPC. You see them here in three sections from the bottom to the top. At the bottom you have the infrastructure components: HPC optimized compute instances, HPC optimized accelerators, HPC optimized network fabric, HPC optimized storage. You have access to that. Then you get to the next layer, which is how do you consume. You can consume on-demand. You can create savings plans if you want to commit to a longer term utilization. You can have access to spot instances, and then recently we also launched capacity blocks as you plan in the future to use that capacity and we reserve it for you.

On top of that we've got a set of services, and I'll point out four of them: PCS, ParallelCluster, Batch, FSx for Lustre, and RES Research and Engineering Studio. Those are four services that are available as soon as you build up your environment, and those services help you build and manage your environment for all of you HPC users.

Let's dive shortly into the instances that are optimized for HPC. On the compute side you have HPC6A, 7A, 6ID, 7G. A little bit of a cheat sheet: the number is the generation, so 6 is two generations ago, 7 is current, 8 is the one that just came out and that will come out. The letter at the end is the technology. A stands for AMD, I for Intel, and then G for Graviton. On the right-hand side you see all the accelerated computing technologies that we have from NVIDIA, from the L4 all the way to the GB200, and again if you heard the keynote with Matt Garman this morning, GB300 being launched already.

Now this is just for HPC today. I do have one slide letting you know that pretty soon, I cannot tell you when, but pretty soon in 2026, we will have an HPC8A which will be Turin-based, which is the 5th generation AMD EPYC processor that will be available to all of you. Get excited because it's better performance. Cost performance is going to get you into the next tranche of workloads and your ability to save and go faster for your simulation and designs.

Okay, let's take a look at the rest. Those were optimized for HPC, but now we also have a broad portfolio, and the portfolio here, I'm not going to go through every single instance. You can find them online, but I do want to point out that we also at AWS design our own chips. Graviton is our processor based on ARM. Trainium and Inferentia are accelerators, and Trainium sounds like a training chip, but it's also great for inferencing. So you have those chips here that are available, and you can have access to them.

If I do a double click into Trainium, you may or may not know that we have made a decision to build a massive set of data centers under a project called Rainier. That will host 500,000 Trainium2 chips that will be used to train the next generation of Anthropic Claude model. So if in your mind you were thinking, well, Trainium seems to be an interesting silicon, does it really scale? We're about to demonstrate to the world that yes, it does scale and it does bring the next generation of large language model to life.

Graviton4, I'd be remiss not to mention that we've continued to innovate out of our 4th generation of Graviton chip, based on the Neoverse V2 core, a lot of cache, seven chiplets, 12 DDR5 channels, and 96 lanes of PCIe Gen5. Okay, so compute is great. Compute without storage is useless. So if you go into your environment, you now have all the compute flexibility that you want. In storage, you have object, file, and block. Object is S3, which is the most popular, lowest cost solution for all of our customers.

But when you get into the file type of portfolio items, we have EFS, FSx for Windows, FSx for NetApp ONTAP, FSx for Lustre, which is probably the most commonly used high-speed storage solution for most of HPC customers and AI customers, and then FSx for OpenZFS. And then if you're block, then you have Amazon EBS, but all those are connected through our EFA fabric to those compute instances, and you can mix and match and you can do also some clever movement of data between object and file depending on what you're looking to accomplish in your environment.

Orchestration Services and Quantum Computing: From PCS to Braket

So if you remember the layer cake that I showed you, you have the three layers. You have the infrastructure, you have the consumption model, we talked about storage, now you have the orchestration layer on top, and I'll only talk about three. I know there were four on the previous slide, but I'll talk about PCS, Batch, and ParallelCluster. PCS is a fully managed service based on Slurm and makes it easy and more intuitive to go build your environment if you don't want to go all CLI and build it yourself, or if you have teams that you're managing that don't have the in-depth knowledge, the environment is already preset and I encourage you to try it. We launched it about a year and a half ago and we're at a point where it's fully featured with all the improvements that we've gotten through the initial customers.

Second one is Batch. Batch is a computing service that schedules and runs containerized workloads. It has a full range of compute offering and could be the right solution for you if you're running containers. The last one is ParallelCluster. That's our legacy service, and legacy doesn't mean that it's not still going to be very much used by a lot of you. It's open source, and it's what most of our customers are using for HPC.

I would be remiss not to talk about quantum. And this may seem weird because this is not a quantum talk, but we see more and more HPC partners, customers, researchers think about quantum and at AWS we have made an effort to continue to bring technology to our customers and partners, and we have this service called Braket that uses a lot of our partners' quantum computers.

It makes them available to our AWS customers. So now in one single environment in your AWS account, you can have access to HPC environments but also quantum environments, and you can do mix and match. By the way, the world is still trying to figure out where to fit quantum in an overall workflow. But today we're looking at at least three modalities that we have made available: ion trap, superconducting, and neutral atoms. If you're interested as an HPC customer, you even have access to credits to go run some quantum experiments, and feel free to talk to me after the presentation today if you want to know a little bit more about it.

Cross-Industry Applications: From CAE and EDA to Weather Modeling

Okay, I'm going to go a little bit faster through the rest of the slides because I want my co-speakers to also have plenty of time to talk about their journey. But all this technology is great. It's theory, it's out there, you can click it, but a lot of customers and partners have used that in the industry, and I'm going to go one after the other. The first one I'm going to talk about is CAE and EDA for structural analysis, finite element analysis, IC design, timing and signal integrity. You recognize a lot of the logos at the bottom.

The second area or industry where we've seen a lot of utilization of AWS for an HPC framework is healthcare and life sciences: genomic sequencing, drug discovery, personalized medicine, epidemiological modeling. That's a tongue twister, but the names at the bottom, you recognize the majority of them, and they're using AWS for their HPC workloads in healthcare and life sciences. Energy, power grid simulation, nuclear reaction design, oil and gas, renewable energy system modeling, all those logos that you recognize.

The next one is financial services. A lot of financial services platforms are running on AWS, and so for HPC workloads, risk analysis, fraud detection, portfolio optimization, market simulation, and forecasting, all those are being used. Automotive. Computational fluid dynamics, crash and safety, advanced driver assistance, autonomous driving algorithms, all those logos that you recognize. And the last ones, university and academia, where you see high energy physics, astrophysical modeling, computational chemistry, social science and analytics.

Actually, I missed one. I missed that one: weather. Global climate modeling, weather forecasting, hurricane and storm surge, air quality modeling. I recognize the DTN logo in there. Okay, as a slight primer for our next two speakers, I'm just going to cover briefly some of the challenges that are being resolved in a cloud environment for EDA. I'm going to skip this and just go to the challenges themselves.

Why would EDA run best on AWS? It gives you the ability to innovate faster, collaborate faster and better, reduce risk, and reduce cost. Weather, giving you again just a little primer as to why you would want to run weather modeling on AWS in the cloud. Look at these benefits: deploying production weather and climate forecasting faster, collaborating better, reducing risk, and reducing cost. So you can see a theme here. You can do a whole lot in the cloud on AWS, but this again is only theory and names.

Arm's Cloud Journey: Building Products with AWS HPC Infrastructure

So it is my pleasure to bring Karima to the stage so she can walk us through her journey at Arm, accelerating engineering on AWS. Hello everyone. My name is Karima Dridi. I'm very excited to be on stage today. I'm leading productivity engineering at Arm, and my team is looking after compute strategy and enablement for engineering. Today, I will take you into Arm's cloud journey, a transformation that is shaping our business and our industry.

I will speak first about our business, then I will take you through our cloud journey, how we use cloud to build our product, and last but not least, I will speak about the optimization and efficiency happening to enable best usage of cloud. So first of all, Arm's global compute footprint is unmatched, powering nearly every connected device and enabling innovation across industries.

100% of the connected global population relies on Arm, and over 325 billion Arm-based chips have been shipped to date by our ecosystem. We also have 22 million software developers on Arm, so Arm has definitely the scale, the power efficiency, and the performance that enables us to accelerate AI everywhere.

In fact, on the infrastructure side, we've come a long way since the launch of our Neoverse brand 7 years ago for cloud and data centers. We have shipped over 1 billion Neoverse cores into data centers and currently are tracking towards 50% unit shipment into the hyperscale in 2025. We owe this success to the technology leadership as well as the strength of our ecosystem. Working together, we have enabled 70,000 companies to adopt Arm in the cloud.

15 years ago it was mainly Arm building and supporting to move software to Arm and making investment in the infrastructure space. Today all our ecosystems, all hyperscale, are actively investing to ensure support. The question is not anymore does it support Arm, but mainly is it supporting Arm first and delivering seamless user and developer experience. AWS strategic partnership started back in 2018. It is about a shared vision that is fueled by the most efficient and scalable compute, as well as a joint value proposition with Neoverse and Graviton.

Now I will speak about how we are using cloud actually into our environment to build the product. As we look at our current business landscape, we have four interconnected challenges that are feeding directly into our strategy and priorities. The first one is the industry pressure driven by increased complexity, the relentless time to market, and the industry dynamics. This is shaping how we plan, execute, and deliver our technology.

The second one is the explosion for compute. We saw compute, and this is driven by AI workload and also the need for sustainability. We saw definitely shifts from compute being a cost center to compute being a strategic enabler, one that must be power efficient and high performance. The third element is the strategic goals across our industry. Our business units have their own priorities, yet they share common goals across scalable compute, the speed of innovation, and the need for transformation.

Last but not least, that transformation to be able to move, we need to adopt cloud agility into everything we are doing. So our strategy is about bringing these four parts of the puzzle into one cohesive, forward-looking plan. Our product development cycle can take at least a year or so, and this is based on the complexity and scope of product. It is split into two main phases, front end and back end, and each of these phases has different needs in terms of compute and storage.

Front end is about massive regression, smaller memory files, single threaded, while back end is about much longer jobs requiring significant memory, and it is multi-threaded. So we are leveraging different types of instances, so workload placement is critical to enable faster iteration of projects and efficiency, and the right sizing instances to fit this workload needs is as important. Cloud agility in summary is a cornerstone in our Arm strategy and product development.

But it is not only about workload placement. The cloud native approach mandates us to tune our infrastructure to unlock the full potential of computing there.

Cloud Native Platforms and Optimization: CloudRunner, CFP, and Cost Efficiency at Arm

So we build platforms that enable us to run different jobs and leverage the power of the cloud. These platforms were built with four elements taken into account. The first one is fit-for-purpose platform, which is about matching the right platform with the right job with efficient data and storage management. Automation plays a crucial role in complexity management. The goal is to increase reliability, reduce costs, and remove human bottlenecks.

Third is observability. It is providing real-time insights for failure analysis, cost control, and fault tolerance. The goal is really to keep the resource usage optimal. Last but not least, continuous integration and continuous delivery, which is about streamlining application development and releasing features fast and reliably. So in summary, it is not only where the apps are running, but how they are optimized and built that unlocks the full potential of computing.

So I will zoom now on two platforms that we built with the Cloud Native approach. The first one is CloudRunner. CloudRunner is optimized for front-end jobs mainly. CloudRunner is operating really as a factory. It is leveraging AWS Batch scheduler. So Arm and AWS worked hand in hand to improve AWS Batch into the leading service it is today. CloudRunner in a few numbers enables, so we have 5,000 concurrent jobs running in parallel, 100 projects executing, 1,000 engineers working on those projects. The environment can be ready within one hour, and we can support the security and export control technology.

So this is on top leveraging on Spot, which is benefiting from 90% savings because these jobs can be interrupted, and we also leverage 70% cost reduction in terms of storage as these jobs are running without any shared memory. They don't require a shared memory. So this is what a factory means in terms of front-end, and it is enabling us to increase productivity of engineering.

So Cloud Foundation Platform, which is our CFP, it is optimized for back-end jobs, and it is redefining how we see CFP by making its functionality completely equivalent to another HPC. It is using LSF as a scheduler and FSx NetApp on top as a storage system. So this is operating exactly like on-premises. Your user identification is using the LDAP system. We can have accesses using Unix permission. Also, users can connect to on-premises to access Git repository, but also we can run in batch or interactive using LSF command or interactive line.

The main difference here is the elasticity of the cloud, as we no longer need to reserve from day one 100 terabytes of storage, for instance. We can start by five terabytes and then scale up as needed. So this is a hybrid environment that enables us to benefit from the security and performance of HPC combined with the elasticity and the performance of the cloud.

So AWS Cloud helps provide the flexibility needed to run our jobs. It provides both compute and the vCPU peak. It's enabled to address vCPU peak as well as the storage capacity. It is not only a technology shift, it is a transformation of how we build, deliver, and execute our projects. The number of projects we run now in parallel shifted our industry from limited to unlimited.

So we can see there that we could have 4,000 slots on-premises for four months. Now we can have five times that within one month.

This is a good example for front-end verification where we can see the elasticity of the cloud. We can scale up to burst to our peak, then scale down to reduce costs. We also see that we can benefit from up to 66% of added throughput once we move to the cloud. We saw that the number of projects that we can run in parallel has actually increased. In the space of four years, we moved from 15 products to 25 products as of today.

I spoke about cloud agility and how it is a cornerstone in why we are building our products, but the optimization part is critical here. Using the cloud resources in the most efficient way is as important. I will zoom into three areas, which are the savings plans, the tooling side, and also the storage side. There are different savings plans to enable cost savings for the products we are building. Spot can be used as a type of instance for interruptible jobs. Savings plans like instance savings plans are mainly about predictable jobs, and compute savings plans are mainly for evolving types of jobs.

This is an example of how we can stack up this savings model by starting with savings plans and competing on-demand for securing the slots, and then moving any job that can be interrupted to spot. That's very important, and this is how we do it on AWS. We can see that we are using a mixed and balanced type of savings plan. It is very important to monitor how we are using these slots. This is an example of one region, which is Oregon. We saw that we were basically having the capacity massively increasing but no coverage, so we decided to move some of the jobs to spot to save costs. This is important because it is all about timely decisions that we make in this space to enable efficiency and to run the jobs in the most optimal way.

Electronic design automation, as my colleagues spoke about, is critical for our development, yet the tools can have different behaviors and different needs, and that can directly impact the performance of the jobs we are running, even for the same tool. It is very important to look at the characteristics like runtime as well as kernel tuning to enable best usage of this tooling and other tools on your side. Understanding how these tools react is very important to optimize both costs and runtime. As we can see, this is for front-end jobs, and the variety of choices that EC2 instances from AWS provides is there, so it is critical to adapt the tooling to the right instances here.

The workload can be very spiky, and read-write performance can vary dynamically, so it is important here as well to monitor the usage and how this workload reacts, mainly for EDA workloads. They can have different characteristics again. Front-end can be limited by metadata, while backend activity can be limited by data throughput, so it is important to have that elasticity for the storage system. FSx for NetApp ONTAP is enabling that scalability and is providing very compelling results when it comes to this type of workload and the storage systems that go with it.

Graviton impacts on front-end verification show that Graviton can be very performant in terms of cost per dollar and also performance at runtime. We can see how it is outperforming versus AMD and Intel, and this is comparing actually Graviton 3 and 4 versus other instances from other providers. It is definitely very critical, mainly for our front-end verification, where it is providing these great results.

In summary, the cloud native approach is really critical in enabling us to deliver products in the most efficient way. Efficiency, looking at efficiency, is very important. Profiling and benchmarking of the tooling play a crucial role in how we manage that complexity and also optimize costs and runtime. And Graviton instances are really providing very compelling advantages for our environment and our workloads.

So our cloud journey started back before 2020. It is work in progress. We need to reinvent the way we are doing things and we are optimizing our resources there to unlock the full potential of computing. Thank you.

DTN's Transformation: From On-Premises to Fully Cloud-Based Weather Forecasting

All right, I'm Taylor Gowan. I am the team lead of the Atmospheric Modeling and Engineering group at DTN. At DTN, our mission is to provide decision-grade weather intelligence data to our customers that have a material business risk. This includes agriculture, aviation, utilities, energy trading, offshore, and so much more. Today I'm going to tell you a story about how DTN went from an on-premises computing company to a fully cloud-based system driven by HPC at AWS.

We're also going to talk about how our forecast system is fully integrated with HPC and how we're pushing the boundaries of numerical weather prediction with the advent of AI NWP, and we're doing that in production in AWS. So I kind of wanted to get a pulse check before we get onto the presentation. How many of you, if you show of hands, are familiar with numerical weather prediction? Okay. Now how many of you have a weather app on your phone? Okay, so that's everyone.

Okay, so that's output from a numerical weather prediction model. So whether you knew it or not, you are all familiar with numerical weather prediction. What a lot of you probably don't know is that increases in skill in numerical weather prediction have been directly tied to increases in skill and efficiency in HPC. With every new iteration of HPC, we're able to increase our computational power. This is allowing us to move from global coarse resolution forecast models to convection-permitting scales and sub-kilometer forecasts.

We're able to assimilate more high-quality observations. These are billions of observations from things like satellites, aircraft, buoys, covering the globe in places where we don't have traditional observations available. And then finally, of course, it is giving us a better understanding of our atmosphere, so we're getting better physical parameterizations, we're getting better data assimilation, and we're understanding the processes that drive our weather every day.

So as I said, I'm going to be telling you the story of my company and how we are using HPC and AWS. We're going to talk about where we've been, where we're going, and where we are right now. So for many decades we were using fully on-premises HPC. This worked for us for a while, but then we needed the scaling and elasticity that HPC and AWS provides for us.

So in 2020, we completed our migration to AWS HPC for both our atmospheric and marine modeling workloads. In 2022, we were starting to push the boundaries of our price and performance with optimized HPC for specialized modeling, which we'll talk about more a little bit later. In 2023, we moved to driving our entire DTN forecast system with AWS HPC, and then just this year in 2025, we started running production AI NWP models in the cloud. And then looking forward to what we've got in the future, we're certainly going to be looking at new AWS offerings such as Parallel Computing Service and the new HPC8A.

So some of the building blocks for our main workloads, for the specialized modeling, we use ParallelCluster to run instances like HPC6 and HPC7A. We use FSx for Lustre for our file I/O, and then we store our customer output in S3 buckets and then move to Glacier as needed. For our DTN forecast system, we also have everything containerized. We run FSx for Lustre and S3, and then again you'll see common themes in production AI NWP.

We're running AWS Batch. We are running instances such as GPU-enabled EC2 instances like the G6Es, and we are storing those in S3 as well.

Okay, so the first workflow that I'm going to talk to you about is our optimized HPC for specialized modeling. Specialized models we run two of those in-house. They're computational fluid dynamics models, and they are the Weather and Research and Forecasting model, or WRF, and the Model for Prediction Across Scales, or MPAS. We run these predominantly for utility customers who are trying to mitigate their wildfire risk and have to make decisions that affect their customers, like public safety power shutoffs. This has a big implication not only on their business but on the livelihood of their customers.

Here's an idea of a workflow that we have within HPC. This is for our WRF modeling. Everything lives in a single AWS region and a VPC. We have a NAT gateway and a public subnet that handles the egress, and then within a private subnet we have ParallelCluster handling our HPC6 and HPC7A jobs. We use FSx for Lustre here, as I said, and then we store all of our data in our client-specific buckets.

Over the years, we have done some work with AWS in order to optimize our price for performance, especially for these customized models. One of the biggest successes that we had was working with the AWS experts for HPC and using a test case for Category 4 Hurricane Laura back in 2022. We tested scaling for MPAS, that model I was just talking about, for a simulation of Hurricane Laura. We used a 15-kilometer global mesh covering the entire world and then a refined grid at 3 kilometers over the Gulf of Mexico. We ran all kinds of tests where we scaled up from 32 to 128 instances, and we found that the sweet spot for price per performance was 64 instances. This really taught us a lesson on how we can scale for other workflows and how we can apply these learnings for the workflows that we do here at DTN.

Next, moving on to applying our learnings, like I just said, to our global forecast system. The DTN forecast system is not just a single weather model. It's an intelligent blend of advanced models from global and regional agencies across the globe. We combine these models along with our proprietary models that we run in-house, such as those specialized custom models, as well as the AI NWP models that we'll talk about in a few minutes. Our DTN forecast system runs hourly out to 15 days. It's a marine and atmospheric coupled model, and we produce over 350 parameters every single day. This decision-grade data is then sent to not only our customers but our downstream applications at DTN that create solutions for sectors like agriculture, utilities, aviation, offshore, and much more.

If we just want to take a look under the hood at the HPC that's driving the DTN forecast system, every single hour we run this forecast model. We take the latest available atmospheric and marine model input and blend it into our model. We compute statistics such as model bias variance to understand the skill of that model over the past month. We then apply model weighting using unsupervised machine learning to pick the models that are performing best at each grid cell for each parameter and each hour of lead time, so we're truly getting the best available forecast. Then we apply observation ramping for the first couple of hours of the forecast because we want to take in, again, the best, most recent available data that we have possible.

Finally, we go to the post-processing and derivation step. This makes sure that all of our parameters post-blending, post-statistical processing are consistent and physically reasonable. We then create probabilistic outputs that use that data here, and that's how we get our decision-grade data that is sent to our customers every hour. This is a tremendous amount of data. It's a high-powered workflow that we have to run, again, like I said, every single hour, and so this poses a challenge if we didn't have AWS and the elastic computing that allows us to scale up and work efficiently.

Every single hour we run this forecast. It takes about 25 minutes to run a 15-day hourly global forecast to completion. Every single day we generate over 20 terabytes of data, which is a huge amount of forecast data. That's 20,000 gridded global files that are generated every day. To accomplish this, we run over 20,000 Lambdas.

AI Revolution in Weather Prediction: Running Production AI NWP Models on AWS

Just for the marine modeling alone, I know I talk about the atmospheric side a little bit more because that's my specialty, but the marine modeling alone consumes 1,000 cores running on HPC 6A and 7A and produces over 300 gigabytes of data just in itself. I know some of the buzzwords this year have been AI and how it fits into every sector of technology, and really numerical weather prediction is no different. We are now looking to enhance and improve our models using AI wherever we can, and the AI revolution in numerical weather prediction is truly here. This has really been driven by big tech startups and everyone in the industry coming together to see what we can do to improve the weather forecasting across the globe.

These numerical weather prediction models are trained on over 50 years of data, so we use reanalysis data from different centers around the world like the ECMWF or the European Centre for Medium-Range Weather Forecasting, and we train those models to get an accurate representation of the atmosphere in the forecast. These AI models are not only trained on decades of data, but they are run at orders of magnitude lower compute costs. The AI models, as you can see in this plot on the left, actually show a clear step function jump in skill compared to its traditional physics-based counterpart.

The solid lines here are the 500 millibar geopotential height skill from the ECMWF, which is kind of the gold standard of weather prediction. And then you can see around 2022 there's a dashed line with the corresponding colors that represent the skill in the ECMWF's AI model skill for the same parameter. So you can see that not only is it fast and efficient, but it's improving in skill as well.

So with that being said, we do run now in production an AI NWP model on AWS. We run a model called ForecastNet which was developed by Nvidia. It's got a 0.25 degree global resolution and it's trained on that European reanalysis version 5. So we run this in AWS and it's fast and it's efficient and we run it for less than $9 a day. We run this and it's less than the cost of two coffees or if you were making a huge mistake like I did and got a latte from Starbucks here, it's less than one latte.

This lower compute cost allows for rapid testing and innovation within our AI NWP models, and like I said, it runs fast. A deterministic model run finishes in less than 10 minutes and then we run a 30 member ensemble, so that's essentially 30 different forecasts, and that compute time is only 15 minutes. So we've been working on this closely with our colleagues at AWS and Nvidia. We just recently published a blog post with AWS about how DTN accelerates operational NWP using Nvidia Earth-2 on AWS, and this is a schematic of our workflow.

Just giving you an idea of how this works, essentially we break it down into three steps. We have data prep, so AWS Lambda functions will trigger that Python utilities container to format the initial conditions, which are the base state conditions that kick off the model run. And then we run model inference. AWS Batch deploys Nvidia Earth-2 Studio on GPU enabled EC2 instances, so in our case those G6Es, to execute the ForecastNet model inference. And then finally we get to the results, which is what everybody's looking for. So the Python utilities container post-processes the output to quantify the forecast uncertainty and to provide the latest weather data, and this information then goes into our DTN forecast system blend.

So just a success story I wanted to share about how we are running these AI NWP models in production on HPC. This was a forecast for Hurricane Melissa earlier in October. This hurricane was very strong, hit the island of Jamaica, and unfortunately caused very much destruction of life and property. But I wanted to highlight that this NWP AI NWP model that we are running showed a huge amount of promise in terms of the skill that it provided for this forecast.

This image on the right was 3 days of forecast, 3 days in advance, and it nearly nailed the point of landfall and then the area of the strongest sustained winds.

And one thing that I do want to point out is that even though I'm saying the skill is really good for these models, we're still finding ways that they can be improved. So for instance with hurricanes, AI and NWP models tend to underdo the intensity, and so we know that it's a known bias. We want to make sure that we are using these tools in addition to our physics-based models to get the best information out there, and because of the elasticity and efficiencies available in HPC on AWS, we're able to do that.

So this is the same model run for the same forecast, except here we're running our ensembles, so we have a 30-member ensemble. What was really neat here, again, this ran in less than 15 minutes, this 30-member ensemble. You can see that there is skill in it. Ensemble lines are converging, which means there's more certainty within the forecast, and it really showed the correct track of the hurricane over the island of Jamaica and even showed that sharp right turn that it made just as it was nearing landfall. So again, you want to take these models with a grain of salt. They are really powerful, they are fast, and we can really only run them because of our resources that we have with AWS HPC, and then we can combine them like I said with our physical models.

So just going on to what's next for us, I've shown you where we were, where we are, and now where we're going. As I mentioned, we're looking into AWS Parallel Computing Service. This is really going to be able to be applied to most of our current workflows that we have right now. It's going to allow us to transition off of that ParallelCluster system that we're using to the next generation fully managed AWS platforms. We're going to move into more advanced AI and NWP. Right now we're just running an AI and NWP model, but our next step is to actually train one, and we're only going to be able to do that using the high-powered GPU instances from AWS. And then finally, we want to add value-added customer models to our suite. We want to leverage our DTN customer data with pre-trained models and be able to apply them to our customers who have very niche use cases like utilities, fuels, and wildfire.

Looking Ahead: AWS Investments and Recognition in HPC Excellence

Alright, I want to thank you all for coming to the session. I know you had a lot of other simultaneous sessions going on, so I appreciate your time and attention, and I'll hand it back. Thank you, Taylor. Thank you. Fascinating. Well, everybody, I think we're in the final stretch. I wanted to do a quick recap. So today during the session we talked about the market trends in HPC. You see a lot of growth continuing to consume a lot of infrastructure, innovate on the services offered. Cloud adoption is growing faster than on-premise growth, and that's not a surprise for all the reasons that we covered in this talk.

AWS is continuing to build optimized HPC building blocks, but also general-purpose building blocks. So the set of options that you're going to have in 2026 and the years following are going to continue to grow. We had a few cross-industry use cases that I covered really quickly, but then the amazing presentations from Karima and Taylor here talking about Arm and DTN's journey on AWS, hopefully inspiring for some of you, hopefully some ideas. And look, everything is a journey. You have so many options that you want to go and engage with AWS specialists. We're here to help.

I did want to bring up a couple additional items that were just announced but didn't really have a lot of coverage. One is we continue to invest, but we continue to invest thinking about all the customers that we don't always take care of. This was announced a few days ago. We are making a $50 billion investment. I know it's $15 billion, but it's over 10 years, so it's a bit less per year, to go and build AI and supercomputing infrastructure for US government agencies. And for countries or agencies outside the US, if we are able to build this for the US agency, you can only imagine that we'd be able to do the same around the world. That was really important for me to mention because it didn't get a lot of coverage.

And the second thing is that in Saint Louis two weeks ago, AWS received for the eighth year in a row the award for best HPC cloud platform. So I know it's a bit of a brag, but I didn't want to put that as the last slide. It is an independent survey that is run by HPCwire, the authority in HPC and AI publications, and this is the eighth year. We're continuing to improve, we're continuing to add value because of all of you customers on AWS. We get better every year.

So with that, remember I told you you're going to have QR codes. This QR code will point you to additional sessions. If you like this session, by the way, fill out the survey in the app, provide feedback. We get better every year thanks to all the feedback, and I just want to close by saying thank you to all of you that showed up today for this presentation. Special thanks to Karima and Taylor for joining me on stage. Enjoy the rest of the re:Invent this year, and hopefully we'll see you again next year.

; This article is entirely auto-generated using Amazon Bedrock.