DEV Community

Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Mission-Ready HPC: From NOAA Today to AI Tomorrow (WPS205)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Mission-Ready HPC: From NOAA Today to AI Tomorrow (WPS205)

In this video, AWS and NOAA explore how high-performance computing converges with AI to revolutionize weather forecasting. David Michaud from the National Weather Service details NOAA's diverse mission spanning from solar monitoring to tsunami warnings, operating two 14-petaflop supercomputers producing 14 million products daily with 99.97% on-time delivery. Rayette Toles-Abdullah demonstrates NOAA's cloud architecture using HPC7a instances, Elastic Fabric Adapter, and FSx for Lustre, highlighting the Hurricane Analysis and Forecasting System running 399 EC2 instances. The presentation showcases AI models like GraphCast reducing forecast time from 1.5 hours on 25,000 cores to 7 minutes on a few H-100 GPUs, while AWS announces a $50 billion infrastructure investment for government AI and supercomputing centers, with examples from S2 Labs and Senera demonstrating real-world HPC-AI convergence applications.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

The Convergence of HPC and AI: Market Growth and Presentation Overview

Good afternoon and welcome to re:Invent. Imagine tracking a developing hurricane in the Atlantic Ocean with many lives at stake as coastal communities prepare for the impact. Behind every forecast, behind every evacuation order, behind every decision that saves lives, there is a massive amount of computational power at work. Today, along with my esteemed colleagues David Michaud from the National Oceanic and Atmospheric Administration and Rayette Toles-Abdullah from AWS, we will explore how high-performance computing is evolving from traditional supercomputers to AI-led forecasting, transforming the way in which we predict, track, and forecast dangerous weather events that impact our planet.

Thumbnail 80

Thumbnail 110

The convergence of high-performance computing and artificial intelligence is an evolutionary technology. However, it is revolutionizing how quickly and how accurately we can respond to very dangerous events and therefore save more lives. We are excited to take you through this journey with us today. Our presentation is organized as follows. First, let's talk about what the market is doing in terms of this convergence of AI and high-performance computing. According to Hyperion Research back in 2023, the HPC market was about 37 billion dollars and it grew by 24 percent in 2024. We expect the numbers to be even higher in 2025.

By 2028, one-third of the HPC market is expected to be on the cloud. By 2029, the converged AI and HPC market is expected to be 49 billion dollars. That is huge validation of the market by some of the early results that we are seeing and it is following the innovations that agencies like NOAA are coming up with and executing with HPC at scale. Then I will invite David Michaud at that time to come to the stage and he will do a deep dive into a NOAA case study. He will talk about NOAA's mission, the important work that they do, how it impacts lives, and how massive amounts of computing at scale are used at NOAA to achieve various mission outcomes.

Then Rayette Toles-Abdullah will come to the stage. As a Principal Solutions Architect at AWS, she has been leading the capability development and service development for supercomputing and other cloud opportunities at NOAA. She will do a deep dive into the architecture capabilities that power NOAA's missions. Finally, I will come back to the stage and discuss some of the new investments that AWS has committed in this space as well as provide a couple of additional examples of innovation that we are seeing in the converged HPC and AI space in the cloud by some of our other partners. With that, I would be honored to have David Michaud join us on the stage. Thank you.

Thumbnail 290

Thumbnail 300

NOAA's Mission: From Sun to Sea, Protecting Lives and Economy

Good afternoon, everyone. My name is Dave Michaud and I work for the National Weather Service. My role at the National Weather Service is essentially responsible for managing the operations of the networking that we have enterprise-wide, all the supercomputing, all the data dissemination systems, and support locally for all the national centers that we have in College Park, Maryland. I have been with NOAA for about 30 years. It is my life and my passion, so I am very excited to be here today to talk to you about what we do.

Thumbnail 310

Let me start with a little bit about the mission of the National Weather Service. Often when I talk to people about the National Weather Service, they don't realize the diversity of what we do. Many people think about what they need to wear the following day or whether they need to bring an umbrella. We're much more than that. The National Weather Service has a very diverse mission. Our domain space covers anywhere from the entire Atlantic basin to the mainland US to the entire Pacific Basin in the northern hemisphere in terms of the domain and scale of what we're forecasting. The breadth is astounding. We do everything from the surface of the sun to the bottom of the sea.

We monitor the activity on the surface of the sun and forecast electromagnetic interruptions or storms that are produced through coronal mass ejections, monitoring those as they approach the Earth. We have traditional forecasting, but we also have tropical forecasting with the National Hurricane Center, watching those storms. We provide aviation weather across the entire Atlantic and Pacific basin. When you think of aviation, we're producing data that helps airlines calculate flight efficiency and how much fuel to put on planes. Every pilot needs to have a certified weather brief before they take off, and all of our data and forecasts feed those mechanisms.

Thumbnail 470

We have tsunami warnings and monitor seismic activity and those impacts on the surface of the ocean and what impacts those would have as they approach land. We have fire weather with actual meteorologists that reside and are embedded with fire chiefs at the site of forest fires and wildfires. There are many more pieces that I didn't cover, and one last piece is the Ocean Prediction Center. When you think of the Ocean Prediction Center, think of all the economic impact of all the vessel traffic traveling across the ocean and these massive storms that come. This creates the need for ship avoidance to circumvent or move around the storm. We're really focused on life safety and the economy there.

Thumbnail 530

We operate across the United States, anywhere from American Samoa to Hawaii to Guam to Alaska to Puerto Rico to the mainland US. We're embedded in a large number of communities across the United States. We have 122 forecast offices, which are probably what you would traditionally see or recognize as a weather forecast office. We're also embedded with the FAA and air traffic control centers providing decision support. We have forecast river crest and river rise and falls and flooding. We have specialty national centers that are looking at unique niche types of weather, and we have a water center looking at hydrology. We're really geared to be out in the community with the people who need to make those decisions.

Today we're going to talk about high performance computing and the associated workloads and how we generate a large amount of forecast information that helps forecasters produce the forecast and the forecast process. But I don't want you to lose sight from a mission perspective that forecasts are only one piece of what we produce. We've really been focused over the past decade or so on making sure that we're working with decision makers in the community to understand what the impact of the weather is going to be.

Let me give you an example. I'm from Washington DC. If we have the slightest bit of snow, that wreaks havoc on traffic and that's a massive impact. Where I grew up in New Hampshire, we could have probably one foot of snow and people would still be out moving around with ease. Impact-wise, different places have different tolerances for different weather and different types of events. We're really focused on delivering the forecast and making sure that we're providing a risk assessment to those decision makers in the community that are providing life-saving direction to the community—such as evacuating for a hurricane or closing highways.

We're really focused on creating that trust relationship with the community decision makers. When things happen, we don't have to learn who we're working with. We have that trust relationship so that we can move quickly and get those warnings out.

Thumbnail 640

High Performance Computing at NOAA: Product Timeliness and Dual Supercomputer Architecture

Let's talk about high performance computing. What gets me excited about high performance computing is that it's one of those areas where any improvement in high performance computing or technology directly correlates to improvement in mission performance. We see improvements in high performance computing capacity or the way our software works and the efficiency on high performance computing that results in better advancements, more accurate forecasts, and higher integrity science going to our forecasters to allow them to perform the mission better. There's that direct correlation between improving HPC and improving the forecast, which then improves those trust relationships.

Thumbnail 700

We talked about the mission and the diversity in the mission. The way we operate within the mission provides certain constraints for the requirements and how we operate the HPC and how we think about our workloads in the HPC environment. One of the metrics we chase is product timeliness. On a daily basis we have a cycle of models that run everywhere from every one hour to every six hours. It's very time-driven and cyclical in nature. For a landfalling hurricane, there's a certain cadence where our forecasters have to interact with decision makers, and those decision makers are looking for updates. People in the public are looking for updates, and there's a certain timeliness and rhythm to that which creates a certain level of trust.

From a product timeliness perspective, we create 14 million products a day out of our supercomputer. We judge ourselves by asking whether each product was produced within 15 minutes of any previous day. If it is, we're good. If it's not, we check our mark against ourselves and we reduce our timeliness numbers and statistics. We try to achieve 99.9% on-time delivery. I think we have a metric of 99%, but we really achieve on a regular basis about 99.97 to 99.98% on-time delivery on a daily basis.

Thumbnail 800

In terms of timeliness and the way we generate and architect our solution, we have two high performance supercomputers that are about 14 petaflops each, or about 40,000 cores per system. They run on different power grids—one on the east coast and one on the west coast. We're able to switch operations between those within a 10-minute period. I can move my entire forecast operation in 10 minutes from one system to the next with no real issue. That's the type of availability and data movement we're looking at on a regular basis with our systems.

Thumbnail 860

Workflow Complexity: Data Assimilation, Model Forecasting, and the Challenge of 80 Different Models

Let's talk about workflow. A modeling system generally has data that moves into the system. We stream on a consistent basis billions of observations a day into our supercomputer. We stream one feed into our operational primary supercomputer and have a separate parallel feed into our backup supercomputer. When the time cutoff comes to generate a forecast, we dump that data out, quality control that data, and then run what we call an analysis state.

We essentially use physical-based forecast models to look backwards at the observational data. We run it ahead 3 hours in a forecast, recheck it with observational data. We run it ahead 3 hours, recheck it with data. We run it ahead 3 hours, and then recheck it and do a final analysis state. This gets us to an atmosphere that is very continuous, which the model recognizes. In the real world with all the data observations, it is not always continuous. Even where we have an observation point at exactly every data point that our model has, we have to reconcile that through a data assimilation process.

Then we run a model forecast which will go out into the future. We have some forecasts that will run a few days into the future. Our global models will run 16 days out into the future, and then we have other models that will run 2 weeks. We also have a coupled climate analysis system that will run 9 months out. As the model is running, we need to output and prepare the model memory out at each time step to post that information out to a file format that people can download and look at, and our forecasters can examine.

We have a certain amount of post-processing, whether it be statistical post-processing or adjustments to the model or adding derived fields. The interesting part of this from an overall workload perspective is that we have certain parts of the forecast model that require a large amount of cores that are massively parallel. Our global forecast model will require about 25,000 cores over a one-and-a-half hour period to operate a physical-based model. Our model analysis will be quite similar, but then we have a lot of serial post-processing that runs around these jobs.

The way that we currently construct our architectures, we have 40,000 of the same kind of HPC cores. Some will have higher memory than others, but we have a lot of processing on there that is really not HPC per se type of processing. However, it runs on our HPC because it needs to be in proximity to the data where we run it. This presents an interesting use case when you start to look at cloud and the flexibility that you can get with the various types of instances versus trying to forecast ahead when you are buying an on-premises system about exactly what the architecture should be and what exact balance should be with the multiple core workloads versus the serial workloads.

Thumbnail 1090

This is what our production suite looks like. It looks like a mess, right? There are 80 different models. We have it all scheduled, and it is a bit of a Tetris puzzle to fit everything in. You see the peaks and valleys generally 4 times a day. If you look at the bottom of the scale, that is a 24-hour period from zero Greenwich Mean Time to zero Greenwich Mean Time. Each one of those layers represents a different type of model workload that we are running that is layered on top of each other.

Where this gets interesting is that we have a timeliness metric. In general, on a daily basis, things run pretty smoothly and everything works out well. Certain days you might have a glitch in the system where your workload might get backed up. On an on-premises system when your workload gets backed up, we have to do a bit of traffic control because you can only peak so far on your on-premises system to catch up. Often to catch up, you may have to load shed certain models or certain model cycles to make that workload work.

If we were to contemplate a cloud environment, that could mean a quicker catch-up cycle because we could burst differently. These are the types of things that we are looking at when we are trying to weigh the benefits of working with an on-premises versus a cloud setup. Just as a side note from an AI perspective, there is a lot of talk about inference models which we will get to a bit later. It is interesting when you look at the inference models because it is really easy to generate these niche, bespoke AI models that have a specialty. One might be your standard global forecast model, another might be geared towards tropical forecasting, and another might be geared towards a certain domain space or hydrology, or different things.

In a production environment when you're trying to run things on a routine basis and manage the implementations that go into operating a system reliably, you could easily get yourself into a situation where you're running so many different models that it becomes a bit of a mess from a change management perspective.

Thumbnail 1260

On-Demand Workloads and Data Dissemination: Hurricane Forecasting, Surge Computing, and Billion Daily Hits

As we look at AI, it's exciting, but from a change management perspective, it's an interesting thing to think about as well. The question becomes: are we going to have 500 models to manage, or are we going to focus more on gaining integrity in a lower number of models? What I showed you in the last slide was the general layout of the modeling suite, which represents the bulk of our regular routine scheduled work. You have about 80 different models. They're all dumping data, they're all doing their initial state to the atmosphere, they're all forecasting into the future, they're all posting data, and they're all stacked on each other.

We also have some other different workloads that lend themselves well to a more on-demand elastic environment. One of those is tropical storm prediction or hurricane storm prediction. We have in our production suite right now reserved slots in our production environment where we can run up to 12 different hurricane modeling runs. Each of these yellow boxes represents the domain of that particular model, and that model moves with the storm itself while using the global model for boundary conditions as it's moving through. We can run up to 12 of these in our operational environment at any given time, four times a day.

Thumbnail 1380

The problem is that we have to reserve that compute capacity from a production perspective to be ready any time we need to run those storms. This means that space on our system cannot otherwise expand the bulk production environment that I showed in the slides before. When we need to run the models, we need to spin them up quickly. We can't wait around for the environment to spin up when we need it. It needs to be there when we need to start running it. That's another interesting piece in architecting the workload with these sorts of on-demand runs.

Thumbnail 1400

We also have development workloads, which I'd like to call surge work. Our developers are working on constant improvements in the models, but there are also cases where you might want to look back 30 years at observational data and apply that to the latest technology with the models to create what we call a reanalysis dataset. You're using the current model to fit it to a continuous atmosphere, fitting it to the grid to make it nice and neat and tidy. You're doing that going back 30 years, and you're doing four forecasts a day for 30 years moving forward, which creates a massive amount of computational workload. But we don't need to do that all the time. That's an example of a surge workload.

Thumbnail 1490

Another example might be taking all of that numerical weather prediction physics-based model and applying that to an AI model and training the model to do that sort of workload. That's an example of a one-time surge. On an on-premises environment, you really have to either reprioritize your workloads or you have to go get more money, take the time to revision that, get that on the floor, and then maybe six to eight months later you're able to start running your workload. That's an example of development surge workloads. Shifting gears a bit, this is newer to us: these AI workloads, in particular the inference workload. The things that we run operationally in production, we're experimenting with these, and we're in the process of getting those running in real time as we speak.

Currently, we have a GraphCast-based model that's trained with our global forecast model. There are two types of modeling systems: a deterministic system, which means you make a single model run from one initial condition, or you can run it in an ensemble, which means you take an initial condition, perturb it slightly, and then run it out into the future.

What that does is help us with uncertainty calculations. If you look at a suite of 31 different forecasts starting from slightly different initial conditions with the same model, that gives you an indicator of the type of spread in the forecast that you could have in the future. We have inference runs that run out to 16 days. A run like this on a physical-based model would take about 25,000 cores for an hour and a half. This model with just a handful of H-100s can run in 7 minutes. It's a game changer.

Thumbnail 1600

Our global ensemble forecast is very similar in nature, and we're very close to running these in real time operationally. We're pretty excited about that and we're embarking on this mission. However, there's a word of caution from an AI perspective versus physical-based models. Are we going to go out tomorrow and just switch straight to AI models directly from observations? No, there's a codependency at least for now. We need physical-based models to train the AI models, which are statistically-based models, and then we can run those really quickly from an inference perspective.

The model runs we're talking about are performing at a very similar performance level to our physical-based models, but keep in mind that those are trained off of the physical-based model, so that's a codependence that we have. We still need significant resources to run reanalysis or large training datasets with the physical-based models to train the AI-based models, but it's an interesting context. From an operational perspective, you saw the amount of compute capacity that we needed to run our physical-based models on a regular timeframe. Now think about that with inference runs that run in 7 minutes versus an hour and a half. It really changes the nature of how we think about our high-performance computing, and even those inference runs at that scale raise the question of whether they're really true HPC or not. We're trying to figure that out.

Another interesting aspect that we're trying to figure out is the impact in terms of moving that data out to the public. I'll give you some statistics in a moment about the type of data movement that we have. You're going from outputting a large amount of data over a 1.5-hour period with a physical-based model to dropping that same amount of data in 7 minutes. From a customer perspective, they're all expecting that data to be out and available. Now you have a data streaming issue where a lot of customers are looking for the same exact data at the same time, and all of that's compressed in a 7-minute period. That's not really fun to think about from a data distribution perspective.

The other piece of this is looking at the advancements of these models. It's really fast-paced in terms of advancements, but the other issue we have is that the integrity of our data and the scientific integrity is super important. There's a large validation process that we have to go through before we implement a model. You want to look back retrospectively to understand how these perform in different seasons. If you're updating a hurricane model, for instance, you might want to run that on 90 to 100 different storms to make sure it's verifying. You need to make sure that if you're improving the troposphere or the mid-latitudes, you're not impacting the tropics, which would then impact the performance of hurricane runs and so on.

Thumbnail 1820

This pace of advancement and model improvements requires us to come to grips with adding potentially more automation and insight into the validation of those models to ensure that we have scientific integrity, and that's a challenge for us.

Let me finish up on data dissemination. We produce about 12 to 15 terabytes a day and make that available to the public. We serve out about 300 terabytes a day, so about 10 to 11 petabytes a month of data that we're streaming out. You can see that as these model implementations reach higher resolutions, it's not a steady amount. We have to plan for growth in that as well.

Thumbnail 1900

We offload about 50 percent of that data distribution to content delivery services, and about 50 percent streams out from our on-premises systems from a web services or API-based perspective. We get a little over a billion hits a day on our services. About 80 percent of those we offload through content delivery services. However, our load changes based on what we have going on—weather events, hurricane events, snowstorms, tsunami events.

This is an example of a tsunami case we had recently where there was an earthquake off the coast of Russia that created a large amount of warnings and advisories for tsunamis across the basin. We typically get about 2.5 million hits a day on our web services for tsunami. In 12 hours, that surged to about a billion hits. That's a good example of how we have to be ready to surge as we move through these events.

Thumbnail 1970

Thumbnail 1980

NOAA's Cloud Journey: HPC as a Service and Real-Time Hurricane Analysis on AWS

Now I'd like to hand it off to Rayette, and we'll go through some of the more practical and technical aspects on the cloud. My name is Rayette Toles-Abdullah. I started supporting NOAA as their solution architect in April 2020. I was excited to become someone supporting such a wonderful mission at NOAA to help save life and property. I certainly took it seriously, and I'm a NOAA employee who sits at AWS.

To explain NOAA's journey with HPC, we started in early 2020 to late 2020 to look at benchmarking some of their models that they run on-premises in the cloud. We started looking at proof of concepts and actually porting in their MPI applications, which proved to be the longest clip for getting this done. Once we got the benchmarking done for models such as GFS and GEFS, and got the metrics and stringent information that we needed for how those models were run in the cloud, we were in a place where we started to talk to AWS partners, some of our force multipliers, to help grow their offerings at NOAA for their scientists and researchers.

We flattened the learning curve along the way. There's a bit of a black box for some scientists and researchers doing HPC in the cloud, so we set up enablement sessions to help everyone understand how this works in the cloud and how it looks. There are fewer knobs for you to be turning in the cloud, which we've been able to do on-premises and found to be unnecessary as well. We work with our HPC specialists to help enable that as well. I want to give an unshameless plug for Aaron Boucher, who's the HPC specialist who helped us to actually enable this.

NOAA has a specialist who helped enable what we see today at NOAA in the cloud in terms of HPC. There are two offerings at NOAA that they provide to their scientists and researchers in the cloud on AWS. They can either run self-managed HPC clusters themselves, building up the clusters, interconnecting those clusters, and having their head nodes and compute nodes as they do on premises. Or they can leverage the HPC as a Service offering that they have available.

In that case, we partnered with Parallel Works and GDIT as the prime to provide a unified, simplified interface for scientists to run their jobs. Either way, the architecture looks the same. NOAA standardized on purpose-built HPC EC2 instances. These are HPC7a instances that were introduced at Supercompute 2025, and HPC8a instances will be released in 2026. They standardize on these instances. These are tightly coupled clusters as they are on premises. Cluster placement groups can be used so that those clusters are in close proximity to one another in AWS.

Thumbnail 2120

They standardize on using Elastic Fabric Adapter, which gives you sub-microsecond connectivity between the nodes that you have as cluster nodes, just as you see on premises. They standardize on using FSx for Lustre as their file system, which gives them high throughput with their storage system. Key to that is also the ability to integrate natively with Amazon S3 to import and export data, meaning they don't have to maintain that expensive storage consistently and persistently in the cloud. They can export the data that the models have output and actually do post-processing with that data or future analysis with that data from S3, using less expensive object storage instead of having that persistent FSx for Lustre storage.

After their runs are done, this cluster is basically terminated and they move on to the next job that needs to be run. Some of the on-premises workflow capability exists. They still use PBS for job management. They still use SLURM, or you can use AWS Step Functions and Lambda functions to handle that workflow management as well. They tend to prefer to stick with some of the tooling that they use now on premises for workflow management and job management.

Going to the HPC as a Service offering, this was key to enable the scientists and researchers at NOAA to accelerate research to operations. They have a simplified web portal. They just go into the web portal, select how many cores they need, how much memory they need, which EC2 instance types they need, and what type of storage they need, and they run their jobs. What this does is extract the scientists and researchers from worrying about the configuration of clusters and how they get the network going and how they are able to tweak the knobs for the configuration in the backend.

Thumbnail 2310

Using infrastructure as code basically from what they've selected, they get deployed an entire cluster to do their jobs. Those jobs really run at the performance capability that was talked about earlier, for example, when discussing the Hurricane Analysis and Forecasting System. With this offering, there came a time where we were able to see the convergence that I'm going to talk about in a bit. The scientists were coming in and saying they need HPC clusters now and they can't wait for the on-premises queues.

Thumbnail 2430

They can't wait to do their R&D because they want to get that done now. We're trying to go operational in XYZ year or a month, and so being able to have this capacity and capability was a big success at NOAA. We have a couple of use cases to talk about in that perspective. The biggest one, and the same one that David was talking about earlier, is the Hurricane Analysis and Forecasting System (HAFS). We've run this a couple of years in AWS real-time during hurricane season. A 21-member ensemble would run twice a day. The output of that data was exported to Amazon S3, shared with scientists so they can collaborate. They can look at that data and see where refinements are needed, and this helps them solve problems and even improve the model over time without waiting for when they can run these on-premises or get the analysis that they need done after those runs.

Thumbnail 2520

We ran 399 HPC6a.48xlarge Amazon Elastic Compute Cloud (EC2) instances at that time, doing those twice a day lab results and lab runs that we were running as well. This was beneficial when they were doing a testbed and trying to analyze in real-time the data that was coming out of the models. Another great case study that came up was what we call RFUS, the Rapid Refresh Forecast System. My understanding is RFUS is slated to go operational in 2026. One of the reasons why they've been able to do so is because they were able to leverage AWS to run the model. This too was a model ensemble run that they ran consistently during the day. They standardized on the Lustre file system, giving them that high throughput. They standardized on the purpose-built HPC instances, and they also exported the data from those models to AWS Open Data so that further analysis and understanding of how the model is run and how it can be tweaked can be leveraged as well.

Thumbnail 2590

Thumbnail 2650

AI Weather Prediction: From Hours to Minutes with Single GPU Instances

They also found that it ran 15% faster than on-premises. It also was 25% better from a cost performance standpoint. But like I said before, there was a convergence. We worked through a few years with the HPC as a service offering and then we found something interesting. The scientists and researchers were saying they want access to Jupyter notebooks, they want access to AI/ML services. They also need to have the capability to analyze large datasets that are outputted from these models. We've seen the convergence from using the big tightly coupled HPC clusters to AI/ML services within this particular offering. Now while we were enabling that, we were asking why, and David explained it earlier. We didn't know early on why, but we figured it out later. We were seeing the convergence into AI weather prediction coming. David talked about earlier with the traditional numerical weather prediction models. These are physics-based. They take the data from the sensors, from the buoys, from the satellites, and they take that observation data and they run those computations on a supercomputer on-premises to get the analysis and the forecast and the reanalysis. With AI weather prediction, though, it's slightly different in the sense that we have the training of the data using that numerical weather data output, and then there's inference that happens with that training data.

What is astounding to me is that from a SageMaker Jupyter notebook, in minutes instead of hours, your scientists and researchers are able to do weather forecasts. They get the output and the data that they need. This democratizes weather prediction. We have the challenges that David talked about earlier, and now we have more data coming in. I think it's a good problem to have, but we can also see how this convergence is helping us solve problems and drive other scientific advancements with AI weather prediction.

Thumbnail 2770

This is a single GPU instance that I'm using here, or that I'm showing here on this slide, instead of a tightly coupled large HPC cluster. David talked about the models available with NOAA that they're working on. There are other models out there—FourCastNetv2, Pangu-Weather, GraphCast, and Aurora—all of which can be run on AWS with a single GPU, or if you have an ensemble like David was talking about earlier, you're essentially going to be looking at several GPUs. But the idea here is that from a Jupyter notebook, you can run an AI weather prediction in minutes instead of hours.

Thumbnail 2820

Thumbnail 2840

Thumbnail 2850

To prove that, because the proof is in the pudding, I'm a technologist, not a scientist. I was able to run the FourCastNetv2 small AI model. What you see here is Hurricane Helene coming through Florida into North Carolina. You can see my name up there—that's my notebook running there based on some open source code that I found available out there. So even myself as a technologist am able to do these runs. Just imagine researchers and scientists having this capability to run these forecasts at will with the capacity and compute options they need available in the cloud.

The Future of HPC and AI: $50 Billion Investment and Innovation Across Industries

With that, Vicky is going to come up and talk about the future of HPC and AI. Thank you, Rayat. You heard a great case study from our colleague from the National Oceanic and Atmospheric Administration, with some great examples of the use of supercomputing to solve their mission issues and achieve the right mission outcomes. Then you heard from Rayat about how AWS is powering that infrastructure and providing support to the National Oceanic and Atmospheric Administration.

Thumbnail 2960

So what does the future of HPC and AI look like? We at AWS believe it is bright. About a week ago, we announced a $50 billion infrastructure investment to build new AI and supercomputer centers for the U.S. government. This $50 billion investment is comparable to 1.3 gigawatts in compute capacity and will serve the missions of various government agencies, including those with workloads in the U.S. gov cloud, in our secret regions, and in our top secret regions, as well as across all data classifications.

Thumbnail 2980

Thumbnail 2990

This investment is based on real-world innovation that we've been seeing from our customers like the National Oceanic and Atmospheric Administration, as well as many other agencies. I wanted to take you through a couple of these examples to spur innovation. First, we have a partner called S2 Labs that does subsurface mapping. What does subsurface mapping mean? Basically, it means that without doing any excavation, you're trying to figure out underground structures.

It could be infrastructure, cables, pipelines, or construction material, but you have to determine what it is underground, especially when excavation is not possible. In 2004, Hurricane Ivan destroyed an offshore oil rig, and all the critical infrastructure fell into the ocean and was buried under meters of sediment. It took 18 years of trials using traditional acoustic methods to try to figure out where that infrastructure was buried because it was hazardous waste as well, but there was no success until S2 Labs used deep learning and applied physics and inferencing along with supercomputing-grade APIs to solve that problem. They were able to do that with unprecedented clarity, so much so that they developed a workflow that can actually map an area of the size of 400 meters by 400 meters by 60 meters in under 5 seconds. That is 29.5 football fields. The key innovation here is that this has applicability in many different industries: oil and gas, energy, utility, urban development, construction safety, and many others. This is exactly a great example of what we are seeing with the convergence of HPC and AI and the results it is delivering.

Let me look at one more example. Many of you may have encountered some challenges when you are trying to provide decision-making that involves multiple expert teams that need to come together and collaborate and then provide some kind of analysis. This happens a lot in the engineering domain, in the science domain, and even in the financial services domain where you are trying to approve a loan, for example. We have a partner called Senera, which is a process automation company. They worked on a domain of engineering where in order to provide a request for quotation for an engineering component that is comprised of many other components that need to be assembled together, each one of which has to be monitored and for which the information needs to be provided by a different engineering expert. All that information needs to be synthesized, and they wanted to try to model that and create a solution for that using HPC performance-grade APIs as well as AI/ML.

They developed a solution based on Amazon Bedrock, EC2 instances running on NVIDIA, as well as additional AI/ML-based inferencing. What they were able to do was take this assembly of complex engineering parts from multiple different experts, take all of their expertise, and run a multi-agent AI system that gathers this information, reasons through it, and then provides the result. They reduced the time from 3 weeks to several minutes. Again, this is a great example of how we applied with our partner HPC supercomputing to solve a very complex problem. These two examples are just scratching the surface. There are many areas of innovation that are possible with this convergence of HPC and AI/ML, and certainly with the investment that we are making, especially for the US government, we see that this will become more advanced in the future as well.

Let me do a quick recap. We started by talking about the convergence of HPC and AI/ML in the cloud. Back in 2023, there was about 37 billion dollars worth of market for HPC. We see this progressing quite rapidly with 24 percent growth in 2024, and then by 2029, the analysts are telling us that the convergence market is going to be about 49 billion dollars. Then David came up and talked about NOAA's mission. It is not just about forecasting, remember that. They do so many other things which are extremely important. And then Rayat came and talked about the infrastructure, the compute, the capacity, and the architecture that is supporting NOAA's mission. We hope you enjoyed it. Thank you very much for your time.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)