AWS re:Invent 2025 - Architecting for sustainable IT at scale (AIM255)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Architecting for sustainable IT at scale (AIM255)

In this video, Nitin Pathak, a technical account manager at AWS, discusses sustainability in cloud computing and generative AI. He highlights that cloud workloads grow 25% annually, while generative AI workloads increase 300-500% yearly. Training a standard GPT consumes 1200 megawatts per hour—enough to power 150-1000 houses annually depending on location. He emphasizes the AWS Well-Architected Framework's Sustainability pillar with six best practices, including proper resource scaling, region selection, and data management. Key recommendations include using auto-scaling, serverless, AWS Compute Optimizer for 20-30% carbon reduction, and AWS silicon chips like Graviton (60% more energy efficient), Inferentia 2 (50% better power consumption), and Trainium (25% better training efficiency). He also stresses monitoring data transfer impacts and using AWS Customer Carbon Footprint tool and CloudWatch metrics for tracking sustainability KPIs.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Growing Environmental Impact of Cloud Workloads and the AWS Well-Architected Framework for Sustainability

Hi, everyone. I'm Nitin Pathak. I'm a technical account manager at AWS, and I have an important question for you. Show your hands if you have conducted an AWS Well-Architected Review for sustainability. Keep your hands raised if you have conducted it at least twice. The hands will go down because we have seen people do not conduct this review repeatedly as they continue to grow on their cloud IT journey. This is what we're here to discuss. One of the key things is that as we continue to scale our cloud workloads, what is the environmental impact? Secondly, what is generative AI's environmental impact on our sustainability? The AWS silicon and how that is key for energy efficiency. And finally, how do we measure our sustainability and our footprints?

Cloud workloads are growing almost 25% year on year. As we continue to grow and scale, we often tend to neglect or don't pay attention to the impact it has on sustainability. There are a few key aspects to consider. First and most important is the higher carbon footprint. As we increase our workloads across various domains, our carbon footprint will increase by the enumeration of the usage added. Second, there will certainly be increased operational costs as we continue to grow and evolve. Our cloud workloads are growing, and our costs will grow along with that.

When these things are growing simultaneously, we often find ourselves not in line with regulatory compliance. In many countries, governments have standards on carbon reporting, how efficient we are in terms of water usage, and our scopes of emission. All these things need to be in check as we continue to grow and evolve. Finally, there is a brand reputation impact. We might ask why we care about sustainability and emissions. Well, our customers care about sustainability, our stakeholders care about sustainability, and our government may also care about sustainability. It is important to have a measure of what we are doing for the environment and for climate, and ensure that our workloads stay in line.

We talked about the scale at which our regular workloads go, but there is another key aspect we have been dealing with in the past two years. We have been saying that last year was a year of proof of concept and this year is a year of production in terms of generative AI. If that is the case and we continue to grow, there are impacts. The generative AI workloads we are consuming are growing up to 300 to 500% year on year. However, there are considerations regarding the models we are using and the transformers we are using. While we use standard GPT training, that could be around 1200 megawatts per hour. While we use standard GPT inference, it is going to be 550 megawatts per hour. That is just one workload consuming so much energy.

What does this translate to? On standard terms, 1200 megawatts per hour of energy will approximately power 150 houses for an entire year in the United States. If we take this number to Europe, it is almost 300 houses for an entire year. In somewhere like India, 1000 houses can be powered by the energy used to train a standard GPT. This is the consideration we need to make. But that is the problem. What should we keep in check? How do we get ahead of this? How do we overcome these things?

We always go back to the Sustainability pillar of the AWS Well-Architected Framework. In 2021, when AWS added the Sustainability pillar, it was based on six foundational best practices. These are the cornerstones for sustainability and how we should manage our workloads. These six practices have been used for our reviews and our attention on our cloud workloads. They will remain the same while we are using our generative AI workloads as well. Has anyone heard of these? Has anyone been practicing any of these? Have any of you gone through any of these best practices so far? Where will they align? How are we selecting our regions in terms of usage? Are we doing proper scaling for alignment to our demand?

What are the data strategies which are in place? Are there hardware and services you're utilizing to the best impact and fullest utilization? What are the software architecture practices we are using? Have you ever given consideration to how much energy a Python code will consume? And lastly, what is the process and culture of our company itself? Are we doing a sustainability-first approach as we are building our cloud workloads?

These best practices form the foundation, and there are a few low-hanging fruits which we can always incorporate, and there are things which we can do further in order to enhance. Speaking of low-hanging fruits, the first one, the easiest one, is that overprovision resources waste energy. We've had this principle: the cleanest energy you can use is the one which is not used. It's the cleanest energy available. That quote from our CTO still stands and makes sense even now.

The things which we can do easily over here is, first of all, auto scaling. Use the flexibility of cloud. Do not provision more than what is required. It is the ability to scale things up as per your requirement. Utilizing spot instances is always good to consume something which is about to be wasted and utilize it for your benefits. Utilize it for your cost efficiency and your energy efficiency.

Using serverless, things like Lambda, using these simple items in order to get your work done quickly and efficiently will help you reduce your overall carbon footprint. And finally, utilizing the tools that we have in place for you. The AWS Compute Optimizer will continue to give you recommendations on what you should resize in your architecture, so that you will get better usage and better efficiency. You're not wasting anything altogether.

What will it lead to? If you're doing the right sizing in a standard way, you can get up to 20 to 30 percent reduction in your carbon efficiency and also your cost usage. That's a very quick and easy win, and certainly can be leveraged. Thus, always try to see where your workloads are, revisit them, get a regular review over them, see where you're failing, and find ways on how we can utilize the same in order to get better efficiency for our workloads.

AWS Silicon Solutions and Practical Strategies for Measuring and Optimizing Sustainable AI Architecture

Now this is what you will be doing, but AWS is with you in this journey. Yesterday, Matt Garmon also spoke on the keynote about our chipset which we've been designing. We always talk about the performance and efficiency of these chipsets: Graviton, Inferentia, and Trainium. As we continue to grow, they have a major implication on our sustainability aspect as well, as they are extremely efficient in your power usage journey.

On a standard practice, a Graviton processor is up to 60 percent more energy efficient than a standard EC2 instance of the same size and same computing power. Inferentia was built to be more energy effective in terms of using our inference aspects, and so much so that Inferentia 2 is up to 50 percent better for our power consumption and efficiency. And finally, Trainium. While we are training our resources, it will have up to 25 percent better efficiency in terms of training our resources as per the standard item.

So when we are using either our standard compute or you're using our AI-based models or generative AI-based models, these things combined together can help you maintain the sustainable posture of your IT infrastructure. Another thing is the sustainable AI architecture. We need to focus towards the training optimization, the inference optimization, and data management and data efficiency as well. We'll talk about the first two aspects.

As we continue to use the models, for our purposes and intents, we tend to utilize the models which we want to train by ourselves and we want to get the data in the right way as well, get the inference and training for the same. Well, there's always an aspect of using what is already available. We have a lot of pretrained models and we have a lot of pre-optimized models which are available with SageMaker that you can certainly use and leverage. As you do it, you will see an immeasurable impact on your overall energy efficiency as you continue to use that.

And the next part, which is somehow the most neglected aspect as we talk about the sustainability journey, is the data management strategies. A lot of times when we see there's a lot of data which is not being used for either our training or our optimal inference purposes, but it's still there. Are we utilizing it? Are we looking back at it? Not just adding cost and volume to our resources, it's also going to waste energy and consume resources from there. So keep a check of that, utilize the data lifecycle management policies and see if you can reduce and remove as many things as possible. Because the data which is used for our training purposes is always there in petabytes and petabytes.

There's a lot of data being used for this purpose, so try to reduce that as much as possible. Next is optimizing the data transfer strategy. With many customers I've spoken with and worked with, we've always had this discussion. We are using AWS regions which are already powered by 100% renewable energy. We are using all the techniques we are discussing. Where are we adding this? Where is our carbon footprint coming from? The one thing which people do not tend to manage is the data transfer and the impact it has.

We are building our workloads in US-C1, and most of our user base is probably in Europe, in Asia, in different places. When they are working towards this, getting things sourced from America to a different part of the country, there will be implications on data transfer and implications on the energy and carbon efficiency of the data transfer that will be added. So keep a check on that. Utilize alignment to demand and utilize region selection in a way which is closer to your user base, so you omit these emissions which are coming from the data transfer parts. And how can you keep track of the same as well?

AWS has aspects that can be useful for you. Have you used AWS Customer Carbon Footprint so far or gone through it? For everyone's benefit, in your Cost Explorer dashboard, which you use to track your costs, there's always another option available for your AWS Customer Carbon Footprint tool. I recommend you take a look at this, see where your workloads are located, how much energy efficiency there is, what kind of emissions are occurring with this Customer Carbon Footprint tool.

Secondly, utilize the CloudWatch metrics in order to be better at your efficiency. The CloudWatch metrics can be used as great KPIs for sustainability. You get a chance to view how your instances and resources have been performing and what kind of utilization they have. You'll be able to see your data transfer network in and out to determine if it's really necessary for a system. You'll get to see your memory consumption of these. Utilize those metrics which are available for you in order to get KPIs for your business and for the sustainability aspects as well.

We keep mentioning that sustainability is nothing but an amalgamation of cost optimization and environmental responsibility. So combine these things together and you'll get better results. To ensure that your workloads remain sustainable and your IT continues to grow with your business and scale, you remain sustainable with your efficiency as well. Keep a check on that. Please do take a look at some of our AWS SkillBuilder courses over here, and thank you so much for attending this. I will be at the sustainability booth, which is right on the corner. If you have any questions on sustainability, AWS sustainability, and how it can help you, please take a look at this, and thank you so much for attending.

; This article is entirely auto-generated using Amazon Bedrock.