Kazuya

Posted on Dec 5

AWS re:Invent 2025 - Sustainable and cost-efficient generative AI with agentic workflows (AIM333)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Sustainable and cost-efficient generative AI with agentic workflows (AIM333)

In this video, AWS solutions architects Isha Dua and Parth Patel address the environmental impact of generative AI while presenting sustainable development practices. They explain the generative AI lifecycle across four stages: problem framing, model training/adaptation, deployment/inference, and monitoring. Key recommendations include using managed services like Amazon Bedrock, selecting appropriately-sized models rather than defaulting to the largest, leveraging features like Bedrock Evaluations, prompt caching, intelligent prompt routing, and model distillation to reduce costs by up to 75%. They emphasize using AWS's optimized silicon like Trainium and Inferentia instances for better energy efficiency. The session covers agentic AI systems, introducing Amazon Bedrock Agent Core with its runtime that suspends CPU cycles during LLM wait times and Agent Core Gateway that uses semantic search to reduce context windows by 90%, significantly lowering carbon footprint and costs.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Environmental Crisis Behind AI's Promise: Introduction to Sustainable Generative AI

Good afternoon, everyone. Thank you for attending our session. You will be hearing a lot about agentic AI and generative AI throughout this week of re:Invent, and you will hear a lot more in this session as well. All of this artificial intelligence promises to solve some of our biggest challenges, but its very development has actually led to one of our more critical environmental crises of our time. What we aim to deliver in this session is sustainable and cost-efficient generative AI with agentic workflows and some practices that you can take away. Welcome everyone. I'm Isha Dua, a senior solutions architect here at AWS, and with me I have Parth. I'll let him do a quick introduction.

Hi everyone. I'm Parth Patel, senior solutions architect at AWS, focusing on machine learning and sustainability. Thank you. Let's go right into it. What we're going to be covering in the session today is the rise of generative AI. We're also going to talk about sustainability at AWS. We will look into the generative AI lifecycle and how we can optimize the lifecycle at each of those phases. Toward the latter half of the presentation, we're going to go into agentic AI systems and talk about Bedrock AgentCore and multi-agent systems.

Everybody knows AI models are getting bigger and resource consumption is increasing. For over a decade, data centers had kept their energy consumption pretty stable at around 100 terawatt hours, and there were many offsets put in place to manage that growing demand. However, as of 2021, that number increased drastically. The model training size increased 350,000 times over the energy consumption. The demand for electricity increased quite significantly, and there are so many studies indicating this now.

As of August 2025, Goldman Sachs conducted research forecasting that about 60 percent of the increasing electricity demand for all of these generative AI systems and what we have to build for our agentic AI systems will actually be met by burning fossil fuels. This is likely to produce around 215 to 220 million tons of carbon dioxide. To give you some perspective, if you have a gas-powered vehicle and drive it around for 5,000 miles, that's about a ton of carbon dioxide that you're releasing. Not just Goldman Sachs, but many other researchers have published on this topic. One example is from the World Economic Forum, which states that the demand for this energy and energy consumption is actually doubling every 100 days.

All of these numbers are staggering and very significant. As developers, scientists, hyperscalers, and AWS and other hyperscalers, it's our responsibility to make sure that we're building innovations with the right interventions in place to mitigate this ballooning carbon footprint. We need to focus not only on the efficiency of the hardware we're using and the efficiency of the algorithms we put in place, but also on how we design our data centers and how we implement cooling techniques in our data centers. There are many other efficiencies that now need to be thought about much more carefully.

AWS's Climate Pledge and Energy Efficiency Advantages

Let's talk about sustainability and what it means at AWS. In 2019, AWS took the Climate Pledge, and one of our core tenets was to power our operations with 100 percent renewable energy. We met that goal as of 2023. One of the things you can do when you're building your generative AI systems or your agentic systems or when you're consuming these services is that if you are building them on AWS, you're automatically a little more sustainable. AWS is about 4.1 times more energy efficient than on-premises solutions, and this is because of all the efficiencies we have built into our ecosystems.

We are powering our operations with 100 percent renewable energy. We have hardware efficiencies that we have built into our services. We have optimized silicon that we're going to talk about in the presentation that is very optimized for model training and model inference. We have the right cooling efficiencies that we are building into our data centers.

The data centers use low carbon concrete, which is being incorporated into our infrastructure. When you look at all of these efficiencies that we have incorporated by moving to the AWS cloud or building on AWS, you have the ability to lower your carbon footprint by up to 90%. This represents a huge carbon reduction opportunity that you should consider when thinking about generative AI and building these systems at scale.

Understanding the Generative AI Lifecycle: Four Primary Stages

I want to pivot now to what we call the generative AI lifecycle. When I talk about the lifecycle, I'm referring to four primary stages. These are the high-level primary stages that we're going to discuss. The first stage is problem framing. When I say problem framing, I mean any idea or use case that you have where you want to build a generative AI system to achieve a certain business outcome. It starts as an idea that a developer, architect, or senior executive has thought about. They want to build or consume a model to achieve a specific business outcome, and they may have collected data for training the model if they want to build one.

In the problem framing stage, you identify that this is a generative AI problem. Once you have identified the outcome and what you need, you reach the stage where you decide whether you want to train a model or use an existing model. This is the model training or adaptation phase. Once you have figured out the model training or adaptation approach, you move into model inference and model deployment. You have a model ready, and you're deploying it for inference to start generating predictions and responses. That is the deployment and inference stage.

Monitoring is something that we would like you to think about across the entire lifecycle. You want to build observability at each and every phase. You also want to ensure that once the model is deployed, you continuously monitor it and work on its improvement. You should look for data drift and model drift, and check whether the output is actually relevant in the real world. This is what we mean when we talk about the lifecycle, and these are the optimizations we will cover at each of these phases.

Problem Framing Stage: Choosing the Right Model with Amazon Bedrock

Let's first go into the problem framing stage. This is the stage where you ask yourself the question: for this business outcome and use case, do I even need generative AI? Do I need to build an agent, or do I just want to build an agent because it's the trendy thing to do right now? The first question is whether this is a simple business rules engine or something that requires generating open-ended content. Can you get away with using a less resource-intensive traditional machine learning approach? What if it's a simple classification problem and you're overcomplicating it by using a foundational model?

The first step in the problem framing stage is to ask yourself these questions. Now, there is a scenario where you ask yourself these questions and determine that yes, this is a generative AI problem and you must use a foundational model. In that case, the first recommendation would be to use a managed service if you have to build something from scratch. We recommend using a managed service because when we build managed services, you operate more efficiently. We have shifted the responsibility of optimizing the underlying hardware, maintaining capacity, ensuring high utilization, and sustainability optimization of that hardware to AWS. We have taken away the undifferentiated heavy lifting from you, so you can focus on actually using the service.

One of the services we would recommend if you want to consume a model is Amazon Bedrock. It's a serverless, fully managed service that provides access to a diverse range of models through a single API. There are over 200 models available on Bedrock from different kinds of providers. We have Anthropic, Amazon, Cohere, Llama, image models from Stability, Meta, and Mistral, and the list keeps growing.

Amazon Bedrock offers far more than just model availability. As a managed service, it provides numerous capabilities and features you can leverage. You can fine-tune your models on Bedrock, use retrieval augmented generation, and access features that help you make easier, more sustainable, and more cost-efficient choices. Beyond these capabilities, Bedrock also has guardrails built in. You can apply guardrails to your systems if you want to build responsible AI principles into your application, ensuring your outputs are correct, fair, and explainable, with no potential harm or bias.

When you have access to 200 or more models, the question naturally arises: how do you select the right model? What criteria should guide your model selection? This is a critical step in the process. I want to emphasize that the biggest, brightest model available is not necessarily the best model for your use case. Just because a model has 200 billion parameters does not mean you should choose it. You may not need a model that large. Larger models are more expensive, their inference costs more, and they generate a larger carbon footprint.

You need to select your model more appropriately by asking yourself questions about your specific use case and the outcomes you want to achieve. Do you need the model to be open source or proprietary? Do you need to fine-tune the model? Do you only need to serve English-speaking customers, or do you also have customers who need Japanese output or other languages? Do you need a multilingual model? How many parameters should it have? Is it a general-purpose model or a domain-specific model? Are you building something generic, or does the model need to provide focused healthcare-related outputs?

Ask yourself these questions before you begin the model selection process. Remember that at inference time, a larger model consumes more resources and memory. The largest model is not always the best solution. Consider this example: ChatGPT 3.5 had 175 billion parameters. A team at Stanford transferred that knowledge to a smaller model and built Stanford Alpaca with 77 billion parameters, significantly fewer than ChatGPT 3.5. Both models behaved qualitatively similarly. The point is that sometimes a smaller model can work well for your use case, cost you less, and be more sustainable.

Let me give you another example using the Nova models. We have three different types of Nova models with different costs. Although the numbers may look small, these costs accumulate across millions of data points. If you do not need multimodal capabilities or very high levels of accuracy, you can get away with a Nova Micro or Nova Lite. You do not necessarily need a Nova Pro. This principle applies to other providers as well.

One of the Bedrock features that can help is Bedrock Evaluations, a very useful feature that helps you assess and compare different models. You can choose how you want to evaluate them. You can use an LLM as a judge, which evaluates your output based on correctness, completeness, and potential harm. You can also use traditional metrics like the BLEU score or an F1 score. There is a third way where you can use a human workforce, either your own private workforce or an AWS-provided workforce for evaluation of your outputs. The process is fairly straightforward: you define the task type, provide your custom prompts, set up your evaluation metrics, and assess which model produces the best results. That will be the most sustainable choice based on how the service is designed. This is a critical feature, and if you have not explored it, I recommend looking at model evaluations.

Model Training and Adaptation: Progressive Strategies from Prompt Engineering to Training from Scratch

Now let's look at model evaluations as a stage in the workflow. We've crossed stage two. We had the problem framing done and selected the right model. There could be two things that could happen here. One, we didn't find a right model, so we have to train one from scratch. Or we found the model, but we still need to customize it further to meet our exact use case and achieve the quality of output we're looking for. So it could be model training or it could be adaptation and customization.

This is a very important infographic that I've spoken about in multiple places, and I would love for you to remember this. When we think about model training and adaptation, this is the order in which I want to think about strategies in a progressive manner. If I selected a base model from Bedrock and I have to customize it, the simplest approach would be to try prompt engineering. That's the lowest here, which is PE. Prompt engineering is the lowest effort, least cost, and least carbon emissions option. You give it your prompts and see what kind of output it's generating.

If prompt engineering solves your use case and meets your requirements, that's the best choice. However, there are scenarios where prompt engineering may not meet your use case, and you may have to do a little more enhancement. In that case, retrieval augmented generation comes into the picture. You can provide some proprietary information to the model and add context so it's generating more tailored responses. For example, if you have an automotive company and you're building a chatbot that helps customers, you can provide your car manual or service manual as proprietary documentation that the model can use for additional context.

Retrieval augmented generation is the second approach you can try. But there are scenarios where prompt engineering didn't work and retrieval augmented generation didn't produce the results you wanted. That's when we have to think about fine tuning. There could be parameter efficient fine tuning or full fine tuning. This is the increasing order of emissions, cost, and resource consumption. Parameter efficient fine tuning like LoRA or prefix tuning means you do not train all of the billions of parameters in the model. You only train a subset of them.

This is an efficient technique where you can fine tune and see if it's actually meeting your requirements. If that doesn't work, you go to full fine tuning. But if none of these work, that's when we would recommend that you think about training from scratch. Training a model from scratch is a very resource intensive, expensive, and time consuming process. You need to have all the hardware available, and it can take a lot of time. This is something you should only think about when none of these other techniques have worked out for you.

If we look at this infographic from a Bedrock perspective, it's very simple. You have some tasks at hand, and there could be two scenarios. That particular task does not need the latest data, or it does need some external data. If it does not need the latest data and it's a very simple task you're trying to achieve, then prompt engineering on Bedrock can solve that for you. But if it's a complex task that's a little domain specific, you have to do some fine tuning, so Bedrock fine tuning can come in handy at that stage.

If we move to the left side where you actually do need external data sources and up to date information, there can be two scenarios. One is relatively static information that you need. You provide documents to the Bedrock knowledge base and you're going to use retrieval augmented generation. That's the scenario where you augment with RAG. But there is a scenario where real time information may be required. You may have to access databases or call APIs. In that case, we augment it with agents and tools, and that's where Bedrock agents comes in handy. This is what we'll talk about in the latter half of the presentation, which is Bedrock agents. This is the same idea just from a Bedrock perspective.

In case you have to train from scratch, there's no other option. You have to train from scratch. None of the customization techniques worked out for you.

That's when we have two recommendations again. Use a managed service. There are multiple managed services available. We have the SageMaker ecosystem for those of you who are more comfortable with SageMaker, so you can train your models in SageMaker. And then there's EKS as well. We also have SageMaker HyperPod. For those of you who are more familiar with HPC backgrounds and are more familiar with SLURM-based orchestration, SageMaker HyperPod is something where you can train your models and you can use SLURM-based orchestrators, whereas EKS would be Kubernetes-based orchestrators. But all of these are managed services. The use of managed services is going to let you leverage all of the hardware efficiencies that our service teams have built into these managed services, so you'd be able to focus on the training and focus on the task at hand.

When you use these managed services, use the right silicon. There are lots of EC2 instances. If you need GPU instances, there are instances available from the P family, the G family, and other families of EC2 instances. But there are also instances that we offer, which is the Trainium family. They have been built in such a way that they are more energy efficient than comparable EC2 instances, so they will produce less emissions. Trainium 1 was about 25% more energy efficient. Trainium 2 is 3 times more energy efficient than Trainium 1, and we're coming out with Trainium 3, which is going to be even more energy efficient.

Deployment and Inference Optimization: Silicon Selection and Bedrock Features

When you're thinking about training from scratch, using a managed service and using the right silicon will help you a lot in terms of both sustainability and sometimes also in terms of price performance. So now we're at that stage where we've either trained the model from scratch or selected a model and customized it based on those strategies that we talked about, and now we're ready to deploy that model for inference.

Again, when you want to deploy it, you have to think about the silicon. You have to think about where you're deploying it. We have the option of deploying it to EC2 instances. There are many families of EC2 instances that you can deploy it to, but we have Inferentia from the Annapurna family, which is 50% better performance per watt, so you have the ability to use the Inferentia family of instances to deploy your models. There may be scenarios where you may not need GPUs for inference. There may be smaller models or classifier models. In those cases, you can actually use Graviton instances as well. They are 60% more energy efficient and were built with that in mind. They've improved performance about 4 times since their launch in 2018.

Thinking about silicon even at the inference stage is very important. There are lots of techniques that you can implement at the deployment and inference stage. There are techniques that can help you reduce the model size and optimize memory usage. You have the ability to compress the models and make sure that you're distributing the models and building them for distributed processing. You're using the right kind of hardware efficiencies. You're removing any kind of unnecessary weights by pruning them. You're transferring knowledge to smaller models through distillation, or you're trying out different precision types for efficiency through quantization.

So there are techniques even at the inference stage that you can use to compress the model size further and make memory usage more efficient. It's going to help you with cost, emissions, and resource consumption. There are libraries like DeepSpeed, Hugging Face Accelerate, and Faster Transformer that can actually help you do all these very easily. These are also available in our language model large model inference containers. These techniques are something that you should definitely think about to optimize the inference stage.

For these techniques, there are some capabilities that are specific to Bedrock that can help. One of them is model distillation. It's an efficiency-focused tool. What it's letting you do is transfer knowledge over to a smaller student model from a larger teacher model.

You'll select a larger teacher model that you know works for your use case. You'll provide some custom prompts, and this fine-tunes the smaller model based on the results that the teacher model generates. The smaller distilled model can behave very similarly with almost 98% accuracy to the larger teacher model, and it saves you about 75% on the cost. This 75% cost reduction with 98% accuracy is very good for scenarios where you want to trade off between model size and performance while saving money and getting performance benefits.

Another Bedrock feature I'd like to point out is prompt caching. When we talk about prompts, we discuss two things: prompt optimization and prompt caching. Prompt optimization means you should not write large, convoluted prompts where the model has to process thousands of tokens. Make sure your prompts are optimized and that you have prompt templates in your organization for certain repeatable tasks.

There's also the caching aspect that Bedrock lets you do, which avoids recomputation of repeatable patterns or matching prefixes. This helps you save cost and reduce latency by up to 90% for costs and up to 85% for latency. You're avoiding the computation of these repeated parts. For example, if you have a coding assistant model with a snippet of code that needs to run repeatedly, Bedrock pre-computes the results of that snippet and caches them for use in subsequent prompts you pass to it.

For coding assistance scenarios, system instructions are often repeated very frequently. This combination of prompt caching and optimization reduces your cost and accelerates your response time by helping with latency. Using Bedrock prompt caching is something we would recommend.

Finally, I'd like to talk about another Bedrock feature called intelligent prompt routing. This is a very smart cost optimization feature that lets you select multiple models like Nova, Claude, and Llama, and designate them as your routers. You provide a prompt and let Bedrock intelligently decide which model to route that particular request to by looking at the complexity of the prompt.

If it's a fairly simple prompt that doesn't need a lot of convoluted processing or complicated logic, Bedrock will send it to a smaller, more cost-efficient model. If it's a more complicated prompt, it will process it with a different model, perhaps a bigger one that's more capable of handling it better. You can reduce your expenses by roughly 30% using intelligent prompt routing. It saves you from having to make that model selection decision yourself.

Each request is traceable, so you'll be able to see which particular prompt went to which model and see metrics about it as well. You'll be able to analyze the results later. Intelligent prompt routing eliminates the need for model selection and complex routing logic that you would otherwise have to build yourself. Bedrock intelligently does that for you.

Continuous Monitoring and Observability Across the Lifecycle

We are now at the final stage. We have framed our problem, trained or adapted or customized our model, deployed it, and looked at some deployment and inference techniques for the model. Now we're at that final stage where we need to continuously monitor and optimize what we have deployed.

However, we also need to remember that monitoring and observability is not something we want to treat as an afterthought. We want it to be something you are thinking about across the entire life cycle, even when you're training the model. You want to be able to use these tools, and there are multiple tools available.

CloudWatch allows you to look at CPU, memory, and other kinds of metrics. The SageMaker Profiler really helps with training jobs and allows you to look at specific training metrics. You can examine data distributions, how the training is progressing, the loss, and the accuracy. Neuron Monitor is a service focused on monitoring neural network training, particularly for deep learning. If you're doing that kind of work, then using Neuron Monitor is another good option.

Of course, if you're using the Nvidia family of instances for training or inference, you have the ability to use the Nvidia Systems Management Interface where you can see GPU metrics and optimize resource usage based on what numbers you're seeing. This can also help you identify potential bottlenecks. With that, we complete the life cycle, and I'm going to pass it over to Parth, who can talk to you about agentic AI systems now.

From Generative AI to Agentic AI: Understanding AI Agents and Bedrock Agent Core

Thanks, Isam. So we learned about the life cycle of generative AI. Now we're hearing throughout the year that this is an era of agents and that generative AI will be used only with agents. Does this mean that generative AI is over? The short answer is no. Both have different applications and different perspectives. Generative AI, which you've been using for quite a while now, is fantastic for generating images, articles, summarizing content, and translation. However, agentic AI takes one step further. It uses the same underlying large language model, but it is used for a specific task. It is goal-oriented and specifically used when the decision is subjective and not following a standard path. We believe that with agentic AI, we are able to leverage the full potential of large language models.

What are AI agents? Agents are essentially a type of AI that can act more like an agent, meaning they take their own decisions to achieve a goal. They are not following a specific predefined path or workflow. They take decisions and have the flexibility to make autonomous decisions based on the data or context they encounter. Let's understand how an agent or agentic system works.

For an agent, you need to have a specific goal. You can give the agent a certain goal based on instructions. You should also provide some tools so that agents can utilize them to achieve certain goals. You provide all these details, such as what state the agent is being invoked in, what is happening, and what it needs to achieve, all as part of the context. Once you have those base elements, you can activate the agent. The agent will utilize a large language model underneath and generate a plan for how to achieve the goal. While executing the goal, it will take certain actions, mostly utilizing the tools. It analyzes the output of that particular action or the tools and then reiterates the process until it achieves the goal. Once the goal is achieved, the output could be text, a generated image, or an action performed by the tools. That's the basis of any agent or agentic system, where you can have multiple agents chained together.

Now, to create an agent, you need a lot of things. You need to make sure that you provide context. Since agents are stateless, you need to ensure that you provide certain tools and go back and forth so that agents understand what action was taken before and what action needs to be taken after the first step.

You also need to manage the orchestration. For example, if an agent decides on a certain plan and based on the action it realizes that the plan is not working, it needs to reiterate the plan. So you need to manage the orchestration for that. In most cases, a single agent could work, but we realize that organizations have much more complicated use cases. So you will have multiple agents chained together or multiple agents talking to each other. So you need agent-to-agent communication as well.

Fortunately for us, we have a large open source community with many frameworks available. Amazon has a Bedrock Agent Core , which is specifically designed for production-grade agents. LangChain and Crew AI are some of the popular frameworks in the developer community that started alongside generative AI. We also have frameworks from model providers like OpenAI and Google. There are many frameworks available, and these frameworks take a lot of boilerplate code to create an agent. By utilizing these frameworks, you can create an agent in your development environment or as a proof of concept.

However, scaling agents is really hard. Gartner mentioned that by 2027, 40% of agentic AI projects will be scrapped due to certain restrictions like governance, security, and scalability . The underlying LLM doesn't have state, so you need to provide memory or context for the agent. You need to make sure the agent adheres to your organization's security and governance rules .

You need to make sure that all actions taken by the agent are auditable. This means you need to ensure that tracing is available, all responses are being logged, and all decisions and reasoning are tracked. All of these things require a specific type of infrastructure, different from what we are used to using for applications. That's why we have Bedrock Agent Core .

Bedrock Agent Core comprises fully managed services that can be used together or separately . Bedrock Agent Core comes with around five services internally. These are different features, and you don't have to use all services together. You can choose any services with your existing framework or existing agentic workflows. On top of it, you can use any framework we talked about. You can use OpenAI framework or LangChain. You don't have to use a specific framework.

Also, you can use any model. You can use models available from Bedrock, but if you are already using models available from OpenAI or Gemini, you can still use those models even while using Bedrock Agent Core. Bedrock Agent Core is specifically designed for any framework, any model, and any protocol infrastructure that helps you build production-grade agents. This means you don't have to make a choice between open source flexibility and AWS's reliability and security.

Now let's understand how Bedrock Agent Core works. As we discussed, for any agent to work, you need base components like instruction , tools available with the framework, or providing context. As part of Agent Core, we have an Agent Core Runtime where you can use any framework or any model. You can have a production-grade agent just with Agent Core Runtime. It's a basic model, a basic agent that you can work with.

However, as most organizations have complexity where you need to call tools outside the purview of this framework or use your existing APIs or existing applications, we have tools that you can leverage alongside agents. Agent Core Gateway, as the name suggests, is a gateway for your MCP tools, Model Context Protocol tools . You can call any of your existing APIs or tools outside of your network or outside of your environment using Agent Core Gateway. Agents could also have a browser and code interpreter.

We see use cases where you're required to open a browser to a URL because the older application doesn't have an API. You can have an agent open a browser, use the information from the browser, or have an action taken like clicking a button and seeing the output, which the agent can then use. You can also have certain use cases where you need very precise output. You can have a code interpreter where it will actually open an environment, execute code, and then return you the output.

All these things require security, so for that we have AgentCore Identity. AgentCore Identity will help you secure your agent, determining which persona is calling the agent or which persona the agent is working under, as well as which tools the agent is calling. It will also leverage your existing IAM security or credentials to call the tools which are available on AWS or outside. So identity works on both sides: inbound and outbound authorizers.

Along with the agent, we realized that agents may work for a long time. We realized that your agent will run out of the context window because eventually it's a large language model and a lot of information and a lot of iteration fill up your context. So we have AgentCore Memory to help you reduce the context from your LLM context and put it into the memory so that as and when needed, you can pick the information from the previous step and continue the process. We also have a long-term memory inside AgentCore Memory where you can save your preferences so that you can make your agent more inclined towards certain behavior.

All those things don't happen without traceability. So GenAI Observability is very important more than ever. We need to understand why the agent takes certain decisions, what information was available, and why it takes different decisions at a different point in time. Everything will be logged so that you can trace it and make it auditable.

AgentCore Runtime and Gateway: Reducing Carbon Footprint Through Efficient Compute and Semantic Search

We will talk about AgentCore Runtime and AgentCore Gateway a little bit deeper to understand how it helps you save cost and reduce your carbon footprint. Let's take a look at AgentCore Runtime. As we see, with AgentCore Runtime, you can run a production-grade model just with the AgentCore Runtime. It supports any open framework, any model, or any protocol.

Alongside with it, there are a few differentiators. It provides true session isolation, which means that every agent will have its own environment to execute. It also supports 100 megabytes of payload for multi-modal support. And alongside with it, it has very low 200 millisecond startup time, as well as it can run up to 8 hours.

Now, let's understand the lifecycle and how AgentCore Runtime works. Whenever a client starts a request, by default it's a 15 minute timeout, and if it is a streaming application, it will be 60 minutes. But whenever a session is initiated, you will have its own isolated environment. Suppose that your request is not responding or you are waiting for the response, it will be suspended. The CPU cycle will be suspended. This is a key differentiator.

Whenever you start an agent, you have certain variables in your code or in your framework, and you have a state that the agent will maintain. Most of the time, the agent is waiting for a response from LLMs, so your agent is calling a tool or calling LLMs to process certain data that it has been following. It is waiting for that response. With Firecracker technology, we are able to reduce the CPU cycle while the agent is waiting. So you can still have your variables and your application state in the memory, but your CPU cycle is not being used, so you will not be charged and you are able to compute while the agent is waiting.

Let's say your agent session times out after 15 minutes, and when it times out, it will release the memory and compute whatever has been occupied. For use cases that you need a longer time, you can use the agent up to 8 hours. And we would love to know what are the use cases where you require a longer period of time.

There are different ways to achieve it. When you look at the cost, you will see that it's a consumption-based model as it is a managed service. You don't have to choose how much CPU or memory is required. The agent code itself will identify based on the framework you use and your code, and it will allocate and identify how much CPU or memory will be used. When the agent is waiting for an LLM response, which happens most of the time, you will not be charged. You will be charged only when active I/O is happening. Memory will be charged throughout because it needs to manage your variables throughout the process.

Let's take the example of a scenario with two different cases: compute light and compute heavy. Compute light is a very standard request with two components: a foundational model and agent code runtime. The foundational model here is Claude Sonnet 3.5. The charges will be in millions per million tokens, whatever the standard charge is for inference. Agent Core will have a vCPU and GB memory per hour charges.

If the agent works for 60 seconds with an input token count of 200 and output of 600, which is a very standard request, you will typically use one CPU and 2 GB of memory. Out of 60 seconds, only 20 seconds of CPU is being used, so you will be charged only for those 20 seconds. You will not be charged for the 40 seconds when the CPU is waiting. On the right-hand side, you can see that 95% of the cost is just the model and only 5% is the cost for Agent Core. This way, we are able to reduce the CPU cycle significantly.

If you look at the compute heavy scenario where the agent is running for one hour with 30 calls, 60,000 input tokens, and 18,000 output tokens, we are using 4 vCPUs and 8 GB of memory. Overall, for 1 hour or 60 minutes, we are able to save almost 50% of the time for your CPU cycles. In the future, when you see scenarios where you have hundreds and thousands of agents running multiple processes together for all your employees or for multiple processes, you will see a huge amount of cost saving as well as compute saving. This will help you reduce your carbon footprint and help you scale from the cost perspective.

Now let's look at how Agent Core Gateway helps. Agent Core Gateway is a unified way for agents to access MCP tools. For example, if you create an MCP server, first of all, you need to have a server always available. You need to manage all your permissions, networking, and compute for your server. Alongside this, you need to make sure that the tools available for your MCP server are discoverable. Agent Core will take all the heavy lifting along with identity and observability. Agent Core Gateway is a unified and fully managed service for agents to access MCP tools.

It also helps you convert your existing APIs, whether it's an OpenAPI or Smithy tools, to make them available as MCP tools. You can also make your existing Lambdas available as MCP tools without any code changes. Gateway also has a semantic search feature, so it's not only about utilizing these tools, but you can use semantic search to find them. One more thing: Agent Core Gateway doesn't require Agent Core runtime. You can use Agent Core Gateway with your VS Code, with your Kiro, with your existing MCP inspector, or any other application as well.

Most of the time, a critical expectation that developers miss is that in production, you have hundreds of tools. There are two approaches: you either have multiple MCP servers with a lesser amount of tools, or you have a single MCP server with multiple tools. In this example, let's say an organization has 300 tools. If you understand that for each LLM call or tool call, you need to provide a definition for your tools so that the LLM can understand which tool is required to perform certain actions, then in this case, you are providing 300 definitions with every request.

This is taking a lot of your context window. What Agent Core Gateway does is provide a semantic search. It tries to understand and filter out which tool is required to perform this action. Instead of passing 300 definitions with every request, Agent Core Gateway can pass only 4 requests in this case, which will reduce 90% of the context window taken by your MCPs. This helps in multiple ways. It makes your normal agent or agentic call much faster because it reduces the context. It improves the accuracy because you don't have additional information that you don't need to process. It definitely increases the speed and reduces the cost. As we understand, if you reduce the context window, you save a lot of carbon footprint. With semantic search, you can also leverage the fact that you now have additional context window available, which you can use to include more information to improve your agent and get better accuracy as well.

If you look at the cost side of it, there are two types of costs. One is the API call and the search call, like tool call and the search call. There is also tools indexing, which is charged per month only for indexing the tools so that semantic search can work. Let's take a practical example with HR assistants. This is the time of the year when new enrollment happens in the US. You have a new benefits application available. You want to check your payroll, check your benefits, or check your new balance, that kind of application where you have an HR agent. For a mid-sized organization, let's consider that there are roughly around 50 million requests per month.

For all those requests, we consider that one search API and 4 tools because of the semantic search, we are able to reduce the tool calls. So each request will have 1 search API and 4 tool calls. Overall for 50 million requests, we have 50 million searches and 200 million invoke tool calls. All of this is under $2,500. This way, we are able to help you with managed services without maintaining any infrastructure. You are able to leverage your MCP as well as your Agent Core runtime. If we see what we have learned so far, how the agentic system works, for production-grade agents, we have Amazon Bedrock Agent Core, where you can choose any framework, any model, any protocol. Agent Core runtime separates compute from memory to help you reduce compute and cost for your agentic application. Agent Core Gateway helps you unify your MCP tools and also uses semantic search to reduce your context window.

In the first section, Isha explained how utilizing managed services helps, how we should use the base model based on your use cases, use the right silicon or right inference optimization technique, and how we can continuously improve our generative AI lifecycle. With that, we have a couple of resources and we are available for any questions that you have. Thank you so much for your time today.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community