🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Architecting scalable and secure agentic AI with Bedrock AgentCore (AIM431)
In this video, Mark Brooker, VP and Distinguished Engineer at AWS, explores the architecture and infrastructure of AI agents through AgentCore. He demonstrates how agents loop between inference and tool calls to achieve goals, using examples like calculating mathematical functions and weather-based activity recommendations. The presentation covers AgentCore's key components: Runtime (built on Firecracker for VM-level isolation), Gateway (for secure tool connections with MCP and OpenAPI support), Memory (for user preference retention), and the neurosymbolic CEDAR policy language for fine-grained authorization control. Brooker emphasizes the importance of tool curation, showing how excessive tools can reduce agent performance despite using cutting-edge models. He introduces online evaluations for production environments and explains how AgentCore's security model creates a controllable "box" around agents, enabling teams to build agents rapidly while maintaining security through infrastructure-level isolation rather than code-level trust.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Understanding AI Agents: A Practical Example of Looping Between Inference and Tool Calls
I'm Mark Brooker, VP and Distinguished Engineer at AWS. I spend most of my time working on Agentic AI and infrastructure for Agentic AI, including AgentCore. I hope most of you watched Matt's keynote this morning, which touched on some really exciting launches and some of the work that we've been doing in AgentCore to build great infrastructure for AI agents. I'm going to get into some of the details of those launches as the talk goes.
We're going to start off by looking inside an agent and understanding what goes on inside AI agents and what I mean when I say this word, AI agents. Then we're going to talk about what agents need and what you need to get agents into production. We're going to talk about runtime, places to run code, memory, gateway, connecting your agent to the outside world, evaluations, and some of the cool neurosymbolic work the team has been doing.
I have a very important problem in my life that I need to solve every day, like many of you do. Every day I need to multiply the number of R's in strawberry by the hours of the current time and the current outdoor temperature in Seattle. Then I need to calculate the log of the gamma function of the result. It should be obvious to all of you why I need to do this. This is the kind of thing that we all do in our day to day lives. To automate this, which I was doing manually with a pencil and paper, I needed to build an AI agent. Let's talk about how I would do that and how an AI model would help me do this.
Counting the number of R's in strawberry is a famous problem. Models mostly can do this pretty well themselves. At the very least, it requires no state from the outside world. It is a fixed function of the input, and so a model can calculate the output pretty reliably for modern models. Then we get to the next two problems. I need to get the hours of the current time.
In isolation, an AI model just can't do this. It doesn't know what the time is. But many models have the current time or time and date in their system prompt somewhere. I can put the time into the system prompt of my agent and have the model know this answer already, without having to reach out to the outside world. Models can also do math, not particularly well as we might see later in the talk, but they can do basic multiplication pretty reliably as long as the numbers don't get too big.
But then we run into problems that a model by itself just cannot do. It cannot know what the current outdoor temperature in Seattle is. I could look that up and put that into the system prompt, but then I have to look up everything and put it into the system prompt. So I need to give the model a tool to use to look up this kind of information. Finally, models aren't particularly good at anything looking like advanced math, math with big numbers, math with floating points, and so on. So I also need to give this model a tool to do this mathematical work that I'm asking it to do.
We have three problems here. We have a relatively easy closed problem for the model to solve. We have a kind of medium problem that it should be able to do, maybe with some additional prompting or maybe we'll need a tool call. Then we have these two problems that modern models can't solve, and even with a bunch of model advancement they can't solve because they're facts about the current world. We also have this larger meta problem of having to read this whole prompt and figure out what it means and decide what to do. At its core, that is what we're doing with the model.
When I talk about AI agents, what I'm talking about is systems that are given a goal. Some kind of goal here, calculating this very important number that I need to calculate every day in my day to day life. These systems loop between inference with an AI model and tool calls to achieve that goal. Eventually, ideally, achieving it and giving me a correct and trustworthy answer.
This isn't quite the entire picture for modern agents. More and more agents I see are including code in their definitions. Both tools defined in code, code generated by the model, and code that implements parts of a workflow, a few steps of a workflow, where it is used to improve the reliability and lower latency because it requires fewer inference calls.
Building Agents with Strands Framework: From Code to Bedrock API Implementation
Let's dive one layer deeper into the implementation of this very useful agent that I'm building. I'm going to use the Strands agent framework. This is an agent framework that we developed at AWS to build our internal agents, including those foundational agents you saw launched in the keynote today. With Agent Core, you can use any AI framework. You can use LangChain, you can use whatever SDK, you can build your own SDK, or just use no framework at all. I like Strands because it's very flexible. It allows me to build anything between agents that are just a single prompt and agents that are complex, hard-coded workflows.
There are also some cool new features in Strands like constraints that we've launched over the last few weeks. Here's what a snippet of the Strands code looks like. I'm using this @tool notation to locally define a tool that is available to my agent. I can also connect tools into this agent using an MCP client, using an OpenAPI client, or over any other protocol that I choose. The next line of code is the definition of this tool. This is a tool description that is given to the model to help it understand when this tool should be used. If you're not familiar with this gamma function, this is a kind of a real number version of the factorial function. So if you ever wondered what 5.5 factorial is, this is the way that you work it out.
Then I define my agent. I say, here's a tool for getting the current time, a tool for getting the weather, and this math tool. Then I pass my prompt into the agent. I pass in that goal. This is as simple as implementing an agent in Strands can be. Just a handful of lines of code in Python. I think the minimum one is probably four lines of code. Let's go one layer deeper and talk about what's happening at the inference API. Here we're going to look at the Bedrock API. One of the cool things about Agent Core is that you can use it with any inference API. But here I'm going to use Bedrock because it's a great secure place to get at cutting edge AI models.
Here's what goes on the wire when I make my first inference call to Bedrock. I'm telling the model what my goal is: multiply the number of Rs in strawberry, and so on. I'm giving the model my tool specifications. This is a list of tools and their descriptions that the model can then use to choose when it makes a tool call and to get a particular piece of data. Then I'm giving it a system prompt that I defined. I didn't put this in the Strands code, but this is a system prompt that I build in that describes the overall goals of this agent. Here I'm saying, you are a helpful assistant, and giving it a few extra hints about the tools that it has. This isn't strictly necessary, but in this case, it improves performance of the model.
So the first time I call the inference API, the response I get back is a tool use request. Bedrock says to me, okay, I ran the model, and what the model asked me to do is run this tool and call it again. So here I have to call get_weather_fact, which is going to get this outdoor temperature, which the model needs. It has looked at that prompt and planned the way that it's going to do this. It's obviously going to get the temperature and then it's going to do some math, and we're going to step through this and go around the loop. If I ran this again, it might do a different step in different order. This is one of the challenges of building reliable AI agents. But here I'm getting the weather fact.
I call the tool locally. Strands handles this for me. I don't need to do it. But I essentially just run that Python function and give its results. Here you can see that the request is growing and growing. This is the second call, the second turn around the loop. I give the model the whole thing again. My original content, multiply the number of Rs. Its response, saying, I'll help you calculate this step by step. Let me gather the information. Let's do a tool call. And then my response saying, I did that tool call for you.
Here is the text that I got back from the tool call. The text which is apparently 3—that's a bit of an alarming outdoor temperature, at least in Fahrenheit for Seattle. Hopefully it's a little bit warmer than that. So you might be wondering, well, if we're just growing this over and over, aren't we doing N squared work? And yes, in the naive implementation, that is basically what we're doing, but there are some techniques that we can use to bring that back to a more scalable standing.
I'm not going to go down to the next level, but what's happening inside Bedrock here is taking this structured API and turning it into a blob of text or a blob of tokens that are passed to the model. Exactly how that is done is model dependent. There are a bunch of delimiters. Some of them look a bit like JSON, some of them look a little bit like ad hoc XML, and some of them look like magic spells. But essentially, I take all of this stuff, I put it into a blob of text, I pass it to the model, and the model responds to me.
Now if you're paying attention to that, you'll notice that delimiting is a little bit ad hoc. And it is one of the core challenges of building secure agents. Because if you're a security person, you will notice that this is in-band signaling, where we are putting control signals to the agents and user-defined content into the same document in a way that is not particularly clearly delimited. So here I go around and around this inference loop. In this case, I go around it on average about 3.5 times to solve this problem. And as I go around it 3.5 times, it sort of steps towards the solution and gives me the result.
That 3.5 times is also a little bit surprising. It means that there's some nondeterminism in the system. Sometimes the agent is able to figure out how to do it in 3 steps, and sometimes it's able to do it in 4 steps. If I wanted to optimize this to reduce my token usage, given that it is a fixed plan every time, I might lift the planning step away from the LLM and turn this into a more fixed function, step-by-step workflow. I might still want to use the LLM to put the big picture together, but not necessarily to figure out the step-by-step plan.
AgentCore Components: Runtime, Gateway, Memory, and Security Infrastructure
Now we understand the internals of agents. Let's step a little bit into what components you need to run agents successfully in production. And here we're going to use a slightly less silly example. I like to do some outdoor sports with my family, and that is very weather dependent in the Pacific Northwest. So I want to build an agent that gets the snow conditions, weather, and river levels near me, and then tells me if I should go skiing or boating. That's a practical thing to want to do with a personal agent.
How might I run this in the cloud? Well, again, I might implement this agent using a framework like Strands. I need to run that agent code somewhere. I need to connect that agent code to the tools that are going to provide it the information that it needs to run the agent—the weather API, the river-level API, the snow-level API. Maybe if I choose to go skiing, I might ask it in my system prompt to book parking for me. So I want to give it an API to do that too. I need to connect it to memory so it remembers my preferences over time. We'll get into that in the talk. And then I might need to have guardrails that make sure that the responses are safe.
So the core of AgentCore is AgentCore Runtime. This is the place where you can securely run agent code. This is where I would run that Strands code. When I say securely, I mean isolated with a per-session, hardware-backed virtual machine. This is one of the most powerful features of AgentCore, providing really strong security isolation around code. I'm going to go into that later. Then we have the Agent Core Gateway. Agent Core Gateway allows you to attach agents or connect agents to the tools inside your company or to externally available tools. You're not going to transform your whole company to agents overnight. You've still got microservices. You've got tons of databases. Those databases have some of your company's most important assets in. So being able to connect them to your agents securely in a place where you can do audit and control is very important.
Agent Core Gateway makes it really easy to connect open API tools, MCP tools, and other types of tools into your agents, talking the protocol that your agents use. It provides features to expose the right set of tools through tool curation to your agents, which I'm also going to talk about a little bit later in the talk. Then we have memory, which allows our agent to remember things over time between sessions. It allows state to be sticky between sessions. I talked about how runtime gives you a new VM per session, and at the end of that session, all of that state is securely deleted. That's great for security, but it does mean that your agents have no memory whatsoever. So you're bringing that back with an explicit memory component, where you add back in the ability for agents to remember between sessions.
I need to connect the user's identity to tools. Here, that ski parking booking API might need an identity from me as a human. So I might need to go through an OAuth type flow and provide credentials for that identity to the agent to use when it makes the decision to use them. Agent Core Identity makes it easy to set up these flows in a secure way. I need to be able to connect my agents to websites. Instead of there being a ski parking API, we could all hope that everything we interact with has APIs, but they don't. So we also provide an agent called browser use, which is a secure, isolated environment that allows you to automate agents using a browser to figure out how to click through an API flow or through a website flow.
What I haven't pictured here is Agent Core Code Interpreter, which is a secure environment for you to run code. I didn't picture it here because Agent Core Code Interpreter is very useful, but it's also a little bit of a minority case. Because of the isolation model of Agent Core Runtime, in a lot of cases, you can take the code that your agent generates and run it right there in the runtime, where it will still be strongly isolated from other sessions and other users of the same agent. Code Interpreter is super useful when you want to give that running code less than the full permissions that your agent has.
AgentCore Runtime: Firecracker-Powered Serverless Isolation with Per-Session Virtual Machines
Let's dive a little deeper into each of these components. Runtime is a place for your agents to live. It's a place for your agent code to run. Hooking your agent up, an agent built in Strands or LangChain or whatever, is very easy. It's just a handful of lines of code. You add this app entry point annotation, and then essentially just run your code. Here, I'm passing in the gateway URL and a little bit more configuration. I'm running the agent, and then I'm returning the response.
What you'll notice here is this allows a multi-turn conversation with an agent inside the session. So I could easily build a client that allowed me to have a step-by-step conversation with this agent, as I might have with an agent that I asked to help me plan re:Invent, for example. If you came to the innovation talk yesterday, you would have seen my silly re:Invent agent, and that was implemented in exactly this way. If I call again with the same session ID, I will get the same runtime with this copy of the agent running with everything that was in memory before. If I call with another session ID, I will get a different VM, a new VM that is strongly isolated from this one.
Agent Runtime, I believe, needs to be serverless, scalable, secure, and offer fine-grained automation. Agents offer a novel set of security threats, and one of the great ways to mitigate that is by strongly isolating agents, for example, in a VM where you can make precise statements about the way that state flows around your system and what agents can do. We'll talk a little bit more about safety later. I like Serverless in this space because it allows me to use operational agents, like the one that Matt announced in his keynote this morning, to do all of these operations for me without having to worry about the complexities of running infrastructure.
Agent Core Runtime is built on Firecracker. Firecracker is a micro VM hypervisor that we built approximately seven to ten years ago and announced at re:Invent 2018.
This is a piece of technology we use to power AWS Lambda, DynamoDB, and many of our analytics tools. We've used Firecracker with its proven track record of high security, strong stability, and performance to provide powerful isolation for agents. This is a real virtual machine started up for your agent in milliseconds. If you want to know how that works, I'm going to go super deep into that down to the memory page level in a 500-level talk tomorrow. Using Firecracker allows us to bring the fine-grained isolation level of Lambda and make it even finer-grained down to the session level while still offering great performance, scalability, and cost-effectiveness for your agents.
Unlike Lambda, Agent Care Runtime is priced based on busy time. Here you can see that I have multiple conversations with my agent. In the purple, it's running and using both CPU and memory. Then it gets back to me and says I need some more input to do this task. During the time that it is waiting for me, waiting for a tool call, or waiting on something to happen, you don't pay for the CPU of the Agent Care Runtime. The cost of running an agent goes down substantially by about 20x on average between when it is running and when it is idle waiting for a response. This is a huge cost benefit for agents that spend time waiting on humans, waiting on long tool calls, or waiting on other agents.
Now we have runtime isolation. My conversation with this agent and another customer's conversation with this agent gets different VMs. This is interesting for security reasons because it means I don't have to worry about my conversation and another user's conversation being mixed up. There is no prompt injection, no bug in the code, or even remote code execution bug that would allow me to access the content of another concurrent session. It is running in a different hardware-backed virtual machine, strongly isolated from this one.
This is also nice because it helps you write simpler agent code. You don't have to think about multi-tenancy when you're writing your agent code. You don't have to think about building your agent code to be multi-threaded. You don't have to think about propagating identity and cleaning variables and clearing memory in agent code. That is all handled in the infrastructure. You can write your agent code in a very simple single-threaded, single-user way and have all of the isolation and multi-user handling done by the infrastructure.
Building agents reliably is challenging enough, and so we've tried to use the infrastructure here to simplify the journey of building agents and make it much easier. It also means you can bring any framework, any library, any language. I've used Python, and I also like to build agents in Rust from time to time. I like to bring libraries, maybe another math library to do a gamma function. All of these things are possible because it is a real virtual machine. You're not limited in what you can bring. You can even mix languages. You can create new processes. You can fork off new threads, whatever you need to do to get the job done.
AgentCore Memory: Enabling User Preference Persistence Across Sessions
One of the really annoying things about talking to naive agents is that they forget things. Here I'm interacting with an agent to help me choose a meal during the week. The first thing I'm going to say to it is, I don't like pizza. Then I'm going to come back to it the next day and say, suggest a meal for me. And it says, have you tried that new pizza place? If this was a joke, it would be a pretty decent one, but it is really annoying that agents don't remember this.
Memory is the simple idea that we can take the conversations that I have with agents and write them down and reuse them.
There are many forms of memory, but probably the most useful single one is user preference memory. Behind the scenes, with just a few clicks in AgentCore, we can set up a pipeline for you that takes the conversations that your users have with your agents, runs them through a model that extracts the preferences that those users have—like, I don't like pizza—and provides those to future invocations of this agent for the same user. User isolation is handled, the setting up of the pipeline is handled, the prompting of the model is handled, retries are handled, and everything is built into AgentCore memory. All you really need to do is create the memory resource, prompt your agent to use it, and wire it in.
It's a little bit more boilerplate. I was hoping we would fix this and make the SDK make this a lot cleaner, but we didn't have time, unfortunately, so you get to see the long ugly version for the couple of weeks that it's going to be around. There are a couple of small things here worth pointing out. When retrieval happens and I run the agent again, I'm asking for a maximum of 5 facts about this user—the 5 most relevant facts. I'm setting a kind of filter of relevance, saying don't give me things that are truly irrelevant to this conversation. This is a way of managing context size.
If you're an agent builder, you will have noticed that if you tell a model a bunch of irrelevant stuff, its performance tends to go down. It's useful to tell models relevant things. With that integration with memory, I can say I don't like pizza. It's going to remember that. It's going to take the trace of that conversation, extract that preference, and write it down in AgentCore memory. That takes up to a second or two to happen in the background. Then the next conversation I have with the agent, it remembers that for me. It's a very simple way to make agents feel a lot more useful to humans, a lot less cold, to have them remember the things that your customers say.
AgentCore Gateway and Tool Curation: Balancing Performance with Accuracy
Let's talk about Gateway and how agents do things. Fundamentally, tool calls are what make agents useful. If an agent can't call tools, if it can't interact with the outside world, it's not doing a whole lot. You can have agents that just talk to customers and chat with them, but it is tool calls and their ability to have side effects on the outside world that makes agents so powerful and useful. Gateway was built to make those tool calls easy to set up, easy to curate, and possibly most importantly, controllable and auditable. You're able to set policy, which I'll dive into in a minute.
Tools allow you to connect your agent to the data you need and the effects that your agent needs to have on the outside world. This could include services and microservices, data in databases, data in storage, software as a service, applications, or even other agents. Gateway provides a single place to connect all of those into your agent. It also provides a place to do tool curation. One of the anti-patterns I see when folks build agents is they provide every tool in their organization. If somebody said to you, "Here's a toolbox with 1000 tools. Good luck picking up the right one for the job based on a little label on each one," that would be overwhelming. Curating sets of tools and giving them to agents is really critical to having your agents perform well and having them have good cost-performance trade-offs. Setting the right tool description is also super useful.
Let's talk a little bit about some data in tool curation. Remember I talked about multiplication and doing multiplication with the models and how they're sometimes good at that. I also showed when we were talking about the Bedrock API how a tool call is another trip through the model. Most of the time they can't do multiple tool calls in a single trip, but that consumes input tokens and output tokens. I'm comparing the performance of that first agent that I built between a version where the model does the math and a version where I give it a multiplication tool and have the multiplication tool do the math.
Looking just at tokens, what we see is that doing the math in the model is substantially more token efficient. It's substantially lower latency and higher performance. That's great news. Giving this model too many tools has made the agent's outcome substantially worse, even in this very simple example with a cutting edge model.
But before we get to the conclusion, there's a little bit of a caveat. It only got the multiplication right about 40% of the time. Giving it a multiplication tool cost more tokens and more latency, but made the success rate of the agent jump up from 40% to 100%. I'm defining success as correctly calculating the result. Tool curation isn't about only giving a minimal set of tools; it's about giving the right set of tools to an agent. Gateway provides a very convenient place to do that outside the agent code where you can iterate on it, test it, and do many other things.
Neurosymbolic AI and Agent Call Policy: Mathematical Certainty for Agent Authorization
Now let's talk about one of the most exciting launches from this morning. One of the things that we've been doing at Amazon for over a decade is investing heavily in automated and formal reasoning. This is the kind of reasoning that maybe would have been called AI in the 1990s, using things like SAT and SMT solvers and other formal reasoning approaches to reason about mathematics symbolically. It is an extremely powerful set of tools that allows us to say very precise things about the world.
It also was a little bit of a dead end on the route to AI because it is quite inflexible. It doesn't have the same kind of natural language understanding that LLMs have, and it's not great at planning in the ways that some LLMs are. We've been investing heavily over the last few years at AWS in neurosymbolic AI that mixes the neural approaches with the symbolic approaches to provide the best of both. What excites me most as a technologist about this policy is that it's a big release based on neurosymbolic AI.
You would have seen automated reasoning guardrails for Bedrock released at re:Invent last year. Based on similar technology, we've been using that to iterate quickly on making this more flexible, more powerful, and more accurate. We now use it here to power agent call policy. Let's talk about what policy is. Agent call policy is a layer inside Gateway that allows you to very crisply, with mathematical certainty, say what your agent is allowed to do.
When I think about agent safety, I think about what agents are allowed to say, and that's where I would use something like Bedrock guardrails. I also think about what agents are allowed to do, and that's where I would use authorization and Bedrock agent call policy. Policy sits on the call path out of your Gateway to your tools, to those data sources, to those services, to that software as a service. It allows you to specify at a very fine grain exactly what your agents are allowed to do with those tools.
This is done with the CEDA policy language. CEDA is a policy language that is designed to be powerful and expressive, has an implementation that is formally verified, and has really cool semantics for doing things like composition of policies. If you're a security person, you might have tried to take multiple policies and understand what happens when you stick them together into one policy. CEDA has extremely well-defined semantics for composition that allows us to run many policies very efficiently and with well-defined semantics.
The other great thing about CEDA's crisp mathematical nature is that policy will give you a statement based in mathematical logic about why it made a particular decision. Ideally, you never look at these. But if one day your agents do something unexpected, they're there for you to look at.
You can see exactly what the inputs and outputs were that made your policy not have the effect that you thought it was going to have. Here's an example of a policy. It's very simple. This is an authorization example. All I'm saying is this tool call is only allowed for one particular principal—one particular OAuth principal in this particular case. I can attach this into my gateway without having this feature built into any of the tools, without changing my agent code, and give different users of my agents different tool call permissions. I can then have my security team look at and audit all of these policies, see what they look like when composed together, and understand their effects.
Here's a slightly more detailed example. Those two agents that I talked about earlier in the talk get weather data. I asked them to get weather data from Seattle. But maybe they're going to hallucinate. Maybe they're going to make bad decisions. Maybe one of my users is going to try and make that agent go off track and be a general weather agent, looking at weather around the world. So here I've written this policy saying when you call the weather service API, it can only be called for Washington State. Maybe I could scope that down further and say it can only be called for Seattle. No matter what the user does with this agent, what kind of prompt injection they do, even if they get remote code execution, whatever decisions the model makes, there is no way for this agent to call the weather API without setting its scope to Washington.
This is a really powerful place to control what your agents can do and what they can ask for, without needing that kind of fine-grained authorization baked into every tool and without needing to trust the agent code. When I think about safety and security of agents, I think about putting agents in a box. This is the architecture in my head. Here I have two agentic applications, one of them with two agents, one of them with agents in memory. Each one gets their own gateway and their own policy layer that defines for that application what tools it can have. This is the tool curation that I talked about, and what powers it should have when it calls those tools.
You might be familiar with gateways in other contexts—things like AWS API Gateway. You might have the mental model of putting a gateway in front of each tool. My favorite use of AgentCore Gateway isn't like that. I want to associate an AgentCore Gateway with each application, with each set of agents with a defined purpose. And there I can govern them, I can connect them to the right set of tools, and I can set policy based on that application. This is a really powerful way to control what your agents can do in your organization outside the agent code.
Now remember that these agents are running in their own virtual machines, and so they have no way to get out of this box except through the gateway. The only way they can send packets to the world is through the gateway. Writing things down in memory or on storage has no effect, because that memory and storage is erased at the end of the session securely. And so here I have built a box around my agent. I don't need to worry about what's happening inside its head. I don't need to worry about untrusted inputs. I only need to worry about what it can do in the outside world.
AgentCore Evaluations: Online Feedback Loops for Production-Ready Agents
This isn't a mental model you can use for every agent. Some agents are too complicated. Some policies too subtle to express in a certain way. But it is the beginning of the way that I always think about building safety and security into agents. And with policy available now, you have a really powerful toolkit to express what you need to, what's allowed out of this box, what's allowed into the box, and so on. Now let's talk about evaluations. This is one of the other things that Matt announced this morning about AgentCore, and I think that evaluations are so powerful.
What are evaluations? Well, evaluations are based on a simple idea: is my agent doing the thing I'm expecting it to do? That's a subtle question, more subtle than you might think. Is it doing things efficiently? Is it communicating with users in a good way? Is it calling tools in the right way? I want to measure these things so I can know how to operate my agent. I can know when it's off track, so I can improve my prompts, improve my tool descriptions, and improve my memory policies.
I want to measure these things so I can know whether I can safely change models. Maybe I want to try out a new model. Maybe I want to try out a smaller model to save time and money. Maybe I want to try out a big, exciting new model to see if I can make this agent even more powerful. Evaluations help you answer this basic question: are those things working?
If you're like me and come from a traditional software engineering background, you would have built CI/CD pipelines where you have multiple steps of testing and evaluation. You put something into gamma, you run some tests against it, you see whether it works, and you set standards. Evaluations are the tool that allows you to do those kinds of things on AI agents. But they're different from regular integration tests because they're more open-ended. By their nature, agents are more flexible. They have more agency. The whole reason you're building an AI agent instead of a traditional workflow is because you want it to be flexible, adaptable, and maybe even creative in finding solutions.
If you just wanted to go through a fixed function workflow, you would use step functions, which is a great solution for that. You wouldn't build an agent. You want to build an agent because it has dynamic behavior, but then you have to answer: is this dynamic behavior good? It is a tough question. I'm an electrical engineer by training, so I think about the world very much as a set of feedback loops. Evaluations are feedback for agents. They allow us to feed the agent's performance back into our development processes so we can improve agents.
Dynamic systems fundamentally don't work reliably without feedback, and so evaluations are the core of making this work. Unlike a lot of folks in the industry, we have focused on what we call online evaluations for Agent Core. That means evaluating agents in the infrastructure, in the cloud, rather than at research time. This is a product mainly aimed at developers, system operators, DevOps and SRE teams, rather than aimed at science and research teams. I love my science and research teams, but they have a lot of evaluation tools to choose from. Agent Core evaluations is there for when you get into or near production, when you want to test against the real tools, ideally with real user inputs, and know when things are working.
You can use evaluation at development time too. There's nothing stopping you from putting your agent traces into Agent Core observability and having them evaluated while running your agent on your desktop. That's exactly what I do when I build agents. But it is most powerful once you get into production, once you get onto the real infrastructure, and once you are interacting with real tools and real user inputs. Because ultimately, that's what matters.
We all know as software developers that writing tests against stubs and mocks has useful utility, but it has fairly limited utility. Those mocks and stubs never quite have the same behavior as the real world. We always learn something new about our software when it gets real data and real user input. Agents are the same, perhaps more so. You're going to learn so much about the behavior of your agents when they are exposed to the real world, when you are actually asking them to do that dynamic process of going around a loop and solving for a goal.
Evaluations is the beginning of a journey, but it is already super powerful. We have canned evaluators built into evaluations with a single click. You can set up a goal success rate,
to measure how often your agent is actually able to iterate around until it achieves its goal. You can set up for conciseness, perhaps wanting your agent's result to be something like "you should ski today" or "you should boat today" instead of a 3,000-word essay about its life story. We all know that models sometimes are more verbose than we would like. You can set up a tool calling success rate to measure how often your agent is calling a tool, succeeding at providing the right inputs, succeeding at passing those policies, and succeeding at providing the right authentication and authorization materials. There are many more evaluations built into AgentCore, allowing you to think about the performance of your agents all built into AgentCore without you having to build an evaluation pipeline that gathers all of that data.
This is just scratching the surface of the components you need to bring agents into production. Over the last year, I have been focusing on this because I spoke to a lot of customers who were saying they were building agents on their laptops and desktops and having a great time. The proofs of concept looked fantastic, but then they did not know how to get into production. They did not know how to make it reliable, secure, isolated, or operable. They did not know how to observe it or control it.
As we built AgentCore, our focus has been solving those problems of how you run real agents in production, how you get to reliability and security, and ultimately how you achieve positive ROI on your businesses. When I talk to folks, they want their teams to move fast. They want to unblock their developers and even non-developers and business users to build agents with low risk. At the same time, they say they want to move fast, but they need to move fast securely and safely.
By building a box that you can put agents in, that you can reason about, and that your security team can think about, you unlock your development teams and say, go off and build agents. We are going to give you the sandbox to play in. Once you have shown that you can do the right thing, we will start giving you access to production tools where you can get access to real data, access to real customers, and then ultimately start changing the world and having those side effects. This is the core endpoint of the agent journey.
This is all about accelerating the building of agents by providing technology and infrastructure that makes it easier, faster, and lower risk to build agents.
; This article is entirely auto-generated using Amazon Bedrock.
























































Top comments (0)