Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Agents in the enterprise: Best practices with Amazon Bedrock AgentCore(AIM3310)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Agents in the enterprise: Best practices with Amazon Bedrock AgentCore(AIM3310)

In this video, AWS Principal Product Manager Kosti Vasilakakis and Tech Lead Maira Ladeira Tanke present best practices for moving agentic AI from POC to production using AgentCore. They outline nine key rules including starting small with defined use cases, implementing observability from day one using OpenTelemetry, exposing tools with clear descriptions, adopting multi-agent architectures, scaling securely with user-specific memory and identity controls, using deterministic code for calculations, and continuous testing with evaluations. The session features live demos showing how to build agents with frameworks like Strands and LangChain, integrate Gateway for API access, implement Code Interpreter for analysis, and deploy with Runtime's micro VM isolation. Phil Norton from Clearwater Analytics shares their journey building 800 agents and 500 tools, emphasizing that "context is king" and highlighting their migration to AgentCore for zero-downtime deployments and eliminating noisy neighbor problems in their financial data processing workflows.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Bridging the Gap Between Agent Demos and Production Scale

Hi everyone. If you've ever been in an agent demo that went perfectly and then thought about how to actually do this at scale across many users, how to do this securely, and how to do this in a way that you can trust it performs well at scale, then you're in exactly the right room. Quick raise of hands if you don't mind—if you are a developer or an engineer working on agents. And if you are representing the platform team that is enabling infrastructure for the developers. If you are from the business side supporting or using those initiatives. We have about a 50-20-30 split. Thank you all for being here. The session is built with these three personas in mind, and we'll share a lot of best practices about what it takes for all of you to help bring agents into production.

My name is Kosti Vasilakakis, and I'm a Principal Product Manager on Agentic AI. I had the pleasure of being with the team since the early days of the vision. We've built the PRFAQs and we launched the service, and we've iterated with many hundreds of design partners that we helped not only build the service but also bring them into production. I'm very excited to be sharing a lot of the lessons we learned. With me on stage is Maira Ladeira Tanke, who is a Tech Lead on the Agentic AI team and has helped many hundreds of customers go from POCs to production. I'm sure she's very excited to share a lot of the lessons we've learned.

Later in the session, we have the pleasure to host Phil Norton from Clearwater Analytics. He's a Senior Manager of Software Development, and Clearwater Analytics is a unified investment platform that has more than 10 trillion assets under management. They have a lot of agents already in production, and I'm sure you'll learn a lot from Phil. Today's session will cover what separates POCs from production, what are some of the best practices and lessons learned as you think about taking your agents into production, and of course we'll hear from Phil a lot of the insights that he has to share. This session is full of insights plus demos, and the pace is probably quite fast. Please keep your questions, and we'll be very happy to answer them right after.

The POC-to-Production Chasm: Six Critical Capabilities for Scaling Agents

Now let's hear what we're hearing from the field. Our customers keep telling us that moving from POC to production is difficult. They keep describing this POC-to-production chasm. Why is it so difficult to go from your demo to a production application that scales across different users and provides the governance that you need? Well, the key to that is in six different capabilities. First of all, you need to have accuracy. You need your agent to work in real life with real users. Users tend to not behave like developers expected. Users tend to want more once they see how cool these applications are. As we see generative AI evolving and Agentic AI evolving, we also see user experience changing and user expectations evolving. Keeping this accuracy is extremely important.

Then there's scalability. Working across different users and different domains, you want to have your agents scaling seamlessly while keeping this hyper-personalization. The beauty of generative AI is that you can have your application being tailored to your needs, being tailored to customer needs, and being completely different yet similar. This brings us to the memory part of it. Handling memory across users and across different agents while handling this memory securely is crucial. Security is a crucial part of the journey to production, and having those agents being secured in production is essential because those agents are accessing live systems. They are accessing real data, and this data access needs to be secure.

Then there's cost. Models keep getting cheaper and applications keep getting more viable, but we are talking about a lot of tokens. Agents can reason, but they use tokens for that. Hosting infrastructure requires cost, and you want to be able to understand where the cost of your application is so that you can control the cost.

You cannot understand the cost, you cannot understand the accuracy of your agent. You cannot do any of this without an observability pipeline. So you need to be able to understand what your agent is doing in order to create these agents and evolve these agents. This is the crucial part of the job.

This comes to the last part of the challenge, which is monitoring. After you create your agency, you have the observability of everything that is happening, and you want to monitor it in production. We are seeing a transformation on how things are working. We are seeing open source moving really quickly, and in months the technology completely changes. So how can we embrace the open source while keeping all of those requirements?

How can we embrace open source securely? How can we embrace open source scalably with monitoring, with observability, cost control, all of it? That's the biggest question. I'm sure it sounds familiar, and that's actually what we've been hearing from our customers. This is why we built AgentCore. These are the very reasons we decided to invest in agentic infrastructure.

AgentCore Overview: A Modular Infrastructure for Agentic AI

Before we discuss the best practices, let me give a quick overview of what AgentCore is. Let's start with Runtime. Runtime is a secure serverless hosting engine for your tools and agents. You can build any agent using any framework, for example, LangGraph, LangChain, Crew AI, and Strands. You can leverage any model from Bedrock, of course, other models like OpenAI and Gemini. You can host all of them and satisfy real-time and long-running use cases.

You not only have an endpoint that you can use for interactive chats, but also you can host long-running deep research as well as coding agents. Runtime is MCP and A2A compatible, which means you can host your tools using and have an MCP interface on top of them, or you can allow your agents to collaborate using Google's A2A protocol. Because agents need to remember, we built also Memory. Memory offers short-term and long-term capabilities where short-term is within the session. Long-term means across sessions you can have the agent remember facts and summaries across the interactions that they had with that particular user.

Because an agent is as good as the insights and data that it has access to, and for this reason we built Gateway. You can expose any internal API, any Lambda, any MCP server into that gateway, and you can expose that gateway to any number of agents. But how do you control who has access to the right agent and right tool? This is the reason why we built Identity. Identity is helping you use your workforce credentials using, for example, Okta, Entra ID, Cognito to establish who has access to Runtime and Gateway, but also who has access to third-party resources from the gateway, like AWS resources as well as third-party services like Jira and Slack.

Because you want to enforce certain rules on those tools, we built and launched yesterday Policy. You can define in real-time exactly what controls you want. For example, you shouldn't allow a particular user to proceed with a transaction of more than one hundred dollars. You can establish that all with Policy. Many of the tools that customers are using tend to be similar, so we decided to invest in some tools.

We offer, for example, Browser and Code Interpreter. Browser is just a headless browser for your agents to navigate the web, and Code Interpreter is an isolated engine for your agents. Once they generate code, they can execute JavaScript, TypeScript, and Python code in a fully secure fashion. Obviously, you need observability. You need to know exactly what the agent did in the end-to-end session, and this is why we launched Observability. Yesterday you might hear that we also launched Evaluations because you need to be able to know how well you're performing, how well your agent is performing based on your own metrics. We'll dive into all of them around the best practices, but that's the overview of AgentCore—fully modular and fully managed.

Rule #1: Start Small with Clear Business Problems and Scope Definition

That's a lot. How do we get started? Well, at AWS we always say that we start from the business problem, so we work backwards from the problem. I'm going to add here that we should start small. We want to define the use case. We want to define what the agent should and should not do.

We want to define how the agent should perform, the tones, and everything. We want to define which kind of APIs the agent can access. But we want to start small. We want to create a proof of concept. We want to fail fast. We want to validate what works and what doesn't work so that we can create the best possible agents.

I want to say here that starting small does not mean thinking small. You think big. You prototype and you iterate on this agent to create your applications to solve your business problem. Let's ground ourselves with one very common use case: that of a financial analyst assistant. I'm sure many of you are building similar cases. Let's imagine three personas, which we have all of them in the room. John is the developer that wants to build that assistant that pulls in data from internal and external data sources, does analysis, and returns back reports so that the analyst can leverage them and do whatever they need to do.

Anna is the user that is actually going to be consuming that agent. What she really cares about is that the agent works perfectly, that the agent remembers her preferences, and that the agent doesn't leak any information between her user sessions versus other users' sessions. Then you also have Michael, the AI platform leader that works with John to establish the mechanism and the processes that are being used to build and scale agents across the organization. Michael really cares about establishing security policies, approving certain tools, and being able to centrally monitor and understand what those agents are doing. So there are three people, three problems, and one use case.

Where do you start? Let's start from the business problem. Let's talk with Anna. Anna is the user of the application. She knows what she wants to do and she also knows what the agent should not do. Anna is a financial analyst. She's going to ask a financial question: "What is the biggest market of Q3 in Asia?" The agent should answer this question, and it should answer with "The UK was the biggest market." In the background, the agent should go to the live systems and get the data to reply to the question.

However, we are defining the scope of this agent and we are defining also what the agent should not do. So we are going to ask Anna what should be out of scope. For instance, "How much is my colleague earning?" That is completely out of scope. The agent should say, "I'm not here to reply to this question. I am here to help you with financial questions." So we need to set the guardrails for the agent for keeping it on scope, and to do that we should also define those queries beforehand.

Rule #2: Establish Observability from Day One Across All Personas

Rule number two: Set observability from the beginning. I know you think observability is something that is production-close to production, but unless you set it at the beginning, you won't know what your agents are doing. As you set your observability, there are a few recommendations that we have for you. Always use OpenTelemetry-compatible traces. All agent core services are emitting logs in an OpenTelemetry-compatible way so that you can understand what the agent did. For example, what models were used, what APIs were invoked, and how many tokens were emitted.

So that you understand at each session level, we also recommend enabling trace-level debugging so that you know at that particular session what the user said, how the reasoning steps went, and what exactly API calls were used. That way you can always be able to remediate if something was wrong. Do use dashboards because on the agent core side we offer a lot of generative AI observability dashboards for agent core specifically. But you're free to use any of your favorite observability stacks. If you're using Datadog, if you're using New Relic, if you're using Dynatrace, you're fully allowed and permitted to use OpenTelemetry traces to leverage that observability stack. You're not tied to agent core.

Observability is crucial throughout the personas of our agentic development. Anna wants to know that what she is getting as information is correct, that the right systems were accessed, and that the right information is returned. John wants to debug his agents. He wants to understand when things are going wrong and where the issue is. He wants to be able to change things and fix problems, so he wants to use observability for debugging.

Michael has the bigger organizational picture. He's on the platform team and wants to know the use case, costs, what is working, what's not working, and how much each team is spending. He wants to have the bigger picture. Observability goes throughout the different personas, and if you want to be successful with your use case, you also need to plan for observability throughout the different personas to have the right data available for the right tasks.

Rule #3: Expose Tools with Clear Descriptions and Reuse MCP Servers

Rule number 3 is to be prepared to expose tools and create tools as well as expose internal APIs to your agents. As Phil will tell you, context is king, and those tools can allow you to get the right context at the right time for your agent and give you accurate data. As you do that, remember a few things. Ensure you use clear descriptions and specify the parameters that are required for each tool. The amount of issues that customers face comes from not writing explicitly what is expected on that particular tool. For example, if a JSON schema is expected in a specific way or a date format is expected in a specific way, providing that will help reduce ambiguity for the agent and will help reduce the tokens that are utilized and the amount of reasoning steps that are needed.

Also, set error handling guidelines. Make sure you have some retry logic, some tool fallbacks, as well as proper user notification if things fail. Reuse MCP servers. For example, there are many in the ecosystem from Slack, Jira, and GitHub. Leverage them, but not only the external ones, but also internal ones. There are many parts of the organization that are building MCP servers for specific tools, and you can definitely reuse them. Create a wiki, leverage the platform team to expose that wiki, and create examples to showcase how exactly to leverage that particular MCP server. Unless you collaborate and serve those assets, innovation will slow down. We highly recommend you reuse those assets.

Now let's go towards our first demo. In the first demo, we'll showcase how John the developer can leverage multiple frameworks, Strands and LangChain, and we'll show you how you can expose a gateway with various tools so that the financial assistant agent can read from those tools, get some insight, create analysis in code that can execute in the code interpreter, and then everything will be flowing into observability. One thing I forgot to mention is that whatever comes from the gateway will have some identity and access controls so that the end user will only be seeing the data that they are allowed to see from those tools that they are invoking. This is real life. John has worked with different teams and saw that the gateway already has all the information that he needs. He is connected with Cognito authentication and already has the permissions that he needs to use this gateway. It is connected with a target Lambda function that has all the functionality already implemented, so John can focus on testing Strands versus LangChain.

He will import the required libraries for code interpretation. He will reuse system prompts and implement his own tools as well, so he will extend the functionality from the gateway with some local tools. He will just start testing. He will test Strands and see how the agent is behaving there. He can ask questions like what was the Q4 revenue. He can get this from his local tools using code interpretation and get all those aggregations. He can also ask questions from the gateway, such as what is the budget for a certain project. This is all available as live information that is connected in the gateway.

He compares that with LangChain and gets exactly the same response because it is using the same prompt and the same tools. He implemented the same tools, and because John created observability from day one, he can see how the frameworks are behaving. So now the choice of framework is his own. The choice of framework here is based on his developer needs.

John decides to go with Strands because he has all the information required to choose his framework. Let's recap what we just saw. We saw an agent built in Strands fully locally from John's laptop. John added a code interpreter to do data analysis, a gateway to expose a variety of tools with an identity, and a primitive to enforce who has access to those tools. Nothing yet has been deployed and is ready to scale, and I'm sure John will get there. But the point we want to make here is that every bit of AgentCore is fully modular. Whatever use case you have, start with the modules that really make sense for you, and then you can expand or reduce however you actually need. Nothing is truly tied together. It's a nice integrated experience, but you're not forced to use everything.

Rules #4 and #5: Implement Automated Evaluations and Multi-Agent Architectures

Now, how can John start interacting with his agent to improve it? He needs to evaluate. He cannot improve his agent blindly. He cannot improve his agent with just some queries that he's thinking about. He needs to have automated evaluation. He needs to start getting ground truth. He needs to start getting all the queries that Anna expects him to have on his agent. With automated evaluation, he needs to set all those metrics and gather diverse sets of examples. He needs to think about how to say the same thing in two, three, or four different ways and consider the metrics.

The technical metrics are great: latency and accuracy, how correct an answer is. John is going to use that, but the business metrics are the ones that will give Anna the trust in the agent and the certainty that this agent is solving her business problem. John will work with Anna to create these metrics. Agent development is an interactive process, and evaluations follow the same features requirements type of process where one needs to understand what is expected and what is not expected. The interaction between Anna and John shows how a traditional feature requirement is happening, but for the evaluation space. For example, Anna is stating that if she's asking specifically about Germany, she shouldn't be given any information about France. John understands that he needs to evaluate accuracy and needs to evaluate if there are citations given back from the agent. Anna also says she wants a response that is fast, a response that is not only good but also fast. John decides to also establish latency and cost as mechanisms to evaluate that particular response.

The conversation is exactly why we built its evaluations. You define what good is, for example, accuracy, citations, latency, and cost, and we continuously score in real time those agent interactions so that you know how well your agent is doing relative to those metrics. As you do that, be ready to swap your models, be ready to add more tools, and be ready to edit your prompts. Be ready to use multi-agent architectures. Agents behave pretty much like us. We have specialists. We have supervisors. It is a very similar organization as we have in our enterprise. As your solution scales, it becomes harder to handle all of this information for the agents. We need to define clear roles and responsibilities for each agent. We need to orchestrate between the different agents in the correct manner. You can have sequential types of orchestrations when there are dependencies between the work that each agent needs to do. But sometimes things are independent, so you can optimize your latency with parallel types of orchestration. All of this needs to have context and memory in mind. You need to know what each agent knows in order to architect good applications. And of course, I'm going to say that many times today: you cannot have a good application without observability, without monitoring everything that is happening in the background.

Let's look at John's financial analyst assistant. We started with a few tools and now we have ten tools

to read Excel files, generate growth trajectories, calculate statistics, and create visualizations. Naturally, the more you want the agent to do, the more tools you need to expose to it. But as you do that, the code gets significantly more complicated, which means you have to implement a lot more error handling scenarios. You need to cater for many edge cases, establish a lot more conditional logic, and obviously the code becomes much harder to debug and understand.

As you expose more tools, the agent has more options and more reasons to be wrong. It is not just the tools that can be wrong. The agent can mix the parameters it uses for different tools or mix the sequencing of the tools it is using. Because of that, the agent becomes a lot slower and more expensive. All those descriptions and tools have to be mounted into the context, so you use a lot more tokens. Because you use more tokens, the agent will do a lot more reasoning steps, which will take a lot more time. This is very natural and not a failure at all, but just a signal that you need to start thinking about breaking your agent into multiple different agents.

For example, in a financial analyst assistant, we can think of three agents that might be required: one that does the data retrieval, one that does the analysis, and one that does the reporting. All of a sudden, we can start optimizing for a lot more focused objectives, which will help simplify the code, make the agent a lot more accurate just because the objective is just one instead of ten, and will also make it a lot faster and easier to run. Of course, that is only if you do it at scale. If you have 23 tools and you break them up, you probably have the overhead of orchestrating. But if you have tens of tools and you break them up into the right agents, then you start getting a lot of meaningful speed and accuracy improvements.

Let's take a break here to talk about multi-agent collaboration protocols versus patterns. We hear a lot about A2A right now, which is agent-to-agent communication. A2A is a protocol, and so is MCP. These define how agents talk to each other, so they care about the message format, the API formats, like the agent card, all of it that makes each work great. When we are talking about multi-agent collaboration, we also have to think about the patterns of the collaboration. How is the information going to be handled between the different agents? The pattern defines how the agents organize this communication. It is about having sequential types of patterns, hierarchical, peer-to-peer, so like how your agents are organizing. It cares about the workflow and the coordination parts of it.

Rule #6: Scale Securely to Multiple Users with Isolation and Personalization

It is always good to remind that you can do A2A with a sequential pattern. You can do that with a hierarchical pattern or with peer-to-peer. It is about how you are implementing those things. Rule number six is to scale your agents to multiple users in a fully secure fashion and also in a personalized manner. Obviously, local development can take you to a certain extent, but if you want to really go to many users, you need to think about a lot of requirements.

For example, always isolate context and sessions. When a user is interacting with an agent session, they do not want the information to leak to a different user with a different session. For this reason, within AgentCore Runtime, we offer what we call full isolation. We dedicate a micro VM at every session not only to maintain persistence but also to ensure that whenever that session ends, we completely clean up the CPU, completely clean up the memory, and clean up the file system. So each end user gets their dedicated micro VM for full security and there is no data leakage.

Secondly, make sure you set up user-specific memory, especially short-term and long-term memory, where the sessions that you execute can remember what the agents that you execute can remember what happened across sessions and within sessions in many ways. For example, what facts that user wants, what preferences they have, what was exactly the topic of the conversation from previous memories. You can namespace them using user IDs, so each user can get their own dedicated memory that nobody else can access.

Third, make sure you have policies on top of the tools and access towards agents and the gateway. For example, you can leverage your workforce credentials like Okta or ID to enforce access rights to the agents, but you can also establish policies on who can use whichever tool under which conditions. Policies come very handy for that.

Lastly, make sure you host your agents and your tools separately. Runtime can host both tools and agents, and we recommend our customers leverage this by splitting each agent or part of the agentic system into different agents and tools and hosting them separately. That way you have much more visibility into what part this particular agent did, as well as you can reuse those agents across many agentic solutions. For example, in the next demo we'll showcase how we use the same agent but split it up into two: one is going to be a PM agent, and the other one is going to be a financial agent. It's going to be an agentic system fully hosted on runtime, again leveraging code interpreter to execute analysis, but after it has leveraged the gateway and proper identity controls to fetch the data.

At the same time, we'll showcase how you can use memory so that the agent can remember for that particular user what happened in previous sessions and what they have served before. Let's look at how John's code evolved. He is now importing the necessary packages to handle memory and to deploy that in real time. He got his Bedrock agent core application. He is changing his tools to become agents. Now those agents will be his tools. He's doing multi-agent collaboration. He's setting the entry point for his runtime so that he can host this agent. He's handling authorization from Cognito here as well and passing that to his memory so that now his agent can handle different memories for each customer.

You see, on runtime he can have different versions. He can connect his identity to it. On memory he can have different types of strategy as well. In the AI interface, you can see this being exposed to different users. I'm creating my own user here, asking my questions, having my responses to my agent, and getting also the visibility of the tool. So I get my observability here. In real life, you check this observability as necessary. John can also observe everything on Agent Core observability together with Michael, so all the visibility is there, but it is out of the box for them, right? You have token latency and all of this.

But now Costi also wants to use the same agent, so Costi is just going to log in here. Now it is Costi's session, so the agent will behave for Costi now and it will start creating the memory for Costi. Costi can ask his questions as a product manager, can get his response, can use this agent for his own work. So now you are getting this hyper-personalization as well. Just taking a look at what we did here, we are deploying this agent at runtime. We are still experimenting, so we did this deployment of all the agents in the same runtime. We would expose them to different runtimes as we move forward as we get more trust in our agents.

We are adding this memory and we are connecting also the user to the memory. We are getting everything we already had before, so we continue having the gateway. We continue having the gateway connected to identity for getting the identity of the tools, but now we are also connecting this identity to runtime so that you can also get the identity of who is accessing the agent. And observability now becomes across all of those different services, out of the box. You didn't have to do anything extra for the observability; it is there for you.

Rule #7: Leverage Deterministic Code Over Agents When Possible

Rule number seven: Do use code whenever possible. I know agents are exciting, but agents are not a hammer. Models are non-deterministic. Agents are completely stochastic, and of course user behaviors change all the time.

Whenever you need to do calculations, forecasting, or anything that is very deterministic, don't rely on the LLM and the agent to perform those tasks. Reserve the agent to do the reasoning tasks and keep functions as tools—either internal or external tools that the agent can use. For example, calculations, validations, or rules-based logic are great things that you can create as fully deterministic code that runs super fast and cheaply. You should leverage that and use the agent to orchestrate which tool or function is used on top of that logic.

You should always measure cost versus value. For example, if you know that it's just cheaper and faster to use code to perform an action, build the code and have the agent orchestrate on top of the logic. If it doesn't make sense, then use the agent. To be honest, one of the most common mistakes we see with customers is that they allow the agent to do pretty much everything. While many new LLMs allow for calculations and other tasks, you still cannot always rely on them to do the calculation correctly. So expose that calculation to your agent as a tool.

Let me bring a classical example here that I've seen with many customers, which is the current date. Current date is even a native tool for some frameworks, but it's a classical thing that you can just pass as an attribute to your context. It's super easy to calculate and it avoids one orchestration loop. So now you're using less tokens and you're passing this information to your agent. The agent can still reason about this date, can still know what's tomorrow or the day after tomorrow, but you don't have to call a tool to get this information.

Other things that you can do with code instead of having the agent orchestrate are authentication of users and getting user profiles. You can combine different tools that should always work together in code because that will be faster, cheaper, and more deterministic. So let's use agents when agents are bringing value because agents do bring a lot of value.

Rules #8 and #9: Continuous Testing and Organizational Collaboration for Scale

Rule number eight is test, test, test, test, test, and never stop testing. You might think that going to production is actually the end, but to be honest, we think that's actually the beginning. As I mentioned, models are stochastic and user behaviors are changing, so we recommend that you implement a continuous testing pipeline. Create a few scenarios or many tens of scenarios. For example, in the Clearwater Analytics case, whenever they create an agent update, they should run that pipeline of testing so that it can determine if the agent version is good or not and if it's maintaining accuracy or increasing accuracy.

If the version is performing well, it should be pushed. But if it's not, it should be pulled back. In a similar fashion, you should use A/B testing. You should have in production endpoints that only a few users are using so that you can understand how those agents are performing. AgentCore Runtime does come built in with versioning, endpoint management, and rollback. So whenever you see the beta is performing well, we can swap over to it. But if it's not performing well, we can bring all the traffic back to the alpha.

You should do the same thing in real time. You should always monitor for drift using evaluations. If there is any degradation of performance for your important metrics, you should roll back to the previous version. You can allow yourself to get alerted or you can roll back to a previous version so that you always have the right experience for your end users.

Here's the last demo in architecture that we're going to cover. Here you'll see how easy it is to add evaluations into an existing architecture that we have already covered. You define what is good, you define your metrics, and as those metrics are coming in along with your observability—for example, in observability you have operational metrics—now you'll start seeing performance metrics for those agent interactions.

So here is how it looks in real life. You set your evaluation configuration. I'm going to pass some metrics here like built-in metrics from AgentCore evaluations, and then that's it. You just continue to invoke your agent and that will automatically create those dashboards for you so you can see it over time. You can see the detailed overview for each trace of your agent with explanations for your metrics.

You can see all of those different metrics in the trace in the session when you're thinking about goal success and also in your parameters. Inside of the trace, looking at the invocation and examining tool selection accuracy, you can see that out of the box for you. You can also create your own evaluation metrics if you would like using our custom evaluation metrics. This provides significant simplification of the work to have evaluation running in production for you.

Here we have seven of the nine primitives that we offer, all working together in a fully modular fashion. What we hope you saw is that the agent developer just built the logic and they didn't have to build agentic infrastructure to host those agents. They didn't have to think about authentication flows. They didn't have to build monitoring pipelines. Everything is there for them, and that's what we consider production looks like.

This takes us to the last rule: scale via continuous update and organization setups. For example, make sure you consolidate the responses from the models so that you can understand and flag what is good and what is not. You can also get CSAT scores so that you can very quickly understand satisfaction with thumbs up and thumbs down from the end user. Use that as a way to test your agents.

Definitely consider if you don't already have to create a platform team or leverage their existing platform team. That's your mechanism to ensure that your organization has reusable assets as a mechanism that is leveraging the common approved technologies to push code and deploy code. Make sure you foster collaboration across the team. At the end of the day, everyone is excited to build with agents, but unless we all collaborate, we won't innovate as fast as we should.

Let's look at how an AI agentic organization setup can look like. You start having different use case teams. Those use case teams will innovate and create different use cases for their business. They will try different functionalities, but they will also reuse different functionalities from the platform team. The platform team is the place where you are going to get everything centralized, so you get different tools exposed and different agents.

From the use case teams, you also do the governance of everything so you can see how the agents are behaving. You can see where the bigger costs are, the bigger challenges, where the customer is happy, and where accuracy is working. I like to say that agentic AI is the next step of software engineering. You have all the DevOps capabilities still there. They are still very, very important. So you need to set the DevOps pipelines. You're going to have version control. You're going to roll back when things go wrong, but you have to have all your software engineering still there with the extra step of agentic AI.

When we look at a platform and different teams collaborating, now you are going to have the capabilities that are created by the platform. Those are the organizations that go across the enterprise that the platform wants to work and invest on. You also have the different teams and the agents that they are creating and exposing to the organization.

If you look at a budget team, the budget team will create the best agent for doing budget analysis because they have the domain knowledge. They are the experts on budget analysis. They should also be the ones creating the agents to interact with their data, but this is such an important thing to do that teams in the organization want to reuse their capabilities. That's where the platform also comes into play. You want to be close to the platform team and work with them. The team is going to work with the platform team now to expose the agent via the platform so that the next team can use this agent.

Now, what we discussed today in a nutshell are these nine things we have been hearing from our customers. We see them struggling, and I have a quick recap. First and foremost, start small, define one use case, define what is good, build it, learn from it, scale it to five or ten users, then move it on to a larger audience. Always use observability.

Understand using OpenTelemetry compatible logs. Understand what is happening so that you can depart and improve and expose. Plan to expose tools and APIs to your agents. Make sure that you are clearing your descriptions. Make sure you avoid any ambiguity and reuse MCP servers. Split your agents into multi-agent architectures so that you can focus on making those subcomponents really good at what they are supposed to do. When you scale to multiple users, isolate contexts and ensure that you use user-specific memories and always use identity to enforce who has access to what. Use code whenever possible. Use deterministic, fast code whenever you need to perform calculations and you know exactly what to expect from the code. Test, test, test again. This is where you can leverage evaluations. You can have your own test rate so that you can always know what is happening. Lastly, make sure you keep all the interactions and satisfaction scores so that you can improve your agents over time and do leverage your platform. These are your friends that can help you scale organization and innovation. Last but not least, don't start with flying. Crawl, walk, run. We're super excited about agents.

We're very thankful you're here. We're very thankful for whoever is using Agent Core, and we're also very excited to be hosting Phil Norton from Clearwater Analytics that can share a lot of lessons from Clearwater Analytics.

Clearwater Analytics Case Study: From 500 Tools to Production-Ready Agents with AgentCore

Thanks Costi and Myra. My name is Phil Norton. I'm from Clearwater Analytics, and I'm here to talk about our Agent Core use case. Clearwater Analytics is a public fintech company. We provide financial accounting and reporting for institutional investors, and we have a very large SaaS platform that we've built on AWS. I'm a software development manager. My illustrator here is GPT 5.1 because he also likes Art Deco. Let's get going and see how we're using this.

From the beginning, let's start with our journey in the agentic AI space. We were early adopters. June 2023 is when we first got together as a group, when people started seeing how GPT 3.5 came out and everyone started getting excited, so we got the group together.

The first thing we did was a standard stack with Langchain, RAG, data access, and application awareness. We had our first production agent internally in July of that year, and externally we got them to clients in December.

Let's go through our use cases. We started from the beginning by crawling. We started with internal knowledge bases and standard operating procedures. This is really the sweet spot of Gen AI. It does a really good job with RAG and explaining things.

Then we built a chat agent that works inside our application. Then we started adding Salesforce ticket support, assisting client services with handling some complex financial use cases and questions coming from clients.

Then we started getting into the data. This is a really big use case for us: accounting data analysis, anomaly detection, and visualization. It's a big use case for our clients. They all have different requirements for looking into the data, finding problems, cleaning the data, and getting their books in order.

But we don't just do that by hand. We don't just tell the agent what to do. We schedule it and automate a lot of those things.

Finally, like everybody else, we're using automated coding agents and code review agents. In addition to interactive coding, we can also put in Jira tickets and have the agents pick up and write code for us and then review the code. One of the more complex workflows is financial data intake. A lot of data providers provide data in PDF, so we have to use more advanced techniques to get that data into JSON format so we can put it into our accounting engines.

Let's talk about our Gen AI apps. This is the part where I get to brag about all the cool stuff that teams have built over the last couple of years. Our brand name is Quick. First up is Quick Chat. This is a web component that we can embed in our production applications. We've got it in five applications for now. This is the original flagship reporting engine.

We've got 55 more instances and we're building more every month. The agent is aware of the report that we're on, the account that we're loaded into, things like that. You can also switch which agent you're chatting with.

Next up, we can create our own agents. This is how we've built 800 agents, and this is available to our clients. It's no code. You just fill out, write your prompts, give it tools, give it knowledge, and you can create your own specialized agent.

Next up is Quick Flow. This is scheduling and triggering, either single or multi-agent collaboration, so you can put together your prompts, kick it off like Monday, every Monday at 8 a.m. and get the results in your inbox. You can go back and see what the agents have done.

Quick Orchestrator is the killer app. This is kind of like LangChain. You can plug together different agents. It's multi-agent workflows, and you can see there's a nice visual interface. You can see even as it's executing which step it's on, which decision it made. It's kind of like you can build a flowchart that a human can follow. You build a flowchart here, now the AI follows it. It's a very natural progression for financial people.

When we were looking at getting to the next level, let's talk about what we needed. We need scalability. We need zero downtime deploys. We're getting into 4-hour workloads. We have very large reports that take a long time to generate, and then some more processing on top of that. When we do that, we want to make sure that we're not having pods drop out and kill the runtime in the middle of it.

We built 500 tools before MCP was a thing. These had been put together in a single FastAPI server. We wanted to move those to MCP servers. We wanted to maintain rapid follow-ups, so maintain context when talking to data, so that you can interactively ask more questions without having to go fetch all that data.

Of course, any platform we wanted to choose has to support backwards compatibility. We've got a very mature system. We've got custom role-based access control that we wanted to keep. We want to make sure that we can maintain our behavior. We have a homegrown agent. We've moved off of LangChain into our own thing. We want connectivity to the rest of our AWS stack. We use a lot of different features of AWS. Of course we needed access to all the major models that we use with LiteLLM. All the state of the art models we want to make sure we keep those. We would like to migrate to a better RAG solution. Finally, we have our own observability solution, and so we wanted to make sure that we can maintain that.

Let's take a look at where we were before. It's a pretty standard stack. We made a lot of use of EKS, but we had a lot of shared workloads. We've got agent runtime pods, we've got data processing agent like a sub-agent, we've got pods running tools. Of course I talked about the zero downtime deploys when these agents run for a long time. We deploy all the time, all day, every day. We don't just want to wait till it's safe to deploy, we just go.

Challenge number two is noisy neighbors. If you work with Python async IO, if somebody does a synchronous call in a shared workload, your noisy neighbor is going to screw up your agent run. That was another problem that we had. Same thing with memory. Sometimes tools can be sloppy with memory and take out other workloads.

After the move to AgentCore, here's what things look like. For one thing, we're able to get rid of that SQS layer with the direct async invoke, so fewer moving parts, fewer things to break. We make use of AgentCore runtime, because that's a very natural progression from moving from Kubernetes. We just move our pod over and it runs. We have our sub-agent that runs in runtime, and we have MCP servers. What's nice about this technology is that everything runs in isolation. We have no more noisy neighbors. Maybe some people are wondering about startup times. I can tell you it's negligible. Don't worry about it.

So why do we choose AgentCore?

The impact is negligible, so don't worry about it. Now, let me explain why we chose AgentCore. As I mentioned, zero downtime deployment is a perfect solution with micro VMs. We have flexibility in our tech stack because we've built all this infrastructure and we want to bring it over without having to rewrite everything. Sticky sessions allow our data processing agent to maintain that session so you can ask repeat questions. We wanted to accelerate data access tooling, and we needed an easy way to create MCP servers so that all the different development teams could own their own MCP server. Memory is something we haven't fully built yet, but it's coming very soon. Another thing I'd like to mention is that we recently came up with a nice use case for making use of AgentCore's browser capabilities. The real big selling point is that this is undifferentiated heavy lifting, so we can manage all of this ourselves. We're smart engineers and we're good at what we do, but AgentCore made it so much easier to accomplish.

I'd like to follow up with some of the best practices we've figured out along our journey so that I can share those with you and hopefully keep you all on a happy path. First of all, context is king. I'm sure you've heard this before, but don't forget it because it's a really big thing. I've found that LLMs are highly reliable if they're given unambiguous context. A good rule of thumb for context is when I see these big prompts and read through them, if I can't understand what's being asked by glancing at it, if I can't read through it and say what's the ask or what are you asking the agent to do, then the LLM won't understand either. That's really your first test.

I would say that hallucinations are pretty much non-existent if you have proper context. However, that's what you can control. Your users are going to throw all sorts of weird stuff at your agents, so you don't really have control over that. LLMs will always do their best, but poor context means they might have hallucinations. So how do you manage that? There are really two main ways that we interact with these agents: chat and automated workflows. In chat, you want to instruct the LLM and ask for clarification. If somebody asks something that's really unclear, the LLM can say I'm not sure what you mean and ask for clarification. You want to give it an out for that. For automated workflows, you don't have that interactivity, but what you can do is have the agent output some metadata about what it came up with while processing, such as confidence and rationale.

If the agent comes back and says my confidence is only 5 percent and I'm not sure about this because some of the fields had different names and I wasn't really sure what to do, then when that goes back into your automated system, you should flag those low confidence answers, follow up, figure out what was wrong, and deal with it there. Now let's talk about rollout and best practices because we're building these workflows for people to use. First of all, this is a culture change problem. We're asking people to change the way they work and people don't like change. So you have to identify what's in it for them, what are their pain points, and how their life is going to be improved. Then comes training, training, and more training. You need to schedule time with your users, show them how to use the tools, and show them what it can do for them.

Second, build narrow use cases. GenAI is a general purpose solution and you can do pretty much anything with it, but that's really hard to understand. What you want to do is focus on very specific tasks. You've got this huge workflow that people do, but just focus on one part of it, get that automated. Probably you want to focus on the most boring part, get that automated, make their life better, and then they find the next piece and move on to that. People can really understand that approach. Finally, monitor everything because observability is key. Figure out what's broken and fix it. Figure out who your champions are, who are the people who are really enthusiastic about this, and empower them. Identify who's on the fence and persuade them, and identify your detractors. Listen to them. You probably can't change their mind, but at least you can understand where they're coming from, and they'll change their own mind when they see themselves falling behind.

That's all I had. Costi, back to you to close it out. Thank you so much, it's awesome. I just want to say a big thank you to all of you for being here. These are the resources that you can leverage: documentation, code samples, and we have workshops. We have anything you need and we're very happy to answer your questions after the session. Again, a very big thank you for being here.

; This article is entirely auto-generated using Amazon Bedrock.