DEV Community

Cover image for AWS re:Invent 2025 - Building Intelligent Workflows with Event Driven AI (MAM327)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Building Intelligent Workflows with Event Driven AI (MAM327)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Building Intelligent Workflows with Event Driven AI (MAM327)

In this video, Marin and Jeff from Unicorn Gaming Shop demonstrate integrating AI agents into event-driven architecture using Amazon EventBridge and Amazon Bedrock AgentCore. They showcase two patterns: EventBridge triggering agents for customer game recommendations with sentiment analysis, and agent-triggered EventBridge events for automated site reliability engineering. The demo reveals how agents autonomously handle customer queries, generate SQL queries without hardcoded logic, and automatically triage CloudWatch alarms by severity, escalating only high-priority issues to engineers while auto-remediating lower-severity problems—enabling 50x user growth without overwhelming limited engineering resources.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction: Scaling Unicorn Gaming Shop with AI and Event-Based Architecture

Thank you all for being here today. Building net new applications or modernizing legacy applications using event-based architecture is an effective way to modernize your applications using events and asynchronous behavior to achieve scalability. What happens though when the business events and characteristics dictate that that's not enough? That's Marin and I'm Jeff, and over the next hour, we're going to introduce where and how to include AI and agents into your event-based architecture to meet those needs of your business.

Thumbnail 50

First we're going to set up the scenario that we have here at our company, the Unicorn Gaming Shop. We're going to play a couple roles here. Apologize that the cartoons don't look exactly like us, but this is what licensing provided us. Thank you, Jeff. So my name is Marin. Welcome, and my name also. I am an application product owner. I'm new to this company. I have 20 plus years of experience in doing these things. Basically at this company, I am tasked to define new features to drive greater customer satisfaction. The main thing is we want to expand and we want to try to automate as much as possible our customer interaction on one side, but we also want to help our engineers in the back end when we have an issue with the application.

Any type of issue to help them out and to speed up the process when they need to resolve some issues. Also, we want to automate part of the resolution process so that only the most important and most hard things come to our backend engineers, which will then step in and resolve the issue. We want to lower that response time for our troubleshooting process.

And I'm Jeff. I'm an experienced software developer and I've been working at Unicorn Gaming Shop for quite some time. I've been modernizing that application over the years using event-driven architecture and EventBridge, and I'm tasked with maintaining and building the new features that Marin's coming up with for the Gaming Shop application that we have. And of course, I do have lots of new features. I'm new here. I need to prove myself, and I definitely have some big ideas for this company. We need to expand. We need to expand to increase our customer base.

Thumbnail 150

Thumbnail 160

Really great ideas, Marin. I'd like to hear some of these things. You may not like some of them because they are really big ideas. So basically I signed 10 new deals for our company and that will drive actually 50 times user growth than what we have now. So I think you have your plate full for this one, Jeff.

Thumbnail 170

Thumbnail 190

Did you say 10? And 50 times more growth exactly. Thank you. That's quite a task that he set me up for. Some of the things that I'm immediately thinking about when I have to deal with that is the scale that that's going to cause in terms of the overload to the team that I have. How about having the operational visibility about everything that's happening across this dynamically scaling and growing system? I'd like to use AI but I got limited experience with that both myself and my team, and I got limited team, limited time, limited budget.

Thumbnail 210

EventBridge and the Strangler Fig Pattern: Building the Foundation with Amazon Bedrock

Now, I mentioned before that I've been modernizing this application that we have here using event-driven architecture and EventBridge, and EventBridge is the core of what our architecture is right now that I have in the system. And EventBridge really is providing me three things. One, it takes in the events that are coming from any one of the producers that we have here. In this case, it's my application. And then I have a bus, and I can have three different kinds of buses. I can have the default bus, which integrates with the APIs of AWS services, the custom event bus which I'm using to handle the particular events that we have in our system, or the SaaS event bus. And then the rules is the third part of this, and the rules dictate what happens with the event that comes in and where does it go to.

Thumbnail 260

Now I've been coupling that with using the strangler fig architecture pattern. The strangler fig architecture pattern is where I have a legacy system up here and I'm building net new capability in that system off to the side of it. I'm building that using EventBridge here to handle particular transactions that I have here that were business features that Marin came up with before, which is introducing an event-based chat capability to our customer assistants who could answer questions asynchronously with customers that we have who are purchasers of games through our Unicorn Gaming Shop.

Thumbnail 300

And it works well, but I'm concerned about how that's going to scale with the amount of growth that we're going to be having here based upon what Marin's telling me with 50 times more users.

Thumbnail 320

That's going to be overwhelming to our customer service assistant. And how am I going to maintain the operational observability and everything that's going on with all those logs being produced by all these new things that are happening? I could use my system as it is right now. Technically it'll work. It's serverless, it's event-based, and all the messages flowing through there are asynchronous. However, it only scales to the customer assistant. The customer assistant has to do the research to answer the questions that come in from the customer, and it's slow. It's slow because the customer assistant is doing a lot of work here, and they're going to get overwhelmed.

Thumbnail 360

I've been doing a lot of research, and I've learned that I can introduce AI to help me out here. And in the AWS world, AI really comes in what we call Amazon Bedrock. Bedrock is our service that provides API access to both AWS first-party models as well as third-party models from the leading providers. What I like about this is with this API layer, I can integrate that into my system without binding myself to any particular model, and then I can learn how my system is providing the answers that I want to my customer on the far end, and I can change the models as I want over time. I can also fine-tune my models to even get better quality in terms of the results that I'm getting out there. So I'm really interested in how I'm going to put this into my system to really start to drive the results that I want with the customer that I have.

Thumbnail 420

So I built a little chat that I can use, a research assistant chat for my customer assistants. My architecture is still serverless, it's still event-based, and it's still asynchronous. Now I have research augmentation here through the integration of Bedrock in there for that customer service, the customer assistance person, and it is faster than it was before. However, it's still not fast, and it still only scales to the customer assistant.

Thumbnail 450

Understanding Agents and Amazon Bedrock AgentCore: Autonomous Decision-Making at Scale

In doing more research, I got even more information about this idea of agents. And what really intrigues me about the agents is they are autonomous systems that can understand requests, make decisions, and act autonomously to do tasks. This sounds like a great way to augment that customer assistant and scale to the business that I have. And when I think about the agents, they come in a couple of different styles. First, I can be thinking about agents that do department-level tasks, which are things that are going to require data that's very near, typically things that aren't going to be terribly complicated, and maybe the security is not that strong in that area. Organization agents will work across the boundaries of my systems, maybe across the boundaries of my lines of business inside my company. Usually greater access to data, security is going to start to become a little bit more important here in terms of who can access what. And then external agents, which are going to be working outside the boundaries of my application, outside the boundaries of my company, interacting with third-party APIs and third-party data. I absolutely need a lot of control in what's happening in there.

Thumbnail 530

And as I think about this, I think about how am I going to manage all of this in this world and how am I going to run it. And that's where Amazon Bedrock AgentCore comes in, and it provides three things for me. First, it provides the tools and the memory that my agent needs, and the memory is really important because a lot of times when you're doing prototyping with agents, they're very short-term memory, meaning you give them a task, they do it, they forget about it when they're done, you ask them a subsequent question and they've got to go redo everything they did the first time. With this, there's memory built into this through AgentCore of that agent, so when you ask subsequent questions, they remember what was there before and you can chain everything together. Second, it provides the runtime environment and the identity management of the agent so that I can then control who can access that agent and what that agent is allowed to do. This is important to me because that agent that I have is going to be producing recommendations, is going to be interacting with a large set of data, and is going to be responding back externally.

And third, the observability. There's going to be a lot going on in the system as it scales, and it's going to be producing a lot of decisions. I need to have the observability not only into the logs that are getting produced from the transactions that are happening here, but I also need to be able to audit and observe

Thumbnail 620

how that agent made the decisions that it did so that I can train the model even further to make the decisions even more in line with what I want going forward. Now, as I think about bringing this all together in my application, there are two real patterns that I'm focusing on here with my existing application and how I would be integrating and incorporating agents for the decision making that I want to have in here. This is where everything comes together with EventBridge and the agent.

Pattern One: EventBridge Triggering Agents for Game Recommendations

First, the first pattern is EventBridge triggering agents. In this case, events are coming in from any number of sources. It could be S3, it could be a schedule, it could be Lambda. In my case, it's going to be this web application that I have that has a chat built into it. The messages come through as events into EventBridge, and EventBridge is going to invoke Bedrock to go do something. In the case that I have here, it's going to be to go do some research based upon what the customer is asking for and provide back a recommendation on what could serve the need of that particular customer.

Thumbnail 740

The second one is agent-triggered EventBridge events. This is kind of the opposite of it. This starts by the agent doing something, doing some research, making a decision, looking at something, and then it produces an event that gets put out onto EventBridge that can then work in a fan-out kind of pattern to say anyone who wants to do something based upon this event can now go do something based upon this event. Together, these two patterns are going to help me greatly in how I need to both handle the concerns that I have around producing this game recommendation feature that I want to have from my customer assistant, which is really going to be the first pattern, and then how am I going to handle all this operational observability that I have and maintaining the site reliability with everything that's going to be going on in my system.

So first, agent-powered game recommendation. I have my system as it was before, and what I'm doing here is I'm introducing the agent that I built, which is going to act on the information that it receives from the event, which is the chat that the customer is having to say I'm interested in new types of games, I like these kinds of things, what would you recommend for me? The agent is going to get that, and the agent is going to do a couple things first before it even does the research. You can have the agent work to determine the sentiment of this person. If they're positive or maybe they're neutral, you probably don't need to have a lot of touch with them. You can have the agent kind of offline handle that themselves.

But maybe the person's calling to say your games stink, I want my money back, I don't like what's going on here. Here you can have the agent make a decision and say, I understand the sentiment, this person probably needs to go talk to another person, I'm going to route that to my customer service agent. However, we build good games, and we think that's going to be a small amount of the transactions. So when the event comes in, the agent can first determine what's the sentiment, who should I be routing this to? When it decides that it should be handling it, it can go do that research, gather the information, and provide the recommendations back out to my customer.

Now remember, because this is also built using AgentCore, which has the first thing which I said was the memory, there can be a chain set of conversation going back and forth here. And the agent's going to be consistently remembering what was before and building all the responses based upon what it knows and not having to redo all that work that it did the first time. One, that saves time. Two, it saves money because the agent's not incurring more cycles to go reproduce stuff that it already did analysis on. So this becomes a way that I can introduce that first pattern, that EventBridge triggering agents pattern, to solve for this problem that I have about building and scaling game recommendations for my customer. That solves one of my problems.

Thumbnail 880

Pattern Two: Agent-Triggered EventBridge for Site Reliability Engineering

My second problem is site reliability engineering. So I have my system and it's all instrumented with CloudWatch, and now that my scale's growing 50 times, I'm producing a lot of logs, and my engineer that I have, remember I've got a limited team, limited budget, he's going to become overwhelmed with having to understand what's happening in there and look for things that need to be dug into and where maybe something is trending badly, something needs to be fixed, something needs to be done.

So the CloudWatch logs are growing exponentially. As good as I build this system, guess what, errors occur. And sometimes they're minor, and sometimes they're major. Either way, they need to be looked at.

Thumbnail 930

Here using CloudWatch, I can set up CloudWatch alarms based upon particular things that could be happening erroneously in my system. Based upon those CloudWatch alarms, when something goes into an alarm, I can invoke my agent. I can invoke my SRE agent. That agent can do a couple things. First, it's going to be able to understand what's happening with that particular alarm, and it also understands the architecture of my system and the implementation of that system to say what could that possibly be the root cause of.

It can then determine the severity of that. Hopefully, most of my issues are low severity type of things, and it can be a quick fix kind of thing, but some might be high severity. And similar to what we had in the first pattern, it can make the decision that I want high severity ones to get dropped as an SNS message that my engineer is going to pick up, and hopefully there should be a small amount of those things. But for the ones that are low severity or medium severity, I want the agent to go fix it themselves, and maybe I can build a little something there that when I have that, the agent can go, you know, maybe redeploy some code for me somewhere else, maybe shut down a failing server.

It can do the things in my system that my engineer used to have to do, but now they can do it fully autonomously. Now this thing can scale out very large to that 50 times volume I want without overwhelming that engineer, and it'll do it tirelessly and it'll do it repetitively. So here's how I can introduce that agent triggered EventBridge pattern to solve for that problem. That combination of the agents in the system running on AgentCore, that memory and the runtime have allowed me to both scale to meet the need of the business that I have as well as scale to that operational support that I need to have, and it's providing wonderful results.

Demo Part One: Customer-Facing Agent with Sentiment Analysis Using AWS Amplify

Now it's great to hear about all this stuff, but it's even better to see a demo on how this all gets built. And with that, I will hand it over to Martin for the demo. OK, I got it from the first time.

Thumbnail 1060

So as Jeff explained, and you can see this is a recording. There is a reason why this is recording. We did not trust the demo gods that everything would work on the stage, so just to make it safe and secure that it's actually going to run and show everything we want to show, we recorded the demo session. So I will go through it. It's about 22 minutes long, so please bear with me while we go through this.

We will show two cases. So as Jeff explained, the first case will be helping our customer agents to help our customers look for similar games that they want to play, to offer them some similar games that they already played, and to enable customers to interact with the agents. In this case, an AI agent that will give them some proposals, but an additional layer on top of that will be that they will be able to provide a sentiment or some kind of feeling. Do they like something or don't they like something, so the agent can then based on that, rephrase the proposal and give them maybe a different list of games to play with.

The second one, the second demo, which is a bit longer one because it has more components, will be the one including our engineer in the backend where we will see mostly stuff in the console and in the code that supports our engineer to resolve cases much more faster. So the logic is again, as explained, that we will have an alert triggered by a fake Lambda trigger which will go then to EventBridge, and then that alert will follow through AgentCore which will then go through, seek additional information about the problem, and then decide is that problem low, medium, or high severity. And based again on the sentiment of the problem, we'll send it either to some auto remediation, which can be restart the service, start something, stop something, run a code or whatever, or if it decides it's a high level problem or issue, send an email to the engineer to escalate further into the case.

Thumbnail 1200

So we'll go through this together, hopefully. Let's see. Can I hide this? Yes, so this is our folder with the project.

Thumbnail 1210

Let me show you the project folder structure. You can see multiple different files that we'll be using, including instructions and other configuration files. We'll go through those files later to show what they contain. Here we're activating the Python environment because we're using Python to communicate with the agent. We also need to install the requirements. As you can see, most of these requirements are already satisfied because we've run this environment previously.

Thumbnail 1240

Thumbnail 1280

Thumbnail 1290

Now I'll run the application itself. Let me stop briefly here to explain. The application is running AWS Amplify front end to be used as a user interface for the customer to interact with. I should mention that Amazon Bedrock AgentCore and Amplify are running locally. They're running locally but can run anywhere, of course. In this case, it's running locally. As you can see, the system is loading the instructions file. The instruction file contains prompt engineering for the agent to understand how to behave. Interestingly enough, as we'll see later, it also contains additional data about sentiment, specifically how to determine what type of sentiment the customer is producing within the system.

Thumbnail 1300

Thumbnail 1310

Thumbnail 1320

Thumbnail 1340

Let's deploy the front end. Amplify should launch the front end itself. The front end is based on React, though you can use whatever framework you want. This is our front end. It doesn't look particularly polished, but it serves our demonstration purposes. Our assistant is called Gus. Let me interact with Gus by asking, "Hi Gus, how are you?" He will respond with information about what he can do for us. Now I'm acting as a customer interacting with the agent, giving him information about what I like or what I used to play, specifically what type of games I used to play. I'm typing slowly, and there's also an intentional spelling mistake to make it look natural. However, Gus is very smart and will easily recognize what I actually need or what I'm asking.

Thumbnail 1360

This query goes to the database in the backend, where we'll see different tools that have been used in the code to help Gus understand what I'm actually asking. You'll notice that in the first run of this iteration, Gus didn't find anything initially. He didn't understand at first, but then he reiterated based on keywords, the database table structure, and what I was saying in my question. He went back, reran the code, recreated the SQL query needed to retrieve that information from the database, and then provided me with recommendations. As you can see, there are multiple different options that he gave me.

Thumbnail 1420

Thumbnail 1460

Now I'll give him some sentiment feedback. I don't like Resident Evil because I scare easily, so that's not something I enjoy. I want to tell him not to propose anything similar to Resident Evil. There's another spelling error here, but again, Gus is very smart and will easily understand that error. Here we can see what's happening in my interaction with the agent. You can see what I'm asking, how he's preparing the memory, and at the end, sentiment analysis is enabled. Interestingly, my first question doesn't have any sentiment. I'm just saying what I'm into, that I like adventure games and this type of game. There's no sentiment related to it. Sentiment comes when I say I don't like horror games or ones where I scare easily. You can see sentiment neutral, no sentiment here because this is my first question.

Thumbnail 1470

Thumbnail 1490

Thumbnail 1500

The agent will start to run the query. Interestingly, by using the tools, the agent is generating the SQL query needed to ask the backend database for what I want. There is no SQL query visible in the code. There's no hardcoded SQL query that tells Gus to do this. There are just instructions about what I want Gus to do. There's no embedded code in the query. Records found zero. You can see it created a query, select console and genre from the table and so on. Records found zero, it didn't find anything initially.

Thumbnail 1510

Thumbnail 1520

Thumbnail 1530

Thumbnail 1540

Thumbnail 1550

Now, the agent will reiterate and rerun the code. It will add additional genres like action or adventure, and then rerun the query to find 150 records this time. The agent will run it again, considering that this might not be enough to show, and it will find some additional credits at the end. Not there, but in the next one. So you can see additional records returned, five in total. The agent doesn't just add records but basically reruns again and shortens the list. Now we have sentiment coming into play. When I mentioned that I'm scared of Resident Evil games, this sentiment is now being provided to the agent, telling it that this sentiment is negative or bad, I would put it that way. So the agent needs to avoid giving me any results containing similar games next time.

Thumbnail 1580

Thumbnail 1600

Thumbnail 1610

If you go into the code itself, you can see here those are the instructions that I mentioned. Within those instructions are guidelines on how the agent needs to behave, what the agent needs to do, how to communicate with me, and how to give me back the information that I need from the agent. On the bottom itself, you will also see sentiment monitoring. So as you can see, the agent continuously monitors conversations and then decides what type of sentiment it is. Is it a good sentiment or a bad sentiment? I could have said, for example, no, I like Resident Evil, give me more of that, or I don't like it as I did, don't give me more of that.

Thumbnail 1640

Thumbnail 1650

Thumbnail 1660

Thumbnail 1670

In the application file itself, which is Python code, we have lots of tools defined. We have tools that will go to the database and extract information, tools that help the agent generate the SQL query that will then bring me back the results. We also have other tools that will help with sentiment generation. Unfortunately, I cannot go through the whole code itself, and it's also very, very small here. We have different instructions again, as we just saw. Let's go to some definitions that basically define how the agent responds back to me. Again, in all of this code, you will not see any query. There is no SQL query saying what needs to be done, because then it wouldn't be agent-based. It would be just a SQL query based on some keywords which I provided in the entry point for the agent.

Thumbnail 1680

Thumbnail 1690

Thumbnail 1700

Thumbnail 1710

So basically, it is based on invocation, totally asynchronous as mentioned, which is the idea. I don't need to be waiting for something to happen. I need it to happen immediately, back and forth. Okay, session ID, user ID, and so on and so forth. Everything, of course, for this to work, the LLM is running in AWS, so you cannot run your LLM locally as you are currently running this agent core and Amplify. So you need to have your credentials. You need to have access to your backend where your database will be and the agent logic or LLM that you're actually questioning.

Thumbnail 1720

Thumbnail 1730

Thumbnail 1740

Thumbnail 1750

Sentiment analysis, so you can see it can analyze sentiment. It always analyzes sentiment as you enter something and interact with the agent itself. We can go to definitions on how to analyze sentiment. So how to analyze sentiment and how to output back to EventBridge. Once you have this in EventBridge, you can then do whatever you want, basically. But in this case, we are just providing feedback back to the customer on how to create it. I'm so sorry, this is like a silent session and this is my first time doing it. I used to have interaction with people asking questions, so it's weird having you like aliens with those red ones.

Thumbnail 1770

Thumbnail 1790

Thumbnail 1800

Yeah, cool. So the code basically shows the pattern and code that helps out in this interaction and defines how all this in the backend works. Again, the most important thing is to understand this is the EventBridge part where we basically send all that information, sentiment, confidence, and summary back to the user to understand that part. Okay, so I will stop here for a second.

The second part, as explained, builds on the first part. What can you do with this? There are so many agents that are pre-built, or you can just reuse them. This one is interesting because it's very easy to deploy. You can run it, test it locally, and then push it to AWS and run it there if you wish. So this is just a concept that proves the concept of basically asynchronous communication between the customer and the agent in the backend and reusing existing LLMs to provide additional data. This is super easy code, it's super short, it's not long, and it can be very easily tested. And this is with user interaction when I'm prompting questions and asking give me this, give me that, and so on and so forth, and this is all based of course on your existing knowledge that you possess in the backend of the database of all the games that you have.

Thumbnail 1880

Thumbnail 1900

Thumbnail 1910

Thumbnail 1920

Demo Part Two: Autonomous Alert Management with CloudWatch and AgentCore Runtime

The second part includes a bit more components, and the second part is, I would say, not as interactive because the idea is that we have a permanent watcher or alert-based system. This time it is through CloudWatch where we are basically generating an alert and then sending that alert through EventBridge to Amazon Bedrock AgentCore, which then decides the sentiment of that alert and then does some action, and this is something that is totally autonomous versus the first one where I need to input something, ask back and forth, and interact. Here we do have the alert itself. As you can see, the alert was set up in this case, maybe not for production, but every one event with one data point then raises the alert, and then I have a rule in EventBridge basically that is watching for those alerts, any change of state, and it has a target of Amazon Bedrock AgentCore to do something with that alert. So we'll see later on what that AgentCore actually does.

Thumbnail 1930

Thumbnail 1940

Thumbnail 1960

So the important thing is that I'm not just sending it, there is an alert. I'm sending it, okay, there is an alert. You need to go back to the CloudWatch Logs. You need to pick up some additional information about that alert, and you will see later in the code we have again additional tools that do that and then do something with it. So this is Amazon Bedrock AgentCore. I will stop quickly just here because Jeff also mentioned some of these. We are using Agent Runtime in this case, but there are additional options that you have like Built-in Tools and Gateways. We did some upgrades recently on Gateways in this tool. Memory also mentioned, Identity is super important, and we will see later on Observability also super important for this because you don't want this running wild, not knowing what is actually happening, how many invocations do you have, how many times your Lambda fired because something happened. You want to fine tune it either because of the resource usage, cost usage, or just optimization of the code. So we are currently using here Agent Runtime for this particular AgentCore.

Thumbnail 2040

So again, alert. I have a fake Lambda you will see later that will trigger a couple of alerts, again, nothing to use in production, just for demo purposes. That will go to EventBridge. EventBridge will send that to AgentCore. AgentCore has instructions. You will see again instructions files, a bit different one, and this one is not called something simple, it has a much more ominous name, but it will then look, oh, there is an alert. Let me go back to CloudWatch. Let me collect some additional information. Let me determine the sentiment. And let me then decide what to do with that information. So what this is doing is actually relieving our engineers that they don't have to be there for each and every event, but they can basically just get the top level, highest priority events and then act on them.

Thumbnail 2050

Thumbnail 2080

Thumbnail 2100

Okay, so this is the dive. Maybe one thing to notice here is just it's a Version 7. I have run this multiple times, so once we come to the end and run it again, so when we go through all the files, it will be Agent Version 9, I think, because we have run it multiple times. So there is versioning, of course, within the agent. We can see the invocation code itself. In this case, Python blurs some things which contain sensitive information. So it's very, very simple code. Multiple versions, currently 7. Once we run it actually again it will be 9. Additional information is about the endpoint itself. You can add tags if you want, of course, when it was created, status ready, and so on and so forth.

Thumbnail 2110

Thumbnail 2130

Thumbnail 2140

This part is super important: observability. This is the thing that you definitely want to have. You want to be able to have insights into the processes and what is actually happening inside. You can of course change the range when something is happening. Here I'm changing it for I think four weeks because I ran it for four weeks before. You can see how many endpoints, how many sessions, traces, error rates, and so on. To see this information, session invocations, invocations, and so on, you need to enable this in Bedrock.

Thumbnail 2160

Thumbnail 2170

Thumbnail 2180

There is a checkbox, basically a switch, that you turn on that then enables invocations. Then inside, we'll just go now into that part, into Bedrock, where you can see basically what types of invocations you want to log and where do you want to store that information. So going to Amazon Bedrock, in my case, it will be turned on because I already turned it on. So settings, and then here, let me stop quickly. Did I stop? No, I did not. Here, stopped.

So this needs to be turned on: model invocation logging. If you don't have that turned on, you will not see that data over there, which is not good. You want to have that data. What type do you want to include into the log? I included all text, image, embedding, and video. Where do you want to store that? S3, CloudWatch Logs only. I will just store it in CloudWatch Logs in this iteration.

Thumbnail 2220

Thumbnail 2230

Let's go to CloudWatch, to CloudFront to see what was deployed. To see what was deployed, there were multiple resources deployed through CloudFormation for this. I will stop quickly. When it loads, let me hide this. I hate how it doesn't hide. Come on. Forget it.

So we have Lambda, EventBridge, and AgentCore. That's when it comes to EventBridge. EventBridge invokes Lambda that then talks to AgentCore and tells it there is an event to investigate. Lambda error generator, that's the fake generator for errors where I will click several times just to generate some errors. Error alarm itself, agent SNS topic which will be used to send an email to an engineer in case the severity or sentiment of the error is high. Then we have the role itself and some metadata on the bottom.

Thumbnail 2320

All of these have adequate permissions to access all these services, so you can really scope the permissions for each of these to have access only to what it needs to have access to. Now let's go to the project file. It is similar to the one that we already saw, so it's similar to the projects that we already saw. You see requirements, you see instructions. The instructions are of course different. It's not the same thing.

I'm instructing in this case, I'm instructing the agent to, when it receives an alert, when it receives the alert, to go back to CloudWatch to collect additional data about that alert, and then based on that to decide what level of sentiment it is. Also, I am using several, you'll see several different tools to get additional information about the tool and about the alerts, and those tools are defined in the code itself. So I want to get log group name query. I want to create CloudWatch Insights query, when it started, when it ended, and so on and so forth.

Thumbnail 2390

Thumbnail 2400

Then on the end, you'll see, I also instructed the agent to retry, to retry at least two times if it can do it. So get more information tool, notify high severity problem tool, those are all the tools that an agent uses once it gets more information and determines the sentiment of the problem itself. Everything in the instructions file, so the better you do this, the better results you will have. Again, some error handling. You should always have some error handling. It's not the best or smartest one, but at least it's there for the tool to know, for the agent to know what needs to be done.

Thumbnail 2410

Again, this is a Python application. This application basically defines how the agent starts once it's initiated, what tools are available, what instructions are there, and so on and so forth, so it preloads everything into the service. I've been using Amazon Bedrock AgentCore Runtime for this, which basically in the backend builds a container, deploys a Docker container to ECR, then deploys the agent in it and runs it through that infrastructure.

Thumbnail 2440

Thumbnail 2450

Thumbnail 2460

Thumbnail 2470

The tools themselves are not invoked immediately, of course, when the tool is deployed or the agent is deployed. They are invoked after something happens. So there are two parts to this, you may say. The first part is how to start the agent, what are the instructions, and what tools are available. The second part is when something happens in an event-driven manner, as we've been talking about all day, then what to do with it and what tools to use to construct the proper query. This includes collecting data from the logs and then basically deciding what level of severity the issue actually is.

Thumbnail 2490

Thumbnail 2500

Thumbnail 2510

All of this happens before we've even seen our engineer yet. This is standing in the background of the service, listening for those alerts and then triggering when needed. So these are the tools that I have mentioned, either tools for sentiment analysis, tools for generating inquiries, or tools for getting additional data on the log files, because I just cannot decide sentiment, or the agent cannot decide the sentiment based just on there being an alert. It needs to go back, collect the alert information, and then basically decide the sentiment and what will happen next. So those tools have arguments like log group name, query, start time, and end time.

Thumbnail 2530

Thumbnail 2550

Thumbnail 2560

Of course, when that happens, it triggers a Lambda function, which will then go to the agent and determine the next possible steps. This could involve ECS service, Lambda, RDS, and so on and so forth. So again, it can send a message, it can restart a service, it can run code, it can be whatever is needed. The question, of course, here is how much automation do you want to give it? Do you need to have somebody in the middle, like a person that will go through this and validate if something is wrong? How much power do you want to give to these kinds of services while they're restarting, rebooting, changing, modifying, or sending some commands to have something done?

Thumbnail 2630

So here we are deploying the agent core from the terminal. It will generate the Docker container and run it, using all the instructions in the application Python code and collecting all the tools that it needs to basically run the agent. There are two Lambda functions, as I said. We have one Lambda that is like a dummy Lambda for generating errors. That's the one on the top, the Lambda Error Generator, and then the second Lambda is basically to act as a middleman between EventBridge and the agent core itself, which then defines again all those tools and how to access that information.

Thumbnail 2650

Thumbnail 2670

This is just a simple error generator which just triggers a couple of errors once we go back into the console to see how that gets triggered and the result, which I'm sorry to say is not super spectacular, but it will show you what actually happens. So let's go back to Agent Runtime. Let's see what happened. As you can see, Version 9, so there was one version in between which we didn't see that was run for test purposes. So this is now Version 9.

Thumbnail 2690

Thumbnail 2700

If we go to Lambda functions, you will see those two Lambda functions that we are basically talking about. This one is to communicate with the agent core. No, this one is to generate events, right? Yes, for testing. So this one will generate, if I click it multiple times, it will just send some fake events.

Thumbnail 2720

Thumbnail 2730

Thumbnail 2740

Thumbnail 2750

We expected errors against demo gods, so we even recorded one. It sends to CloudWatch, and you will see a couple of events in CloudWatch. Let's see that part. As you can see, several events were generated from Lambda. In real life, of course, this would be something that happened actually in production. Again, the alarm that is triggered changed state. When the state is changed, it creates an EventBridge event, and then from there we are choosing to send this to AgentCore. That's one of the options where the AgentCore with all those instructions will decide the severity and everything else. But once you're here, you basically can do anything. This is the beauty of EventBridge from previous options that you had.

Thumbnail 2770

Thumbnail 2780

Thumbnail 2790

Thumbnail 2800

Thumbnail 2820

Here again, I see two events happening. Back to Lambda, for the other one, the Lambda that will actually talk to AgentCore. It has instructions to go back to CloudWatch and look for more information on the problem. Again, you will see here that it has been triggered. It's a small trigger, but it's there. You see that this has been triggered. Let's see the logs itself. This is what's happening. Let me stop quickly here. Here you can see the full log of what's actually happening. Lambda errors, and inside here you can even see that it's going back, instructing that it's going back to CloudWatch and basically saying it needs to collect additional information on this error and define the severity of this error.

Thumbnail 2870

Again, there is no query anywhere. There is no secret query. The query and all the queries are generated on the fly based on the information provided from the log. At the end, it's a JSON payload. As you can see, severity was determined to be high. So severity was determined to be high in this case. It could be low, medium, or high. Again, we decide what is low, medium, high. I mean, we decide what action happens if it's low, medium, or high. In this case, it's high, so a high severity notification was sent.

Let me stop here. As I said, not a very spectacular result, but this is what we sent to our Site Reliability Engineer. This is the email he got. You can make it prettier. It doesn't have to go to email. It can go to a ticketing system with a high priority assigned to somebody and so on and so forth. But this is basically the idea. The general idea, as Jeff explained, is to actually automate part of this in a super easy way with the technologies that were already there for most applications in event-driven architecture. We are just adding on top of it some smarts in a way that lets us reuse existing LLMs but give them on top some of our own knowledge from the logs that we're collecting, and then utilize that to offload work from our people, from engineers that need to resolve those issues, of course, at the end for the benefit of the customer and the application.

Thumbnail 2990

Conclusion: Enhancing Event-Driven Applications Without Starting from Scratch

I hope I managed to put it a bit closer for all of you to understand what is the possibility and how to enhance your existing service application. You don't need to build everything from the ground up. Jeff and I will close up now. Thank you. That's a good demo. Hopefully that demonstrated how by starting with building your architecture in an event-based manner using EventBridge and then incorporating agentic agents into that process, you can help scale to deal with characteristics that you might have previously thought were beyond your business capabilities. With that, Martin and myself, we're happy with what we built. Our customers are happy, our engineers are happy, and our customer assistants are happy.

Thumbnail 3000

With that, we thank you very much for your time today. We thank you for coming to re:Invent, and we're going to be hanging around a little bit longer if you have any questions for us. Thank you very much.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)