Kazuya

Posted on Dec 4, 2025

AWS re:Invent 2025 - Elevate Streaming Quality with AI: Prime Video's Innovative Approach (AMZ306)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Elevate Streaming Quality with AI: Prime Video's Innovative Approach (AMZ306)

In this video, Prime Video engineers Brian and Mona share how they use AI agents to scale streaming operations for 200+ million members globally. Brian demonstrates an automated artwork quality moderation system using Amazon Bedrock and Strands agents that reduced manual review by 88% and evaluation time from days to under an hour, achieving 78% precision for defect detection like safe zones and text placement. Mona presents a multi-agent system for detecting and mitigating streaming quality issues in real-time, using orchestrated agents for request handling, routing, data integration, analysis, and reasoning. Both teams emphasize breaking problems into smaller tasks, building robust evaluation loops, and using AI throughout the software development lifecycle to handle massive scale—like 18 million concurrent viewers for Thursday Night Football.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Prime Video's Scale and AI-Driven Quality Challenges

Welcome everyone to session AMZ 306: Elevate Streaming Quality with AI: Prime Video's Innovative Approach. Before we start, how many of you watch Prime Video? It's awesome. You might have seen the latest shows like Fallout and Ted Lasso, or watched NBA or Thursday Night Football on there. This session will be really interesting for you to learn how Prime Video uses AI, particularly generative AI, to improve their artwork quality and to improve the streaming quality so that you can watch uninterrupted movies and uninterrupted sports on your favorite channel, Prime Video.

To start off, I'm Tulip Gupta, a Principal Solutions Architect with AWS. I support strategic accounts under AWS, primarily in media and entertainment, and Prime Video is one of my customers. I have with me Brian and Mona from Prime Video. Brian is a Principal SDE at Prime Video, and Mona is a Senior Manager of Data Engineering. They will introduce themselves later when they present their use cases.

We are going to start with an introduction and go through all of the use cases that Brian will talk about regarding artwork quality moderation using generative AI. Then Mona will elaborate on streaming quality and how they improve that using generative AI agents. We will also talk about the challenges they faced in their individual journeys, as well as the solution and journey on AWS. We will also discuss the demonstration and benefits at the end.

When we talk about Prime Video, Prime Video is global. Prime Video has millions of titles in its catalog that you can stream on hundreds of devices. They support over 30 different languages, operate in more than 240 countries and territories, and support over 200 million Prime members worldwide. So nearly anywhere around the world you can log in to Prime Video and enjoy your favorite content. Let's say you're traveling to Europe or India or anywhere else—you'll be able to log in and watch your favorite show there.

Prime Video has 315 million data points and a global ad-supported reach for over 300 million monthly viewers. This is how Prime Video has grown over time. They started in 2016 with The Late Show, and as you can see from 2018 to 2020 to 2022, they have increased the number of streams and the number of shows they have produced. There is a lot going on, and as they scale, they also want to scale how they use it for streaming quality and how they reach millions of customers out there. It's a huge accomplishment, and as they continue to grow bigger, scaling efficiently impacts their ability to deliver these events.

They are lining up their biggest slate of theatrical movies ever, and with Prime Video customers they can customize their viewing experience and find their favorite movies, series, and live events, including Amazon MGM Studios produced series and movies like Fallout. In 2025, they also added Peacock, Fox One, and Peacock Premium Plus. It has been quite a journey as they have grown over time.

Here is a little trivia. In 2025, what was the peak audience for the Commanders versus Packers Thursday Night Football game on Prime Video? Can anyone answer? We see 18 million. You can see how massive it was with 18 million people logging in at the same time and watching the Thursday Night Football game. To operate at that scale, they also want to scale out the infrastructure primarily on AWS to meet those demands.

To talk about what the use cases were, they wanted to improve and moderate the streaming quality that Mona will touch on, and also moderate artwork quality that Brian will talk about at scale. When you talk about the scale we looked at, it goes linearly. To operate at this scale, they depended on solutions like generative AI tools for evaluation and performance, and that is how they were able to scale it out. For example, for Brian's use case, they get content from Peacock, from Stars, from PBS for NFL, MLB, and NBA, and they are able to moderate it by using generative AI tools.

Understanding AI Agents and AWS's Agentic AI Portfolio

If folks are aware that some of the GenAI tools out there, I'm also going to cover some of the GenAI tools in the coming slides. So when we talk about GenAI tools, you might have heard about AI agents particularly. So what are AI agents? It's a tectonic shift of how we build, deploy, and interact. For AWS, we have what we call our agents, which are GenAI agents that have come out there. These are autonomous or semi-autonomous software systems that can reason, plan, and act to accomplish goals in digital or physical environments.

These systems had foundation models which can do one task, but when we talk about agents, they are able to do a task independently. They are able to call the foundation model. They are able to access databases. They are able to call tools, and they are able to perform that task on their own. These can leverage AI to reason and plan and accomplish the task on their own. And so that brings us to what we think about the evolution of agentic AI.

When we moved from left to right, there was more human oversight when we first started with generative AI assistance, which you might have heard about two years ago when it came out. They follow a set of rules and automate repetitive tasks. As we move towards the right, we have the GenAI agents we touched about in the previous slide, and they are able to do a singular task very well. Then as we move on to the right, we have agentic AI systems, which are fully autonomous multi-agents, and they can orchestrate between them to accomplish a set of tasks. They can mimic human logic and reasoning. As we move from left to right, there is less and less human oversight and there is increasing autonomy.

When we talk about AWS agentic AI portfolio, for some of you it might be familiar with something called SageMaker. SageMaker essentially gives you the compute power and the models where you can customize your own model. You can build, train, and host your own models, and we also provide compute options like Trainium and Inferentia. Then we have the middle layer, which is AI and agent development software and services. In this we have Amazon Bedrock. How many of you are familiar with Amazon Bedrock out here? For folks who are not, this is a managed service that gives you a flexible, comprehensive service for AI application and agent development. It offers access to foundation models. You can customize models and build applications with your data and apply safety guardrails.

Let's say you're building an assistant that provides movie recommendations and you say, "Hey, you don't ask me questions about politics. They shouldn't be able to do that." That's what we mean by guardrails. It also provides Agent Core, which is one of the new services we added that essentially helps you scale and deploy your services and operate at scale on AWS. If you build an application and you want to scale it out to millions and hundreds of users, Agent Core is what we will be using. This is just briefly touching about the services, and then we have SDKs for agents like the one with Nova Act, which is designed to take action within the browser, and then we have Strands agents, which Mona and Brian are going to talk about in depth when they talk about the use cases and how they have been leveraging Strands agents.

Strands agents is another open source SDK that AWS has developed, and it basically is a lightweight source SDK that helps you develop agents very quickly and be able to leverage and call multiple tools to accomplish that task. On the top we have applications. As you move on from SageMaker, it gets more and more managed. You don't have to manage your own infrastructure. For Bedrock, you do not have to manage your infrastructure, and the applications give you whole applications out there that you can play around with. For example, with Codewhisperer, it acts as a coding assistant if you want to develop an app like a Streamlit app to act like a travel agent. It will be able to quickly design and build it out for you.

Then we have Amazon Q Developer for accelerating software development, Amazon Q Business for your data and to answer questions, Transform to accelerate enterprise modernization of .NET and mainframe and VMware clouds, and then Amazon Connect, which is to speed customer service to delight customers. And then lastly, the marketplace where we can access a lot of agents out there, just a quick brief overview for you to see all the portfolio out there. And that brings us to Strands agents. We briefly covered what Strands agents is in the previous slide. Again, as I said, it's an open source SDK. In the past, LLMs were something where we basically would provide agents a template of how to do stuff.

Right now, LLMs are getting smarter and smarter, and Strand Agents provides a lightweight model where it's more intuitive development and it can figure out how to call the LLM and the tools on its own. It can get started in minutes instead of hours, and it provides robust capabilities like native tools, MCP servers, and you can also extend the support for custom mobile providers, custom tools, and MCP. This helps with rapid development and prototyping. That's what Mona's and Brand's team leverage to be able to experiment quickly, use new services, and iterate on what they're developing to create that robust evaluation loop.

Brian Breck on Artwork Quality: The Challenge of Manual Evaluation

Thanks Mona. My name is Brian Breck, and I'm a principal engineer with the partner experience team within Prime Video. We work with our content creators like major studios and independent filmmakers to ingest their content and prepare it for streaming customers. That content includes not only the streamable assets but also artwork, metadata, trailers, and bonus material as part of that data set.

Just to get us started, I have a quick question for everybody. How many of you raise your hands and stream with more than one device? Probably most of you. I stream with my phone, my laptop, my TV, my tablet, and that creates a lot of complexity at Prime Video. We have to support a number of form factors and backgrounds on top of 30+ languages in over 200 territories.

What we're going to be talking about today is artwork quality. We use artwork to represent movies, TV shows, channels, and carousels, and that artwork can show up in the streaming site as well as in marketing material and advertising. Now, the artwork is provided by our partners, and the artwork may be beautiful, but it may not meet the needs of Prime Video.

For example, we need to be able to crop artwork depending on what form factor we're going to be showing it. We also may need to overlay a logo or an action button. So we need a safe zone on the perimeter of the artwork so that we can work within those parameters. I've highlighted the safe zone in purple. In the first example, we can see on the left there's plenty of text on the side of Lioness, and on the right, only the shoulder of the last actor is being cut off. So this is a perfectly acceptable safe zone.

Now in the second example, we can see that a head is being cut off for one of the characters and there's some text that's completely obscured, so we could not use this for Prime Video. This is just one type of defect that we're looking for. We're looking for other things like mature content, pixelation, issues with localization, as well as accessibility for things like color blindness.

Building an Evaluation Framework with Generative AI

We have a number of challenges in this space. One of the biggest is that a lot of this work is traditionally done manually. Our partners provide us a piece of artwork. It needs to get into a manual evaluator's queue. They need to provide their feedback, and then that feedback needs to get back to the partner, and there may be multiple iterations of this, so the entire process can take multiple days.

Our solution to this generally has been to create an ML model that we train with artwork examples, and then we use those examples to find defects. But that's a time-consuming process and we have over 30 defects that we're looking for today in our artwork, and that number is only growing. Now another problem that we have is data. We have been collecting data from our manual evaluators on acceptable and not acceptable content.

Evaluators don't always follow the same standard operating procedure, so what may pass one evaluator may not pass another. This inconsistency can leak into our data sets and make them more difficult to use. Once we have a solution, the evaluation can take several days to run over a data set and then figure out what changes need to be made for the next iteration to improve the system.

When we're talking about evaluation, what we generally do is take a few thousand images and have a human annotator go through and provide what the correct result should be. That's what we see with the ground truth column. Then we run our automated solution and look for discrepancies in the results. In this example, in the second image, ground truth says that this should have passed, but the model we were using is going to fail that piece of artwork. Maybe it thought that someone's hair was too close to the edge, and so now we need to provide more artwork as training data to solve for this problem or create additional instruction in our algorithm to allow this to pass.

But we're only looking at four images here. Imagine if we're looking at thousands of images with results in S3 buckets that we've got to pour over. It can just take a lot of time to evaluate the results. One of the things we noticed is that with LLMs and their multimodal counterparts, we can detect certain defects with those foundational models. We've seen anecdotes of it being used in other places, and so we wanted to try it out for our use case.

What we ended up doing is using QCLI, which is now folded into Quiro, to generate a few algorithms for us and use a few different foundational models to try out some results. We wanted to move quickly and make sure that this was going to work for us. When we ran the results, we anecdotally saw that it was promising. So then we wanted to go and perform one of the evaluations that we just looked at. We saw that our precision wasn't high enough, and we knew that we could improve it, but we also knew that it was going to take a few iterations, so we wanted to move a lot faster than we had in the past.

What we ended up doing is creating an evaluation framework that we use for defect detection. We take as input data sets with ground truth as well as some initial configuration, and then as output we get the results as well as feedback on how it could be improved. Instead of diving into S3 buckets, we can see views of the artwork. For example, with Uncharted, we can see a mobile view versus a web view and how that's going to look in the different use cases.

Strands Agents Architecture: Orchestrating Defect Detection

We also get some benchmarking statistics based on how it compared to ground truth as well as how it compared to previous runs. We also have the ability to dive in and take a look at some of the issues at the individual artwork level. Here we're taking a look at the back end architecture for our evaluation system. It is a system that is orchestrated by strands with individual agents that perform the defect detection, perform the evaluation of the results, and also agents for suggesting improvements to the process.

The way that it works is that a user will submit an initial request through our CloudFront and load balancer instances and hit our API. That API will store the data in our config table and that will include an initial prompt used for defect detection. It will include a link to a data set along with the ground truth data. It will also include some initial configuration such as which model to use and things like temperature.

The orchestrator will pick up that configuration and delegate each piece of artwork to our evaluation subject agent that will actually perform the defect detection. Once we've gotten through all of those pieces of artwork, we write the results to our S3 results bucket. Once the results are written, we use our results calculator to generate the statistics that we saw in the previous slide. We also use that data to determine some next steps.

Finally, we send that data—both the results from the defect detection as well as the results that were calculated—to our prompt improver. The prompt improver agent will take a look at all of that data and make a determination on what should be done next. That could be making changes to the prompt, suggesting different models to use, or different configurations for using those models. Once we've gotten through that process, that data is written back to our config table and can be used for the next run.

Strands is doing a few things for us. First, it's simplifying our interaction with LLMs. Second, it is allowing us to easily create relationships between agents, and it also provides some out-of-the-box tools that we've been able to take advantage of. When we're talking about tools, there are a few built-in tools that we use on a regular basis. One would be the image reader, which allows us to prepare the artwork for the LLM or the LMM when it's initially being called.

We use file read and write so that we can do intermediate manipulation of the images as part of certain processes. We also use agents as tools to create those relationships between the agents and be able to call them explicitly. On top of that, we've also created some custom tools. For example, like the safe zone example that we talked about, we have a crop image custom tool. We have also a transparency check tool for readability and accessibility, and the set of tools that we have available to us now has grown significantly.

Unlike the safe zone defect detection process, not everything can be a pass or fail, or can we use that information as pass or fail in order to improve the system. Sometimes we need qualitative results as well as quantitative results in order to decide what our next steps are. What we do is we create a judge that takes a look at the evaluation performance and provides some additional context for why things failed and what could be done better on that individual artwork basis.

We provide some initial configuration to our DynamoDB table for the judge. The judge configurator agent reads that information and prepares it for the judge itself. Then the judge will take a look at the results of the evaluation and provide additional context that can then be used as part of the prompt improver step. We could use a judge for all of our defects, but it's expensive, so we only add that configuration and step when it's absolutely necessary. However, it has been critical to get some of our defect detection mechanisms in place.

Overall, Strands has greatly simplified a lot of what we're trying to do. It has made it so that we don't have to move images around. All of our interfaces are text-based, and we can access the images centrally.

Results and Lessons Learned: 88% Reduction in Manual Effort

We are also able to use the system to run regression tests. So if we want to change the model, change the configuration, or change a prompt, we can validate that we haven't made things worse. This has been so successful that we've also started using it for things like text. As I mentioned, we receive metadata and have to validate the synopsis, so we have a bunch of defects that we look for in a synopsis running through the same process.

We generally use about 2000 images and have humans run through them and provide the ground truth result. When we initially started this process, we were running into some local maximums with precision. We just weren't hitting the values that we thought we should be able to reach, so we were in situations where fixing one false positive or false negative would cause another one. What we realized is that as we started digging into the data, the manual evaluators were using inconsistent criteria. Some of them would pass something that others would fail and vice versa.

As part of this process, we ended up creating a standard operating procedure for the manual evaluators that would also be shared with the automated system so we could have consistent results. Those consistent results led to a better ground truth dataset. That better ground truth dataset allowed us to run the evaluation loop where we could run an evaluation dataset, look for ways to improve it, and then run another evaluation set, basically creating an auto-tuning mechanism that was completely hands-off. We could just watch the system improve itself. This actually simplified our runtime solution.

We take the configuration that we were generating in our evaluation phase and load it into an App Config instance. Then we allow our partners to upload their artwork through our portal, which goes through API Gateway and is delegated to modules, each representing a particular defect. We can run defect detection in parallel, and for each defect detection mechanism, we are reading the configuration from App Config. In combination with providing the artwork to Bedrock, we're able to generate those results almost in real time. Where it would take several days for the partners to get the results back, we're actually providing that within a minute.

The solution isn't perfect, so when the partner gets a result that they don't agree with, they are allowed to override that result. What that means is we recommend that they make some particular update, and they think the artwork is fine, so it goes into a manual evaluator's queue and we'll use the old process. The great thing about this solution is that where we were reviewing 100 percent of the artwork that was provided to us, now we're only reviewing about 10 to 12 percent of the artwork, and that has been a huge time saver for our manual evaluators.

Through this process, starting with the QCLI solution, we were only at about 37 percent precision. We were able to get that number to 78 percent for the safe zone. We were able to reduce false positives and negatives by 70 percent, and we were also able to reduce the amount of time it takes to get results from several days to less than an hour for certain circumstances. But most importantly, we were able to reduce that manual effort by 88 percent. So we learned a few things along the way. First was don't try to do too much at once.

The context windows have been growing recently, and we initially tried to take advantage of that by running defect detection for multiple types of issues simultaneously. However, we found that this approach was not effective. Instead, we broke the problem down into individual defect types, ranked them by how often they occurred and how much effort it took to manually perform the detection, and tackled them one at a time.

The next thing we learned is that we can really use generative AI throughout the entire software development lifecycle. We started with our initial proof of concept being generated by Amazon Q. Then we used generative AI to create our system design, followed by development, evaluation, and much of our monitoring. We estimate that we used generative AI for approximately 85% of what we showed for both the evaluation framework and the production system, which was a huge time saver and saved months of engineering work.

We also found that LLMs are effective at improving their own prompts. We used Claude to examine prompts that we were providing to Claude, and it was effective at telling us where we could improve things that were specific to that particular model. Additionally, we found it very helpful to establish a robust evaluation loop, allowing us to iterate quickly and make changes. Even when the changes we were making were manual, we could simply kick the process off again and see how it worked without spending so much time in the investigation phase, which was critical for our success.

Finally, we learned that manual evaluation is hard and error-prone. Therefore, it was worth taking the time to generate a high-quality dataset to ensure that our automated processes are successful. We started with the safe zone and have since moved to logo and text placement, offensive and mature content detection. We are also looking at text legibility, localization, and accessibility. All of these capabilities are either in production and being used by our partners, or they will be available by the end of the year.

Enterprise Agentic AI Applications: From Code to Custom Workflows

That concludes the presentation. It was great to be able to share this with you, and I'm going to hand it back to Tulip, who will continue. You heard from Brian how they were able to use Amazon Q and how they were able to use agentic AI in their framework to moderate their artwork quality and place it in the right location. I want to talk about agentic AI a little bit more before I hand it over to Mona to talk about her story regarding agentic AI and generative AI.

When we think about the two flavors of agentic AI, one is accelerating software development, where we discussed applications like Amazon Q Developer helping you as a code assistant and helping you code and develop applications. The second is reimagining business workflows with custom agents, and this is where Mona's use case becomes really important. They were able to use multiple agents that could orchestrate between them and create a custom workflow to accomplish their goal of improving streaming quality. This is the use case that we will focus on next, which Mona will discuss.

When we talk about enterprise agentic AI applications, we have an LLM at the center that serves as the brain, reasoning through information and providing output. However, it needs help. LLMs typically do not have the latest information. For example, if you ask any of the LLMs what the weather is in New York, it would not be able to answer because it does not know what the current date is or what the weather conditions are. Therefore, we provide it with data points through tools that help it call a weather API or access the current time, and databases can show information like where New York is located, providing the information it needs to give you an accurate response.

It also probably needs memory to be able to understand the current context and conversation, take that action, and give you the correct answer. For example, it's 56 degrees Fahrenheit in New York. I'm not sure if it's actually 56 degrees right now, but I'm just using that as an example. Another important thing is observation and guardrails, where we want to ensure that if you're asking it for weather information, it doesn't answer you about politics or tell you who the current president of the United States is. We want to be able to restrict it to answer only for the prompt or the context that we're asking it for.

With Amazon Bedrock, you're able to do all of that. It provides you the ability to access models. It provides you the ability to call tools with MCP, which is Model Context Protocol, that's been introduced by Anthropic and A2A, agent to agent, showing how multiple agents can orchestrate. It provides you frameworks like Agentic Agents that we talked about, as well as LangGraph, to be able to call and build that agentic framework code to deploy your agents at scale on AWS with your runtime, gateway, memory, and observability. Obviously, it provides you the ability to customize and fine-tune your models. With that, I'll hand it over to Mona to talk about her use case and how they improve streaming quality.

Mona on Streaming Quality: Building an Autonomous Detection System

Thanks. So quick show of hands, how many of you have been on a call trying to use a whole bunch of data to detect, localize, and mitigate issues? I see a few hands here. Well, I'm very excited to talk about our journey of building an agentic workflow to detect, localize, root cause, and mitigate streaming quality issues. I'm Mona, and I'm a Senior Manager at Amazon Prime Video. Prime Video is a global platform. We stream high-scale live events, video on demand, and linear content for our customers. Ensuring our customers are able to watch their favorite content, whether it is the Patriots scoring a touchdown or Barcelona scoring a goal, we want to obsess over our customers being able to take in that moment.

Sometimes what can happen is that while streaming to millions of customers, even a brief interruption can mean that thousands or millions of customers are interrupted from their viewing experience. Traditional operational approaches such as manual monitoring of metrics or reactive root causing simply do not cut it at our scale. We asked ourselves the question: what must we fundamentally build? We need a system that's not just monitoring these metrics, but actively understanding these metrics, learning from these metrics, and able to autonomously take action towards these metrics. That's exactly what we sought out to build.

We had a few challenges that kept in mind and used as guiding principles while working through these systems. First, we wanted our system to have access to a multimodal set of data. This can include things like time series, infrastructure logs, player logs, and graphs, and so on. We also wanted the system to not just have access to this data, but to actually understand this data and almost build an intuition behind it. Finally, we wanted to have the system accessible to pretty much anyone in the engineering teams, so this did not require special expertise or domain expertise when it came to understanding our data.

With that in mind, we went ahead and built an AI agentic system that put together multiple agents that were orchestrated using Agentic Agents to accomplish these tasks. One of the qualities of this system is that it can reason through complex tasks, break them down into multiple simple tasks, and then chain them together in a sequence to accomplish the set task. We also made sure that the system was not just a one-and-done, but was constantly learning from all of the data around it, keeping the most current operational snapshot of the data, and also being able to actively learn from any past mistakes or feedback that it received.

So let's take an overall 30,000-foot view of what the architecture looks like. This system was built as an AWS-native as well as an AI-native system. It's orchestrated with Agentic Agents, uses AWS Lambda for things like authentication as well as orchestrating across the different agents, and uses Athena both for querying as a data store as well as DynamoDB for some of the global state management. This system is the foundational state management.

It is foundational to this backend system, which can be used for a multitude of different frontend interfaces. For example, it can be used as a chatbot interface where somebody can input a natural language question and receive an answer. It can also be autonomously triggered by a different system that has detected an issue, enabling a bunch of different use cases to be facilitated with the same backend system.

Multi-Agent System Architecture: From Request Handler to Response

Diving into the components of this system, this is a multi-agent system with a bunch of different agents and sub-agents working together to accomplish the task. First, we start by looking at the request handler. The request handler is the front gate of the system. Once it receives a request, it will first authenticate the request, then validate it, and then decompose this request into simpler tasks.

For example, if the question asked was what was the rebuffering rate on iPhone devices in Germany over the past week, the request handler will break this down to understand that this ask involves a metric ask, a trend analysis ask, and an ask for a specific cohort involving devices, geographic information, and time periods. The request handler also has a guard rail agent that validates and ensures the request is compliant with what the system is supposed to support.

Now that we have discussed the request handler, let's look at what happens after that. The routing agent can be thought of as an intelligent orchestrator or traffic controller. Based on the request it received from the handler, it tries to understand what different capabilities need to be invoked to service this request. The routing agent understands what those capabilities are and passes that along as a signal to invoke other sub-agents, agents, tools, data sources, and so on. It can really be thought of as the brain of the operation once it gets that decomposed request.

The routing agent also uses the chain of thought process in terms of breaking down a complex task, reasoning through it like a human would, and then finally understanding what capabilities it requires. From the routing agent, we then have the integrator sub-agent. The integrator sub-agent can be thought of as a traffic controller. Once it has received the request from the routing agent, it knows what specific tools and data sources it needs to connect to.

This integrator sub-agent works through MCP, which is Model Context Protocol, and is able to communicate through a host of different tools and data sources. It is able to work through different access patterns, APIs, and access formats. It is also able to combine a bunch of these data together by knowing the right join conditions. The integrator sub-agent also serves as a data quality check and ensures that only the right kind of data and the right quality of data is accessed by the system.

Post the integrator sub-agent, we then have the analysis sub-agent. The analysis sub-agent can really be thought of as a data scientist in a box. The analysis sub-agent is primarily based on Amazon Bedrock and has a whole bunch of both large language models and small language models that it can access to service a specific request. You can really think of this as sitting in an ensemble of different models and leveraging the right models per the use case and per the capability that is needed.

Now once we have discussed the analysis sub-agent, the next thing that we have is the reasoning agent. The reasoning agent takes all of the input it has received from the prior agents and sub-agents, and uses business context that it has access to determine if the analysis it has been provided is pertinent and relevant with all of the business context it has. This can be thought of as using an iterative approach and using LLMs as a judge to validate the responses it has received from the previous analysis agent, for example.

Using that business context, it is able to tell whether this is really the expected answer or whether it would be something else. The reasoning agent also has the ability to have an iterative loop to go back and request a different capability or invoke a different data source based on what it might have received from the LLM as a judge.

From the reasoning agent, we then have the response handler. The response handler takes all of the different input that it has received from the routing agent, the reasoning agent, as well as any multiple iterations of the loop that we have just discussed. It packages all of this information together into the expected output format.

This could be things like the response to a natural language question, generation of a complex SQL query, or alternatively an autonomous action or a mitigation lever that can be pulled in response to a certain trigger. The response handler also interacts with the guard rail agent to make sure that the response is compliant with the required data sharing and other such activities. Separately, the response handler logs all of these decisions that it has made, similar to what all of the other agents have done. It takes all of these logs which can then be used for reflective analysis in terms of improving some of the decision making.

Taking a step back and looking at the system overall, we have the request handler that takes in a request, validates it, and makes sure it is in the right format. It decomposes it into simpler tasks and moves that along to the routing agent, which knows which capabilities it needs to invoke. From there it goes to the integrator agent, which can orchestrate a bunch of different tools and data sources, followed by the analysis agent, which is really the data scientist in a box ensemble of LLMs that can be used. Finally, we have the reasoning agent that uses business context along with what is given from the analysis agent to make sure that it is an acceptable and reliable answer. It can also trigger reiterations of the loop, invoking other capabilities and data sources as needed, and finally the response handler that packages all of this and makes it available either as some sort of output, a mitigation lever, or an autonomous action.

Key Takeaways: Automation, Iteration, and the Agentic Future

As you can see, we also have both an online evaluation system, similar to the LLM as a judge that I mentioned, as well as an offline evaluation system that takes all of these logs of decision making and can be used iteratively in terms of understanding and improving the system. Now that we have discussed the overall components of the system and how it works in real time to help detect, localize, as well as root cause issues, let me share some of the lessons that we learned through the process of developing the system.

First, data quality always beats data quantity. When you are trying to develop such a system, there can be a whole lot of things that you might want to add, such as infrastructure logs, metrics, past incidents, and tickets. However, you really want to make the most efficient use of your context window and make sure that you are giving the right set of data that will actually get you to the right outcomes. Being judicious in terms of the data that you have as part of the system is especially important.

The next lesson is building feedback loops early and often. This really helps in the efficiency of development as well as the efficiency of getting your system to the levels of accuracy and reliability that you are looking for. Another important lesson is planning for failure modes. Systems do fail, and while we want to have an autonomous system, there are going to be times when you might want to have human evaluation or when the system is exposed to a brand new situation. You do want to have safe ways that the system can fail and trigger the right sort of human involvement as needed.

Finally, continue to use AI to amplify human expertise, whether that is understanding the business context better, understanding data better, or other aspects. You always want to have AI amplifying human expertise throughout the software development life cycle process. In terms of what is next for our system, we want to continue to build more mitigation levers and have more autonomous actions that the system is able to perform by itself. We are continuing to use AI through the software development cycle in terms of accelerating our development, having quick prototyping using things like SageMaker, as well as tools for quick orchestration and quick proof of concepts. We are also using AI in the deployment as well as the overall maintenance of the system.

And finally, having more and more safety mechanisms so that you continue to not only build but also keep trust in your system and leverage more and more autonomous actions. So with that, I will hand it over to Tulip.

Thank you, Mona. So let's see how many of you were paying attention. What was the peak audience during the Commanders and the Packers game in 2025? Does anyone remember? That's right, yes, so you heard from Mona and Brian about how they were able to use GenAI in their solutions in their use cases respectively and how they were able to, if you remember that agentic AI evolution from going from more human oversight to less human oversight, and be able to automate a lot of those tasks.

And so both of them wanted to be able to for the use cases where their main thing was to deliver premium video and artwork to scale to millions of customers across thousands of different devices. And obviously when you're trying to do that, there are millions of customers globally and you're streaming content 24/7. And you want to be able to ensure that nothing, none of their customers or Prime Video are impacted while they're trying to experiment, while they're trying to see how it is being delivered. And so that's where automation using AI helped them. It was more precise with increased productivity because there was less human oversight. They were able to make those agents do the work and be able to experiment faster and faster and be able to get the output that they need.

And so their first key takeaway is more automation. They reduce their evaluation time, and you heard from Brian. It took them days before, and they were able to reduce it to 15 minutes and maintain performance without human intervention. And so going from that slide from right to left with more human oversight to less human oversight, they were able to scale smart. What they did is break out the problem into small chunks and then scale out. So if you have a bigger use case like they want to achieve, you break it out into smaller chunks, iterate on it, work on it, use AI agents, and then build a multitude of agents to orchestrate across and then be able to accomplish the goal that you need.

And have a robust evaluation loop. So whatever you build and whatever you want to do, even for Mona's and Brian's use cases, they were able to enable rapid iteration and continuous improvement through all of their process because they had a robust evaluation loop. What they were building, they were able to validate the agent's outcome, see if it's the right one, and then basically improve their agents as they went on with their experimentation.

So I'll leave with this quote from Andy Jassy who recently said what makes this agentic future so compelling for Amazon is that these agents are going to change the scope and speed at which we can innovate for customers, and that's true. We are working on it. That's the agentic future that we look at at Amazon and that's the scope that we look at when we talk about it.

So if you're here around the corner, there's the One Amazon Lane as well where you can see some of the other use cases that we have out there from Prime Video. So we have the extra recap, we have rapid recap and a NASCAR bonus bar which kind of shows you some of the sport innovations that we have done as well as extra repair. If you watch Prime Video, when you watch a movie or you watch any TV episode, it allows you to basically show you what the scene is about, and you can go and recap and summarize it. So that's out there in the demo at One Amazon Lane on top of other demos as well like the Zeus robotaxi. So if you're out here, I would highly recommend just passing by it and checking out some of the cool demos from Amazon, including some of the ones from Prime Video.

And with that, I'll end the session. If you have enjoyed the session, please complete the session survey in the mobile app. We look forward to your feedback and we obviously work off your feedback to improve our sessions as we go. So thank you so much for attending the session.

; This article is entirely auto-generated using Amazon Bedrock.