Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Elevate Streaming Quality with AI: Prime Video's Innovative Approach (AMZ306)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Elevate Streaming Quality with AI: Prime Video's Innovative Approach (AMZ306)

In this video, Prime Video engineers Brian and Mona demonstrate how they use AI agents and Amazon Bedrock to solve streaming quality challenges at massive scale. Brian explains their artwork quality moderation system that reduced manual evaluation by 88% and cut review time from days to under an hour, achieving 78% precision for safe zone detection. Mona presents their multi-agent system for detecting and mitigating streaming issues across 200+ countries and 315 million viewers, using Strands Agents orchestration with components like Request Handler, Routing Agent, and Analysis Sub-Agent. Both teams emphasize breaking problems into smaller tasks, establishing robust evaluation loops, and leveraging generative AI throughout the development lifecycle to serve millions of Prime Video customers watching content like Thursday Night Football's 18 million peak audience.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Prime Video's Scale and AI-Driven Quality Challenges

Welcome everyone to session AMZ 306, Elevate Streaming Quality with AI: Prime Video's Innovative Approach. Before we start, how many of you watch Prime Video? It's awesome. You might have seen the latest shows like Fallout and Lasso, or watched NBA or Thursday Night Football on there. This session will be really interesting for you to learn how Prime Video uses AI, particularly generative AI, to improve their artwork quality and streaming quality so that you can watch uninterrupted movies and uninterrupted sports on your favorite channel, Prime Video.

To start off with, I'm Tulip Gupta. I'm a Principal Solutions Architect with AWS, and I support the strategic accounts under AWS, primarily the media and entertainment accounts under Amazon. Prime Video is one of my customers, and I have with me Brian and Mona from Prime Video. Brian is a Principal SDE at Prime Video, and Mona is a Senior Manager of Data Engineering, and they are going to introduce themselves later on when they present their use cases.

We are going to start off with an introduction, go through all of the use cases that Brian is going to talk about regarding artwork quality moderation that he uses generative AI with, and then Mona is going to elaborate on the streaming quality and how they improve that using generative AI agents. We're going to talk about the challenges that they faced in their individual journeys as well as the solution and journey on AWS, and also talk about the demonstration and benefits at the end.

When we talk about Prime Video, Prime Video is global. Prime Video has millions of titles in its catalog that you can stream on hundreds of devices. They support over 30 different languages, operate in more than 240 countries and territories, and support over 200 million Prime members worldwide. Nearly anywhere around the world you can log in onto Prime Video and enjoy your favorite content. So let's say you're traveling to Europe or India or anywhere, you'll be able to log in and watch your favorite show out there.

Prime Video has a global ad-supported reach for over 315 million monthly viewers. This is how Prime Video has grown linearly over time. They started off in 2016 with the Late Show, and then as you can see, as they have moved from 2018 to 2020 to 2022, they have increased the number of streaming that they have done and the number of shows that they have produced. It's a lot that's going on, and as they're scaling, they also want to be able to scale how they use it for the streaming quality and how they're able to reach out to millions of customers out there.

It's a huge accomplishment, and as they continue to grow bigger and bigger, scaling efficiently does impact their ability to deliver these events over streaming. They're lining up their biggest slate of theatrical movies ever, and with Prime Video, customers can customize their viewing experience and find their favorite movies, series, and live events, including Amazon MGM Studios produced series and movies like Fallout. In 2025, they also added Peacock Fox One and Peacock Premium Plus as well, so it's been quite a journey as they've grown over time.

I saw a little trivia out here. In 2025, what was the peak audience for the Commanders versus Packers Thursday Night Football game on Prime Video? Can anyone answer the question out here? Any guesses? 18 million. You can see how massive it was, with 18 million people logging in at the same time and watching the Thursday Night Football game. To operate at that scale, they also want to be able to scale out the infrastructure, primarily on AWS, to be able to meet those demands.

To talk about what the use cases were, they wanted to be able to improve and moderate the streaming quality that Mona's going to touch about, and also to be able to moderate artwork quality that Brian's going to talk about at scale. When you talk about the scale that we looked at as it goes linearly, to operate at this scale, they depended on solutions like generative AI tools for evaluation and performance, and that's how they were able to scale it out. For example, for Brian's use case, they get content from Peacock, from Starz, PBS, for NFL, MLB, and NBA, and they are able to moderate it by using generative AI tools.

Understanding AI Agents and AWS's Agentic AI Portfolio

If folks are aware that some of the Gen AI tools out there, I'm also going to cover some of the Gen AI tools in the coming slides. So when we talk about Gen AI tools, you might have heard about AI agents particularly. So what are AI agents, and it's a tectonic shift of how we build, deploy, and interact. For AWS, we have what we call our agents, which are Gen AI agents that have come out there. These are autonomous or semi-autonomous software systems that can reason, plan, and act to accomplish goals in digital or physical environments.

We had the foundation models which can do one task, but when we talk about agents, they're able to do a task independently. They're able to call the foundation model, access databases, and call tools, so they're able to perform that task on their own. These can leverage AI to reason and plan and accomplish the task on their own. And so that brings us to what we think about the evolution of agentic AI. When we move from left to right, there was more human oversight when we first started with generative AI assistance, and you might have heard about it two years ago when it came out. They follow a set of rules and automate repetitive tasks.

As we move towards the right, then we have the Gen AI agents we touched about in the previous slide, and they are able to do a singular task very well. Then we move on to the right to agentic AI systems, which are fully autonomous multi-agents, and they can orchestrate between them to accomplish a set of tasks, and they can mimic human logic and reasoning. As we move from left to right, there's less and less human oversight and there's increasing autonomy. And so then we talk about AWS agentic AI portfolio, and for some of you it might be familiar with something called SageMaker.

SageMaker essentially gives you the compute power and the model where you can customize your own model. You can build, train, and host your own models, and we also provide computer chips like Trainium and Inferentia. Then we have the middle layer, which is AI and agent development software and services, and in this we have Amazon Bedrock. How many of you are familiar with Amazon Bedrock out here? Cool. So for folks who are not, this is the managed service. Basically, it gives you a flexible, comprehensive service for AI application and agent development.

It offers access to foundation models. You can customize models and application with your data and apply safety guardrail rules. So let's say you're building an assistant that provides you movie recommendations and you say, hey, don't ask me questions about politics, they shouldn't be able to do that. That's what we mean by guardrails. It also provides you AgentCore, which is one of the new services we added, which essentially helps you scale, deploy your services, and operate at scale on AWS. So if you build an application and you want to scale it out to millions and hundreds of users, AgentCore is what you'll be using. And this is just briefly touching about the services.

Then we have SDKs for agents, like the one with Nova Act, which is designed to take action within the browser. And then we have Strands Agents, which Mona and Brian are going to talk about in depth when they talk about the use cases and how they have been leveraging Strands. Strands Agents is another open source SDK that AWS has developed, and it basically is a lightweight open source SDK that helps you develop agents very quickly and be able to leverage and call multiple tools to accomplish that task. And on the top we have applications.

As you move on from raw infrastructure, obviously it gets more and more managed, and you don't have to manage your own infrastructure. For Bedrock, you do not have to manage your infrastructure, and with the applications, you get whole applications out there that you can play around with. For example, Amazon Q Developer Hero acts as a coding assistant. If you want to develop an app like a Streamlit app to act like a travel agent, it'll be able to quickly follow, design it, and build it out for you. Then we have Amazon Q Developer for accelerating software development, Q Business for your data and to answer questions, Transform to accelerate enterprise modernization of .NET and mainframe and VMware clouds, and then Amazon Connect, which is to speed customer service to delight customers.

And then lastly, marketplace where we can go and access a lot of agents out there, and just a quick brief overview for you to see all the portfolio out there. And that brings us to Strands Agents. So we briefly covered what Strands Agents is in the previous slide. Again, as I said, it's an open source SDK. In the past, LLMs were, you know, we basically would provide agents a template of how to do stuff. And right now LLMs are getting smarter and smarter, and Strands Agents provides a lightweight model where it's more intuitive development and it can figure it out to call the LLM and the tools on its own.

And it can get started in minutes instead of hours, and it provides you robust capabilities like native tools, MCP servers, and you can also extend the support for custom model providers, custom tools, and MCP. So it helps you with rapid development and prototyping. And that's the key thing that Mona's and Mona and Brian's team leverage to be able to experiment quickly, to be able to use new services and be able to iterate on what they're developing to develop that robust evaluation loop.

Brian Breck on Artwork Quality Moderation: The Challenge of Manual Evaluation

And with that, I'll hand it over to Brian to talk about his use case and the artwork quality. Thanks, Tulip. So my name is Brian Breck, and I'm a Principal Engineer with the Partner Experience team within Prime Video. We work with our content creators like major studios and independent filmmakers to ingest their content and prepare it for streaming customers. That content includes not only the streamable assets but also artwork, metadata, trailers, and bonus material as a part of that data set.

So just to get us started, quick question for everybody. How many of you, raise your hands, stream with more than one device? Probably most of you. It looks like most of you. So I stream with my phone, my laptop, my TV, my tablet, and that creates a lot of complexity at Prime Video. We have to support a number of form factors and backgrounds on top of 30+ languages in over 200 territories.

Now what we're going to be talking about today is artwork quality. We use artwork to represent movies, TV shows, channels, and carousels, and that artwork can show up in the streaming site as well as in marketing material and advertising. Now, the artwork is provided by our partners, and the artwork may be beautiful, but it may not meet the needs of Prime Video. So for example, we need to be able to crop artwork depending on what form factor we're going to be showing it. We also may need to overlay a logo or an action button, and so we need a safe zone on the perimeter of the artwork so that we can work within those parameters.

So I've highlighted the safe zone in purple. In the first example, we can see on the left there's plenty of space on the side of Lioness, and on the right, only the shoulder of the last actor is being cut off. So this is a perfectly acceptable safe zone. Now in the second example, we can see that a head is being cut off for one of the characters and there's some text that's completely obscured, so we could not use this for Prime Video.

Now this is just one type of defect that we're looking for. We're looking for other things like mature content, pixelation, issues with localization, as well as accessibility for things like color blindness. So we have a number of challenges in this space. One of the biggest is that a lot of this work is traditionally done manually. So our partners provide us a piece of artwork, it needs to get into a manual evaluator's queue, they need to provide their feedback, and then that feedback needs to get back to the partner. And there may be multiple iterations of this, so the entire process can take multiple days.

Now our solution to this generally has been to create an ML model that we train with artwork examples, and then we use those examples to find defects. But that's a time-consuming process, and we have over 30 defects that we're looking for today in our artwork, and that number is only growing. Now another problem that we have is data. We have been collecting data from our manual evaluators on acceptable and not acceptable content. However, they don't always follow the same standard operating procedure,

and so what may pass one evaluator may not pass another, and that data can leak into our datasets and make it more difficult to use. And then once we have a solution, the evaluation can take a while. We can take several days to run it over a dataset and then figure out what changes need to be made next for the next iteration so that we can improve the system.

Now when we're talking about evaluation, what we generally do is we take a few thousand images and we have a human annotator that goes through and provides what the correct result should be, and that's what we see with the ground truth column. Then we run our automated solution and then we look for discrepancies in the results. So in this example, in the second image, we can see that ground truth says that this should have passed, but the model that we were using is going to fail that piece of artwork. Maybe it thought that Ryan Reynolds' hair was too close to the edge, and so now we need to provide more artwork as training data to solve for this problem or create some additional instruction in our algorithm to allow this to pass. But we're only looking at four images here. Imagine if we're looking at thousands of images with results in S3 buckets that we've got to pore over. It can just take a lot of time to evaluate the results.

Building an Automated Evaluation Framework with Strands Agents

Now one of the things that we noticed is that with LLMs and their multimodal counterparts, we can detect certain defects with those foundational models. We've seen anecdotes of it being used in other places and so we wanted to try it out for our use case. So what we ended up doing is using QCLI, which is now folded into Quiro, to generate a few algorithms for us and use a few different foundational models to try out some results. We wanted to move quickly. We wanted to make sure that this was going to work for us.

When we ran the results, we anecdotally saw that it was promising, so then we wanted to go and perform one of the evaluations that we just took a look at. We saw that our precision wasn't high enough, and we knew that we could improve it, but we also knew that it was going to take a few iterations, so we wanted to move a lot faster than we had in the past. So what we ended up doing is creating an evaluation framework that we use for defect detection. We take as input datasets with ground truth as well as some initial configuration, and then as output we get the results as well as feedback on how it could be improved.

Instead of diving into S3 buckets, we can see views of the artwork, so for example, with Uncharted, we can see a mobile view versus a web view and how that's going to look in the different use cases. And then we get some benchmarking statistics based off of how it compared to ground truth as well as how it compared to previous runs, and then we also have the ability to dive in and take a look at some of the issues at the individual artwork level.

So here we're taking a look at the backend architecture for our evaluation system. It is a system that is orchestrated by strands with individual agents that perform the defect detection. They perform the evaluation of the results, and then also agents for suggesting improvements to the process. So the way that it works is that a user will submit an initial request through our CloudFront and load balancer instances and hit our API. That API will store the data in our config table and that will include an initial prompt used for defect detection. It will include a link to a dataset along with the ground truth data. And then it will also include some initial configuration such as which model to use and things like temperature to determine how to use that

model. Next, the orchestrator will pick up that configuration and delegate each piece of artwork to our evaluation subject agent that will actually perform the defect detection. Once we've gotten through all of those pieces of artwork, we write the results to our S3 results bucket.

Once the results are written, we use our results calculator to generate the statistics that we saw in the previous slide, and we also use that data to determine some next steps. Finally, we will send that data on both the results from the defect detection as well as the results that were calculated and send that to our prompt improver. The prompt improver agent will take a look at all of that data and make a determination on what should be done next. That could be making changes to the prompt, it could be suggesting different models to use, or different configurations for using those models.

Once we've gotten through that process, that data is written back to our config table and then can be used for the next run. I'll take a step back real quick. Strands is doing a few things for us. First, it's simplifying our interaction with LLMs. Second, it is allowing us to easily create relationships between agents, and it also provides some out-of-the-box tools that we've been able to take advantage of.

When we're talking about tools, there's a few built-in tools that we use on a regular basis. One would be the image reader, which allows us to prepare the artwork for the LLM when it's initially being called, or the LMM. We use file read and write so that we can do intermediate manipulation of the images as a part of certain processes, and then also agents as tools to create those relationships between the agents and be able to call them explicitly.

On top of that, we've also created some custom tools. Like the safe zone example that we talked about, we have a crop image custom tool. We also have a transparency check tool for readability and accessibility, and the set of tools that we have available to us now has grown significantly.

Now, unlike the safe zone defect detection process, not everything can be a pass-fail, or can we use that information as pass-fail in order to improve the system. Sometimes we need qualitative results as well as the quantitative results in order to decide what our next steps are. What we do is we create a judge that takes a look at the evaluation performance and provides some additional context for why things failed and what could be done better on that individual artwork basis.

What we do is we provide some initial configuration to our DynamoDB table for the judge. The judge configurator agent reads that information and prepares it for the judge itself, and then the judge will take a look at the results of the evaluation and provide additional context that can then be used as a part of the prompt improver step. Now, we could use a judge for all of our defects, but it's expensive, and so we only add that configuration and step when it's absolutely necessary. However, it has been critical to get some of our defect detection mechanisms in place.

Overall, Strands has greatly simplified a lot of what we're trying to do. It has made it so that we don't have to move images around. All of our interfaces are text-based, and we can access the images centrally.

Production Implementation and Results: 88% Reduction in Manual Effort

We are also able to use the system to run regression tests. So if we want to change the model or if we want to change the configuration or a prompt, we can validate that we haven't made things worse. This has been so successful that we've also started using it for things like text. As I mentioned, we receive metadata, and so we have to validate the synopsis. We have a bunch of defects that we look for in a synopsis that is running through the same process.

As I was saying before, we generally use about 2,000 images and we have humans run through them and provide the ground truth result. When we initially started this process, we were running into some local maximums with precision. We just weren't hitting the values that we thought that we should be able to reach, so we were running into situations where when we would fix one false positive or false negative, we would cause another one. What we realized is that as we started digging into the data, the manual evaluators were using inconsistent criteria. Some of them would pass something that others would fail and vice versa.

As a part of this process, we ended up creating a standard operating procedure for the manual evaluators that would also be shared with the automated system so we could have consistent results. Those consistent results led to a better ground truth data set. That better ground truth data set allowed us to run that loop that I was showing where we could run an evaluation data set, we could look for ways to improve it, and then run another evaluation set, basically creating an auto-tuning mechanism where it was completely hands off the wheel and we could just watch the system improve itself.

That actually simplified our runtime solution. So we take the configuration that we were generating in our evaluation phase and we load that into an app config instance. Then we allow our partners to upload their artwork through our portal. It goes through API Gateway and then is delegated to modules, each representing a particular defect, so we can run defect detection in parallel. For each defect detection mechanism, we are reading the configuration from the app config in combination with providing the artwork to Bedrock, and we're able to generate those results almost real time. Where it would take several days for the partners to get the results back, we're actually providing that within a minute.

The solution isn't perfect, and so when the partner gets a result that they don't agree with, they are allowed to override that result. What that means is we recommend that they make some particular update, and they think the artwork is fine, so it goes into a manual evaluator's queue and we'll use the old process. But the great thing about this solution is that where we were reviewing 100% of the artwork that was provided to us, now we're only reviewing say 10 to 12% of the artwork, and that has been a huge time saver for our manual evaluators.

Through this process, starting with the QCLI solution, we were only at about 37% precision. We were able to get that number to 78% for the safe zone. We were able to reduce false positives and negatives by 70%, and we were also able to reduce the amount of time it takes to get results from several days to less than an hour for certain circumstances. But most importantly, we were able to reduce that manual effort by 88%.

So we learned a few things along the way. First was don't try to do too much at once.

The context windows have been growing recently and we initially tried to take advantage of that. We tried to run defect detection for multiple types of issues at once and found that that was just not the way to go. So what we ended up doing was we broke the problem down into the individual defect types, ranked them by how often they occurred and how much effort it was to manually perform the detection, and tackled them one at a time.

The next thing that we learned is that we can really use generative AI throughout the life cycle. So we started off with our initial proof of concept being generated by Q Developer. Then we used generative AI to create our system design and then for development and then for evaluation and a lot of our monitoring. So we estimate that we used generative AI for about 85% of what I showed for both the evaluation framework and the production system, and so it was a huge time saver, saved months of engineering work.

We also found that LLMs are effective at improving their own prompts, so we used Claude to take a look at prompts that we were providing to Claude, and it was effective at telling us where we could improve things that were specific to that particular model. We also found it very helpful to establish that robust evaluation loop, so being able to just iterate quickly, make changes, even when the changes that we were making were manual, we could just kick the process off again, see how it worked, and not spend so much time in the investigation phase was critical for our success. And last, we learned that manual evaluation is hard and error prone, and so it was worth it to take the time to generate a high quality data set so that we can make sure that our automated processes are successful.

So we started off with the safe zone. We've since moved to logo and text placement, offensive and mature content. We're taking a look at text legibility, localization, and accessibility. All of these things are either in production, being able to be used by our partners, or they will be there by the end of the year. So that's the presentation. It was great to be able to share that with you, and I'm going to hand it back to Tulip, who is going to continue.

Enterprise Agentic AI Applications: Reimagining Business Workflows

Thank you Brian. And so you heard from Brian how they were able to use Agentic AI, how they were able to use Bedrock agents in their framework to be able to moderate their artwork quality and be able to place it at the right place. And so I want to talk about Agentic AI a little bit more before I hand it over to Mona to talk about her story about Bedrock agents and generative AI. So when we think about the two flavors, one is basically accelerating software development where we talked about the applications like Hero and Q Developer helping you to be the code assistant and helping you code and develop apps. And the second one is like reimagining business workflows with custom agents, and this is where Mona's use case becomes really important.

So they were able to use a few agents out there that were able to orchestrate between them and created that custom workflow to be able to accomplish that goal that they need, which was improving the streaming quality. And so this is the one that we're going to focus on in the next use case that Mona is going to talk about. And so briefly about when we talk about enterprise agentic AI applications, when we have an LLM in the center, that's the brain that basically reasons, that basically understands and gives you the output. But it needs help.

When we talk about LLMs, it probably doesn't have the latest information. So if you go and ask any of the LLMs out there what's the weather in New York, it wouldn't be able to answer it because it doesn't know what date it is, it doesn't know what the weather is, so we want to give it some data points. And so the tools out there help it call maybe a weather API or the current time, the database can maybe show you where New York is, things like that.

The information it needs to be able to give you to correctly reason and provide that information. It also probably needs memory to be able to understand the current information, current context and conversation, and take that action and give you the correct answer, like, oh, it's 56 degrees Fahrenheit in New York. I'm not sure if it's 56 degrees Fahrenheit right now, but I'm just saying it as an example. Another important thing is observation and guardrails where we want to ensure that if you're asking it for weather information, it doesn't answer about politics or who the current president of the United States is. So we want to be able to restrict it to answer only for the prompt or the context that we're asking it for.

And so with Amazon Bedrock, you're able to do all of that. It provides you the ability to access models. It provides you the ability to call tools with MCP, which is Model Context Protocol that's been introduced by Anthropic, and A2A as well, which is agent to agent, like how your multiple agents can orchestrate. It provides you frameworks like Strand, agents that we talked about, Q, and LangGraph to be able to build that agentic framework, agent code to deploy your agents at scale on AWS with your runtime gateway, memory, and observability. And obviously it provides you the ability to customize and fine-tune your models. And with that, I'll hand it over to Mona to talk about her Prime Video use case and how they improve streaming quality.

Mona on Streaming Quality: A Multi-Agent System for Autonomous Issue Detection

All right, thanks Tulip. So quick show of hands, how many of you have been on call trying to use a whole bunch of data to detect, localize, and mitigate issues? All right, I see a few hands here. Well, I'm very excited to talk about our journey of building an agentic workflow to detect, localize, root cause, and mitigate streaming quality issues. I'm Mona and I'm a senior manager at Amazon Prime Video. Prime Video is a global platform. We stream high-scale live events, video on demand, and linear content for our customers.

Ensuring our customers are able to watch their favorite content, whether it is the Patriots scoring a touchdown, go Pats, or Barcelona scoring a goal, we want to be able to obsess over our customers being able to take in that moment. Sometimes what can happen is, while streaming to millions of customers, even an interruption for a brief moment can mean that thousands or millions of customers are interrupted from their viewing experience. Traditional operational approaches such as manual monitoring of metrics or alternatively reactive root causing just do not cut it at our scale. We asked ourselves the question, what must we fundamentally build, a system that's not just monitoring these metrics, but actively understanding these metrics, learning from these metrics, and able to autonomously take action towards these metrics, and that's exactly what we sought out to build.

Now a few challenges that we had kept in mind and used as the guiding principle while working through these systems. First off, we wanted our system to be able to have access to a multimodal set of data. So this can include things like time series, infrastructure logs, player logs, graphs, and so on and so forth. We also wanted the system to not just have access to this data, but to actually understand this data and almost build an intuition behind it. And finally, we wanted to be able to have the system accessible to pretty much anyone in the engineering teams, so this did not require special expertise or domain expertise when it came to understanding our data.

With that in mind, we went ahead and built an AI agentic system that put together multiple agents that were orchestrated using Strand to be able to accomplish these tasks. One of the qualities of this system is that it can reason through complex tasks, break them down into multiple simple tasks, and then chain them together in a sequence to accomplish the set task. We also made sure that the system was not just a one and done, but was constantly learning from all of the data around it, keeping the most current operational snapshot of the data, and also be able to actively learn from any past mistakes or feedback that it received.

So with that, let's take an overall 30,000-foot view of what the architecture looks like. This system was built as an AWS native as well as an AI native system. It's orchestrated with Strand, uses AWS Lambda for things like authentication, as well as orchestrating across the different agents, and uses Athena both for querying as a data store, as well as DynamoDB for some of the global state management.

This system is the foundational backend system that can be used for a multitude of different frontend interfaces. For example, it can be used as a chatbot interface where somebody can put in a natural language question and be able to get an answer. It can also be used to autonomously be triggered by a different system that maybe has detected an issue, so really a bunch of different use cases that can be facilitated with the same backend system.

So diving into the components of this system. As I said, this is a multi-agent system. We have a bunch of different agents as well as sub-agents working together to be able to accomplish the task. First off, we start by looking at the Request Handler. The Request Handler is the front gate of the system. What the Request Handler does is once it receives a request, it will first go ahead and authenticate the request, then it will validate the request, and then it will go ahead and start to decompose this request into simpler tasks. So for example, if the question asked was what was the rebuffering rate on iPhone devices in Germany over the past week, the Request Handler will break this down to understand, okay, this ask involves a metric ask, it involves a trend analysis ask, and then it also involves an ask for a specific cohort, cohort being devices, geographic information, and time periods.

The Request Handler also has a Guardrails Agent, as you heard from Tulip previously, that's able to validate and make sure that the request is compliant with what the system is supposed to support. Now that we have talked about the Request Handler, let's look at what happens after that. Suppose the request handler is the Routing Agent. You can think of the Routing Agent as an intelligent orchestrator or a traffic controller that based on the request that it got from the handler, tries to understand what are the different capabilities that need to be invoked to be able to service this request. So the Routing Agent understands what those capabilities are and passes that along as a signal to invoke other sub-agents, agents, tools, data sources, and so on. It can be really thought of as the brain of the operation once it gets that decomposed request.

The Routing Agent also uses the chain of thought process in terms of breaking down a complex task, reasoning through it like a human would, and then finally understanding what are the capabilities that it requires. From the Routing Agent, we then have the Integrator Sub-Agent. The Integrator Sub-Agent can be thought of as a traffic controller. Once it has got the request from the Routing Agent, it knows what specific tools and data sources it needs to connect to. This Integrator Sub-Agent works through MCP, which is Model Context Protocol, and is able to talk through a host of different tools as well as data sources and is able to work through things like different access patterns, APIs, access formats, and so on. It's also able to combine a bunch of these data together by knowing the right join conditions and so on.

The other thing about the Integrator Sub-Agent is that it also serves as a data quality check and makes sure that it's only the right kind of data and the right quality of data that is accessed by the system. Post the Integrator Sub-Agent, we then have what is the Analysis Sub-Agent. The Analysis Sub-Agent can really be thought of as a data scientist in a box. The Analysis Sub-Agent is primarily based on Bedrock, and it has a whole bunch of both large language models as well as small language models that it can access in order to be able to service a specific request. You can really think of this as sitting in an ensemble of different models and leveraging the right models per the use case and per the capability that's needed.

System Components in Action: From Reasoning to Response and Lessons Learned

Now once we have talked about the Analysis Sub-Agent, the next thing that we have is the Reasoning Agent. So the Reasoning Agent takes all of this input that it has received from the prior agents and sub-agents, and what it does is it uses business context that it has access to to be able to determine if the analysis that it has been provided is pertinent and is relevant with all of the business context that it has. So this can be thought of really as using an iterative approach and uses LLMs as a judge to be able to use an independent LLM to validate the responses that it has received from the previous Analysis Agent, for example, and using that business context is able to tell, okay, is this really the expected answer that I would have, or would it be something else?

The reasoning agent also has the ability to have an iterative loop to go back and request a different capability or invoke a different data source based on what it might have received from the LLM as a judge. From the reasoning agent, we then have the response handler. The response handler takes all of the different input that it has received from the routing agent, the reasoning agent, as well as from multiple iterations of the loop that we have just talked about, and packages all of this information together into the expected output format. This could be things like the response to a natural language question, generation of a complex SQL query, or alternatively it could also be an autonomous action or a mitigation lever that can be pulled in response to a certain trigger.

The response handler also interacts with the guardrail agent again to make sure that the response is compliant with the required data sharing and other such activities. Separately from that, the response handler also goes ahead and logs all of these decisions that it has made, similar to what all of the other agents have done, so it takes all of these logs which can then be used for reflective analysis in terms of improving some of the decision making. Taking a step back and looking at the system overall, what we talked about first was the request handler that takes in a request, validates it, makes sure it's the right format, decomposes it into simpler tasks, and moves that along to the routing agent that then knows which capabilities it's going to need to invoke.

From there it goes to the integrator agent, which can orchestrate a bunch of different tools and data sources, followed by the analysis agent, which is really the data scientist in a box ensemble of LLMs that can be used. Finally, the reasoning agent uses business context along with what's given from the analysis agent to make sure that it is an acceptable and reliable answer. Along with that, it can also trigger reiterations of the loop, invoking other capabilities and invoking other data sources as needed, and finally the response handler that packages all of this and makes it available either as some sort of an output or a mitigation lever or an autonomous action.

As you can see, we also have both an online evaluation system, similar to the LLM as a judge that I mentioned, as well as an offline evaluation system that takes all of these logs of decision making and can be used iteratively in terms of understanding and improving the system. Now that we have talked through the overall components of the system and how it works in real time to be able to help detect, localize, as well as root cause issues, let's take a step into some of the lessons that we learned through the process of developing the system.

First off, data quality always beats data quantity. When you're trying to develop such a system, there can be a whole lot of things that you might want to add, things like infrastructure logs, metrics, past incidents, tickets, and so on and so forth, but you really want to make the most efficient use of your context window and make sure that you're giving the right set of data and data that will actually get you to the right outcomes. Being judicious in terms of the data that you have as part of the system is especially important. The next thing is building feedback loops early and often, which really helps in the efficiency of development as well as the efficiency of getting your system to the levels of accuracy and levels of reliability that you're looking for.

The other one is planning for failure modes. Systems do fail, and we want to have an autonomous system, but there are going to be times when you might want to have human evaluation or when the system is just exposed to a brand new situation. You do want to have safe ways that the system can fail and trigger the right sort of human involvement as needed. Finally, continuing to use AI to amplify human expertise, whether that is understanding the business context better, understanding data better, and so on and so forth, you always want to have AI amplifying human expertise throughout the SDLC software development life cycle process.

In terms of what's next for our system, we want to continue to build more mitigation levers and have more and more autonomous actions that the system is able to perform by itself. We're continuing to use AI through the software development cycle in terms of accelerating our development, in terms of having quick prototyping using things like SageMaker as well as orchestration tools for quick proof of concepts, as well as using AI in the deployment and the overall maintenance of the system. Finally, we're having more and more safety mechanisms so that you continue to not only build but also keep trust in your system and have it leverage more and more autonomous actions.

Key Takeaways: Automation, Smart Scaling, and the Agentic Future

So with that, I will hand it over to Tulip.

Thank you, Mona. So let's see how many of you were paying attention. What was the peak audience during the Commanders and the Packers game in 2025? Does anyone remember? That's right, yes. So you heard from Mona and Brian about how they were able to use Gen AI in their solutions and their use cases respectively, and how they were able to move through that Gen AI evolution from having more human oversight to less human oversight and be able to automate a lot of those tasks. Both of them wanted to be able to deliver premium video and artwork at scale to millions of customers across thousands of different devices for their use cases. And obviously when you're trying to do that, there are millions of customers globally and you're streaming content 24/7. It can impact thousands or even millions of viewers instantly if you're not doing the right job. So they wanted to ensure that none of their customers or Prime Video viewers are impacted while they're trying to experiment and see how content is being delivered. That's where automation using AI helped them. They were able to do it more precisely with increased productivity because there was less human oversight. They were able to make those agents do the work and experiment faster and faster and be able to get the output that they needed.

And so their first key takeaway is more automation. They reduced their evaluation time, and you heard from Brian that it took them days before, and they were able to reduce it to 15 minutes and maintain performance without human intervention. Going from that slide, moving from right to left with more human oversight to less human oversight, they were able to scale smart. What they did is break out the problem into small chunks and then scale out. So if you have a bigger use case that you want to achieve, you break it out into smaller chunks, iterate on it, work on it, use AI agents, and then build a multitude of agents to orchestrate across and be able to accomplish the goal that you need. And have a robust evaluation loop. So whatever you build and whatever you want to do, even for Mona's and Brian's use cases, they were able to enable rapid iteration and continuous improvement through all of their processes because they had a robust evaluation loop. What they were building, they were able to validate the agent's outcome, see if it's the right one, and then basically improve their agents as they went on with their experimentation.

So I'll leave with this quote from Andy Jassy, who recently said that what makes this agentic future so compelling for Amazon is that these agents are going to change the scope and speed at which we can innovate for customers, and that's true. We are working on it. That's the agentic future that we look at at Amazon, and that's the scope that we look at when we talk about it. So if you're here around right on the corner, there's the One Amazon Lane as well where you can see some of the other use cases that we have out there from Prime Video. We have X-Ray Recaps, we have Rapid Recap, and a NASCAR Bonbar which shows you some of the sport innovations that we have done, as well as X-Ray Recaps. If you watch Prime Video, when you watch a movie or you watch any TV episode, it allows you to basically see what the scene is about, and you can go and recap and it summarizes it. So that's out there in the demo at One Amazon Lane on top of other demos as well, like the Zoox robotaxi. So if you're out here, I would highly recommend just passing by it and checking out some of the cool demos from Amazon, including some of the ones from Prime Video.

And with that, I'll end the session. If you have enjoyed the session, please complete the session survey in the mobile app. We look forward to your feedback, and we obviously work off your feedback to improve our sessions as we go. So thank you so much for attending the session.

; This article is entirely auto-generated using Amazon Bedrock.