Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Amazon Ads Creative Agent uses AWS to democratize ad creation (IND3335)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Amazon Ads Creative Agent uses AWS to democratize ad creation (IND3335)

In this video, Amazon Ads presents their Creative Agent, a production agentic AI system that democratizes ad creation at scale. Yashal and Fabio detail the technical architecture including multi-level routing, parallel tool execution with stateful tools, and handling millions of tokens through intelligent memory selection. They achieved 8x faster video animation, 4x context reduction, and can productionalize new agents in just 2 days. The system uses layered guardrails across text, images, and video, processes 5 billion tokens weekly, serves 1M+ active users, and delivered 338% CTR increase for advertisers. Built on AWS Bedrock with agent SDK, capability operators, and automated diagnostics using AI to debug AI workflows.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Democratizing Ad Creation with Amazon Ads Creative Agent

Good morning, and I hope you are all having a great conference with re:Invent. I'm Avinash Kolluri, a Senior Solutions Architect with AWS supporting Amazon and its subsidiaries on their cloud journey. Today, I have my co-speakers Yashal, Lead Applied Scientist from the Amazon Ads team, and Fabio, Principal Engineer from the Amazon Ads team. We are going to discuss how Amazon Ads has democratized ad creation with their Creative Agent. I know this is an AI-heavy topic, but how many of you are here familiar with the buzziest word in the tech industry this year, which is "agent"? I see a lot of hands. In that case, you are exactly in the right place. We are going to show you how Amazon Ads has built their Creative Agent.

We have quite a packed agenda. Initially, we are going to go over the Amazon Ads overview, and then we are going to dive deep into the ads Creative Agent. Just a spoiler alert on the ads Creative Agent: this is something that the Amazon Ads team has recently launched at their unbox event, which has gotten great momentum and attention. We will also dive deep into the science and research side of the house, along with the infrastructure challenges and implementation of the infrastructure. Towards the end, we are going to discuss some additional advertising use cases on generative AI and make sure to show how AWS is helping to evolve the landscape for you.

With that, I am going to present to Yashal, and he is going to discuss the Amazon Ads overview as well as the Creative Agent on the science and research side. Good morning, everyone. Thanks for attending. My name is Yashal. I have been at Amazon for nine years and I am a Lead Applied Scientist on the team.

Amazon Ads at Scale: Reaching 300 Million Users Across Multiple Platforms

I will be walking you through three key areas. I will first introduce Amazon Ads and the scale at which we operate. Then I will introduce the Creative Agent. I want to be precise here: this is not another text-to-image or text-to-video model. It is a complete production agentic system that is running on AWS and serving real customers today. Then we will dive deep into research and science, and we will talk about the challenges that we faced and how you can use AWS to solve these challenges.

Amazon Ads has evolved into a full-funnel solution. It is available to businesses of all sizes, from Fortune 500 companies to small and medium businesses. We serve over 300 million average monthly reach across the US. To put that into perspective, that is almost the entire US adult population. We have over 200 million reach on streaming TV alone. We power advertising across a variety of surfaces like Amazon.com, Prime Video, Twitch, Audible, Fire TV, Music, and even more third-party publishers.

Now, let us take a quick look at some of the features of streaming TV. Play along at home and see if you can get there first. Exciting, is it not? That is just one of the various formats we have. All of these formats and advertisements need to be created. Advertisers need to create text, images, videos, and headlines across all of these variety of formats. That is where our team helps.

The Challenge: Scale, Speed, and Cost in Generative AI for Advertising

Over the last three years, we have launched a variety of products, models, and features to power generative AI for ads. This includes image generation, audio generation, headline generation, and even video generation. All of these features are available in the AI Creative Studio. It allows advertisers to select any of their products on Amazon or even upload their own images. For instance, in this example, we see a set of cups that are sold on Amazon. This is a feature that we first launched over two and a half years ago and have improved since. For products like these cups, you can generate imagery or videos that feature these cups. Along with improving this, we have launched other features like holidifying your images, animating videos, or creating templatized multi-scene videos.

As we continue to improve these systems, we identified some areas that are important to our advertisers. I want to highlight that these are not nice-to-haves. These are must-haves in today's generative media landscape. The first is scale. For every campaign that advertisers create, they need to generate 10, 20, sometimes even 50 variations, and you need to do that for all the events. You need to do that for Christmas, Black Friday, Thanksgiving, Father's Day, Valentine's, and so much more. When you multiply that by the hundreds of products you have, you can imagine the number of variations in content that you need to create.

Then you couple that with the demand for video, which is more than ever before. Advertisers need to create a lot of content. And then here's a paradox. All of them want automation, but they also want control. They want these systems to think intelligently and automate the generation, but they want to do that using their own brand guidance. They want to control how these systems operate. Then think about time to market. In advertisement, time to market is critical. By the time they generate content, the market trends may shift, so you need to be fast. Every small advertiser cannot spend $100,000 to $50,000 to create this content. So you also need to think about cost.

All of this, of course, requires expertise. We need expert technical creators, content writers, and so on, and not all small businesses may have access to these. Finally, it's not just enough to create. You need to optimize. You need to know what colors work, what brand messaging resonates with your shoppers. You need to optimize all the content. If you think about the entire traditional approach, you need to plan and hire, spend weeks creating content, then optimize it, take feedback, and based on the feedback repeat the whole process again. By the time you spent weeks creating your content, market trends may shift. So you need to democratize content.

Creative Agent in Action: An Intelligent System Built on Three Pillars

How do you democratize creating content for advertisements? The answer is Creative Agent. Watch as it springs into action for a simple query like creating content for my latest hiking gear. It immediately responds and reasons and shows real-time updates to the users. It presents some concepts about the video to the user, and at any moment the user can edit and chat with the agent to update them. In this case, they decided to go with concept 3. The agent immediately begins executing. It generates stunning imagery for the concept and presents them as a storyboard as an artifact in the UI. The advertiser can review these images, they can chat to edit, or they can choose to animate them. Once the advertiser asks to animate, it begins animating all the images into a stunning video. It compiles everything into a single video and brings it together, and you'll notice it highlights the product and then it's ready to serve on Amazon. That is Creative Agent.

It is not a simple model. It has planning. It has reasoning. It acts on behalf of advertisers, and it is intelligent. It is built on three key pillars, and as I mentioned, it's not a simple UI wrapper over a single model. It's important to have great experience. The users can chat with it naturally, or they can also interact with it in a full-fledged image and video editor. The second key pillar is integration. Our traditional AI models operate independently. You have a model that generates text to image. You need to take the output and then pass it on to another model by yourself, but this integrates not just all the models, but also Amazon product data and other features that are powered by our systems.

The last pillar is intelligence. It does not just execute commands on behalf of the user. It's not the user who is always deciding what to do and what to say. It intelligently knows what to do and it uses multiple agents to do so and acts on behalf of the advertisers. Now let's get to the fun part. Before I go, I have a quick question. Raise your hand if you ever face problems with tokens, context windows, or if you've ever gotten the message that you've reached your daily limit on AI systems. Exactly, it's so commonplace, and that's exactly the kind of challenge that we face and we solve. So we'll go through these challenges and how we solve them. I will cover four more areas, and while this gets technical,

I want you to stay with me because these challenges and the solutions apply to any agentic system and they're not just applicable to our system. The first is agents and routing. So do you cram everything into a single system prompt and hope for it to work, or do you build a modular system? Long-running tasks is one of the toughest problems to solve in agentic systems. Every conversation in Creative Agent has hundreds of images, and a single LLM does not support hundreds of images. So how do you make sure that the agent can see all the images and act accordingly? And then this one kept me up at night. How do you evaluate your system and make sure that every deployment is safe and actually improves rather than leading to regression? As we went through these problems, we answered some key questions, and each of these questions is important because it has a direct impact on the scale, latency, cost, and the performance of the system.

Agent Architecture: Multi-Level Routing and Modular Design for Complex Tasks

The first, as we discussed, is whether you put everything in a single system prompt and hope your system works, or whether you build multiple agents and tools. We tried that. We tried with a simple system prompt. We put all the instructions in a single system prompt, and it worked okay to start with at a small scale. It works okay, but as you add many features and you build a bigger system, it simply fails. The model starts reasoning for simple questions or it simply ignores some instructions and does things by itself.

Second, an important thing is handling of user queries. Not all user queries are equal. Asking a simple question like what can you do is very different from saying take my top ten products and create an amazing video for it. So you need to understand what the user needs and adjust the complexity and difficulty based on that.

And last is balancing real-time updates versus long-running tasks. Naturally, as you support many features, some of them take longer. Making a video may take one minute, but generating an image only takes three seconds. Responding immediately may also only take three seconds. So how do you balance showing real-time updates while also doing tasks that take longer?

As we answered all of these questions, it led to our current design. And before we discuss this, I want you to understand that this is a dynamic system. It is actually living, so based on every query, the path it takes and the tools it calls changes, but this is a generalized view of what happens. It starts with the user's message. As soon as the user sends the message, our system decides: is this a difficult question, do I need sub-agents to handle this, or can I handle it myself? It then does two things in parallel, so it forwards some of the requests to sub-agents, but at the same time it immediately begins responding to the user, and that response is shown to the user in real time. While it is showing it to the user in real time, it sends that message to the sub-agents, which have their own tools. These tools can be image generation models, web search, and so on. And all of these tools have access to memory where they write. So for instance, if they generate images, they can write to the asset store. They can save some content to the shared memory as well, and this is shared across the tools and sub-agents. Once they write to the memory, they send that response back to the sub-agents. The sub-agents correlate all the responses, and then we have a correlated response sent back to the agent. Once everything is generated, it's also shown to the user in the UI. And that's how the user sees the real-time messages and also content like images, videos, and so on. And all of this is fed back into the conversational memory and ready for the next query. And because it's a chat system, all of this can keep happening and you keep pulling from the memory and from the user.

Now we built this over different steps, and I want to walk you through those steps because you'll understand how we built them in layers. So this is the most basic agent, and it was step one for us. It can take a user query, reason about what to respond, and then it sends that response back to us. Then we of course use prompt caching or KV caching to improve speeds, and that alone led to massive improvements. So this agent works fine for simple queries, but it cannot act on the user's behalf. It does not have any tools. It cannot really take any actions on behalf of the user. And that led us to our step two.

So now this agent, as you see, it takes a user query and can reason. But instead of directly responding, it plans. Then it acts using the tools and sub-agents that it has. And once it has the response, instead of showing that response directly, it also reflects. So if you're generating an image, it will call the image model and then reflect on whether the image is good or not. And then it can either send that message back to the user or it can choose to continue, so it builds a loop. If you think about a multi-step process like making a video, it can generate an image, animate it, add music, and then decide to show that to the user.

This approach is very powerful because it generalizes to almost any task you have at hand. However, there's still the problem of complexity that we discussed. Imagine asking a simple question like "What can you do?" Do you really need to reason, act, and think through whether you should reply with a paragraph of thinking before responding to the user that you can make images? You don't really need to do that.

That led us to our final iteration, which is to have a multi-level router. Now when a user sends a message, we quickly decide whether it really needs a level of thinking or whether it can be responded to quickly. If it's a simple question, we immediately respond, and if not, we spend the time needed to reason, react, and reflect.

Here's a quick example of actual output. For a simple query like "What can you do?" you'll see that it immediately responds with the entire content. It does not spend minutes thinking and reasoning before responding after five minutes. So we adjust the effort based on difficulty. The reactive loop that we discussed also handles errors because if your image model fails, it can read the error and react. It can think that the model is overloaded and let me call another model instead.

It is modular and reusable, so it not only helps to build this system, but tomorrow if you wanted to build another agent which has a subset of these features, it's very easy because not everything is crammed into a single model or system prompt. You can reuse components and tools. Finally, it helps to optimize tokens because for a simple question like "What can you do?" you don't spend thousands of tokens thinking about how to respond. You can respond immediately with a simpler model.

Solving Long-Running Tasks: Parallel Execution with Stateful Tools and Subagents

Now let's talk about one of the most complex problems. We discussed the loop, but imagine asking a very complex question like "Find my best-selling products and make ten videos for them." You can imagine it may take ten, twenty, or even fifty steps to do that. So how do you ensure all of those fifty steps execute correctly? For instance, if I asked what are the key steps needed to make a video, even the agent responds that it has to do a lot of things.

As you're executing these fifty steps, what if you fail on step forty-one? Do you restart from the beginning all over again? How do you make sure that all of this fits into the context? As you're generating images and making videos, you may go out of the LLM context window. So how do you make sure that on step forty-nine, the LLM still remembers step one?

And then how can you optimize? Do you really need to do these steps one by one, or can you do them in parallel? Those are the questions we ask. So how do you ensure that they can run successfully? How do you make sure that it fits into the context? And do you really need to run them in serial? For instance, if you're generating ten images and they're independent of each other, you don't need a loop to keep generating them one by one. Can you generate them in parallel?

That simple question led us to the most basic design, which is to ensure that all tools are called in parallel. This can be done either using parallel tool calling, which some LLMs support, or you can build this into your own tool. You can design the schema in a way that ensures that things are executed in parallel. So the agent invokes multiple tools in parallel, but you need to bring them together and reduce them to a single output.

In the most basic implementation, you use another agent to do so. The agent invokes ten different requests, and then another agent looks at all of these ten responses and combines them into a single output. However, that led to some problems. You always need an LLM to reduce them to a single output, but that is very costly.

So we improved the tools using something called stateful tools. Now instead of needing another LLM to look at the outputs and reduce them to a single output, we can use fast, non-LLM-based reducers. So if you're generating two images in parallel, you can reduce them or join them into a single list without needing an LLM, which takes a minute to do so. That led to massive improvements in speed.

And it all comes together with subagents. Instead of calling multiple tools in parallel, you can call multiple subagents in parallel. Each of them can take multiple steps. So if you think about those fifty steps we discussed, maybe ten of them can be taken by a subagent. As soon as that subagent takes those ten steps, the head agent does not need to know about all ten of them. It can reduce them into a single output.

The original action of taking fifty steps reduces to certain chunks which are hidden away from the main agent. So the subagent reduces them to one output, and all of those outputs are reduced to a single head agent output that ensures that the context remains in control. And the results speak for themselves. Our video animation became eight times faster. Our context reduced by over four times.

It led to parallel execution, but also crucially, it led to parallel development. Before, you had a single system prompt or single agent where all the scientists and engineers needed to collaborate. There were so many merge conflicts, but now all of them can develop these things in parallel. You can have one person improving the image agent, another person improving the music agent, and you can bring them all together into a single system.

Handling Multimodal Context: Managing Millions of Tokens and Hundreds of Images

Then there are tool operations which did not need LLM, and that's not a typo. It really went from 30 seconds to a few milliseconds because you don't need LLM to do everything for you. You can just write code to reduce it. The third key innovation that we added was how to handle multiple images and videos. Imagine a conversation in which you've already created a video, you've gone through the whole cycle of 50 steps, and then you ask to edit it.

So you say, "Hey, this video looks good, but can you add a logo and slogan to the end?" Now in this case, it needs to remember the full context, all the hundreds of images and videos it already generated. Moreover, if you remember the UI edits I talked about, the user can also interact with the video in a full-fledged video editor. So the agent not only needs to remember all the hundreds of images and videos it created, it now needs to act on them. The user may edit these videos and then take actions in the UI.

Because it's chat, the user can come and say, "Do what I just did, but better." Now you need to remember all the UI actions that the user did as well. You can imagine that this simply cannot fit in a single LLM's context window. It's millions of tokens. So how do you ensure that your agent can handle millions of tokens and hundreds of images and videos? This is the design for that.

For every UI interaction event, we have a special processor that listens to all the events in the UI and then processes them. Similarly, for any uploads that the user does—images, videos—we have a special media handler that processes them in place. Both of these are saved in the artifact memory. Notice they're not directly fed into the LLM, but rather we save them in the memory first. Similarly, all the user messages continue going to the conversational memory. Again, all of these still don't go to the LLM directly. We make sure we decorate the memory and select appropriately.

Based on every request, we select what media or what UI events need to be sent to the LLM. Similarly, all the conversations are not directly fed in. We compress them, but also select which messages are most relevant. Then all of this is fed to the agent. This pattern in between ensures that no matter what the user does—it could be millions of tokens worth of images—our agent is still shielded. The selection is intelligent, so we don't forget. You still remember step one, you still remember the first image that you uploaded.

This also led to massive improvements. Our agent running in production can support hundreds of turns. It can support 10x more images and videos than our first version. It costs less and runs faster, and interestingly, it performs better. You might think that reducing context or selecting makes it worse, but because you're helping the agent to focus on what matters and removing anything that does not matter, the quality actually improves.

Evaluation Strategy: Adaptive Benchmarks and Test Agent Prompts for Evolving Systems

Now let's talk about the daunting task of evaluating. The key questions here are that it is ambiguous compared to traditional ML evaluations. You might have seen those CAPTCHAs about selecting buses in images. It's mostly black and white, and sometimes they have buses which you're really not sure if it's really a bus or not. But compared to that, evaluating generative AI systems is much more ambiguous. What is a good Father's Day ad? Gen Z may like something else. Millennials may like something else. So it's difficult to have a binary yes or no answer.

You also need speed. As we're improving our ability to code and deploy faster, you also need to evaluate faster. And finally, there's this dichotomy of manual evaluation. You need manual evaluation because they're great quality and you cannot rely on automated systems. But at the same time, if you're pushing features like 10 features every week, by the time you're done with manual evaluation, you've likely made 10 more features. So how do you keep up?

So if I were to summarize, we have ambiguous success. To solve that, we made our system adaptable. We started with making fixed rubrics, like, "Is there brand color in this image? Does it have a headline?" But that led to fixed benchmarks, and it did not really evolve with the system. So we make sure our benchmark adapts to the growing needs. For speed, we made sure we use automated benchmarks when possible.

To ensure that we can rely on humans while keeping up the speed, we make sure to validate instead of depending solely on human evaluations. This is the architecture we follow for any bug, feature request, or feedback from the user. As we create a ticket, we make sure to write a test agent prompt. Instead of writing a fixed rubric, we write a test agent prompt that chats with a real agent. This simulated conversation shows us what's possible for this specific feature or bug. Then a human evaluates and refines the test agent prompt based on that simulated conversation.

This is iterated until the reviewer is satisfied with the simulated conversation. What this helps us do is ensure that as our agent evolves, our benchmark can evolve along with it. The test agent prompt becomes part of our actual test suite. Now we have a collection of hundreds of prompts which lead to hundreds of test agents that all simulate different aspects important to our customers and product managers. We can simulate conversations with our agent.

That was a deep dive into science and research, and I have some key takeaways for you that you can apply to any agentic system that you have. The first thing we talked about was agent architecture, and it's important to specialize here. It's not just to make your system better, but it's also helpful to develop and deploy in parallel. For long-running tasks, it's important to execute in parallel and to use stateful tools and workflows.

While large language models can do everything, it does not mean that you use them for everything. You can depend on your own code to reduce different contexts or perform certain actions that are much faster than relying on large language models. To handle multimodal context, you don't need to feed everything to the large language model at the same time. You need to ensure that you offload as much as possible and then intelligently select and forward that to the large language model. And lastly, it's important for evaluations to be liquid. You cannot have a fixed rubric that stays on for twelve months. It needs to evolve and it needs to consider feedback loops from real customers.

Engineering Infrastructure: Building Reusable Components and Guardrails at Scale

With that, I'll hand it over to Fabio to talk more about the underlying infrastructure. Thanks, Yashal. Hi everyone. I'm Fabio and I'm a principal engineer in Amazon Ads, and I directly support the team behind the Creative Agent. While Yashal has mentioned a lot about the science and research behind this project, I'm going to dive deep into the engineering and architecture that's powering the service at scale and then lessons learned and the road ahead for the new improvements that are coming along next year.

With generative AI agent products and workflows, usually twenty percent of the time is spent on achieving eighty percent of the prototype, and then you spend eighty percent of the time to fine-tune the last twenty percent and productionalize it to make it work. In that last part is where we found our main technical challenges. How do we avoid duplication of efforts since every team in Amazon and every team in ads are building generative AI workflows? And then how do we keep up with the state of the art while delivering at the same pace as anyone else in the industry? Also scalability and latency. We want to make sure that our services don't just work for a prototype subset of users, but they scale up to millions of users. And lastly, how do we embrace nondeterministic agents with classical engineering? You have a certain input and you expect a certain output. With agents, randomness is a feature, so how do we make that all work?

Let's tackle the first challenge. Because if you can't build reusable components, you will spend a lot of time just building a single prototype and won't be able to replicate the successes. So our main challenge was solving this problem with architecture discipline. We divided it into five different layers. Let's start from the bottom. Infrastructure comprises all the building block components that are reusable across our project. Then we have the models and services layer. This layer allows us to reuse every model and connection within Amazon advertising and the overall ecosystem. Then we have the tools layer. This is where we create tools that are reusable across many different agents. Above that is the agent layer that my colleague Yashal focused on, where we have agents, orchestrators, sub-agents, and all the capabilities collected for the agent to work. The last layer is the service layer, and that's what our advertisers see and interact with, which is our advertising console or could be our AI creative studio.

Let's get into the details. The diagram shows the actual implementation and how the components are spread across these layers.

So something that I want you to notice is that there is a spectrum of skills where science tends to contribute more on the left side of the diagram, while engineering is usually mainly focused on the right side of the diagram. Those skill sets help bridge some of the gaps in between. Here is the architecture that made all of these reusable. As you can see, on the left side is where the customers interact with us. They join one of our entry points and websites, and then a service gateway shuffles the request across our compute workflows depending on the type of workflows that they request.

In the compute workflow, we have one of our main components that allow reusability, which is our agent SDK. We also have some services that we decided to centralize so that every team didn't have to rebuild it from scratch, one of which is the guardrail service. Guardrails are at the key of our workflows. We don't want any harmful input or any harmful output to go out from our workloads. We also have capability operators, which is where we centralize most of the capabilities so that once implemented by a team, they are usable all across advertising.

Everything is powered by some registries, workflow registries and other registries, so either the single components and models or services or the overall workflow that we create or agents could be easily available for anyone in ads to be reused since they're made. This is how the agent SDK powers any new agent creation within Amazon advertising. First, I want you to look at this. It's very simple and allows us to create an agent description, and with that we would import subagents, which are the main components that you should describe. Each subagent would use the same SDK to import a series of tools that are used. Then moving forward, each tool would have their own capabilities. Those capabilities are nothing more than clients, and some of those are Bedrock clients or SageMaker clients.

This is also very reflective of how our layer infrastructure is architected and what kind of contribution engineering would do on which part and what kind of contribution science would do. The result is that we could build sophisticated agents in a shorter and shorter amount of time, the more capabilities we have at the bottom side of the layers. Moving forward, we don't just allow reuse of tools or some of the capabilities, but some foundational blocks that we don't want every single team to reinvent, which are logging and monitoring. We want to standardize the logs across all these workflows. Logs are some of the most important components in nondeterministic systems. You want to know at any given time why a decision was made and where.

Security is incredibly important to centralize. We want our security teams to just focus on one part of the infrastructure so that everyone can benefit without having them go to every single team. Caching is another thing that we want to serve out of the box, either from caching or asset store caching. We want to make sure engineers don't have to go through the exercise of reinventing it every time. Checkpointing is something that every agent needs and allows us to recover from any Lambda failures or compute failures from where the workload was left.

State management is incredibly important because it allows us to persist the conversation across days, months, and eventually in perpetual state. Time travel allows us to understand what happened across a conversation and recover from a previous step. For conversation, it's extremely important to test different scenarios when we improve our tools or our prompts, and all of this is baked into the SDK so that any new agent that gets created already benefits from all this work. The challenge that we have is how do we make these tools and capabilities available to everyone. So we created a very simple component, which is the capability operator, and every capability that gets registered by a team will flow into this component that orchestrates the traffic and routes the traffic across the different components.

A team only has to know about the ID or a capability name, and from there they could easily connect and route the traffic through there. I also want you to notice the red box on the 3P capabilities, and that's what I mentioned before. The security team will focus all their efforts so that every team doesn't have to do the same.

Moving forward, RAI is not optional, and it's not simple either. Some of the unique challenges we faced are that RAI has to work on multimodality. We need to have security and safety through text, images, videos, and audio. There's also category diversity—a guardrail might work for one category but not for another, so we need to make sure we diversify. There's also scale and latency consistency. When reviewing a video or assessing content, that usually takes a long time for any specific guardrail. So how do we make sure we don't ruin the experience? And then there's accuracy and consistency. We need to sustain high accuracy and high coverage across all our workflows.

We came down to a layered approach. We have multiple checks at input and output so we can diversify the type of checks, but also at every single step between layers and tools. We also have multiple layers of filters that are much simpler than machine learning guardrails. Some of the solutions cover real-time scalability. We do parallel moderation as we do parallel cooling, but we also do sliding windows. When it comes to streaming output, we don't want to block our stream before it's over to understand if it could be damaging to the experience.

We also have content safeguards, so there are category-specific protectors that we invoke when we classify and categorize a prompt or a tool output. We have customized rules for diversity, PII detection, and so on. We also have automated checks and safe filing. With this kind of workflow, the great part is that when something fails, if you pass back the output of the guardrail, then the agent can take another path in this conversation or can try a different approach. So the conversation never stops, and we can always guarantee an output for our advertiser.

Here is how it's implemented. As you can see on the left side, input guardrails and output guardrails are where we apply our guardrail service. We also apply that at any step between agent and tool, specifically when it comes to third-party integrations. Here we didn't have to invent much. AWS already provides all the building blocks, so the service is fairly simple. It's a router collecting all the services that AWS provides—Comprehend, Bedrock Guardrails, SageMaker for some personalized guardrails that we have, and Rekognition. AWS makes it simple, and it's a fairly one-stop shop for all these guardrail rules.

Some of the results we achieved with our usability efforts show that it now takes just two days to productionalize a new agent from prototype. We have more than 100 experimental workflows running today and 40 production agent workflows and agents that are currently supporting advertiser traffic. We have onboarded more than 130 models. These are not different models—they could be similar models, different versions, different fine-tunes, or not even LLMs. We generate 150 million plus assets every single month. My favorite one is that it just takes two hours from when a new model comes out to implement and serve that to every advertiser.

Production Results: Scaling from 10 Beta Users to Millions with 338% CTR Increase

Now reusability gives you velocity, but that means nothing if you can't scale it to all the customers that you have. Here is how we moved from 10 beta users to millions of active users today. On the left side is scalability. Using a lot of the building blocks on the infrastructure layer that we discussed before that are coming from AWS makes it very simple. You need more traffic and more scale? Service quota is where you would make the change, and instantly you can get more scaling.

We also implemented a lot of fallbacks on similar models that allow us to diversify the model we take. If a model capacity is not responding as we expect, we can move over to a similar model—text to text or text to image—that could produce a good output. Similar to that, we have spillover quotas. We make sure that we provide access to some models and the same model on a self-hosted capacity so that if we have capacity constraints, we can spill over this quota to our self-hosted models that we can control much easier.

Very important for scalability is also service map and resource tracing. We want to know ahead of time when we're about to hit the limits. Knowing every service and their contribution to the whole general workflow and what quota is left is very important. At 70 percent of the quota, we can proactively increase all the limits.

Now we can also move into latency. As we mentioned before, we implemented sliding windows gathered for streaming, which guarantees a very short user-perceived latency. We do parallel tool execution and quick start workflows. As I showed before, many decision steps are required for the agent. Many times we recognize that those steps are recipes that advertisers want to execute all the time. When we recognize and classify these patterns, we can run a series of steps in parallel, which shortens the time that advertisers need to get to their output.

We also have token caching that is saving us more than three times the cost we were paying before, and latency is also affected quite significantly. Regarding thinking versus non-thinking, not every task needs thinking, and not every task needs the same thinking quota. When you give a quota to an LLM for thinking, most LLMs will try to use all of it. Having that dynamically adjusted depending on the task helps a lot in reducing latency.

We also have aggressive timeouts. We understand that with these kinds of workflows, some components, tools, or LLMs could suddenly take a huge amount of time. We do not want a single use case where you have ten tools running in parallel to jeopardize the whole execution. So what we do is apply aggressive timeouts and then retry more often. There is also seeds over prompts. We used to create prompts for every single variation of our assets, but many times just changing the seeds or randomizing the seed could get us higher quality diversification across those assets, and that saves a lot of time.

Now we can move to auto diagnostics. Traditional engineering usually relies on certain inputs and certain outputs, as we discussed before. With agents, every single execution is a different path, so it is fairly impossible when you have millions of customers to analyze the logs and spot a problem that might only happen three percent of the time or zero point one percent of the time. How do we get through all these logs and understand if we have a problem?

We created some notebooks that allow us to automatically collect logs from all our executions. Then, as you can see, we categorize those logs per conversation and per turn ID. We use LLMs to read all these logs and proactively tell us if at any step of our conversation something could be optimized, a tool is not working as expected, or a prompt could be rewritten. More than that, if we find a problem, these notebooks will try to see if there is already a ticket open to address some of these issues. If not, they will download the code repo and try to reproduce the same problem with the inputs found from the logs.

If they manage to reproduce the same problem, they will try to create a fix. So engineers will wake up in the morning or middle of the day and find agents providing them with clear signals and code changes to apply fixes to these kinds of problems. This is AI using AI to build AI, which is something we are trying to push more and more forward to make sure we can always understand what our agents are doing.

Some of the results for scalability and latency show that our average time for first chunk is below four seconds. That seems like a lot, but for generative AI workflows and some of those creating a video, that is a fairly low average. We also have more than one million active users in the US that use our services on a daily basis, and we consume more than five billion tokens per week across all the foundational models that we use. Our service availability is ninety-nine point nine percent, which is not as good a standard as AWS would have, but this is a genuine workflow, so there are so many things that could go wrong.

Now we have solved velocity and discussed scale. Now comes the real problem: how can we build a reliable system when we have non-deterministic agents? The retry strategy is everything you need to make sure you can control your agents. We make very good use of fallback strategy, spillover quotas, and retry strategies for any type of use cases. For example, for tool error, we could try a different tool or the same tool with exponential backoff. For agent error, we can try a different agent, restore from a checkpoint error, or try a completely different path from the orchestrator. For network error, we can also do exponential backoff, aggressive timeout, and swap different routes, and eventually overall timeout.

No matter what path occurs across the system or what flaws emerge, we always end up with successful completion. We always try to make sure we have an output for our advertisers. Non-deterministic systems demand observability at every possible level. You cannot debug what you cannot see. We divided this using the same idea of architectural layers. We have agent level observation, where we track latency, tool calls, tool steps, and structured logs. We also have platform and infrastructure logs, which are the classic logs you are probably used to when you use systems like Amazon Lambda logs, web socket logs, and dependencies.

We also have tool and model level logs so that we can improve every single log, every single tool execution, or model inference, depending on the path taken across our workflows. Then we have product and UX layers where we want to know how our advertisers are using our product, how they turn our creatives into campaigns, and if they are actually successful. We also have guardrails and safety. We need to make sure we have proactive alarms if we can detect any abuses. We track capacity and quota providers across 1P, 2P, and 3P models and infrastructure quota usage limits.

We have alarms for all of that, and we have automation so that anyone in the team can address those problems. We have playbooks on how to move regions, how to fail over to different models, and how to close that gap very quickly before any advertiser even notices. We also focus on closing the loop from signals to fixes. We have a lot of auto evaluation in place that is able to understand all the signals, understand all the logs, and help us try to fix problems before any issue spreads.

We built this and scaled it. Does it work? This is one sample of the first results we got from our Creative Agent. One of our partners using the Creative Agent saw shocking results: a 338% increase in click-through rate versus all other campaigns not using generative AI workflows, 89% new to brand offers, which means we do not just appeal to current brand customers but also unlock a completely new set of customers, and 121% return on ad spend, which means we are lowering costs using generative AI workflows. Advertisers are seeing double the revenue compared to their ad spending.

Lessons Learned and the Future: AWS AI Stack for Agentic Systems

We know it works as metrics and we know it scales well, but how does it look? Those were some real campaigns that advertisers are running on streaming TV. Now, what are the lessons learned across all of this? Productize the building blocks, not the demo. We started with just one prototype and understood pretty quickly that that was not scalable, so we had to shift into productionizing building blocks.

Keep agent scopes small, optimize for context, and delegate through an orchestrator, as mentioned earlier. We discovered this at a very early stage. Build for parallelism and retries from day one. This is key to working with non-deterministic agents. Also, make evaluation and observability part of development, never an afterthought. Without data, it is impossible to debug.

Moving forward, these are some of the things that we want to improve in our creative agent. We want to enhance creative agent autonomy and deep research to improve context. We want our agent to have significantly more intelligence, know everything about a brand, and predict what could work for any advertising campaign. We also want deep integration with broader Amazon native agents. Every team in Amazon is building agents, and every team in Amazon Ads is also building agents. We want those agents to collaborate with each other and leverage each other's capabilities. One could specialize in data, another could specialize in supply or a specific customer segment, and we want to use all that intelligence within our creative agent.

We also want to continue expanding our multi-foundational model effort. We don't want to focus on just one or a few models. We want every model that's best for the task to be dynamically selected by our creative agent. Additionally, we're scaling the service worldwide. We are making all our efforts so that we can serve this service in the US and to every other advertiser worldwide for free. Now I'm inviting Avinash back on the stage for some final remarks.

All right, am I audible? Thanks Fabio. Thanks Yashal. That's a great deep dive on the entire creative agent for ads. I would like to continue with some of the industry-based use cases within agentic AI and also show you how the AWS landscape is helping out. We just discussed one of the use cases, which is the creative agent, but there were many to begin with customer experiences. In fact, we have seen a lot of agents being built as chatbots and assistants within the advertising industry tech. What we could see is these agents are actually helping build campaign management and insights, and also serving as chatbots for many applications. Then you have personalization. I won't go more into it since we just discussed the ads use case on personalization. It's all about how creative you can be with the agents—building videos, images, stitching it all together, and driving insights.

Well, this is my personal favorite: text analysis and automation. In fact, many of you might have seen on Amazon.com there's a Rufus agent which predominantly helps you summarize all the reviews and also gives you insights about the products. That's another use case where we're seeing huge growth in terms of how agents are being used. Then comes democratizing analytics. This is where the actual insights that help drive the business come in. For example, you want to make sure that your campaign ad managers or business insights people can put in natural language text, such as "Show me my last 10 reviews" or "Show me my last 10 products where they are at the top scale," and these can be easily converted into SQL queries so that they can automatically build those insights and dashboards for you.

The reason why we highlighted personalization is that it's the key theme for today on the creative agent. With that, agentic AI is evolving a lot and the landscape is changing rapidly. We have been seeing agents moving from POCs to production within less time. In fact, as Fabio and Yashal mentioned, there are a lot of aspects within the entire agentic AI systems. For example, you need these agents to talk to each other, and at the same time they have to scale to support from hundreds of users to millions of users. That's one of the primary challenges you're seeing here in the orchestration and scaling for these agents to start communicating with each other and then maintaining and scaling.

You need a certain best platform where you could go build, deploy, and run these agents. Then comes prompt and guardrails. Fabio has just mentioned responsible AI, and in fact this is something that is not limited to the creative side of the house. Any agentic AI applications should be equally responsible in terms of implementing proper guardrails and mechanisms. In that case, it could be few-shot prompting, prompt injections, prompt leaking—all these sorts of things have to be taken into control and make sure that you're implementing all the responsible AI methodologies. The third aspect comes with advanced memory. So agents are not like fire and forget. You are constantly interacting with them, and you want to make sure that these interactions stay intact so that when you come the next day, it picks up from where you left off. For that you need a lot of context here.

Context engineering is the new keyword that is evolving, but for all of this, it requires a lot of memory. Advanced memory techniques are another challenge in the wheelhouse that will be implemented for entire agentic systems. The final consideration is identity and access, which doesn't need a great introduction. All our systems are following a trend of evolving from monolithic to microservices and now to agentic AI systems where you talk with multiple agents, so you need proper access mechanisms for all of this.

Well, as we discussed all of these challenges, if you're building agents, you don't have to do something from scratch because we're in a fast-paced environment and you need systems that are reliable to help you start building agents, ship to production, and start seeing those ROIs. What we have on AWS, I'll try to unfold it for you and make it simple within the entire AWS AI stack.

To begin with, we have the infrastructure layer. On the right side you're seeing Amazon SageMaker AI and on the left side AI compute. If you're building models or fine-tuning models to make them well suited for your agentic AI systems, something we usually recommend is SageMaker related products under SageMaker AI. Then for AI compute, we have Trainium and Inferentia, which are AWS-based GPUs to help you host.

The very next layer that we have is the development software and services, and this is where we can spend more time discussing all the available agentic core components. In fact, we recently launched Amazon Bedrock Agent Core. Bedrock Agent Core is a system where it has multiple components. To begin with, the runtime where you can deploy and host your agents and also scale your agents so that you don't have to worry about orchestration mechanisms.

Then the Agent Core Gateway is a personal favorite of mine. It serves as a central piece for you to host your tools and MCPify everything, which is another buzzword—MCP—so that you can connect with all your tools in one go. We also discussed memory, so Agent Core Memory provides you the capabilities where you can have large-scale short-term or long-term memory systems. We also have capabilities around Amazon Nova models, which are foundation models that Bedrock Agent Core and all of this also support. The open-source SDKs like LangChain, Crew AI, and Strands are also supported.

But if you don't want to build these from scratch and you want to use something off the shelf, then we have Amazon QuickSuite and Amazon Connect, which are purpose-built agentic systems so that you can directly use them off the shelf. Quiro is an IDE-based agentic system that has been recently launched. I'm pretty sure you must have checked out the Quiro booth near the expo area. Quiro is spec-driven development, which is agentic AI based.

With that, I hope you all have a great session and really thanks for making it here. Feedback is a gift. It helps us enrich and make sure we bring the right content for you. Please fill out the session survey towards the end and thank you once again. Have a great conference.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community