Kazuya

Posted on Dec 5, 2025

AWS re:Invent 2025 - Unleashing Generative AI for Amazon Ads at Scale (AMZ303)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Unleashing Generative AI for Amazon Ads at Scale (AMZ303)

In this video, Amazon Ads demonstrates how they built a large-scale LLM inference system on AWS to process billions of daily requests for understanding shoppers and products. The team shares their architecture using Amazon ECS with over 10,000 GPUs, implementing optimizations like disaggregated inference and KV-aware routing through NVIDIA Dynamo, achieving 50% throughput improvement and 20-40% latency reduction. They detail three inference patterns—offline batch, near real-time, and real-time—each optimized for different latency requirements ranging from seconds to milliseconds, while dynamically allocating GPU capacity based on traffic patterns to maximize resource utilization.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Amazon's Scale and the AWS-Amazon Ads Partnership

Great, so let's get started here. Thank you. This is a session on the Amazon on AWS track. In this particular track, we feature solutions that were built by teams across Amazon. This particular talk showcases a solution built by Amazon Ads. As you see, Amazon is a group of more than 100 different entities. You may know some of them: Prime Video, Ads, Ring. What's common to all of them is the scale at which they operate. As this slide builds up, you'll see some rather large numbers, whether it's storage, throughput, or general capacity. These numbers are really huge. Here's a fun fact: the numbers you see on the screen are not for a duration of a year or even a month. They represent just 4 days that happened in July during the Prime Day event that occurred earlier this year. These are the statistics for that event.

Let's do a quick show of hands. How many of you here have used consumer LLMs such as GPT or Gemini? Before I could finish the sentence, I saw people raising their hands, so that's a good thing. If you use LLMs, you know that when you're using them in a process that is deterministic, it is challenging, right? At Amazon scale, that adds layers of complexity to that particular challenge. If you wanted to know how you can go about solutioning this, this is the session for you.

I'm seeing all orange headsets, which is a good thing. You're listening to my session and not some other session. In this session, you'll learn how Amazon Ads harnesses the power of LLMs to understand their shoppers more deeply. This understanding helps them deliver better outcomes for their advertisers. You'll also learn the architecture they use to build the solution. This architecture was built on AWS, and all the fine-tunings and optimizations they did, we'll share those with you.

I'm Varun Kamakarran, a principal customer solutions manager at AWS. Joining me are Shengwaa Bao, who's the director of sponsored advertising from Amazon Ads, and Bole Chen, who's the senior manager in Shengwaa's team and was tasked with building the solution. Our agenda for today: I'll go through the introduction and talk about the relationship between Ads and Amazon and how this engagement got started. I'll pass it over to Shengwaa, who will talk about the use cases they pursued and set context for why a GenAI-based solution was the right option for them. Then we'll have Bole on stage to walk you through the journey of building the solution, what fine-tunings he did, what architecture options they chose, and he'll wrap this up with lessons learned.

Amazon Ads, on one hand, is a customer of AWS just like you. They use AWS services and solutions to build their own products and services and give those to their customers. On the other hand, they're our partner. We package our products with theirs and provide that to our joint customers. This dual relationship has been greatly beneficial to us. Looking at where Ads came from, as you'd imagine, they were born in the cloud. Their first product, which was more than a decade ago, was an Amazon advertising product. As their customer base has grown and more people have joined their platform, they've built more complex products. They were early investors in AI and most recently launched two products: Creative Agent and Ads Agent, both agentic solutions built on AWS.

Looking at the services they use, early on, as you'd imagine, they started off with EC2 and S3, foundational building block services. They moved on to more complex, more managed and orchestrated services such as Step Functions or EMR. They were early adopters of SageMaker AI and most recently Bedrock. The key point here is they use a vast majority of AWS services. They use more than 180 different services.

And for each of these services they push us. They help mature the services. They tell us, "Hey, this feature is not there, this capability is missing. Hey, can you do this for us?" So we take the feedback, we build those capabilities, we mature the service, and that benefits our customers such as you. So in general our partnership has been mutual and fostering.

How Amazon Ads Works and Where GenAI Can Make a Difference

To give you a little bit more context into the use case, I'm going to bring Shengwa on stage. Shengwa, please take it over. Thank you for the introduction. It has indeed been a great journey with AWS together. Next, I'm going to introduce a little more about Amazon Ads , how it works, and there's a new opportunity GenAI introduces. Nearly 20 years ago, Amazon asked a simple question: How can we combine a great shopping experience with branded discovery through advertising?

Today, hundreds of millions of shoppers engage with Amazon through different channels. They can stream a video, they can shop at Amazon, or interact with Alexa. Through these interactions, we learn how Amazon shoppers discover products, explore brands, or make purchase decisions. With this in mind, we use this learning applied to full-funnel advertising solutions to help advertisers reach customers at scale. Let's make it more concrete using one example in the Amazon store. This is a query for Halloween costumes, and this is one of my favorite seasonal queries.

When a shopper searches for Halloween costumes, they may start with seeing a particular brand recommendation along with featured products to build brand awareness. As the shopper continues browsing, we may serve more relevant products in the search results for consideration or product groups that share certain attributes to help the shopper with navigation. When the shopper clicks on a product and lands on a particular page, we may also serve complementary products or alternatives to support the purchase decision.

So what's behind the scenes is how it works. Given a query of Halloween costumes, the Amazon access system needs to first retrieve the most relevant subset of products from hundreds of millions of items. This is often in the order of tens of thousands. We then use machine learning models to score these candidates and reduce to a few hundred. And then we finally select a field to present to the shoppers. On a daily basis, we need to process billions of such requests.

This experience is empowered by hundreds of models behind the scenes running in real time. Now let's zoom in a little bit even further to the models, how they typically look like and how they match up to a particular shopper. The models often take multiple inputs including the queries, the search context, and the product features of the candidate ad. It can also include shopper signals. With all of this input, the signals flow into a neural network architecture, which often includes attention mechanisms or mixture of expert designs, and eventually produce a probability score that tells how likely a shopper is going to click or purchase a particular product item.

Without going into the technical details, these models often have billions of parameters, and during inference, we need to perform tens of billions of operations within a tight latency budget of 40 milliseconds. These models are useful and effective in predicting the likelihood. However, these models are not good at telling why a particular product is a good match and why it is not. That's where GenAI can make a difference. These are some examples we saw earlier. Imagine now we are advertisers, we might ask a question: given this product, who are the shoppers that might be interested in these items?

GenAI can take the model, product, image, title, and description as input and produce who the shoppers are likely interested in these items. They can be either looking for fantasy scenes or traditional costumes, as well as simply for fun options. Similarly, GenAI can help with shopper understanding. Imagine we have two shoppers looking for the same query of Halloween costumes. Even though they start with the same query, their behavior may tell us they are having different interests, and GenAI is good at telling the nuanced difference as a shopper engages with different products. In this case, GenAI can tell, for example, John might be looking for some fun, attention-grabbing looks, maybe for a special school event. While Alice may be looking for fantasy scenes with elegant design. Now with both product understanding and shopper understanding in place, GenAI can also further help reason about the matching. We know the taste John and Alice are looking for already, and GenAI can help identify the fourth group as a potential match for John and the first group as a potential match for Alice, with the rest of the products first explaining why they are not as good a match in human understandable language.

From these examples we can tell where GenAI model is good at. It understands machine learning and also understands human language. It possesses broad world knowledge and is able to tell the nuanced difference of product attribute differences. It can also understand the evolving shopper interest. With that, it can also understand and reason about why a particular product is a good or not a good match for a particular shopper. These are all promising directions we are exploring. But it comes with challenges. In particular, the GenAI model can go up to hundreds of billions of parameters. Many of the use cases require us to respond within seven seconds. In particular, we also need to respond to the evolving shopper interest and campaign changes. All of them need to respond at the requested scale of billions a day. With many use cases of LLM in place, this workload can translate to something that is ten times bigger than the common consumer language model we are using today. With that, I am going to introduce Bulla to the stage to talk about how we build the solution and the lessons we have learned so far.

Building a Dedicated LLM Inference Service: Architecture and Use Cases

Thank you. Thanks V and Shenghua for the great setup. In the remaining part of this session, I will mainly walk you through behind the scenes of how we build a dedicated LLM inference service to support the GenAI use case you just heard about. Let's recap on the main factors that can shape LLM inference performance. The first idea coming to mind is model size. In general, bigger models mean more compute and higher latency. That is straightforward. Second is token length. Longer input tokens as well as generated output tokens both add additional work per request. Together, this affects the latency and throughput each single host can deliver. But these two are just half the story. Latency SLA as well as overall traffic volume also affect how much capacity our system actually needs. Here is a universal rule: latency and throughput trade off against each other. That being said, if you need super low latency, of course you will need more capacity, which means more cost.

With this idea in mind, let's revisit the GenAI use cases and see how each one pulls these factors in different ways. Let's go back to the product understanding example, the one where we try to figure out which shopper segment would love different types of Halloween costumes. Historically, we rely on product image, title, and brand embedding to represent the product. This works, but not very well when the goal is to identify differences and nuanced shopping intent, like shoppers who may like attention-grabbing looks.

LLM can unlock a richer view by digesting food product descriptions, customer reviews, as well as other long-form content, which can easily go up to 100,000 tokens. Because this information is relatively static, we don't need super low latency here. What matters is highest throughput, as we need to generate hundreds of millions of requests on a daily or weekly basis. On the infrastructure side, we use AWS Step Functions and EventBridge to orchestrate this large-scale offline batch processing. Data flows from S3 into a throughput-optimized LLM endpoint, which is capable of high concurrency. Then we sync the data into a storage layer on ElastiCache and S3, where other ecosystem components will consume it as additional signals.

Let's move to the shopper understanding example. Here the goal is to infer what a shopper is actually looking for based on recent activities. For example, from a few clicks, an LLM can tell that John is a shopper who prefers fun and humorous products. To do this, we assemble session context, consuming raw signals like clicks, items viewed but not clicked, purchases, and even products in a shopping cart, and putting them into an input prompt. Then the LLM can reason about why a shopper might like or dislike a product. Typically this input ranges from a few hundred to a few thousand tokens.

Latency-wise, we aim for what matters, as long as it can return before the shopper's next engagement with Amazon, which is typically a few seconds. We don't need to optimize for the minimum value to balance cost and efficiency with throughput. On the infrastructure side, we leverage Amazon managed Kafka and Flink to scale our streaming pipeline, which is capable of handling hundreds of thousands of queries per second. Within the streaming pipeline, it makes asynchronous calls to the LLM endpoint. Similar to offline batch inference, LLM outputs are written into a storage layer.

We have covered offline batch inference and near real-time inference. There are also scenarios where real-time decision making is critical, such as shopper product matching, as well as using LLM to assist with product ranking. Here latency is the most important thing, because we need to return a response within a few hundred milliseconds, while supporting a scale of hundreds of thousands of queries per second. To do this, our servers make direct synchronous calls to a latency-optimized LLM endpoint. We see it as a server workflow from different components including ad sourcing and ranking.

Now that we have covered several GenAI use cases, let's summarize the key tenets and what they mean to our system requirements. First, we need flexibility across many aspects, including model selection, workloads, and optimization objectives. The system needs to support diverse use cases as models and applications continue to evolve. Second, we need high throughput wherever possible for cost efficiency, especially for heavy loads from offline batch inference and near real-time inference. Finally, for certain scenarios, we do need ultra-low latency, so that LLM inference does not introduce additional delay to existing shopping experience.

With these tenets in mind, we developed a dedicated LLM inference solution using a hybrid stack of both software and hardware. The service runs entirely on Amazon Elastic Container Service with a mixture of EC2 instances to support different demands. This choice is not only for cost efficiency purposes. It actually also helps us navigate through GPU constraints. Many of our workloads can run on more readily available instance types such as G6E rather than competing high-demand instances such as P5 or P6 with NVIDIA GPUs.

At the orchestration layer, we developed a model router and job scheduler. These modules help route traffic to the right model and schedule offline batch jobs based on their priority and current capacity status. For software, because this space evolves so fast and different model servers including vLLM, TensorRT-LLM, and SGLang can offer different optimizations, we try to keep our technical choices and the system flexible enough. We also benchmark different configurations, compare their performance, and pick the best setup for each use case. Finally, system optimization is very critical when talking about running LLMs in production at scale. We integrate with NVIDIA Dynamo to adopt several of the promising optimizations, which I will cover in a moment.

Let me also discuss how this system integrates holistically with the broader ecosystem. It provides a lightweight Java client for servers to make direct synchronous calls for real-time inference. It also allows streaming pipelines to invoke near real-time inference. Shopper streaming events including views, clicks, add to cart, and purchases flow into a Kafka and Flink managed streaming pipeline, which makes asynchronous calls to the LLM endpoint using the same Java client. For large-scale offline batch inference, we offer a batch interface for different teams to submit and schedule their jobs. Aside from real-time inference where servers make direct calls to the LLM endpoint, both offline batch inference and near real-time inference follow the same pattern by writing LLM output to a storage layer managed by ElastiCache and S3, where servers consume them as additional signals.

Optimization Strategies: Disaggregated Inference, KV-Aware Routing, and Resource Management

In the next few slides, I will walk you through several optimization strategies we found effective for our application use cases. Let me set up some context first. In LLM inference, the first output token depends on the entire input prompt. After that, the model generates output tokens one at a time in an auto-regressive loop. The phase with processing the entire input token is called prefill, and the subsequent token-by-token generation is called decode. Additionally, to better leverage GPU models, modern model servers normally adopt continuous batching, which packs requests of different sizes together, mixing prefill and decode stages within the same batch at a token level.

Because prefill and decode can run within the same batch, latency is always gated by the slowest phase, and typically it is prefill. As you have larger models or longer input tokens, this latency penalty to the prefill stage becomes even more visible. As you can see in the example, assuming we have four requests in a batch, when request five arrives with its heavy prefill stage, it can cause interruption to the decode stage of other requests and delay them. To address this inefficiency, one idea is called disaggregated inference. We separate the prefill and decode stages onto different containers and services, allowing them to run and scale independently. This can lead to better utilization of resources and improve both latency and throughput.

Recall the near real-time shopper understanding example where input tokens range from two thousand to four thousand tokens, which is much longer than the output token size of only two hundred. Here, prefill interruption is unavoidable, but this is why disaggregated inference can be effective. For example, we configured a disaggregated setup using Nvidia Dynamo with twelve prefill workers, each with one GPU, and another decode worker with four GPUs. Testing against the Llama 2 235B model, this setup achieved up to fifty percent throughput optimization under the same latency compared to the aggregated setup.

Another powerful idea is called KV-aware routing. Let me provide some context first. KV cache is a technique used to improve inference performance by storing intermediate computation results from previously processed tokens. As you recall, a token can be represented as a vector of floating point values, also known as an embedding. For simplification, you can imagine KV, or key-value-query, as matrices used to compute internal interaction scores from these embeddings. When multiple requests share overlapping prefixes, KV caching can help avoid redundant computation. KV-aware routing takes this one step further by directing requests to the GPU worker which already holds the most relevant cache data. In this example, requests from the same user almost always share the same input prompt except for the last moving part, which is the shopping query. When we direct requests from the same user to the same GPU worker, this can maximize KV cache hit rate and help avoid unnecessary computation.

In this benchmark running on a VDI Dynamo against the 32B model, you can see with KV router enabled, this setup helps reduce end-to-end latency by 20 to 40% at different percentiles. The difference is even more obvious if we focus on the prefill stage or the time to first token alone.

Now, let's take a look at the key enablers for this optimization. At the data plane layer, we use Amazon ECS to orchestrate and manage a cluster of over 10,000 GPUs of different instance types. This is the foundation for us to build a scalable service and also adopt some of the optimizations.

For high-performance networking, we use AWS EFA to accelerate internode communication and better leverage network bandwidth. This is also essential for us to run distributed inference and adopt some optimizations. Finally, we integrate with VDI Dynamo, which brings together several key system-level optimizations including KV router, disaggregated inference, as well as low latency data transfer. Together, this enables optimizations and helps with our latency and throughput.

Here's another thing I'd like to cover. We all know GPUs are expensive, and sometimes you may also face GPU supply chain constraints. Of course, you don't want to waste your money or hardware, and neither do we. Beyond system optimization, another practical idea to increase resource utilization is to dynamically allocate your capacity across different workloads based on their traffic patterns.

Fortunately, real-time inference for shopping sites like Amazon has a pretty predictable daily traffic pattern. As you can see, there is a clear daily traffic peak and valley in the chart. Real-time inference must meet this demand. However, offline batch inference does not need to compete for capacity during these peak hours. Before the traffic peak hour, we can allocate more GPUs to real-time inference service to meet the latency SLA. As we enter the off-peak hours, the same GPUs can be shifted to support offline batch inference workloads. Overall, we do not overprovision on capacity.

Another important aspect of running GPUs at scale is dealing with hardware faults. Typically, GPUs have a lifespan from 3 to 7 years, and over time they become degraded, showing symptoms like higher latency or even exception error codes in responses. For a cluster running over 10,000 GPUs, this issue can actually happen on a daily basis. We leverage AWS tools to continuously monitor these signals and mark GPU instances for replacement whenever we identify degradation in some of the signals. This helps with our system's availability without requiring human intervention from our engineer team.

Key Learnings, Acknowledgments, and Closing Remarks

This almost concludes the main topics I would like to cover. Our journey is still ongoing, but we want to share some early learnings with you. First, as you can see, we did not build everything from scratch. We heavily rely on a big range of AWS services and offerings. This is the key for us to achieve the scalability and reliability we need and also frees us to really focus on building the application layer, business logic, and trying different optimization ideas.

Second, we stay closely connected to industry partners and the community. This is very important for us to adopt optimizations and new solutions even earlier than their general availability. We always keep in mind to keep our system flexible and open. Whenever we adopt a new approach or new capability, we do not need to re-engineer the entire stack. Finally, not every emerging optimization works for your use case and application out of the box. It's equally important for you to work closely with your application partners to understand their use case, including their traffic patterns, their input prompt design, and their optimization objectives. This information can help you make the right decisions for solutions and optimizations.

Finally, we'd like to acknowledge the AWS teams as well as NVIDIA teams, including NVIDIA, AWS, and NVIDIA Dynamo teams for their partnership. Their support spans from optimization benchmarking to helping us adopt Dynamo running on containers on EKS. This is instrumental for us to bring the technical solution into production at scale. Thank you everyone for your time, and I'm handing over to Arun.

Thank you, Bula. As Bula mentioned, you saw the solution approach was rather simple. It was built on foundational building blocks. You can use the same approach and the techniques that Bula mentioned in your use case.

That concludes our session today, but before I let you go, please visit the Amazon demo at One Amazon Lane. It is located here in Caesar's Forum. If you scan the QR code there, it will tell you the location. This is a curated list of demos featuring what our delivery drivers use from a technology perspective and what is in the Rivian van. There is a Rivian ramp there where you can go check it out if you want to know what Prime Video is doing for their lifeboats and what X-ray means. There is a demo for that as well.

Finally, there is the Zoox robotaxi, which is a cool piece of technology. If you have taken a ride in Zoox, you will know what I am talking about. If you have not, go check that out. All of that is at One Amazon Lane, right around the corner here. That concludes our session. Thank you very much for joining us.

Please provide us feedback. Go to your app and let us know how we did. If you have any agenda or sessions that you would like us to have at the next re:Invent, please put that in there as well. Thank you very much. Have a wonderful rest of your day at re:Invent. Thank you, and take care.

; This article is entirely auto-generated using Amazon Bedrock.