Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Unleashing Generative AI for Amazon Ads at Scale (AMZ303)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Unleashing Generative AI for Amazon Ads at Scale (AMZ303)

In this video, Amazon Ads presents their large-scale LLM inference solution built on AWS to understand shoppers and improve advertiser outcomes. The team processes billions of daily requests during events like Prime Day, using GenAI for product understanding, shopper behavior analysis, and matching optimization. They built a dedicated inference service on Amazon ECS with mixed EC2 instances (including G6E), implementing key optimizations like disaggregated inference (achieving 50% throughput improvement), KV-aware routing (reducing latency 20-40%), and dynamic capacity allocation across workloads. The architecture leverages AWS services including EKS with 10,000+ GPUs, EFA networking, and NVIDIA Dynamo integration, while handling diverse latency requirements from milliseconds to seconds across offline batch, near real-time, and real-time inference scenarios.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Amazon Ads and AWS Partnership at Scale

Great, so let's get started here. Thank you. So this is a session on the Amazon on AWS track. In this particular track, we feature solutions that were built by teams across Amazon. This particular talk features a solution built by Amazon Ads. As you can see, Amazon is a group of more than 100 different entities. You may know some of them, like Prime Video, Ads, and Ring, but common to all of them is the scale at which they operate.

As this slide builds up, you'll see some rather large numbers, whether it's storage, throughput, or general capacity. These numbers are really huge. And here's a fun fact: the numbers you see on the screen are not for a duration of a year or even a month. They represent just four days in July during the Prime Day event that happened earlier this year. These are the stats for that event.

Let's do a quick show of hands. How many of you here have used consumer LLMs such as GPT or Gemini? Before I could finish the sentence, I saw people raising their hands, so that's a good thing. Great. So if you use LLMs, you know that when you're using LLMs in a process that is deterministic, it is challenging, right? And at Amazon scale, that adds layers of complexity to that particular challenge.

Well, if you wanted to know how you can go about solutioning this, this is the session for you, right? And I'm seeing all orange headsets, which is a good thing. You're listening to my session and not some other session, so that's great. Okay, so in this session, you'll learn how Amazon Ads harnesses the power of LLMs to understand their shoppers more deeply, and this understanding helps them deliver better outcomes for their advertisers. You'll also learn the architecture that they used to build the solution. This architecture was built on AWS, and all the fine-tunings and optimizations that they did, we'll share those with you.

So I'm Varun Kamakarran. I'm a Principal Customer Solutions Manager at AWS, and joining me are Shengwa Bao, who's the Director of Sponsored Advertising from Amazon Ads, and Bole Chen, who's the Senior Manager in Shengwa's team. He was tasked with building the solution. So our agenda for today: I'll go through the introduction and talk to you about the relationship between Ads and Amazon and how this whole engagement got started. I'll pass it over to Shengwa, who will talk to you about the use cases they pursued. He'll also set context as to why a GenAI-based solution was the right option for them. Then we'll have Bole on stage, who will walk you through the journey of building the solution, what fine-tunings he did, what architecture options they chose, and he'll wrap this up with lessons learned.

So let's get started here. Amazon Ads, on one hand, is a customer of AWS just like you. They use AWS services and solutions to build their own products and services and give those to their customers. On the other hand, they're our partner. We package our products with theirs and provide that to our joint customers. So this dual relationship has been greatly beneficial to us.

Now, to look a little bit about where Ads came from: as you'd imagine, they were born in the cloud. Their first product, which was more than a decade ago, was the Amazon advertising product, and as their customer base has grown and as more and more people have joined their platform, they've built more complex products. They were an early investor in AI, and most recently they've launched two products: Creative Agent as well as Ads Agent. They're both agentic solutions built on AWS.

A look into the services they use: early on, as you'd imagine, they started off with EC2 and S3, foundational building block services, and they moved on to have more complex, more managed services and orchestrated services such as Step Functions or EMR. They were early adopters of SageMaker AI and most recently Bedrock. But the key is this: they use a vast majority of AWS services. They use more than 180 different services,

and for each of these services they push us. They help mature the services. They tell us, hey, this feature is not there, this capability is missing. Hey, can you do this for us? So we take the feedback, we build those capabilities, we mature the service, and that benefits our customers such as you. So in general our partnership has been mutual, has been fostering, and to give you a little bit more context into the use case, I'm going to bring Shenggua on stage. Shenggua, please take it over.

How Amazon Ads Works and Where GenAI Can Make a Difference

All right. Thank you for the introduction. It has indeed been a great journey with AWS together. Next, I'm going to introduce a little more about Amazon Ads, how it works, and the new opportunity GenAI introduces. Nearly 20 years ago, Amazon asked a simple question. How can we combine great shopping experience with branded discovery through advertising? Today, hundreds of millions of shoppers engage with Amazon through different channels. They can stream a video, they can shop at Amazon, or interact with Alexa.

Through these interactions, we learn how Amazon shoppers discover products, explore brands, or make purchase decisions. With this in mind, we use this learning applied to the full funnel advertising solutions to help advertisers reach customers at scale. Let's make it more concrete, using one example in the Amazon store. This is a query for Halloween costumes, and this is one of the favorite queries for the seasonal query I have.

When a shopper searches for Halloween costumes, they may start with seeing a particular brand recommendation along with the featured product to build the brand awareness. As the shopper continues browsing, we may surface more relevant products in the search result for consideration. Or product groups that share a certain attribute to help the shopper with the navigation. When the shopper clicks the product and lands on a particular page, we may also surface the complementary products or alternatives to support the purchase decision.

So, with that, what's behind the scene is how it works, given a query of Halloween costumes. The Amazon Ads system needs to first retrieve the most relevant subset of products from hundreds of millions of options. This is oftentimes in the order of tens of thousands. We then use machine learning models to score these candidates and reduce to a few hundred. And then we finally select a few to present to the shoppers. On a daily basis, we need to process billions of such requests.

And this experience is empowered by hundreds of models behind the scene, running in real time. Now let's zoom in a little bit even further to the models, how they typically look like and how they match up to a particular shopper. The model oftentimes takes multiple inputs, including the queries, the search context, including the product features of the candidate ad. It can also include the shopper signals.

With all of these inputs, the signals flow into a neural network architecture, which oftentimes includes attention mechanism or mixture of expert designs, and eventually produces a probability score, which can tell how likely a shopper is going to click or purchase a particular product item. Without going into the technical details, these models oftentimes have billions of parameters, and during inference, we need to perform tens of billions of operations within a tight latency budget of 40 milliseconds.

These models are useful and effective in predicting the likelihood. However, these models are not good at telling why a particular product is a good match and why it is not. That's where GenAI can make a difference. These are some examples we saw earlier. Imagine now we are advertisers, we might ask a question: given this product, who are the shoppers that might be interested in these items?

GenAI can take the model product and also image, title, and description as input and produce insights about who are the shoppers likely interested in these items. They can be either looking for fantasy scenes or traditional costumes, as well as simply for fun options.

Similarly, GenAI can help with shopper understanding. Imagine we have two shoppers looking for the same query of Halloween costumes. Even though they start with the same query, their behavior may tell they are having different interests, and GenAI is good at telling the nuanced difference as a shopper engages with different products. In this case, GenAI can tell, for example, John might be looking for some fun, attention-grabbing looks, maybe for a special school event, while Alice may be looking for some fantasy scenes with elegant design.

Now with both product understanding and shopper understanding in place, GenAI can also further help reason about the matching. We know the taste John and Alice are looking for already, and GenAI can help identify the fourth group as a potential match for John and the first group as a potential match for Alice, with the rest of the products first explaining why they are not as good a match with human-understandable language.

From these examples, we can tell where GenAI models are good at. They understand machine learning and also understand human language. They possess broad world knowledge, able to tell the nuanced difference of product attributes. They can also understand the evolving shopper interest. With that, they can also understand and reason about why a particular product is or is not a good match for a particular shopper. These are all promising directions we are exploring, but it comes with challenges.

In particular, the GenAI model can go up to tens or hundreds of billions of parameters, and many of the use cases require us to respond within milliseconds. In particular, we also need to respond to the evolving shopper interest and the campaign changes. All of them need to respond to the request scale of billions a day. With many use cases of LLM in place, this workload can translate to something that is ten times bigger than the common consumer language model we are using today.

Building LLM Inference for Three Distinct Use Cases: Batch, Near Real-Time, and Real-Time

With that, I'm going to introduce Bulla to the stage to talk about how we build the solution and the lessons we have learned so far. Thanks V and Shenghua for the great setup. In the remaining of the session, I will mainly walk you and take you behind the scenes of how we build a dedicated LLM inference service to support GenAI use cases you just heard about.

Let's recap on the main factors that can shape LLM inference performance. The first idea coming to mind is model size. In general, bigger model means more compute and higher latency. That is straightforward. Second, the token length. Longer input tokens as well as generated output tokens both add additional work per request, and together this affects latency and the throughput each single host can deliver. But these two are just half the story. Latency SLA as well as overall traffic volume also affect how much capacity our system actually needs. And here's a universal rule: latency and throughput trade off against each other. That being said, if you need super low latency, of course you will need more capacity, which means more cost.

With this idea in mind, let's revisit the GenAI use cases and see how each one pulls this factor in different ways. Let's go back to the product understanding example, the one where we try to figure out which shopper segment would love different types of Halloween costumes. Historically, we rely on product image, title, and brand embedding to represent the product. This works, but not very well when the goal is to identify differences and nuanced shopping intent, like shoppers who may like attention-grabbing looks.

Well, Large Language Models can unlock a richer view by digesting product descriptions, customer reviews, as well as other long-form content, which can easily go up to 100,000 tokens. Because this information is relatively static, we don't need super low latency here. What matters is highest throughput, as we need to generate hundreds of millions of requests on a daily or weekly basis.

On the infrastructure side, we use AWS Lambda functions and EventBridge to orchestrate this large-scale offline batch process. Data flows from S3 into a throughput-optimized LLM endpoint, which is capable of high concurrency. Then we sync the data into a storage layer on ElastiCache and S3, where other ecosystems will consume it as an additional signal.

Let's move to the shopper understanding example. Here the goal is to infer what actually a shopper is looking for based on recent activities. For example, from a few clicks, the LLM can tell that a shopper prefers humorous and attention-grabbing products. To do this, we assemble session contexts, consuming raw signals like clicks, items viewed but not clicked, purchases, and even products in a shopping cart, and putting them into an input prompt. Then the LLM can reason about why a shopper might like or dislike a product. Typically this input ranges from a few hundred to a few thousand tokens.

Latency-wise, we aim for what matters, as long as it can return before the shopper's next engagement with Amazon, which is typically a few seconds. We don't need to optimize for the minimum value to balance cost and efficiency and throughput. On the infrastructure side, we leverage Amazon Managed Kafka and Flink to scale our streaming pipeline, which is capable of handling hundreds of thousands of queries per second. Within the streaming pipeline, it makes asynchronous calls to the LLM endpoint. And similar to offline batch inference, LLM outputs are written into a storage layer.

We have covered offline batch inference and near real-time inference. There are also scenarios where real-time decision making is critical, such as shopper product matching, as well as using LLMs to assist with product ranking. Here latency is the most important thing, because we need to return responses within a few hundred milliseconds, while supporting a scale of hundreds of thousands of queries per second. To do this, our servers make direct synchronous calls to a latency-optimized LLM endpoint. We see it as a server workflow from different components including ad sourcing and ranking.

Now that we have covered several GenAI use cases, let's summarize the key tenets and what they mean to our system requirements. First, we need flexibility across many aspects, including model selection, workloads, and optimization objectives. The system needs to support diverse use cases as models and applications continue to evolve. Second, we need high throughput wherever possible for cost efficiency, especially for heavy loads from offline batch inference and near real-time inference. And finally, for certain scenarios, we do need ultra-low latency, so that LLM inference does not introduce additional delay to the existing shopping experience.

Dedicated LLM Inference Solution: Hybrid Stack Architecture and System Integration

With these tenets in mind, we developed a dedicated LLM inference solution using a hybrid stack of both software and hardware. The service runs entirely on Amazon Elastic Container Service with a mixture of EC2 instances to support different demands. This choice is not only for cost efficiency purposes. It actually also helps us navigate through GPU constraints. Actually, many of our workloads can run on more readily available instance types such as G6E rather than competing for high-demand instances such as P5 or P6 with NVIDIA GPUs.

At the orchestration layer, we developed a model router and job scheduler. This module helps route traffic to the right model and schedules offline batch jobs based on their priority and current capacity status. For software, because this space evolves so fast and different model servers including vLLM, TensorRT-LLM, and SGLang can offer different optimizations, we try to keep our technical choices and system flexible enough. We also benchmark different configurations, compare their performance, and pick the best setup for each use case.

And finally, system optimization is very critical when talking about running LLMs in production at scale. We integrate with NVIDIA NeMo to adopt several promising optimizations, which I will cover in a moment.

Let's also talk about how holistically this system integrates with other ad ecosystem components. It provides a lightweight Java client for servers to make direct synchronous calls for real-time inference. It also allows streaming pipelines to invoke for near real-time inference. Shopper streaming events including view, click, add to cart, and purchase flow into Kafka and Flink managed streaming pipelines, which make asynchronous calls to the LLM endpoint using the same Java client. For large-scale offline batch inference, we offer a batch interface for different teams to submit and schedule their jobs.

Aside from real-time inference where servers make direct calls to LLM endpoints, both offline batch inference and near real-time inference follow the same pattern by writing LLM output to a storage layer managed by ElastiCache and S3, where servers will consume them as additional signals. In the next few slides, I will walk you through several optimizations we found effective for our application use cases for your reference.

Key Optimizations: Disaggregated Inference, KV-Aware Routing, and Dynamic Resource Allocation

Let me set up some context first. In LLM inference, the first output token depends on the entire input prompt, and after that, the model generates output tokens one at a time in an auto-regressive loop. The phase that processes the entire input token is called prefill, and the subsequent token by token generation is called decode. In addition, to better leverage GPU models, modern model servers normally adopt continuous batching, which packs requests of different stages together, mixing prefill and decode stages within the same batch at a token level.

Because prefill and decode can run within the same batch, latency is always gated by the slowest phase, and typically it's prefill. As you have larger models or longer input tokens, this is even more visible as the latency penalty to the prefill stage is bigger. As you can see in the example, assuming we have four requests in a batch, when request five arrives with its heavy prefill stage, it can cause interruption to the decode stage of other requests, and of course it delays the other requests.

To deal with this inefficiency, one idea is called disaggregated inference. We separate prefill and decode stages onto different containers and services, allowing them to run and scale independently. This can lead to better utilization of your resources and improve both latency and throughput. Still remember the near real-time inference shopper understanding example where our input token ranging from 2,000 to 4,000 tokens is much longer than the output token size, which is only 200. Here, the prefill interruption is unavoidable, but that is why this optimization can be effective.

For example, we configured a disaggregated setup using NVIDIA Dynamo with a setup of 12 prefill workers, each with one GPU, and another decode worker with four GPUs. Testing against the QN 32B model, you can see this setup achieved up to 50% throughput optimization under the same latency compared to the aggregated setup. Another powerful idea is called KV-aware routing.

Still a little bit of context first. KV cache is a technique used to improve inference performance by storing intermediate results and computations from previously processed tokens. As you recall, a token can be represented as a vector of floating point arrays, also known as embeddings. For your simplification, you can imagine KVQ, or key, value, and query, as matrices that are being used to compute the internal interaction scores from these embeddings. When multiple requests share overlapping prefixes, KV caching can help avoid redundant computation.

KV-aware routing takes this one step further by directing requests to the GPU worker which already holds the most relevant cache data. In this example on the right, you can tell requests from the same user almost share the same input prompt except the last moving part, which is the shopping query. When we direct requests from the same user to the same GPU worker, this can maximize KV cache hit rate and help avoid unnecessary computation.

Again, in this benchmark running on NVIDIA Dynamo against the 32B model, you can see with KV router enabled, this setup helps reduce end-to-end latency by 20 to 40% at different percentiles. And the difference is even more obvious if we focus on the prefill stage or the time to first token alone.

Now, let's take a look at the key enablers for this optimization. At the data plane layer, we use Amazon EKS to orchestrate and manage a cluster of over 10,000 GPUs of different instance types. This is the foundation for us to build a scalable service and also adopt some of the optimizations. For high-performing networking, we use AWS EFA to accelerate inter-node communication as well as to better leverage network bandwidth. This is also essential for us to run distributed inference and adopt some optimizations. And finally, we integrate with NVIDIA Dynamo, which brings together several key system-level optimizations including KV router, disaggregated inference, as well as low latency data transfer. Together, this enabled optimizations and helped with our latency and throughput.

Here's another thing I'd like to cover. So we all know GPUs are expensive, right? And sometimes you may also face GPU supply chain constraints. Of course, you don't want to waste your money or hardware, and neither do we. Beyond system optimization, another practical idea to increase your resource utilization is to dynamically allocate your capacity across different workloads based on their traffic patterns.

Fortunately, real-time inference for shopping sites like Amazon has a pretty predictable daily traffic pattern. As you can see, there are clear daily traffic peaks and valleys in the chart. Real-time inference must meet this demand. However, offline batch inference workloads do not need to compete for capacity during these peak hours. That being said, before traffic peak hours, we can allocate more GPUs to real-time inference services to meet the latency SLA. Well, as we enter into the off-peak hours, the same GPUs can be shifted to support offline batch inference workloads. So overall, we do not overprovision on capacity.

Another important aspect of running GPUs at scale is dealing with hardware faults. Typically, GPUs have a lifespan from 3 to 7 years, and over time they become malformed, showing symptoms like higher latency or even exception error codes from your responses. For clusters running over 10,000 GPUs, this issue can actually happen on a daily basis. So we leverage AWS CloudWatch to continuously monitor these signals. And we mark a GPU instance for replacement whenever we identify degradation on some of the signals. This helps with our system's availability without human intervention from our engineering team.

Lessons Learned and Closing Remarks

Okay, so this almost concludes the main topics I would like to cover. Our journey is still ongoing, but we still want to share some early learnings with you. First of all, as you can see, we did not build everything from scratch. Actually, we heavily rely on a broad range of AWS services and offerings. This is the key for us to achieve the scalability and the reliability we need and also frees us to really focus on building the application layer, business logic, and trying different optimization ideas. Second, we stay closely connected to industry partners and the community. This is very important for us to adopt optimizations and new solutions, even earlier than their general availability. We always keep in mind to keep our system flexible and open. So whenever we adopt a new approach or new capability, we do not need to re-engineer the entire stack. And finally, not every emerging optimization works for your use case and application out of the box. It's equally important for you to work closely with your application partners to understand their use cases, including their traffic patterns, their input prompt design, and their optimization objectives. This information can help you make the right decisions for solutions and for optimizations.

Finally, we'd like to acknowledge AWS teams as well as NVIDIA teams, including NVIDIA AWS and NVIDIA Dynamo teams for their partnership. Actually, their support spanning from optimization benchmarks to helping us adopt Dynamo running on containers on EKS has been instrumental for us to bring the technical solution into production at scale. Thank you everyone for your time, and I'm handing over to Arun.

Thank you, Bula. So as Bula mentioned, you saw the solution approach was rather simple. It was built on foundational building blocks, right? You can use the same approach, the techniques that Bula mentioned in your use case.

That concludes our session today, but before I let you go, please visit the Amazon Demo in One Amazon Lane. It is located here in Caesar's Forum. If you scan the QR code there, it'll tell you the location where it is. This is a curated list of demos. If you were wanting to know what our delivery drivers use from a technology perspective, what's in the Rivian van, there's a Rivian ramp there where you can go check it out.

If you wanted to know what Prime Video is doing for their live boats, what X-ray means, there's a demo for that. And also finally, there's the Zoox robotaxi. It's a cool piece of technology. If you've taken a ride in Zoox, you'll know what I'm talking about. If you haven't, go check that out. So that's all in the One Amazon Lane right around the corner here.

That concludes our session. Thank you very much for joining us. Please do provide us feedback. Go to your app, let us know how we did, and if you're able to, if you have any agenda or any sessions that you'd like us to have in the next re:Invent, put that in there. Thank you very much. You have a wonderful rest of the day and re:Invent. Thank you guys, take care.

; This article is entirely auto-generated using Amazon Bedrock.