Kazuya

Posted on Dec 4, 2025

AWS re:Invent 2025 - Building and managing conversational AI at scale: lessons from Alexa+ (AMZ305)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Building and managing conversational AI at scale: lessons from Alexa+ (AMZ305)

In this video, Brittany Hurst, Luu Tran, and Sai Rupanagudi from Amazon Devices and Services explain how they rearchitected Alexa into the generative AI-powered Alexa+ while maintaining service for 600 million+ customers. They detail four major challenges: accuracy in routing and API selection, latency reduction through prompt caching and speculative execution, balancing determinism with conversational creativity, and model flexibility using a multi-model architecture with AWS Bedrock and SageMaker. Key techniques include minification, instruction tuning, API refactoring, and context engineering. They emphasize that traditional optimization methods weren't sufficient—new approaches like disabling chain of thought reasoning in production and right-sizing models for specific use cases were essential. The team demonstrates real-world capabilities like the "Daisy the dog" feeding notification example, showcasing how Alexa+ now handles complex, multi-step tasks while maintaining the reliability customers expect from existing integrations.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Transforming Alexa for 600 Million Customers

Good morning everyone. I'm Brittany Hurst, and I lead the global AWS relationship with Amazon Devices and Services. Joining me today are two of my customers, Luu Tran and Sai Rupanagudi, two people who were instrumental in the rearchitecting and rebuilding of Alexa, the voice assistant we all know and love, and transforming it into the generative AI powered Alexa+. Over the next 45 minutes, we're going to take you behind the scenes of one of the most challenging engineering problems that we've ever tackled.

How do you evolve a voice assistant serving 600 million+ customers from scripted commands to natural conversation without breaking a single existing integration? This isn't a story about building something from scratch. This is about transforming a massive production system that customers depend on every day while maintaining the reliability that they expect and adding conversational capabilities that they demand in this new world.

Here's what we'll cover today. Sai is going to walk us through Alexa's evolution and a number of the challenges that we faced, the customer learnings that we had over having this technology available to customers for the last decade. Then Luu is going to outline our design considerations and the techniques that we use to rearchitect it with generative AI and large language models behind the scenes. You're going to walk away with lessons learned, battle tested, and hopefully things that you can use in your own projects.

Alexa's Journey: From 13 Skills in 2014 to a Global Voice Assistant

Thank you, Brittany. How are you all doing today? I'm Sai Rupanagudi. I lead the product team for Alexa AI. How many people remember the first Alexa device? We're going to talk a little bit about the background of where we started. We first launched Alexa in 2014. At that time, we had about 13 skills, about a handful of things that Alexa could do, all from one developer, Amazon, and that was only in the US when we started.

But these were things that were delightful for customers. You wanted to play music while cleaning the house, just ask Alexa. You don't need to get your hands free. You wanted to convert ounces to grams while you're baking and making dough. You didn't need to clean your hands. And you're feeling a little lazy getting off the couch to turn on the lights, just ask Alexa. Now, all kidding aside, that last bit has been invaluable for our customers living with disabilities.

So customers loved what they were seeing. However, remember it's a decade, more than a decade ago at this point. This came with significant technical challenges. You had to pick up voices from across the room. If anybody remembers that Pringle scan, that blue light would follow you around. That was to make sure that we were getting the right signals from the right places. Once you actually had that, you had to understand the customer's intent. Then we actually had to go get that information for the customer and execute the actions.

Now, all of this is through voice, so it needed to happen within a second or two. Otherwise, you're not going to wait for answers. No awkward pauses. It's a difficult thing to deal with. Fast forward to today, Alexa has over 600 million customers and devices. We work with developers across the world. Alexa is the assistant with the most products and services connected to a billion plus devices.

But then, customers still had to work with a machine. They felt like they were talking to a machine. You had to phrase things a certain way for Alexa to respond properly, much like you had to do with search at the time. Generative AI promised to break through that barrier for us. How we took Alexa from what it was to where we wanted it to be is the story that we're going to talk about. With 600 million devices though, we had many challenges that we had to overcome.

Alexa Plus: Making Conversations Natural and Getting Things Done

We wanted to use generative AI for capabilities that would take Alexa to the next step with Alexa Plus. That meant it had to be more conversational and much smarter. When you come in here and say it looks dark in here, most humans would get up and try to find a light switch. If you tell that to an LLM right now, it goes through its thoughts telling you, "I think this person thinks it's dark in here. What do I do? Maybe we should look for lights around here," and then finally gets to the lights and tries to turn on a light, only to find out it's hallucinating a light that doesn't exist in their home. We can't have that.

Besides these challenges, we also had to make sure that the things customers had already been using Alexa for worked seamlessly. Dropping in at home, checking in on your grandma and how she's doing, using your smart home routines that you've set up and use every day from the morning. All of these had to work seamlessly. Let's take a look at what we had to do to get Alexa to where it is with Alexa Plus.

We can just talk now. I'm all ears, figuratively speaking. Do you know how to manage my kids' schedules? I noticed a birthday party conflicts with picking up grandma at the airport. Want me to book her a ride? Billie Eilish is in town soon. I can share when tickets are available in your city. Yes, please. Got any spring break ideas? Somewhere not too far, only if there's a beach and nice weather. Santa Barbara is great for everyone. I found a restaurant downtown I think you'd like. What is Santa Barbara known for? It has great upscale shops and oceanfront dining. Can you go whale watching? Absolutely. Want me to book a catamaran tour? What's the next step? Remove the nut holding the cartridge. Should I get bangs? You might only love them for a little while. You're probably right. Make a slideshow of baby tea nuts. Mom, what part am I looking for again? Two-inch washers. Your Uber is two minutes away. For real. Wait, did someone let the dog out today? I checked the cameras, and yes, in fact, Mozart was just out.

That's pretty awesome, isn't it? How many people have actually tried Alexa Plus here? That's smaller than I expected. We hope you all will try it pretty soon. Alexa Plus is a step change in every way that we can think of. We had to make it more conversational. When you talk to it, it must feel like you're talking to a human. You can talk about whether it's dark inside and it would understand what you actually meant. It had to be smarter. If you are planning a vacation, Alexa will find what the weather is there. It might recommend what you should wear there or if there are any weather advisories that you need to take care of. Alexa Plus is very personalized. The more you tell Alexa about the things in your household, the more you can work with it. The smarter it becomes about what you have in your household, the better it works for you. If you want to shop for Christmas gifts, Alexa will help brainstorm gifts with you and ship them to you as well.

But the most important thing that we had to do with Alexa Plus was get things done. Yes, you can plan and brainstorm that vacation all you want, but you also had to book that vacation. You had to book experiences in that vacation. This is quite often where a lot of other systems out there stumble. Once you actually try to make reservations, set up experiences, or book things or do things in the real world, things start to fall apart. You saw an example in the video about asking if someone let the dog out. Let's talk about a real example in this case. We have a dog named Daisy. She's a cream golden retriever. When we go to work, we have someone drop by to feed her. I wanted to make sure that she's fed every day.

So I have a Ring cam pointing in the vicinity of the football field. All I had to do was ask Alexa to let me know if Daisy is not fed by noon every day. That's it. I get a notification any day that this person or people that we've requested to help us out doesn't come in by noon.

Think about the things that have to go into making that happen. Alexa first has to understand what I said—just recognizing the speech itself. It then has to understand the context of my ask. It has to recognize that we have a golden retriever, which means it has to bring up the personal context. It then has to figure out what I meant by letting me know if she doesn't eat or if she's not fed, which means it's going to look for Ring cameras that I have.

So the devices and the context of what I have in the household also needs to be ready for the element to respond. It then has to take action based on recognizing that it has to look for a cream-colored golden retriever that is eating and has eaten anytime before 12 every day, and has to do that every day. Once it does not see a dog eating by that time, it has to send out a notification to me. These are hard things to do for LLMs these days.

To do this, we had to push the boundaries of accuracy on LLMs. We had to push the boundaries of latency. Remember, all of that is done in the background, yet it still has to come back and acknowledge me in a couple of seconds, max. Otherwise you lose context. We had to push the boundaries of latency, and for this, we had to actually dig through how we best use the right model at the right time. To talk through the challenges and how we work through them, I want to bring Luu Tran on stage.

Four Critical Challenges in Building Generative AI Applications

Good morning, everyone. My name is Luu Tran. I'm one of the engineers who worked on Alexa+. I'm going to talk to you about four of the challenges that we came across—just four. I guarantee you there were a lot more. But I think these four are going to be the most relevant to all of you because anyone building a generative AI application on top of LLMs these days is going to run into these issues.

First is accuracy. This is about getting the LLM to do what we want it to do. Second is around latency, and that's about the user-perceived latency. There are lots of elements of slowness in the system, but as anyone who's worked on LLMs will know, the big hitter in terms of time-consuming processing is the inference cycle of the LLM at runtime handling customer requests. Third is determinism. Obviously, an LLM is inherently non-deterministic. So when you have a use case like Alexa where customers expect it to do the right thing, how do you make sure it's consistent and reliable while still maintaining the creativity that we all have come to know and love about LLMs?

Finally, I'll talk about model flexibility. This is about making sure that we're picking the right model for the job. Starting with accuracy, as we all know, LLMs are great at understanding natural language, listening to customer requests, and figuring out what the intent is behind that request. If you're building a chatbot, that might be good enough. But when you're taking real-world actions like Alexa is—turning on lights, playing music, booking tickets—it's really important to get accuracy as high as possible.

The stakes are much higher than just a conversational chatbot when you're taking real-world actions. In a system as complex as Alexa is with lots of steps along the way, the errors compound. So you really need to drive up accuracy at every step of the way, especially in the LLM inference cycle.

Accuracy Challenge: Teaching LLMs to Route, Plan, and Execute Correctly

The first task that we're asking the LM to do is routing. Let's take the example that I mentioned: "Let me know when Daisy doesn't get fed by noon." In Alexa, we have integrations with tools and software we call experts, or possibly other agents that can carry out real-world actions. The first thing we ask the LLM to do is determine which one of the universe of tools, experts, and agents can handle this particular request.

For the example "Let me know when that might be," it could be a notification expert, a reminder expert, or a calendar expert. It's up to the LM to figure out which one of these experts to call on to get the task done. I'd argue that this is probably the most difficult and important step because if you get this step wrong, it's very hard to recover downstream. The next thing we ask the LLM to do is, given that expert, determine which API should be invoked from the set of APIs that expert offers to the system.

For example, to create a reminder, you need to know what API name should be invoked and what parameters need to be provided to that API. In a reminder, for example, you need to know when to fire off the reminder, what conditions go into that reminder, what frequency, how often the target for the notification, the message of the notification. All those values need to be retrieved at runtime by the system as orchestrated by the LLM to call those APIs. All of this represents complex planning that we're asking the LLM to do.

Each selection along the way—from which expert to which API to which parameters to which values—all represent inference cycles that can add to or reduce accuracy if you get it wrong. It's not an easy process to go through. You really have to try what works and figure out what doesn't, especially with different models. What we learned was that providing examples or exemplars of how to invoke these APIs was useful in the beginning, especially a number of months ago when the state-of-the-art LMs weren't as capable as they are today.

We learned that giving an utterance and an example API call helped the model understand the use of those APIs and experts. As we were running into bugs, we kept adding to those examples, and it actually counterintuitively reduced accuracy. It made things worse. Why is that? Well, it turns out, as we all know now but was new to us at the time, working with LLMs means you can overload the context and the prompts that go into the LLMs. You can give it too much information, and it ends up overfitting or acting in a way that's too specific to a particular use case.

Like humans, LLMs have limited attention. If you overload them with too much information, especially irrelevant information, they can get forgetful and do the wrong thing. Certainly, if you have conflicting information, some bugs might ask for examples that contradict other bugs. It's a real tricky situation to get into. We ended up having to take out a lot of those examples and exemplars to help improve accuracy. It was a lot like squeezing a balloon, where we would fix a problem in one area—like in smart home use cases, you turn on the lights—it's important for the LM to know which lights are in the household.

So when the customer says turn on the lights, which light are they talking about? But if the customer says play some music, then providing that context about what lights are in the household is irrelevant and actually reduces accuracy for the music playback use case. We found ourselves solving problems in one area and then causing problems in another area.

Refactoring the APIs helped by making it more obvious so that we wouldn't have to provide examples to the language model. It would just figure it out. What we also learned was that it was actually harmful to be too obvious. If you had an API called create reminder, you don't need an instruction in the prompt that says use this API to create a reminder. It's too obvious and it's actually harmful in terms of accuracy for the reasons I mentioned before: overfitting and forgetfulness or prompt overload.

I remember when we first got it working, more often than not we were so excited. Yes, finally. This was a complete rearchitecture of the Alexa system, an architecture that had been around and served customers well on 600 million devices for 10 years. We had to redo all of it to get the language model integrated. Little did we know that we had a lot more work ahead of us than behind us.

Latency Challenge: From Parallelization to Token Optimization

Our next big challenge was around latency. We had accuracy where we wanted it to be, but then everything was so slow. Anyone working with language models will know that if you're using a chatbot and you're typing, it's okay to have the responses get typed out and you can do some latency masking techniques like the thinking in the chatbot. But if you're Alexa, customers expect the lights to come on practically instantaneously.

We tackled this using traditional latency reduction techniques like parallelization, streaming, and prefetching. They worked to a certain extent. Parallelization is taking APIs that are largely independent and calling them at the same time, not waiting for one to finish before you call another. Streaming is about getting started as soon as possible, changing a finish to start dependency into a start to start dependency.

In an utterance like let me know when, Alexa doesn't have to wait until it hears the rest of the utterance and request. It can just get started picking an expert that can handle use cases that involve when, like a reminder expert. Prefetching is about loading in the right context. In most cases, when you just say Alexa as the wake word, Alexa is already getting started before you say even another single word because it can gather information about the device you're talking to, the time zone that device is configured in in case you're going to ask for the weather, for example.

It can look up your account for personal personalization use cases. All those sorts of things happen in real time and prefetch before they're needed so that they're ready to go when the rest of the utterance finishes. But we quickly exhausted all of the traditional techniques. Language models were new to us at the time, and what we learned was that there's a big difference between input and output tokens. Tokens are the numerical representation of the words that you say to a language model or the inputs, the images, or whatever it is that you give to the language model.

Can anyone guess which takes longer: processing input tokens or generating output tokens? How many think it's input tokens? About 25% of you. How many of you think output token generation is more expensive? So more of you, about three-quarters. It turned out that with at least the models we were using, and I think this is universal, output token generation is literally orders of magnitude—not one order, but multiple orders of magnitude—more expensive in time than processing input tokens.

While we were really careful with being efficient on the input token side to avoid overfitting and too much context, we were meticulous about output tokens. How many of you are familiar with the chain of thought technique? That's where you say in your prompt to the language model, "think out loud," and it will think out loud. In its output, it'll start saying things like, "I think I understand what the customer wants. I think I need to call this tool." So if the customer says, "remind me when Daisy doesn't get fed by noon," the model needs to get the APIs for the reminder expert and so on.

That's great, but it's generating output tokens the whole time. It's great because in some cases, and there are papers written about this, chain of thought can help with accuracy. If the model is thinking out loud, it has a higher likelihood of doing the right thing and predicting the next right tokens to take the right actions. Certainly, it's great for debugging purposes because if you're developing and you don't know what's going on inside the language model, you can turn on chain of thought reasoning and it'll output exactly what it's thinking.

However, there are orders of magnitude latency impacts on a running system. You don't want to keep this on in a production system. It's like turning on trace-level logging in your services and leaving it on in production and flushing to disk on every request. It's just not a good idea. It's great for troubleshooting and development and figuring out what your system is doing, but you don't want to leave it on in production.

The other thing we had to do with token processing was on the input side. Language model inference is one of the most expensive steps in terms of latency in the architecture. We needed to take advantage of past processing of past utterances that are very similar. How many times do customers say, "Alexa, stop"? We don't want to have to redo that same exact inference. In any language model application, a lot of the prompt is going to be the same and doesn't change from utterance to utterance. It includes your identity, what you can expect customers to do, the tools and experts and agents you have available to get the job done, and all those sorts of instructions that don't change from utterance to utterance.

We wanted to take advantage of past processing, and caching is a great way to do that. Back when we were getting started with this, prompt caching wasn't available yet. Today, we kind of take it for granted. It's amazing how fast this field is moving that something like prompt caching is just out of the box now with pretty much all the models. But when we were working on Alexa+, it wasn't available, so we had to invent it working with AWS and our model providers.

When you're caching in this way, ordering matters because dipping into the internal state of the language model means it's going to take a different path depending on the input tokens it's processing. You have to make sure that the stuff in the beginning is the most stable and all the stuff that's changing,

you push toward the end. There's another problem with that, which I'll cover in a minute. But that was some of the invention that we had to go through early on—invention that is now ubiquitous. But it was really important to get latency down.

And then the prompts themselves. We went through a lot of effort to improve and optimize the input prompts, including techniques like minification and instruction tuning. Minification is where you take a long string of words or tokens in the input and reduce it somehow. Think of it like a compression algorithm. You have to do it in a way that doesn't affect the behavior of the language model.

So you're looking for elements—for example, take identifiers. They're naturally long because they have to be unique. But it doesn't really matter to the language model whether it's this particular value or that particular value. You can replace it on the way in and then restore it on the way out, so that the rest of your system operates on the right value. But as far as the language model is concerned, it doesn't matter what it is.

It actually helped with caching too, because you imagine you don't want "Alexa, stop" and then have that cache miss because you have an identifier in there that's different for every customer. You want that same inference cycle to give you the same result. So minification helps in that regard as well.

You have to be careful with minification because when you do the analysis, you're looking at tokens. Not every model has the same tokenizer, and even different versions of models from the same vendor might change in their tokenization process. So you don't want to get too tightly coupled and dependent on how tokenization works in a model. But when you do use this and take advantage of it, it can really help reduce latency.

Instruction tuning is about looking at the instructions you're sending into the language model and in some cases using the language model to give you feedback on how you can make it clearer, use fewer words to describe the same thing, fewer examples—all things to reduce the number of input tokens going into the system and therefore reduce the amount of processing required and reduce latency.

Finally, there are a lot of model-related techniques that we went through to help reduce latency, like speculative execution. I mentioned parallelization, streaming, and prefetching as ways of reducing latency, but speculative execution with models also involves understanding that some models—early on in the early days, the models we were working with back then weren't as capable as state-of-the-art models today and won't be as capable as state-of-the-art models tomorrow.

But you can use a high recall, potentially low accuracy model with fewer parameters. Again, this is just how the state of the world and technology is today, though that could change. But fewer parameters usually translates into faster inference execution time. So it's going to come back faster and have lower latency. You can get started on that result before you're certain and sure that it's a high accuracy result.

At the same time, you can fire off requests to a higher accuracy model that potentially has more parameters and therefore higher latency. Later, if the result is the same, great—you've gotten a head start on calling all those APIs of that expert that you've identified. And if they're different, no worries. Just throw it out and call the expert APIs that the higher accuracy model predicted. That way, the customer gets what they want. And in a lot of cases, the lower latency model was right because it was close enough.

That's pretty cool. But there's a nuance here because if you get started on something and it turns out to be wrong, you have to make sure that the APIs you're calling aren't harmful. If they're wrong, you don't want to turn on the wrong light or play music when someone is asking to turn on a light. You have to make sure that the APIs you're calling are idempotent or don't have unintended side effects, or in other ways are safe to call or can be undone if it turns out to be the wrong APIs.

By far the most impactful part of the process that we found was reducing the number of times that we went to the language model to ask it to do an inference cycle. API refactoring was super important because you could take a sequence of fine-grained APIs and combine them into a single or a small number of coarse-grained APIs that could then be predicted in fewer inference cycles through the language model. Fine tuning was about making the foundational models that we got from our vendors into specialized models that were a better fit for our use cases and could operate more quickly given the traffic that we can expect from customers of Alexa, using that data in training the models.

Determinism Challenge: Balancing Reliability with Creativity

Balancing between accuracy and latency was a very methodical, data-driven process. Our traditional systems before language models were really good at doing things reliably and consistently. But when you throw in a non-deterministic, stochastic, or statistically driven language model into the mix, they're great at being creative and engaging in conversation in a chatbot. However, when you absolutely positively have to get it right—you're booking tickets, turning on lights—it's not okay to just play the right music on the right speaker most of the time for our customers. That's just not acceptable. It's got to work all the time.

Balancing between accuracy and latency, we're tuning the system to be more and more efficient, and unfortunately, more and more robotic. It lost some of that character and personality that we've come to know and love in the chatbots and language model-driven AIs that we're now used to. So we had to actually tune things back and dial things back, reducing a little bit of the determinism so that it could reinject some of its creativity and be less robotic.

If you say "Alexa, turn on the lights," just do it all the time. If you say "Alexa, I'm bored," sometimes she might offer to play some music, and sometimes she'll say, "Hey, you want to pick up that conversation about the fun travel destinations that we talked about earlier?" It's hard to build a system that's both deterministic and consistent and reliable for the use cases that are more like using a tool, like a light switch, and also give it the personality, character, and creativity that we can come to expect from language models, which are really good at that. Balancing that is incredibly challenging.

Of course, we all know now that parametric answers—answers that come from the model itself based on its training—are only as good as the data that was available at the time it was trained. Current events or updated datasets like personalization and knowledge bases aren't going to be available in the parametric answers that the model is giving. So for that, we use the standard techniques of grounding and retrieval-augmented generation, or RAG.

As is well known, the challenge here is about how you balance just giving the answer and embellishing it a little bit. For example, Alexa might troll the Yankees in an answer if you're a Red Sox fan, or if you ask it or say "I'm bored," it might tell the classic dad joke: "Hi, bored. I'm Alexa."

What we learned along the way was that this subset of prompt engineering, which we call context engineering, was super important to get this balance right. I mentioned the difference between smart home context, like what light switches you have, and music context. It's not okay for the model to hallucinate and make up that you have 67 speakers in the household when you really only have 5, and then not play the music because it's thinking it's playing on a non-existent speaker. That's not acceptable.

You have to be really careful to make sure that the model has the context it needs to make the right decisions. Elements like past conversations provide continuity between utterances and interactions with Alexa and the customer. It was really an iterative process to decide what to include and what to exclude from this context. Summarization helps, of course, but there's a lot more that goes into making sure there's the right context without negatively impacting latency and accuracy.

We learned, as many have learned these days, that models exhibit this sort of recency bias, where instructions towards the end of the prompt are given a little bit more weight, just like humans. I tend to remember things I'm just told more than things I'm told much earlier. It turns out models do that too. Ordering in the context and in the prompt matters, not just for caching as I mentioned earlier, but also for getting the accuracy right and balancing the behavior of the language models so that we're able to give a use case or handle an experience that's both deterministic and creative.

Safety is non-negotiable for us. It always has to be a safe experience, and the guardrails that we use are truly indispensable. We took a belts and suspenders approach. We didn't trust that everything that went into the model was safe and everything that came out of the model was safe. So we had guardrails all over the place. We would prompt the model to do things safely, but in case it didn't, we would have other guardrails in place to take care of that.

Model Flexibility: Building a Multi-Model Architecture for Diverse Use Cases

All of this couldn't have been possible without a multi-model architecture. It was absolutely essential and an early decision that turned out to be the right one. You can't expect a single model to handle all the use cases, especially in a system as diverse in its customer base and experiences as Alexa offers. As we were balancing accuracy and latency and making trade-offs with personality and creativity, we found that there were a lot of other dimensions we had to consider, like capacity and GPU cost and many others.

This multi-model architecture was built into the system early on, partly out of necessity because the early models we were working with weren't as capable. So it was really convenient to be able to swap out and use other models. But what really helped us was discovering that we don't have to turn this off. We don't have to do this only in development time.

We can leave this on in production. And that meant that we didn't need to go looking for a one size fits all model. We could find the right model for the right job, for the right use cases. Working with AWS was amazing because the Bedrock service made it super easy for us to just swap out the underlying model on the back end at runtime whenever we needed to.

Not every challenge or use case is a nail that requires an LLM hammer. You don't need an LLM to handle the Alexa stop use case. I mean, you could, but it's kind of overkill. There's a feature in Alexa Plus where you can email a PDF of, say, your kid's school schedule. And then later ask Alexa about it. When is my son's homework due again? When is that exam? And Alexa will answer, given the information in that PDF.

Sure, the LLM could handle this use case as well. But that would mean you'd have to feed that PDF in as part of the input prompts. That's a lot of tokens. And then you have to figure out when it's relevant and whether the customer is asking about this. If it's not there, then you have extra latency giving that context to the LLM. In our case, this isn't one of those use cases where a specific purpose-built bespoke non-LLM traditional ML model could do the job just fine.

We train lots of models in this way in the Alexa system. Not only do we have multiple LLMs to draw on, but we have multiple non-LLM ML models that we use in the system to handle all of the use cases. Working with AWS and SageMaker makes it really easy and convenient for us to build up all of these different bespoke models. But if you have all these models, now you have a new challenge: how do you choose the right one?

One approach is, just like speculative execution, you could use all of them all the time and then pick one answer, maybe the fastest, maybe the most accurate, or maybe the cheapest. But calling all of these models in parallel might help with latency, but it also costs you in capacity and runtime GPUs. We found that you have to use a combination of techniques with multiple models. As we were going through this, it was really like peeling the layers of an onion. Fixing one problem only uncovered yet another set of issues and challenges that we had to overcome.

I've talked about four of them: accuracy, latency, determinism, and model flexibility. Some we anticipated, and some we didn't. But I hope sharing this with all of you, you can decide for yourself what you need to anticipate as you're building your use cases. Back to you, Brittany.

Key Takeaways: Model Flexibility, Iterative Experimentation, and Step Progression

Thanks, Lou. So what are the key takeaways that we want you all to walk away with to take back to your own projects? The first is model flexibility. Lou talked a lot about how early design decisions for our models afforded us flexibility later. We think this is a really important layer to how we build Alexa Plus, because what we found is one model doesn't fit all and right sizing models for specific use cases actually delivers better outcomes than adhering to just one and hoping that it works through.

As you look across that continuum, optimizing for accuracy, speed, and cost is based on the need and the use case that you're trying to deliver on. Second, and this is critical, iterative experimentation is essential. Traditional optimization wasn't enough, and you heard Lou talk a lot about how we started with traditional techniques like parallelization and streaming, but they really didn't go the distance. We had to bring in new techniques like prompt caching, API refactoring, and speculative execution, and it was the blend of those techniques that actually allowed us to create something magical for our customers with generative AI.

What works in theory doesn't always work in production, and it also doesn't work in production at scale. Building in that experimentation into your process and your mental model is critical. Lastly, we took a step progression. We wanted first for it to be right. We wanted it to then be fast, and then we wanted it to be reliable. When you're in your own world, you have to make these trade-offs. It's a balance. How you do those is going to be a component of what the outcome you're looking to deliver on.

Most importantly, Daisy the dog did get fed, but it's almost noon, so we're going to see what happens today. If you want to experience Alexa Plus live, come to the One Amazon Lane activation at Caesar's Forum. This is where you can discover many of the innovations that are powered by AWS across Amazon's different lines of business. Thank you so much and please fill out the survey in your mobile app.

; This article is entirely auto-generated using Amazon Bedrock.