Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Building and managing conversational AI at scale: lessons from Alexa+ (AMZ305)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Building and managing conversational AI at scale: lessons from Alexa+ (AMZ305)

In this video, Amazon engineers detail the transformation of Alexa into Alexa+, a generative AI-powered assistant serving 600 million devices. The team discusses four critical challenges: accuracy in routing requests and API selection, latency reduction through techniques like prompt caching and speculative execution, balancing determinism with conversational creativity, and implementing multi-model architecture. Key innovations include minification, instruction tuning, and context engineering to optimize token processing. The presentation demonstrates real-world applications like monitoring pets through Ring cameras and emphasizes that traditional optimization techniques alone weren't sufficient—requiring novel approaches like API refactoring and model flexibility to maintain reliability while adding natural conversation capabilities.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Transforming Alexa for 600 Million Customers Without Breaking Existing Integrations

Good morning everyone. I'm Brittany Hurst, and I lead the global AWS relationship with Amazon Devices and Services. Joining me today are two of my customers, Luu Tran and Sai Rupanagudi, two people who were instrumental in the rearchitecting and rebuilding of Alexa, the voice assistant we all know and love, and transforming it into the generative AI powered Alexa+. Over the next 45 minutes, we're going to take you behind the scenes of one of the most challenging engineering problems that we've ever tackled. How do you evolve a voice assistant serving 600 million plus customers from scripted commands to natural conversation without breaking a single existing integration?

This isn't a story about building something from scratch. This is about transforming a massive production system that customers depend on every day while maintaining the reliability that they expect and adding conversational capabilities that they demand in this new world. Here's what we'll cover today. Sai is going to walk us through Alexa's evolution and a number of the challenges that we faced, the customer learnings that we had over having this technology available to customers for the last decade. And then Luu is going to outline our design considerations and the techniques that we use to rearchitect it with generative AI and large language models behind the scenes. You're going to walk away with lessons learned, battle tested, and hopefully things that you can use in your own projects.

Alexa's Journey: From 13 Skills in 2014 to 600 Million Devices and the Challenge of 'Alexa Speak'

Thank you, Brittany. How are you all doing today? Coffee still kicking in, it sounds like. How many people remember the first Alexa device? We're going to talk a little bit about the background of where we started, and I'm Sai Rupanagudi. I lead the product team for Alexa AI. We first launched Alexa in 2014. Alexa users? All right. At that time, we had about 13 skills, about a handful of things that Alexa could do, all from one developer, Amazon, and that was only in the US when we started.

But these were things that were delightful for customers. You wanted to play music while cleaning the house, just ask Alexa. You don't need to get your hands free. You wanted to convert ounces to grams while you're baking and making dough. You didn't need to clean your hands. And you're feeling a little lazy getting off the couch to turn on the lights, just ask Alexa. Now, all kidding aside, that last bit has been invaluable for our customers living with disabilities. So, customers loved what they were seeing.

This, however, remember it's a decade, more than a decade ago at this point, came with significant challenges, technical challenges at that time. You had to pick up voices from across the room. If anybody remembers that Pringle scan, that blue light would follow you around. That was to make sure that we were getting the right signals from the right places. Once you actually had that, you had to understand the customer's intent. Then we actually had to go get that information for the customer and execute the actions. Now, all of this is through voice, so it needed to happen within like a second or two. Otherwise, you're not going to wait for answers. No awkward pauses, it's a difficult thing to deal with.

Fast forward to today, Alexa has over 600 million customers and devices, I should say. And we work with developers across the world. Alexa is the assistant with the most products and services connected to a billion plus devices. But then, customers still had to work with a machine. They felt like they were talking to a machine, if you will. Alexa speak, yes I called it. You had to phrase things a certain way for Alexa to respond properly, much like you had to do with search at the time. Generative AI promised to break through that barrier for us, and how we took Alexa from what it was to where we wanted it to be is the story that we're going to talk about.

Alexa+: Making Voice Assistance More Conversational, Smarter, and Action-Oriented

With 600 million devices though, we had many challenges that we had to overcome.

We wanted to use Generative AI for things that would take Alexa to the next step with Alexa+. What that meant was that it had to be more conversational. It had to be much smarter. When you come in here and say it looks dark in here, most humans would get up and try to find a light switch. If you tell that to an LLM right now, it'll go through its thoughts telling you, I think this person thinks it's dark in here. What do I do? Maybe we should look for lights around here and then finally get to the lights and try to turn on a light only to find out it's hallucinating that light that doesn't exist in your home. We can't have that.

Besides these challenges, we also had to make sure that the things that customers had already been using Alexa for, dropping in at home, checking in on your grandma and how she's doing, using your smart home routines that you've set up and use every day from the morning, all of these had to work seamlessly. Let's take a look at what we had to do to get Alexa to where it is with Alexa+.

Wow wow, look at my style. I know you ain't seen it like this in a while. Oh hey there, so we can just like talk now. I'm all ears figuratively speaking. Oh, do you know how to manage my kids' schedules? I noticed a birthday party conflicts with picking up grandma at the airport. Want me to book her a ride? Billie Eilish is in town soon. No way. I can share when tickets are available in your city. Yes, please. Got any spring break ideas? Somewhere not too far, only if there's a beach and nice weather. Santa Barbara is great for everyone. I found a restaurant downtown I think you'd like. What is Santa Barbara known for? It has great upscale shops and oceanfront dining. Can you go whale watching? Absolutely. Want me to book a catamaran tour? What's the next step? Remove the nut holding the cartridge. Should I get bangs? You might only love them for a little while. You're probably right. Make a slideshow of baby tea nuts. Mom, what part am I looking for again? 2-inch washers. Your Uber is 2 minutes away. For real. Wait, did someone let the dog out today? I checked the cameras, and yes, in fact, Mozart was just out. Wow.

Pretty awesome, isn't it? How many people have actually tried Alexa+ here? Well, that's smaller than I expected. Well, we hope you all will try it pretty soon. Alexa+ is a step change in every way that we can think of. We have to make it more conversational. When you talk to it, it must feel like you're talking to a human. You can talk about whether it's dark inside and it would understand what you actually meant.

It had to be smarter. If you are planning a vacation, Alexa will go find what's the weather there. It might recommend what you should wear there or if there's any weather advisories that you need to take care of. And Alexa+ is very personalized. The more you tell Alexa about the things in your household, the more you can work with it. The more it becomes smarter about what you have in your household, the better it works for you. If you want to shop for Christmas gifts, Alexa will help brainstorm gifts with you and ship them to you as well.

But the most important thing that we think we had to do with Alexa+ was just get things done. Yes, you can plan and brainstorm that vacation all you want, but you also had to book that vacation. You had to book experiences in that vacation. This is quite often where a lot of other what you see out there stumbles. Once you actually try to make reservations, set up experiences or book things or do things in the real world, things start to fall apart.

The Daisy Example: Understanding Complex Context and Taking Real-World Actions

You saw an example in the video about asking if someone let the dog out. Let's talk about a real example in this case. We have a dog named Daisy. She's a cream golden retriever. And when we go to work, we have someone drop by to feed her.

I wanted to make sure that she's fed every day, so I have a Ring cam pointing in the vicinity of the food bowl. All I had to do was ask Alexa to let me know if Daisy is not fed by noon every day. And that's it. I get a notification any day that the person or people that we've requested to help us out doesn't come in by noon.

Think about the things that have to go in to make that happen. Alexa first has to understand what I said, just recognizing the speech itself. It then has to understand the context of my ask. It has to recognize that we have a golden retriever. That means it has to bring up the personal context. It then has to figure out what I meant by let me know if she doesn't eat, or if she's not fed, which means it's going to go look for Ring cameras that I have. So the devices, the context of what I have in the household, also needs to be ready for the element to respond.

It then has to take action based on recognizing that it has to look for a cream golden retriever that is eating and has eaten anytime before 12 every day, and has to do that every day. And once it, if it does not see a dog eating by that time, it has to send out a notification to me. These are hard things to do for LLMs these days.

To do this, we had to push the boundaries of accuracy on LLMs. We had to push the boundaries of latency. Remember, all of that that is done in the background still has to come back and acknowledge me in a couple of seconds max. Otherwise you lose context. We had to push the boundaries of latency, and for this, we had to actually dig through how we best use the right model at the right time. And to talk through the challenges and how we work through them, I want to bring Luu on stage.

Accuracy Challenge: Getting LLMs to Route Correctly and Invoke the Right APIs

Thank you, Sai. Good morning, everyone. My name is Luu Tran. I'm one of the engineers who worked on Alexa+. And I'm going to talk to you about four of the challenges that we came across, just four. I guarantee you there were a lot more. But I think these four are going to be the most relevant to all of you because anyone building a Gen AI application on top of LLMs these days, I think, is going to run into these issues.

First is accuracy. And this is about getting the LLM to do what we want it to do. Second is around latency. And that's about the user perceived latency. There are lots of elements of slowness in the system. But as anyone who's worked on LLMs will know, the big hitter in terms of time-consuming processing is the inference cycle of the LLM at runtime, handling customer requests.

And third is determinism. Obviously, an LLM is inherently non-deterministic. So when you have a use case like Alexa, where customers expect it to do the right thing, how do you make sure it's consistent and reliable, but still maintains the creativity that we all have come to know and love about LLMs? And then finally, I'll talk about model flexibility. And this is about making sure that we're picking the right model for the job.

So, starting with accuracy. As we all know, LLMs are great at understanding natural language, listening to customer requests and figuring out what the intent is behind that request. And if you're building a chatbot, that might be good enough. But when you're taking real-world actions like Alexa is, as I mentioned, turning on lights, playing music, booking tickets, it's really important to get accuracy as high as possible.

And the stakes are much higher than just a conversational chatbot when you're taking real-world actions. And in a system as complex as Alexa is with lots of steps along the way, the errors compound. And so you really need to drive up accuracy at every step of the way, especially in the LLM inference cycle.

The first task that we're asking the language model to do is routing. Let's take the example that I mentioned: "Let us know when Daisy doesn't get fed by noon." In Alexa, we have integrations with tools and software we call experts, or possibly other agents that can carry out the real-world actions. The first thing we ask the LLM to do is, of the universe of tools and experts and agents, which one can handle this particular request?

In the "let me know when" example, that might be a notification expert, a reminder expert, or it might be a calendar expert, but it's up to the language model to figure out which one of these experts to call on to get the task done. I'd argue that this is probably the most important and difficult step, because if you get this step wrong, it's super hard to recover downstream.

The next thing we asked the LLM to do is, given that expert, there's a set of APIs that that expert offers to the system. For example, to create a reminder, what API name should be invoked? What parameters need to be provided to that API? In a reminder, for example, you need to know when to fire off the reminder, what conditions go into that reminder, what frequency, how often, the target for the notification, the message of the notification. All those values need to be retrieved at runtime by the system as orchestrated by the LLM to call those APIs. All of this represents complex planning that we're asking the LLM to do.

Each selection along the way, from which expert to which API to which parameters to which values, all represent inference cycles that can add to or reduce accuracy if you get it wrong. Finally, it's not an easy process to go through. You really have to try what works and figure out what doesn't, especially with different models.

What we learned was providing examples or exemplars of how to invoke these APIs was useful in the beginning. Again, this was quite a number of months ago at this point, where the state-of-the-art language models weren't as capable as they are today. We learned that giving an utterance and an example API call helped it to understand the use of those APIs and experts. As we were running into bugs, we kept adding to those examples, and it actually counterintuitively reduced accuracy. It made things worse.

Why is that? Well, it turns out, as we all know now but was new to us at the time working with LLMs, that you can overload the context and the prompts that go into the LLMs. You can give it too much information, and it ends up overfitting or acting in a way that's too specific to a particular use case. Like humans, they have limited attention. If you overload it with too much, especially irrelevant information, it can get forgetful and do the wrong thing. Certainly, if you have conflicting information, some bugs might ask for examples that contradict other bugs. It's a real tricky situation to get in.

We ended up having to take out a lot of those examples and exemplars to help improve accuracy. It was a lot like squeezing a balloon, where we would fix a problem in one area, like in smart home use cases. You turn on the lights, and it's important for the language model to know which lights are there in the household.

So which light is the customer talking about when they say turn on the lights? But if the customer says play some music, then providing that context about what lights are in the household is irrelevant and actually reduces accuracy for the music playback use case. And so we found ourselves solving problems in one area and then causing problems in another area.

Refactoring the APIs helped, making it more obvious so that we wouldn't have to provide the examples to the language model. It would just figure it out. And what we also learned was it was actually harmful to be too obvious. So if you had an API or expert called create reminder, you don't need an instruction in the prompt that says use this API to create a reminder. Like it's too obvious, and it's actually harmful in terms of accuracy for those reasons I mentioned before, overfitting and forgetfulness or prompt overload.

I remember when we first got it working, more often than not, we were so excited. It was like, yes, finally. I mean, this was a complete rearchitecture of the Alexa system. It's an architecture that had been around and served customers well on 600 million devices for 10 years, and we had to redo all of it to get the LLM integrated. Little did we know that we had a lot more work ahead of us than behind us.

Latency Challenge: Optimizing Token Processing and Reducing Inference Cycles

Our next big challenge was around latency. And so we've got accuracy where we wanted it to be, but then everything was so slow. And again, anyone working with language models will know, if you're using a chatbot and you're typing, it's okay to have the responses get typed out and you can do some latency masking techniques, you know, the thinking in the chatbot, for example. But if you're Alexa, the customers expect the lights to come on practically instantaneously.

And so we tackled this the way we knew how, using traditional latency reduction techniques like parallelization, streaming, and prefetching. And they worked to a certain extent. Parallelization is taking APIs that are largely independent and calling them at the same time and not waiting for one to finish before you call another. Streaming is about getting started as soon as possible, changing a finish to start dependency into a start to start dependency. And so in an utterance like let me know when, Alexa doesn't have to wait until it hears the rest of the utterance and request. It can just get started picking an expert that can handle use cases that involve when, let me know when, like a reminder expert.

And prefetching is about loading in the right context. So in most cases, when you just say Alexa in the wake word, Alexa is already getting started before you say even another single word. Because it can gather information about the device you're talking to, the time zone that that device is configured in, in case you're going to ask for the weather, for example. It can look up your account for personalization use cases. All those sorts of things happen in real time and prefetch before they're needed so that they're ready to go when the rest of the utterance finishes.

But we quickly exhausted all of the traditional techniques. LLMs were new to us at the time, and what we learned was that there's a big difference between input and output tokens. Tokens are the numerical representation of the words that you say to a language model or the inputs, the images, or whatever it is that you give to the language model.

Can anyone guess which takes longer: processing input tokens or generating output tokens? How many think it's input tokens that take longer? Okay, about 25% of you. How many of you think output token generation is more expensive? So more of you, about half to three-quarters. It turns out, at least with the models we were using, and I think this is universal, that output token generation is literally orders, not one order but multiple orders of magnitude more expensive in time than processing input tokens. So while we were really careful with being efficient on the input token side, as I mentioned, to avoid overfitting and too much context, we were meticulous about output tokens.

How many of you are familiar with the chain of thought technique? That's where you, in your prompt to the language model, you say think out loud, and it will think out loud. In its output, it'll start saying, I think I understand what the customer wants, I think I need to call this tool. So the customer says, remind me when Daisy doesn't get fed by noon, and so I need to get the APIs for the reminder expert, and so on. That's great, but it's generating output tokens this whole time. It's great because in some cases, and there are papers written about how this can help with accuracy, if it's thinking out loud, it has more likelihood of doing the right thing and predicting the next right tokens to take the right actions. That's good, and certainly it's great for debugging purposes because if you're developing and you don't know what's going on inside the language model, you can just turn on chain of thought reasoning and it'll spit out exactly what it's thinking.

But again, orders of magnitude in latency impact on a running system. You don't want to keep this on in a production system. It's kind of like turning on trace-level logging in your services and leaving it on in production and flushing to disk on every request. It's just not a good idea. It's great for troubleshooting, it's great for development and figuring out what your system is doing, but you don't want to do that and leave it on in production.

The other thing we had to do with token processing was on the input side. Again, the language model inference is one of the most expensive in terms of latency steps of the architecture, and we needed to take advantage of past processing, past utterances that are very similar. I mean, how many times do customers say, Alexa, stop? We don't want to have to redo that same exact inference. Or imagine just like any application, any language model application, a lot of the prompt is going to be the same, right? It doesn't change from utterance to utterance. Here's your identity, here's what you can expect customers to do, here are the tools and experts and agents you have available to you to get the job done, and all those sorts of instructions are not going to change from utterance to utterance. So we want to take advantage of past processing, and obviously caching is a great way to do that.

Again, back when we were getting started with this, it wasn't a thing, it wasn't available yet. Today we kind of take it for granted. It's just amazing how fast this field is moving that something like prompt caching is just out of the box now, pretty much all the models. But when we were working on Alexa+, it wasn't available, and so we had to invent it, working with AWS and our model providers.

It turns out when you're caching in this way, ordering matters, because dipping into the internal state of the language model, it's going to take a different path depending on the input tokens that it's processing. So you have to make sure that the stuff in the beginning is the most stable, and all the stuff that's changing, you push toward the end. There's another problem with that, which I'll cover in a minute.

But that was some of the invention that we had to go through early on, invention that now is ubiquitous. But it was really important to get latency down.

And then the prompts themselves. We went through a lot of effort to improve and optimize the input prompts, including techniques like minification and instruction tuning. Minification is where you take a long string of words or tokens in the input and you reduce it somehow. Think of it like a compression algorithm, and you have to do it in a way that doesn't affect the behavior of the language model. So you're looking for elements, for example, take identifiers. They're naturally long because they have to be unique, but it doesn't really matter to the language model whether it's this particular value or that particular value. And so you can replace it on the way in and then restore it on the way out so that the rest of your system operates on the right value. But as far as the language model is concerned, it doesn't matter what it is.

It actually helped with caching too, because you can imagine you don't want Alexa to stop and then have a cache miss because you have an identifier in there that's different for every customer. You want that same inference cycle to give you the same result, so minification helps in that regard as well. You have to be careful with minification because when you do the analysis, you're looking at tokens, and not every model has the same tokenizer. Even different versions of models from the same vendor might change in its tokenization process, so you don't want to get too tightly coupled and dependent on how tokenization works in a model. But when you do use this and take advantage of it, it can really help reduce latency.

And then instruction tuning is about looking at the instructions you're sending into the language model and, in some cases, using the language model to give you feedback on how you can make it clearer, use fewer words to describe the same thing, fewer examples, all things to reduce the number of input tokens going into the system and therefore reduce the amount of processing required and reduce latency.

And then finally, there are a lot of model-related techniques that we went through to help reduce latency, like speculative execution. This is where, you know, I mentioned parallelization and streaming and prefetching as ways of reducing latency, but speculative execution with models also involves understanding that some models, again, early on in the early days, the models we were working with back then weren't as capable as state-of-the-art models today and won't be as capable as state-of-the-art models tomorrow. But you can use a high recall, potentially low accuracy model with fewer parameters. And again, this is just how the state of the world and technology is today and could change. But fewer parameters usually translates into faster inference execution time, and so it's going to come back faster and have a lower latency answer when you ask it to, say, route to a particular expert or pick an API.

You can get started on that result before you're certain and sure that that's a high accuracy result. And at the same time, you can fire off requests to a higher accuracy model that potentially has more parameters and therefore higher latency. And later, if the result is the same, great, you've gotten a head start on calling all those APIs of that expert that you've identified. And if they're different, no worries, just throw it out and call the expert and APIs that the higher accuracy model predicted. That way, the customer gets what they want, and in a lot of cases, the lower latency model was right because it was close enough.

And you've actually reduced latency. That's pretty cool. But there's a nuance here because if you get started on something and it turns out to be wrong, you have to make sure that the APIs you're calling aren't harmful. If they're wrong, you don't want to turn on the wrong light or play music when someone is asking to turn on a light. So you have to make sure that the APIs you're calling are idempotent or don't have unintended side effects, or in other ways are safe to call or undo if it turns out to be the wrong APIs.

By far the most impactful part of the process that we found was reducing the number of times that we went to the language model to ask it to do an inference cycle. Here, API refactoring was super important because you could take a sequence of fine-grained APIs and combine them into a single or a small number of coarse-grained APIs that could then be predicted in fewer inference cycles through the language model. Fine-tuning was about making the foundational models that we got from our vendors into specialized models that were a better fit for our use cases and could operate more quickly given the traffic that we can expect from customers of Alexa by training the models using that data.

Determinism Challenge: Balancing Reliability with Creativity and Ensuring Safety

Balancing between accuracy and latency was a very methodical, data-driven process. Our traditional systems before LLMs were really good at doing things reliably and consistently. But when you throw in a non-deterministic, stochastic, or statistically driven LLM into the mix, they're great at being creative and engaging in conversation and in a chatbot. But when you absolutely positively have to get it right, you're booking tickets, turning on lights, it's not okay to just play the right music on the right speaker most of the time for our customers. That's just not acceptable. It's got to work all the time.

Balancing between accuracy and latency, we were tuning the system to be more and more efficient and, unfortunately, more and more robotic. It lost some of that character, that personality that we've come to know and love in the chatbots and LLM-driven AIs that we're now used to. So we had to actually tune things back, dial things back, and reduce a little bit of the determinism so that it could re-inject some of its creativity and be less robotic.

Turn on the lights, just do it all the time. If you say, "Alexa, I'm bored," sometimes she might offer to play some music. Sometimes she'll say, "Hey, you want to pick up that conversation about the fun travel destinations that we talked about earlier?" It's hard to build a system that's both deterministic and consistent, reliable for the use cases that are more like using a tool, like a light switch, and also give it the personality and character and creativity that we've come to expect and that the LLMs are really good at. Balancing that is incredibly challenging as well.

Of course, we all know now parametric answers, or answers that come from the model itself based on its training, are only as good as the data that was available at the time it was trained. Current events or updated datasets like personalization knowledge bases aren't going to be available in the parametric answers that the model is giving. For that, we use the standard techniques of grounding, retrieval augmented generation, RAG.

But the challenge here is about how do you balance just giving the answer and embellishing it a little bit. So, for example, Alexa might troll the Yankees in an answer if you're a Red Sox fan. Or it might, if you ask it, or say, I'm bored, it might tell the classic dad joke. Hi, bored. I'm Alexa.

And so what we learned along the way was this subset of prompt engineering that we call now context engineering was super important to get this balance right. I mentioned the difference between smart home context of what light switches you have and music context. Like, it's not okay for the model to hallucinate and make up that you have 67 speakers in the household when you really only have 5 and not play the music because it's thinking it's playing it, but it's playing on a non-existent speaker. That's not okay. So you have to really be careful to make sure that it has the context it needs to make the right decisions.

Elements like past conversations, so there's continuity between the utterances and the interactions with Alexa and the customer. It was really an iterative process to decide what to include and what to exclude from this context. Summarization, of course, helps. But there's a lot more that goes into making sure that there's the right context without impacting negatively latency and accuracy.

And we learned also, as many have learned these days, that the models exhibit this sort of recency bias, where instructions towards the end of the prompt are given a little bit more weight. Just like humans. I tend to remember things I'm just told than things I'm told much earlier. I just forget. It turns out models do that too. And so, ordering again, in the context, in the prompt matters, not just for caching, as I mentioned earlier, but also getting the accuracy right and balancing the behavior of the LMs so that we're able to give a use case or handle an experience that's both deterministic and creative.

And then finally, safety is just non-negotiable for us. It always has to be a safe experience. And the guardrails that we use are truly indispensable. And we took a sort of a belts and suspenders approach. We just didn't trust the models. We didn't trust that everything that went into the model was safe and everything that came out of the model was safe. So we had guardrails all over the place. Of course, we would prompt the model to do things safely. But in case it didn't, we would have other guardrails in place to take care of that.

Model Flexibility: Building a Multi-Model Architecture to Choose the Right Tool for Each Job

And all this couldn't have been possible without a multi-model architecture. It was absolutely essential. It was an early decision, and it turned out to be the right one, that you can't expect a single model to handle all the use cases, especially in a system as diverse in its customer base and experiences that Alexa offers. And so as we were balancing accuracy and latency and making trade-offs with personality and creativity, we found that there were a lot of other dimensions that we had to consider, like capacity, GPU cost, and many, many others.

And this multi-model architecture, lucky for us, was built into the system early on. And part of that was out of necessity, because as I mentioned, the early models we were working with weren't as capable. And so, it was really convenient to be able to swap out and use other models.

But what really helped us was discovering that we don't have to turn this off. We don't have to do this only in development time. We can leave this on in production. That meant that we didn't need to go looking for a one-size-fits-all model. We could find the right model for the right job, for the right use cases. Again, working with AWS was amazing because the Bedrock service made it super easy for us to just swap out the underlying model on the back end at runtime whenever we needed to.

Not every challenge or use case is a nail that requires an LLM hammer. You don't need an LLM to handle the Alexa stop use case. I mean, you could, but it's kind of overkill. There's a feature in Alexa Plus where you can email a PDF of, say, your kid's school schedule, and then later ask Alexa about it. When is my son's homework due again? When is that exam? Alexa will answer, given the information in that PDF.

Sure, the LLM could handle this use case as well, but that meant you would have to feed that PDF in as part of the input prompts. That's a lot of tokens. Then you have to figure out when is it relevant. Is the customer asking about this? If it's not there, then again, it's extra latency to give that context to the LLM. In our case, this is one of those use cases where a specific purpose-built bespoke non-LLM traditional ML model could do the job just fine. We train lots of models in this way in the Alexa system.

Not only do we have multiple LLMs to draw on, but we have multiple non-LLM ML models that we use in the system to handle all of the use cases. Again, here working with AWS and SageMaker makes it really easy and convenient for us to build up all of these different bespoke models. But then, if you have all these models, now you have a new challenge, which is how do you choose the right one?

Well, one approach is, just like speculative execution, you could use all of them all the time and then pick one answer, maybe the fastest, maybe the most accurate, maybe the cheapest. But again, calling all of these models in parallel might help with latency, but it also costs you in capacity and runtime GPUs. We found that you have to use a combination of techniques with multiple models.

As we were going through this, it was really like peeling the layers of an onion. Fixing one problem again only uncovered yet another set of issues and challenges that we had to overcome. I've talked about four of them: accuracy, latency, determinism, and model flexibility. Some we anticipated, some we didn't. But I hope sharing this with all of you, you can decide for yourself what you need to anticipate as you're building your use cases. Back to you, Brittney.

Key Takeaways: Model Flexibility, Iterative Experimentation, and Step Progression

Thanks, Lou. Okay, so what are the key takeaways that we want you all to walk away with, to take back to your own projects? First, model flexibility. Lou talked a lot about how early design decisions for our models afforded us flexibility later. We think this is a really important layer to how we build Alexa Plus, because what we found is one model doesn't fit all, and right-sizing models for specific use cases actually delivers better outcomes than adhering just to one and hoping that it works through. Then also, as you look across that continuum, optimizing for accuracy, speed, and cost is based on the need and the use case that you're trying to deliver on.

Second, and this is critical, iterative experimentation is essential. Traditional optimization wasn't enough, and you heard Lou talk a lot about how we started with traditional techniques like parallelization and streaming, but they really didn't go the distance. We had to bring in new techniques like prompt caching, API refactoring, speculative execution, and it was the blend of those techniques that actually allowed us to create something magical for our customers with generative AI. Also too, what works in theory doesn't always work in production, and then it also doesn't work in production at scale, so building in that experimentation into your process and your mental model is critical.

Lastly, we took a step progression. We wanted first for it to be right. We wanted it to then be fast, and then we wanted it to be reliable. When you're in your own world, you have to make these trade-offs. It's a balance, and so how you do those is going to be a component of what the outcome you're looking to deliver on.

Most importantly, Daisy the dog did get fed, but it's almost noon, so we're going to see what happens today. If you want to experience Alexa Plus live, come to the One Amazon Lane activation at Caesar's Forum. This is where you can discover many of the innovations that are powered by AWS across Amazon's different lines of business. Thank you so much, and please fill out the survey in your mobile app.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community