Kazuya

Posted on Dec 6, 2025 • Edited on Dec 7, 2025

AWS re:Invent 2025 - End-to-end foundation model lifecycle on AWS Trainium (AIM351)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - End-to-end foundation model lifecycle on AWS Trainium (AIM351)

In this video, AWS presents Trainium3, their fourth-generation ML chip offering 4.4x more FP8 compute than Trainium2, with demonstrations of the complete AI model lifecycle from selection through deployment. Matt McClean details optimization techniques using Optimum Neuron for fine-tuning and the new Neuron Profiler tool for performance analysis, while showcasing vLLM integration for inference. Randeep Bhatia from Splash Music demonstrates their Humming LM model, which converts human humming into music, achieving 54% cost reduction and 159% better efficiency by migrating from GPUs to Trainium. The session highlights AWS's commitment to open-sourcing their entire Neuron software stack, including NKI compiler and kernel libraries, with new PyTorch native support enabling seamless device migration for developers.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Optimizing the End-to-End AI Model Lifecycle on Trainium

Thanks for joining on the end of the day. I imagine I'm holding you back from your dinner plans and your happy hours, but thanks for joining us and meeting us all the way out here at the MGM as well. I think we have a great talk today, and I'm pretty excited for this one. What we want to look at today is the end-to-end model lifecycle and how we can optimize every part of that, and we're going to be looking at how we do that on Trainium.

Today, I have a couple of guest presenters with us to bring this topic to you. Matt McClean, who leads our customer engineering team at Annapurna, and we also have Randeep Bhatia, who is the CTO of Splash, and he's going to bring a lot of color and excitement to the end of this presentation. So with that, kind of jumping in and starting from the beginning, the circle of the AI model lifecycle.

We generally start looking at applications from generative AI models or agentic workflows, and that's a big deal right now, especially at AWS. How are we going to optimize for these agentic workflows when the compute demands and the cost demands are getting so high? So what I wanted to look at with you guys today is really how do we optimize every level of this and what is this AI model lifecycle that we should be looking at and what are the different parts of this puzzle that we can optimize for.

Understanding the AI Model Lifecycle and Its Key Bottlenecks

This probably looks very familiar to all the builders here in the room today, right? You want to typically start with an idea, a business problem, an application that you want to build a model for to serve those needs. So you're going to start with your use case discovery, prioritization, some type of qualification of what the use case you're driving towards. Next, you're going to look at the data you have. You're going to look at open source datasets, you're going to look at your proprietary data, and what is your value add to this as well.

Then you're going to move on to model selection. You're going to think about all the different types of models that could solve this problem and what the best case model is going to be for that use case. You're going to adapt that model, make it very specific to your users and their applications. Then you're going to start evaluating, you're going to do some offline evaluation, then you're going to optimize it for deployment.

You're going to then reevaluate it under your production metrics, right? Is it serving fast enough? Is the latency meeting your customer's expectations? Then you're going to deploy for scale. You're going to scale it up, you're going to roll it out, you're going to test it with that audience, and hopefully you're going to start making some money on this one as well, right? And then ultimately you're going to rinse and repeat, you're going to build on to the next application, and the cycle continues and it never really ends, right?

But there are key parts of this circle of life that we want to talk about today, and that's really looking at all of these pink levels from model selection down to optimizing for your production environment. Really, when we look at the cost of building and deploying models, this is where you're going to spend the most money. And the reason for that is the decisions you make along these points on this axis, on this part of the circle, are really going to dictate the business value metrics you're going to see at scale, right?

So if we can optimize the circle of life at this level first, choosing the right model, choosing the right hardware to deploy that model to, choosing the right libraries and techniques to adapt that model in the most efficient way, we're going to be able to optimize all the way through to our deployment lifecycle and then scale it up, right?

So when we take those AI bottlenecks out of this, we want to be able to iterate quickly on the front part of our model lifecycle, which is choosing the right model and typically being able to leverage a lot of the resources out in the open source community to do this. We want to overcome slow iteration processes, slow fine-tuning processes, and accelerate that. We want to manage our rising compute costs, right? Because the more times we go around that circle because we weren't optimizing at all of these different steps, it adds to our cost and it adds to the cost of our application that then we need to recoup in the market.

So we want to manage our compute costs. We want to be able to deploy on hardware and to think about the compute scarcity when we're getting closer to our deployment environment and starting to scale up. But before we get to our scale and deployment, we want to make those considerations early in this model lifecycle as well, and then even from selecting our model and how we're adapting that model and all the way through optimizing it for deployment.

We're going to be considerate of is this going to meet my customers' care-outs, right? Can I serve this fast? Can I serve it cheaply? Can I serve it at scale? So all of those things that we want to talk about today of how to overcome those, we want to think about, you know, we're here to consider and look at overcoming those bottlenecks with training. And so we're going to be focusing on how do we create faster iterations. We're going to be looking at how do we select from open source models so that we don't have to pre-train models so that we can adapt those models more quickly, choosing the right hardware to not only reduce our cost to deploy and build these models with utilizing Trainium, but also being able to serve them and serve them both from a deployment perspective and easier access to compute as well.

AWS's Full Stack Approach and the Annapurna Labs Mission

And then of course optimizing for lower inference times and total latencies or the objectives that you're looking for that could be high throughput, low latency, or total end throughput as well. So we're going to be touching on all of these throughout this presentation. And I wanted to start with what do we have at AWS. So at AWS, I'm sure with the other sessions you're visiting today and all of the other keynote messaging, we have a full stack for you. So if you're developing at a higher level of the stack where you want to just take models from our agentic workflows through Amazon Bedrock or utilize the SDK for agents using Agent Core and other Strands libraries, it makes it really easy to start developing those more complex gen AI and agentic workflows rather than starting from scratch.

And as you kind of go down the stack and you want to own more of this stack yourself and develop it in your environment because maybe you have data residency issues that you want to be cognizant of, maybe you are worried about, you know, you want to optimize for the end-to-end lifecycle of our model lifecycle as well to optimize at all levels and own all of the data and you want to make sure the data lineage is going to be clean throughout, then you'll want to kind of take an infrastructure-first approach where you're doing this on your self-managed instances, right? And so today we'll focus on training and inference as part of that. And this is probably a good segue to Trainium and Inferentia. How many of you have heard of Trainium and Inferentia? Oh, that's fantastic. That's more hands than I'm used to.

How many of you tried Inferentia and Trainium for your applications today? OK, so, well, we got a couple. That's wonderful. So for those of you that are not super familiar, we're part of Annapurna Labs. We're the hardware division of AWS and we build purpose-built silicon that serves our customers' most challenging problems, right? So the first product that Annapurna brought to market was Nitro for network acceleration and hypervisor solutions. The next product was Graviton, bringing general purpose compute with lower cost and higher performance, and of course Inferentia and Trainium.

And we've been doing this for a very long time, right, now over 10 years in that 10 years we've produced and got to market many chips across our three product families. And now with Trainium3 launching this morning in the keynote, this is our 4th generation ML chip, right? And we've been building these solutions and focusing on Inferentia and Trainium for a few reasons, right? Fundamentally, when we were working with our customers here at AWS, we noticed that the types of bottlenecks our customers were facing. First, performance, right? As the models get more complex, as we expect more out of our models, as the now agentic workflows that are expecting 10x or 100x more token generation requirements or even training these models more efficiently, we wanted to provide high-performance options for both training and inference.

We also wanted to make it lower cost for our customers, and so investing in our own silicon made it more achievable to deliver those cost savings to customers as well. And lastly, we want to make it more accessible. We want to democratize AI. What that means is it shouldn't just be the ones with the deepest pockets get access to the compute they need. And so we want to give customers choice, and we want to invest in these technologies to build our own AI chips for these reasons, right? And so over the last 6 years we've now introduced our 4th generation chip, now Trainium3, and we're super excited about bringing these features to you and what it could mean to inspire the next generation of agentic models or mixture of experts models or video generation models.

But our other generations like Trainium 2 or Inferentia 2 are great accelerators for small models and embedding models as well. So there's still a lot of opportunity to use the entire product portfolio. But with our time today I'm just going to be looking at, or wanting to introduce, Trainium 3.

Trainium3 Launch: Breakthrough Performance and Architecture

Trainium 3, we've doubled our compute on chip, we've increased our memory capacity on chip by 50%, and we've increased the memory bandwidth by an additional 1.7 times. And what this really means is when we put it all together into our second generation Ultra Server, we're giving you 4.4 times more compute, FP8 compute, than we made available in our last generation. We also are giving 3.9 times higher memory bandwidth and 3.4 times more memory capacity at the full Ultra Server level, and we've innovated at every level of this server design as well.

So starting from the chip, we also redesigned our compute sleds and the servers themselves, bringing in better switching matrix as well with our introduction of Neuron Switch. And what this allows is that all 144 chips work in very low latency, high bandwidth communication with each other, and this is especially important as we're moving into these more complex AI workloads where we're using multiple models together to make your final end output. And so it allows you to scale your, let's say for example, large mixture of experts models very efficiently across all of these chips in extremely low latency.

And so with this now, you know, all the specs on the slide are always kind of maybe a little dry. What does this really mean from a performance perspective, right? So there's kind of two ways to look at what this performance really means. The bottom line in blue here, that's our Trainium 2, our previous architecture, and the light blue line above it is our Trainium 3 Ultra Server, and we're looking at GPT-3 and serving this at scale. And so the dimensions of this chart are interesting. On the bottom, on the horizontal axis, we have our interactivity or tokens per second, how many tokens can an individual user get? And on the y-axis we're looking at how many users can I serve, or my concurrency of that model.

So with Trainium 2 you're able to kind of deploy your workloads today and get really responsive models, and you're able to do this with, you know, kind of I would say medium concurrency, right? So we can serve these large models, we can do it, we can serve multiple users at a time and get good performance. So if you have workloads today that are utilizing models like this, you can deploy it directly on Trainium 3 and immediately now serve more, you know, 5.4 times more users with that same exact interactivity, right? Each user will get the same experience, and now you can serve more of these customers, bringing down your total cost and being able to expand to more workloads.

The other way to think about this is, well, you know, I want to increase the number of features and performance I'm giving each of my users and each of my customers, right? I want to give them deeper reasoning. I want to provide them content from more models as I'm building this out. So the other way to think about it is now we go horizontally, right? With the same exact concurrency I serve, now I can generate more tokens per user per chip. And what this allows me to do is do deeper reasoning, more complex models, more turns inside of my agentic workflows, you know, serve and generate more tokens without increasing the cost, without increasing or reducing the user experience to my end customers. And we can deliver over six times the number of tokens with Trainium 3 Ultra Servers than we could with Trainium 2 Ultra Servers.

And so we're really excited about that, clearly I'm really excited about that as well, right? And so we're also excited to see Anthropic and our other lead customers kind of leaning in and sharing that excitement as well, as you may have already heard through other parts of re:Invent this year or in previous news. You know, Anthropic's been ramping up their Trainium 2 usage over this last year and are now, with our, at re:Invent last year we launched Project Rainier. It was an ambitious project at the time where we wanted to launch the largest compute cluster for Anthropic to take advantage of, and we're pleased to say that that is now in full production.

We had a stated goal at the time of deploying hundreds of thousands of chips for Anthropic's Project Rainier. We ended that with 500,000 chips deployed in Project Rainier and a million chips already deployed serving all of our customers this year, and we're on track to exceed that already with Trainium3.

Customer Success Stories: Descartes and Splash Music on Trainium

The other really exciting part of this is not just for LLMs. I think that LLMs have a lot of excitement going on right now, but there's also a lot of innovation happening in the computer vision space or in the video-to-video space. One of our lead partners in this area is Descartes. Descartes is a startup focused on building more visually exciting experiences with their models, and so they have a lot of innovation there as well. They're actually doing a session tomorrow which I'll share the session ID at the end, and I highly encourage everyone to take a look at them.

We worked with them ahead of our Trainium3 launch today to see what they could do with Trainium. They were very extensive and experts at using GPUs for a very long time, and they built a lot of kernels to accelerate their models and innovate on top of a GPU platform. When we introduced Trainium to them, they were really excited and it really met the architectural criteria that they had in mind as well. In under two months, they were able to optimize their workflows and models and bring it to Trainium3 and experience higher performance and lower latencies than they were able to achieve with their previous architectures.

We have a session tomorrow where one of the co-founders, actually it's a brother pair, so Dean is the CEO and the CTO is Oren, and he's going to be presenting tomorrow. He's going to go into a lot of depth. I highly encourage that. We also have the Descartes demo running in our expo booth, so definitely talk about that as well.

And of course Splash Music. We also have the privilege of having Splash Music's CTO, Randeep, here as well. He's going to be going into more details on their use case, and I'm super excited for everyone to hear about that. So with that, I wanted to pass it over to my colleague Matt to talk a little bit more on the libraries.

Model Selection: Balancing Intelligence and Cost with Open-Weight Models

Thanks, Cameron. What I'm going to do is kind of, Cameron gave you an overview of the AI model lifecycle, so I'm going to double-click on a few of these stages to give you a little bit more details on how you should approach this, especially with AWS Trainium and Inferentia. The first stage is model selection. That's when you want to select the model that you want to use. There are many different models to choose from.

Here is a benchmark from a website called Artificial Analysis AI, and you can see the most high intelligence. Intelligence is essentially a key metric for selecting your model. The models in black are the proprietary models. These are models typically only accessible through an API. These are the likes of Claude from Anthropic, Gemini from Google, and GPT-5 from OpenAI. These are awesome models, really high on the intelligence scale.

But what you may not know is that open-weight models, so these are models, for example, that you can download from Hugging Face, are actually quite comparable. This is a fairly recent benchmark just from last week, and the Kimi K2 model is actually very, very close to the top-performing Gemini 3 model. The key message here is don't exclude open-weight models. They're actually very, very competitive and have very high intelligence compared to these proprietary models.

Intelligence is one criteria. Another key criteria is cost, because especially if you're deploying, you're going to be making a lot of inference calls to these models. One thing is to do a small proof of concept. Another thing is to deploy at scale, and here cost is another key criteria. This is showing on the vertical axis the intelligence scale and on the horizontal axis it's showing you the actual cost to do inference for these models.

Again, proprietary models are great on the intelligence axis, but they tend to be on the right-hand side of the cost to run, sort of in that top right quadrant. Really what you should be thinking about and the ideal models is essentially the top left corner. That is high on the intelligence scale but lower on the cost. An interesting thing here is that the only models that are present in this top left green quadrant are actually the open-weight models, so models such as GPT-OSS, MiniMax M2, and DeepSeek.

The good thing is that all of these models can actually be fine-tuned and deployed on AWS Trainium and Inferentia. So we've selected our model. Now typically the next stage is adapting the model. In technical terms, we often refer to this as post-training, where you're doing things such as supervised fine-tuning or using reinforcement learning to really adapt the model to your specific use case.

Model Adaptation: Post-Training with Optimum Neuron and LoRA

One of the most popular open-source libraries for doing this post-training is the Hugging Face Transformers library. Are there any users of Hugging Face Transformers? Alright, so we've got a few users. We've actually collaborated with Hugging Face on an optimized version of the Transformers library for AWS Trainium called Optimum Neuron. Essentially it provides you the same APIs you're familiar with in the Transformers library, but under the hood, we've done a lot of optimization. We've integrated very closely with our software stack and we've integrated a whole bunch of kernels to make sure that the performance is really good when you're fine-tuning or deploying these models. This is just showing a snapshot of the landing page, the documentation page where you can go and find more information on this library.

The steps in terms of when you're doing your post-training using Optimum Neuron are essentially these four steps. You're going to load and prepare your datasets. You're going to fine-tune your model using LoRA. LoRA is a very efficient way to fine-tune a model. It uses a lot less memory than standard full model fine-tuning. Step three is you want to consolidate your LoRA adapters into the open weights, and then finally, optionally, you can push them up into the Hugging Face Hub.

So if we go into each one of these, loading and preparing the dataset is nothing specific to AWS Trainium. This is what you would have to do on any sort of fine-tuning project. You have a choice of over 400,000 datasets. These are all up in Hugging Face Hub you can choose from, or typically in a business, you'll actually take your proprietary data and you want to use that in order to have your model generate things that are specific for your business. So once you've selected the dataset, the next thing you need to do is format it depending on the kind of application. For example, if you're building a chatbot kind of application, you want to format the dataset into an instruction format, so you're telling the model the types of data that will be sent to the model and the way it has to respond.

So now we get into, once we've prepared the dataset, we're ready to fine-tune and launch our training jobs. And here you can actually take existing scripts you've used on other accelerators and basically bring them across because the code changes are essentially minimal. Here I'm highlighting essentially the only kind of classes you'd need to replace. This is replacing the standard Transformers classes with Optimum Neuron classes, and then things should just work.

So we've trained our model and we've got our LoRA adapters. Now we want to consolidate everything together. Here are a couple of different options. The first option is to use the Optimum Neuron CLI. This will essentially consolidate the LoRA adapters into the open weights downloaded from Hugging Face and bring them all together. That's one option. The other option shown below is just standard Python code where you have a little bit more control over how that consolidation will work.

So now we've consolidated everything into one package. Now you can either deploy, or if you want to actually share this newly fine-tuned model to, for example, other users, then you have the option of pushing it back up into Hugging Face. Here's just a code snippet showing you how you can actually do that.

Performance Optimization: Neuron Profiler and Neuron Kernel Interface

Alright, so we've adapted our model and we've done the fine-tuning. Now often what you can do is you want to optimize the performance of your model, and this is what we'll look into now. Before we show how you can do this, just to level set on a few best practices and principles, essentially performance optimization is all about trying to maximize the utilization of our AI chips. When you're talking about AI chips, that typically means you want to ensure that it's doing what we call compute bound. It's basically using all of the available FLOPs in the accelerator to the maximum extent possible. There are a few different ways you can do this. One way is you can pipeline operations.

What this means is essentially while you're doing, for example, a matrix multiplication for one operation, you can at the same time load the data for the next operation in parallel. You can also save the data for the subsequent operation at the same time as well. So you're ensuring that your compute engines are always being utilized.

The second principle is you want to minimize the data movement because, while the bandwidth between the different hierarchies of memory is fast, you still want to typically keep memory on the chip as much as possible. This is what we talk about when we mention minimizing data movement. For example, for our activation tensors, we want to keep them in the chip's SRAM, a small amount of very high bandwidth but small capacity SRAM, typically in the tens of megabytes. We want to keep our activation tensors there in order to avoid doing lots of reads and writes back to our high bandwidth memory.

As well as minimizing data movement, when we do have to move data, we want to maximize the throughput. Typically, we want to make sure that our read and write operations are using large chunks of data rather than having small amounts. That will just ensure that our bandwidth is maximized.

Finally, when we're running inference on our models, we're typically running large models, right? They have tens or hundreds of billions of parameters, which means they can't fit on a single accelerator. The model has to be sharded across multiple accelerators, and those accelerators have to communicate between each other. That communication is what we call the collective operations. One key core principle is we want to ensure our collective time is less than the time it takes to perform those computations, such as our matrix multiplications and the like.

So how can you actually know if your model is optimized, if it's really performance optimized for AWS Trainium and Inferentia? We've launched this week a new tool called Neuron Profiler, and essentially this is our new profiler. This provides you nice different levels of hierarchy, so you can view your model from a high-level, sort of each layer by layer module or operation, right down to the low-level instructions running on the hardware. We provide it via a web application or an integration like a plugin to VS Code. Very soon we'll have integrations to do system profiling. What system profiling means is you can also view how your host's CPU and memory are performing as well as your device.

Here is a screenshot, and in a second, I will actually go through a full demo of this new Neuron Profiler tool. So let's say we've found a bottleneck in our code. Well, what can you do about it? How can you actually improve the utilization of your AI chip? This is where our kernel interface comes in. It's called Neuron Kernel Interface, or NKI for short. Essentially, this provides you full control and programmability of our AI chips. We provide it as a Python DSL, and you can basically write low-level ISA commands, which essentially are the machine instructions that your AI chip will execute. This gives you control of all such things, such as the way your memory layout, how you do tiling, how you do scheduling and allocation of your model. For example, Cameron mentioned earlier Descartes. Descartes used NKI to essentially optimize their video to video model to get the best performance possible.

Okay, so now we're going to have a demo of the Neuron Profiler, which again is a new tool just launched this week. Here is a profile of a large language model. It's a decoder. It's been not optimized. Here we can see a summary page. On the top left, we can see just some overall FLOPS utilization, how compute efficient our model is. We've got these bars that we're going over, which is essentially the utilization of the different engines within the chip. So we have a tensor engine, we have a scalar engine, a vector engine, and a general-purpose SIMD engine.

We've got some recommendations that will actually tell you things that it's picked up and give you some recommendation next steps for how you can alleviate some of these bottlenecks. We also have some info on the collective operations around the size. You can spot outliers that you can dive into. Also, we have a memory bandwidth utilization, so typically we want to keep this as high as possible. It's quite low in this particular case, so that's another indication there's a bit of work here we can do to optimize this particular model.

So that's one view. Another view is a bit more detailed. This is of the same model, and here we have our hierarchical view. We can essentially go from each layer in our model and dive into it. Within each layer we can see, for example, the major components. Typically these are things like an attention component and an MLP component. Within each of those, we can break them down into the specific operations down to, for example, a matrix multiplication, an add, or a value.

On the right-hand side, top right, you'll see a similar sort of view right down to the operation level, and we can see how efficient each of those operations is on our hardware. On the bottom left, we can actually see how this all maps down into the specific engines running on our AI chip. For example, we can see the tensor engine utilization, the vector engine, the scalar engine, and all the engine utilization. We can also see the collective operations when we are having to communicate between chips, and we can see how long each of those operations takes. We can also see our memory bandwidth utilization, so how the memory is being moved around on the chip and how efficient that is through the sort of low-level device view.

So what we can do is also look at a specific module. We can, for example, take the attention module, and we can actually look and see how long the latency is for that particular operation. Here we're taking sort of two markers and we're seeing that it takes around, it's pretty hard to see here, but it's about 125 microseconds to run the attention part of our model. So that's one sort of data point, and we can use that as our baseline.

And then what we're doing now is we're actually looking, I've opened a new profile and this is an optimized version of that same model, but this time we're using an NKI kernel. So again, we have that hierarchical view that I mentioned before. We can go down and see down to the low-level operations. We can go again and mark, for example, the time, the latency of the particular component. And here you can see that we've actually sped it up. So we've gone from 125 microseconds down to around 79 microseconds. So we have about a 50% speed up on this particular model.

We also have a new feature, which is a code view. So we can actually map the specific instructions running on the hardware back to our NKI code. This is our kernel code, which is showing you on the top right. So on the top right is our NKI code written in Python, and we can highlight an area in our device profile and actually see which lines of code in our NKI kernel actually are responsible for that instruction. So what you can do is point back to a line of code and then you can go back to your code, optimize it, try something new and see the effect on the operation or the speed up of that particular component and how it maps to the low-level device instructions. Right, so that's the demo of the new Explorer tool.

Deployment at Scale: vLLM Integration with Trainium and Inferentia

So once we've kind of optimized the model, now we're ready to actually deploy a model into production. So this is the deploy and scale phase. So one of the most popular libraries used for deploying, especially foundation models, is vLLM. Any users of vLLM here? Okay, we've got a couple of folks. So it's a really popular open-source library. It's really designed for high throughput, low latency LLM serving. And it does this through various different mechanisms. It has a really efficient KV cache management mechanism. So it's doing things such as paged attention and also really efficient batching. So it's using concepts such as continuous batching.

It has a very open, vibrant, open-source community. There are folks from Red Hat and many other companies contributing to this, and it has a lot of different model support. So your popular models like Llama, GPT, DeepSeek, you know, the most popular open weight models are supported by vLLM. And now it's part of the Linux Foundation. So a lot of collaboration happening. So here's just an example of a code snippet of how you would actually use vLLM. Really simple to use. You set up things such as your sampling parameters like your temperature. You would configure things such as how you want to shard your model, so how many accelerator chips you want to shard your model weights over. In this case, we're using tensor parallel to shard our model over two tensor cores. And then we can basically call generate to generate some outputs.

So vLLM is fully integrated with AWS Trainium and Inferentia, and this is done via the vLLM Neuron plugin. So we have support for some of the most popular open weight models, Llama, Qwen, GPT, and Mistral. We've integrated a lot of kernels, so our team has been busy writing kernels, so you don't need to write them for the most popular open-source models.

There are things such as FlashAttention, which is a popular approach to accelerate the attention component of a model. We have other kernels such as Fused QKV to get better support and better performance. And also other features such as speculative decoding, which is another way for you to speed up your model to generate tokens a lot faster during the decoding phase.

All right, so now I've explained the end-to-end lifecycle in a hypothetical way. I'd actually like to invite Randeep, who is going to show you a concrete use case of how they've actually managed the end-to-end model lifecycle using AWS Trainium and Inferentia. Thanks, bud.

Splash Music's Remix: Building Interactive Music with Humming LM on Trainium

Hello, everyone. I'm Randeep Bhatia, CTO for Splash Music. And today with AWS we are launching something new. Not a product, not a platform, but a new way of making music, a completely new format, and that is going to be interactive, and we call it Remix. Wait, before I go and explain, I'm getting a call. This person usually doesn't call me unless there's an emergency. I hope you guys don't mind. I'm going to take this and FaceTime it.

Hello. Hey, Randeep, I know you're talking at the AWS re:Invent conference today. Oh, whoa, you're on stage right now. Hey everyone, sorry for interrupting. I just wanted to give you a call and let you know that I sent you a good luck mix. I just texted it to you. Check it out. All right, we'll talk later. Bye.

 Wow, that's a really awesome good luck message to receive from a coworker right in the middle of a presentation. Perfect timing as well. So what we are also going to do, I'm going to respond back with a message that she has sent me over. I am on this stage. I'm going to show how to co-create. I am on the stage.

Done. All I did was just sing my melody with the lyrics, with vocals, and combined it into a single musical composition that we both created together. It's interactive, and it's fun. It's music, it's a vibe, and it is magical.

So what makes it interactive? Music is one of the ways that people love communicating with. It's the melody that makes us feel connected. Gen Z and Gen Alpha have been communicating with each other through text messages, through videos, through images. But we wanted to change that. We wanted to bring in a new format of communication through music. And that's where we made Remix.

Now, in under 15 seconds, you can have infinite variations of compositions that you can add on top of each other. And this really changes the way where music is actually headed today. And it is another way of supporting artists. So how do we do this? Let's take a look behind the scenes.

When we started this, we asked ourselves, why is music making so hard.

People don't really express themselves in prompts. How do you express feelings in prompts? It's impossible to do it. They hum in the shower. They make random noises. Sometimes they tap on the desk while sitting. So our goal was very simple. We wanted to take these everyday moments and turn that into a mode of communication. We wanted to take technology out of it, but we also wanted to do it ethically and at scale.

To solve this, we built the first ever LLM model called Humming LM. It is the model that takes your hum and listens to it. If you're off-key, totally fine. You're offbeat, that's even better. Because in every single imperfection, we can understand the intent that is behind your expression. Humming LM is built to understand that intent and make it sound great.

Of course, converting random hums into music is not easy because it is a challenging problem. We built massive datasets of people singing, people recording their vocals, getting the right key, the BPM, the melody, and then we built the entire model end to end. We first trained it on GPUs, which is very expensive for any startup. But then we started working on AWS. With the help of the AWS AI Innovation Center team, we experimented on AWS Trainium chips. That cut our training costs by 54%. It increased our throughput by 8.3%. And it gave us 159% better efficiency on the metrics that represent what people can actually hear, which means you have cheaper, faster, better music from your hums.

How do you measure the better? We basically have the metrics that define the quality of the music, so that's how we know it's better. And of course, everybody needs a little inspiration because you have an idea and you want to take it. What we have curated is these sounds from different artists that represent different intents. Have a heartbreak, we have a sound for that. Are you happy? We have a sound for that. You have a good vibe, we have a sound for that as well. And that gives direct attribution back to the artist, and it benefits all the creators.

Now, you might wonder why we didn't use an open-source model to begin with. Open-source models aren't built for melodies. They don't understand the intentions behind the melodies that are created. People usually hum off-key. They start off the note, they drift, and then they express emotions that are not mathematically clean. In music, those imperfections define the intent. These are the clues that make them magical.

So to create the new formation of music, we wanted a way for us to understand your hum, your emotion, your intent. And that's when we built the Humming LM that encapsulates all of these intents and music together. When people ask what we do at Splash, we empower anyone to make music with their favorite artist, with friends, family, and even people from the crowd.

So we are not an AI music company. In fact, we are not a music company. We started off with a simple goal on how people want to express themselves through music. But to build that, we built the entire system around understanding melody, timbre, structure, and emotion that represents music. For our complete journey, CPUs were too slow to run the algorithms. GPUs got too expensive. So we needed something that can define our availability and scale at the same time. That's when Trainium came into the picture.

So with Trainium and SageMaker Hyperpod, we started off our journey with just four Trainium nodes. We were able to scale it all the way up to 100 Trainium nodes, get our model trained and ready to be served across millions of endpoints through Inferentia. Now, we did use other AWS services such as FSx Lustre for storage and had the orchestration around it.

Now, you saw that we can take a very simple hum and convert it into a song with your favorite artist in just 15 seconds. You hum, we capture your intent, and that's what you hear. But that's only half of the story.

Since music is a form of communication, we want people to add their own voice to these songs. Like how you saw my coworker sent me a good luck, and I was able to relate with that and send my response back into that message. And the capabilities that we have built in our player is around that you can do it right from that experience. You don't have to navigate or anything. It is as simple as typing a message to somebody.

Now since we have launched in less than a year, across 30,000 creations on our platform, we have streamed over 750 million times. That's almost 10,000 years of listening in less than a year. The only question is, what are you all going to create today? And I truly appreciate re:Invent for having me on the stage here and sharing our story.

The Future of AI Development: Open Source Neuron Stack and Next Steps

Thank you so much Randeep. It's really inspirational to see how creative his team has been and able to leverage both the services from AWS but also Trainium to achieve their goals, to lower their costs, of course, but to create something new, right? You know, even in my own world right now we kind of focus on LLMs as the source of AI. We're like, oh yeah, generative workflows, they're gonna take over and to be honest they will, but it's also really interesting to see how AI is going to be applied and utilized in so many different kinds of applications and markets today, right? We're just scratching the surface of this really.

You know, from creative endeavors like with music with what Splash is doing today where the impacts that these new models, generative AI and what's coming after generative AI is really gonna be impacting like healthcare, medicine, therapeutics. Video is another great example of this as well so I think that there's gonna be a lot of growth still left to be discovered and a lot of inspirational moments are still ahead of us, right? And so at the Annapurna Labs and AWS team what we're thinking about is, well, how do we make our developer stack more accessible to more people? How do we make Trainium and Inferentia more valuable to this discovery of different ideas, right?

And so we have our developer stack called Neuron. And we've been really focused on how do we make it better and more accessible to more developers at all the levels, right? Whether you're an ML developer looking to utilize AI building blocks, models from some of the resources that our presenters here talked about today, making it easier for you to integrate inside of your applications, extend your applications with AI, and you want off the shelf great resources to do that in a very easy way to achieve the performance goals you want without having to really get in there and optimize these models. Whether you're researchers like the designers at Splash were like, you know what, there's not an open source resource or open source model that makes the best use here. I really need to design something from the ground up, right?

And for that we're really investing in resources at the framework level so native PyTorch support we want to make it really robust. Have the full ecosystem around PyTorch where you can leverage that. So starting this week we're introducing our new PyTorch native support for Trainium. This will allow you to basically take any of your PyTorch code you're running today or the libraries running on top of PyTorch, be able to change that device description from CPU or GPU and now to Neuron and be able to deploy directly on Trainium and Inferentia, take advantage of the performance and cost savings that we can offer you and to do it very easily and take advantage of the PyTorch ecosystem with FSDP, PyTorch eager mode, dot compile, and many of the other features there and then kind of plugging into the ecosystem of discoverability.

So, you know, PyTorch Profiler and Weights and Biases and many others, right.

And then for performance engineers, and Matt kind of touched on a lot of the features we're building here with our new profiler capabilities coming through our Neuron Explorer, which really gives you unparalleled access into what's happening and how your models are performing. We're going to keep investing in that space as well, with more access through our Neuron Kernel Interface, expanding our access into the ISA of our devices. And then of course, our optimized kernel library, which is going to be launching this week as well with Neuron Kernel Library. These are pre-optimized kernels our team is developing. Some of the numbers that we were sharing earlier to achieve the GPT OSS, those are kernels we built, and we're going to be open sourcing and sharing all of those as well.

And so our entire developer stack that's available for all the different types of developers we want to engage with, what underpins all of that is open source. We want to make it easy and we want to make it accessible. We want to build with you in this room and everyone out there as well, so we're committing to open sourcing our entire software stack. Today that includes our Neuron Kernel Interface kernels, our library of kernels, and our NKI compiler is also going to be open source. All our plugins for PyTorch and vLLM, Hugging Face and others, and over time our entire software stack that includes our core graph compiler as well.

With that, we have some time for questions, but I wanted to also share, we're sitting at the end of Tuesday, and I thank you so much for sitting here. It's almost past 5:00 on a Tuesday, and I expect you guys to have a lot of evening plans as well, but thank you for joining us all throughout today. We still have two more exciting full days at re:Invent. We have a lot of sessions, and if you're inspired from this one, we have a workshop tomorrow. I believe it's AIM 309, so if you want to get hands-on with inference and training and kind of take your knowledge to the next level, it's a great opportunity. We have all our Neuron experts in the room, and you can ask them a ton of questions as well.

And then of course, join us on Thursday for more deep dives and innovation talks. Join us for AIM 201, where we're going to be looking at all of the, going down and peeling the onion a little bit, talking about the innovation that went into our software stack, that went into our chips and server design. This one will be featured with two partners of ours as well that will be talking about innovation and how they've uncovered that and utilized it in inference and training along the way.

And then of course lastly, and probably most importantly, come build with us. We're super excited to build this with you guys. We want everyone to leverage inference and training to build the next big thing. We want them to build unique new experiences for their customers and do it at lower cost and easier accessibility. We don't want this to be compute limited anymore. So with that, I'd like to thank everyone for joining us today.

; This article is entirely auto-generated using Amazon Bedrock.