DEV Community

Cover image for AWS re:Invent 2025 - End-to-end foundation model lifecycle on AWS Trainium (AIM351)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - End-to-end foundation model lifecycle on AWS Trainium (AIM351)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - End-to-end foundation model lifecycle on AWS Trainium (AIM351)

In this video, AWS presents the end-to-end AI model lifecycle optimization using Trainium chips. Cameron introduces Trainium3, featuring 4.4x more FP8 compute and 5.4x higher concurrency than Trainium2, with Anthropic deploying 500,000 chips in Project Rainier. Matt demonstrates Optimum Neuron for Hugging Face integration, the new Neuron Explorer profiler, and NKI for kernel optimization, achieving 50% speedup on attention modules. Randeep from Splash Music showcases their Humming LM model, which reduced training costs by 54% and increased throughput by 8.3% on Trainium, enabling interactive music creation in 15 seconds. AWS announces open-sourcing their entire Neuron software stack, including native PyTorch support and the Neuron Kernel Library, making AI development more accessible across different developer levels.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction: Optimizing the AI Model Lifecycle on Trainium

Thanks for joining us at the end of the day. I imagine I'm holding you back from your dinner plans and happy hours, but thanks for joining us and meeting us all the way out here at the MGM as well. I think we have a great talk today and I'm pretty excited for this one. What we want to look at today is the end-to-end model life cycle and how we can optimize every part of that. We're going to be looking at how we do that on Trainium. Today, I have a couple of guest presenters with us to bring this topic to you: Matt McClean, who leads our customer engineering team at Annapurna Labs, and we also have Randeep Bhatia, the CTO of Splash Music, who's going to bring a lot of color and excitement to the end of this presentation.

Thumbnail 60

So jumping in and starting from the beginning, let's look at the circle of the AI model life cycle. We generally start looking at applications from generative AI models or agentic workflows, and that's a big deal right now, especially at AWS. How are we going to optimize for these agentic workflows when the compute demands and the cost demands are getting so high? What I wanted to look at with you today is how we optimize every level of this and what this AI model life cycle is that we should be looking at, and what are the different parts of this puzzle that we can optimize for.

This probably looks very familiar to all the builders here in the room today. You typically want to start with an idea, a business problem, an application that you want to build a model for to serve those needs. So you're going to start with your use case discovery, prioritization, and some type of qualification of what the use case you're driving towards is. Next, you're going to look at the data you have. You're going to look at open source datasets. You're going to look at your proprietary data and what your value add is as well.

Then you're going to move on to model selection. You're going to think about all the different types of models that could solve this problem and what model is going to be best for that use case. You're going to adapt that model and make it very specific to your users and their applications. Then you're going to start evaluating. You're going to do some offline evaluation, then you're going to optimize it for deployment. You're going to then reevaluate it under your production metrics. Is it serving fast enough? Is the latency meeting your customer's expectations?

Thumbnail 180

Then you're going to deploy for scale. You're going to scale it up, you're going to roll it out, you're going to test it with that audience, and hopefully you're going to start making some money on this one as well. And then ultimately you're going to rinse and repeat. You're going to build on to the next application and the cycle continues. It never really ends. But there are key parts of this circle of life that we want to talk about today, and that's really looking at all of these pink levels from model selection down to optimizing for your production environment.

Addressing AI Bottlenecks: Cost, Iteration Speed, and Deployment Challenges

When we look at the cost of building and deploying models, this is where you're going to spend the most money. The reason for that is the decisions you make along these points on this axis, on this part of the circle, are really going to dictate the business value metrics you're going to see at scale. So if we can optimize the circle of life at this level first—choosing the right model, choosing the right hardware to deploy that model to, choosing the right libraries and techniques to adapt that model in the most efficient way—we're going to be able to optimize all the way through to our deployment life cycle and then scale it up.

Thumbnail 230

When we take those AI bottlenecks out of this, we want to be able to iterate quickly on the front part of our model life cycle, which is choosing the right model and typically being able to leverage a lot of the resources out in the open source community to do this. We want to find and overcome slow iteration processes and slow fine-tuning processes and accelerate that. We want to manage our rising compute costs because the more times we go around that circle, because we weren't optimizing at all of these different steps, it adds to our cost and it adds to the cost of our application that then we need to recoup in the market.

We want to manage our compute costs and be able to deploy on hardware and think about the compute scarcity when we're getting closer to our deployment environment and starting to scale up. But before we get to our scale and deployment, we want to make those considerations early in this model life cycle as well. And then even from selecting our model and how we're adapting that model and all the way through optimizing it for deployment.

Thumbnail 320

We're going to be considerate of whether this is going to meet my customers' requirements. Can I serve this fast? Can I serve it cheaply? Can I serve it at scale? All of those things that we want to talk about today are how to overcome those challenges. We want to think about overcoming those bottlenecks with training. We're going to be focusing on how we create faster iterations. We're going to be looking at how we select from open source models so that we don't have to pre-train models and can adapt those models more quickly.

Thumbnail 380

Choosing the right hardware not only reduces our cost to deploy and build these models utilizing Trainium, but also enables us to serve them from a deployment perspective and provides easier access to compute. We're also optimizing for lower inference times and total latencies or the objectives that you're looking for, which could be high throughput, low latency, or total end-to-end throughput. We're going to be touching on all of these throughout this presentation. I wanted to start with what we have at AWS. At AWS, I'm sure with the other sessions you're visiting today and all of the other keynote messaging, we have a full stack for you.

AWS's Full Stack Approach and the Evolution of Purpose-Built Silicon

If you're developing at a higher level of the stack where you want to take models from our agentic workflows through Amazon Bedrock or utilize the SDK for agents using Agent Core and other supporting libraries, it makes it really easy to start developing those more complex generative AI and agentic workflows rather than starting from scratch. As you go down the stack and want to own more of this stack yourself and develop it in your environment, perhaps because you have data residency issues that you want to be cognizant of or you want to optimize for the end-to-end lifecycle of our model lifecycle to optimize at all levels and own all of the data and ensure the data lineage is clean throughout, then you'll want to take an infrastructure-first approach where you're doing this on your self-managed instances.

Thumbnail 460

Today we'll focus on training and inference as part of that. This is probably a good segue to Trainium and Inferentia. How many of you have heard of Trainium and Inferentia? That's fantastic. That's more hands than I'm used to. How many of you have tried Inferentia and Trainium for your applications today? Well, we got a couple. That's wonderful. For those of you that are not super familiar, we're part of Annapurna Labs. We're the hardware division of AWS and we build purpose-built silicon that serves our customers' most challenging problems.

Thumbnail 510

Thumbnail 520

Thumbnail 530

The first product that Annapurna brought to market was Nitro for network acceleration and hypervisor solutions. The next product was Graviton, bringing general purpose compute with lower cost and higher performance, and of course Inferentia and Trainium. We've been doing this for a very long time, now over ten years. In that ten years, we've produced and brought to market many chips across our three product families. Now with Trainium 3 launching this morning in the keynote, this is our fourth generation ML chip. We've been building these solutions and focusing on Inferentia and Trainium for a few reasons.

Fundamentally, when we were working with our customers here at AWS, we noticed the types of bottlenecks our customers were facing. First, performance. As the models get more complex and we expect more out of our models, as agentic workflows now expect ten times or even one hundred times more token generation requirements or when training these models more efficiently, we wanted to provide high performance options for both training and inference. We also wanted to make it lower cost for our customers, and investing in our own silicon made it more achievable to deliver those cost savings to customers.

Thumbnail 600

Lastly, we want to make it more accessible. We want to democratize AI. What that means is it shouldn't just be the ones with the deepest pockets who get access to the compute they need. We want to give customers choice and invest in these technologies to build our own AI chips for these reasons. Over the last six years, we've now introduced our fourth generation chip, Trainium 3, and we're super excited about it.

Thumbnail 640

Thumbnail 660

Trainium3 Launch: Breakthrough Performance and Scale with Project Rainier

To inspire the next generation of agentic models, MOE models, or video generation models, our other generations like Trainium2 or Inferentia2 are great accelerators for small models and embedding models as well. So there's still a lot of opportunity to use the entire product portfolio. But with our time today, I'm just going to introduce Trainium3. With Trainium3, we've doubled our compute on chip, we've increased our memory capacity on chip by 50 percent, and we've increased the memory bandwidth by an additional 1.7 times. What this really means is when we put it all together into our second generation Ultra server, we're giving you 4.4 times more compute FP8 compute than we made available in our last generation. We also are giving 3.9x higher memory bandwidth and 3.4x more memory capacity at the full Ultra server level, and we've innovated at every level of server design as well.

Starting from the chip, we also redesigned our compute sleds and the servers themselves, bringing in better switching matrix with our introduction of Neuron Switch. What this allows is all 144 chips to work in very low latency, high bandwidth communication with each other. This is especially important as we're moving into these more complex AI workloads where we're using multiple models together to make your final end output. It allows you to scale, for example, large MOE models very efficiently across all of these chips in extremely low latency.

Thumbnail 730

Now, all the specs on the slide are always kind of dry. What does this really mean from a performance perspective? There are two ways to look at what this performance really means. The bottom line in blue here is our Trainium2, our previous architecture, and the light blue line above it is our Trainium3 Ultra server, and we're looking at GPT-OSS and serving this at scale. The dimensions of this chart are interesting. On the bottom, on the horizontal axis, we have interactivity or tokens per second, how many tokens can an individual user get? On the y-axis, we're looking at how many users can I serve or the concurrency of that model.

Thumbnail 800

With Trainium2, you're able to deploy your workloads today and get really responsive models, and you're able to do this with medium concurrency. So we can serve these large models, we can do it with multiple users at a time and get good performance. So if you have workloads today that are utilizing models like this, you can deploy it directly on Trainium3 and immediately now serve more, 5.4 times more users with that same exact interactivity. Each user will get the same experience, and now you can serve more customers, bringing down your total cost and being able to expand to more workloads.

Thumbnail 820

Thumbnail 840

The other way to think about this is, well, I want to increase the number of features and performance I'm giving each of my users and each of my customers. I want to give them deeper reasoning. I want to provide them content from more models as I'm building this out. The other way to think about it is now we go horizontally. With the same exact concurrency I serve, now I can generate more tokens per user per chip. What this allows me to do is deeper reasoning, more complex models, more turns inside of my agentic workflows, serve and generate more tokens without increasing the cost or reducing the user experience to my end customers. We can deliver over 6x the number of tokens with Trainium3 Ultra servers than we could with Trainium2 Ultra servers.

Thumbnail 870

We're really excited about that. We're also excited to see Anthropic and our other lead customers leaning in and sharing that excitement as well. As you may have already heard through other parts of re:Invent this year or in previous news, Anthropic has been ramping up their Trainium usage over this last year and are now at re:Invent last year we launched Project Rainier. It was an ambitious project at the time where we wanted to launch the largest compute cluster for Anthropic to take advantage of. We're pleased to say that that is now in full production. We had a stated goal at the time of deploying hundreds of thousands of chips for Project Rainier.

We ended that with 500,000 chips deployed in Project Rainier and a million chips already deployed, serving all of our customers this year, and we're on track to exceed that already with Trainium 3.

Thumbnail 930

Beyond LLMs: Descartes and Splash Music Pioneer New Applications

The other really exciting part of this is not just for LLMs. I think that LLMs have a lot of excitement going on right now, but there's also a lot of innovation happening in the computer vision space or in the video-to-video space. One of our lead partners in this area is Descartes. Descartes is a startup focused on building more visually exciting experiences with their models, and they have a lot of innovation there as well. They're actually doing a session tomorrow, which I'll share the session idea at the end, and I highly encourage everyone to take a look at them.

We worked with them ahead of our Trainium 3 launch today to see what they could do with Trainium. They were very extensive experts at using GPUs for a very long time and built a lot of kernels to accelerate their models and innovate on top of a GPU platform. When we introduced Trainium to them, they were really excited and it really met the kind of architectural criteria that they had in mind as well. In under two months, they were able to optimize their workflows and models and bring them to Trainium 3 and experience higher performance and lower latencies than they were able to achieve with their previous architectures.

Thumbnail 1040

We have a session tomorrow where one of the co-founders will be presenting. It's actually a brother pair. Dean is the CEO and Oren is the CTO. He's going to be presenting tomorrow and going into a lot of depth. I highly encourage that. We also have the Descartes demo running in our expo booth, so definitely talk about that as well. And of course, Splash Music. We also have the privilege of having Splash Music's CTO, Randeep Bhatia, run deep here as well. He's going to be going into more details on their use case, and I'm super excited for everyone to hear about that.

Thumbnail 1060

Model Selection Strategy: Balancing Intelligence and Cost with Open-Weight Models

With that, I wanted to pass it over to my colleague Matt to talk a little bit more on the libraries. Thanks, Cameron. What I'm going to do is kind of double click on a few of these stages to give you a little bit more details on how you should approach this, especially with AWS Trainium and Inferentia. Cameron gave you an overview of the AI model lifecycle.

Thumbnail 1080

The first stage is model selection. That's when you want to select the model that you want to use. There are many different models to choose from. Here is a benchmark from a website called Artificial Analysis AI, and you can see the most high intelligence. Intelligence is essentially a key metric for selecting your model. The models in black are the proprietary models. These are models typically only accessible through an API. These are the likes of Claude from Anthropic, Gemini from Google, and GPT-5 from OpenAI. These are awesome models, really high on the intelligence scale.

But what you may not know is that open-weight models, models that you can download from Hugging Face, are actually quite comparable. This is a fairly recent benchmark, just from last week, and the Kimi K2 model is actually very close to the top-performing Gemini 3 model. The key message here is don't exclude open-weight models. They're actually very competitive and have very high intelligence compared to these proprietary models.

Intelligence is one criteria. Another key criteria is cost, because especially if you're deploying, you're going to be making a lot of inference calls to these models. One thing is to do a small proof of concept. Another thing is to deploy at scale, and here cost is another key criteria. This is showing on the vertical axis the intelligence scale and on the horizontal axis the actual cost to do inference for these models. Proprietary models are great on the intelligence axis, but they tend to be on the right-hand side of the cost to run, so sort of in that top right quadrant.

Thumbnail 1160

What you should be thinking about and the ideal models is essentially the top left corner. That is high on the intelligence scale but lower on the cost. An interesting thing here is that the only models that are present in this top left green quadrant are actually the open-weight models.

Models such as GPT-OSS, MiniMax, M2, and DeepSeek are in that most attractive region. The good thing is that all of these models can actually be fine-tuned and deployed on AWS Trainium and Inferentia. So we've selected our model. Now typically the next stage is adapting the model. In technical terms, we often refer to this as post-training, where you're doing things such as supervised fine-tuning or using reinforcement learning to really adapt the model to your specific use case.

Thumbnail 1230

Thumbnail 1250

Model Adaptation with Optimum Neuron: Streamlined Fine-Tuning on Trainium

One of the most popular open-source libraries for doing this post-training is the Hugging Face Transformers library. Are there any users of Hugging Face Transformers here? Alright, so we have a few users. We've actually collaborated with Hugging Face on an optimized version of the Transformers library for AWS Trainium and Inferentia called Optimum Neuron. Essentially, it provides you the same APIs you're familiar with in the Transformers library, but under the hood, we've done a lot of optimization. We've integrated very closely to our software stack and we've integrated a whole bunch of kernels to make sure that the performance is really good when you're fine-tuning or deploying these models.

Thumbnail 1300

This is just showing a snapshot of the landing page and documentation page where you can go and find more information on this library. The steps in terms of when you're doing your post-training using Optimum Neuron are essentially these four steps. You're going to load and prepare your datasets. You're going to fine-tune your model using LoRA. LoRA is a very efficient way to fine-tune a model. It uses a lot less memory than standard full model fine-tuning. Step three is you want to consolidate your LoRA adapters into the open weights, and then finally, optionally you can push up into the Hugging Face Hub.

Thumbnail 1330

If we go into each one of these, loading and preparing the dataset is nothing specific to AWS Trainium. This is what you would have to do on any sort of fine-tuning project. You have a choice of over 400,000 datasets. These are all up in Hugging Face Hub that you can choose from, or typically in a business you would actually take your proprietary data and use that in order to generate things that are specific for your business. Once you've selected the dataset, the next thing you need to do is format it depending on the kind of application.

Thumbnail 1380

For example, if you're building a chatbot kind of application, you want to format the dataset into an instruction format so you can tell the model the types of data that will be sent to the model and the way it has to respond. So now we get into it. Once we've prepared the dataset, we're ready to fine-tune and launch our training jobs. Here you can actually take existing scripts you've used on other accelerators and basically bring them across because the code changes are essentially minimal. Here I'm highlighting essentially the only kind of classes you'd need to replace. This is replacing the standard Transformers classes with Optimum Neuron classes, and then things should just work.

Thumbnail 1410

So we've trained our model and we've got our LoRA adapters. Now we want to consolidate everything together. Here are a couple of different options. The first option is to use the Optimum Neuron CLI. This will essentially consolidate the LoRA adapters into the open weights downloaded from Hugging Face and bring them all together. That's one option. The other option shown below is just standard Python code where you have a little bit more control over how that consolidation will work.

Thumbnail 1440

Thumbnail 1470

So now we've consolidated all into one package. Now you can either deploy, or if you want to actually share this newly fine-tuned model to, for example, other users, then you have the option of pushing it back up into Hugging Face. Here's just a code snippet showing you how you can actually do that. Alright, so we've adapted our model and we've done the fine-tuning. Now often what you can do is you want to optimize the performance of your model, and this is what we'll look into now.

Thumbnail 1480

Performance Optimization: Neuron Explorer and Neuron Kernel Interface in Action

So before we show how you can do this, just to level set on a few base practices and principles. Essentially, performance optimization is all about trying to maximize the utilization of our AI chips. When you're talking about AI chips, that typically means you want to ensure that it's what we call compute bound. It's basically using all of the available FLOPs in the accelerator to the maximum extent possible. There are a few different ways you can do this.

One way is to pipeline operations. This means that while you're performing one operation, such as a matrix multiplication, you can simultaneously load the data for the next operation in parallel. You can also save the data for the subsequent operation at the same time. This ensures that your compute engines are always being utilized.

The second principle is to minimize data movement. While the bandwidth between different memory hierarchies is fast, you typically want to keep memory on the chip as much as possible. For your activation tensors, you want to keep them in the chip's SRAM, which has a small amount of very high bandwidth but limited capacity, typically in the tens of megabytes. Keeping your activation tensors there saves you from doing many reads and writes back to your high bandwidth memory.

As well as minimizing data movement, when you do have to move data, you want to maximize the throughput. This typically means ensuring that your read and write operations use large chunks of data rather than small amounts. This ensures that your bandwidth is maximized. Finally, when running inference on your models, you're typically running large models with tens or hundreds of billions of parameters. This means the model can't fit on a single accelerator, so the model has to be sharded across multiple accelerators. Those accelerators have to communicate with each other, and that communication is what we call collective operations. One key core principle is to ensure that your collective operation time is less than the time it takes to perform computations such as matrix multiplications.

Thumbnail 1630

So how can you actually know if your model is optimized for AWS Trainium and Inferentia? We've launched this week a new tool called Neuron Explorer, which is our new profiler. This provides you with different levels of hierarchy, so you can view your model from a high level, looking at each layer, module, or operation, right down to the low-level instructions running on the hardware. We provide it via a web application or as an integration like a plugin to VS Code. Very soon we'll have integrations for system profiling, which means you can also view how your host CPU and memory are performing as well as your device.

Thumbnail 1690

Let's say we've found a bottleneck in our code. What can you do about it? How can you actually improve the utilization of your AI chip? This is where our Neuron Kernel Interface, or NKI for short, comes in. This provides you full control and programmability of our AI chips. We provide it as a Python DSL, and you can basically write low-level ISA commands, which are the machine instructions that your AI chip will execute. This gives you control of things such as your memory layout, how you do tiling, how you do scheduling, and allocation of your model. For example, Descartes used NKI to optimize their video-to-video model to get the best performance possible.

Thumbnail 1740

Thumbnail 1770

Now we're going to have a demo of the Neuron Explorer, which is a new tool just launched this week. Here is a profile of a large language model decoder that hasn't been optimized. On the summary page, on the top left, we can see some overall FLOPS utilization and how compute efficient our model is. We have bars showing the utilization of the different engines within the chip. We have a tensor engine, a scalar engine, a vector engine, and a general-purpose SIMD engine.

We have some recommendations that tell you things it's picked up and give you some next steps for how you can alleviate some of these bottlenecks. We also have information on the collective operations around the size. You can spot outliers that you can dive into. We also have memory bandwidth utilization, and typically we want to keep this as high as possible. It's quite low in this particular case, so that's another indication that there's work we can do to optimize this particular model. That's one view. Another view is a more detailed view of the same model.

Thumbnail 1810

Here we have our hierarchical view. So we can essentially go from each layer in our model, we can dive into it and then within each layer we can see, for example, the major components. Typically these are things like an attention component and an MLP component, and within each of those, we can break them down into the specific operations down to, for example, a matrix multiplication, an add, or a value.

Thumbnail 1860

On the right-hand side, top right, you'll see a similar sort of view, right down to the operation level, and we can see how efficient each of those operations is on our hardware. Then on the bottom left, we can actually see how this all maps down into the specific engines running on our AI chip. For example, we can see the tensor engine utilization, we can see vector scalar engine utilization, all the engine utilization. We can also see the collective operations when we are having to communicate between chips, we can see how long each of those operations takes.

Thumbnail 1870

Thumbnail 1880

Thumbnail 1890

Thumbnail 1900

We can also see our memory bandwidth utilization, so how the memory is being moved around on the chip and how efficient that is through the low-level device view. So what we can do is also look at a specific module. For example, let's take the attention module, and we can actually look and see how long the latency is for that particular operation. So here we're taking two markers and we're seeing that it takes around 125 microseconds to run the attention part of our model. So that's one data point, so we can use that as our baseline.

And then what we're doing now is we're actually looking at an optimized version of that same model, but this time we're using a Neuron kernel. So again, we have that hierarchical view that I mentioned before. We can go down and see down to the low-level operations. We can again mark, for example, the time, the latency of the particular component. And here you can see that we've actually sped it up. So we've gone from 125 microseconds down to around 79 microseconds. So we have about a 50% speed up on this particular model.

Thumbnail 1940

We also have a new feature, which is a code view. So we can actually map the specific instructions running on the hardware back to our NKI code. This is our kernel code, which is showing on the top right. So on the top right is our NKI code written in Python, and we can highlight an area in our device profile and actually see which lines of code in our Neuron kernel are responsible for that instruction. So what you can do is point back to a line of code and then you can go back to your code, optimize it, try something new, and see the effect on the operation or the speed up of that particular component and how it maps to the low-level device instructions.

Thumbnail 1990

Thumbnail 2000

Production Deployment at Scale: vLLM Integration with AWS Trainium and Inferentia

So that's the demo of the new explorer tool. Once we've optimized the model, now we're ready to actually deploy a model into production. So this is the deploy and scale phase. One of the most popular libraries used for deploying, especially foundation models, is vLLM. It's a really popular open-source library designed for high throughput, low latency LLM serving. And it does this through various different mechanisms. It has a really efficient KV cache management mechanism, doing things such as page detention and also really efficient batching. It's using concepts such as continuous batching.

Thumbnail 2060

It has a very open, vibrant open-source community with folks from Red Hat and many other companies contributing to it, and it has a lot of different model support. So popular models like Llama, GPT-OSS, DeepSeek, and the most popular open weight models are supported by vLLM. And now it's part of the Linux Foundation. So there's a lot of collaboration happening. Here's just an example of a code snippet of how you would actually use vLLM. It's really simple to use. You set up things such as your sampling parameters like your temperature. You would configure things such as how you want to shard your model, so how many accelerator chips you want to shard your model weights over. In this case, we're using tensor parallel to shard our model over two tensor cores. And then we can basically call generate to generate some outputs.

Thumbnail 2090

So vLLM is fully integrated with AWS Trainium and Inferentia, and this is done via the vLLM Neuron plugin. So we have support for some of the most popular open weight models: Llama, Qwen, GPT-OSS, and Mistral. We've integrated a lot of kernels, so our team has been busy writing kernels so you don't need to write them for the most popular open-source models.

Thumbnail 2130

There are techniques such as Flash Attention, which is a popular approach to accelerate the attention component of a model. We also have other kernels such as Fused QKV to get better support and performance. Additionally, there are other features such as Speculative Decoding, which is another way for you to speed up your model to generate tokens much faster during the decoding phase.

Now I've explained the end-to-end lifecycle in a hypothetical way. I'd actually like to invite Randeep, who is going to show you a concrete use case of how they've managed the end-to-end model lifecycle using AWS Trainium and Neuron. Thanks, Randeep.

Thumbnail 2180

Splash Music's Journey: Building Humming LM and Revolutionizing Interactive Music Creation

Hello, everyone. I'm Pradeep Bhatia, CTO for Splash Music. Today with AWS, we are launching something new. Not a product, not a platform, but a new way of making music—a completely new format that is going to be interactive. We call it Remix.

Thumbnail 2200

Wait, before I explain, I'm getting a call. This person usually doesn't call me unless there's an emergency. I hope you don't mind. I'm going to take this and FaceTime it.

Thumbnail 2220

Thumbnail 2230

Thumbnail 2240

Hello. Hey, Randy, I know you're talking at the AWS re:Invent conference today. Oh, wow, you're on stage right now. Hey, everyone. Sorry for interrupting. I just wanted to give you a call and let you know that I sent you a good luck mix. I just texted it to you. Check it out.

Thumbnail 2290

Thumbnail 2300

Thumbnail 2310

All right, we'll talk later. Bye. That's a really awesome good luck mix to receive from a coworker right in the middle of a presentation. Perfect timing as well. So what we're going to do is I'm going to respond back with a message that she sent me. I'm on this stage, and I'm going to show how to co-create.

Thumbnail 2320

Thumbnail 2330

Thumbnail 2340

Done. All I did was sing my melody with the lyrics and vocals and combine it into a single musical composition that we both created together. It's interactive and fun. It's music, it's a vibe, and it is magical.

Thumbnail 2350

Thumbnail 2360

So what makes music interactive? Music is one of the ways that people love communicating with each other. It's the melody that makes us feel connected. Gen Z and Gen Alpha have been communicating with each other through text messages, through videos, through images. But we wanted to change that. We wanted to bring in a new format of communication through music. That's where we made Remix.

Thumbnail 2380

Thumbnail 2400

Now, in under 15 seconds, you can have infinite variations of compositions that you can add on top of each other. This really changes the way music is actually headed today and is another way of supporting artists. So how do we do this? Let's take a look behind the scenes. When we started this, we asked ourselves why music making is so hard.

People don't really express themselves in prompts. How do you express feelings in prompts? It's impossible.

Thumbnail 2420

Thumbnail 2430

They hum in the shower. They make random noises. Sometimes they tap on the desk while sitting. Our goal was very simple. We wanted to take these everyday moments and turn that into a mode of communication. We wanted to take technology out of it, but we also wanted to do it ethically and at scale.

Thumbnail 2440

Thumbnail 2450

To solve this, we built the first ever LLM model called Humming LM. It is the model that takes your hum and listens to it. If you're off-key, that's totally fine. You're offbeat, that's even better. Because in every single imperfection, we can understand the intent that is behind your expression. Humming LM is built to understand that intent and make it sound great.

Thumbnail 2510

Thumbnail 2520

Converting random hums into music is not easy because it is a challenging problem. We built massive datasets of people singing, people recording their vocals, getting the right key, the BPM, the melody, and then we built the entire model end to end. We first trained it on GPUs, which is very expensive for any startup. But then we started working on AWS. With the help of the GE AIC Innovation Center team, we experimented on AWS Trainium chips. That cut our training costs by 54%. It increased our throughput by 8.3%. And it gave us 159% better efficiency on the metrics that represent what people can actually hear, which means you have cheaper, faster, better music from your hums.

Thumbnail 2540

Thumbnail 2580

How do you measure the better? We basically have the metrics that define the quality of the music, so that's how we know it's better. Of course, everybody needs a little inspiration because you have an idea and you want to take it. What we have curated are these sounds from different artists that represent different intents. Have a heartbreak, we have a sound for that. Are you happy? We have a sound for that. You have a good vibe, we have a sound for that as well. That gives direct attribution back to the artist and it benefits all the creators.

Now, you might wonder why we didn't use an open-source model to begin with. Open-source models aren't built for melodies. They don't understand the intentions behind the melodies that are created with. People usually hum off-key. They start off the note, they drift, and then they express emotions that are not mathematically clean. In music, those imperfections define the intent. These are the clues that make them magical. To create this new form of music, we wanted a way for us to understand your hum, your emotion, your intent. That's when we built the Humming LM that encapsulates all of these intents and music together.

Thumbnail 2640

So when people ask what we do at Splash, we empower anyone to make music with their favorite artist, with friends, family, and even people from the crowd. We are not an AI music company. In fact, we are not a music company. We started off with a simple goal on how people want to express themselves through music. But to build that, we built the entire system around understanding melody, timbre, structure, and emotion that represents music.

Thumbnail 2690

For our complete journey, CPUs were too slow to run the algorithms. GPUs get too expensive. So we needed something that can define our availability and scale at the same time. That's when Trainium came into the picture. So with Trainium and SageMaker Hyperpod, we started our journey with just 4 Trainium nodes. We were able to scale it all the way up to 100 Trainium nodes, get our model trained and ready to be served across millions of endpoints through Inferentia. We also used other AWS services such as FSX Lustre for storage and orchestration around it.

Thumbnail 2720

Thumbnail 2730

Thumbnail 2740

We can take a very simple hum and convert it into a song with your favorite artist in just 15 seconds. You hum, we capture your intent, and that's what you hear. But that's only half of the story. Since music is a form of communication, we want people to add their own voice to these songs. Like how my co-worker sent me a good luck message. I was able to relate with that and send my response back into that message. The capabilities we have built in our player enable you to do it right from that experience. You don't have to navigate or anything. It is as simple as typing a message to somebody.

Thumbnail 2770

Since we launched less than a year ago, we have 30,000 creations on our platform. We have streamed over 750 million times. That's almost 10,000 years of listening in less than a year. The only question is, what are you all going to create today? I truly appreciate Edler Willis for having me on the stage here and sharing our story. Thank you so much.

Thumbnail 2830

The Future of AI Development: Open Source Commitment and Building Together

It's really inspirational to see how creative the team has been and able to leverage both the services from AWS but also training them to achieve their goals, to lower their costs, but to create something new. In my own world right now, we focus on LLMs as the source of AI. We think generative workflows are going to take over, and to be honest they will. But it's also really interesting to see how AI is going to be applied and utilized in so many different kinds of applications and markets today. We're just scratching the surface of this really.

Thumbnail 2870

From creative endeavors like with music, with what Splash is doing today, the impacts that these new models, generative AI, and what's coming after generative AI is really going to be impacting healthcare, medicine, therapeutics. Video is another great example of this as well. I think there's going to be a lot of growth still left to be discovered and a lot of inspirational moments are still ahead of us. At the entrepreneur labs and AWS team, what we're thinking about is how do we make our developer stack more accessible to more people? How do we make training and inferentia more valuable to this discovery of different ideas?

Thumbnail 2900

Thumbnail 2920

Thumbnail 2930

We have our developer stack called Neuron. We've been really focused on how do we make it better and more accessible to more developers at all levels. Whether you're an ML developer looking to utilize AI building blocks and models from some of the resources that our presenters here talked about today, making it easier for you to integrate inside of your applications and extend your applications with AI, you want off-the-shelf great resources to do that in a very easy way to achieve the performance goals you want without having to really get in there and optimize these models. Whether you're researchers like the designers at Splash, where you think there's not an open source resource or open source model that makes the best use here, you really need to design something from the ground up. For that, we're really investing in resources at the framework level with native PyTorch support. We want to make it really robust with the full ecosystem around PyTorch where you can leverage that.

Starting this week, we're introducing our new PyTorch native support for Neuron. This will allow you to basically take any of your PyTorch code you're running today or the libraries running on top of PyTorch, be able to change that device description from CPU or GPU and now to Neuron and be able to deploy directly on Inferentia and take advantage of the performance and cost savings that we can offer you. You can do it very easily and take advantage of the PyTorch ecosystem with FSDP, PyTorch eager mode, dot compile, and many of the other features there. Then plugging into the ecosystem of discoverability with PyTorch Profiler and Weights and Biases and many others.

Thumbnail 3010

For performance engineers, we think about them at the bottom level of the stack. Matt touched on a lot of the features we're building here. Our new profiler capabilities coming through Neuron Explorer really give you unparalleled access into what's happening and how your models are performing. We're going to keep investing in that space as well, with more access through our Neuron Kernel Interface, expanding our access into our ISA of our devices, and of course our optimized kernel library, which is launching this week as well with Neuron Kernel Library.

Thumbnail 3070

These are pre-optimized kernels our team is developing. Some of the numbers we shared earlier to achieve the GPT-OSS 120B performance are kernels we built, and we're going to be open sourcing and sharing all of those as well. Our entire developer stack that's available for all the different types of developers we want to engage with is underpinned by open source. We want to make it easy and accessible. We want to build with you, everyone in this room and everyone out there as well.

We're committing to open sourcing our entire software stack. Today that includes our Neuron Kernel Interface kernels, our library of kernels, and our Neuron compiler is also going to be open source. All our plugins for PyTorch and vLLM, Hugging Face, and others will be open sourced. Over time, our entire software stack, including our core graph compiler, will be available as well.

Thumbnail 3110

We have some time for questions, but I wanted to also share that we're sitting at the end of Tuesday. I thank you so much for sitting here. It's almost past 5 on a Tuesday, and I expect you have a lot of evening plans as well, but thank you for joining us throughout today. We still have two more exciting full days at re:Invent with a lot of sessions. If you're inspired by this one, we have a workshop tomorrow, AIM 309. If you wanted to get hands-on with inference and training and take your knowledge to the next level, it's a great opportunity. We have all our Neuron experts in the room, and you can ask them a ton of questions as well.

Thumbnail 3200

Join us on Thursday for more deep dives and innovation talks. We have AIM 201, where we're going to be looking at all the innovation that went into our software stack, our chips, and server design. This session will be featured with two partners of ours as well who will be talking about innovation and how they've uncovered and utilized inference and training along the way.

Most importantly, come build with us. We're super excited to build this with you. We want everyone to leverage inference and training to build the next big thing. We want you to build unique new experiences for your customers and do it at lower cost with easier accessibility. We don't want this to be compute limited anymore. With that, I'd like to thank everyone for joining us today.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)