Kazuya

Posted on Dec 6, 2025

AWS re:Invent 2025 - AWS Trn3 UltraServers: Power next-generation enterprise AI performance(AIM3335)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - AWS Trn3 UltraServers: Power next-generation enterprise AI performance(AIM3335)

In this video, AWS introduces Trainium 3, their next-generation AI training chip designed for agentic workloads and reasoning models. Ron Diamant details the chip's 362 petaflops compute, 20.7TB HBM3E capacity, and innovations like microscaling hardware circuits and accelerated softmax instructions that achieve sustained performance close to peak specs. The Trainium 3 Ultra server scales to 144 chips with NeuronSwitches enabling low-latency all-to-all communications. Jonathan Gray from Anthropic demonstrates real kernel optimizations, showing how they achieve 60% tensor engine utilization on Trainium 2 and over 90% on Trainium 3 while serving the majority of Claude traffic on nearly a million Trainium 2 chips. The presentation covers ease-of-use improvements including PyTorch native support, the open-sourced NKI compiler, and Neuron Explorer profiling tools with nanosecond-level observability.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: AI's Transformative Impact on Scientific Discovery and Software Engineering

Welcome everyone. My name is Joe Senerchia. I'm the EC2 product manager for our Trainium chips, and I'm super excited to have everyone here. Just a quick show of hands, how many are familiar with Trainium? OK, what about Anthropic's Claude models? OK, a few more. Well, today I'm super excited because we have two experts on both of those things. We have Ron Diamant, the chief architect of Trainium, and we have Jonathan Gray, who's the Trainium inference lead for Anthropic, thinking about optimizing cloud models on Trainium. So quickly, what we have in store today: I'll first walk through how AWS builds and thinks about building AWS AI infrastructure. Then Ron will walk through Trainium and how he built it for performance, scale, and ease of use. Then Jonathan Gray will come up and look at how he actually optimizes different kernels to run on Trainium effectively. OK, great. So let's get started.

So first, why is there so much news around AI? Why is there so much excitement? I think it comes down to the fact that this is really a tectonic shift for how we build, deploy, and interact with the world. This isn't just incremental change. We're seeing new capabilities pop up because AI has enabled them. Andy Jassy said this most recently in a quote: "We are at the beginning of the biggest technological transformation of our lifetime." I think one of the areas that I want to step back and take a look at is where AI is really reshaping scientific domains.

We look at this in things like protein biology, where models can now predict and design new proteins in minutes, which traditionally took hours to do. Or even in mathematics, where models like AlphaGeometry are competing at an Olympiad level and also solving formal proofs. Another area that's become its own scientific domain in and of itself is software engineering, where AI is now a breakthrough force of its own. It can develop and deploy its own code, solve bugs within the code, and even reason across large code bases. This is just the beginning.

Together, these innovations are no longer just supporting scientific discovery. They're actually becoming the engine driving that scientific discovery. Let's take a look at how this is happening in practice with software engineering, as I mentioned. Over time, we saw traditional programming, and over the past few years, we've seen things like code completions and chat-based programming or even collaboration with vibe coding start to take off. As these start to take off in the same time frame, we've also seen benchmarks exceeded on things like SWE-Bench Lite, where models are completing up to 80 percent of real GitHub issues. Or even on harder benchmarks like SWE-Bench Verified, where some models can complete up to 50 percent with full correctness.

What that enables then is the next phase of AI in software engineering, which is deploying agents, agent fleets, or agent clusters so that they can autonomously operate and solve software problems. We don't know the exact shape this future will take, but we can envision a world where we have software developers collaborating closely with an entire fleet of agents. What that does is it opens up a new set of speed and scale for delivering software features and functionality. So taking a step back, why does this all matter? The important part is really that none of this happens without the infrastructure underneath that's powering all of this AI.

AWS AI Infrastructure Stack: A Decade of Silicon Innovation

At AWS, we spent millions and more than a decade building the most comprehensive, deeply integrated AI stack, starting at the top with compute, where we offer a broad portfolio of NVIDIA GPU instances or our latest Inferentia and Trainium chips, which offer cost efficiency for AI workloads. At the network layer, we're deploying ultra clusters that are capable of scaling up to tens or even hundreds of thousands of chips, all connected with low latency, low jitter elastic fabric adapter. Then you have storage, where we've increased high throughput storage options so that you can keep those GPUs fed with FSx for Lustre or even S3 Express One. You now have access to your data at 10 times faster speeds than before. Importantly, security, which is important for all of our infrastructure here at AWS but also important for AI, where we have the Nitro system that allows isolation of your workloads to protect customer data. At the very bottom, we offer management services and observability tools like CloudWatch, where you can monitor and watch your nodes to make sure that they're healthy and operating efficiently.

All of this comes together as a full stack platform for training and inference frontier models. The reason that we can do this is because behind the scenes, we've been developing silicon for over a decade, right? Whether it's in our Nitro system at the top here, we have over 6 generations available now for offloading virtualization to dedicated hardware for higher performance and stronger isolation of your workloads.

We built Graviton, which is now supporting a multitude of workloads and over tens of thousands of customers. We also anticipated the growth of AI and started building our inference and training chips early with the release of Inferentia in 2019, and we continue to innovate across this full stack, which is really important to drive the next phase of AI infrastructure.

Starting with last year when we announced Trainium 2, we talked a lot about the chip and the specs on the chip, but we also showed that it wasn't just about the chip. It's also about the innovation that we're bringing at the server level and the network level. At the chip level, you have innovations that are pushing compute and FLOPS like 1,300 petaFLOPS of dense FLOPS. At the server level, we released our first Trainium 2 Ultra server capable of scaling up to 64 chips across NeuronLink, which has one terabyte per second connectivity. At the network level, we deployed tens of thousands of these chips all connected with our Elastic Fabric Adapter.

When you look at all that engineering and that end-to-end design coming together, it enables us to do things we had never done before. One example is shrinking the time from when we receive chips from our manufacturer to when we can put them in customers' hands. Over the course of Trainium 2's life, we shrunk that by 70 percent. The result is that it allows us to ramp Trainium 2 four times faster than any other prior AWS AI instance and to a footprint that is 33 times larger in capacity than any other instance, and all of that capacity is fully subscribed.

Emerging Trends and System Requirements: From Reasoning Models to Trainium 3

That end-to-end innovation is what's required to build something like Project Rainier, which is the world's largest publicly announced compute cluster. But as we build more scale, we want to keep our eye on what is scaling next. We look at the trends continually with customers and we look at the industry to see where they're going. Here we'll start to look through some of the trends that we saw over 2025. There's more emphasis on post-training. Reinforcement learning is becoming more important as customers and model developers look to put their models in real environments and get feedback, whether that's virtually generated or actual real environments like robotics.

Then you have reasoning models where these models are taking a little bit more time. They're reducing the latency, but they're reasoning over multiple steps so they can generate a more accurate response to a deep question. And then coming back to what we talked about with software, you have agentic workloads where we see multiple agents collaborating autonomously, making tool calls to really drive independent solutions for a wide variety of problems. Digging a bit deeper, what is the impact? How does this impact what we're building next for AWS AI infrastructure?

I think it really comes down to a few different new system requirements that we're looking at. I say system, not just chip, because this is about the bigger picture. These systems now need to support context lengths. As we see reasoning models over longer contexts, context lengths are reaching over a million tokens. We need support for mixture of expert models, which are communication heavy, where you have sparsely activated mixture of expert models communicating across the scale-up domain. You also need support for infrastructure that can be used for pre-training, post-training, and inference so customers can really optimize the compute that they have available to them as they scale each one of these independently.

Then the last requirement is you need support for really high batch size, high throughput systems that can support lots of concurrent agents operating autonomously on their own. The key theme here is that next-generation AI infrastructure isn't just about compute FLOPS. It's about more than that. It's about having balanced compute, which means having more memory, having more memory bandwidth, and also having a larger scale-up domain that you can support a wide range of expert parallel designs as those models scale. That's really why we're happy to introduce Trainium 3, which is the chip built for these next-generation agentic workloads, reasoning workloads, as well as video generation workloads that are going to drive the compute demand for these next AI systems.

As I mentioned before, it's not just about the chip, it's also about the system. If you caught Matt's keynote, we recently announced our Trainium 3 Ultra servers which scale up to 144 chips.

I won't walk through all these stats, so I'll leave that to Ron, who's our chief architect here. The key thing to remember is that there is innovation at each one of these that drives the capabilities of our next AI systems. So with that, I'll pass it off to Ron to walk through Trainium 3.

Building Trainium 3 for Performance: Balanced Compute and NeuronSwitch Architecture

Thanks a lot, and thanks for being with us today. For the next part of the talk, I'd like to go a little deeper into how we built Trainium 3 and specifically how we built it to be performant, ready for scale, and easy to use. Let's start with performance.

Performance is actually not a single metric; it's a combination of metrics. Of course there's compute floating point operations per second, but you also care about memory bandwidth and memory capacity, and the interconnect that connects between these chips. All of these need to be balanced in order to achieve maximum performance. We actually touched on that in detail in last year's talk, and you have a QR link at the top right. By the way, throughout this talk every time there's an opportunity for offline self-learning there will be a QR link at the top right.

Trainium 3 Ultra servers made significant leaps across each one of these performance dimensions. We got 362 petaflops of HXFP8 compute. I'll explain exactly what that means in a second. That's 4.4x more than what we had with the Trainium 2 Ultra servers. We have 20.7 terabytes of HBM3E capacity, 3.4x more, and 706 terabytes per second of HBM3E memory bandwidth, 3.9x more than the Trainium 2 Ultra servers. We also have a 2x faster interconnect.

I'd like to draw your attention to these switches in the middle of the rack. These are new components we call them NeuronSwitches, and they connect between the Trainium 2 compute and Trainium 3 compute sleds in a full mesh topology. Each sled is connected to every other sled within a single hop. These are the sort of system optimizations that don't come through in the top level specs, but they absolutely impact real life workload performance. That's because they give us more flexibility to deploy different topologies, they cut down the latency between each pair of Trainium 3 devices, and they give us really high performance for all-to-all communications.

The reason we care about performance in all-to-all communications, or at least one of the reasons we care, is what we call mixture of experts models, or MOE for short. In such models, we tend to place different experts on different chips and then route a token to the relevant expert in real time in order to do the compute on that specific chip. That requires blazing fast all-to-all communication, which is exactly what the NeuronSwitches provide.

That brings me to my next point, which is peak performance versus sustained performance. In real life, if you think about what I quoted to you a slide ago, those were spec performance numbers or peak performance numbers. But in real life, that's where the performance story only begins. It's not where it ends. A nice analogy for that is who would you bet on winning a marathon race, a sprinter or a marathoner? Obviously a marathoner, right? But if you think about it, the sprinter has a higher spec speed or peak speed. It just can't sustain it over the entire marathon race. So you can see that there are situations where we actually care more about sustained performance rather than short peak performance. In AI chips, it's actually the same. We care a lot about achievable and sustained performance more than specifically some spec number.

Achieving Sustained Performance: Microscaling, Accelerated Softmax, and Real-World Benchmarks

When we started developing the Trainium 3 chip, our software team posed a challenge to us. What would it take to build a chip where the sustained performance is as close as possible to the peak performance? You get every single floating point operation that you paid for. That led us to a list of microarchitectural improvements aimed to give you every last percentage of performance. I'd like to walk you through a couple of those just to give you a sense of how this looks like and how we're really optimizing this workload end to end. Let's start with microscaling. The motivation for low precision training and inference is very clear, right? It's pure physics. If you use a smaller data type or a lower precision data type, you can run the compute on smaller circuits and move smaller data around the chip, which leads to higher performance and better energy efficiency.

However, this comes with some important considerations.

For example, if you naively cast from a high precision data type like BFloat16 into a lower precision data type like FP8, you can completely destroy your model. The reason is that BFloat16 has a much higher dynamic range—a range of numbers that it can represent—compared to FP8. This means that large numbers overflow to infinity and small numbers tend to be squashed to zero. We can fix this through a technique called quantization. Here I'm showing you a quantization technique called ABSmax, where we calculate the maximum absolute value in a tensor and then scale the entire tensor such that it exactly captures the entire dynamic range of FP8. This actually works quite well until we reach an interesting case of outlier distribution outliers in the tensor.

Imagine a case where one of the elements in the tensor is 100 times larger than the other values—that's the green element right there. We would scale the tensor such that the green element maps to the maximum representable value in FP8, but then all other elements will be squashed to zero or near zero. This means we completely lose the representation capability after casting or quantizing to FP8. We can solve this through a technique called microscaling. With microscaling, we do ABSmax quantization one more time, but this time we do it in small groups of elements.

Here we have one group with the green and yellow elements and another group with the orange, pink, and blue elements. You can see that the green is an outlier—the green element is much larger than any other element. What this causes is that with the first microscaling group, after we quantize, green goes to maximum representable and yellow gets squashed to zero. But in the second microscaling group, we quantize from scratch with a new distribution, and you see that the blue, pink, and orange elements are quantized quite well without any impact from the green element or the outlier. That's exactly what microscaling does, and it has been shown to be very efficient in preserving model accuracy in low precision training and inference.

However, microscaling is hard to do because you need to take a tensor, break it into groups, compute the scaling in each group, calculate the scale, apply the scale, and then do all of this in reverse order when you dequantize. What we did in Trainium3 is build hardware circuits to completely offload microscaling quantization and dequantization. You can basically get all the accuracy benefits of microscaled quantization without any overhead on your compute engines, and that drives—even though it doesn't appear in the peak numbers that I show you—absolutely improves your end-to-end workload performance.

Let's do another example involving accelerated softmax instructions. To give you some background, at the core of every modern or most modern AI models is an operator called self-attention. It was one of the breakthroughs in the transformer architecture that makes models like Claude and others work as well as they do today. At the core of the self-attention computation, we multiply two matrices, Q and K here, and then compute softmax on this result, and finally multiply that by another matrix, V.

If I show you a timeline, we do a matrix multiplication followed by a softmax operation, followed by another matrix multiplication. If I pipeline that over multiple tiles of computation, you can see that we can get a very clean pipeline where the tensor engine—the engine that is doing matrix multiplication, the most precious resource in the system—is constantly busy at 100 percent utilization. We love that. Now let's apply the previous optimization that I told you about: microscaled FP8. All the matrix multiplications now run way faster, but the overall self-attention computation didn't accelerate by nearly as much, and that's because softmax doesn't leverage FP8. It actually runs at a higher precision, which we need to do in order to keep the accuracy of the model.

That's a well-known secret in the ML space. If we weren't paying attention to that, we could have encountered a couple of problems. First, despite the nice optimization I showed you before, the end-to-end speed up is not as much as we wanted it to be, and the tensor engine, the most precious resource in the system, is now underutilized.

Luckily, our team saw it coming, and as we worked on the micro-scale FP8 optimization, we also introduced another list of optimizations to make sure that we always keep the tensor engine running. In this case, it was an accelerated SoftMax instruction that is able to run SoftMax 4x faster at the same precision with zero loss of accuracy. That's how it looks like with the accelerated SoftMax instructions. Now we get the end-to-end speed up that we wanted, and the tensor engine is constantly running at 100% utilization again. Achieved performance is as close as possible to peak performance.

We have a huge list of these optimizations. We actually document them and you can do self-learning online. There's a link at the top right, and all of these optimizations build on top of one another in order to make sure that you get to use every single floating point operation per second that the Trainium 3 device offers. Let's put it all together. Here we benchmark a model called GPT-055 with 120 billion parameters. This is an open-source, open-weight model by OpenAI. On the x-axis, we measure what we call interactivity, which is the per-user experience and how quickly we can generate output tokens. On the y-axis, we measure overall throughput. If the server is serving multiple requesters at a time, what's the overall number of tokens that it can generate per second?

To make it a fair apples-to-apples comparison, we normalize the y-axis to a megawatt. Now we're comparing Trainium 2 and Trainium 3 on even ground to see which one is more efficient. The results are beyond impressive. We can generate 5x more tokens per megawatt with Trainium 3 compared to Trainium 2, and at the same time we're also improving interactivity. We're really proud of these results. We think they will generate real value for you.

Designing for Scale: Modular Architecture and Project Rainier's Million-Chip Deployment

Let's move to scale. Last year I showed you this graph, and it demonstrates that the adoption curves in the ML space are very different than the adoption curves we're used to from other technologies. This is a typical adoption curve. We have the early adopter phase, then we start ramping up and eventually we get to mass volume. With ML, when we introduce a new technology, just like we're doing today with Trainium 3, we immediately get customer demand to build giant clusters with this new technology and new generation. This required us to build Trainium 3 to be ready for scale from the very first day.

That's exactly where Annapurna and AWS meet each other and complement each other. We at Annapurna built Trainium 3 that is built for scale from day one. Annapurna built Trainium 3 for scale, and then we marry that with AWS, which has decades of expertise in deploying massive compute clusters faster than anyone in the world, and we build projects like Project Rainier and what we're going to build with Trainium 3. What you see here is a Trainium 3 compute sled. It's a very modular design, and that's not just an elegant design choice, that's important.

It means that we can test every component independently and then plug it into the system. Every single component is top-accessible and replaceable, and this is critical because it allows us to automate the production line and make the assembly completely robotic. That means we can scale much faster compared to manual or complicated assembly. It also means that when we need to serve these cards in production, we can do it very quickly and efficiently and keep your infrastructure running, which is what we all want to do.

Let's break this down. At the back of this compute sled, you see 4 Trainium 3 devices. Then in the front, you can see 2 Nitro devices for scale-out networking. And there in the middle you see the Graviton CPU that is responsible for input, output, and management as a whole. Now all these chips were built in-house with deep expertise. We know how to optimize them, we know how to debug them, we know how to serve them, and with deep co-optimization between them.

That's critical to give you maximum performance. Achievable performance needs to be as close as possible to peak performance, and we need to optimize across the entire stack to do it. Joe showed you these graphs. With Trainium 2, we deployed 4x faster and 3x larger capacity than any other AI chip in AWS. For example, he mentioned Project Rainier. Let's talk about Project Rainier. Last year, when we were on this stage, we announced Project Rainier. We said that we were going to build a giant AI cluster for Anthropic, and now we're 12 months later with 1 million chips running, training and serving state of the art cloud models in production. I'm not talking about some future announcement. This is running today, and this happened in 12 months.

Ease of Use Across Customer Personas: From ML Developers to Performance Engineers

With what I just showed you with Trainium 3, we expect to scale faster than we scaled with Trainium 2, actually much faster and to much larger quantities. Lastly, let's talk about ease of use. We're building a very sophisticated infrastructure here, and we need to make sure that our customers can easily use it and get the maximum value from it. We knew that if we want to optimize for ease of use, we needed to deeply know our customers, so we talked to them a lot, and what emerged is that we have three customer personas with different needs.

At the top, we have the ML developers that are building AI applications based on existing models, and what they value the most is very strong and robust third-party library integrations and ready-to-use pre-optimized models. Then we have researchers who are inventing new models and new operators. They want to iterate quite quickly, and they care about a robust, frictionless experience much more than they care about performance. They care about developer cycles. The experimentation needs to happen very quickly. Finally, we have our performance engineers. These are folks like Jonathan Gray that breathe and live hardware optimizations. He's one of the best in the field, and he'll explain very nicely how he's optimizing for Trainium 2 at this point, with Trainium 3 coming. What they value the most is tools that give them full control over the hardware.

Let's go one by one with ML developers. With deeply integrated Neuron, with third-party libraries like PyTorch Lightning, vLLM, and Hugging Face, you get to just take models from these libraries and seamlessly, frictionlessly run them on Trainium. We're also engaging the community via university courses and hackathons. You can see one example there when folks take a Hugging Face model, for example, fine-tune it to do a certain task. You can see a QR link there for a hackathon that we have for fine-tuning models to play chess. Eventually, we serve the models on Trainium as well, and the feedback that we're getting so far is overwhelmingly positive.

For researchers, we have deep integration with PyTorch and JAX, and what I'm really excited to share is that Trainium is also becoming PyTorch native. Let's talk about that. With recent advances that the PyTorch team did here where they introduced something called Private Use One that allows you to integrate a custom AI backend into PyTorch, we made Trainium natively supported by PyTorch. That means that code that you write on PyTorch that can run on a CPU or a GPU can seamlessly run on Trainium, the same exact code. That means that you get the eager execution experience that you know and love from PyTorch on the Trainium devices, and it also means that you get the automatic code optimization that PyTorch introduced via Torch.compile, also running seamlessly on Trainium.

A nice side effect here is that all the tools and libraries that you know and love that run on top of PyTorch also come along for the ride. If you're using FSDP or DTensor for distributing your workload, that will run seamlessly on Trainium as well. If you're using libraries like Torch Titan to do large-scale training, that will run seamlessly on Trainium again as well. Here's how it looks like in code. On the left, we have PyTorch code that runs on GPU, and on the right, we have the corresponding PyTorch code that runs on Trainium. It should be hard to spot the differences because there are not many differences.

It's one word literally. Instead of saying two, instead of writing to Kuda, you write two Neurons and we take care of the rest. It just works. We wanted to give a lot of credit to the PyTorch team here. The way that they extended the PyTorch framework allowed us to do what we're showing you here. We're already piloting this capability with a select set of customers. We're getting very good feedback and we plan to make it generally available in Q1.

Last but not least, let's talk about performance engineers. For this category of customers or customer personas, we introduced two new capabilities. The Neural Kernel Interface, which we call NKI for short, is a low-level programming interface to directly program the training devices. This existed last year and we evolved it quite a bit. The second piece is the Neuron Explorer, a toolkit for doing performance optimizations on the training devices. It's built on top of the Neuron profiler and gives you deep insight and observability into your workload running on Trainium. With both of these together, you get full control over optimizing your workload on Trainium.

NKI is a Python embedded DSL with something quite unique. It combines two levels of abstraction in a single programming environment. You can implement your code in a tile-based language, just doing computation between submatrices, which is very easy to ramp up on, especially if you're coming from NumPy or Triton. But then if you identify an area where you really want to optimize, you can go all the way to the assembly level with very similar semantics to the tile-based semantics. That combination allows you the ability to ramp up very quickly and to optimize very deeply.

This year we're introducing a couple of new capabilities in NKI, including scheduling and allocation APIs that allow you to do fine-grained control over the scheduling of different instructions running on the machine, as well as where we allocate the different tensors. That allows you to build the very structured pipeline that I showed you before in the self-attention example. This is actually a feature request from some customers. We listened and this is already available. You can start using it.

In addition, we also introduced a new front end with much improved error messaging that allows you to self-serve and iterate much more quickly on the NKI programming experience and improve your time to get an optimized kernel on Trainium. Last but not least, I'm actually pretty excited about this one. We decided to open source the NKI compiler. It's coming in the next couple of months, and the reason we decided to do it is because NKI is all about giving you control and observability. So now we give you full transparency on how your code is compiling to the Trainium hardware, and we also welcome industry and community contribution across the entire Trainium stack.

Neuron Explorer and Performance Optimization Tools: Industry-Leading Profiling Capabilities

Here's one nice example. There's a company called Descartes that has a cool application where they do real-time AI video generation. They can ingest the video, edit it, and generate the video back to you. You can see examples here. They decided to build their entire model based on NKI and they achieved phenomenal utilization numbers, actually beyond what I expected the team could do in three to four months.

Next, let's talk about the Neuron Explorer. If you ever wrote highly optimized code, you know that your best friend is a strong profiler or tracer that tells you what's running on the hardware and where the bottlenecks are. With Trainium we have the industry's leading Neuron profiler that allows you to get instruction-level trace of what's running on the hardware with zero performance hit on the actual workload running. This is nanosecond-level observability without slowing down your workload. We extended the Neuron profiler a lot and built a suite of tools that we call the Neuron Explorer on top of it. First of all, it's four times more interactive, which means that you can debug much faster and get a better overall debugging experience. But on top of that, we made it available via web applications for easy sharing between developers, and we deeply integrated it with IDEs like VS Code.

This is actually quite important. What you see on the screen here is that I highlighted one of the lines in my code, and the Neuron Explorer automatically highlighted the relevant instructions in the profiler. This gives you a much tighter connection between the code that you're writing and what's actually running on the hardware, and gives you a sense of what's worth optimizing.

We're also introducing system-level profiling, which is not ready yet but will come in a month. This allows you to see a full run on multiple devices and determine if they're tightly synchronized or if there's one slow machine. It really helps you when you debug highly distributed code like a big training run.

We did a couple more things. We introduced a hierarchical view. When the Neuron Explorer is brought up, it shows you framework-level operators, such as self-attention or fully connected layers, and then you can click and drill down all the way to the instructions. This makes your debug experience much more incremental. You can start at a high level and try to understand where the bottlenecks may lie, and then when you really want to zoom into something, you can just drill down through it. It makes the debug experience much nicer.

We'll also give you a summary page that shows you how the different engines are utilized. Here you can see the tensor engine on the left. The tensor engine is utilized very well here at 60 to 70 percent MFU, and the other engines are lightly utilized, so that shows you how the workload is running. At the top right, you can see how we're utilizing our memory bandwidth, what portion is used for reads, what portion is used for writes, and what portion of the time the memory is actually sitting idle. When you look at this top-level view, you can really get a sense of how well your workload is running on the hardware.

We also give you stats and visualizations. The one on the bottom right is the one that I particularly like. Here we're showing collective communication throughout the execution of the inference run in this case, and we're showing you a scatter plot of them. What you see here is that most of the communications are happening almost exactly for the same duration, which means the performance is very consistent and predictable. But if you ever see a large spread here, you get a sense that there's an outlier and you need to go debug and try to understand what happened.

And lastly, this is a cool one. We introduced something that we called performance insights. On the summary page, you'll see a bunch of boxes that show you where we think the performance bottlenecks are and what you can actually do to solve them. We do it via a combination of AI-based techniques and human-based techniques. If we debug something a couple of times, we'll introduce a rule here and try to give you a hint that this might help improve performance.

We showed this to the folks at Anthropic. There's a brilliant performance engineer here named Tristan, and when he saw that, he said this is the dream of every performance engineer. That's one of the quotes that I love the most in recent years, especially from someone like Tristan.

Wrapping up, we provide this ease of use across different customer personas, and most of what I showed you today is going to get open-sourced. I talked about the Neuron compiler. In addition, the Torch native training backend is going to be open-sourced, and we're also open-sourcing a Neuron kernel library, which is a suite of pre-optimized kernels that we built for our use cases that we want to make available to the world.

Just before I pass it to Jay Gray, as you can imagine, we're deep into implementing training on Trainium4. It's a little early to share the detailed facts, but we're accelerating over time what we're shooting for, and we'll probably exceed a 6X performance uplift in FP4, 4X memory bandwidth uplift, and 2X memory capacity uplift. The energy efficiency uplift is going to be tremendous, but I'm not ready to share that just yet.

Anthropic's Trainium Journey: Powering Claude Models with Custom Kernel Engineering

Jay Gray, why don't we talk about how we're actually using these chips? Thank you.

Thanks Ron, thanks Joe. It's a pleasure to share the stage with you guys. I'm super stoked to be here. Hi everyone, my name is Jay Gray, and I'm the training inference lead at Anthropic. Anthropic is the fastest-growing business at its scale in history. Our Claude 4 and Claude 4.5 models are the most trusted AI by enterprises all over the world, and especially with our release just last week of Claude Opus 4.5, Claude is the best coding model in the world and the best model for agentic workflows.

The key to all of this is that across all of our product surfaces, across our first-party API and AWS Bedrock, every usage of Claude code, our web apps, and our mobile apps—the majority of our traffic today is served on Trainium2. My team's job is to provide the core inference technology on Trainium that enables us to scale at such an unprecedented rate. Today we're going to take a deep dive into the kind of performance engineering work we do that enables this scale.

Our work is fundamentally about running our models as fast as possible while serving an exponentially growing set of customers as efficiently as we can. Every time we shave 10% off the pre-fill time of our models, it opens up new product use cases. More ergonomic uses of longer context so you can put your entire codebase in the context, faster response times to enable more ergonomic interactive use cases, and every time we increase the token generation speed, it enables Claude to think a little longer, your code to get written a little faster, or perhaps it enables us in the backend to increase the sampling batch size and silently serve your traffic a little more efficiently.

At Anthropic, every operation and every kernel of our model inference is designed to get the best performance out of Trainium chips. Today I thought it would be fun to take you all the way deep on a dive into the kind of performance work we do on a day-to-day basis. Let's have an overview of Anthropic's custom model architectures and custom kernels. This is a bit of a joke—I'm not going to literally run you through our model architectures because we still do have some trade secrets. However, I am going to take you through some real optimizations that we've done on a realistic large-scale LLM inference kernel.

This is going to be our playground for the next 5 or 10 minutes. This is a real fused Flash Attention kernel in 3 parts. It starts with a large-scale matrix multiplication that generates the queries, keys, and values, which are the inputs to the self-attention operation that Ron described earlier. There's the actual self-attention operation, and then it ends with another big matrix multiplication that projects the outputs of attention back into the residual stream space. Before I really get into it, I'm going to give a very quick overview of the NeuronCore architecture. If you're already programming in Trainium, this is a review for you. If you're more familiar with programming other architectures, then this is hopefully just a quick and interesting overview of the NeuronCore architecture.

At the core of every Trainium chip is a set of NeuronCores, and in each core are a number of different engines which specialize in different linear algebra operations. At the heart of this is the Tensor Engine, which does small tiles of matrix multiplication. If you take just one takeaway from an ML performance or a kernel optimization talk, it should be this: the goal of a kernel and the goal of a kernel engineer is to make sure the Tensor Engine is always doing matrix multiplications. Everything else is essentially auxiliary data movement and extra operations to ensure that when the Tensor Engine is done with one matrix multiplication, the data needed for the next one is ready to go in, and we densely pack our operations.

The Vector Engine is an engine which specializes in doing reductions and processing over streams of data like a summation on a vector, and the Scalar Engine specializes in doing element-wise operations like activation functions or the exponent part of softmax. The last engine here is a fun innovation on the training architecture called the GPSIMD, or General Purpose SIMD engine, which basically lets us write arbitrary C code to operate on our data and fit in whatever weird operation into your custom architecture that doesn't fit into the other engines. All of these engines read and write to a set of fast SRAM memory banks near the engines called SBUF and PSUM, and there are a set of DMA engines which shuttle data back and forth between the fast SRAM memory close to the engines and the larger HBM on chip.

Deep Dive into Flash Attention Optimization: From FP8 Conversion to Trainium 3 Performance Gains

Back to our Flash Attention kernel, what you're seeing here is the actual profiler view of a real kernel. Every row here corresponds to one of the engines that I just described, and every line is an actual operation happening on one of those engines. What you can see here without even diving into the numbers is that we're doing pretty well. Visually you can just tell the Tensor Engine is densely packed with these blue matrix multiplication operations. This is looking pretty good, but how did we get there? So for the first optimization we're going to dive into the first of the big matrix multiplications, which is the QKV. Let's start by looking at a single operation happening on the Tensor matrix.

I'm going to pause here for a moment and really emphasize this point because I think this is one of my favorite things about programming on Trainium. What you're seeing here is the actual ISA readout of a single 128 by 128 matrix multiplication operation, one of many that happens within a kernel during a full forward pass. You're seeing the full readout here down to the nanosecond, the individual bytes of memory space that are being read from and written to. This is exactly what is happening, and if you're used to programming on other chip architectures, you understand immediately how remarkable this is.

This is a level of visibility into the performance of your kernels that you really just don't get anywhere else. Every FLOP, every nanosecond, every byte of memory in every operation of every kernel can be traced to this level of detail. This is what enables us to get the maximum performance out of Trainium chips. Here we're starting with a well densely packed matrix multiplication in the standard bfloat16 format, but a lot of modern LLM inference, especially in decode, is about using smaller, more efficient data formats. Trainium2 is designed to get twice the speed out of the smaller FP8 formats that you get out of a full width bfloat16.

By moving these operations from the slower bfloat16 into a faster FP8E4M3 format, we immediately get a 2x speedup on this matrix multiplication. Next, let's dive into the actual self-attention operation. This is a bit more of a complex kernel, and optimizing attention is one of the most interesting problems in modern LLM inference. It's a much more complex optimization than working with a single matrix multiplication because there are many more operations and many more opportunities for bottlenecks that prevent your kernel from spending all of its time in matrix multiplications.

What you can see here visually is that unlike the matrix multiplication we were just looking at, if you look at the green tensor matrix in the third row from the top, we are not densely packed doing matrix multiplications the entire time like we want to be. What we see are bursts of matrix multiplications interspersed with gaps where we're actually doing a large number of small vector core operations. When we dive in here using the profiler view and reading the ISA view, what we can see is that the bottleneck is not doing matrix multiplications like we want. The bottleneck is actually in shuttling the results of the matrix multiplications between one memory bank and another using an inefficiently large number of small vector core operations.

When we realized that, we rewrote the tiling such that we move memory from one bank to another using a smaller number of larger vector core operations, amortizing the instruction launch overhead and making better use of the instructions so we spend more of our time in matrix multiplications. Just by touching this, we get a 13% speedup in attention. Maybe it doesn't sound like a lot, but at the scale at which we operate, this is a huge amount of chips saved and a huge amount of extra traffic that we conserve.

Let's talk about communications. It's been many years now that modern LLMs are large enough that they don't fit on a single chip. A lot of the interesting design space that we have as performance engineers is how to split up and shard the data and the computation of a full LLM forward pass across multiple chips and then communicate between them using collectives to arrive at the correct results. Trainium, like most chip architectures, operates with a smaller amount of fast SRAM memory that communicates with a larger amount of HBM.

By default, in order to do a collective operation, you take the result of one of your operations, shuttle it from the fast SRAM down to HBM, do a collective from HBM to HBM of different chips, and then shuttle the result back up to SRAM. Especially in token decode when you're trying to stream tokens as fast as possible, this three-step memory movement is terrible for latency. If you're unable to overlap your communications with other computation, spending time in communications like this is just the death of a low-latency kernel.

What Trainium allows us to do in this optimization is take advantage of one of the cool hardware features that allows us to do direct collectives from SRAM to SRAM along different chips, saving the extra hops of memory between SRAM and HBM. What you can see here is not super obvious, but I've notated with the red circles the GPSIMD operation which on the left is spending all of its time writing descriptors for the memory movement DMAs between SBUF and HBM.

This goes away in the right, and with the faster SRAM to SRAM collectives, the amount of time that we spend in comms is lower and the latency of our decode is faster. The last optimization, of course, is to run this kernel on Trainium 3. Every operation that I've described today gets faster on Trainium 3.

The double speed FP8 mat muls that we looked at in the first optimization are 4X speed on Trainium 3 and make use of the microscaling architecture that Ron described to do more efficient blockwise quantization and dequant. The vector and scalar operations that can so easily become the bottleneck of a complex real workload like attention are made faster in Trainium 3. The comms are made faster. The amount of HBM capacity per ICI domain is larger, which lets us serve larger models on a single ICI domain.

I could go on and on. In this case, the kernel that we've been working with for the last 5 or 10 minutes, which achieves after the optimizations about 60% tensor engine utilization on Trainium 2, gets to over 90% on Trainium 3. A year ago we announced Project Rainier and Anthropic announced its initial use of Trainium chips. A year later, we're serving Anthropic models on nearly a million Trainium 2 chips, and we're so excited to see in 2026 and beyond where we can get with Trainium 3. Back to you, Joe.

Great, just another round of thanks for both Jay Grey and Ron. They did a great job really deep diving into the details on how they think about optimizing Trainium just broadly across the stack. A few takeaways before we leave: Trainium 3 is generally available. The chip was announced yesterday at Matt's keynote. Think about Trainium 3 not just in the chip, but also the system. You saw how important the system is for the optimizations as Jay Gray talked through not just the flops, but also the ICI bandwidth or what we call NeuronLink.

We're building systems that really scale with 144 chips, the Trainium 3 UltraServer. It's easy to get started. We're making it really easy, so we have a lot of information available about Neuron SDK, and we want to make sure that you can get out there and learn. If you have time tomorrow and you're interested in learning more, there's a few workshops available on Thursday, so definitely recommend checking it out. You can scan this and see the full list.

To learn more, and if you don't have time on Thursday, you can always scan one of these, get started, and quickly ramp up on your own independently. We have a lot of tutorials available, and we're really excited about having folks develop on training across the board. So with that, just to thank you again to everyone for being here, we really appreciate it, and if you have some time, please complete the survey.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community