Kazuya

Posted on Dec 6 • Edited on Dec 8

AWS re:Invent 2025 - High-performance inference for frontier AI models (AIM226)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - High-performance inference for frontier AI models (AIM226)

In this video, Philip Kiely from Baseten discusses high-performance inference for frontier AI models, introducing the concept of inference engineering. He explains how Baseten optimizes runtime performance through techniques like quantization (including NVFP4), speculative decoding (Eagle 3, look-ahead decoding), KV cache-aware routing, parallelism for mixture of experts architectures, and disaggregation of prefill and decode workers. The infrastructure layer features traffic-based auto-scaling and multi-cluster capacity management across regions for reliability and scalability. Real-world examples include OpenEvidence serving billions of LLM calls weekly for healthcare providers, Zed achieving 2x faster code completion, and Superhuman cutting P95 latency by 80% for embedding models. Baseten supports all major modalities including LLMs, embeddings, image generation, video generation, and speech processing across two million open-source models on Hugging Face.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction to Inference Engineering and the Rise of Open Source Foundation Models

Thank you so much for joining us this afternoon. I'm Philip Kiely from Baseten, and I'm going to be talking about high-performance inference for frontier AI models. We're going to cover a few things: the idea of inference engineering, the rise of open source foundation models, the components of an inference stack including runtime performance optimization and infrastructure, and then look at what that does in production.

Baseten is inference. We are an inference provider in the AWS marketplace, and we serve open source, fine-tuned, and custom models on infrastructure that's purpose-built for production. We think about inference in a few different parts. Number one is performance. At the runtime level, you want to make sure that your GPU is utilized to its fullest extent to create the highest possible throughput at the lowest possible latency. On the infrastructure level, once you have that incredibly efficient model service, eventually you're going to have enough traffic that you're going to be saturating it, so you need to be able to scale horizontally to two replicas, ten replicas, one hundred, one thousand replicas across clusters and across regions.

All of that needs to be wrapped in a delightful developer experience and delivered with hands-on support from expert forward-deployed engineers. Why did we build this? Why does all of this exist? It really starts with open source frontier AI models. Today there are well over two million open source models on Hugging Face. This is up from just a few tens of thousands four or five years ago. With these models, it's not just about the number, it's about the quality. Open source models are now routinely hitting frontier quality across different parameter sizes. When you look at models like the Qwen QwQ thinking model or think back to January with Deep Seek R1, we see open source models finally crossing the gap versus closed models and delivering frontier intelligence.

But it's not just for large language models. In voice, we have outstanding models for automatic speech recognition, for text-to-speech generation, and for diarization. We have great image generation models from Flux and Stable Diffusion. We have models for generating and processing videos, as well as embedding models that can be used for all kinds of tasks on various data including both text data and multimodal data. With all of these different models all at frontier quality, you need something that we call inference engineering. Inference engineering is the process of running AI models in production, and there are three principles that define this engineering practice.

The first is the idea that optimization requires constraints. If we're going to optimize our system's performance, we need a goal that we're optimizing toward. We can't just try and make it better at everything. The second is that scale unlocks more performance techniques. If we're going to use something like large scale parallelism, disaggregation, or other performance techniques, we're going to need enough traffic that we can cost effectively leverage the hardware required to actually pull this off. Finally, we're going to need to stay dynamic. We're not going to configure our system one time statically and deploy it. We want our system to be able to update itself in real time to adjust to the different traffic configurations that are coming in from our live user base.

One thing I like to talk about to illustrate these points is the idea of the NFL. I'm a big fan of the NFL, and one of the reasons I like watching it is you see those guys out there on the field. They're big, they're fast, they're strong, and they're playing this sport. But they're not as big as sumo wrestlers. They're not as fast as Olympic sprinters. They're not as strong as a champion powerlifter. This is a useful lesson when it comes to inference optimization. Sometimes it's easy to get caught up in just trying to be number one on whatever benchmark. What really matters is making sure that you have the right mix of capabilities to actually serve the unique needs of your application.

A great example of someone who's really found that balance is OpenEvidence. The easiest way for me to explain OpenEvidence is it's a chat application for doctors to be able to get up-to-date information. They're a massively successful AI healthcare startup, and the CTO said about Baseten that we support billions of custom and fine-tuned LLM calls per week for serving high stakes medical information to healthcare providers at just about every major healthcare facility in the country. How do you achieve this scale? How do you do that with excellent performance and reliably enough to be used in a healthcare setting?

That comes down to the idea of an inference stack. Inference has multiple parts. You need both the one-time component and the infrastructure component. You need to make sure that your individual GPU is working as well as possible, and you need the infrastructure component to ensure that you can scale that optimized instance across many replicas.

Runtime Performance Optimization: Quantization, Speculation, Caching, and Disaggregation

Let's start with the one-time layer. Inference runtime is really about applied research. There are constantly interesting papers being published. NAS is in San Diego right now, and there's a lot of great research happening there. The question is how to take these ideas from papers and actually apply them in production.

One of the major pieces of optimization technology that we use often is quantization. Quantization is the idea of moving from a 16-bit floating point number format down to 8 bits or 4 bits, so that you can access higher power tensor cores and take better advantage of your memory bandwidth by sending less data with each pass through the model. We think about quantization both in terms of being selective with the formats we're using. I gave a talk at NVIDIA's booth about everything we're doing with NVFP4, which is a new micro-scaling data format with Hopper or with Blackwell.

We also think carefully about what we quantize. Quantization isn't as simple as taking your entire model, putting it in your model optimizer, and bringing out a quantized model. You want to think carefully about quantizing maybe just weights, or just weights and activations. Maybe you can carefully quantize the KV cache to FP8, and you probably want to leave attention alone. Within these neural networks and within each component, maybe you quantize only the middle layers and leave the input and output layers intact. There's a lot of granularity that you can apply in quantization to ensure that you preserve quality.

Another main driver for us of performance is speculation, or speculative decoding. You can use different algorithms to generate draft tokens, which increases tokens per second on memory-bound decode. With every pass through the model, we can create more than one token, which is really promising. With speculation, we do many different algorithms. Some of the ones that are achieving really great results right now are Eagle 3, which uses a specialized trained model to do the speculation. The model takes hidden states from the target model and generates draft tokens. We also do a lot of look-ahead decoding and gram speculation, especially in the code completion field where you have a very constrained vocabulary.

Another really important optimization for us is caching. We do a lot of KV cache-aware routing to ensure high hit rates on KV cache reuse. This is especially important for things like code completion. A customer of ours, Zed, which is an IDE, was able to achieve 2x faster end-to-end code completion with the Baseten Inference stack, including a good deal of KV cache reuse. So as you're typing, those suggestions pop up twice as fast with 45% lower P90 latency and 3.5 times faster Ohio system throughput. That's just one example of what these techniques can achieve.

A couple of other things we think about a lot are parallelism, especially with the rise of mixture of experts as the predominant architecture within large-scale language models. You have techniques like tensor and expert parallelism that need to be carefully balanced to ensure that you're making the right trade-offs between latency and throughput. With other modalities like video generation, which challenge even 8xH100 systems, you need approaches like context parallelism to ensure that you're able to split attention across 8 GPUs and use all of your compute efficiently.

Finally, one more runtime optimization I'm really excited about is disaggregation, where you separate prefill and decode onto separate workers that scale independently. This allows you to do many of the things I just talked about, as well as adopting different kernel strategies and different runtimes. You can specialize each of your workers for the specific challenge of either compute-bound prefill or memory bandwidth-bound decode. Disaggregation, which is another thing supported by Dynamo, is another one of these model performance techniques that we've found a lot of interest in.

Infrastructure Components: Auto-Scaling, Multi-Cluster Management, and Multi-Modal Deployment

All of that together, I just went through five of the primary model performance techniques. Even if you do a world-class job of implementing those, that's not going to be enough for your production inference service. You also need the infrastructure to match that. A big part of the infrastructure component is the idea of auto-scaling. A lot of companies, especially companies that start out with large training clusters, end up with fixed amounts of GPU capacity. The problem with fixed GPU capacity for inference is that traffic fluctuates.

For business applications, it's highest during business hours and lower overnight and on weekends. When you have fluctuating traffic, maybe you have a consumer application and sometimes you go viral on TikTok and get a million users overnight. Some days it's much quieter. In these cases, your static capacity is not going to be a great match for your traffic. When traffic is low, you're going to have wasted spend, and when traffic is high, even with all of these optimizations, your system won't have enough throughput. You'll start missing SLAs. That's where auto scaling comes in to closely match the GPU capacity that you're provisioning at any given time to your traffic.

At Baseten, we do a lot of traffic-based auto scaling decisions, and we're able to get this capacity via something that we call multi-cluster capacity management. While many inference services rely on capacity within only a single region—let's say you're only in US East One and you build your entire inference system there—with multi-cluster capacity management, you're able to pool compute across multiple regions and across multiple independent clusters. With a single global control plane, you treat it as a unified resource where you're able to, if you have say 10 replicas, schedule 8 of them on one instance and 2 on the other. That gets you great things like active-active reliability across regions and makes sure that you have access to both more capacity and more resilience, as well as more geographic proximity to your end users for globally distributed applications.

That matters because no matter how fast you make your model server—no matter how fast this bit is where the actual inference workload is running—if your network latency is slow because you're sending a request from Singapore to California and back, or if your queue depth is high because you're waiting on, you know, only having 10 GPUs but having 20 GPUs worth of traffic, no matter how fast this model service is from that inference runtime layer we talked about earlier, your end-to-end inference time is going to suffer. That's why it matters to do both the runtime and the inference runtime and the infrastructure optimization together to ensure great end-to-end latency.

Another customer who's experienced this is a company called Latent. Latent is a pharmaceutical search company, and they like to talk about how we save them a lot of stress and developer time in implementing highly reliable inference with this multi-cluster strategy and with these auto scaling capabilities within the inference-optimized infrastructure. But everything I've talked about so far has been mostly focused on the idea of large language models. Large language models and AI are deservedly somewhat synonymous, but there's a lot of other modalities of open-source models, like we were talking about at the beginning. There's image generation models, video generation, embedding models, text-to-speech, speech-to-text, and all sorts of novel modalities being developed today.

A great example of adapting this specific setup to a given modality is the idea of embedding inference. Embedding models take text as input and output vectors that encode the semantic meaning of that text. I should also say there are multimodal embedding models that take images or videos and encode those similarly. If you have this great capability around running language models, how do you then turn around and run embedding models? It turns out that embedding models, like many other modalities, are architecturally very similar to large language models. Most of the frontier quality embedding models today are things like Embedding Gemma and Qwen Embed, models that are built out of open-source large language models. As such, you can use the same runtime if you're able to build the rest of the system around it, which is a common theme in what I've been talking about with inference today.

For Baseten embedding inference, you would have a model server sitting in front of it to process the requests, a multi-worker tokenizer that's able to take the generally hundreds of thousands of individual sentences or individual inputs that might be batched together in a single inference request to an embedding model in particular, stick it into a batch manager, queue it up, and have it take advantage of the same sort of token-by-token in-flight batching and continuous batching mechanism that your runtime provides for language models. All of a sudden you have the same high-quality inference service in a completely novel modality.

We've done this for embeddings, speech to text, text to speech, image generation, and video generation. So all six major modalities of models that you could think of. One of our customers using this is Superhuman. You might know them as they were recently acquired by Grammarly, who then themselves rebranded as Superhuman. Lo was the CTO of the original Superhuman email app. With this Baseten embedding inference, they were able to cut the P95 latency by 80% across many different fine-tuned embedding models that power key features in their app. I like to highlight these P90, P95, and P99 latency gains because they show the impact of the two-part problem I've been discussing, which is that it's not enough to be able to run fast. You need to be able to run fast reliably, and that comes not only from the runtime but also from the infrastructure.

If you're going to build inference in production, you need to build four things. First, you need to deliver world-class, state-of-the-art performance in terms of time to first token, tokens per second, or whatever other metrics matter to you and your end users. Second, you need to pair that with infrastructure that delivers four nines or better of reliability so that you can trust your applications for mission-critical deployments in fields like healthcare. Third, you need to be able to scale GPU capacity across regions and maybe even across different VPCs so that you can handle AI-powered applications that are growing fast. We see customers growing by multiples and multiples compounding within a single year, so you need to be able to very rapidly scale on a month-to-month global view and scale compute as well as individually within any given day.

You need to be able to scale automatically up and down with traffic. Finally, you need to be able to do that not just once, not just for one model, not just for one modality, but for any model, any of the two million open-source models on Hugging Face, any fine-tuned model, any customized model that enterprises are increasingly turning to in order to deliver differentiated value in production. All four of those things need to be done together. Thank you for the time today. I really enjoyed speaking with you. We're going to be at booth 1632, which is right over that way behind the Red Hat booth. We've got artificially intelligent t-shirts and a bunch of my teammates over there doing demos. If you have any questions whatsoever about the content that we talked about today, please join me over there. Thank you so much for your time.

; This article is entirely auto-generated using Amazon Bedrock.