Kazuya

Posted on Dec 5

AWS re:Invent 2025 - vLLM on AWS: testing to production and everything in between (OPN414)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - vLLM on AWS: testing to production and everything in between (OPN414)

In this video, Phi Nguyen and Omri Shiv present a four-level journey for deploying open source LLMs on AWS: foundation (AI gateway for cost visibility and control), optimization (using vLLM inference engine achieving 80% GPU utilization and 5x throughput), memory management (KV cache offloading with NVIDIA NIO showing 3x better time-to-first token), and distributed inference (disaggregated architecture with prefill/decode separation). They demonstrate the AI on EKS open source project with practical deployment patterns including model testing with Qwen 3, benchmarking using Guideline LLM, autoscaling with Ray, and container optimization using SOCI and model streaming (reducing startup time by 63%). The session covers advanced techniques like AI Bricks for context-aware load balancing, LoRA adapter management, and distributed KV cache across multiple accelerators including Trainium and Inferentia.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: The LLM Journey and the Foundation Stage with AI Gateways

Welcome to OPN 414 vLLM on AWS: Testing to Production and Everything in Between. My name is Phi Nguyen. I'm a Senior GenAI Specialist and I'm joined by Omri Shiv, who's a Senior Open Source ML Engineer. I'll be covering the LLM journey and some of the findings from working with many customers. And Omri will be covering the AI on EKS project and some pragmatic LLM deployment patterns. Let's get started.

I think we can all relate to this story. At the beginning of the LLM era, it was common to prototype using OpenAI, build your first Q&A chatbot, and then when you move from prototyping to production, you might get surprised by the real cost of running this application in production. The good news is that open source has really caught up to frontier models, promising better cost, more model choices, and providing better control, which opens up LLMs to more use cases and applications.

Today, I'm going to walk you through a four-level journey that organizations of all sizes—both enterprise and startup—are using to deploy open source models in order to reduce costs, improve performance, and gain better control for your LLM applications. The first stage is foundation. Similar to the previous story, it's very common for many teams to swipe their credit card and start using model providers such as OpenAI, Claude, or even Bedrock. However, this creates a number of challenges.

This does not provide centralized cost visibility, it's very difficult to enforce different policies, and it does not provide any compliance. The solution to this is straightforward: put an AI gateway in front of all the different users and all the different requests. An AI gateway provides the ability to track cost by user, team, or project. It allows you to enforce different policies such as content filtering for PII data as well as different safety mechanisms. It allows you to audit everything and control access by storing different API keys, allowing teams to only access what they are allowed to access across model providers or model names themselves.

Now that you have visibility and control, we can move on to the next phase. I should mention that there are a number of frameworks out there that allow you to implement this type of gateway. If you're a Kubernetes user, Envoy AI Gateway might be a popular choice. If you're more Pythonic or more developer-oriented, LiteLLM is another great option. And if you're looking for a managed gateway, OpenRouter could be a great option. Now we have control and visibility, so let's move on to the next level.

Understanding GPU Utilization Challenges and the Transformer Architecture

Maybe you've gone to Hugging Face and picked a model, you download the model and you use the Hugging Face Transformers library, and maybe you put it behind a web server such as FastAPI or Flask. By doing that, you can only achieve 40 to 50 percent utilization of your GPU. You need some kind of optimizations in order to maximize the resources that you are using. Just by taking a model and deploying it, your infrastructure is underutilized, your costs are high, and your users might be left frustrated.

Before we cover some inference optimization, it might be worthwhile to take a small detour and understand the Transformer architecture. One thing to note is that the whole industry has converged on the Transformer architecture, and many models are using this type of architecture or small variances of it. There are three different stages: tokenization, which is the process of breaking text into tokens.

This is CPU bound and very often not the bottleneck. Then you have the prefill stage, which is the process of processing your prompt to create a KV cache representation of your query based on the model that you've picked. This is accelerator bound. Finally, the decode is the process of using that KV cache and predicting the next token. That process is memory bound and does not use a lot of GPU.

Another factor that might impact the resources of your application is the context window. With the latest models allowing you to include a very large input sequence length, you need to understand that you also need large memory in order to process and store the KV cache. As you double the sequence length of your input, you also need to accommodate and double the memory resources for your application. While a model can accommodate a very large input text, it is worthwhile to understand what the upper bounds are of what you are allowing in your applications so that you can plan your resources accordingly and mitigate some memory errors.

Optimizing Infrastructure with vLLM: Inference Engine Capabilities and Performance Gains

vLLM is an inference engine. It is open source and part of the Linux AI and Data Foundation as well as the PyTorch Foundation. It is very popular with over 60,000 GitHub stars and very active with over 800 pull requests per month by over 1,700 contributors.

vLLM provides a single platform to run over 100 model architectures across different accelerators. It is not just NVIDIA GPUs, but also Trainium and Inferentia as well as AMD and other accelerators.

Going back into our use case, if we now use an inference engine such as vLLM, it provides a number of optimizations out of the box. PagedAttention is one technique where you think of it like virtual memory for GPUs. Instead of loading the entire model weight, you can page through those weights, allowing you to have more resources to process your batches.

Continuous batching is another technique. If you consider a batch size of one, you will never be able to really optimize the GPU utilization, but you will get better latency. If you have a very high batch size limit, you will maximize the throughput but will compromise on the latency of your application. Continuous batching is the best of both worlds. Dynamically, it will resize the batch size so that you can get good throughput without compromising on your latency.

Quantization is another optimization. If you download a model from Hugging Face, it is by default in FP32 precision data type. There are techniques that allow you to reduce the precision type to FP16 or FP8, therefore reducing the memory footprint of your model and allowing for more resources for processing your batches.

By using an inference engine with the existing optimizations, you can increase your GPU utilization to 80 percent. You can expect 5x better throughput and 80 percent savings per token.

vLLM is not the only inference engine out there. There are others such as SGLang and NVIDIA Triton Inference Server, and they are all open source as well.

Advanced Memory Management: KV Cache Strategies and Performance Benchmarks

You have optimized your infrastructure and increased the GPU utilization, but users are still complaining about the latency, especially in certain scenarios. Maybe you build RAG applications or maybe you build tool calling via MCP. Part of building those capabilities, you need to provide some instructions and system prompts that need to be sent to the model as you are calling those different tools.

When you are calling your LLM application or searching for your Python documentation, you will be processing your system templates over and over again. One could argue that not all tokens are created equal, and some tokens could be cached or reused so that you can really focus on the new token or the new text being sent to the application.

Another trend that is compounding the problem is the shift from input-heavy tokens to output-heavy tokens. I call it the grade token ratio inversion. At the beginning of the LLM era, remember, summarization or translation, you would stuff your input context window using RAG, and then the output would be dramatically smaller. As we move to the optimization phase for multi-turn chatbots and Q&A, code generation, you see a more balanced ratio between input and output. Finally, today with reasoning models, they are creating a lot of intermediate tokens that need to be taken and iterated in order to create a final output.

For use cases such as deep research, coding agents, and multi-agent systems, you can guess that a lot of those tokens are context being carried over across the thinking of those models and therefore not very optimal if you need to reprocess those tokens all over again. So caching is needed in this case. In order to work around those memory management challenges, there are a few techniques. One of them is called KV cache offloading. If you see, GPU memory is quite finite, so how can you have a system that offloads the KV cache across different memory or storage data is what new frameworks are coming up with.

NVIDIA NIO is a new framework from NVIDIA that allows you to transfer the KV cache across two different GPUs, across two different nodes, but also across different storage and memory tiers as well. NVME, FSX, and F3 are all storage options that allow you to load the data directly into GPU and bypass the CPU and the operating system. Another architecture evolution is, if you consider that I have processed the beginning of a conversation in a worker node and I have processed the KV cache somewhere, if you consider a traditional load balancer, they do not really have any awareness of whether KV cache has been processed. It is maybe using a round robin algorithm, and therefore it does not give you the ability to reuse that KV cache that was processed previously.

With KV cache offloading, it is pretty common to see those AI routing mechanisms that are KV cache model and capacity aware so that you can route the request to a worker that has processed the KV cache previously and therefore optimize cache hit. To put it together, maybe you want to use different strategies such as prefix caching, KV cache offloading, or maybe semantic caching coupled with an AI router, and you can accelerate or get a better time to first token because you can skip the prefill and reuse the KV cache across different requests, across different users.

Some benchmarks are from NVIDIA Dynamo, which is a platform for running those types of solutions. They tested with 100,000 requests using Llama 70B for an average input of 4,000 tokens and average output of 800 tokens, and they found a 3X better performance for time to first token and a 2X better latency overall. Another benchmark they conducted for KV cache offloading on 20 multi-turn conversations with 15 users for Qwen 3 8B model, and they found a 2.2 to 12X better improvement in performance for time to first token.

Scaling to Distributed Inference: Parallelism Strategies and Disaggregated Architecture

When we talk about KV cache and different frameworks out there, LMCache, Mooncake, and NVIDIA Dynamo are all open-source frameworks for you to use to implement those types of solutions. The last step in our journey is called scale or distributed inference. This is not a path I would recommend to everyone. If you can stick to a single node and keep it simple, that might be better. However, if you have use cases or needs such as a high volume of traffic—we're talking about 1000 requests per second or more—or maybe you need to accommodate a model that cannot be contained within one node, these are good examples for you to consider how to do distributed inference.

One thing to know is the three dimensions of parallelism. When you're going distributed, there are usually different strategies on how to do that: data, tensor, and pipeline. Data parallelism is when you duplicate the model across different nodes, shard the data, and pass the data across different nodes. Tensor parallelism is when you shard the weights of a model and you have the data passing through the different weights. Pipeline parallelism is when you shard different layers of a model and you pipeline the data across different layers across different nodes.

Now there's a new dimension called expert parallelism. If you consider all the latest state-of-the-art open weight models, such as Qwen, DeepSeek, or Kimi, they are all using a mixture of experts model. The big difference is that they are a lot more efficient. DeepSeek has 256 experts, and when you perform inference, you only activate 8 experts—a fraction of the weights—in order to perform inference and get better performance overall.

When you do expert parallelism, they have a different communication pattern, which increases the complexity on how you architect those different parallelisms. For example, expert parallelism uses an all-to-all token exchange or peer-to-peer communication. They have 3 to 6x more communications. The communication is asymmetric—it's between different experts—and therefore very hard to predict between which nodes the communication is going to occur. Contrasting this to data, tensor, and pipeline parallelism, where communication patterns are very symmetric and have very fixed volume in nature.

Another pattern in inference engines is called disaggregated architecture. Remember, a transformer model has two different stages: the prefill and the decode. When you are running inference on one node and you have completed the prefill and are performing the decode, very often because decode is memory-bound, your GPU will sit idle during that process. Many inference engines are now adopting a disaggregated architecture. Naturally, you want to decouple those two different stages into their respective clusters so that you can allocate more adequate resources for each of the stages. Maybe more GPU or more compute for your prefill clusters and then maybe more memory-optimized resources for your decode clusters. Note here that NVIDIA NXKL is being used to transfer the KV cache across those different clusters. Now you can allocate better resources, decouple, and scale independently those two different stages.

If you put everything together—disaggregated architecture with prefill and decode, and maybe you're using those four different levels of parallelism: data, tensor, pipeline, and expert—for serving DeepSeek, you can really scale and increase the throughput as you add more nodes to your clusters. This is another benchmark from NVIDIA, and they've benchmarked this disaggregated architecture on one node versus two nodes. On one node, you can see a 1.3x better throughput. On two nodes, a 2x better throughput. This suggests that as you add more nodes, you continue to see improvements in throughput.

I've talked about NVIDIA vLLM quite a bit, but there are other frameworks that allow you to implement, run, and architect all those different capabilities. AI Bricks from ByteDance, LLMD from Red Hat, and a number of other companies are all open source options for running advanced inference workloads. To recap, we talked about setting up the foundation. How can I set up an AI gateway so that I have more control and visibility across different model providers? By using an inference engine, I can increase utilization and optimize GPU consumption while increasing throughput. By implementing memory and KV cache strategies, I can reduce latency and reuse the KV cache. Finally, by using a disaggregated architecture or different parallelism strategies, I can perform distributed inference and increase throughput even more by adding more nodes to our clusters. There's a great blog post showing how the Amazon team, more specifically the Rufus team, used vLLM to scale Rufus, the Amazon shopping assistant, to over 80,000 Inferentia and Trainium chips during Prime Day. With that, I'm going to hand it over to Henry. Henry is going to talk about AI on EKS project and some practical deployment patterns.

AI on EKS Project: Infrastructure, Deployments, and Practical Implementation

Hi everyone. We are going to talk about the practical side of everything Fi just mentioned, and to do that we need a couple of things. We need infrastructure under which we can actually deploy these patterns. We need deployments, so blueprints or charts or something to actually implement these patterns, and then we need guidance to actually know how to use these. We have this open source repository called AI on EKS. Please feel free to visit it. Basically, it has a three-tiered strategy of what we offer. It's exactly those layers.

On the infrastructure side, what we have is purpose-built architectures that are able to give you training, inference, ML ops, or Agentic capabilities—basically all the components you need to run whatever workload you want to put on top of it. The infrastructure is very modular, so if you want to run a vector database, for instance, and we provide one tool, but you want to run a different tool, you can swap them in and out really easily. It's optimized for a bunch of different hardware, so we start with traditional x86-64, but of course we also offer AWS accelerators like Neuron and Inferentia and then Graviton as well. It's very cost effective, so we use this hardware optimization as well as the fact that we keep the infrastructure really lean and then we use Karpenter to scale it out when we need to grow the environment that we have.

On top of that, we have the deployments. So depending on what it is you want to do, we have blueprints for inference, for ML ops, for training, and Agentic again, just to keep aiding you in your journey, whatever that might be. We are of course flexible. We know that today maybe our talk today is on vLLM, but of course there are other inference engines. We talked about AI Bricks versus LLMD and other productionization stacks, so we want to make sure that we're flexible. Again, just like the infrastructure is scalable, so are our deployments. We want to make sure that we're cost effective there and then that we support the appropriate hardware.

And then finally on the guidance front, we want to make sure that we provide you with all the information you need to make effective decisions in how you want to approach things. So we try to optimize for performance and optimize for costs. Of course we want to give you practical guidance, so we talk about why we chose the tools that we did and some alternatives if you want to consider them. Everyone here is on the LLM journey.

Everyone here is at a different step in the LLM journey. We know some people might have never deployed an LLM or might have never used one. We want to make sure that we support you at that point as well as anyone who's going to global production. And just like with everything else we do, we have a whole host of different topics that we're going to talk about.

I apologize for the very dense slide here. This is to show you what we're going to use to put into practice some of the things we talked about today. The inference ready cluster is one of our infrastructures and the core one that we use to highlight inference. Just to highlight a few things here, you'll see we use four availability zones. In this case we're in US West 2, so that gives us that flexibility. The reason we do that is we want to give you the best success at actually getting a GPU instance. We know they're not always available in all availability zones. But we also take into account the fact that going across availability zones adds latency and cost, so we make sure we handle that in the deployment by making them topology aware.

We also provide all of the controllers that you're going to need to go through this journey of deploying your first LLM, benchmarking it, and so on. I just want to show you how easy this is. Three lines: you clone our repo, you CD into the infrastructure, and then you run this install script. What this does, if you look on the right there, this is a file that's available in every one of our infrastructures. This is our customization. You can see we have these enable lines. That's the modularity we talked about. You don't want to use the observability stack, turn it off. Actually, everything's off by default. You can just delete it. You don't want to use AI Bricks for some reason or you want to use something else instead, you change that line. You want to use fewer availability zones or a different region, we try to make it really easy.

So that's the infrastructure. Let's talk about the deployment component. We also offer a set of charts to go with this infrastructure. We're talking today a lot about LLMs, of course that's a very popular topic, but we recognize there's other things you might want to do on this infrastructure or just in general. So we also provide diffusion support if you want to do text to image. We've got a few models there that you can use. We're all about choice, so we have vLLM, diffusers, llama.cpp, if you want to do something on Graviton or ARM-based processors, and of course a whole host of inference servers and more. We just published some benchmarking that we're going to talk about, and then we have some of the other controllers that are available right there.

I just want to show how easy it is to use the charts. On the left there is the chart. Most of that is actually optional, but you can see for instance we have a model up here. You want a different model, just change the model name. This is the exact model name that's from Hugging Face, and then over here is the deployment. Once you add and update the repo, you don't have to do that again unless you want to pull a new version, but you can see here this file here. If you go to the repo, you can see we have a whole host of other templates you can use if you want to do something else.

Model Testing and Benchmarking: From Deployment to Performance Evaluation

With that in mind, let's talk about the actual journey. We know that everyone here, whether you're just starting out or sort of on the end or somewhere in between, we've got your back and we're going to help you out. Let's start at the beginning. You've never deployed an LLM before and you just want to check it out. We call this model testing. In a model testing scenario, you just want to see: does this model work? Does it give me some sort of semblance of reasonable output? How do I use it? We're going to use Qwen 3 for that. We also want to keep the infrastructure small, so by default, very slim infrastructure at steady state, and we're going to see how we actually scale up to bring in a model server.

It's consistent, so you saw that Helm file a second ago. Everything you're going to see is going to use a very similar Helm file, and all the endpoints are going to stay the same. If you're testing vLLM with Qwen 3 and Triton with some really big model, for instance, everything is as consistent as we can make it, so you can swap in and out. It's customizable, so we have all the model parameters exposed. If you want to override things or change things, that's all possible to do. Our Helm chart there. If we zoom in on our inference ready cluster, we're going to throw out all of the controllers and all of the pods that just keep our infrastructure running, and now we're just going to focus on model serving.

We have our inference cluster and want to deploy a model. Right now we don't have any nodes that will support it because we're not going to bring up a GPU node when you don't have any GPU requests. You ran that install file on the Helm chart and now you've got a pending vLLM pod. What's going to happen? Karpenter is going to realize that there's an unscheduled pod, and it's going to bring in a node.

To actually get the model started, we need two things. We need to pull and start the container, and then we need to bring in the weights. If you're just starting out, you're probably using Docker. We're going to pull the Docker image and then pull the weights, and now we've got a running model server that we can actually start consuming. Whether it's a pod that's in your infrastructure or if you're port forwarding or using an ingress and connecting to it externally, you now have a model that you can actually send and receive requests from.

In this case, we talk about vLLM. Of course, we talked about Triton a little bit. There's SG Lang and other model servers, but we just want to show you how easy it is to prove to yourself that vLLM works and gives you this performance. Triton can also work and gives you what it does. If we look at our charts, we go from vLLM to Triton. We change our image, keeping the same exact configuration file otherwise, and you can deploy Triton and get a very similar endpoint other than the Triton versions. It's very similar and very consistent. Skipping a few steps, now we have a Triton model server running and we're able to send and receive requests just like that.

So we have a model server running and we're able to see that it works. Now we actually want to understand how well it works, and that's where benchmarking comes in. We use benchmarking because we want to see if this model actually gives us usable output and if this model server gives us good results. There are a few different components, so we're able to switch and test different things. We want to optimize our parameters to make sure that they meet our SLOs. We want to test those components and then, really importantly, test it with your own data.

Synthetic data is great, but because of the way tokens skew the performance, if you have a different distribution than what you're actually going to see in production, your model server will perform differently. That's really important to bring your own data and be able to test with your own data. We're going to look at what that looks like. We have benchmarking here and instead of our own model consumer pod, now we have a benchmarking framework that's going to run a test and actually give us results as far as how this model server is performing. If you go to the QR code, we have a very deep guide on using Inference Perf. Because we're talking a lot about vLLM and the vLLM ecosystem, there's another tool called Guideline LLM that's part of the vLLM project that does complementary work and also does a few other things that are really nice. We don't yet have the chart for that, but I do want to show you really quickly how it can run tests and give you this output.

The really nice thing it does, which I really like, is being able to subjectively understand at first how well the model is actually performing. If you look, I'm not going to claim that this is our dataset, but this is a Databricks dataset, and it is a real dataset rather than just being synthetic data. You can see in the dataset on the instruction side these are the prompts, and over here is sort of what the response should be. If you look at a small excerpt of what I actually got from it, you can see maybe the output here is summarized as being 2000 versus giving you the full date. If that's close enough for you, great, but if you need maybe more, maybe you need to tweak your prompt, maybe you need to get a different model, this gives you a subjective view of whether this is doing what you expect it to do.

The next step, of course, is to go into deeper evals. Unfortunately, we don't have time for that today, but I'm happy to talk about that in a different setting. With benchmarking done, you've now optimized a single instance of a model server to meet your SLOs.

Production Readiness: Autoscaling with Ray, Container Optimization, and Distributed Inference

You have to start thinking now about Service Level Objectives, or SLOs. What are we going to do with this model server once it leaves our small environment? Ultimately, we're going to send requests to it, and we need it to be reactive to those requests. Are we going to scale this up and have 200 instances running all the time without knowing whether those requests are needed or whether all those replicas are needed? Or are we going to start really low and cause issues because things are timing out or the latencies just aren't good enough?

That's why we do model scaling. In this case, we're going to look at Ray as part of the vLLM ecosystem. What we want to make sure is that as we scale up, we scale up because traffic is increasing. We're not doing it just in case, because if you scale up your replicas, you are consuming resources whether or not they're actually in use. That means you're not able to run other workloads or your costs are going to increase for no reason.

On the other hand, we also want to make sure we can scale down. As traffic drops, maybe you're very regional as far as traffic patterns and you find that overnight you're not getting as many requests, you want to make sure that you can scale down for that and aren't doing it just based on rules. Of course, it's more cost effective, and we want to make sure that just like we provide the configuration for the model server itself, we also make autoscaling configurable. So of course we have a chart for it. If you look now , instead of just vLLM, we are asking for Ray vLLM and then we expose some options that are specific for autoscaling.

Let's look at what that looks like in practice. In this case, you can see we have a consumer and it's sending a lot of requests way faster than our responses are coming back. Ray, the head pod, is actually going to be able to see that there's an issue with these requests queuing up. What it's going to do is automatically create another replica. Of course, we're not having a server just sitting there for it. We're being reactive and cost conscious, so we'll have to scale up our replica or our instance, then pull the container , pull the weights, and now we're actually able to adjust and meet those requests and response times. Then of course, as those drop, we scale out or scale our model servers back.

The problem with this, if you've ever been in the scenario where you're waiting for a container to come up, is that these containers are very large, probably tens of gigabytes, and the model weights are very large, also in the tens plus of gigabytes. If you think about what you have to do to be able to do that, you have to create a node, then you have to wait for the container to pull, and only then can you pull your weights. This is a time-consuming process, and if your users are waiting for you to do that, they might just give up and go somewhere else.

So there are a few strategies to address this. On the top, we talk about caching the actual container itself. You can do that with an EBS snapshot. When your node comes up, your container images are already there and you'll pull them instantly because it'll come out of the cache. Likewise, you can do the same thing by either putting your model into the container image or putting it somewhere where it's more accessible. The problem there, if you've ever done this, is you have to maintain this cache. Images get updated, models get updated. It's kind of a pain.

What we do is we try to optimize this process in two ways. We optimize the container pull using something called SOCI, which is Seekable OCI. It comes by default on the inference ready cluster, and that allows you to parallelly pull all of the layers, so they come onto the node much faster. The other side, once we've got the container running, is to pull the model weights, and we do something very similar with model streaming. VLLM has this option to do model streaming. It uses a Run AI model streamer, and so we're actually able to pull the model weights onto the container faster.

Let's go through what that looks like. So we wrote a template here, very simple. What this basically does is you give it a model name and it's going to put it on S3 for you. Very simple. So when you run this, it'll just pull and push over to S3. So now we have, we're not having to pull directly from Hugging Face, we can go through S3 because the model is already there for us.

Now what happens is you can see it looks similar. Fortunately, it's very hard to show a parallel pull, but this does happen much faster, I promise you. We're able to meet these demands a lot better, but you can believe me or hopefully you believe my benchmarks that I have here. So what we did was we ran the same benchmark on two different instances. We took one that was just good enough to fit the model—it's like the baseline instance that would fit the model that we did—and one that was way overkill, just to benchmark and see how much better this is if we have a way more performant node. We're also using a 15 gigabyte model, so just to give you a little more context around what we're doing.

So if you look at the top left there, that's the image pull. This is just using Docker Hub. How long does it take to get this container image pulled? On the cost effective node, 3 minutes 46 seconds. On the overkill node, 1 minute 41 seconds. That has to do with enhanced networking, faster processing, and faster storage. Then we looked at SOCI. SOCI is how we address the first part of this problem, which is the container pull. So with our SOCI parallel pull, we dropped the cost effective container pull from 3 minutes 46 to 1.5 minutes, and then on the performance from 1 minute 41 to 36 seconds. So really big difference there.

Then we looked at the second part of the problem: how do we get those weights on the node faster? So with our cost effective instance, 14 minutes 18 seconds, with our performance instance, 11 minutes 18 seconds. Now with model streaming, we're able to get the model weights pulled in 1 minute and 29 seconds, and for the performance one, 1 minute 14. Not much of an improvement there. Still a few things we can tweak, but this was just to see how well this works. So just to summarize, cost effective 8 minutes and 5 seconds total container start time to 2 minutes and 58 seconds—63% improvement. Then on the performance side, 3 minutes down to 150. If you're comparing, with these optimizations that took very little to do, we got the cost effective node to start faster than the performance node without the optimizations. So not a whole lot of work on anyone's part to actually implement this, but way faster start up time on a more cost effective node.

So we'll take a quick detour here and talk about distributed inference. We've so far been talking about models that either fit on one GPU or up to 8 GPUs, but basically on a single node. But what if you want to use a really large model or you want to use a model that's maybe a little big, but you want to spread it across smaller nodes, maybe a little more cost effectively? So this is an option here. The thing that we leverage here is those three degrees of parallelism, so we use tensor and pipeline parallelism. Of course, we want to make sure that we know about where we are availability zone wise so that we're not spreading these nodes across availability zones, which is going to increase our latency and it's going to increase our costs.

So we have a chart for that. You can see here we switched from just the vLLM framework to leader worker set vLLM. Here we set a pipeline size, a pipeline parallel size, so that's the amount of nodes we're going to spread this on. And then tensor parallelism decides how many GPUs we're going to use from each node. The other thing I want to point out that's really important—you'll see this instance type here. This tells us which instance we're going to use to actually deploy this model. If you're familiar with Karpenter or just auto scaling in node groups, you might have a node group that says I just want G5 instances or G6 instances or maybe just GPU instances. If you deploy this without selecting an instance type, you might get instances that are smaller than others.

So as you bring up your pods, some of them might start while others might crash because they're on smaller GPUs, or you're going to have weird networking issues because you've got nodes that are really fast but they're bottlenecked by GPUs that are slower. So I would highly recommend if you're going to use this, make sure you're using the same instance type. And again, what this looks like, you know, we skip to the actual inferencing part.

Advanced Deployment Patterns: LLM Gateways, AI Bricks, and Community Resources

We have a request going to our leader, and then the leader coordinates amongst the workers how the processing happens, comes back to the leader, and then back to the consumer. Let's talk about LLM gateways a little bit. LLM gateways are really great for a lot of different reasons. They're great for routing because you're able to do things like error handling. If you have multiple model servers and one of them fails for whatever reason, you can retry that same request without sending a failure back to your client.

They let you centralize observability. If you have tracing that you want to aggregate rather than forcing all of your ML teams to do their own thing, you can centralize that through the gateway. They let you centralize guard rails, so if you're managing a lot of inferencing, you're able to say, "Don't return a response back that has PII in it," which could be a good idea. And of course, if you're a platform team, we're talking a lot about self-managed inference, but you can use these with external providers. If you don't want to give API keys to everyone in your organization, you can centralize them in the gateway and then have all of that accounting over there. Teams consume them, and you can track that and attribute them to teams.

In this example, we're going to show that one of the really nice things about the gateway is being able to anticipate the best model to send a request to. In this case, a large request might go to a frontier model on a model provider versus one of your smaller, self-managed inference servers. When we run out to Bedrock, that's one scenario. The other scenario, as I mentioned, is this retry. It's going to try the smaller, cheaper one first. Maybe there's a failure before sending it back, so it's going to go to the other server and then send the request back.

We've talked about how to identify and benchmark models, whether you want big ones or small ones, and how to do scaling. Now we're at that advanced stage where we're looking at some more advanced techniques. AI Bricks is a member of the vLLM ecosystem, so we chose to highlight that here. There are a lot of other tools you can use as well. We also have Dynamo in our repository, so if you want to swap this out with Dynamo, that's a possibility. The idea is that you've got more context for where load balancing happens.

If you think about multiple model servers and the stochasticity that comes with running these LLMs, you might have a few instances that are processing requests a lot longer than other ones. You don't want to round robin where you're going to jam up certain replicas. Instead, you want to make sure you always send to the model server that has the most free processing capacity. That's context-aware load balancing. AI Bricks also does LoRA adapter management. If you have LoRA adapters, it can know based on the request whether to unload or load them and where to route them to.

We talked a lot about distributed KV cache, and of course, AWS has a lot of different accelerators. This gives you the ability to mix and match those and still maintain your SLOs. The only change is changing the framework. If you were using this with vLLM before and now you want to try using this more advanced thing, you can switch from vLLM to AI Bricks and you've got this AI Bricks deployment.

Here we'll show you an example of going to a model server that has a LoRA adapter that you need or maybe one that has less load on it. Based on the Envoy gateway that's running and the AI Bricks controller, it'll know to which model server to route this request.

We know everyone is at a different point in their LLM journey. We want to be here to support you and enable you. For a lot of companies and people here, running infrastructure is not what you want to do. Deploying models is not what you want to do. You want to build and enable on top of that. We're here for you, whether you're just starting out or towards the end of that journey.

Of course, with that, we have more things to show. I'll slow down on this slide because I know there are a few QR codes. We have some workshops, so please, if you're interested in learning more, we have the Generative AI on EKS workshop, which goes through a lot of this material in a more guided manner, as well as introducing agentic AI and observability. We have this other repository. If you may have noticed the AIML observability line on one of the deployments, we have an open source observability stack for AIML, so if you're interested in that, please check it out.

There's way more in the repository than we could talk about in an hour, so please check out the repository. It's entirely an open source project. We love getting issues and contributions. Please feel free to open issues for future requests, bugs, or anything like that. We are really hoping to engage everyone here and make this a great experience.

Finally, we have another resource: our AWS Skill Builder. If you are just starting out on this journey or just want to level up a little bit, we have some material on AI here that you can use and go deeper into. Thank you everyone for joining us today.

; This article is entirely auto-generated using Amazon Bedrock.