🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.
Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!
Overview
📖 AWS re:Invent 2025 - Scale AI agents with custom models using Amazon SageMaker AI & SGLang (AIM387)
In this video, AWS demonstrates building production-ready agentic AI applications using Amazon SageMaker's end-to-end capabilities. The session covers fine-tuning Llama 3.2 3B with QLoRA on medical data, deploying models using custom SGLang containers via SageMaker's BYOC paradigm, and orchestrating workflows with SageMaker Pipelines. A healthcare agent demo showcases integration with Bedrock AgentCore, implementing MCP tools for patient lookup and S3 report uploads, with full observability through MLflow and CloudWatch tracing. SGLang's lead engineer details performance optimizations including speculative decoding V2, prefill-decode disaggregation, hierarchical KV cache, and GB200 NVL72 rack-scale deployment achieving significant throughput improvements for large-scale inference.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
The Challenge of Deploying Agentic AI Applications at Scale
Welcome everyone. This session is for data scientists and AI developers that are looking to customize models and deploy at scale to build high quality and cost-effective agentic applications. So let's get started.
We are seeing two big trends in the market, in the industry. First, we are seeing rapid adoption of agentic AI and enterprise software apps. This adoption is expected to grow to 33% by 2028, from 1% in 2024 to 33% by 2028. That's a 33x increase in just four years. To build these agentic applications, you need models that are high quality, cost effective, and fast. Customers are increasingly relying on open weight models to build out these applications.
Despite huge opportunity and clear line of sight on how to build these applications, we continue to see that the majority of applications never reach production. Let's take a look at what are the key challenges that customers face, which leads to these failures.
First, customers lack standardized tools to customize models with different techniques. So they end up spending time building out these workflows, standardized tools, which delays the time to customize the models. Once the models are customized, you need an inference stack to host these models with the right price performance ratio. It takes a lot of effort to find the right instance, right container configuration, constantly adapt new inference optimization techniques that are coming in to ensure that your inference stack is meeting the latest requirements. This is complex and requires a ton of effort, which further delays the timelines for these projects.
After the model is deployed, customers need tools to track the behavior of these models and the agents. Often the observability is fragmented across two different tools, which makes root causing and debugging of the behavior much harder, further delaying any progress. And then once you identify all these issues, now it's time to take your model customization workflow into production. Often the experimentation work is done with glue code, which doesn't scale well when you try to deploy that to production. So often these pipelines need to be rebuilt. And this disconnect leads to further delays.
Lastly, to deploy these models at an enterprise scale, you need the right governance practices. You need capabilities to track, audit and version your models, generative AI assets so that you can ensure as they move from dev to staging to production environments, you can keep right track and ensure they meet compliance and governance requirements before they can be deployed to the customers.
SageMaker Training Capabilities: Fine-Tuning Models with Resilient Infrastructure
I'm Ahmed Moody, Senior Manager for SageMaker Model Operations and Inference, and I have with me Dan, who's our Worldwide Specialist for Gen AI, and Jing, who leads the co-creator of SGLang. In this session, we'll do a quick overview of all the SageMaker capabilities that can help you address some of these challenges. Then Dan will do a demo that'll bring all these capabilities to life, and then Jing will talk about some of the key highlights for SGLang that helps you host cost optimized inference at scale. So let's dive in.
SageMaker offers capabilities to train and fine tune these models. SageMaker offers the broadest selection of models, all popular open source models, as well as Amazon models that you can use to get started. They meet all the enterprise requirements. We've already gone through the security scanning and so on, so you can quickly get started. You can choose from the broadest range of recipes to fine tune your model and customize them. We'll take a little bit of a deeper dive into these recipes in coming slides, and you can also choose from the right training infrastructure to train these models whether you want ephemeral compute with fully managed training jobs or you can have persistent clusters to run large scale jobs with SageMaker Hyperpod. SageMaker has built-in resiliency into these infrastructure, which helps you train your models faster. Let's take a look at some of the key capabilities of training.
So the first key capability that SageMaker offers is when you kick off these training jobs in Hyperpod, it constantly checkpoints these jobs. So in case there is a node failure on the cluster, the cluster self-heals by replacing with a good node and then resumes the training job from the last checkpoint, so your work is never lost, and it also accelerates the training because there are multiple node failures that happen during the day.
The self-healing cluster takes away the burden from you to identify those nodes, replace them, and resume the job from the last checkpoint. All of this is taken care of for you as a managed capability.
Another key capability is these fine-tuning recipes. SageMaker offers fine-tuning recipes to customize your model for most commonly used popular models and for most common techniques. You can get started by putting your training and validation data into the right directories, identify the right recipe for your use case, and then kick off the recipe either on the fully managed training jobs or on the HyperPod clusters, depending on your compute requirements.
Cost-Effective Inference with Multi-Model Endpoints and Speculative Decoding
Next, let's take a look at SageMaker AI Inference. We talked about the key challenge with inference was to host cost-effective inference. You need capabilities so that you can deploy your models faster and quickly. SageMaker AI Inference offers capabilities that allow you to deploy open source or fine-tuned models with a couple of steps onto a managed instance, either through UI or through SDK. You can choose to host any model on any framework on any different type of infrastructure to make sure you're getting the best performance. SageMaker out of the box offers high throughput and low latency for your endpoints, and we'll take a deeper dive into some of these features.
With SageMaker, you can deploy multiple models onto the same endpoint. This ensures that you can scale the one endpoint to all your use cases to drive the maximum utilization out of GPUs. The intelligent routing built into the endpoint redirects the request to the right model and ensures there is no performance penalty. You can scale up to hundreds of models on the same endpoint depending on the memory requirements.
SageMaker also emits metrics for each of these models, so you can configure auto-scaling policies for each model separately. This allows your SageMaker Inference endpoint to auto-scale as the traffic grows. In this use case, you can see Foundation Model 3 has high traffic and it can auto-scale only for Foundation Model 3 and make sure it meets the requirements.
Next, we recently launched a capability that also allows you to get better throughput out of these models. The expectation for customers for inference is that they need fast inference and fast response. In reality, these LLMs generate one token at a time, which makes it really hard to meet the customer expectation.
We launched speculative decoding recently. Speculative decoding works this way. You have a foundation model and you have a draft model. The draft model is a smaller model that does predictions for the prompt, and the foundation model validates these predictions by calculating some probabilities. It accepts some of the tokens and rejects the rest. This is a simplistic view of how this works. In reality, you can have the draft model actually produce multiple variations that the foundation model can choose from.
Let's take a quick look at an example. In this case, the foundation model and the draft model get the prompt. The draft model produces the next set of tokens. The foundation model decides what is the probability for each token and accepts some of these tokens. And this is how speculative decoding occurs. This leads to latency reduction and up to 2.5x higher throughput for your model without compromising on the accuracy.
All these capabilities work out of the box. The way it works is you bring your own dataset, which maps to your traffic patterns, and use that dataset to fine-tune the draft model so the draft model can do better predictions. SageMaker kicks off an asynchronous training job to fine-tune your draft model. After the draft model is fine-tuned, you can look at the evaluation metrics, choose to deploy that on the same SageMaker endpoint without provisioning a new instance, and then the speculative decoding gets started.
End-to-End Observability with Serverless MLflow Integration
Out of the box, SageMaker offers managed containers to host more popular frameworks like vLLM or SGLang. SageMaker offers open source containers with Deep Learning Containers, which offer these two flavors of most common frameworks, or using a proprietary LMI container that also offers TensorRT-LLM. Today, we also announced the launch of Serverless MLflow on SageMaker. With Serverless MLflow, it's a fully managed experience, so you don't need to worry about managing any infrastructure and it's all serverless, so it scales up and down based on your requirements for when you are running a large scale training job or you're sending out the agent traffic. We'll talk about those use cases in a bit,
you can log your experiment runs, evaluation results, or agent traces all in one place. SageMaker MLflow is free and does not have any additional charges. So here is a screenshot of how the experiment tracking works with MLflow. You can see within the experiment all the runs, which is basically each training run. You can compare and contrast the training runs to identify the best candidate model that you want to take to production. So you have all the observability from the time of model customization.
If you're using Bedrock AgentCore for deploying your agents and SageMaker AI for customizing the models, AgentCore already emits all the traces into AgentCore observability dashboards, and you can also send them to MLflow because they meet the OpenTelemetry specifications. SageMaker also offers partner AI apps as first-party capabilities, which are third-party apps that can also help you monitor agents as well as the model performance.
So now let's take a look at how this works. You can now see on the left side there's a complete trace tree, starting with the invoke agent at the top, drilling down through the workflow build process, capturing each step of the LangChain operations, and even showing the tool calls, and you can see multiple assistant interactions too. This hierarchical view gives you the complete visibility into every step of the agent, including the conditional branches and iterations, which makes it easy to root cause any issues within the agents. And you can go down to the model customization and see which model was used, what dataset was used, which makes debugging much easier.
Production Workflows and Model Governance with SageMaker Pipelines and Model Registry
Now that you have your experimental workflows figured out, you have all your observability in one place, you need capabilities to productionize all these experimental workflows into scalable and repeatable workflows. SageMaker AI offers pipelines. You can simply convert your existing experimental code into pipeline steps by annotating with the at-step annotator in code, or drag and drop all of these steps into a UI to create an end-to-end pipeline. SageMaker also offers built-in steps for training, evaluation, and deployment that eliminates the need for you to write any redundant wrapper code for spinning up the training jobs and infrastructure. And pipelines is a serverless orchestration, so you don't need to manage any infrastructure.
We emit all the telemetry to CloudWatch where you can observe the behavior of each pipeline step. SageMaker Pipelines also have caching built in, so if a pipeline fails and you retrigger the pipeline, it will skip the steps that were already successfully executed. Lastly, for governance, we talked about how you need to keep track of and manage each version of the generative assets. So SageMaker offers Model Registry, which is the single source of truth for all your model versions.
SageMaker Model Registry is a central hub to manage the entire lifecycle of the machine learning models. We'll take a look into some of those key capabilities in a moment. You can track all the metadata corresponding to the training jobs, evaluation runs, and so on into SageMaker Model Registry. So here, with SageMaker Model Registry, you can set up cross-account access that allows you to track all the models in all your environments in one place. So you can track not only the environment that you're working in, but also your staging and the production environments.
Model Registry also enables you to capture all the details of the training job, so you can have end-to-end lineage of what the model went through, which makes it much easier to debug in case you run into any issues. Next, I will hand it over to Dan, who will walk us through the demo to bring all these capabilities to life. Thank you.
Demo Introduction: Building a Clinical Healthcare Agent with Custom Models
Hello everyone. Today we're going to be going through this demo, which showcases a lot of the capabilities that Emmit was talking about. Let's imagine that we are some kind of clinical healthcare provider, maybe a clinic or possibly an emergency room, and we want to accelerate the amount of people that we can see and we want to do this using machine learning and artificial intelligence. So what we're going to do is increase the amount of patients we can see by developing some kind of agent using a lot of the techniques that Emmit had mentioned.
We're going to start by hosting a language model onto a SageMaker managed endpoint. We're going to fine-tune it and we're going to orchestrate that entire process using SageMaker Pipelines. We're going to make some friends along the way. We're going to incorporate MLflow for observability and tracing.
We're going to deploy that model onto not just the managed endpoint, but we're going to log it inside a model registry. We'll look at SageMaker's model registry as well as MLflow's model registry and how we can incorporate governance considerations like a model card in both of them. Then once our endpoint is hosted and up and running, we'll examine how we can connect that endpoint to something like an agent.
We'll start with an agent that's built off Strands SDK. If anyone's never worked with Strands SDK, it's a very lightweight open source tool that lets you build agents very quickly. We'll also examine how you use native MCP and connect all of this with some of the capabilities within Bedrock AgentCore. Another thing we're going to do that's a little interesting is take advantage of SageMaker's Bring Your Own Container paradigm.
We're going to use a new repository called ML Container Creator. The QR code links to that if you wanted to take a look at it. It's a very new, just launched like two weeks ago, open source repository. So if any of you are interested in contributing to the open source world, this is a great place to get started because it is very new.
So without further ado, let's take a look at this first part of launching our model. This is the part of the architecture that we're going to be focusing on in this first part of the demo where we're going to be training a Llama 3.2 3B Instruct model using QLoRA adaptation. We're going to host it on a SageMaker endpoint using SGLang for serving. The SGLang container that we're using is actually one that we're bringing ourselves. We're building it ourselves using some of the assets generated by the ML Container Creator repository.
Deploying Llama 3.2 with Custom SGLang Container Using SageMaker Pipelines
So let's dive into how this actually works. The first part of any notebook is a big list of imports. Don't really bother yourself with too many of these except for these steps. These are the pipeline definition files that we're importing from our file system that tell us how to execute each part of the pipeline, and we'll see in a moment how they come together to form a directed acyclic graph, kind of like what Emit had shown on the slides before.
Each one of these has special instructions about how each step is supposed to run. The inputs of one step get processed and become the inputs to the next step, or the outputs of the first step become inputs to the next step. I'll show you what one of these steps looks like in a moment. As we continue, we're basically setting up the rest of our environment so that we have a couple of Boto3 clients that we can work with, a fancy timer for keeping track of things.
Where this really gets interesting is where we initialize our MLflow environment. For this demo, we need to give our code a direction or a place to point all of its telemetry requests, all of its parameters and metrics that we want to log through each of these pipeline steps. What we're actually doing here is we're capturing the MLflow ARN. This is using a server-bound MLflow instance, but now with serverless you can do the same thing. We're setting the tracking URI and experiment name in our environment, and these two lines of code you'll see them in the steps. They're included here for completeness, but they're actually used in the steps where we're publishing metrics and logs and configuration items to MLflow within each pipeline step.
So this is where we set these up and you'll see it more in each step as we look through those. Then we continue to identify where we're getting our configuration and our container assets from. I took the liberty of using ML Container Creator to generate an SGLang container, and then I uploaded those files to S3. It's really just a Dockerfile and a serving file that basically spins up an SGLang server onto whatever instance you deploy it on.
So I've put all of those onto S3 already and we're going to define where we're creating that inside Elastic Container Registry because we're creating our own container and deploying it onto a SageMaker endpoint. We're going to be using Llama 3.2 3B Instruct for this fine-tuning job.
We're going to get the model ID directly from Hugging Face inside our fine-tuning job. This was my Hugging Face key. And this wall of text is the pipeline definition. So each one of these blocks is effectively a method call to the files that we imported at the very beginning. I'm not going to spend too much time on this, because if you haven't worked with pipelines before, you're not really going to appreciate this big wall of text like you would appreciate this diagram.
This is the directed acyclic graph that is generated by the big wall of text that we just saw, and effectively it's starting with building our container, preprocessing a dataset for fine-tuning. That dataset is the Freedom Intelligence Medical Reasoning 01 dataset. It's effectively a series of questions and answers using chain of thought reasoning that we're going to use to fine-tune our model to become a better clinical advisor.
So we're going to preprocess that to work for our Llama model. We're going to do QLoRA adaptation on it, and then we're going to deploy it using our custom SGLang container. We're also going to register this inside SageMaker Model Registry and the MLflow Model Registry. So all of these steps are represented and defined in these method calls. We stitch them together using a SageMaker pipeline object, and you can see the steps that we've imported earlier on. And we upsert the existing role. Now this role has permission to do all of the things we need to do within each part of the pipeline.
And the magic happens. We run pipeline.start. This initializes a pipeline run. This creates the pipeline if it hasn't been created for the first time. I'm sorry, that's already been created. This executes a new pipeline run, and once it's successful, because it eventually will be successful, each of these steps will show a green tick mark. And then you can kind of double-click onto each of these and take a look at what has happened. It'll point you to the CloudWatch logs, other parameters for each pipeline step, and this is how it looks within SageMaker.
We can go into the console now and take a look at all of my failed runs and my one successful run, or we can look at how this looks in MLflow. MLflow has a very similar layout. Instead of a graph, we have this list of pipeline steps that have run under a run ID, and you can see the BYOC container, data preprocessing, QLoRA adaptation, SGLang server, model registration. All of the same steps are also logged in MLflow.
Inside the QLoRA Fine-Tuning Pipeline Step with MLflow Tracking
So let's look at one of these steps to see how this was set up. I'm going to pull out the QLoRA fine-tuning orchestration step. Each of these steps is about the same. The only difference is what's actually happening, what you're actually instructing the training job to do. In this step, we're instructing the training job to fine-tune the model. Every training step is going to have, or every pipeline step is going to have some of the same elements.
We're going to have an at-step annotation. This allows us to define a step separately from our notebook or from our pipeline context so that you don't have lots and lots of very long methods in the same file. You can import them the way we showed before, as long as you define a set of requirements, because what you're doing is you're creating a training job and you're executing this code inside that SageMaker training job. We're also passing information about the MLflow tracking server. This allows this step to publish metrics and telemetry data back to MLflow.
Then some of the things you'd come to expect from a fine-tuning job: training dataset, test dataset, the model ID that we're actually training. In this scenario, we're pulling this straight from Hugging Face, so we have a Hugging Face token that we're passing, and a role. Here again we see we're setting the tracking URI and the experiment name just like we showed before in the previous notebook. This is how the execution context knows how to publish metrics and log data to MLflow. Then we define the MLflow context. We're also setting up automatic system metrics logging. This captures system metrics like memory used, disk used, network bytes, and IO.
And we begin running the job, and this is all the kinds of configuration we need. This is where we would set up our training configuration, and you can see eventually as we get down to the end, we're going to kick off a training job. We're defining our compute, input data, and eventually a model trainer.
Before we actually call train, all of the things that are happening before that are logs, pieces of information being built and sent out to MLflow to make it easier to understand what has happened in this run. We can take a look at some of the system metrics for the fine-tuning job. We could take a look at some of the metrics that are published and tagged.
So these are the kinds of things that you'll see when you're writing these steps for your SageMaker pipeline. This is where it's going to end up. So now that we've defined our pipeline, let's actually, oh I forgot to mention this is the model card that was set up in the model registration step. This is an important piece to make note of because this is a medical scenario where an AI agent is advising healthcare professionals, and so a model card should have something similar to this, not exactly this but similar to this, when you're implementing appropriate governance standards.
So we'll test the endpoint. I have set up this endpoint a while ago. That's why I'm hard coding the endpoint name and inference component, and then we just create a standard predictor like you would with the SageMaker SDK. And we will run a basic prompt against this endpoint. You're going to see this prompt a lot. Forty-five year old male with fever and cough, temperature is 101.3 and heart rate is 98. Give me a treatment recommendation and a report. Upload the report to S3 when done.
Now this is just a simple model that's deployed onto an SGLang endpoint. It doesn't have tools capabilities. It's just been fine-tuned. So we know that the model should have the ability to perform some basic medical reasoning because of the fine-tuning job, but it's not enabled with tools to upload a report to S3 yet. So this is the output here. You can see that the patient details have been set up in a nice report style format, some kind of basic markdown syntax, and the model did a fairly good job. It's performing a preliminary assessment of the patient's symptoms to suggest that the patient is exhibiting signs of pneumonia or something similar.
So this is the first part of our pipeline. The model is deployed onto an endpoint. It's using the BYOC container for SGLang that we built ourselves. We fine-tuned it and we have tracked our entire experiment with MLflow, and we can go back and take another look at it if we want to. Our model card is registered and logged, and I will show you the model registry as well within MLflow that shows you more information about the model and where it lives. Version one.
Building a Strands Agent with SageMaker Endpoint Integration and Bedrock AgentCore
So if you do this again and you publish a new version of that model, it'll be added to the version list. The model card might get updated, et cetera. So this is the first part of our demo, right? We've successfully deployed a fine-tuned agent, or a fine-tuned model, to SageMaker. Let's give it some more capabilities. Let's build out an agent, an agent that can take some of the actions that we were talking about, like uploading the initial diagnostic report to S3 for a clinician to review at a later time.
We might also want to give it some additional capabilities around patient lookups. If a patient's already been there before, how does the agent know who this patient is? Maybe there's a patient database that we can query to get more information about that patient. And so really what we're building out is this piece down here. We're going to be using Bedrock AgentCore to deploy a Strands agent onto the AgentCore runtime.
The Strands agent is something we're going to be building out inside this notebook. It's going to be a pretty large block of code. I'm going to walk you guys through what that looks like.
We're going to use Bedrock AgentCore telemetry to capture telemetry data about each request that the agent is handling. So Bedrock AgentCore telemetry kind of complements MLflow for observability. Again, we have this big long list of imports. Nothing really to pay too much attention to here. Same kind of environment setup.
What I've done is I've copied the endpoint listing inside SageMaker. This is the endpoint that's been deployed, the name and the inference component that we're working on. I forgot to mention this model is deployed on an inference component, so it scales independent of the infrastructure so long as there is capacity on that infrastructure to scale. So we define the endpoint name and the inference component. I hardcode them for the sake of this demo.
Now we're going to build out the Strands agent. The Strands agent is an F string in Python, so we're basically injecting variables inside this F string so that instead of flipping between different files, it's just easier for demo purposes. You can see we're passing in the region, the endpoint name, an S3 bucket for artifacts, and the inference component is in here as well.
We've defined a patient data class, basically some data information about who the patient is, what kind of symptoms they have. You can imagine that this might be analogous to a record in a patient database. I'm not a healthcare professional, so I have no idea what you might find in a patient database, but I took a stab at what it might look like and this is what we came up with. I'm going to skip to the more interesting part where we are invoking the SageMaker endpoint that we defined earlier.
This is written inside of a method called invoke SageMaker endpoint async. This method invokes the SageMaker endpoint that we deployed earlier, but there's a couple of other things that are happening here. First, we're defining our prompt and the prompt is basically that same query that I showed you guys earlier. It's the same one we're going to be using. I'm not changing the prompt. It's a 47-year-old man with a temperature exhibiting a cough.
What we're doing is wrapping this up in a super prompt, which basically says that this is a clinical training scenario. What would you do in this situation as a clinical healthcare provider? Once we have that in place, we're going to define our tracer. The tracers are how Bedrock AgentCore observability captures trace data and trace information inside your runtime.
We've hardcoded some attributes here, like the SageMaker endpoint and inference component that is servicing the request. This might be really helpful if you have an agent that's making multiple calls to several endpoints and you don't know exactly which one was called when. This is a great piece of information to keep inside a trace, as well as some of the hyperparameters like max tokens, the temperature, and what the actual prompt was. These are examples of what you might want to put inside an agent trace.
Then we define our payload. We're using the messages API, so we have a component which is the system prompt and we have a component which is the user prompt, again, that wrapped up prompt of the gentleman with symptoms of pneumonia. And then we invoke our endpoint.
Now Strands allows us to do this because we've defined an endpoint. This is how Strands allows us to define an endpoint that we're going to use within our agent. Once we're done getting our response back from the model, we set additional information on our span object. This is the context that captures the trace. We're going to send information like the duration of the model's response, the length of the response, and additional metadata about the success of the query.
So now we're going to configure this actual agent, right? All we did was define a way to query the SageMaker endpoint which we deployed. Let's configure the actual agent using Strands.
Now I've set this up to run inside a local mode and inside an Agent Core runtime context. Agent Core runtime uses FastAPI, so you'll see that there's a little bit of overlap in this F string so that we can illustrate how to run this locally and how to run this on Agent Core runtime. Here we're defining a SageMaker AI model. This is the class that Strands SDK uses to define an agent or an LLM hosted on a SageMaker endpoint. This was introduced in Strands in August, I believe it was. So it's a fairly new capability and it allows you to use language models outside of API-based providers such as Amazon Bedrock.
And then we just pass over our prompt back to the model in much the same way that we just illustrated. You can see we have additional tracer information, additional examples of how to use traces, identifying some attributes on this trace and then defining a span context within which we are performing additional executions. Now if we're not in local mode, we're going to create the Bedrock Agent Core app. This is effectively creating a FastAPI server. And we're going to call the create medical agent, that's that block of code that creates the SageMaker model or SageMaker AI model object, the Strands agent.
And then we define our entry point which is again going to perform effectively the same task of passing a prompt over to the model, except this code will only ever run when we deploy this onto Bedrock Agent Core. When we run this locally, which I'll show you in a moment, it's not going to execute this block. It's going to execute effectively the same logic just run in a local context. So what I do here eventually is write this big giant F string to a Python file. Now I'm going to actually run this. We're going to run this using shell magic to run the Python code using the local flag, and we're passing in the same friendly prompt that we've come to know and love.
And when we run this, we're spinning up the file inside our Python runtime and it's passing out the message back to the language model endpoint. This is exactly what the agent is going to do once it's deployed to Agent Core runtime. We're just not accessing this agent code over the runtime API yet. So what it comes back with is basically the same thing we saw earlier. Based on the clinical scenario provided, a 45-year-old male presents with fever and cough, his temperature is 101.3, suggests some kind of respiratory infection, et cetera, et cetera. So this is all running locally.
Deploying and Monitoring the Agent on Bedrock AgentCore Runtime
Now we deploy it. Everything checks out so far. Let's use this on the Bedrock Agent Core runtime. So we're going to use something called the Bedrock Agent Core Starter Toolkit. If you've never seen this before, it's a pretty fancy wrapper around some of the AWS CLI commands or API commands you might need to run to deploy an agent to the Agent Core runtime for the first time. Effectively you're defining a runtime object, configuring your agent, and basically wrapping up the Python code we wrote alongside some requirements files and other runtime information.
And then all we do after we've defined this runtime object is we execute the launch command. Now I'm using launch with CodeBuild. What this means is that CodeBuild is going to receive the objects that we've built out and it's going to execute a new CodeBuild execution that effectively takes all of the information that we've built and walked through just now and builds it out and deploys it onto the runtime control plane, the Agent Core runtime control plane. Now I'm not going to actually do this because it takes a little bit to run, but you can see what it looks like. It's walking you through all of the steps that CodeBuild takes to actually build this agent.
Eventually we'll get an ARN at the end. And that ARN lets us run or invoke the agent using the Agent Core runtime. So I'm going to invoke this.
And when you deploy this inside Agent Core, you're going to see something like this. The Strands medical agent with no tools when it was deployed, all the different versions that are up there, as well as the active endpoint. The active endpoint is what's capturing the telemetry data from our trace objects. The active endpoint is what's actually accruing logs. And the active endpoint is what we're actually making calls to when we invoke the Agent Core runtime with our prompts. So we're going to get probably the same exact response or very similar to the same response that we showed earlier.
While we're waiting for that to finish, you can see an example of what some of the telemetry data looks like that we've captured inside our agent. So this is a screenshot of the observability window inside CloudWatch. When you select the observability dashboard inside of your agent's view inside Agent Core, it's going to take you to CloudWatch and you'll see something a lot like this. The different sessions and traces that have happened within the agent's lifetime, different layers of invocation. For example, the POST request to the SageMaker endpoint, which is like the final layer of that invocation, is superseded by an API call to the SageMaker SDK, the SageMaker API, and then it's wrapped up in some other layers. And so we can kind of pull this apart at different layers to see what's happening at each layer of the invocation.
We can also see the amount of load or the amount of resources consumed by the agents, so how many virtual CPU hours it's consumed, how much memory it's consumed in gigabyte hours, and so on. And wow, this is still going. There we go. Well, it wouldn't be a demo. We'll come back to that. If I were a braver man, I'd run the same thing with Boto3. At one point this worked, so take a look at the output. And this is an example of what the trace might look like. This is a trace with an error. Not a timeout, unfortunately. This is an error that shows us we have malformed JSON. Some of the JSON keys or maybe values don't have the right number of quotes around them. So it's going to throw an error. We fixed that and now the entire chain runs nicely.
Implementing Native MCP Tools with AgentCore Gateway for Patient Search and Report Generation
So this is a great example of how you can use tracing to improve your, to kind of identify issues deep within your execution. Now we have a few more minutes left. We haven't actually built out any tools yet, so I'm going to go through the tools part of this, and I'm going to switch gears away from Strands into native MCP. And the reason I want to do this is to showcase the capabilities of Agent Core Gateway at the very end. Effectively what we're doing here is building out another agent in almost exactly the same way that we did before, but instead of using Strands and the SageMaker AI model object, we're building a custom MCP server. And as well as an MCP client that will make calls to the server, which has methods coded in to handle requests to the tools list and tools call endpoints of, as part of the MCP protocol.
So again, we have much the same initialization logic at the beginning of our notebook. We're going to be creating a Bedrock Agent Core role. This is a pretty standard policy document for a Lambda function that's going to be making calls to the Bedrock Agent Core control plane. And then we'll define our MCP server. Now this is where I want to spend a little time, and we only have five minutes left, so I'm going to spend some time walking through how we're implementing these two methods, tools list and tools call. Tools list prints out the list of tools that our MCP server knows how to execute, right? Patient search and save medical report. Patient search is taking in some general identifier, probably a compound key for a patient search database.
In this scenario, it's just a generic identifier and it doesn't matter because I've hard coded it to return one patient for demo purposes. And then save medical report, which takes in the content that we're trying to upload to S3, where we're uploading it to, and what kind of format. Perhaps we want to support more formats besides just markdown. You can imagine a world where we're supporting lots of different patient portals, so we might want to support different kinds of file formats.
Now we're implementing that tools list. Tools call is taking in requests from the Lambda function and identifying which tool we're actually calling, grabbing the parameters that are passed into that request and forwarding them on to the method itself. So in the scenario where we're calling the save medical report tool, we're extracting the arguments and passing them to the save medical report method, and the same for patient search.
These methods are implemented down here, so you can see patient search is just returning John Doe with a couple of basic medical history. And save medical report is basically dropping the response from the language model into a markdown object. Here it is, whatever that content is, it's getting dropped into a markdown file and uploaded to S3. Fairly basic capability, fairly basic MCP tool call that our model is now, where our agents will now be equipped to perform for us.
So this has already been deployed to AWS Lambda, and we can now make calls to this. You can see it's all the same code that we wrote before. We deploy the, this is the code block that deploys the Lambda function, and then we recreate an agent. So the agent we're recreating here is effectively the same agent that we had before. It uses that F string notation to make it a little easier to read inside of a notebook, but we're doing the same kind of idea as what we did before.
Now what makes this, when we're adding the ability for tools to be called here is, I was just searching through because it's a little hard to find everything, we're applying a little bit of determinism to our agent, telling it which tools to call in what order. We're telling it which patients have come in. The agent is receiving the initial prompt from whatever interface that we've built out, and it's calling the MCP gateway to get the patient information, perform some analysis on that patient information, and generate a report out of that.
So all of this is, I'm running out of time, so I apologize, I'm going to skip to the actual report that's generated. And this is the markdown report that's generated from this. You can see the same information that's printed out in the inside our notebook is now printed out here. I downloaded this directly from S3, so it's not something that I saved in my file system, trust me, it's definitely downloaded from S3.
And now we've enabled our model to make tool calls within the Bedrock AgentCore runtime. Now if we wanted to, we can actually extend this to not use native MCP but to actually build a gateway. You can actually build a gateway using AgentCore, so I've already taken the liberty of doing that and registered this Lambda function as an MCP target on the gateway.
If we go down to the bottom, we're using the request API to make calls to the gateway, like list tools and call tool. We're defining our gateway URL here. Using requests, we can actually make calls to that, and this is what the output looks like. This is the JSON that comes back from the gateway using the MCP protocol. These are the tools that are supported: the patient search and generate medical report,
and this is the output of calling those tools. Patient search returns patient 12345 named John Doe, and this is the structured report located at this location in S3. So I apologize, I went a little bit over time. I'm going to pass it over to Ying from SGLang to take us home.
SGLang Performance Optimizations: From Hierarchical KV Cache to Rack-Scale Deployment
Yeah, thank you, thank you for Amit and Dan for the great introduction about SageMaker AI. I'd also like to give you some insights about the underlying SGLang so that you have a better idea of how to leverage this advantage. SGLang has been a community-building open source project, so the development momentum is continuous. I'd like to talk a little bit about the last quarter roadmap of SGLang development, and there are two focuses here.
First, we have spent a lot of effort recently on improving user experiences, and we're collaborating with all the open source model providers so that we aim to run all the latest open models efficiently at large scale from day zero. Second, we spent a recent focus on big refactoring to make all advanced features compatible with each other to achieve high performance and usability. So there are a list of five major components here: PD disaggregation that we are keeping improving the compatibility. Now we're adding more features that were supported in non-PD scenarios, now also in PD scenarios.
And speculative decoding, we're implementing the V2 version so that it's with a better engineering design to accommodate different algorithms, speculative algorithms. And all kinds of parallelism, we're doing refactoring on PP for pipeline parallelism and EP for expert parallelism so that we reduce the bubble and make it more efficient. And for all kinds of memory pool, the refactoring work on the memory pool V2 will support different attention mechanisms, especially optimized for the hybrid attention including linear attention, window attention, and full attention.
And overlap schedule is a unique feature in SGLang that we carefully designed the schedule logic so that we overlap the CPU overhead under GPU computation to achieve zero overhead in CPU. And then we are now working on making all those five components to be compatible with each other. And this here emphasized on how we focus on refactoring because good engineering is at the core of our philosophy.
Also, I would like to give you some highlights on other features. One of them is hierarchical KV cache. This feature will be very useful for agentic applications that have many turns and huge cache reuse opportunities, and it was also mentioned in previous slides. As for this feature, we utilize multiple layers of the storage to store the KV cache including the GPU memory, CPU memory, and also the remote memory. We tried NVMe. We support multiple backends: the DepFS, Mooncake, and Nvidia Nixel.
And this is one experimental result on a certain workload that we can see it achieves a much better latency and also much higher throughput because within this feature we achieve much higher cache hit rate. Another highlight is speculative decoding V2. I also mentioned before that this speculative decoding is a very powerful optimization, especially for online serving when you have a very small batch size. It can give you usually two to three times speed up, and especially when we came into the reasoning models. For the reasoning model, they usually have tons of output tokens, and speculative decoding will be a great way, a very good way to accelerate in that case.
And the V2 design, we add one more feature so we make it compatible with the overlap schedule. So previously, the speculative decoding was implemented in a way that we iterate for one decoding step and then the CPU scheduling and then another decoding step.
In this new version, we have been able to decouple further the CPU control plane and GPU compute plane. So now, we hide the CPU scheduling part under the GPU computation and only remain one mapping that stores the value in GPU and only the pointers in CPU. So we only maintain the map. Then after each iteration, we just have a small step to store the future tensor to this map and then read the future tensors from the map. This helps us reduce the overhead by a further 10 to 20%.
So when we came into the large-scale serving world, that's also always the focus of SGLang, and we have been deployed in production serving scenarios for a really very large scale. Prefill-decode disaggregation and wide expert parallelism are two core features that have been used, especially for the DeepSeek use cases. Separate prefill and decode engines can help us to achieve better specialization for the prefill phase and the decode phase, and so also we can have better latency control. And for the expert parallelism, using a large degree of expert parallelism, we can increase the concurrency and the throughput. And then the result is, in the first half of this year, we have been the first open-source framework to match the official DeepSeek results.
And this is a demo so that we can have the deployment for the prefill with a different partition strategy, with one partition strategy, and the decode with a different partition strategy, and with the KV transfer engine in between. So moving forward, we also have further optimized this large-scale deployment on new hardware. Especially so that we came into, we go into the next slide, we came into the new world of the rack-scale Blackwell GPU deployments.
In the future, inference will be at rack scale such as the GB200 NVL72. So previously it's like we deployed model on one of 48 GPUs, but now we have the 72 GPUs in one rack that are connected by NVLink. So that enables us to do more optimization for the large, really large-scale parallelism, especially the expert parallelism. As we optimize it, then with prefill-decode disaggregation like the PDW we mentioned and also the wide expert parallelism, we think by utilizing the new features, new kernels for Blackwell, and also the new all-to-all communication kernels. And this is one result we published in summer that shows how much performance gain we have for the GB200. We're now also working on the GB300.
So that will be, I think that I can just summarize here for those highlight features. And some of the features are still in experimental status, so they are not always turned on by default. So I always encourage you to go to our documentation website to see what's the most recent updates and the guides to turn on those optimizations. And the last thing is we recently, also recent, last month, we released the component in addition to language model. We also now support image and video generation, also image in, image out. And we call this SGLang Diffusion that accelerates this image video generation by 1.2 times generally and sometimes in special cases could be up to 6 times.
In future models, we think will be multimodal in, multimodal out, combining autoregressive language models and diffusion models. And in this release, we have from this metrics benchmark that we compared with the existing Diffusers, Hugging Face Diffusers. You can see the performance gain and also on the right figure, we list how those different optimization techniques can help with the performance, the throughput. And majorly there are two techniques. One is the sequence parallelism and another one is CFG parallelism. And when we combine them, combine the two, we got even higher improvement, even better improvement.
Session Recap: Accelerating AI Development with SageMaker and SGLang
Yeah, I think that ends my part then. Okay, so we covered a lot of ground today. I'm just going to do a quick recap of the challenges that we referred to in the very beginning. Here we go.
Model customization can be a time-consuming process. It can be a little expensive at times, and it may be unclear how you optimize that workflow. What we've discussed today is really meant to give you a couple of tips and tricks on how you might do that, right? Taking into consideration, I think the clicker stopped working. There we go.
How we might leverage fully managed training jobs to allow you to continuously repeat these experiments over and over and optimize that way without having to manage all of the experimentation harness. We talked about how you might host SageMaker models more effectively, and Ying just went through several great stats about how SGLang is a really great model serving engine for efficient model serving and inference. End-to-end observability with managed MLflow and now managed MLflow Serverless, which is a little more cost-effective than hosted MLflow.
We talked about using SageMaker Pipelines to be able to repeat these experiments over and over and over again so that you can effectively scale your experimentation. Finally, tracking, auditing, and versioning models within SageMaker Model Registry and MLflow Model Registry, giving you the ability to determine exactly where you're going to focus your optimization efforts. Thank you all for attending. This is a link to the demo video. The code is not available yet on GitHub. It will be at some point soon, but this is a link to the demo video.
If you go into your app, the AWS Events app, don't forget to leave a review for the session so that we know how to improve it for next time. Thank you very much.
; This article is entirely auto-generated using Amazon Bedrock.


































































































































































Top comments (0)