Kazuya

Posted on Dec 5

AWS re:Invent 2025 - Scaling foundation model inference on Amazon SageMaker AI (AIM424)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Scaling foundation model inference on Amazon SageMaker AI (AIM424)

In this video, Vivek Gangasani and Kareem Syed-Mohammed from AWS, along with Richard Chang from Salesforce, discuss deploying LLMs on SageMaker AI Inference. They cover 2025 trends including agentic workflows and reasoning models with test-time compute. The session focuses on three pillars: price performance optimization through Eagle 2/3 speculative decoding (achieving 2.5x throughput), dynamic LoRA adapter hosting, and inference components for multi-model deployment; flexibility with support for any framework, bi-directional streaming for voice agents, and multi-modality; and ease of use via self-service GPU capacity reservation and built-in observability. Richard Chang demonstrates how Salesforce built Agentforce Voice using SageMaker's bi-directional streaming, serving 22 regions globally with inference components, multi-adapters, and fine-tuned open-source models like GPT-OSS to handle over 1.5 million customer service requests with low latency and high security.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Speakers and the Evolution of Generative AI in 2025

Thank you so much for being here. This is day two of re:Invent, and I hope you are all having lots of fun. There is almost a week left, so I hope you get to learn as much as you can. I'm Vivek Gangasani, a Principal GenAI Specialist Solutions Architect. In my role as a lead specialist architect for SageMaker, I have been working with hundreds of startups and enterprises, helping them train, deploy models, and scale them up. I am also responsible for building solutions for running models, and I partner closely with our product and engineering teams to help shape the features for SageMaker Inference.

Today we are going to talk about how you can efficiently and effectively deploy LLMs and generative AI workloads on SageMaker and manage them. These are learnings based on our testing, feedback from customers, and our new features that we are launching. I want to introduce Kareem to share a bit about himself.

Thank you all. Can you hear me? Awesome. I'm Kareem Syed-Mohammed, Principal Product Manager for SageMaker AI Inference. I'm happy to be here on stage with you and explain how SageMaker AI Inference has evolved and the new capabilities we are bringing. We will be here after the session ends if you have any questions, so feel free to drop by and ask us. Now I will hand off to our partner here.

Good afternoon. Can you hear me? Yes. My name is Richard Chang. I'm a Software Architect at Salesforce. It is my great pleasure to work with Vivek and Kareem to tell you what product we are building and how we build it.

Awesome. Thank you, Richard, for sharing the stage with us. We are super excited to have you. All right, let's get right in. In terms of agenda, we will start off by going through what are the trends in 2025 and what are the new things that we have been seeing since 2024. Then we will talk about what are the steps involved in deploying models to SageMaker AI and what are the features available for you to know about SageMaker.

We will then talk about the three important pillars that will enable you to run models efficiently and cost-effectively at production scale. Those are optimizing for price performance. We will talk about how you can optimize your models on SageMaker AI. Then we will talk about how you can deploy any models or frameworks of your choice, giving you the flexibility with SageMaker AI. We will also talk about how easy it is for you to deploy and manage your cluster of nodes with SageMaker.

We will then talk about once you deploy models in SageMaker, how do you consume it and how do you build agentic workflows with it. Then we will hand it off to our partner Richard to talk about what is AgentForce and how they are building their platform and how they are using SageMaker AI to power their platform. All right, let's jump right in. In terms of trends, since the launch of ChatGPT in late 2022, we have all used ChatGPT in some way or another for text generation, summarization, and chatbots. You ask a question and you get a response. This year in 2025, agentic workflows have picked up. By that I mean enterprises are seriously trying to integrate and understand how to build an agentic workflow, how to integrate it with their existing workforce, and gain productivity from it.

Agentic AI and the Rise of Test Time Compute

Agentic AI is pretty interesting because it does not just give you a response, but it takes actions. It goes and figures out what to do to accomplish a goal that you give it. You give it some tools to go do something, and it will find out how to break the problem first, figure out how to execute it, and execute it on your behalf. This is a super interesting new development that is taking place in 2025.

Another key development that has taken place is the rise of test time compute or inference time compute. What this is, is the computational effort a model needs to generate an inference response. A year ago or a year and a half ago, general models like GPT-4 or Claude models would just respond to your question. But as you see in the latest set of models, the reasoning models go through a chain of thought process before giving you a response. They look at your question, break it down, plan, and identify different ways to answer your question, and then give you the right response. We have seen this with DeepSeek models and the new Qwen reasoning models. Basically all the models—open source models and even cloud models that are coming these days—are using reasoning models with chain of thought.

Now this is great and it gives better accuracy, and it's proven that it works better because it not just uses the knowledge it has gained through training, but as it goes through a chain of thought it improvises its answer. But the trade-off is that you generate much more tokens per inference response than you would do in just giving a straight response.

So if your prompt is 100 tokens and your response is expected to be 100 tokens, the reasoning takes another 200 tokens, so it's 300 tokens total. So the demand for computation power has gone up with reasoning models. This is a prediction from Gartner: over a third of applications will have some kind of agentic workflows by 2028. That's the expectation, and enterprises are moving towards that. Also, a lot of day-to-day tasks will eventually not be totally automated, but they will have some percentage of automation in there with agents.

Production Challenges and SageMaker vs Bedrock: Understanding the Differences

So this is all great, but the challenge that we hear from customers is going from POC to production. It's easy to spin up a studio account and spin up a GPU, deploy a Qwen model, use LangChain to write up some framework, and then invoke the model to get responses. It works. It also works if you have a team of 5 people, but if you look at scaling up from 5 people to 500,000, these are the challenges that customers face.

Let's talk about what these are and how we are solving those with SageMaker AI. First is performance. LLMs, like I said before, take time to generate a response, and as the concurrency increases, the load increases. So it's important for a good user experience to have ideal performance response within the time frame that users expect. Customers are finding it challenging to first benchmark what is the right number of GPUs, and once you have the GPUs, how do you optimize performance to get the best responses as you scale up? That's the number one challenge.

The second one is related to it, which is costs. We all know that GPUs are expensive and scarce, and it is super important to ensure that you have the maximum GPU utilization and squeeze maximum efficiency out of a GPU to keep the costs in check. But it is not clear how you optimize costs. Then there's scalability. Unlike CPU workloads, you cannot just keep scaling up and scaling down because of the cost reasons. So you need to plan within a cluster how you efficiently manage your resources, your deployments, your workloads, and scale up and down within your resource constraints and the number of GPUs you have.

The fourth one is that AI is still evolving and there are a lot of new frameworks out there. There are a lot of environment variables that you have to adjust for different models and optimization techniques, so it's very complex in how you set up your infrastructure. How do you set the drivers, your containers, your load balancing, routing? There are a lot of minute details that come into the picture when you're going into production. All of these steps are different in non-AI workloads. In a web app, you have to do things differently versus in AI you have to do things differently.

Before I hand it off to Kareem to talk about this, I just want to spend one minute on a question that we get asked a lot. So Bedrock also has inference, and SageMaker also has inference. What is the difference? I think a lot of you might have this question. To keep it simple, there are three differences. One is if you want to access proprietary models like Claude or Anthropic Claude models or Nova models or Cohere models, they are available only on Bedrock as APIs, so they're not available on SageMaker. That's difference one.

The second one is Bedrock has a certain set of models that are available on it, and they support certain open source model architectures so you can import and use them. Whereas with SageMaker you get full flexibility in deploying basically anything you want. Today when a model releases you can just download it from Hugging Face, download a container, and deploy right away. You also get lower level access so you can configure your post-processing scripts, your metrics, how you want to process your input and output—it's totally customizable. You can also bring in your trained models of any architecture, whether it's custom or open source, and you get full flexibility there.

The third difference is that Bedrock is serverless, so you pay on a token basis, whereas SageMaker is server-full where you pay for the GPU you use. That is why it is important to ensure that when you pay for the GPU, you squeeze maximum efficiency out of it, and that is where Karim is going to talk about how you can do that.

Three Pillars of SageMaker AI Inference: Price Performance, Flexibility, and Ease of Use

Thank you. Can you still hear me? Awesome. So thank you, Vivek, and thank you for also setting the stage here. Now, what we are going to do at this part of the talk is show how for all the capabilities or all the 2025 trends that Vivek spoke about, SageMaker AI Inference makes it easy for you to do the deployments of your models, any model. You get good price performance, get flexibility, and get ease of use, so that you get the best out of it. That is what we are going to do in the next part of this talk.

So as Vivek had hinted, there are three main pillars. The three main pillars that we will be focusing on is, as you can see, price performance. Price performance is important because you want to get the best performance out of your models, so high throughput, low latency, and auto scaling. Auto scaling is very important because production traffic is never constant. You can prepare for steady traffic, but there can be a burst. You all came off from Cyber Monday, Thanksgiving Friday, and so on, right? You would have seen the peak in the traffic. How do you plan for that? So quick auto scaling is important, but at the same time, you need to have low latency and high throughput.

The second pillar for us is flexibility. One thing that Vivek was hinting towards is that you can bring any model. A model releases today, and if those weights are available in Hugging Face, download it and do the deployment. You can deploy with any framework. You can deploy with any GPU instance that you have access to. And last but not least, ease of use is very important for us. We want to give you all these capabilities, but if you do not have good ease of use, that is going to be a big problem. That is a big hurdle that you have to cross. So ease of use is extremely important for us. How do you go from a model weight all the way to invoking the model or getting responses back? For us, that flow is extremely important, and we take a tenet to keep it as simple as possible.

Now let us go through these one by one. So first, what I want to talk about is this pretty interesting and cool slide because it takes you through the journey of what you can do on SageMaker AI Inference and what you get with it. So let me explain. On the right side, you can see that you can bring models from S3, SageMaker JumpStart, which is the SageMaker Studio, Hugging Face, or even your VCR image. Then you create an endpoint, you deploy the model at that endpoint, and you start invoking. That is simple.

What do you get with it? You get instance choice. It can be a G series. It can be a P series, whatever series that you need for your needs. You get your choice for weights. You get your choice for instances and auto scaling. You can decide to have on demand. You can decide to have a reservation, or you can decide to have a reservation for your steady traffic, and when you are scaling up, you can get it from on demand so that you have continuous operations going on. Built-in observability is of course necessary because we have to make sure that our instances and our models are performing well.

If there is a failure, you need to know why, where, and when it happened so that you can take the corrective measures. Bring your own container, which is what Vivek was hinting at. There are no restrictions there. I will be diving a bit into multi-model. Because you are going to be consuming the GPU or the instance itself, we have the ability for you to have multiple models run on the same GPU so that you are getting the best out of the GPU. You are squeezing as much as possible out of it. And streaming response. We will be talking about a feature that we launched which enables a much more real-time experience for streaming.

Before I move on to this, if you want to take a picture, please do. What I do want to call out is that SageMaker Inference gives you the flexibility for the models you bring, containers you bring, instances you use, performance, scalability, and getting the best out of your GPUs. So once your setup is done, this is how the deployment looks like. Your bottom is your infrastructure layer, which is where your GPU CPUs live.

Along with your containers that you bring or any open source container, and then your models. You can deploy a single model on the instance because maybe you have the use case for that. You can also deploy multiple models on the same GPU, as I was discussing, so that you get the best out of it, and that also enables you to scale.

For example, if one particular model is getting more traffic, you can scale that model on the same GPU as a separate component so that you are able to meet the traffic needs. If you need to scale out to an on-demand instance, you can do that as well. This is the auto scaling model copies capability. Now I'll be touching upon some releases that we have done leading up to re:Invent.

Eagle Speculative Decoding: Accelerating Throughput Without Compromising Quality

One of our focus areas has been price performance optimization. Let me walk you through a bunch of releases that we did in that area. I'm really happy to announce that we have Eagle 2 and Eagle 3 speculative decoding available on SageMaker AI Inference. This is well explained with a use case, so I'm going to walk you through a use case and a sample so that it makes sense to you, and you can see how the claim that we are making about throughput increases actually happens.

Normally, what do any customers expect when they are working with a chatbot or with inference? They want fast inference. That's table stakes nowadays. But in reality, LLMs generate a token at a time, which actually is slow. Hence, your throughput in general is not at the performance level that you want it to be.

Here comes Eagle Heads: speed without compromise. You increase the throughput without compromising on your quality. The way it works is you have your foundation model. You also have your Eagle Heads as drafts playing the roles of the draft model that are actually generating the next n consecutive tokens. They generate all the next n number of tokens all together, and the foundation model then evaluates every token that has been generated and sees whether it is actually the token that it would have generated.

Instead of the foundation model generating one token at a time, it's just evaluating all these tokens at the same time and accepting what it thinks it would have generated and rejecting what it thinks it wouldn't have generated. So from a foundation model generating one token at a time, it has now accelerated and accepted three tokens at a time, increasing your throughput.

Let me play this out with an example. Let's say that my LLM is writing a book on a cat, and it's about to finish the sentence: "The cat?" My draft model says that jumped over the moon with the given probabilities. The draft model generated this entire set of tokens together. The foundation model then looks at every token together and sees what is the probability that it would have generated, and if its probability is within the range of the draft model's probability, it's going to accept these and it's going to reject the one that is not.

With this, your inference speed has increased. Eagle 3 makes it even more powerful. Eagle 3 generates multiple variants of the tokens from the draft model. It can say jumped over the moon, climbed over the bed, hopped over the fence, whatever, all together at the same time. The foundation model evaluates all of these together and selects the best and gives you the throughput increase.

Last but not least, it leads to latency reduction. We have seen in our tests 2.5x throughput, and there is no trade-off in your accuracy. Now, how did we implement this? You can run it as an optimization job today. You can bring your own dataset, or we can bring a dataset. If you bring your own dataset, then the Eagle Heads get trained for your specific use case, or you can use an open source dataset which does improve the throughput.

We create an optimization job and then we run the evaluations. When we share the evaluation reports with you as a chart, you can see the values and determine if your throughput has actually increased or what is the rate at which it is increasing. If you are happy with the output, you can deploy the new weights, which is your optimized model, or you can update an existing model with these new weights so that your same user experience now just got faster.

Dynamic LoRA Adapters and Advanced Routing Capabilities

This is another capability that we have brought to you. One of the trends is that you can fine-tune your model with multiple adapters. What you can do in that case is deploy your base model on an instance and deploy multiple LoRA adapters on the same instance. Based on which LoRA adapter gets invoked, the responses are handled by that adapter. What we have done here is make this even more dynamic.

Let me explain what that means. Let's say you have your S3, your CPU, and GPU. Initially, all your LoRA adapters that are trained are sitting in storage. At the time of initial invoke, the adapters move to your GPU and then they start serving the traffic based on what has been invoked. If your GPU memory is getting filled, then the adapters that are not in use get offloaded so that the next adapter that is invoked, which is not yet available in your GPU memory, comes back up. In a way, it gives you limitless LoRA adapters, but in all honesty, it is able to serve the right LoRA adapter and be ready for you to load it from memory, from your storage to your instance memory, and start using it pretty quickly.

This is where you can actually deploy multiple models or copies of the same model on the same instance using our Amazon SageMaker AI Inference Components capability. Amazon SageMaker AI Inference Components enables you to have multiple models. If you look at the picture here, Model A is our pink model. Model B is our blue model. In the same GPU, which is 8 GPUs, you are able to deploy 1 copy or 2 copies of Model A and 2 copies of Model B. This has not only enabled you for scaling up of models each, but also enabled you to maximize the utilization of your GPUs. We are getting the best out of this with respect to utilization.

We have also brought up the caching capabilities into our NVME volumes such that your containers are now already available on the NVME volumes. As your instance is up, the containers are downloaded from the NVME volumes. We have seen upwards of 50% improvement in autoscaling when this happens with your containers. Last but not least, we also have improvements that are done for latency. Specifically, we have load-aware routing and session-aware routing.

Think of load-aware routing as where it is okay for you to get a response from any available accelerator or GPU. It is stateless. I just need to get the response. For example, what is the weather of Seattle? I just need the response. Load-aware routing enables you to get your prompt being sent to the GPU, to the place where there is not enough traffic so that you get the maximum or the quick output back. Session-aware routing is more stateful, so that it is a multi-turn chat or you want to have a proper context with the past history so that you are able to get more information or more personalized data back. That is where session-aware routing comes into the picture. You enable a session and then you use the session to keep sending your prompts back so that your chatbot or similar applications are able to get a good experience there.

Flexibility in Model Deployment: Multi-Modality and Bi-Directional Streaming

Now with this, I want to hand off to my friend here, Vivek, so that he will walk you through what are the flexibility capabilities that we are bringing in. Thank you. Alright, so Karim talked through two things. One is how do you optimize performance and increase throughput, and how do you pack more models into a GPU and manage it so you get maximum efficiency out of it. So now, let's talk about the second pillar, which is how do you get choice. How do you deploy any framework of choice, any model of choice, and how you can experiment quickly.

The first thing is SageMaker has managed containers, so we build containers using open source frameworks and we do some optimizations to make it run on SageMaker. We have those, but some customers have security requirements or they do additional customization on top of the containers to be better. So we support anything: bring your own containers and then inference scripts. A lot of customers have pre-processing and post-processing scripts where when you send an inference response, you take a part of the inference response, put it in some database, or anything else.

You have complete capability to customize your inference scripts. The third key capability is multi-modality. Text generation is the original capability that started with the OG models. But right now we are seeing video generation, video understanding, audio generation, and voice agents. We are launching features to support all use cases across multi-modality, and that is a key capability we will talk about in a launch as well.

When we say flexibility, this is a screenshot from Artificial Analysis, which has become the industry standard to look at performance. The key point I want to make here is that open source models are on par with closed source models. We have many customers who want to use open source models for multiple reasons because of the open source ecosystem with Hugging Face, the ease of customizability that you can have with open source, and the cost efficiency that you gain.

We have benchmarked models like GPT-OSS, which are pretty good in tool calling and function calling, which is required for agents, and they achieve pretty good results in testing that we have done. These models are very cost efficient compared to the bigger models and are also much faster. The smaller the model is, the faster it responds to inference requests, and these are also reasoning models. The GPT-OSS model can run on a single G6E2X large instance, which costs around two dollars an hour, and it can handle forty to fifty concurrent requests at the same time. So it is super cost efficient. At some point, when you are scaling up significantly, GPU-based inferencing will give you better price performance. You can basically take any open source model from here, choose the right GPU instance, and deploy it on SageMaker using the same code. You do not even have to change the code.

I am also excited to introduce a new capability: bi-directional streaming. Bi-directional streaming is super important for audio transcription. When you speak, it needs to be transcribed in real time. It is also important for audio translation or for a customer service agent where you are sending a streaming request and you expect a streaming response from the model. This has been implemented for re:Invent, and the way it works is we have a router and we have the SageMaker endpoint. You can bring your own container to work with bi-directional streaming, or we have also partnered with DeepGram to bring their models to work with SageMaker.

We are implementing HTTP/2 protocols. When the client connects to the SageMaker router, it connects using HTTP/2, and the router will forward the request to the API, which uses a WebSocket connection. The WebSocket connection takes care of establishing a connection between the model and the router, and it takes care of streaming the responses using the same connection. It does not keep opening new connections, but it uses the existing connection to stream the response in both directions. This is super useful in cases like when you are talking to a customer agent. They do not have to wait until you send the entire request. The response keeps coming in as real-time transcription happens as you are speaking. So it is super efficient, and we also have a demo on how this works.

We also partnered with DeepGram to bring their models natively into SageMaker, so you can just go to Marketplace, go to DeepGram, and select some details and directly deploy to SageMaker. You are speaking with our virtual pharmacy assistant. How may I assist you? I would like to check the status of my order. Okay, let me look that up for you. Could you please provide your member ID? M1001. Can you hear the audio? Thank you. Let me check your orders. You have one order, ORD001, currently processing. What medication is in this order? Amoxicillin 500 mg, twenty-one pills. Is there any refill remaining on my prescription?

You have 0 refills remaining for RX 1001. When will this be ready? Your order ORD 001 should be ready for pickup on December 6th at 10:00 a.m. Thank you. Thank you for calling. Goodbye.

As you can see, the assistant and the customer have a conversation in a streaming way, which enables whoever is acting on it to take actions quicker because it is streaming instead of waiting for each request. This is super important and this is available only on SageMaker in AWS. It takes a lot of effort to build this, so we are offering it as a capability right on SageMaker.

Ease of Use: Capacity Reservations, Observability, and Managed Containers

The next thing I want to talk about is the ease of use. One of the common asks from customers is how they know how many instances they will need to handle production-grade capacity. How do you know if a specific GPU is a good fit or not? Or how do you know if a specific framework is a good fit or not? Customers want an easy way to spin up GPU capacity and experiment with it for shorter periods. Some need it for a week, some for a month, and some for two months.

Until now, it has been a manual process to work with your account teams to get the capacity you need. We are launching this capability now where you can go to the AWS SageMaker console, just do the request, and reserve your capacity in a self-service fashion. You can select what instance type you need and select the time frame and number of days, and you have the capacity starting from the day when you selected it. Once it is approved, you can start using the capacity right away. You do not have to create any request or raise any limit request. That has been a challenge in the GPU space for a while, so we are solving that using this.

At the end of the day, we want to make it easy for you to get started and do your experimentation and deploy. You can use it both for benchmarking and testing, or if you have a planned release where you want suddenly like 20 instances, you can come here and use it. Another thing is GPU. The way you know that you are maximizing GPU utilization is by having an observability stack. We built an observability stack into SageMaker where it tracks per instance GPU metrics. Every GPU that is there shows what the level of utilization is and what the CPU level of utilization is. The basic idea is that based on this, you take necessary action either scaling up or ensuring that you process more requests through these GPUs.

We also track the number of invocations. Our router mechanism exports metrics where it calculates how many errors there are and how much the latency is for these errors. Basically, the idea is that you do not have to build all of these yourself. We have it and you can just deploy your models and get these out of the box in CloudWatch. One of the questions we get a lot is what is the ideal container or the best framework to get the best performance. The answer is it depends, right? vLLM has been a pretty popular option and it always supports the latest models on day zero. So whenever a new Llama 4 or Llama 5 comes up, it will support it on day zero. vLLM fits most of the use cases.

SG Lang has also been good for some models like mixture of experts, and if you are having extreme long context length, SG Lang sometimes performs better. So it is for you to benchmark and know. But the idea here is we have our containers, which is LMI container, which is based out of these frameworks and offered as a managed container. We also have open source containers and solutions in our GitHub repo, which I will share at the end, where we walk through Docker files on how you can just take open source container and do some minor modifications to run it on SageMaker basically to allow the health checks, to allow the port and all of that on SageMaker.

And then we also just launched a new container. Kareem has been talking about LoRA earlier and speculative decoding, but the way to answer the question about how you can start using that is through this container, right? This container is built off of vLLM, which we shared earlier, and then we have additional optimizations on top of it to run efficiently on SageMaker.

We have the capability with LoRA hosting where you deploy a base model and add multiple LoRA adapters. You just configure them using the Python SDK and then start deploying it. Even with speculative decoding, after the entire optimization job runs, you can use the LMI container to host your model and get those performance benefits.

We also support OpenAI chat completions, which has been the industry standard. This means you don't have to change your client code. You can use the same client code to deploy any model, whether it's OpenAI, GPT, or OSS, and use the same format to use it. Since this is a managed container from AWS, we want to ensure that it is free of vulnerabilities, production-grade quality tested, and used. That's another thing customers value—they don't want to build containers themselves. They want AWS to do that heavy lifting for them, which is why these managed containers are so useful for customers to get started and deploy.

Building Agentic Workflows with SageMaker and Agent Core Runtime

We cover all three pillars: price, performance, and ease of use and flexibility. Now, once you follow all the best practices and deploy your models on SageMaker, what do you do next? Since the theme is Agentic AI, by the way, agentic is not the only thing you can do on SageMaker. You can also do personalization, computer vision, and anything else. But how do you build an agentic workflow?

All of these frameworks have connectors for SageMaker. So if you deploy a model on a SageMaker endpoint, you can use any of these frameworks and have the SageMaker connector to connect to the model directly and use it for your frameworks and your agentic workflows. Here is a sample use case on how you would do that. You have your SageMaker endpoint where you deploy a model like Qwen, and then you take LangGraph, for example. You write up your agent workflow, like an SRE agent or a customer support agent in LangGraph, and most of the flow would be similar to how you would do it anywhere else. The difference is in the connector part—instead of using OpenAI or Bedrock, you just use the SageMaker connector so that it connects, authenticates to SageMaker, and then gets the response.

Once you write that code in the form of Python scripts, you can then deploy it into Agent Core runtime. That's a new launch from July that we did earlier, and it is a scalable platform where you can run your agentic workflows with features like memory, gateway authentication, and more. You can use that for SageMaker as well. Once you write a Python script, you have to create a Dockerfile. You create a Docker container with that, push it to your container registry, and then go to your Agent Core and deploy that container.

Once you deploy the container, it becomes available in a healthy state. Your applications can start connecting to the Agent Core runtime to invoke an agentic workflow, and the LangGraph code that you have will connect to the SageMaker endpoint to give you responses. We have a workshop tomorrow for this exact same thing. If you are interested, you can come and have a hands-on experience deploying a model to a SageMaker endpoint, a Qwen model, and writing Python code for a deep research agent and deploying it to Agent Core. You can stop by tomorrow for the workshop.

These are some of the takeaways we have. The re:Invent session tomorrow gives you details about where it is. We have a workshop that you can do in a self-service way on deploying models and doing autoscaling and more. The third one has a lot of examples on deploying different models and doing optimizations for your workflows. So take a picture. I'll just give a minute. Then I'll hand it over to our friend Richard to talk about Salesforce.

Salesforce Agentforce: Building an Enterprise Agentic Platform with SageMaker

Thank you. Can you guys hear me? All right. Thanks, Vivek. Thanks, Karen. I'm really excited to share our journey together with AWS on how we built an agentic platform. Before I started, has anyone used Salesforce or heard of Salesforce? All right, I see a good amount of people.

Let me tell you what Salesforce is. Salesforce is the largest CRM company in the world. We were founded in 1999, and since the beginning, our mission has been to innovate and reinvent how businesses manage their customer relationships. Starting from there, we have the Salesforce platform. In 2016, we introduced one of the most important innovations: the Einstein platform, which brings predictive science and machine learning automation to all applications across the Salesforce platform.

In 2024, we achieved a new milestone with Agentforce. With Agentforce, we build autonomous assistive agents natively on the Einstein platform. These agents not only can assist, but they can also reason, take action, and complete complicated workflows. This transforms how our businesses manage their customers.

I'm going to play a brief video to tell you about what Agentforce is. A year ago, Agentforce was just a whisper, an idea. Today it's the fastest growing product we've ever launched. Let's talk about Salesforce unveiling Agentforce and how big companies are using it. It's not just what is possible, it's what you are going to make possible. Humans with agents driving customer success together. This is what AI was meant to be.

This story isn't just about scale, it's about something more profound. It's about a choice. Is AI going to replace us, or are we going to be in command of it? At Salesforce, we've made our choice. We're building AI that elevates people, where agents handle the busy work so we can focus on the work that matters. Because AI, when built with trust, elevates your people, helps them move faster, think deeper, and connect more meaningfully.

Agentforce has now handled more than 1.5 million customer service requests at Salesforce. That's not just a milestone, that's over a million questions answered with precision, a million customers understood with empathy and care, and most of all, more than a million times our people were empowered to drive greater value. We are very excited to talk about Agentforce. Our vision is that human beings and agents can work together to deliver success stories.

All of these agents are seamlessly integrated with our ecosystems, and they can perform very complicated jobs. This slide shows a comprehensive architecture of Agentforce. We start with the foundation layer, which is the data layer. We build Salesforce using Salesforce data and customer data. On top of it, we build a knowledge base and vector database. This data layer serves as a single source of truth to enrich AI and agents.

On top of that, we build a model ecosystem. We have Salesforce-trained models, we have customer-trained models, and we also allow customers to access state-of-the-art large language models such as Anthropic through Bedrock, ChatGPT, and Gemini. After that, we build the trust layer where Salesforce can help customers securely retrieve data and perform grounding. The core of the architecture is basically Agentforce itself. In this part, we have autonomous and systems agents built on top of our reasoning engines, protected by guardrails, and they can perform classifications and tool use.

The entire Agentforce is integrated with every single product from Salesforce such as Sales, Service, Tableau, and Slack. This was last year. This year, we introduced voice.

Agentforce Voice is basically a natural extension of Agentforce Core. It allows human beings to directly interact with Agentforce. Now there are three main features of Agentforce Voice. First, we designed the voice to be ultra-realistic and conversational. We introduced a wide variety of voices to represent a customer's brand. We have audio caching, interrupt handling, and activity detection to make the entire voice experience fluid and natural. On top of that, our speech-to-text and text-to-speech models perform activity and entity detection and redaction for any security issues.

Second, we want to make sure the voice can do real-time agentic reasoning. This requires low latencies, intent detection, topic selection, and tool use. It is fully integrated with our knowledge base, including the RAG systems and RAG pipelines we have built. Last but not least, the Agentforce Voice is multi-channel and multi-platform ready, which means any time we need to require a human being to get involved, the entire context and conversation with the agents can be seamlessly transferred to the human being.

That's the features. Now, what are the requirements? There are three major requirements we like to fulfill. First, the entire end-to-end voice flow we want to perform within the Salesforce security boundaries to achieve maximum security. Second, we want to make sure the voice experience is fluid and conversational. That means we need to support bidirectional low-latency streaming support. This is where Vivek and Kareem have just shared with you. We chose SageMaker because it provides those features so that we can perform very low-latency bidirectional voice support.

Third, we want to make sure we support low-latency open-source models. We use open-source models built in-house and tailored for Salesforce use cases. This is the architecture of the voice. On the left side, we support different channels: PSTN traditional PSTN, digital channels, and web RTC. The majority of the work is currently under testing and will be released very soon. In the middle, which is the voice call, we connect to SageMaker, which is supporting bidirectional low-latency streaming. Agentforce Core itself has two major connections. One connects to external large language models, and the other connects to the in-house built fine-tuned model based on open-source models. On top of that is the Salesforce knowledge base we built through the RAG pipeline.

We have talked about the voice at Salesforce. Now I would like to step back and look at the overall model serving. At Salesforce, we serve 22 regions globally at this time, and by the beginning of next year, we will expand to 28 regions. We use a variety of combinations of SageMaker model serving strategies for the generative AI models. I will name several. First, we use the inference component, which allows us to share the endpoint and utilize instances more efficiently. Second, we support multi-adapters. Basically, we use the same models and the same baseline open-source model, but tailored for different use cases, which Kareem has just shared.

We also have single model endpoints, which are for traditional deployment cases. Last but not least, we have the custom model import support from Bedrock so that we can upload customized model images.

Now, on this slide, I want to show you some of the thought process and practice we have at Salesforce. We use SageMaker AI and Large Model Inference to basically optimize the different model serving strategies. This is because SageMaker AI and Large Model Inference integrate multiple state of the art inference frameworks. So those work includes several items I'd like to touch base on. First is we choose what is the right framework: vLLM or TRT-LLM. Second is we do a comprehensive instance scanning to pick up the best instance to serve our models to achieve the best latencies and the best throughput.

The next items we use are the rolling strategies and the batching strategies. On top of it, we have quantization. We have a wide variety of quantization based on eventually the accuracy of models and the latencies requirement. Last but not least, we have speculative decoding as well. Overall, with this combination of strategies, we have been able to serve real-time traffic as well as large volume batch traffic.

On this slide, I'd like to touch base on how we do model fine-tuning at Salesforce. At Salesforce, we host almost 10,000 training pipelines. So here I'm only going to focus on the fine-tuned part of the models. We start with open source models such as GPT-OSS models or other open source models like Llama 4 and Llama 3. And we use a good amount of data for the pre-training or pre-fine-tuning. Those are typically much larger volume.

After that, we get a baseline of the models. And then we use very high quality labeled data as well as AI-based data to fine-tune the model further, which achieves the highest throughput and highest accuracies for Salesforce use cases. Over this process, we reduce the hallucinations, improve the output format, and as well as for the specialized customer use cases. Overall, at Salesforce, we are looking for integrated solutions and infrastructure for security, scalability, and high availability to empower Agentforce, which includes our agent builder, prompt builders, and trust layers.

We also want to assess the latest models such as Anthropic. AWS has been partnered with us to provide very integrated solutions for us. Let Salesforce really focus on the innovation, uphold our core values with customers so that we can serve the customers in a very highly regulated industry. I'm going to pause here. Thank you. Hi, thank you.

; This article is entirely auto-generated using Amazon Bedrock.