Kazuya

Posted on Dec 8, 2025

AWS re:Invent 2025 -Develop AI Agents faster with Amazon SageMaker Studio & Bedrock AgentCore-AIM388

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 -Develop AI Agents faster with Amazon SageMaker Studio & Bedrock AgentCore-AIM388

In this video, Amazon SageMaker AI and Amazon Bedrock AgentCore capabilities for building custom AI agents are demonstrated. The session introduces SageMaker Model Customization Agent for planning workflows, serverless synthetic data generation, serverless reinforcement learning fine-tuning supporting models like Meta Llama and Amazon Nova, serverless MLflow for experiment tracking, and serverless model evaluation. A live demo shows end-to-end agent development from fine-tuning a Llama 3.1 8B model for SQL query generation to deployment using Bedrock AgentCore with Strands Agent SDK. Robinhood's Nikhil Singhal shares their production experience using these tools for customer support agents, achieving over 50% latency reduction through fine-tuning while balancing the cost-quality-latency trilemma in multi-stage agentic applications.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Accelerating Agentic AI Development with SageMaker and Bedrock

Hello, everyone. Thank you for coming all the way down to Mandalay Bay on a Thursday afternoon. Today, I have with me, I'm Sumit Thakur, I lead all of product for Amazon SageMaker AI, and today I have with me Davide Gallitelli, who is a senior specialist for generative AI at AWS, and Nikhil Singhal, who is a senior staff ML engineer at Robinhood. Together, we are going to talk about how we can help you accelerate building agents using SageMaker Studio and Amazon Bedrock AgentCore.

Now, before we talk about how we can help accelerate your agentic AI development, I would love to take a step back and look at some of the key adoption drivers which are driving adoption of agentic AI across enterprises. Enterprises today are deploying agents to improve their customer experience, automate operational workflows, and even improve their employee productivity. Very often, the enterprises are choosing to use custom models to power these agents. They take off-the-shelf foundation models and then customize them on their private data to give them specialized domain knowledge and enable them for specific tasks.

Gartner predicts that over the next three years, the usage of these custom models is going to explode, with more than 50% of enterprises choosing to use custom models as compared to just 1% a year ago. This upsurge in model customization is not a surprise. This is driven by very broad accessibility and availability of various state-of-the-art fine-tuning techniques like reinforcement fine-tuning, preference tuning, and supervised fine-tuning. Using these techniques, our customers are able to take their private data and build this knowledge into the models in a very cost-effective way.

Some of these techniques, especially reinforcement learning, have been quite instrumental in not just imparting specialized knowledge, but also giving these models critical reasoning capabilities. With these reasoning capabilities, a model can now take an incoming request, break it down into discrete steps, execute those steps using tools, aggregate the responses, and then plan the next set of steps. This continuous improvement loop of thinking, acting, and observing enables a model to continuously make progress towards its goal in a self-sufficient way. This is exactly what forms the bedrock of the brain of your agentic workflow.

The Agent Development Workflow and AWS Purpose-Built Services

Now, I'm going to walk you through how you can build an agent for yourself. This is how an agent development workflow typically looks like. First, you have to plan out your entire agentic AI development workflow. In this first step, you think about what your use case is, who are your target users, what goals and objectives do you have, and even set up a success criteria. You often use the success criteria to evaluate the performance of your customized model against that of a base model.

Then you move on to the next step where you gather and collect data for training, validation, and evaluation. Then you go on to set up a tuning infrastructure and then tune, evaluate, and customize your model candidates. Once you identify a model candidate that meets your success criteria, you push them into production for inference. Finally, you take this customized model running in production and build an agent on top.

AWS gives you several purpose-built services to help you with each step of this workflow. You have Amazon SageMaker AI, which is a model development service that lets you tune, customize, evaluate, and deploy foundation models for any use case. Then you have Amazon Bedrock, which lets you take these custom models and deploy them onto a serverless infrastructure, giving you high cost performance and great ease of use for inference. Finally, you have the Amazon Bedrock AgentCore suite of tools and services that lets you build, deploy, and monitor agents in production on top of these custom models.

Now, I'm going to walk you through the experience of how to use these services to build an agent for yourself, starting with the first step, planning and setting up an evaluation criteria. Customers often tell us that in this step, there's a lot of inertia. You have to not only choose one base model among many base foundation models, you also have to set up a robust evaluation criteria so you can ensure your models behave responsibly and accurately in production. This often requires expertise and can take weeks if not months to get started.

Planning with SageMaker Model Customization Agent: A Pet Store Chatbot Example

To help customers overcome this hurdle, we are pleased to announce the launch of a new capability in SageMaker, SageMaker Model Customization Agent, which was launched in preview this week. Using the SageMaker Model Customization Agent, you can now plan, execute, and reproduce your end-to-end model customization workflow using plain language instructions.

Here is how the experience looks like. All you need to do is go inside SageMaker Studio, which is the visual interface for SageMaker. Open the chat interface of the Model Customization Agent and then describe your use case in plain natural language. Then, through a mutual discussion and conversation with the agent, you will eventually arrive at a model customization workflow for your use case. Once you and the agent mutually agree upon that workflow, the agent converts that workflow into a hardened specification which is stored as a JSONL file. This specification is then used to both drive the workflow as well as reproduce it at any point in time.

Now I'm going to show you how this specification-driven workflow looks like inside SageMaker Studio. So here is the landing page of the chat interface. You can see on your left there's a progress bar which helps you see where you stand in your customization workflow. In the main chat area, you see the agent which greets you and then asks you certain clarifying questions about your use case. For example, whom are you planning to target? What success criteria do you have in mind? Are there any challenges that you face today in addressing this use case with other alternative solutions?

I am going to take an example of building a customer service chatbot for a pet store. So what I tell this agent is, hey, I want to build this customer service chatbot. This is going to target my current and potential customers of Petto who are inquiring about my products and services. As far as success criteria goes, I have a few things in mind. I want this agent to always respond to my customers in a polite and empathetic tone. I want this agent to be grounded in factual correctness when it comes to producing information about my catalog of products and services. I want these responses to be very clear and succinct. My hope is by deploying this customer service chatbot, I can achieve a higher customer satisfaction rate and a higher customer retention rate of at least 20%.

The agent takes all these inputs and then asks me further clarifying questions. For example, what are the typical prompt-response pairs that we expect this customer service chatbot to receive in production? I provided a few examples which I have in mind. As the conversation goes, the agent finally determines that the right technique to use in this case is preference tuning using DPO. It provides an explanation that, hey, DPO is a good technique for this use case because you want to align the tone of your customer service chatbot to match your brand voice, as well as you want to ground this agent in factual correctness. You can meet both of these objectives if you use DPO.

It goes on to then suggest a small language model which is a great fit for my use case. I'm going to choose the same model. It's a Llama 3.1 as an example. Then finally it lays out the success criteria. It presents this criteria both in the form of simple text, so it's human-readable, easy-to-understand text, as well as in the form of measurable success metrics. As you might see, these success metrics are aligned with the way I had described my use case. It has defined success metrics around measuring the politeness of the tone, measuring the alignment of the tone with my brand voice, and making sure the responses are succinct. All these metrics are pretty much aligned with how I had described my use case to the agent.

Synthetic Data Generation for Model Training

Now, once I and the agent mutually agree upon this success criteria, I'm going to hit approve, and this will convert it into a hardened specification file. Once that specification is ready, we are ready to move on to the next step, which is to gather and prepare data. In this step, customers often tell us that sometimes collecting data can be quite costly and cumbersome. Other times it may just simply contain some private and sensitive information, making it impossible to use that data for training.

To help customers address this challenge, we announced a new synthetic data generation capability in SageMaker AI. It's also available in preview starting this week. Using this new capability, you can generate data which is grounded in your context completely privately and securely.

To use this capability, all you need to do is go back into the model customization chat interface and then provide your context. You can choose to provide context in the form of a few inline prompt response samples, or you can point to a content repository in S3 which contains things like your PDF documents, customer support tickets, log files, and other things that you want to provide as context. The agent uses this context to generate statistically similar data and also a data quality report, and all of this process runs completely on a serverless infrastructure, which means you don't have to manage any of the underlying compute and it automatically matches the size of your workload.

Here is how the experience looks like in the chat interface. It opens up this input form asking you to provide some inputs, for example, how many samples you want to generate and giving the S3 link for the context repository. Once you hit start synthetic generation, it generates both the data set and a data quality evaluation report with a bunch of useful metrics.

Let's take a look at a few of those. So you get something like diversity analysis, which shows you how diverse your generated data is and whether you have all your demographic groups well represented without any biases. It also shows you certain quality statistics. In this case, it's showing you the mean length of responses generated in the synthetic data set. If you remember, I had asked the agent to create a customer service chatbot for me which produces clear, succinct responses. Hence, looking at the mean length of responses is an important success criteria or a metric to look at in the synthetic data.

It also gives you access to a few of the responsible AI quality metrics to make sure that your generated data does not contain any toxic, harmful, or sensitive content. Once you review these quality parameters, you can proceed onto the next step, which is to customize and evaluate the models.

Serverless Reinforcement Learning and Model Customization Techniques

In this step, customers were telling us it's often very hard to set up the right fine-tuning infrastructure and then manage, scale, and operate this infrastructure over time. And then as new fine-tuning techniques keep on coming up, you have to keep on experimenting with those techniques, try out all those different hyperparameter choices, try to get the right cost performance, and all of this process of iteration and experimentation can take months. To help customers quickly get to a fine-tuned model, we announced a new serverless reinforcement learning capability in SageMaker AI.

Using this capability, you can now fine-tune a broad choice of popular open weights and Amazon Nova models using a host of fine-tuning techniques, including reinforcement learning in a completely serverless way. This capability supports a lot of popular open weight models like Meta Llama, OpenAI GPT OSS, Qwen, DeepSeek, and Amazon Nova model families. It also supports a broad suite of customization techniques.

For example, you can use Reinforcement Learning with Verifiable Rewards for domains like math, science, code generation, or structured data output generation, where you can measure the output of a model using a simple reward function. And the reward score generated by this function can then be used to adjust the model weights. You can also use reinforcement learning with AI feedback where instead of a reward function, you can use a state-of-the-art judge model like a Claude Sonnet model to evaluate your model responses and generate a reward score.

We also support preference tuning that lets you align your models to human preferences by providing a data set consisting of a prompt and a positive and a negative response as evaluated by human evaluators. This can be really useful if you're trying to align your model to things like helpfulness, to tone, to brand voice, where human preference is super important. And finally, you can run all these techniques, as I said, in a completely serverless way, which means you don't have to procure, provision, or manage any GPU clusters. You simply fire away your job with a push button approach with all the default hyperparameters which have been tested by the service to give you the best cost performance, and I'm going to show you what that experience looks like.

So you go back into SageMaker Studio and you go back to the JumpStart Model hub. From the JumpStart Model Hub, you get access to a host of state-of-the-art open weights and proprietary foundation models. I'm going to choose the Meta Llama model as an example. Once you click on customized model, you're going to see a bunch of options. We already saw customization using AI agent as one of the user interfaces a couple of slides ago. Now I'm going to show you the visual interface, how to customize using UI.

So once I click on customize with UI, the screen opens up where I have to give a few minimal inputs.

First, I need to select the customization technique. Here I'm going to choose Reinforcement Learning with Verifiable Rewards. Then I need to upload my training dataset. Finally, I need to choose a reward function. I can either choose from one among the many available built-in reward functions, or I can write my own reward function as well. If I have a reward function hosted as a Lambda function, I can bring that Lambda in as well.

Once you have given these few minimal inputs, you can just hit run. As you can see, all the hyperparameters have been filled up by default. As I mentioned, these recipes have been optimized by AWS to ensure you get the best cost performance on AWS hardware, so you can just go ahead and hit the run button to get a customized model.

Tracking Experiments and Evaluating Model Performance

Now, as you go about doing these experiments, you would need a place to track and compare these experiments side by side. To help with this problem, we have launched a serverless MLflow capability. You can track all your model customization experiments using the serverless MLflow and compare those experiments side by side. This capability gives you access to the familiar MLflow interface. The entire MLflow tracking server runs in a completely serverless way, so there is no compute to manage, and it's available at no additional charge. It's free and available in SageMaker Studio, and I'm going to show you the experience.

Once you launch a fine-tuning job and open up the job details page, you can see some of the metrics which already start appearing on the job details page, like training loss and validation accuracy. All these metrics are getting pulled from the serverless MLflow. If you want to see more metrics and do a deep dive, you can just click on these links embedded right there to open up MLflow in a separate browser tab. There you can choose to go back into that training run and do a deep dive, or you can choose multiple training runs side by side and do a comparison.

Once you have identified a model candidate after running a bunch of experiments that you want to take forward, now is the time for model evaluation to make sure it meets the success criteria you had determined at the beginning. To run model evaluation, we are pleased to also add another capability of serverless model evaluation in SageMaker AI that lets you evaluate and compare models on multiple different dimensions in a completely serverless way. You might have already noticed I'm using the word serverless with every capability, and that's what our vision is, that every capability comes to you without having to manage any infrastructure throughout the model customization workflow.

With this serverless model evaluation capability, you can simply try out multiple different evaluation techniques. For example, you can run industry standard benchmarks like MMLU. You can choose to provide your own custom scoring function, or you can use an LLM as a judge, where you can use a Claude Sonnet or a GPT model as a judge to evaluate your model responses and generate evaluation scores. This capability not only collects the evaluation results from your job, but also summarizes them, aggregates them, and gives you a summarized report comparing the evaluation outcomes of your customized model against that of the base model. You can quickly determine if your customization is having the right impact on your model.

I'm going to show this experience. Here's the visual experience in SageMaker Studio. You can see the three evaluation techniques on the top. I'm going to choose LLM as a judge, and I'm going to choose Claude Sonnet as the judge model. Then I'm going to specify the evaluation metrics I want the model to evaluate on. I can choose among one of the many quality and responsible AI metrics which are available out of the box, or I can provide my own custom metrics using a simple prompt. You can use one of the built-in prompt templates, write down a prompt for your custom metrics, or you can write a prompt completely from scratch. All of these options and flexibility are completely available for you, so you can choose to run your evaluation in the way that best fits your use case.

Deploying Models for Inference with Cost-Performance Optimization

Once you run the evaluation, it generates this summary report which I was talking about, that lets you compare the base model against the fine-tuned model on all the summary statistics. You can quickly at a glance determine if your model looks better after customization and move on to the next step, which is to deploy the model in production for inference. When it comes to deployment, SageMaker gives you two easy to use options. You can choose to deploy either to Bedrock in a few simple clicks, or you can deploy it on SageMaker for inference. If you deploy on SageMaker AI, we do provide a bunch of built-in capabilities to help you improve your cost performance. For example,

SageMaker Inference comes with multi-LoRA deployments where if you are training multiple different LoRA adapters for different use cases on top of the same base model, you can bin pack all those LoRA adapters along with the base model onto the same instance and set up flexible independent scaling policies for each adapter. And that lets you manage the right performance, matching the demand patterns for each of those LoRA variants, as well as reduce the cost by bin packing them onto the same instance.

Similarly, we also give other techniques like dynamic adaptive speculative decoding, which lets you train personalized draft models on your private data. We call them Eagle Heads. And using these Eagle Heads you can then run inference against the incoming prompts in a very cost effective way and then pass on only a subset of the generated tokens onto your main customized model. This two-phase inference results in the reduced overall cost of inference with higher performance. All these capabilities are built in as part of SageMaker AI Inference, so I would definitely encourage you all to go and give it a spin.

Building Agents with Bedrock AgentCore and Strands SDK

Now, once you are done deploying the model for inference, now comes the final step, building the agent. For this, you use the Bedrock AgentCore suite of tools which works with both Bedrock endpoints as well as the SageMaker AI Inference endpoints. With Bedrock AgentCore, you get a bunch of tools and services to deploy and monitor your agents in production. First, Bedrock AgentCore is quite open. It works with all the popular agentic development software toolkits. For example, you can bring Crew AI, LangGraph, OpenAI SDK, or you can use Amazon's Strands Agent to build your agentic workflow.

You can then deploy it onto the completely serverless runtime of the AgentCore. This runtime allows you to run the agents even for prolonged periods of time if you have any batch process running behind the scenes. AgentCore also gives you managed primitives for memory, both short term and long-term memory, so you can maintain the context of your conversations as well as maintain user preferences across sessions.

Now, as your agents run, they would typically need to connect with multiple tools to complete their task. So AgentCore gives you a managed MCP gateway to connect with these third party and first party tools. In addition, it also gives you certain first party tools out of the box. So for example, it gives you a code execution runtime and also a browser runtime to browse the web and extract information on the fly.

And finally, once you are done deploying the agent, it gives you ways to monitor the agents and trace the entire agentic call throughout the agentic workflow and publishes these traces in the form of an OpenTelemetry format so you can bring your own choice of observability stack and analyze and observe these traces. One of the things I want to call out here in the Bedrock AgentCore toolkit is the Strands SDK which really makes it easy to develop your agentic workflow. This forms the backbone of developing your agent. And I think to show you how this can be done very easily, I'm going to invite my friend Davide to come back on stage and show you a live demo.

Live Demo: Building a Business Analyst SQL Agent End-to-End

Thank you, Sumit. Awesome. Let me prepare my computer for the demo. There we go. Perfect. So, we're just gonna go ahead and get started with the demo. Let me give you a brief rundown of what this is gonna be all about. Let's assume that you've been tasked with the one challenge that we've seen very common across basically every customer that I've been working with. Customers normally have a business analyst or a business intelligence persona. They have a platform where they can throw some questions against this. It could be a chatbot, it could be just a web interface, a UI.

Normally, these business analysts, they have great knowledge about the domain that they wanna work on, but not necessarily have the knowledge to know how to write SQL queries from scratch. So this is the task that we have. Our goal is to work with this company. This company is in the retail industry. They have different tables, they have a data catalog, and they wanna make sure that the agents that they build is a business analyst agent, which can receive natural language questions and output SQL queries, execute the SQL queries and provide the results once again in natural language back to the user.

To do that, we will use, of course, Amazon SageMaker AI. Specifically, we will use the new capabilities that were just introduced by Sumit in the SageMaker Studio interface.

I'm showing you the UI aspect of it just for the sake of it being easy for you to understand right now, but all of that you see is also available via code in the form of an SDK. So what are the steps that we're going to go through in the demo? We're going to start by choosing an off-the-shelf small language model. Our goal is to use a small language model to be cost effective, to have a training that doesn't last for weeks and instead lasts for hours, maybe even minutes. We will leverage synthetic data, so the customer unfortunately didn't provide us with any specific data or queries that they've already pre-built. So we're going to use the service data generation capabilities in order to create a synthetic dataset. We'll use that to customize the small language model, evaluate it, deploy it, and use it for an agent.

I think we spoke already enough in general terms about the demo, so let's just jump directly into the UI and into the code. Perfect. Alright, I'm going to zoom in a little bit so that the people that are a little bit further away can see at least a little bit better. Cool. So our journey, as always, starts within SageMaker Studio. So in SageMaker Studio, when you jump in, you will have all the different applications that are available, the JupyterLabs, the Studios, the Canvas, the Code Editor, and the new serverless MLflow. But our journey actually starts from the JumpStart Model hub, which you can access from the Models tab here on the left.

There are some spotlight models, and we're going to use one of them today. Some of them are from Amazon, like the Nova 2.0 family, which is now available also in SageMaker for customization, but also open source models ranging from Meta Llama models to Qwen models to OpenAI models. And you can actually drill down and see which models are available for serverless fine-tuning by selecting this filter here and checking the trainable serverless. Right now we have 14 models supported by the serverless capabilities, but you can expect this number to grow a lot in the next weeks and the next months.

So what I'm going to be doing today is that I'm going to start with the, I'm going to choose the Meta Llama 3.1 8 billion instruct. It's a model that I know and I tested already, so this is the model that I want to go ahead and customize. But of course for your use case, you can use whatever you want. You can go as little as, for example, 0.3 or 0.6 billion model, 2, 3, 4 billion, et cetera. There's a lot of different models out there. I'm just going to go ahead and choose this one. And I have two options, actually I have three options. There's the code full version, so customizing with code. There's the UI-driven optimization, customization using the interface that was shown before by Sumit. But today we're going to focus on customizing a model with an AI agent.

So the very first thing is the agent very simply is going to tell us, hey, what is the specific task that you want to solve? And we're going to tell it. I should already have it here in memory, so let's just use this one. So I want to fine-tune a model to assist business analysts and data analysts in writing correct and efficient SQL queries, improving query correctness and reducing time to insights. This is a message that I already had pre-created, but virtually you can use any, you can solve any kind of problem here, because the goal of the agent behind the scenes is to walk you through the steps so that you can understand and you can get the best suggestions on how to solve your problem.

You will see, as it was already shown by Sumit, that this is going to be spec-driven development. You're going to see a lot of files that detail step by step what the model, what the agent needs to do in order to make sure that it solves the right problem for you. So first of all, it already identifies, okay, this is likely going to be an SFT use case, Supervised Fine-Tuning, and it tells you why, right? The SFT leverages labeled examples of SQL prompts and correct answers, so basically that's the prompt completion pairs. This seems correct to me, but if I wanted to, I could have told it, hey, another good way to train a model for SQL query generation is, for example, Reinforcement Learning with Verifiable Rewards, because I could be executing the queries behind the scenes, getting the results, and using those results if correct to drive further the fine-tuning of the model.

So for now, for some reason it's asking me again, so we'll just say it again. So we're going to be using SFT, and you're going to see that the very next step is going to be asking a little bit more questions about, you know, what should be the data that we want to use. So first, the first step is selecting the model. It remembers that I'm using 3.1 8 billion, so I'm going to go ahead and use this model, of course accepting the license agreement in this case. And then the next step is, okay, how do I tell it which data I want to use. In this case, I don't have any data.

I just have an example that I can share with it, and you're going to see that it will prompt me on how to generate the dataset. First, it makes sure that the use case is the correct one. So it gives me the spec. This is what I was referring to in terms of spec-driven development. It tells me, hey, the business problem you want to solve is to assist the business analysts and data analysts, what I told you before. It defines some tenets, tenets for success. The query needs to be correct, needs to be efficient, needs to be explainable, needs to have a professional tone. Okay, maybe this is irrelevant. Most importantly, it needs to be safe, so it doesn't have to do any write queries back to SQL that need to be a safe one.

If I don't like any of these specs, I can go ahead and just edit. So I can go, for example, here and change anything about the description of the explanation clarity. To be honest, they're pretty fine. I'm going to keep also the professional tone, why not? So I'm going to save and approve. Remember that whenever we do spec-driven development, your goal is not to just blindly accept the spec that comes from the agent. That spec is going to be the one that drives your use case, so make sure that it is aligned with what you actually want to achieve.

So in just a second here, what's going to happen is that we are going to jump, there we go. So, let's create an example for your use case. If you have any specific details about the expected behavior, blah blah blah, please share them now. I do have something already. Which is, there we go. I have an example, so this is an example that I can use to generate more examples here, sorry for the repetition there.

So my prompt says, you are an AI assistant tasked with answering user questions with accurate SQL queries, so I have a data schema. In this case, I only have two tables, so it's going to be very short, but you can enrich this as you want. And I provide it with a completion, so a SQL query with detailed examples. So what's going to happen now is that the agent is going to leverage this example that I give to it to generate two more examples. I'm going to review them, and if I like them, then I can define these as the base for my spec for data generation.

Let's look at the two examples that were given, and you can see that the three examples that are now in the chat, they actually follow the style that I gave to the agent. So it has a prompt, which has, let's say the system prompt per se, the user input in terms of the query, in this case find total sales and total quantity for each region where total sales exceeds one million, and the response in a well-structured SQL query. This looks good, so let's say yes, proceed.

So as I say yes proceed now, what it's going to give me is that it's going to tell me, okay, now I understand how I'm supposed to create this data, so let's please align to the, let's say, let's define together the spec for how to generate data, and we will use that spec to create a certain number of examples. So let's go through the spec first. You see information about the use case definition, what's important about the data, some of the examples, which are the ones that we wrote before. And finally, quality standards. What does it mean to have a good answer? What does it mean to have a bad generation in this case?

Once again, here I'm accepting everything blindly. I'm not suggesting you do that. I'm suggesting you actually go through the details here, change anything that might be less relevant. We try to make it as aligned as possible to your use case, but there's no one else out there that knows your use case better than you. So let's save the changes in this case and approve the data spec.

And now what's going to happen is they're going to tell me, hey, configure the synthetic data generation. You know, for a good SFT use case, it's suggested, it's recommended to create five thousand records. I'm going to also use this path, data the demo, let's put it that way, with a role ARN which I have already handy here. And if I have additional data which I can use as context for the data generation, I can provide it here. Right now I don't have any, but I'm going to start the synthetic data generation.

So what you're going to see is that this is going to take a little while. So if you don't have anything to do for the next forty-five minutes, we can wait for this to be completed. Of course that's a joke. We have one already pre-done. But what's important is that the data generated will be stored in S3. It's going to be JSONL format. We can take a look at that in just a second, and this you can use not just as part of the AI agent experience, but it can be used also with your own training.

To make sure you guys are not waiting too long here, because you have to attend a party tonight, so you might not want to wait necessarily here until this is done, I'm going to show you a complete end-to-end process, all right? So allow me to, you know, pull a TV magic trick and get one already done. So, data spec, generation of the data, data generation is completed.

I can take a look at the generated synthetic data. It's going to give me a couple of examples here, just the first five. It's going to tell me that it generated 5,000 records for the SFT strategy, so it's going to be prompt-completion pairs. They are correctly formatted with the data schema as well as the completions, and you're going to see all of these examples. If you want, you can look it up in a tree to have the complete list. But it also provides you with data quality results. This is very important because if you're doing any kind of training process, you want to understand how the data actually looks, not just from a few random samples, but actually get statistics on your dataset. Get the n-gram diversity, get the total text, mean word length, and any Responsible AI metrics. This is very important to make sure that we are aligned.

The toxicity score is super low. It's basically zero, almost negligible. In fact, there are zero toxic records. This should probably be like a zero toxicity score. Once that's done, the very next step that the agent will suggest is to start a training job. And once again, we're doing it here just because we want to showcase the AI agent capability, which is available in private preview. But again, you can take this data and use it anywhere else. You could be using the UI that Sumit was just showing before, you could be using it in your code, really whatever you want. Any of these components is modular, and you can reuse it where you want.

For the training job, you have to configure a couple of parameters like the batch size and learning rate. We provide these values by default because we have tested a couple of recipes behind the scenes. But again, these values depend on the use case and on the complexity of the use case, so you might still want to have a little bit of knowledge of data science. Of course, I've already launched the training job behind the scenes, so I can show you the full metrics. As you can see, this was a job for seven epochs. Epochs are complete, lasted 6,000 seconds. You do the math. I don't remember exactly what that is. Batch size of 64, seven epochs, learning rate very low.

Let's look at the metrics with the new serverless MLflow. You're going to see that with a simple click of a button, without even leaving the agent per se, I can be rerouted to MLflow. I'm going to get the list of experiments. Spoiler alert, you're going to see a lot of examples here because this is a shared account, so we're using it with a bunch of other folks out there. But if I remember right, this is mine, so it's the Llama SQL with system prompt. Promptly called Llama SQL because it's a Llama model that generates SQL queries. Yeah, in tech we don't have a lot of imagination for naming. And so I can go into the details of my experiment, get the model metrics, get the system metrics, and so on. You can see there's a pretty decent training. Loss is continuously falling, so it's converging at some point, so that's good. And you can take a look at all the different details and go into much more understanding of the different outcomes from this training.

Once that is done, then what we can do is start the evaluation. Now, once again, the agent is going to suggest a spec for evaluation, right? It's going to give me some metrics that I could be using if I wanted to evaluate the performance of this model. It's going to take a while to run the evaluation. I had to rerun it this morning because of some cleanup in this account. As you can see, it's still a little bit running, but you can see the progress. Basically what it does, oh there we go, there was an error. Maybe I changed something in there, but it will run an LLM-as-a-judge evaluation as well as a custom scoring of what the model actually is performing and will give me at the end of this a nice report saying, hey, this is how the model performs according to the metrics that we have defined.

The metrics it defines itself, for example in this case, efficiency focus, user-friendly explanation, cost awareness. Those are all very generic and they require an LLM to be confirmed, which is why we use LLM-as-a-judge behind the scenes. But again, this has been approved as is. If you want, you can go ahead and change anything in it. So let's take a look at the model results in general. I'm going to go once again into the models tab here. I'm going to look at my models, going to go through all the different models that my colleagues have trained, and I'm going to look at mine, Llama SQL with system prompt.

So I have here at the scene a single pane of glass. I will have all the information with respect to my model. So what did I train it on? When was it trained? What is the training job that led to this model? What was the training dataset? When did this happen? Which technique was it? What was the base model? Training hyperparameters, performance of the model training as well, which come directly from MLflow. And then I can also take a look at the evaluation report that I was telling you about before. It's still going on, but we can take a look at the details later. I can also go ahead and deploy this model. This is really important because once we have trained the model, as Sumit was telling us about before, a model that has been fine-tuned is only valuable when you can actually put it into production.

So the very next step after evaluating a model should be to deploy. In SageMaker, it's as easy as clicking two buttons. Click the deploy button all the way up top right, and then choose your technique. You can deploy it on an endpoint on SageMaker, which runs 24/7, is always available, and can scale horizontally as much as needed. Or you can deploy on Bedrock, though rather than deploying, we should be saying import in Bedrock with the custom model, which will make it available in a serverless fashion, and you will pay only for the requests that you make. We tend to prefer SageMaker when you have a certain threshold of throughput that you want to meet, because then you define the architecture, you define how big the endpoint is going to be, and you define how many requests you can receive. If your workload is spiky or you're still more in a development sort of fashion, Bedrock is a great choice with custom model import because it allows you to test, understand, and further improve your model.

I can decide to create a new endpoint or use an existing endpoint. I'm not going to do that right now because I already have many endpoints deployed here. We are using a shared account, so no surprise there. But what I'm going to show you right now is what you can do once you have deployed that model. So my model is available. Let me actually clean up this code so we can walk you through step by step.

As we have deployed our model, the next step is to use it for inferences. The very first thing I'm going to do is create some utility functions. We can go back into this in a second. My very first operation will be to generate an inference just to test if the model is actually working or not on the endpoint. Of course, we have trained it according to the best practices, so it should be working just fine. In fact, I'm going to use a SageMaker Predictor SDK here to generate an inference. You see it was super fast.

You can see that this model, because I'm passing it the information, I'm passing it the system prompt saying you are an advanced AI assistant, your goal is to execute a tool to generate SQL queries and to execute them using the tool. The query is, can you give me the top three markets with the highest number of returns? The model correctly replies by saying, hey, you need to call a tool. The tool is called execute SQL query. I've created it at the beginning of this notebook, and the SQL query you need to execute is the select market count, and so on, limit three. Not exactly the most difficult SQL query out there, but just for the sake of the demo it's going to do fine.

Then I can go ahead and test this query to make sure that it works all right. As you can see, the output here says in my demo dataset, the market with the most returns is LATAM with 296, followed by APAC and the United States. You might say, David, this is nothing new, this is not an agent because I had to call it manually. So how about we build an agent with this? We're going to use Strands Agent for our use case, and the good thing about Strands Agent is that it executes the tool behind the scenes, and all I have to do is just provide it with a system prompt, provide it with the tools, pass it the query, and after a couple of query executions, it provides me exactly the same result. As you see, I didn't go through any steps of defining what the tool code was. It did everything automatically behind the scenes.

The final thing that I'm going to show you, and then I'm going to let Nikhil come on stage to talk about the good things that Robinhood is doing with SageMaker AI. The very last step is that now your agent is running on the notebook, but that's not really useful. How about we put it into production? So I wrote this little script that basically what it does is that it defines the same agent that I just defined. As you see, it's the same Strands Agent configuration, but additionally, I have this Bedrock AgentCore app definition. This is all you need to add in order to provide information on how to run this agent in production on AgentCore. In fact, you can see that as I invoke my agent on AgentCore with the prompt, with the same kind of question, I will get my output saying, show me the number of orders that were returned for each market, and if everything goes right, then I have exactly the same response.

Of course, we can look it up, you can try different queries, try different prompts. The goal for me was to make you understand how do you go end to end from chatting with an agent to build a machine learning model, a Gen AI model, a small language model, and go all the way down to production. This is ready to go to production. Please don't use this code in your production environment because this is still early code, this is still Cory Road, so you might want to be careful about that.

All right, let's look at someone that actually did good production level code with a great team. So I'm going to invite Nico on stage to talk about accelerating AI innovation at Robinhood.

Robinhood's AI Journey: Fine-Tuning at Scale for Mission-Critical Applications

Thank you, thank you. That was such a wonderful talk. My name is Nikhil Singhal. I'm a Senior Staff ML Engineer at Robinhood. I lead our agentic platform initiatives, everything from LLM evaluations to fine-tuning infrastructures to inference systems at production scale. As I'm in an infrastructure team, the platform team, we build capabilities to accelerate agentic application development at Robinhood. We have built platform capabilities that are built over Amazon SageMaker AI and AWS Bedrock. We use Amazon SageMaker AI for training and Bedrock for inferencing. So today, I'm going to talk about how we are accelerating AI development through fine-tuning, but before that, let me contextualize it.

I'll start with how Robinhood started with a bold question from our co-founders, Baiju and Vlad. What if finance were for everybody, not just for the ultra-wealthy? This simple but powerful idea sparked a movement, and Robinhood broke all the barriers with commission-free trading. We did not stop there. We extended into crypto, cash management, credit cards, and recently stock tokens, enabling access for our users to the market in a way once unthinkable. Throughout our journey, we stay true to our mission, which is democratizing finance for all. We build capabilities which are simple, ergonomic, and empowering for users so they can take full control of their financial future.

Moving to our AI vision. For us to truly realize our mission, we believe that we need to give our users the same level of support and insight as a high net worth individual. For that, we need to really harness the transformative power of AI and machine learning. Therefore, AI isn't just a feature for us. It is one of the cornerstones for us to fulfill our mission.

Let me walk you through some of the applications which we have built, which are some of the complex applications. Powering mission-critical agentic apps, this is Cortex. If you open any stock trading app, you will see sometimes that a stock is moving up or down. Before, you needed to be really a detective to figure out why it moved up or why it moved down. This is where we thought that AI can really be that detective for you and for our users. It can crunch way too much information and can provide unique, cohesive, and useful information in a digestible manner to our users. This, we believe, is very empowering, and what we believe is that it enables users to get the same level of insight that somebody with an extended team would have received.

Customer support is the other frontier where we have built an LLM agent to automate our customer support. For users, when they are dealing with trading, there are a lot of nuances and a lot of support they need. That is where we thought that building an LLM agent for customer support will be empowering for our users because that is how we can scale and provide users answers to their questions, whether they are complex or whether they are about exploration or a product idea.

Our customer support agent is split into three stages. The first is intent understanding, where for observability reasons and for the downstream stages, we first understand the intent of a user question. Is it a brokerage question? Is it a crypto question? Or does it require complex reasoning? Then we move to planner and tool selection. To solve this question, what kind of planning do I need? What kind of tools or database queries do I need to execute? At the end, when it figures out that this is the plan it needs, it executes the tool invocations.

Once it does the tool invocation, it retrieves all the context. Once the context is built, then the final answer generation happens. Many of the models here do get served out of Amazon Bedrock.

Moving forward, while we build these apps, we realize that we need to scale them as well. And that is where I talk about the generative AI trilemma. In the world of generative AI, these three variables—cost, quality, and latency—are often fighting against each other. You may think that sometimes you can throw high-end hardware to reduce your latency, but that burns your cost budget. So the idea is very simple. If you go with quality, you will need a high-end frontier model, but that puts a lot of pressure on your cost and latency. But at the same time, if we are working with a small size model, often you will see, and we saw in the previous session as well, that there is a quality hill climbing exercise that needs to be performed.

But for us, this is even more challenging because we are not building single interaction applications. We are building AI agents. These agents make any number of LLM calls. So if one of these stages is either slow or is inferior in quality, the results are amplified. That is why we had to be methodological in terms of approaching and solving this problem.

Our approach here is we split into three categories. First, we are very selective about what models to select. For example, intent understanding is one of the easiest, relatively simpler stages. We question why do we really need a high-end model for that. Can we work with a smaller model? And that is where our approach has always evolved. First, whatever we do, first evaluate and make sure and build it. Be cognitive about what model is the right fit here or not.

Then, if we don't get the quality, for example in the planner case and the tool selection case, where we understand that for a user question, these are the tools we need to invoke, if we don't get the quality just by tweaking the prompt or optimizing the prompt, we inject a few shot examples which carry high fidelity with the user question. We call it trajectory optimization. Once we have squeezed all the juice out of the prompt tuning and trajectory tuning, and we want to further dive into gaining more optimization of our cost or latency, at that point we go into fine-tuning. The idea is simple there. Don't treat every problem as a nail, otherwise we'll end up overutilizing the fine-tuning hammer. As the generative AI space is evolving, there are a lot of problems that still land in the fine-tuning space where we end up doing cost and latency optimization.

With that, I'll share the fine-tuning platform which we have built at Robinhood Markets and how we are utilizing AWS SageMaker and AWS Bedrock. Every fine-tuning exercise starts with a goal. The goal definition has to be clear. Often you will see there may be two goals, like as I said, latency and cost, but we recommend, what we internally recommend, have a primary goal. Then once your goal is defined, do a base model selection. This is very essential. And this can only happen once we have an eval dataset. And the dataset creation is a challenging exercise.

We say that walk before you run, which basically means spend some amount of time in your dataset preparation. Your evals need to be a true reflection of your production use cases. So it has to be well stratified. It needs to have a good sample of complex use cases and simple use cases. Once we have selected a model, move to a dataset creation. And the dataset creation is an extended exercise of how you would have created an eval dataset. Same stratification exercise needs to be understood. For example, in the CX use case which I talked about, one of the dimensions which we utilize is intent, making sure that the questions which we have are well diversified across all intents.

The other potential dimension here is number of turns, whether the user is asking a single turn question or a multi-turn conversation with an AI assistant. Once we have the data, we move to the training, and this is where we have a fork.

Today, we saw that there are a lot of standard recipes which are available on SageMaker. We leverage those recipes, and that is where we utilize SageMaker JumpStart. But if there are use cases which require some customization, for example, your context length is really long and your dataset requires special handling, at that point we use SageMaker Studio. You have the Jupyter notebook there, you can attach to a P4DE instance or a P5 instance, and you can basically get a machine for yourself.

Either way, once you fine-tune your model, you can serve it through Bedrock, and this is where we unify our serving flow. We use custom model import to import those models to Bedrock, and once the models are imported to Bedrock, we integrate through our LLM gateway. This is how all of the applications at Robinhood Markets integrate through the LLM gateway. So once the model is accessible through the LLM gateway, it is available for our internal playground, internal evaluations, and for the production use cases.

So this is how we operationalize the fine-tuning and offer it as a platform capability to our users. Let me talk about the impact. We have received more than 50% of latency savings through a fine-tuned model. Let me put it in perspective. Before, for one of the planner stages, we were using a high-end model, and that was giving 3 to 6 seconds of latency. And then, particularly at P90 and P95 or P90 plus, we were able to cut down the latency heavily and brought it under 1 second. And with this validation, once we have validated this, we have started to extend it to other sets of agents, and we are seeing trending results.

Closing Remarks and Q&A

With that, I will invite Sumit and Davide. Thank you. You can stay here with us. Thank you, everyone, once again for coming all the way to Mandalay Bay on a Thursday afternoon, and thank you for staying through the presentation. Please do go ahead and provide your feedback through the survey in the app. We would always love to go back and look at all the feedback that you shared with us so we can come back with better content and better presentations every year.

We're going to stay here for some more time just to take any questions you might have. Please feel free to come to the stage and we can chat away. Thank you so much. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community