Kazuya

Posted on Dec 5

AWS re:Invent 2025 - Fine-tuning models for accuracy and latency at Robinhood Markets (IND392)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Fine-tuning models for accuracy and latency at Robinhood Markets (IND392)

In this video, Robinhood Markets shares how they use fine-tuning to improve accuracy and latency for generative AI use cases like Cortex Digest and their CX AI agent. They present a methodological tuning roadmap—prompt tuning, trajectory tuning, and LoRA fine-tuning—to balance the generative AI trilemma of cost, quality, and latency. Key insights include their three-layer evaluation system using LLM-as-judge with human feedback, task-specific metrics like categorical correctness and semantic intent for the CX planner, and strategic dataset curation focusing on quality over quantity through stratification. Their LoRA implementation on Amazon SageMaker achieved over 50% latency reduction (from 3-6 seconds to 1-2 seconds) while maintaining quality parity with frontier models, demonstrating production-scale success in a regulated financial services environment.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Robinhood's Journey to Production-Scale Generative AI

Thank you all for coming today. My name is Viraj Padte, and I'm a Senior Solutions Architect at AWS. Today, we're here to hear an interesting story from Robinhood Markets on how they've used fine-tuning of models for improving accuracy and latency for various use cases. Before we get there, I'm assuming many people in the room today who are looking at generative AI adoption, especially generative AI adoption at scale in production use cases, are looking at either improving their accuracy, reducing the cost, or reducing the latency when it comes to using LLM models.

Various approaches we've seen for customers doing all three of these include improving the curation of data, right-sizing models through fine-tuning, and using optimized deployment options that AWS provides for reducing generative AI latency. In this process of productionizing use cases with generative AI, AWS has been partnering with Robinhood for a while. We have helped Robinhood with creating interesting offerings within their product, for example, the Cortex Digest, a revolutionary customer experience, and a CX agent which helps resolve customer queries. We have also been working with Robinhood on a lot of their future releases that are coming up.

With this, I want to take this opportunity to quickly introduce two of our strongest champions within Robinhood: Nikhil and Davide, who are going to tell the story of how they're using fine-tuning for improving efficiencies across various generative AI use cases at Robinhood Markets. Thank you, Viraj. Thank you. My name is Nikhil Singhal. I'm a Senior Staff ML Engineer at Robinhood. I lead our agentic platform initiatives, everything from LLM evaluations to fine-tuning infrastructure to inference systems at production scale.

Today, Nikhil and I will walk you through how we have adapted models and we are continuing to adapt models for latency, cost, and quality in mission-critical workloads. Hi everyone. I'm Davide Giovanardi. I'm a Senior Machine Learning Engineer at Robinhood, on the same team as Nikhil. I work on developing agentic apps as well as model optimization. This means fine-tuning, LoRA, DPO, all the way to building our evaluation framework. I'm very excited to talk to you about our fine-tuning efforts and also building out our evals.

Democratizing Finance Through AI: Robinhood's Mission and Agentic Use Cases

I'll start with democratizing finance for all. Robinhood began with a bold question from our co-founders: What if finance is for everybody and not just for the ultra-wealthy? That simple but powerful question sparked a movement, and this is where we started our journey. We broke barriers with commission-free trading, and we did not stop there. We entered into crypto, cash management, credit cards, and recently stock tokens. We gave users access to markets in ways once unthinkable.

Throughout our journey, we stayed true to our mission, which is to democratize finance for all, and we enabled it by building tools and capabilities which are ergonomic, intuitive, and empowering for users to take control of their financial future. Coming to Robinhood's AI vision, we believe that to truly realize our mission of democratizing finance, we need to give users the same level of support and insight as the ultra-wealthy. For that, we need to harness the transformative power of AI and machine learning. So for us, AI isn't just a feature; it is one of the cornerstones for us to fulfill our mission.

Realizing that mission means that we need to build agents that work for millions of users simultaneously.

of users simultaneously. Let's take this out of the abstract and talk about one of our agentic use cases at Robinhood. This is Cortex Digest. We've all been there. We open the app, we see a stock jump 10% or maybe tank 5%, and our first question is why? To answer that, usually you have to be a detective. You have to read summaries, you have to scour analyst readings, all the news possible, and that takes a lot of work. What we did with Cortex is that we are doing that detective work for you. But finance is a hard domain. We can't have hallucinated summaries. We can't have vague or unclear digest. So this is where fine tuning becomes our power. A fine-tuned model is usually better at, for instance, vocabulary. It knows that advice is not just a traditional word, it actually means guidance in financial terms. It's also really good at being more objective if fine-tuned properly, and for instance, giving a balanced view that is compliant. And finally, it learns importance. So it may learn that an analyst report is more important than a random blog post. So, back to Cortex Digest here on the right, we see that it's processing all this data and it's telling us why a stock may be up or down. But the next question is, OK, we know why, but how do we act on this knowledge? This brings us to the second frontier of our work, and this one is called custom indicators and scans. This solution deals with translating natural language user queries into actionable, executable trading logic. We announced custom indicators and scans at AWS Summit last September, actually here in Vegas. Usually, if you want a chart to light up, maybe like you want a golden cross, as we see in this animation, you have to be a programmer. You have to know how to write scripts. For instance, you have to know how to code in JavaScript. With Cortex, we are removing that barrier. First, custom indicators. With custom indicators, we are giving the option to the user to just ask in plain English for an indicator, for instance, golden crossover, and then under the hood, our agent will write the code. Our aim is also to fine tune a model that knows and learns indicator syntax and writes the code for you and then automatically displays that indicator on the chart. With custom scans, we scale this logic to the entire market. We have a scan that will scan for stocks, ETFs, and any asset in real time depending on the filters that our coding logic will spit out. Now the technical win here is that we are democratizing algorithmic trading for all and just letting users use plain English to build trading logic and custom scans. Now, this brings us to our third and maybe one of our most complex use cases, which also turns out to be the primary focus or use case for this talk, which is the CX AI agent. To serve millions of customers, we had to build a solution that is more advanced and goes through different stages of maturity, and it's not just a general chatbot. The first stage of maturity is what we call the foundation. We leverage Amazon Bedrock to handle the heavy lifting here. This is the heavy lifting of inference. We needed state of the art models that take user questions and translate these messy user questions into tool calls, into planned actions, into high quality responses. But having a state of the art model is usually not enough for answering questions in a personalized way. So this brings us to our second pillar, which is knowledge and tool access. We gave our bot knowledge of our internal tools, knowledge of the account history. So that when customers ask for where is my tax form or what is the status of my latest crypto transfer, our bot will have access to those tools and will actually be able to help you in a very personalized way. And this brings us to our third pillar, which is fine tuning. This is how we scale ourselves.

The Generative AI Trilemma: Balancing Cost, Quality, and Latency

For fine-tuning, we use Amazon SageMaker to scale up our fine-tuning effort with methods like LoRA. We will deep dive into that later, and we were able to improve on aspects like latency at great length. But as we move from a prototype to deploying to millions of users, the next question is: how do we make this sustainable as we keep adding intelligence? We are continuously increasing our price tag, and this may not be sustainable if we don't take action over time.

So this realization forced us to confront ourselves with the fundamental constraint of our industry. In the world of generative AI, these three players—cost, quality, and latency—are always fighting with each other. We call it the generative AI trilemma or a problem triangle. The idea is simple here. If we go after quality and use a large frontier model, we will definitely get quality, but it blows up our latency and cost budget. But if we go with a tiny model to improve latency or cost, then the quality often dips and it goes below our safety threshold, and that gets blocked by our guardrails.

The problem here is even harder when we talk about agents, because an agent is not a singleton conversation. It is a multi-tier pipeline where we are making end numbers of model calls. So if one of the calls in your entire agent flow is either slow or producing inferior quality, it may jeopardize your end-to-end user-facing results or responses. Either they don't get to see the response, or the guardrail blocks it, or the latency is high, and that causes a dip in customer satisfaction.

So how do we go about it? That is where we want to be very methodological. We, like Davide just covered, have our CX block diagram where we have three stages. One is intent understanding. The second is the planner slash tool selection, where we understand which tools to call to retrieve and answer a particular user question. And then the third is the final answer generation. We cannot just put a large-sized model in all three stages. We need to be selective, and that is where we need to be methodological. This is what we will be covering in detail today. We apply prompt tuning, we apply trajectory tuning, and we are selective about when to apply fine-tuning to improve quality and get the best of both worlds of quality and latency and also cost.

With that, I would like to share the agenda for today's talk. Now that we have contextualized it well, the first is the foundations. Davide will cover the evaluation and data processing workflows, and then I will be covering the tuning roadmap, which I just discussed—the prompt tuning, trajectory tuning, and fine-tuning—and how we have built infrastructure capabilities over AWS SageMaker and Bedrock. Then Davide will be doing a deep dive, and at the end, I will be covering the lessons learned and some of the examples where we have solved these problems.

Building a Foundation: Evaluation Frameworks and Data Quality

Before we talk about learning rates, ranks, or loss curves, we have to talk about evaluation. Sometimes there is a temptation in GenAI to just vibe check the model, maybe ask it a few questions, see if it looks good, and then just ship it. But when you are fine-tuning, especially for a financial application, vibes are not magic. We adopted a philosophy at Robinhood of walk before you can run. If you cannot reliably measure performance, then we don't have a baseline. And without a baseline, you have no idea if our fine-tuning is actually improving the model or just making it different.

So we realized that we needed to move beyond a simple vibe check and actually build a framework that measures performance both on the end-to-end scale and also at the task-specific level.

Let's talk first about the end-to-end evaluation system. Training the model might actually be the easy part sometimes. The hard part is answering the sometimes uncomfortable question of whether we actually improved the model or just made it different. To answer this with confidence over time, we implemented a three-layer evaluation system, which is especially valid for end-to-end evaluation.

At the top, we see what we call the unified control plane. This is important because evaluation is not just an engineering task. It's most importantly also a product requirement. We want to make sure that we are aligned between engineers, product managers, and data scientists, and especially that we are aligned on the success criteria.

This evaluation framework is also powered by brain trust, which brings us to our second point: hybrid evaluation. It's not sustainable to evaluate millions of chat logs by yourself or with human grading, so we heavily leverage LLM as a judge for end-to-end automated evaluation. But we don't stop at LLM as a judge. We also backstop it with human feedback and hand-curated evaluation datasets, which we will cover later.

Finally, at the bottom of the slide we see three types of models: our fine-tuned model, but also closed-source and open-source models. This is our competitive benchmarking criteria. Whenever we want to ship a fine-tuned model, we want to compare it to the baseline and especially to large models that are either open source or third party. This approach gives us what we call system-level visibility, allowing us to sometimes catch regressions before we actually go to production and the model is served to customers.

In addition to system level, we are also talking about task-specific evaluation. This is where I want to zoom in and use our CX agent as an example, particularly the CX planner. To give you some context, the planner in the CX architecture sits between the user question and the rest of the agentic system. Its job is to create specific tool calls or invoke downstream agents, and importantly, it's not a creative job or necessarily a free-form generative job.

For this particular task, it was very useful to identify metrics that are not necessarily LLM as a judge. If we look at the first type of metric on the left-hand side, we call this categorical correctness. We can think of this as a routing check. In other words, did the planner select the right downstream tool or downstream agent? For example, if a customer is asking about the price of Apple and the planner is actually invoking the crypto wallet, that would be a hard fail. We see that this is a classification task, and it's very natural to use precision, recall, and F1, which are traditional classification metrics for this task.

On the right-hand side, we have semantic intent. In semantic intent, we are dealing with the input argument to the downstream agents. This means that the planner, in addition to routing to the downstream agent, is also going to give it a query or an argument. Comparing those planner-generated queries with a reference set, we want to make sure that the similarity is high enough. In this case, we use semantic similarity and make sure that over time we have high similarity compared to the reference set, and we are never in the case of low similarity in the input query to the agents.

The key point here is that we saw the end-to-end evaluation system and the task-specific evaluation. So the question is: do we use both at the same time, or when does one come into play? By isolating the planner for the task-specific metric, we actually unlock the ability to do very rapid hyperparameter tuning and model comparison when we are fine-tuning. We are able to zero in on a model candidate very fast. And then once we have a model candidate, we deserve the end-to-end, a bit more expensive and time-consuming metric for the final acceptance.

So we'll talk about metrics, but metrics are not super useful if we don't have very high quality evaluation data. In this case, we'll stick with the CX chatbot. If we look at the left hand side, we have our sampling and audition strategy. With sampling and audition, we start from real escalated cases which will end up giving us our gold answer. Let's see how this happens in our case. We don't just pull random chat logs; we actually use our internal platform called Optimus to specifically sample escalated cases. In this case, escalation means moments where the CX chatbot failed at giving an answer and the case was actually escalated to a human. In these cases, it gets sampled and it goes to the QA team, which helps write gold answers that then will be contributing to this gold dataset.

However, we can sense that this process, although it's optimized for negative examples which are usually the ones that are most important, is usually a little bit slow and doesn't scale too much. So on the right hand side, we have explored some synthetic data generation techniques. At the top we see self-play or coverage expansion. In this case, we have our gold dataset and we generate variations of the same question to just expand the coverage and representation in the dataset. At the bottom we have some active sampling strategies, and this is very useful because the chatbot, especially in CX, will have different intents. So we want to make sure that we are sampling in underrepresented areas or areas where the bot has higher uncertainty or areas where we deem our higher impact based on the feedback data that we get.

By combining the high quality signals from Optimus and the emerging synthetic generation strategies, we are building a roadmap where our evaluation dataset evolves along with our models as well. So this brings us to the so what of evaluation. We don't just build evals to just get a score; we actually build evals for velocity. This means that when our evals are reliable or when they correlate with human judgment, we can move fast. One key takeaway here is that we actually stop guessing and start defining our problem with much higher precision. Instead of saying the model feels off or the vibe is off, we can say we have a five percent improvement in quality or the latency increased by two seconds.

A Hierarchical Tuning Roadmap: From Prompt Tuning to Fine-Tuning

Once we have this data-driven problem statement, we actually have to decide how to fix it. That decision is exactly what we want to show you next. At this point, we have good conviction that our evals are working. We know what is bleeding and we know what is working. Say for example, in our customer support use case, questions like how does the instant transfer work or what are the fees for instant transfer are typically the easier questions and our CX agent can perform well in such cases. But for questions like why my incoming ACAT transfer has failed, these questions require passing through a lot of error logs, joining multiple datasets, and then coming up with a highly reasoned answer to give to the user insight into why their incoming transfer might have failed. This is a typical workflow of how a human agent would have operationalized while attempting to answer a user question.

So the question here is not whether we can fix it or not. The question rather is how can we fix it in an efficient manner, because you will always find a smarter model, a more powerful model which can answer more complex queries. But are they the right model for you to productionize? This is where we are methodological in our approach. We believe that if we look at every problem as a nail, then we will go after over-utilizing the fine-tuning hammer. We will be like all the time saying, can fine-tuning solve it or not? But that itself is a big undertaking and that is where we want to have this hierarchical approach in operationalizing which tuning method is more appropriate for a particular problem.

We start with the base, which we call prompt tuning. With prompt tuning, the idea is that we hill climb on prompts. Whether we want to move from a larger model to a smaller model, if it gives lower quality, can we tweak and make changes to the prompt which elicit the best results?

If prompt tuning doesn't work, then we move to trajectory tuning. In trajectory tuning, the idea is you give the model some few short examples, dynamic few short examples which carry higher fidelity with the user question. It works and has uplifted the quality by a big margin, but it has its own issue. When you inject more dynamic few short examples in your context, your context grows, and with that, your input token goes up, which has an impact on your latency and also the cost.

When we see some uptick with trajectory tuning, then it becomes a good question for us: if we give more data to the model and more examples, the model is able to learn. That is when we jump into fine-tuning. This is what we will be talking about because higher quality means higher cell solve rate, which we call higher satisfaction for the user. The bot is able to successfully answer the user, and with lower latency, we are able to answer the user question in a quick manner. So the first is prompt tuning.

Implementing Prompt Tuning and Trajectory Tuning for Multi-Stage Agents

Before I walk you through this block diagram, as a user, if I were to tune a prompt, I would basically pick some examples. Some are easier, some are tough, and run my prompts against them to see whether I am getting the desired results or not. If I do not get desired results, then I mutate my prompt. I see what could be the gaps, and then I hill climb on my prompts. This is a natural cycle we typically follow while tuning a prompt.

But when we look at an LLM agent which is multi-stage, you are not just dealing with one prompt; you are dealing with n prompts. Optimizing all those n prompts itself is a big hassle and a big manual work. This is where we are leveraging a lot on the prompt tuning tooling which we have built as a platform capability at Robinhood. There are existing prompt optimization offerings from many companies, but for us our use case was a little distinct because we were looking at the implication of a prompt change at a particular stage to the entire agent response.

Because our application is not a single turn or single interaction application but a multi-stage pipeline, this is split into four sections. The first is the base prompt with the foundation model at the left bottom, then the evaluation, then the optimization loop, and then the final output, which is an optimized prompt. We start with the base prompt and a foundation model. We see with the eval how it is doing. If the base and the eval dataset has to be well stratified, I think this is what David Giovanardi has covered in length: the importance of the stratification and importance of the diversity of your eval dataset.

We will also talk about its importance in the context of fine-tuning. Now you have the eval dataset. With this prompt, if your eval score is good and hopefully your eval is well diversified, then the problem is solved. If it is not, then you throw it in this optimization loop where you utilize a frontier model to critique your prompt and you generate more candidates. You basically mutate your prompt and generate more candidates. It is your choice. There are some configurations we offer to our users internally regarding whether you want to include short examples or not, because sometimes that is an overfitting, sometimes that is okay, sometimes that is not, depending on your use case.

Based on the number of epochs, what we have seen is that if we keep a fan out of 10 or say 16, five epochs are generally good enough.

For this evaluation process, generally 10 to 50 rows are good enough. You pick the top 5 or top 4 candidates at every epoch, and finally you have an optimized prompt which gives you higher quality than the prompt you started with. The benefit with this approach is that you understand the impact of the change on the entire agent because sometimes it's a multi-stage pipeline, and that interconnection across stages is equally important.

However, there is a limit on what quality uplift you can get through this approach, and that is where dynamic few-shot examples come in. We call it trajectory tuning. Why trajectory tuning? Because of the agentic nature of it. If you have a planner and at the end a communicator, also known as the final answer generation stage, and you inject some examples into the planner which teach the model how to answer questions like "what is my balance across multiple portfolios," you are teaching the model domain-specific information which carries high fidelity to the user question. You would not inject that example if somebody asks a question around credit cards. So we call it dynamic few-shot example injection, and that is why we call it trajectory tuning because by changing the planner itself, you change the entire trajectory of your agent workflow.

There are four pillars to building trajectory tuning. The first is an annotated dataset. This is where the real magic lies. You build a strategy based on your LLM's judge or your evaluation, which works as a filtering logic. You build a dataset for humans to review, and those humans annotate the dataset, marking whether it is good or bad. If it is bad, they generate a golden answer. That golden answer is the main key here because now you know what the bot should have provided but failed to provide, and what the delta or success criteria you have defined is.

You have your agent and then the eval loop. The eval loop is straightforward and could be just a similarity check or an LLM's judge that determines whether the golden answer and the agent answer are really similar, or if there are specific figures like account balance or stock price that are similar. Then you have a vector database where you store your updated or high-quality few-shot examples.

Here is how we walk through the process. We start with the annotated dataset labeled by humans. We give the input to the agent, and the agent generates an answer. That answer gets evaluated against your golden answer. If it does not match with the golden answer, we go into the analyzer loop. The analyzer loop tweaks the planner and the execution phase of the agent, and then we keep doing it until we find a match. Once we find a match with the golden answer, we have found a modification to the input prompt with one few-shot example which has resulted in a golden answer. That few-shot example which we inject into the prompt itself is our golden featured example, which we then put in the vector database.

At real inference time for a user question, we do a similarity check or embedding similarity and then retrieve those 5 or 10 few-shot examples from that vector database. That uplifts the quality. This overall is called trajectory tuning, and it takes us beyond what prompt tuning could yield because of the dynamic nature of it. However, it has its own challenges. Your context length grows, your input tokens grow, and your latency grows as well.

Fine-Tuning Strategy: Quality Over Quantity in Dataset Creation

Additionally, there are limitations to what you can do based on the number of examples you can inject into the prompt. With that, we finally jump into fine-tuning, where we are not just optimizing the context but optimizing the weights of the model for our domain-specific use cases. The one thing I want to underscore in the fine-tuning context is that it doesn't really require too much machine learning or AI expertise. The real magic here is in how you create a dataset. The magic isn't just in the recipe, though there is magic in the recipe, but there are a lot of standard recipes out there. For example, we utilize AWS Jumpstart, which has recipes for some open source models that are ready for use, and you can just work with those.

Provided with a golden dataset, you can get good results without requiring much machine learning or AI domain knowledge. We sometimes talk about hyperparameters, but the real magic again is in the dataset. With hyperparameters, you can try some and typically get good quality in about three or four iterations. These are serverless iterations, so you are not paying too much cost in terms of training. You are only paying for the training time itself. The first part is the training dataset. The way we create a training dataset, we go after quality, not quantity. It's not that we throw everything to the model and let it alone. That is a failed approach. Sometimes it could work if you are lucky, but that's not the right approach.

We basically employ a strategy where we first find a strata, which can work as a dimension to cluster your input dataset. For example, in the customer support use case, intent is one of the strata we have used. The other strata are the number of tones, such as whether it's a single-tone conversation, multi-tone conversation, or whether the user is just asking for help repeatedly. These become our stratification dimensions. Once we split our dataset across these dimensions, you can use k-means clustering and sample, say, five from each cluster to create your dataset. Maybe five thousand or ten thousand. For our use case, fifteen thousand was a good spot.

Once you have created a dataset, query the same certification or same level of sampling you used to create an evaluation or validation dataset, which is ten to twenty percent of your input training dataset. You give it to a foundational model. Depending on your situation, suppose you see forty percent quality and your goal is to get eighty percent quality, you pick a foundation model. We use LoRA heavily. This is what we have productionized, and this is what we believe is easy to adopt and easy to roll out in production, with good support through AWS Bedrock and Custom Model Import. There are other techniques like DPO and PPO with RFT in the reasoning model space, which are exploratory at the moment, and as we move more into the reasoning model era, I see us adopting them more in the future.

Deep Dive into LoRA: Low-Rank Adaptation for Efficient Model Training

Once everything works, we get a fine-tuned model that should work well and should have that quality bump on our evaluation dataset. With that, I would like to thank you. So we saw our tuning roadmap: prompt tuning, trajectory tuning, and fine-tuning. Now we would like to do a deep dive on LoRA. This is the method that we use the most. So what we'll do here is define LoRA as a refresher, and then we'll go deeper into why we use LoRA at Robinhood, how it works under the hood, and then we'll also see how we implement our fine-tuning platform in more detail.

So what is LoRA? Before that, the standard way of fine-tuning was just full fine-tuning. If you see the left-hand side of the diagram where we have regular fine-tuning, we have the pre-trained weights matrix W. Then if we wanted to fine-tune for a specific task, let's say we are using a 70 billion parameter model, we have to learn the delta on all 70 billion parameters. That means tracking gradients and optimizer states for all 70 billion, which is very prohibitive in terms of cost and GPU usage.

So the real question is actually, do we always need to do full fine-tuning? And the answer is sometimes, and actually oftentimes, no. This is where we get to Low-Rank Adaptation, or LoRA, on the right-hand side of the diagram. We still have the pre-trained weights, the W, but we keep these frozen, so we don't change them during training. Then we introduce the green boxes here on the right-hand side: two more learnable matrices, A and B.

One of the key things about LoRA is the inner rank, so the rank that you see between these two matrices in the visualization. This is basically the inner dimension of these matrices. If you force that rank to be small, very common values are 8 or 16, then we are essentially reducing the trainable parameters by a factor up to 10,000 depending on the model that you're fine-tuning. Instead of learning all these parameters, we end up with these two small matrices.

At inference time, we actually have two options. One is to keep them as is and swap them out, so we would have the base model plus the learnable matrices depending on the task. Or we can actually merge them in and then deploy the model in full. This is what we'll look at in the next few slides, and we'll also give examples of how we implement this at Robinhood.

Let's dig deeper into the why of LoRA. You've seen the math and the diagram, but what are the main factors that let industry really adopt this method? The first one is cost. This is the most immediate and straightforward. We've just talked about it. We are freezing up to 99% of the model. By doing that, we are reducing the training parameters massively. We don't need to store optimizer states, and this means that we can fine-tune most of the time on a single GPU, and this was the case for the Robinhood Cortex Chatbot.

Second is latency. This is a little bit more nuanced. When we actually introduce more parameters or different matrices, we might think we are adding some extra latency at inference time. But it turns out that this is not the case. Because the math is linear, and at inference time, especially if we just merge these weights with the base model and deploy the model as is, there is actually zero latency overhead at inference time. And then finally, accuracy. This is more like the natural question: if I'm fine-tuning just a small size, maybe 1% of the weights of the model, will my performance take some hit? And the answer is usually not. There is a lot of empirical research that points to LoRA achieving performance that is very comparable to a full fine-tuning run. And this is also what we found on our site for the Cortex Chatbot.

In conclusion, we're getting the performance of a full fine-tuned model at a fraction of the cost and at no extra latency at inference time. Let's dive deeper and see how it works under the hood. The first concept here is integration. So the question is where do we put the LoRA matrices in the context of the transformer architecture. As we see in the two diagrams here, we have two different types of blocks: one is the multi-head self-attention and one is the feed-forward. If you see the blue box here, these are the frozen weights that we were talking about in the previous slides. This is the big W. Then you see the LoRA green boxes. These are attached to each of them, and these are the ones that are learned during training. Now, the second part is the training strategy.

If we can attach this to every layer and every matrix, does it mean that we need to train all of them? This is a trade-off. So on the limit, if you attach it to every layer, you should probably get the best performance, but this also comes with cost in terms of training time and compute. On our side, what we found out was that the sweet spot, especially for the CX chatbot planner, was to only target the multi-head self-attention weights.

So this means that once we train these weights, we end up with the multi-head self-attention matrices. On our side, especially using Amazon SageMaker and Amazon Bedrock with CMI, it was a seamless deployment because the base model plus the adapters are merged, and then we have a final model that is identical in architecture to the base model. We can just deploy a smarter version that is optimized for the planner task.

So finally, let's recap why it matters. On the left-hand side, we have some practical benefits. We talked about scalability and training way fewer parameters, and this also means very short training time, which means we can train for many use cases instead of just focusing on one. This lets us scale our fine-tuning to multiple use cases at Robinhood, and this also leads to fast iteration.

Fast iteration means that we are training on the same use case many times and then comparing and seeing which model is performing better. And finally, portability. If you train the entire model, it's many gigabytes, but LoRA matrices are usually a few megabytes, so it's very portable. In terms of use cases, this unlocks use cases that were prohibitive in terms of cost before LoRA.

For instance, we have domain specialization. We can have a model specialized on SQL language and another in Python, for instance. And then we have persona or tone tuning. For the CX chatbot, it would be very useful to have a softer tone versus a more objective tone for, let's say, financial writing. And then finally, AB testing is very effective here because we can train many different versions, maybe doing some hyperparameter tuning, and then test this directly in an AB testing setup and see which one is received better by the end user.

So let's get to how we integrate LoRA into our fine-tuning platform. The first block here we see is the goal and success criteria. This ties back to the end-to-end eval slide we were showing before, where we have the unified control plane. The first step is always defining the goal and success criteria in partnership with the product team and data science team, so to make sure that we are aligned on the goal.

Once we do that, then there is the base model selection. We want to choose a base model that is aligned with the goal. So if the goal is latency, we will choose a model versus quality or maybe a mix of those. Once we have the base model, evals come in very quickly, and this is because evals are very important to establish a baseline. We want to know how good the base model is out of the shelf at this task.

Once we have baseline evals, we work on creating the training dataset. This may employ some synthetic data generation depending on the task, or it may mean just accumulating enough data so that LoRA can run. And LoRA usually can run on relatively small dataset quantities, which is another advantage. So when it comes to training, we actually developed two paths depending on the use case.

At the top, we see standard LoRA recipes. This is what we call the fast path, and here we leverage SageMaker JumpStart, which is a great tool if you want quick experimentation and testing your hypothesis. It allows you to apply standard LoRA recipes and choose the most common and popular hyperparameters like rank, for instance, and target weights as we saw before in the other slide.

For cases where maybe the data is a little bit more messy or when you want to try something a little bit more custom, then we have what we call the power lane at the bottom, and this is more for custom LoRA recipes. This is where we leverage SageMaker Studio and SageMaker training jobs,

which is more like what we call our lab, where engineers spend their time testing different iterations and different flavors of loaded recipes. No matter what lane we choose, either the fast lane or the power lane, we unify the deployment with Amazon Bedrock, specifically with custom model import or CMI. This allows us to seamlessly deploy our model, which also connects to the Robinhood LLM gateway, which is very important because some engineers who don't work on fine-tuning don't care where the model is coming from. They just want to hit our API or endpoint and actually use the model. This provides an abstraction layer for them and makes it very easy to use.

Production Results and Lessons Learned: Achieving 50% Latency Reduction

Finally, we have the eval-based iteration loop. We don't stop eval at the baseline. We make sure that once we have the model, we rerun eval to ensure that it's better than the baseline. If so, we ship it to production. If not, we do iterations until we see improvements. With that, we've discussed a lot about technical details and ideas. Let's talk about some numbers and what we have received in production.

With our LoRA fine-tuned model, where one of the stages is an NCX agent, we have received more than 50% latency savings. To put that in perspective, our previous model was giving us 3 to 6 seconds of latency, and with the LoRA fine-tuned model, we cut it down and brought it within 1 to 2 seconds. The major gain was on the long tail because we were seeing P90 and P95 latency upwards of 55 seconds, which was causing dissatisfaction with users because they were wondering why it was taking so long.

As we have follow-up stages, the P90 impact gets amplified, and we run out of time to serve a particular request. The other important aspect here was that we maintain quality parity, which was very important and which we mentioned in the beginning. We can't really compromise on quality. We were able to match the categorical correctness of the trajectory-tuned frontier model, and that was very essential for us to really productionize this.

With this success, we plan to extend it to other sets of agents under the Cortex portfolio, and we have seen early trending positive results in terms of adopting fine-tuning. Jumping to the last slide on lessons learned, I have four cards here, one by one. The first is evaluation. I think evaluation was very critical, and there is a flywheel here. I talked a lot about prompt tuning. Prompt tuning isn't just to improve your agent prompts; it is also useful to build LLM judge prompts. I'll give you two examples here.

One example is with the CX bot. When we were doing a lot of human evaluation, it was difficult for us to scale the evaluation flow. Our approach to LLM judge initially was to throw all the account signals in a prompt and ask the model to do the evaluation, but that overwhelms the model because there is just a lot of account information. Account data is just too much and overwhelms the model. We built a two-tier approach leveraging prompt tuning where we first collect the necessary signals needed to answer a particular user question, and we use that in the second step to just evaluate. This has helped us in scaling our evaluation and also in calibrating the human reviews.

We have seen that some sets of intense human reviews were lenient while others were more strict. This approach has helped us there as well. In one of the other use cases, the fin crime use case with our eval-driven development, what we have seen is that we were able to get the same quality out of the box with a smaller model matching the frontier model. We could make it happen because of our eval-driven development. Otherwise, there is just a tendency to go after and adopt the frontier model.

With that a nice segue to data preparation. Data preparation is equally important. The question is not about the quantity, it's about the quality. Understanding what dataset needs to go into evaluation versus training is important. For example, if a model is performing well on some set of questions or some categories of questions, you don't necessarily need to include it in your training dataset. You can reduce its footprint if you want to include it, but you should definitely include it in your evaluation dataset so there are no regressions.

With that approach, you apply the same methodology to create the evaluation dataset and the fine-tuning dataset. With that, we covered the tuning methodology, which ensures that we are using engineering resources efficiently and not using the fine-tuning hammer all over the place. We have prompting, trajectory tuning for dynamic few-shot examples, and then fine-tuning for additional quality gain. The last piece here is inference. We work a lot with AWS Bedrock and CMI where we have customized our inference capabilities. We pick hardware whether it's H100, A100, or other hardware based on our needs, whether we want to optimize latency or cost.

There are other techniques including prompt caching that we definitely leverage. For example, if there are prompts in your agents, move your static contents towards the beginning so the model isn't building the attention KV cache on every user question. It reduces your cost and also reduces the latency, so it has multiple benefits. Leverage prompt compression as well. If we study the input data or the input prompt going into the model, there are many opportunities for compressing the data. Can you change the way you represent your data? Is tabular a better way to represent your data? Can you remove some UUID from the data? Can you remove the null values or columns from the data?

This overall helps in ensuring that the input token counts are low, which brings two benefits at once: latency and cost. I want to say thank you for attending this talk. Thanks a lot, Yuan. We appreciate you all spending the time here. Just one last note: if you can see Robinhood, which is working in financial services in a regulated industry, they can be so sophisticated with using all the various AWS services for shipping more generative AI powered workloads to production, you all can do it too. So I hope you got some really good lessons and some ideas on how you can use this more reliably in your production workloads. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.