Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Mastering model choice: The 3-step Amazon Bedrock advantage (AIM391)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Mastering model choice: The 3-step Amazon Bedrock advantage (AIM391)

In this video, AWS presents a three-step framework for selecting AI models in Amazon Bedrock: identify, evaluate, and optimize. The identification phase filters models by modality and benchmarks, highlighting differentiated capabilities like reasoning, agentic features, and domain-specific models. Evaluation involves creating golden datasets with ~100 use cases and using LLM-as-a-judge alongside human review to measure quality, latency, and cost. Optimization includes multi-model routing strategies and fine-tuning. CoinMarketCap's Bryan Koh demonstrates real-world application, processing 10+ billion tokens daily across sentiment extraction, planning, and summarization tasks. Their GLAS system enables rapid model evaluation in hours, combining AI insights with human judgment. The framework helped a financial crimes investigation agent achieve 80% cost reduction while scaling from 500 million to 5 billion daily requests. Amazon Bedrock now offers 80+ models from providers including Anthropic, Google, Nvidia, and Minimax, with new inference tiers providing up to 50% cost savings.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: The Challenge of Model Selection in AI Applications

Welcome everyone. Thank you for joining this session on mastering model choice in Amazon Bedrock. We're glad you're here. How many of you are overwhelmed by the number of AI models available when building AI applications? Great. And how many of your organizations have a systematic framework for picking out models for AI applications? All right, fewer hands there, great.

Well, we're here to help. We're going to talk through a simple framework that you and your organization can use to identify models, evaluate those models, and optimize them for production based on the work we're doing with our customers at AWS. I'm Scott Munson, Principal Worldwide AI Specialist with the Amazon Bedrock team. Along with me today are John Liu, Principal Product Manager with Amazon Bedrock, and Bryan Koh, Senior AI Product Manager with CoinMarketCap.

So quick agenda. We're going to start off with just an overview to set the context. What are the challenges that our customers are seeing, and how are we helping them? Then we'll talk through the framework itself, how we identify and evaluate and optimize models. And then we'll have a chance to hear from Bryan on how he's using this framework with his team to deploy AI applications that are serving over 65 million monthly active users.

The Overwhelming Landscape of AI Models and Its Consequences

So to get a sense of the landscape, 2.19 million public models are available on Hugging Face today. I chatted with the Chief Product Officer of Hugging Face yesterday. He said there's a new model every 10 seconds, an incredible pace, right? That adds up to about 4,000 models per day. That's just publicly available open-weight models. Add to that all the proprietary model makers like Anthropic, Amazon, OpenAI, and all the rest. We have so much choice.

This is in every modality: text, image, video, audio, new modalities. There's so much innovation. It's an exciting time to be building these applications with AI models, but there's a little bit of a bottleneck. I mean, we're hearing from customers that picking the model itself is actually a challenging task.

So why is it challenging? What's the consequence of this? What we're hearing is that the POCs matter. When you're building out an AI application for the first time with your organization, you want to make sure that that's a successful trial run. They're not all guaranteed to be successful, but it matters to our customers that they have a good win early on in this process. So this matters for the time and labor spent, but also just the momentum and organizational reputational impact. It can be a tough loss if they pick the wrong model for the job and build around that.

We're also hearing this specific problem of no selection framework to work through all the models with, and so we're going to talk through that today, of course. And the pace of innovation is challenging. If it takes four weeks to test out your models and pick the one you like and a better model came out yesterday, it's sort of a challenging environment to work in, so pace is important to our customers. And then, of course, any customer-facing application, any large organization application needs to function at scale. So sometimes our POCs aren't optimized for the big production workloads, and we want to make sure that our customers are successful in that arena as well.

Amazon Bedrock: Simplifying AI Development with Comprehensive Model Access

So Amazon Bedrock is AWS's solution to provide developers what they need to focus on their developing of applications, and we can provide the rest. You know, Bedrock has a great selection of top models that are coming out continuously, and we're committed to providing great model choices. We optimize our inference to make sure that our customers can use this inference in a serverless manner so they can scale up and scale down. They can use different modes of inference. They can balance the requirements of their application without having to do all the deployment of AI models themselves.

A lot of our customers are seeing success by leveraging their data, so that could be in the form of a RAG knowledge-based solution, or it could be in model distillation or fine-tuning according to their specific proprietary data. And, of course, safety and responsibility are critical. So Amazon Bedrock provides guardrails along with other features that help our customers ship their products with confidence in the reliability and responsibility of their AI applications. And we're doing a lot with agentic work. This week we launched a number of agent capabilities that are very exciting to help you seamlessly deploy and operate agents at scale.

To go a bit further on model options themselves, we really do work with top model providers and continue to expand that number over time. We think that's important. We're hearing from our customers they want that choice of models.

So we're committed to giving them that. Model evaluation tools are a critical counterpart with that to make sure that customers know what the right model is. We also allow people to import their own models. So if you fine tune a model or have another model that you're interested in using, we can import that through custom model import and provide that inference along with the other models available, which is a big convenience factor, again letting us handle that inference task for customers.

And again, Bedrock has this slide shows 13. We added 4 additional model providers this week. We launched models from Google, from Minimax, from Nvidia, and from Moonshot, very exciting. We're over 80 models now available on Amazon Bedrock. We really do hear from our customers and are convinced that us providing this inference for them, this Bedrock service, gives them the simplicity, the scale, and security that they need so they can just focus on their domain and their application development.

A Three-Step Framework for Model Selection: Identify, Evaluate, Optimize

So great to have all those models. You still have to pick some to use for your application, and so this is the model selection framework. We've pulled this from working with a lot of customers and trying to simplify it down so you can work with your stakeholders and understand easily what stage you're in. The process itself is relatively simple if you think about it in these terms. First up, you identify candidate models, so you look at all these options and filter them down. We'll go into exactly how to do that in a moment.

Then you evaluate those models on your data, on your use case. This is really critical. Generalized benchmarks are not the greatest indicator of what model is going to be most performant to meet your application's requirements. And then we have another step to optimize those models that might be through breaking up the AI workload into multiple models to meet the performance requirements you need, or potentially through fine tuning as well. So we'll talk through all that today.

And to help keep us grounded, we're going to use this use case to help understand the process. So we've got a financial crimes investigation agent scenario. This is one that our customers have built that we've talked about and know what this looks like. In this case, a financial institution, something gets flagged for potential financial crime. They still have a human review process required, so these analysts are manually reviewing a lot of documents, transaction data. It's time intensive work and they want to accelerate this workflow for them.

The requirements we're looking at for this application would be to process structured and unstructured data. We want to generate really accurate, concise summaries. This is a large scale operation, and so we're looking at about 5 billion tokens per day. So we want to make sure we're keeping that in mind as we're picking out the models we want to use. And then of course security is essential. The goal here would be to hit 20% efficiency for these humans' workload and be able to work more quickly.

Identifying Candidate Models: Filtering by Modality and Benchmarks

So, the identification stage, I'll go into more detail on this. Put simply, we're going to filter out all the models that are relevant by modality and just consider the ones that are relevant for our use case. Next up we'll look at all those benchmarks. They are valuable. There is utility there, especially the more detailed benchmarks, and then considering benchmarks alongside other metrics that you care about. And finally we'll look at some differentiated capabilities. There's so much innovation. There's a lot of categories of models, but there's a few that I want to highlight that really we're seeing matter for application success.

So first up, modality. This one's relatively simple. What's the input type that the model can ingest and what's the output type that it provides? There's a huge amount of text data, of course, in business. A lot of these models were initially started there, so we've got some great options within text. But we're seeing a lot of great use cases developed with an image input, video, audio, as well as the outputs of all of those modalities. So step one, just look at the models that are relevant for your use case.

And here's a little view on what's available in Bedrock, by modality. This can be helpful, right? A lot of great models available within the text category. Multimodal input, so those are often text plus image, also some video understanding available. Multimodal output, and then embeddings, a lot of growth in those areas. We're launching new models all the time and meeting more of our customers' needs that way.

So we've got it filtered by modality. Now let's think about a way to compare quickly. Artificial Analysis is a very popular third party resource. They provide an index and metrics for a huge range of models as well as inference providers.

It's nice to be able to go here and consider top intelligence and price or latency or whatever your metrics you care most about side by side to kind of quickly evaluate what's the one that I might care about. They also provide this capability to do both X and Y axis to find the sweet spot quadrant where you've got optimal price to intelligence ratios. In this example, we see a few of our Bedrock models, Claude Haiku, DeepSeek, as well as the GPT OSS models really standing out in really high intelligence but really considerably lower prices. So it's important to take note of that.

I think the other thing I want to make sure that we call out is that within benchmarks, often we talk about just a single kind of combined intelligence score, which is really valuable for rapid comparison, but there are really detailed benchmarks. If you know you have something in your use case that's relevant, for example, long context reasoning, that's something that's really relevant in this use case we're talking about, you can look at specific benchmarks for that. In this case, we've got Claude standing out alongside DeepSeek as some really good contenders for long context reasoning capabilities.

This is also an agentic application, so I've highlighted here as well, you know, we've got agentic tool use listed. This is an Artificial Analysis resource here, but Galileo is another agent leaderboard that's available. It's also Berkeley within agentic leaderboards. You've got a lot of variety, you know, how is it as an orchestration or managing level model, and how is it as kind of an executor worker model? How is the tool use selection? There's a lot of meaningful benchmarks that are available when you're trying to pick out the models and which role they should play within an agentic application.

Differentiated Model Capabilities: Reasoning, Agentic, Customization, and Domain Specificity

So this again filters down to kind of consider, hey, what's a couple of top models we should think about? Modalities done, we're thinking about metrics, where does the price performance or other kind of combined ratios, what benchmarks are really strong. I also want to talk about a few other categories of models, and these are ones that I decided to highlight because I see a lot of our customers drive a lot of success with these. There are others, but I really think these are the ones to focus on.

First up is reasoning. You know, DeepSeek landed in January. There were reasoning models before then, but a lot of attention, a lot of major model providers adopted reasoning models as part of their even hybrid architecture. So there's a lot available. Reasoning models can do more complex step-by-step roles, often again playing that orchestration role, can be valuable in agentic use cases. They can explain the rationale, very useful when you need to kind of understand why the model made a choice. Traceability is very nice to have through that method, of course, science, math, and coding.

Within this category, we have a variety available. The Claude models are good examples. The GPT OSS models, we've got models from DeepSeek, from Qwen, and a number of others that are all reasoning models. Again, agentic, you know, autonomous task completion, very exciting. I think the best practices are being developed, new offerings are being developed every day, and these are important to consider, you know, what's the tool selection capabilities, orchestration of multiple tools, all the rest there.

Customization is worth kind of pausing and considering as well. This tends to be dominated by open weight models. There are some abilities to customize proprietary models, but this is the category where we see the most popularity for open weight models where the models can be used to fine-tune. And I'll say from our experience, we're seeing customers fine-tune when they have a very specific performance requirement they're having a hard time hitting in any other way. Nothing off the shelf is getting them to the goal they need, and often it's latency sensitive.

So if you have a user experience that requires a really quick model response time, often we'll see a customer take a smaller model, distill from a larger, or really have a very specific use case. We're also seeing this grow in agentic applications. So if you've got a worker model that does a very specific task every time, you can fine-tune a model and get really strong results hitting the latency and cost needs that you might require. So it's worth considering these open weight models, these customizable models as a candidate in your first identification stage in case you need to fine-tune or want to have the option to fine-tune later, and we're seeing this also within a specific domain. Terminology may be relevant.

And last, this domain specificity. This is something that's important. We're seeing a lot of models develop in this space. This, I'd say finance, healthcare are two that really stand out. This is driving more success if you have specific domains where the language is very relevant for that domain. You can sometimes get higher performance, higher intelligence in the model itself. You know, Bedrock Marketplace has a lot of offerings in this area.

A couple of examples: Upstage is a really good model for Korean language translation. There's also from Writer the Palmyra financial model as well as a healthcare model. So a few to consider as additional options that maybe if your domain has some offering in that space.

So back to our use case, let's look at the financial crimes investigation agent. First step there, filter by modality. These are just text problems that we're solving right now. So text to text, document analysis, summarization. We'd want to look at those benchmarks. We considered summarization specifically, long context reasoning, some of these agentic capabilities as well down to the differentiated capabilities. And then I think it would be in this case because of the financial crimes, there's some terminology we probably want to include a customizable model as well. So that gives us a result, right? We're kind of going from all the models in the world down to a short list that we think could be really strong for this. This would include Claude Sonnet, OpenAI GPT-4VIS in this case, and DeepSeek V3.1. So that's the identify stage. I'm now going to hand off to my colleague John Liu who's going to talk through evaluation and optimization. Thank you.

Creating Golden Datasets: The Foundation of Model Evaluation

All right, so now we get to go into the evaluation stage, and when it comes to evaluation, as Scott mentioned earlier, it's important to have a framework in place to do this, right? Some of our customers have put in place frameworks that let them evaluate a model and decide whether they want to include that in their production workload within 24 to 48 hours, and today we're going to present a framework that'll help you hopefully set something similar in place. When it comes to evaluation, the first step is always create your golden dataset, which is that source of truth that you want to benchmark all your models that you're trying against to see how well it performs. And then when you benchmark these models, you want to make sure you're continuously benchmarking the models because even if you don't change the models themselves, the incoming context might change which could lead to unexpected results.

So let's go into our sample golden dataset and see what that actually is about. The golden dataset is really just made up of two pieces, right? One is going to be the prompt. In this case, it describes a particular use case we want to measure in our financial crimes agent. And the second part is the source of truth. This is why we call it the golden dataset, the ground truth. This is what you want to measure the models that you want to try against and see how close they come to this source of truth.

When you create a golden dataset, first you start with the basics. You want to have a comprehensive dataset, about 100 use cases, right, of which you want to choose. They're very specific to your particular terminology, the way that you're thinking of designing your workflow, because only by creating this tailored dataset do you know whether the model that you selected is going to be useful for you, right? The generic benchmarks that Scott presented first, they're a very good starting point, but you want to see how much of that actually translates to the use case you have in mind.

You also don't want to just live in what we call the easy mode, right? You want to also select cases that are adversarial, right? You want to trip up the models because you're trying to define where the limits of your model is going to be. And typically these adversarial use cases, these tricky use cases, they make up about 5% of the 100 use cases that I mentioned earlier. And a good way to start, you set up 10 use cases and then you get an internal feedback loop with your subject matter experts and build that up to 100. But then as you're rolling out the production, you're introducing more models, you're now thinking about 200 to 300 use cases, and potentially there's a better way to scale these golden datasets than continuously lean on human resources, right? Subject matter experts are expensive. So let's take a look if we can do that.

What humans are not so great at or not very efficient at doing, you know, is all these detailed tasks, right, these detailed low-level tasks or trying to create multiple iterations of datasets or reading detailed SOPs, standard operating procedures, and translating that and making sure that's captured in your golden dataset. We can do it, but it's not efficient. And what humans are quite good at doing, a good use of our intuition, right, is the high-level judgment that we have. And we're also quite good at looking at total frameworks and seeing hey, what works well and what doesn't work well. So you see here there are three things: high-level judgment, setting rules and standards through rubrics, and then looking for ways to improve solutions.

So if there's a way that we can leverage the strength of humans and then delegate what we call the low-level tasks to agents, then we've got a pretty good solution to help us scale our golden dataset from that 100 to the 200, 300 type of use cases.

Scaling Golden Datasets with Multi-Agent Systems

And now let's look at what that actually looks like. We start with the first agent, right? This is the user simulator, and you can think of that as actually the KYC expert. He's doing work to find out whether there's a potential financial crime associated or fraudulent activity associated with this particular transaction. You have a very descriptive mission for the agent and a persona and then some example questions, because the more descriptive you are with your agents, the more accurate they're going to be.

You pair this user simulator with your task agent, and this is the agent that actually goes through and does all the work. It looks through all the documents, looks through the online and offline credit cards, for example, synthesizes that information, and comes back with a recommendation to the user saying, hey, is this a fraudulent or not fraudulent activity? Here you can see again mission and persona described, and they have an action. They have tool calls because they're actually going through and pulling in this information.

Finally, you pair this with the critique agent, and this is the actual agent that looks through how close the task agent is or how well the task agent performs against a set of rubrics or rules that the subject matter expert, the human, set in place earlier. Here you see the mission, and your job is to go and evaluate the task agent, make sure it does a good job, and your persona, you're an expert teacher, and you've got some tools as well. Your actions are you have some global evaluation methods, and you have some relative evaluation methods that you call dynamically when you're actually evaluating the task agent itself.

Let's put that together. We have our user simulator agent. It passes through a request over to the task agent. Now paired with this, of course, is our critique agent, which is reading rules from this rubric that was defined by the humans. The critique agent continuously iterates what the results are from the task agent until the critique agent says yes. The response that you gave meets the rubrics that were set in place, and at that point, the critique agent, of course, passes the correct answer back to the task agent, and the task agent now sends it to the user simulator. But also the critique agent passes that to the golden dataset. So now you're automatically scaling your golden dataset.

Now where's the human, right? The human is actually evaluating this entire system at this time, right? They're not in the detailed weeds of looking at every single one of these responses. They look at how the overall framework behaves, and they can make suggestions that say how can I improve how the critique agent is actually guiding the task agent. Maybe we need to have more tables that are coming through so it's easier for our customers to read. So they update the rubric, and now this entire golden dataset gets stronger and the rubric gets stronger as well.

Now to build this, customers can benefit from AWS's Agentic stack. Customers can start with the open-source framework and very quickly build and deploy locally agents. And as they're ready to scale, they can now lean on agent core and benefit from enterprise-grade security and also dedicated runtime environments and memory management. And there were lots of announcements this re:Invent around our agent core.

Evaluation Metrics and Methods: From Benchmarks to LLM-as-a-Judge

So we've gone through, we've created our golden dataset. Now we get into the metrics that we want to evaluate against. The operational metrics are the foundational piece: cost, latency, and scalability, just like almost any software, right? Some things to keep in mind, of course, is when you're actually measuring these operational metrics, make sure you're sending in inference requests that cover a variety of patterns, right? You want to look for your P95, your P99 type of response context window lengths so you can find out what the latency is, and you also want to send a variety of different workloads so you can see where the model or the model serving solution you're using runs into errors.

Once you have your operational metrics covered, you can move into the comprehensive evaluation metrics around semantics and knowledge, right? So you've probably seen the quality and accuracy, the style and usability, responsible AI. This is pretty much, again, foundational aspects that customers are now used to.

However, as customers have been rolling these into production over the last year, they've also introduced their own custom metrics. What these custom metrics actually measure, just like any software, are KPIs that you want to bring in. For example, it could be click-through rates of a particular task you want to get done, or it could be the type of overrides any time a particular user decides to override what the agent does. Maybe that's another KPI you want to bring in. So keep these in mind as you're evaluating your model. Don't just lean on the tried and true quality and accuracy. These are important, but bring in the actual custom metrics from your workload as well.

Some other metrics to keep in mind: temporal consistency, which I mentioned a little bit earlier. You have to consistently look at and evaluate your model. You want to catch model drift early on. As you're introducing more modalities, you want to make sure you're measuring the modality not just in isolation but also as they interact with each other. And finally, agentic capabilities. How well is your entire system or your entire model behaving when it's calling tools or when it's actually orchestrating across different agents?

We've gone through our metrics. Let's look at what methods are used to evaluate these models. They really break down into three. We start with the benchmarks, the programmatic type of evaluations, and these are really good when there's a true right or wrong answer. You can think of benchmarks like the MMLU, which measures how good a model behaves for translation, or you could use GSMK, which measures how good a model is for mathematic calculations. Very good starting point. But if you want to pick up the nuances of summarization or how people talk, how people think, then you start leaning on humans. That's where human evaluation comes in. Your subject matter experts have to come in and look for those nuanced semantic type of evaluations.

Now to help us scale, we can lean on models as well because they've got much deeper reasoning capabilities that have been developed over the last year. Customers are now using powerful models, whether it's a single powerful model as a judge to evaluate the output of other models, or they put in place maybe ten to twelve lightweight models as a jury to evaluate the output of the model that's being evaluated. Amazon Bedrock model evaluation supports all three of these. We've got our programmatic evaluation, the human evaluation, and also LLM-as-a-judge. Importantly, we also support custom metrics. And if you want to evaluate models that are not directly within Bedrock, you can do that too. You can bring the responses from those models and run them through Amazon Bedrock model evaluation.

Let's dive a little bit into LLM-as-a-judge because that's what we're going to use for our financial crimes agent that we've been using as our reference case. Under the hood, what actually happens is we've optimized the prompts for a variety of our evaluator models, which include the Anthropic models, some of the Meta models, and Nova models as well. We tell it how to behave and we tell it what type of response to give. What gets passed in is the JSON online file of the golden dataset. Here's all my prompts, here's my reference answer, and here's also the response from the model that you're going to evaluate against.

To set this up with Amazon Bedrock model evaluation, you start with selecting the models. There are two things you want to select: the model you want to be the evaluator and the model you want to evaluate. In this case, we're picking the Claude Sonnet 3.7 as the evaluator, and then we're going to evaluate the gpt-oss-120b. You select the metrics. Amazon Bedrock model evaluation has twelve metrics, and of course you can also import your custom metrics as mentioned earlier. Now you run the model evaluation, and you get a result and you can compare different models against each other.

In this case, I've run a simple evaluation against the gpt-oss as mentioned earlier and also DeepSeek V3.1, and you can see a radar chart of how well they perform. You can dive deep as well into each of the prompts that you're evaluating. It gives you a score from zero to one, with one being the best score. And if I want to dive in and understand why I got a score as such, I can do that too. I click in and it has an explanation. In this case, it got a one because the value that you're evaluating matched very well with the ground truth that you actually gave in.

Optimization Strategies: Multi-Model Routing and Fine-Tuning for Production

Now we can move into the optimize state. We've gone through, we've evaluated, we have a set of models that we want to choose against.

When we think about optimizing, we want to start by thinking that it's not just about optimizing a single model, but rather optimizing the entire system. You can choose single models to replace, and you can also introduce multiple models to optimize the entire end-to-end workflow. You can further customize individual models through fine-tuning and distillation, and then you can optimize how you're sending your inference requests for cost and latency.

As an example, we announced the inference tiers that were in today's Dave Brown keynote. He was talking about the inference tiers and how you can actually pay a little premium, for example, to access the priority tiers to make sure that your inference requests are at the top of the line. Or for the less time-sensitive ones, you can pay up to a 50% discount and use the flex tier. This is how you start optimizing cost versus latency. When it comes to optimizing these multiple models, you have to consider the routing strategies you have, and there are really just four things I'd like to share with you today.

You can start with rule-based routing. You can go through and look at the queries: simple queries use a light model, complex queries use the heavier models. For private queries, if they're highly sensitive ones, maybe you want to use a model that you're hosting yourself. But that has a challenge too, because this means that someone is always writing these rules. So we can benefit from machine learning-based routing using classic machine learning. You can train classifiers around the dataset that you have and then route these incoming inference requests to the proper model.

Of course, the challenge there is you need a training dataset. So you can further benefit from using the LLMs themselves, right? The LLMs themselves can look at the incoming data and say, well, based on this, we think the profile matches, and we should send it to what type of model: light, medium, or heavy. And then you can further fine-tune that as you get more data. The best way, of course, is to combine all three of them, right? That gives you a very robust way to route these inference requests to the different models.

When you look at the individual models themselves, you can customize them through a variety of methods through Amazon Bedrock. You can do straight fine-tuning, and you can also do distillation where you take a teacher model and then bring the knowledge of that into a lighter model. Or you can do some advanced customization on your own through maybe SageMaker and bring those models into Bedrock through Amazon Bedrock's custom model import. And we just announced Nova Forge about two days ago, and that further helps customers customize their models because now you can not just benefit from your own dataset, you can benefit from Nova's pre-trained dataset as well. You can bring these two datasets together and train the models.

So looking at our financial crimes investigation agent after these two steps, what are we looking at? Well, we've got the Claude Opus 3.5 for fast classification tasks, and then now we've got the Claude Sonnet for the complex analysis, and we fine-tuned the Claude Opus model. And then we optimized our inference through the different inference tiers that I mentioned, which leads us to an 80% reduction while scaling the workload from 500 million inference requests to 5 billion per day. And we've not talked about the customer behind this, but this is a real customer use case that we've seen in production, benefiting from model choice, the optimizations, and the framework that we've talked about so far.

Now let's review and just put it all together. In our three-step framework in action, we start with the identify stage. You have a library of models that you want to look at. You review the general benchmarks that are out there, the modalities, and then you select from there the models you want, and you evaluate these models against your specific use case, against the golden dataset that you've created. You then optimize the setup that you have, and you can have single model type of workloads, or you can have multiple model type of workloads.

And importantly, you want to get all this information and bring that feedback so you can improve your particular framework, including monitoring for latency, cost, and accuracy. And also getting the real-time user feedback and having that information improve that golden dataset. Another benefit of having a strong golden dataset is that it doesn't just power up the model evaluation that you have, you can also use it to fine-tune models, so you get almost like two for the price of one. And that's how this three-step framework works. You put something like this in place, and you can be able to evaluate a model for your workload within 24 to 48 hours of it dropping.

So with that said, I'd like to bring on Bryan who can show you how CoinMarketCap has done a similar type of framework and enabled Gen AI workloads for millions of users daily.

CoinMarketCap's AI Journey: From Risk Minimization to Production Scale

Thanks, John. Hi everyone. My name is Bryan and I manage AI products at CoinMarketCap. In the next roughly 20 minutes, I'll be walking you through how CoinMarketCap selects models. The analogy I like to use is, although you can use an F1 car to deliver groceries, you wouldn't want to do that. You can use the strongest models, the most advanced models to do the simplest task, but you wouldn't want to do that for different reasons, cost being one of them.

Before I continue, how many of you know about CoinMarketCap? Okay, not many people here know about CoinMarketCap, so this slide will be useful. CoinMarketCap is a cryptocurrency data platform and we've been around since 2013. For reference, one Bitcoin at that time cost roughly $100. When I checked this morning, a Bitcoin was roughly $93,000. But more than a decade later, we're the home of crypto with over 65 million monthly active users and more than 1 billion page views.

If you're more familiar with traditional finance, we're often referred to as the Bloomberg of crypto. We have more than 1 million API users with institutions like Google, Yahoo Finance, Coinbase, and central banks around the world using our crypto data. Our Gen AI journey started in Q3 2023, and since then, we have more than 10 user-facing products in production. Our AI products are used by millions around the world, and since 2023, we have consumed trillions of LLM tokens. Today, we consume more than 10 billion LLM tokens every single day.

This scale is exactly why we are rigorous about our model selection and why Amazon Bedrock matters in our stack. So, what do we use these more than 10 billion tokens for every single day? Four main things. First, users come to our site to find alphas, to find signals. With more than 27 million cryptocurrencies tracked by CoinMarketCap and thousands more created every single day, there is a lot of noise. LLMs compress this noise into signals and users use our AI products to find these signals.

Second, explain. Users come to our site to try to understand why is Bitcoin price up, why is Bitcoin price down, what is Bitcoin, what is proof of work, and as we all know, AI does a great job explaining all of these. Third, forecasts. Users want price predictions so that they can make money. They want potential scenarios and although nobody has a crystal ball, AI can help lay out those options and give reasoning behind each of them.

Fourth, act. Insight is only useful if you can act on it. We turn AI outputs into watchlists, alerts and automations, so users can act on these insights. Let me show you one of our use cases, what we call CMCAI. This is essentially a chatbot. So imagine you wake up one day, like today, Bitcoin is up and you wonder why Bitcoin is up. Instead of having to open 10 different tabs, you just go to CoinMarketCap and ask CMCAI.

Using live crypto data, on-chain data, and social sentiment data, it gives you an answer. And then you're wondering, you read about this hack that just happened and you're wondering why did it happen. Instead of having to deep dive into many news articles and many social posts, you just ask CMCAI. And of course, this experience is also on our app. Millions of users use CoinMarketCap's portfolio feature to track their portfolio and because we have this data, we can personalize the answers for users.

The same question by a different user would yield different results, personalized results. Of course, we didn't start from that chatbot that you just saw. Back in 2023, when we started our AI journey, our aim was to minimize risk and maximize learning. We needed two critical components before we scaled to the AI chatbot. First is stakeholder buy-in, and the second is a deep understanding on how to build on top of the non-deterministic nature of LLMs.

The image you see on the screen is AI FAQs, one of our earlier products. When creating this AI FAQs, the reason we chose this is because it's on a less trafficked part of the page, so the risk is lower. For all the AI FAQs we generate, a human would thoroughly review them to make sure that it is correct and accurate before we go live with them. This manual review process, although tedious and although it took a long time, it was very important because we learned the limitations and strengths of LLMs.

With this product, and with more organizational buy-in over time, we built AI products for more prominent parts of our product and more deeply integrated AI within our product.

CoinMarketCap's Model Selection Process: Specialized Models for Specialized Tasks

We've moved from simple use cases like AI FAQs to what we have now, which is the AI chatbot and many other integrations across the site. One thing we realized is that we cannot have the same model do many different tasks. The example I gave earlier is you don't want to use an F1 car to deliver groceries, and I'll give another analogy. In an F1 pit crew, you have different people for different tasks, and speed and efficiency comes because of that specialization.

I'll highlight five different tasks that we do within CoinMarketCap using AI. First, a sentiment extractor. Every day, we process millions of social posts and millions of news articles, and we want to extract the sentiment within these social posts and news articles. Is it positive? Is it very positive, or is it the opposite? The LLM we select here needs to be cheap because I just mentioned the scale. We process millions of social posts and news articles, and it needs to be fast. So the most advanced models typically don't suit this that well.

The second is planning. This is about taking the user's query, however complex or ambiguous, and transforming them into a list of to-do lists. This LLM needs to be very good at reasoning and understanding a user's query and converting that into a to-do list. The third is data retrieval. This is about taking the to-do list from the second step and converting that into a list of tool calls. What are tools? You can think of tools as API calls that LLMs can make in order to retrieve context. This context can be the latest data, the latest news that the LLM doesn't have access to within its training data. The model that we select here needs to be good at choosing the right tools and the right parameters within those tools.

The fourth is summarization. Now that we have retrieved all that data for the large language model, in this step, the LLM needs to take hundreds of pages of data and news articles and convert that into one single page of data for the user to read. And the fifth is translating natural language into CoinMarketCap IDs because we track more than 27 million cryptocurrencies. The same name can represent many different cryptocurrencies. What we found here is that we don't need the strongest model to do this conversion. A simple chat model generally works very well for this.

So we care about three main things when selecting models for the different use cases I mentioned earlier. One is quality. Quality looks different for each use case. So metrics like factual accuracy, relevance, proper evidence usage, and groundedness matter to us depending on the use case. Second is speed. Time to first token, tokens per second, and time to last token matters a lot to us. And the third is cost efficiency. We don't care very much about the raw cost per token. What we care more about is the cost to quality ratio.

So with so many models coming out, I think Scott previously shared that there's one model coming out every 10 seconds. How do we choose which models to test? We can't possibly test every single model every 10 seconds. So in CoinMarketCap, we have three main categories of models, at least the way we think about it. The first is full reasoning. These are your top-tier models. They are very good at complex tasks. Using the Claude family of models as examples in this slide, this would be your Claude Sonnet 4.5 and your Claude Opus 4.1.

Under the lite reasoning models, these are mid-tier models that deal with reasoning and are generally faster and more cost efficient. This would be your Claude Haiku 4.5. And then the third, fast and low cost. These are lightweight models, typically chat models without any reasoning that are used for simple and high-volume tasks. This would be your Claude 3.5 Haiku.

John talked quite a bit about golden dataset, and that's what I'm going to be talking about within this slide. So in order to prepare the golden dataset, we always start by setting the scope. We need to make clear what is being tested, and over the next couple of slides, we will use tool calling as an example. If a model has access to 10 tools, we want to make sure that it is calling the right tools to retrieve the right data to generate the answer for the users. Say a user's query only requires two of the tools because it doesn't need that much data. The model should not go and fetch data from five of the tools and be excessive about tool calling. So that is the scope: tool calling accuracy.

The second is ground truth. Humans will prepare the golden dataset. Say we have 1,000 user queries that we know users care about. For each of these user queries, we will think about which tool do you need to call to retrieve the data to answer the question. So say for question one, we select four different tools. That is the ground truth.

The third step is we run checks. We call the LLM API and give it the ten tools, and we see what it actually returns to us. Then we compare what they returned to us with our ground truth. If the LLM selected four tools when our ground truth has two tools, then we know something is wrong. There are three metrics we care about here: precision, recall, and F1. I'll talk more about these shortly.

The last step is update. This step is super important. Your ground truth will shift over time. Your ground truth needs to reflect what users care about, and what your users care about changes over time. Using the same set of one thousand questions, maybe one hundred of those one thousand questions will not matter three months from now. So you need to remove those one hundred questions and replace them with something new.

Let's dive a bit deeper into tool use. Precision is the first metric we care about, and what it tells us is amongst the tools that the model actually called, how many of them were correct, how many of them were necessary. Recall tells us out of all the tools the model should have called, how many did it indeed call. F1 score combines both of them. So F1 score is essentially a formula between the two of them, and it's what we use to compare between models.

The reason why F1 score is important is because, say there are ten tools and the model calls all ten tools. In this case, recall is full because it called all the tools it needed to call. However, precision is extremely low, and the F1 score will reflect that. So if model A has an F1 score of 0.7 and model B has an F1 score of 0.4, then model A is better and we will choose that for the tool calling use case.

Let's dive into another use case: summarization. Summarization, like I mentioned earlier, is about taking hundreds of pages of content and converting that into a single page or maybe two pages of content for the user to read. The model needs to have very good ability to understand large context, and there are many different metrics we care about here. I'll mention four. First, is the answer relevant to the user's question? Second, is it readable? Third, is the answer well structured? And fourth, groundedness. Is the final answer grounded in the context it was provided?

We all know LLMs hallucinate. So what we found is that sometimes LLMs in the final answer will come up with facts that we did not provide it within the context, and groundedness is super important. The chart you see is our test of seven different LLMs for groundedness. The orange line represents the median, the box represents the interquartile range, which means the middle fifty percent of the results, and the dots represent outliers.

You see that the first result has a tight interquartile range and a decent median score. The second result has a very big interquartile range and the highest median result, so it has the potential for the highest result and also the potential for low results. And then all the other results just are generally not so good looking metrics. So in this particular test, we selected model one because although model two had the potential for very good results, it was too unpredictable, so we cannot go with that.

Internally, we have this system called GLAS, which stands for Generic LLM as a Judge Evaluation Service. What this does is it allows us to use LLM as a judge very easily. It allows any team member to call LLM as a judge and to get the results in hours instead of in days. And if anybody wants to use LLM as a judge internally, the main thing they have to do is to define the JSON input. They define the goal of the test, they define the dimensions that they care about, and they define the candidates that they are testing.

After they have this JSON input, they simply call the API, and underneath the hood of the API we actually call many different leading models. All of these models combine together to produce a PDF output that contains charts like what you saw earlier, and also AI insights. Using these AI insights combined with more human evaluation, we can decide which model is good for the job.

Continuous Evaluation, AWS Partnership, and Future Directions

Model evaluation is not a one-off event. New models come out very frequently, and there are typically two triggers that make us want to do a model evaluation. One is a product insight, and the second is a new model coming out that we think is worth testing. The first step of model evaluation is always human evaluation. A lot of models get filtered out simply in this step. So humans, our team, myself or my colleagues, will go in and test the model on a few important use cases.

If it passes this stage, we will run the GLAS system, which is what I mentioned in the previous slide. And then combined with AI insights and more human evaluation, we will know whether this model is suitable for the use case. And if it's suitable, we will push it to production.

In production, every single answer is run through an LLM as judge. So we know the quality of all answers generated in production, and we have alerts if something goes wrong. This step is very important because we have gained many insights just from this step.

Of course, I need to have a slide talking about how amazing AWS and Amazon Bedrock is, and to be fair, they have been great partners. So I think they fully deserve this. The first is inference. Many of the latest models are all available on Amazon Bedrock. Whenever Claude comes out with a new model, Amazon Bedrock has it the same day. To us, calling LLM APIs is just an LLM API. We don't deal with infrastructure, we don't deal with scaling, and we are very grateful for that. So that's the main thing that Amazon Bedrock helps us with.

The second thing is RAG. We use Amazon's OpenSearch Service and their Cohere Rerank. This helps us to fetch the right content for the model to use to answer the user's question. The third is security. Built-in features like zero data retention is super important because we process the portfolio data of millions of users. So we need to have this for our inference provider.

And the fourth is hands-on partnership. Since the beginning, Amazon hasn't been just an inference provider. They have always been very helpful, offering deep technical support, and in fact, we are working together on a crypto AI benchmark. This benchmark will compare general purpose chatbots with CMC AI's ability for crypto use cases.

So to sum up, four points. First is the use case drives model choice. The best model, a Ferrari car, isn't always suitable for every job, isn't always suitable for delivering groceries. Specialists usually beat generalists for different use cases. The second is that evaluation is not a side project. It is a product. It's never set and forget, and through continuous evaluation can you convert what may be a fun demo into a production-ready product.

The third is humans matter. As much as we try to automate everything away, what we found is that humans alone or humans combined with AI brings us the best insights. And the fourth is the right partner multiplies output. AWS has been super helpful. Amazon Bedrock has been super helpful, and you need to pick the right inference partners.

What's next for CoinMarketCap? There are three things that we will be focusing on in 2025. The first is reducing the cost of inference and the speed of inference. And there are a couple of ways we are thinking of doing this. First is through fine-tuning models for different use cases, and the second is model distillation, which is available on Amazon Bedrock.

The second is B2B opportunities. So many enterprises have indicated interest in integrating AI within their products, and instead of building it from the ground up, they come to us and ask us for it. So if any of you are in crypto and you want it, reach out to me. And the third is better AI UX and expanded CMC AI capabilities. Users mainly interact with large language models through a chat interface right now, and in order to get the best from that, users need to be expert prompters and they need to understand the limitations of a chat interface.

I don't think that is the future of AI, and I think that AI needs to be more proactive, it needs to be easier to use, it needs to be more personalized, and that's what we will be focusing on in 2025. So I hope now 20 minutes later you are better at choosing models, and if any of you are in crypto, try out CMC AI and let me know what you think. I'll hand the time back to Scott.

Thank you so much, Bryan. The work that Bryan and his teams are doing is really fantastic, exciting to see the level of maturity and the scale of production they're working at. I really appreciate you sharing all that with our audience today. Quick recap, we've got just wanted to kind of reconsider what we talked about today. Three steps: identifying models, look at the general benchmarks, consider modalities, and pick good candidate models that are relevant for your use case.

Next step is evaluate. Run those models in your environment on your data, on your use case, and generate that golden dataset so that you can verify and know that you're picking the right models for your use case. And then optimize those models through model combinations or through fine-tuning to make sure that you're achieving the benchmarks and the performance requirements that you need. And great to hear from CoinMarketCap as well to understand what that looks like in production for them today.

I want to close with a few resources that I think could be helpful for you all. The first is an article written by one of our VPs talking about the value of model choice, a nice kind of general purpose explanation of why we're committed to that at Amazon Bedrock. We've also got a model choice page for the Amazon Bedrock website. On that is all those new models we launched, 18 new models in the open weight category to over 20 this week, and we've also got those new model providers, so lots more coming. Keep staying tuned, keep watching, more models are on the way.

We've also got a few GitHub resources. There's a model evaluation application. This is downloadable. You can run it locally in tandem with the Bedrock tools for evaluation. This is one that you can use that generates a nice visual, just a Streamlit application that you can compare intelligence as well as accuracy, latency, and performance side by side. It's a nice one to kind of share with stakeholders and help them understand why you're picking the model you're picking. And then if you want to go even further, there's a model evaluation workshop available also on GitHub. So take a shot at those great resources to get whatever level you want to kind of understand what's available within Amazon Bedrock.

I want to thank you all for being here. It was great getting to talk through this with you. There is a little survey if you don't mind filling that out. That does help us out. We do read those surveys. We're really committed to trying to deliver great presentations for you all. Thank you again for being here today and enjoy the rest of re:Invent.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community