DEV Community

Cover image for AWS re:Invent 2025 - Balance cost, performance & reliability for AI at enterprise scale (AIM3304)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Balance cost, performance & reliability for AI at enterprise scale (AIM3304)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Balance cost, performance & reliability for AI at enterprise scale (AIM3304)

In this video, Jared Dean, Ankur Desai, and Deepen Mehta discuss Amazon Bedrock's inference tier options for balancing cost, performance, and reliability at enterprise scale. They introduce four tiers using an airline analogy: Reserved (private plane) for mission-critical workloads with steady traffic, Priority (first class) for spiky latency-sensitive requests at premium pricing, Standard (economy plus) for day-to-day workloads tolerating some throttling, and Flex (basic economy) for latency-tolerant agentic workloads at 50% discount. Intuit's Deepen Mehta shares how they leverage these tiers—Reserved for TurboTax's seasonal traffic, Priority/Standard for daily spikes, and Flex for non-critical experiments. All tiers support explicit prompt caching with 90% discount on cached tokens. The session includes technical implementation details and CloudWatch monitoring capabilities for optimizing token usage across different workload patterns.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction: Balancing Cost, Performance, and Reliability with Amazon Bedrock's Inference Tiers

The title of this session is "Balance Cost, Performance, and Reliability for AI at Enterprise Scale." I'm Jared Dean, a principal solutions architect with the Bedrock team. Joining me today are Ankur Desai, who's a principal product manager with Amazon Bedrock, and Deepen Mehta, who's a senior engineering manager for AI Foundations at Intuit. We're going to discuss the various aspects of building at enterprise scale and the choices and options available from recent releases in the Amazon Bedrock portfolio.

Thumbnail 50

First, we'll do some introductions. Next, we'll provide an overview of options. Then we'll look at a customer experience, which I know is always beneficial to many of you. After that, we'll go through some technical details, and then we'll have a Q&A session at the end where we'll all come on stage and be able to answer your questions.

Thumbnail 60

Thumbnail 70

How many of you got to Las Vegas for the conference via airplane? Perfect. I'll use an analogy that I hope will resonate with all of you. When we travel, we have a variety of choices. The analogy I'm going to use is airline-based since we've all traveled, and how that relates to some of the Bedrock offerings we have. It's not a perfect analogy, so please don't look for holes to poke in it, but the goal here is to give you information to help with this analogy.

Thumbnail 100

Thumbnail 120

First, we have private planes. The value of a private plane is that it's at your disposal. It leaves on your schedule when you're ready, not that you have to conform to their schedule, and it's the most comfortable option available. Next, we have the first class cabin. First class gives you priority boarding, dedicated overhead space, and premium service, but you pay an increased cost for that service.

Thumbnail 140

Thumbnail 150

The next option is economy plus. This is where I end up traveling most of the time. You have better seats and earlier boarding, but not the first ones on the plane. For a reasonable price, you can upgrade to sit in those seats. Then there's how my flight tomorrow evening will be on my red eye—I'll be at the back of the plane. We have the basic economy seat, which is the lowest price. You're the last to board, and you might have to get your bag checked at the end if you don't have overhead space for it. But you get on the plane and you're able to get to your destination in the most economical way.

Thumbnail 160

Thumbnail 170

Those are the four ways of air travel, and I assume all of you arrived in Las Vegas via one of those methods. This is analogous to the inference tiers available in Amazon Bedrock. In the last two weeks, we have announced a number of options. To compare them, the analogy here is that we have the Priority tier, which is similar to that first class cabin with priority access to fulfilling an inference request. We have the Standard tier, which is the basic economy or basic economy plus, where you have that standard request. That's what you've been using in Bedrock thus far when you've done an on-demand request—you use that standard tier.

Moving down to the second row, we have what we call Flex, which would be equivalent to the basic economy in the back, and there are some trade-offs there. You get a discount on your tokens served, but it also takes the longest to fulfill those requests. Finally, Reserved Capacity would be equivalent to the private plane. There's dedicated capacity available to you that's reserved on your behalf, and that allows you to use it at your leisure. Those are the four offerings we have available and the expansion that's been done in the last couple of weeks if you've seen those announcements from the Bedrock team.

Thumbnail 250

Thumbnail 270

Thumbnail 290

Thumbnail 300

The Three-Axis Challenge: Optimizing Accuracy, Latency, and Cost for Production Workloads

With these requests and these new options available to you, we know that you're now trying to figure out how to take advantage of and optimize. With every production workload, you have to balance across three axes. There is the accuracy of the model, the speed and latency of the inference that you need, and then also the cost. These are the three main components that factor into every workload that you run in some kind of large language model inferencing workload. The criticality of these applications is use case dependent, and so with some models you are trying to find the smartest answer because you need high accuracy. Using the smartest model often means an increased price and reduced latency, or increased latency—it takes longer to complete.

For some workloads you need that high accuracy. For other workloads you need a good enough answer, and so you're willing to sacrifice the accuracy of your model in order to reduce the cost or increase the speed. These are the trade-offs I think you're all familiar with.

Thumbnail 380

One of the goals of the tiers that we have released is to make that more apparent. Traditionally in Bedrock, looking back over the last year, you really have the opportunity to give on-demand requests with no differentiation of whether this was high priority traffic or less important traffic that could tolerate more latency. So we're trying to give you that optionality, and that's the main goal here.

While we realize that across cost, latency, and accuracy, there are other things that factor into the inference decision you're making beyond just what we're going to focus on today. There's also which models you pick and which features you take advantage of. I don't want to minimize that the problem is larger in scope—you have lots of choices that you're working through. But here today we're just going to focus on the inference aspects and the choices that you have at runtime for where you're making those decisions.

Thumbnail 420

To give you some examples of where we think, going back to our airline analogy, about which things would be critical or latency sensitive and where we want to use potentially a priority tier, some of the ideas I came up with were around flight booking during checkout. Once you decide to purchase that flight, you want to know right away whether you've successfully gotten a seat or purchased your ticket. Also, if there was a gate change announcement for any of you as you were traveling here, you'd want to know about that gate change announcement immediately so that you could start navigating to the new gate to get on the plane. If we didn't want the airline to say that's okay if that gets completed in 40 to 45 minutes, because then you're going to miss your flight.

Thumbnail 490

The third example would be mobile app check-in. When you check in for your flight, you want to get that boarding pass and add it to your Apple or Google Wallet so that you're able to proceed to the gate. These are all latency sensitive applications. As we talked about today, you now have those options to be able to say this is a latency sensitive application—I want to prioritize that traffic.

These are things that are more latency tolerant, and you're happy to take a discount in the pricing as well since you don't need them completed right away. Some things that are latency tolerant as examples in the airline industry would be crew scheduling and assignments. Those are things that take place weeks and months in advance. Loyalty program mileage posting is another example—I'm happy to look at what my status is on different airlines, but if that happens in the next day or two hours from now, that's not critical to my travel. So an airline could process those at a later time.

Thumbnail 560

Similarly, I assume you have workloads. Many of the companies that I work with have workloads that aren't urgent but just need to be completed at some point during the business day. These new tiers allow you the opportunity to prioritize those and tell us which things are latency sensitive and which ones are not, with corresponding prices for those.

Thumbnail 580

Just to recap here before we hand off and learn about some personal experience from Intuit, we have the on-demand tier, the reserved tier, and then batch inference. That's what has existed. What we've added to that is now within the on-demand tier we have priority, in which case you can identify on a request by request basis that this inference request is priority. In that case, it will be prioritized, and there's a premium price paid for that. There's standard, which is what you've been consistently using for the last year or two with Bedrock. And then there's also the new flex tier, which allows you to receive a discount but also integrates with your event-driven architectures that you have today.

The distinction between Flex and Batch is that one is event-driven and one would be set into a batch job. This is what the landscape looks like today in terms of what's available to you as you choose how to do inferencing for the various workloads you have on Bedrock.

Intuit's Experience: Serving Millions of LLM Requests Across Seasonal and Spiky Traffic Patterns

Now we're going to hand over to Intuit to learn about their experience and how they're using these tiers and workloads. I'm Deepak. I lead AI Foundation hosting service at Intuit, and today I'm here to talk about how Intuit serves millions of LLM requests using all of the flexible inference options that Bedrock provides, and how Intuit has balanced cost versus reliability.

Thumbnail 680

Thumbnail 690

Before going into that, let me introduce you to Intuit. Intuit's mission is to power prosperity around the world. Our strategy is to build an AI-driven expert platform to serve hundreds of millions of our consumers, small businesses, and mid-market customers so that they can earn more money, feel confident, and make their financial decisions with greater confidence.

Thumbnail 720

These are some of the experiences currently powered at Intuit using AI. It's a combination of both generative AI as well as classical AI, including document extraction for receipts, automated data entry for TurboTax, and cash flow forecasting. Taking a deep dive into how Intuit has solved for cost, reliability, and performance to serve all of these LLM requests, at Intuit we have one platform called Intuit Genos. It has multiple different components, but one of the main components is the model router. We have one model router which serves different products, but we only have one model gateway.

Thumbnail 780

The challenge is how we can serve different use cases, which have different LLM needs. Some require high reliability, some require low latency, and some use cases are very seasonal in nature, especially for TurboTax. For some use cases, we see daily spiky traffic. LLMs are expensive, and we cannot host LLMs at scale because it would not be cost-effective at all. The return on investment would not be great.

Thumbnail 840

I'm going to take a deep dive into different use cases depending on the nature of the traffic and how we have solved for it. Especially for TurboTax, which is very seasonal in nature, we know that there would be a need for high throughput for a short time, and we can forecast the steady token profile that would be needed for that. We are leveraging the Reserved model for Bedrock that helps us pre-purchase that capacity to power those experiences with high reliability and predictable latency. The outcome is we are paying a premium, just like a private plane, so we can use it the way we want to. It gives us that flexibility and the guarantee that the business needs, but it comes at a cost.

For other use cases where we see a lot of daily interactive traffic spikes, the Reserved model does not work because we cannot provision reserve capacity for a spike in traffic. It would result in a lot of cost wastage due to underutilization, and we cannot forecast those spikes efficiently. That's where the Priority or Standard mode offering really helps us. For critical traffic that falls under daily spikes and which we want to serve with low latency, we can leverage the Priority mode. For spikes that are low and not that critical but still latency-important, we can definitely use the Standard on-demand capacity model from Bedrock. The advantage here is that Priority is more expensive than the Standard tier, but if we have to look at the overall value proposition, it makes sense for our business needs.

Thumbnail 930

But if we look at the amortized rate over the whole year, even if we are paying a premium for the priority, it is not as expensive as purchasing reserve capacity for throughout the year or for months. We are only paying a premium for what we use, not for 24/7. The third use case is for generative AI. We are seeing a lot of experimentation happening, with many new experiments and use cases being launched under beta testing. These are not that critical because they are not yet production ready, so they fall under non-critical. However, the nature of the requests are very real-time, not completely batch. That is where the flex mode helps us plan the workload accordingly, where we get a discounted rate on each and every one of our requests. We can get the designated throughput that we need for the model, but latency is not that important because the flow is not critical for that use case.

Thumbnail 1000

That is how we have planned the LLM serving framework that we have at Intuit, depending on your use case and what your traffic looks like. This is a mental map of how we have planned across different capacity, different modes, and different use cases. We have reserved, which is very seasonal where we know we need guaranteed capacity. We have priority and standard, which are latency sensitive but spiky in nature. And we have flex for non-critical and offline jobs. With that, I will hand it over to Ankur, who will cover the technical details.

Thumbnail 1060

Thumbnail 1070

Technical Deep Dive: Understanding the Standard Tier and Prompt Caching Benefits

Thanks, Deep. Hello everyone, I am Ankur Desai. I am a product manager on the Bedrock team, and I have been involved with the launches of the service tiers recently. Let us deep dive into the technical details so that we understand how to use these service tiers and what they actually mean when you use them. Let us start with the simplest one, the standard tier. How many of you have used Bedrock recently? It is highly likely that all of you have used the standard tier without calling it the standard tier, right? Because now the nomenclature is out, but we did not have Priority or Flex. You just had Bedrock on-demand. This is a tier that is designed for your day-to-day GenAI workloads.

These workloads sometimes can tolerate some amount of retries. What that means is that even when you are in your defined Bedrock quota, sometimes you can see a little bit of throttling. When you see that, you have to retry the request, and it most likely goes through the next time. The standard tier is best effort, but we do have alarming frameworks and dashboards. If we see a lot of throttling, people wake up in the middle of the night to make sure that we rebalance the capacity and keep the throttling limits under our designated thresholds.

You have all been using the standard tier when you use Bedrock via invoke model or the recently supported OpenAI SDKs like chat completion and responses. One thing that comes with the standard tier is explicit prompt caching. Prompt caching is a great technique for both performance and cost. Typically, when you need better performance, you have to pay more. But with prompt caching, you get actually both. Because your prompt is cached, it does not have to be processed on the GPU for the next request. That means you get much faster time to first token. Your input has already been processed, and because we do not have to use GPUs, we actually pass on the cost benefit to you, so it comes at a 90% discount on your typical input token processing. We wanted to make sure that this great benefit actually goes across all of the tiers.

Prompt caching is supported across all of the tiers, including explicit prompt caching. You can see how many of your tokens have been processed as regular input tokens, how many were written to the prompt cache, and how many were read from the cache. All of that is available for you to observe in the CloudWatch metrics. This is how you would look at your usage and establish a baseline if you have been using Bedrock on demand.

This helps you understand, especially when you're going for the reserved tier where you have to buy reserved capacity 24/7. If you look at your current usage of your workload, you can establish P90, P50, and P25 for your input tokens and your output tokens. Now you can make an educated decision on how much you want to reserve or when you will use priority versus standard.

Thumbnail 1270

Reserved Tier Explained: Guaranteed Capacity for Mission-Critical Workloads with Predictable Usage

Next, we'll talk about the reserved tier. This is where Deepin mentioned TurboTax tax season. They know it's going to be high volume, they know it's going to require a lot of throughput, and they know they cannot have end users sitting in TurboTax asking questions and having to retry. The user will just be frustrated and go away. When you have situations like that, it's better to just reserve the capacity based on your usage. You can look at your P50, maybe that's what you're comfortable spending.

Once you have reserved the input and output TPM, tokens per minute, based on your usage, we make sure that capacity is always available to you. Unlike on-demand where on the standard tier, if there is high demand from all customers at the same time, you may see some throttling because that's the nature of the standard tier. On the reserved tier, even if other customers are sending lots of requests at the same time, your capacity is guaranteed. Other people will get throttled, but not you. That's the benefit here.

We also have flexible provisioning of input and output tokens. What that means is, let's say you have a summarization use case where you're sending walls of text and you just want a paragraph out of it as a summary of the entire content. Here, you would have lots of input tokens and very few output tokens. But let's say the use case is around content generation where there is a small set of instructions and you have to generate maybe a few pages of text. Now the number of input tokens is really small, and you need a lot of output tokens.

Based on your use case, you may need different token profiles. Sometimes you may need higher input TPM, sometimes you may need higher output TPM, and we give you that flexibility. It's not like a hardware reservation behind the scenes where we say here is a GPU-enabled instance and you get out of it what you get out of it. We have built the smartness on top of it to allow you to actually provision and reserve what is required for your use case.

Now, this is a premium offering, so it comes at a cost. It's a fixed hourly cost per 1000 input and output tokens per minute. What that means is if you don't use it, you're still paying for it. That's where your baseline comes into effect where you actually look at your usage for a given use case and say maybe I want to reserve only for P50 so that I make sure I don't waste the reserved capacity. Most of the time I will use it.

The cost of having this reserved capacity and processing guarantee means you are paying for it 24/7. It is optimized for workloads such as the one Deepin mentioned for TurboTax tax season. We have other customers where they have trading platforms, and they know what their usage is and what the pattern is. For their customers, which is very latency sensitive in the trading world, they cannot tolerate retries. When you know your daily usage of your applications and what that translates into Bedrock usage, you can once again make an educated decision on how much you want to reserve.

Those are the type of workloads where you absolutely cannot tolerate latency delays as well as downtimes. So no throttling and no retries is the nature of the game. One other benefit we have with the reserved tier is you can burst into on-demand. So the question is, let's say I reserve for my P50. At my peak, I have a lot more tokens coming into Bedrock.

What happens if I run out of reservation? At that time, you just burst to standard tier. You don't have to worry about getting throttled beyond your reservation. You will automatically be served with standard tier. In the future, we may even introduce a feature where you can say that when I spill over to pay-as-you-go, process my request with higher priority, so the priority tier, not standard tier.

One other key feature here is we do support explicit prompt caching for reserved tier. What that means is, because you're paying for 24/7, how do we actually give you the benefit and cost-benefit of prompt caching? What we do is we use a different burn-down rate. Based on the model, let's say Anthropic Sonnet 4.5, the cache rate token pricing is at 90% discount versus regular input token processing price. We do the same burn-down rate conversion for reserved tier.

Let's say you are sending 1 million tokens per minute and your cache hit rate is 90%. For those 900,000 tokens that are being read from the cache, we will apply a burn-down factor of 0.1. So those will count as 90,000 tokens, not 900,000 tokens. When you're reserving capacity, you don't have to reserve for 1 million input tokens. You can reserve for 100,000 input tokens, and the 900,000 that after burn-down will be effectively 90,000. So maybe around 200,000 tokens that reservation should suffice for your 1 million tokens per minute. That's how you get the benefits of prompt caching in the reserved tier.

You can observe the usage via CloudWatch. Now, there are a few things to keep in mind here. The first one is you're reserving for a certain threshold. When it spills over, it is automatically served with standard processing. So in CloudWatch, you may want to see: I sent 1 million tokens in this given minute. How many were served with reserved tier? How many were served with standard tier? Based on that, do I need to balance my reservation? Do I need to reduce it? Do I need to increase it?

Thumbnail 1670

All of that information is available to you in CloudWatch metrics where you can actually see the tier you requested at the time of making the Bedrock request and the actual tier the request was served with at the end.

Priority and Flex Tiers: Pay-As-You-Go Options for Spiky Traffic and Latency-Tolerant Workloads

All right. So now let's talk about Priority tier. This is actually meant for pay-as-you-go, so you're not reserving anything, you're not paying for anything 24/7. You only pay for it when you use it. But there is a price premium. The idea here is the standard tier is for workloads that can tolerate a little bit of retries, but there are workloads where you don't want to waste time retrying and you cannot tolerate throttling.

One example is we have this new generation of banks where it's all online. In their apps, as you deposit checks, they have Sonnet in their workflow for verification. This is a workflow where you don't want to reserve capacity because people don't deposit checks like every day, every hour of the day. But when somebody's trying to do that in your application, you want to make sure that it gets processed immediately and there are no retries, so the customer doesn't go away.

Use cases like this don't justify reservation. You don't want to pay 24/7 if you're not going to use it. But when the request comes, you want a flag where you can say this is a high-priority request and I'm willing to pay a premium if you give me latency benefits and uptime guarantee. That's what the priority tier is. It's for spiky traffic, sporadic traffic, but it is still important and you cannot tolerate multiple retries, you cannot tolerate throttling.

It is priced at a premium over standard tier, so somewhere around 75% to 100% more, so you are going to pay more per token, but hopefully your use case justifies that kind of premium. The example I gave, if the customer goes away because your check processing fails, there might be downstream implications of customer loss for that new generation bank. You don't want to lose that customer. It's much easier to pay the premium than face the other implications.

Priority Tier also supports prompt caching, which is a wonderful feature that gives you both performance and cost benefits at the same time. The way it works in Priority Tier is if your token is read from cache, there is typically a 90% discount. We apply that discount to the premium rate as well. So let's say you're paying $1.75 per token instead of $1 per token. You get a 90% discount on $1.75, so we still provide you the cost benefit of prompt caching.

For certain models, we also give you better end-to-end latency by providing better output tokens per minute. What happens behind the scenes is if it's a Priority request, we send it to a part of the fleet where we have optimized the fleet configuration or hardware configuration for speed. In simple terms, for Standard Tier we may have a batch size of 8 or 4. For Priority Tier we keep the batch size smaller. So on the same instance, taking Anthropic Claude 2 as an example, instead of processing 8 requests simultaneously, we only process 4 requests. Because of that, for each request on a request level, you see better or faster output token processing, so your latency is better.

That's one other benefit of Priority Tier. You get higher priority, you jump the queue, you will not get throttled, and you will also get faster processing. We have CloudWatch metrics where you can see the latency benefits as well as the uptime benefits. You can search your requests by service tier priority and look at your end-to-end latency and look at your 503s or rate-limit throttling, and you can compare it to your Standard Tier usage as well. This is observable in real world in your applications when you're using Priority Tier versus Standard Tier. You can look at the graph side by side and see how your throttling is lower and your speed is faster on Priority Tier.

Thumbnail 1950

Now let's talk about Flex Tier. This is a new offering for us on Bedrock, and it's actually by customer demand. The other major use case for this is agentic workloads. So let's say you have automated workflows end to end where there is no human being sitting waiting for a response. There are 3 steps in the process. Whenever the first step completes, the next step starts, and so on. Because there is no human being sitting, you may want to actually take longer for the processing if you can get a discount. We're talking about around 50% discount, and instead of 10 seconds, it might take 10 minutes. But your entire workflow still completes in half an hour, which is good enough.

One example is scheduling of your workflows or workforce. That doesn't have to be done in real time. If you're optimizing how your employees should operate at the airport, it's not a one-second thing. It can take longer and you can optimize for that. You can also get a discount on the processing of the data. Flex Tier is priced at a discount on the Standard Tier. We are thinking around 50%. It can depend model to model and can change model to model, but the mental model is around 50%. We want to give you that cost benefit if you are willing to wait a little bit longer so that other high priority requests finish ahead of you, but you will also get your request finished whenever you need it. It's just a longer duration.

Flex Tier also supports prompt caching. Because if we give you a 50% discount on Flex Tier and prompt caching is a 90% discount, if we didn't support prompt caching here, your Standard Tier would actually be more cost effective. So similar to Priority Tier, Flex Tier will also give you a similar discount. Let's say on Standard Tier you're spending $1 per token. On Flex Tier you're spending 50 cents per token. If your token is read from cache, we give you a 90% discount on 50 cents, so we are passing on the same benefits to you on the Flex tier as well.

Thumbnail 2120

Implementation and Summary: Easy Service Tier Switching and Choosing the Right Tier for Your Use Case

One thing we wanted to make sure as we add these service tiers is that they're easy to use and easy to swap. Maybe you want to use the Priority tier if your request gets throttled once or twice, and the third time you want to send it with Priority. That should be easy enough to do, right? If you have heuristics where if your latency goes above 5 seconds, send the next request with priority, or if your request gets throttled twice in a row, send it with priority, you can do that.

Here you can see an example of the chat completion API with a simple service tier parameter, and you can easily switch between them. Here it is set to priority, but it could be reserved, default, or flex. All of the service tiers are easily swappable in your inference request on a request level. It doesn't require a lot of coding changes—it's just a parameter where you change the value. So it's very flexible that way. You can actually decide to use the right service tier as it serves you in real time. If something happens and you're seeing high latencies, you just go with priority.

Thumbnail 2210

One thing I want to call out though is that on the reserved tier, you need prior reservation. Your inference request will say reserved, but before it actually gets processed as a reserved request, you have to create a reservation, which is a control plane action. Someone as an admin in your company will have to do it for you, and then developers can send the reserved tier to your request. Across all of this, we didn't talk about batch much, but this is another inference option that you have at your disposal. Typically, batch is meant for things like creating daily reports for 100 different businesses or business units in your company, or running evaluations of your agents.

Thumbnail 2270

Thumbnail 2280

Here you're going to do bulk processing of a bunch of prompts. Maybe you'll take hundreds of prompts together, put them in a file, and send it to Bedrock. It can take its own time, but a few hours later you get answers for all of the prompts at the same time. This is also at a 50% discount on a token level. You can decide when to use it. Typical workflows or workloads are evaluations and reporting. It has a 24-hour completion window, and you get the discount appropriately. So I want to wrap up now and make sure we bring all of this together so you can leave with a mental model of when to use what.

The reserved tier is for mission-critical workloads with steady traffic. You have a usage pattern, and you can reserve capacity for it with no wastage of cost. You would do that for workloads such as tax season at TurboTax, right? You know it's going to be heavy usage. From your previous year, you know what the patterns look like and what your input and output TPM requirements are, and you're going to reserve that so you can rest assured that your users will not get throttled or have to retry. The pricing model is fixed, which means if you don't use it, there could be cost wastage, so you have to be careful about how much you want to reserve.

The Priority tier is next in the line of priority. Think about these tiers as levels of quality of service. If there is a reserved tier request, it gets served first ahead of everyone else, and we have made sure there is enough capacity for reserved. Priority tier requests get served before standard and flex, and standard gets served before flex. So it's basically real-time prioritization on the platform. Private DTR is for mission-critical workloads with direct traffic. We talked about the check deposit use case, and there are many other use cases like that where it's spiky traffic. It doesn't happen 24/7 or every day of the week, but when it happens, you want to make sure it gets prioritized and you're willing to pay a premium for it. That's what the priority tier is for. It's pay-as-you-go pricing, and you pay a premium per token.

The Standard tier is for day-to-day workloads that can tolerate some rate limits. Some of your requests may experience throttling. We have a very high threshold for when we get alarms and wake up to start rebalancing, but in very rare cases, some of your requests may get throttled, and you can handle that in your code. You can just do a 503 retry again. It's simple to do that, but that retry actually adds latency for your end users. As long as that is fine, typical use cases are code generation, right? A developer asks for code generation or code completion, and it doesn't complete at that point, but you can retry and it takes 5 more seconds, but the code is there. That's still an OK experience in my view. This is the pay-as-you-go standard token pricing.

Flex is for latency-tolerant workloads such as agentic workloads, right? There is no human being sitting in front of a desk hoping to see a response. It's all automated processing, and if it takes a few minutes, that's fine. You get the discount and you actually optimize for cost here versus performance. This is pay-as-you-go pricing with a discounted model. You're paying per token that you send to the platform and process on the platform. Batch is the last option. This is specifically for bulk processing where you have hundreds of thousands of prompts that you want answers for. You can wait for up to 24 hours, and it's a repetitive batch bulk processing use case where you do it maybe once a week or every day. That's what batch processing does for you.

With that, I think we will wrap up and open it up for Q&A.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)