Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Boost performance and reduce costs in Amazon Aurora and Amazon RDS (DAT312)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Boost performance and reduce costs in Amazon Aurora and Amazon RDS (DAT312)

In this video, Principal Database Solutions Architect Pini Dibask demonstrates performance and cost optimization strategies for Amazon RDS and Aurora through a fictional company called AnyCompany. He covers three main cost dimensions: compute, storage, and backup. Key techniques include using CloudWatch Database Insights for observability, SQL optimization with proper indexing, instance right-sizing with Graviton processors (achieving 46% cost reduction), leveraging read replicas for workload separation, implementing io2 Block Express for sub-millisecond latency, and using Optimized Reads with local NVMe SSD for complex queries. For Aurora specifically, he explains Aurora I/O-Optimized (23% savings), tiered cache capabilities (90% cost reduction), fast clones combined with Aurora Serverless for test environments (90% storage savings), and Aurora Global Database for multi-region deployments. The session emphasizes that cost and performance optimization are complementary goals, demonstrating how proper tooling and architectural choices can simultaneously improve both metrics.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Performance and Cost Optimization Journey with AnyCompany

Hello everyone and welcome. I'm excited to have you here today as we dive into the world of performance and cost optimization in Amazon Aurora and Amazon RDS. We're going to follow the journey of a fictional company named AnyCompany, and while the name is made up, it does represent some of the most common questions I get from customers just like you. My goal is that you will walk away from this session with actionable insights that you can directly apply to your own database environments. My name is Pini Dibask, and I'm a Principal Database Solutions Architect here at AWS. Thank you for joining me, and let's get started.

I assume many of you are already familiar with Amazon RDS, but just to level set, Amazon RDS is a fully managed relational database service. With Amazon RDS, you can spend time innovating and optimizing your applications instead of focusing on operational database tasks such as upgrades, backups, provisioning, and disaster recovery. Amazon RDS supports both popular commercial engines and open source engines, and as part of the Amazon RDS family, we also offer Amazon Aurora, which deserves special mention as our cloud-native relational database engine which is fully compatible with both MySQL and PostgreSQL.

Amazon Aurora was designed to provide you the enterprise-grade security, availability, and reliability of commercial-grade databases with the simplicity and cost effectiveness of open source databases. Due to Amazon Aurora's innovations and unique capabilities, it is actually the fastest growing service in the history of AWS. We have hundreds of thousands of customers who use it today for their relational database needs. Amazon Aurora has evolved over the last decade into a family of different options. The latest addition to the Amazon Aurora family was Aurora DSQL, which was announced last year at re:Invent. However, I just want to set expectations up front. We are not going to cover Aurora DSQL in this presentation because it has a completely different pricing model and architecture. We are going to talk about RDS and Aurora, both Aurora Provisioned, which is instance-based, and Aurora Serverless.

Let's get started by talking about the RDS cost dimensions. There are costs associated with every RDS cluster, and this includes compute and storage. There are also costs associated with most RDS clusters, like backup, data transfer, and IOPS, and this depends on your specific cluster configuration. And there are additional cost dimensions which really depend on your specific business and technical use cases and which database features you use for your applications. As part of today's presentation, we will talk about some of these cost dimensions by showing examples from AnyCompany.

AnyCompany's First Challenge: CPU Spikes and the Power of CloudWatch Database Insights

Speaking of AnyCompany, I would like to provide some context to our customer story. AnyCompany empowers e-commerce sellers with generative AI to turn every product into a bestseller. They run Amazon EKS for the application layer and Amazon RDS for the operational database. As an early stage, fast-growing startup, they faced a common challenge balancing performance needs as workload scales while doing so in a cost-conscious manner. This is a common challenge that we as solution architects help customers solve.

So back to the RDS cost dimensions, I would like to focus on three cost dimensions today: compute, storage, and backup. Let's get started with compute. I would like to introduce AnyCompany's first challenge. You probably could resonate with some of these challenges. AnyCompany started very fast with a great customer base, but suddenly their CPU spiked to 100 percent due to poorly running SQL statements. So there are two key questions we need to ask here. First, can we identify and resolve the problematic SQL? And second, what is the right instance type to use for AnyCompany's workload? They started with R6G instances. However, does that mean that this is the best instance for their needs? We need to think about that, and this is a common question we get from customers where scaling decisions need to be both performance and cost conscious.

With AWS, you have a wide range of instance classes to choose from. Burstable instances are for smaller and variable workloads. General purpose instances are for CPU-intensive tasks. Memory-optimized instances are for memory-intensive applications. And Graviton instances, which are based on ARM CPU architecture, offer enhanced price-performance compared to their x86 counterparts, making them a great choice and cost-effective option for many of the workloads running on AWS.

Choosing the right instance class and instance type may seem complex, but with the right observability tools, it can become much simpler. I would like to talk about observability for a moment.

For many years, RDS and Aurora customers like you used various different tools and solutions: CloudWatch metrics, instance-level metrics, Enhanced Monitoring for operating system level insights, Performance Insights for query analysis and wait event analysis, and custom solutions built on RDS events, RDS logs, and scripts. However, all of these tools created silos, forcing you to jump between different consoles without necessarily seeing the full picture.

This is exactly why last year we introduced CloudWatch Database Insights. CloudWatch Database Insights is our single pane of glass for database observability. It unifies metrics, logs, and performance data into one cohesive experience. I would like to show you how it looks like with a quick demo.

Resolving Performance Bottlenecks Through SQL Optimization and Instance Right-Sizing

What you can see here is the fleet level view, which means you can see an entire fleet of instances. You can choose a fleet based on tags or specific instances you would like to dive into. We have a fleet of six instances, and one of them has a warning due to database load. When we go to top database instances, I can see there is one instance with the name AnyCompany, and that instance is the top in terms of database load.

We also can see the top SQL across the entire fleet. There is a join between customers and payments, and we will talk about that SQL later in this presentation. I can also look at the top database instances based on various metrics like CPU utilization, number of connections, IOPS, network throughput, and so on. AnyCompany is the one that suffers from most of the bottlenecks and most of the database issues.

We would like to dive into AnyCompany's instance, and what we can see here is the database flow chart, the average active sessions, and the number of vCPUs, which is represented by the horizontal dotted line. We can see that we have top wait events that exceed the number of vCPUs in the machine. We can see CPU and also IO buffer file write. Please remember IO buffer file write event because we will talk about it later in this presentation.

You can see database telemetry with dozens of metrics you can choose from. You can see logs, slow query logs, RDS events, OS processes, and all of these different capabilities which you used to have in the past in different consoles are now fully available in one unified, cohesive experience. I can analyze the dashboard and analyze the performance based on a specific time period. It brings me automated documentation, so when I click on view performance analysis dashboard, I can see unusual high load.

We have 21 times more load than the typical database load. Obviously there is something that we need to address here. When I scroll down, I can see that it identifies that 91 percent of my database load is associated with CPU and IO wait events, and it also shows me what is the problematic SQL. This is the same one as before. Essentially this is a join between customers and payments, so we do a join between those two tables and we order by the amount.

We show the top customer information and payment information, ordered by descending. We show the top X payments. If I want to see the exact SQL statement, I can go back to CloudWatch Database Insights, and here I can see the full text of the SQL. We can see the statement limit 100, which essentially means I would like to see the top 100 payments of those customers.

I am using my favorite IDE, which is DataGrip, and I am running Explain Plan Explain Analyze. You can see the query takes 30 seconds. This is pretty slow for an operational database. We are not going to dive into performance and execution plans in this session, but we are looking to see that we have parallel sequential scans on both payments and customers, which are full table scans. I need to sort the entire table in order to get the top 100 because there is no index. By definition and by design, an index is already sorted.

I would create an index based on amount and customer ID on the payments table. Now when I run EXPLAIN ANALYZE again, you can see that instead of 30 seconds, it's reduced to 46 milliseconds. That's a significant improvement. Now I can see that I'm using indexes for the join and I no longer have the issue. If I go back to CloudWatch Database Insights, I can see that AnyCompany no longer has any warnings. The SQL that we saw earlier doesn't appear as a top SQL query. We actually see the CREATE INDEX statement instead.

If I scroll down to the top database load instances, instead of 90 percent or higher utilization, now it's 11 percent. We're looking at a much healthier instance right now. If I go to database instance, everything looks significantly better. CPU utilization dropped, and you can see the I/O latency also dropped. You can see the database load chart at the top, and in the past, you can see that the wait events were higher than the number of vCPUs, which is always a red flag for performance. But now I no longer have this issue.

That was a quick demo to show you how useful CloudWatch Database Insights can be. As a reminder, the first issue we had with AnyCompany was that we saw spikes for 100 percent utilization and we suffered from SQL optimization issues with poorly running SQL. What we did was optimize the SQL, and following that optimization, I no longer need such a large instance. I can optimize cost. Instead of using two extra large instances, I moved to extra large, which has half of the vCPUs and half of the memory, but I don't need more than that.

You may also notice something interesting here. I moved from R6G to R8G. R6G is based on Graviton 2 generation, and R8G is based on Graviton 4 generation, which provides improved price performance. By combining SQL optimization and instance right sizing and leveraging the latest Graviton generation and Graviton innovation, I was able to reduce 46 percent on cost and also reduce 70 percent CPU utilization on average. That was a quick example of how using the right tools we can identify problematic SQL and using right sizing we can make cost conscious decisions in terms of the right instance that we're using.

Tackling Noisy Neighbors: Read Replicas and Caching Strategies

Now I would like to talk about the second challenge, which is noisy neighbors. AnyCompany's core API performance degraded once data analysts, who are essentially internal customers and employees of AnyCompany, started running reporting queries on top of this database. This is the same database that serves the customer-facing application where the company actually makes money. This impacted the overall user experience and reliability. A common challenge we need to address is that we have competing workloads that interfere with each other. We need a solution that can separate these workloads and manage them separately without affecting each other.

One traditional scaling option is to scale vertically. You're all familiar with this option. We can add more memory and vCPUs and get more resources. However, for this use case, since we're talking about only reporting queries and reporting data, why not offload data to read replicas? We can offload data to read replicas, and by doing so we can effectively scale read traffic without impacting the core customer-facing application.

You may wonder about the cost differences between vertical scaling and horizontal scaling. Let's assume that we did vertical scale and we moved from R8G extra large to R8G 2xlarge. We would end up doubling the compute cost. Instead, with read replicas, we have more flexibility. We can create up to 15 read replicas, each one in a different size. For the reporting queries, we realized we don't need such a large instance. We can use large, and by using R8G large, we can meet the requirements for the reporting queries and manage to reduce the cost significantly. That's another way for us to make cost conscious decisions instead of just scaling vertically and spending money.

Please note that for this example and all remaining examples in this presentation, I will be using on-demand pricing from the North Virginia region. You could obviously further reduce costs by using longer-term commitments with reserved instances, for example, but for simplicity's sake, I'm using on-demand pricing for North Virginia. Another alternative to handle the noisy neighbors issue is to use caching. For example, we can cache frequently accessed data like SQL result sets. One way to do that is using Amazon ElastiCache. We have one strategy called write-through, and with this strategy we update the data in the cache whenever we write data to the database. Now when we want to retrieve the information by query, we can access the in-memory database, Amazon ElastiCache, and that provides us microsecond latency. That's much faster than any EBS option we offer. That's one example of scaling without necessarily increasing the instance to a larger instance type.

However, this approach is typically useful when you have repetitive queries. For the AnyCompany use case, most of the queries are actually based on ad hoc, dynamic queries with dynamic filters. They decided not to go with the caching approach. Instead, they offloaded the queries using the read replica approach, which enabled them to maintain consistent performance for their production application without interfering with the core customer application. They achieved an overall 50% improvement in the core API of the production application. Read replica is key when you have reporting queries that you can offload to read replicas. You should be aware of that and utilize it.

Storage Optimization: Choosing the Right EBS Type for Mission-Critical Workloads

Now I would like to talk about storage. In Amazon RDS you have a range of storage options. GP2 ties the number of IOPS to the amount of storage allocated. As you allocate more storage to the volume, you will get more IOPS. For example, the ratio is 1 to 3. So for each 1 gigabyte you get 3 IOPS. 100 gigabytes of volume equals 300 IOPS. With GP3 on the other hand, you have more flexibility. You can allocate IOPS and throughput independently of the storage size.

However, both GP2 and GP3 are best suited for development and testing environments as well as medium-sized workloads. If you have mission-critical applications which require consistent low latency, I would typically recommend my customers to use io1 or io2 Block Express. These are the go-to options for mission-critical applications. With io2 being the more recommended option because of its improved performance and durability. In fact, io2 is the only EBS option in RDS that provides sub-millisecond latency. There is no other way with RDS to provide sub-millisecond latency, so that's the only EBS option. Additionally, io2 provides 5/9 durability. This is 100 times more durable than io1. This is a low-hanging fruit. We price io2 the same as io1. So if you are using io1 today, you can and you should move to io2 to achieve improved performance and durability for the same price. This is a low-hanging fruit without downtime and without any workload impact. You can do that and achieve improved performance and durability.

Back to AnyCompany for another challenge, and this time it revolves around EBS high latency, as you probably guessed because we're talking about EBS and storage now. As the database scaled, they reached EBS IOPS limits. Essentially they saw a spike from 2 milliseconds IO latency to 500 milliseconds. That's a significant performance hit. This is a scaling issue we occasionally see where high demand could lead to bottlenecks. We need to find a solution that ensures sustained performance under load. The requirement from the company's DevOps team is two things. First, we need to support 20K IOPS, and second, we need to support sub-millisecond IO latency.

Similar to what we saw earlier, CloudWatch Database Insights was useful to identify the issue. If we look at the database telemetry, we can clearly identify that there are two issues here. First, we see an increase in the number of EBS IOPS, and second, in the IO latency, which spiked to roughly 500 milliseconds. That's a clear indication of a problem we need to address. The solution here was to use right-sizing for the storage.

Right sizing for storage is another important consideration. The company decided to move to IO2, which allowed them to achieve consistent submillisecond IO latency for the workload.

Accelerating Complex Queries with RDS Optimized Reads

Challenge number 4 involves slow reporting queries running on the read replica. These are queries that use joins, aggregations, and group by clauses—basically complex queries that generate a lot of temporary objects. The executives in any company want to see dashboards showing performance data, but the dashboard was taking around 10 seconds, which was not acceptable. The requirement was to reduce the dashboard load time to 5 seconds or less.

The root cause for the performance issues is temporary objects written to the EBS volume. These queries generate a lot of temporary objects, and database systems like PostgreSQL try to perform operations like complex sorting in memory. You can tune parameters like work_mem in PostgreSQL, but there are situations where you must write to EBS. When we write to EBS, or in other words, spill to disk, that causes IO latency and impacts the performance of the queries.

There is one very elegant way to address this. Before we talk about that, I would like to show you how we identified the issue of the temporary objects. This is the CloudWatch Database Insights that we saw earlier, and you can see here very clearly two clear indications of the problem. First, IO buff file right, represented in purple, is a wait event that occurs whenever RDS creates temporary files. The second identification is database telemetry. I created dozens of different metrics, and I chose IO files and IO bytes per second. You can clearly see that both of these metrics indicate bottlenecks from excessive temporary writes caused by complex SQL.

So we know we have an issue and we know what the issue is. Now I would like to show you an elegant way to address that. Instead of just increasing to a very large instance with a lot of memory and CPU, why not use a feature called Optimized Reads? Optimized Reads is a feature that allows you to use local NVMe SSD to process these complex temporary objects in the local NVMe SSD instead of EBS. It provides much faster query execution time by utilizing instance types that have local NVMe SSD. These are the instance types with D at the end, such as R8gd or M8gd.

These instance types utilize the local NVMe SSD. Now instead of writing the temporary objects to EBS, which adds additional high latency, we use the local NVMe SSD, which could be a much more cost-effective approach to address this type of issue with slow reporting queries. This is better than just getting a larger and larger instance, which may cost more money. Optimized Reads is very highly recommended for this use case, and I encourage you to use it for similar challenges.

Now there is a very similar comparison. I would like to show you some cost examples so you could choose between a larger instance and using Optimized Reads. If we go to a larger instance and scale vertically from R6g large to R6g extra large, we can see that the price is now doubled. The compute cost is doubled. Instead, with the same instance type with the same amount of vCPUs and memory but using the D at the end, R6gd, you can see that the cost is significantly lower compared to the vertical scaling approach. This provides a way to improve performance without overspending.

By changing the read replica to R6gd large, the instance now comes with local NVMe SSD, which allows a significant performance boost in a much more cost-effective way than scaling vertically. It actually enabled the dashboard load time to load two times faster, so a 50 percent improvement, or actually a 100 percent improvement in the database load time.

Backup Cost Reduction Through Strategic Retention and Export Methods

Now for the last cost dimension we will cover, which is backup. In Amazon RDS, we offer two types of backups: automated backups and database snapshots. Automated backups include daily EBS snapshots and transaction logs being sent to Amazon S3 every 5 minutes. The benefit of using automated backups is that it allows you to perform automated point-in-time recovery, so you can go back in time to a specific second within the last 35 days. You have granularity for up to 35 days. Alternatively, if you would like to have more retention for, let's say, one year or a couple of months, you can use database snapshots. Database snapshots can be taken at any time and they do not expire, so they give you more control over long-term backup retention.

I would like to introduce the last challenge with RDS, and that's high backup costs. Any company faces this challenge because they have a backup retention policy of storing data up to one month. However, a closer look at the customer's business requirement revealed they don't really need one month of data to be restorable. They only need one week of data to be restorable. Any older data, let's say two or three weeks old, needs to be available only for compliance queries.

Many of my customers say they need a retention period of one or two months, but they only need a certain amount of data to be restorable, maybe a few days, and all the data for compliance requirements needs to be available for queries only. So we can find a solution that will be much more cost-effective than just using automated backups for one month. Let's see a cost example like we did earlier. Before we show the cost example, I would like to show you the different options for longer-term backup methods which are more cost-effective.

Snapshot export to S3 allows you to export data in full or partial Apache Parquet format. This is an open-source columnar storage format that allows you to access the data very quickly with tools like Amazon Athena for those of you who are familiar with it. Alternatively, it is possible to run logical backups to Amazon S3 with native tools like pg_dump in the case of PostgreSQL or MySQL Shell dump in the case of MySQL. The benefit of both of these options is that you actually have access to the underlying S3 storage bucket. With the RDS native backups that we discussed earlier, automated backups and RDS snapshots, you don't have access to the S3 storage class. It's stored in AWS's own backup infrastructure. Here you have control over it. So based on your access pattern, you can decide that if this is frequently accessed data, you can use S3 Standard. But if it's longer-term archival data, maybe you can use S3 Glacier, so you have more control over storing it in the right S3 storage class.

Now I would like to show you the cost example. The original backup plan was 30 days, and we assume that the initial 500 gigabytes of full snapshot is free of charge because with RDS we provide you free backup up to 100% of the database storage size in a region. However, each daily incremental backup you are charged for, so you're being charged for the EBS block changes that you have in your database. We can see the calculation now. If you want to go to the other approach where we store only one week of data with automated backups and any older data being stored using Apache Parquet snapshot export, then the total cost can be reduced dramatically. I've seen cases where we achieved 50% or higher reduction in backup costs just by leveraging this approach and not storing only other snapshots that we don't really need.

So there are ways to optimize backup costs as well, and this strategy provided a 30% cost reduction in backup costs, demonstrating how efficient backup data retention could lead to significant savings while still maintaining access and retention requirements. I would like to summarize the journey of RDS with any company, and then we will talk about a similar but different journey with Aurora. We started with the challenge of increased workload and CPU spikes due to poorly running SQLs. We did SQL optimization by using indexes and instance right-sizing. We utilized the latest Graviton generation instance that provided us 46% instance cost reduction. Then we had the issue of noisy neighbors using query offloading to read replicas.

We managed to maintain consistent performance and reduced the API response time of the core customer-facing application by 50%. We then faced an issue of high latency, with latency spikes from 2 milliseconds to 500 milliseconds. We managed to reduce it to sub-millisecond latency by using IO2 block express. Later, we encountered the issue of slow reporting queries running on the read replica. By utilizing RDS optimized rates with the RAGD instance type with D at the end that comes with local NVMe SSD, we managed to improve the dashboard load time by 2X. Lastly, to address the high cost of backups, we used the more cost-effective long-term backup method, which reduced the backup cost by 30%. This was a summary of the journey with RDS. Hopefully, you learned some new techniques for how to optimize performance and cost in RDS.

Transitioning to Amazon Aurora: Understanding Storage Architecture and Cost Dimensions

Now, I would like to follow any company's journey again, starting from a similar place but with a different punchline, now with Aurora. As a reminder, Aurora is our cloud-native database engine. It is fully compatible with MySQL and PostgreSQL as mentioned in the beginning. It has commercial-grade reliability capabilities and scalability with the cost-effectiveness and simplicity of open-source databases. One of Aurora's key innovations lies in the storage layer, which automatically stores your data across 3 availability zones. For these properties and more, any company now chooses to use Amazon Aurora.

Last year we celebrated a decade of innovation with Amazon Aurora. As you can see now, we have Aurora Provisioned, also known as Aurora Instance-based, for predictable workloads. Aurora Serverless, which is great for spiking and unpredictable workloads, has an auto-scaling mechanism, so you actually get more resources. It can grow and shrink the amount of CPUs and memory based on actual usage without any interruption to the sessions that are running or SQL queries that are running. And Aurora DSQL, which is the latest addition to the Aurora family, provides virtually unlimited scale and is fully serverless. It could be either in single-region deployment or multi-region deployment with strong consistency and active-active characteristics, as I mentioned in the beginning of the presentation. Just to set expectations up front, we will be focusing on Aurora Provisioned and Aurora Serverless just because Aurora DSQL has a completely different architecture and pricing model. As I mentioned, if you would like to learn about Aurora DSQL, there are great sessions here at re:Invent for that.

So with Aurora, we have the same basic cost model of three dimensions, but they work differently. Let's explore how. Let's learn how. Aurora is mostly different than standard databases in the storage layer. So I would like to break the sequence and this time start with storage. With RDS we started with compute. Now I would like to start with storage. Aurora aggregates storage from the database instance to the Aurora storage fleet. This is a multi-tenant storage fleet that is spread across 3 availability zones in an AWS region, made up of a large number of special-purpose nodes. These nodes are responsible for storing your data, balancing, repairing, and essentially offloading many of the other operations from the database instance.

Aurora makes 6 copies across 3 availability zones. In other words, 2 copies per each availability zone. But you only pay for one copy, so you don't really care about the fact that we store 6 copies of the data. You only pay for 1 copy. This means that Aurora can handle the unavailability of storage in the entire availability zone, plus 1 small storage node, and that's one of the key features with Aurora. But how does that impact performance and cost?

Let's explore this by using some challenges like we did before. So the first challenge is growth. We have more customers, but we also have bursty customers. Perhaps when their geography wakes up or maybe there is a new sale of Taylor Swift, and now there is a high load and burst. So we need to find a way to autopilot it, and Aurora will do it for you. So this flexibility creates some challenges and cost considerations. We need to make sure that we have more predictability around cost.

So let's see how we do that. And of course, in addition to predictability, we would like to reduce the cost. That's the most important thing here. Aurora storage is a pay-for-what-you-use system in two storage dimensions: storage size and storage I/O.

Let's start with storage size. You are billed per gigabyte-month, but unlike other options like RDS, you don't need to provision storage in advance. You don't need to provision I/O or throughput. It can grow and shrink as needed. Because it automatically resizes your storage, you can have more control. You can do things like drop and use partitions or old partitions. You can use vacuum in progress to keep storage under control, so you have more control here, and you can monitor using the VolumeBytesUsed metric in CloudWatch.

Aurora I/O-Optimized: Achieving Predictable Pricing and Cost Savings

In terms of I/O, we charge based on 1 million I/O operations. Every page read, whether it's MySQL or Postgres, is counted as one I/O. For writes, because we don't write pages in Aurora, we actually write logs. We write in units of 4 kilobytes, so we chunk it into 4 kilobyte units. In both cases, whether it's reads or writes, you are billed per 1 million I/O operations. I/O is also automatically scalable. You don't need to define I/O in advance, and I will spoil the surprise for later by telling you right now that the same works with other nodes.

If you add more replicas to an Aurora cluster, they all work against the same shared storage layer. From that perspective, they all operate the same way, and we'll see that in a moment. Let me walk you through a cost example. We have 100 gigabytes of storage and we grow 100 megabytes per day. The read and write I/O operations are 600 ops per second, with 400 reads and 200 writes. Just like in the earlier RDS example, we use an r8g.xlarge instance. You can see the monthly cost for the instance. We also can see the storage cost per month. We don't need to think about the math here. I already did the math for how much it costs per month by taking into account the calculation of growth rate, not just the initial storage size but also the growth rate, and you can see that with 10 cents per gigabyte-month, which is the storage pricing based on N. Virginia, we get to roughly $300 per month.

We have regular storage I/O and that leads us to around $207. Here is the challenge: we also have bursty I/O, and the bursty I/O is associated with 15,000 I/O operations. Every day we have two hours of 15,000 I/O operations per second. That's a big challenge that we need to solve because you can see that it adds $648. That's pretty significant in terms of cost, and it also creates some predictability challenges because we cannot predict in advance exactly how much I/O we will use.

One way to deal with this issue is to reduce cost by scaling up the instance, which brings more memory to the instance and improves the buffer pool cache hit ratio. By doing that, we are able to use more I/O reads from memory instead of from disk, so that can reduce the I/O cost. In addition to reducing cost, it also improves read latency because now I can read from memory because I have a bigger instance. This is a valid technique that works, and many customers use this technique, but what I would like to show you is what I believe would be a better technique for you.

That would be using Aurora I/O Optimized. With Aurora I/O Optimized, there are zero charges for I/O operations, so you pay nothing for any I/O operations you have. It also brings you I/O predictability because I/O is no longer a variable. You no longer need to estimate in advance how much you will use. That means you can save up to 40 percent of the storage cost and you can move once every 30 days from Aurora storage standard to Aurora I/O Optimized storage and go back at any time, so that's without any interruption or workload impact.

I would like to show you a cost example again, this time using I/O Optimized. You see that some of the data is grayed out because it's the same. There are no changes in the fact that we are using 30 days and 24 hours, but you can notice that the pricing of the compute is at a 30 percent higher rate because of I/O Optimized. Also, the storage cost is at higher rates, so it used to be 10 cents per gigabyte-month. Now it's 2.25.

It's much higher. So because we have 22.5 cents per gigabyte per month, we know that both compute cost and storage cost are higher. However, and this is where it becomes more exciting, is that you don't pay anything for I/O, either the regular I/O and also the burst I/O. These are completely free. And the outcome if you actually combine the costs together, you get 23% cost savings even though you pay more for compute and storage. You still manage to achieve 23% cost savings because you pay nothing for I/O. This is particularly useful for I/O intensive applications. By seamlessly switching to I/O optimized, AnyCompany managed to achieve 23% cost savings and also as a bonus point got predictable pricing.

Aurora Optimized Reads with Tiered Cache for Enhanced Read Performance

Now AnyCompany has another challenge. They have internal customers running different workloads. It could be write-heavy reporting jobs or batch jobs, and this can cause buffer pool cache overload. So this is another challenge that we need to solve. What they want is a way to reduce the impact of these back-end jobs on the front-end queries, those that are actually important for the business. They would like to improve the overall performance for these read-heavy queries and remain cost effective.

So what we can do here, like we described earlier in the RDS cost section, is we can add more replicas to separate workloads. Aurora supports this very similarly but with a different approach, because with Aurora, due to its shared storage architecture, all the read replicas access the same shared storage. We can add replicas, and they can all access the same physical storage. So what is important to say here is that all of that has nothing to do with durability. Durability is fully handled by the storage layer. So even if we add replicas or remove replicas, everything about durability lies in the storage. There's still one writer node, and the writer updates the read replicas with up to 100 millisecond latency.

We talked about Aurora Optimized Reads earlier in the context of RDS, but now I would like to talk about it in the context of Aurora. It has the same name and it's kind of working the same way, but it's a bit different. Let's see how it's still using the R8g instance. It still has the feature to store temporary objects in the local NVMe SSD, but here is the big difference which is unique to Aurora. With Aurora, you could actually store up to 8 times improved query performance by storing tiered cache. So what we can do with Aurora, we can scale up the amount of memory we have in the local NVMe SSD. So tiered cache lets you expand this buffer pool into the local NVMe SSD, and then you get more performance for read-heavy queries.

So remember you can do this by switching the instance to D type, like by doing a failover operation. It's very easy, and you can do failover operation with Aurora with up to 30 seconds RTO. So you can also reduce it with RDS Proxy tools to make it even more seamless. Remember the D type instances could be available with various configurations. The impact of any caching solution, also when using Optimized Reads, is always going to be dependent on the workload's cacheability. So that depends on the size of the cache. In the case of Optimized Reads, you have great control over it as you use a larger instance. So for example, if you move from R8g.2XL to 4XL or 8XL, you will get a larger local NVMe SSD which improves the amount of cache you have.

So that's one thing. The other thing is the working set of your application, and you can use those metrics in CloudWatch to estimate the working set for your application. So far we've been using R8g Extra Large in the example. So I would like to switch to R8gd Extra Large, and that gives us 230 gigabytes of tiered cache in the instance. If you would need to get an equivalent amount of memory from a regular instance, not an optimized instance, not an optimized read instance with R8g, I would need to use R8g.12xlarge. Now, obviously that would be much more expensive compared to using R8gd. So R8g.12xlarge will be roughly $6,200. R8gd.xlarge comes with $580. Now obviously memory is faster than local NVMe SSDs. But if local NVMe meets your performance requirements, that would be a much better solution and provide you a much more cost-effective solution to address your needs.

Storage size and storage aisles are the same in this example. The only difference is that we compare between optimized reads and regular instance with more memory. You can see how powerful this could be. By using Aurora Optimized Reads, AnyCompany was able to reduce costs by 90%. That's huge, and this is something I can see also with my customers that are using this option. That's much better than moving to a larger instance. The transition is easy, as we said. You can use classic failover operation in Aurora, and you get the desired instance type.

Transforming Development Environments with Fast Clones and Aurora Serverless

Now I would like to move to challenge number 3. AnyCompany has developers that need to test their environments. They want to test it with production-like data, but they obviously don't want to do it on production because that would be a bad idea. So what they can do and what they have been doing so far is custom complex ETL processes. They have an ETL pipeline and it takes time and costs money. They would like something much better. These ETL jobs are expensive and we would like to improve that. Basically, each developer needs their own environment, so that drives up the compute and storage costs for development. Also remember developers don't work 24 hours per day. Most of them keep their environments idle overnight and that also leads to waste of money.

Ideally, what we would like to do is make a solution that can refresh these environments very quickly to improve developer productivity. So we want to reduce storage costs, reduce compute costs, and improve creation time. Basically, we want to improve the refresh time and developer productivity. What we can do is use a feature called fast clones. Has anyone heard about fast clones in Aurora? Some of you have. Fast clones is a technique where the Aurora storage volume underneath your instance can be virtually copied. This uses a copy-on-write protocol. That means we don't really actually copy the data, only a few pointers. Every day, AnyCompany creates a golden image of the production, so they do a clone. Only when you read from the unmodified data of the clone, you don't need to actually pay any money for the storage of the clone because it uses the pointers. It follows the pointer to access the source clone and the source storage. Only the changes are billed for your cost. So this is one way to deal with this situation, and that drives down the cost of the storage. We can create up to 15 clones in a set. Here you can see only 2.

As I mentioned, you're only being charged for the space you actually occupy. So when you change the blocks either in the clone or in the source database, then you pay for it. So if you have 1 terabyte of storage size and you only change 10 megabytes of data in the cloned environment, you pay for only 10 megabytes of storage. Obviously, that would be much more cost-effective than restoring the entire volume, which would be more costly. But we need to talk about compute because we talked about how we can reduce storage size and we dramatically reduced the storage size. What can we do about the compute of the test environments? The solution for that would be Aurora Serverless. AnyCompany's test environments need good performance when in use, but when they are idle, they don't want to pay for it. So what we can do is use the automatic scaling model for Aurora, which is Aurora Serverless. It's built from the ground up to instantly scale up and down. It can grow and shrink. Serverless scales the database based on capacity units, Aurora Capacity Units, or in short, ACU.

Each ACU comes with 2 gigabytes of memory. You can define a range of minimum and maximum capacity units, and it will know how to grow and shrink based on your actual usage. It can grow up to 256 ACUs, which is equivalent to half a terabyte of memory. So it can actually provide a solution not only to the test environments that we've been discussing so far, but for production-like use cases as well. It's also another instance in your cluster you can mix and match. So you can have a production primary with a provisioned instance, and you can have Serverless in a read replica. You can have a lot of flexibility here. We'll follow the same example with Serverless clones, and you can see that both work with the Aurora standard storage and Aurora I/O-Optimized.

Aurora I/O-Optimized has no limitations from that perspective. A team of 20 DBAs or 20 developers can use the same 100 gigabytes of storage , and you can see that there is a 10% change rate per developer, similar to what we discussed before. By looking at historical patterns, we can notice that any company knows that the instance they're currently using is idle for roughly half of their shift. They have 8-hour shifts, with half of the shift being idle, and in the other half, the instance fits nicely to an R8g.large, which is 16 gigabytes of memory. That is the equivalent of 8 ACUs. In all scenarios, if we do the math, we can see that the instance cost with the provisioned instance is roughly $132 per day and with Serverless it's roughly $76 per day, so much more cost effective.

Now if we take into account the storage difference because we are now using clones instead of full volume restorations, we managed to reduce it by 90%, which is a pretty significant improvement. The I/O is the same for both. So how did we do this? First, by using fast clones to create those test databases with a low rate of changes, we managed to reduce the storage cost by 90%. By using Serverless, we managed to reduce the compute cost by 38%. The bonus point is that with Serverless we can scale down to zero. If your database is idle overnight because you're not working in the environment, you pay nothing for it. It's completely paused and you pay $0 for the compute.

Global Resiliency with Aurora Global Database and Final Recommendations

Let's quickly talk about the first cost pillar, which is backup, and from that perspective it's pretty similar to RDS, but some points are quite different. Aurora provides built-in continuous incremental backup. What we're doing with Aurora is the Aurora storage writes data to Amazon S3 with no performance impact since that's totally handled by the storage and not by the database instance. There is no scheduled maintenance for that. Backups happen continuously and they are current within 5 minutes, so we can always restore to up to the latest 5 minutes.

Incremental backups are charged based on storage needed to restore to any point in time, and you get free backup retention if you decide that you only have 1 day of retention. So if you want 1 day of retention, you get it completely free. If you want more than 35 days, then from that perspective it's very similar to RDS. You have no expiration for database snapshots. We can use snapshot exports to Apache Parquet format like we did earlier. You could use logical backups just like with RDS, and you can also optimize the retention period.

Because with Aurora you have more control over the storage size, you can do things like vacuuming in progress or drop unused partitions. Then you can also reduce the backup storage cost by having more control on your volume size. Now I would like to consider some challenges across all the dimensions. That would be the last challenge, and that would be global resiliency. So basically what happens is that now things are looking great. Any company has customers all around the globe. But what they are thinking is creating parallel application stacks in multiple regions, so they would like to have an opportunity to support low latency rates across regions. If there is a customer in EMEA, they want to have low local latency reads. If they have customers in Australia, they want them to have low local latency reads.

The other thing is that they also want resiliency across global regions. So what we can do is use a feature called Aurora Global Database. With Aurora Global Database we are using storage-based replication, so physical replication to replicate the data. It's much more consistent and it's much lower latency compared to logical replication. The typical RPO is up to 1 second, so that means that the lag or the latency between the primary region to the secondary region is up to 1 second. For any company we're going to focus on how Global Database enables local reads while giving flexible price performance.

Here is an Aurora Global Database designed to show you how we can provide a solution with minimal operational overhead. We have this one region and one instance. Now we would like to create another region. We just click with the click of a button or with CLI. We can create this replication which is based on storage, and you have zero operational overhead for that. That's where Aurora Global Database comes in. So you have this solution that can provide you very consistent lag between the primary region and secondary region. Here you see only one secondary region.

But we actually support up to 10 regions. Traditionally, those of you who use global database are probably familiar with the limit of 5 regions, but earlier this year we improved it to 10 regions. Each one of those secondary regions can scale independently to the primary region, and you remember we talked about replicas earlier. You can create replicas in each one of those regions, and Aurora will take care of the replication with the Aurora storage independently. Regardless of how many replicas you use or how many regions you use, you can create asymmetric clusters, so each cluster can be different than the other one.

To summarize, by switching from the previous handmade solution which was based on logical replication to our global database, any company is now able to achieve in-region fast local reads with millisecond latencies instead of hundreds of milliseconds. They are able to optimize costs by using different instance types rather than using symmetrical clusters. As a bonus, they also got disaster recovery with an RPO of 1 second. Now I would like to reach the end of AnyCompany's Aurora storage and do a quick recap. We solved the challenge of cost control with storage by using I/O Optimized, which led us to 23% cost savings. We solved the issue of read performance by using Aurora Optimized Reads, which gave us up to 8 times read performance improvement.

We talked about the fact that Aurora has tiered cache in addition to temporary objects with the local NVMe SSD, and that is unique to Aurora. We had the issue of exploiting cost of test environments by using the combination of fast clones and Aurora Serverless. We managed to reduce 90% on storage costs and also a significant amount of money on compute by using Aurora Serverless. With the issue of global data and customers that are working globally, AnyCompany now with Global Database managed to achieve local fast reads with millisecond latencies for global reads, and as a bonus point, we also got an RPO with 1 second.

For all of you at AnyCompany out there, you now have the blueprint. You learned about features and techniques in both RDS and Aurora. Cost and performance optimization are not competing priorities; they are actually complementary. You could use powerful tools like CloudWatch Database Insights to monitor and improve your performance because you cannot fix what you cannot see. Observability is key, and this is why we used CloudWatch Database Insights throughout this session to show you examples of how you can use the right tools to improve performance. What I would like to leave you with is this mission: when you go back home after having a great time at re:Invent, go forth and optimize your RDS and Aurora performance and cost. You can do both at the same time.

Here are some extra recommended sessions if you are interested. You can also see SQL at the bottom. I mentioned that we did not cover it today, but if you would like to learn about it, you can go to other related sessions that I find interesting. Please do fill in the survey. We are a data-driven company, so your feedback would be greatly appreciated. Hopefully you will learn new things and new techniques about RDS and Aurora performance and cost optimization. Thank you for your time and enjoy re:Invent.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community