DEV Community

Cover image for AWS re:Invent 2025 - Troubleshooting database performance issues at scale (COP331)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Troubleshooting database performance issues at scale (COP331)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Troubleshooting database performance issues at scale (COP331)

In this video, Joe Alioto, a Senior Specialist Solutions Architect, demonstrates troubleshooting database performance issues at scale through the story of Alex, an SRE lead at Ocktank. The session covers fleet-wide observability across multiple database engines (PostgreSQL, MySQL, SQL Server), using unified monitoring to identify performance bottlenecks. Key features include analyzing database load through honeycomb visualizations, investigating query patterns and execution plans, examining lock trees for blocking scenarios, and leveraging application signals for end-to-end transaction analysis. The presentation showcases how Database Insights and application performance monitoring tools enable rapid problem identification—from pinpointing problematic queries run by specific users to tracing errors in Lambda functions. A troubleshooting workflow is outlined: identify instances from fleet view, analyze queries and locks, review metrics, and utilize application performance monitoring to determine if issues originate upstream or in the database itself.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction: Meet Alex and the Challenge of Fleet-Wide Database Monitoring

Welcome everybody. Today we're going to talk about troubleshooting database performance and issues at scale. But before we dive in, let me paint a picture that might sound familiar. I'd like to ask for a show of hands. When you're managing a large fleet of databases, it can be very difficult to identify where the issue is or even whether it's the database at all. Do teams ever reach out to you saying something like, "Hey, our retail service is slow. We think it might be due to this," or "Our ordering service is slow. We've looked at the front end, and we think it's the database. You need to look at it," or "We're not sure what the issue is"? I know I've run into situations like that for the past twenty-some years.

Thumbnail 60

My name is Joe Alioto, and I'm a Senior Specialist Solutions Architect focused on cloud operations, including observability, monitoring, governance, compliance, and centralized operations management across compute and database resources. I deal with a lot of different areas. Today, we're going to discuss some common customer challenges related to what I just mentioned. We'll explore fleet-wide observability, look at how we can perform end-to-end transaction analysis with application performance monitoring built in, and walk through a common troubleshooting workflow. Additionally, this session will be available on YouTube within the next 48 hours, so you'll be able to review it again.

Thumbnail 90

Thumbnail 110

Before we proceed, let me introduce you to Alex. Alex is an SRE lead at Ocktank and has experience supporting everything operations needs. He's the superhero who actually fixes things, often unsung in the organization. Alex will be our champion today. What Alex commonly encounters is managing a wide variety of databases. His customers have different databases ranging from PostgreSQL to MySQL to Microsoft SQL Server. They have multiple databases from different teams, and he has to manage all of them. Having a lack of application context has been a persistent problem. When teams say, "Hey, we think there's something wrong with the retail application," and after looking at the application, they find nothing wrong, they assume it must be the database. Being able to solve that without spending hours going back and forth trying to figure out what's going on has been a longstanding challenge, and we're going to see how we can simplify that.

Thumbnail 170

Additionally, there's an ever-growing set of tools. Different tools exist for PostgreSQL, MySQL, and Microsoft SQL Server. We use different things to look at the databases and applications. Each tool has very specific characteristics tailored to specific database engines, which makes it complex. So here's Alex's situation: the sales team is reaching out saying, "Our retail service is slow! What's happening?" Alex doesn't know yet, but he's going to take a look and dive in.

Thumbnail 180

What Alex can do is use fleet-wide monitoring that gives unified access across accounts and regions. The key thing is it's across three regions at a time currently due to some limitations, but he's able to see the different alarms and the different average load for the databases, as well as the maximum load for those databases. You can see them in those honeycombs, and if it's high, it's red. If it's really low, it's white, going down that scale. He can save those fleets to different views like the retail prod application view. He can see the top ten instances and think, "Oh, this one has a lot of load. Maybe I need to focus my attention there." Additionally, he can look at the different events, seeing restarts, failures, and different levels of severity that have been built in to understand what's going on, as well as application integration where that prod retail service had 45 percent failures, which might be an issue.

Additionally, there are some top ten metrics you can define by CPU, memory, disk, network I/O, and things that are important to you, and a table view at the bottom where you can see that honeycomb in a table format. This is a great way to start looking at your fleet across your entire global fleet, identify where the problem is happening, and then dive deeper into the instance view once you've found it. So once you've found that instance, you can start to look at it and get the time span data at the top, where you can see each individual second during the time span. At the bottom, you have the holistic time span, so you can see over the duration we're looking at right now, which is one hour, that most of the queries or most of the load was due to a specific query.

Thumbnail 280

Thumbnail 300

Diving Deep: Analyzing Query Performance, Locks, and Execution Plans

You can view this in different ways, whether you want a line or a bar graph. You can see the maximum CPU, and going above that maximum CPU is a general load characteristic for the size of your database. If you're having sessions above that, that's generally a performance issue, and you can start to carve it out. But what's really interesting here is who's running the SELECT * FROM orders, and it's consistently running. Who's doing that? We want to start to carve that out. So what we can do is select that query and look at it. Looking at that query, who's running this? It's the dba_intern. Why is he running it, and why is it continuously running? Where is he doing that from? I can see he's doing it from these two hosts.

Thumbnail 310

Thumbnail 320

So I know it's the dba_intern, and I know he's running it from two locations. What application is he using? From there you can see I'm using the db-operations tools that they have from that application to identify the who, what, when these things happen. And the where of when it's happened. We haven't answered the why, but this gives you the information so you can reach out to that team and say, "Hey, what's going on? Did they make a mistake? Is this something they need to clean up?"

Thumbnail 340

Thumbnail 350

Thumbnail 360

Additionally, you can start to carve it out by the other areas. You want to see which hosts have the most load, which users, which applications are generating the most load on the database. You can do it very easily. You kind of choose how you want to carve that up. Next, let's move on. "Alex, help me." Sales team, "We can't run our reports. We don't know what's going on. Can you help us fix it?"

Alright, so he was able to find the instance. We already looked at the fleet view, but he found the reports, and what he's seeing is there's a lot of locking going on. He's able to use Database Insights for this instance to start looking at the lock analysis or lock table that we have, or this lock tree, and we can start to investigate deeper. We can view the specific details that are important to us, but we can identify things like popular record locking scenarios and all the other common ones.

There are so many different locking scenarios you can run into, but we can start to get an idea of what's running and then we can start to carve it out or slice out the views to get a better understanding of what's actually going on. Right now at the top, we're looking at the waits view. We can see what is causing the different waits, and we can see specifically what that is and where it is. We can start looking at the blocked objects. So if there was a specific single object that's blocking it, you would be able to see that and see it going across, see what the value of that is, and you can just kind of key in on the various aspects here.

Thumbnail 430

Thumbnail 440

Thumbnail 450

Blocking session: if you had a single session holding up a specific locking scenario, you would be able to view that here, see it across the specific time span, and then be able to identify how you need to clear that up, which could potentially be an application issue, but we'll look at that later as well. From here, what we can also do is choose the segment we want to key in on. That lock tree is focused in that one area. For this area, what we do is select that minute timeframe and then break it down by 15 second increments where we have this specific lock tree. Depending on the specific locking scenario, you're able to dive really deep to understand what's going on without having to leave the console. You don't have to connect to the database. It's all here for you.

Thumbnail 470

Thumbnail 480

Alright, so next: "Help me, Alex." Sales team again. Sales team needs a lot of help. "Our order processing was having issues looking up orders. What happened? It happened sometime earlier today. We're not sure when. It was just having some problems." So what you could do is go and look and see. When we're taking a look at this specific query that they were using that we know is part of that, we could see during this timeframe there were some issues. But I can't tell much yet. I can see there are two execution plans that were run. Okay, so I know that something happened, but I only really see one right now.

Thumbnail 510

Thumbnail 520

But if I change the pane to a plan view to slice it by that, I can see there are two plans. I can't even see the other one. Where even is that? I can see there's about 8.5 average active sessions, and when I switch it to the other one, that's way more efficient. It went way down to less than a tenth of an average active session. It was being very performant. So now we can go down at the bottom, and it might be a little small, but you can see where we were doing a full table scan on that orders table, and actually it's the September 19th partition. We could see we're doing the index only scan for that yellow, the ones that were being efficient.

Thumbnail 550

Thumbnail 570

And what you'll notice is that is on a different partition, still on the orders table, but we can kind of see what the differences are and we can also start to look a little bit deeper into what the cause is for the SQL and start to understand when and why that happened. We know it was happening on a different table, but we can see what the specific execution plans were when they happened and dive a bit deeper from there. Next: "Help me, Alex." Application devs, the good old app team. "Our dev databases have been running slow recently. What's the problem?"

Thumbnail 580

Thumbnail 600

Beyond the Database: Custom Metrics, Slow Queries, and End-to-End Transaction Tracing with Application Signals

I don't know, but I'll find out. That's what Alex is going to do. He has these automatic dashboards that are built in with metrics. He's done some of the other analysis already, but what he can do is go into his database telemetry. He has 18 different metrics that are already built into this dashboard per database engine, per PostgreSQL, per MySQL, per whatever engine you're using. You update it once, it's going to be across the entire fleet having that same view. But we also have hundreds of other metrics. Just search by keywords. Let's look for latency. I want some latency information. So I want to get my average reads and writes. I'm going to create a new widget that's specifically for that so I can view it across my entire fleet.

Thumbnail 610

This is going to make it really easy to have those customized metric dashboards you want no matter what instance you look at. You do it once and that's it. We can see that there are more reads there. Let's move on.

Thumbnail 630

Thumbnail 640

Help me, Alex. This is the inventory manager this time. Sometimes it takes a long time to view our inventory. We're not sure what's going on, but we want to know more. So what you can do is we also have some slow query analysis. When's it happening, why is it happening, and where? We can go in and take a view in our database telemetry. If you've noticed, this is all in the same pane of glass so far. We've gone to different tabs, but we haven't changed to a different section. We're now looking at the slow queries.

We can see query patterns, and then if we select a specific query pattern, we can see the patterns related to that. We can see how many times that's run. We can see the average duration, what the 50% mark is, what the 95% of use cases are, and then the max, the ones that are running long. This can help you identify a long running query that maybe you didn't know exactly was running long, and also find out when it was running long so you can start to analyze what was going on.

Thumbnail 690

Thumbnail 700

Help me, Alex. Application devs. Our retail service is having issues. It has to be the database, I know it's the database. I've looked at everything else. There's no way it's not the database. What we have is we've built in integration with application signals. We call it calling services. These are the services that database manages. You have to have application signals configured on your applications, but we can see the fault rate of 47% on that retail service. But now we can see why.

Thumbnail 720

What we're going to do is we're going to go into application signals. This is no longer database insight that jumps us to application signals service, specifically that retail service we need to investigate. From here what you're going to see is it has a dashboard that gives you an idea of different service level operations like availability or latency that are within thresholds you define. It can give you some operational audit information out of the box. You didn't do anything, it's just there. It can show you some high periods of latency. That's just there by default by kind of looking at it and understanding, hey, this looks out of the normal.

Thumbnail 750

Thumbnail 760

Thumbnail 770

From there you could take a look at the service operations. This is generally going to be where the compute's happening. This is happening in Lambda for me. I've built my application to run in Lambda, and I can start to see what's happening here. I don't see any specific errors there yet, but let me look at the dependencies that run within that Lambda function. I have the select operation. I have some other uninstrumented areas that need to be fixed, but I can see the fault rate's high on that select operation.

Thumbnail 780

Thumbnail 790

Thumbnail 800

What's going on with that select operation? What I can do is I can select the faults and I can start to view the spans. Spans give us an idea of what's occurring within that entire transaction. I see there's failures, I see how long they run. I can pop this up and I'm going to get a trace map in this trace details end to end. Customer comes in, goes through API gateway. Something's happening with API gateway, red's got to be bad. I can see the Lambda function, the service that's running within the Lambda function, and when it's connecting to the database.

Thumbnail 810

Thumbnail 820

Additionally, on the right, I can go down and view the actual timeline of the entire trace. I can view each individual span. I could see where I'm making that PostgreSQL query. I can go directly back to the database from there if I want. I can also view the specific query for that trace that's occurring during that specific span. If you're looking at this, you can see the duration it took to run that call. If that's within normal behavior of what you expect for that application, you know it's not related to your database.

Thumbnail 830

Thumbnail 860

In this instance, I was able to find an error earlier up in the stack in the Lambda and I can see specifically in the Lambda function on line 96 in the code of the Lambda function, that's where the stack trace error's being thrown. I know it was within the threshold of what we wanted, but the actual application code threw the error in the stack trace and pointed to the exact line that it's at. As long as I know that duration is within normal operating expectations, then it kind of helps you point out what could potentially be wrong pretty easily.

Conclusion: A Generic Troubleshooting Workflow and Resources for Database Observability

Alright, thanks Alex. Alex is our superhero. He manages a lot of different things, but he was very easily able, through mostly a single pane of glass for his global fleet, to see how to fix those different problems, whether it was related to database load, execution plans that may have flipped or had bad plans. There's a lot of other things you can do for looking at the logs directly as well without having to change a single pane and search through those. Alex saved the day.

Thumbnail 890

Thumbnail 900

With what we just went through and all the problems Alex solved, let's talk about a pretty generic troubleshooting workflow. First, identify the database instances and database load from the fleet view. Look at your fleet if you don't know the exact instance you need to go to. Take a look where it is potentially supposed to be and try to identify where there's high load.

Thumbnail 910

Thumbnail 920

Thumbnail 930

From there, dive a little bit deeper and look at the queries that are being run once you get to that instance view. Understand where the load is, whether it's related to locks or execution plans changing, or what is potentially causing that. Or if there is initial upstream load that's occurring and you have to address that. And then from there, also review key metrics to help you understand what's going on. If you're looking at blocks per second per hit, understanding how many pages are occurring and how those are piling up, average active executions—there are so many metrics, hundreds of metrics. But things that can help you correlate and understand what that load is and how it's changed over time are valuable.

Thumbnail 960

And then lastly, if you have application performance monitoring enabled for the applications in front of the database, you can very easily see how the end-to-end transaction is running. You can see where the latency is for the queries and whether that's within normal thresholds. You can not just track down when it is performing poorly, but you can actually set up some other alarms around it not behaving as you expected for those dependencies as well. This can make it easier to be notified before you actually have an issue happen and you're not sure what's going on.

Thumbnail 990

Thumbnail 1010

So with that, I have a few resources here. These are just some real scenarios. Go ahead and take a screenshot of this. You can go to that QR code and review some very specific scenarios. There's also some database insights documentation. On the documentation, there's a detailed section for execution plan analysis, lock analysis, and some other functionality that's in there as well. And you've got to love the cat, super cat. This is more observability as a whole, not just database insights. But if you go to this QR code, it has a bunch of best practices on that page. It's going to take you to a GitHub page where we have our observability best practices. It has a lot of really good links in there.

Thumbnail 1030

And then the next thing—it's early in re:Invent, the expo hasn't opened yet, but when it does in the AWS village, we have Cloud Ops kiosks with everything from Cloud Ops, management, observability, governance and compliance, resilience, AIOps, multicloud and more. Just come to the expo. I will be at the expo on Wednesday between 12:00 and 4:00 doing the AIOps booth, as well as the observability booth. If you want to talk more about database insights or other observability for your applications, I will be there and happy to talk to you.

Thumbnail 1060

And with that, thank you all so much for attending today. Please remember to do the session survey for COP331. That's how they rate me and let me know if you want me to come back and how we can make this better for you. I hope you all have a wonderful time at re:Invent. Ride the Datadog slide—it's amazing, it's fun, and it's one of my favorite things to do. I do it every year. And check out the expo, check out all the partners, the other sessions, and everything that interests you. I'll be right over there if anyone wants to chat right after this. But with that, thank you all so much for being here, and hopefully I get a chance to talk to you all in just a minute.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)