How a few queries were compromising the whole operation of a business.
This situation happened to one of our customers and it shows how important it is to know exactly what is running in your database servers and to keep track of that.
“MeusPedidos is a SaaS application that handle the processing of thousand of orders per second. Those Orders made by Industries, Sales Representative and Distributors can be done by users in their Web App, users of Mobile Apps, third party integrations of a public API and finally by some huge XLS Sheets imported directly in their Web App.”
Overview
Initially, the RDS was running on a “m3.2xlarge” instance. After executing some tests, MeusPedidos’ team found some limitations in the “m3.2xlarge” instance with Multi-AZ (you can see more info about that here). They got in touch with the AWS Support and were told to upgrade to a “m4.2xlarge”.
After the migration, there was a reduction in the frequency and intensity with which instabilities occurred in the system but it was still possible to identify Write Throughput and High DiskQueueDepth spikes with CloudWatch’s monitoring. As you can see in the picture below:
Monitoring with Nazar
Then we started monitoring their application with Nazar to identify the offensive queries responsible for the spikes. After that we could then identify some infrequent queries with a high execution time.
By crossing the time these queries were run and the peak times presented in CloudWatch, we identified the relationship between them.
Cause
When these queries were executed, a relatively large amount of data was manipulated, which generated the need for disk write operations specially for the “Creating sort index” and “Sending data” stages in the execution plan.
Example:
Duration 3.58268150
Creating sort index 2.843240
Sending data 0.721862
Solution
With Nazar it was very easy to identify the 3 queries that were presenting this characteristics and they were optimized by taking the following actions:
- Index creation;
- SQL rewritten;
- Reducing the number of columns in the SELECT list;
Results
After the implementation (04/18/2018) of the optimizations in the 3 queries identified, there was no more Write Throughput and High DiskQueueDepth spikes.
Conclusion
The secret to anticipate that daily performance problems will scale and become critical issues is to observe them continuously and to monitor the cause, not the consequences. Even after upgrading to a higher instance the queries continued to be executed and to compromise the application’s performance.
"The partnership with Nazar was really helpful to tackle down write throughput problems that were haunting us for a very long time." - Israel Fonseca — Senior Software Developer
Top comments (0)