Query Optimization with MySQL

#mysql #optimization #development #software

There is a reason why DBA is a profession, the nuances of a database can lead anyone to insanity and MySQL nuances are no different.

I don’t know these nuances at heart and, as many reading this article, I’m an engineer that works with MySQL, complex queries and large amounts of data on a daily basis. Talking about performance require placing a threshold to understand what is a poorly performant query and for that, I will use MySQL default parameter: 1 second.

Even with larger datasets and a large number of rows, queries can perform under 1 second of time. MySQL query optimization engine is really good and, as long as you know how to properly write a query, it will optimize it to run as fast as it can.

When it comes to making a query perform faster is important to have at least some basic knowledge of how the engine optimization works so one could leverage all the potential and avoid the caveats of the same. Unfortunately, unless you are a DBA is very unlikely that you will know those nuances at heart. Little things like understanding that MySQL runs every portion of the query or every query like it would run a JOIN statement.
Query Construction

To understand how to properly improve the performance of a query you must first understand how to properly break a query into portions. Every SELECT statement will have:

SELECT
    columns
    sub-selects
    functions
    improper aggregations-- main table
    joins
    sub-selects-- where clauses-- aggregation
    groupping, ordering, having and limiting

Columns

Each of these portions can impact the performance of your query in a way or another, but the columns are the only one that might impact less.

The caveat on the columns portion will be if you have a sub-select or a function call. If the performance of that function is poor, then it will impact your query performance. The same can be said for the sub-selects.

The last point that can impact on this portion of the query is improper aggregation. There are a considerable amount of resources from MySQL that performs extremely fast such as TOP, MAX, MIN, etc.

Main-Table, Joins, and sub-selects

The order that you place your joins or the sequence of the table that is being read here will not impact much on the performance, but the number of rows on each table, or on the result of the sub-select, will. Indexes will surely help with this portion of the query as long the whole query can be executed at memory or the needed portion of the indexes for the query is available in memory. At the moment that this information is not available in memory, the database will create a temporary resource to store the results and then filter out.

The best way of handling this is to only bring the tables that you really need for the query. For example:

SELECT f.id, f.name, f.description, f.type_id
  FROM foo f
  JOIN foo_type ft ON ft.id = f.type_id
 WHERE ft.id = 2;

If you are only bringing from the foo table the rows that have the type_id equals 2, why joining the foo_type, especially if that type name is not available. Furthermore, if the type name would be necessary, could this be pre-loaded information that is displayed during run time?

SELECT f.id, f.name, f.description, f.type_id
  FROM foo f
 WHERE f.type_id = 2;

The easy trick here is to ask the questions:

Do I really need this table to achieve the result?
Am I only bringing this table because I might need it?

Where Clauses

Where clauses are interesting when it comes to performance. Too little filtering will cause more rows and therefore a slowdown, but too much filtering will also cause a slowdown. The best way is to find a healthy medium.

For example:

select 
    distinct o.id as order_id
    , round(sum(oi.amount * oi.quantity), 2) as total
  from orders o
  join order_items oi on oi.order_id = o.id
  join order_type ot on ot.id = o.order_type_id
 where o.create_date between '2019-01-01 00:00:00' and '2019-01-01 23:59:59'
   and oi.type_id in (1,2,3)
   and o.type_id = 2
 group by oi.id

That where clause has to filter the date range, the item type, and the order type. If all those tables (fictitious) had a large number of rows (consider over 2 million), that operation could be costly even if the indexes were properly created. With the exception of the range clause, which is expensive by itself, you can play with the other 2 conditions to see which one will perform faster.

For example, if you remove the condition to check for the order type and move it to the SELECT block, it can perform faster.

select 
    distinct o.id as order_id
    , round(sum(if o.type_id != 2, 0, oi.amount * oi.quantity), 2) as total
  from orders o
  join order_items oi on oi.order_id = o.id
  join order_type ot on ot.id = o.order_type_id
 where o.create_date between '2019-01-01 00:00:00' and '2019-01-01 23:59:59'
   and oi.type_id in (1,2,3)
 group by oi.id

The end result of the query above is the same as the end result of the first query, but it performs better because it is doing fewer conditions on the where clause.

The reason why this happens is simple. MySQL will traverse all the required rows from orders and then will match those to the order items, finally matching those to the order type. After this resultset is created, a temporary table or not, it needs to apply each of the conditions. Lesser conditions mean faster especially if you can add these as part of the column generation.

The final query should look like this:

select 
    distinct o.id as order_id
    , round(sum(if o.type_id != 2, 0, oi.amount * oi.quantity), 2) as total
  from orders o
  join order_items oi on oi.order_id = o.id
 where o.create_date between '2019-01-01 00:00:00' and '2019-01-01 23:59:59'
   and oi.type_id in (1,2,3)
 group by oi.id

There is no need to having the order type joined there because it was doing nothing at all and if the justification of joining a table is to ensure that the foreign keys are valid, then, there is a whole new issue that needs to be addressed but that is not query performance.

Ordering, grouping, having and limiting

This last portion of the query can impact the query performance in a few ways. Ordering can be slow if being applied through filesort, the lack of grouping and limiting can bring more results than it is required and, finally, the having clause can be really expensive especially if they are doing mathematical comparison such as

having sum(oi.amount * oi.quantity) > 999.00

Depending on the query, the having clause will need to calculate overall entries and then filter out the result.
Saved by Explain

I know that the default answer for everything when it comes to query optimization is by adding indexes. Well, that is a lie. Indexes are a portion of the optimization that can be applied and if you really want to optimize something then you must understand the results from explain, even better, the results from explain extended.

Executing explain is really simple you just add it in front of the query and let it’s magic tell you what is wrong.

explain extended select 
    distinct o.id as order_id
    , round(sum(if o.type_id != 2, 0, oi.amount * oi.quantity), 2) as total
  from orders o
  join order_items oi on oi.order_id = o.id
 where o.create_date between '2019-01-01 00:00:00' and '2019-01-01 23:59:59'
   and oi.type_id in (1,2,3)
 group by oi.id

From the results that it presents there are 6 columns you need to pay attention to:

type: look for the word ALL. That means that it scanned the whole table.
table: look for something like on it
rows: these are the number of rows traversed to get to the result
filtered: this is the percentage over the number of rows that were filtered to reach the result.
extra: this column can get a bit big, so make sure you are reading it completely, but here look for temporary and filesort.

It does provide a lot more information but, these columns will be your direct indication on why the query is not performing correctly or if there is anything that can be done to improve the query performance. Keep in mind that the number of rows and the percentage filtered should be read together since one is an influencer to the other.
All options exhausted

Suppose that you did all that you could to improve your query performance and all options are now exhausted. Aside from beefing up the server with more power, there is nothing left to do. Well, this will be a direct dependency on the application you have, but there are options.

The following are strategies that can be used when the query optimization is exhausted:

Break the query into smaller blocks and merge them together on the application layer.
Return chunks of the result until there are no more chunks to be returned and merge the chunks together on the application layer. Similar to the first, just different approach.
Create an ETL where the archived data can be stored in a pre-defined format and then the query would be performing 2 operations. One to read from ETL and then another to read live data.
Application layer caching. Similar to ETL but instead storing the results on a Redis instance for example.

The important information here is that, if you can’t optimize the query then you must consider alternatives. There will be situations that the data is just too big to be handled in less than a second.
Resources

I could write till the end of time here and probably not be able to cover everything that is needed with query optimization. This is fairly the very basic of it and there is a lot more that can be achieved. The following resources I found to be essential to optimize queries:

O’Reilly High-Performance MySQL chapter on query optimization goes really deep into the nuances of query optimization and how MySQL optimization query performs.
MySQL Explain output documentation. Doesn’t matter how many times you run that command, this is still essential information.
MySQL Select optimization documentation page. Provides a lot of good information on how MySQL engine optimization will execute and how you can take advantage of it.

This post was originally written at https://bit.ly/2KJwEED