Lawrence Cooke

Posted on Oct 13

Optimizing SQL Queries

#mysql #sql #mariadb

When writing queries, we should always take time to find the best way to write the query.

Sometimes this can mean using methods that on the surface seem like they wouldn't be fast, but actually are.

Query optimization is critical to having an efficient website.

While query optimization also applies to reporting and analytics, queries that run as part of a web service are the ones most noticed by users of your website.

For this article I am using the MySQL test employee database: https://dev.mysql.com/doc/employee/en/

The Schema

CREATE TABLE `employees` (
  `emp_no` int NOT NULL,
  `birth_date` date NOT NULL,
  `first_name` varchar(14) NOT NULL,
  `last_name` varchar(16) NOT NULL,
  `gender` enum('M','F') NOT NULL,
  `hire_date` date NOT NULL,
  PRIMARY KEY (`emp_no`),
  KEY `name` (`first_name`,`last_name`) 
)

CREATE TABLE `salaries` (
  `emp_no` int NOT NULL,
  `salary` int NOT NULL,
  `from_date` date NOT NULL,
  `to_date` date NOT NULL,
  PRIMARY KEY (`emp_no`,`from_date`),
  KEY `salary` (`emp_no`,`salary`)
)

The salaries table can contain the same employee multiple times, each time an employees salary changes, it's a new row in the salaries table.

The task

The task for this query is to return a unique list of employee number, first_name, last_name who earn over $50,000 a year.

Along with selecting the data, we will need to ensure there are no duplicate employees.

Using DISTINCT

SELECT DISTINCT
    employees.emp_no,
    first_name,
    last_name
FROM
    employees
    INNER JOIN salaries USING (emp_no)
WHERE
    salary > 50000

In general, the use of DISTINCT is an indication that the query could be written better.

DISTINCT fetches all the possible rows, and at the end of the query process, strips out duplicate rows it doesn't need.

Distinct is calculated against all selected rows. This can mean that it's possible to return duplicate names in some cases.

An example of when this could occur would be if we included a column where each row for an employee changed, for example salary

SELECT DISTINCT
    employees.emp_no,
    first_name,
    last_name,
    salary
FROM
    employees
    INNER JOIN salaries USING (emp_no)
WHERE
    salary > 50000

Query Execution Plan:

-> Table scan on <temporary>  (cost=241946..245972 rows=321886)
   └─> Temporary table with deduplication  (cost=241946..241946 rows=321886)
      └─> Nested loop inner join  (cost=209757 rows=321886)
         ├─> Filter: (salaries.salary > 50000)  (cost=97097 rows=321886)
         │  └─> Index scan on salaries using salary  (cost=97097 rows=965756)
         └─> Single-row index lookup on employees using PRIMARY (emp_no=salaries.emp_no)  (cost=0.25 rows=1)

The execution plan shows the use of a temporary table and a high cost. Temporary tables are generally slower queries. They are necessary at times, but if you can find a way to query without the use of a temporary table, it's generally going to be more efficient.

Average response time: 745ms

Using GROUP BY

A common method of ensuring unique users is to use GROUP BY

GROUP BY is generally faster than DISTINCT. It doesn't need that last step of removing duplicates to complete the query plan

SELECT
    employees.emp_no,
    first_name,
    last_name
FROM
    employees
    INNER JOIN salaries USING(emp_no)
WHERE
    salary > 50000
GROUP BY
    employees.emp_no

Query Execution Plan:

-> Table scan on <temporary>  (cost=241946..245972 rows=321886)
   └─> Temporary table with deduplication  (cost=241946..241946 rows=321886)
      └─> Nested loop inner join  (cost=209757 rows=321886)
         ├─> Filter: (salaries.salary > 50000)  (cost=97097 rows=321886)
         │  └─> Index scan on salaries using salary  (cost=97097 rows=965756)
         └─> Single-row index lookup on employees using PRIMARY (emp_no=salaries.emp_no)  (cost=0.25 rows=1)

While the GROUP BY is slightly faster than DISTINCT, the execution plan is the same. The difference between them in this case is generally related to the internal query optimizer, query caching etc.

While execution plans are very useful, they don't always give you the whole story of what is going on internally, which leads to subtle differences between queries that might have the same execution plan.

Average response time: 721ms

Using Subquery

While subqueries are often viewed as less efficient, there are times where they can reduce the row count, which can make queries faster.

In this case, we are going to use a subquery to find the employee numbers where salary is over $50,000

SELECT
    employees.emp_no,
    first_name,
    last_name
FROM
    employees
WHERE
    emp_no IN(
        SELECT
            emp_no FROM salaries
        WHERE
            salary > 50000)

Using this method, the query time drops significantly.

Query Execution Plan:

-> Nested loop inner join  (cost=89029 rows=33961)
   ├─> Remove duplicates from input sorted on salary  (cost=5161 rows=33961)
   │  └─> Filter: (salaries.salary > 50000)  (cost=5161 rows=33961)
   │     └─> Index scan on salaries using salary  (cost=5161 rows=965756)
   └─> Single-row index lookup on employees using PRIMARY (emp_no=salaries.emp_no)  (cost=80472 rows=1)

Here you will see that the query is no longer using a temporary table, and is using a much simpler plan, with a much lower cost value.

These factors lead to a faster response time.

Average response time: 234ms

While using a subquery significantly improved the query performance, we may be able to achieve better results by using the EXISTS clause, which offers some advantages over the IN statement used in the subquery.

Using EXISTS

When using EXISTS, the query early terminates once it finds a match. In this case, it will early terminate once it has found a specific employee.

While there are multiple rows in the salaries table for an employee, it does not need to continue checking if that specific employee exists if it has found a matching row, so it stops looking for the employee and moves onto looking for the next one.

SELECT
    employees.emp_no,
    first_name,
    last_name
FROM
    employees
WHERE
    EXISTS (
        SELECT
            1
        FROM
            salaries
        WHERE
            salaries.emp_no = employees.emp_no
            AND salary > 50000)

We use SELECT 1 in this query because EXISTS only returns TRUE or FALSE, not what that the row contains.

While we could use SELECT emp_no or SELECT *, returning a constant makes the intent of the query clearer, and in some cases, can be more efficient.

Query Execution Plan:

-> Nested loop inner join  (cost=89029 rows=33961)
   ├─> Remove duplicates from input sorted on salary  (cost=5161 rows=33961)
   │  └─> Filter: (salaries.salary > 50000)  (cost=5161 rows=33961)
   │     └─> Index scan on salaries using salary  (cost=5161 rows=965756)
   └─> Single-row index lookup on employees using PRIMARY (emp_no=salaries.emp_no)  (cost=80472 rows=1)

While this query plan is the same as the subquery query plan, the early termination improves the execution time.

Average response time: 220ms

Summary

Distinct: 745ms
Group By: 721ms
Subquery: 234ms
Exists : 220ms

Using subqueries is not always the most efficient querying method, however, in scenarios like this, it can significantly improve your query.

While just changing the query can help fix slow queries, there are other optimizations that could be considered.

Creating better indexes can also help resolve slow queries, but adding indexes should be reserved for times where rewriting the query doesn't help the query to be more efficient.

It's important to try out different query strategies on your own data. While EXISTS was the most efficient strategy when querying this dataset, results may differ on other datasets, so try out a variety of queries and see which one works best for you.

DEV Community

Optimizing SQL Queries

The Schema

The task

Using DISTINCT

Using GROUP BY

Using Subquery

Using EXISTS

Summary

Top comments (0)

Read next

Finally Contributing Actual Code to Open Source

The Great Firewall of Work-Life Balance: Breaking Through Tech Burnout

21. for Loop

VARRAY in PLSQL Functions | LIMIT | COUNT | EXTEND