Lawrence Cooke

Posted on Dec 29, 2023

Finding the best SQL query for the task.

#sql #mysql #mariadb

Each time we write a SQL query, we are faced with having to find a solution that solves a specific problem.

Finding the best query is key to keeping a website, or reporting tools efficient.

I recently needed to find the Median value for a table of salaries with about 1 million rows and tried multiple methods to find the most efficient query to use.

What is Median?

Median is the middle value in a result set.

If you have a result set with three results, the median will be the value of the second result.

If you have as result set with four results, the median will be the average of the second and third values.

It averages the two values out because when there is an even number of results, there is no single middle value.

Window functions

Window functions allow you to perform calculations across either partial or complete result set.

Using the ROW_NUMBER Window function is a handy way to get the sequential row numbers needed to calculate the Median in the result set.

SELECT
 ROUND(AVG(salary))  median_salary 
FROM
    ( SELECT salary, ROW_NUMBER() OVER ( ORDER BY salary ) row_num 
FROM salaries ) t
WHERE
    t.row_num IN (
    FLOOR(((SELECT COUNT(*) count_num FROM salaries) + 1) / 2 ) ,
    CEIL(((SELECT COUNT(*) count_num FROM salaries) + 1) / 2 ) 
    );

Average Time: 1.02 seconds

This is a good base line, it's not bad, but there are probably better ways to do it.

Common Table Expressions (CTEs)

CTEs are temporary result sets that you can reference within your query. While CTEs and Subqueries can serve similar purposes, they differ in a lot of ways.

They can be handy in complex queries to break down the query into its parts, Making the queries more readable. They also offer additional functionality not possible in a subquery. But are they more efficient than using subqueries?

WITH 
get_rows AS 
(
SELECT salary, 
ROW_NUMBER() OVER ( ORDER BY salary ) row_num FROM salaries 
),

count_rows AS 
( SELECT COUNT(*) count_num FROM salaries ) 

SELECT ROUND(AVG( salary )) median_salary 
FROM get_rows 
WHERE row_num IN (
    FLOOR(((SELECT count_num FROM count_rows) + 1) / 2 ),
    CEIL(((SELECT count_num FROM count_rows) + 1) / 2 )
    );

Average Time: 1.02 seconds

No different in terms of time compared to subqueries. There will be cases where there are differences in performance between them, but not in this case.

Temporary Tables

Temporary Tables can be good with large data sets, where you can create a temporary table with data from a query already generated in preparation to run more queries on the data.

CREATE TEMPORARY TABLE get_rows ( 
salary DECIMAL ( 10, 2 ), row_num INT );
CREATE TEMPORARY TABLE count_rows ( count_num INT );

INSERT INTO get_rows ( salary, row_num ) SELECT
salary, ROW_NUMBER() OVER ( ORDER BY salary ) row_num 
FROM salaries;

INSERT INTO count_rows ( count_num ) SELECT COUNT(*) count_num 
FROM salaries;

SELECT ROUND( AVG( salary )) median_salary 
FROM get_rows 
WHERE row_num IN (
        FLOOR(((SELECT count_num FROM count_rows ) + 1) / 2 ),
        CEIL(((SELECT count_num FROM count_rows ) + 1) / 2 ));

DROP TEMPORARY TABLE get_rows;
DROP TEMPORARY TABLE count_rows;

Temporary tables are slightly different than other queries. Two things need to be taken into account.

How long does the whole set of queries take to run?
How long does the query that you need to run on it take?

Average Time: 3.4 seconds (this includes the creation of the temporary table and inserting the data into the table).

Average Query Time: 0.27 seconds.

If this wasn't a one off query, this is certainly a good option to consider. 0.27 seconds for a query is efficient, but as a one off query, the 3.4 seconds it takes to build the temporary table could be inefficient and not the best option.

SQL Variables

Using SQL variables is another alternative method to use.

Variables can be really handy to keep track of data beyond the row currently being read.

In this case, its being used to create row numbers for each row, similar to what ROW_NUMBER() was doing in previous queries.

SET @rownum := - 1;
SELECT
    ROUND(AVG( salary )) AS median_salary 
FROM
    ( SELECT @rownum := @rownum + 1 AS rownum, 
salary FROM salaries ORDER BY salary ) AS t 
WHERE
    t.rownum IN (
    FLOOR( @rownum / 2 ),
    CEIL( @rownum / 2 ));

Average Time: 0.55 seconds

For a one time query, this is the fastest method so far. About twice as fast as the original query.

There is a way to make the SQL variable query more efficient.

Move the variable initialization from the top, to inside the query. It's not as tidy, but it does add efficiency.

It's not always the right choice to make, but in this case, it helped the query.

SELECT
     ROUND(AVG(salary )) median_salary 
FROM
    ( SELECT @rownum := @rownum + 1 rownum, salary 
        FROM salaries,
        (SELECT @rownum := -1) init 
        ORDER BY salary )  t 
WHERE
    rownum IN (
    FLOOR( @rownum / 2 ),
    CEIL( @rownum / 2 ));

Average Time: 0.35 seconds

This would be the query I would choose for this task. 0.35s is efficient, not an unnecessarily complex query, and does the job.

Comparing the queries with EXPLAIN

Comparing the first query with the last query highlights something to look for in your queries.

EXPLAIN for original query

EXPLAIN for final query

While there are a few differences between these queries, one worth highlighting is the internal temporary table used in the original query.

Internal temporary tables are not always avoidable, but they can cause a decrease in efficiency.

The existence of temporary tables (or file sorts) may indicate that better indexing is needed on the table, or as was the case here, the query needed to be improved to avoid the need for the temporary table.

Final Thoughts

Take the time to look and see if there are better ways to write your queries.

Use the EXPLAIN feature to look for possible inefficiencies.

Sometimes your first query will be the right one, but often there are better options, you just need to think outside the box to find them.

While using variables was the best option this time, it's not going to be right for every task. All the methods I tried here serve a purpose and will have their time to shine in a query that you will be asked to design.

DEV Community

Finding the best SQL query for the task.

What is Median?

Window functions

Common Table Expressions (CTEs)

Temporary Tables

SQL Variables

Comparing the queries with EXPLAIN

Final Thoughts

Top comments (0)

Read next

Roast my css code

Adding advanced search to a Strapi and Next.js project with Algolia

Faktor Yang Menunjukan Pendidikan Yang Baik

TweetGagger Plugin: getLikes, getRetweets, getReplies