SQLFlash

Posted on Nov 14

MySQL COUNT Scalar Subquery Optimization: The Complete Guide

#sql #database #mysql #ai

Preface

In modern database applications, achieving efficient query performance is a core challenge for system performance. Developers often use COUNT scalar subqueries for existence checks (e.g., (SELECT COUNT(*) FROM ...) > 0). However, this type of query can trigger MySQL's DEPENDENT SUBQUERY execution plan, leading to significant performance issues: each row from the outer table may trigger a full table scan in the subquery. When data volumes are large, performance degrades sharply as a result of these repeated scans and aggregate calculations for each outer row .

By rewriting the COUNT scalar subquery into an IN subquery, MySQL's SEMI JOIN optimization mechanism can be activated. This changes the execution plan from a Nested Loop to a more efficient Hash Join or index lookup, thereby avoiding redundant full table scans and aggregate calculations . This article will demonstrate the optimization effect through comprehensive test data generation, performance comparison experiments, and execution plan analysis, providing a standardized rewriting procedure .

I. Test Data Generation (Simulating Student Course Selection Scenario)

1. Creating Test Tables

-- Create students table (Student Information)
CREATE TABLE students (
    student_id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(20) NOT NULL,
    class_id INT NOT NULL
) ENGINE=InnoDB;

-- Create enrollments table (Course Enrollment Records)  
CREATE TABLE enrollments (
    enrollment_id INT AUTO_INCREMENT PRIMARY KEY,
    student_id INT NOT NULL,
    course_id INT NOT NULL,
    enroll_time DATETIME NOT NULL
) ENGINE=InnoDB;

2. Batch Insertion of 5 Million Test Records

-- Insert 500,000 student records (random class 1-100)
SET SESSION cte_max_recursion_depth = 500000;
INSERT INTO students (name, class_id)
WITH RECURSIVE seq AS (
    SELECT 0 AS n
    UNION ALL
    SELECT n+1 FROM seq WHERE n < 499999
)
SELECT
    CONCAT('Student_', n) AS name,
    FLOOR(RAND(n) * 100) + 1 AS class_id
FROM seq;

-- Insert 5 million course enrollment records (random students and courses)
INSERT INTO enrollments (student_id, course_id, enroll_time)
WITH RECURSIVE seq AS (
    SELECT 0 AS n
    UNION ALL
    SELECT n+1 FROM seq WHERE n < 4999999
)
SELECT
    FLOOR(RAND(n) * 500000) + 1 AS student_id,
    FLOOR(RAND(n) * 50) + 1 AS course_id,
    NOW() - INTERVAL FLOOR(RAND(n) * 365) DAY AS enroll_time
FROM seq;

3. Creating Indexes (Accelerating Join Queries)

ALTER TABLE enrollments ADD INDEX idx_student (student_id);

II. Performance Comparison Testing

1. Test Environment

Database Version: MySQL 8.0.18
Hardware Configuration: 4-core CPU/8GB RAM/SSD Storage
Data Volume:
- students table: 500,000 records
- enrollments table: 5,000,000 records

2. Original SQL

SELECT * FROM students WHERE (
    SELECT COUNT(*) FROM enrollments 
    WHERE students.student_id = enrollments.student_id
    AND enrollment_id > 4990000
) > 0;

View detailed report:https://sqlflash.ai/app/sqlResult?shareid=202bcc49-35e3-4bb5-a5de-62a4eb6d6ea3

3. Optimized SQL

We attempted to rewrite using SQLFlash.

The rewritten SQL obtained is as follows:

SELECT * FROM students  
WHERE student_id IN (  
    SELECT student_id  
    FROM enrollments  
    WHERE students.student_id = enrollments.student_id  
    AND enrollment_id > 4990000  
);

4. Test Results

Metric	Original Approach (COUNT Subquery)	Optimized Approach (IN Subquery)	Performance Improvement
Execution Time	4.41 seconds	0.04 seconds	99%
Rows Scanned	~10 million rows	~39,028 rows	99.6%

III. Optimization Principle Analysis

1. SQLFlash Analysis

Compared to the original SQL, the rewritten statement uses the form (order_id, product_id) IN (...), which avoids row-by-row calculation of COUNT(*). This allows the database to more easily recognize it as a filtering subquery, improves readability, and reduces redundant calculation overhead in some cases. At the logical level, compared to the original correlated subquery which required executing an aggregate function for each row of the main query, this rewritten form allows the optimizer to more directly determine the existence of (order_id, product_id), potentially reducing invalid access and iterative processes on the detail table, thereby achieving significant performance improvements in large dataset environments.

2. Execution Plan Comparison

Original SQL Execution Plan

mysql> explain SELECT * FROM students WHERE student_id IN ( SELECT student_id FROM enrollments WHERE students.student_id = enrollments.student_id AND enrollment_id > 4990000 );
+----+--------------+-------------+------------+--------+---------------------+---------+---------+------------------------+-------+----------+-------------+
| id | select_type  | table       | partitions | type   | possible_keys       | key     | key_len | ref                    | rows  | filtered | Extra       |
+----+--------------+-------------+------------+--------+---------------------+---------+---------+------------------------+-------+----------+-------------+
| 1  | SIMPLE       | <subquery2> | NULL       | ALL    | NULL                | NULL    | NULL    | NULL                   | NULL  | 100.00   | Using where |
| 1  | SIMPLE       | students    | NULL       | eq_ref | PRIMARY             | PRIMARY | 4       | <subquery2>.student_id | 1     | 100.00   | NULL        |
| 2  | MATERIALIZED | enrollments | NULL       | range  | PRIMARY,idx_student | PRIMARY | 4       | NULL                   | 19514 | 100.00   | Using where |
+----+--------------+-------------+------------+--------+---------------------+---------+---------+------------------------+-------+----------+-------------+

Optimized SQL Execution Plan

mysql> explain SELECT * FROM students WHERE ( SELECT COUNT(*) FROM enrollments WHERE students.student_id = enrollments.student_id AND enrollment_id > 4990000 ) > 0;
+----+--------------------+-------------+------------+------+---------------------+-------------+---------+------------------------+--------+----------+--------------------------+
| id | select_type        | table       | partitions | type | possible_keys       | key         | key_len | ref                    | rows   | filtered | Extra                    |
+----+--------------------+-------------+------------+------+---------------------+-------------+---------+------------------------+--------+----------+--------------------------+
|  1 | PRIMARY            | students    | NULL       | ALL  | NULL                | NULL        | NULL    | NULL                   | 498986 |   100.00 | Using where              |
|  2 | DEPENDENT SUBQUERY | enrollments | NULL       | ref  | PRIMARY,idx_student | idx_student | 4       | ct.students.student_id |     21 |     0.39 | Using where; Using index |
+----+--------------------+-------------+------------+------+---------------------+-------------+---------+------------------------+--------+----------+--------------------------+

IV. Summary

Through this optimization rule, the following benefits can be achieved:

99% Performance Improvement: Execution time is reduced from 4.41 seconds to 0.04 seconds, and the number of scanned rows drops from 10 million to 40,000, completely eliminating the performance bottleneck of nested loops caused by DEPENDENT SUBQUERY.
Efficient Index Utilization: Avoids back-table queries by directly locating data through indexes, reducing I/O and CPU overhead.
Code Maintainability: Transforms implicit COUNT aggregation logic into explicit IN existence checks, resulting in clearer semantics and reduced maintenance complexity.

DEV Community