DEV Community

Cover image for SQL Optimization Practices Episode 7: GROUP BY Clause Elimination
SQLFlash
SQLFlash

Posted on

SQL Optimization Practices Episode 7: GROUP BY Clause Elimination

Preface: Performance Pitfalls of Redundant GROUP BY

When SQL queries apply GROUP BY to unique columns (like primary keys) without aggregate functions, forced grouping triggers unnecessary sorting and temporary table generation. This report validates performance gains from eliminating redundant GROUP BY through million-row dataset experiments.

1. Experimental Environment & Data Preparation

1.1 Hardware Configuration

Component Specification
CPU 4-core
RAM 8GB
Storage 1TB SSD
MySQL Version 8.0.18

1.2 Table Schema Design

-- Students table (Primary key: student_id)
CREATE TABLE students (
  student_id INT AUTO_INCREMENT PRIMARY KEY,
  student_name VARCHAR(100) NOT NULL,
  date_of_birth DATE,
  gender ENUM('Male','Female','Other'),
  major VARCHAR(100)
) ENGINE=InnoDB;
Enter fullscreen mode Exit fullscreen mode

2. Data Generation Method

SET SESSION cte_max_recursion_depth = 1000000;

INSERT INTO students(student_name, date_of_birth, gender, major)
WITH RECURSIVE seq AS (
  SELECT 1 AS n
  UNION ALL
  SELECT n+1 FROM seq WHERE n<1000000
)
SELECT
  CONCAT('Student_',LPAD(n,7,'0')),
  DATE('2000-01-01')+INTERVAL FLOOR(RAND()*365*20) DAY,
  ELT(FLOOR(RAND()*3)+1,'Male','Female','Other'),
  CONCAT('Major_',FLOOR(RAND()*10)+1)
FROM seq;
Enter fullscreen mode Exit fullscreen mode

3. Performance Comparison Experiment

3.1 Original SQL

SELECT
*
FROM
(
    SELECT
    *
    FROM
    students
    WHERE
    student_id > 100000
    GROUP BY
    student_id
) dt
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

3.2 Optimized SQL

We attempted to rewrite the SQL using SQLFlash.

The rewritten SQL is as follows:

SELECT *
FROM students
WHERE student_id > 100000
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

View detailed report

3.3 Performance Metrics Comparison

To control variables, we conducted a comparative experiment using only the GROUP BY clause elimination rule.

Metric Control Group (with GROUP BY) Experimental Group (without GROUP BY)
Execution Time 0.58s 0.001s
Rows Scanned 1,000,000 500,000

3.4 SQLFlash Analysis

The most significant improvement in the rewritten SQL is the removal of the original GROUP BY student_id operation, which avoids potential temporary table creation and additional sorting. The query now retains only basic primary key range filtering and LIMIT clause, significantly reducing execution steps. Compared to the original approach that required grouping first and then taking the top 10 results, the simplified SQL can directly read data within the qualified index range segment and immediately stop scanning after obtaining sufficient results, eliminating unnecessary grouping processing and potential filesort overhead. In scenarios with large data volumes, this can effectively reduce server-side CPU and I/O burden and accelerate data return speed.

4. In-Depth Execution Plan Analysis

4.1 Control Group Execution Plan

mysql> explain select * from(select * from students where student_id > 100000 group by student_id) dt LIMIT 10;

+----+-------------+------------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
| id | select_type | table      | partitions | type  | possible_keys | key     | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+------------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
| 1  | PRIMARY     | <derived2> | NULL       | ALL   | NULL          | NULL    | NULL    | NULL | 498385 | 100.00   | NULL        |
| 2  | DERIVED     | students   | NULL       | range | PRIMARY       | PRIMARY | 4       | NULL | 498385 | 100.00   | Using where |
+----+-------------+------------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
Enter fullscreen mode Exit fullscreen mode

Problem Diagnosis: The GROUP BY clause prevents MySQL from performing subquery pushdown optimization, resulting in a secondary full table scan of the students table.

4.2 Experimental Group Execution Plan (without GROUP BY)

mysql> explain select * from(select * from students where student_id > 100000) dt LIMIT 10;

+----+-------------+----------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys | key     | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+----------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
| 1  | SIMPLE      | students | NULL       | range | PRIMARY       | PRIMARY | 4       | NULL | 498412  | 100.00   | Using where |
+----+-------------+----------+------------+-------+---------------+---------+---------+------+---------+----------+-------------+
Enter fullscreen mode Exit fullscreen mode

Optimization Principle: The outer query SELECT * FROM (subquery) dt LIMIT 10 is simplified by the optimizer into a single-layer query that directly operates on the original students table, avoiding secondary scanning.

5. Conclusion

For grouping queries based on primary keys or unique indexes, eliminating redundant GROUP BY clauses is a critical optimization strategy in OLTP scenarios. When combined with index optimization, this approach can resolve over 80% of unnecessary grouping performance issues, ensuring sub-second response times even in high-concurrency environments.

Top comments (0)