DEV Community

Cover image for Stop Using NOT EXISTS? Unveiling the Optimization Secrets of LEFT JOIN Anti-Join
SQLFlash
SQLFlash

Posted on

Stop Using NOT EXISTS? Unveiling the Optimization Secrets of LEFT JOIN Anti-Join

I. Preface

In complex business query scenarios, the need for anti-joins—specifically, "finding records that lack associated entries"—is very common. The traditional LEFT JOIN + HAVING COUNT = 0 approach, while logically straightforward, often generates substantial intermediate results and incurs grouping and aggregation overhead in large-scale data scenarios, leading to performance bottlenecks. This solution is based on a real-world business case. It utilizes a constructed test dataset of approximately 200,000 records to compare the execution plans and performance metrics of two anti-join implementations: LEFT JOIN versus NOT EXISTS. The objective is to validate the significant effectiveness of anti-join optimization techniques in a VIP customer screening scenario and to clarify the applicable boundaries and key implementation points for this optimization.

II. Test Environment Setup

1. Table Structure Design

-- Customer master table (100k records)
CREATE TABLE customers (
    customer_id INT AUTO_INCREMENT PRIMARY KEY,
    customer_name VARCHAR(100) NOT NULL,
    vip_level VARCHAR(20) NOT NULL,
    total_spent DECIMAL(10,2) NOT NULL DEFAULT 0.0,
    INDEX idx_vip_spent (vip_level, total_spent)
) ENGINE=InnoDB;

-- Order detail table (100k records)
CREATE TABLE order_items (
    order_item_id INT AUTO_INCREMENT PRIMARY KEY,
    customer_id INT NOT NULL,
    product_category VARCHAR(50) NOT NULL,
    INDEX idx_customer_category (customer_id, product_category)
) ENGINE=InnoDB;
Enter fullscreen mode Exit fullscreen mode

2. Test Data Generation

-- Generate 100,000 customer records
SET cte_max_recursion_depth = 100000;
INSERT INTO customers (customer_name, vip_level, total_spent)
WITH RECURSIVE seq AS (
    SELECT 1 AS n
    UNION ALL
    SELECT n + 1 FROM seq WHERE n < 100000
)
SELECT
    CONCAT('Customer_', n) AS customer_name,
    CASE
        WHEN n % 3 = 0 THEN 'gold'
        WHEN n % 3 = 1 THEN 'platinum'
        ELSE 'silver'
    END AS vip_level,
    ROUND(3000 + RAND(n) * 7000, 2) AS total_spent
FROM seq;

-- Generate 100,000 order detail records
INSERT INTO order_items (customer_id, product_category)
WITH RECURSIVE seq AS (
    SELECT 1 AS n
    UNION ALL
    SELECT n + 1 FROM seq WHERE n < 100000
)
SELECT
    FLOOR(1 + RAND(n) * 100000) AS customer_id,
    CASE
        WHEN n % 4 = 0 THEN 'Electronics'
        WHEN n % 4 = 1 THEN 'Clothing'
        WHEN n % 4 = 2 THEN 'Books'
        ELSE 'Food'
    END AS product_category
FROM seq;
Enter fullscreen mode Exit fullscreen mode

III. SQL Optimization

1. Original SQL

SELECT 
    c.customer_id,
    c.customer_name,
    c.vip_level,
    c.total_spent
FROM customers c
LEFT JOIN order_items oi 
    ON c.customer_id = oi.customer_id 
    AND oi.product_category = 'Electronics'
WHERE c.vip_level IN ('gold', 'platinum')
  AND c.total_spent > 5000
GROUP BY c.customer_id, c.customer_name, c.vip_level, c.total_spent
HAVING COUNT(oi.customer_id) = 0;
Enter fullscreen mode Exit fullscreen mode

2. Optimized SQL

We attempted to rewrite the SQL using SQLFlash:

SQLFlash Demo

The rewritten SQL obtained is as follows:

SELECT 
    c.customer_id,
    c.customer_name,
    c.vip_level,
    c.total_spent
FROM customers c
WHERE c.vip_level IN ('gold', 'platinum')
  AND c.total_spent > 5000
  AND NOT EXISTS (
      SELECT 1
      FROM order_items oi
      WHERE oi.customer_id = c.customer_id
        AND oi.product_category = 'Electronics'
  );
Enter fullscreen mode Exit fullscreen mode

View detailed report: https://sqlflash.ai/app/sqlResult?shareid=0cd99623-0c25-464d-bb9b-1f5c976057e5

IV. Performance Analysis

SQLFlash Analysis

According to the analysis provided by SQLFlash, this rewrite fundamentally changes the query execution strategy by converting the anti-join pattern of LEFT JOIN + HAVING COUNT = 0 into a NOT EXISTS subquery. The execution flow before the rewrite required first performing a LEFT JOIN, generating 244,602 rows of intermediate results, then performing GROUP BY aggregation via a temporary table, and finally filtering with HAVING. After the rewrite, the MySQL optimizer materializes the subquery into a temporary table (executed only once, cost 10,084.25). Subsequently, for each eligible customer, it quickly determines whether they exist in the materialized table via hash lookups, avoiding extensive intermediate data processing and grouping overhead. The execution cost was reduced from 50,261.35 to 35,969.60, a 28.43% reduction, with actual execution time improving by 52.14%.

Original Query Plan

mysql> EXPLAIN SELECT c.customer_id, c.customer_name, c.vip_level, c.total_spent
FROM customers c LEFT JOIN order_items oi ON c.customer_id = oi.customer_id 
AND oi.product_category = 'Electronics' WHERE c.vip_level IN ('gold', 'platinum')
AND c.total_spent > 5000 GROUP BY c.customer_id, c.customer_name, c.vip_level, 
c.total_spent HAVING COUNT(oi.customer_id) = 0;

+----+-------------+-------+------------+------+------------------------+-----------------------+---------+------------------------+-------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys          | key                   | key_len | ref                    | rows  | filtered | Extra                        |
+----+-------------+-------+------------+------+------------------------+-----------------------+---------+------------------------+-------+----------+------------------------------+
|  1 | SIMPLE      | c     | NULL       | ALL  | idx_vip_spent          | NULL                  | NULL    | NULL                   | 99949 |    58.24 | Using where; Using temporary |
|  1 | SIMPLE      | oi    | NULL       | ref  | idx_customer_category  | idx_customer_category | 206     | c.customer_id,const    |     4 |   100.00 | Using index                  |
+----+-------------+-------+------------+------+------------------------+-----------------------+---------+------------------------+-------+----------+------------------------------+
Enter fullscreen mode Exit fullscreen mode

Query Cost: 50,261.35

Execution Time: Average 0.3367 seconds (Minimum 0.2646 seconds, Maximum 0.3731 seconds)

Result Rows: 10,000 rows

Optimized Query Plan

mysql> EXPLAIN SELECT c.customer_id, c.customer_name, c.vip_level, c.total_spent
FROM customers c WHERE c.vip_level IN ('gold', 'platinum') AND c.total_spent > 5000
AND NOT EXISTS (SELECT 1 FROM order_items oi WHERE oi.customer_id = c.customer_id 
AND oi.product_category = 'Electronics');

+----+----------------+-------------+------------+--------+------------------------+-----------------------+---------+-------------------+-------+----------+------------------------------+
| id | select_type    | table       | partitions | type   | possible_keys          | key                   | key_len | ref               | rows  | filtered | Extra                        |
+----+----------------+-------------+------------+--------+------------------------+-----------------------+---------+-------------------+-------+----------+------------------------------+
|  1 | SIMPLE         | c           | NULL       | ALL    | idx_vip_spent          | NULL                  | NULL    | NULL              | 99949 |    58.24 | Using where                  |
|  1 | SIMPLE         | <subquery2> | NULL       | eq_ref | <auto_distinct_key>    | <auto_distinct_key>   | 5       | c.customer_id     |     1 |   100.00 | Using where; Not exists      |
|  2 | MATERIALIZED   | oi          | NULL       | index  | idx_customer_category  | idx_customer_category | 206     | NULL              | 99800 |   100.00 | Using where; Using index     |
+----+----------------+-------------+------------+--------+------------------------+-----------------------+---------+-------------------+-------+----------+------------------------------+
Enter fullscreen mode Exit fullscreen mode

Performance Metrics Comparison

Metric Original Query Optimized Query
SQL Pattern LEFT JOIN + GROUP BY + HAVING NOT EXISTS Anti-Join
Execution Time 0.3367s 0.1612s
Performance Improvement 52.14%
Query Cost 50,261.35 35,969.60
Cost Reduction 28.43%
Query Level Single-level (includes temp table) Materialized Subquery
Rows Scanned 99,949 (main table) + 244,602 (JOIN result) 99,949 (main table) + 99,800 (materialized table)
Intermediate Results 244,602 rows No JOIN intermediate results needed
Extra Identifier Using temporary Materialized + Not exists
Optimizer Strategy Nested Loop + Group Aggregation Subquery Materialization + Hash Lookup
GROUP BY Overhead Requires temporary table for grouping No grouping required

Optimization Principle Analysis

The essence of NOT EXISTS optimization lies in reducing intermediate data inflation and avoiding group aggregation:

1. Eliminate Redundant Intermediate Results
LEFT JOIN generates all matching intermediate records (240,000 rows in this example), while NOT EXISTS only needs to check for existence, directly filtering without the need to retain intermediate data.

2. Replace Computation-Intensive Operations
Converts the aggregate computation of GROUP BY + HAVING (which requires temporary table sorting/grouping) into simple hash lookups (the materialized table only needs to be built once).

3. Optimize Data Access Patterns
After materialization, the subquery forms an in-memory hash table. The main query only requires constant-time lookups, significantly reducing CPU and I/O costs.

V. Conclusion

In anti-join scenarios for "finding records without associated entries," NOT EXISTS achieves over 50% performance improvement compared to LEFT JOIN + HAVING COUNT = 0. The key points are:

  • Eliminate intermediate result inflation (reducing data volume by 60%)
  • Avoid group aggregation overhead (eliminating temporary table operations)
  • Leverage the optimization strategy of subquery materialization + hash lookup

Development Guideline: Prioritize the NOT EXISTS pattern for anti-join queries, ensure associated fields are indexed, and verify that the execution plan includes the "Materialized" identifier.

Top comments (0)