I. Preface
In complex business query scenarios, the need for anti-joins—specifically, "finding records that lack associated entries"—is very common. The traditional LEFT JOIN + HAVING COUNT = 0 approach, while logically straightforward, often generates substantial intermediate results and incurs grouping and aggregation overhead in large-scale data scenarios, leading to performance bottlenecks. This solution is based on a real-world business case. It utilizes a constructed test dataset of approximately 200,000 records to compare the execution plans and performance metrics of two anti-join implementations: LEFT JOIN versus NOT EXISTS. The objective is to validate the significant effectiveness of anti-join optimization techniques in a VIP customer screening scenario and to clarify the applicable boundaries and key implementation points for this optimization.
II. Test Environment Setup
1. Table Structure Design
-- Customer master table (100k records)
CREATE TABLE customers (
customer_id INT AUTO_INCREMENT PRIMARY KEY,
customer_name VARCHAR(100) NOT NULL,
vip_level VARCHAR(20) NOT NULL,
total_spent DECIMAL(10,2) NOT NULL DEFAULT 0.0,
INDEX idx_vip_spent (vip_level, total_spent)
) ENGINE=InnoDB;
-- Order detail table (100k records)
CREATE TABLE order_items (
order_item_id INT AUTO_INCREMENT PRIMARY KEY,
customer_id INT NOT NULL,
product_category VARCHAR(50) NOT NULL,
INDEX idx_customer_category (customer_id, product_category)
) ENGINE=InnoDB;
2. Test Data Generation
-- Generate 100,000 customer records
SET cte_max_recursion_depth = 100000;
INSERT INTO customers (customer_name, vip_level, total_spent)
WITH RECURSIVE seq AS (
SELECT 1 AS n
UNION ALL
SELECT n + 1 FROM seq WHERE n < 100000
)
SELECT
CONCAT('Customer_', n) AS customer_name,
CASE
WHEN n % 3 = 0 THEN 'gold'
WHEN n % 3 = 1 THEN 'platinum'
ELSE 'silver'
END AS vip_level,
ROUND(3000 + RAND(n) * 7000, 2) AS total_spent
FROM seq;
-- Generate 100,000 order detail records
INSERT INTO order_items (customer_id, product_category)
WITH RECURSIVE seq AS (
SELECT 1 AS n
UNION ALL
SELECT n + 1 FROM seq WHERE n < 100000
)
SELECT
FLOOR(1 + RAND(n) * 100000) AS customer_id,
CASE
WHEN n % 4 = 0 THEN 'Electronics'
WHEN n % 4 = 1 THEN 'Clothing'
WHEN n % 4 = 2 THEN 'Books'
ELSE 'Food'
END AS product_category
FROM seq;
III. SQL Optimization
1. Original SQL
SELECT
c.customer_id,
c.customer_name,
c.vip_level,
c.total_spent
FROM customers c
LEFT JOIN order_items oi
ON c.customer_id = oi.customer_id
AND oi.product_category = 'Electronics'
WHERE c.vip_level IN ('gold', 'platinum')
AND c.total_spent > 5000
GROUP BY c.customer_id, c.customer_name, c.vip_level, c.total_spent
HAVING COUNT(oi.customer_id) = 0;
2. Optimized SQL
We attempted to rewrite the SQL using SQLFlash:
The rewritten SQL obtained is as follows:
SELECT
c.customer_id,
c.customer_name,
c.vip_level,
c.total_spent
FROM customers c
WHERE c.vip_level IN ('gold', 'platinum')
AND c.total_spent > 5000
AND NOT EXISTS (
SELECT 1
FROM order_items oi
WHERE oi.customer_id = c.customer_id
AND oi.product_category = 'Electronics'
);
View detailed report: https://sqlflash.ai/app/sqlResult?shareid=0cd99623-0c25-464d-bb9b-1f5c976057e5
IV. Performance Analysis
SQLFlash Analysis
According to the analysis provided by SQLFlash, this rewrite fundamentally changes the query execution strategy by converting the anti-join pattern of LEFT JOIN + HAVING COUNT = 0 into a NOT EXISTS subquery. The execution flow before the rewrite required first performing a LEFT JOIN, generating 244,602 rows of intermediate results, then performing GROUP BY aggregation via a temporary table, and finally filtering with HAVING. After the rewrite, the MySQL optimizer materializes the subquery into a temporary table (executed only once, cost 10,084.25). Subsequently, for each eligible customer, it quickly determines whether they exist in the materialized table via hash lookups, avoiding extensive intermediate data processing and grouping overhead. The execution cost was reduced from 50,261.35 to 35,969.60, a 28.43% reduction, with actual execution time improving by 52.14%.
Original Query Plan
mysql> EXPLAIN SELECT c.customer_id, c.customer_name, c.vip_level, c.total_spent
FROM customers c LEFT JOIN order_items oi ON c.customer_id = oi.customer_id
AND oi.product_category = 'Electronics' WHERE c.vip_level IN ('gold', 'platinum')
AND c.total_spent > 5000 GROUP BY c.customer_id, c.customer_name, c.vip_level,
c.total_spent HAVING COUNT(oi.customer_id) = 0;
+----+-------------+-------+------------+------+------------------------+-----------------------+---------+------------------------+-------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+------------------------+-----------------------+---------+------------------------+-------+----------+------------------------------+
| 1 | SIMPLE | c | NULL | ALL | idx_vip_spent | NULL | NULL | NULL | 99949 | 58.24 | Using where; Using temporary |
| 1 | SIMPLE | oi | NULL | ref | idx_customer_category | idx_customer_category | 206 | c.customer_id,const | 4 | 100.00 | Using index |
+----+-------------+-------+------------+------+------------------------+-----------------------+---------+------------------------+-------+----------+------------------------------+
Query Cost: 50,261.35
Execution Time: Average 0.3367 seconds (Minimum 0.2646 seconds, Maximum 0.3731 seconds)
Result Rows: 10,000 rows
Optimized Query Plan
mysql> EXPLAIN SELECT c.customer_id, c.customer_name, c.vip_level, c.total_spent
FROM customers c WHERE c.vip_level IN ('gold', 'platinum') AND c.total_spent > 5000
AND NOT EXISTS (SELECT 1 FROM order_items oi WHERE oi.customer_id = c.customer_id
AND oi.product_category = 'Electronics');
+----+----------------+-------------+------------+--------+------------------------+-----------------------+---------+-------------------+-------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+----------------+-------------+------------+--------+------------------------+-----------------------+---------+-------------------+-------+----------+------------------------------+
| 1 | SIMPLE | c | NULL | ALL | idx_vip_spent | NULL | NULL | NULL | 99949 | 58.24 | Using where |
| 1 | SIMPLE | <subquery2> | NULL | eq_ref | <auto_distinct_key> | <auto_distinct_key> | 5 | c.customer_id | 1 | 100.00 | Using where; Not exists |
| 2 | MATERIALIZED | oi | NULL | index | idx_customer_category | idx_customer_category | 206 | NULL | 99800 | 100.00 | Using where; Using index |
+----+----------------+-------------+------------+--------+------------------------+-----------------------+---------+-------------------+-------+----------+------------------------------+
Performance Metrics Comparison
| Metric | Original Query | Optimized Query |
|---|---|---|
| SQL Pattern | LEFT JOIN + GROUP BY + HAVING | NOT EXISTS Anti-Join |
| Execution Time | 0.3367s | 0.1612s |
| Performance Improvement | — | 52.14% |
| Query Cost | 50,261.35 | 35,969.60 |
| Cost Reduction | — | 28.43% |
| Query Level | Single-level (includes temp table) | Materialized Subquery |
| Rows Scanned | 99,949 (main table) + 244,602 (JOIN result) | 99,949 (main table) + 99,800 (materialized table) |
| Intermediate Results | 244,602 rows | No JOIN intermediate results needed |
| Extra Identifier | Using temporary | Materialized + Not exists |
| Optimizer Strategy | Nested Loop + Group Aggregation | Subquery Materialization + Hash Lookup |
| GROUP BY Overhead | Requires temporary table for grouping | No grouping required |
Optimization Principle Analysis
The essence of NOT EXISTS optimization lies in reducing intermediate data inflation and avoiding group aggregation:
1. Eliminate Redundant Intermediate Results
LEFT JOIN generates all matching intermediate records (240,000 rows in this example), while NOT EXISTS only needs to check for existence, directly filtering without the need to retain intermediate data.
2. Replace Computation-Intensive Operations
Converts the aggregate computation of GROUP BY + HAVING (which requires temporary table sorting/grouping) into simple hash lookups (the materialized table only needs to be built once).
3. Optimize Data Access Patterns
After materialization, the subquery forms an in-memory hash table. The main query only requires constant-time lookups, significantly reducing CPU and I/O costs.
V. Conclusion
In anti-join scenarios for "finding records without associated entries," NOT EXISTS achieves over 50% performance improvement compared to LEFT JOIN + HAVING COUNT = 0. The key points are:
- Eliminate intermediate result inflation (reducing data volume by 60%)
- Avoid group aggregation overhead (eliminating temporary table operations)
- Leverage the optimization strategy of subquery materialization + hash lookup
Development Guideline: Prioritize the NOT EXISTS pattern for anti-join queries, ensure associated fields are indexed, and verify that the execution plan includes the "Materialized" identifier.

Top comments (0)