DEV Community

SQLFlash
SQLFlash

Posted on

SQL Optimization Practices Episode 1: Window Function Performance Comparison and Optimization Strategies

SQL Optimization Practices Episode 1

Introduction

In the era of big data, real-time data processing performance optimization has become the core challenge in system architecture design. This report specifically addresses the performance bottlenecks of window functions in high-frequency TopN query scenarios within monitoring systems. Through empirical testing with tens of millions of alert records, we reveal three critical pain points in traditional full-table sorting approaches: I/O pressure from full data scans, memory contention caused by massive sorting buffers, and computational resource waste from complex execution plans.

Leveraging MySQL 8.0's window function features and indexing mechanisms, we propose a "pre-fetch and post-processing" optimization paradigm. Through SQL logic rewriting and coordinated index strategies, we reduced Top3 query response times from 8 seconds to millisecond-level (10ms), achieving two orders of magnitude performance improvement. This report not only provides directly reusable optimization patterns but also establishes a boundary condition determination model for TopN query optimization, offering methodological guidance for SQL tuning across different business scenarios. All experimental data were validated in standardized testing environments to ensure optimization solutions' universality and reliability.

I. Business Scenario Simulation

Requirements Background

A monitoring system requires real-time display of "latest 3 alert records with global sequence numbers". Original window function implementation faced performance bottlenecks, now validated through SQL rewriting.

Data Characteristics

  • Warning table with timestamp field (10M+ records)
  • Requires absolute accuracy (no duplicate skipping)
  • High-frequency queries (dozens per minute)

II. Test Environment Setup

Table Creation Statement

-- Original table (no indexes)
CREATE TABLE t (
  id BIGINT AUTO_INCREMENT PRIMARY KEY,
  alert_time DATETIME(6) NOT NULL COMMENT 'Alert Time',
  content VARCHAR(255) NOT NULL
) ENGINE=InnoDB;
Enter fullscreen mode Exit fullscreen mode

Data Generation Method

-- Generate 5M test records
INSERT INTO t (alert_time, content)
SELECT 
  NOW(6) - INTERVAL FLOOR(RAND()*86400 * 365) SECOND,
  CONCAT('Alert-', UUID())
FROM 
  information_schema.columns c1,
  information_schema.columns c2 
LIMIT 5000000;
Enter fullscreen mode Exit fullscreen mode

III. SQL Optimization

Original Test Query

-- Original SQL (full table sort)
SELECT * FROM (
  SELECT 
    ROW_NUMBER() OVER (ORDER BY alert_time DESC) AS rownumber,
    id,
    alert_time,
    content  
  FROM t
) dt WHERE rownumber <= 3;
Enter fullscreen mode Exit fullscreen mode

Optimized Implementation with SQLFlash

SQLFlash Demo

WITH t_topN AS (
  SELECT id, alert_time, content 
  FROM t 
  ORDER BY alert_time DESC 
  LIMIT 3
)
SELECT * FROM (
  SELECT 
    ROW_NUMBER() OVER (ORDER BY alert_time DESC) AS rownumber,
    id,
    alert_time,
    content  
  FROM t_topN
) dt;
Enter fullscreen mode Exit fullscreen mode

Add supporting index:

ALTER TABLE t ADD INDEX idx_at (alert_time);
Enter fullscreen mode Exit fullscreen mode

View Full Report

IV. Performance Comparison

Testing Methodology

  1. Hot cache tests (3 executions averaged)
  2. Monitoring metrics:
    • Execution time (seconds)
    • Scanned rows (Handler_read_next)

Performance Metrics (5M Records)

Metric Original SQL Optimized SQL Optimized SQL (Indexed)
Cached execution time 8s 1.4s 0.01s
Scanned rows 4,975,244 4,975,244 3

Performance Analysis

According to SQLFlash's analysis, the main improvements from this optimization are:

  1. Using CTE with LIMIT 3 to conduct significant result set contraction before applying ROW_NUMBER() function
  2. Compared to original SQL, this avoids window function sorting and numbering on full 1,000,000 rows, reducing CPU and memory usage
  3. Final top 3 results can be determined immediately after data retrieval completion, reducing overall computational overhead and achieving performance improvement

V. Execution Plan Analysis

Original Query Plan

EXPLAIN FORMAT=TREE
SELECT * FROM (
  SELECT ROW_NUMBER() OVER (ORDER BY alert_time DESC) AS rownumber, id, alert_time, content  
  FROM t
) dt WHERE rownumber <= 3;

-> Window aggregate
   -> Sort: t.alert_time DESC  (cost=503747.65 rows=4975244)
      -> Table scan on t  
         └─➔ Full table scan with filesort
Enter fullscreen mode Exit fullscreen mode

Key Issues:

  1. Missing index causing full table scan
  2. Built a 5,000,000-row sorting buffer in memory
  3. Generated temporary tables to store intermediate results

Optimized Query Plan

EXPLAIN FORMAT=TREE
WITH t_topN AS (...)
SELECT * FROM (...);

-> Window aggregate
   -> Sort: t_topN.alert_time DESC  
      -> Table scan on t_topN  (cost=2.76 rows=3)
         └─➔ Materialized CTE (3 rows)
            -> Limit: 3 row(s)
               -> Index scan on t using idx_at(reverse)  (cost=0.00 rows=3)
Enter fullscreen mode Exit fullscreen mode

Optimization Highlights:

  1. Leverage index to directly locate latest 3 records
  2. Window function calculations limited to 3 rows
  3. Avoid temporary table generation completely

VI. Optimization Boundaries

Valid Optimization Case

-- Original
SELECT * FROM (
  SELECT ROW_NUMBER() OVER (ORDER BY score DESC) AS rn 
  FROM students
) t WHERE rn <= 5;

-- Optimized
WITH top_students AS (
  SELECT score FROM students ORDER BY score DESC LIMIT 5
)
SELECT * FROM (
  SELECT ROW_NUMBER() OVER (ORDER BY score DESC) AS rn 
  FROM top_students 
) t;
Enter fullscreen mode Exit fullscreen mode

Invalid Optimization Cases

Dynamic N-value:

-- Fails: N from subquery
SELECT * FROM (
  SELECT ROW_NUMBER() OVER (ORDER BY sales DESC) AS rn 
  FROM products
) t WHERE rn <= (SELECT top_num FROM config);
Enter fullscreen mode Exit fullscreen mode

Non-consecutive range:

-- Fails: BETWEEN clause
SELECT * FROM (
  SELECT ROW_NUMBER() OVER (ORDER BY create_time) AS rn 
  FROM orders
) t WHERE rn BETWEEN 100 AND 103;
Enter fullscreen mode Exit fullscreen mode

VII. Technical Principles

  1. ​Index Skip Scan​

    • Leverages B+Tree ordered index to directly locate TOP N records
  2. ​Early Data Processing​

    • Completes core filtering during data retrieval phase ​
  3. Memory Optimization​

    • Reduces sorting operations from 10M to 3 rows ​
  4. Execution Plan Stability​

- Complex O(n log n) plan simplified to O(1) complexity
Enter fullscreen mode Exit fullscreen mode




VIII. Conclusion

SQLFlash optimizes TopN queries by:

  • Limiting data early via CTE/LIMIT in initial query phase
  • Eliminating full-table scans through index utilization
  • Reducing window function computation to 3 rows
  • Maintaining query accuracy while achieving 800x performance gains

Recommended reading

What is SQLFlash?

SQLFlash is your AI-powered SQL Optimization Partner.

Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.

How to use SQLFlash in a database?

Top comments (2)

Collapse
 
jsgurujobs profile image
JavaScript Jobs

It's very useful. Thank you.

Collapse
 
sqlflash1024 profile image
SQLFlash

So happy to hear that it helps!