In this part, we'll learn how to think about query optimizations in PostgreSQL with practical examples, using all our previous knowledge of different scans, joins, and indexes.
Query Optimization
Query optimization is the process of improving the performance of a query. It involves analyzing and enhancing multiple aspects of how your queries interact with PostgreSQL. While there are many factors to consider, we can break down the optimization process into a few key areas:
-
Query Structure and Logic
- Writing efficient WHERE clauses
- Using appropriate JOINs
- Avoiding unnecessary subqueries
- Proper use of aggregations
-
Schema Design and Indexing
- Creating the right indexes for your queries
- Understanding when indexes help or hurt
- Maintaining optimal table statistics
- Smart choices in data types and constraints
-
Data Access Patterns
- Understanding how your application uses data
- Identifying frequently accessed columns
- Analyzing query patterns and frequency
- Considering caching strategies
The key to successful optimization is taking a methodical approach:
- Measure current performance
- Identify bottlenecks using EXPLAIN ANALYZE
- Make targeted improvements
- Validate the changes
- Repeat if necessary
Let's explore these concepts with practical examples...
Example 1:
Get all the sales, employees and departments for the last month, where the sale amount
is greater than 1090.
create view latest_extended_sales as
SELECT
s.id as sale_id,
s.amount sale_amount,
s.sale_date as sale_date,
e.id as employee_id,
d.id as department_id,
d.department_name as department_name
from
sales s
join employees e on s.employee_id=e.id
join departments d on e.department_id=d.id
where
sale_date > now()-interval'1 months'
and s.amount > 1090
;
explain analyze
select * from latest_extended_sales order by sale_id desc;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=268392.55..268402.13 rows=3831 width=538) (actual time=1064.998..1065.159 rows=3932 loops=1)
Sort Key: s.id DESC
Sort Method: quicksort Memory: 404kB
-> Nested Loop (cost=0.57..268164.54 rows=3831 width=538) (actual time=6.135..1063.872 rows=3932 loops=1)
-> Nested Loop (cost=0.42..267542.42 rows=3831 width=22) (actual time=6.126..1060.664 rows=3932 loops=1)
-> Seq Scan on sales s (cost=0.00..260474.20 rows=3831 width=18) (actual time=6.101..1050.484 rows=3932 loops=1)
Filter: ((amount > '1090'::numeric) AND (sale_date > (now() - '1 mon'::interval)))
Rows Removed by Filter: 9835028
-> Index Scan using employees_pkey on employees e (cost=0.42..1.85 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=3932)
Index Cond: (id = s.employee_id)
-> Index Scan using departments_pkey on departments d (cost=0.14..0.16 rows=1 width=520) (actual time=0.001..0.001 rows=1 loops=3932)
Index Cond: (id = e.department_id)
Planning Time: 0.237 ms
Execution Time: 1065.935 ms
(18 rows)
Let's analyze the query plan to understand where we can make optimizations.
Understanding the Query Plan
The query plan shows several key operations:
-
Sequential Scan on
sales
table:- Scans over 9.8M rows to filter by
sale_date
andamount
- Takes 1050ms - the major bottleneck
- Removes 9.8M rows that don't match filters
- Scans over 9.8M rows to filter by
-
Index Scans:
- Uses index on employees table to match
employee_id
- Uses index on departments table to match
department_id
- These operations are relatively fast
- Uses index on employees table to match
-
Overall Performance:
- Total execution time: 1065ms
- Most time spent on sequential scan (1050ms)
- Remaining operations only take ~15ms combined
Optimizing the Query
The main bottleneck is the sequential scan on the sales table that examines over 9.8M rows. The query plan shows this operation takes 1050ms out of the total 1065ms execution time. Since we're filtering on both sale_date
and amount
columns, and scanning the full table to apply these filters, we should consider creating appropriate indexes to avoid the costly sequential scan.
postgres=# create index on sales(amount);
CREATE INDEX
Time: 15829.995 ms (00:15.830)
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=66342.37..66351.94 rows=3830 width=538) (actual time=340.297..340.749 rows=3932 loops=1)
Sort Key: s.id DESC
Sort Method: quicksort Memory: 404kB
-> Nested Loop (cost=1.01..66114.42 rows=3830 width=538) (actual time=0.204..338.719 rows=3932 loops=1)
-> Nested Loop (cost=0.86..66017.29 rows=3830 width=22) (actual time=0.197..337.137 rows=3932 loops=1)
-> Index Scan using sales_amount_idx on sales s (cost=0.43..58950.51 rows=3830 width=18) (actual time=0.189..311.728 rows=3932 loops=1)
Index Cond: (amount > '1090'::numeric)
Filter: (sale_date > (now() - '1 mon'::interval))
Rows Removed by Filter: 94309
-> Index Scan using employees_pkey on employees e (cost=0.42..1.85 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=3932)
Index Cond: (id = s.employee_id)
-> Memoize (cost=0.15..0.17 rows=1 width=520) (actual time=0.000..0.000 rows=1 loops=3932)
Cache Key: e.department_id
Cache Mode: logical
Hits: 3922 Misses: 10 Evictions: 0 Overflows: 0 Memory Usage: 2kB
-> Index Scan using departments_pkey on departments d (cost=0.14..0.16 rows=1 width=520) (actual time=0.001..0.001 rows=1 loops=10)
Index Cond: (id = e.department_id)
Planning Time: 0.369 ms
Execution Time: 340.886 ms
(19 rows)
- 1065ms -> 340ms
- Improvement: 3.1x
Analysis:
- Created an index on the
amount
column of thesales
table - The query planner now uses this index to filter rows where
amount > 1090
- Index scan on sales table takes 311ms vs 1050ms before - a significant improvement
- However, the
sale_date
filter is still applied after the index scan, removing 94,309 additional rows - Total execution time reduced from 1065ms to 340ms (3.1x faster)
This shows the index helps but isn't optimal since we're still doing post-index filtering on sale_date
. Let's create a composite index on both columns to improve this further.
Optimizing the Query Further
postgres=# drop index on sales_amount_idx;
postgres=# create index on sales(amount, sale_date);
CREATE INDEX
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=12543.60..12553.18 rows=3831 width=538) (actual time=26.234..26.545 rows=3932 loops=1)
Sort Key: s.id DESC
Sort Method: quicksort Memory: 404kB
-> Nested Loop (cost=1.02..12315.59 rows=3831 width=538) (actual time=0.064..24.952 rows=3932 loops=1)
-> Nested Loop (cost=0.86..12218.43 rows=3831 width=22) (actual time=0.058..23.751 rows=3932 loops=1)
-> Index Scan using sales_amount_sale_date_idx on sales s (cost=0.44..5150.22 rows=3831 width=18) (actual time=0.051..12.323 rows=3932 loops=1)
Index Cond: ((amount > '1090'::numeric) AND (sale_date > (now() - '1 mon'::interval)))
-> Index Scan using employees_pkey on employees e (cost=0.42..1.85 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=3932)
Index Cond: (id = s.employee_id)
-> Memoize (cost=0.15..0.17 rows=1 width=520) (actual time=0.000..0.000 rows=1 loops=3932)
Cache Key: e.department_id
Cache Mode: logical
Hits: 3922 Misses: 10 Evictions: 0 Overflows: 0 Memory Usage: 2kB
-> Index Scan using departments_pkey on departments d (cost=0.14..0.16 rows=1 width=520) (actual time=0.001..0.001 rows=1 loops=10)
Index Cond: (id = e.department_id)
Planning Time: 0.379 ms
Execution Time: 26.683 ms
(17 rows)
- 1065ms -> 26ms
- Improvement: 40x
Analysis:
- Created a composite index on
amount
andsale_date
columns of the sales table - The query planner now uses this index to efficiently filter both conditions:
amount > 1090
sale_date > now() - 1 month
- Index scan on sales table takes only 12ms vs 1050ms before
- Additional index scans on employees and departments tables are very fast (<1ms) due to primary key indexes
- Total execution time reduced dramatically from 1065ms to 26ms
- Overall improvement: 40x faster query performance
Example 2:
Get the employee with the highest salary for each department.
create index on employees(department_id, salary);
explain analyze
with cte as (
select department_id, max(salary) m_salary
from employees
group by department_id
), cte2 as (
select
d.department_name,
e.id,
e.salary
from
employees e
join cte on cte.department_id=e.department_id and e.salary=cte.m_salary
join departments d on d.id=cte.department_id
) select * from cte2 order by department_name
;
Sort (cost=23358.94..23358.94 rows=1 width=528) (actual time=243.129..243.131 rows=10 loops=1)
Sort Key: d.department_name
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=23334.57..23358.93 rows=1 width=528) (actual time=243.061..243.119 rows=10 loops=1)
Join Filter: (employees.department_id = d.id)
-> Nested Loop (cost=23334.42..23358.75 rows=1 width=20) (actual time=243.055..243.105 rows=10 loops=1)
-> HashAggregate (cost=23334.00..23334.10 rows=10 width=36) (actual time=243.036..243.039 rows=10 loops=1)
Group Key: employees.department_id
Batches: 1 Memory Usage: 24kB
-> Seq Scan on employees (cost=0.00..18334.00 rows=1000000 width=12) (actual time=0.006..51.934 rows=1000000 loops=1)
-> Index Scan using employees_department_id_salary_idx on employees e (cost=0.42..2.44 rows=1 width=16) (actual time=0.006..0.006 rows=1 loops=10)
Index Cond: ((department_id = employees.department_id) AND (salary = (max(employees.salary))))
-> Index Scan using departments_pkey on departments d (cost=0.14..0.16 rows=1 width=520) (actual time=0.001..0.001 rows=1 loops=10)
Index Cond: (id = e.department_id)
Planning Time: 0.236 ms
Execution Time: 243.167 ms
(16 rows)
Let's break down what this query does:
- First, find the highest salary in each department
- Then, get the employee details who has that maximum salary
Understanding the Query Plan
Let's analyze the execution plan:
-
The query starts with a sequential scan on the
employees
table (which has over 10M rows)- This scans the ENTIRE table to find max salaries per department
- Very expensive operation taking most of the 243ms execution time
-
For each department's max salary:
- Uses index scan on
employees
to find matching employee - Joins with
departments
table for department names
- Uses index scan on
-
Performance issues:
- Total execution time: 243ms
- Sequential scan is extremely inefficient
- Not utilizing indexes effectively
While this may not seem slow, but in the database world this is a disaster.
This query performance is problematic for a production database:
- Sequential scanning millions of rows is resource intensive
- Response time is too high for interactive applications
- Could cause bottlenecks under concurrent load
The key question is: Can we restructure this query to:
- Avoid the full table scan?
- Make better use of our indexes?
- Process only the rows we actually need?
Optimizing the Query
explain analyze
select
d.department_name,
e.id,
e.salary
from
departments d
join lateral (
select
id, salary
from
employees
where
department_id=d.id
order by
salary desc
limit 1
) e on true
order by department_name
;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=87.44..87.79 rows=140 width=528) (actual time=0.085..0.086 rows=10 loops=1)
Sort Key: d.department_name
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=0.42..82.45 rows=140 width=528) (actual time=0.016..0.076 rows=10 loops=1)
-> Seq Scan on departments d (cost=0.00..11.40 rows=140 width=520) (actual time=0.003..0.004 rows=10 loops=1)
-> Limit (cost=0.42..0.49 rows=1 width=12) (actual time=0.007..0.007 rows=1 loops=10)
-> Index Scan Backward using employees_department_id_salary_idx on employees (cost=0.42..6251.68 rows=100000 width=12) (actual time=0.006..0.006 rows=1 loops=10)
Index Cond: (department_id = d.id)
Planning Time: 0.101 ms
Execution Time: 0.101 ms
(10 rows)
- 243.167ms -> 0.101ms
- Improvement: 2400x
Let's analyze the query plan:
-
Departments Table Scan
- A sequential scan is performed on the
departments
table - This is acceptable since the table only has 10 rows
- Cost is minimal at 0.003-0.004ms
- A sequential scan is performed on the
-
Employees Table Index Scan
- For each department, an index scan is done on the
employees
table - The scan is backward (descending) to get highest salary first
- The LIMIT 1 ensures we only get the top earner
- Very efficient: 0.006-0.007ms per department
- For each department, an index scan is done on the
-
Overall Performance
- Total execution time: 0.101ms
- Compared to original 243.167ms
- Represents a 2400x improvement
- Shows the power of proper indexing and query structure
Key Takeaways
-
Query Optimization Process
- Start with EXPLAIN ANALYZE to understand the execution plan
- Identify the most expensive operations
- Look for opportunities to use indexes effectively
- Restructure query to minimize resource usage
-
Important Query Plan Concepts
- Sequential Scan: Full table scan
- Index Scan: Using an index to access data
- Nested Loop: Row-by-row join operation
- Hash Join: Build hash table for joining
- Merge Join: Pre-sorted merge operation
- LIMIT: Restricting number of rows returned
-
Success Metrics
- Reduced execution time
- Better resource utilization
- Proper use of available indexes
- Scalable query structure
Top comments (0)