DEV Community

Aditya Agrawal
Aditya Agrawal

Posted on • Originally published at adiagr.com

2 1 1 1

Navigating PostgreSQL - Query Optimizations

In this part, we'll learn how to think about query optimizations in PostgreSQL with practical examples, using all our previous knowledge of different scans, joins, and indexes.

Query Optimization

Query optimization is the process of improving the performance of a query. It involves analyzing and enhancing multiple aspects of how your queries interact with PostgreSQL. While there are many factors to consider, we can break down the optimization process into a few key areas:

  1. Query Structure and Logic

    • Writing efficient WHERE clauses
    • Using appropriate JOINs
    • Avoiding unnecessary subqueries
    • Proper use of aggregations
  2. Schema Design and Indexing

    • Creating the right indexes for your queries
    • Understanding when indexes help or hurt
    • Maintaining optimal table statistics
    • Smart choices in data types and constraints
  3. Data Access Patterns

    • Understanding how your application uses data
    • Identifying frequently accessed columns
    • Analyzing query patterns and frequency
    • Considering caching strategies

The key to successful optimization is taking a methodical approach:

  1. Measure current performance
  2. Identify bottlenecks using EXPLAIN ANALYZE
  3. Make targeted improvements
  4. Validate the changes
  5. Repeat if necessary

Let's explore these concepts with practical examples...

Example 1:

Get all the sales, employees and departments for the last month, where the sale amount is greater than 1090.

create view latest_extended_sales as
SELECT
    s.id as sale_id,
    s.amount sale_amount,
    s.sale_date as sale_date,
    e.id as employee_id,
    d.id as department_id,
    d.department_name as department_name
from
    sales s
    join employees e on s.employee_id=e.id
    join departments d on e.department_id=d.id
where
    sale_date > now()-interval'1 months'
    and s.amount > 1090
;

explain analyze 
select * from latest_extended_sales order by sale_id desc;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=268392.55..268402.13 rows=3831 width=538) (actual time=1064.998..1065.159 rows=3932 loops=1)
   Sort Key: s.id DESC
   Sort Method: quicksort  Memory: 404kB
   ->  Nested Loop  (cost=0.57..268164.54 rows=3831 width=538) (actual time=6.135..1063.872 rows=3932 loops=1)
         ->  Nested Loop  (cost=0.42..267542.42 rows=3831 width=22) (actual time=6.126..1060.664 rows=3932 loops=1)
               ->  Seq Scan on sales s  (cost=0.00..260474.20 rows=3831 width=18) (actual time=6.101..1050.484 rows=3932 loops=1)
                     Filter: ((amount > '1090'::numeric) AND (sale_date > (now() - '1 mon'::interval)))
                     Rows Removed by Filter: 9835028
               ->  Index Scan using employees_pkey on employees e  (cost=0.42..1.85 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=3932)
                     Index Cond: (id = s.employee_id)
         ->  Index Scan using departments_pkey on departments d  (cost=0.14..0.16 rows=1 width=520) (actual time=0.001..0.001 rows=1 loops=3932)
               Index Cond: (id = e.department_id)
 Planning Time: 0.237 ms
 Execution Time: 1065.935 ms
(18 rows)
Enter fullscreen mode Exit fullscreen mode

Let's analyze the query plan to understand where we can make optimizations.

Understanding the Query Plan

The query plan shows several key operations:

  1. Sequential Scan on sales table:

    • Scans over 9.8M rows to filter by sale_date and amount
    • Takes 1050ms - the major bottleneck
    • Removes 9.8M rows that don't match filters
  2. Index Scans:

    • Uses index on employees table to match employee_id
    • Uses index on departments table to match department_id
    • These operations are relatively fast
  3. Overall Performance:

    • Total execution time: 1065ms
    • Most time spent on sequential scan (1050ms)
    • Remaining operations only take ~15ms combined

Optimizing the Query

The main bottleneck is the sequential scan on the sales table that examines over 9.8M rows. The query plan shows this operation takes 1050ms out of the total 1065ms execution time. Since we're filtering on both sale_date and amount columns, and scanning the full table to apply these filters, we should consider creating appropriate indexes to avoid the costly sequential scan.

postgres=# create index on sales(amount);
CREATE INDEX
Time: 15829.995 ms (00:15.830)


QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=66342.37..66351.94 rows=3830 width=538) (actual time=340.297..340.749 rows=3932 loops=1)
   Sort Key: s.id DESC
   Sort Method: quicksort  Memory: 404kB
   ->  Nested Loop  (cost=1.01..66114.42 rows=3830 width=538) (actual time=0.204..338.719 rows=3932 loops=1)
         ->  Nested Loop  (cost=0.86..66017.29 rows=3830 width=22) (actual time=0.197..337.137 rows=3932 loops=1)
               ->  Index Scan using sales_amount_idx on sales s  (cost=0.43..58950.51 rows=3830 width=18) (actual time=0.189..311.728 rows=3932 loops=1)
                     Index Cond: (amount > '1090'::numeric)
                     Filter: (sale_date > (now() - '1 mon'::interval))
                     Rows Removed by Filter: 94309
               ->  Index Scan using employees_pkey on employees e  (cost=0.42..1.85 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=3932)
                     Index Cond: (id = s.employee_id)
         ->  Memoize  (cost=0.15..0.17 rows=1 width=520) (actual time=0.000..0.000 rows=1 loops=3932)
               Cache Key: e.department_id
               Cache Mode: logical
               Hits: 3922  Misses: 10  Evictions: 0  Overflows: 0  Memory Usage: 2kB
               ->  Index Scan using departments_pkey on departments d  (cost=0.14..0.16 rows=1 width=520) (actual time=0.001..0.001 rows=1 loops=10)
                     Index Cond: (id = e.department_id)
 Planning Time: 0.369 ms
 Execution Time: 340.886 ms
(19 rows)
Enter fullscreen mode Exit fullscreen mode
  • 1065ms -> 340ms
  • Improvement: 3.1x

Analysis:

  • Created an index on the amount column of the sales table
  • The query planner now uses this index to filter rows where amount > 1090
  • Index scan on sales table takes 311ms vs 1050ms before - a significant improvement
  • However, the sale_date filter is still applied after the index scan, removing 94,309 additional rows
  • Total execution time reduced from 1065ms to 340ms (3.1x faster)

This shows the index helps but isn't optimal since we're still doing post-index filtering on sale_date. Let's create a composite index on both columns to improve this further.

Smart child

Optimizing the Query Further

postgres=# drop index on sales_amount_idx;
postgres=# create index on sales(amount, sale_date);
CREATE INDEX


QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=12543.60..12553.18 rows=3831 width=538) (actual time=26.234..26.545 rows=3932 loops=1)
   Sort Key: s.id DESC
   Sort Method: quicksort  Memory: 404kB
   ->  Nested Loop  (cost=1.02..12315.59 rows=3831 width=538) (actual time=0.064..24.952 rows=3932 loops=1)
         ->  Nested Loop  (cost=0.86..12218.43 rows=3831 width=22) (actual time=0.058..23.751 rows=3932 loops=1)
               ->  Index Scan using sales_amount_sale_date_idx on sales s  (cost=0.44..5150.22 rows=3831 width=18) (actual time=0.051..12.323 rows=3932 loops=1)
                     Index Cond: ((amount > '1090'::numeric) AND (sale_date > (now() - '1 mon'::interval)))
               ->  Index Scan using employees_pkey on employees e  (cost=0.42..1.85 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=3932)
                     Index Cond: (id = s.employee_id)
         ->  Memoize  (cost=0.15..0.17 rows=1 width=520) (actual time=0.000..0.000 rows=1 loops=3932)
               Cache Key: e.department_id
               Cache Mode: logical
               Hits: 3922  Misses: 10  Evictions: 0  Overflows: 0  Memory Usage: 2kB
               ->  Index Scan using departments_pkey on departments d  (cost=0.14..0.16 rows=1 width=520) (actual time=0.001..0.001 rows=1 loops=10)
                     Index Cond: (id = e.department_id)
 Planning Time: 0.379 ms
 Execution Time: 26.683 ms
(17 rows)
Enter fullscreen mode Exit fullscreen mode
  • 1065ms -> 26ms
  • Improvement: 40x

Analysis:

  • Created a composite index on amount and sale_date columns of the sales table
  • The query planner now uses this index to efficiently filter both conditions:
    • amount > 1090
    • sale_date > now() - 1 month
  • Index scan on sales table takes only 12ms vs 1050ms before
  • Additional index scans on employees and departments tables are very fast (<1ms) due to primary key indexes
  • Total execution time reduced dramatically from 1065ms to 26ms
  • Overall improvement: 40x faster query performance

Example 2:

Get the employee with the highest salary for each department.

create index on employees(department_id, salary);


explain analyze
with cte as (
    select department_id, max(salary) m_salary
    from employees
    group by department_id
), cte2 as (
    select 
        d.department_name, 
        e.id, 
        e.salary
    from
        employees e
        join cte on cte.department_id=e.department_id and e.salary=cte.m_salary
        join departments d on d.id=cte.department_id
) select * from cte2 order by department_name
;

Sort  (cost=23358.94..23358.94 rows=1 width=528) (actual time=243.129..243.131 rows=10 loops=1)
   Sort Key: d.department_name
   Sort Method: quicksort  Memory: 25kB
   ->  Nested Loop  (cost=23334.57..23358.93 rows=1 width=528) (actual time=243.061..243.119 rows=10 loops=1)
         Join Filter: (employees.department_id = d.id)
         ->  Nested Loop  (cost=23334.42..23358.75 rows=1 width=20) (actual time=243.055..243.105 rows=10 loops=1)
               ->  HashAggregate  (cost=23334.00..23334.10 rows=10 width=36) (actual time=243.036..243.039 rows=10 loops=1)
                     Group Key: employees.department_id
                     Batches: 1  Memory Usage: 24kB
                     ->  Seq Scan on employees  (cost=0.00..18334.00 rows=1000000 width=12) (actual time=0.006..51.934 rows=1000000 loops=1)
               ->  Index Scan using employees_department_id_salary_idx on employees e  (cost=0.42..2.44 rows=1 width=16) (actual time=0.006..0.006 rows=1 loops=10)
                     Index Cond: ((department_id = employees.department_id) AND (salary = (max(employees.salary))))
         ->  Index Scan using departments_pkey on departments d  (cost=0.14..0.16 rows=1 width=520) (actual time=0.001..0.001 rows=1 loops=10)
               Index Cond: (id = e.department_id)
 Planning Time: 0.236 ms
 Execution Time: 243.167 ms
(16 rows)
Enter fullscreen mode Exit fullscreen mode

Let's break down what this query does:

  • First, find the highest salary in each department
  • Then, get the employee details who has that maximum salary

Understanding the Query Plan

Let's analyze the execution plan:

  1. The query starts with a sequential scan on the employees table (which has over 10M rows)

    • This scans the ENTIRE table to find max salaries per department
    • Very expensive operation taking most of the 243ms execution time
  2. For each department's max salary:

    • Uses index scan on employees to find matching employee
    • Joins with departments table for department names
  3. Performance issues:

    • Total execution time: 243ms
    • Sequential scan is extremely inefficient
    • Not utilizing indexes effectively

While this may not seem slow, but in the database world this is a disaster.

This query performance is problematic for a production database:

  • Sequential scanning millions of rows is resource intensive
  • Response time is too high for interactive applications
  • Could cause bottlenecks under concurrent load

The key question is: Can we restructure this query to:

  1. Avoid the full table scan?
  2. Make better use of our indexes?
  3. Process only the rows we actually need?

Optimizing the Query

explain analyze
select 
    d.department_name, 
    e.id, 
    e.salary
from
    departments d
    join lateral (
        select 
            id, salary
        from
            employees
        where
            department_id=d.id
        order by
            salary desc
        limit 1
    ) e on true
order by department_name
;


QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=87.44..87.79 rows=140 width=528) (actual time=0.085..0.086 rows=10 loops=1)
   Sort Key: d.department_name
   Sort Method: quicksort  Memory: 25kB
   ->  Nested Loop  (cost=0.42..82.45 rows=140 width=528) (actual time=0.016..0.076 rows=10 loops=1)
         ->  Seq Scan on departments d  (cost=0.00..11.40 rows=140 width=520) (actual time=0.003..0.004 rows=10 loops=1)
         ->  Limit  (cost=0.42..0.49 rows=1 width=12) (actual time=0.007..0.007 rows=1 loops=10)
               ->  Index Scan Backward using employees_department_id_salary_idx on employees  (cost=0.42..6251.68 rows=100000 width=12) (actual time=0.006..0.006 rows=1 loops=10)
                     Index Cond: (department_id = d.id)
 Planning Time: 0.101 ms
 Execution Time: 0.101 ms
(10 rows)
Enter fullscreen mode Exit fullscreen mode
  • 243.167ms -> 0.101ms
  • Improvement: 2400x

Happy Child

Let's analyze the query plan:

  1. Departments Table Scan

    • A sequential scan is performed on the departments table
    • This is acceptable since the table only has 10 rows
    • Cost is minimal at 0.003-0.004ms
  2. Employees Table Index Scan

    • For each department, an index scan is done on the employees table
    • The scan is backward (descending) to get highest salary first
    • The LIMIT 1 ensures we only get the top earner
    • Very efficient: 0.006-0.007ms per department
  3. Overall Performance

    • Total execution time: 0.101ms
    • Compared to original 243.167ms
    • Represents a 2400x improvement
    • Shows the power of proper indexing and query structure

Key Takeaways

  1. Query Optimization Process

    • Start with EXPLAIN ANALYZE to understand the execution plan
    • Identify the most expensive operations
    • Look for opportunities to use indexes effectively
    • Restructure query to minimize resource usage
  2. Important Query Plan Concepts

    • Sequential Scan: Full table scan
    • Index Scan: Using an index to access data
    • Nested Loop: Row-by-row join operation
    • Hash Join: Build hash table for joining
    • Merge Join: Pre-sorted merge operation
    • LIMIT: Restricting number of rows returned
  3. Success Metrics

    • Reduced execution time
    • Better resource utilization
    • Proper use of available indexes
    • Scalable query structure

References

Top comments (0)

👋 Kindness is contagious

Please consider leaving a ❤️ or a friendly comment if you found this post helpful!

Okay