DEV Community

Cover image for Learn Why we call to use index with PostgreSQL
NightBird07
NightBird07

Posted on

Learn Why we call to use index with PostgreSQL

To better understand how the cost estimation process works in PostgreSQL, it can be helpful to explore practical examples of how it impacts query execution. In previous articles, we discussed the role of the planner, executor, sequential scan, and runtime scan in query execution. To apply these concepts to real-world scenarios, it's recommended to review the previous article on sequential scans and cost estimation in PostgreSQL.

What is an index

An index in PostgreSQL is similar to an index in a book or a set of notes, where it provides a way to quickly locate and retrieve specific information. Unlike normal tables, indexes are stored in a specialized data structure optimized for quick insertion and searching, such as B-tree, GiST, or Hash.

The most common type of index in PostgreSQL is the balanced-tree index, or B-tree, which efficiently stores and organizes data in a tree-like structure. This allows for fast searching and retrieval of data based on the values in the indexed column(s).

Indexes can significantly improve query performance by allowing PostgreSQL to locate and retrieve specific rows more efficiently than scanning the entire table. However, it's important to note that indexes are not always the optimal solution for all cases. In some cases, the overhead of maintaining the index can outweigh the performance benefits. Therefore, it's important to carefully consider the columns to include in an index and monitor the performance of your queries to ensure that the indexes are providing the expected performance improvements.

Let's write some SQL

I have created a PostgreSQL function that generates rows by taking user input for the number of rows and table name. With this function, I was able to generate over 10 million rows within seconds. Next, I executed several queries to observe their performance.

CREATE OR REPLACE FUNCTION generate_series(n integer,name text ,t text)
RETURNS void AS
$BODY$
DECLARE
    i integer;
BEGIN
    FOR i IN 1..n LOOP
        EXECUTE 'INSERT INTO ' || quote_ident(t) || ' VALUES (' || i || ',' || quote_literal(name || i) || ')';
    END LOOP;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
Enter fullscreen mode Exit fullscreen mode
EXPLAIN  ANALYZE SELECT id from t where name='NAME';
CREATE INDEX ON t(id);
RESET ALL;
EXPLAIN  ANALYZE SELECT id from t where t.id = 1000; 
EXPLAIN  ANALYZE SELECT name from t where t.id = 2000; 
Enter fullscreen mode Exit fullscreen mode

The first query took the longest time as it used a sequential scan to filter the rows based on a text column. The second and third queries used an index-only scan on the primary key column, resulting in much faster query performance.

Results

the first one will take the longest query as

Gather  (cost=1000.00..11614.43 rows=1 width=10) (actual time=1.348..46.516 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t  (cost=0.00..10614.33 rows=1 width=10) (actual time=24.932..39.442 rows=0 loops=3)
         Filter: (name = 'NAME10000'::text)
         Rows Removed by Filter: 333333
 Planning Time: 0.058 ms
 Execution Time: 46.533 ms
Enter fullscreen mode Exit fullscreen mode

When analyzing query performance in PostgreSQL, the EXPLAIN ANALYZE command provides valuable insights into the query execution process. The output of this command includes information such as the number of worker threads used by the planner, the type of scan performed (such as sequential or index-only), the planning time for the query, and the execution time for the query.

While PostgreSQL can use parallel threads to read data from the database and improve performance, this should not be relied on as a solution for poorly optimized queries. It's important to write clean, optimized queries to avoid putting undue load on the CPU.

The planning time for a query can vary depending on its complexity and optimization, but an optimized query should have a planning time that does not significantly impact the total execution time. The execution time is the time needed to retrieve the data from the table and is a critical factor in query performance.

It's recommended to focus on optimizing queries to improve their execution time, rather than relying solely on parallelism or increasing hardware resources. By optimizing queries and carefully selecting which columns to include in indexes, it's possible to achieve significant performance improvements in PostgreSQL.

 Index Only Scan using t_id_idx on t  (cost=0.42..8.44 rows=1 width=4) (actual time=0.832..0.834 rows=1 loops=1)
   Index Cond: (id = 100000)
   Heap Fetches: 1
 Planning Time: 0.060 ms
 Execution Time: 0.873 ms
Enter fullscreen mode Exit fullscreen mode

when index is not really needed

there are two main situations the first is realted to the table itself and the other is realted to the query

1- table

  • the table is small
  • the table is updated frequently
  • low selective columns

2- Data
the excuted query always fetch pages or large portion of tables then the Seq SCAN is faster.

Top comments (0)