DEV Community

Cover image for How PostgreSQL plant a tree
NightBird07
NightBird07

Posted on

How PostgreSQL plant a tree

in this section we are going to go in a deep to understand the architecture of the porstgreSQL and how the process is a little complicated to create a tree for the execution of a single query though we mentioned the postgreSQL handles the parallel queries, so far we will start by a single query.

Stages before executing

  1. Parser tree is used to create the preprocess of the query clause
  2. Analyzer used to analyze the created tree from the previous stage-parser
  3. rewrite in this stage we define again the tree in a more format that sticks with the defined rules.
  4. Planner the most effected tree that will be excuted.
  5. Executor we finally go to the last stage where the result goes to one of the following {memory, buffer, temporary file}.

Parser Tree

the parser stage is responsible for constructing a parse tree from the given query. The parse tree represents the syntactic structure of the query and categorizes each word or phrase into its corresponding category based on a set of predefined grammar rules. for example, fromClause, whereClause, SortBy, Targetlist,

SELECT id,data
FROM table_a
WHERE ID<200 
ORDER BY data
Enter fullscreen mode Exit fullscreen mode

the table_a => fromClause subtree, (id, data)=>TargetList, data => sortBy, (200,ID,'>') => whereClause.

Analyzer

in this stage the analyzer analyze the output of the previous t=stage-pars tree and generate a query tree- * also known as an operator tree, is a data structure used in database systems to represent the logical plan of a query. It is constructed during the query optimization phase, after the query has been parsed and transformed into a parse tree.*
from the previous example we can define the query tree as following

  1. the tragetList will have two columns the ID,Data each one has its metadata the type and the index of the var.
  2. the rangeTable holds the name of the table and the OID
  3. stores the FROM clause and WHERE clause
  4. store the SortGroup Clause.

Rewriter

the rewrite is like the secretary of the office who will not pass the paper-tree to the officer- executor unless it matches the rule defined. these rules have a very interesting system but the blog is going to be too long.

Planner

after the regeneration of the tree from the rewrite stage we convert the tree to be a query again.. this process is actually based on the appending the cost-based optimization not a rule-based which makes it more complicated. but let's say you have done the plan tree and resulted in the min path cost on the query we move directly to the last stage

Executor

the execution of the plan tree is quiet simple and straight forward. In single-table queries, the executor takes the plan nodes in an order from the end of the plan tree to the root and then invokes the functions that perform the processing of the corresponding nodes. the process goes under the cycle of two stage

  1. seqscan of the nodes
  2. sorting these nodes

    how to estimate the cost of anything in postgreSQL?

    the resulted value is a rough estimation as the CPU divers from a device to another but we use it as a way to determine the shortest path -cost-based optimized in planner stage.

  3. seqScan Cost

  4. Index Scan Cost

  5. Total Cost

  6. Sort Cost

but before we give the formula to define each term we define three other main variables.

  1. startup time the time needed to fetch the first tuple
  2. run time the time needed to fetch all the tuples
  3. total time => run + start up ### seqScan cost The total run time of a sequential scan is the sum of the time it takes to read all the pages of the table and the time it takes to process the tuples on those pages. The time it takes to read the pages depends on the disk speed and the number of pages to be read, which can be affected by factors such as the size of the table, the selectivity of the query, and the availability of indexes.

the startup time is 0 and the run cost = cpu run cost + disk run cost = (CPU_TUBLE_COST + CPU_OPERATOR_COST) * N1 + SEQ_PAGE_COST * N2 : N1 = #of tuples and N2=#of pages other are default value

why the Startup is 0

The startup cost for a sequential scan in PostgreSQL is the cost of initializing the scan structures, such as opening the table and setting up the scan state. This cost is usually negligible compared to the cost of actually reading the data from disk. Therefore, the startup cost is often reported as 0 because it is considered insignificant in the overall cost of the scan operation.

what is the run time

the run time in sequential scan is the time needed to read the pages form the hard disk following the structure set by the startup.

Finally

in another blog as index scan needs to be estimated through equations but eventually the total cost will be a prime factor in choosing the path as the min cost path in the planner stage. so it is critical to have a good estimation on the time needed for each path.

Top comments (0)