DEV Community

HRmemon
HRmemon

Posted on

Exploring the World of Databases: Cost Estimation in Single-Table Queries with PostgreSQL

Are you ready to dive deep into the world of databases and discover the inner workings of query processing? Let's blast off into the universe of databases and explore the exciting world of cost estimation in single-table queries!

In this blog post, we'll explore how PostgreSQL's query optimization works based on cost. But don't worry, we'll break down complex database concepts using fun analogies. Think of cost as the price tag on a product, it's not an absolute indicator of performance but a tool for comparing the relative value of operations.

Costs are estimated by the functions defined in costsize.c, and all operations executed by the executor have corresponding cost functions. For instance, the costs of sequential scans and index scans are estimated by cost_seqscan() and cost_index() respectively. In PostgreSQL, there are three types of costs: start-up, run, and total, with the total cost being the sum of start-up and run costs.

The start-up cost is the cost expended before the first tuple is fetched. Picture it like the cost of getting dressed and ready before leaving your house. On the other hand, the run cost is the cost of fetching all tuples, or the actual work done to achieve a result. It's like the energy you spend to travel to a destination after getting ready.

The EXPLAIN command shows both the start-up and total costs in each operation. For example, when you run EXPLAIN SELECT * FROM tbl;, the cost section displays two values: 0.00 and 145.00. Here, the start-up and total costs are 0.00 and 145.00, respectively.

Now, let's dive into how to estimate the sequential scan, index scan, and sort operations in detail. We will use a specific table and an index to demonstrate how these operations work.

For instance, the cost of the sequential scan is estimated by the cost_seqscan() function. Let's explore how to estimate the sequential scan cost of the following query: SELECT * FROM tbl WHERE id < 8000;.

In the sequential scan, the start-up cost is equal to 0, and the run cost is defined by the equation:

run cost = (cpu_tuple_cost + cpu_operator_cost) × Ntuple + seq_page_cost × Npage,

where seq_page_cost, cpu_tuple_cost, and cpu_operator_cost are set in the postgresql.conf file, and the default values are 1.0, 0.01, and 0.0025, respectively. Ntuple and Npage represent the number of tuples and pages of the table, respectively.

We can obtain the values of Ntuple and Npage using the following query: SELECT relpages, reltuples FROM pg_class WHERE relname = 'tbl';. For our example table, the output is:

relpages reltuples
45 10000

Therefore, the run cost can be calculated as follows:

run cost = (0.01 + 0.0025) × 10000 + 1.0 × 45 = 170.0.

And the total cost is the sum of the start-up and run costs:

total cost = 0.0 + 170.0 = 170.

If you run the EXPLAIN command for the above query, you'll see the output below:

QUERY PLAN

Seq Scan on tbl (cost=0.00..170.00 rows=8000 width=8)
Filter: (id < 8000)
(2 rows)

Now, let's move on to estimating the cost of an index scan. An index scan is a more efficient operation than a sequential scan because it accesses only the required data pages instead of scanning the entire table. The cost of an index scan is estimated by the cost_index() function. Let's use the same table tbl and an index on the id column to demonstrate how to estimate the cost of an index scan for the query SELECT * FROM tbl WHERE id < 8000;.

The start-up cost of an index scan is defined by the cost of scanning the index to find the starting point. This cost is estimated by the cost of accessing a single index page, which is determined by the random_page_cost setting in the postgresql.conf file. The run cost of an index scan is the cost of fetching all the matching tuples, which is similar to the cost of a sequential scan, but it's multiplied by the index's selectivity factor, which is the fraction of the table's rows that satisfy the query condition.

The selectivity factor of an index is estimated by the planner, based on statistics collected by the ANALYZE command. For our example query, let's assume that the selectivity factor is 0.8, which means that 80% of the rows satisfy the condition.

The cost of an index scan can be calculated as follows:

start-up cost = random_page_cost × 1 = 4.0 (assuming random_page_cost = 4.0)
run cost = (cpu_tuple_cost + cpu_operator_cost) × Ntuple × selectivity + index_page_cost × Nindex
= (0.01 + 0.0025) × 10000 × 0.8 + 0.2 × 45
= 82.0
total cost = start-up cost + run cost = 86.0

If you run the EXPLAIN command for the above query, you'll see the output below:

QUERY PLAN

Index Scan using idx_tbl_id on tbl (cost=4.00..86.00 rows=8000 width=8)
Index Cond: (id < 8000)
(2 rows)

You can see that the cost of the index scan is lower than that of the sequential scan, indicating that the index scan is more efficient for this particular query.

Finally, let's briefly discuss how the cost of a sort operation is estimated. Sorting is an expensive operation that requires a lot of CPU and I/O resources. The cost of a sort operation is estimated by the cost_sort() function, which takes into account the size of the data to be sorted, the available memory, and the cost of writing and reading data to and from disk.

In conclusion, estimating the cost of query operations is a critical aspect of query optimization. Understanding how costs are estimated can help you write more efficient queries and improve the performance of your database. We hope this blog post has given you a better understanding of how PostgreSQL's cost-based optimizer works. Happy querying!

Top comments (0)