Aliaksei Kirkouski

Posted on Sep 19

Strange PostgreSQL query planner behavior in subqueries

#postgres #database

In a previous article, I described the problems that arise when working with temporary tables. Then I briefly described why we have to use them so often. In particular, one of the reasons was that the query planner in PostgreSQL doesn't work properly. In this article I will show another quite simple and often used case when the planner goes wrong, which can lead to a significant increase in resource consumption.

The problem is reproduced on the latest currently stable version of PostgreSQL - 16.4. The default PostgreSQL settings are used. I tried changing different settings, but I was not able to achieve the correct plan in the general case, since in this case the problem is more logical rather than computation costing. However, anyone can easily reproduce this situation locally and try playing around with the settings.
Consider a simple domain logic where there are documents and their lines. For each line, a sum is entered. The lines are in a separate table and refer to the document :

CREATE TABLE doc (id int PRIMARY  KEY);
CREATE TABLE line (id int PRIMARY  KEY, docId int, amount numeric);
CREATE INDEX line_doc ON line (docid);

Let's fill the table with test data. Generate 100,000 documents having 10, 20, 50 and 100 lines in equal proportion :

INSERT INTO doc (id) SELECT generate_series AS id FROM generate_series(1, 100000);
INSERT INTO line (id, docId, amount) (SELECT docId*100 + id AS id, docId, random() AS amount FROM generate_series(1, 10) AS id, generate_series(1, 25000) AS docid);
INSERT INTO line (id, docId, amount) (SELECT docId*100 + id AS id, docId, random() AS amount FROM generate_series(1, 20) AS id, generate_series(25001, 50000) AS docid);
INSERT INTO line (id, docId, amount) (SELECT docId*100 + id AS id, docId, random() AS amount FROM generate_series(1, 50) AS id, generate_series(50001, 75000) AS docid);
INSERT INTO line (id, docId, amount) (SELECT docId*100 + id AS id, docId, random() AS amount FROM generate_series(1, 100) AS id, generate_series(75001, 100000) AS docid);

Let's run ANALYZE so that PostgreSQL collects the correct statistics for them :

ANALYZE doc;
ANALYZE line;

Let's get a plan for a simple query to get the lines for a particular document :

SELECT id FROM line WHERE docId = 4353;

Index Scan using line_doc on line  (cost=1.73..12.03 rows=70 width=4)
  Index Cond: (docid = 4353)

We can immediately see that PostgreSQL is a pessimist, as it thinks it will get 70 records at the output. On the test data, on average each document has about half as many records. This is not critical, and in some cases even useful. PostgreSQL can be understood, but specifically in our case such pessimism will only make the problem worse.

Next, let's try to make a simple, from the developer's point of view, query :

SELECT docId, SUM(amount) FROM line WHERE docId IN (3531,6572) GROUP BY 1;

It just gets the two documents along with the sum by line for each of them. But what do we see in the plan :

GroupAggregate (cost=1.73..35.03 rows=139 width=36)
Group Key: docid
-> Index Scan using line_doc on line (cost=1.73..23.91 rows=139 width=15)
Index Cond: (docid = ANY ('{3531,6572}'::integer[])))

The query planner thinks that for these two documents it will select 139 rows (which corresponds to the statistics of the previous query), but as a result of grouping by documents it will have the same 139(!!!) documents. Although, obviously, there will be at most 2 documents. As a result, the statistics does not correspond to the real one almost 70 times.

By the way, if you make a query with one document, the statistics is already correct (apparently there are some heuristics in the planning) :

SELECT docId, SUM(amount) FROM line WHERE docId IN (3531) GROUP BY 1;
GroupAggregate (cost=1.73..12.79 rows=1 width=36)

-> Index Scan using line_doc on line (cost=1.73..12.03 rows=70 width=15)
Index Cond: (docid = 3531)

This optimization for one value appeared only in the latest versions of PostgreSQL. Before version 15 it was not available yet, and the planner thought that it would have 70 records.

Next, let's consider the following task. We will need to display to the user a page of the list of documents, consisting of 50 records and containing the amounts for each document. To solve this task, let's record the codes of all the documents on one page in a separate temporary table :

CREATE TEMPORARY  TABLE tmp (id int PRIMARY  KEY);
INSERT INTO tmp (id) SELECT * FROM generate_series(1, 50);
ANALYZE tmp;

Finally, let's generate a query to get the document amounts, using a subquery to calculate the amounts for each document. First, let's do the easiest way that first comes to mind, using the subquery :

SELECT tmp.id,
       sq.amount
FROM tmp
LEFT JOIN
  (SELECT docid,
          SUM(amount) AS amount
   FROM line
   GROUP BY 1) sq ON sq.docid = tmp.id

Unfortunately, PostgreSQL doesn't realize that it needs to calculate amounts for only 50 documents, so it calculates them for all documents in the table :

Hash Right Join (cost=155816.90..163627.72 rows=1616168 width=36)
  Hash Cond: (line.docid = tmp.id)
  -> Finalize HashAggregate (cost=155811.30..159691.74 rows=64674 width=36)
        Group Key: line.docid
        -> Gather (cost=135115.62..151930.86 rows=129348 width=36)
              Workers Planned: 2
              -> Partial HashAggregate (cost=134115.62..137996.06 rows=64674 width=36)
                    Group Key: line.docid
                    -> Parallel Seq Scan on line ( cost=0.00..96615.82 rows=1874990 width=15)
  -> Hash (cost=2.60..2.60 rows=50 width=4)
        -> Seq Scan on tmp ( cost=0.00..2.60 rows=50 width=4).

This, of course, is not the optimal plan, so we will help PostgreSQL a bit, and add a JOIN with our temporary table (by the way, the lsFusion platform does this automatically when generating queries) :

SELECT tmp.id,
       sq.amount
FROM tmp
LEFT JOIN
  (SELECT docid,
          SUM(amount) AS amount
   FROM line
   JOIN tmp ON tmp.id = docid
   GROUP BY 1) sq ON sq.docid = tmp.id

This results in a much better plan :

Hash Right Join (cost=8.82..744.26 rows=870 width=36)
  Hash Cond: (line.docid = tmp.id)
  -> GroupAggregate (cost=3.22..527.25 rows=3479 width=36)
        Group Key: line.docid
        -> Merge Join (cost=3.22..248.93 rows=3479 width=15)
              Merge Cond: (line.docid = tmp_1.id)
              -> Index Scan using line_doc on line (cost=1.73..453359.63 rows=4499977 width=15)
              -> Index Only Scan using tmp_pkey on tmp tmp_1 (cost=0.56..5.86 rows=50 width=4)
  -> Hash (cost=2.60..2.60 rows=50 width=4)
        -> Seq Scan on tmp (cost=0.00..2.60 rows=50 width=4).

However, even though the plan is actually optimal, there is one big problem with the statistics. First, when grouping by document, PostgreSQL assumes that there will be as many documents as rows (3479). That is, like in the examples above, it is wrong by a factor of 70.

And secondly, after executing LEFT JOIN with the temporary table tmp, it does not guess that the subquery will have unique keys. This is extremely surprising, considering the fact that GROUP BY always has unique values of the fields by which the grouping is performed.

Obviously, the result cannot have more records than the original table, but PostgreSQL expects there to be 870 records, which is more than 15 times higher than the correct statistics.

In the simple case, this error in statistics is not that critical. However, if the query becomes more complex, this error can lead to completely incorrect query plans. For example, for the test, in order not to complicate the logic, let's add LEFT JOIN with the same table 2 more times (as if we need to calculate the sums from other tables). In addition, let's add another field read from the source table doc :

SELECT tmp.id,
       sq.amount,
       sq2.amount,
       sq3.amount,
       d1.id
FROM tmp
LEFT JOIN
  (SELECT docid,
          SUM(amount) AS amount
   FROM line
   JOIN tmp ON tmp.id = docid
   GROUP BY 1) sq ON sq.docid = tmp.id

LEFT JOIN
  (SELECT docid,
          SUM(amount) AS amount
   FROM line
   JOIN tmp ON tmp.id = docid
   GROUP BY 1) sq2 ON sq2.docid = tmp.id

LEFT JOIN
  (SELECT docid,
          SUM(amount) AS amount
   FROM line
   JOIN tmp ON tmp.id = docid
   GROUP BY 1) sq3 ON sq3.docid = tmp.id

LEFT JOIN doc d1
    ON tmp.id = d1.id

We get a plan like this :

Hash Left Join (cost=1824.83..2788.04 rows=263256 width=104)
  Hash Cond: (tmp.id = sq3.docid)
  -> Hash Left Join (cost=914.89..1715.05 rows=15134 width=72)
        Hash Cond: (tmp.id = sq2.docid)
        -> Merge Left Join (cost=4.95..795.76 rows=870 width=40)
              Merge Cond: (tmp.id = line.docid)
              -> Merge Left Join (cost=1.73..15.76 rows=50 width=8)
                    Merge Cond: (tmp.id = d1.id)
                    -> Index Only Scan using tmp_pkey on tmp (cost=0.56..5.86 rows=50 width=4)
                    -> Index Only Scan using doc_pkey on doc d1 (cost=1.17..10028.77 rows=100000 width=4)
              -> GroupAggregate (cost=3.22..527.25 rows=3479 width=36)
                    Group Key: line.docid
                    -> Merge Join (cost=3.22..248.93 rows=3479 width=15)
                          Merge Cond: (line.docid = tmp_1.id)
                          -> Index Scan using line_doc on line (cost=1.73..453359.63 rows=4499977 width=15)
                          -> Index Only Scan using tmp_pkey on tmp tmp_1 (cost=0.56..5.86 rows=50 width=4)
        -> Hash (cost=701.20..701.20 rows=3479 width=36)
              -> Subquery Scan on sq2 (cost=3.22..701.20 rows=3479 width=36)
                    -> GroupAggregate (cost=3.22..527.25 rows=3479 width=36)
                          Group Key: line_1.docid
                          -> Merge Join (cost=3.22..248.93 rows=3479 width=15)
                                Merge Cond: (line_1.docid = tmp_2.id)
                                -> Index Scan using line_doc on  line_1 (cost=1.73..453359.63 rows=4499977 width=15)
                                -> Index Only Scan using tmp_pkey on tmp tmp_2 (cost=0.56..5.86 rows=50 width=4)
  -> Hash (cost=701.20..701.20 rows=3479 width=36)
        -> Subquery Scan on sq3 (cost=3.22..701.20 rows=3479 width=36)
              -> GroupAggregate (cost=3.22..527.25 rows=3479 width=36)
                    Group Key: line_2.docid
                    -> Merge Join (cost=3.22..248.93 rows=3479 width=15)
                          Merge Cond: (line_2.docid = tmp_3.id)
                          -> Index Scan using line_doc on  line line_2 (cost=1.73..453359.63 rows=4499977 width=15)
                          -> Index Only Scan using tmp_pkey on tmp tmp_3 (cost=0.56..5.86 rows=50 width=4)

In the plan we can see that the error “accumulates” and here PostgreSQL expects 263256 records (although there cannot be more than 50 of them). However, the DBMS optimizer guesses to rearrange the JOIN executions and reads fields from the doc table only for records from the temporary table (see lines 7-10 in the plan). The problem is that if the number of JOINs is large (in particular, more than the join_collapse_limit parameter, which is 8 by default), such optimization may not help.

Let's simulate this situation by simply setting the join_collapse_limit parameter to one and running the same query :
SET join_collapse_limit=1;

Hash Left Join (cost=12873.00..16545.19 rows=263256 width=104)
  Hash Cond: (tmp.id = d1.id)
  -> Hash Left Join (cost=1828.70..2736.54 rows=263256 width=100)
        Hash Cond: (tmp.id = sq3.docid)
        -> Hash Left Join (cost=918.76..1663.55 rows=15134 width=68)
              Hash Cond: (tmp.id = sq2.docid)
              -> Hash Right Join (cost=8.82..744.26 rows=870 width=36)
                    Hash Cond: (line.docid = tmp.id)
                    -> GroupAggregate (cost=3.22..527.25 rows=3479 width=36)
                          Group Key: line.docid
                          -> Merge Join (cost=3.22..248.93 rows=3479 width=15)
                                Merge Cond: (line.docid = tmp_1.id)
                                -> Index Scan using line_doc on line (cost=1.73..453359.63 rows=4499977 width=15)
                                -> Index Only Scan using tmp_pkey on tmp tmp_1 (cost=0.56..5.86 rows=50 width=4)
                    -> Hash (cost=2.60..2.60 rows=50 width=4)
                          -> Seq Scan on tmp (cost=0.00..2.60 rows=50 width=4)
              -> Hash (cost=701.20..701.20 rows=3479 width=36)
                    -> Subquery Scan on sq2 (cost=3.22..701.20 rows=3479 width=36)
                          -> GroupAggregate (cost=3.22..527.25 rows=3479 width=36)
                                Group Key: line_1.docid
                                -> Merge Join (cost=3.22..248.93 rows=3479 width=15)
                                      Merge Cond: (line_1.docid = tmp_2.id)
                                      -> Index Scan using line_doc on  line_1 (cost=1.73..453359.63 rows=4499977 width=15)
                                      -> Index Only Scan using tmp_pkey on tmp tmp_2 (cost=0.56..5.86 rows=50 width=4)
        -> Hash (cost=701.20..701.20 rows=3479 width=36)
              -> Subquery Scan on sq3 (cost=3.22..701.20 rows=3479 width=36)
                    -> GroupAggregate (cost=3.22..527.25 rows=3479 width=36)
                          Group Key: line_2.docid
                          -> Merge Join (cost=3.22..248.93 rows=3479 width=15)
                                Merge Cond: (line_2.docid = tmp_3.id)
                                -> Index Scan using line_doc on  line line_2 (cost=1.73..453359.63 rows=4499977 width=15)
                                -> Index Only Scan using tmp_pkey on tmp tmp_3 (cost=0.56..5.86 rows=50 width=4)
  -> Hash (cost=5044.30..5044.30 rows=100000 width=4)
        -> Seq Scan on doc d1 (cost=0.00..5044.30 rows=100000 width=4)

We can see that PostgreSQL started executing JOINs in the same order as in the query itself. As a result, due to an error in the statistics, the query planner started to consider that it would be efficient to do a seq scan (see the last 2 lines of the plan) of the entire doc table followed by a hash join. And it would be correct if the first part of the query would really give 260 thousand records. In practice, however, there will be 50 records there, and it will be much more efficient to do a simple run on the index.

If we use a regular table instead of the temporary table tmp, the result will not change. However, if you use the main table with WHERE instead of the temporary table, then suddenly scheduling becomes correct :

SELECT doc.id,
       sq.amount
FROM doc
LEFT JOIN
  (SELECT docid,
          SUM(amount) AS amount
   FROM line
   JOIN tmp ON tmp.id = docid
   GROUP BY 1) sq ON sq.docid = doc.id
WHERE doc.id >= 1 AND doc.id <= 50;

Merge Left Join (cost=4.39..744.12 rows=52 width=36)
  Merge Cond: (doc.id = line.docid)
  -> Index Only Scan using doc_pkey on doc  (cost=1.17..7.51 rows=52 width=4)
        Index Cond: ((id >= 1) AND (id <= 50))
  -> GroupAggregate (cost=3.22..527.25 rows=3479 width=36)
        Group Key: line.docid
        -> Merge Join (cost=3.22..248.93 rows=3479 width=15)
              Merge Cond: (line.docid = tmp.id)
              -> Index Scan using line_doc on line (cost=1.73..453359.63 rows=4499977 width=15)
              -> Index Only Scan using tmp_pkey on tmp (cost=0.56..5.86 rows=50 width=4).

As we can see, PostgreSQL now expects 52 rows. And it also makes the same mistake when determining the number of rows in the nested subquery. However, after JOIN it does not think that the number of records will increase. And all this despite the fact that there is actually no connection between the doc table and the nested subquery. The subquery uses only tables line and tmp, which from the point of view of the database schema are in no way related to the original doc table.

Unfortunately, I don't have an explanation for this behavior of the query planner yet. Using the source table with WHERE in an external query is also not very good, because in case of a complex filter PostgreSQL can also make a mistake and get incorrect statistics on the number of records. When JOIN with a temporary table, it at least knows exactly how many records are in it.

I have only considered the simplest case of grouping in a nested query. In practice, there are much more complex subqueries on which the scheduling error is also reproduced. Fortunately, this problem doesn't occur very often, as optimizing PostgreSQL with JOIN rearrangement usually helps. However, every now and then we stumble upon such inefficient queries.

In the lsFusion platform, this problem is easily fixed by materializing a nested subquery without changing the application logic. But this has negative effects: increasing the number of stored fields increases the size of the database and also slows down the database write transaction. Another option to solve the problem is to pre-write nested queries into temporary tables and then run ANALYZE for them. This is also done automatically by the platform when the query execution time exceeds some threshold.

Recently, a patch has been proposed in PostgreSQL that fixes this problem quite easily. Perhaps, if it is accepted, the problem will be solved.

DEV Community

Strange PostgreSQL query planner behavior in subqueries

Top comments (0)

Read next

Aurora Limitless - Creation

Ensuring Data Integrity in Financial Transactions: The PostgreSQL Transaction Solution

Aurora Limitless - Single and Multi-Shard Scan

Statement Level Trigger vs Row Level Trigger