Frits Hoogland for YugabyteDB

Posted on Aug 17, 2021

Postgres query execution: plpgsql

#postgres #internals #database #performance

This is a series of blogposts intended to look into how postgres query execution works, so you can get a better understanding, and understand what is happening and what you see. This post is looking at simple protocol usage of an anonymous code block using plpgsql. Anonymous code is plpgsql code. Procedures and functions generally are written in plpgsql too, and therefore follow the same pattern.

recap

In the previous posts I described how simple query execution works, and how that can be optimized with prepared statements, which allow you to do some of the work (parse and rewrite) once. The conclusion from the prepared statements post is that despite some work is done during the prepare, the step that potentially can take a lot of time outside of execution, the planning alias generation of an execution plan, is executed during each execution of a prepared statement. Potentially, the planning step can choose a generic plan after generating 5 custom plans. Choosing a generic plan means it fixed the plan, and no new plans are computed.

plpgsql

But how does that work with plpgsql (postgres procedural SQL)? And more importantly: does plpgsql optimize executing SQL?

This blogpost is using the simple query protocol, which means that an execution must use the steps of:

pg_parse_query (syntactic parse)
pg_analyze_and_rewrite->parse_analyze (semantic parse)
pg_analyze_and_rewrite->pg_rewrite_query (rewriting)
pg_plan_queries (planning)
PortalRun->PortalRunMulti->PortalRunUtility (execution) Which are executed inside simple protocol execution function 'exec_simple_query'.

An anonymous code block starts with a 'do' command. During parsing the parser has marked the anonymous code as a 'utility' command, alias no select/update/insert/delete, therefore gets it to the standard_ProcessUtility function, which selects the required function to handle it, ExecuteDoStmt.

In ExecuteDoStmt, the language is selected, after which the handler for that language is run. The handler first compiles the anonymous code block, and then executes it.

I have been doing tests with the following very simple code:

drop table if exists t;
create table t (i int);
insert into t values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20);
--
do
$$
declare 
 o int;
begin
 for c in 1..20 loop
  select i from t where i=c into o;
 end loop;
end;
$$;

plpgsql inline SQL

SQL statements in plpgsql are documented to have an implicit prepared statement created. The documentation also mentions 'the plan is cached', which to me in the most positive sense is ambiguous.

First of all, the documentation about plpgsql internals and cached plans reveals a very important detail: when a SQL statement is first executed (so not when it's compiled), the plpgsql interpreter parses and analyzes the command to create a prepared statement. This means we will see the SQL statement being processed in the plpgsql 'plpgsql_exec_function' function that performs the execution, and not in 'plpgsql_compile_inline' function, that compiles plpgsql.

Inside plpgsql_exec_function, these are the functions that handle the 'select i from t where i=c into o' statement. Please mind that inside these functions we exactly see the previous mentioned mandatory steps for query execution:

exec_stmt_block
ㄴ exec_stmts
  ㄴ exec_stmt
    ㄴ exec_stmt_fori
      ㄴ exec_stmts
        ㄴ exec_stmt
          ㄴ exec_stmt_execsql
            ㄴ exec_prepare_plan
              ㄴ SPI_prepare_params
                ㄴ _SPI_prepare_plan
                  ㄴ pg_parse_query                parse (syntactic) 
                  ㄴ pg_analyze_and_rewrite_params parse (semantic)
                    ㄴ pg_rewrite_query            rewrite
            ㄴ SPI_execute_plan_with_paramlist
              ㄴ _SPI_execute_plan
                ㄴ GetCachedPlan                   planner
                  ㄴ BuildCachedPlan
                    ㄴ pg_plan_queries
              ㄴ _SPI_pquery
                ㄴ ExecutorRun                     execute

The big question is what happens when the query is executed multiple times, which is what happens in the above simple code example.

When closely following the executed internal functions further along execution of the plpgsql do block containing a loop executing a SQL statement, the following is seen after the above first round of all the mandatory steps:

            ㄴ SPI_execute_plan_with_paramlist
              ㄴ _SPI_execute_plan
                ㄴ GetCachedPlan                   planner
                  ㄴ BuildCachedPlan
                    ㄴ pg_plan_queries
              ㄴ _SPI_pquery
                ㄴ ExecutorRun                     execute

The steps of parsing and rewriting do not happen again, only the planning and execute step. This is actually very understandable, this is identical to the sequence in manually prepared and executed statements.

This means that SQL statements in plpgsql will implicitly be executed as prepared statements, which means the parse and rewriting steps only happen during the implicit prepare phase, and executing and re-executing a SQL in plpgsql will perform the planning step before the execution every single time.

In line with prepared statements, the logic I explained in the prepared statements post in the 'inner working of the planner' paragraph is performed, which in a normal situation means 5 custom plans are computed, generated and used before potentially a generic plan is created and permanently used, which then skips the plan generation.

The reason I mentioned that the documentation is ambiguous is that the statement 'the plan is cached' probably means to say the query tree is cached, and not the execution plan. The following functions do mention query tree as plan (exec_prepare_plan, _SPI_prepare_plan). The actual execution after these functions still needs to obtain an execution plan, which has to create 5 custom plans before a generic plan might be chosen.

plpgsql execute

There are some cases where plpgsql inline SQL does not work, of which a notable case is when table names are dynamic (needs to be a variable). In such a case, plpgsql provides a solution using the plpgsql execute command. Please mind the plpgsql execute command has a collision with the postgres SQL command execute to execute a prepared statement.

The documentation is clear about plpgsql execute. It explicitly mentions that because plpgsql execute is dynamic, a plan is never cached.

This means that for plpgsql execute, the execution is parsed, rewritten, planned and executed for each individual invocation.

Example code for the use of execute:

drop table if exists t;
create table t (i int);
insert into t values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20);
--
do
$$
declare 
 o int;
 tabname varchar:='t';
begin
 for c in 1..20 loop
  execute format('select i from %I where i = %s',tabname,c) into o;
 end loop;
end;
$$;

plpgsql execute prepared statements?

For the plpgsql execute command I concluded that the documentation explicitly mentions SQL executed via 'execute' is never cached, and that the execute command in plpgsql has a different function than the execute command in SQL.

However my colleague Franck Pachot came up with a very creative way of getting benefit of prepared statements AND still being able to use the execute command to allow table names to be dynamic, by using plpgsql execute to explicitly build a prepared statement, and "execute the execute".

Using explicit prepared statements in plpgsql is not an optimal situation, it's significantly less optimal than executing prepared statements implicitly, but that cannot be used because implicit prepared statements cannot use dynamic table assignments.
On the other hand it is much more optimal than not using prepared statements at all.

Example code:

do
$$
declare
  o int;
  c int;
  tabname varchar:='t';
begin
  execute format('prepare my_i(int) as select i from %I where i = $1',tabname);
  for c in 1..20 loop
    execute format('execute my_i(%s)',c) into o;
  end loop;
  execute format('deallocate my_i');
end;
$$;

Let get an overview of what happens internally. These are the internal C functions for the prepare part in the anonymous code:

exec_stmt_block
ㄴ exec_stmts
  ㄴ exec_stmt
    ㄴ exec_stmt_dynexecute
      ㄴ SPI_execute
        ㄴ _SPI_prepare_oneshot_plan
          ㄴ pg_parse_query                 A-parse syntactic
        ㄴ _SPI_execute_plan
          ㄴ pg_analyze_and_rewrite         A-parse semantic
            ㄴ pg_rewrite_query             A-rewrite
          ㄴ GetCachedPlan
            ㄴ BuildCachedPlan
              ㄴ pg_plan_queries            A-planner
          ㄴ pgss_ProcessUtility
            ㄴ standard_ProcessUtility
              ㄴ PrepareQuery               A-execute (prepare)
                ㄴ CreateCachedPlan
                ㄴ parse_analyze_varparams  B-parse semantic
                ㄴ QueryRewrite             B-rewrite
                ㄴ CompleteCachedPlan
                ㄴ StorePreparedStatement

I have prefixed the explanation with A and B:
A = the execute in plpgsql.
B = the prepare statement being executed.
The plpgsql execute can be identified by the 'exec_stmt_dynexecute' function. Because this executes a command, the mandatory steps of parse, rewrite, planning and execute have to be performed.

The execution of the prepare can be seen with 'A', the result of the prepare can be seen with 'B': a prepared statement is created, for which the work of parsing and rewriting is done, after which it is saved in a hash table for later use with SQL execute.

And these are the internal functions for the execute part (that executes an execute) in the anonymous code:

exec_stmt_fori
ㄴ exec_stmts
  ㄴ exec_stmt
    ㄴ exec_stmt_dynexecute
      ㄴ SPI_execute
        ㄴ _SPI_prepare_oneshot_plan
          ㄴ pg_parse_query                A-parse syntactic
        ㄴ _SPI_execute_plan
          ㄴ pg_analyze_and_rewrite.       A-parse semantic
            ㄴ pg_rewrite_query            A-rewrite
          ㄴ GetCachedPlan
            ㄴ BuildCachedPlan
              ㄴ pg_plan_queries           A-planner
          ㄴ pgss_ProcessUtility           
            ㄴ standard_ProcessUtility
              ㄴ ExecuteQuery              A-execute (execute)
                ㄴ GetCachedPlan
                  ㄴ BuildCachedPlan
                    ㄴ pg_plan_queries     B-planner
                ㄴ PortalRun
                  ㄴ PortalRunSelect
                    ㄴ ExecutorRun         B-execute

I have prefixed the explanation with A and B:
A = the execute in plpgsql.
B = the execute statement being executed.
The essence is that the execute of the command 'execute' has to follow the mandatory steps of parse, rewrite, planning and execute, which are shown with 'A'.

The result of the execute can be seen with 'B': an execute statement, for which it performs the planning and execute steps only. For the planning the normal rules apply, which mean the prepared statement has to build at least 5 custom plans before it will consider a generic plan.

DEV Community