PostgreSQL
PostgreSQL is a powerful and reliable database system that offers many features and benefits. However, it also has some challenges and drawbacks, such as the table bloat problem. Table bloat occurs when the space occupied by a table or an index is larger than the actual data size. This can affect the performance and efficiency of the database, as well as increase the backup and restore time.
One of the main causes of table bloat is the accumulation of dead tuples. Dead tuples are rows that have been deleted or updated, but are still visible to concurrent transactions. PostgreSQL uses a mechanism called MVCC (Multi-Version Concurrency Control) to ensure transaction isolation and consistency. MVCC allows each transaction to see a snapshot of the database at a certain point in time, without locking the rows or tables. This means that old versions of rows are not immediately removed, but are marked as dead and invisible to new transactions.
To reclaim the space occupied by dead tuples, PostgreSQL runs a process called vacuum. Vacuum removes dead tuples from the tables and indexes, and marks the space as free for reuse. Vacuum also updates the visibility map, which tracks which pages of a table contain only visible tuples. This helps to speed up index-only scans, which can avoid reading the table pages if the visibility map indicates that they are not needed.
Autovacuum
PostgreSQL has an automatic vacuum daemon called autovacuum, which runs in the background and performs vacuum on tables that need it. Autovacuum is triggered by various factors, such as the number of dead tuples, the age of the transaction ID, and the size of the table. Autovacuum also performs analyze, which updates the statistics used by the query planner to choose the best execution plan.
However, sometimes autovacuum may not run frequently enough for some tables, especially those with a large number of tuples and a high update or delete rate. This can lead to excessive table bloat and poor query performance. In this blog post, we will show you how to control the frequency of autovacuum for specific tables using storage parameters.
Storage Parameters
Storage parameters are options that can be set for each table or index to control their physical storage characteristics. Some of these parameters affect how autovacuum behaves for a given relation. The two parameters that we will focus on are:
- autovacuum_vacuum_threshold: The minimum number of dead tuples in a table before autovacuum will issue a VACUUM command on it. The default value is 50.
- autovacuum_vacuum_scale_factor: A fraction of the total number of tuples in a table that will be added to autovacuum_vacuum_threshold to determine whether autovacuum should issue a VACUUM command on it. The default value is 0.2.
The formula for calculating when autovacuum will run on a table is:
(the number of dead tuples)
> autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * (the number of tuples)
For example, if a table has 100,000 tuples and 20,000 dead tuples, autovacuum will run on it when:
20,000
> 50 + 0.2 * 100,000
which simplifies to:
20,000
> 20,050
which is false, so autovacuum will not run on this table.
However, if we change the storage parameters for this table to:
- autovacuum_vacuum_threshold = 1000
- autovacuum_vacuum_scale_factor = 0
then autovacuum will run on it when:
20,000
> 1000 + 0 * 100,000
which simplifies to:
20,000
> 1000
which is true, so autovacuum will run on this table.
By adjusting these parameters, we can make autovacuum more aggressive or more conservative for specific tables, depending on our needs and preferences.
Example
Let's see how this works in practice using an example. We will use the pgbench_accounts table from the pgbench benchmark tool. This table has 100,000 rows and is frequently updated by pgbench transactions.
We can check the current storage parameters for this table using the \d+ command in psql:
postgres=# \d+ pgbench_accounts
Table "public.pgbench_accounts"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
------------+---------------+-----------+----------+---------+---------+--------------+-------------
aid | integer | | not null | | plain | |
bid | integer | | | | plain | |
abalance | integer | | | | plain | |
filler | character(84) | | | | extended| |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid_index" btree (bid)
Options: autovacuum_vacuum_threshold=1000, autovacuum_vacuum_scale_factor=0
We can see that the table has the storage parameters that we want to use for this example. If we want to change them, we can use the ALTER TABLE command:
postgres=# ALTER TABLE pgbench_accounts SET (autovacuum_vacuum_threshold = 1000);
ALTER TABLE
postgres=# ALTER TABLE pgbench_accounts SET (autovacuum_vacuum_scale_factor = 0.0);
ALTER TABLE
Now, let's generate some dead tuples by updating some rows in the table:
postgres=# update pgbench_accounts set abalance = 1 where aid < 1002;
UPDATE 1001
We can check the number of dead tuples and the last autovacuum time for this table using the pg_stat_all_tables view:
postgres=# SELECT relname, n_dead_tup, last_autovacuum FROM pg_stat_all_tables WHERE relname = 'pgbench_accounts';
relname | n_dead_tup | last_autovacuum
------------------+------------+-------------------------------
pgbench_accounts | 1001 | 2023-07-01 00:32:23.627931+09
(1 row)
We can see that the table has 1001 dead tuples and the last autovacuum was at 00:32:23. If we wait for a few minutes, we can see that autovacuum runs again on this table:
postgres=# SELECT relname, n_dead_tup, last_autovacuum FROM pg_stat_all_tables WHERE relname = 'pgbench_accounts';
relname | n_dead_tup | last_autovacuum
------------------+------------+-------------------------------
pgbench_accounts | 0 | 2023-07-01 00:43:24.498598+09
(1 row)
We can see that the number of dead tuples is now zero and the last autovacuum was at 00:43:24. This means that autovacuum ran on this table as soon as it reached our set threshold of 1000 dead tuples.
Apache AGE:
Apache AGE is a graph database extension for PostgreSQL that allows you to store, query, and analyze graph data in a relational database. Graph data is composed of nodes and edges that represent entities and relationships. Visit their Official Website and Github Repo to learn More.
Conclusion
In this blog post, we learned how to control the frequency of autovacuum for specific tables using storage parameters. We saw how to adjust the autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor parameters to make autovacuum more aggressive or more conservative for a given table. We also saw an example of how this works in practice using the pgbench_accounts table from the pgbench benchmark tool.
We hope that this blog post was helpful and informative for you. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading!
Top comments (0)