My name is Yakupov Azat. I'm a Data Architect, and here is the continuation of the saga of table types in PostgreSQL. In this part, we'll talk about clustered and foreign tables. Let's look at examples of their creation, areas of application, and pros and cons of their use.
Clustered Tables in PostgreSQL
In PostgreSQL, a clustered table is a table whose physical order of rows has been changed to match the order of the rows in an index. This is different from the normal behavior of a table, where the physical order of the rows may not match the order of the rows as defined by any particular index.
Few people like chaos, everyone likes order. Within the framework of relational databases, the concept of chaos is closely intertwined with the storage of information, because throughout its life cycle, a table is constantly changing.
In the process of working with the DBMS at the disk level, the contents of the table are constantly changing. For example, you've updated data and your updated row falls on another page of the table (here we should mention FILLFACTOR) with a dead tuple in the current position. Then the autovacuum process deleted the dead tuple, and the vacated slot was filled with the newly received row.
A simple test that you can do yourself. Make the following commands into a regular newly created table:
INSERT INTO test(id,name) VALUES(1, ‘Peter’);
INSERT INTO test(id,name) VALUES(2, ‘Ivan’);
INSERT INTO test(id,name) VALUES(3, ‘Sergey’);
After executing the SQL query (please note, there is no ORDER BY
):
SELECT *
FROM test;
You'll see the expected picture:
But by doing a row update
UPDATE test
SET name = ‘Ruslan’
WHERE id = 2;
and then by executing the same SQL, you'll get:
The order of the rows has changed! Entropy has grown.
Now imagine that you're looking for data in the table, for example, number 4. How can you do this within the chaotic green topology I drew below left? Just going through record after record: you accidentally poke at some number and compare it to the needed number 4. In fact, you'd have to go through all the entries, because there could be more than one number 4. In other words, a sequential scan is needed.
But when you have order, as in the table on the right, you’ll clearly know that the number 4 lies between 3 and 5. That’s the whole point of organizing order and clustered tables: they help create an ordered structure out of chaos. If you randomly select a random position in the ordered blue table in search of the number 4, then three outcomes are possible:
- The number is equal to the desired one.
- The number is less than the desired one.
- The number is greater than the desired one.
This gives a great advantage in the speed of search execution. If the number is greater than 4, you’ll continue to search up the table. If less, you’ll go down. Or you can get a range and search for the number 4 inside it. This is much faster than searching through all the data, as it was in the unorganized green topology, namely, logarithmic times faster.
Consider an example of creating a clustered table:
CREATE TABLE test.cluster_table
(id INTEGER,
name VARCHAR) WITH (FILLFACTOR = 90);
CREATE INDEX id_idx ON test.cluster_table (id);
CLUSTER [VERBOSE] test.cluster_table USING id_idx;
Here I created a table called cluster_table
and set the FILLFACTOR
value for it to 90% — this is the fill percentage. It doesn’t affect our clustered table in any way, it’s just an example of how you can set a property when creating a table of this type. Next, I create a BTree index on the table (CREATE INDEX
) in the id field and call the CLUSTER
command. The CLUSTER
command does the clustering of the table using the index that we previously created.
It’s important to know here that until clustering is completed, all current transactions in the table will be blocked. Traffic blocking occurs because Postgres is trying to rebuild the table in the order you require based on the index. And after creating this order, Postgres should save it to another file.
In fact, this is a disk-level data migration operation from one file to another, but only in the specified order. The data should be placed based on the index, in our case by the id field. I have graphically shown this in the figure below by referring to the metadata before and after the clustering of the table.
Initially, the table was placed in a file with the number 45969. After the CLUSTER
command, the file name changed. Data was moved from one file to another. Therefore, blocking occurs, and, accordingly, incoming traffic can’t use this table until it becomes available.
You can also create an index for subsequent clustering that contains many columns (multicolumn index), or specify a descending order for certain columns (DESC / ASC).
You can optionally use the CLUSTER VERBOSE
command, which will return details of what PostgreSQL did, namely how many pages there were, which pages were moved, and so on.
Test Cases and Data Order
Let’s do a little test:
CREATE TABLE test.cluster_table
(id INTEGER,
name VARCHAR) WITH (FILLFACTOR = 90);
CREATE INDEX id_idx ON test.cluster_table (id);
INSERT INTO test.cluster_table
SELECT (random( )*100)::INTEGER, 'test'
FROM generate_series(1,100) AS g(i);
SELECT id
FROM test.cluster_table;
Let’s create a table, an index by the id field and then generate 100 arbitrary rows using the generate_series command. The result is unordered data:
To achieve order in the output, we need to add the keyword ORDER BY
. But here it’s important to remember that the ORDER BY
operation also requires resources and you should pay for it. Every nanosecond counts with high-load traffic, and then there’s sorting.
In this case, let’s do the clustering of the table with the CLUSTER VERBOSE
command, using the index that I created in advance:
CLUSTER VERBOSE test.cluster_table USING id_idx;
SELECT id
FROM test.cluster_table;
Voila, the data is sorted without sorting:
But there’s a trap here. Let’s update all the rows — but in fact, it’s enough to change the value of one row.
UPDATE test.cluster_table
SET id = id * (random( )::INTEGER);
SELECT id
FROM test.cluster_table;
In this case, chaos will return to our clustered table:
To restore the order, you’ll need to run the CLUSTER
command again. You don’t even have to specify the index again, because it’s preserved in the PostgreSQL metadata. And the database will understand next time what you’re doing clustering on.
CLUSTER VERBOSE test.cluster_table;
SELECT id
FROM test.cluster_table;
You’ll be able to observe the order again only after the CLUSTER
command. This is the Achilles heel of clustered tables: any change in the clustering key can immediately bring disorder to the data.
When Clustered Tables are Suitable
Clustered tables are suitable if your data is reference tables (well, or SCD — Slowly Changing Dimension), for example, an address system. This type of tables is convenient if you upload new data quite rarely, for example, once a month.
If the table changes very often and is subject to INSERT
, UPDATE
and DELETE
operations, it will have to be clustered constantly, and this is not very convenient and generally critical. The purpose of clustering is to avoid unnecessary ORDER BY
in constant queries to the table by a clustered field or fields.
Metadata of a Clustered Table
From the metadata of the clustered table, you can understand that it’s clustered:
SELECT c.oid AS “OID”,
c.relname AS “Relation name”
FROM pg_class c INNER JOIN pg_index i
ON i.indrelid = c.oid
WHERE c.relkind = ‘r’ AND
c.relhasindex AND
i.indisclustered;
The “true” value in the relhasindex
field indicates that there’s an index to support clustering. When we rebuild the cluster in the next CLUSTER
command, PostgreSQL will use the specified index from the metadata.
Foreign Tables in PostgreSQL
Foreign tables in PostgreSQL are tables that are stored outside of the database. These tables can be located on another database server or in a file, for example. Foreign tables are useful from the point of view of quickly getting data from another source, if you have the ability to join it.
In addition, if you tinker with it, you can provide a so-called data lifecycle, to provide a Retention Policy metric. The following set of tools can help you here:
- VIEW (virtual table).
- A set of regular tables separated by data retention logic (POOD design) with up-to-date data.
- Foreign tables that focus on files that store data outside the database on cheaper disks (here you’ll find old data that exceeded the Retention Policy metric).
There are many foreign tables and connection types, for example:
- CSV file.
- Connection with many other RDBMS.
- Connection to some NoSQL databases.
Let’s look at an example of a foreign table based on a CSV file. The file_fdw extension based on fdw — foreign data wrapper — will help us with this:
CREATE EXTENSION file_fdw;
CREATE SERVER csv_log FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE test.csv (
id INTEGER,
name VARCHAR
) SERVER csv_log
OPTIONS (filename '/var/lib/postgresql/file.csv',
delimiter ‘;', format 'csv');
I create a foreign table and describe the attributes, specifying the server for fdw, which I created in advance with options for working with the file.
If I make an SQL query to a foreign table, I’ll see the data that is presented in the file. Since the foreign table is registered (meaning there’s an entry in the PostgreSQL metadata), I have a hypothesis: is the data stored not in an external file, but in a PostgreSQL data file?
SELECT oid AS “OID”,
pg_relation_filepath(oid) AS “File path”,
pg_relation_size(oid) AS “Relation Size”
FROM pg_class
WHERE relname = ‘csv’;
Execution result:
So, the foreign table as an object is registered in the metadata (there’s an OID identifier of the object), but there’s no corresponding data file, that is, the data is presented only in an external source.
Queries to Foreign Tables
How do queries to foreign tables work? Let’s take the CSV file as an example.
While the data’s being loaded, there’s a fairly long time delay, so we store the old data somewhere on the old disks. To get the data, we need to open an external file descriptor, then copy the data to memory or to a temporary file and return the data to us. If we re-execute the same request a little later, there won’t be any acceleration: the process remains the same.
There are a great many libraries of foreign tables for various needs. For example, postgres_fdw. With its help, we can connect to PostgreSQL from PostgreSQL. It’s very much like a database link:
CREATE EXTENSION postgres_fdw;
DROP FOREIGN TABLE test.csv;
CREATE SERVER pg_log FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host '192.168.56.10', port '5432', dbname ‘course_db');
CREATE USER MAPPING FOR test SERVER pg_log
OPTIONS (user 'test', password 'test');
CREATE FOREIGN TABLE test.csv (
id INTEGER,
name VARCHAR
) SERVER pg_log
OPTIONS (schema_name 'test', table_name ‘user');
A huge number of libraries are available for working with external sources. For example:
- Oracle, MySQL, SQLite, MS SQL Server, Sybase.
- Cassandra, MongoBD, HBase, Redis, Neo4j.
- Twitter, Telegram
- JSON, XLM, GeoFiles, LDAP.
Metadata of the Foreign Table
As we found out, the foreign table as an object is fixed in the metadata:
SELECT oid AS "OID",
relname AS “Relation name",
CASE
WHEN relpersistence = 'p' THEN 'Permanent'
WHEN relpersistence = 't' THEN 'Temporary'
ELSE 'Unlogged'
END AS “Type”,
relkind AS “Subtype”
FROM pg_class
WHERE relname = ‘csv’;
It’s a permanent table (surprisingly), but it has a pointer “f” which is a subtype of the relation. And it indicates that our table is foreign, that is, external.
In the Next Series
That’s all for today. In the following material we’ll analyze:
- Partitioned tables.
- Inherited tables.
Top comments (0)