DEV Community

Cover image for Clustered and Foreign Tables in PostgreSQL
Quadcode team for Quadcode

Posted on • Originally published at Medium

Clustered and Foreign Tables in PostgreSQL

My name is Yakupov Azat. I'm a Data Architect, and here is the continuation of the saga of table types in PostgreSQL. In this part, we'll talk about clustered and foreign tables. Let's look at examples of their creation, areas of application, and pros and cons of their use.

Clustered Tables in PostgreSQL

In PostgreSQL, a clustered table is a table whose physical order of rows has been changed to match the order of the rows in an index. This is different from the normal behavior of a table, where the physical order of the rows may not match the order of the rows as defined by any particular index.

Few people like chaos, everyone likes order. Within the framework of relational databases, the concept of chaos is closely intertwined with the storage of information, because throughout its life cycle, a table is constantly changing.

In the process of working with the DBMS at the disk level, the contents of the table are constantly changing. For example, you've updated data and your updated row falls on another page of the table (here we should mention FILLFACTOR) with a dead tuple in the current position. Then the autovacuum process deleted the dead tuple, and the vacated slot was filled with the newly received row.

A simple test that you can do yourself. Make the following commands into a regular newly created table:

INSERT INTO test(id,name) VALUES(1, ‘Peter’);
INSERT INTO test(id,name) VALUES(2, ‘Ivan’);
INSERT INTO test(id,name) VALUES(3, ‘Sergey’);
Enter fullscreen mode Exit fullscreen mode

After executing the SQL query (please note, there is no ORDER BY):

SELECT *
   FROM test;
Enter fullscreen mode Exit fullscreen mode

You'll see the expected picture:

Image description

But by doing a row update

UPDATE test
  SET name = ‘Ruslan’
WHERE id = 2;
Enter fullscreen mode Exit fullscreen mode

and then by executing the same SQL, you'll get:

Image description

The order of the rows has changed! Entropy has grown.

Now imagine that you're looking for data in the table, for example, number 4. How can you do this within the chaotic green topology I drew below left? Just going through record after record: you accidentally poke at some number and compare it to the needed number 4. In fact, you'd have to go through all the entries, because there could be more than one number 4. In other words, a sequential scan is needed.

Image description

But when you have order, as in the table on the right, you’ll clearly know that the number 4 lies between 3 and 5. That’s the whole point of organizing order and clustered tables: they help create an ordered structure out of chaos. If you randomly select a random position in the ordered blue table in search of the number 4, then three outcomes are possible:

  1. The number is equal to the desired one.
  2. The number is less than the desired one.
  3. The number is greater than the desired one.

This gives a great advantage in the speed of search execution. If the number is greater than 4, you’ll continue to search up the table. If less, you’ll go down. Or you can get a range and search for the number 4 inside it. This is much faster than searching through all the data, as it was in the unorganized green topology, namely, logarithmic times faster.

Consider an example of creating a clustered table:

CREATE TABLE test.cluster_table
(id       INTEGER,
 name VARCHAR) WITH (FILLFACTOR = 90);

CREATE INDEX id_idx ON test.cluster_table (id);

CLUSTER [VERBOSE] test.cluster_table USING id_idx;
Enter fullscreen mode Exit fullscreen mode

Here I created a table called cluster_table and set the FILLFACTOR value for it to 90% — this is the fill percentage. It doesn’t affect our clustered table in any way, it’s just an example of how you can set a property when creating a table of this type. Next, I create a BTree index on the table (CREATE INDEX) in the id field and call the CLUSTER command. The CLUSTER command does the clustering of the table using the index that we previously created.

It’s important to know here that until clustering is completed, all current transactions in the table will be blocked. Traffic blocking occurs because Postgres is trying to rebuild the table in the order you require based on the index. And after creating this order, Postgres should save it to another file.

In fact, this is a disk-level data migration operation from one file to another, but only in the specified order. The data should be placed based on the index, in our case by the id field. I have graphically shown this in the figure below by referring to the metadata before and after the clustering of the table.

Initially, the table was placed in a file with the number 45969. After the CLUSTER command, the file name changed. Data was moved from one file to another. Therefore, blocking occurs, and, accordingly, incoming traffic can’t use this table until it becomes available.

Image description

You can also create an index for subsequent clustering that contains many columns (multicolumn index), or specify a descending order for certain columns (DESC / ASC).

You can optionally use the CLUSTER VERBOSE command, which will return details of what PostgreSQL did, namely how many pages there were, which pages were moved, and so on.

Test Cases and Data Order

Let’s do a little test:

CREATE TABLE test.cluster_table
(id       INTEGER,
 name VARCHAR) WITH (FILLFACTOR = 90);

CREATE INDEX id_idx ON test.cluster_table (id);

INSERT INTO test.cluster_table
  SELECT (random( )*100)::INTEGER, 'test'
     FROM generate_series(1,100) AS g(i);

SELECT id
   FROM test.cluster_table;
Enter fullscreen mode Exit fullscreen mode

Let’s create a table, an index by the id field and then generate 100 arbitrary rows using the generate_series command. The result is unordered data:

Image description

To achieve order in the output, we need to add the keyword ORDER BY. But here it’s important to remember that the ORDER BY operation also requires resources and you should pay for it. Every nanosecond counts with high-load traffic, and then there’s sorting.

In this case, let’s do the clustering of the table with the CLUSTER VERBOSE command, using the index that I created in advance:

CLUSTER VERBOSE test.cluster_table USING id_idx;

SELECT id
 FROM test.cluster_table;
Enter fullscreen mode Exit fullscreen mode

Voila, the data is sorted without sorting:

Image description

But there’s a trap here. Let’s update all the rows — but in fact, it’s enough to change the value of one row.

UPDATE test.cluster_table
   SET id = id * (random( )::INTEGER);

SELECT id
 FROM test.cluster_table;
Enter fullscreen mode Exit fullscreen mode

In this case, chaos will return to our clustered table:

Image description

To restore the order, you’ll need to run the CLUSTER command again. You don’t even have to specify the index again, because it’s preserved in the PostgreSQL metadata. And the database will understand next time what you’re doing clustering on.

CLUSTER VERBOSE test.cluster_table;

SELECT id
   FROM test.cluster_table;
Enter fullscreen mode Exit fullscreen mode

You’ll be able to observe the order again only after the CLUSTER command. This is the Achilles heel of clustered tables: any change in the clustering key can immediately bring disorder to the data.

When Clustered Tables are Suitable

Clustered tables are suitable if your data is reference tables (well, or SCD — Slowly Changing Dimension), for example, an address system. This type of tables is convenient if you upload new data quite rarely, for example, once a month.

If the table changes very often and is subject to INSERT, UPDATE and DELETE operations, it will have to be clustered constantly, and this is not very convenient and generally critical. The purpose of clustering is to avoid unnecessary ORDER BY in constant queries to the table by a clustered field or fields.

Metadata of a Clustered Table

From the metadata of the clustered table, you can understand that it’s clustered:

SELECT  c.oid AS “OID”,
        c.relname AS “Relation name”
  FROM pg_class c INNER JOIN pg_index i 
                  ON i.indrelid = c.oid
WHERE c.relkind = ‘r’ AND 
      c.relhasindex AND 
      i.indisclustered;
Enter fullscreen mode Exit fullscreen mode

Image description

The “true” value in the relhasindex field indicates that there’s an index to support clustering. When we rebuild the cluster in the next CLUSTER command, PostgreSQL will use the specified index from the metadata.

Foreign Tables in PostgreSQL

Foreign tables in PostgreSQL are tables that are stored outside of the database. These tables can be located on another database server or in a file, for example. Foreign tables are useful from the point of view of quickly getting data from another source, if you have the ability to join it.

In addition, if you tinker with it, you can provide a so-called data lifecycle, to provide a Retention Policy metric. The following set of tools can help you here:

  • VIEW (virtual table).
  • A set of regular tables separated by data retention logic (POOD design) with up-to-date data.
  • Foreign tables that focus on files that store data outside the database on cheaper disks (here you’ll find old data that exceeded the Retention Policy metric).

There are many foreign tables and connection types, for example:

  • CSV file.
  • Connection with many other RDBMS.
  • Connection to some NoSQL databases.

Let’s look at an example of a foreign table based on a CSV file. The file_fdw extension based on fdw — foreign data wrapper — will help us with this:

CREATE EXTENSION file_fdw;

CREATE SERVER csv_log FOREIGN DATA WRAPPER file_fdw;

CREATE FOREIGN TABLE test.csv (
  id   INTEGER,
  name VARCHAR
) SERVER csv_log 
  OPTIONS (filename '/var/lib/postgresql/file.csv', 
                   delimiter ‘;', format 'csv');
Enter fullscreen mode Exit fullscreen mode

I create a foreign table and describe the attributes, specifying the server for fdw, which I created in advance with options for working with the file.

If I make an SQL query to a foreign table, I’ll see the data that is presented in the file. Since the foreign table is registered (meaning there’s an entry in the PostgreSQL metadata), I have a hypothesis: is the data stored not in an external file, but in a PostgreSQL data file?

SELECT  oid AS “OID”,
        pg_relation_filepath(oid) AS “File path”,
        pg_relation_size(oid) AS “Relation Size”  
 FROM pg_class
WHERE relname = ‘csv’;
Enter fullscreen mode Exit fullscreen mode

Execution result:

Image description

So, the foreign table as an object is registered in the metadata (there’s an OID identifier of the object), but there’s no corresponding data file, that is, the data is presented only in an external source.

Queries to Foreign Tables

How do queries to foreign tables work? Let’s take the CSV file as an example.

Image description

While the data’s being loaded, there’s a fairly long time delay, so we store the old data somewhere on the old disks. To get the data, we need to open an external file descriptor, then copy the data to memory or to a temporary file and return the data to us. If we re-execute the same request a little later, there won’t be any acceleration: the process remains the same.

There are a great many libraries of foreign tables for various needs. For example, postgres_fdw. With its help, we can connect to PostgreSQL from PostgreSQL. It’s very much like a database link:

CREATE EXTENSION postgres_fdw;

DROP FOREIGN TABLE test.csv;

CREATE SERVER pg_log FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host '192.168.56.10', port '5432', dbname ‘course_db');

CREATE USER MAPPING FOR test SERVER pg_log 
OPTIONS (user 'test', password 'test');

CREATE FOREIGN TABLE test.csv (
  id   INTEGER,
  name VARCHAR
) SERVER pg_log 
  OPTIONS (schema_name 'test', table_name ‘user');
Enter fullscreen mode Exit fullscreen mode

A huge number of libraries are available for working with external sources. For example:

  • Oracle, MySQL, SQLite, MS SQL Server, Sybase.
  • Cassandra, MongoBD, HBase, Redis, Neo4j.
  • Twitter, Telegram
  • JSON, XLM, GeoFiles, LDAP.

Metadata of the Foreign Table

As we found out, the foreign table as an object is fixed in the metadata:

SELECT  oid AS "OID",
        relname AS “Relation name",
        CASE
         WHEN relpersistence = 'p' THEN 'Permanent'
         WHEN relpersistence = 't' THEN 'Temporary'
         ELSE 'Unlogged'
       END AS “Type”,
       relkind AS “Subtype”
 FROM pg_class
WHERE relname = ‘csv’;
Enter fullscreen mode Exit fullscreen mode

Image description

It’s a permanent table (surprisingly), but it has a pointer “f” which is a subtype of the relation. And it indicates that our table is foreign, that is, external.

In the Next Series

That’s all for today. In the following material we’ll analyze:

  • Partitioned tables.
  • Inherited tables.

Top comments (0)