Introduction to Database Clusters, Databases, and Tables for Effective Data Management: PostgreSQL

#postgressql #clusters #tutorial

The main objective of this blog is to get readers with some brief insights of the concept of Database Clusters and provide a comprehensive overview of PostgreSQL, which serves as an illustrative example of a database cluster. The article covers the following topics in detail:

The logical structure of a database cluster.
The physical structure of a database cluster.
The internal structure of a heap table file.
The procedures involved in writing and reading data into a table.

By exploring these aspects, readers will gain a fundamental understanding of how database clusters are organized and how data is managed within them, specifically focusing on PostgreSQL as a practical implementation.

1-Database Logical Structure Overview:

First of all, PostgreSQL is a server that manages cluster of database( collection of database)

A database is a compilation of various database objects that serve the purpose of storing and referencing data. These objects can take the form of different data structures. One commonly used example is a table (often referred to as a heap), but there are also other types of objects such as indexes, sequences, views, functions, and more. In PostgreSQL, the database itself is considered a database object and is logically distinct from other objects. To manage these objects internally, PostgreSQL utilizes Object Identifiers (OIDs), which are 4-byte unsigned integers assigned to each object. The relationships between database objects and their corresponding OIDs are stored in designated system catalogues.

2-Lets Talk about the Physical Structure:

Cluster of Db is a basically a sub-directory which usually contains a lot of files and other related Db stuff.
A database is a collection of database objects that are designed to store and reference data. These objects encompass various data structures, including tables (commonly known as heaps), as well as indexes, sequences, views, functions, and more. In PostgreSQL, the actual database is treated as a distinct database object, separate from other objects within the system. To effectively handle these objects, PostgreSQL employs Object Identifiers (OIDs), which are 4-byte unsigned integers uniquely assigned to each object. The relationships between the database objects and their respective OIDs are stored in specific system catalogues for efficient management.

Database Cluster's layout:
In terms of file and sub-directory descriptions, the following details can be provided:

Files:

PG_VERSION: This file holds the major version number of PostgreSQL.
postgresql.conf: It is a file utilized for configuring parameters.
postgresql.auto.conf: This file is responsible for storing configuration parameters that are set in ALTER SYSTEM (version 9.4 or later).

Sub-directories:

base/: This sub-directory encompasses per-database subdirectories.
global/: Within this sub-directory, cluster-wide tables like pg_database and pg_control are stored.
pg_stat/: This sub-directory is dedicated to permanent files for the statistics subsystem.
pg_tblspc/: Here, symbolic links to tablespaces can be found.

Db's Layout:
Under the base sub-directory, each database in PostgreSQL is represented as a separate sub-directory. The names of these database directories correspond to their respective Object Identifiers (OIDs).

Information regarding Files Associated with Tables and Indexes:

Here are key points regarding the files associated with tables and indexes,

Tables and indexes in PostgreSQL are internally managed by individual Object Identifiers (OIDs).
The data files of tables and indexes are managed by a variable called "relfilenode".
Certain commands like TRUNCATE, REINDEX, and CLUSTER are used to change the relfilenode values of tables and indexes.
When the file size of a table or index exceeds 1GB, PostgreSQL creates a new file with the name relfilenode.1. Subsequent files are created with names like relfilenode.2 when the previous file is full.
Each table has two associated files: one suffixed with '_fsm' for the free space map and another suffixed with '_vm' for the visibility map.
Free space maps store information about the free space capacity of each page in the table or index, while visibility maps store visibility information.
Indexes only have individual free space maps and do not have a visibility map.
These maps and files can also be referred to internally as forks of each relationship.
The free space map is the first branch of the table/index data file (with branch number 1), and the visibility map is the second branch of the table data file (with branch number 2).
The fork number of the data file is 0.

Tablespaces:
In PostgreSQL, a tablespace is an extra storage area located outside the base directory. When creating a tablespace using the CREATE TABLESPACE statement, a version-specific subdirectory is established within the specified directory. For instance, a subdirectory named PG_14_202011044 may be created. When a new table is created within a database belonging to the base directory, a new directory is generated under the version-specific subdirectory, bearing the same name as the existing database OID. Subsequently, the new table file is stored within the created directory.

3. Internal Layout of a Heap Table File:

In PostgreSQL, the data file, including the heap table, index, free space map, and visibility map, is partitioned into fixed-length pages or blocks. By default, each page is 8192 bytes or 8 KB in size. These pages are assigned sequential block numbers starting from 0 within each file. When a file reaches its capacity, PostgreSQL appends a new empty page at the end of the file to expand its size and accommodate additional data.

Heap tuple(s): These are records stored in a table page, arranged in order from the bottom of the page.
Line pointer(s): Each line pointer is a 4-byte element that points to a heap tuple within the page.
Header data: This 24-byte section stores general information about the page, providing an overview of its properties.

A table page in PostgreSQL contains heap tuples, line pointers, header data, free space or holes, and Tuple Identifiers (TIDs). Heap tuples are ordered from the bottom of the page, line pointers indicate their locations, and the header data provides general information. Free space refers to empty areas, and TIDs include block and offset numbers. Larger heap tuples use TOAST for storage.

Methods of Writing and Reading Tuples:

In a single-page table with one heap tuple, when a second tuple is inserted, a new row pointer is added to the first tuple to point to the second tuple. The pointers for pd_lower and pd_upper are updated accordingly, along with other necessary header modifications in the page.

Learn how to read the Head Tables

Sequential Scan: All tuples on each page are read sequentially by scanning row pointers.

B-tree Index Scan: The index file contains index tuples with an index key and a TID pointing to the target heap tuple. When an index tuple matching the searched key is found, PostgreSQL utilizes the TID to retrieve the desired heap tuple.

For more Detailed information visit the official documentation website: https://www.interdb.jp/pg/
Similar articles like this: https://medium.com/@nimratahir1212/chapter-01-81c49052a73