DEV Community

Hadi Atef
Hadi Atef

Posted on

"The Internals of PostgreSQL" chapter #1

Logical Structure of Database Cluster:

A database is a collection of related data that is managed and accessed together.PostgreSQL allows multiple databases to be created within a single cluster, which can be useful for organizing related data.
I should note that Object Identifier (OID) columns used to be a feature of PostgreSQL, OIDs were a way to assign a unique identifier to each row in a table, which could be used as a primary key or for other purposes. The OID value was automatically generated by PostgreSQL and could be accessed using the oid column in system catalogs.

Physical Structure of Database Cluster:

In PostgreSQL, a database cluster is a directory structure that contains subdirectories and files for managing databases. The base directory is the main directory, and it contains subdirectories for databases, tables, indexes, and configuration files. When initializing a new database cluster, a base directory is created, which can be set to the environment variable PGDATA. PostgreSQL also supports tablespaces, which are directories containing data outside of the base directory.

A heap table file:

Stores data in pages, which are fixed-size blocks of disk storage. Each page has a header with page metadata and contains a body consisting of an item pointer array and free space. Rows are inserted into pages by adding the new row to the item pointer array and updating the free space. When rows are deleted, the space they occupy is marked as free in the page's free space map. Over time, the free space on a page may become fragmented, requiring PostgreSQL to perform periodic vacuum processes to reclaim unused space and reorganize the data for better performance. The hierarchy of the heap table file is organized into a linked list of pages, with table-level metadata at the highest level.

Writing and reading Heap Tuples:

when a new tuple is inserted into a table file in PostgreSQL, the database engine follows a multi-step process to ensure transaction safety and consistency. If the table contains only one page with one heap tuple, and a second tuple is inserted, the second tuple is placed after the first one, and the page's pd_lower is updated to point to the second line pointer, while pd_upper is updated to point to the second heap tuple. The line pointers and heap tuples are used to organize the data within the page, while other header data, such as pd_lsn, pg_checksum, and pg_flag, are rewritten to appropriate values. These steps are carried out in a carefully managed process to ensure the stability and consistency of the table file.

PostgreSQL uses two main access methods to retrieve tuples from a table file: sequential scan and B-tree index scan.

  • Sequential scan reads all tuples in all pages of the table file by scanning all line pointers in each page. This method is used when there is no specific index to use for the query, but it can be slow for large tables.

  • B-tree index scan uses an index file that contains index tuples composed of an index key and a TID pointing to the target heap tuple. If the index tuple with the key being searched is found, PostgreSQL reads the desired heap tuple using the obtained TID value. This method is faster for large tables with appropriate indexes and can significantly improve query performance by reducing the amount of data that needs to be scanned.

Top comments (0)