This post summarizes Chapter 5 of the book "The Internals of PostgreSQL".
The chapter discusses about concurrency control techniques, with a huge focus on MVCC (multi-Version Concurrency Control), S2PL (strict Two-Phase Locking), and OOC (optimistic Concurrency Control).
MVCC is a technique where a new version of a data item is created in every write operation, while retaining the old version. This helps both the reader and writer to operate simultaneously, without blocking each other. A variation of MVCC, Snapshot Isolation (SI) is used by PostgreSQL. SI ensures isolation while preventing issues like Dirty read, Phantom Read, and non-repeatable read.
In version 9.1, PostgreSQL introduced Serializable Snapshot Isolation (SSI) in order to achieve true serializability. SSI helps to detect serialization anomalies, and resolves conflicts caused by them - hence, providing a serializable ioslation level. Transaction isolation levels such as READ COMMITTED, REPEATABLE READ, AND SERIALIZABLE have been explained in terms of their behavior, with regards to the anomalies mentioned above.
Tuples are the fundamental units of storage - they represent individual rows of data. The operations done on tuples include:
Insertion: inserting a new tuple into a table refers to inserting it directly into a page of the target table. The header fields (txid, ctid) are updated to indicate information about the transaction. This helps keep track of the history and visibility of the tuple.
Deletion: The id of the transaction that runs the delete operation is set as the maximum transaction ID that can see the tuple (t_xmax). This indicates that the tuple being deleted is logically marked as deleted, and that is is no longer needed, and therefore it can be considered as a dead tuple. In the process of VACCUM processing, these dead tuples are eventually removed from pages.
Update: To update a tuple, PostgreSQL deleted the existing tuple, and creates a new one with the updated values. This ensures integrity of data, and an ability to track the changes made in the past. the t_xmax field of the old tuple, and the t_xmin field of the new tuple is updated to the ID of the transaction performing the update operation.
Free Space Map (FSM) is used to manage space allocation in case of insertion or updation. Information about the free space capacity of each page is stored here, and FMS therefore helps to determine which page is capable of accomodating a new tuple, and selects the appropriate page accordingly.
Commit Log (clog) in postgreSQL holds statuses of transactions. It is useful throughout transaction processing. The clog maintains an array-like structure, where the status of a transaction ID is shown by each item. The status may be IN-PROGRESS, COMMITTED, ABORTED, and SUB-COMMITTED. The clog is backed-up to files on fisk, and old data from the clog is removed by vaccum processing.
Transaction Snapshot: information about the active transactions at a given point in time is stored in a transaction snapshot. It is basically a dataset, and it helps determine the visibility of tuples while the query is being executed. It is useful for visibility checks of tuples. Its format is:
'xmin:xmax:xip_list', where xmin is the earliest active transaction ID, xmax is the first unassigned transaction ID, and xip_list represents the active transaction IDs.
Visibility Check Rules: These are a set of rules used to determine whether a tuple is visible or not, based on parameters like t_xmin, t_xmax values, the clog and the transaction snapshot. Considerations such as transaction status, committed or aborted transactions are also included. These rules ensure isolation levels and consistency in data throughout the database system.
PostgreSQL determines the versions of data visible to a transaction, based on the status and tid. These rules are then used to determine the outcome of the SELECT commands run. This ensures that the data visible to transactions is committed, and issues like dirty read (uncommitted changes are visible to other transactions) are prevented.
PostgreSQL ensures that updated data is not accessible by transactions until it has been committed - thus preventing dirty read. This ensures data integrity as transactions are prevented from accessing unverified data.
Other anomalies that are prevented by PostgreSQL include phantom reads, where a transaction accesses newly inserted data which was not visible earlier, and Lost updates, where concurrent transactions update the data simultaneously.
PostgreSQL prevents phantom reads by considering a set of rules that make newly inserted data invisible to transactions that are ongoing; PostgreSQL's first-updater-win scheme is used to prevent loss updates. When multiple transactions access / update the data simultaneously, PostgreSQL ensures that only one transaction's changes are applied, thus preventing data incosistencie.
In summary, the chapter discusses about mechanisms used by PostgreSQL for data consistency, and transaction Isolation. Visibility checks are covered, along with anomalies and how they are prevented. PostgreSQL implements strategies like MVCC and SSI to maintain data integrity, and allow a reliable access to data.
Top comments (0)