DEV Community: Hasnain Somani

Enterprise DB : About it.

Hasnain Somani — Fri, 28 Jul 2023 07:21:20 +0000

About EDB:
Enterprise DB is a company that provides enterprise-class services and products - most of which are based on open-source PostgreSQL DBMS. EDB's main objective is to promote the use of PostgreSQL, and make it suitable for much complex applications.

EDB offerings includes:

EDB Postgres Platform, which includes tools for performance monitoring and database management.
EDB Postgres Advanced Server: This is a flagship product, and provides additional features such as Oracle compatibility, improved security, and better performance for complex workloads.
EDB Postgres Ark Platform: This is a Database-as-a-Service (DBaaS) which allows users to manage postgreSQL-based databases in a cloud based environment.
EDB Postgres Tools, which are used to make management, monitoring and modification of postgreSQL Databases easier.

A Recap on Bitnine:
Bitnine is a different company, which is a pioneer in graph database technologies. Bitnine's flagship product is AgensGraph, which is used for graph database management, and it is built on top of PostgreSQL. Bitnine provides its users an access to its own graph query language, "Agens Query Language (AQL)".

How are they related?
EDB and Bitnine both offer different types of services, as well as focus on different grounds. EDB focuses on postgreSQL for general enterprise use, whereas Bitnine focuses on specializing the graph database technology. Apart from the fact that both of these organizations work on the same grounds of PostgreSQL, there is no collaboration between them.

PgAdmin: About it, and getting it.

Hasnain Somani — Fri, 07 Jul 2023 08:10:28 +0000

pgAdmin is an administraction and development platform for PostgreSQL databases. It is opensource, and it provides a GUI for management and interaction with postgreSQL databases.

Features of pgAdmin include:

Database Management: the creation, modification, and deletion of databases can be done effectively using pgAdmin. In addition, pgAdmin provides an intuitive interface for tasks such as defining constraints and managing indexes.
Querying and Scripting: pgAdmin has a built-in SQL query editor, which enables the user to execute SQL queries against postgreSQL databases. Advanced features like syntax highlighting and query history are also available on pgAdmin.
Visual Tools: these are used in the designing and modification of database schemas: you can visually edit and create tables, views, functions and triggers.
User and Role Management: pgAdmin allows you to manage users that can access the database, and their privileges. Thus, you can use pgAdmin to create and delete user accounts, or set permissions and assign roles.
Backup and Restore: PgAdmin offers features to perform backups (both full and partial) and restore them when needed.
Extension Support: pgAdmin supports extensions of PostgreSQL. You can manage, install and configure these extensions through pgAdmin's interface.

PgAdmin is available for Windows, macOS as well as Linux Platforms, and it is widely used by database professionals and data analysts for managing postgreSQL databases. It is very rich in features and power, and it helps to simplify tasks such as querying, scripting, and user management for the user.

To get pgAdmin on Windows, follow the video in the link below:
https://www.youtube.com/watch?v=E-Qn2tbcmDE

(source: Rishabh Mishra on YouTube).

AgensGraph: About it, and Getting access to it.

Hasnain Somani — Fri, 07 Jul 2023 07:25:04 +0000

AgensGraph is a database server, and a complete DBMS that provides storage and management functionalities for graph-structured data. AgensGraph extends PosygreSQL's relational database with features and functions that are graph-specific. It runs a separate server process, and clients can connect to it for data storage, querying and analysis.

Just like any Database server would do, AgensGraph also manages storage of data, provides concurrency control, ensures data integrity, and offers various optimization techniques which help query performance enhancement. It maintains data consistency while allowing multiple clients to interact with the database concurrently.

AgensGraph provides a dedicated environment for managing graph data, allowing you to access and store large-scale graph structured datasets in an efficient way. Clients can connect to AgensGraph server using various programming languages, or libraries which support the PostgreSQL protocol.

As a summary, AgensGraphleverages postgreSQL infrastructure to provide database essential functionalities, and it is therefore used by users and developers to interact with the graph DB server, modify data, execute queries, and retrieve results from them.

Some key uses and aspects of AgensGraph include Graph Data Modeling, Graph Query Language, Graph Analytics, Integration with PostgreSQL Ecosystem, and usecases (involving complex and interconnected data).

Installation of AgensGraph Database Server:
AgensGraph is a flagship product of Bitnine Global Inc., and therefore, access to it can be obtained through their official website. Below are the steps to download and install AgensGraph on your systems.

1. Google Search "download agensgraph database"

2. Go to Bitnine's official website's Link.

Choose the windows version under Community Edition.

3. Fill this form and Send your details to Bitnine.

4. Open the Email, and go to the download option to get a Zip file, which contains the executable file for AgensGraph.

*5. Run the installation file, and follow the steps in the image: *

Go to advanced Options if you want to change any settings (such as port number). If you don't want to change anything, go with the default settings, and don't click on the Advanced Options checkbox.

Set your password:

6. Now that the installation is complete, look for Agens in your windows start bar, and open it. Opening it should open a command prompt. Execute the commands, and ensure that your result matches to that shown in the screenshot.

7. That is it. You have successfully downloaded, installed and tested the AgensGraph Database Server.

The Internals of PostgreSQL: Chapter 6: Vacuum Processing

Hasnain Somani — Thu, 29 Jun 2023 16:25:25 +0000

This post summarizes Chapter 6 of the book "The Internals of PostgreSQL".
Vaccum processing is a maintenance process that is used in PostgreSQL to remove dead tuples, and freeze transaction ids to ensure an efficient operation in the database. Vaccum processing can further be categorized into 2 branches:

Concurrent Vaccum: it removes dead tuples for each page of the table, while allowing other transactions to read the table.
Full Vaccum: removes dead tuples and defragments live tuples for the whole file. Other transactions can not access the tables during this process.

Vaccum processing scans entire tables, and therefore it is a costly operation. This had to be executed manually until version 8.0, and improvements have been made to this over the years - including Visibility Map in version 8.4 (improves the efficiency in removal of dead tuples - reduces the cost of vaccum processing), and enhancements made to the freeze process in version 9.6.

Vaccum processing involves 3 blocks:

The first block:
performs freeze processing.
removes index tuples pointing to dead tuples.
The second block:
removes dead tuples.
updates Free Space Map (FSM) page by page.
updates Visibility Map (VM) page by page.
The third block:
cleans up indexes.
truncates the last page (if possible).
updates statistics and system catalogs.

Statistics and system catalogs related to vaccum processing are updated once vaccum processing is complete, and unnecessary parts of the clog are also removed.
In v13, the PARALLEL option was introduced, which allows the vaccum to process vaccuming, and cleaning indexes in parallel if multiple indexes are created. This operation is available in VACCUM command - not in autovaccum.

The Visibility Map (VM) is an individual map linked with each table, indicating the visibility of each page. The use of VM allows vaccum processing to skip pages that do not contain dead tuples - improving efficiency as a result.

Freeze processing has 2 parts: lazy mode and eager mode. In lazy mode, only those pages having dead tuples are scanned with the use of VM. In eager mode, all pages are scanned, and relevant system catalogs are updated. Although, eager mode is trigerred only when specific conditions are satisfied. One such condition is the freeze threshold of the database exceeding.

The clog (commit log) stores transaction states, and when pg_database.datfrozenxid value is updated, unnecessary files can be removed from the clog. If the minimum pg_database.datfrozenxid exists in a clog file, older files can be removed since all transactions stored in those files are considered to be frozen.

Autovaccum Daemon is used to automate the vaccum process in PostgreSQL. Although the default is to wake every 1 minute and invoke 3 workers, it periodically invokes multiple autovaccum worker processes.

Concurrent vaccum is not sufficient to reduce the table size, and therefore PostgreSQL provides Full VACCUM mode to help with this. Full VACCUM involves creating a new table, and copying all live tuples from the old table to it, removing the old file. It also updates statistis, Free Space Map, and Visibility Map. During Full VACCUM, temporary disk space usage can be up to twice the table size, and access to the table is prohibited.

As a summary, this chapter focuses on maintenance and optimization of the database. Vaccumming, analyzing and monitoring of the database performance have been covered, as well as the importance of vaccuming to remove dead tuples has been highlighted.

The Internals of PostgreSQL: Chapter 5: Concurrency Control

Hasnain Somani — Thu, 29 Jun 2023 15:39:57 +0000

This post summarizes Chapter 5 of the book "The Internals of PostgreSQL".
The chapter discusses about concurrency control techniques, with a huge focus on MVCC (multi-Version Concurrency Control), S2PL (strict Two-Phase Locking), and OOC (optimistic Concurrency Control).

MVCC is a technique where a new version of a data item is created in every write operation, while retaining the old version. This helps both the reader and writer to operate simultaneously, without blocking each other. A variation of MVCC, Snapshot Isolation (SI) is used by PostgreSQL. SI ensures isolation while preventing issues like Dirty read, Phantom Read, and non-repeatable read.

In version 9.1, PostgreSQL introduced Serializable Snapshot Isolation (SSI) in order to achieve true serializability. SSI helps to detect serialization anomalies, and resolves conflicts caused by them - hence, providing a serializable ioslation level. Transaction isolation levels such as READ COMMITTED, REPEATABLE READ, AND SERIALIZABLE have been explained in terms of their behavior, with regards to the anomalies mentioned above.

Tuples are the fundamental units of storage - they represent individual rows of data. The operations done on tuples include:

Insertion: inserting a new tuple into a table refers to inserting it directly into a page of the target table. The header fields (txid, ctid) are updated to indicate information about the transaction. This helps keep track of the history and visibility of the tuple.
Deletion: The id of the transaction that runs the delete operation is set as the maximum transaction ID that can see the tuple (t_xmax). This indicates that the tuple being deleted is logically marked as deleted, and that is is no longer needed, and therefore it can be considered as a dead tuple. In the process of VACCUM processing, these dead tuples are eventually removed from pages.
Update: To update a tuple, PostgreSQL deleted the existing tuple, and creates a new one with the updated values. This ensures integrity of data, and an ability to track the changes made in the past. the t_xmax field of the old tuple, and the t_xmin field of the new tuple is updated to the ID of the transaction performing the update operation.

Free Space Map (FSM) is used to manage space allocation in case of insertion or updation. Information about the free space capacity of each page is stored here, and FMS therefore helps to determine which page is capable of accomodating a new tuple, and selects the appropriate page accordingly.

Commit Log (clog) in postgreSQL holds statuses of transactions. It is useful throughout transaction processing. The clog maintains an array-like structure, where the status of a transaction ID is shown by each item. The status may be IN-PROGRESS, COMMITTED, ABORTED, and SUB-COMMITTED. The clog is backed-up to files on fisk, and old data from the clog is removed by vaccum processing.

Transaction Snapshot: information about the active transactions at a given point in time is stored in a transaction snapshot. It is basically a dataset, and it helps determine the visibility of tuples while the query is being executed. It is useful for visibility checks of tuples. Its format is:
'xmin:xmax:xip_list', where xmin is the earliest active transaction ID, xmax is the first unassigned transaction ID, and xip_list represents the active transaction IDs.

Visibility Check Rules: These are a set of rules used to determine whether a tuple is visible or not, based on parameters like t_xmin, t_xmax values, the clog and the transaction snapshot. Considerations such as transaction status, committed or aborted transactions are also included. These rules ensure isolation levels and consistency in data throughout the database system.
PostgreSQL determines the versions of data visible to a transaction, based on the status and tid. These rules are then used to determine the outcome of the SELECT commands run. This ensures that the data visible to transactions is committed, and issues like dirty read (uncommitted changes are visible to other transactions) are prevented.
PostgreSQL ensures that updated data is not accessible by transactions until it has been committed - thus preventing dirty read. This ensures data integrity as transactions are prevented from accessing unverified data.

Other anomalies that are prevented by PostgreSQL include phantom reads, where a transaction accesses newly inserted data which was not visible earlier, and Lost updates, where concurrent transactions update the data simultaneously.
PostgreSQL prevents phantom reads by considering a set of rules that make newly inserted data invisible to transactions that are ongoing; PostgreSQL's first-updater-win scheme is used to prevent loss updates. When multiple transactions access / update the data simultaneously, PostgreSQL ensures that only one transaction's changes are applied, thus preventing data incosistencie.

In summary, the chapter discusses about mechanisms used by PostgreSQL for data consistency, and transaction Isolation. Visibility checks are covered, along with anomalies and how they are prevented. PostgreSQL implements strategies like MVCC and SSI to maintain data integrity, and allow a reliable access to data.

The Internals of PostgreSQL: Chapter 4: Foreign Data Wrappers and Parallel Query

Hasnain Somani — Wed, 28 Jun 2023 16:39:28 +0000

Abbreviations to be used in the post:
FDW = Foreign Data Wrappers
MED = Management of External Data

FDW allows access to remotely stored data with the help of SQL/MED. In order to use FDW, appropriate extensions have to be installed, and necessary commands such as 'CREATE FOREIGN TABLE", "CREATE SERVER" have to be set-up. Once the set-up is complete, foreign tables can be accessed on the remote servers through SELECT queries. Moreover, FDW also allows join operations between foreign tables.

The process of how FDW works is:

The analyzer generates the query tree from the SQL input.
The executor connects to the remote server through the use of specific libraries based on foreign data wrapper being used.
The cost of each plan path is estimated using the EXPLAIN command, given that the use_remote_estimate option is enabled.
Deparsing: creation of plain text SQL statement from a plan tree.
This plain text SQL statement is sent to the remote server. The result is processed by the executor. This processing includes join operations, and necessary operations made on the tables. The results may also be converted into a PostgreSQL readable format.

FDW extensions use the EXPLAIN command to get statistics, and estimate the plan tree for a query (As discussed in chapter 3). These results are best reflected by postgres_fdw extension of FDW. Others may not provide enough information.

In order to execute SELECT queries, a sequence of SQL statements are followed:

Start a remote transaction.
Declare a cursor.
Fetch the result.
Close the cursor.
Commit the transaction.

For multi-table queries, each foreign table is fetched individually through single table SELECT statement, and then a join operation is performed on the local server. This was done in earlier versions even if the foreign tables were on the same remote server - In later versions, postgres_fdw can execute remote join operation on the remote servers if the tables are on the same server (the use_remote_estimate option should be enabled). Although, it is important to keep in mind that distributed deadlock detection is not supported by FDW - hence, if multiple transactions involve updates on both local and foreign tables simultaneously, deadlocks can occur. This is one disadvantage, or a short coming of Postgre's FDW.

Overall, FDW supports access to remote data, and allows managing it just like local data - resulting in seamless integration of data.

The Internals of PostgreSQL: Chapter 3 Query Processing (Part 1,2,3)

Hasnain Somani — Wed, 28 Jun 2023 16:00:53 +0000

This post summarizes Chapter 3 of the book "The Internals of PostgreSQL".
3.1:
PostgreSQL query processing comprises of several subsystems working together in order to respond to client queries. The main systems involved include:

Parser: the parser receives an SQL statement in the form of a plain text, and in result, it generates a parse tree which represents the syntactic structure of the query. In addition, the parser also performs syntax checking, but semantics are not checked here.
Analyzer: Symentic analysis of the parse tree (generated by the parser) is done here. A query tree is generated as a result, and the query tree represents the metadata and the structure of the query. It contains information such as the target list, range table, and sort clause.
Rewriter: The rewriter applies rules present in the rule system so that necessary transformations can be made to the query tree. The query tree is therefore modified on the basis of pre-defined rules.
Planner: The planner generates a plan tree from the query tree it receives from the rewriter. This plan tree is a representation of the most effective execution plan for the query, and a cost based optimization approach is used to determine the best plan. Factors like resources and statistics play a vital role in determining the optimal plan.
Executor: At this stage, the plan tree generated by the planner is executed. The executor uses instructions mentioned in the plan tree to access tables and indexes accordingly. The retreival and processing of data is also done according to the order specified in the pkan tree. Temporary files are created if needed, and the executor enforces concurrency control mechanisms to maintain data consistency, and isolation during query execution.

3.2:
The cost of an operation in PostgreSQL is a dimentionless value, which is useful in the comparision of the relative performance, and it varies with operations. The types of costs are:

Start-up cost: The cost before the first tuple is fetched.
Run cost: The cost of fetching all tuples. There are vaious formulas for all types of scans.
Total cost: This is the sum of start-up cost and run cost.

In this section, the cost estimation on sequential scan, and index scan is seen.

3.3:
This section discusses the process of the creation of a plan tree for a single table query in postgreSQL. 3 main steps are:

Preprocessing: this is done in the PlannerInfo Structure. Target lists, limit clauses, and all other expressions are simplified by using built-in functions. In addition, nested AND/OR expressions are also flattened.
Getting the cheapest Access path: the cost of all possible access paths is estimated, and the cheapest one is selected. RelOptInfo structure is created, which is used to store access paths, and their costs. Index scan, sequential scan, and bitmap scan are all parts of this structure.
Creating the Plan tree: A plan tree is generated from the cheapest access path. This job is done by the planner. The root of the plan tree contains information related to the type of operation, and related tables for the query. The tree has many plan nodes, each curresponds to an operation such as the sequential, or index scan. These plan nodes have information such as the estimated costs, number of rows to be scanned, qual conditions, and so on - and these nodes are then linked together to form a plan tree.

3.4:
This section discusses about how the executor performs query processing by executing plan nodes in the reverse order, starting from the end of the plan tree, and moving towards the root node. Every plan node corresponds to an operation, which might be an index scan, or a sequential scan, or sorting, and functions associated with it are stored in the src/backend/executor/ directory.
The EXPLAIN command is the most appropriate command to examine the output. The result makes best sense when read from the bottom line to the top line. EXPLAIN ANALYZE is another command, which provides additional information such as the true row counts, memory usage, and run time.
Overall, the EXPLAIN command is useful to understand the working of executor in terms of query processing. It also helps determine the use of temporary fies in the process, which are created in the base/[g_temp subdirectory.

3.5:
PostgreSQL supports various join operations including nested loop join, merge join and hash join:

Nested Loop Join: This is the most basic join operation, and can be used in any join conditions. It has 5 variations in PostgreSQL: The materialized jon reduces the cost of scanning the inner tables by storing the tuples in temporary file, The indexed nested loop join utilizes an index on the inner table to make direct searches and match tuples, instead of searching sequentially, and other variations which can reduce the cost in their own ways. 2.Merge join: it is used for natural and equi joins, and has 4 variations. Materialized merge join works in the similar way, by using temporary files, and other variations include outer index scan, which reduces cost by utilizing an index on the outer table.
Hash join: involves hashing the join keys of both the inner and outer tablesm and then matching the tuples on the basis of their hash values. This approach is preferred with large datasets, or datasets with non-indexed columns. Hash join has its own 2 variations. An in-memory hash table is used for matching tuples.

The choice of the algorithm depends on conditions such as the size of input relations, the join condition selected, and user preference. Examples of all joins along with their results on the EXPLAIN command are shown in this section.

3.6:
This section highlights the creation of a plan-tree for multiple-table query. The steps are somehow similar:

Preprocessing: the planner converts subqueries in the FROM clause using the built-in function pull_up_subqueries(). Moreover, in preprocessing, outer join queries are transformed, to make them inner join queries.
Getting the cheapest path: the planner looks into various combinations of join methods and indexes to determine an optimal plan tree. It uses dynamic programming if there are less than 12 tables. If the number of tables are larger, Genetic Query Optimizer is used by the planner, which helps find a reasonable plan. The planner selects the cheapest access path for each level from level 1 to higher levels.
Getting the cheapest path of a triple-table query: the planner makes cost estimations, and thus, the cheapest path is determined using a combination of RelOptInfo. In addition, Join oaths are also evaluated.

As a summary, this chapter provides an overview on query processing, and the optimization of query in PostgreSQL. Beginning with an introduction on how the query is processed, the document explains stages in query optimization including plan generation and cost estimation. The creation of plan trees is also discussed.

The Internals of PostgreSQL: Chapter 2: Process and Memory Architecture

Hasnain Somani — Fri, 23 Jun 2023 21:21:39 +0000

This post summarizes Chapter 2 of the book "The Internals of PostgreSQL".
Postgre SQL is a relational database management system that works on client / server multi-process architecture. It consists of a single host, and various types of processes work together in postgreSQL to manage a database cluster.
The PostgreSQL server is a parent process that monitors the entire database cluster: It starts background processes, allocates shared memory, and awaits client connections. The server communicates on a specific network port (Default: 5432), and can run multiple instances on the same host by using different port numbers.
Postgres Processes are the backend processes spawned by the server. Their job is to handle incoming queries from clients that are connected. The communication between the client and process is done through a TCP connection, which eliminates as soon as the client disconnects. Every backend process operates on a single database, and the parameter "max_connections" determines the maximum number of concurrent client connections. The default value to it is 100.
Background processes handle various tasks such as Background writer, checkpointer, autovaccum, WAL Writer, and so on. All these functions are discussed in the upcoming chapters.
The memory architecture in PostgreSQL is simple: Local memory areas are allocated by each backend process, and a shared memory area is used by all processes of the postgreSQL server. In addition, PostgreSQL allocates memory for access control mechanisms, transaction processing, and other purposes. The details of both the local memory area and the shared memory area will be discussed in upcoming sections.
In a nutshell, PostgreSQL has a multi-process architecture: the server is the parent process, and backend processes handle client queries. It employs both local and shared memory areas to helo manage data and provide concurrency control.

The Internals of PostgreSQL: Chapter 1: Database Cluster, Databases, and Tables

Hasnain Somani — Fri, 23 Jun 2023 18:19:34 +0000

This post summarizes Chapter 1 of the book "The Internals of PostgreSQL".

PostgreSQL database's logical structure consists of a collection of databases, all managed by a single Postgre server, which runs on a host. Each database existing on the cluster is treated as a separate entity, and therefore comprises of various database objects such as indexes, sequences, views, functions, and tables.

The relations that hold these objects together, and the corresponding Object IDs are stored in catalogues present in the database (pg_database to store OIDs of databases and pg_class catalog stores OIDs of heap tables.) Retrieval of the OIDs of objects is done through querying these catalogs.

A database cluster has a directory hierarchy physical structure: The base directory acts as the main directory, and it comprises of other subdirectories and files. The path for the base directory is set to PGDATA. Creation of a new database cluster can be done by using the initdb command.

Some of the important files and subdirectories include:

PG_VERSION: A file containing the major version number of PostgreSQL.
pg_hba.conf: A file used to control client authentication for PostgreSQL.
pg_ident.conf: A file used to map PostgreSQL user names.
postgresql.conf: A file used to configure various parameters for PostgreSQL.
postgresql.auto.conf: A file used to store configuration parameters set in ALTER SYSTEM (version 9.4 or later).
base/: A subdirectory containing per-database subdirectories.
global/: A subdirectory containing cluster-wide tables like pg_database and pg_control.
pg_wal/ (Version 10 or later): A subdirectory containing Write Ahead Logging (WAL) segment files.
pg_tblspc/: A subdirectory containing symbolic links to tablespaces.
Each database within the cluster has its own subdirectory under the base directory, with the subdirectory name corresponding to the OID of the database.

Tables and indexes are stored as separate individual files, and they are stored within their respective subdirectories. The names of these files correspond to the value of refilenode, which is usually same as the OID of the table, except for those tables and indexes who's size increases to above 1GB. After the size exceeds 1GB, a new refilenode is created, and its name is given a new suffix like ".1", and ".2" proceeds after that, and so on.

Additional files (_fsm) and (_vm) files also exist within the databse subdirectories. These are the free space map files, and visibility map files respectively. _fsm files store information about the free space capacity whereas the _vm files store information about the page visibility within the table.

The idea of tablespaces is also supported in PostgreSQL, which contains additional data, outside the base directory. Tablespaces have their own subdirectories, and they are themselves created under a specified directory.

A heap table file consists of pages of fixed length. The default length is 8192 bytes, and each page contains record data (tuples), Line pointers (point to tuples), and header data. The header data further has information about the checksum value, Log Sequence Number, and other details related to the page.

As a conclusion, the chapter discusses an overview on the logical and physical structure of a postgreSQL database cluster, and how databases, objects, files and subdirectories are organized and managed.