DEV Community: hammadsaleemm

Enhancing PostgreSQL Performance with SP-GiST

hammadsaleemm — Thu, 04 May 2023 13:54:29 +0000

Introduction

PostgreSQL is a powerful open-source relational database management system that provides many advanced features to developers. One of the features that make PostgreSQL stand out from other database systems is its support for custom index types. These index types allow developers to create indexes tailored to specific data types, leading to better query performance. In this article, we will explore SP-GiST, a custom index type that can significantly enhance the performance of PostgreSQL.

What is SP-GiST?

SP-GiST stands for Space-Partitioned Generalized Search Tree. It is a custom index type that can efficiently handle complex data types, such as geometric shapes, network addresses, and full-text documents. SP-GiST is a generalization of the GiST (Generalized Search Tree) index type, which is a popular custom index type in PostgreSQL.

How SP-GiST works

SP-GiST is based on the concept of space partitioning. It divides the search space into non-overlapping regions and assigns each region a label. These labels are then used to traverse the search tree efficiently. SP-GiST also uses a split algorithm that ensures that each partition is split optimally to minimize the search time. The split algorithm takes into account the distribution of data and the size of the partitions.

Benefits of using SP-GiST

There are several benefits to using SP-GiST over other index types. Firstly, SP-GiST can handle complex data types, which makes it suitable for a wide range of applications. Secondly, SP-GiST is highly customizable, allowing developers to fine-tune it to their specific needs. Finally, SP-GiST is more efficient than other index types when dealing with high-dimensional data.

Use cases for SP-GiST

SP-GiST is particularly useful for applications that deal with complex data types. Some examples of such applications include:

Geographic information systems (GIS) that deal with spatial data
Full-text search engines that deal with text documents
Network management systems that deal with IP addresses and network topologies
Image and video processing applications that deal with multimedia data

Conclusion

SP-GiST is a powerful custom index type that can significantly enhance the performance of PostgreSQL when dealing with complex data types. Its space partitioning approach and split algorithm make it efficient and highly customizable. Developers can use SP-GiST to fine-tune their PostgreSQL databases for optimal performance in a wide range of applications.

Boosting Database Performance with GIN Indexes in PostgresPro

hammadsaleemm — Wed, 03 May 2023 17:42:29 +0000

Introduction:

PostgreSQL is a popular open-source database management system, known for its flexibility, robustness, and extensibility. It provides various indexing techniques, one of which is the Generalized Inverted Index or GIN. In this article, we will explore how GIN indexes work, and how they can significantly enhance database performance in PostgresPro.

What are GIN Indexes?

GIN indexes are designed to handle complex data types, including arrays, JSON, and full-text search. They are implemented as a set of inverted indexes, where each index entry is a key-value pair, where the key is a term from the indexed data, and the value is a pointer to the row that contains that term. GIN indexes are particularly useful for queries that involve multiple conditions, as they can quickly identify the relevant rows that match all the query conditions.

How GIN Indexes work in PostgresPro?

PostgresPro implements GIN indexes using a multi-level structure, where each level contains a set of index entries that are logically grouped into blocks. Each block has a fixed size and contains a set of entries that share a common prefix. PostgresPro uses a specialized search algorithm called the Generalized Search Tree (GiST) to traverse the GIN index, which enables it to quickly identify the relevant blocks that contain the matching index entries.

Benefits of GIN Indexes:

There are several benefits of using GIN indexes in PostgresPro, including:

Improved query performance: GIN indexes can significantly improve query performance, particularly for complex queries that involve multiple conditions.

Efficient support for array and JSON data types: GIN indexes are particularly useful for queries that involve array or JSON data types, as they can efficiently index and search through these data types.

Flexibility and extensibility: GIN indexes are highly flexible and can be extended to support custom data types and query operations.

Conclusion:

GIN indexes are a powerful indexing technique that can significantly enhance database performance in PostgresPro. They provide efficient support for complex data types, including arrays, JSON, and full-text search, and can improve query performance for complex queries. By leveraging the power of GIN indexes, PostgresPro users can achieve better scalability, faster query processing, and improved overall database performance.

Boosting Query Performance with GIST Indexes in PostgreSQL

hammadsaleemm — Wed, 03 May 2023 17:38:30 +0000

PostgreSQL is a popular open-source relational database management system known for its robustness, extensibility, and compliance with SQL standards. One of its powerful features is its support for various types of indexes, including the Generalized Search Tree (GIST) index. In this article, we will explore how GIST indexes work and how they can improve query performance in PostgreSQL.

What are GIST indexes?

GIST indexes are a type of index in PostgreSQL that can handle complex data types, such as geometric shapes, full-text search, and network addresses. They are a type of balanced search tree, similar to B-trees, but they can handle more complex data structures.

GIST indexes work by dividing data into a set of overlapping rectangles or boxes, each containing one or more data points. These boxes are organized into a hierarchical tree structure, with the root node containing all the data and subsequent nodes containing smaller subsets of data.

When a query is executed, PostgreSQL traverses the GIST index tree to locate the relevant data, starting at the root node and descending into the appropriate branch based on the query conditions. As a result, GIST indexes can efficiently handle queries that involve complex data types and multi-dimensional data.

Creating a GIST index in PostgreSQL

To create a GIST index in PostgreSQL, you need to specify the data type and the operator class that the index should use. For example, if you want to create a GIST index on a column that contains geometric shapes, you would use the "gist" data type and the "geometry_ops" operator class:

CREATE INDEX idx_geometry ON mytable USING gist (geom gist_geometry_ops);

In this example, "idx_geometry" is the name of the index, "mytable" is the name of the table that contains the column you want to index, and "geom" is the name of the column. The "gist_geometry_ops" operator class tells PostgreSQL to use the GIST index with geometric shapes.

Using a GIST index to improve query performance

GIST indexes can improve query performance by allowing PostgreSQL to quickly find the relevant data without having to scan the entire table. For example, suppose you have a table that contains millions of rows of geographic data, and you want to find all the data points within a specific radius of a given location. Without an index, PostgreSQL would have to scan the entire table to find the relevant data, which could take a long time. With a GIST index, PostgreSQL can use the index to quickly locate the relevant data and return the results much faster.

GIST indexes can also be used to optimize full-text search queries, network address queries, and queries that involve multi-dimensional data.

Conclusion

GIST indexes are a powerful feature in PostgreSQL that can improve query performance for complex data types and multi-dimensional data. By dividing data into a set of overlapping rectangles or boxes, GIST indexes provide a hierarchical tree structure that allows PostgreSQL to efficiently locate the relevant data without having to scan the entire table. If you're working with complex data types in PostgreSQL, consider using GIST indexes to boost query performance and improve the overall efficiency of your database.

What's new in PostgreSQL 14: Highlights of the upcoming release

hammadsaleemm — Wed, 03 May 2023 17:33:05 +0000

PostgreSQL 14, the latest major release of the world's most advanced open-source relational database management system, is expected to be released in the third quarter of 2021. This release includes a host of new features, enhancements, and improvements, including:

Improved Performance and Scalability

PostgreSQL 14 offers significant performance improvements over previous versions, especially for large databases. One of the key areas of focus in this release is improving the scalability of parallel query execution. This has been achieved by improving the dynamic memory allocation mechanism and introducing a new parallel hash join algorithm.

Enhanced Security Features

PostgreSQL 14 has added several new security features to make the database more secure by default. The most notable of these is the introduction of role-level password policies. This allows administrators to define password policies that apply to specific database roles, making it easier to enforce password complexity rules.

Another new security feature is the ability to specify the number of authentication attempts allowed before an account is locked. This helps to prevent brute-force attacks by limiting the number of attempts that can be made to guess a password.

Improved Data Management

PostgreSQL 14 also includes several new features to improve data management. One of these is the ability to define a default partition for a partitioned table. This allows for more efficient querying of partitioned tables, as queries can be optimized to only search the relevant partitions.

Another new feature is the ability to refresh materialized views concurrently. This allows for faster updates to materialized views, which can be especially useful in high-traffic applications.

Better Developer Experience

PostgreSQL 14 also includes several new features aimed at improving the developer experience. One of these is support for extended statistics on columns, which can help improve query planning and performance. Another new feature is the ability to modify the structure of a partitioned table without needing to rebuild the entire table, which can save time and reduce downtime during schema changes.

Conclusion

PostgreSQL 14 is shaping up to be a major release, with significant improvements in performance, security, data management, and developer experience. These new features and enhancements make it an even more compelling choice for organizations of all sizes looking for a robust and scalable relational database management system. If you're already using PostgreSQL, upgrading to version 14 is likely to be a no-brainer. If you're not yet using PostgreSQL, now is a great time to start.

A Comprehensive Guide to Concurrency Control in PostgreSQL

hammadsaleemm — Fri, 07 Apr 2023 12:45:11 +0000

Concurrency control is a mechanism that ensures atomicity and isolation, two key properties of the ACID, when multiple transactions run simultaneously in a database. In this article, we will focus on the concurrency control mechanisms used in PostgreSQL, a popular open-source relational database management system.

Concurrency Control Techniques

There are three primary concurrency control techniques used in PostgreSQL:

Multi-version Concurrency Control (MVCC): Each write operation creates a new version of a data item while retaining the old version. When a transaction reads a data item, the system selects one of the versions to ensure isolation of the individual transaction. PostgreSQL uses a variation of MVCC called Snapshot Isolation (SI).

Strict Two-Phase Locking (S2PL): In S2PL, a writer acquires an exclusive lock for an item, blocking readers until the writer has finished.

Optimistic Concurrency Control (OCC): OCC assumes that conflicts between transactions are rare and allows transactions to proceed without locks. OCC checks for conflicts only at the end of each transaction and rolls back if conflicts are detected.

PostgreSQL uses MVCC with Snapshot Isolation as its primary concurrency control technique.

Transaction Isolation Levels
PostgreSQL supports the following transaction isolation levels:

READ COMMITTED: This level allows non-repeatable and phantom reads but prevents dirty reads.

REPEATABLE READ: This level allows phantom reads but prevents non-repeatable and dirty reads.

SERIALIZABLE: This level prevents all three anomalies: dirty reads, non-repeatable reads, and phantom reads.

It's important to note that in PostgreSQL version 9.0 and earlier, the REPEATABLE READ level was used as SERIALIZABLE because it prevented all three anomalies defined in the ANSI SQL-92 standard. However, with the implementation of Serializable Snapshot Isolation (SSI) in version 9.1, a true SERIALIZABLE level was introduced, and the REPEATABLE READ level was changed to its current definition.

Transaction ID

In PostgreSQL, each transaction is assigned a unique identifier called a transaction ID (txid). When a transaction starts, the transaction manager assigns a txid, which is a 32-bit unsigned integer. PostgreSQL reserves three special txids: 0 (Invalid txid), 1 (Bootstrap txid, used only in the initialization of the database cluster), and 2 (Frozen txid).

Concurrency Control in PostgreSQL

PostgreSQL uses Snapshot Isolation (SI) for DML (Data Manipulation Language, e.g., SELECT, UPDATE, INSERT, DELETE), and Strict Two-Phase Locking (S2PL) for DDL (Data Definition Language, e.g., CREATE TABLE, etc). Here are the main components of PostgreSQL's concurrency control mechanism:

Commit log (clog): The clog holds all transaction states.

Transaction snapshots: Transaction snapshots represent a consistent view of the database for a particular transaction.

Visibility check rules: Visibility check rules are used to determine which transaction snapshots are visible to other transactions.

Preventing Anomalies

PostgreSQL's Snapshot Isolation (SI) technique prevents the three anomalies defined in the ANSI SQL-92 standard: Dirty Reads, Non-Repeatable Reads, and Phantom Reads. However, SI does not achieve true serializability and allows serialization anomalies such as Write Skew and Read-only Transaction Skew.

To address this issue, Serializable Snapshot Isolation (SSI) was introduced in PostgreSQL version 9.1. SSI detects serialization anomalies and resolves conflicts caused by such anomalies, providing a true SERIALIZABLE isolation level.

Conclusion: Concurrency Control in PostgreSQL

In conclusion, concurrency control is a crucial mechanism in any database management system that allows multiple transactions to execute simultaneously while maintaining the integrity of the database. PostgreSQL implements three main concurrency control techniques: MVCC, S2PL, and OCC. Among these, MVCC is the most popular, and PostgreSQL uses a variation of it called Snapshot Isolation (SI).

While SI prevents the three anomalies defined in the ANSI SQL-92 standard, it cannot achieve true serializability because it allows serialization anomalies. To address this issue, Serializable Snapshot Isolation (SSI) was introduced in PostgreSQL version 9.1 and later, which provides a true SERIALIZABLE isolation level.

This chapter has discussed the key features required for implementing the concurrency control mechanism, including transaction ids, tuple structure, commit log (clog), transaction snapshots, and visibility check rules. We have also looked at specific examples of concurrency control in PostgreSQL and discussed the transaction isolation levels in PostgreSQL.

Overall, PostgreSQL's concurrency control mechanism is a powerful tool for managing multiple transactions in a database environment. Its use of MVCC and SI ensures that transactions can read and write data simultaneously without blocking each other, while SSI provides true serializability. By understanding these concepts and techniques, database administrators and developers can ensure that their applications are efficient, scalable, and reliable.

Understanding Hash Indexing in PostgreSQL: Hash (Part 3)

hammadsaleemm — Thu, 06 Apr 2023 11:00:36 +0000

The first article described PostgreSQL indexing engine, the second one dealt with the interface of access methods, and now we are ready to discuss specific types of indexes. Let's start with hash index.

When it comes to database indexing, PostgreSQL provides several options, one of which is hash indexing. In this article, we will discuss the general theory behind hash indexing, its structure, and how it works in PostgreSQL.

What is Hash Indexing?

A hash index is a type of database indexing technique that maps keys to values using a hash function. It is similar to a regular array, but instead of being indexed by an integer, it is indexed by a hash value. Hashing is a way to associate a small number with a value of any data type. In PostgreSQL, hash indexing is used for fast data access where keys are relatively small and the data set is static.

Hash Index Structure

The hash index in PostgreSQL is structured in the following way:

Meta Page: This is the first page in the hash index, containing information about what is inside the index.
Bucket Pages: The main pages of the index, storing data as "hash code - TID" pairs.
Overflow Pages: Structured the same way as bucket pages and used when one page is insufficient for a bucket.
Bitmap Pages: These pages keep track of overflow pages that are currently clear and can be reused for other buckets.

How Hash Indexing Works in PostgreSQL:

When inserting into the hash index, the hash function is computed for the key, which always returns the "integer" type in PostgreSQL, which is in the range of 232 ≈ 4 billion values. The number of buckets initially equals two and dynamically increases to adjust to the data size. The bucket number is computed from the hash code using bit arithmetic, and the TID is stored in the appropriate bucket.

When searching the hash index, the hash function is computed for the key, and the bucket number is obtained. The contents of the bucket are then searched for the matching TIDs with the appropriate hash codes. Since stored "hash code - TID" pairs are ordered, this is done efficiently.

However, collisions can occur where different keys have the same four-byte hash codes. In such cases, the access method asks the general indexing engine to verify each TID by rechecking the condition in the table row.

Mapping Data Structures to Pages

From the perspective of the buffer cache manager, all information and index rows must be packed into pages. In PostgreSQL hash indexing, four kinds of pages are used: meta pages, bucket pages, overflow pages, and bitmap pages.

When the index increases, PostgreSQL creates twice as many buckets and pages as were last created. Overflow pages are allocated as needed and are tracked in bitmap pages, which are also allocated as needed.

Conclusion

In conclusion, hash indexing is a useful technique for fast data access in static data sets with relatively small keys. PostgreSQL's hash indexing uses a hash function to map keys to values, and the data is stored in buckets. While collisions can occur, the index structure is designed to handle them efficiently. Understanding the structure and workings of hash indexing in PostgreSQL can help optimize database performance.

Understanding Indexes in PostgreSQL: Part 2

hammadsaleemm — Wed, 05 Apr 2023 14:05:44 +0000

PostgreSQL is an object-relational database management system that uses access methods to provide efficient ways of storing and retrieving data. An access method is an algorithm for organizing data in a way that makes it easy to search and retrieve. PostgreSQL provides several built-in access methods, including B-tree, hash, GIST, GIN, SP-GiST, and BRIN.

In this series of articles, we will take a closer look at access methods in PostgreSQL, starting with an overview of the access method interface. In the first article, we've mentioned that an access method must provide information about itself. Let's look into the structure of the access method interface.

Interface: How does the access method interface work?

An access method interface is a set of properties and functions that define the behavior of the access method. PostgreSQL stores all access method properties in the "pg_am" table, which contains a list of available access methods.

In PostgreSQL versions 9.5 and lower, each property was represented with a separate field of the "pg_am" table. Starting with version 9.6, properties are queried with special functions and are separated into several layers: access method properties, properties of a specific index, and properties of individual columns of the index.

The access method layer and index layer are separated to accommodate for the fact that all indexes based on one access method will always have the same properties. The following are the access method properties for the B-tree access method:

can_order: the access method enables us to specify the sort order for values when an index is created.
can_unique: support of the unique constraint and primary key.
can_multi_col: an index can be built on several columns.
can_exclude: support of the exclusion constraint EXCLUDE.
The following are the index properties for an existing B-tree index:

clusterable: a possibility to reorder rows according to the index.
index_scan: support of index scan.
bitmap_scan: support of bitmap scan.
backward_scan: the result can be returned in the reverse order of the one specified when building the index.
Finally, the following are column properties:

asc, desc, nulls_first, nulls_last, orderable: these properties are related to ordering the values.
distance_orderable: the result can be returned in the sort order determined by the operation.
returnable: a possibility to use the index without accessing the table, that is, support of index-only scans.
search_array: support of search for several values with the expression «indexed-field IN (list_of_constants)».
search_nulls: a possibility to search by IS NULL and IS NOT NULL conditions.

Operator classes and families: What are operator classes and families?

In addition to properties of an access method exposed by the interface, PostgreSQL uses operator classes and families to determine which data types and operators the access method accepts.

An operator class contains a minimal set of operators (and maybe, auxiliary functions) for an index to manipulate a certain data type. An operator class is included in some operator family, and one common operator family can contain several operator classes if they have the same semantics.

For example, "integer_ops" family includes "int8_ops", "int4_ops", and "int2_ops" classes for types "bigint", "integer", and "smallint", having different sizes but the same meaning. PostgreSQL provides a wide range of operator classes and families for each access method, allowing users to optimize data storage and retrieval for specific use cases.

Conclusion:

In this article, we've discussed the structure of the access method interface in PostgreSQL, including the properties of access methods, indexes, and individual columns. We've also introduced the concepts of operator classes and families, which contain information about the data types and operators that an access method accepts. Understanding these concepts is essential for optimizing the performance of PostgreSQL databases and creating efficient indexes. In the next article, we'll dive deeper into the properties of specific access methods, starting with the popular "btree" method.

Understanding Indexes in PostgreSQL: An Introduction

hammadsaleemm — Wed, 05 Apr 2023 13:19:43 +0000

As an application developer using DBMS, you must have come across indexes in PostgreSQL. Indexes are auxiliary structures in the database, designed to speed up data access and enforce integrity constraints. However, with so many different types of indexes available, it can be challenging to choose the right one for your application.

In this series of articles, we will take a closer look at indexes in PostgreSQL, discussing the available types of indexes, why there are so many different types of them, and how to use them to optimize query performance. We will also delve into the details of the internal workings of indexes, giving you the tools to make informed decisions about which indexes to use.

In this first article, we will discuss the distribution of responsibilities between the general indexing engine related to the DBMS core and individual index access methods, which PostgreSQL enables us to add as extensions.

Types of Indexes in PostgreSQL

PostgreSQL has six different types of indexes built-in as of version 9.6, with one more available as an extension. Despite all differences between types of indexes, each of them associates a key with table rows that contain this key. Each row is identified by TID (tuple id), which consists of the number of block in the file and the position of the row inside the block.

It's important to note that while indexes can speed up data access, there is a cost involved. Each operation on indexed data, such as insertion, deletion, or update of table rows, requires the indexes for that table to be updated as well, and in the same transaction.

Extensibility of Indexes in PostgreSQL

To enable easy addition of a new access method to the system, an interface of the general indexing engine has been implemented in PostgreSQL. Its main task is to get TIDs from the access method and work with them:

Read data from corresponding versions of table rows.
Fetch row versions TID by TID or in a batch using a prebuilt
bitmap.
Check visibility of row versions for the current transaction
taking into account its isolation level.
The indexing engine is involved in performing queries, and is called according to a plan created at the optimization stage. The optimizer evaluates different ways to perform the query, taking into account the capabilities of all access methods that are potentially applicable.

Not only does the optimizer need information about the access method, but when building an index, the system must decide whether the index can be built on several columns and whether this index ensures uniqueness. Each access method should provide all the necessary information about itself, which is achieved through special functions in PostgreSQL.

The indexing engine in PostgreSQL enables the database to work with various access methods uniformly, taking into account their features. The main scanning techniques include index scan and bitmap scan. Index scan returns TID values one by one until the last matching row is reached, and the indexing engine accesses the table rows indicated by TIDs in turn. Bitmap scan, on the other hand, first returns all TIDs that match the condition, and the bitmap of row versions is built from these TIDs before row versions are read from the table. The choice of which scanning technique to use depends on the number of retrieved rows and other factors, and is determined by the optimizer.

Conclusion

In this article, we have introduced the topic of indexes in PostgreSQL, discussing the different types of indexes available and the cost involved in using them. We have also touched on the extensibility of indexes in PostgreSQL, which enables easy addition of new access methods to the system.

In the next article, we will dive deeper into the interface of the access method and critical concepts such as classes and operator families. Armed with this knowledge, we can begin to explore the different types of indexes available in PostgreSQL and how to use them to optimize query performance.

Maximizing PostgreSQL Cluster Performance with Pgpool-II Load Balancing and Scaling Techniques

hammadsaleemm — Sun, 12 Mar 2023 09:13:34 +0000

If you're using PostgreSQL for your database needs and are looking to scale up your cluster while also ensuring load balancing, then Pgpool-II can be an excellent solution for you. Pgpool-II is a middleware that sits between the PostgreSQL database and its clients, and provides a variety of features such as connection pooling, load balancing, and query routing. In this article, we'll explore how Pgpool-II can help you scale up and load balance your PostgreSQL cluster.

Setting up a PostgreSQL Cluster

To get started with Pgpool-II, you'll need to set up a PostgreSQL cluster first. A cluster is a group of PostgreSQL instances that work together to provide high availability and scalability. Setting up a cluster involves configuring multiple PostgreSQL instances to replicate data between each other, so that in the event of a failure, the data can be quickly recovered. You'll also need to set up a monitoring and failover mechanism, which can be achieved using tools such as repmgr or Patroni.

Installing Pgpool-II

Once you have a PostgreSQL cluster set up, you can proceed to install Pgpool-II. Pgpool-II is available for installation on Linux and Windows, and can be installed using your operating system's package manager or by compiling from source. After installation, you'll need to configure Pgpool-II to work with your PostgreSQL cluster. This involves specifying the connection details of your PostgreSQL instances, as well as configuring settings such as load balancing and connection pooling.

Configuring Load Balancing

One of the primary features of Pgpool-II is its ability to load balance connections across multiple PostgreSQL instances. This helps distribute the load evenly across the cluster and ensures that no single instance is overloaded. Pgpool-II uses a variety of load balancing algorithms, including round-robin, least-connections, and weighted least-connections. You can choose the algorithm that best suits your needs, and configure it in the Pgpool-II configuration file.

Configuring Connection Pooling

Another important feature of Pgpool-II is connection pooling. Connection pooling helps reduce the overhead of creating and destroying database connections by reusing existing connections. This can help improve performance and reduce resource usage. Pgpool-II allows you to configure the maximum number of connections to be pooled, as well as settings such as the maximum time a connection can be idle before it's closed.

Configuring Query Routing

Pgpool-II also provides a feature called query routing, which allows you to route queries to specific PostgreSQL instances based on their type or content. This can be useful if you have certain queries that need to be executed on specific instances, or if you want to balance the load across instances more evenly. Query routing can be configured in the Pgpool-II configuration file using regular expressions.

Conclusion

In conclusion, if you're looking to scale up your PostgreSQL cluster and ensure load balancing, Pgpool-II is an excellent solution that provides a variety of features to help you achieve your goals. By configuring load balancing, connection pooling, and query routing, you can ensure that your PostgreSQL cluster is highly available, scalable, and performs well under heavy loads. So why not give it a try?

References

https://www.ashnik.com/scaling-up-and-load-balancing-your-postgresql-cluster-using-pgpool-ii/

Understanding the Performance of Postgres_fdw Extension in PostgreSQL 4.1.2 (Part 2)

hammadsaleemm — Sat, 11 Mar 2023 06:23:27 +0000

In our previous article, we discussed Foreign Data Wrappers (FDW) in PostgreSQL and how they enable PostgreSQL to interact with other data sources. We also introduced the postgres_fdw extension, which is a module maintained by the PostgreSQL Global Development Group that allows users to create foreign tables in PostgreSQL for remote data sources.

In this article, we will dive deeper into the postgres_fdw extension and its performance capabilities in PostgreSQL 4.1.2. We will cover the release notes related to postgres_fdw and how it processes multi-table queries, sort operations, and aggregate functions. Additionally, we will discuss how postgres_fdw can process other Data Manipulation Language (DML) statements, including INSERT, UPDATE, and DELETE.

Release Notes Related to Postgres_fdw

Postgres_fdw has undergone numerous updates over the years. Table shows the release notes related to postgres_fdw as cited from the official document:

Processing Multi-Table Queries, Sort Operations, and Aggregate Functions

Postgres_fdw can execute queries involving multiple tables by fetching each foreign table with a single-table SELECT statement and then joining them on the local server.

In version 9.5 or earlier, even if the foreign tables were stored on the same remote server, postgres_fdw would fetch them individually and join them. However, in version 9.6 or later, postgres_fdw has been improved to execute the remote join operation on the remote server when the foreign tables are on the same server and the use_remote_estimate option is on.

Let us consider how PostgreSQL processes the following query that joins two foreign tables, tbl_a and tbl_b:

SELECT * FROM tbl_a AS a, tbl_b AS b WHERE a.id = b.id AND a.id < 200;

The result of the EXPLAIN command of the query shows that the executor selects the merge join and is processed as the following steps:

Merge Join (cost=532.31..700.34 rows=10918 width=16)
Merge Cond: (a.id = b.id)
-> Sort (cost=200.59..202.72 rows=853 width=8)
Sort Key: a.id
-> Foreign Scan on tbl_a a (cost=100.00..159.06 rows=853 width=8)
-> Sort (cost=331.72..338.12 rows=2560 width=8)
Sort Key: b.id
-> Foreign Scan on tbl_b b (cost=100.00..186.80 rows=2560 width=8)

The above result shows that the executor fetches the table tbl_a using the foreign table scan and sorts the fetched rows on the local server before carrying out the merge join operation. The executor then fetches the rows of tbl_b using the foreign table scan and sorts them on the local server before continuing with the merge join operation. Finally, the result of the merge join operation is returned as the output of the query.

Sort Operation

When a query involves sorting the result set, postgres_fdw can perform the sort operation on the remote server if possible, thus reducing the amount of data that needs to be transferred to the local server. However, this depends on the planner's decision, which considers the cost of transferring the data versus the cost of performing the sort operation remotely.

Aggregate Functions

Postgres_fdw can also push aggregate functions to the remote server if possible. This means that the remote server will perform the aggregation operation and return the aggregated result set to the local server. However, not all aggregate functions can be pushed down to the remote server, and the planner will make a decision based on the cost of transferring the data versus the cost of performing the aggregation operation remotely.

Conclusion

Postgres_fdw is a powerful tool that allows PostgreSQL to work with data stored on remote servers. The postgres_fdw extension has been gradually improved in recent versions, allowing it to perform more operations on the remote server, such as joins, sorts, and aggregate functions. However, it's important to note that postgres_fdw and the FDW feature do not support the distributed lock manager and the distributed deadlock detection feature, which means that deadlocks can occur in certain situations. Nevertheless, postgres_fdw remains a valuable tool for working with remote data sources in PostgreSQL.

Understanding Foreign Data Wrappers (FDW) in PostgreSQL 4.1

hammadsaleemm — Fri, 10 Mar 2023 09:20:13 +0000

In PostgreSQL, accessing remote data has been made possible with the development of SQL Management of External Data (SQL/MED) since version 9.1. Foreign Data Wrappers (FDW) use SQL/MED to manage foreign tables similar to local tables. This feature allows accessing foreign tables from local servers and executing join operations with foreign tables stored on different servers. PostgreSQL has developed many FDW extensions, including the officially developed and maintained extension, postgres_fdw, for accessing remote PostgreSQL servers.

This article discusses the fourth chapter of The Internals of PostgreSQL 4.1, which explains in detail the working of FDW in PostgreSQL.

Overview of FDW in PostgreSQL

To use the FDW feature, one needs to install the appropriate extension and execute setup commands such as CREATE FOREIGN TABLE, CREATE SERVER, and CREATE USER MAPPING. Once set up, the functions defined in the extension are invoked during query processing to access the foreign tables.

The query tree of the input SQL is created by the analyzer/analyser using the definitions of the foreign tables stored in the pg_catalog.pg_class and pg_catalog.pg_foreign_table catalogs. To connect to the remote server, the planner (or executor) uses a specific library to connect to the remote database server, with connection parameters such as the username, server's IP address, and port number stored in the pg_catalog.pg_user_mapping and pg_catalog.pg_foreign_server catalogs using the CREATE USER MAPPING and CREATE SERVER commands.

Creating a Plan Tree Using EXPLAIN Commands (Optional)

PostgreSQL's FDW supports obtaining statistics of the foreign tables to estimate the plan tree of a query. Some FDW extensions, such as postgres_fdw, mysql_fdw, tds_fdw, and jdbc2_fdw, use this feature. If the use_remote_estimate option is set to on using the ALTER SERVER command, the planner queries the cost of plans to the remote server by executing the EXPLAIN command. Otherwise, the embedded constant values are used by default.

Creating a Query Tree

The query tree is created by the analyzer/analyser using the CREATE FOREIGN TABLE or IMPORT FOREIGN SCHEMA command, which defines foreign tables stored in the pg_catalog.pg_class and pg_catalog.pg_foreign_table catalogs.

Connecting to the Remote Server

The planner (or executor) uses a specific library to connect to the remote database server, such as libpq to connect to the remote PostgreSQL server and libmysqlclient to connect to the MySQL server. Connection parameters such as the username, server's IP address, and port number are stored in the pg_catalog.pg_user_mapping and pg_catalog.pg_foreign_server catalogs.

Deparsing

To generate the plan tree, the planner creates a plain text SQL statement from the plan tree's scan paths of the foreign tables. The plain text SQL statement is sent to the remote server, and the executor receives the result and processes the received data if necessary. For instance, if a multi-table query is executed, the executor performs the join processing of the received data and other tables.

Conclusion

Foreign Data Wrappers in PostgreSQL are a useful feature that allows accessing remote tables from local servers and executing join operations with foreign tables stored on different servers. Understanding the workings of FDW is crucial to optimize the performance of the queries. The officially developed and maintained extension, postgres_fdw, is the most reliable and well-maintained FDW extension available.

Installing age-viewer (Part3: Installing apache and PSQL from Source)

hammadsaleemm — Fri, 03 Mar 2023 19:45:42 +0000

Welcome back to the third installment of our series on the Apache AGE graph database. In my previous two articles, I discussed how to install Age and PSQL, as well as how to create and add a node to a graph database using Age and PSQL. In this article, I will introduce Age-viewer, a web application that allows you to visualize data stored in your graph database. Before we dive into the details of Age-viewer, there are a few prerequisites you need to install.

Installing Dependencies

Before we start with Age-viewer, we need to install nodejs and npm to run the app. We recommend installing node version 14.16.0. You can install the specific version using the nvm install 14.16.0 command. If you do not have nvm, you can download it from https://www.freecodecamp.org/news/node-version-manager-nvm-install-guide/.

sudo apt install nodejs npm

or

# recommended
nvm install 14.16.0

Downloading and Cloning

After installing the necessary dependencies, we need to download and clone Age-viewer. Go back to the root directory and run the following command:

git clone https://github.com/apache/age-viewer.git

Start

Once Age-viewer is downloaded, navigate to the Age-viewer directory using the following command:

cd age-viewer

To start the app, run the following commands:


npm run setup
npm run start

It may take some time to start the app. Once it is up and running, connect it with the database server by entering the login details.

# address (default : localhost)
url: server_url;

# port_num (default 5432)
port: port_num;

# username for the database
username: username;

# password for the user
pass: password;

# database name you wanna connect to
dbname: demodb;

For me it was:

url: localhost;

port: 5432;

username: hammad;

# radom pass as password is not set for this user.
pass: 1234;

dbname: demodb;

The output is as follows

Graph Visulaslization

Now that Age-viewer is connected to the database, we can start visualizing our data. To create nodes, run the following commands to create PERSON and COUNTRY nodes:

SELECT * FROM cypher('demo_graph', $$ CREATE (n:Person {name : "hammad", bornIn : "Pakistan"}) $$) AS (a agtype);
SELECT * FROM cypher('demo_graph', $$ CREATE (n:Person {name : "sohaib", bornIn : "Pakistan"}) $$) AS (a agtype);
SELECT * FROM cypher('demo_graph', $$ CREATE (n:Person {name : "sadeed", bornIn : "Pakistan"}) $$) AS (a agtype);
SELECT * FROM cypher('demo_graph', $$ CREATE (n:Person {name : "taimoor", bornIn : "Pakistan"}) $$) AS (a agtype);
SELECT * FROM cypher('demo_graph', $$ CREATE (n:Person {name : "adele", bornIn : "US"}) $$) AS (a agtype);
SELECT * FROM cypher('demo_graph', $$ CREATE (n:Person {name : "nandhini", bornIn : "US"}) $$) AS (a agtype);

Once you have created these nodes, you can view them on the age-viewer. To do this, simply refresh the viewer page and you should be able to see the newly created nodes.

Now, let's create some nodes for the COUNTRY label.

SELECT * FROM cypher('demo_graph', $$ CREATE (n:Country{name : "Pakistan"}) $$) AS (a agtype);
SELECT * FROM cypher('demo_graph', $$ CREATE (n:Country{name : "US"}) $$) AS (a agtype);

Again, you can view these nodes on the age-viewer by refreshing the page.

Finally, we will create a relationship between the Person and Country nodes.

SELECT * FROM cypher('demo_graph', $$ MATCH (a:Person), (b:Country) WHERE a.bornIn = b.name CREATE (a)-[r:BORNIN]->(b) RETURN r $$) as (r agtype);

This query will create a relationship between the Person nodes and Country nodes where the "bornIn" property of the Person node matches the "name" property of the Country node.

You can now view the entire graph on the age-viewer by executing the following query:

SELECT * from cypher('demo_graph', $$ MATCH (a:Person)-[r]-(b:Country) WHERE a.bornIn = b.name RETURN a, r, b $$) as (a agtype, r agtype, b agtype);

This query will return all the Person nodes, the relationships between them and the Country nodes they are born in.

With this, you now know how to use the age-viewer to visualize your data. The age-viewer is a powerful tool that can help you gain insights into your data by allowing you to visually explore the relationships between different entities in your graph.

References

https://age.apache.org/age-manual/master/intro/setup.html
https://www.postgresql.org/docs/current/install-procedure.html