DEV Community: Nile Lazarus

A Guide to Setting Up Pgadmin for Development on Windows(Updated)

Nile Lazarus — Thu, 12 Oct 2023 14:16:20 +0000

In my last guide on this topic, some instructions may not have worked for everyone who attempted it.
This time around, the issues with my last guide have been resolved and thoroughly tested to see if they work for anyone and everyone wanting to configure pgadmin 4 for development on a Windows system.

Install prerequisites

First and foremost, please ensure that you have all of the following requirements fulfilled for your system.

git (https://git-scm.com/downloads)
Node.js 16 and above (https://nodejs.org/en/download)
yarn (https://classic.yarnpkg.com/lang/en/docs/install)
Python 3.7 and above (https://www.python.org/downloads/)
PostgreSQL server (https://www.postgresql.org/download)

Steps

1. Open a terminal of your choice, I will be using gitbash.
Create a directory for your setup and navigate to the directory

mkdir pgadmin
cd pgadmin

Clone the pgadmin 4 git repository

git clone https://github.com/pgadmin-org/pgadmin4.git

You can now begin building the runtime for your frontend

Navigate to /pgadmin4/runtime directory (while in the pgadmin directory created above)

cd pgadmin4/runtime

Run the following command and copy the contents of the dev_config.json.in file

cat dev_config.json.in

The contents will look something like this:

{
    "pythonPath": "C:/Python38/python.exe",
    "pgadminFile": "../web/pgAdmin4.py"
}

Replace the string stored in pythonPath with the actual path to python.exe stored on your system.

Now run this command to create a new file called dev_config.json and open it for file writing

cat > dev_config.json

A blank line will appear when you enter this command. Paste the contents of the file copied earlier and hit CTRL + D

Run the command

yarn install

Execute the runtime by running this command

node_modules/nw/nwjs/nw

We can now configure the Python environment for the backend

Navigate out of the runtime directory

cd ..

Create a virtual environment using whatever name you wish. I named my environment pgenv

python -m virtualenv pgenv

Activate the environment

source pgenv/Scripts/activate

Upgrade to the latest version of pip

pip install --upgrade pip

Add the path to your PostgreSQL installation bin directory to your environment variables with this command

export PATH="$PATH:/c/Program Files/PostgreSQL/13/bin"

I'm using PostgreSQL v13 but you can change the path to match the version you have installed.

Install dependencies

pip install -r requirements.txt

Open up a second terminal and navigate to the web directory. Then run the following command:

yarn run webpacker --watch

In the first terminal, navigate to the web directory again. Ensure that your virtual environment has been activated and start the server by running

python pgAdmin.py

You will get a message like Starting pgAdmin 4. Please navigate to http://127.0.0.1:5050 in your browser.. Navigate to http://127.0.0.1:5050 in your browser.

How to Install and Configure Kerberos on Windows 10

Nile Lazarus — Mon, 28 Aug 2023 16:23:57 +0000

Cybercrime is an unfortunate part of reality which most of have been victim to in one way or another. As a developer or cyber security specialist, it is important to keep yourself up-to-date and aware of measures you can take to prevent or reduce cyber threats to you applications.

Kerberos can be a great starting point.

Kerberos is a computer network security protocol that authenticates service requests between two or more trusted hosts across an untrusted network, like the internet. It uses secret key cryptography and third party authentication for authenticating client-server applications and verifying users' identities.

In this blog, I will show you how to install and configure Kerberos on your system.

Installation

Head over to this site and navigate to the "Downloads" section
Click on the 64-bit version to download: "Network Identity Manager 2.5.0.106 (64-bit MSI)".
Once the download is done, open up the installer and select the "Typical" installation option.
Now access this site and once again navigate to the "Downloads" section.
Click on "Heimdal 7.4.0 (64-bit and 32-bit)" to download.
Once it is finished downloading, open the installer and follow the default installation steps.
Once everything has been installed, restart your computer.

Configuration

Follow these steps to get your own valid Kerberos ticket.

Open your Start menu and search for "Network Identity Manager".
Open the application and right-click on "My Keystore" and then "Obtain new credentials".
Enter in a username and default realm.
Enter your system's password.
(Optional) Set up your Keystore.

Graph Databases and Their Applications

Nile Lazarus — Sun, 27 Aug 2023 19:57:01 +0000

Real-life relationships and data are usually heavily interconnected and extremely complex. A lot of valuable information cannot be restricted to simple tables and documents. This is where graph databases come in handy.

A graph database (GDB) is a NoSQL database that uses graph structures to store data. It uses nodes, edges, and properties instead of rows, columns, tables, and documents like traditional databases. The edges represent relationships between nodes (can be entities). Graph databases are crucial for applications where relationships between data elements are more important.

Use Cases and Applications

Graph databases shine the brightest in scenarios where understanding the complexity of relationships is placed at the forefront.
The first such scenario that should come to mind is social networks. Social network platforms rely heavily on the capabilities that GDBs offer. An example would be using a GDB to discover hidden patterns within friend circles, interests, and interactions, and utilizing them to provide users with a more personalized experience.
GDBs are also used in the e-commerce realm to provide personalized recommendations by analyzing past purchases and shared preferences.
Additionally, search engines are also fueled by knowledge graphs and so are fraud detection systems which utilize connections to spot anomalous behaviour.

Bitnine Global Inc's Graph Database Software

Bitnine is a company that specializes in fully integrated graph databases.
Apache AGE is an open-source GDB offered by Bitnine. Its key features are:

Is a GDB plugin for PostgreSQL
Allows hybrid queries (OpenCypher and SQL)
Fast Graph query processing, and
Graph visualization and analytics

AgensGraph is their closed-source GDB option. Its key features are:

Hybrid query processing (Cypher and SQL)
Enhanced security
Data sharding
Native graph storage

Apache AGE vs AgensGraph

The main difference between the two is that AGE is an openCypher plugin for PostgreSQL whereas AgensGraph is a complete graph database built upon PostgreSQL.

By being a fork of PostgreSQL, AgensGraph is tied to a specific version of PostgreSQL. However, AGE is an extension and is not tied to any specific version of PostgreSQL.

Learn More

How to install pgadmin from source code for development on Windows

Nile Lazarus — Thu, 24 Aug 2023 21:50:39 +0000

pgAdmin is an open-source, web-based tool designed for the administration and management of PostgreSQL databases. Users can use its intuitive graphical interface (GUI) to interact with their PostgreSQL databases.
pgAdmin is available for Windows, macOS, and Linux, making it a versatile cross-platform choice for database administrators and developers.

Usually, pgadmin 4 can be installed by default on Windows when installing PostgreSQL. However, if you are interested in contributing to pgAdmin development or need to work with a specific version of pgAdmin, knowing how to configure it for development is crucial.

In this blog, I will show you how to install and configure pgadmin for development on Windows.

Guide

The two easiest ways to setup pgadmin for development are:

using virtual environment and python for backend
using Node.js and Yarn for frontend

I will be covering the second method in this guide as it allows for dedicated frontend tools which yield better optimization and overall development, and also allows a separation of frontend and backend development environments. It is slightly more complicated to set up as compared to the first approach but offers more flexibility.

It goes without saying that this method requires you to have Python 3.6 or later installed as well as Node.js and Yarn.

Steps

Open a terminal of your choice, I will be using gitbash.
Create a directory for your setup and navigate to the directory

mkdir pgadmin
cd pgadmin

Clone the pgadmin 4 git repository

git clone https://github.com/pgadmin-org/pgadmin4.git

You can now begin building the runtime for your frontend

Navigate to /pgadmin4/runtime directory (while in the pgadmin directory created above)

cd pgadmin4/runtime

Run the following command and copy the contents of the dev_config.json.in file

cat dev_config.json.in

The contents will look something like this:

{
    "pythonPath": "C:/Python38/python.exe",
    "pgadminFile": "../web/pgAdmin4.py"
}

Replace the string stored in pythonPath with the actual path to python.exe stored on your system.

Now run this command to create a new file called dev_config.json and open it for file writing

cat > dev_config.json

A blank line will appear when you enter this command. Paste the contents of the file copied earlier and hit CTRL + D

Run the command

yarn install

Execute the runtime by running this command

node_modules/nw/nwjs/nw

We can now configure the Python environment for the backend

Navigate out of the runtime directory

cd ..

Create a virtual environment using whatever name you wish. I named my environment pgenv

python -m virtualenv pgenv

Activate the environment

source pgenv/Scripts/activate

Upgrade to the latest version of pip

pip install --upgrade pip

Add the path to your PostgreSQL installation bin directory to your environment variables with this command

export PATH="$PATH:/c/Program Files/PostgreSQL/13/bin"

I'm using PostgreSQL v13 but you can change the path to match the version you have installed.

Install dependencies

pip install -r requirements.txt

Start the server by running

python web/pgAdmin.py

You will get a message like Starting pgAdmin 4. Please navigate to http://127.0.0.1:5050 in your browser.. Navigate to http://127.0.0.1:5050 in your browser.

An Introduction to Postgres Enterprise Manager

Nile Lazarus — Sat, 19 Aug 2023 14:51:54 +0000

Introduction

Postgres Enterprise Manager (PEM) is an all-in-one database administration solution which offers management, monitoring, and performance tuning capabilties.
This blog will attempt to briefly explain what PEM is and the core features it offers.

Understanding Postgres Enterprise Manager (PEM)

When you install PostgreSQL, you are given the option of also installing pgadmin. If you've ever used pgadmin before, you may already be familiar with PEM. PEM can be described as a more advanced version of pgadmin 4.
PEM provides database administrators with a browser-based console platform for monitoring, performance tuning, backup management, and security enhancement.

Key Features of Postgres Enterprise Manager

Performance Monitoring and Tuning: PEM provides key insights into database performance, identification of bottlenecks, and query optimization.

Alert Management: PEM empowers database adiministrators to proactively tackle issues by allowing them to create customized alerts and notifications of critical events and potential issues.
Database Diagnostics: PEM's diagnostic tools help administrators troubleshoot by highlighting performance problems. This allows for speedy root cause analysis and easy identification of effective solutions.
Backup and Recovery Management: PEM simplifies backup and recovery, safeguards data integrity, and minimizes downtime through features such as Barman, Bart, and Failover Manager.
User Management and Security: PEM provides extensive capabilities to efficiently manage user access and fine-tune security controls.
Query Analysis: PEM can help administrators optimize database performance by analyzing complex SQL queries and pinpointing where improvements could be made. This is a crucial feature for improving query response times and increasing efficiency.
Capacity Planning: PEM can provide forecasts and predictions to aid in resource planning. This can help administrators anticipate growth and allocate resources efficiently.

Learn More

Exploring PostgreSQL Extensions: Enhance Your Database Capabilities

Nile Lazarus — Sat, 19 Aug 2023 13:15:20 +0000

Introduction

In today’s world, organizations grapple with increasingly complex data challenges, and the need for a robust yet flexible database system has never been more apparent. This is where PostgreSQL’s exceptional feature set comes into play, with one particular aspect shining brightly: extensions.

Being able to tailor database solutions to suit your specific needs is an invaluable asset for database administration. PostgreSQL seems to understand this need and allows users to seamlessly enhance PostgreSQL’s capabilities without altering its core codebase.

In this blog, we’ll delve into the purpose and potential of these extensions, exploring how they transform a standard PostgreSQL installation into a powerhouse of specialized functions.

Understanding PostgreSQL Extensions

Think of extensions as specialized tools that can be plugged into PostgreSQL to give you exactly those functionalities which you require without needing to touch the codebase.

This allows you to get the capabilities you need without making any risky, complex changes to PostgreSQL’s architecture. Adding extensions provides benefits like retaining a maintainable architecture, customization without complexity, flexibility, adaptability, and community-driven innovation.

Commonly Used PostgreSQL Extensions and Their Functions

pg_trgm: A text similarity measurement extension that facilitates efficient text search and comparison, crucial for applications involving natural language processing.
hstore: Key-value storage model, ideal for managing semi-structured or schema-less data within PostgreSQL.
PostGIS: Equips PostgreSQL with geospatial objects and allows for advanced location-based queries, used for spatial data management.
uuid-ossp: Generates universally unique identifiers (UUIDs), essential for ensuring data integrity and uniqueness across distributed systems.
pg_stat_statements: Provides insights into query optimization, enabling better database performance by tracking and analyzing SQL query performance.
citext: Useful in scenarios requiring case-insensitive text searching and matching.

Advanced Extensions for Specialized Use Cases

TimescaleDB: Tailored for time-series data, optimizes data storage and retrieval for temporal datasets This extension is a must-have for IoT applications and financial analyses.
pgBouncer: Addresses connection pooling, and efficiently manages database connections. Useful for enhancing resource utilization and scalability.
PL/Python and PL/pgSQL: Enable the incorporation of Python and SQL procedural languages, respectively. Ideal for creating custom functions and stored procedures that best fit with your application's logic.
pgAudit: Provides detailed database auditing and monitoring capabilities, essential for tracking data access and changes.

Unraveling pgAdmin: A Comprehensive Guide to PostgreSQL Management

Nile Lazarus — Sun, 23 Jul 2023 18:06:49 +0000

In the world of database management systems (RDBMS), PostgreSQL has become a leading choice for both businesses and developers alike due to its robustness, scalability, and adaptability due its open-source nature. To effectively manage PostgreSQL databases, an essential tool comes into play - pgAdmin.
In this blog, we'll delve into the features of pgAdmin to better understand its significance in database management.

What is pgAdmin?

Key Features

User-Friendly Interface: pgAdmin offers an well-organised, intuitive, and extremely easy-to-use Graphical User Interface that makes it highly accessible for both beginners and seasoned database administrators alike.
SQL Query Editor: the SQL query editor in pgAdmin allows users to query their databases directly. It supports syntax highlighting, code completion, error checking, and enhances the overall development experience.
Database Object Management: pgAdmin provides users with a hassle-free way to manage database objects like tables, views, indexes, triggers, and functions. The tool is equipped with a variety of options to create, modify, and delete these objects.
Server Dashboard: the server dashboard feature provides users with all the essential information needed to monitor their PostgreSQL server. This includes details regarding active connections, resource utilisation, and overall performance.
Backup and Restore: database administrators can easily create and manage database backups as well as restore data from said backups.
Data Visualization and Reporting: users can visualise data through charts, graphs, and pivot tables. They can also set up triggers to generate custom reports based on specific query results.
Security and User Management: pgAdmin provides user and group management capabilities which help ensure secure and controlled access of users to the databases.
Foreign Data Wrappers (FDW) Support: I've discussed Foreign Data Wrappers in detail in part 4 of my Internals of PostgreSQL series. In short, Foreign Data Wrappers enable users to connect to external data sources/ foreign tables and manage them alongside their own PostgreSQL data. pgAdmin supports FDWs and allows users to manage them directly from the application.

Conclusion

pgAdmin is a powerful and versatile tool for PostgreSQL databases. It's equipped with an extremely user-friendly, easy-to-navigate Graphical User Interface designed to empower beginners and database experts alike with all the essential database management capabilities they would need to efficiently manage PostgreSQL databases. Like PostgreSQL itself, pgAdmin is also open-source, has a strong community, and regular updates. All these factors make it a great choice for developers and database administrators seeking a more flexible and convenient way to interact with their databases.

EDB BigAnimal: Harnessing the Power of Cloud PostgreSQL

Nile Lazarus — Mon, 17 Jul 2023 09:36:51 +0000

In today's data-driven world, businesses face immense pressure to handle and process vast amounts of data efficiently. This has led to the emergence of advanced database solutions that are both highly scalable and provide optimal performance. EDB BigAnimal is one such technology that offers such a solution for managing large-scale PostgreSQL databases in the cloud.

In this blog we will be exploring the different features, benefits, and capabilities of EDB BigAnimal.

What is EDB BigAnimal?

BigAnimal is a cloud-based, fully managed, PostgreSQL database solution. It was developed by EnterpriseDB (EDB), a leading global provider of Postgres software and services.
BigAnimal allows users to migrate from Oracle and offer deep compatibility and a plethora of useful tools and features aimed at helping users avail the benefits of using PostgreSQL.

Key Features and Functionality

Let's take a closer look at the features and functionalities EDB BigAnimal has to offer.

Scalability: BigANimal is designed to handle massive workloads and scale horizontally to effectively tackle ever increasing data demands. Organisations using BigAnimal can easily add or remove nodes, enabling their data infrastructure to remain flexible and adaptable.
High Availability: Ensuring high availability is crucial for any business. Users need to be able access any of their data, at any time and from anywhere. BigAnimal offers high availability configurations such as synchronous and asynchronous replication, automatic failover, and data redundancy to minimise downtime and prevent data loss.
Disaster Recovery: Protecting data and ensuring disaster recovery is also a crucial requirement in database management. BigAnimal provides robust disaster recovery solutions and allows organisations to enable point-in-time recovery and to minimise damage from data loss caused by unforeseen disasters.
Geo-distributed Data: BigAnimal allows users to implement Active-active architecture and deploy clusters across various regions and/or availability zones with multi-write access equipped.
Performance Monitoring and Tuning: BigAnimal offers advanced performance monitoring and tubing capabilities where users can tune up to 97% of available parameters. This provides administrators with valuable insights into database performance and resource utilisation optimization.
Cross-Platform Support: BigAnimal is specifically designed to support various operating systems and ensure compatibility with any cloud providers and on-premises deployments as well. You can deploy on any cloud, in your account or on EDB BigAnimal as well.

Command Line Interface (CLI)

BigAnimal offers a user-friendly CLI for management and configuration activities. The CLI can be installed on Linux, MacOS, and Windows operating systems and enables system administrators and developers to script and automate BigAnimal operations.

Oracle Compatibility

We have already discussed how BigAnimal is built to provide a smooth migration from Oracle database to PostgreSQL. But that's not all.
BigAnimal offers compatibility with Oracle SQL and PL/SQL and also allows procedures written in PL/SQL to be converted to PL/pgSQL (PostgreSQL's procedural language).

Conclusion

EDB BigAnimal is a fairly robust and scalable PostgreSQL database management solution which empowers enterprises to meet the challenges of handling large-scale data workloads head-on. It offers high availability, disaster recovery, and performance optimization features. It also provides a powerful and reliable option for organisations seeking to utilise the full potential of PostgreSQL in the cloud.

The Importance of String Distances: Levenshtein, Jaro, Naive Recursive

Nile Lazarus — Sat, 15 Jul 2023 15:13:41 +0000

Being able to calculate the similarity or dissimilarity between strings in the most accurate and cost efficient manner is a crucial task in various domains like natural language processing, data mining and database management.
In regards to databases, string distance algorithms allow tasks like spell checking, record linkage, data deduplication, and information retrieval to be effectively carried out.
In this blog, we will be focusing on three popular string distance algorithms:

Levenshtein distance
Jaro similarity
Naive Recursive (Edit Distance)

We will be discussing how each algorithm works, their significance in databases, and what the advantages and disadvantages are for each.
Additionally, we will also cover the UTL_MATCH package available in Oracle Database for string comparison.

Levenshtein Distance

The Levenshtein Distance was named after the Soviet mathematician Vladimir Levenshtein , who introduced it in 1965.
It follows a simple concept of calculating the minimum (least) number of single character edits required to transform one string into another. These edits include insertions, deletions and substitutions.

Let's try to understand it better with an example:
Suppose two words/strings: "kangaroo" and "potato"

This would be the initial matrix:

|   |   | k | a | n | g | a | r | o | o |
|---|---|---|---|---|---|---|---|---|---|
|   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| p | 1 |   |   |   |   |   |   |   |   |
| o | 2 |   |   |   |   |   |   |   |   |
| t | 3 |   |   |   |   |   |   |   |   |
| a | 4 |   |   |   |   |   |   |   |   |
| t | 5 |   |   |   |   |   |   |   |   |
| o | 6 |   |   |   |   |   |   |   |   |

Now let's fill the matrix. Compare each character of the two strings. If a character matches, the value from the diagonal cell is copied. If they do not match, the minimum value out of those present in the left, diagonal and upper cells is selected and incremented by one.

|   |   | k | a | n | g | a | r | o | o |
|---|---|---|---|---|---|---|---|---|---|
|   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| p | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| o | 2 | 2 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| t | 3 | 3 | 2 | 2 | 3 | 4 | 5 | 6 | 7 |
| a | 4 | 4 | 3 | 3 | 3 | 4 | 5 | 6 | 7 |
| t | 5 | 5 | 4 | 4 | 4 | 4 | 5 | 6 | 7 |
| o | 6 | 6 | 5 | 5 | 5 | 5 | 5 | 6 | 7 |

The final value for the Levenshtein distance is the value calculated for the bottom right cell of the matrix, which in our case is 7.

Advantages:

Provides a precise value for string dissimilarity considering edits required.
Suitable for fuzzy matching, spell checking, and data deduplication.
Can handle strings of differing lengths.

Disadvantages:

Computationally expensive. Time complexity is O(m*n) where m and n are the lengths of the two strings.
May perform poorly with lengthier strings.

Jaro Similarity

The Jaro similarity algorithm was introduced in 1989 by William E. Jaro. The algorithm calculates the similarity between two strings by comparing their characters and the order in which they appear. The algorithm provides a resultant value between 0 and 1, where 0 indicates no similarity at all and 1 indicates a perfect match.

Let's use an example to try to understand better.
Suppose two strings: "marble" and "table"

Jaro similarity is calculated using the following formula:

m is the number of matching character
t is half the number of transpositions
|s1| and |s2| are the lengths of string 1 and string 2 respectively.

For our example ("marble" & "table"):

4 matching characters. m = 4
No transpositions. t = 0

Jaro Distance = (4 / 6 + 4 / 5 + (4 - 0) / 4) / 3 = 0.822

Jaro-Winkler Similarity

Jaro-Winkler similarity is a slight modification of the Jaro similarity. It gives additional weightage to common prefixes (substring that appears at the beginning of the string itself).
It is useful for calculating similarity between shorter strings or when you want to focus on strings with similar prefixes.

Here are the steps to calculate the Jaro-Winkler similarity:

Calculate the Jaro similarity as done earlier. We will continue with the value from our example (8.22).
Calculate prefix scaling factor (p). The formula for this is: p = 0.1 * common prefix length * (1 - Jaro Similarity) According to our example, this would be: p = 0.1 * 0 * (1 - 0.822) = 0
Calculate Jaro-Winkler similarity using formula: Jaro-Winkler similarity = Jaro similarity + p ^(1 - Jaro similarity) We would get: Jaro-Winkler similarity = 8.22 + 0 * (1 - 8.22) = 8.22

Advantages:

Consider the order of characters when calculating string similarity.
Used in tasks such as record linkage, string matching, and deduplication.
Can handle strings of different lengths as well as strings having transpositions.

Disadvantages:

Not as accurate as Levenshtein distance as edits are not considered in the calculations.
May not perform well in cases where common prefixes are not that significant.

Naive Recursive (Edit Distance):

The Naive Recursive algorithm, also known as Edit Distance algorithm, is similar to the Levenshtein distance algorithm in that it also uses a minimum number of edits to compare strings. However, unlike Levenshtein, this algorithm uses a recursive approach. It breaks down the problem into smaller subproblems and solves them recursively.

Let's once again use an example to understand this better using the strings "book" and "back".

If either string is null/empty, the distance is set to the length of the other string.
If the last characters of the strings match (common postfix), the distance for the remaining substrings is calculated.
If the conditions above are false, the distance is calculated using the following three recursive calls:

Insertion: EditDistance(s1, s2[:-1]) + 1
Deletion: EditDistance(s1[:-1], s2) + 1
Substitution: EditDistance(s1[:-1], s2[:-1]) + 1

For our example, the Edit Distance would be 2.

Advantages:

Simplistic, easy to understand and implement.
Calculates an accurate distance between strings.

Disadvantages:

Exponential time complexity hence only suitable for short strings.
Not suitable for real-world use with large datasets and real-time applications due to time complexity.
Computationally costly.

Oracle Database

Oracle database offers all of the above capabilities in its UTL_MATCH package. This package offers a wide variety of string comparison functions including the Levenshtein distance and Jaro-Winkler similarity algorithms discussed above.

Here is how the Levenshtein Distance is calculated using UTL_MATCH.EDIT_DISTANCE function:

DECLARE
  distance NUMBER;
BEGIN
  distance := UTL_MATCH.EDIT_DISTANCE('kangaroo', 'potato');
  DBMS_OUTPUT.PUT_LINE('Levenshtein Distance: ' || distance);
END;

Output will be: 'Levenshtein Distance: 7'

We can also use the UTL_MATCH.JARO_WINKLER_SIMILARITY function to calculate the Jaro-Winkler similarity:

DECLARE
  similarity NUMBER;
BEGIN
  similarity := UTL_MATCH.JARO_WINKLER_SIMILARITY('marble', 'table');
  DBMS_OUTPUT.PUT_LINE('Jaro-Winkler Similarity: ' || similarity);
END;

The output here will be: 'Jaro-Winkler Similarity: 0.822'

The UTL_MATCH package in Oracle Database offers optimised functions for string comparison, further enhancing the capabilities of databases in handling string-related operations.
Understanding and utilising these algorithms and tools can greatly enhance the efficiency and accuracy of database systems.

Demystifying the Internals of PostgreSQL - Chapter 6

Nile Lazarus — Wed, 28 Jun 2023 23:06:20 +0000

Welcome back to the sixth instalment in our journey towards understanding The Internals of PostgreSQL.

In the last blog, we covered chapter 5 which delves into how PostgreSQL handles Concurrency Control.

In this blog, we will be exploring chapter 6 which covers Vacuum Processing. So without further ado, let's begin.

Introduction

Vacuum Processing is a maintenance process in PostgreSQL which we briefly touched upon at the end of the last blog. The main responsibilities that Vacuum Processing handles are removing dead tuples and freezing transaction IDs.

There are two types of methods used in vacuum processing of dead tuples:

Concurrent VACUUM: dead tuples are simply removed and other transactions can still read the table during this process.
Full VACUUM: dead tuples are removed and live tuples are defragmented as well. Other transactions cannot read the table while this process is underway.

Up until version 8.0 of PostgreSQL, vacuum processing had to be done manually and was only automated in 2005.
Additionally, this is a costly process which is why the Visibility Map (VM) feature was introduced in version 8.4 to increase efficiency.

Outline of Concurrent VACUUM

Vacuum Processing handles 3 tasks for all or some tables in the database:

Removing dead tuples: removes dead tuples, defragments live tuples, and and removes indices of dead tuples.
Freezing old txids: freeze old txids, update frozen txids, and remove unnecessary parts of the clog if possible
Others: update FSM and VM of processed tables, and update statistics like pg_stat_all_tables.

For each table in the database, PostgreSQL implements processing techniques called first block, second block, and then third block. After this, statistics and system catalogs are updated, and unnecessary files and pages of the clog are removed if possible.
Let's now delve into what exactly the first block, second block, and third block do.

First Block

First block is responsible for freeze processing and removing index tuples which point to dead tuples.
PostgreSQL will first create a list of dead tuples, freeze old tuples, and store the list in maintenance_work_mem. After this, the list is used to remove the index tuples of the dead tuples.
Once this is done, PostgreSQL moves onto Second Block.

Second Block

Second block removes dead tuples and updates the FSM and VM.
Dead tuples are removed and the remaining live tuples are reordered (defragmentation) after which the FSM and VM of the page this was performed on are updated.

Third Block

Third block performs cleanup for the deleted indexes and updates the statistics and system catalogs.

Visibility Map

As mentioned before, vacuum processing is a costly process hence PostgreSQL introduced the Visibility Map in version 8.4 to improve efficiency.
The VM basically holds information on which pages have dead tuples which allows the vacuum process to skip pages that have no dead tuples.
The efficiency provided by VMs was further enhanced in version 9.6 such that VMs now also had page visibility and information on which pages contain frozen tuples.

Freeze Processing

Freeze processing has two modes:

Lazy Mode: only pages which contain dead tuples are scanned using the VM of each respective table. Tuples whose t_xmin is less than the freezeLimit txid are frozen. The following formula is used to calculate freezeLimit txid:
freezeLimit_txid = (OldestXmin − vacuum_freeze_min_age)
Eager Mode: every page is scanned regardless of whether it does or does not contain dead tuples. System catalogs are also updated and unnecessary parts of the clog are removed if possible. The eager mode is performed when the following condition is satisfied:
pg_database.datfrozenxid < (OldestXmin − vacuum_freeze_table_age)

Full VACUUM

Although Concurrent VACUUM seems thorough at a glance, it falls short in several areas such as reducing table sizes even after several dead tuples have been removed.
This negatively impacts both the efficiency of disk space usage and the overall performance of the database.
To tackle this issue, PostgreSQL provides the Full VACUUM mode which takes the following steps:

Creates new table file
Copies live tuples to the new table file
Deletes the old file, rebuilds indexes and updates statistics, FSM, and VM

Demystifying the Internals of PostgreSQL - Chapter 5

Nile Lazarus — Wed, 28 Jun 2023 01:07:36 +0000

Welcome back to another step in our journey towards understanding the Internals of PostgreSQL.

In the last blog, we covered chapter 4 which explains Foreign Data Wrappers and Parallel Query.

Now we're going to be moving on to chapter 5 which explains how PostgreSQL manages Concurrency Control. Let's jump right in!

Introduction

When multiple transactions are running simultaneously in a database, concurrency control is needed to maintain atomicity and isolation which are two crucial ACID properties.

There are three concurrency control techniques commonly used:

Multi-version Concurrency Control (MVCC)
Strict Two-Phase Locking (S2PL)
Optimistic Concurrency Control (OCC) Each technique has its own variations. PostgreSQL uses a variation of MVCC called Snapshot Isolation (SI). In MVCC, when a write is performed, a new version of the data item is created while also retaining the old version. Then when a transaction attempts to read a data item, the selects one of the old versions to ensure isolation. In this way, read operations dont block write operations and vice versa.

In SI used by other RDBMSs, old versions of the data items being written to are stored in rollback segments. In the variation PostgreSQL uses, the new data item is inserted directly into the target table. When reading an item, PostgreSQL selects the appropriate version of the item through visibility check rules.

SI does not allow Dirty Reads, Non-Repeatable Reads, and Phantom Reads. However, SI allows serialisation anomalies which makes it unable to achieve true serializability. To handle this issue, PostgreSQL uses Serializable Snapshot Isolation (SSI) which was added as of version 9.1. This enables PostgreSQL to offer true SERIALIZABLE isolation level.

Transaction ID

When a transaction begins, it is assigned a unique identifier called transaction id (txid) by the transaction manager. In postgreSQL, the txid is a 32-bit unsigned integer. To view the current transactions txid, execute the txid_current(). PostgreSQL will then return either the current txid or one of the three following txids reserved by it:

0 which means Invalid txid
1 which means Bootstrap txid (used in the initialization of database cluster)
2 which means Frozen txid

txids can be compared with one another however a concept of past and present is to be kept in mind. If your current txid is 100, you can only view txids less than that since they are considered past. Transaction ids greater than that will not be viewable as they are considered future and hence invisible.

Tuple Structure

Heap tuples in table pages have three parts: HeapTupleHeaderData structure, NULL bitmap, and user data.

The HeapTupleHeaderData has seven fields, for of which are described in this chapter:

t_xmin contains the txid of the transaction which inserted this tuple.
t_xmax contains the txid of the transaction which deleted or updated this tuple while 0 would mean this tuple has not been deleted or updated as of yet
t_cid holds the current command id (cid)
t_ctid holds the tuple id (tid) that points to itself or to a new tuple id if tuple has been updated.

Inserting, Deleting and Updating Tuples

This section of the chapter contains in-depth examples of what happens when inserting, deleting or updating a tuple and of how a Free Space Map (FSM) is used by PostgreSQL.

I highly recommend reading this section from the book to better understand through the examples and diagrams provided.

Commit Log (clog)

The statuses of transactions performed are stored in the Commit Log (clog) in PostgreSQL.

PostgreSQL has 4 defined transaction states:

IN_PROGRESS: transaction is in progress
COMMITTED: transaction has been committed
ABORTED: transaction has been aborted
SUB_COMMITTED: denotes sub-transactions (not elaborated upon in the book)

The clog takes up one or more pages of 8KB each in the shared memory and logically forms an array. The indices of this array match respective transaction ids. The status of each transaction is stored at its respective index in the table.
When the current txid goes beyond the capacity of the clog, a new page is appended.

In the case that PostgreSQL shuts down or if the checkpoint process is run, the contents of the clog are copied into files stored in the pg_xact sub directory. The files are named from 0000, 0001 to so on and so forth. The maximum file size is 256 KB. When PostgreSQL starts up, the data from files stored in pg_xact is used to initialise the clog.

Transaction Snapshot

A transaction snapshot stores data about whether transactions are or are not active at a certain point in time.
In PostgreSQL, this is textually represented in a form such as '100 : 100 :' which denotes that txids less than 100 are not active and txids greater than or equal to 100 are active.

Transaction Snapshots are provided by the Transaction Manager. These snapshots are then used for visibility checks by PostgreSQL as mentioned above.

Visibility Check Rules

Visibility checks rules determine whether a tuple is invisible (future) or visible (past) by using the data stored in t_xmin, t_xmax, clog, and the transaction snapshot. This chapter only goes into the minimal rules used for visibility checks.

Rule 1: if t_xmin status is ABORTED, tuple is invisible
Rule 2: if t_xmin status is IN_PROGRESS, its value is equal to current txid, and t_xmax is INVALID then transaction is visible

If Status(t_xmin) = IN_PROGRESS ∧ t_xmin = current_txid ∧ t_xmax = INVALID ⇒ Visible

Rule 3: if t_xmin status is IN_PROGRESS and its value is equal to current txid (given that t_xmax is not INVALID), tuple is invisible

If Status(t_xmin) = IN_PROGRESS ∧ t_xmin = current_txid ∧ t_xmax ≠ INVALID ⇒ Invisible

Rule 4: if the tuple is inserted by another transaction (txid is not equal to current txid) and t_xmin staus is IN_PROGRESS, tuple is invisible

If Status(t_xmin) = IN_PROGRESS ∧ t_xmin ≠ current_txid ⇒ Invisible

Rule 5: If Status(t_xmin) = COMMITTED ∧ Snapshot(t_xmin) = active ⇒ Invisible
Rule 6: If Status(t_xmin) = COMMITTED ∧ (t_xmax = INVALID ∨ Status(t_xmax) = ABORTED) ⇒ Visible
Rule 7: If Status(t_xmin) = COMMITTED ∧ Status(t_xmax) = IN_PROGRESS ∧ t_xmax = current_txid ⇒ Invisible
Rule 8: If Status(t_xmin) = COMMITTED ∧ Status(t_xmax) = IN_PROGRESS ∧ t_xmax ≠ current_txid ⇒ Visible
Rule 9: If Status(t_xmin) = COMMITTED ∧ Status(t_xmax) = COMMITTED ∧ Snapshot(t_xmax) = active ⇒ Visible
Rule 10: If Status(t_xmin) = COMMITTED ∧ Status(t_xmax) = COMMITTED ∧ Snapshot(t_xmax) ≠ active ⇒ Invisible

This chapter goes into further detail about how these rules are implemented along with how Lost Updates* are prevented through scenarios and examples which I highly recommend reading.

a Lost Update (also called a ww-conflict) is an anomaly which occurs when two transactions attempt to updates the same rows simultaneously

Serializable Snapshot Isolation

If a cycle is created containing conflicts in the precedence graph, a serialisation anomaly will occur.

There are three types of conflicts: wr-conflicts (Dirty Reads), ww-conflicts (Lost Updates), and rw-conflicts. wr and ww conflicts are already prevented by PostgreSQL, hence SSI implementation in PostgreSQL only handles rw-conflicts using the following strategy:

Record all objects (tuples, pages, relations) accessed by transactions as SIREAD locks.
Detect rw-conflicts using SIREAD locks whenever any heap or index tuple is written.
Abort the transaction if a serialisation anomaly is detected by checking detected rw-conflicts.

Required Maintenance Processes

The following maintenance processes are required by PostgreSQL's concurrency control mechanism:

Remove dead tuples and index tuples that point to corresponding dead tuples
Remove unnecessary parts of the clog
Freeze old txids*
Update FSM, VM, and the statistics

FREEZE is a process in PostgreSQL whereby a frozen txid is defined in such a way that it is always older than other txids and is hence always inactive and visible

Demystifying the Internals of PostgreSQL - Chapter 4

Nile Lazarus — Fri, 23 Jun 2023 19:40:54 +0000

Welcome back to our journey into The Internals of PostgreSQL.
In the last blog in this series, we covered Chapter 3 'Query Processing'.
Now we're going to cover Chapter 4 'Foreign Data Wrappers and Parallel Query'. Let's jump right in.

Foreign Data Wrappers (FDW)

SQL Management of External Data (SQL/MED) is a part of the SQL Standard that was added in 2003. It states that a table on a remote server is called a foreign table. PostgreSQL's Foreign Data Wrappers (FDW) use SQL/MED to manage these foreign tables.
Once you install the required extension and configure your settings appropriately, you can begin accessing foreign tables on remote servers. For example, you can use SELECT queries to access foreign tables stored in different servers.
Many different FDW extensions have been developed and are listed in the Postgres wiki but the only properly maintained one is the postgres_fdw extension which has been officially developed and maintained by PostgreSQL Global Development Group.

To use the FDW feature, you will need to not only install the required extension but also execute setup commands like CREATE FOREIGN TABLE, CREATE SERVER and CREATE USER MAPPING.
The work flow of the FDW feature in PostgreSQL is as follows:

The Analyzer creates a query tree for the given SQL query using the foreign tables definitions. These definitions are stored in the pg_catalog.pg_class and pg_catalog.pg_foreign_table catalogs.
The Planner or Executor then connects to the remote server using the appropriate library. For example, postgres_fdw uses libpq to connect to a remote PostgreSQL server, and mysql_fdw uses libmysqlclient to connect to a mysql server.
If the use_remote_estimate option has been enabled, EXPLAIN commands are executed by the Planner for cost estimation of each plan path. If not, the embedded constant values are used by default.
Planner creates a plain text SQL statement from the plan tree. This process is called deparsing in PostgreSQL.
Executor sends the plain text SQL statement created by the Planner to the remote server and receives the result.

This section also details how the postgres_fdw extension performs and how it has evolved over the course of multiple versions of PostgreSQL starting with version 9.3. I highly recommend reading through it yourself to understand better as this section contains many examples which can help you better understand how different SQL operations and functions are handled by the FDW.

This section also includes Parallel Query however it is currently under construction