DEV Community: Namsi Lydia

PostgreSQL Backup and Restore Strategies: A Comprehensive Guide

Namsi Lydia — Sun, 14 Jan 2024 11:23:46 +0000

PostgreSQL is a powerful and widely used open-source database management system known for its reliability and robustness. To ensure the integrity of your data and protect against potential disasters, it is crucial to have a well-designed backup and restore strategy in place. In this comprehensive guide, we will explore various backup and restore techniques in PostgreSQL.

Introduction
A solid backup and restore strategy is crucial for any database system, ensuring data recoverability in the event of hardware failures, user errors, or natural disasters. In PostgreSQL, there are multiple approaches to backup and restore, each with its own benefits and considerations. In this guide, we will cover the traditional backup strategies, continuous archiving with point-in-time recovery, and other advanced techniques.

Traditional Backup Strategies

In traditional backup strategies, periodic full backups of the database are taken, along with periodic incremental backups to capture the changes since the last full backup. Full backups provide a complete snapshot of the database, while incremental backups only capture the changes, reducing backup time and storage requirements. However, restoring from incremental backups can be more complex and time-consuming compared to full backups.

Understanding Write Ahead Log (WAL)

At the core of PostgreSQL's backup and recovery mechanisms lies the Write Ahead Log (WAL). WAL is a transaction log that records every change made to the database's data files. Its primary purpose is crash-safety, allowing the database to be restored to consistency by replaying the log entries since the last checkpoint. The existence of WAL also enables more advanced backup strategies, such as continuous archiving for point-in-time recovery.

Continuous Archiving: The Power of Point-in-Time Recovery
One of the most powerful features of continuous archiving is point-in-time recovery. With this technique, it is possible to restore the database to its state at any time since the base backup was taken. By continuously archiving the sequence of WAL files, we can achieve a consistent snapshot of the database at any given point. This is particularly valuable for large databases where frequent full backups may not be practical. By feeding the sequence of WAL files to another machine with the same base backup, a warm standby system can be created, allowing for nearly-instantaneous failover.

Setting Up WAL Archiving
To enable WAL archiving, specific settings must be configured in the postgresql.conf file. Setting the wal_level configuration parameter to 'replica' or higher and the archive_mode parameter to 'on' are the first steps. Additionally, the archive_command configuration parameter should be specified to define how completed segment files are copied to the archive location. This command can be as simple as a shell command using cp or a more complex custom script. It is crucial to ensure that the archive command returns a zero exit status upon successful archiving and handles any errors or interruptions appropriately.

Making a Base Backup
Creating a base backup is an essential step in the backup and restore process. The pg_basebackup tool provides an easy way to perform this task. It can create a base backup as regular files or as a tar archive. It is important to note that the backup should include all the WAL segment files generated during and after the file system backup. To aid in tracking the backup and its associated WAL files, a backup history file is created and stored in the WAL archive area. This file contains information about the starting and ending times, WAL segments, and the label string given to the backup.

Recovering Using a Continuous Archive Backup
In the event of a disaster or data loss, the recovery process using a continuous archive backup is straightforward. The first step is to stop the server, if it is running, and make a copy of the cluster data directory and any tablespaces as a precaution. Next, all existing files and subdirectories under the cluster data directory and tablespace roots should be removed. The database files can then be restored from the file system backup, ensuring correct ownership and permissions. It is essential to remove any files present in the pg_wal/directory that came from the file system backup, as they are most likely obsolete. If any unarchived WAL segment files were saved, they should be copied into the pg_wal/directory. Recovery configuration settings, such as the restore_command, should be set in postgresql.conf, and a recovery.signal file should be created in the cluster data directory. Finally, the server can be started, initiating the recovery process, which will read through the archived WAL files and restore the database to the desired state.

Timelines: Managing Parallel Universes

When performing point-in-time recovery or experimenting with different recovery scenarios, timelines play a significant role. PostgreSQL assigns a unique timeline ID to each series of WAL records generated after a recovery. This allows for recovery to a specific timeline, ensuring that changes made in separate timelines do not interfere with each other. Timeline history files are created to document the branching of timelines and are archived into the WAL archive area. These files are essential for choosing the correct WAL segment files during recovery from an archive with multiple timelines. By utilizing timelines effectively, administrators can recover to any prior state, even if they abandoned that timeline earlier.

Tips and Examples for Continuous Archiving

To enhance the continuous archiving process, there are several tips and examples that can be followed. One popular approach is to use scripts to define the archive_command. This allows for greater flexibility and complexity management within the script, enabling tasks such as copying data to secure off-site storage and interfacing with other backup and recovery software. Compressing archive logs using tools like gzip can also help reduce storage requirements. It is essential to monitor the archiving process and ensure that it is working correctly. Designing a robust archive command or library that does not overwrite pre-existing archive files is crucial to preserving the integrity of the archive. Additionally, considering potential failure scenarios and designing appropriate alerts or notifications can help resolve issues promptly.

Caveats and Considerations

While continuous archiving provides powerful backup and restore capabilities, there are some limitations and considerations to keep in mind. For example, when executing a CREATE DATABASE command during a base backup, caution should be exercised in modifying the template database to avoid propagating those changes into the created database during recovery. CREATE TABLESPACE commands are also WAL-logged with the literal absolute path, potentially causing issues if theWAL is replayed on a different machine. Additionally, it is important to note that the default WAL format includes disk page snapshots, which can impact storage requirements. Administrators may consider adjusting checkpoint interval parameters and exploring options to reduce the number of page snapshots included in the WAL.

Conclusion
PostgreSQL offers a range of backup and restore strategies to ensure the integrity and recoverability of your data. Whether you choose traditional backups, continuous archiving with point-in-time recovery, or a combination of techniques, it is crucial to design a strategy that aligns with your specific requirements. By implementing a well-designed backup and restore strategy, you can safeguard your data and minimize the impact of potential disasters.

Exploring The Set Clause : Updating Labels and Properties in Apache Age

Namsi Lydia — Sat, 13 Jan 2024 22:59:24 +0000

Graph databases are a powerful tool for managing and querying connected data. Cypher, the query language for graph databases, provides various clauses to manipulate and update data. In this article, we will dive into the SET clause in Cypher, which allows us to update labels and properties on vertices and edges. We will explore different use cases and provide examples to demonstrate its functionality

Understanding Terminal SET Clauses
A SET clause may be the final clause in a Cypher query, making it a terminal clause. When a query ends with a terminal clause, no results will be returned from the Cypher function call. However, it is important to note that the Cypher function call still requires a column list definition. To handle this, when a Cypher query ends with a terminal node, you can define a single value in the column list definition, ensuring that no data will be returned in this variable.

Setting Properties on Nodes and Relationships
One of the primary use cases of the SET clause in Cypher is to set properties on nodes and relationships. By using the SET clause, you can easily update the values of specific properties within your graph database. Let's take a look at an example:

SELECT * 
FROM cypher('graph_name', $$
   MATCH (v {name: 'Alexandria'})
   SET v.job = 'Software Engineer'
$$) as (v agtype);

In this query, we are matching a node with the name 'Alexandria' and setting the job property to 'Software Engineer'. The newly changed node is returned as the result of the query. This simple yet powerful application of the SET clause allows for effortless property updates within your graph database.

Creating and Returning Modified Vertices
In addition to updating properties, the SET clause can also be used to create and return modified vertices. To create a single vertex, you can utilize the following query:

SELECT * 
FROM cypher('graph_name', $$
    MATCH (v {name: 'Alexandria'})
    SET v.job = 'Software Engineer'
    RETURN v
$$) as (v agtype);

In this query, we are not only setting the jpb property of the matched node to 'Software Engineer', but we are also returning the newly changed vertex as the result of the query. This allows for seamless creation and retrieval of modified vertices, facilitating efficient data management within your graph database.

Removing Properties with SET Clause
While the REMOVE command is typically used to remove properties in Cypher, the SET clause also provides the flexibility to remove properties. This can be particularly useful when dealing with properties that originate from parameters. Let's consider the following example:

SELECT * 
FROM cypher('graph_name', $$
    MATCH (v {name: 'Alexandria'})
    SET v.name = NULL
    RETURN v
$$) as (v agtype);

In this query, we are matching a node with the name 'Alexandria' and setting the name property to NULL. The node is returned as the result of the query, and the name property is now missing. This demonstrates how the SET clause can be utilized to remove properties when needed, providing flexibility in managing your graph database.

Setting Multiple Properties with a Single SET Clause
In many cases, you may need to set multiple properties within a single query. The SET clause offers a convenient solution to accomplish this by allowing you to separate multiple property assignments with a comma. Consider the following example:

SELECT * 
FROM cypher('graph_name', $$
MATCH (v {name: 'Alexandria'})
SET v.department = 'Microservices', v.state = 'Nairobi'
RETURN v
$$) as (v agtype);

In this query, we are matching a node with the name 'Alexandria' and setting both the department and state properties. The modified vertex is returned as the result of the query, showcasing the ability to set multiple properties using a single SET clause. This feature enhances query efficiency and simplifies data manipulation within your graph database.

Advanced Techniques with SET Clause

While the basics of the SET clause cover most common use cases, there are advanced techniques that can further enhance its functionality. For instance, you can use expressions or functions within the SET clause to perform calculations or transformations on property values. This allows you to create dynamic updates based on the existing data in your graph.

Additionally, you can leverage variables and parameters in the SET clause to make your queries more flexible and reusable. By using variables, you can update properties based on the values stored in other parts of the query, resulting in more dynamic and adaptable data manipulation.

Best Practices for Using SET Clause

To make the most out of the SET clause in Cypher, it is important to follow some best practices. Here are a few tips to keep in mind:

Use descriptive property names: Clear and concise property names make your code more readable and maintainable.
Group related property updates: When setting multiple properties, group them logically to improve code organization and readability.
Handle null values gracefully: Consider the impact of null values when setting properties, and handle them appropriately in your queries.
Leverage constraints and indexes: Take advantage of graph database features such as constraints and indexes to optimize the performance of SET clause operations.

By following these best practices, you can ensure that your SET clause queries are efficient, maintainable, and scalable.

Conclusion

The SET clause in Cypher is a powerful tool for updating labels and properties on vertices and edges in graph databases. By harnessing its capabilities, you can manipulate data with ease, customizing your graph structures to suit your specific needs. Throughout this comprehensive guide, we have explored the various aspects of the SET clause, from its basic usage to advanced techniques and best practices. Armed with this knowledge, you are now ready to dive into the world of graph database manipulation and unleash the full potential of the SET clause. Happy querying!

Resources:
[https://age.apache.org/age-manual/master/clauses/set.html]

Understanding Cypher Prepared Statements in Apache Age

Namsi Lydia — Sat, 13 Jan 2024 13:03:40 +0000

Optimizing your cypher queries with prepared statements can significantly improve performance.When you prepare a statement cypher parses and analyzes the query then stores the execution plan for later use. This means when you execute the prepared statement cypher can skip the parsing and planning phase and go straight to executing the query

This is useful for queries you execute often with different filter values for example the preparation does the hard work of parsing and optimizing the cypher query up front so each execution after that has minimal overhead.

How to format cypher parameters in prepared statements
when using cypher prepared statements in Apache Age formatting your parameters properly is crucial if done incorrectly you will get errors and your queries won't run.
The following are tips one can use to construct cypher prepared statements.

1.Use $ for params
cypher parameters in Apache Age always start with a $ followed by a letter and alphanumeric characters.e.g

$name

do not use numbers,symbols or spaces in the name stick to numbers,letters and underscores.

2.Place cypher parameters in the cypher query string
When writing your cypher query within the prepared statement include the cypher parameters values need to be inserted like this:

MATCH (v:Person)
WHERE v.name = $name
RETURN v

do not include postgresql style parameters(i.e $1, $2) in the cypher query string .only use cypher parameters.

Preparing the statement
To prepare a statement, use the PREPARE clause and include cypher parameters($name) in the query string.Then ,call the cypher function and pass and sql parameters($1) as the third argument.

e.g

PREPARE cypher_stored_procedure(agtype) AS
SELECT *
FROM cypher('expr', $$
     MATCH (v:Person)
     WHERE v.name =$name
     RETURN v
$$, $1)
AS (v agtype);

Executing the prepared statement
To execute the prepared statement, call EXECUTE and pass an agtype map with the cypher parameters values.Exclude the $ when passing the values.

e.g

EXECUTE find_person('{"age":30}');

This will execute the prepared find_person query, substituting 30 for the $1 parameter.

Benefits of Prepared Statements
There are a few key benefits to using prepared statements

- Security:
Prepared statements prevent SQl injection attacks.The parameters are bound to the statement so malicious input cannot be executed.

- Efficiency:
Prepared statements are compiled once and then executed multiple times.This makes them faster than executing a new query string for each execution.

- Readability:
Prepared statement clearly separate the query string from the input parameters.This makes queries easier to read and debug.

- Reusability
The same prepared statement can be executed with different parameters,reducing duplication.

When should you use prepared statements
As we now know prepared statements are useful when you want to run the same query multiple times with different parameters.The following reasons may be a determinant on when to use prepared statements and they include:

Security
Prepared statements can be used in case of security concerns such as sql injections in that prepared statements help prevent SQL injection attacks.The parameters are bound to the statement,so malicious input cannot modify the sql code

Performance
Prepared statements can come in handy in terms of performance issues in your database in that through prepared statements are compiled once then executed multiple times with different parameters which in turn avoids the overhead of compiling the statements each time improving performance.

Conclusion
Prepared statements allow you to write clean reusable cypher queries with parameters.By understanding how to define and execute prepared statements,you can improve the security,performance and readability of your queries.

Understanding and Optimizing Parallel Queries in PostgreSQL

Namsi Lydia — Sat, 30 Dec 2023 19:39:25 +0000

PostgreSQL, a powerful open-source relational database management system, employs parallel query execution to enhance performance for certain types of queries. This article delves into the mechanics of parallel query execution, exploring how it works, when it is utilized, and the considerations involved in optimizing its performance.

How Parallel Query Works
When the PostgreSQL query optimizer determines that parallel execution is the most efficient strategy, it constructs a query plan that includes a Gather or Gather Merge node. This node serves as the coordinator for parallel execution, collecting results from parallel worker processes. The number of workers is determined by the planner and can be visualized using the EXPLAIN command.

The Gather or Gather Merge node appears at the top of the query plan tree if the entire query is to be executed in parallel. Otherwise, it is positioned where parallelism is applied within the plan. In the example provided, a parallel sequential scan on the pgbench_accounts table is executed for a query filtering rows based on the filler column.

Parallel execution involves background worker processes, and the number of workers is constrained by configuration parameters like max_parallel_workers_per_gather. If the available workers are insufficient, or if certain conditions prevent parallel execution, the query may revert to a serial execution plan.

Parallel Query Constraints

Several factors influence whether a parallel query plan is generated:

Configuration Settings:

max_parallel_workers_per_gather must be set to a value greater than zero.
The system should not be in single-user mode.

Query Characteristics:
Queries involving data modification or row locking do not support parallel plans.
Certain commands like CREATE TABLE ... AS or SELECT INTO can use parallel plans for underlying SELECT operations.
Queries marked with functions labeled as PARALLEL UNSAFE are ineligible.

Execution Context:
Nested queries or queries within parallel constructs may not use parallel plans.
Operations like DECLARE CURSOR or PL/pgSQL loops may preclude parallel execution.

Safety Measures:
The system avoids parallel plans if it suspects partial or incremental execution.
Parallel Query Execution Challenges

Even when a parallel query plan is generated, execution may face challenges:

Worker Availability:
If the total number of background workers exceeds system limits (max_worker_processes or max_parallel_workers), parallel execution may be constrained.
Client Constraints:

The client's request, such as a non-zero fetch count, may limit parallel execution. This is influenced by the extended query protocol.

Dynamic Adjustments:

To prevent suboptimal serial execution, consider adjusting max_parallel_workers_per_gatherdynamically based on the expected execution context.

Conclusion
Understanding PostgreSQL's parallel query execution is crucial for optimizing database performance. Configuring relevant parameters, adhering to parallel-safe practices, and recognizing constraints on parallelism contribute to efficient query processing. As PostgreSQL evolves, staying informed about advancements and best practices ensures effective utilization of parallel query capabilities for enhanced database performance.

A Guide to Implementing Community Detection in Apache Age

Namsi Lydia — Wed, 27 Dec 2023 19:37:14 +0000

As we all know Apache Age has been a force to reckon with in the graph database space for a while now which is offering solutions to the analytics space and also helping data scientist to solve data issues with ease.In this article we are going to discuss/define what is community detection and how best can one implement community detection in Apache Age.

What is community detection ?
Community detection techniques are useful for social media algorithms to discover people with common interests and keep them tightly connected. Community detection can be used in machine learning to detect groups with similar properties and extract groups for various reasons. For example, this technique can be used to discover manipulative groups inside a social network or a stock market.

based on ongoing research some of this algorithms are still in the implementation stage but Apache Age supporting various programming languages such as Java,python with readily available drivers to support Apache Age one can create algorithms with the said drivers in place.

community detection algorithm that can be used in Apache Age is Eigenvector-based community detection.This a method used to identify communities or groups within a network by analyzing the eigenvectors of the network's adjacency matrix. The basic idea behind this approach is that nodes that belong to the same community will be more strongly connected to each other than to nodes in other communities.

This approach begins by computing the adjacency matrix of the network, capturing the relationships between nodes. Subsequently, the eigenvalues and corresponding eigenvectors of this matrix are determined. The eigenvectors associated with the highest eigenvalues are employed to categorize nodes into communities.

Essentially, the concept relies on the notion that nodes within the same community will exhibit comparable eigenvector values for these prominent eigenvectors. Through the clustering of nodes with analogous eigenvector values, the method facilitates the identification of distinct communities within the network.

The following approach can be implemented through the python programming language and below are the steps you can follow to implement this approach.

Requirements for the test:
install the apache-age community detection library
pip install apache-age-community-detection

Below is a sample implementation of community detection using python

import time
from age_cda import Graph

# Define a new set of nodes and edges for the graph
nodes = [A, B, C, D, E]
edges = [[A, B], [A, C], [B, C], [C, D], [D, E], [D, F], [E, F]]

# Measure the execution time for the new example
start_time = time.time()

# Create a new graph and obtain the community structure
example_graph = Graph.Graph()
example_graph.createGraph(nodes,edges)
results = new_graph.get_community()

print(results)

end_time_new = time.time()
print("Elapsed time for the new example: {} seconds".format(end_time_new - start_time_new))

Conclusion
This approach is one of the few that was easily found based on research but more approaches can be researched on to achieve community detection in Apache Age.

Resource
[https://arxiv.org/pdf/physics/0605087.pdf]
[https://pypi.org/project/apache-age-community-detection/]

Understanding Use Of Shortest Path Algorithm In Apache Age

Namsi Lydia — Tue, 26 Dec 2023 14:00:49 +0000

What is shortest path algorithm?
In graph theory, the shortest path problem is the problem of finding a path between two vertices (or nodes) in a graph such that the sum of the weights of its constituent edges is minimized.

The problem of finding the shortest path between two intersections on a road map may be modeled as a special case of the shortest path problem in graphs, where the vertices correspond to intersections and the edges correspond to road segments, each weighted by the length of the segment.

Shortest path algorithm is widely applied in various domains, including transportation networks, social networks, and network routing. Apache AGE provides an efficient implementation of the shortest path algorithm, allowing users to find the most optimal path between nodes in their graph data.

Benefits of using shortest path algorithm in Apache AGE

Scalability
Shortest path algorithms, when implemented in Apache AGE, contribute to the scalability of the system. As the size of the graph grows, the algorithm can still provide efficient pathfinding without a significant degradation in performance.

Optimized Query Performance
Use of the shortest path algorithm allows for optimized query performance, especially when dealing with large and complex graphs. It minimizes the computational cost by efficiently navigating the graph structure to find the shortest path.

Real-time Response
Shortest path algorithm enable real-time responses to queries related to pathfinding. This is particularly valuable in scenarios where quick decision-making based on the latest data is critical.

Step to take when implementing shortest path algorithm in Apache Age using dijkstra algorithm

Requirements needed for dijkstra implementation in python include:

sudo apt-get update
sudo apt-get install python3-dev libpq-dev
pip install --no-binary :all: psycopg2

1.Set up Apache Age
install Apache Age and its python library client and this can be done as follows:

pip install apache-age apache-age-python-client

next after installing apache age and its python client library we need to install the age-dijkstra library and this is done as follows:

pip install apache-age-dijkstra
pip install antlr4-python3-runtime==4.9.3

2.Create your graph database connection

con = Age_Dijkstra()
con.connect(
    host="localhost",       # default is "172.17.0.2" 
    port="5432",            # default is "5432"
    dbname="databaseName",    # default is "postgres"
    user="usernamePostgres",    # default is "postgres"
    password="password password",  # default is "agens"
    printMessage = True     # default is False
)

3.Define Vertex and Edge Properties:
Vertices: Store vertex IDs and any additional properties (e.g., name, location).
Edges: Store source and target vertex IDs, weight, and any other properties.

4. Load Data into the Graph:
Vertices: Store vertex IDs and any additional properties (e.g., name, location).
Edges: Store source and target vertex IDs, weight, and any other properties.

sample use case code implementation of dijkstra algorithm in Apache Age.


from age_dijkstra import Age_Dijkstra, Graph

con = Age_Dijkstra()
con.connect(
    host="localhost",       # default is "172.17.0.2" 
    port="5430",            # default is "5432"
    dbname="postgresDB",    # default is "postgres"
    user="postgresUser",    # default is "postgres"
    password="postgresPW",  # default is "agens"
    printMessage = True     # default is False
)

con.create_graph( graph_name = "states_graph")

con.set_vertices(
    graph_name = "states_graph", 
    label="County", 
    property={"name" : "Nairobi",}
    )
con.set_vertices(
    graph_name = "states_graph", 
    label="County", 
    property={"name" : "Mombasa",}
    )
con.set_vertices(
    graph_name = "states_graph", 
    label="County", 
    property={"name" : "Kisumu",}
    )
con.set_vertices(
    graph_name = "states_graph", 
    label="County", 
    property={"name" : "Nakuru",}
    )

con.set_edge( 
    graph_name = "states_graph", 
    label1="Country", 
    prop1={"name" : "Nairobi",}, 
    label2="County", 
    prop2={"name" : "Mombasa"}, 
    edge_label = "Neighbour", 
    edge_prop = {"distance":"483.5km"}
)
con.set_edge( 
    graph_name = "states_graph", 
    label1="County", 
    prop1={"name" : "Kisumu",}, 
    label2="County", 
    prop2={"name" : "Kisii"}, 
    edge_label = "Neighbour", 
    edge_prop = {"distance":"115.2km"}
)
con.set_edge( 
    graph_name = "states_graph", 
    label1="County", 
    prop1={"name" : "Kisumu",}, 
    label2="County", 
    prop2={"name" : "Seme"}, 
    edge_label = "Neighbour", 
    edge_prop = {"distance":"72km"}
)
con.set_edge( 
    graph_name = "states_graph", 
    label1="County", 
    prop1={"name" : "Mombasa",}, 
    label2="County", 
    prop2={"name" : "Machakos"}, 
    edge_label = "Neighbour", 
    edge_prop = {"distance":"443.3km"}
)
con.set_edge( 
    graph_name = "states_graph", 
    label1="County", 
    prop1={"name" : "Nairobi",}, 
    label2="County", 
    prop2={"name" : "Nakuru"}, 
    edge_label = "Neighbour", 
    edge_prop = {"distance":"161km"}
)
con.set_edge( 
    graph_name = "states_graph", 
    label1="County", 
    prop1={"name" : "Kiambu",}, 
    label2="County", 
    prop2={"name" : "Nakuru"}, 
    edge_label = "Neighbour", 
    edge_prop = {"distance":"155.7km"}
)

edges = con.get_all_edge()
nodes = []
for x in con.get_all_vertices():
    nodes.append(x['County'])

init_graph = {}
for node in nodes:
    init_graph[node] = {}

for edge in edges :
    v1 = edge['v1']['County']
    v2 = edge['v2']['name']
    dist = int(edge['e']['distance'])
    init_graph
    init_graph[v1][v2] = dist

graph = Graph(nodes, init_graph)
previous_nodes, shortest_path = Graph.dijkstra_algorithm(graph=graph, start_node="Dhaka")
Graph.print_shortest_path(previous_nodes, shortest_path, start_node="Dhaka", target_node="Chittagong")

con.delete_graph( graph_name = "states_graph")
con.close_connection()

the code sample above will print output of the distance between the various counties available.

Conclusion
The sample code implementation is one of the few ways you can implement dijkstra algorithm in Apache Age using python as your programming language.

Exploring Columnar Indexes in PostgreSQL

Namsi Lydia — Fri, 15 Dec 2023 18:53:39 +0000

One of the most powerful tool/mechanism that is offered by postgresql as a database management system is indexing techniques which is essential for query performance.The most common ,notable and powerful indexing mechanism is columnar index which focus on optimizing the storage and retrieval of data in a column-wise manner.
In this article we are going to have deep dive on understanding columnar indexes ,its Features,considerations,limitations and benefits.

First off we are going to start by defining what columnar indexes is.

What is columnar indexes?
In postgresql columnar indexes store information about columns. This distinction becomes particularly beneficial in scenarios where analytical queries involve aggregations, filtering, and computations on specific columns rather than entire rows.

Key Features of columnar indexes
Compression
Columnar indexes often employ compression techniques that take advantage of the similar data types and values within a column.Compression leads to reduced storage requirements and can improve query performance by minimizing the amount of data that needs to be read from disk.

Data Aggregation
Columnar indexes are well-suited for analytical queries and data aggregation operations.

Columnar Storage
columnar indexes store data in columns.Each column is stored separately, allowing for more efficient compression and storage of data.

Materialized Views
Columnar indexes can be effectively used to create materialized views, which store the results of complex queries in a precomputed form, improving query response times.

What are some of factors you should put into consideration when using columnar indexes

Maintenance Overhead
The use of columnar indexes can lead to increased maintenance efforts, particularly when dealing with updates to the data. It's essential to weigh the trade-offs between enhanced query performance and the additional costs associated with index maintenance.

Compatibility
Ensure that your PostgreSQL version supports columnar indexes and extended statistics. Check for updates and patches to benefit from any improvements or bug fixes.

Workload Analysis
Comprehend the characteristics of your queries. Columnar indexes are most efficient when dealing with analytical workloads that include aggregations and scans. For transactional workloads that involve frequent updates, conventional B-tree indexes may be more appropriate.

Advantages of using columnar indexes
Improved Query Performance
Columnar indexes excel when dealing with analytical queries that involve aggregations or filtering on specific columns. They reduce the amount of data that needs to be scanned, leading to faster query execution times.

Space Efficiency
Columnar indexes are built on individual columns, they can be more space-efficient than row-based indexes. This is particularly advantageous when dealing with wide tables where only a subset of columns is frequently queried.

Reduced I/O Operations
Columnar indexes can minimize the number of I/O operations required to fetch relevant data. This is especially beneficial in scenarios where disk I/O is a significant performance bottleneck.

Below is a sample example implementation of columnar index in postgresql
use case: sample sales data for a boutique store


CREATE TABLE sales_data (
    order_id INT,
    product_id INT,
    customer_id INT,
    order_date DATE,
    quantity_sold INT,
    total_amount DECIMAL(10, 2),
    PRIMARY KEY (order_id)
);

#insert some data into the sales_data table

INSERT INTO sales_data(order_id,product_id,customer_id,order_date,quantity_sold,total_amount)
  values(13,123,1,'12/2/2023',23,2345);
INSERT INTO sales_data(order_id,product_id,customer_id,order_date,quantity_sold,total_amount)
  values(11,125,2,'13/2/2023',25,2345);
INSERT INTO sales_data(order_id,product_id,customer_id,order_date,quantity_sold,total_amount)
  values(12,129,3,'14/2/2023',29,2345);
INSERT INTO sales_data(order_id,product_id,customer_id,order_date,quantity_sold,total_amount)
  values(14,122,4,'15/2/2023',21,2345);
INSERT INTO sales_data(order_id,product_id,customer_id,order_date,quantity_sold,total_amount)
  values(18,456,4,'15/2/2023',21,2345);

#create an index
CREATE INDEX idx_sales_product_orderdate ON sales_data (product_id, order_date);

#sample queries that you can undertake

-- Query to find total quantity sold for a specific product on a given date range
SELECT product_id, order_date, SUM(quantity_sold) AS total_quantity
FROM sales_data
WHERE product_id = 123 AND order_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY product_id, order_date;

-- Query to retrieve the latest orders for a specific product
SELECT *
FROM sales_data
WHERE product_id = 456
ORDER BY order_date DESC
LIMIT 10;

Conclusion
Columnar indexes in PostgreSQL present a compelling strategy for enhancing the efficiency of analytical queries and data warehousing tasks. By structuring and storing data based on columns, these indexes deliver advantages like enhanced compression, heightened query speed, and effective parallel processing. Proficiently utilizing columnar indexes can substantially boost the overall performance of PostgreSQL databases, particularly in situations where analytical processing and reporting play pivotal roles in the data workflow.

Understanding Pluggable Storage In Postgresql And Its Benefits

Namsi Lydia — Mon, 11 Dec 2023 19:34:25 +0000

What is a pluggable storage?
A pluggable storage refers to the ability to use different storage engines interchangeably within the same database system.A storage engine is a component of a database management system (DBMS) that is responsible for handling how data is stored, indexed, and retrieved. By making the storage engine pluggable, developers can choose and switch between different storage engines based on their specific needs and use cases.

In this article we are going to discuss more about the various storage engines that are currently being used in postgresql and also describe the architecture of some of these pluggable storage engines.

Some of the examples of pluggable storage engines used in postgresql include:

Pluggable table Access Methods.
Zheap.
Oriole Db.

Zheap
Is a way to keep table bloat under control by implementing a storage engine capable of running UPDATE-intense workloads a lot more efficiently. The project was originally started by EnterpriseDB and a lot of effort has been put into the project already.

The main objectives of Zheap include:
1.Reduce write amplification both by avoiding rewrites of heap pages and by making it possible to do an update that touches indexed columns without updating every index.

2.Reduce the tuple size by shrinking the tuple header and eliminating most alignment padding.

3.Provide better control over bloat. zheap will prevent bloat by allowing in-place updates in common cases and by reusing space as soon as a transaction that has performed a delete or non-in-place-update has committed. In short, with this new storage, whenever possible, we’ll avoid creating bloat in the first place.

Basic architecture of Zheap:
Here are the three components that make up the Zheap architecture and they include:

zheap: The table format
undo: Handling transaction rollback, etc.
WAL: Protect critical writes

Zheap:The table format
In the zheap system, transaction details are now stored directly on the page, making data much smaller and improving performance.

Each transaction consumes 16 bytes of storage and includes three pieces of information: transaction id, epoch, and the latest undo record pointer for that transaction. A row in the system points to a transaction slot.

By default, a table has 4 transaction slots, which is usually enough for large tables. However, in some cases, more slots are required. To handle this, zheap uses a feature called "TPD" (Transaction Processing Data), which acts as an overflow area. It stores extra transaction information when the default slots are insufficient.

Zheap :Tuple Formats
In the context of PostgreSQL's zheap implementation, the structure of a single tuple (a row of data) is crucial. In a standard heap tuple in PostgreSQL, a significant portion of the tuple's size is dedicated to storing transactional information within the tuple itself, making the header quite large (20+ bytes). However, in zheap, this approach changes. Transactional details are now stored in page-level structures called transaction slots, allowing the tuple header to shrink to a mere 5 bytes.

Furthermore, zheap optimizes the layout by eliminating CPU alignment (padding) between the tuple header and the actual data in the row. Typically, standard tuples require padding, which consumes additional bytes for each row in the table. Zheap, on the other hand, avoids this padding, resulting in more efficiently packed storage.

Additionally, zheap achieves space savings by removing padding from pass-by-value data types. All these optimizations collectively contribute to saving valuable storage space in every individual row of the table.

In summary, zheap's tuple format streamlines the storage structure by relocating transactional information, reducing header size, eliminating unnecessary padding, and optimizing the handling of specific data types, ultimately leading to more space-efficient storage for each row in the table.

Undo: Handling transaction rollback
One key concept integral to zheap is the concept of "undo." What is the primary purpose of this element?we can understand its primary purpose by describing the following operations and they include:

INSERT: Adding rows
UPDATE: Modifying data
DELETE: Removing rows
UNDO and ROLLBACK in action

INSERT: Adding rows
When you perform an INSERT operation in a zheap it needs to set aside a transaction slot and then record an undo entry. This undo entry helps to correct any issues that may arise in case of an error during the INSERT process. Unlike traditional heap tables in PostgreSQL, zheap allows for prompt reclamation of space if the INSERT operation is rolled back, which is a significant distinction between the two storage mechanisms. The Transaction ID (TID) plays a crucial role in managing and undoing the changes made during the INSERT operation in zheap.

UPDATE: Modifying data
When dealing with an UPDATE statement in a zheap there are two primary scenarios to consider :

New Row Fits in the Old Space:
If the new row can be accommodated within the existing space allocated for the old row, the process is relatively straightforward. The system overwrites the old row with the new data and creates an undo entry that contains the complete old row. This way, if necessary, the system can revert to the old structure by copying the information from the undo entry.

New Row Does Not Fit in the Old Space:
When the new row is longer than the old one, the process becomes more complex. In such cases, zheap essentially has to perform a DELETE followed by an INSERT operation. This is less efficient than an in-place UPDATE, and it can impact performance.

There are specific conditions where space can be promptly reclaimed:

When updating a row to a shorter version.
When non-inplace UPDATEs are executed.

In essence, the efficiency of the UPDATE operation in zheap depends on whether the new data fits into the existing space or requires a more resource-intensive DELETE / INSERT operation.

DELETE: Removing rows
When it comes to the DELETE operation in zheap, the process involves emitting an undo record. This undo record is crucial for restoring the old row in case of a ROLLBACK. During the DELETE operation, the row needs to be eliminated from the zheap to complete the removal process. In simpler terms, when you delete a row in zheap, the system creates a record that allows it to undo the deletion and puts the removed row back in its original place if needed during a rollback.

UNDO and ROLLBACK in action
When a ROLLBACK occurs, it's crucial to ensure that the original state of the table is reinstated. This involves executing the previously scheduled undo action. If errors occur during this process, the undo action is implemented within a new transaction to ensure success.

To optimize efficiency, it's ideal to apply all undo actions associated with a single page simultaneously. This minimizes the amount of Write-Ahead Logging (WAL) that needs to be recorded. Additionally, this approach allows for reducing page-level locking to the bare minimum, decreasing contention and enhancing overall performance.

However, dealing with long transactions poses challenges, especially in scenarios where a substantial amount of data, such as a terabyte, needs to be rolled back. To address this, if the undo action surpasses a configurable threshold, a background worker process is employed to handle the task. This solution is not only elegant but also ensures a positive end-user experience, avoiding prolonged rollbacks. Furthermore, the system must be resilient to potential crashes during the rollback process.

Undo actions can be removed under three circumstances:

When no active transactions can access the data.
Once all undo actions have been successfully completed.
For committed transactions, until they become visible to all transactions.

Sample example implementation of Undo: Handling transaction rollback

-- Example 1: Inserting a Row
BEGIN;
    INSERT INTO my_table (id, name) VALUES (1, 'John Doe');
    -- Undo operation: DELETE FROM my_table WHERE id = 1;
COMMIT / ROLLBACK;

-- Example 2: Deleting a Row
BEGIN;
    DELETE FROM employee WHERE employee_id = 101;
    -- Undo operation: INSERT INTO employee (employee_id, name, department) VALUES (101, 'John Smith', 'IT');
COMMIT / ROLLBACK;

-- Example 3: Updating Data
BEGIN;
    UPDATE products SET price = 15.99 WHERE product_id = 2001;
    -- Undo operation: UPDATE products SET price = 19.99 WHERE product_id = 2001;
COMMIT / ROLLBACK;

In the following examples above ,the BEGIN initiates a transaction, and subsequent operations modify the database state. The crucial aspect is that zheap maintains undo information to revert these changes in case of a ROLLBACK or to finalize them in case of a COMMIT. The undo operations are essentially the reverse of the original operations, ensuring that the database remains in a consistent state despite transaction outcomes.

Pluggable Table Access Methods
Table access methods refer to a capability that enables a different approach to storing data within a table. Before PostgreSQL 11, these methods were mainly used for index data, allowing users to choose storage methods like B-Tree or hash. However, there was no similar flexibility for tables.

With the introduction of PostgreSQL 12, this feature was expanded to include tables as well. Access methods for tables provide a way to implement alternative storage mechanisms. These methods expose APIs allowing PostgreSQL developers to create their own methods. In PostgreSQL 12, the traditional heap format for tables is transitioned into a table access method, which is now the default option. This enhancement provides more flexibility and control over how data is stored in postgresql tables.

How do users use the table access method interface?
To define a table access method, use CREATE ACCESS METHOD with TYPE TABLE. To then have a table use the new access method, specify it in the USING clause in CREATE TABLE, CREATE TABLE AS SELECT, or CREATE MATERIALIZED VIEW. Alternatively, you can specify the default table access method in the postgresql.conf file parameter default _ table _ access _ method.

CREATE ACCESS METHOD heap1 TYPE TABLE HANDLER heap_tableam_handler;
CREATE TABLE tbl1(id int, name text) USING heap1 ...;

basic syntax support for table access methods:

CREATE ACCESS METHOD <new access method> TYPE TABLE
HANDLER <table_am_handler>
CREATE TABLE ... USING <new access method> ...
CREATE TABLE ... USING <new access method> AS ...
CREATE MATERIALIZED VIEW ... USING <new access method>

Advantages of using table access methods
The attractiveness of this functionality lies in the following aspects:

Concise and Pluggable Architecture:
- The feature offers a streamlined and adaptable structure that is both straightforward and easy for PostgreSQL developers to work with.
User-Friendly Customization:
- Users have the ability to explicitly define the access method for each individual table, providing a high degree of customization and control over data storage strategies.
Compatibility Across Databases:
- This functionality caters to a wide range of databases, accommodating both open source and commercial database systems. It ensures flexibility for users who might be using different types of databases within their environment.
Supports Diverse Access Methods:
- The coexistence of various table access methods within the same database is facilitated. This means that different tables can use different storage mechanisms, allowing users to tailor their approach based on specific requirements or performance considerations.

By leveraging pluggable table access methods, it becomes feasible to implement a diverse range of table access methods. This includes:

zHeap as an Alternative to Heap
Provide better control over bloat using in-place updates and undo records for delete and reduce write amplification as compared to heap
Columnar table
The storage layout of the columnar table is column-wise instead of row-wise. With pluggable Table AM, the implementations will be rather focused without bothering to change the core server
In-memory table
Main memory is used as primary storage tables. Rows in the table are read from and written to memory. Some considerations:

Copy of the table data is maintained on disk, only for durability
Follow Only in memory table ( MongoDB has in-memory storage engine )

Index organized table In an index-organized table, as opposed to a conventional heap-organized table where data is stored without a specific order, the information in an index-organized table is stored within a B-tree index structure, arranged in a sorted manner based on the primary key. Each leaf block in this index structure contains both the primary key and the non-key columns of the table.

Conclusion
What is the future of pluggable storage engines in postgresql
The future of pluggable storage engines is bright and is looking very promising.More initiatives from the postgresql community are coming up to improve the already existing extension available and also introduction of orioleDb into the market can be able to optimize some of these challenges too.

Resources:
[https://www.pgcon.org/2019/schedule/attachments/536_pgcon2019_pluggable_table_AM_V1.3.pdf]

[https://wiki.postgresql.org/wiki/Future_of_storage]

[https://www.cybertec-postgresql.com/en/zheap-reinvented-postgresql-storage/]

[https://www.enterprisedb.com/blog/zheap-storage-engine-provide-better-control-over-bloat]

Concurrency Control Techniques in Postgresql: A Comprehensive Analysis

Namsi Lydia — Tue, 05 Dec 2023 20:38:00 +0000

This article describes the behavior of the PostgreSQL database system when two or more sessions try to access the same data at the same time. The goals in that situation are to allow efficient access for all sessions while maintaining strict data integrity.

What is concurrency control
Concurrency control is the most essential and important aspect of any database system that ensures that transactions can be executed concurrently without leading to data inconsistencies.

Some of the concurrency techniques used by postgresql include:

MVCC(Multi-version concurrency Control)
Explicit Locking
Transaction Isolation

Multi-version Concurrency Control (MVCC)
Unlike traditional database systems which use locks for concurrency control, PostgreSQL maintains data consistency by using a multiversion model (Multiversion Concurrency Control, MVCC). This means that while querying a database each transaction sees a snapshot of data (a database version) as it was some time ago, regardless of the current state of the underlying data. This protects the transaction from viewing inconsistent data that could be caused by (other) concurrent transaction updates on the same data rows, providing transaction isolation for each database session.Mvcc is good with read operations

The main advantage to using the MVCC model of concurrency control rather than locking is that in MVCC locks acquired for querying (reading) data do not conflict with locks acquired for writing data, and so reading never blocks writing and writing never blocks reading.

Transaction Isolation
Transaction isolation is a concept in most database management systems that defines the extent to which the operations within a transaction are isolated from the effects of other concurrently executing transactions. In other words, it ensures that the intermediate states of a transaction are not visible to other transactions until the transaction is committed.

on the other hand isolation level is a property that can be set for a transaction, and it determines the degree to which a transaction is isolated from the effects of other concurrent transactions.

There are four levels of transaction and they include:

Read uncommitted
Read committed(used in postgresql)
Repeatable read
serializable(used in postgresql)

These four levels of transaction isolation can be interfered by three phenomena that must be prevented between concurrent transactions and these phenomena include:

dirty read
A transaction reads data written by a concurrent uncommitted transaction.

non repeatable read
A transaction re-reads data it has previously read and finds that data has been modified by another transaction (that committed since the initial read).

phantom read
A transaction re-executes a query returning a set of rows that satisfy a search condition and finds that the set of rows satisfying the condition has changed due to another recently-committed transaction.

in postgresql one can request any of the four standard transaction isolation levels. But internally, there are only two distinct isolation levels, which correspond to the levels Read Committed and Serializable this is because when you select the level Read Uncommitted you really get Read Committed, and when you select Repeatable Read you really get Serializable, so the actual isolation level may be stricter than what you select,thus the reason that PostgreSQL only provides two isolation levels is that this is the only sensible way to map the standard isolation levels to the multiversion concurrency control architecture.

Read Committed Isolation Level
In this isolation level when a transaction runs on this isolation level, a SELECT query sees only data committed before the query began; it never sees either uncommitted data or changes committed during query execution by concurrent transactions.

The Read Committed isolation level ensures that when you execute a query (e.g., SELECT, UPDATE, DELETE), it only sees a snapshot of the database as of the moment the query starts running. Even within a single transaction, two consecutive SELECT queries might retrieve different data if other transactions commit changes during the execution of the first SELECT.

For commands like UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE, they behave similarly to SELECT in terms of finding target rows. They locate rows committed as of the command's start time, but these rows may have been modified or deleted by other transactions by the time they are found. In such cases, the second transaction must wait for the first one to either commit or roll back. If the first transaction rolls back, the effects are undone, and the second transaction can proceed. If the first transaction commits, the second transaction will either ignore the row (if it was deleted) or apply its operation to the updated version of the row.

Due to this behavior, an updating command may perceive an inconsistent snapshot, showing the effects of concurrent updates on the same rows it's trying to update, but not on other rows in the database. This makes Read Committed mode less suitable for commands with complex search conditions but suitable for simpler cases.

In simpler terms, Read Committed ensures that when you interact with the database, you see a consistent view of the data at the moment your operation begins. However, it may not capture all changes happening concurrently in the database, especially for more complex queries.

Serializable Isolation Level
The Serializable isolation level is the strictest level for database transactions. It simulates the idea of executing transactions one after another, in a serial manner, rather than simultaneously. However, applications using this level must be ready to retry transactions due to potential serialization failures.

In Serializable mode, a SELECT query only sees data committed before the transaction began; it doesn't see uncommitted data or changes committed during the transaction's execution by other transactions. Successive SELECT commands within a single transaction always observe the same data, as they see a snapshot as of the start of the transaction.

Commands like UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE behave similarly to SELECT in terms of finding target rows. They locate rows committed as of the transaction start time, but these rows may have been updated, deleted, or locked by concurrent transactions. In such cases, the serializable transaction waits for the first updating transaction to commit or roll back. If the first transaction rollback, the serializable transaction can proceed; if it commits, the serializable transaction is rolled back with an error message indicating a concurrent update. The application, upon receiving this error, should abort the current transaction and retry it from the beginning.

It's important to note that only updating transactions may need to be retried; read-only transactions won't face serialization conflicts. While Serializable mode ensures a wholly consistent view of the database for each transaction, the application should be prepared to retry transactions when concurrent updates make serial execution impractical.

Serializable mode is recommended when updating transactions are complex enough to potentially yield incorrect results in Read Committed mode. It's typically necessary when a transaction executes several successive commands that must observe identical views of the database.

Below is a sample example implementation of serializable isolation level:


-- Create a sample table
CREATE TABLE account (
    id serial PRIMARY KEY,
    balance integer
);

-- Insert some initial data
INSERT INTO account (balance) VALUES (1000), (2000);

-- Transaction 1
BEGIN;
DECLARE
    new_balance integer;
BEGIN
    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

    -- Perform the update and store the new balance in a variable
    UPDATE account SET balance = balance - 100 WHERE id = 1 RETURNING balance INTO new_balance;

    -- Check if the balance after the update is negative
    IF new_balance < 0 THEN
        -- Rollback the transaction and raise an exception
        ROLLBACK;
        RAISE EXCEPTION 'Transaction failed: Insufficient balance after deduction';
    END IF;

    -- Perform some other operations

    COMMIT;
EXCEPTION
    WHEN OTHERS THEN
        -- Handle any other exceptions
        ROLLBACK;
        RAISE;
END;

-- Transaction 2
BEGIN;
DECLARE
    new_balance integer;
BEGIN
    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

    -- Perform the update and store the new balance in a variable
    UPDATE account SET balance = balance + 100 WHERE id = 2 RETURNING balance INTO new_balance;

    -- Check if the balance after the update is negative
    IF new_balance < 0 THEN
        -- Rollback the transaction and raise an exception
        ROLLBACK;
        RAISE EXCEPTION 'Transaction failed: Negative balance not allowed';
    END IF;

    -- Perform some other operations

    COMMIT;
EXCEPTION
    WHEN OTHERS THEN
        -- Handle any other exceptions
        ROLLBACK;
        RAISE;
END;

Explicit Locking
PostgreSQL provides various lock modes to control concurrent access to data in tables. These modes can be used for application-controlled locking in situations where MVCC does not give the desired behavior. Also, most PostgreSQL commands automatically acquire locks of appropriate modes to ensure that referenced tables are not dropped or modified in incompatible ways while the command executes.

There are various techniques used in explicit locking and they include:

Table-Level Locks
Row-Level Locks
Page-Level Locks
Deadlocks
Advisory Locks

Table-Level Locks
Table-level locks are a type of lock that can be applied to an entire table. These locks are used to control access to a table, preventing multiple transactions from interfering with each other. There are different types of locks in PostgreSQL and they include:

Access share lock
Row share lock
Row exclusive lock
Share lock
Share update exclusive just to name a few.

Access Share
These locks are acquired on a specific table via the PostgreSQL SELECT command. After acquiring these locks on the table, we are only able to read data from it and not able to edit it.

syntax:

begin;
lock table Username IN ACCESS SHARE MODE;
select * from Username;

Row Share
The SELECT will acquire this PostgreSQL Lock on the table FOR SHARE & SELECT FOR UPDATE statements. This lock conflicts with Exclusive & Access Exclusive modes of Locking.

Syntax:

begin;
lock table Email IN ROW SHARE MODE;

*Row Exclusive
*
The share row exclusive, share, access exclusive, and exclusive modes of PostgreSQL conflict with this lock. The locks on the table will be acquired by UPDATE, DELETE & INSERT statement

syntax:

begin;
lock table Email IN ROW EXCLUSIVE MODE;

Share
This lock is incompatible with the share row exclusive, share, access exclusive, share update exclusive, share, and exclusive modes. This lock mode will obtain locks from PostgreSQL’s create index command.

Syntax:

begin;
lock table Email IN SHARE MODE;

Page Level Locks
Page-level Locks are native to two types. Share & Exclusive locks limit read/write access to table pages in the shared buffer pool and table and row locks. After a row is fetched or updated, these locks are immediately released. Page-level locks are typically not an issue for application developers, although they are listed here for completeness.

Row- Level Locks
Row-level locks are a type of lock that can be applied to individual rows within a table. Row-level locks provide a more granular level of control over concurrent access to data, allowing transactions to lock specific rows rather than the entire table. This can be useful in scenarios where you want to prevent other transactions from modifying specific rows concurrently.

There are two main row-level locks that used in postgresql and the include:

FOR UPDATE
When a transaction acquires a FOR UPDATE lock on a row, it indicates an intention to update the row. Other transactions attempting to acquire a FOR UPDATE lock on the same row will be blocked until the lock is released.
This lock mode is used in situations where a transaction wants to ensure that it has exclusive rights to modify a particular row.

syntax for FOR UPDATE:

BEGIN;
SELECT * FROM your_table WHERE key_column = 'some_value' FOR UPDATE;
-- Perform updates or other operations on the selected row(s)
COMMIT;

FOR SHARE
The FOR SHARE lock mode is less restrictive than FOR UPDATE. It allows multiple transactions to acquire shared locks on the same row simultaneously.
Transactions using FOR SHARE can read the locked row, but they cannot acquire an FOR UPDATE lock on the same row until the lock is released.

syntax for FOR SHARE

BEGIN;
SELECT * FROM your_table WHERE key_column = 'some_value' FOR SHARE;
-- Perform read operations on the selected row(s)
COMMIT;

Data Consistency Checks
Enforcing consistency through Serializable Transactions

When using serializable isolation level, it will avoid creating an unnecessary burden for application programmers if the application software goes through a framework which automatically retries transactions which are rolled back with a serialization failure. It may be a good idea to set default_transaction_isolation to serializable. It would also be wise to take some action to ensure that no other transaction isolation level is used, either inadvertently or to subvert integrity checks, through checks of the transaction isolation level in triggers.

Enforcing consistency through Explicit Blocking Locks

When dealing with non-serializable writes in PostgreSQL, it's crucial to use mechanisms like SELECT FOR UPDATE, SELECT FOR SHARE, or appropriate LOCK TABLE statements to ensure the current validity of a row and protect it from concurrent updates. These techniques help in avoiding conflicts when multiple transactions are attempting to modify the same data simultaneously.

It's important to note that SELECT FOR UPDATE and SELECT FOR SHARE specifically lock the returned rows against concurrent updates, while LOCK TABLE locks the entire table. When migrating applications from other database environments to PostgreSQL, understanding and adapting to these concurrency control mechanisms is essential.

One notable aspect is that using SELECT FOR UPDATE doesn't guarantee that a concurrent transaction won't update or delete a selected row. In PostgreSQL, ensuring such protection involves actually updating the row, even if no changes to the values are needed. SELECT FOR UPDATE temporarily prevents other transactions from acquiring the same lock or executing conflicting operations, but the protection is lifted once the holding transaction commits or rolls back, unless a real UPDATE of the row occurred during the lock.

something to note if you're relying on explicit locking to prevent concurrent changes in PostgreSQL, it's essential to either use Read Committed mode or, in Repeatable Read mode, take precautions to acquire locks before executing queries. In Repeatable Read mode, a transaction's lock ensures that no other transactions modifying the table are active. However, it's crucial to note that if the transaction's snapshot predates obtaining the lock, it might miss some changes that have been committed to the table.

Conclusion
Concurrency control in PostgreSQL plays a pivotal role in upholding data consistency and isolation within a multi-user environment. As a robust relational database management system, PostgreSQL employs a spectrum of techniques, including Multi-version Concurrency Control (MVCC), Transaction Isolation and Explicit Locking to adeptly manage concurrent transactions.

PostgreSQL's concurrency control mechanisms extend beyond mere data manipulation to address critical challenges in complex scenarios. By skillfully managing transactions and utilizing MVCC, PostgreSQL guarantees data consistency and isolation. This is particularly beneficial for applications that require concurrent access to data while upholding the highest standards of integrity. The reliability and versatility of PostgreSQL's concurrency control make it a preferred choice for diverse applications demanding robust concurrent data management.

Comprehensive Guide :Getting Started with Apache Age In Java

Namsi Lydia — Sun, 26 Nov 2023 12:43:16 +0000

As we all have come to learn Apache Age is a force to reckon with in the realm of data analysis and management .Apache age as we all know is a postgresql extension which it's capabilities is to integrate graph database capabilities into the familiar and robust PostgreSQL environment, empowering users to harness the power of graphs without sacrificing the familiarity and performance of PostgreSQL.

In this article we are going to describe the sample template guide on how to get started with apache age using programming language such as java.

Prerequisite needed to get started with Apache Age in java

Postgresql- Install the postgresql version of your choice from 12 - higher
Java Development Kit -Install Jdk 8 or higher
Apache Age- Install the apache age extension version that is compatible to you postgresql

It is recommended that you install the above tools, as they are essential prerequisites for working with the JDBC driver and AGE. Once you have set up your PostgreSQL , jdk and AGE environment, you can follow the steps that are going to be listed below to set up the JDBC driver and start working with graph data in your Java applications.

Next we are going to install the jdbc driver.The apache age jdbc driver enables users connect to an AGE database from a Java application. This JDBC driver provides a standard interface for accessing and manipulating the data stored in an AGE database, allowing developers to work with both relational and graph data models using the familiar JDBC API.

1. Setting up the jdbc driver
To be able to set up the environment for jdbc to be used with apache age it's important that the following jar files and packages are installed and they include:

gradle build tool
postgres JDBC driver
antlr4-4.9.2-complete
common-lang3
commons-text-1.6

Nb:ensure you unzip the jars if they are zipped before using them

2.Build from source
After you have unzipped the jar files next we are going to clone the official Apache AGE JDBC repository. The repository contains the necessary source code and build scripts for compiling the driver and creating a JAR file that you can use in your Java applications and this can be done with the following command:

git clone https://github.com/apache/age.git
cd age/drivers/jdbc

gradle assemble

After the build completes successfully, a jar file will be created at path age/drivers/jdbc/lib/build/libs/lib.jar. Now add this JAR file to classpath for your java project.

3.Connect to the postgresql database and create graph
to connect to the postgresql database in java we can implement the following code to create a connection

import org.apache.age.jdbc.base.Agtype;
import org.postgresql.jdbc.PgConnection;

import java.sql.*;

public class studentDb {
    static final String DB_URL = "jdbc:postgresql://localhost:5432/students-demo";
    static final String USER = "postgres";
    static final String PASS = "pass";

    public static void main(String[] args) {

        // Open a connection
        try {

            PgConnection connection = DriverManager.getConnection(DB_URL, USER, PASS).unwrap(PgConnection.class);
            connection.addDataType("agtype", Agtype.class);

            // configure AGE
            Statement stmt = connection.createStatement();
            stmt.execute("CREATE EXTENSION IF NOT EXISTS age;");
            stmt.execute("LOAD 'age'");
            stmt.execute("SET search_path = ag_catalog, \"$user\", public;");

            // Run cypher
            ResultSet rs = stmt.executeQuery("SELECT * from cypher('stud_graph', $$ MATCH (n) RETURN n $$) as (n agtype);");

            while (rs.next()) {

                // Returning Result as Agtype
                Agtype returnedAgtype = rs.getObject(1, Agtype.class);

                String nodeLabel = returnedAgtype.getMap().getObject("label").toString();
                String nodeProp =  returnedAgtype.getMap().getObject("properties").toString();

                System.out.println("Vertex : " + nodeLabel + ", \tProps : " + nodeProp);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

The above code will create a connection to the postgresql database ,configure AGE and finally execute a cypher query on the graph database.

This is just a sample example on how one can get started with Apache Age in java. this can be your starting guide on how to implement Apache Age in your java applications. In the next articles we will explore more how Apache Age can be utilised in Java with various use cases.

Securing Postgresql Database

Namsi Lydia — Wed, 22 Nov 2023 14:03:12 +0000

Database Security
This concerns with the use of broad range of information security controls to protect database potentially including the data, the database applications or stored functions, the database systems, the database servers and the associated network links against compromises of their confidentiality, integrity and availability. It involves various types or categories of controls, such as technical, procedural/administrative and physical.

What is the importance of database security
Database security is crucial for the following reasons and they include:

Regulatory compliance
With the coming up of the General Data Protection Regulation (GDPR) coming into force in 25 May 2018, data security is more important than ever before. Under the GDPR, organizations must ensure that their systems are secure and confidential, which means having strong database security is essential.

Reputation damage due to data breaches
Data breaches can also damage your brand and reputation. In fact, according to a study, almost 80% of consumers would take their business elsewhere if a company suffered a data breach that put their information at risk ,Thus its important that organisations ensure that their database system are secure at all time from external threats to prevent any breaches that may destroy the organisation's reputation.

Protect organisational data
To protect organizational data, which contains trade information and customer data. The data can be accessed by cybercriminals for malicious reasons, compromising customer privacy.

Threats in Database security
Attacks on database systems may come in many form some may come from malware-infected computer systems to targeted cyberattacks on your database systems themselves.Some of the most common types of attack on database systems may include :

SQL injection: This is when an attacker inserts malicious code into a website’s login page to obtain information.

Denial of service (DoS): This is an attempt to overload a database or website with superfluous requests, causing it to slow down or even shut down.

Database exploitation: This refers to the unauthorized use of sensitive data.

Code modification: This occurs when attackers change the code of a database.

Malicious insiders: This are people who have been given access to a database, but misuse their privileges.

Hidden flaws in code: These can be exploited by attackers to steal sensitive data or cause damage to your system.

Deep Dive into mechanisms used in postgresql database security

database security in postgresql is addressed in several levels and they include :

Database file protection. All files stored within the database are protected from reading by any account other than the Postgres superuser account.
Connections from a client to the database server are, by default, allowed only via a local Unix socket, not via TCP/IP sockets. The backend must be started with the -i option to allow non-local clients to connect.
Client connections can be restricted by IP address and/or user name via the pg_hba.conf file in PG_DATA.
Client connections may be authenticated vi other external packages.
Each user in Postgres is assigned a username and (optionally) a password. By default, users do not have write access to databases they did not create.
Users may be assigned to groups, and table access may be restricted based on group privileges.

Network -Level security
The network level security mechanism has two main components that include the unix domain socket ,TCP/IP socket, and firewall deployed on the host machine.

Unix Domain sockets
To be able to connect a postgresql database you need to connect it to a Unix-based environment, Unix domain sockets (or UDS) is the feature that enables the said connection. It looks like a separate file in the filesystem. The socket owner is the OS user that runs the Postgres server, and the only way to access that UDS file directly is from the machine where it is installed.

A UDS is assigned the same access control as other files on the filesystem, although it only requires write permissions. You can control and manage a UDS using the unix_socket_group and unix_socket_permissions configuration options. You can also change access controls on the directory hosting the socket to change its permissions
PostgreSQL lets you create multiple sockets, via the unix_socket_directories option. You can specify several directories, each with its own permissions, to define separate permissions for different applications or groups of users.

Restricting Listening to address
For example if the node that postgresql is running on contains several network interfaces you can make use of the listen_address parameter of the configuration file. This parameter ensures that the server listens only to those addresses that the clients will use to connect.

listen_addresses = 'localhost, 192.168.0.1'

If users connecting to the database are always on the same node, you can disable listening to TCP sockets. Therefore, the server will accept connections from Unix domain sockets only, and not from the network.

To achieve this, place an empty line as the address that the server should listen to as below:

listen_addresses = ''
The following methods are helpful to ensure an appropriate level of security at the network level and protect your Postgres databases.

Firewalls
A firewall lets you define rules that define what inbound and outbound traffic is allowed. There are several parameters you can use to define these rules, including the local port (the default is 5432 in PostgreSQL), the protocol (IPv6 or TCP), as well as the source address (typically a list of subnets or addresses). To ensure your database remains protected, strictly control external access to the server.

some of these methods used for database security using firewalls include:iptables utilitty

sample implementations include:


# Make sure to keep the established connections valid.
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow SSH.
iptables -A INPUT -p tcp -m state --state NEW --dport 22 -j ACCEPT

# Allow PostgreSQL.
iptables -A INPUT -p tcp -m state --state NEW --dport 5432 -j ACCEPT

# Allow all outbound, drop everything else inbound.
iptables -A OUTPUT -j ACCEPT
iptables -A INPUT -j DROP
iptables -A FORWARD -j DROP

Database -level security
In the database level we are going to look at some of the data mechanisms that can be used to secure your database and these mechanisms may include:
Roles and permission,Auditing,row-level security.

Auditing
Keeping an audit trail is to ensure that you have accurate and detailed records. You can monitor several aspects of Postgres, including the server itself, by enabling the verbose logging feature.

One of the most efficient options is keeping a detailed log. To enable it, add the following parameters to the server configuration file as follows :

; Log successful and unsuccessful connection attempts.
log_connections = on

; Log terminated sessions.
log_disconnections = on

; Log all executed SQL statements.
log_statement = all

Row -level security
The row level security method is used in setting up privileges to table rows. It relates to both the existing rows brought by the SELECT command and the new rows that result from the INSERT, UPDATE, or DELETE commands. To apply this method, you need to enable RLS for the table and configure access (create a policy).

when implementing row-level security policies these are some of the aspects you should consider:

Use the CREATE POLICY command—to define new row-level security policies for each table.
The policy is selective, only applying to rows that match a predefined SQL expression.
Use ALTER TABLE … ENABLE ROW LEVEL SECURITY—to enable row-level security on the table. If this option is not enabled, your policy cannot be applied to the table.
USING statements are used to check existing table rows for the policy expression.
WITH CHECK statements are used to check new rows.
Policy names can be repeated across tables—you can create a policy with the same name and reuse it across multiple tables.
Each policy grants permission to specific database operations, such as DELETE, UPDATE, or SELECT.

and some of the parameters that one should put into consideration while implementing row -level security include:

Name—this is the name of the policy.
table_name—lets you create a name for the table the policy is applied to.
command—the default option is DELETE. ALL. You can also use, SELECT, ALL, INSERT, UPDATE.
role_name—the default is PUBLIC, which means the policy applies to all database users.
using_expression—a SQL expression that returns boolean. Each row is checked against this expression—if it returns false, it is silently suppressed and cannot be viewed or modified by the user. No error is returned.
check_expression—a SQL expression returning boolean, which is used when INSERT or UPDATE operations are performed on the table. Rows are allowed if the policy expression is true, and if it returns false, an error is returned.
RESTRICTIVE—lets you create a restrictive policy. Restrictive policies can be applied together to a single query. You can use this parameter to restrict access to tables.

Roles and permissions
Postgresql comes with an inbuilt user permission system designed around the concept of roles .a database account name is treated as a role, and it comes with a LOGIN attribute that enables the role to connect to the database.Roles can be assigned other attributes that grant them certain types of privileges.

sample example implementation of how roles are created:

CREATE USER jane;
Or
CREATE ROLE jane LOGIN;

Transport Level Security
This is a secure protocol for safe communication on the internet.PostgreSQL supports it by default for data transport and database connection (it is called SSL in the documentation). When enabled, it allows the client to authenticate itself on the server (and vice versa) using the certificates issued by a trusted certificate authority.

To enable TLS for PostgreSQL connections, you’ll need a server key and certificate. Protect the key and certificate with a passphrase, which can either be entered manually when the server starts, or automatically, by writing a script that uses the ssl_passphrase_command configuration parameter.

There are various transport level security configurations that can be implemented to your database and they include:

client TLS.
Server TLS.

Best postgresql security practices
Prevent external connections to the database
You can disable remote access to your database using pg_hba.conf. If you need remote access, you can use SSH to access the database host machine, then use a local connection. You can also use SSH to configure tunnel access to your PostgreSQL database.

Also you can configure listen_addresses to localhost, or the specific host machine running the application that uses the database. This forces the operating system to reject connection attempts originating from any other machines except the PostgreSQL host or another known host. This can help prevent unauthorized access to your database and any underlying data structures.

Staying alert on critical security updates and patches
Regularly check for updates and patches this a good practice in that you are able to check what updates are important for the efficient working of your postgresql databases.

Restricting database logs from revealing more information than intended.
Ensure you use standard practices to administer your database, to prevent revealing sensitive information in database logs.

Using one way encryption for values that do not need to be decrypted
Using encryption techniques such as AES, can be decrypted. Hash-based encryption techniques, on the other hand, such as MD5, perform one-way encryption. Ensure you use the most appropriate encryption for each type of data—for example, use two-way encryption for data that is needed for analysis, and one-way encryption for user passwords.

Data security technologies that can help secure you postgresql database :

There are various data security technologies that organisations can use to secure and protect their database systems from external and internal threats and some of the technologies include:

Data encryption.
Data masking.
Tokenization.
Authentication.

Data Masking
Masking can help protect your data from exposure to unauthorized or malicious sources externally or internally. Masking can be applied to personally identifiable information (PII), such as a phone number or email address, by obscuring parts of the PPI, e.g., the first eight digits or letters within a database.

Tokenization
Tokenization replaces data with random characters,The “token,” which relates to the original data, is stored away separately in a database lookup table, where it is protected from unauthorized access.

Authentication
This is the process of confirming or validating user login credentials to make sure they match the information stored in the database. User credentials include usernames, passwords, PINS, security tokens, swipe cards, biometrics, etc.

Authentication is a frontline defense against unauthorized access to confidential and sensitive information, making it an important process. Authentication technologies, such as single sign-on, multi-factor authentication, and breached password detection make it simpler to secure the authentication process while maintaining user convenience.

Data encryption
Data encryption uses an algorithm to scramble every data character converting information to unreadable format Encryption keys from authorized users only are needed to decrypt the data before reading the files.

Conclusion
Database security is and should be a top priority in any organisation managing any database system thus you need apply the appropriate measures and set them to work together as solid barriers against intruders. Such barriers will block the hackers and hinder them as much as possible from accessing your database systems.

Partition Strategies: Fundamental Concepts for PostgreSQL Partitioning

Namsi Lydia — Fri, 17 Nov 2023 18:57:53 +0000

Table Partitioning
This is a database design technique used to divide a large table into smaller, more manageable chunks called partitions. Each partition is essentially a separate table that stores a subset of the original data. This technique can significantly improve query performance and data management for large datasets.

partitioning can be done in a variety of ways depending on the specific context and the best partitioning method for a particular application will depend on the specific requirements of that application. In some cases, it may be beneficial to use a combination of partitioning methods.

Factors to consider when to partition your database in postgresql:

1. Query performance degradation
You can consider partitioning your database when you notice your queries are starting to slow down especially those that should only touch a subset of your data .partitioning can significantly enhance query performance when your daily queries include searches based on a specific range or criteria. Let's say you're dealing with time-series data: partitioning by date can help you quickly retrieve records within a particular time frame without scanning the entire table.

2. Maintenance Overhead
As a table grows, maintenance operations like VACUUM, ANALYZE, and indexing can take longer and might start impacting your operational efficiency. Partitioning can simplify these operations because you can focus on maintaining smaller partitions independently, reducing the impact on your database's overall performance.

3.To reduce the use of memory
If you want to use/operate with less memory to perform various operations on your data you might benefit from partitioning, as smaller indexes and data chunks fit better in memory and improve cache hit rates.

4.Large tables
As we known managing large tables with large data sets might be a bit cumbersome so it may be beneficial to partition you tables so as to easily manage your data.

5.High ingestion rate
At times the current table size might not be massive but a high data ingestion rate may indicate that the table will likely grow significantly in the future thus implementing a partitioning strategy may be beneficial to be able manage this growth before it begins to affect your performance and maintenance operations.

Best practices to partitioning your data in postgresql

1.Optimize your queries
Ensure that you analyze and understand the query execution plan and validate that only necessary partitions are being scanned .

2.Choose the right partitioning key
Get a key that aligns with the query patterns. For instance, if most of your queries filter by date, a timestamp or date column would be an ideal partitioning key.

3.Choose the right partition size
This is one of the most integral factor when partitioning your data in that we ask ourselves what is your ideal partition size .Postgresql can be able to handle large data sets we should consider what partition size we want to work with.

Partitioning Methods In Postgresql
Postgresql offers various partitioning methods one can work with and they include:

List Partitioning.
Range Partitioning.
Hash Partitioning.

Understanding postgresql partitions

How to Create a Partition Table
First, you need to use CREATE TABLE and specify the partition key and partition type.

Then create each partition using CREATE TABLE while specifying each partition method.
You can use the following code snippet to create the main table:

e.g

first create the main table

CREATE TABLE main_table_name (
column_1 data type,
column_2 data type,
) PARTITION BY RANGE (column_2);

secondly create the partition table and this can be done as follows:


CREATE TABLE partition_name
PARTITION OF main_table_name FOR VALUES FROM (start_value) TO (end_value);

after creating the main table and partition we are going to define the various partition methods

List Partitioning
A list partition is where data is partitioned based on discrete values that have been specified .This is used when you need to group discrete data such as regions with arbitrary values.

example scenario and implementation of partitioning by list can be:

use case:students in a institution


CREATE TABLE students(
studentId INTEGER, 
status TEXT,
studentNAME TEXT,
course TEXT,
enrollment DATE
) PARTITION BY STATUS(course);

CREATE TABLE students_active  PARTITION OF students FOR VALUES IN ('ACTIVE');

CREATE TABLE students_completed PARTITION OF students FOR VALUES IN ('COMPLETED');

CREATE TABLE students_other PARTITION OF students DEFAULT;

RANGE PARTITION
Partitioning by range is when data is partitioned into segments based on the chosen range.When you need to access time-series data

example scenario and implementation of partitioning by range can be:

use case:students in a institution

CREATE TABLE students_enrol PARTITION OF students
FOR VALUES FROM ('2021-01-01') TO ('2023-02-01');

HASH PARTITION
Partitioning by Hash is where data is partitioned by supplying a modulus and a remainder.Each partition will contain the rows for which the modulus divided by the hash value of the partition key yields the given remainder. This can be beneficial when you want to avoid access concentration to a single table by distributing data almost evenly.

eg.
example scenario and implementation of partitioning by Hash can be:

use case:students in a institution


CREATE TABLE students (
studentId INTEGER, 
status TEXT,
studentNAME TEXT,
course TEXT,
enrollment DATE
) PARTITION BY HASH (studentId);

CREATE TABLE students_1 PARTITION OF students
FOR VALUES WITH (MODULUS 3, REMAINDER 0);

CREATE TABLE students_2 PARTITION OF students
FOR VALUES WITH (MODULUS 3, REMAINDER 1);

CREATE TABLE students_3 PARTITION OF students
FOR VALUES WITH (MODULUS 3, REMAINDER 2);

using hashes, the partitions will receive approximately the same amount of rows.

Benefits of table partitioning in postgresql

Reduced storage requirements: Partitioning can also reduce storage requirements by allowing you to archive or delete older partitions. For example, if you partition a table by year, you can archive or delete partitions for older years that you no longer need to access.

Easier maintenance: Partitioning can make it easier to maintain large tables by breaking them down into smaller, more manageable pieces. For example, you can update or migrate data in one partition without affecting the data in other partitions

Improved performance: Partitioning can improve performance by reducing the amount of data that needs to be scanned for each query. For example, if you partition a table by date, then a query that only needs to return data from the past year will only need to scan the partitions for that year.

Conclusion
Partitioning can be a powerful tool for improving the performance, manageability, and storage requirements of large PostgreSQL tables. However, it is important to carefully consider the partitioning strategy and the partition key columns before creating a partitioned table.