DEV Community: ksheroz

Debugging C/C++ on M1 Macs

ksheroz — Sat, 23 Sep 2023 20:24:48 +0000

Background

Recently I had to debug a project written in C but this time round I was using my new M1 Mac and turns out the old familiar GDB Debugger is not available on it. I had to dig deep to find a solution and I'm writing this article so you don't have to do the same.

Solution

You can either use some fancy ways to make GDB working on M1 or a simpler and recommended solution is to use lldb instead of gdb. lldb is similar than gdb and supported on M1 Macs. It's commands are also similar to gdb so you wont have much issues.

Here's the documentation for it:
Documentation

Postgres SQL for Dummies: Heap Only Tuple and Index-Only Scans (Part 7)

ksheroz — Tue, 11 Jul 2023 15:25:05 +0000

In the previous part we learnt about VACUUM processing that helps with cleaning the dead tuples. Heap Only Tuple (HOT) was introduced in version 8.3 to reduce the need for VACUUM processing. It optimises the usage of index and table pages. Index-Only Scans were introduced in 9.2 to reduce the I/O costs by directly using the index key with the table pages when all entries of SELECT are available in the index. In this part lets explore these two features.

Heap Only Tuple (HOT)

Instead of inserting the new tuple in the table in case it needs to be inserted in the same table where the old one exits, HOT takes a different approach. It uses two bits HEAP_HOT_UPDATED bit and the HEAP_ONLY_TUPLE bit to manage the states. They both occur in old and new tuples respectively in the t_informask2 field. It is important to note that HOT is not available in some cases such as when the key value of the index tuple is updated.

Index-Only Scans

When a select query is being executed using the index, Postgres checks the visibility of the tuples using the visibility map and if all the tuples on the page are visible, ii directly uses index tuple's key without using the table page. Otherwise it has to read the page for visibility. This reduced the I/O overhead and costs.

Postgres SQL for Dummies: Vacuum Processing (Part 6)

ksheroz — Wed, 28 Jun 2023 15:34:03 +0000

Vacuum processing is definitely one of those features which makes Postgres SQL great. It essentially removes dead tuples and freezes transaction ids to ensure the smooth operation of the database.

VACUUM

Two modes are provided for dead tuples removal: Concurrent VACUUM and Full VACUUM. In concurrent VACUUM transactions can still be made while the dead tuples are being removed but in Full VACUUM the table is locked until all the dead tuples are removed. This VACUUM implementation used to be manual until v8 which supports an autovacuum daemon. In v8.4 the visibility map that we referred to in a previous post was introduced. It helps skips pages that do not contain dead tuples making the complete process more efficient.

Concurrent VACUUM

First Block

This block performs freeze processing and removes index tuples that point to dead tuples.

Second Block

In this block the dead tuples are actually removed and FSM and VM are also updated in a page-by-page fashion.

Third Block

It performs the required clean-up and update the statistics.

Full VACUUM

While the concurrent VACUUM removes the dead tuples, it does not help much with the table size reduction hence hindering the performance. This is exactly why you need a full VACUUM. It does both delete dead tuples while also helping in the size reduction.

Freeze processing

Freeze processing freezes old transaction ids. It is a part of VACUUM processing itself. It has two modes: lazy mode and eager mode. Lazy mode scans pages with dead tuples, while eager mode scans all pages regardless of dead tuples.

The PARALLEL option allows parallel processing of vacuuming index and cleaning index phases in the VACUUM command.

The other two important features that make Postgres stand out are the removal of unnecessary clog files as discussed in part 5 and the autovacuum daemon which handles the vacuums.

Postgres SQL for Dummies: Concurrency Control (Part 5)

ksheroz — Wed, 28 Jun 2023 15:27:55 +0000

Let's assume that you are working as an Engineer designing some new kind of database, what's the most important thing that you need to make sure exists in your database system for it to be practical? Well it would have to be the ACID properties. Even among the ACID properties the two most important ones are atomicity and isolation. This is exactly why we need concurrency control. It helps us maintain these two properties in Postgres.

There are three broad concurrency control techniques: Multi-version Concurrency Control (MVCC), Strict Two-Phase Locking (S2PL), and Optimistic Concurrency Control (OCC), each with its own set of variations as well. Needless to say this might get a little complicated here but I'll try my best to keep it simple. MVCC is simple; there's two versions when writing, the old version and the new version and both are maintained, readers don’t block writers, and writers don’t block readers. S2PL is a locking based mechanism instead where the one who has the lock at a particular time can perform and operation only while others wait to attain the lock. Snapshot Isolation (SI) is a complex version of MVCC used by Postgres. It's essentially just like having a backup always and when someone wants to read data, we lookup the transaction id, apply some visibility checks and see where should we read from; the backup or the updated data.

Transaction IDs serve as unique identifier for the the transactions. There are special txids as well indicating things like frozen transactions and so on.

Heap Tuple

This a very important data structure that stores a bunch of relevant information such as which transaction id updated the tuple, which transaction id inserted the tuple, how many commands were executed, etc. Insertion, Deletion, and Updating is made possible by using this data structure and manipulating its fields to maintain consistency.

When inserting an index tuple, PostgreSQL uses the FSM (Free Space Map) of the corresponding table or index to select the page which can be inserted it. Commit log holds the statuses of all the transactions. There's four statuses: IN_PROGRESS, COMMITTED, ABORTED, and SUB_COMMITTED.

Transaction Snapshot

It is the snapshot of all the transactions and their statuses at a particular point in time.

Visibility Rule Checks

In this part we have all the things mentioned above in the article and based on certain rules we can check whether a transaction is visible or invisible. Let's explore Rule 1 only to get the idea: Rule 1: If Status(t_xmin) = ABORTED ⇒ Invisible. Now since tx_min deals with insertion, if insertion is aborted it makes sense that the transaction becomes invisible.

Serializable Snapshot Isolation

In order to prevent anomalies like Dirty Reads, Non-Repeatable Reads, and Phantom Reads, SSI is used. It uses a graph based approach and tries to make sure no cycle is produced.

Postgres SQL for Dummies: Foreign Data Wrappers (Part 4)

ksheroz — Sun, 25 Jun 2023 20:58:54 +0000

In this part 4 of the series, we will be discussing the Foreign Data Wrappers support in Postgres. It is certainly a very interesting feature.

Before getting to the the FDWs lets first discuss a bit about the SQL/MED feature of SQL which is used by FDWs. In SQL/MED, a table on a remote server is called a foreign table, and FDWs allows managing foreign tables just like local tables. You need to install the extensions and then you can access remote tables just like local tables using a select query. But how does this work exactly? Lets find out:

The analyzer/analyser creates the query tree of the input SQL.
The planner is responsible for connecting to the remote server.
An optional cost estimate check can also be done by planner if we turn on the use_remote_estimate option.
Plan text statement is created from SQL text known as deparsing.
The plan text is sent by the executor to the remote server and its results is also received by the executor.

It should be noted that for optimization, depending on the type of query some operations might be performed locally such as part of ORDER_BY and some part of aggregate functions.

Postgres SQL for Dummies: Query Processing (Part 3)

ksheroz — Sun, 25 Jun 2023 20:42:21 +0000

In this third in a series blog post we will try to understand how query is processed in Postgres, this is definitely one of the more complicated topic so I'll try to abstract the complexities at times for a clearer core concepts without many implementation specific details.

I have further divided this post into 3 parts but the important part is the first and you can feel free to skip the part 2 and 3 in case it gets too complicated as they are there just for the more curious and you would not loose much by missing those. You can also feel free to refer them back later.

Part 1

Query Processing

The query processing subsystem includes 5 components which are essential to the processing: Parser, Analyzer/Analyser, Rewriter, Planner, and Executor.

Parser

The parser is essentially responsible for generating the parse tree. Parse tree helps create and order of precedence among the operations. If the syntax is not valid it returns an error or else a parse tree is returned. No semantics are checked at this stage so if you logically misspell a table name it wont be detected at this stage.

Analyzer/Analyser

It runs a semantic analysis on the parse tree and returns a valid query tree. So the misspelled table name will be definitely caught here. Let's get a bit familiar with the query tree. It has a target-list of the output columns. Range is the list of relations used in the query. The join tree stores the FROM clause and the WHERE clauses. The sort clause is a list of SortGroupClause. We can surely skip how it looks exactly at this point.

Rewriter

It essentially serves as a rule checker. It transforms the query tree by checking the rules defined in pg_rules system catalog.

Planner and Executor

The planner receives a query tree from the rewriter and generates a plan tree that is the most efficient based on pure cost optimizations. It consists of plan nodes. Each plan node has all the information that would be needed by the executor to execute and the executors executes in a bottom-to-top fashion for single-query table.

Part 2: Optimizing the cost for single-table-query

Cost optimization is done in the Planner and can be split into two types: Single-table-query and multi-table-query. In this part we discuss the first and in part 3 we will discuss the later. Since there's a lot of math going on here so we will just understand the basics and skips the math for now. It's also best to split the details in a few points for better remembering them, I'll also keep them short and sweet.

Estimating the cost of different operations such as scans, joins, sorts, etc. Each operation has a unique cost and different formula for estimation with different weightages assigned to different factors.
Factors effecting costs: Disk I/O, CPU and Memory consumption
PostgreSQL uses statistics collected from the database. Estimates are made for the selectivity of predicates, cardinality of tables, and other relevant factors.

4.EXPLAIN command exists to view the cost of a query.

Part 3: Joins

Multiple-table query involves joining data from multiple tables to retrieve the desired results. 3 join methods are supported by Postgres SQL: nested join, merge join, hash join. In Nested join each row from one table is compared with each row from another table to find matching rows. Merge join is used when both tables being joined are sorted on the join key. Hash join builds hash tables on the join key columns of the participating tables and performs a hash-based lookup system to find the matching rows.

The planner uses the most suitable join based on lower costs. The sequence of joins is explained in the plan tree and then that is executed by the executor. While the plan is executing buffers can be used. Atomic transactions are made sure by using isolation mechanism and concurrency support.

Postgres SQL for Dummies: The Architecture (Part 2)

ksheroz — Sun, 25 Jun 2023 19:48:09 +0000

Link to prev: Link
In the second in a series post, we will try to explore the Architecture of Postgres which can be primarily divided in two sub Architectures:

The Process Architecture
The Memory Architecture

The Process architecture

Postgres essentially works as a client-server application with multi-process support. The master of all the processes is the server process itself henceforth is was called postmaster before. It is the parent process. It starts up by allotting shared memory, launching background processes, and waiting for client connection requests. It manages the entire database cluster. Its also responsible for spinning up the backend processes for clients. After the server process we can divide the rest of the processes in two types as I like to remember them:

Backend processes
Background processes: Background writer, checkpointer, etc.

Backend processes

Backend processes handle all the queries and statements from connected clients. Each connection has its own backend process.

Background processes

The background processes have specific tasks and they are 7 in total. You can read up further on them but here I'll just mention a few to give you an idea. Background writer writes dirty pages to persistent storages such as SSD. Checkpointer as the name suggests checkpoints. Achiever logs the archiving logs and so on. TCP connection is used for connection with client and only one database can utilize the background processes and it needs to be specified by the client first.

The Memory Architecture

Postgres has two types of memories: The local memory and the global or shared memory. Local memory is allocated to each backend process whereas Shared Memory is used by all the server processes.

The local memory is allocated by the backend processes for its own use and are divided into sub-areas for different purposes, such as work memory and maintenance work memory while the shared memory is allocated the server itself and is allocated by the server at startup. Shared memory is also divided into sub-areas for storing pages, commit logs, etc.

The memory of Heap Tables can be read either as a Sequential Scan or a more complex but efficient B-tree index scan.

Postgres SQL for Dummies: Database Cluster, Databases, and Tables (Part 1)

ksheroz — Sun, 25 Jun 2023 19:23:07 +0000

Hi there, I'm Sheroz and by profession I'm a Software Engineer who likes to explore complex engineering topics such as the internals of Postgres and likes to make them simpler to understand for myself and others too. In this series we'll try to explore the workings of Postgres on a simplistic level. The aim of this first in a series article would be to serve as a notes for understanding how Postgres works in order for making the most of it.

There's 4 basic things we need to cover straight up:

The physical structure of Postgres
The logical structure of Postgres
Heap table structure and its use in Postgres

1. The physical structure of Postgres

A PostgreSQL server runs on a single host and manages a single database cluster. A database itself is a collection of object each which can be referred by OIDs(Object IDs). These objects include: Indexes, Heap Tables, and so on. Each Table also has an OID. Cluster contains multiple databases.

2. The logical structure of Postgres

Logically Postgres can be thought of as a file structure wherein you have a base directory and sub directories. A table for instance is itself a sub directory under a base directory. Postgres also has the concept of tablespaces where you can have a 'space' exist outside of the base directory and it can contain its own sub directory structure. The files essentially tables and indexes can be identified by a relfilenode which may or may not be the same as the OID depending of whether the the file has been altered or not. Files also contains associated free space maps and visibility maps.

3. Heap table structure in Postgres

Tables are stored as Heap Table Files divided into fixed pages(block) of fixed size, numbered sequentially and cane be assessed dynamically. The size also increases dynamically. Each block contains heap tuples (record data), line pointers (pointers to tuples), and other metadata. The data is read and written using this structure.