Zaffere

Posted on Apr 11, 2023

Scaling PostgreSQL read performance

#gratitude

Scaling your PostgreSQL read performance

We've all been at the situation where we find our database queries takes way too long to execute. This typically happens when our tables or columns grows too large and which causes the table operational cost to increase as well. A common approach here would be to simply add a read replica and redirect all reads to this replica. This approach works by redirecting all read queries to a replicated database while our writes goes to the main/master database.

However, there are some cost effective solutions we might want to consider before spinning up a read replica.

We will take a look at two methods that, when used together, has the potential to increase our read capacities significantly.

Indexing

First we will take a look at indexing.

What is indexing?
Indexing is a technique that sorts data in a table into a data structure that is ordered and allows fast retrieval speed. The data structure used is typically a B-tree where it creates a multi-level tree structure that breaks a database down into fixed-size blocks or pages.

For example, remember our good old friend, Yellow Pages? (Yellow Pages is a telephone directories of businesses used in the 90s) Lets say you want to find the address of a specific company among thousands of different companies. Now if the listings in Yellow Pages weren't ordered in any form, you can expect to flip through each page and check every company until you find the address you were looking for. This is called a sequential scan in PostgreSQL and as you can imagine, depending on the size of your database, this would take up a significant amount of time.

To solve this problem, we can use Indexes. With Indexes, we can sort our database table in an orderly manner. In the Yellow Pages example, we can order all companies in alphabetical order and to find the address of Tesla, we know we can start searching on the pages where company names starts with a "T". In doing so we are way more efficient in finding what we need as we know we can skip the first 20 alphabets (which may in itself involve multiple pages).

How to create an index in PostgreSQL
Let's see the theory above in action.

We will create a table that stores details of multiple companies and run queries with and without creating an index on them to compare the execution time:

First lets create a new database in PostgreSQL.

Note: I used NodeJS, Sequelize and FakerJS to quickly seed some random data into my database.

Creating the table

module.exports = {
  up: async (queryInterface, Sequelize) => {
    await queryInterface.createTable('companies', {
      id: {
        type: Sequelize.INTEGER,
        allowNull: false,
        autoIncrement: true,
        primaryKey: true,
      },
      name: Sequelize.TEXT,
      category: Sequelize.TEXT,

    });

  },

  down: async (queryInterface, Sequelize) => {
    await queryInterface.dropTable('companies');
  }
};

Seeding the table

const {faker} = Faker

const generateRandomCategory = () => {
  const categories = ['Tech', 'Health Care', 'Food & Beverage', 'Tourism'];
  const randomIdx = Math.round(Math.random() * 3)
  return categories[randomIdx]
}

module.exports = {
  async up (queryInterface, Sequelize) {

    const companyList = [];

    for (i = 0; i < 10000; i ++) {
      const companyName = faker.company.name
      const newCompany = {
        name: companyName(),
        category: generateRandomCategory(),
      }
      companyList.push(newCompany)
    }

    await queryInterface.bulkInsert('companies', companyList)

Looking at our current database, if we ran a SELECT * FROM companies WHERE name='Rau Group' we'll see that we have an entry where the company name == 'Rau Group'

Now lets run an EXPLAIN ANALYZE command to see the planning and execution time PostgresSQL takes to select all from companies where name is Rau Group:

Based off the results, we can see that PostgreSQL had to make a Sequential scan (go through each entry in the table) and took 3.908ms to plan and 70.679ms to execute the query. Not too bad..

Let's see if we can do better?

Lets create an index on the name column and run the query again:

Creating an index in PostgreSQL is quite straight forward.
We can simply use PostgreSQL's CREATE INDEX command.
CREATE INDEX <index_name> ON companies (table_column):

And now we run the same query:

Looking at the results now, we see that PostgreSQL uses an Index Scan and took 0.602ms and 2.418ms to plan and execute respectively.

As mentioned, database indexes are stored in a B-Tree data structure and the Index Scan performs a B-tree traversal, walks through the leaf nodes to find all matching entries, and fetches the corresponding table data.
This improves the query by more than half and hence leads to better database performance.Indexing works well if we have identified frequently queried columns in our table.

Cons of using indexes

One of the downsides of creating an index in PostgreSQL is that indexes slow down data entry or modification. Whenever a new row is added that contains a column with an index, that index is modified as well
Increased disk space

Partitioning

Now we'll take a look at partitioning in PostgreSQL.
There are two methods of partitioning (Horizontal/Vertical partitioning) but we will focus our attention on Horizontal Partitioning here.

What is a partition?
To partition means to slice our single large table horizontally (by rows) or vertically (by columns) and splitting each half(partition) into multiple smaller tables identified by a specified key. As you can already imagine, each table has halved in size improving our read queries quite a tiny bit already.

When our tables in our database gets too large, performance and scaling are affected. As table size increase, more data scanning, swapping pages to memory and other table operation cost also increases. Partitioning helps by dividing our large table into smaller tables (categorised by your desired field/key) thus reducing table scans and memory swap problems which ultimately increases performance.

In our example above with the Yellow pages, seeing that each company has a category associated with it (and assuming that we have identified this as part of our data access pattern), we can go a step further in improving our read queries by partitioning the companies table into each category.

Methods of PostgreSQL partition
PostgreSQL allows two types of partitions: Declarative & Inheritance Partitioning. Both has their limitations but here we will use Declarative Partitioning.

There are multiple methods of partitions in PostgreSQL(of which, we will use List):

Range Partitioning
The table is partitioned into “ranges” defined by a key column or set of columns, with no overlap between the ranges of values assigned to different partitions.
List Partitioning
The table is partitioned by explicitly listing which key value(s) appear in each partition.
Hash Partitioning
The table is partitioned by specifying a modulus and a remainder for each partition. Each partition will hold the rows for which the hash value of the partition key divided by the specified modulus will produce the specified remainder

Creating a partition
In most real-life scenarios, we often would be implementing partitioning on an already existing database and table. Here arises the problem of database availability. While migrating our tables to partitions with live data, because this process might take a long period of time and the fact that database locking will be invoked on the transaction, this could cost significantly on database availability. With this method, I see the potential of minimising database locks hence keeping availability

NOTE: There are other strategies to tackle this problem, I am following this approach based off my own research and knowledge. Be sure to do your own research before implementing anything at this scale in production

This is a 5 step process we performed in a transaction. If any of the steps fails, then our database would rollback to its original state, keeping everything safe and sound.

Step 1
Rename the original (going to be legacy/old table) table. We do this so that we can create our partitioned table with the same name as before, allowing our application to still make queries from this table.

ALTER TABLE companies RENAME TO companies_old;

Step 2
Create the new partitioned table with the old name. Using the old name for the same reasons as mentioned above in Step 1. Here we partition by List with value 'category'

CREATE TABLE companies (
        id SERIAL,
        name TEXT,
        category TEXT
) PARTITION BY LIST (category);

Step 3
Create the partition tables to hold data specific to each category

CREATE TABLE tech_companies PARTITION OF companies FOR VALUES IN ('Tech');
CREATE TABLE health_care_companies PARTITION OF companies FOR VALUES IN ('Health care');
CREATE TABLE fnb_companies PARTITION OF companies FOR VALUES IN ('Food & Beverage');
CREATE TABLE tourism_companies PARTITION OF companies FOR VALUES IN ('Tourism');

Step 4
Store each partition's data in a temporary table which will be inserted into their respective partition table afterwards

WITH tech_companies AS (
        SELECT * FROM companies_old WHERE category = 'Tech'
), fnb_companies AS (
        SELECT * FROM companies_old WHERE category = 'Food & Beverage'
), health_care_companies AS (
        SELECT * FROM companies_old WHERE category = 'Health Care'
), tourism_companies AS (
        SELECT * FROM companies_old WHERE category = 'Tourism'
)

Step 5
Finally, we insert respective data into the partition table

INSERT INTO companies SELECT * FROM tech_companies;
INSERT INTO companies SELECT * FROM fnb_companies;
INSERT INTO companies SELECT * FROM health_care_companies;
INSERT INTO companies SELECT * FROM tourism_companies;

Here is the full DDL:

BEGIN;
-- rename the legacy/old table
ALTER TABLE companies RENAME TO companies_old;

-- Create a new partitioned table with the legacy/old table's name to ensure application compatibility moving forward from here
CREATE TABLE companies (
        id SERIAL,
        name TEXT,
        category TEXT
) PARTITION BY LIST (category);

-- Create a new partition table for each category values
CREATE TABLE tech_companies PARTITION OF companies FOR VALUES IN ('Tech');
CREATE TABLE health_care_companies PARTITION OF companies FOR VALUES IN ('Health care');
CREATE TABLE fnb_companies PARTITION OF companies FOR VALUES IN ('Food & Beverage');
CREATE TABLE tourism_companies PARTITION OF companies FOR VALUES IN ('Tourism');

-- Select data from legacy/old table to be moved into partition tables
WITH tech_companies AS (
        SELECT * FROM companies_old WHERE category = 'Tech'
), fnb_companies AS (
        SELECT * FROM companies_old WHERE category = 'Food & Beverage'
), health_care_companies AS (
        SELECT * FROM companies_old WHERE category = 'Health Care'
), tourism_companies AS (
        SELECT * FROM companies_old WHERE category = 'Tourism'
)

-- Insert data to partition tables
INSERT INTO companies SELECT * FROM tech_companies;
INSERT INTO companies SELECT * FROM fnb_companies;
INSERT INTO companies SELECT * FROM health_care_companies;
INSERT INTO companies SELECT * FROM tourism_companies;
COMMIT;

Inspecting our new partitioned tables
Now if executing the above commands succeeds, we can go ahead and inspect our table. In your PSQL terminal run:

\d+ companies

Looking at the details, we see that our "companies" table is in fact a partitioned table, with a partition key of "LIST" grouped by the "category" column and that it has 4 total partitions.

We have successfully partitioned our one large table.

Now lets run some tests to see how this improves our read queries!

Analysing the query
Now running the same SELECT * WHERE name=Rau Group command, let's see what the results gives us.

Keep in mind that we now have only partitioned our tables and have yet to add indexes to our partitioned tables. With that, PostgreSQL took 27.435ms & 11.117ms to plan and execute respectively. Still way quicker than querying from a regular table.

Lets see how much more improvement we can make with indexes in our partitioned tables.
Creating an index on a partitioned table is the same as with a regular table. We will create an index on company's name based off our knowledge of the data access pattern:

CREATE INDEX tech_companies_name_idx ON tech_companies (name);
CREATE INDEX fnb_companies_name_idx ON fnb_companies (name);
CREATE INDEX health_care_companies_name_idx ON health_care_companies (name);
CREATE INDEX tourism_companies_name_idx ON tourism_companies (name);

Now let's make the same query and take a look at the results:

We can see that we have gained a slight improvement here with 1.76ms execution time.

Partitioning can help greatly improve the speed of read queries on really large tables if done correctly by splitting groups of data into smaller tables which in turn allows PostgreSQL's table operation cost to significantly reduce. Adding indexes to partitioned tables can further improve this. Whats important is that we have to identify our data access patterns, properly normalise our tables and learn to find grouping patterns in our data.

Cons of partitioning

Lack of fault tolerance. Unlike distributed replication, if this instance fails, then your database is unavailable
Foreign keys referencing partitioned tables, as well as foreign key references from a partitioned table to another table, are not supported because primary keys are not supported on partitioned tables
If the number of partitions is out-grown, you have the same issue with the partitions. A full re-partitioning would be needed to increase the partition count

Please feel free to leave any comments or corrections on my implementations ✌️