Emtiaj Hasan

Posted on Aug 13, 2023 • Edited on Aug 19, 2023

Demystifying ORM Queries: The Search for Optimized PostgreSQL Queries

#database #postgres #typeorm #performance

While TypeORM is a great library, I sometimes feel it does not always generate the most efficient queries. To evaluate my thought, I decided to use PostgreSQL's query execution plan to test different SQL queries to see how they perform.

Let's imagine there's this cool app that works like a personal dictionary. Whenever someone comes across a word they don't know, they can save it in the app along with its meaning, examples, notes, etc. This way, they can easily look it up later. Guess what? I'm going to create that app!

Designing the data model to keep the required data for this app is pretty straightforward. We need to create only three tables for this: User, Vocabulary, and Definition.

User

| Column | Type              | Nullable | Default            |
| ------ | ----------------- | -------- | ------------------ |
| id     | uuid              | not null | uuid_generate_v4() |
| email  | character varying | not null |                    |

Vocabulary

| Column | Type              | Nullable | Default            |
| ------ | ----------------- | -------- | ------------------ |
| id     | uuid              | not null | uuid_generate_v4() |
| word   | character varying | not null |                    |
| userId | uuid              | not null |                    |

Definition

| Column       | Type              | Nullable | Default            |
| ------------ | ----------------- | -------- | ------------------ |
| id           | uuid              | not null | uuid_generate_v4() |
| meaning      | character varying | not null |                    |
| vocabularyId | uuid              | not null |                    |

We all know a word might have multiple definitions, and users might have various vocabulary in their dictionary. From the SQL perspective, the relationship between Definition and Vocabulary is many to one. The exact relationship would be for tables Vocabulary and User.

Foreign Key

| Table Name | Column       | Constraint Name                          | Foreign Constraint Definition  |
| ---------- | ------------ | ---------------------------------------- | ------------------------------ |
| Vocabulary | userId       | FK_Vocabulary_userId_User_id             | REFERENCES "User" ("id")       |
| Definition | vocabularyId | FK_Definition_vocabularyId_Vocabulary_id | REFERENCES "Vocabulary" ("id") |

Besides creating or updating vocabulary, there should be a way to display all of it. I need to create an API responsible for fetching data from the database chunk by chunk. As we know, bringing it all at once could perform worse. There should also be a mechanism to retrieve another piece of data until received all of it. It can be achieved if the API returns the total number of vocabulary.

Let’s quickly show the input and output payload of the API request and response, respectively.

interface RequestPayload {
   userId: string;
   pageNumber: number;
   pageSize: number;
}

interface ResponsePayload {
   data: {
       id: string;
       word: string;
       definitions: {
           id: string;
           meaning: string;
       }[];
   }[];
   total: number;
}

The batch size will be 20, and I must ensure responses are sent in descending order of words.

TypeORM Approach (Version 0)

I can quickly achieve it using the TypeORM’s getManyAndCount method.

async fetch(userId: string, skip: number, pageSize: number): Promise<ResponsePayload> {
    // Imagine I calculated the value of the skip using pageSize and pageNumber
    const [vocabularies, total] = await this.createQueryBuilder('vocabulary')
        .leftJoin('vocabulary.definitions', 'definition')
        .where(`vocabulary.userId = :userId`, { userId })
        .orderBy(`vocabulary.word`, 'DESC')
        .skip(skip)
        .take(pageSize)
        .select(['vocabulary.id', 'vocabulary.word'])
        .addSelect(['definition.id', 'definition.meaning'])
        .getManyAndCount();

    return {
        data: vocabularies,
        total,
    };
}

The getManyAndCount method produces three queries before sending the data back. Let’s see those three queries.

Query 1

SELECT DISTINCT "distinctAlias"."vocabulary_id" AS "ids_vocabulary_id", "distinctAlias"."vocabulary_word"
FROM (SELECT "vocabulary"."id"      AS "vocabulary_id",
             "vocabulary"."word"    AS "vocabulary_word",
             "definition"."id"      AS "definition_id",
             "definition"."meaning" AS "definition_meaning"
      FROM "Vocabulary" "vocabulary"
               LEFT JOIN "Definition" "definition" ON "definition"."vocabularyId" = "vocabulary"."id"
      WHERE "vocabulary"."userId" = '41f89c90-7029-46a4-8211-5f8c6e527a2d') "distinctAlias"
ORDER BY "distinctAlias"."vocabulary_word" DESC, "vocabulary_id" ASC
LIMIT 20 OFFSET 4980;

Query 2

SELECT "vocabulary"."id"      AS "vocabulary_id",
       "vocabulary"."word"    AS "vocabulary_word",
       "definition"."id"      AS "definition_id",
       "definition"."meaning" AS "definition_meaning"
FROM "Vocabulary" "vocabulary"
         LEFT JOIN "Definition" "definition" ON "definition"."vocabularyId" = "vocabulary"."id"
WHERE ("vocabulary"."userId" = '41f89c90-7029-46a4-8211-5f8c6e527a2d')
  AND ("vocabulary"."id" IN
       ('41f89c90-7029-46a4-8211-5f8c6e527a2d', 'b19457a9-03fb-4c3b-b649-83001b6b9616',
        '17d84794-88a0-4e62-9f1c-88dea1fe8148', '9381945a-14d5-458b-b283-870ba3fa2057',
        'c00d27cb-7ad1-4fd4-9641-73d3484f3741', '34e8fc07-dec6-4dc8-b461-ed412a5f1cc8',
        'c5a52ddf-328f-4f1f-8c19-e38bc5d74ce1', '70de43d8-dbf2-45a4-86f1-1c4a0dc07512',
        '9150aff2-27b0-499e-82de-7a15373f35b0', '25aa3d73-5e50-410a-b16a-522246b54982',
        '3bf46a90-6982-46ac-9414-a5b3d36e9340', '05c26854-bfe2-48b3-ae2b-c3a76e26a473',
        '40808019-f2c1-4857-8bab-a1820cfb974f', '5cb2e93d-c64c-407c-bfd2-5b765f630298',
        '0fdf440c-37cd-46c3-a305-bd3b8937c377', '3fd8baf6-7e5e-4008-a8ef-81e30fcb6b0b',
        '9501f02f-b77c-4abc-b6ef-fe1a44a575f3', 'cf6c5111-9077-4c9f-9d1b-07eeb44a7160',
        '83c4e918-b14e-4848-ac16-e6e88fa6d774', 'ae156eb4-5a94-431a-8c1f-6b384d062e3b',
        'b44b3033-1b54-4d3f-93a0-c73477dddf68'))
ORDER BY "vocabulary_word" DESC;

Query 3

SELECT COUNT(DISTINCT ("vocabulary"."id")) AS "cnt"
FROM "Vocabulary" "vocabulary"
        LEFT JOIN "Definition" "definition" ON "definition"."vocabularyId" = "vocabulary"."id"
WHERE "vocabulary"."userId" = '41f89c90-7029-46a4-8211-5f8c6e527a2d';

To be honest, I am not impressed, as I think it can be made a bit better, e.g., we can remove the LEFT JOIN and the DISTINCT entirely from the first query because it needs to select only the vocabulary ID using the userId column so that the second query can use it.

As the library generated the query, and I could not modify codes unless forking the repository, I decided to try other approaches.

Query Version 1

SET SESSION my.userId = '41f89c90-7029-46a4-8211-5f8c6e527a2d';


SELECT vocabulary.id,
      vocabulary.word,
      JSON_AGG(JSON_BUILD_OBJECT('id', definition.id, 'meaning', definition.meaning)) AS definitions,
      (COUNT(*) OVER ())::INTEGER                                                     AS total
FROM "Vocabulary" AS vocabulary
        LEFT JOIN "Definition" AS definition ON vocabulary.id = definition."vocabularyId"
WHERE vocabulary."userId" = CURRENT_SETTING('my.userId')::UUID
GROUP BY vocabulary.id, vocabulary.word
ORDER BY vocabulary.word DESC
OFFSET 4980 LIMIT 20;

The above query aggregates the associated definitions of each vocabulary item into a JSON array using JSON_AGG and JSON_BUILD_OBJECT, ensuring I do not need to map it later. It also uses a window function COUNT(*) OVER () to calculate the total count of vocabulary created by the requested user. The fundamental difference between this one and the TypeORM version is that the latter gives us the data using only one query.

Query Version 2

There are trade-offs between single and multiple queries to fetch the required data. I will not debate over it. I decided to try another approach, as we can stop using OFFSET and leverage the ROW_NUMBER to get a subset.

SET SESSION my.userId = '41f89c90-7029-46a4-8211-5f8c6e527a2d';


SELECT vocabulary.id, vocabulary.word, vocabulary.definitions, vocabulary.total
FROM (SELECT vocabulary.id,
            vocabulary.word,
            JSON_AGG(JSON_BUILD_OBJECT('id', definition.id, 'meaning', definition.meaning)) AS definitions,
            (COUNT(*) OVER ())::INTEGER                                                     AS total,
            ROW_NUMBER() OVER (ORDER BY vocabulary.word DESC)                               AS "rowNumber"
     FROM "Vocabulary" AS vocabulary
              LEFT JOIN "Definition" AS definition ON vocabulary.id = definition."vocabularyId"
     WHERE vocabulary."userId" = CURRENT_SETTING('my.userId')::UUID
     GROUP BY vocabulary.id, vocabulary.word) vocabulary
WHERE vocabulary."rowNumber" BETWEEN 4981 AND 5000;

Pretty straightforward, as it just assigned the ranking to each vocabulary and filtered the correct batch in the end.

Query Version 3

I wanted to try LATERAL to avoid GROUP BY. With a slight effort, I wrote the query.

SET SESSION my.userId = '41f89c90-7029-46a4-8211-5f8c6e527a2d';


SELECT vocabulary.id,
      vocabulary.word,
      definitions,
      (COUNT(*) OVER ())::INTEGER AS total
FROM "Vocabulary" AS vocabulary
        LEFT JOIN LATERAL (
   SELECT JSON_AGG(JSON_BUILD_OBJECT('id', definition.id, 'meaning', definition.meaning)) AS definitions
   FROM "Definition" AS definition
   WHERE definition."vocabularyId" = vocabulary.id
   ) AS definitions ON TRUE
WHERE vocabulary."userId" = CURRENT_SETTING('my.userId')::UUID
ORDER BY vocabulary.word DESC
OFFSET 4980 LIMIT 20;

Query Version 4

The whole idea of calculating the total number of available vocabulary is to provide a way for the front-end app to decide whether to make one more API request to the back-end. The app won't ask for more if it has already received all vocabulary.

There is an elegant way to avoid retrieving the total number of vocabulary created by a user.

Keyset pagination, also known as the seek method, relies on the ordering of the columns to paginate through the data. It avoids using OFFSET. Instead of using the page number to determine how much data need to be skipped, we can use the previous chunk’s last word to fetch one more subset of data.

SET SESSION my.userId = '41f89c90-7029-46a4-8211-5f8c6e527a2d';

SELECT vocabulary.id,
       vocabulary.word,
       JSON_AGG(JSON_BUILD_OBJECT('id', definition.id, 'meaning', definition.meaning)) AS definitions
FROM "Vocabulary" AS vocabulary
         LEFT JOIN "Definition" AS definition ON vocabulary.id = definition."vocabularyId"
WHERE vocabulary."userId" = CURRENT_SETTING('my.userId')::UUID
  AND vocabulary.word < 'abcdef0105'
GROUP BY vocabulary.word, vocabulary.id
ORDER BY vocabulary.word DESC
LIMIT 20;

The above solution works as expected when we use an orderable column. Since the id column’s type is UUID, we cannot use it as ids are randomly generated. It is not a good approach if we want to change the sorting field from the word to id.

By the way, an awesome video by Hussein Nasser explains the limitation of using offset.

Performance Measurement

You probably noticed it already. All my queries are for page number 250, fetching 4981st to 5000th vocabulary.

Well, I wrote SQL queries to insert a good amount of data into the database.

The script mentioned above did the following things.

It created 50 users.
It created 5K vocabulary for each user, i.e., 250K rows in the Vocabulary table.
It created two definitions for each vocabulary. That means, in total, the amount of rows in the Definition table is 500K, and each user has 10K definitions.

I injected a good amount of data to have an idea of the performance of my queries so far. SQL EXPLAIN is a beautiful tool to measure the estimated execution cost.

Since there is a direct relation between the index and the performance, I will share the available indexes in this POC database. The majority of indexes are for primary keys, and I have created a composite unique constraint on the Vocabulary table so that inserting the same vocabulary by a user in the table can be prevented easily. Also, one more unique constraint has been added over the email column of the User table.

Index Key

| Table Name | Index Name                | Index Definition                                                                                    |
| ---------- | ------------------------- | --------------------------------------------------------------------------------------------------- |
| User       | PK_User_id                | CREATE UNIQUE INDEX "PK_User_id" ON public."User" USING btree (id)                                  |
| User       | UQ_User_email             | CREATE UNIQUE INDEX "UQ_User_email" ON public."User" USING btree (email)                            |
| Vocabulary | PK_Vocabulary_id          | CREATE UNIQUE INDEX "PK_Vocabulary_id" ON public."Vocabulary" USING btree (id)                      |
| Vocabulary | UQ_Vocabulary_word_userId | CREATE UNIQUE INDEX "UQ_Vocabulary_word_userId" ON public."Vocabulary" USING btree (word, "userId") |
| Definition | PK_Definition_id          | CREATE UNIQUE INDEX "PK_Definition_id" ON public."Definition" USING btree (id)                      |

Executing a query using EXPLAIN ANALYZE gives us a lot of crucial information, but here I want to focus only on the planning and the execution time.

Time Comparison

| Query Version |  Planing Time (Milliseconds)  |    Execution Time (Milliseconds)     |
| :-----------: | :---------------------------: | :----------------------------------: |
|       0       | 0.544 + 0.235 + 0.642 = 1.421 | 116.484 + 58.069 + 108.542 = 283.095 |
|       1       |             0.350             |               133.773                |
|       2       |             0.240             |               129.735                |
|       3       |             0.168             |              107734.892              |
|       4       |             0.190             |                49.012                |

As the TypeORM generated three queries, I applied summation to compare with other queries.

We see versions 1 and 2 are almost identical, the LATERAL-based version is the worst, and the TypeORM one is good enough. The best one is the keyset pagination-based query.

Let’s not make a concrete decision instantly. Execution time depends on many variables, e.g., the data volume, the hardware's nature, caching, concurrency, etc. Let me share a few scenarios.

I executed the same queries multiple times using the same user ID and page number. The time always differs.
Before creating multiple users, I checked performance with a different dataset with only one user, 5K vocabularies, and 10K definitions. In that scenario, TypeORM-generated queries outperformed versions one and two.
Using AWS RDS, Aurora (or other vendors) also affects performance. I haven’t tried this dataset, but once, I did it with another dataset in RDS.

Summary

We always need to make a decision carefully. Most importantly, we should not blindly rely on any ORM library, as those libraries are designed to support multiple databases. So, probably that’s why those are somewhat generic solutions, which produce unexpected, not-so-optimal queries (sometimes).

Do you know that randomly trying to memorise a word is like chasing shadows? Just like crafting queries, memorising words requires strategy. That is why scientists invented techniques to make it easy. The spaced repetition is one of the techniques to memorise anything in the long term. The Leitner system works based on this principle. It suggests creating a flashcard for each item and reviewing those at an interval.

(Cough, cough) I created the Firecracker Vocabulary Flashcards app based on the idea. You are invited to join the learning party!

To use or not to use ORM

Well, we wrote several SQL queries. As we integrate ORM into our project, we should make a way to execute the selected query. The easiest way is like the one below.

await this.query(/** paste the raw query here **/);

Some say using the raw query in an ORM is not a good approach. Whatever the solution is, it should be translated using the query builder or find options. But, not all types of query might be transformed. So, it is entirely up to the developer to decide based on their preference and the requirements and analysis.

Goodbye

Thank you for your patience. I hope you found it insightful and enjoyable!

NB: I have taken the cover image from here.

DEV Community

Demystifying ORM Queries: The Search for Optimized PostgreSQL Queries

TypeORM Approach (Version 0)

Query Version 1

Query Version 2

Query Version 3

Query Version 4

Performance Measurement

Index Key

Time Comparison

Summary

To use or not to use ORM

Goodbye

Top comments (0)

Read next

No Gap Ordered Numbering in SQL: A Unique Index to Serialize In Read Committed

Ensuring Data Integrity in Financial Transactions: The PostgreSQL Transaction Solution

Aurora Limitless - Single and Multi-Shard Scan

Statement Level Trigger vs Row Level Trigger