While TypeORM is a great library, I sometimes feel it does not always generate the most efficient queries. To evaluate my thought, I decided to use PostgreSQL's query execution plan to test different SQL queries to see how they perform.
Let's imagine there's this cool app that works like a personal dictionary. Whenever someone comes across a word they don't know, they can save it in the app along with its meaning, examples, notes, etc. This way, they can easily look it up later. Guess what? I'm going to create that app!
Designing the data model to keep the required data for this app is pretty straightforward. We need to create only three tables for this: User
, Vocabulary
, and Definition
.
User
| Column | Type | Nullable | Default |
| ------ | ----------------- | -------- | ------------------ |
| id | uuid | not null | uuid_generate_v4() |
| email | character varying | not null | |
Vocabulary
| Column | Type | Nullable | Default |
| ------ | ----------------- | -------- | ------------------ |
| id | uuid | not null | uuid_generate_v4() |
| word | character varying | not null | |
| userId | uuid | not null | |
Definition
| Column | Type | Nullable | Default |
| ------------ | ----------------- | -------- | ------------------ |
| id | uuid | not null | uuid_generate_v4() |
| meaning | character varying | not null | |
| vocabularyId | uuid | not null | |
We all know a word might have multiple definitions, and users might have various vocabulary in their dictionary. From the SQL perspective, the relationship between Definition
and Vocabulary
is many to one. The exact relationship would be for tables Vocabulary
and User
.
Foreign Key
| Table Name | Column | Constraint Name | Foreign Constraint Definition |
| ---------- | ------------ | ---------------------------------------- | ------------------------------ |
| Vocabulary | userId | FK_Vocabulary_userId_User_id | REFERENCES "User" ("id") |
| Definition | vocabularyId | FK_Definition_vocabularyId_Vocabulary_id | REFERENCES "Vocabulary" ("id") |
Besides creating or updating vocabulary, there should be a way to display all of it. I need to create an API responsible for fetching data from the database chunk by chunk. As we know, bringing it all at once could perform worse. There should also be a mechanism to retrieve another piece of data until received all of it. It can be achieved if the API returns the total number of vocabulary.
Let’s quickly show the input and output payload of the API request and response, respectively.
interface RequestPayload {
userId: string;
pageNumber: number;
pageSize: number;
}
interface ResponsePayload {
data: {
id: string;
word: string;
definitions: {
id: string;
meaning: string;
}[];
}[];
total: number;
}
The batch size will be 20, and I must ensure responses are sent in descending order of words.
TypeORM Approach (Version 0)
I can quickly achieve it using the TypeORM’s getManyAndCount
method.
async fetch(userId: string, skip: number, pageSize: number): Promise<ResponsePayload> {
// Imagine I calculated the value of the skip using pageSize and pageNumber
const [vocabularies, total] = await this.createQueryBuilder('vocabulary')
.leftJoin('vocabulary.definitions', 'definition')
.where(`vocabulary.userId = :userId`, { userId })
.orderBy(`vocabulary.word`, 'DESC')
.skip(skip)
.take(pageSize)
.select(['vocabulary.id', 'vocabulary.word'])
.addSelect(['definition.id', 'definition.meaning'])
.getManyAndCount();
return {
data: vocabularies,
total,
};
}
The getManyAndCount
method produces three queries before sending the data back. Let’s see those three queries.
Query 1
SELECT DISTINCT "distinctAlias"."vocabulary_id" AS "ids_vocabulary_id", "distinctAlias"."vocabulary_word"
FROM (SELECT "vocabulary"."id" AS "vocabulary_id",
"vocabulary"."word" AS "vocabulary_word",
"definition"."id" AS "definition_id",
"definition"."meaning" AS "definition_meaning"
FROM "Vocabulary" "vocabulary"
LEFT JOIN "Definition" "definition" ON "definition"."vocabularyId" = "vocabulary"."id"
WHERE "vocabulary"."userId" = '41f89c90-7029-46a4-8211-5f8c6e527a2d') "distinctAlias"
ORDER BY "distinctAlias"."vocabulary_word" DESC, "vocabulary_id" ASC
LIMIT 20 OFFSET 4980;
Query 2
SELECT "vocabulary"."id" AS "vocabulary_id",
"vocabulary"."word" AS "vocabulary_word",
"definition"."id" AS "definition_id",
"definition"."meaning" AS "definition_meaning"
FROM "Vocabulary" "vocabulary"
LEFT JOIN "Definition" "definition" ON "definition"."vocabularyId" = "vocabulary"."id"
WHERE ("vocabulary"."userId" = '41f89c90-7029-46a4-8211-5f8c6e527a2d')
AND ("vocabulary"."id" IN
('41f89c90-7029-46a4-8211-5f8c6e527a2d', 'b19457a9-03fb-4c3b-b649-83001b6b9616',
'17d84794-88a0-4e62-9f1c-88dea1fe8148', '9381945a-14d5-458b-b283-870ba3fa2057',
'c00d27cb-7ad1-4fd4-9641-73d3484f3741', '34e8fc07-dec6-4dc8-b461-ed412a5f1cc8',
'c5a52ddf-328f-4f1f-8c19-e38bc5d74ce1', '70de43d8-dbf2-45a4-86f1-1c4a0dc07512',
'9150aff2-27b0-499e-82de-7a15373f35b0', '25aa3d73-5e50-410a-b16a-522246b54982',
'3bf46a90-6982-46ac-9414-a5b3d36e9340', '05c26854-bfe2-48b3-ae2b-c3a76e26a473',
'40808019-f2c1-4857-8bab-a1820cfb974f', '5cb2e93d-c64c-407c-bfd2-5b765f630298',
'0fdf440c-37cd-46c3-a305-bd3b8937c377', '3fd8baf6-7e5e-4008-a8ef-81e30fcb6b0b',
'9501f02f-b77c-4abc-b6ef-fe1a44a575f3', 'cf6c5111-9077-4c9f-9d1b-07eeb44a7160',
'83c4e918-b14e-4848-ac16-e6e88fa6d774', 'ae156eb4-5a94-431a-8c1f-6b384d062e3b',
'b44b3033-1b54-4d3f-93a0-c73477dddf68'))
ORDER BY "vocabulary_word" DESC;
Query 3
SELECT COUNT(DISTINCT ("vocabulary"."id")) AS "cnt"
FROM "Vocabulary" "vocabulary"
LEFT JOIN "Definition" "definition" ON "definition"."vocabularyId" = "vocabulary"."id"
WHERE "vocabulary"."userId" = '41f89c90-7029-46a4-8211-5f8c6e527a2d';
To be honest, I am not impressed, as I think it can be made a bit better, e.g., we can remove the LEFT JOIN
and the DISTINCT
entirely from the first query because it needs to select only the vocabulary ID using the userId
column so that the second query can use it.
As the library generated the query, and I could not modify codes unless forking the repository, I decided to try other approaches.
Query Version 1
SET SESSION my.userId = '41f89c90-7029-46a4-8211-5f8c6e527a2d';
SELECT vocabulary.id,
vocabulary.word,
JSON_AGG(JSON_BUILD_OBJECT('id', definition.id, 'meaning', definition.meaning)) AS definitions,
(COUNT(*) OVER ())::INTEGER AS total
FROM "Vocabulary" AS vocabulary
LEFT JOIN "Definition" AS definition ON vocabulary.id = definition."vocabularyId"
WHERE vocabulary."userId" = CURRENT_SETTING('my.userId')::UUID
GROUP BY vocabulary.id, vocabulary.word
ORDER BY vocabulary.word DESC
OFFSET 4980 LIMIT 20;
The above query aggregates the associated definitions of each vocabulary item into a JSON
array using JSON_AGG
and JSON_BUILD_OBJECT
, ensuring I do not need to map it later. It also uses a window function COUNT(*) OVER ()
to calculate the total count of vocabulary created by the requested user. The fundamental difference between this one and the TypeORM version is that the latter gives us the data using only one query.
Query Version 2
There are trade-offs between single and multiple queries to fetch the required data. I will not debate over it. I decided to try another approach, as we can stop using OFFSET
and leverage the ROW_NUMBER
to get a subset.
SET SESSION my.userId = '41f89c90-7029-46a4-8211-5f8c6e527a2d';
SELECT vocabulary.id, vocabulary.word, vocabulary.definitions, vocabulary.total
FROM (SELECT vocabulary.id,
vocabulary.word,
JSON_AGG(JSON_BUILD_OBJECT('id', definition.id, 'meaning', definition.meaning)) AS definitions,
(COUNT(*) OVER ())::INTEGER AS total,
ROW_NUMBER() OVER (ORDER BY vocabulary.word DESC) AS "rowNumber"
FROM "Vocabulary" AS vocabulary
LEFT JOIN "Definition" AS definition ON vocabulary.id = definition."vocabularyId"
WHERE vocabulary."userId" = CURRENT_SETTING('my.userId')::UUID
GROUP BY vocabulary.id, vocabulary.word) vocabulary
WHERE vocabulary."rowNumber" BETWEEN 4981 AND 5000;
Pretty straightforward, as it just assigned the ranking to each vocabulary and filtered the correct batch in the end.
Query Version 3
I wanted to try LATERAL
to avoid GROUP BY
. With a slight effort, I wrote the query.
SET SESSION my.userId = '41f89c90-7029-46a4-8211-5f8c6e527a2d';
SELECT vocabulary.id,
vocabulary.word,
definitions,
(COUNT(*) OVER ())::INTEGER AS total
FROM "Vocabulary" AS vocabulary
LEFT JOIN LATERAL (
SELECT JSON_AGG(JSON_BUILD_OBJECT('id', definition.id, 'meaning', definition.meaning)) AS definitions
FROM "Definition" AS definition
WHERE definition."vocabularyId" = vocabulary.id
) AS definitions ON TRUE
WHERE vocabulary."userId" = CURRENT_SETTING('my.userId')::UUID
ORDER BY vocabulary.word DESC
OFFSET 4980 LIMIT 20;
Query Version 4
The whole idea of calculating the total number of available vocabulary is to provide a way for the front-end app to decide whether to make one more API request to the back-end. The app won't ask for more if it has already received all vocabulary.
There is an elegant way to avoid retrieving the total number of vocabulary created by a user.
Keyset pagination, also known as the seek method, relies on the ordering of the columns to paginate through the data. It avoids using OFFSET
. Instead of using the page number to determine how much data need to be skipped, we can use the previous chunk’s last word
to fetch one more subset of data.
SET SESSION my.userId = '41f89c90-7029-46a4-8211-5f8c6e527a2d';
SELECT vocabulary.id,
vocabulary.word,
JSON_AGG(JSON_BUILD_OBJECT('id', definition.id, 'meaning', definition.meaning)) AS definitions
FROM "Vocabulary" AS vocabulary
LEFT JOIN "Definition" AS definition ON vocabulary.id = definition."vocabularyId"
WHERE vocabulary."userId" = CURRENT_SETTING('my.userId')::UUID
AND vocabulary.word < 'abcdef0105'
GROUP BY vocabulary.word, vocabulary.id
ORDER BY vocabulary.word DESC
LIMIT 20;
The above solution works as expected when we use an orderable column. Since the id
column’s type is UUID
, we cannot use it as ids are randomly generated. It is not a good approach if we want to change the sorting field from the word
to id
.
By the way, an awesome video by Hussein Nasser explains the limitation of using offset.
Performance Measurement
You probably noticed it already. All my queries are for page number 250, fetching 4981st to 5000th vocabulary.
Well, I wrote SQL queries to insert a good amount of data into the database.
The script mentioned above did the following things.
- It created 50 users.
- It created 5K vocabulary for each user, i.e., 250K rows in the
Vocabulary
table. - It created two definitions for each vocabulary. That means, in total, the amount of rows in the
Definition
table is 500K, and each user has 10K definitions.
I injected a good amount of data to have an idea of the performance of my queries so far. SQL EXPLAIN is a beautiful tool to measure the estimated execution cost.
Since there is a direct relation between the index and the performance, I will share the available indexes in this POC database. The majority of indexes are for primary keys, and I have created a composite unique constraint on the Vocabulary
table so that inserting the same vocabulary by a user in the table can be prevented easily. Also, one more unique constraint has been added over the email
column of the User
table.
Index Key
| Table Name | Index Name | Index Definition |
| ---------- | ------------------------- | --------------------------------------------------------------------------------------------------- |
| User | PK_User_id | CREATE UNIQUE INDEX "PK_User_id" ON public."User" USING btree (id) |
| User | UQ_User_email | CREATE UNIQUE INDEX "UQ_User_email" ON public."User" USING btree (email) |
| Vocabulary | PK_Vocabulary_id | CREATE UNIQUE INDEX "PK_Vocabulary_id" ON public."Vocabulary" USING btree (id) |
| Vocabulary | UQ_Vocabulary_word_userId | CREATE UNIQUE INDEX "UQ_Vocabulary_word_userId" ON public."Vocabulary" USING btree (word, "userId") |
| Definition | PK_Definition_id | CREATE UNIQUE INDEX "PK_Definition_id" ON public."Definition" USING btree (id) |
Executing a query using EXPLAIN ANALYZE
gives us a lot of crucial information, but here I want to focus only on the planning and the execution time.
Time Comparison
| Query Version | Planing Time (Milliseconds) | Execution Time (Milliseconds) |
| :-----------: | :---------------------------: | :----------------------------------: |
| 0 | 0.544 + 0.235 + 0.642 = 1.421 | 116.484 + 58.069 + 108.542 = 283.095 |
| 1 | 0.350 | 133.773 |
| 2 | 0.240 | 129.735 |
| 3 | 0.168 | 107734.892 |
| 4 | 0.190 | 49.012 |
As the TypeORM generated three queries, I applied summation to compare with other queries.
We see versions 1 and 2 are almost identical, the LATERAL
-based version is the worst, and the TypeORM one is good enough. The best one is the keyset pagination-based query.
Let’s not make a concrete decision instantly. Execution time depends on many variables, e.g., the data volume, the hardware's nature, caching, concurrency, etc. Let me share a few scenarios.
- I executed the same queries multiple times using the same user ID and page number. The time always differs.
- Before creating multiple users, I checked performance with a different dataset with only one user, 5K vocabularies, and 10K definitions. In that scenario, TypeORM-generated queries outperformed versions one and two.
- Using AWS RDS, Aurora (or other vendors) also affects performance. I haven’t tried this dataset, but once, I did it with another dataset in RDS.
Summary
We always need to make a decision carefully. Most importantly, we should not blindly rely on any ORM library, as those libraries are designed to support multiple databases. So, probably that’s why those are somewhat generic solutions, which produce unexpected, not-so-optimal queries (sometimes).
Do you know that randomly trying to memorise a word is like chasing shadows? Just like crafting queries, memorising words requires strategy. That is why scientists invented techniques to make it easy. The spaced repetition is one of the techniques to memorise anything in the long term. The Leitner system works based on this principle. It suggests creating a flashcard for each item and reviewing those at an interval.
(Cough, cough) I created the Firecracker Vocabulary Flashcards app based on the idea. You are invited to join the learning party!
To use or not to use ORM
Well, we wrote several SQL queries. As we integrate ORM into our project, we should make a way to execute the selected query. The easiest way is like the one below.
await this.query(/** paste the raw query here **/);
Some say using the raw query in an ORM is not a good approach. Whatever the solution is, it should be translated using the query builder or find options. But, not all types of query might be transformed. So, it is entirely up to the developer to decide based on their preference and the requirements and analysis.
Goodbye
Thank you for your patience. I hope you found it insightful and enjoyable!
NB: I have taken the cover image from here.
Top comments (0)