Preface: This article is not a tutorial on how to debug database performance, nor is it about how to fix it. Instead, it is a guide on how to do a basic lookup into the query operation and an attempt to explain some related concepts.
In the world of speed and performance, every single point of traction can make a big difference in the delivery of anything. Every main service probably has a data source where data is stored and served, so being able to retrieve data faster can lead to better performance and an overall reduction in costs.
But how we can measure performance? Most database offer some level of “performance debug” tools for that, in this article im using postgres and explain analyze
, but the idea is the same for other sql services.
Summary
- First Query and Index
- Explaining the index
- More complex queries
- Bitmaps
- Mixing indexed and non-indexed columns
- What to take from searches and indexes?
- KV Stores
- What are some examples of KV stores?
- What is the main difference between a KV store and a SQL database?
- When to use one or another
- References
First Query and Index
The first case is a very basic one, just a table with a id and a username column with the given sql.
CREATE TABLE IF NOT EXISTS "users" (
"id" varchar PRIMARY KEY NOT NULL,
"username" text NOT NULL
);
Our first query is to retrieve a record with a given id
explain analyze
select *
from users
where id = '01H7BZ4XC9HTG7H4DTH7B3AAY5';
Index Scan using users_pkey on users (cost=0.28..8.29 rows=1 width=47) (actual time=0.033..0.035 rows=1 loops=1)
Index Cond: ((id)::text = '01H7BWSR1GJCJBKGGPV5NS3YNY'::text)
Planning Time: 0.070 ms
Execution Time: 0.056 ms
Index Scan using users_pkey on users
: This part of the output indicates that the database engine is performing an index scan on the "users" table using the "users_pkey" index. An index scan is a method of retrieving rows from a table using an index to look up the rows that match the specified condition.(cost=0.28..8.29 rows=1 width=47)
: This section provides information about the estimated cost of executing the query. The cost is an arbitrary unit used by the database optimizer to compare query execution strategies. In this case, the estimated cost of the query is between 0.28 and 8.29. It also estimates that there will be 1 row returned, and the width of each row is 47 bytes.(actual time=0.033..0.035 rows=1 loops=1)
: This part of the output shows the actual execution time of the query. It took between 0.033 and 0.035 milliseconds to execute the query. It also indicates that only 1 row was returned, and the query executed in a single loop.Index Cond: ((id)::text = '01H7BWSR1GJCJBKGGPV5NS3YNY'::text)
: This line specifies the condition used in the index scan. It's comparing the "id" column of the "users" table, cast to text, to the value '01H7BWSR1GJCJBKGGPV5NS3YNY' also cast to text. This condition is used to identify the row(s) that match the WHERE clause of your query.Planning Time: 0.070 ms
: This line provides the time it took to plan the query, which is the process of determining the most efficient way to execute the query. In this case, it took 0.070 milliseconds to plan the query.Execution Time: 0.056 ms
: This line indicates the total execution time of the query, which is the time it took to execute the query after planning. In this case, it took 0.056 milliseconds to execute the query.
But what if we only wanted the indexed value itself? We can skip the lookup in the table and return the index value itself.
explain analyze
select id
from users
where id = '01H7BZ4XC9HTG7H4DTH7B3AAY5';
Index Only Scan using users_pkey on users (cost=0.29..4.30 rows=1 width=27) (actual time=0.024..0.025 rows=1 loops=1)
Index Cond: (id = '01H7BZ4XC9HTG7H4DTH7B3AAY5'::text)
Heap Fetches: 0
Planning Time: 0.091 ms
Execution Time: 0.041 ms
Scanning index and maybe looking into row
For our next query, now we make a more realistic one: find the user with a given username
explain analyze
select *
from users
where username = 'username-999';
Seq Scan on users (cost=0.00..22.50 rows=1 width=47) (actual time=0.179..0.180 rows=1 loops=1)
Filter: (username = 'username-999'::text)
Rows Removed by Filter: 999
Planning Time: 0.056 ms
Execution Time: 0.208 ms
We are faced with a different scan, now with Seq Scan
and as the name implies, we sequentially scanned through the table looking rows with matching conditions. At the time of the query, the are 1000 rows in the table, only one with ‘username-999’ as a column value, but we had to scan the entire table as there was no way to guarantee that only that row had the value.
But lets say that there is a new business rule about unique usernames.
ALTER TABLE "users" ADD CONSTRAINT "users_unique_username" UNIQUE("username");
Making the same query again will result in a different scan.
explain analyze
select *
from users
where username = 'username-999';
Index Scan using users_unique_username on users (cost=0.28..8.29 rows=1 width=47) (actual time=0.023..0.024 rows=1 loops=1)
Index Cond: (username = 'username-999'::text)
Planning Time: 0.084 ms
Execution Time: 0.038 ms
Explaining the index
Why did it change from a Seq to an Index scan? Previously, we had no way to find which record had the username without scanning the entire table for the matching row(s). However, now we can have better assurance that the value is unique when modifying the data, and a faster lookup as we can scan the index more efficiently.
Indexes are amazing when you think about them, but why don't we index everything? There are costs associated with indexes, both in terms of storage and logic. When you mutate the table, the indexes also need to update their values, which adds costs to the overall database.
More complex queries
Now lets expend more into the possible search aspect of a table with a more complex one: Just a simple posts
table with a relationship with the users
table.
CREATE TABLE IF NOT EXISTS "posts" (
"id" varchar PRIMARY KEY NOT NULL,
"title" text NOT NULL,
"body" text NOT NULL,
"published" boolean DEFAULT false NOT NULL,
"created_at" timestamp DEFAULT now(),
"user_id" varchar NOT NULL
);
--> statement-breakpoint
DO $$ BEGIN
ALTER TABLE "posts" ADD CONSTRAINT "posts_user_id_users_id_fk" FOREIGN KEY ("user_id") REFERENCES "users"("id") ON DELETE no action ON UPDATE no action;
EXCEPTION
WHEN duplicate_object THEN null;
END $$;
For our next example query: search all published posts.
explain analyze
select *
from posts
where published is true;
Seq Scan on posts (cost=0.00..2538.00 rows=50410 width=84) (actual time=0.012..15.176 rows=49868 loops=1)
Filter: (published IS TRUE)
Rows Removed by Filter: 50132
Planning Time: 0.074 ms
Execution Time: 17.384 ms
Now we are faced with a situation similar to before: a Seq Scan for a "simple" comparison. The first thought is to simply add an index. Yes, but not completely right. We only care about the published posts, so we can index them only based on that aspect, as we don't care about the other ones.
CREATE INDEX IF NOT EXISTS "posts_published_index" ON "posts" ("published") WHERE published IS true;
explain analyze
select *
from posts
where published is true;
Bitmap Heap Scan on posts (cost=438.23..2480.33 rows=50410 width=84) (actual time=1.931..11.999 rows=49868 loops=1)
Recheck Cond: (published IS TRUE)
Heap Blocks: exact=1538
-> Bitmap Index Scan on posts_published_index (cost=0.00..425.63 rows=50410 width=0) (actual time=1.658..1.659 rows=49868 loops=1)
Planning Time: 0.066 ms
Execution Time: 14.315 ms
Running the same query now give us a new scan method: Bitmap.
Bitmaps
Bitmap Index uses bits (0s and 1s) to represent whether a particular value exists or not in the column. Each bit corresponds to a row in the table, and if a value is present for a row, its corresponding bit is set to 1; otherwise, it's set to 0. The Bitmap Heap is where the database collects all the relevant rows from the table based on these conditions.
https://www.cybertec-postgresql.com/en/postgresql-indexing-index-scan-vs-bitmap-scan-vs-sequential-scan-basics/
Another case to access the data is from the users side, as follows:
explain analyze
select *
from posts
where user_id = '01H7BZ4XC9HTG7H4DTH7B3AAY5'
Seq Scan on posts (cost=0.00..2788.00 rows=10 width=84) (actual time=1.265..15.436 rows=11 loops=1)
Filter: ((user_id)::text = '01H7BZ4XC9HTG7H4DTH7B3AAY5'::text)
Rows Removed by Filter: 99989
Planning Time: 0.066 ms
Execution Time: 15.452 ms
You can start to see the pattern around accessing the data. When you have a case where you have a clear path on how to access the data, you can check if the index is worth the case.
CREATE INDEX IF NOT EXISTS "posts_user_id_index" ON "posts" ("user_id");
explain analyze
select *
from posts
where user_id = '01H7BZ4XC9HTG7H4DTH7B3AAY5';
Bitmap Heap Scan on posts (cost=4.50..42.20 rows=10 width=84) (actual time=0.054..0.073 rows=11 loops=1)
Recheck Cond: ((user_id)::text = '01H7BZ4XC9HTG7H4DTH7B3AAY5'::text)
Heap Blocks: exact=10
-> Bitmap Index Scan on posts_user_id_index (cost=0.00..4.49 rows=10 width=0) (actual time=0.047..0.047 rows=11 loops=1)
Index Cond: ((user_id)::text = '01H7BZ4XC9HTG7H4DTH7B3AAY5'::text)
Planning Time: 0.372 ms
Execution Time: 0.096 ms
Mixing indexed and non-indexed columns
We indexed both published posts and user posts, but what if we wanted both of them?
explain analyze
select *
from posts
where user_id = '01H7BZ4XC9HTG7H4DTH7B3AAY5'
and published is true;
Bitmap Heap Scan on posts (cost=4.49..42.20 rows=5 width=84) (actual time=0.100..0.117 rows=3 loops=1)
Recheck Cond: ((user_id)::text = '01H7BZ4XC9HTG7H4DTH7B3AAY5'::text)
Filter: (published IS TRUE)
Rows Removed by Filter: 8
Heap Blocks: exact=10
-> Bitmap Index Scan on posts_user_id_index (cost=0.00..4.49 rows=10 width=0) (actual time=0.072..0.073 rows=11 loops=1)
Index Cond: ((user_id)::text = '01H7BZ4XC9HTG7H4DTH7B3AAY5'::text)
Planning Time: 0.157 ms
Execution Time: 0.151 ms
We previously indexed both fields separately, but now we have both of them in the same query.
In this case, we can delve further into the query to better understand its planning aspect. By first index scanning the user_id, and then the published value, we can eliminate a larger portion of unwanted rows by examining the user_id. This is because there are more possibilities of having different values that we don't want. We can then check a smaller subset of the published filter.
We can also create a compounded index for the case above: user_id and published.
CREATE INDEX IF NOT EXISTS "posts_user_id_published_index" ON "posts" ("user_id", "published") WHERE "published" IS TRUE;
explain analyze
select *
from posts
where user_id = '01H7BZ4XC9HTG7H4DTH7B3AAY5'
and published is true;
Bitmap Heap Scan on posts (cost=4.33..23.54 rows=5 width=84) (actual time=0.045..0.051 rows=3 loops=1)
Recheck Cond: (((user_id)::text = '01H7BZ4XC9HTG7H4DTH7B3AAY5'::text) AND (published IS TRUE))
Heap Blocks: exact=3
-> Bitmap Index Scan on posts_user_id_published_index (cost=0.00..4.33 rows=5 width=0) (actual time=0.037..0.038 rows=3 loops=1)
Index Cond: ((user_id)::text = '01H7BZ4XC9HTG7H4DTH7B3AAY5'::text)
Planning Time: 0.342 ms
Execution Time: 0.076 ms
But what happens if we need to add a non-indexed column to the query? Will an index not be used?
explain analyze
select *
from posts
where user_id = '01H7BZ4XC9HTG7H4DTH7B3AAY5'
and published is true
and created_at > '2023-01-01';
Bitmap Heap Scan on posts (cost=4.33..23.55 rows=5 width=84) (actual time=0.038..0.044 rows=3 loops=1)
Recheck Cond: (((user_id)::text = '01H7BZ4XC9HTG7H4DTH7B3AAY5'::text) AND (published IS TRUE))
Filter: (created_at > '2023-01-01 00:00:00'::timestamp without time zone)
Heap Blocks: exact=3
-> Bitmap Index Scan on posts_user_id_published_index (cost=0.00..4.33 rows=5 width=0) (actual time=0.028..0.028 rows=3 loops=1)
Index Cond: ((user_id)::text = '01H7BZ4XC9HTG7H4DTH7B3AAY5'::text)
Planning Time: 0.115 ms
Execution Time: 0.071 ms
The query planner is smart enough to understand that a subset of the query conditions are indexed, so it can perform that faster subset first and then the "slower" conditions later.
What to take from searches and indexes?
Indexes can speed up data access, but they come at a cost. The key takeaway from the examples above is to learn the patterns of how you access your data. Once you've identified these patterns, you can create indexes or explore alternative approaches to accessing the data.
KV Stores
But what if the SQL query isn't fast enough? Are there any alternatives? Yes, we can use key-value (KV) stores for that.
Here's a simple analogy: a SQL database is like a giant list that you filter and merge with queries, while KV stores are like map/hashmap data structures that allow direct access to the data you're looking for.
Not a real metric value, just an example to understand
What are some examples of KV stores?
- Redis: A classic one, slow query performance? Just cache it on redis =D
- DynamoDB: A modern and dangerous one, the aws service to have O(1) performance (supposedly) when acessing the data
- Cassandra (Scylla): The dark horse, you hear about it on giant scales like the discord trillion messages and such.
What is the main difference between a KV store and a SQL database?
In "classic" databases, there are many points of access to the data. You can jump from table1 to table2 to table3 to table n, as long as there is a logical connection between them. To take advantage of this, you have to plan what should happen in the query. Database engines are great at this. However, one drawback is that you must plan the entire process and execute it afterward. In Postgres, this is shown as Planning Time: N ms - Execution Time: N ms.
With KV stores, you can achieve almost linear data access at the expense of a rigid access pattern. You must access data using the key → value pattern every time. There is a base cost to using this pattern, as you must compute the key to locate the data. For small amounts of data, this cost may not be worth paying. However, you can tip the scale in favor of the initial cost for larger amounts of data.
When to use one or another
In general, SQL databases are more flexible and offer decent performance when you don't know the data access pattern. When using a KV store, you trade flexibility for faster read speeds. For some use cases, this tradeoff is acceptable, but not for all. However, SQL databases and KV stores are not mutually exclusive. You can have both at the same time.
Top comments (0)