Postgres organizes Indexes

#database #postgres #indexes

Index

what is Index?

The index is a datatype, separated from our data from the table. It likes a map of locations of data that points to a value located in which page on heap AND row id (for Postgres, it is tupleId). Heap is a table of our data with its page after another.

(index value) => (Page, rowId)
(index value) => (Page, rowId)

Page

At first, I thought when we make a query to get a single row, it just fetches that single row for us. Actually, it will fetch the whole page that contained our row from the heap, then collect and return the row we want. For each page, we call it an I/O. So the performance of our query depends on how many pages the DB fetches. In case each row of data contains a huge string or a lot of columns, then a page will include fewer rows, and then we will need more IO, so the performance of our query will decrease.

Clustered Index

Sometimes, we may hear about Clustered index. When we define a column as a clustered index, the table now will organize around that key. In case a new row is added to the table, the row will be added to the location on the heap that follows the index order. For example, we have a user table below with the id column is uuid, and priority_level is a clustered index. Although, we all see the rows ordered Dantis > Vu > Ngan > John, the actual order in the heap is Dantis > Ngan > Vu > John.

id	name	priority_level
awes	Dantis	10
zwqe	Vu	5
tydb	Ngan	9
mnbv	John	4

The Clustered index is really beneficial when we perform a range query, like getting all users that are older than 90 with column age is a clustered index. The DB just need to find the start page and the end page, then fetch all of them.

However, in Postgres, we just have an eventually clustered database. Because even if we create a clustered index for our table, then a new row comes, Postgres does not guarantee that the row will be followed our order.

Organizing Indexes (Postgres vs Mysql)

For Postgres, we still can do that, but actually, those indexes are all secondary indexes. Different from other RDBMS, all indexes will point directly to the heap. So, you may ask where is the primary index, Postgres uses that for tupleId (it likes rowId) to manage the data.

For MySQL, we can create both primary index and secondary index, then the primary index will point to the heap, and the secondary index will point to the primary one.

Because of that design, Postgres and Mysql have their own behaviors when querying, inserting, and updating (or deleting) a row. Basically, they are around the idea that the index needs to be aware of any change related to it.

Querying

Because Postgres point the index directly to the heap, the DB just needs to use an index scan or another heap scan to get our targeted row. Meanwhile, Mysql will perform 2 index scans on the secondary index and the primary index. So, Postgres can process just a little bit faster than Mysql.

Inserting

For inserting, Postgres and Mysql do the same thing that they need to update all the existing indexes to make those indexes aware of the new row coming. For example, if our table has 10 indexes, then the table needs to perform 10 updates. That is the problem when we have too many indexes in a table.

Updating (Deleting) This is the most interesting part

When updating a row, instead of changing the data, Postgres create a new row from the original data with the changes and a new tupleId, then it points all the indexes to that new row. You may ask about the old tupleId (dead tuple). Postgres uses that to revert a transaction, but then if we forget to clean up those dead tuples, they will take a large resource for nothing. That is why we need the 'vacuum', not to free our memory but to reuse it.

For Mysql, the thing is quite more simple. it just needs to update the primary and the related index if necessary.

Datatype

The problem of the index datatype mostly comes from the randomness of the data. In case our index is Sequential data, when we create or update a row, it is easy for the DB to locate the next location on the heap for the row. In addition, B-tree ,which is the most popular index datatype by default, is also need that sequential. Because it takes a lot of step to re-balance the structure.

Paypal.

I am really happy to receive your feedback on this article. Thanks for your precious time reading this.