Abel Ayalew

Posted on Jan 18 • Originally published at nabroleon.hashnode.dev

Inside the Database: Tables, Indexing, and Keys

#database #backend #postgres #tutorial

That's a pretty long title, huh? 😰 Exploring the ins and outs of a database isn't something we can cover in a quick blog post, but I'll break down the main ideas in a simple way. This way, you can get the basics down and be all set to dive into more advanced stuff on your own.

For the second part of the "Database Fundamentals" series, I was going to write about Database Indexing, but then I thought there were a couple of things we need to understand before we move on to database indexing concepts.

How do databases store table data?
What are indexes? Do they have a type? How are they stored?

These and some other questions are what we are going to address in this blog, I will keep it concise bear with me.

How do databases store table data?

You might say "On disk, duh! 🤷‍♂️", but there is a bit more detail to it. The tables we see(i.e. rows and columns) to visualize our data are just logical representations. The way databases store the table data is with things called pages. A page is nothing but a fixed-size memory location and is a unit of information read from or written to disk. The size of a page differs from database to database(i.e. 8kb for PostgreSQL, 16kb for MySQL(InnoDB)). Under the hood, these pages are managed by the underlying storage system.

Each page can store multiple table rows inside depending on the size of the row. The database doesn't read a single row when we do an I/O. It reads one or more pages and we get lots of rows.

An I/O is a read request to disk but sometimes it can go to the operating system cache instead of disk.

The cost of page I/O dominates the cost of typical database operations, and DBMSs are optimized to minimize this cost.

Ok got it, we store the rows in pages, but where do we store the actual table data which we have organized in pages? well, we store it in a heap structure and it is not sorted at all. There is neither a relationship between the rows stored on the same page nor is there any connection between the pages. This holds true across various DBMSs.

In this simplest and most basic type of organization, records are placed in the file in the order in which they are inserted, so new records are inserted at the end of the file. Such an organization is called a heap or pile file. ~ Elmasri

Inserting a new record into the heap is very efficient. However, traversing the heap is very expensive as it holds lots of data and it involves a linear search through the file page by page. This is why we need to utilize indexes to optimize certain kinds of retrieval operations. The indexes tell us what page in the heap we need to pull.

💡For more info check Part III STORAGE AND INDEXING from "Database Management Systems" by Raghu Ramakrishnan and Johannes Gehrke.

What are indexes?

An index is a data structure that organizes data records on a disk to optimize retrieval operations. Without indexing, DBMS has to iterate through every page until it finds the requested data. Without indexing this would be us 👇

An index makes the query fast⚡. Awesome, now, let's talk about two types of indexes: clustered and non-clustered. There are more types, but today, we'll focus on these two.

Clustered Indexes

Sometimes a heap file is organized around a single index, a clustered index. A clustered index is a type of index that physically orders the data in the database table according to the index key. This means the order of the data in the clustered index is the same or close to how the data is stored in the table, which makes it very efficient for range queries. They are also called Index Organized Tables.

In a clustered index, the index data and the actual row data are not stored separately. They are in the same table and the clustered index ensures the table data is sorted physically using the key column of the index. Usually, a primary key is considered unless otherwise specified. We can think of clustered indexes as the dictionary. In a dictionary, the words are sorted by alphabet and the definitions(our data) are located next to the corresponding words(our index).

One thing we mustn't forget is that since the clustered index sorts the rows using an index key, there can only be one clustered index. Additionally, the choice of the key for the clustered index is crucial, as it directly impacts the physical storage and retrieval efficiency of the table's data.

MySQL(InnoDB) uses a clustered index, how is this clustered index created?🤔. Well, a clustered index is chosen in this order

Default Primary Key:
- If a primary key exists, it's automatically chosen as the clustered index.
Unique Non-Null Column:
- Without a primary key, any unique, non-null column becomes the clustered index.
Hidden Clustered Index:
- If neither is available, InnoDB creates a hidden clustered index on a system column with row IDs.

If unfamiliar with InnoDB, it's a MySQL storage engine known for balancing high reliability and high performance. MySQL offers other engines like ISAM, MyISAM, MEMORY, and more, each with distinct features and use cases. InnoDB is the default one.

Most MySQL indexes (PRIMARY KEY, UNIQUE, INDEX, and FULLTEXT) are stored in B-trees. B-tree is a self-balancing tree that maintains sorted data and allows searches, insertions, deletions, and sequential access in logarithmic time. You don't have to worry about how B-trees are implemented as it is abstracted for you but if you are interested, I can write a blog about B-trees and B+ trees. But until then, if you want to see how B Tree and B+ Tree work, check out these resources.

MIT OpenCourseWare - Trees and B-Trees
B Tree visualizer
B+ Tree visualizer

Non-clustered Indexes

A non-clustered index is a type of index that does not affect the physical order of the data in the database table. Instead, it creates a separate data structure, such as a B-tree or a hash table, that stores the index key and a pointer to the corresponding record in the table.

Non-clustered indexes are like "Indexes" in books, wherein we can check the page number and then use that page number to read the page we want.

So as you can see just like the book indexes and the actual content, index data and actual table data(heap data) are stored separately. Take a look at the next figure

In PostgreSQL, all indexes are secondary and stored separately from the table's main data (heap). Unlike clustered indexes, secondary indexes require fetching data from both the index and heap during a typical scan. While offering flexibility for optimizing queries, and supporting multiple columns and indexes per table, they may increase storage, memory, and disk I/O. Balancing these factors is essential for maximizing the benefits of secondary indexes in PostgreSQL.

As you have seen the way MySQL(InnoDB) and PostgreSQL store their indexes differently and it affects their reads and writes differently. I might write a blog about it in the future.

If you want to learn more about this, Check out Hussain's video(Indexing in PostgreSQL vs MySQL). Don't forget to check out his other videos too as he makes exceptional content.

Conclusion

Before we dive into the world of database indexing, we took a moment to understand how databases store information and explored different types of indexes and how they are stored. If it seems a bit overwhelming, don't worry! Give it another read and check out additional resources for more clarity. We touched on things like primary and secondary keys briefly, but we'll delve into them more in future blogs.

📆 In the upcoming blog, we'll explore the practical details of using indexes to speed up queries, including different types of scans. Get ready for more insights into Database Indexing! Stay tuned, and everything will become clearer.

Happy reading! Bye for now! 👋

👀 Always remember, this is just the beginning – a glimpse into the vast world of database intricacies. For deeper insights into database indexing and storage, check the resources below.

Resources

"Fundamentals of Database Systems" by Elmasri and Navathe.
"Database Management Systems" by Raghu Ramakrishnan and Johannes Gehrke.
"SQL Performance Explained" by Markus Winand
PostgreSQL Docs
CMU Intro to Database Systems - YouTube
Hussein Nasser - Database Engineering - YouTube

DEV Community

Inside the Database: Tables, Indexing, and Keys

How do databases store table data?

What are indexes?

Clustered Indexes

Non-clustered Indexes

Conclusion

Resources

Top comments (0)

Read next

Bridging the Gap: A Case Study on Synchronizing Shopify and Microsoft Dynamics GP

Day 16: Introduction to DockerHub

Set Git to Recognize Case Changes

Exploring new AWS Aurora DSQL. What is it ? Why it is important ? How to quickstart ?