SQL queries are used to retrieve data from a database. We use the SELECT statement to retrieve data and usually a WHERE clause to filter the data.
Let's see how SELECT statements work.
First let's define our table. We have a
posts table defined by the below schema:
CREATE TABLE posts ( id INTEGER PRIMARY KEY AUTOINCREMENT, user_id INTEGER, title TEXT, body TEXT, created_at DATETIME DEFAULT CURRENT_TIMESTAMP, updated_at DATETIME DEFAULT CURRENT_TIMESTAMP FOREIGN KEY (user_id) REFERENCES users(id) );
Let's say we want to retrieve all posts. We can do this by using the SELECT statement.
SELECT * FROM posts;
This will return all the posts in the
Let's take this one level up. We want to filter all posts where the title contains the word "Java" (since its my favorite language).
SELECT * FROM posts WHERE title = "Java";
This will return all the posts where the title is "Java". The
WHERE clause is used to filter the data.
All is pretty convenient till now. Queries can be much complex but this should be enough to understand what is happening behind the scenes.
Let's examine how our query works. What does the system do in the background?
- The first step is to locate the table we want to query. So it reads the FROM clause. Now it knows that it needs to query the
- It will then read the WHERE clause and find the
titlecolumn mentioned there. Now it knows that it needs to filter the results based on the
titlecolumn. If there were multiple columns in the WHERE clause, it would filter the results based on all the columns.
- Next step is to go through the
titlecolumn of each row and find the rows that contain the word "Java".
- Finally, it will return all the rows that contain the word "Java".
Let's say this table is part of a huge database. Since it goes through the entire database, it will take a long time to run.
This is where the concept of indexing comes in.
What if the rows were sorted alphabetically by the title column?
Would it be faster to query the database?
We can achieve this by creating an index on the title column.
Let's see how we can create an index on the
CREATE INDEX posts_title_index ON posts (title);
This will create an index on the
The index is basically a lookup table. It has two parts:
- A key - It is the value of the column we are indexing on - in this case
- A value - It is the pointer to the row that contains the entire
Note: It is also possible to index multiple columns together. For example, if we want to find all the posts that contain the word "Java" and the word "Python", we can index the
title column and the
body column together.
CREATE INDEX posts_title_body_index ON posts (title, body);
Here the index will have two keys. The first key is the
title column and the second key is the
body column. When both the keys match, the index will point to the row.
So if we run the query again, it will look up the key in the index and return the row id. The row id can then be used to return the entire row.
- The query optimizer will look at the WHERE clause and see if it can find an index on the column.
- If it can, it will use the index to find the rows.
- If it cannot, it will run the query normally.
In our case, the query optimizer will find the index on the
title column. It will use the index to find the rows. Now the speed depends on how the index is implemented.
- if the index is implemented as a hash table, as can be done for values that can be consistently hashed, then the query will be quite fast - O(1)
- if the index is implemented as a binary search tree, as with values that can be sorted, then the query will be slightly slower but still faster than the query without the index which was taking linear time. - O(log n)
There are many more ways to implement indexes and each one has its own advantages and disadvantages. To read further refer to - https://en.wikipedia.org/wiki/Index_(database)
While it is possible to create an index for each column, it is not necessary. It has a performance cost. Wait, why?
We saw that the index makes querying faster but what about writing to the database?
- If we write to the database, we will have to write to the index as well. Indexed need to be maintained in real-time.
- So every time a new row is added to the table, all the indexes on its column will need to be updated. More the indexes, longer the write time.
For optimization purposes, the database engine will automatically create indexes for some columns.
- The primary key column - This is the column that is used to uniquely identify a row. It is automatically indexed.
- The columns with unique constraints - These are automatically indexed. There can be many columns with unique constraints. Disclaimer: Each specific database implementation may have different rules for indexing although most of them follow the above two rules.
There are many factors that can influence the speed of indexing and it is important to choose the columns that are most likely to be queried.
- The column should be indexed if it is likely to be queried often - For example, finding all the posts by a user is a common query.
- Multiple columns should be indexed only if they are likely to be queried together - For example, finding all the posts that contain the word "Java" and the word "Python" is not a common query.
- Don't index columns that are changed frequently - This creates a lot of unnecessary index updates.
- Columns used to join tables are a good candidate for indexing. It is likely that they will be used in queries.
As you can see, the decision criteria are not always obvious but the general idea is to choose the columns that are most likely to be queried. Keeping this in mind, we should decide our querying strategy and indexing strategy together. Keep as few columns in queries as possible and look at the possibility of those columns having their own index.
Thanks for reading. This should give you an idea about indexes in SQL databases and how they can be an important part of your SQL performance. If you want to connect with me, you can find me on Twitter @abh1navv