If you are just starting out on your SQL database journey, you may be wondering what indexes are and why we care so much about them.
While the subject of indexes can be complex, learning some basics will get you on your way.
What are indexes?
Much like an index in the back of a book, database indexes help speed up queries by knowing where to look for the data you are after.
If a book didn't have an index, you would need to look through every page until you found what you were after, but having an index in the back of the book means you can look up what page something is on and go straight to the page, which speeds up your search.
This is the role indexes have in a database.
What types of indexes are there?
There are four types of indexes.
Primary
A Primary index is a unique index. It ensures that the field that is set as Primary will contain a unique value in every row of a table.
A table can only have one primary index.
Unique
Much like a primary index, a unique index ensures that the field value is unique in every row of the table.
There can be many unique indexes in a table.
Index
This type of index is a general index. It tells the database to make an index of the data in the field, which it may use to speed up queries. Field values do not need to be unique.
Fulltext
While the other indexes index on the whole value in the field, full text indexes index on text within each row of the field set as full text. This allows the index to work on searches for partial text. It's particularly useful on fields set as TEXT or BLOB data types where large amounts of data may be stored in the field.
When should an index be added?
It is likely that queries that take longer than it should to run need an index. Inefficient queries can bog down your server and lead to increased server load, long response times, and poor server performance.
Adding indexes to tables with small amounts of data in them won't help speed up the query. What you will most likely find is that the indexes are ignored.
Before adding an index, make sure it really needs to be indexed. While indexes will speed up SELECT queries. It can slow down INSERTs or UPDATEs.
While it may be tempting to index every field in a table. This is not good practice and can lead to poor server performance.
If a query runs well without an index, it may not need an index, but it if runs slow and bogs down your system, its probably a good idea to add one.
Use the EXPLAIN keyword in your query to get an insight into how a query will run, this may give you information you need to make informed choices on indexing.
What is EXPLAIN?
EXPLAIN will show you what the database is planning to do to retrieve the results from your query. This is called a query execution plan.
If you have a query like:
SELECT * FROM employees WHERE gender='f';
To use the EXPLAIN keyword to look at the query plan you would add EXPLAIN to the beginning of your query
EXPLAIN SELECT * FROM employees WHERE gender='f';
Which will show a result similar to:
Temporary Tables and File Sorting
Using EXPLAIN, sometimes you will see Using Temporary or Using File Sort. These are both things to watch out for, there is a good chance the query is not optimized well if the query plan contains these.
Temporary Tables
While temporary tables are sometimes required to run a query, if SQL has decided that creating a temporary table is the best option, it's an indication that it's worth taking a second look at the query.
Sometimes the way to prevent the need for the temporary table is to add an index, but don't just blindly add indexes (we want to keep the number of indexes on a table to as few as practical), first take a look and see if there is another way you could write the query that doesn't require a temporary table.
File Sorting
File sorting means that SQL is trying to sort all the results in a temporary buffer, which can slow the query down.
Indexes can help with removing file sorting, but as with temporary tables, look to see if there is a way to improve your query before adding in index.
Using an index on a WHERE clause
For this tutorial I am using a sample database available from https://github.com/datacharmer/test_db.
Looking at the WHERE part of a query is a good place to start when first starting out with indexes. If a field referenced in the WHERE clause has an index, the query will try to use that to speed up the query.
SELECT * FROM employees WHERE first_name='mary';
The EXPLAIN for this query looked like this:
Explain shows us that it used the WHERE to execute the query since there is no index to use. It also shows that it looked through 299556 rows to find the data. This is a full table scan. It's looked through every row and to see if the value of the first name field is equal to Mary.
Full table scans are a good indication that an index might be needed, especially in large data sets.
CREATE INDEX first_name_idx ON employees (first_name);
You can see from explain that it used the index this time and only looked at 224 rows rather than all 299556 rows.
This makes the query much faster.
This is what indexes are good for.
Multi-field indexes
Indexes can be created to combine multiple fields. In this example we might want to run queries on first name and last name.
SELECT * FROM employees
WHERE first_name='Mary'
AND last_name='Cooley'
Without an index this would result in a table scan and explain would look similar to the previous table scan
We could add indexes on both first name and last name fields separately.
ALTER TABLE employees
ADD INDEX first_name (first_name),
ADD INDEX last_name (last_name);
This would provide a decent index for the query we have run, but not the best index.
Explain would provide an interesting result.
The database has merged the two indexes together then looked for a row matching both fields combined.
If instead we create an index combining the two fields, then the database doesn't need to merge the indexes and it will be a less taxing query.
ALTER TABLE employees
ADD INDEX full_name (first_name,last_name);
This is a much better looking query execution plan and will run faster than it did with separate indexes. The overhead of merging the indexes together takes time, so anything we can do to help the database execute our queries faster is a good thing.
Special Note: The order fields are added to a multi column index matters in some cases
This query:
SELECT * FROM employees
WHERE first_name='Mary'
AND last_name='Cooley'
Could be rewritten as
SELECT * FROM employees
WHERE last_name='Cooley'
AND first_name='Mary'
This will be fine, it will use the index, as long as both fields are present in the WHERE clause.
But what happens if we want to run a query on just on first name?
SELECT * FROM employees
WHERE first_name='Mary'
Even though only one field in the index has been used in the query, it still uses the index.
What happens if we want to run a query on last name?
SELECT * FROM employees
WHERE last_name='Cooley'
You can see that it is no longer using the index.
So what happened here?
Multi field indexes cascade from left to right. For the index to be used, the first field in the index needs to be in the query before it will look for the presence of the second field in the index.
To gain the advantages multi-field indexes give us and also run queries on last name alone, an additional index could be added for just last name
ALTER TABLE employees
ADD INDEX last_name (last_name);
Note: There is no need to add an index for first_name since the multi-field index will work with just the first name.
If we rerun the original query
SELECT * FROM employees
WHERE first_name='Mary'
AND last_name='Cooley'
If we look at the fields Possible Keys and Key, it will show that it has considered both indexes but chosen the multi-field index
If we rerun the query looking for just last name
SELECT * FROM employees
WHERE last_name='Cooley'
This time it will use the index we created for just last name.
Indexing on ORDER BY
WHERE clauses are not the only place indexes are used. Indexes are also used when ordering data in the query.
SELECT * FROM employees
ORDER BY last_name LIMIT 100
Even though there is no WHERE clause, the index is used to order the data faster.
Indexing on GROUP BY
Indexes are also used in GROUP BY queries.
SELECT COUNT(*),last_name
FROM employees
GROUP BY last_name
If this query is run without an index Explain gives us this result:
This is possibly the worst kind of query. It's using a file sort and it's creating an internal temporary table to execute the query. This will cause the query to be quite slow.
Rerunning the query with an index on last_name added:
Now it is using the index and will be a much faster query.
Indexing with LIKE
In some cases indexes will not be used with a LIKE keyword.
For queries similar to the the ones below, the index can be used:
SELECT * FROM employees WHERE last_name LIKE 'cooley'
SELECT * FROM employees WHERE last_name LIKE 'coo%'
However, queries similar to the ones below will not use the index:
SELECT * FROM employees WHERE last_name LIKE '%ley'
SELECT * FROM employees WHERE last_name LIKE '%ole%'
The index is used up until it hits a wild card, after that, it no longer uses the index.
In cases like these, it may be more beneficial to use a full text index and query the full text data rather than using LIKE.
The 30% Rule
Sometimes SQL will not use an index even when you think it should. To demonstrate this I have added an index on the hire_date field
ALTER TABLE employees
ADD INDEX hire_date (hire_date);
SELECT * FROM employees
WHERE hire_date = '1991-06-01'
This query will use the index just fine
But what if I want to find all the employees hired in 1991?
SELECT * FROM employees
WHERE hire_date BETWEEN '1991-01-01'
AND '1991-12-01'
Looks good also, it used the index.
Now we want employees hired between the start of 1987 and the end of 1991
SELECT * FROM employees
WHERE hire_date BETWEEN '1987-01-01'
AND '1991-12-01'
This query does not use the index.
Whats going on here?
The answer lays in the amount of rows being returned. When the number of rows returned goes over about 30% of the total rows, SQL is less likely to use the index.
Sometimes this is OK, other times it is a problem that needs to be addressed.
One option is to paginate your results by limiting the results returned
SELECT * FROM employees
WHERE hire_date BETWEEN '1987-01-01'
AND '1991-12-01'
LIMIT 1000
Creating good queries without an index
Consider the following query
SELECT * FROM employees
WHERE first_name = 'Mary'
In this query, first_name does not have an index and the result is that a table scan is done to retrieve the results
An index could be added to prevent the table scan and that would fix the issue.
However, Before we create the index, we should see if there is another way we can query the database that won't result in a table scan.
One option is to paginate the results. We can do this by limiting the result set.
SELECT * FROM employees
WHERE first_name = 'Mary'
LIMIT 10
At first glance you may conclude that this would prevent the table scan because it's only returning 10 results.
However, this isn't the case.
Looking at the execution plan, it looks exactly the same as before.
This is because even though it returns only 10 rows, The execution plan has determined that a full table scan is the best way to go.
To prevent this table scan, there are a few options, but one simple method is to order the data by the tables primary key
SELECT * FROM employees
WHERE first_name = 'Mary'
ORDER BY emp_no
LIMIT 10
Now it is only looking at the number of rows it needs to return the number of results asked for.
By paginating the results, there is a good chance that result offsetting will be required.
SELECT * FROM employees
WHERE first_name = 'Mary'
ORDER BY emp_no
LIMIT 10,100
The execution plan shows it is looking in more rows than it does without the offset, The plan is telling us that it will scan only up until it finds the the 10 rows it is looking for, and not scan the whole table.
While adding an index would provide a faster query, to prevent having too many indexes, you can adjust your query to prevent the need for the index. This is all part of balancing adding indexes with adjusting queries to be more efficient queries.
Final Thoughts
Indexes will help optimize your queries, but you do need to balance adding indexes with improving your query instead.
Try to improve your query before adding an index, adding too many indexes can slow down inserting and updating data into your table.
Use EXPLAIN on the queries you design to ensure they are well optimized.
It is also worth noting that if you are testing your query locally, that the way the query optimizes may differ than on live due to the amount of data and type of data on live vs local development.
Top comments (6)
Nice post, Lawrence!
Adding too many indices can impact performance and disk space too.
I do have one question about this query and why its matching
rows
column inEXPLAIN
is 10:rows
is an estimated number of rows needed to be scanned so it could be off by a large margin.Nothing tells the DB engine that employees named
Mary
are in the first 10 rows when sorted by emp_no. You could have 10000 John's or Steve's or Sarah's in which case the DB engine will have to scan them all.Any insights?
In the actual execution plan, the "select type" is set to SIMPLE.
In this case, it scans until there are 10 rows passing through the WHERE clause, sorted by the primary key.
However, since the data in the database is already sorted by the primary key, the ORDER BY clause is unnecessary.
Hmmm, your comment is valid if
emp_no
is indeed the primary key. Unless I am mistaken this post doesn't tell us about the table structure. Soemp_no
could just be a unique index.Anyhow, I'd like to know where that
rows=10
comes from. I understand it scans until it finds 10 rows that match the condition but why does it think it needs to scan only 10 rows? It could need to scan 900 rows cosemp_no
is unrelated tofirst_name
.The 10 would be the minimum possible rows it can look at to get the results you are after. In the case of the LIMIT 10,100 its shown as 110 as it knows it needs to find at least 110 rows before it will find enough rows to show you. In reality the number of rows will be larger than that, its unlikely that the first 10 rows are going to match the criteria, but it is possible and its that "possible" that EXPLAIN is returning.
Using an index doesn't always guarantee an improvement in query performance.
In fact, it can even be slower than a full scan, even when the optimizer expects the index to be faster.
yes, you always need to weigh up adding an index. its not always the right answer, its always worth testing both ways and see which way is more performant.