In today's data-driven world, databases play a crucial role in modern computing, serving as the backbone of countless applications and systems. A critical component of database management is indexing, which helps optimize database performance by speeding up data retrieval. In this article, we will explore indexing techniques in SQL and their impact on query performance.
Indexing is a powerful tool in the database world, used to speed up search and retrieval of data. By creating a data structure that maps the values in a column to the location of the corresponding rows, indexing can dramatically reduce the amount of time needed to perform certain queries. However, not all indexing techniques are created equal, and different techniques are better suited for different types of data and queries. In this article, we'll explore some of the most common indexing techniques used in computer science and discuss their strengths and weaknesses, providing you with a solid foundation for optimizing the performance of your own databases.
The Theory Behind Indexing
In a traditional database, data is stored in tables that are composed of rows and columns. When a query is performed on a database, the database engine must search through every row in the table to find the data that matches the query conditions. This process can be time-consuming, especially when the table is very large, or the query conditions are complex.
Indexing works by creating a separate data structure that maps the values in a particular column to the location of the corresponding rows in the table. By creating an index on a column, the database engine can quickly locate the rows that match a particular value or range of values in that column, without having to search through every row in the table.
Before understanding the main techniques of indexing, it's important to have a basic understanding of how a hard drive works.
Hard drives
Hard drives are physical devices that store data on magnetic disks. Each disk is segmented in a particular way in order to save data, these sections are:
A block is the smallest unit of data that can be read from or written to a hard drive. Blocks are typically 512 bytes in size, although this can vary depending on the hard drive and the operating system.
A sector is a subdivision of a track on a hard drive. Tracks are concentric circles on the surface of the disk, and sectors are pie-shaped sections of each track. Each sector typically contains one or more blocks of data, along with some additional information such as an error correction code.
A track is a circular path on the surface of the hard drive that contains multiple sectors. The number of tracks on a hard drive varies depending on the size of the disk and the density of the data stored on it. Modern hard drives can have thousands of tracks per inch, allowing for very high data densities.
When data is read from or written to a hard drive, the disk rotates at high speeds, typically several thousand revolutions per minute, and the read/write head moves across the surface of the disk to access the desired track and sector. As the read/write head can only access one track at a time, data is read and written in blocks that are spread out across multiple sectors on the same track. However, searching for data row by row can become time-consuming. To reduce this time and avoid inefficient resource consumption, indexing techniques come into play.
Non-Clustered Index
A non-clustered index is a type of database index that creates a separate data structure to map the values of the indexed column(s) to the physical locations of the data on disk. When a non-clustered index is created, the database engine generates a new data structure that stores a copy of the indexed column(s), along with a pointer to the location of the corresponding data row(s) in the table. Unlike a clustered index, a non-clustered index arranges the data logically rather than physically. Therefore, the physical order of rows may differ from that of columns in a non-clustered index. As a result, the index is constructed to logically order the data based on the columns of the index.
When a query is executed that involves the indexed column(s), the database engine can use the non-clustered index to quickly locate the relevant rows in the table, without having to perform a full table scan. The engine follows the pointers in the index to locate the data rows on disk, and then retrieves the relevant data from those rows.
Advantages:
More flexibility: Non-clustered indexes allow for greater flexibility in database design, as they can be created on any column in a table, including those that do not have a primary key.
Improved concurrency: Non-clustered indexes can improve database concurrency by reducing the need for table locks during data retrieval operations.
Reduced disk usage: Non-clustered indexes typically require less disk space than clustered indexes, as they only store a copy of the indexed column(s) and the corresponding pointer(s) to the actual data.
Disadvantages:
Additional maintenance overhead: Non-clustered indexes require regular maintenance to ensure that they remain effective, which can add to the administrative burden of managing a database.
Potential for fragmentation: Over time, non-clustered indexes can become fragmented, which can reduce their effectiveness and slow down data retrieval operations.
Reduced query performance for some queries: While non-clustered indexes can speed up data retrieval for some queries, they can actually slow down performance for others, particularly those that involve multiple joins or complex search criteria.
Clustered Index
When a clustered index is created, the database engine creates a new data structure that stores a copy of the indexed column(s), along with a pointer to the physical location of the corresponding data row(s) in the table. However, instead of creating a separate data structure, the engine rearranges the data rows in the table so that they are physically stored in the order specified by the index key.
Advantages
No additional disk space requirement: Clustered indexes physically store data on disk in a sorted order, which makes it faster to retrieve data because the database engine can seek directly to the data without having to perform additional sorting operations.
Reduced disk I/O: Clustered indexes can reduce disk I/O by storing all related data in a single location, which makes retrieving data faster and easier.
Simplified maintenance: Clustered indexes typically require less maintenance compared to non-clustered indexes because they are directly linked to the table and don't require additional indexing.
Disadvantages
Increased write time: Clustered indexes can slow down write operations, as the entire table must be reorganized when a new row is inserted, updated, or deleted from the table.
Limited flexibility: Clustered indexes can only be created on one column per table and are usually based on the primary key.
Difficulty with range searches: If a query involves a range search (i.e., a search for values within a specific range), a clustered index may not be the best option because it can be slow to traverse the entire index to find matching values.
B-tree indexes
In a B-tree index, the index is stored as a tree, with each node representing a range of values from the indexed column. The root node represents the entire range of values in the indexed column, while the leaf nodes represent individual values.
To understand this, please consider the table shown in Figure 5, where each row can store 128 bytes of data, and one block on the hard drive can store four rows of that table.
To find a record with an ID of 1000, we would need to search through 250 blocks on the disk (Since one block can hold 512 Bytes). However, we can improve performance by adding an index structure (as shown in Figure 6), to store a pointer for each ID. This enables quick access to the exact location of a given input on the disk, improving the speed of record retrieval.
Indexes are also stored on disk, and the space required to store an ID (10 bytes) and a pointer (6 bytes) gives a total of 16 bytes for each row in the index structure. As a result, up to 32 ID entries can be stored per disk block.
Now, if a user needs to retrieve record 3, the index pointer for ID = 3 can be used to quickly locate the specific block on the disk that contains that record. This allows the database system to access the desired record much more quickly and efficiently than if it had to search through all of the blocks on the disk. However, there is still room to further improve performance. As a reminder, 32 ID entries are equivalent to 512 Bytes. Another layer can be added to map each 32 Byte block, which would further improve performance.
For example, when searching for ID=32, the database system can start at the first layer (Indexes block) and determine that the desired record is located in the first block. Then, the system can move to the second layer (Indexes-Table) to obtain the exact pointer where the data is stored.
Figure 7 provides a good representation of the concept of multilevel indexing, a technique commonly used in database management systems. With multilevel indexing, the index is itself indexed, creating a hierarchical structure of indexes that enables efficient searching and retrieval of data.
When Figure 7 is rotated by 90 degrees and with some added creativity, a tree structure similar to the one in Figure 8 becomes visible.
B-trees are a type of multilevel indexing in which data is stored in a hierarchical structure of nodes, with each node containing a certain number of keys and pointers to child nodes. The use of B-trees in database management systems is a prime example of how multilevel indexing techniques can be employed to optimize database performance.
Advantages:
- Efficient access: B-trees provide efficient access to data, even in very large databases. The hierarchical structure allows for rapid access to data, making B-trees well-suited for applications with high read-to-write ratios.
- Scalability: B-trees are designed to scale well as the size of the database grows. By adding new nodes to the tree as needed, B-trees can handle very large datasets without significant performance degradation.
- Flexibility: B-trees are flexible and can be used to index many different types of data, including numerical, alphabetical, and even non-sequential data.
- Range queries: B-trees are particularly effective at performing range queries, which involve searching for data that falls within a specific range of values. This is because the hierarchical structure of the tree allows for rapid traversal of data that meets certain criteria.
- Self-balancing: B-trees are designed to self-balance as new data is added, deleted, or modified. This means that performance remains consistent even as the database changes over time.
Disadvantages:
- Overhead: B-trees require additional storage space to maintain the index, which can be significant in large databases.
- Complexity: The implementation of B-trees can be complex, and the algorithms used to maintain the index can be difficult to optimize.
- Write-intensive operations: B-trees are optimized for read operations and can be slower for write-intensive operations, such as inserting or deleting records from the database.
- Limited flexibility: B-trees are well-suited for certain types of data, such as numerical or alphabetical data, but may not perform as well with more complex or unstructured data.
- Fragmentation: As data is added, deleted, or modified, B-trees can become fragmented, which can negatively impact performance over time. Regular maintenance is required to prevent or mitigate fragmentation.
Bitmap Index
Unlike traditional indexes, which create a list of pointers to the locations of the indexed data, bitmap indexes store a bitmap for each unique value in the indexed column. Each digit corresponds to a specific value in the indexed column, and if a digit is 1, it indicates that the corresponding value is present in the data, while a digit of 0 means that the value is absent.To understand this please consider the following table:
In Figure.9, cardinality refers to the number of distinct values in a column. For example, the column ID has a cardinality of 1000 since every ID on the table is unique, and the same is true for the name column assuming each name does not repeat in another row. On the other hand, the Gender and is_active columns both have a cardinality of 3 and 2, respectively. When a bitmap index is created on the Gender column, the database system generates three bitmaps. These bitmaps include one for "Female", another for "Male", and finally one for "Non-Binary". Each of these bitmaps contains a sequence of 1s and 0s that indicate which rows contain that value.
A bitmap index can quickly determine which rows contain a specific value or set of values by storing a bitmap for each unique value in the indexed column, without having to search through the entire table.
Advantages:
- Faster querying: Bitmap indexing can significantly speed up queries that involve filtering or grouping by values in the indexed columns. Since bitmaps are small and easy to manipulate, the database can quickly retrieve the relevant rows without having to scan the entire table.
- Reduced I/O: Because bitmap indexes store only binary values (0 or 1) for each row, they require much less disk space than other types of indexes. This can lead to reduced I/O and better performance overall.
- Simple to implement: Bitmap indexing is a simple technique that can be easily implemented in most database management systems. It can be used with any data type and is especially effective for low-cardinality columns.
- Concurrent access: Bitmap indexes can be easily updated and maintained while the database is still in use. This means that other users can still access and modify the database while the index is being updated.
Disadvantages:
- High memory usage: Bitmap indexes can consume a significant amount of memory, especially for high-cardinality columns. This can lead to increased memory requirements for the database and may require additional hardware resources.
- Slow inserts: Bitmap indexes are not well-suited for tables that require frequent inserts or updates. Updating a bitmap index can be time-consuming and may require the entire index to be rebuilt, which can slow down database performance.
- Limited applicability: Bitmap indexing is most effective for low-cardinality columns, and may not be as useful for columns with high cardinality or columns that require complex queries.
- Complexity for range queries: Bitmap indexes are not well-suited for range queries (e.g., finding all values greater than a certain threshold), as these queries require the use of multiple bitmaps and may be slow to execute.
Hashing Indexing
Hash indexing is a technique used to locate data records quickly based on a key value. This technique works by using a hash function to generate a fixed-length key value, which is used to find data records in the database.
The hash function is a mathematical function that takes an input value (known as the key) and produces a fixed-length output (known as the hash code). The hash code is used as an index into a hash table that contains pointers to the data records. The hash table is typically stored in memory to enable fast access.
However, hash indexing has some limitations. One limitation is that it can be challenging to design a good hash function that provides a uniform distribution of hash codes. If the hash function is poorly designed, it can lead to collisions, where two or more keys produce the same hash code. This can slow down the search process, as the DBMS has to search through the hash table to find the correct data record.
Advantages:
Fast access to data records: Hash indexing can provide fast access to data records, as long as the hash function is well-designed and the hash table is appropriately sized.
Efficient for exact matches: Hash indexing is well-suited for exact match queries, where the search criteria is an exact match to the key value.
Small storage space: Compared to other indexing techniques like B-trees, hash indexing requires relatively small storage space in memory.
Good for large datasets: Hash indexing can be especially efficient for large datasets where exact match queries are common.
Disadvantages:
Limited flexibility: Hash indexing is not well-suited for range queries, where the search criteria is a range of values.
Poorly designed hash functions can lead to collisions: If the hash function is poorly designed, it can lead to collisions, where two or more keys produce the same hash code. This can slow down the search process.
Additional overhead: Hash indexing requires additional overhead to maintain the hash table, which can increase the cost of storage and retrieval.
Limited support for complex data types: Hash indexing is limited in its support for complex data types like text and binary data, which may require additional processing to generate a hash code.
Full text Indexing
This indexes are used to improve the performance of full-text search queries in a database. Full-text search allows users to search for words or phrases within the text of documents, web pages, or other unstructured data stored in the database.
Traditional indexes, like B-tree indexes, are optimized for searching for exact values, such as an integer or a string. However, they are not well-suited for full-text search, which involves searching for words or phrases within blocks of text.
It works by breaking down the text into words, known as "tokens," and creating an index of those tokens. This index allows the database engine to quickly search for documents or records that contain specific words or phrases.
Advantages
Improved search performance: Full-text indexes allow for more efficient searching of large amounts of text data, resulting in faster and more accurate search results.
Flexible search capabilities: Full-text search allows for more flexible search capabilities, including searching for words or phrases within the text, as well as the ability to perform more complex searches using Boolean operators and other search parameters.
Better support for natural language search: Full-text indexes are designed to support natural language search, which means that they can understand and interpret natural language queries in a more human-like way.
Better handling of misspellings and synonyms: Full-text indexes can be configured to handle common misspellings and synonyms, which can help to improve the accuracy of search results.
Disadvantages
Increased storage requirements: Full-text indexes can require significant amounts of storage space, especially for large text data sets.
Higher processing overhead: Full-text indexes can be more resource-intensive to create and maintain than other types of indexes, which can lead to higher processing overhead and slower query performance.
Limited support for structured data: Full-text indexes are designed to work with unstructured text data, which means that they may not be well-suited for searching structured data or data that is stored in a highly normalized format.
Limited support for some languages: Full-text indexes may not support all languages equally well, and some languages may require additional configuration or customization to work effectively with full-text search.
Indexing techniques supported by DB engines
- MySQL: supports B-tree indexes, hash indexes, and full-text indexes.
- PostgreSQL: supports B-tree indexes, hash indexes, GiST (Generalized Search Tree) indexes, SP-GiST (Space-Partitioned Generalized Search Tree) indexes, GIN (Generalized Inverted Index) indexes, and BRIN (Block Range Index) indexes.
- Oracle: supports B-tree indexes, bitmap indexes, hash clusters, and function-based indexes.
- Microsoft SQL Server: supports clustered indexes (B-tree), nonclustered indexes (B-tree), full-text indexes, and XML indexes.
- MongoDB: supports B-tree indexes, hash indexes, and geospatial indexes.
- Cassandra: supports B-tree indexes and secondary indexes (global and local).
- SQLite: supports B-tree indexes and full-text search using FTS (Full-Text Search) extension.
It's worth noting that each database engine may have different variations of these indexing techniques, and some may support additional indexing methods not listed here.
Benchmark
To further investigate the indexing techniques discussed in the previous paragraphs, a performance test was conducted to compare a query that searches for a citizen's name in two different scenarios: one without an index and the other with a B-tree index.
Set up
To set up the test, a script was executed to insert 12,000 citizen records into a Postgres database. The objective of the test was to search these 12,000 rows for the citizen named "Laura Donaldson" using the "citizen_name" column, assuming that names in this column are unique.
To measure the performance of each scenario, a benchmark in Python was implemented to execute 100 iterations of the query. This allowed for precise measurement of the execution time of the query. A sample of the benchmark code is provided below.
import psycopg2
import timeit
# Connect to the database
conn = psycopg2.connect(host="localhost", database="citizen", user="postgres", password="postgres")
cur = conn.cursor()
# Define the query to be benchmarked
query = "SELECT * FROM citizen WHERE citizen_name = 'Laura Donaldson'"
# Define the number of iterations and the setup code
number = 100
setup = """
import psycopg2
conn = psycopg2.connect(host="localhost", database="citizen", user="postgres", password="postgres")
cur = conn.cursor()
"""
# Define the function to be timed
def benchmark_query():
cur.execute(query)
# Run the benchmark and print the results
time = timeit.timeit(benchmark_query, setup=setup, number=number)
print("Average execution time: {:.2f} ms".format(time * 1000 / number))
For scenario 2 (using a B-tree index), the following SQL command was executed in the database in order to create an index:
CREATE INDEX btree_citizen_name ON citizen USING BTREE (citizen_name)
Results
After executing the performance tests for the first scenario (without the index) for 100 iterations, the following result was obtained:
Average execution time: 0.95 ms
For the second scenario (using a B-tree index), the following result was obtained:
Average execution time: 0.11 ms
As evident from the results, there was a significant difference of approximately 0.84 between the two scenarios. This indicates that using indexing provides a much greater advantage than executing queries without indexing. The performance boost provided by indexing can be attributed to the efficient organization and retrieval of data using the index, which reduces the number of disk accesses and leads to faster query execution times. Overall, the use of indexing is a highly effective strategy for improving the performance of database queries.
Conclusion
Selecting the appropriate indexing technique is crucial for database performance optimization. Each indexing technique has its own strengths and weaknesses, and selecting the right one depends on the specific needs of the database system.
Ultimately, choosing the right indexing technique is a balancing act between the system's performance requirements, storage capacity, and the type of queries that the system will handle. Therefore, careful consideration of the database design, including the indexing technique used, is essential for achieving optimal performance in any database system.
References
[1] CIS Department > Tutorials > software design using C++ > B-trees. Available at: https://cis.stvincent.edu/html/tutorials/swd/btree/btree.html (Accessed: March 14, 2023).
[2] MikeRayMSFT Indexes - SQL server, SQL Server | Microsoft Learn. Available at: https://learn.microsoft.com/en-us/sql/relational-databases/indexes/indexes?view=sql-server-ver16 (Accessed: March 15, 2023).
[3] Blogs.oracle.com. Available at: https://blogs.oracle.com/sql/post/how-to-create-and-use-indexes-in-oracle-database (Accessed: March 15, 2023).
Top comments (0)