Tomas

Posted on Feb 24, 2023 • Edited on Nov 6, 2024

Principles of Database Index Design

#sql #index #pawsql #discuss

Channel of advanced SQL tuning

Overview

Indexes are one of the most important tools for optimizing database performance. However, creating too many indexes or indexing the wrong columns can also have a negative impact on performance. Therefore, it is important to follow certain principles when designing indexes.

Principle 1. Create indexes based on your workload

The most important principle for creating efficient indexes is to create indexes based on your workload, not your table structure. The purpose of an index is to improve the efficiency of operations in the database, and SQL statements executed against a database make up the workload of the database. Therefore, any other index creation methods that do not start from the workload are incorrect.

When building a set of indexes for a workload, we need to consider the following characteristics of the workload:

SQL type: In OLTP scenarios where users frequently insert new data and modify existing data, multiple indexes may have a negative impact on performance. It is recommended to create the minimum number of indexes to meet your indexing requirements. In OLAP scenarios where queries are dominal usage, you can add more indexes, each index can have multiple columns, and even function indexes and condition indexes can be created.
SQL frequency: Indexes should be created for the most frequently used queries. By creating the best indexes for these queries, the overall performance of the system can be maximized.
Importance of SQL queries: The more important a query is, the more you may want to optimize its performance by creating an index.
The structure of the SQL query itself.

Principle 2. Create indexes based on the structure of SQLs

The functions of indexes are as follows:

Quickly locate data
Avoid sorting
Avoid table lookup

Fast Location

Indexes can quickly locate data by matching the query conditions, which may be in the WHERE clause, HAVING clause, or ON clause. The matching principle between indexes and conditions follows the leftmost prefix matching principle.

Leftmost Prefix Matching Principle

The leftmost prefix matching principle refers to when the equal query conditions accurately match the leftmost consecutive column or several columns of an index, this column can be used to matching index. When encountering range queries (> , < , between, like), matching stops, but the the range condition is matched.

For the composite index lineitem (l_shipdate, l_quantity), the first two SQL queries below satisfy the leftmost prefix matching principle and can use the index. The last one does not meet the leftmost prefix matching principle and cannot use the index.

select * from lineitem where l_shipdate = date '2021-12-01'; -- index can be used
select * from lineitem where l_shipdate = date '2021-12-01' and l_quantity = 100; -- index can be used
select * from lineitem where l_quantity = 100; -- The index cannot be used

The execution plans for these three SQL queries are as follows:

-> Index lookup on lineitem using lidx (L_QUANTITY=100.00, L_SHIPDATE=DATE'2021-12-01') (cost=0.35 rows=1)

-> Index lookup on lineitem using lidx (L_QUANTITY=100.00, L_SHIPDATE=DATE'2021-12-01') (cost=0.35 rows=1)

-> Filter: (lineitem.L_QUANTITY = 100.00) (cost=15208.05 rows=49486)
     -> Table scan on lineitem (cost=15208.05 rows=148473)

Besides the leftmost prefix principle, when creating a composite index, you should also consider the number of distinct values(Cardinality) when deciding the order of them index fields. The fields with higher Cardinality should be placed first.

Equal Conditions

Single table equal condition
- COL = 'A'
- COL IN ('A')

Equal conditions in table joins, when a table is used as a driven table, the equal join condition can also be considered as the equal condition is used for index matching.

T1.COL = T2.COL

select * from orders, lineitem where o_orderkey = l_orderkey;

Execution plan of above query

-> Nested loop inner join (cost=484815.77 rows=1326500)
    -> Table scan on orders (cost=20540.71 rows=200128)
    -> Index lookup on lineitem using lineitem_idx(L_ORDERKEY=orders.O_ORDERKEY) (cost=1.66 rows=7)

Range Conditions

range operators (>,>=,<,<=,BETWEEN)
IN ('A','B')
IS NOT NULL
IS NULL
LIKE 'ABC%'
COL = 'A' OR COL = 'B'

Range conditions can also be used to quickly locate data,

create index lshipdate_idx on lineitem(l_shipdate);
explain format = tree select * from lineitem where l_shipdate >= date '2021-12-01';

-> Index range scan on lineitem using lshipdate_idx over ('2021-12-01' <= L_SHIPDATE), with index condition: (lineitem.L_SHIPDATE >= DATE'2021-12-01') (cost=11855.06 rows=26344)

Indexed columns that follow a range condition cannot take advantage of the index due to the leftmost matching principle.

Avoiding Sorting

For B+ tree indexes, since they are sorted by index key, they can be used to avoid sorting in SQL queries. The SQL structures involved mainly include:

GROUP BY
ORDER BY
DISTINCT
PARTITION BY... ORDER BY...

Considering we have a index lshipdate_idx as follows

create index lshipdate_idx on lineitem(l_shipdate);

You can see the execution plan for the following SQLs leverages lshipdate_idx index to avoid sorting.

SQL1 (ORDER BY)

  select * from lineitem order by l_shipdate limit 10;

Execution Plan for SQL1

  -> Limit: 10 row(s) (cost=0.02 rows=10)
      -> Index scan on lineitem using lshipdate_idx (cost=0.02 rows=10)

SQL2(GROUP BY)

  select l_shipdate, sum(l_quantity) as sum_qty from lineitem group by l_shipdate;

Execution Plan for SQL2

  -> Group aggregate: sum(lineitem.L_QUANTITY) (cost=30055.35 rows=148473)
      -> Index scan on lineitem using lshipdate_idx (cost=15208.05 rows=148473)

SQL3(DISTINCT)

  select DISTINCT l_shipdate from lineitem;

Execution Plan for SQL3

  -> Covering index skip scan for deduplication on lineitem using lshipdate_idx (cost=4954.90 rows=15973)

SQL4(PARTITION BY... ORDER BY...)

  select rank() over (partition by L_SHIPDATE order by L_ORDERKEY) from lineitem;

Execution Plan for SQL4

  WindowAgg (cost=0.29..545.28 rows=10000 width=28)
  -> Index Only Scan using lshipdate_idx on lineitem (cost=0.29..370.28 rows=10000 width=20)

Note:

For grouping and deduplication, the order doesn't matter.
For sorting, the order of sorting columns needs to match the order of indexed columns, otherwise, the index cannot be used to avoid sorting.
If sorting and grouping appear at the same time, the sorting column needs to come first.

For example, for the following SQL statement:

select l_shipdate, l_orderkey, sum(l_quantity) as sum_qty from lineitem group by l_shipdate, l_orderkey order by l_orderkey;

Case 1: Create an index on (l_shipdate, l_orderkey), use index access, and require sorting, with a cost of 486.526.

-> Sort: lineitem.L_ORDERKEY (actual time=479.465..486.526 rows=149413 loops=1)
     -> Stream results (cost=30055.35 rows=148473) (actual time=0.175..423.447 rows=149413 loops=1)
         -> Group aggregate: sum(lineitem.L_QUANTITY) (cost=30055.35 rows=148473) (actual time=0.170..394.978 rows=149413 loops=1)
             -> Index scan on lineitem using lshipdate_idx2 (cost=15208.05 rows=148473) (actual time=0.145..359.567 rows=149814 loops=1)

Case 2: Create an index on (l_orderkey,l_shipdate), use index access, and avoid sorting, with a cost of 228.401.

-> Group aggregate: sum(lineitem.L_QUANTITY)  (cost=30055.35 rows=148473) (actual time=0.067..228.401 rows=149413 loops=1)
    -> Index scan on lineitem using lshipdate_idx3  (cost=15208.05 rows=148473) (actual time=0.052..194.479 rows=149814 loops=1)

Avoiding Table Lookup

When all the columns in a query are in the index columns, the database only needs to access the index to obtain the required data, avoiding table lookups. In some scenarios, this can greatly improve query efficiency.

For the following SQL statement:

select l_shipdate, l_orderkey,  sum(l_quantity) as sum_qty from lineitem group by l_orderkey,l_shipdate;

Index on (l_orderkey,l_shipdate) does not include l_quantity, a table lookup is required, with a cost of 194.875.

-> Group aggregate: sum(lineitem.L_QUANTITY)  (cost=30055.35 rows=148473) (actual time=0.044..194.875 rows=149413 loops=1)
    -> Index scan on lineitem using lshipdate_idx3  (cost=15208.05 rows=148473) (actual time=0.034..159.863 rows=149814 loops=1)

Index on (l_orderkey,l_shipdate,l_quantity) includes l_quantity, no table lookup is needed, with a cost of 113.433, resulting in a performance improvement of approximately 71.8%.

-> Group aggregate: sum(lineitem.L_QUANTITY)  (cost=30055.35 rows=148473) (actual time=0.035..113.433 rows=149413 loops=1)
    -> Covering index scan on lineitem using lshipdate_idx4  (cost=15208.05 rows=148473) (actual time=0.026..82.266 rows=149814 loops=1)

Other Related Topics

Indexes for Partitioned Tables

For partitioned tables, different databases support different types of indexes. Generally, partitioned tables can have the following three types of indexes:

Local partitioned indexes (PostgreSQL/MySQL/Oracle/Opengauss)
Global partitioned indexes (Oracle)
Global non-partitioned indexes (Oracle/Opengauss)

Local Partitioned Indexes

In terms of index maintenance, local indexes are easier to manage than global indexes. When you add, delete, or truncate a table partition, local indexes will automatically maintain their index partitions. MySQL and PostgreSQL only support local partitioned indexes; when creating local partitioned indexes in Oracle and Opengauss, the keyword local needs to be specified.

create index lshipdate_idx on lineitem(l_shipdate) local;

Global Partitioned Indexes

Similar to table partitioning, the partition key for the index and the partition key for the table do not necessarily have a relationship, and even non-partitioned tables can have global partitioned indexes. Oracle supports global partitioned indexes.

Global Non-partitioned Indexes

For global non-partitioned indexes, when you perform partition operations on the table, the index may become unavailable and you need to explicitly update or rebuild the index. In terms of index efficiency, global indexes are more efficient than local partitioned indexes in queries that do not include partitioning columns. Oracle and Opengauss create global non-partitioned indexes by default for partitioned tables.

create index lshipdate_idx on lineitem(l_shipdate) global;
create index lshipdate_idx on lineitem(l_shipdate);

When performing partition operations, you need to add the update global index keyword to rebuild the index, otherwise the index is unavailable.

alter table t DROP PARTITION partition_name update global index;

Function index

A function index (or expression index) is an index based on a function or an expression. It uses a function or expression to provide a calculated value as an index column to build an index, which can improve query performance without modifying the application.

The use of functional indexes requires strict matching between functions or expressions and expressions in SQL queries, so the conditions for its use are relatively strict, and it is suitable for key optimization for important queries or queries with high frequency.

select * from lineitem where EXTRACT(DAY from l_shipdate) = 1;

If there is a normal index on l_shipdate, the database optimizer will not able to use it.

Seq Scan on lineitem (cost=0.00..1870.24 rows=238 width=158) (actual time=0.502..10.655 rows=1616 loops=1)
   Filter: (EXTRACT(day FROM l_commitdate) = '1'::numeric)
   Rows Removed by Filter: 46000
Planning Time: 0.107 ms
Execution Time: 10.709 ms

If we have an function index created on EXTRACT(DAY from l_shipdate) as follows, you will see the optimizer uses the index and chooses a much better execution plan.

create index idx on lineitem(EXTRACT(DAY from l_shipdate));

Bitmap Heap Scan on lineitem (cost=6.13..593.60 rows=238 width=158) (actual time=0.216..0.981 rows=1620 loops=1)
   Recheck Cond: (EXTRACT(day FROM l_shipdate) = '1'::numeric)
   Heap Blocks: exact=889
   -> Bitmap Index Scan on idx (cost=0.00..6.08 rows=238 width=0) (actual time=0.149..0.149 rows=1620 loops=1)
         Index Cond: (EXTRACT(day FROM l_shipdate) = '1'::numeric)
Planning Time: 0.102 ms
Execution Time: 1.075 ms

From the execution plans, we can see after utilizing the function index, the performance improves by 900%.

Partial index

A partial index (Conditional index) is an index built on a subset of a table, and the subset is defined by a conditional expression, and the index includes only those rows in the table that satisfy the conditional expression.

Partial indexes are used with relatively strict conditions. Only when the database can recognize that the WHERE condition of the query logically covers the definition of the conditional expression of the index, this partial index can be used for the query.

Take the following conditional index as an example, the conditional expression of the index is l_shipdate > '2022-01-01'

create index l_partkey_idx on lineitem(l_partkey) where l_shipdate > '2022-01-01';

Since the condition of the following query statement l_shipdate = date '2021-12-01' does not fall within the scope of the index condition expression, the index will not be used, so the execution plan uses a full table scan.

select l_partkey , count(1) from lineitem where l_shipdate = date '2021-12-01' and l_partkey < 100 group by l_partkey ;

GroupAggregate (cost=1870.25..1870.27 rows=1 width=12)
   Group Key: l_partkey
   -> Sort (cost=1870.25..1870.26 rows=1 width=4)
         Sort Key: l_partkey
         -> Seq Scan on lineitem (cost=0.00..1870.24 rows=1 width=4)
               Filter: ((l_partkey < 100) AND (l_shipdate = '2021-12-01'::date))

However, the condition of the following query statement l_shipdate = date '2022-12-01' is within the scope of the conditional expression, the database optimizer will use this index, and you can see that the performance has been greatly improved.

select l_partkey , count(1) from lineitem where l_shipdate = date '2022-12-01' and l_partkey < 100 group by l_partkey ;

GroupAggregate (cost=402.37..402.39 rows=1 width=12)
   Group Key: l_partkey
   -> Sort (cost=402.37..402.38 rows=1 width=4)
         Sort Key: l_partkey
         -> Index Scan using lorderkey_idx on lineitem (cost=0.28..402.36 rows=1 width=4)
               Filter: ((l_partkey < 100) AND (l_shipdate = '2022-12-01'::date))

Notice: MySQL does not currently support conditional indexes, but PostgreSQL, Opengauss, and Oracle all support them.

Index Merge

Index Merge is an optimization technique that uses multiple indexes to perform a single-table data access. When multiple conditions on a table are involved in a query and each condition has a suitable index, Index Merge can merge the results of multiple indexes before retrieving the data, which can improve query performance.

For the lineitem table, there are single-column indexes on l_shipdate and l_partkey. For the following SQL statement:

select * from lineitem where l_shipdate = date '2010-12-01' or l_partkey=100;

Execution Plan in PostgreSQL

Bitmap Heap Scan on lineitem (cost=9.05..202.96 rows=59 width=158)
   Recheck Cond: ((l_shipdate = '2010-12-01'::date) OR (l_partkey = 100))
   -> BitmapOr (cost=9.05..9.05 rows=59 width=0)
         -> Bitmap Index Scan on l_shipdate_idx (cost=0.00..4.70 rows=54 width=0)
               Index Cond: (l_shipdate = '2010-12-01'::date)
         -> Bitmap Index Scan on l_partkey_idx (cost=0.00..4.33 rows=5 width=0)
               Index Cond: (l_partkey = 100)

Execution Plan in MySQL

-> Filter: ((lineitem.L_SHIPDATE = DATE'2010-12-01') or (lineitem.L_PARTKEY = 100)) (cost=12.53 rows=21)
     -> Deduplicate rows sorted by row ID (cost=12.53 rows=21)
         -> Index range scan on lineitem using l_shipdate_idx over (L_SHIPDATE = '2010-12-01') (cost=1.11 rows=1)
         -> Index range scan on lineitem using l_partkey_idx over (L_PARTKEY = 100) (cost=3.03 rows=20)

The optimizer can use both indexes to satisfy the query, and then merge the results of the two indexes. This approach avoids the need for a full table scan and can improve query performance.

Index Merge is supported by various databases such as MySQL, PostgreSQL, Oracle, and SQL Server. However, not all databases support all types of Index Merge, and the optimization behavior may vary across different versions of the same database.

Index for Foreign Key

It is recommended to create an index on the foreign key. This principle may seem to contradict the first principle (create indexes based on your workload), but in fact, it is consistent, because in real-world applications, most table relationships are based on primary and foreign keys. Creating an index on the foreign key can improve the efficiency of table association.

In MySQL, if a field is defined as a foreign key, an index is created on it by default. However, in PostgreSQL and its related databases, setting certain fields as foreign keys does not automatically create an index on those fields.

Principle 3. Constraints of Index Design

When designing indexes, it is important to consider the following constraints:

Maximum number of indexes per table: Creating too many indexes can have a negative impact on write performance, as every insert or update to the table must also update the indexes. Therefore, it is important to limit the number of indexes per table.
Maximum number of columns per index: Creating indexes with too many columns can also have a negative impact on performance, as the index may become too large to be efficient. Therefore, it is important to limit the number of columns per index.
Disk space usage: Indexes can take up a significant amount of disk space. Therefore, it is important to consider the amount of available disk space when designing indexes.

To design and maintain indexes that meet these constraints, the following methods can be used:

Index selection: Indexes can be provided on the most important SQL statements or the most frequently used queries.
Column selection for indexing: Indexes should be created on the columns that have the best filtering effect based on the evaluation of the single value selectivity of the column. You should avoid creating indexes on columns that are frequently updated.
Index merging: By designing the columns in the correct order in composite indexes, one index can be used to accelerate multiple queries.
Index deletion: Unused indexes can be periodically deleted to free up disk space and reduce maintenance overhead.

Summary

In summary, the process of creating indexes can be abstracted as defining the benefits of an index under above constraints, using heuristic algorithms to calculate the index set that maximizes the overall workload benefit under specific constraints. This is also the internal logic of the PawSQL index engine.

DEV Community