loading...

How does database indexing work?

patarapolw profile image Pacharapol Withayasakpunt ・1 min read

Especially for non-equality comparisons; and sorting.

If it is equality; maybe it is like a sorted hashtable? Not that I know much about how hashtable work...

But what about more than, less than, string startsWith, string endsWith, string contains, string case-insensitive operations? What are possible ways to optimize this (probably on write)?

Discussion

markdown guide
 

In non-technical terms, database indices are structured more or less like associative arrays.

Assuming you have a table called users, where each user is a row. That table would be, by default, saved in a way like this:

[ID] => row

Now let's assume you create an index /we'll ignore index types for now) on the "email" column of your row because you do a lot of searching for users by email. Your database engine will now ALSO store this table like this:

[ID] => row,
[email] => row

When you run your query, the engine will identify possible indexes to match based on its content, and then use that version of the saved table to find results.

If your query has WHERE email="brandinchiu@gmail.com", then it will know to use the email index you created beforehand.

Now this is reasonable for simple indices, but gets more complicated when you start looking at compound keys, and when you start having indices conflict with each other.

This is why properly identifying which indices matter is so important. Every time you create an index, your database gets significantly bigger, relative to the size of the table where the index is. Optimally, you want your database to run in memory, so having ten different indices on a table with a hundred million rows is going to a headache.

Different index types help you tell the engine how to optimize the queries that use them. Integer-based indexes will always be faster, and performing arithmetic against them is going to be relatively trivial.

String indices are possible, but should be limited in scope whenever possible. Indexing something like a "description" column for purposes of searching is not something you should be doing. Instead, you would create a separate table of individual "keywords" and use the foreign key to search, not the actual paragraph text itself.

 

What about pre-segmentation of strings for string contains operators, probably just like how full-text-search in SQLite Virtual Table works?

Currently not even understanding why hashtable is near O(1), but I believe string startsWith operator is just akin to integer greater than?

  • Different index types
  • String indices

Now that you mentioned it, how do they work with best performances in relative large datasets?

Also, about String indices, if I don't plan to make unique, it might be possible to optimize, by limiting index size, and looking up multiple times?

 

A lot of this is going to be engine-dependant, but in almost all cases I can think of, string indices should be avoided for the purpose of searching and/or sorting.

In MYSQL, string operations are just integer operations, but there's a step of converting the strings to integers which degrades performance at scale.

MYSQL supports the FULLTEXT index type which indexes the entire string column. However, the larger the string is, the less performant the search will be unless you're searching for the exact, entire string (which rarely happens and defeats the purpose of a string index in the first place)

 

To my knowledge mainstream relational databases (MySQL, PostgreSQL, SQLite) all use a form of B-tree, because hashtable is not good for comparison. Abstractly you can view B-tree as a sorted set while a hashtable an unordered set. For string operations I suspect they are just implemented as regular expressions, which if you limit them to no look back can be implemented efficiently as finite state machines.

 

I have written a while ago a medium article about this: medium.com/faun/explained-indexing...

Long story short, primary keys are like partition keys, and databases knows where to find them really fast if u give one to it. He knows on disk where to find id 1, 2, etc.

Indexes also act like partition keys, but they map the indexed key to the primary key. So if u give it an indexed field, he tries to find the primary keys then knows where to find the records.

 

Same as Table of Index of a book, You look up chapter name and go straight to page where that chapter starts, you don't roll page one by one till the chapter starts.

 

Why is it O(1), yet comparable? How do I implement it in JavaScript / Python (i.e. not using existing libraries)?

The point is, how can I have faster substring indices? How can I have prebuilt indices?