Table of Contents
- Intro
- Business case
- No free lunches
- Without indexes
- Using B-Tree indexes
- Introducing reverse indexes
- Using reverse indexes with default parser
- Using reverse indexes with n-gram parser
- InnoDB reverse index performance degradation
- Alternatives
- Outro
Intro
Full text search means looking for given phrase anywhere in given text. This article will guide you how to do it efficiently in MySQL and save you from falling into many traps along the way.
Business case
Let's assume we have address book of our customers and the goal is to quickly find person by piece of his/hers name or email.
CREATE TABLE `address_book` (
`id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
`name` VARCHAR(128) NOT NULL,
`email` VARCHAR(128) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB CHARSET=utf8mb4;
We will populate address book with 1_000_000 randomly generated people. Each person will be inserted in separate query. Name will be always in tidy form - first name and last name. Email will be more chaotic - with different first/last name order and presence, different separators and with some random numbers.
> SELECT `name`, `email` FROM `addressbook` LIMIT 8;
+--------------------+---------------------------------+
| name | email |
+--------------------+---------------------------------+
| Reed Slavik | 664-slavik-reed@example.com |
| Reilly Isaacson | reilly972isaacson@example.com |
| Theodore Klosinski | 942.klosinski@example.com |
| Duncan Sinke | 912.duncan@example.com |
| Maranda Cabrara | cabrara-809-maranda@example.com |
| Hugh Harrop | hugh765@example.com |
| Bernard Luetzow | bernard887luetzow@example.com |
| Niki Manesis | niki-247@example.com |
+--------------------+---------------------------------+
Tests will be performed on stock MySQL 8.0.32 Docker image with default configuration (except when said otherwise). Hardware is AMD 6800U, 32GB RAM, PCIe NVMe 4.0 x4 SSD. Operating system is vanilla Arch Linux with BTRFS and LUKS disk encryption.
No free lunches
There is no such thing as a free lunch. Indexes speed up SELECT
but slow down INSERT
/UPDATE
/DELETE
statements because of additional CPU cost to calculate and additional disk transfer and space cost to store. I'll try to write short summary when to use each method, what are the benefits and drawbacks.
Without indexes
The most naive approach is to have no indexed columns and to use LIKE '%john%'
syntax.
Because there are no indexes to maintain this method does not increase data loading time and storage space.
$ time cat address_book.sql | mysql
real 23m 31.43s
> SELECT data_length, index_length FROM information_schema.tables WHERE table_name = 'address_book';
+-------------+--------------+
| DATA_LENGTH | INDEX_LENGTH |
+-------------+--------------+
| 71942144 | 0 |
+-------------+--------------+
Performance is poor. When no index is used MySQL uses Turbo Boyer-Moore algorithm to find matching rows.
> SELECT * FROM `address_book` WHERE `name` LIKE '%john%' AND `name` LIKE '%doe%';
+--------+----------------+-------------------------------+
| id | name | email |
+--------+----------------+-------------------------------+
| 222698 | Johnie Doemel | doemel.36.johnie@example.com |
| 316137 | Johnnie Doepel | johnnie-doepel-72@example.com |
+--------+----------------+-------------------------------+
2 rows in set (0.222 sec)
All rows needs to be picked up from disk for analysis as shown by query EXPLAIN
.
> EXPLAIN SELECT * FROM `address_book` WHERE `name` LIKE '%john%' AND `name` LIKE '%doe%'\G
id: 1
select_type: SIMPLE
table: address_book
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 996458
filtered: 1.23
Extra: Using where
Use: When your application does full text search rarely and you are willing to accept low query performance. Works really well on small data set. Simple implementation is big bonus.
Avoid: When full text search is frequently used - you will burn a lot of database performance here, especially on large data set. Also because of full rows scan it may block other queries in your application that need FOR UPDATE
locks on such table.
Using B-tree indexes
Slapping an index on a field and calling it a day unfortunately will not work here. In B-tree index text is converted to series of binary (true/false) tests tree from start to end of searched phrase. For sample data:
1 John
2 Joseph
3 Joseph
4 Ann
It will look something like this.
<="a"?
/ \
yes no
/ \
<="nn"? <="jo"
/ /
yes yes
/ /
[4] <="h"?
/ \
yes no
/ \
<="n"? <="seph"?
/ /
yes yes
/ /
[1] [2,3]
If you are looking for Joseph
you test first character. Because j>a
you go through no
path. Then you test first two characters. Because jo=jo
you trim them from phrase and go through yes
path. Then you test next unmatched character that is h
... You continue to perform those series of tests until you finally reach list of rows that have phrase you are looking for, in this case 2
and 3
. But that shows this type of index must work from start to end of phrase, meaning phrase cannot start with wildcard.
Let's add it to our table.
> ALTER TABLE `address_book` ADD KEY (`name`), ADD KEY (`email`);
As you can see when searched phrase starts with wildcard index will not be used.
> EXPLAIN SELECT * FROM `address_book` WHERE `name` LIKE '%john%' AND `name` LIKE '%doe%'\G
id: 1
select_type: SIMPLE
table: address_book
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 996458
filtered: 1.23
Extra: Using where
If you know that text has some specific structure (in our case name goes first) we can leverage this knowledge and ask for name without starting wildcard.
> SELECT * FROM `address_book` WHERE `name` LIKE 'john%' AND `name` LIKE '%doe%';
+--------+----------------+-------------------------------+
| id | name | email |
+--------+----------------+-------------------------------+
| 222698 | Johnie Doemel | doemel.36.johnie@example.com |
| 316137 | Johnnie Doepel | johnnie-doepel-72@example.com |
+--------+----------------+-------------------------------+
2 rows in set (0.003 sec)
Explain shows that this time index is used, all names starting with john
are found within index and Boyer-Moore must be only used for fine filtering this set against doe
.
> EXPLAIN SELECT * FROM `address_book` WHERE `name` LIKE 'john%' AND `name` LIKE '%doe%'\G
id: 1
select_type: SIMPLE
table: address_book
partitions: NULL
type: range
possible_keys: name
key: name
key_len: 514
ref: NULL
rows: 3602
filtered: 100.00
Extra: Using index condition
This approach quickly shows limitation when it comes to email. It is too chaotic - may start with first name, may start with last name, may even start with something completely different. In this case query time will be like in without index case.
> SELECT * FROM `address_book` WHERE `email` LIKE '%john%' AND `email` LIKE '%doe%';
+--------+----------------+-------------------------------+
| id | name | email |
+--------+----------------+-------------------------------+
| 222698 | Johnie Doemel | doemel.36.johnie@example.com |
| 316137 | Johnnie Doepel | johnnie-doepel-72@example.com |
+--------+----------------+-------------------------------+
2 rows in set (0.314 sec)
Performance wise it slows down data loading a bit and doubles storage space without being very useful.
$ time cat address_book.sql | mysql
real 24m 12.81s
> SELECT data_length, index_length FROM information_schema.tables WHERE table_name = 'address_book';
+-------------+--------------+
| DATA_LENGTH | INDEX_LENGTH |
+-------------+--------------+
| 71942144 | 112623616 |
+-------------+--------------+
Use: When you can split text into well defined columns with their own indexes. Like for example restructure table to store first_name
and last_name
separately. Also you must be willing to sacrifice starting wildcard.
Avoid: When text is too unpredictable and unordered, like for example email
or name
of various products in your store.
Note: Right to left languages are no exception, searched phrase cannot start with wildcard no matter what direction text is written in.
Introducing reverse indexes
First let's explain what reverse index is. B-tree index was series of tests on searched phrase from start to end. Reverse index takes a different approach and it creates tokens from words. Token can be whole word or n-gram (substring of given length from word, for Johnie
3 letter n-grams are: joh
, ohn
, hni
, nie
).
And that allows to build index in slightly different way. For sample data:
1 Paul
2 Roland
3 Carol
Index of 3 letter n-gram tokens will look like:
pau => [p1r1] # that means this n-gram is at position 1 in row 1
aul => [p2r1]
rol => [p1r2,p3r3]
ola => [p2r2]
lan => [p3r2]
and => [p4r2]
car => [p1r3]
aro => [p2r3]
Now if we look for rol
we immediately know that this token is present in rows 2
and 3
. If we search for longer phrase like roland
database may use this index twice - if rol
is found on some position then and
must be found 3 characters later. Only row 2
matches this condition.
Using reverse indexes with default parser
Reverse indexes have its own syntax, let's add one to our table.
ALTER TABLE `address_book` ADD FULLTEXT (`name`), ADD FULLTEXT(`email`);
Default tokenizer uses word boundaries to find tokens, meaning one continous word is one token.
To utilize full text index MATCH () AGAINST ()
syntax must be used. AGAINST
section can work in NATURAL LANGUAGE MODE
where searched text is also tokenized or in more useful BOOLEAN
mode that contains its own powerful mini-expression-language. I won't dive too deeply into BOOLEAN MODE
syntax, basically +
means AND
.
> SELECT * FROM `address_book` WHERE MATCH (`name`) AGAINST ('+johnie +doemel' IN BOOLEAN MODE);
+--------+---------------+------------------------------+
| id | name | email |
+--------+---------------+------------------------------+
| 222698 | Johnie Doemel | doemel.36.johnie@example.com |
+--------+---------------+------------------------------+
1 row in set (0.001 sec)
> SELECT * FROM `address_book` WHERE MATCH (`email`) AGAINST ('+johnie +doemel' IN BOOLEAN MODE);
+--------+---------------+------------------------------+
| id | name | email |
+--------+---------------+------------------------------+
| 222698 | Johnie Doemel | doemel.36.johnie@example.com |
+--------+---------------+------------------------------+
1 row in set (0.001 sec)
Whoa, that is fast. Over 200x faster than no indexes approach. We are not restricted to search from beginning of the phrase like in B-tree index, meaning searching in email works fast too. Our index filters rows great according to EXPLAIN
.
> EXPLAIN SELECT * FROM `address_book` WHERE MATCH (`name`) AGAINST ('+johnie +doemel' IN BOOLEAN MODE)\G
id: 1
select_type: SIMPLE
table: address_book
partitions: NULL
type: fulltext
possible_keys: name
key: name
key_len: 0
ref: const
rows: 1
filtered: 100.00
Extra: Using where; Ft_hints: no_ranking
Life is great. Or is it?
> SELECT * FROM `address_book` WHERE MATCH (`name`) AGAINST ('+john +doe' IN BOOLEAN MODE);
Empty set (0.002 sec)
First trap! You cannot find phrase shorter than token length and by default whole words are tokens. This is balance between search speed and index build / storage cost. And we are already taking significant penalties in those areas.
$ time cat address_book.sql | mysql
real 29m 34.44s
# du -bc /var/lib/mysql/default/fts_*
492453888 total
That is 126% of unindexed loading time and fulltext index alone is taking 7 times more than data itself. Note that there is no easy way to check fulltext index size from INFORMATION_SCHEMA
, it has to be done on MySQL server filesystem.
Use: When you want to search by whole words. Boolean mode expression allows to do some cool tricks like exclude some words or find by relevance which you may find useful. But you must be willing to accept higher write times and a lot higher storage costs.
Using reverse indexes with n-gram parser
This time each word will be split into n-grams. Default length of n-gram is defined in server configuration variable:
> show variables like 'ngram_token_size';
+------------------+-------+
| Variable_name | Value |
+------------------+-------+
| ngram_token_size | 2 |
+------------------+-------+
Index creation syntax must define tokenizer (named here as "parser") explicitly.
ALTER TABLE `address_book` ADD FULLTEXT (`name`) WITH PARSER ngram, ADD FULLTEXT(`email`) WITH PARSER ngram;
This time rows are found as expected, even if not whole words are used in search .
> SELECT * FROM `address_book` WHERE MATCH (`name`) AGAINST ('+john +doe' IN BOOLEAN MODE);
+--------+----------------+-------------------------------+
| id | name | email |
+--------+----------------+-------------------------------+
| 222698 | Johnie Doemel | doemel.36.johnie@example.com |
| 316137 | Johnnie Doepel | johnnie-doepel-72@example.com |
+--------+----------------+-------------------------------+
2 rows in set (0.266 sec)
But what about this horrible performance? This is slower than without indexes! The answer lies in n-gram size. If matching phrase does not match n-gram size then database must query index few times and merge results or do supplementary unindexed filtering. Let's restart our server with --ngram_token_size=3
and rebuild table.
> SELECT * FROM `address_book` WHERE MATCH (`name`) AGAINST ('+john +doe' IN BOOLEAN MODE);
+--------+----------------+-------------------------------+
| id | name | email |
+--------+----------------+-------------------------------+
| 222698 | Johnie Doemel | doemel.36.johnie@example.com |
| 316137 | Johnnie Doepel | johnnie-doepel-72@example.com |
+--------+----------------+-------------------------------+
2 rows in set (0.087 sec)
So in this case doe
was matching token size and index was used directly but john
had to be found inderectly in this index. This is even more obvious if we ask for COUNT
.
> SELECT COUNT(*) FROM `address_book` WHERE MATCH (`email`) AGAINST ('+john' IN BOOLEAN MODE);
+----------+
| COUNT(*) |
+----------+
| 3563 |
+----------+
1 row in set (0.064 sec) # phrase longer than token
> SELECT COUNT(*) FROM `address_book` WHERE MATCH (`email`) AGAINST ('+doe' IN BOOLEAN MODE);
+----------+
| COUNT(*) |
+----------+
| 431 |
+----------+
1 row in set (0.003 sec) # phrase equal to token
So we sacrificed ability to search by 2 characters using index, gained a lot of boost when searching by 3 characters, gained mediocre boost in other cases.
Using this method is a pile of tradeoffs. No, you cannot have few indexes on the same field with different n-gram sizes to address various search phrase lengths. It get's worse - configuration variable is global so you cannot even have two FULLTEXT
indexes on different tables with different n-gram sizes. One config must fit all your needs server-wide.
How about write performance and storage penalty?
$ time cat address_book.sql | mysql
real 26m 31.05s
# du -bc /var/lib/mysql/default/fts_*
362430464 total
Unfortunately they are big, index takes 5 times more space than data.
Use: When you want to search by parts of words. Boolean mode expression also works here. But first you must find the right server-wide balance of token length and accept higher write times and a lot higher storage costs. Phrases of length different than token size will still be faster that unindexed approach but without "wow" factor.
Avoid: When your text uses ideographic language (like Chinese or Japanese) and requires single character tokens. There is separate MeCab tokenizer for Japanese but that is beyond scope of this article.
InnoDB reverse index performance degradation
Let's use data from previous chapter and delete all rows.
> DELETE FROM `address_book`;
> SELECT * FROM `address_book` WHERE MATCH (`name`) AGAINST ('+john +doe' IN BOOLEAN MODE);
Empty set (0.233 sec)
So time was 0.087s for table with data but now is 0.233s for empty table? That is because when row is deleted from InnoDB table it is not deleted from FULLTEXT index. Instead separate hidden table tracks removed rows and search in outdated index must compare outdated results on 1_000_000 rows with list of deleted 1_000_000 rows. This gets progressively worse. Let's add, remove, add, remove and add our data. So we are back at 1_000_000 original rows in table. The same amount of rows that we started with.
> SELECT * FROM `address_book` WHERE MATCH (`name`) AGAINST ('+john +doe' IN BOOLEAN MODE);
+--------+----------------+-------------------------------+
| id | name | email |
+--------+----------------+-------------------------------+
| 222698 | Johnie Doemel | doemel.36.johnie@example.com |
| 316137 | Johnnie Doepel | johnnie-doepel-72@example.com |
+--------+----------------+-------------------------------+
2 rows in set (7.038 sec)
That escalated quickly... Now it is time to enter very trippy land. To rebuild InnoDB FULLTEXT
index and regain performance you must alter whole table. That requires extensive database user priviledges and is very likely to cause application downtime. But fear not. There is global innodb_optimize_fulltext_only=ON
flag that globally(!) changes ALTER
/OPTIMIZE
(in InnoDB those are synonyms) to only purge old entries from FULLTEXT
index. You can configure how many tokens are purged by setting innodb_ft_num_word_optimize
flag, maximum value is 10_000. And there is no feedback if you are done. I repeat once again - there is no feedback if you are done, you are supposed to run consecutive ALTER
s hoping that at some point your FULLTEXT
index will have no outdated entries.
That is garbage UI design.
The cure is worse than the disease. MyISAM purges FULLTEXT
index on the fly, it does not degrade on data retention. So you may convert your InnoDB table to MyISAM losing all InnoDB goodies. Or you can build supplementary MyISAM table like address_book_fts
, have FULLTEXT
indexes there and synchronize data from InnoDB table using triggers. And when you think you are vicorious - GTID consistency kicks in. If you are using GTID transation identifiers in replication you cannot update InnoDB and MyISAM table in the same transaction, meaning you must risk autocommit writes in your flow. Yuck.
Alternatives
I hope now you better understand MySQL capabilities regarding full text search. There are tradeoffs and there are traps. If you haven't found solution matching your needs I recommend:
- Switching to PostgreSQL. I try to be objective, but full text search in MySQL is some weird, unfinished patchwork in my experience. PostgreSQL solution is much better and maybe I'll write follow-up to this article but with using Postgres.
- Staying in MySQL ecosystem but using Sphinx plugin instead of built-in solutions.
- Outsorcing fulltext search to ElasticSearch. That of course comes with whole new set of challenges related to data synchronization, but ES capabilities are awesome.
Outro
Thanks for reading. If you succcessfully implemented full text search in your application please share info about used technology and encountered issues in comment.
Top comments (0)