Tom Nijhof

Posted on Jan 17 • Edited on Jan 18

SQL and PostgreSQL for Advanced Full-Text Search

#sql #postgres #database #backend

Searching within your database is a handy feature for your users. However, there are many different ways to do this. Here I will explore pure SQL options and special PostgreSQL options. All of them can be useful in different situations.
We will explore WHERE, LIKE, ILIKE, tsvector, tsquery, and tsvector-index.

The examples given below are all run on the same database of coffees and teas looking for all the drinks with rooibos in the name; rooibos is a kind of herb tea from South Africa.

By using EXPLAIN I get a cost for the queries to compare against each other. The cost will be in the form of 1..10 meaning it is between 1 and 10. The numbers are an abstraction. Lower numbers are better, numbers close to each other mean there is not a lot of variance. In general, I take the highest number to compare different queries.

AI-generated image of the PostgreSQL mascot doing a full-text search

Pure SQL

No matter if you have PostgreSQL, MySQL, MariaDB, MSSQL, or one of the many other databases, these queries should work for you.
To make the comparison more realistic an index is put on the column name. This is done with the following query.

CREATE INDEX if not EXISTS idx_drink_name
ON public.drinks
USING btree (name)
TABLESPACE pg_default;

Naive comparing

The most basic search is comparing. You use WHERE and compare a column against a value. This has the lowest cost of all methods, but it is also very sensitive to errors. A wrong capital, only part of a word, or only one word, and it does not work. For this reason, it is often used for IDs and other cases where you know exactly what you are looking for.

SELECT name
FROM drinks
WHERE name = 'Rooibos';

cost=0.27..2.49

LIKE and ILIKE

The LIKE operator is used for pattern matching in a WHERE clause. It is used to search for a specified pattern in a column. The % wildcard is often used with LIKE to represent any sequence of characters, and _ (underscore) represents a single character.
The ILIKE operator is similar to LIKE. However, it disregards the case of the letters in the comparison. So LIKE would say Rooibos and rooibos are different, but ILIKE would say they are the same.
There is no cost difference between LIKE and ILIKE. However, the max cost is about 10 times higher as a naive comparison.

The query will return 17 results, from “Rooibos” to “Rooibos Green Raspberry Orange”

SELECT name
FROM drinks
WHERE name ILIKE 'rooi%';

cost=0.00..24.36

PostgreSQL: tsvector

In PostgreSQL, a tsvector is a special data type used to represent text in a form optimized for full-text search. It’s part of PostgreSQL’s full-text search capabilities.
A tsvector is a vector of lexemes, which are the basic units of text. It is a way to simplify language for search operations.
This processing involves breaking down the document into words followed by:

removing common words. These are words like the, for, in, is, be, being etc. These words are so common that it does not add value to know if they are there. If you search for “the best coffee” you do not care for every coffee with the word “the” in the name.
Stemming is reducing a word to its most basic form. No plurals, and verbs only in the simplest form. So “cats” become “cat”, “loving” becomes “love” etc. This will help us if we search for “coffee love d by cat s ” you will find “a cat love s this coffee”.

Let’s look at this in action.

"This coffee is great for cats! Meow Coffees"

will become

'cat':6 'coffe':2, 8 'great':4 'meow':7.

Coffe is a weird result, but this does not matter, because all coffee, coffe, coffees will all become coffe. For us, it is useful that a typo will still result in coffee results.
The number behind the word is the location where the word is found. In the case of coffee, the second and eighth words have the stem coffee.

Naive tsquery

If we combine the tsvector with tsquery we can search our database based on it. Firstly we are going to do this on the fly. This means all the cost will be at the moment we match. This cost is very high, but we can bring that down later using an index.

The below query will give us every name that has at least 1 word matching our text “rooibos”.

SELECT *
FROM drinsks
WHERE to_tsvector(name) @@ to_tsquery('rooibos');

cost=0.00..158.86

Using a wildcard

Let’s power this up a bit with wildcards. We do not need to complete the word “rooibos” but can use a wildcard with :. Similar to % in **LIKE* and ILIKE. Also here it does not have an impact on the cost.

SELECT *
FROM drinks
WHERE to_tsvector(name) @@ to_tsquery('rooibo:*');

cost=0.00..158.86

Cleaver index

In order to index the vectors we are going to introduce a new column with the vectors. Every time we update a row or insert a new one, we will generate a (new) value for this column. This index makes inserting and updating more expensive.
This time we will search over the name AND description combined.

Step 0: Remove NULL
**Make sure none of the values are NULL. Replace them with empty strings. The query below does this for all descriptions.

UPDATE drinks
SET description = COALESCE(description, '')
WHERE description IS NULL;

Step 1: Create an index

Just run this query. Trust me!

ALTER TABLE drinks
ADD COLUMN fts tsvector
GENERATED ALWAYS
as (to_tsvector('english', description || ' ' || name)) STORED;
CREATE INDEX drink_fts
ON drinks
USING gin (fts);

Fine, let’s break it down so you do not just have to trust a random person on the internet.

ALTER TABLE drinks
ADD COLUMN fts tsvector
GENERATED ALWAYS
as (to_tsvector('english', description || ' ' || name)) STORED;

This command adds a new column called “fts” of type tsvector. The generated always makes sure that the values in the “fts” column are automatically generated when a row is updated or inserted.

The data in this column is generated using the to_tsvector function, which processes the concatenated values of the “description” and “name” columns. It will use the English dictionary to get the stem of the words. The stored keyword indicates that the generated values are stored physically on disk for efficient retrieval.

CREATE INDEX drink_fts
ON drinks
USING gin (fts);

This command creates an index named “drink_fts” on the “drinks” table. The index is a GIN (Generalized Inverted Index) index, and it is created on the “fts” column. GIN indexes are suitable for indexing complex data types like tsvector.

Step 2: New search query

So now we use the indexed column “fts” for searching. We do this by replacing the to_tsvector(name) with our new column. The cost is lower than even the LIKE and the ILIKE. If we add a wildcard the cost is going up but still lower than the LIKE and the ILIKE. Making our PostgreSQL perform better than native SQL for full-text searching.

SELECT *

FROM drinks

WHERE fts @@ to_tsquery('english','rooibos');

cost=3.31..4.42

SELECT *

FROM drinks

WHERE fts @@ to_tsquery('rooibo:*');

cost=4.69..11.26

Conclusion

In our exploration of PostgreSQL’s full-text search options, we navigated from basic SQL methods like WHERE, LIKE, and ILIKE to advanced features like tsvector. This special data type optimizes language for efficient searches by breaking down text, removing common words, and applying stemming.

The combination of tsvector and tsquery provides dynamic search capabilities, while the strategic use of GIN indexes on generated tsvector columns significantly enhances search performance. PostgreSQL emerges as a robust solution, offering a versatile toolkit to implement powerful and efficient full-text search functionalities in applications.

DEV Community

SQL and PostgreSQL for Advanced Full-Text Search

Pure SQL

Naive comparing

LIKE and ILIKE

PostgreSQL: tsvector

Naive tsquery

Using a wildcard

Cleaver index

Conclusion

Top comments (0)

Read next

Linking Data in PostgreSQL Without Explicit Keys: A Practical Guide

SQL: Perform statistics at different granularities based on the time span of the data #85

Understanding Sequelize's 'escape' function and why ideally we should avoid using it manually

SQL Server Execution Plans