Kairat

Posted on Nov 23, 2024

Beyond LIKE: Advanced Text Search and Keyword Matching in Postgres using Full Text Search

#database #postgres #sql

Imagine you’re collecting video data from YouTube using their API and storing it in a Postgres database. Now, you want to automatically tag certain videos based on specific keywords stored in another table or enable text search functionality.

The simplest solution would be to create a database view using the basic LIKE operator. You could then set up a cron job to periodically fetch new records and assign tags to them.

This approach might work for simple use cases, but it has several limitations:

while LIKE supports basic substring matching, it has significant limitations. You could use ILIKE for case-insensitive searches or ~ for regex matches, but these are still quite basic. Achieving advanced features like ranking, synonym recognition, or phrase-based matching is either extremely difficult or impossible with these tools.
as your data grows, performance becomes a concern. The engine has to scan through the text to find matches, which can be computationally expensive. Indexing can improve performance to some extent. For example, a B-tree index works well for patterns like "rust%" (starts with). However, searches such as "%rust%" (anywhere in the text) or "%rust" (ends with) cannot leverage the index effectively and will fall back to full-text sequential scans.

Because of that, using LIKE and similar operators might not be the most suitable solution for this use case.

Fortunately, Postgres has built-in full-text search functions that address these limitations.

To demonstrate how it works, let's create simple youtube_videos table.

CREATE TABLE IF NOT EXISTS
  "public"."youtube_videos" (
    "id" uuid DEFAULT "gen_random_uuid" () NOT NULL PRIMARY KEY,
    "title" text
  );

It has title column and we gonna run our searches against it.
But to be more specific, tokenized version of it.

And you might be wondering what do I mean by that?

Well, Postgres has this function - to_tsvector() which stands for "to text search vector". Using this function we can transform our video title into searchable tokens that can be used for comparisons.

This conversion involves lemmatization which normalizes word. It involves making word lower-cased, removing suffixes (such as es in Eng) and stop words (such as a and the in Eng). This step is done by the use of dictionaries. Various standard dictionaries are provided, and custom ones can be created for specific needs.

To avoid making this transformation every time we need to search for something, let's store it in the column of type tsvector.

alter table youtube_videos
add column fts_title tsvector generated always as (to_tsvector('english', title)) stored;

Running this query will create a new column in our videos table. It will be generated every time we update the title column (on INSERT and UPDATE).

First parameter passed to function is a dictionary name. In this example as you can see we used basic english dictionary. You can read more about Postgres dictionaries here.

Also, in this specific use case, we know for sure, every video has a title, but if you are dealing with nullable columns, it's better to use coalesce(title, '') to handle null values.

To speed up the text search process even more, let's create GIN index on column. As this column is of type tsvector, regular B-Tree index won't work in this case.

create index title_fts_inx on youtube_videos using gin (fts_title);

Now, after inserting a few records into the table, we can take a look at what fts_title value looks like:

As you can see, normalization was applied to the title: ""

"stocks -> stock"
"to, you, in" were dropped

And now, to actually find keyword matching records, we need to use the to_tsquery() function that converts a query string into tokens to match.

select
  *
from
  youtube_videos
where
  fts_title @@ to_tsquery('english', 'rust');

Here, @@ symbol is the "match" symbol for full-text search.
Execution time for this command is roughly between 0.05 - 0.065 ms.

For comparison, running search using ILIKE operator on indexed title column, is way slower (1.45 ms - 1.50 ms).

create index title_inx on youtube_videos(title)

SELECT
  *
FROM
  youtube_videos
WHERE
  title ilike '%rust%';

Or let's say we wanna get all videos about stocks (maybe I wanna buy PLTR or Tesla).

Running query like that:

SELECT
  *
FROM
  youtube_videos
WHERE
  title ilike '%stock%';

Returns these 2 records:

Maybe we can use a bit different query:

SELECT
  *
FROM
  youtube_videos
WHERE
  title ~* '\ystock\y';

Returns none as stock is singular.

But running this query, returns the desired result.

SELECT
  *
FROM
  youtube_videos
WHERE
  title ~* '\ystock\y';

And let's say we have keywords table.

Now we can create a view that has all the videos containing any of those keywords:

CREATE OR REPLACE VIEW videos_with_keywords AS
select
  yv.*
from
  youtube_videos yv
  join keywords on (yv.fts_title @@ to_tsquery(keywords.keyword))

We can also do partial searches:

select * from youtube_videos where fts_title @@ to_tsquery('english', 'sto:*')

While this article focused on the basics of full-text search, PostgreSQL offers many advanced features that you can explore further. Here are some great resources to dive deeper:

Docs
Supabase Article

And that's it, guys.

I hope that you have learned something new today!
I would appreciate it if you could like this post or leave a comment below!

Also, feel free to follow me on GitHub and Medium!

Adios, mi amigos)