Arctype Team for Arctype

Posted on Sep 30, 2021 • Originally published at arctype.com

Probing Text Data Using PostgreSQL Full-Text Search

#guide #postgres

Introduction

Recap

In the previous article, we saw how we could use fuzzy search in PostgreSQL. This article is more about using the power of full-text indexes to search through large text data. This is different from any other PostgreSQL indexes, which work either with alphanumeric or smaller textual content.

Dataset Preparation

To understand how full-text search works, we are going to utilize the same data set we saw in the PostgreSQL Fuzzy Search blog -

The table has around 79091 rows.

We will focus more on the column episode_narrative, which has more text content as it narrates all of the things that happened during that particular weather event.

Fuzzy Search vs. Full-Text Search

Let's first understand the difference between fuzzy search and full-text search. Below is sample content from the storm_events.episode_narrative column.

Hurricane Irene formed east of the Caribbean island of Dominica, part of the Lesser Antilles region, on the afternoon of August 20. Irene moved through the Caribbean and up the east coast of the United States making landfall twice. She first made landfall as a Category 1 Hurricane near Cape Lookout, North Carolina around 7:30am on August 27, then moved offshore again during the evening. She then made a 2nd landfall, again as a Category 1 Hurricane at 540am on August 28 near Little Egg Inlet in New Jersey. She moved over New York City and then into southeastern New York State and Connecticut as a Tropical Storm a few hours later. By the end of the evening of the 28th, Irene was crossing the U.S./Canada border having produced significant amounts of rain, storm surge, inland and coastal flooding, and wind damage across southern New England and much of the east coast of the United States. ||In Southern New England, the minimum surface pressure recorded was 976.9mb taken at Barnes Municipal Airport in Westfield, Massachusetts. The storm surge experienced along the coast was generally in the two to four foot range with a high of 4.78 feet at Fox Point in Providence, Rhode Island. The highest sustained windspeed was 54 knots (62 mph) at the Physical Oceanographic Real Time System station at Conimicut Light in Narragansett Bay, RI. The highest sustained wind speed on land was 38 knots (44 mph) recorded on the Automated Surface Observing Systems at both Barnstable Municipal Airport in Hyannis, MA (KHYA) and Logan International Airport in Boston, MA (KBOS). Rainfall amounts ranged from nearly zero (0.03 at Nantucket Memorial Airport - ACK) to nearly 10 inches (9.92 in Conway, MA). ||Despite the relatively low wind speeds, sustained winds over a 6 to 12 hour long duration resulted in widespread tree damage and resulted in power outages to roughly half a million customers throughout the state. Some of these customers did not get their power back until the Friday following the storm (some five days later). Durring the passage of Tropical Storm Irene, the winds resulted in $25,000 in property damages. ||The Connecticut River at Walpole reached its highest level since the 1938 hurricane.||The collective effects of Tropical Storm Irene on August 28, resulted in 1 fatality, 0 injuries, and $127.3M in property damage in the following counties: Barnstable, Bristol, Essex, Franklin, Hampden, Hampshire, Middlesex, Nantucket, Norfolk, Plymouth, Suffolk, and Worcester (all in MA), Hartford, Tolland, and Windham (all in CT), Cheshire and Hillsborough (all in NH), and Bristol, Providence, Kent, Washington, and Newport (all in RI).

As you can see, the text is huge. It is like a few paragraphs combined, at least. Let's say we want to search this column for the content Logan International Airport i.e., we want all of the storm events involving that airport.

A query might look like this:

SELECT
    episode_narrative
from
    storm_events
WHERE
    episode_narrative ILIKE '%Logan International Airport%'

This query and its results look great, but the query takes a lot of time (763 ms) for a mere 79 thousand rows. A realistic production use case would involve millions of rows.

To solve this, we need to move beyond the text data type and enter the world of documents.

Terminologies

Before we jump in and use PostgreSQL's abilities, we need to understand specific basic terminologies.

Documents

The text that we store in PostgreSQL needs to be converted into a document structure for faster searching. From the documentation,

A document is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words.

This can be done by storing the data type as tsvector, a particular data type that stores text structure in the document format. tsvector stands for text search vector. We can use the to_tsvector to convert any arbitrary text to tsvector similar to how we typecast other data types.

Lexemes

The tsvector type not only stores the text as documents, but it stores the value as a sorted list of distinct lexemes. A lexeme is a basic unit of language, which in our case is English. Let's understand this with an example taken from the documentation.

SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;

If we run this, then we get:

Notice that the a character is present only once, and other words are separated. This is a pre-processing step that is required to speed up our search. We will explore more of this once we get to our actual content.

Also, if we give a set of words with verbs, then the tsvector will normalize them like below.

The word scared became scare and the word the got removed because it is a stop word. Usually, we would not search the content/words if it contains the the word. Also, normalizing words makes it much easier to search for variations because we want usually search for scared then scare is also a good enough match as well. Go ahead and play around with the different words to see how they behave. As usual, you can find more literature about the complete behavior of these functions in the Postgres documentation.

Searching Against the Vector

By now, we know that the tsvector is not a standard text column, and there is a lot of pre-processing that happens for easier search. We need to use a particular function called tsquery to search against the tsvector column.

The output as we see is simply true or false. In this case, note that we searched for the word scares, but still, the result was true. That was because of the normalization we saw earlier. The @@ operator is used for direct matches. We have other operators that perform more sophisticated matches.

Now that we have a hang of the basics. Let's move on to our actual table to perform some neat searches.

Implementing Full-Text Search

We can choose to implement our search either on the fly or by constructing a new column of the type tsvector. Converting our data on the fly and then comparing as we did above will be extremely slow. Let's create a new column of the type tsvector by converting our existing episode_narrative column.

ALTER TABLE
    storm_events
ADD
    COLUMN episode_narrative_tsv tsvector;

UPDATE
    storm_events
SET
    episode_narrative_tsv = to_tsvector(episode_narrative);

Then we can query for Logan International Airport using a to_tsquery.

SELECT
    episode_narrative
from
    storm_events
WHERE
    episode_narrative_tsv @@ to_tsquery('logan & international & airport')

This returns the same 85 rows as before and takes almost the same time (Around 800 milliseconds). These numbers are not practical for production use cases. In the next section, we see how to create an index to speed up the search.

Indexing

We can create a GIN index for the tsvector column we created above.

CREATE INDEX tsv_idx ON storm_events USING gin(episode_narrative_tsv);

The time taken is pretty fast, i.e., under 10 ms almost every time (the 377ms indicated above is for rendering and not the actual query time). An Explain Analyze makes things much more apparent.

Great! Now we have a fully functional search index that rivals most text search engines out there. But there is always a trade-off which we will get to at a later part of the article. We can also use plurals of the words in our search queries (not for obvious nouns), and our index/query would still work because of the normalization.

SELECT
    episode_narrative
from
    storm_events
WHERE
    episode_narrative_tsv @@ to_tsquery('logan & international & airports')

Notice the word airport is airports with an s at the end, so the query works fine.

GIN Indexes are known for their large size compared to the actual content itself because they have to index in a format called inverted index. Let's compare the size of the index with respect to the actual column size.

SELECT
    PG_SIZE_PRETTY(SUM(PG_COLUMN_SIZE('episode_narrative'))) as column_size,
    PG_SIZE_PRETTY(PG_RELATION_SIZE('tsv_idx')) as index_size
from
    storm_events

The index is almost ten times the size of the actual content. Our dataset is not huge here, but it can get pretty crazy if we attempt to index real-world text data such as instant messages, emails, etc. We should be mindful of what content we are indexing and its size, as these indexes come and sit in the shared_buffers—also known as cache—and can affect the performance of other queries if we are not careful.

There are a considerable number of operators present in the documentation. I would highly encourage the readers to import the data set and experiment with these different operators to understand the power of search fully. In the next section, we will see a few special operators that can be helpful in our search.

Advanced Search Techniques

So far, we have searched for the existence of words and found that they can be searched in a very efficient fashion.

JSON Search

JSON is a data type that PostgreSQL natively supports. We can create native data types such as JSON and JSONB. JSON indexing and its functionality in PostgreSQL have been covered extensively in a previous article. Check it out here -

Using the GIN index, we can search entire JSON documents with the same performance as a B-Tree index.

Dictionaries

PostgreSQL provides a dictionary facility for common languages, which is used in the backend for many tasks such as,

Stop word removal (a, an, the, etc.).
Normalization (scared -> scare as we saw above).
Thesaurus.
Spelling.

and a lot more.

We can customize this dictionary to add more stop words, add/remove certain words and even create our own custom dictionary for use cases where each word means differently in a different business context.

Ranking

If our results contain more than one match, it makes sense to sort it by the best match to least match. This can be done using the ts_rank function.

SELECT
    event_type,
    event_source
FROM
    storm_events
WHERE
    episode_narrative_tsv @@ to_tsquery('logan & international & airports')
ORDER BY
    ts_rank(
        episode_narrative_tsv,
        to_tsquery('logan & international & airports')
    )
DESC

Here we are ranking in descending order by the most relevant results so that the most matched one will come on top.

Caveats

The underlying technology of all search engines is similar from a data structure perspective: an inverted index, but the datastore is very different. Industry-standard solutions for search such as Elastic Search, Apache Solr are built for horizontal scaling.

We can use the search in PostgreSQL if,

Our data fits in a relational structure, and there is a need to do a full-text search.
It would be overkill to choose a full-blown search solution.
The entire data model needs ACID consistency.

We must keep in mind that when we say search with a relational use case, it means a search for content limited to operational databases. These data stores typically move their data which is older than a year or more, depending on the business policy, to data warehouses or other systems for analytics and do not retain data indefinitely/longer durations. As always, performance test as closely as possible to production before deploying to minimize the impact.

DEV Community