Introduction
Recap
In the previous article, we saw how we could use fuzzy search in PostgreSQL. This article is more about using the power of full-text indexes to search through large text data. This is different from any other PostgreSQL indexes, which work either with alphanumeric or smaller textual content.
Dataset Preparation
To understand how full-text search works, we are going to utilize the same data set we saw in the PostgreSQL Fuzzy Search blog -
The table has around 79091 rows.
We will focus more on the column episode_narrative
, which has more text content as it narrates all of the things that happened during that particular weather event.
Fuzzy Search vs. Full-Text Search
Let's first understand the difference between fuzzy search and full-text search. Below is sample content from the storm_events.episode_narrative
column.
Hurricane Irene formed east of the Caribbean island of Dominica, part of the Lesser Antilles region, on the afternoon of August 20. Irene moved through the Caribbean and up the east coast of the United States making landfall twice. She first made landfall as a Category 1 Hurricane near Cape Lookout, North Carolina around 7:30am on August 27, then moved offshore again during the evening. She then made a 2nd landfall, again as a Category 1 Hurricane at 540am on August 28 near Little Egg Inlet in New Jersey. She moved over New York City and then into southeastern New York State and Connecticut as a Tropical Storm a few hours later. By the end of the evening of the 28th, Irene was crossing the U.S./Canada border having produced significant amounts of rain, storm surge, inland and coastal flooding, and wind damage across southern New England and much of the east coast of the United States. ||In Southern New England, the minimum surface pressure recorded was 976.9mb taken at Barnes Municipal Airport in Westfield, Massachusetts. The storm surge experienced along the coast was generally in the two to four foot range with a high of 4.78 feet at Fox Point in Providence, Rhode Island. The highest sustained windspeed was 54 knots (62 mph) at the Physical Oceanographic Real Time System station at Conimicut Light in Narragansett Bay, RI. The highest sustained wind speed on land was 38 knots (44 mph) recorded on the Automated Surface Observing Systems at both Barnstable Municipal Airport in Hyannis, MA (KHYA) and Logan International Airport in Boston, MA (KBOS). Rainfall amounts ranged from nearly zero (0.03 at Nantucket Memorial Airport - ACK) to nearly 10 inches (9.92 in Conway, MA). ||Despite the relatively low wind speeds, sustained winds over a 6 to 12 hour long duration resulted in widespread tree damage and resulted in power outages to roughly half a million customers throughout the state. Some of these customers did not get their power back until the Friday following the storm (some five days later). Durring the passage of Tropical Storm Irene, the winds resulted in $25,000 in property damages. ||The Connecticut River at Walpole reached its highest level since the 1938 hurricane.||The collective effects of Tropical Storm Irene on August 28, resulted in 1 fatality, 0 injuries, and $127.3M in property damage in the following counties: Barnstable, Bristol, Essex, Franklin, Hampden, Hampshire, Middlesex, Nantucket, Norfolk, Plymouth, Suffolk, and Worcester (all in MA), Hartford, Tolland, and Windham (all in CT), Cheshire and Hillsborough (all in NH), and Bristol, Providence, Kent, Washington, and Newport (all in RI).
As you can see, the text is huge. It is like a few paragraphs combined, at least. Let's say we want to search this column for the content Logan International Airport
i.e., we want all of the storm events involving that airport.
A query might look like this:
SELECT
episode_narrative
from
storm_events
WHERE
episode_narrative ILIKE '%Logan International Airport%'
This query and its results look great, but the query takes a lot of time (763 ms) for a mere 79 thousand rows. A realistic production use case would involve millions of rows.
To solve this, we need to move beyond the text data type and enter the world of documents.
Terminologies
Before we jump in and use PostgreSQL's abilities, we need to understand specific basic terminologies.
Documents
The text that we store in PostgreSQL needs to be converted into a document structure for faster searching. From the documentation,
A document is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words.
This can be done by storing the data type as tsvector
, a particular data type that stores text structure in the document format. tsvector
stands for text search vector. We can use the to_tsvector
to convert any arbitrary text to tsvector
similar to how we typecast other data types.
Lexemes
The tsvector
type not only stores the text as documents, but it stores the value as a sorted list of distinct lexemes. A lexeme is a basic unit of language, which in our case is English. Let's understand this with an example taken from the documentation.
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
If we run this, then we get:
Notice that the a
character is present only once, and other words are separated. This is a pre-processing step that is required to speed up our search. We will explore more of this once we get to our actual content.
Also, if we give a set of words with verbs, then the tsvector
will normalize them like below.
The word scared
became scare
and the word the
got removed because it is a stop word. Usually, we would not search the content/words if it contains the the
word. Also, normalizing words makes it much easier to search for variations because we want usually search for scared
then scare
is also a good enough match as well. Go ahead and play around with the different words to see how they behave. As usual, you can find more literature about the complete behavior of these functions in the Postgres documentation.
Searching Against the Vector
By now, we know that the tsvector
is not a standard text column, and there is a lot of pre-processing that happens for easier search. We need to use a particular function called tsquery
to search against the tsvector
column.
The output as we see is simply true
or false
. In this case, note that we searched for the word scares
, but still, the result was true. That was because of the normalization we saw earlier. The @@
operator is used for direct matches. We have other operators that perform more sophisticated matches.
Now that we have a hang of the basics. Let's move on to our actual table to perform some neat searches.
Implementing Full-Text Search
We can choose to implement our search either on the fly or by constructing a new column of the type tsvector
. Converting our data on the fly and then comparing as we did above will be extremely slow. Let's create a new column of the type tsvector
by converting our existing episode_narrative
column.
ALTER TABLE
storm_events
ADD
COLUMN episode_narrative_tsv tsvector;
UPDATE
storm_events
SET
episode_narrative_tsv = to_tsvector(episode_narrative);
Then we can query for Logan International Airport
using a to_tsquery
.
SELECT
episode_narrative
from
storm_events
WHERE
episode_narrative_tsv @@ to_tsquery('logan & international & airport')
This returns the same 85 rows as before and takes almost the same time (Around 800 milliseconds). These numbers are not practical for production use cases. In the next section, we see how to create an index to speed up the search.
Indexing
We can create a GIN
index for the tsvector
column we created above.
CREATE INDEX tsv_idx ON storm_events USING gin(episode_narrative_tsv);
The time taken is pretty fast, i.e., under 10 ms almost every time (the 377ms indicated above is for rendering and not the actual query time). An Explain Analyze
makes things much more apparent.
Great! Now we have a fully functional search index that rivals most text search engines out there. But there is always a trade-off which we will get to at a later part of the article. We can also use plurals of the words in our search queries (not for obvious nouns), and our index/query would still work because of the normalization.
SELECT
episode_narrative
from
storm_events
WHERE
episode_narrative_tsv @@ to_tsquery('logan & international & airports')
Notice the word airport
is airports
with an s
at the end, so the query works fine.
GIN Indexes are known for their large size compared to the actual content itself because they have to index in a format called inverted index. Let's compare the size of the index with respect to the actual column size.
SELECT
PG_SIZE_PRETTY(SUM(PG_COLUMN_SIZE('episode_narrative'))) as column_size,
PG_SIZE_PRETTY(PG_RELATION_SIZE('tsv_idx')) as index_size
from
storm_events
The index is almost ten times the size of the actual content. Our dataset is not huge here, but it can get pretty crazy if we attempt to index real-world text data such as instant messages, emails, etc. We should be mindful of what content we are indexing and its size, as these indexes come and sit in the shared_buffers
—also known as cache—and can affect the performance of other queries if we are not careful.
There are a considerable number of operators present in the documentation. I would highly encourage the readers to import the data set and experiment with these different operators to understand the power of search fully. In the next section, we will see a few special operators that can be helpful in our search.
Advanced Search Techniques
So far, we have searched for the existence of words and found that they can be searched in a very efficient fashion.
JSON Search
JSON is a data type that PostgreSQL natively supports. We can create native data types such as JSON
and JSONB
. JSON indexing and its functionality in PostgreSQL have been covered extensively in a previous article. Check it out here -
Using the GIN index, we can search entire JSON documents with the same performance as a B-Tree index.
Dictionaries
PostgreSQL provides a dictionary facility for common languages, which is used in the backend for many tasks such as,
- Stop word removal (a, an, the, etc.).
- Normalization (scared -> scare as we saw above).
- Thesaurus.
- Spelling.
and a lot more.
We can customize this dictionary to add more stop words, add/remove certain words and even create our own custom dictionary for use cases where each word means differently in a different business context.
Ranking
If our results contain more than one match, it makes sense to sort it by the best match to least match. This can be done using the ts_rank
function.
SELECT
event_type,
event_source
FROM
storm_events
WHERE
episode_narrative_tsv @@ to_tsquery('logan & international & airports')
ORDER BY
ts_rank(
episode_narrative_tsv,
to_tsquery('logan & international & airports')
)
DESC
Here we are ranking in descending order by the most relevant results so that the most matched one will come on top.
Caveats
The underlying technology of all search engines is similar from a data structure perspective: an inverted index, but the datastore is very different. Industry-standard solutions for search such as Elastic Search, Apache Solr are built for horizontal scaling.
We can use the search in PostgreSQL if,
- Our data fits in a relational structure, and there is a need to do a full-text search.
- It would be overkill to choose a full-blown search solution.
- The entire data model needs ACID consistency.
We must keep in mind that when we say search with a relational use case, it means a search for content limited to operational databases. These data stores typically move their data which is older than a year or more, depending on the business policy, to data warehouses or other systems for analytics and do not retain data indefinitely/longer durations. As always, performance test as closely as possible to production before deploying to minimize the impact.
Top comments (0)