Full-text search is a technique used to search within a large collection of documents or data by matching the words or phrases in the query. It is superior to traditional search because it has the capability to sort the search results in order of similarity to the query. Traditional search involves simple queries that match exact terms or phrases. Operators like LIKE
and ILIKE
are used in this case. However, full-text search builds upon this limitation and allows for the preprocessing of natural language documents before performing a search.
If you need to perform text searches on a large document that contains text data, full-text search is the way to go. This article covers the basics, use cases, and how to optimize full text search. Walk with me.
to_tsvector and to_tsvector in postgreSQL
In PostgreSQL, two commonly used functions for full-text search are:
to_tsvector:
This function is used to convert a document's text into a vector of tokens (words), where each token is associated with its normalized form. The to_tsvector
function performs several processes as part of its normalization, such as;
- Tokenization: the text is divided into individual tokens or words. The text is separated based on punctuation marks and spaces.
- Lowercasing: All tokens are converted to lowercase. This ensures the text case-insensitivity.
- Stemming: All tokens are reduced to their root forms, called "lexemes". Suffixes and endings are removed. For instance, "believe", "believer" and "believing" would be stemmed to "believ".
- Stopword removal- Stop words are common words (like "are", "the", "and") that do not contribute significantly to the meaning of the document. The function also assigns position information to each token, indicating their location in the original text.
Normalization is a crucial step in full-text indexing and searching.
For example;
-- Convert text to tsvector
SELECT to_tsvector('english', 'Innovative technologies are shaping the future');
This query would generate a vector of tokens like
'futur':6 'innov':1 'shape':4 'technolog':2
"english" is passed as a parameter into the to_tsvector function to indicate text search configuration.
to_tsquery:
This function is used to convert a query string into a query expression that can be used for full-text searching. It takes into account the same language configuration as to_tsvector
. For instance;
-- Convert text to tsquery
SELECT to_tsquery('english', 'cat & rat');
This query would create a query expression to find documents containing both "cat" and "rat".
Full-Text Search Operators
1. Match Operator ("@@")
A match operator @@
is used to test whether a tsvector matches a tsquery. It returns true if the document contains the tokens and operators specified by the query, and false otherwise.
For example, consider a table called books
with a column named title
, which contains the titles of various books:
| ID | Title |
|----|--------------------------------------------|
| 1 | The Hitchhiker''s Guide to the Galaxy |
| 2 | One Hundred Years of Solitude |
| 3 | Pride and Prejudice and Zombies |
| 4 | The Girl with the Dragon Tattoo |
| 5 | To Kill a Mockingbird |
| 6 | The Perks of Being a Wallflower |
| 7 | The Curious Incident of the Dog in the Night-Time |
| 8 | The Fault in Our Stars |
| 9 | The Catcher in the Rye |
| 10 | The Lost Girl of Paris |
| 11 | Under the Stars'' Embrace |
The source code, including the query for creating this table, can be found on GitHub
We can use the match operator to find all the books that have the word “Girl” in their titles. Here's the query to achieve that:
-- Select books where the title matches the tsquery 'Girl'
SELECT * FROM books
WHERE to_tsvector('english', title) @@ to_tsquery('english', 'Girl');
This query returns the following output:
| ID | Title |
|----|--------------------------------------------|
| 4 | The Girl with the Dragon Tattoo |
| 10 | The Lost Girl of Paris |
2. The & (AND) Operator
This operator is used in a query when multiple words are being searched for. The multiple words are usually separated by "&" and the query only returns documents where ALL the words are present
For example,
-- Select books where the title matches the tsquery 'girl' and 'dragon'
SELECT * FROM books
WHERE to_tsvector('english', title) @@ to_tsquery('english', 'girl & dragon');
In this query, the &
symbol represents the AND
operator. It would create a query expression to find documents containing both "girl" and "dragon", matching the record with ID "4" from the books
table.
| ID | Title |
|----|--------------------------------------------|
| 4 | The Girl with the Dragon Tattoo |
3. The | (OR) Operator
This operator is used in a query to find documents that contain at least one of the specified search terms.
Let's say we want to find books that have either "Galaxy" or "Stars" in their titles from the "books" table. We can use the OR operator to achieve this:
-- Select books where the title matches the tsquery 'Galaxy' or 'Stars'
SELECT * FROM books
WHERE to_tsvector('english', title) @@ to_tsquery('english', 'Galaxy | Stars');
The |
symbol represents the OR
operator. The query will match documents containing either "Galaxy", "Stars", or both.
The query above returns the following result:
| ID | Title |
|----|--------------------------------------------|
| 1 | The Hitchhiker''s Guide to the Galaxy |
| 8 | The Fault in Our Stars |
| 11 | Under the Stars'' Embrace |
4. The ! (NOT) Operator
This operator is used in queries to exclude documents that contain the word alongside which the operator is used.
-- Select books where the title does not match 'Perks'
SELECT * FROM books
WHERE to_tsvector('english', title) @@ to_tsquery('english', '!Perks');
The !
symbol represents the NOT
operator. The query will match only documents that do not contain the word "Perks".
5. The <-> (FOLLOWED BY) Operator
This functions like the AND
operator, but the order in which the words are written in the document matters. The two words need to be right next to each other without any other words in between.
In this instance, even though there is a row in the table with the title "One Hundred Years of Solitude", searching for "Solitude" followed by "Hundred" will not match any document because the "Hundred" is not immediately after the "Solitude".
-- Select books where the title matches 'Hundred' followed by 'Solitude'
SELECT * FROM books
WHERE to_tsvector('english', title) @@ to_tsquery('english', 'Hundred <-> Solitude');
There is an extension of the FOLLOWED BY operator that allows us to specify the gap between the two words, allow for a specified number of words between the position of the two words we are trying to match in the document. This is the <N>
operator. "N" represents an integer denoting the maximum gap length between the positions of the two words.
For example, we want to find sentences where the words "Hundred" and "Solitude" appear within range of 3 words from each other. We can use the <3> operator as follows:
-- Select books where the title matches 'Hundred <3> Solitude'
SELECT * FROM books
WHERE to_tsvector('english', title) @@ to_tsquery('english', 'Hundred <3> Solitude');
In this query, the <3> operator specifies that there can be 3 words between "Hundred" and "Solitude" in the title column. The query will return this result:
| ID | Title |
|----|--------------------------------------------|
| 2 | One Hundred Years of Solitude |
Creating Indexes
Indexes can be created on a table before text search is done. These indexes help our database quickly locate the relevant documents being we're searching for.
By setting up indexes before performing text searches, we're providing a shortcut for the database to find the information we need. Instead of scanning through every record in the table, the database can follow these indexes to directly reach the data we're looking for, making the process much more efficient.
-- Create an index on the 'title' column
CREATE INDEX title_idx ON books(title);
Full text search with precomputed vectors
Here, a new column is created for the text search vectors, followed by the creation of an index on that column. This method is faster than solely creating the index on the table and vectorizing the document during query processing. We can add the new column to an already loaded table, then update the table to include the vector for each row.
-- Alter the books table to add a new column for tsvector
ALTER TABLE books ADD COLUMN title_tsvector tsvector;
-- Update the title_tsvector column for all rows
UPDATE books SET title_tsvector=to_tsvector('english',title);
-- Create an index on the 'title_tsvector' column using GIN
CREATE INDEX title_tsvector_idx ON books USING GIN(title_tsvector);
Alternatively, we can create the column while making the table and utilize "Stored Procedures" to update the column whenever a new row is added to the table.
-- Create the books table with id, title, and title_tsvector columns
CREATE TABLE books (
id serial PRIMARY KEY,
title text,
title_tsvector tsvector
);
-- Create a stored procedure to update title_tsvector
CREATE OR REPLACE FUNCTION update_title_tsvector()
RETURNS TRIGGER AS $$
BEGIN
NEW.title_tsvector := to_tsvector('english', NEW.title);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Create a trigger to automatically update title_tsvector
CREATE TRIGGER update_title_tsvector_trigger
BEFORE INSERT ON books
FOR EACH ROW
EXECUTE FUNCTION update_title_tsvector();
Ranking Search results
Full text search has the capability to sort the results in order of relevance to the query. It assigns a score to indicate the level of similarity between each match and the query. This feature makes full text search incorporated in web search engines, providing users with relevant search results from the vast amount of information available on the internet. It also enable job seekers to search for job postings based on keywords and criteria.
-- Selecting book ID, title, and calculated rank, and ordering the results by rank in descending order for relevance
SELECT id, title, ts_rank(to_tsvector('english', title), to_tsquery('english', 'stars')) AS rank
FROM books
WHERE title_tsvector @@ to_tsquery('english', 'stars')
ORDER BY rank DESC;
This query retrieves book records where the word "stars" is present in the title. The results are then ordered based on their relevance to the search query. Since the query is searching for a single word in a small table, both books might end up having a similar rank.
Conclusion
Full-text search enhances search capabilities within large collections of documents or data. Another notable feature of full-text search, which was not mentioned earlier, is its ability to assign weights, allowing us to prioritize certain words over others when performing a search. By understanding its functions and operators, we can optimize searches and efficiently retrieve relevant information.
Top comments (0)