DEV Community

Cover image for Full Text Search in PostgreSQL
Thiago Avelino for prestd

Posted on • Updated on

Full Text Search in PostgreSQL

When the subject is search engine (inverted index) it is extremely common to think of solutions like Apache Solr or Elastic (former elasticsearch), but why don't we remember the Full Text Search feature we have native in PostgreSQL?

The full text search feature in PostgreSQL became native in version 9.4, but before that we had the GiST Indexes, which was used as the basis for the development of Full Text Search (tsvector).

What Is Full Text Search Anyway?

(...) full-text search refers to techniques for searching a single computer-stored document or a collection in a full text database; (...) distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references). Wikipedia

In other words, imagine you have a set of text documents stored in a database. These documents are not just meta-data items like an author name or a country of origin, but rather an abstract for an article, or full text articles themselves, and you want to find out if certain words are present or not in them.

E.g. you want to search all the news that has subistantive related to "dog" or "fox" (we are talking about animals) are present so if they are in their singular form, you'll find them with the ILIKE keyword...

SELECT * FROM news  
WHERE  
(body ILIKE '%fox%' OR
body ILIKE '%dog%');
Enter fullscreen mode Exit fullscreen mode

... but you'll also find stuff like "foxtrot" or "Dogville", which is not quite what you intended, you had to declare in which field you would search, if you did not have the terms searched in the field will not bring record.

Another problem is that if you search for a word such as "query", and if it's present in its plural form "queries", then you won't find it if you try a simple pattern search with LIKE, even though the word is, in fact, there. Some of you might be thinking to use regular expressions, and yes, you could do that, regular expressions are incredibly powerful, but also terribly slow.

A more effective way to approach this problem is by getting a semantic vector for all of the words contained in a document, that is, a language-specific representation of such words. So, when you search for a word like "jump", you will match all instances of the word and its tenses, even if you searched for "jumped" or "jumping". Additionally, you won't be searching the full document itself (which is slow), but the vector (which is fast).

That is, in a nutshell, the principle of full text search, thinking about problems related to this was developed data type tsvector.

What do you need to know to use tsvector?

PostgreSQL has two functions that do exactly what we intend to do:

  • to_tsvector for creating a list of tokens (the tsvector data type, where ts stands for "text search");
  • to_tsquery for querying the vector for occurrences of certain words or phrases.

For example, to create a vector for the sentence "the quick brown fox jumped over the lazy dog":

SELECT to_tsvector('The quick brown fox jumped over the lazy dog.');
Enter fullscreen mode Exit fullscreen mode

Which will return a vector where every token is a lexeme (unit of lexical meaning) with pointers (the positions in the document), and where words that carry little meaning, such as articles (the) and conjunctions (and, or) are conveniently omitted:

                      to_tsvector
-------------------------------------------------------
 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2
Enter fullscreen mode Exit fullscreen mode

By default, every word is normalized as a lexeme in _English+ (e.g. "jumped" becomes "jump"), case depending on the localization settings of your PostgreSQL installation.

A common doubt is about support for other languages (e.g. Brazilian Portuguese). The good news is that we have support for several languages, see the list:

SELECT cfgname FROM pg_ts_config;

cfgname
--------
simple
arabic
danish
dutch
english
finnish
french
german
hungarian
indonesian
irish
italian
lithuanian
nepali
norwegian
portuguese
romanian
russian
spanish
swedish
tamil
turkish
Enter fullscreen mode Exit fullscreen mode

Deeper into the PostgreSQL text search configuration.

Real example generating tokens

We have news table with title field, description and other meta data:

CREATE TABLE news (
    id SERIAL PRIMARY KEY,
    title text NOT NULL,
    description text NOT NULL,
    created_at timestamp without time zone NOT NULL DEFAULT now(),
    last_update timestamp without time zone NOT NULL DEFAULT now(),
    tokens tsvector
);
Enter fullscreen mode Exit fullscreen mode

The way it is when creating a record will not be created the tokens to do textual search, how will we solve this?

We can create a trigger that listens to all the creation and updating, joins the text of the title and description and generates the tokens automatically:

CREATE OR REPLACE FUNCTION set_full_text_search_on_news()
RETURNS TRIGGER AS $$
BEGIN
       NEW.tokens = to_tsvector(concat(NEW.title, ", ", NEW.description))
       RETURN NEW;
END;
$$ language 'plpgsql';

CREATE TRIGGER update_new_full_text_search
       BEFORE UPDATE ON news
       FOR EACH ROW EXECUTE PROCEDURE set_full_text_search_on_news();
Enter fullscreen mode Exit fullscreen mode

Inserting a Record:

INSERT INTO news (title, description)
VALUES
('dog history', 'The quick brown fox jumped over the lazy dog');
Enter fullscreen mode Exit fullscreen mode

The tokens field will be generated automatically, thus remaining:

'caolha':4 'da':2 'de':5 'em':12 'fax':9 'gigant':11 'história':1 'java':6 'mandar':8 'new':13 'para':10 'querem':7 'york':14 'zebrazebra':3
Enter fullscreen mode Exit fullscreen mode

It is now possible to do text search using the full text search feature (function to_tsquery):

SELECT * FROM news WHERE tokens @@ to_tsquery('fox');
Enter fullscreen mode Exit fullscreen mode

Return:

 id |    title    |                 description                  |         created_at         |        last_update         | client_id |                                 tokens
----+-------------+----------------------------------------------+----------------------------+----------------------------+-----------+------------------------------------------------------------------------
  1 | dog history | The quick brown fox jumped over the lazy dog | 2020-12-11 15:58:55.298558 | 2021-01-02 16:05:21.088482 |         1 | 'brown':4 'dog':1,10 'fox':5 'historyth':2 'jump':6 'lazi':9 'quick':3
(1 row)
Enter fullscreen mode Exit fullscreen mode

How to use this feature in pREST?

GitHub logo prest / prest

PostgreSQL ➕ REST, low-code, simplify and accelerate development, ⚡ instant, realtime, high-performance on any Postgres application, existing or new

RESTful API prestd

Build Status GoDoc Go Report Card codecov Homebrew Discord

pREST (PostgreSQL REST), simplify and accelerate development, instant, realtime, high-performance on any Postgres application, existing or new

PostgreSQL version 9.5 or higher

Contributor License Agreement - CLA assistant

pREST - instant, realtime, high-performance on PostgreSQL | Product Hunt

Problem

There is PostgREST written in Haskell, but keeping Haskell software in production is not an easy job. With this need prestd was born. Read more.

Test using Docker

To simplify the process of bringing up the test environment we will use docker-compose

# Download docker compose file
wget https://raw.githubusercontent.com/prest/prest/main/docker-compose-prod.yml -O docker-compose.yml
# Up (run) PostgreSQL and prestd
docker-compose up
# Run data migration to create user structure for access (JWT)
docker-compose exec prest prestd migrate up auth

# Create user and password for API access (via JWT)
## user: prest
## pass: prest
docker-compose exec postgres psql -d prest -U prest -c "INSERT INTO prest_users (name, username, password) VALUES ('pREST Full Name', 'prest', MD5('prest'))
Enter fullscreen mode Exit fullscreen mode

This feature was implemented in version 1.0.5 as a search filter and works like this:

GET /{DATAVASE}/{SCHEMA}/news?tokens:tsquery=fox 
Enter fullscreen mode Exit fullscreen mode
[
    {
        "id": 1,
        "title": "dog history",
        "description": "The quick brown fox jumped over the lazy dog",
        "created_at": "2020-12-11T15:58:55.298558",
        "last_update": "2021-01-02T16:05:21.088482",
        "client_id": 1,
        "tokens": "'brown':4 'dog':1,10 'fox':5 'historyth':2 'jump':6 'lazi':9 'quick':3"
    }
]
Enter fullscreen mode Exit fullscreen mode

Read more about pREST search filters here.

Top comments (0)