antoniodipinto

Posted on Feb 19

Full-Text Search in PostgreSQL: When You Can Skip Elasticsearch

#postgres #elasticsearch

The search bar that almost became a new service

Someone opens a ticket: "We need a search bar on the articles page." The team huddles. One person suggests Elasticsearch. Another says Algolia. Someone mentions Meilisearch. Within ten minutes the discussion has shifted from "add a search bar" to "provision a new cluster, sync data with a pipeline, handle eventual consistency, and add a new service to the on-call rotation."

Meanwhile, the articles already live in PostgreSQL.

The first attempt is usually ILIKE:

SELECT id, title FROM articles
WHERE title ILIKE '%connection pooling%'
   OR body  ILIKE '%connection pooling%'
LIMIT 20;

It works in development with 500 rows. In production with 2 million rows it takes 3 seconds, returns results in no particular order of relevance, and cannot distinguish "connection pooling" from "the connection to the pooling facility." There is no stemming, no ranking, no concept of how well a document matches.

The fix is not a new service. It is five lines of SQL using features PostgreSQL has shipped since version 8.3.

Why LIKE fails at scale

ILIKE '%term%' forces PostgreSQL to read every row in the table. A B-tree index cannot help because the pattern has a leading wildcard. There is no shortcut -- Postgres must scan the full text of every column in every row, check for a substring match, and discard the rest.

On a table with 2 million articles, that looks like this:

Seq Scan on articles  (cost=0.00..389420.00 rows=1843 width=312)
                      (actual time=12.881..3042.617 rows=347 loops=1)
  Filter: ((title ~~* '%connection pooling%') OR (body ~~* '%connection pooling%'))
  Rows Removed by Filter: 1999653

Nearly 2 million rows scanned, 3 seconds of wall time, all to return 347 results with no ranking. Every time a user types a query and hits enter, the database does the same full scan again.

Beyond performance, ILIKE has no understanding of language. It cannot match "deploying" when the user searches for "deploy." It does not remove stop words like "the" or "is." It returns results in insertion order, not relevance order. For anything beyond a toy prototype, you need something better.

What tsvector and tsquery are

PostgreSQL has a built-in document search engine. It works with two data types: tsvector and tsquery.

A tsvector is a sorted list of normalized words (called lexemes) extracted from text. PostgreSQL tokenizes the input, applies stemming (reducing "running" to "run", "deployed" to "deploy"), and removes stop words:

SELECT to_tsvector('english', 'Deploying PostgreSQL with zero-downtime migrations');

'deploy':1 'migrat':5 'postgresql':2 'zero':4 'zero-downtim':3

Notice: "Deploying" became "deploy", "migrations" became "migrat", and the stop word "with" disappeared. Each lexeme has a position number.

A tsquery is a search expression. It supports boolean operators (& for AND, | for OR, ! for NOT):

SELECT to_tsquery('english', 'deploy & postgres');

'deploy' & 'postgr'

The @@ operator checks whether a tsvector matches a tsquery:

SELECT to_tsvector('english', 'Deploying PostgreSQL with zero-downtime migrations')
    @@ to_tsquery('english', 'deploy & postgres');

true

The user searched for "deploy" and "postgres." The document contains "Deploying" and "PostgreSQL." The stemmer reduced both sides to the same lexemes, and the match succeeds. This is the core of PostgreSQL full-text search.

Making it fast with a GIN index

Without an index, every @@ query still requires a sequential scan. PostgreSQL has to compute to_tsvector for every row and check it against the query. On 2 million rows, this is faster than ILIKE (because the matching is more efficient), but still slow.

A GIN (Generalized Inverted Index) solves this. GIN indexes map each lexeme to the list of rows that contain it. When a query arrives, PostgreSQL looks up the lexemes in the index and intersects the row lists. No table scan needed.

You can create a GIN index directly on an expression:

CREATE INDEX idx_articles_fts ON articles
  USING gin(to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, '')));

But there is a better approach for production. Add a stored generated column that precomputes the tsvector, then index that column:

ALTER TABLE articles ADD COLUMN search_vector tsvector
  GENERATED ALWAYS AS (
    to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))
  ) STORED;

CREATE INDEX idx_articles_search ON articles USING gin(search_vector);

The generated column updates automatically whenever title or body changes. The GIN index stays in sync. Now a search query hits the index instead of scanning the table:

EXPLAIN ANALYZE
SELECT id, title
FROM articles
WHERE search_vector @@ to_tsquery('english', 'deploy & postgres')
LIMIT 20;

Limit  (cost=52.08..117.34 rows=20 width=268)
       (actual time=0.412..0.931 rows=20 loops=1)
  ->  Bitmap Heap Scan on articles  (cost=52.08..11342.19 rows=3467 width=268)
        (actual time=0.410..0.924 rows=20 loops=1)
        Recheck Cond: (search_vector @@ '''deploy'' & ''postgr'''::tsquery)
        ->  Bitmap Index Scan on idx_articles_search  (cost=0.00..51.21 rows=3467 width=0)
              (actual time=0.287..0.288 rows=347 loops=1)
              Index Cond: (search_vector @@ '''deploy'' & ''postgr'''::tsquery)
Planning Time: 0.134 ms
Execution Time: 0.972 ms

From 3,042ms to under 1ms. The GIN index looked up "deploy" and "postgr" in the inverted index, intersected the results, and returned matching rows without touching the rest of the table.

Ranking results

Finding matching documents is only half the problem. Users expect the most relevant result first.

ts_rank scores each document based on how often the query terms appear:

SELECT id, title, ts_rank(search_vector, q) AS rank
FROM articles, to_tsquery('english', 'deploy & postgres') q
WHERE search_vector @@ q
ORDER BY rank DESC
LIMIT 20;

For longer documents where term proximity matters more than raw frequency, ts_rank_cd (cover density ranking) gives better results. It rewards documents where the search terms appear close together:

SELECT id, title, ts_rank_cd(search_vector, q) AS rank
FROM articles, to_tsquery('english', 'deploy & postgres') q
WHERE search_vector @@ q
ORDER BY rank DESC
LIMIT 20;

You can also highlight matching terms in the output with ts_headline:

SELECT id, title,
       ts_headline('english', body, to_tsquery('english', 'deploy & postgres'),
                   'StartSel=<b>, StopSel=</b>, MaxFragments=2') AS snippet
FROM articles
WHERE search_vector @@ to_tsquery('english', 'deploy & postgres')
ORDER BY ts_rank(search_vector, to_tsquery('english', 'deploy & postgres')) DESC
LIMIT 10;

This returns text fragments with matching terms wrapped in <b> tags, ready to display in a search results page.

Phrase search and advanced queries

Basic to_tsquery splits terms into individual lexemes connected by boolean operators. For phrase search, where word order and proximity matter, PostgreSQL offers more specific functions.

phraseto_tsquery matches terms that appear next to each other in the document:

SELECT * FROM articles
WHERE search_vector @@ phraseto_tsquery('english', 'connection pooling');

This only matches documents where "connection" and "pooling" (after stemming) appear adjacent, not scattered across the text.

websearch_to_tsquery (PostgreSQL 11+) accepts the kind of input users actually type into search bars:

SELECT * FROM articles
WHERE search_vector @@ websearch_to_tsquery('english', '"connection pooling" -pgbouncer');

This matches the exact phrase "connection pooling" but excludes documents containing "pgbouncer." The syntax is close to what Google users expect: quotes for phrases, minus for exclusion.

Prefix matching works with the :* suffix, useful for search-as-you-type:

SELECT * FROM articles
WHERE search_vector @@ to_tsquery('english', 'depl:*');

This matches "deploy", "deployment", "deploying", and any other word that stems to a lexeme starting with "depl."

Fuzzy and typo-tolerant search with pg_trgm

Full-text search requires exact lexeme matches after stemming. If a user types "postgre" instead of "postgres", tsvector search returns nothing. For typo tolerance and fuzzy matching, the pg_trgm extension fills the gap.

pg_trgm breaks strings into trigrams (three-character sequences) and compares the overlap. The similarity function returns a score between 0 and 1:

CREATE EXTENSION IF NOT EXISTS pg_trgm;

SELECT similarity('postgre', 'postgresql');
-- 0.5

A GIN index with trigram operators makes fuzzy search fast on large tables:

CREATE INDEX idx_articles_title_trgm ON articles USING gin(title gin_trgm_ops);

SELECT title, similarity(title, 'postgre') AS sim
FROM articles
WHERE title % 'postgre'
ORDER BY sim DESC
LIMIT 10;

The % operator returns rows where the similarity exceeds the threshold (default 0.3, adjustable with SET pg_trgm.similarity_threshold).

A practical pattern is to combine both approaches. Use pg_trgm for the autocomplete dropdown (matching partial, possibly misspelled input against titles) and tsvector for the full results page (matching stemmed terms across title and body with ranking).

For applications that serve users typing in multiple languages or with accented characters, the unaccent extension normalizes text so that "cafe" matches "cafe":

CREATE EXTENSION IF NOT EXISTS unaccent;

SELECT unaccent('Deploiement rapide du service de base de donnees');
-- 'Deploiement rapide du service de base de donnees'

You can create a custom text search configuration that chains unaccent with the standard dictionary, so accented and unaccented input produce the same lexemes.

When PostgreSQL search is enough

PostgreSQL full-text search is not a toy. It handles real workloads. Here are practical thresholds where it works well:

Corpus size: up to tens of millions of documents. GIN index lookups are fast, and the index stays compact relative to the data.
Query volume: a few hundred queries per second is comfortable for a well-indexed setup. For most applications with a search bar, this is far more than needed.
Language support: PostgreSQL ships with configurations for over 20 languages. If your corpus is in one or a handful of known languages, built-in support covers it.
Feature set: stemming, ranking, phrase search, prefix matching, highlighting, boolean operators, and fuzzy matching with pg_trgm. This covers the needs of most CRUD applications with a search feature.
Operational simplicity: the data already lives in PostgreSQL. There is no sync pipeline, no second cluster, no replication lag between the source of truth and the search index. One database, one connection string, one backup strategy.

When you actually need a dedicated search engine

There are workloads where PostgreSQL is not the right tool:

Faceted navigation with aggregated counts. An e-commerce catalog where users filter by brand, price range, color, and size, and every filter shows a live count of matching products. Elasticsearch was designed for this. PostgreSQL can approximate it with GROUP BY and CTEs, but it is slower and harder to maintain.
Hundreds of millions of documents with sub-10ms p99. At this scale, Elasticsearch's distributed architecture and in-memory segment caching give it an edge that a single PostgreSQL instance cannot match.
Complex scoring pipelines. Learning-to-rank, boosting by recency or popularity, vector-plus-keyword hybrid search. These are possible in PostgreSQL but require significant custom SQL, while dedicated search engines provide them as configuration.
Dynamic multi-language corpora. If documents arrive in dozens of languages and you need automatic language detection per document, a dedicated engine handles this more gracefully.

The key question is not "can PostgreSQL do search?" It can. The question is whether you need capabilities that justify operating a second system. Most applications with a search bar over their own data never reach that threshold. Starting with PostgreSQL and migrating later if needed is a lower-risk path than introducing Elasticsearch on day one.

Getting started

Here is a minimal setup you can run against any PostgreSQL 12+ database:

ALTER TABLE articles ADD COLUMN search_vector tsvector
  GENERATED ALWAYS AS (
    to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))
  ) STORED;

CREATE INDEX idx_articles_search ON articles USING gin(search_vector);

SELECT id, title, ts_rank(search_vector, q) AS rank
FROM articles, websearch_to_tsquery('english', 'full text search postgres') q
WHERE search_vector @@ q
ORDER BY rank DESC
LIMIT 20;

Three statements: a generated column, a GIN index, and a ranked search query. No external service, no data pipeline, no new dependency.

If you want to test this on a real PostgreSQL instance, start with a free database on Noctaploy. Extensions like pg_trgm, unaccent, and btree_gin are pre-installed and can be enabled from the dashboard with one click. Create your GIN index, run your first ts_rank query, and see if PostgreSQL covers your search needs before reaching for anything else.

This article was originally published on Noctaploy.io.

DEV Community