DEV Community

Franck Pachot
Franck Pachot

Posted on

Text Search with MongoDB and PostgreSQL

MongoDB Search Indexes provide full‑text search capabilities directly within MongoDB, allowing complex queries to be run without copying data to a separate search system. Initially deployed in Atlas, MongoDB’s managed service, Search Indexes are now also part of the community edition. This post compares the default full‑text search behaviour between MongoDB and PostgreSQL, using a simple example to illustrate the ranking algorithm.

Setup: a small dataset

I’ve inserted nine small documents, each consisting of different fruits, using emojis to make it more visual. The 🍎 and 🍏 emojis represent our primary search terms. They appear at varying frequencies in documents of different lengths.

db.articles.deleteMany({});

db.articles.insertMany([
 { description : "🍏 🍌 🍊" },                // short, 1 🍏
 { description : "🍎 🍌 🍊" },                // short, 1 🍎
 { description : "🍎 🍌 🍊 🍎" },             // larger, 2 🍎
 { description : "🍎 🍌 🍊 🍊 🍊" },          // larger, 1 🍎
 { description : "🍎 🍌 🍊 🌴 🫐 🍈 🍇 🌰" },  // large, 1 🍎
 { description : "🍎 🍎 🍎 🍎 🍎 🍎" },       // large, 6 🍎
 { description : "🍎 🍌" },                 // very short, 1 🍎
 { description : "🍌 🍊 🌴 🫐 🍈 🍇 🌰 🍎" },  // large, 1 🍎
 { description : "🍎 🍎 🍌 🍌 🍌" },          // shorter, 2 🍎
]);
Enter fullscreen mode Exit fullscreen mode

To enable dynamic indexing, I created a MongoDB Search Index without specifying any particular field names:

db.articles.createSearchIndex("default",
  { mappings: { dynamic: true } }
);
Enter fullscreen mode Exit fullscreen mode

I created the equivalent on PostgreSQL:

DROP TABLE IF EXISTS articles;
CREATE TABLE articles (
    id BIGSERIAL PRIMARY KEY,
    description TEXT
);

INSERT INTO articles(description) VALUES
('🍏 🍌 🍊'),
('🍎 🍌 🍊'),
('🍎 🍌 🍊 🍎'),
('🍎 🍌 🍊 🍊 🍊'),
('🍎 🍌 🍊 🌴 🫐 🍈 🍇 🌰'),
('🍎 🍎 🍎 🍎 🍎 🍎'),
('🍎 🍌'),
('🍌 🍊 🌴 🫐 🍈 🍇 🌰 🍎'),
('🍎 🍎 🍌 🍌 🍌');
Enter fullscreen mode Exit fullscreen mode

Since text search needs multiple index entries for each row, I set up a GIN (Generalized Inverted Index) and use tsvector to extract and index the relevant tokens.

CREATE INDEX articles_fts_idx
  ON articles USING GIN (to_tsvector('simple', description))
;
Enter fullscreen mode Exit fullscreen mode

MongoDB Text Search (Lucene BM25):

I use my custom search index to find articles containing either 🍎 or 🍏 in their descriptions. The results are sorted by relevance score and displayed as follows:

db.articles.aggregate([
  { $search: { text: { query: ["🍎", "🍏"], path: "description" }, index: "default" } },
  { $project: { _id: 0, score: { $meta: "searchScore" }, description: 1 } },
  { $sort: { score: -1 } }
]).forEach( i=> print(i.score.toFixed(3).padStart(5, " "),i.description) )
Enter fullscreen mode Exit fullscreen mode

Here are the results, presented in order of best to worst match:

1.024 🍏 🍌 🍊
0.132 🍎 🍎 🍎 🍎 🍎 🍎
0.107 🍎 🍌 🍊 🍎
0.101 🍎 🍎 🍌 🍌 🍌
0.097 🍎 🍌
0.088 🍎 🍌 🍊
0.073 🍎 🍌 🍊 🍊 🍊
0.059 🍎 🍌 🍊 🌴 🫐 🍈 🍇 🌰
0.059 🍌 🍊 🌴 🫐 🍈 🍇 🌰 🍎
Enter fullscreen mode Exit fullscreen mode

All documents were retrieved by this search since each contains a red or green apple. However, they are assigned different scores:

  • Multiple appearances boost the score: When a document contains the search term more than once, its ranking increases compared to those with only a single appearance. That's why documents featuring several 🍎 are ranked higher than those containing only one.
  • Rarity outweighs quantity: When a term like 🍎 appears in every document, it has less impact than a rare term, such as 🍏. Therefore, even if 🍏 only appears once, the document containing it ranks higher than others with multiple 🍎. In this model, rarity carries more weight than mere frequency.
  • Diminishing returns on term frequency: Each extra occurrence of a term adds less to the relevance score. For instance, increasing 🍎 from one to six times (from 🍎 🍌 to 🍎 🍎 🍎 🍎 🍎 🍎) boosts the score, but not by a factor of six. The effect of term repetition diminishes as the count rises.
  • Document length matters: A term that appears only once is scored higher in a short document than in a long one. That's why 🍎 🍌 ranks higher than 🍎 🍌 🍊, which itself ranks higher than 🍎 🍌 🍊 🍊 🍊.

MongoDB Atlas Search indexes are powered by Lucene’s BM25 algorithm, a refinement of the classic TF‑IDF model:

  • Term Frequency (TF): More occurrences of a term in a document increase its relevance score, but with diminishing returns.
  • Inverse Document Frequency (IDF): Terms that appear in fewer documents receive higher weighting.
  • Length Normalization: Matches in shorter documents contribute more to relevance than the same matches in longer documents.

To demonstrate the impact of IDF, I added several documents that do not contain any of the apples I'm searching for.

const fruits = [ "🍐","🍊","🍋","🍌","🍉","🍇","🍓","🫐",         
                 "🥝","🥭","🍍","🥥","🍈","🍅","🥑","🍆",  
                 "🍋","🍐","🍓","🍇","🍈","🥭","🍍","🍑",  
                 "🥝","🫐","🍌","🍉","🥥","🥑","🥥","🍍" ];
function randomFruitSentence(min=3, max=8) {
  const len = Math.floor(Math.random() * (max - min + 1)) + min;
  return Array.from({length: len}, () => fruits[Math.floor(Math.random()*fruits.length)]).join(" ");
}
db.articles.insertMany(
  Array.from({length: 500}, () => ({ description: randomFruitSentence() }))
);

db.articles.aggregate([
  { $search: { text: { query: ["🍎", "🍏"], path: "description" }, index: "default" } },
  { $project: { _id: 0, score: { $meta: "searchScore" }, description: 1 } },
  { $sort: { score: -1 } }
]).forEach( i=> print(i.score.toFixed(3).padStart(5, " "),i.description) )

3.365 🍎 🍎 🍎 🍎 🍎 🍎
3.238 🍏 🍌 🍊
2.760 🍎 🍌 🍊 🍎
2.613 🍎 🍎 🍌 🍌 🍌
2.506 🍎 🍌
2.274 🍎 🍌 🍊
1.919 🍎 🍌 🍊 🍊 🍊
1.554 🍎 🍌 🍊 🌴 🫐 🍈 🍇 🌰
1.554 🍌 🍊 🌴 🫐 🍈 🍇 🌰 🍎
Enter fullscreen mode Exit fullscreen mode

Although the result set is unchanged, the score has increased and the frequency gap between 🍎 and 🍏 has narrowed. As a result, 🍎 🍎 🍎 🍎 🍎 🍎 now ranks higher than 🍏 🍌 🍊, since the inverse document frequency (IDF) of 🍏 does not fully offset its term frequency (TF) within a single document. Crucially, changes made in other documents can influence the score of any given document, unlike in traditional indexes where changes in one document do not impact others' index entries.

PostgreSQL Text Search (TF only):

Here is the result in PostgreSQL:

SELECT ts_rank_cd(  

        to_tsvector('simple', description)
     ,  
        to_tsquery('simple', '🍎 | 🍏')  

       ) AS score, description  
FROM articles  
WHERE
       to_tsvector('simple', description) 
    @@ 
       to_tsquery('simple', '🍎 | 🍏')  

ORDER BY score DESC;  
Enter fullscreen mode Exit fullscreen mode

It retrieves the same documents, but with many having the same score, even with different patterns:

 score |       description
-------+-------------------------
   0.6 | 🍎 🍎 🍎 🍎 🍎 🍎
   0.2 | 🍎 🍌 🍊 🍎
   0.2 | 🍎 🍎 🍌 🍌 🍌
   0.1 | 🍏 🍌 🍊
   0.1 | 🍎 🍌
   0.1 | 🍌 🍊 🌴 🫐 🍈 🍇 🌰 🍎
   0.1 | 🍎 🍌 🍊 🌴 🫐 🍈 🍇 🌰
   0.1 | 🍎 🍌 🍊
   0.1 | 🍎 🍌 🍊 🍊 🍊
(9 rows)
Enter fullscreen mode Exit fullscreen mode

With PostgreSQL text search, only the term frequency (TF) matters, and is a direct multiplicator of the score: 6 apples ranks 3x higher than two, and 6x than one.

There's some possible normalization available with additiona flags:

SELECT ts_rank_cd(
         to_tsvector('simple', description),
         to_tsquery('simple', '🍎 | 🍏')  ,
            0 -- (the default) ignores the document length
         |  1 -- divides the rank by 1 + the logarithm of the document length
    --   |  2 -- divides the rank by the document length
    --   |  4 -- divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd)
         |  8 -- divides the rank by the number of unique words in document
    --   | 16 -- divides the rank by 1 + the logarithm of the number of unique words in document
    --   | 32 -- divides the rank by itself + 1
       ) AS score,
       description
FROM articles
WHERE to_tsvector('simple', description) @@ to_tsquery('simple', '🍎 | 🍏')
ORDER BY score DESC
;
    score    |       description
-------------+-------------------------
    0.308339 | 🍎 🍎 🍎 🍎 🍎 🍎
 0.055811062 | 🍎 🍎 🍌 🍌 🍌
  0.04551196 | 🍎 🍌
  0.04142233 | 🍎 🍌 🍊 🍎
 0.024044918 | 🍏 🍌 🍊
 0.024044918 | 🍎 🍌 🍊
 0.018603688 | 🍎 🍌 🍊 🍊 🍊
 0.005688995 | 🍎 🍌 🍊 🌴 🫐 🍈 🍇 🌰
 0.005688995 | 🍌 🍊 🌴 🫐 🍈 🍇 🌰 🍎
(9 rows)
Enter fullscreen mode Exit fullscreen mode

This penalizes longer documents and those with more unique terms. Still, it doesn't consider other documents like IDF.

PostgreSQL Full Text Search scoring with ts_rank_cd is based on term frequency and proximity. It does not compute inverse document frequency, so scores do not change as the corpus changes. Normalization flags can penalize long documents or those with many unique terms, but they are length-based adjustments, not true IDF, like we have in TF‑IDF or BM25‑style search engine.

ParadeDB with pg_search (Tantivy BM25)

PostgreSQL popularity is not only due to its features but also its extensibility and ecosystem. The pg_search extension adds functions and operators that use BM25 indexes (Tantivy, a Rust-based search library inspired by Lucene). It is easy to test with ParadeDB:

docker run --rm -it paradedb/paradedb bash

POSTGRES_PASSWORD=x docker-entrypoint.sh postgres &

psql -U postgres

Enter fullscreen mode Exit fullscreen mode

The extension is installed in version 0.18.4:

postgres=# \dx
                                        List of installed extensions
          Name          | Version |   Schema   |                        Description
------------------------+---------+------------+------------------------------------------------------------
 fuzzystrmatch          | 1.2     | public     | determine similarities and distance between strings
 pg_cron                | 1.6     | pg_catalog | Job scheduler for PostgreSQL
 pg_ivm                 | 1.9     | pg_catalog | incremental view maintenance on PostgreSQL
 pg_search              | 0.18.4  | paradedb   | pg_search: Full text search for PostgreSQL using BM25
 plpgsql                | 1.0     | pg_catalog | PL/pgSQL procedural language
 postgis                | 3.6.0   | public     | PostGIS geometry and geography spatial types and functions
 postgis_tiger_geocoder | 3.6.0   | tiger      | PostGIS tiger geocoder and reverse geocoder
 postgis_topology       | 3.6.0   | topology   | PostGIS topology spatial types and functions
 vector                 | 0.8.0   | public     | vector data type and ivfflat and hnsw access methods
(9 rows)
Enter fullscreen mode Exit fullscreen mode

I created and inserted the same as I did above on PostgreSQL and created the BM25 index:

CREATE INDEX search_idx ON articles
       USING bm25 (id, description)
       WITH (key_field='id')
;
Enter fullscreen mode Exit fullscreen mode

We can query using the @@@ operator and rank with paradedb.score(id). Unlike PostgreSQL’s built‑in @@, which uses query‑local statistics, @@@ computes scores using global IDF and Lucene’s BM25 length normalization — so adding unrelated documents can still change the scores.

SELECT description, paradedb.score(id) AS score
FROM articles
WHERE description @@@ '🍎' OR description @@@ '🍏'
ORDER BY score DESC, description;

 description | score
-------------+-------
(0 rows)
Enter fullscreen mode Exit fullscreen mode

The result is empty. Using emoji as terms can lead to inconsistent tokenization results, so I replaced them with text labels instead:

UPDATE articles SET description 
 = replace(description, '🍎', 'Gala');
UPDATE articles SET description 
 = replace(description, '🍏', 'Granny Smith');
UPDATE articles SET description 
 = replace(description, '🍊', 'Orange');
Enter fullscreen mode Exit fullscreen mode

This time, the scoring is more precise and takes into account the term frequency within the document (TF), the term’s rarity across the entire indexed corpus (IDF), along with a length normalization factor to prevent longer documents from having an unfair advantage:

SELECT description, paradedb.score(id) AS score
FROM articles
WHERE description @@@ 'Gala' OR description @@@ 'Granny Smith'
ORDER BY score DESC, description;

          description          |   score
-------------------------------+------------
 Granny Smith 🍌 Orange        |  3.1043208
 Gala Gala Gala Gala Gala Gala | 0.79529095
 Gala Gala 🍌 🍌 🍌            |  0.7512194
 Gala 🍌                       | 0.69356775
 Gala 🍌 Orange Gala           | 0.63589364
 Gala 🍌 Orange                |  0.5195716
 Gala 🍌 Orange 🌴 🫐 🍈 🍇   |  0.5195716
 🍌 Orange 🌴 🫐 🍈 🍇   Gala |  0.5195716
 Gala 🍌 Orange Orange Orange  | 0.34597924
(9 rows)
Enter fullscreen mode Exit fullscreen mode

PostgreSQL’s built-in search only provides basic, local term frequency scoring. To get a full-feature text search that can be used in application's search boxes, it can be extended with third-party tools like ParadeDB's pg_search.

Conclusion

Relevance scoring in text search can differ widely between systems because each uses its own ranking algorithms and analyzers. To better visualize my results in these tests, I used emojis and opted for the simplest definitions. I selected PostgreSQL's to_tsvector('simple') configuration to prevent language-specific processing, while for MongoDB Atlas Search, I used the default dynamic mapping.

MongoDB Atlas Search (and now in MongoDB Community Edition) uses Lucene’s BM25 algorithm, combining:

  • Term Frequency (TF): Frequent terms in a document boost scores, but with diminishing returns
  • Inverse Document Frequency (IDF): Rare terms across the corpus get higher weight
  • Length normalization: Matches in shorter documents are weighted more than the same matches in longer ones

PostgreSQL’s full-text search (ts_rank_cd()) evaluates only term frequency and position, overlooking other metrics like IDF. For more advanced features such as BM25, extensions like ParadeDB’s pg_search are needed, which require extra configuration and are not always available on managed platforms. PostgreSQL offers a modular approach, where extensions can add advanced ranking algorithms like BM25. MongoDB provides built‑in BM25‑based full‑text search in both Atlas and the Community Edition.

Top comments (0)