Mongodb text index vs regex for text searching

#mongodb #regex #database #webdev

In this post I'll discuss the difference between two methods for text searching in mongodb collection and compare their complexities, pros and cons.

A case study

Consider having posts collection where each object consists of title and content:

"_id": ObjectId(""),
"title": "PostA",
"content": "This is the content for the first post."
},
{
"_id": ObjectId(""),
"title": "PostB",
"content": "This is a different content for the second post."
}

Our objective here is to search for some text and return the matching documents.

Using REGEX

In case we want to search for title it will be easy to use normal find filter

db.posts.find({title: "PostA"})

if we know exactly the title or to use regex if we know a part of it

db.posts.find({title: 'pattern', $options: '<options>'})

But if we want to search in the content field, using this

db.posts.find({content: "first"})

will return nothing as it will search for an exact matching.
So, we can use regex here

db.posts.find({content: {$regex: /first/}})

this query will return the PostA document as its content contains the word first.
But this will do a full collection scan of O(n) and it will have a poor performance on larger datasets.

Using Text Index

Text Indexes : It converts the text into array of single words and it remove all the stop words (is, a, an, etc)
Let's create a text index on our content field

db.posts.createIndex({content: "text"})

and don't forget to specify "text" to remove unneeded words and store keywords.
to search for a word

db.posts.find({$text: {$search: "first"}})

This will return PostA.

Why we didn't search inside our content in the above query? 🤔
Since mongo treat this index as an array of words in order if you want to add another field for this text index, for example we can add both title and content to the index and it will treat them under one text only.

Example

db.posts.createIndex({title: "text", content: "text"})

Note: we can't add another text index while there is already another one so these next lines are illegal and we should add them at once,

 db.posts.createIndex({content: "text"})
 db.posts.createIndex({title: "text"})

Ok, now we have a combined index on both title and content so if we search with any keyword whether in title or content it will return the correct matching document.
This approach will be very efficient in terms of complexity as it uses indexScan ( O(log(n)) ) also in usability instead of searching in a specific field this will search in a combined multiple fields which is more practical.

Exclude words using text index

Let's try to search a post with content contain the 'post' keyword

db.posts.find({$text: {$search: "post"}})

This query will return both PostA and PostB docs, but we can return only PostB if we exclude the 'first' keyword :

db.posts.find({$text: {$search: "post -first"}})

This will exclude the docs with content having the word 'first'.

Conclusion

Finally we saw that using text index is faster, easier and preferable to be used and support keyword exclusion, but we have some other cases when we want to search for a substrings or partial word matches like the word Post in PostB in this case we have to use regex.