MongoDB Guests for MongoDB

Posted on Mar 20

How to Use MongoDB’s Text Search

#mongodb #python

This tutorial was written by Damilola Oladele.

How to Use MongoDB's Text Search

MongoDB Search is an embedded full-text search engine that lives inside your MongoDB deployment. You don't need to run a separate search system alongside your database. You get fine-grained text indexing, a rich query language for complex search logic, customizable relevance scoring, and advanced features like autocomplete and faceted search.

This tutorial shows you how to insert a sample book catalog into MongoDB, create a search index, and run six types of text search queries in Python.

You'll learn the following in this tutorial:

What search queries, search indexes, and search analyzers are in MongoDB
How to insert sample data and create a search index
How to run a basic text search
How to search across multiple fields
How to search with fuzzy matching
How to search for a term while excluding another word
How to combine search conditions with the compound operator
How to include relevance scores in your results

You can find all the code samples for this tutorial in the GitHub repository.

What You Need to Know Before You Start

Three concepts are central to text search in MongoDB. They'll help you understand why each search operation behaves the way it does.

Search queries

A search query asks MongoDB to find documents that are relevant to a term or phrase. Search queries are different from traditional database queries. A traditional database query follows a strict syntax and returns exact matches. On the other hand, a search query can match similar phrases, tolerate typos, support wildcards, and rank results by how relevant they are.

MongoDB Search queries take the form of an aggregation pipeline stage. The $search stage returns the actual search results. The $searchMeta stage returns metadata about the results, such as counts or facet data. Both stages must come first in your aggregation pipeline. You then chain additional stages, like $project or $limit, after them.

Search indexes

A search index is a data structure that maps terms to the documents that contain them. Think of it like the index at the back of a book. Instead of scanning every page to find a topic, you look it up in the index and go straight to the right pages. A search index works the same way. MongoDB consults it instead of scanning every document in your collection, thereby improving search speed.

You must create a search index before you can run any $search queries.

Search analyzers and tokens

An analyzer controls how MongoDB transforms your text into searchable units called tokens. This transformation happens both when MongoDB builds the index and when it processes your query. The analyzer you choose determines which tokens end up in the index and which tokens your query produces.

MongoDB Search ships with five built-in analyzers. The one you choose shapes what your search can and can't find.

Analyzer	How it splits text	Lowercase	Stop words
Standard	Whitespace and punctuation	Yes	Yes
Simple	Any non-letter character	Yes	No
Whitespace	Whitespace only	No	No
Keyword	Treats the whole field as one token	No	No
Language	Language-specific rules	Varies	Varies

If you don't specify an analyzer, MongoDB uses the Standard analyzer. Later in this tutorial, you'll see how different analyzers change what a query finds and why.

Now let's get started.

Prerequisites

Before you start, make sure you have the following ready:

Python 3.8 or later installed
A MongoDB Atlas account with an M0 (Free tier) cluster set up
pymongo 4.0 or later installed
Basic familiarity with Python and MongoDB collections
Your Atlas connection string, available in the Atlas UI under Database > Connect > Drivers

Make sure your IP address is allowlisted in Atlas before you run any scripts. Go to Security > Network Access in the Atlas UI and add your current IP address. Your scripts won't connect to your cluster without this step.

Set Up Your Project

Create a folder for your project and navigate into it in your terminal:

mkdir mongodb-text-search
cd mongodb-text-search

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate

On Windows, activate it with:

venv\Scripts\activate

Install pymongo:

python -m pip install pymongo

You'll create a separate Python file for each step of this tutorial. All files go in the mongodb-text-search folder.

Insert Your Data and Create a Search Index

Create a file named insert_data.py. This file does two things:

It inserts a sample book catalog into your cluster
It creates the search index you'll use for most of the tutorial.

You only need to run this file once. The other files in this tutorial just connect and query.

from pymongo import MongoClient
import time

client = MongoClient(
    "mongodb+srv://<USERNAME>:<PASSWORD>@<HOST>/",
    appname="devrel-tutorial-python-search"
)
db = client["bookstore"]
books = db["books"]

# Drop the collection to start clean on each run:
books.drop()

books.insert_many([
    {
        "title": "The Midnight Library",
        "author": "Matt Haig",
        "genre": "fiction",
        "description": "A woman discovers a library between life and death that contains books representing all the lives she could have lived.",
        "year": 2020
    },
    {
        "title": "Atomic Habits",
        "author": "James Clear",
        "genre": "self-help",
        "description": "A practical guide to building good habits and breaking bad ones through small, incremental changes.",
        "year": 2018
    },
    {
        "title": "Project Hail Mary",
        "author": "Andy Weir",
        "genre": "science fiction",
        "description": "An astronaut wakes up alone in space with no memory and must figure out how to save Earth from an extinction-level threat.",
        "year": 2021
    },
    {
        "title": "Educated",
        "author": "Tara Westover",
        "genre": "memoir",
        "description": "A woman reflects on her isolated rural childhood and her journey to educate herself and earn a PhD from Cambridge.",
        "year": 2018
    },
    {
        "title": "The Hitchhiker's Guide to the Galaxy",
        "author": "Douglas Adams",
        "genre": "science fiction",
        "description": "An ordinary man is swept across the universe after Earth is demolished to make way for a hyperspace expressway.",
        "year": 1979
    }
])

print("Sample data inserted.")

# Drop any existing search index named 'default' before creating a new one:
try:
    books.drop_search_index("default")
    time.sleep(5)
except Exception:
    pass

# Create a search index with dynamic mappings:
books.create_search_index({
    "name": "default",
    "definition": {
        "mappings": {
            "dynamic": True
        }
    }
})

print("Search index created. Waiting for it to go active...")

# Poll until the index is ready:
while True:
    indexes = list(books.list_search_indexes())
    if indexes and indexes[0].get("status") == "READY":
        print("Search index is active. You're ready to run queries.")
        break
    time.sleep(3)

Replace <USERNAME>, <PASSWORD>, and <HOST> with your actual Atlas credentials.

The dynamic: True setting tells MongoDB to automatically index all supported string fields in your documents. The while loop polls the index status and pauses the script until the index is ready.

Run the script:

python insert_data.py

You'll get the following output in your terminal:

Sample data inserted.
Search index created. Waiting for it to go active...
Search index is active. You're ready to run queries.

Run a Basic Text Search

A basic text search finds documents where a specific field contains your query term. The text operator handles this. You set query to the term you want to find and path to the field you want to search.

Create a file named basic_search.py:

from pymongo import MongoClient

# Replace the placeholder values with your Atlas credentials and cluster hostname:
client = MongoClient(
    "mongodb+srv://<USERNAME>:<PASSWORD>@<HOST>/",
    appname="devrel-tutorial-python-search"
)
db = client["bookstore"]
books = db["books"]

# Search the description field for the term "space":
results = books.aggregate([
    {
        "$search": {
            "text": {
                "query": "space",
                "path": "description"
            }
        }
    },
    {
        # Return title, author, and description — suppress _id:
        "$project": {
            "_id": 0,
            "title": 1,
            "author": 1,
            "description": 1
        }
    }
])

for doc in results:
    print(doc)

Run the script:

python basic_search.py

You'll get the following output:

{
  "title": "Project Hail Mary",
  "author": "Andy Weir",
  "description": "An astronaut wakes up alone in space with no memory and must figure out how to save Earth from an extinction-level threat."
}

Only "Project Hail Mary" matches because "space" appears there as a standalone word. "The Hitchhiker's Guide to the Galaxy" has the word "hyperspace" in its description, but the Standard analyzer splits text on whitespace and punctuation only. “hyperspace” is tokenized as a single word, not broken into “hyper” and “space”. For “space” to match “hyperspace”, you'd need a different approach, like using the wildcard operator or a custom analyzer with edge n-grams.

Search Across Multiple Fields

You can pass a list of field names to path. MongoDB searches all listed fields and returns any document where your query term appears in at least one of them.

Create a file named multi_field_search.py:

from pymongo import MongoClient

# Replace the placeholder values with your Atlas credentials and cluster hostname:
client = MongoClient(
    "mongodb+srv://<USERNAME>:<PASSWORD>@<HOST>/",
    appname="devrel-tutorial-python-search"
)
db = client["bookstore"]
books = db["books"]

# Search title, description, and genre for the term "fiction":
results = books.aggregate([
    {
        "$search": {
            "text": {
                "query": "fiction",
                "path": ["title", "description", "genre"]
            }
        }
    },
    {
        # Return title, genre, and description — suppress _id:
        "$project": {
            "_id": 0,
            "title": 1,
            "genre": 1,
            "description": 1
        }
    }
])

for doc in results:
    print(doc)

Run the script:

python multi_field_search.py

You'll get the following output:

{
  "title": "The Midnight Library",
  "genre": "fiction",
  "description": "A woman discovers a library between life and death that contains books representing all the lives she could have lived."
},
{
  "title": "The Hitchhiker's Guide to the Galaxy",
  "genre": "science fiction",
  "description": "An ordinary man is swept across the universe after Earth is demolished to make way for a hyperspace expressway."
},
{
  "title": "Project Hail Mary",
  "genre": "science fiction",
  "description": "An astronaut wakes up alone in space with no memory and must figure out how to save Earth from an extinction-level threat."
}

All three books match because "fiction" appears as a token in each of their genre fields. The Standard analyzer tokenizes "science fiction" into two separate tokens: "science" and "fiction". So the two science fiction books match on the token "fiction", and "The Midnight Library" matches because its genre is exactly "fiction".

Search with Fuzzy Matching

A fuzzy search finds documents that contain terms similar to your query. This is useful when users make typos or spelling mistakes.

Create a file named fuzzy_search.py:

from pymongo import MongoClient

# Replace the placeholder values with your Atlas credentials and cluster hostname:
client = MongoClient(
    "mongodb+srv://<USERNAME>:<PASSWORD>@<HOST>/",
    appname="devrel-tutorial-python-search"
)
db = client["bookstore"]
books = db["books"]

# Search for terms similar to "librery" — allow up to one character edit:
results = books.aggregate([
    {
        "$search": {
            "text": {
                "query": "librery",
                "path": "description",
                "fuzzy": {
                    # maxEdits controls how many character changes MongoDB tolerates:
                    "maxEdits": 1
                }
            }
        }
    },
    {
        # Return title and description — suppress _id:
        "$project": {
            "_id": 0,
            "title": 1,
            "description": 1
        }
    }
])

for doc in results:
    print(doc)

Run the script:

python fuzzy_search.py

You'll get the following output:

{
  "title": "The Midnight Library",
  "description": "A woman discovers a library between life and death that contains books representing all the lives she could have lived."
}

The maxEdits value controls how many character changes MongoDB tolerates. A value of 1 means MongoDB matches terms that differ from your query by one insertion, deletion, or substitution. The misspelled "librery" is one character away from "library", so the match succeeds.

Search for a Term While Excluding Another Word

Sometimes you want to find documents that match one term but don't contain another. This is different from a basic text search because you need two conditions to work together:

a positive match
an exclusion

The compound operator handles this through its must and mustNot clauses.

must requires the document to match a condition to appear in the results. mustNot requires the document not to match a condition to appear in the results. A document only returns if it satisfies both at once.

Create a file named exclude_term_search.py:

from pymongo import MongoClient

# Replace the placeholder values with your Atlas credentials and cluster hostname:
client = MongoClient(
    "mongodb+srv://<USERNAME>:<PASSWORD>@<HOST>/",
    appname="devrel-tutorial-python-search"
)
db = client["bookstore"]
books = db["books"]

results = books.aggregate([
    {
        "$search": {
            "compound": {
                # Must: the description field must contain "woman":
                "must": [
                    {
                        "text": {
                            "query": "woman",
                            "path": "description"
                        }
                    }
                ],
                # MustNot: exclude any book whose description contains "death":
                "mustNot": [
                    {
                        "text": {
                            "query": "death",
                            "path": "description"
                        }
                    }
                ]
            }
        }
    },
    {
        # Return title and description — suppress _id:
        "$project": {
            "_id": 0,
            "title": 1,
            "description": 1
        }
    }
])

for doc in results:
    print(doc)

Run the script:

python exclude_term_search.py

You'll get the following output:

{
  "title": "Educated",
  "description": "A woman reflects on her isolated rural childhood and her journey to educate herself and earn a PhD from Cambridge."
}

Two books in the collection have "woman" in their description: "The Midnight Library" and "Educated". The query returns only "Educated" because "The Midnight Library" contains the word "death", which the mustNot clause blocks.

A mustNot clause doesn't contribute to the relevance score of any document. It acts purely as a filter that removes documents from the result set. A document either passes or fails the exclusion check, and that outcome doesn't change how MongoDB ranks the documents that do pass.

Combine Conditions with the Compound Operator

The compound operator lets you combine multiple search conditions in a single query. It supports four clauses:

must: the document must match for it to appear in the results. Maps to AND.
mustNot: the document must not match for it to appear in the results. Maps to AND NOT.
should: documents that match rank higher in the results. Maps to OR.
filter: the document must match, but matching doesn't affect the relevance score.

Create a file named compound_search.py:

from pymongo import MongoClient

client = MongoClient(
    "mongodb+srv://<USERNAME>:<PASSWORD>@<HOST>/",
    appname="devrel-tutorial-python-search"
)
db = client["bookstore"]
books = db["books"]

results = books.aggregate([
    {
        "$search": {
            "compound": {
                # Must: the genre field must contain the token "fiction"
                "must": [
                    {
                        "text": {
                            "query": "fiction",
                            "path": "genre"
                        }
                    }
                ],
                # MustNot: exclude any book whose description mentions "astronaut"
                "mustNot": [
                    {
                        "text": {
                            "query": "astronaut",
                            "path": "description"
                        }
                    }
                ],
                # Should: boost books whose description mentions "library"
                "should": [
                    {
                        "text": {
                            "query": "library",
                            "path": "description"
                        }
                    }
                ],
                # Filter: only include books published in 1979 or later (no score impact)
                "filter": [
                    {
                        "range": {
                            "path": "year",
                            "gte": 1979
                        }
                    }
                ]
            }
        }
    },
    {
        "$project": {
            "_id": 0,
            "title": 1,
            "genre": 1,
            "description": 1,
            "year": 1,
        }
    }
])

for doc in results:
    print(doc)

Run it:

python compound_search.py

You'll get the following output:

{
  "title": "The Midnight Library",
  "genre": "fiction",
  "description": "A woman discovers a library between life and death that contains books representing all the lives she could have lived.",
  "year": 2020
},
{
  "title": "The Hitchhiker's Guide to the Galaxy",
  "genre": "science fiction",
  "description": "An ordinary man is swept across the universe after Earth is demolished to make way for a hyperspace expressway.",
  "year": 1979
}

The query works in four steps. The must clause narrows the candidate set to all books whose genre contains the token "fiction": "The Midnight Library" (genre: "fiction"), "Project Hail Mary" (genre: "science fiction"), and "The Hitchhiker's Guide to the Galaxy" (genre: "science fiction"). The mustNot clause then removes "Project Hail Mary" because its description contains "astronaut". The filter clause confirms that both remaining books were published in 1979 or later without adding any score. The should clause boosts "The Midnight Library" because its description contains "library", which pushes it to the top of the results.

Include Relevance Scores in Your Results

Every document in a $search result gets a relevance score. A higher score means the document is a closer match to your query. Documents score higher when the query term appears often in that document, and lower when the same term appears across many documents in the collection.

You include scores in results by adding a $project stage with the {"$meta": "searchScore"} expression. To see ranking in action, you need a query that returns more than one document. Searching for "woman" matches two books and lets you compare how MongoDB scores them against each other.

Create a file named search_with_scores.py:

from pymongo import MongoClient

# Replace the placeholder values with your Atlas credentials and cluster hostname:
client = MongoClient(
    "mongodb+srv://<USERNAME>:<PASSWORD>@<HOST>/",
    appname="devrel-tutorial-python-search"
)
db = client["bookstore"]
books = db["books"]

# Search the description field for the term "woman":
results = books.aggregate([
    {
        "$search": {
            "text": {
                "query": "woman",
                "path": "description"
            }
        }
    },
    {
        # Include the relevance score alongside each document:
        "$project": {
            "_id": 0,
            "title": 1,
            "description": 1,
            "score": {"$meta": "searchScore"}
        }
    },
    {
        # Return results from highest to lowest score:
        "$sort": {"score": -1}
    }
])

for doc in results:
    print(doc)

Run it:

python search_with_scores.py

You'll get the following output:

{
  "title": "The Midnight Library",
  "description": "A woman discovers a library between life and death that contains books representing all the lives she could have lived.",
  "score": 0.39296838641166687
}
{
  "title": "Educated",
  "description": "A woman reflects on her isolated rural childhood and her journey to educate herself and earn a PhD from Cambridge.",
  "score": 0.39296838641166687
}

Both books contain the word "woman" in their descriptions, so both return. The scores are equal because "woman" appears exactly once in each description and both descriptions are similar in length. This is the expected behavior when term frequency and document length are similar across matching documents.

The exact score values you see may differ from another cluster's output. Relevance scores depend on term frequency and document count at index time, which can vary between deployments. What matters is the relative ranking across results, not the absolute numbers.

The $sort stage makes the ranking explicit. Without it, MongoDB Search still returns results from highest to lowest score by default, but the score field lets you confirm and reason about the order.

Key Takeaways

MongoDB Search is embedded in your cluster. You don't need a separate search system.
You must create a search index before running any $search query. If the index doesn't exist, MongoDB returns an empty result set without an error.
The $search stage must come first in your aggregation pipeline. You chain other stages, like $project, $sort, and $limit, after it.
The Standard analyzer tokenizes multi-word strings into separate tokens. "science fiction" becomes two tokens: "science" and "fiction". This affects what matches your queries' return.
Compound words like "hyperspace" stay as a single token. A search for "space" doesn't match "hyperspace".
The compound operator lets you combine must, mustNot, should, and filter clauses in a single query. A mustNot clause acts as a pure filter and doesn't affect relevance scores.
Every $search result gets a relevance score. Documents score higher when the query term appears often in that document and lower when the same term appears across many documents in the collection.

All the code samples from this tutorial are available in the GitHub repository.

DEV Community

How to Use MongoDB’s Text Search

How to Use MongoDB's Text Search

What You Need to Know Before You Start

Search queries

Search indexes

Search analyzers and tokens

Prerequisites

Set Up Your Project

Insert Your Data and Create a Search Index

Run a Basic Text Search

Search Across Multiple Fields

Search with Fuzzy Matching

Search for a Term While Excluding Another Word

Combine Conditions with the Compound Operator

Include Relevance Scores in Your Results

Key Takeaways

Further Reading

Top comments (0)