Previously, I demonstrated MongoDB text search scoring with a simple example, creating a dynamic index without specifying fields explicitly. You might be curious about what data is actually stored in such an index. Let's delve into the specifics. Unlike regular MongoDB collections and indexes, which use WiredTiger for storage, search indexes leverage Lucene technology. We can inspect these indexes using Luke, the Lucene Toolbox GUI tool.
Set up a lab
I started an Atlas local deployment to get a container for my lab:
# download Atlas CLI if you don't have it. Here it is for my Mac:
wget https://www.mongodb.com/try/download/atlascli
unzip mongodb-atlas-cli_1.43.0_macos_arm64.zip
# start a container
bin/atlas deployments setup atlas --type local --port 27017 --force
Sample data
I connected with mongosh
and created a collection, and a search index, like in previous post:
mongosh --eval '
db.articles.deleteMany({});
db.articles.insertMany([
{ description : "π π π" }, // short, 1 π
{ description : "π π π" }, // short, 1 π
{ description : "π π π π" }, // larger, 2 π
{ description : "π π π π π" }, // larger, 1 π
{ description : "π π π π΄ π« π π π°" }, // large, 1 π
{ description : "π π π π π π" }, // large, 6 π
{ description : "π π" }, // very short, 1 π
{ description : "π π π΄ π« π π π° π" }, // large, 1 π
{ description : "π π π π π" }, // shorter, 2 π
]);
db.articles.createSearchIndex("default",
{ mappings: { dynamic: true } }
);
'
Get Lucene indexes
While MongoDB collections and secondary indexes are stored by the WiredTiger storage engine (by default in /data/db
directory), the text search indexes use Lucene in a mongot
process (with files stored by default in /data/mongot
). I copied it to my laptop:
docker cp atlas:/data/mongot ./mongot_copy
cd mongot_copy
One file is easy to read, as it is in JSON format, and it is the metadata listing the search indexes, with their MongoDB configuration:
cat configJournal.json | jq
{
"version": 1,
"stagedIndexes": [],
"indexes": [
{
"index": {
"indexID": "68d0588abf7ab96dd26277b1",
"name": "default",
"database": "test",
"lastObservedCollectionName": "articles",
"collectionUUID": "a18b587d-a380-4067-95aa-d0e9d4871b64",
"numPartitions": 1,
"mappings": {
"dynamic": true,
"fields": {}
},
"indexFeatureVersion": 4
},
"analyzers": [],
"generation": {
"userVersion": 0,
"formatVersion": 6
}
}
],
"deletedIndexes": [],
"stagedVectorIndexes": [],
"vectorIndexes": [],
"deletedVectorIndexes": []
}
The directory where Lucene files are stored has the IndexID
in their names:
ls 68d0588abf7ab96dd26277b1*
_0.cfe _0.cfs _0.si
_1.cfe _1.cfs _1.si
_2.cfe _2.cfs _2.si
_3.cfe _3.cfs _3.si
_4.cfe _4.cfs _4.si
_5.cfe _5.cfs _5.si
_6.cfe _6.cfs _6.si
segments_2
write.lock
In a Lucene index, each .cfs
/.cfe
/.si
set represents one immutable segment containing a snapshot of indexed data (with .cfs
holding the actual data, .cfe
its table of contents, and .si
the segmentβs metadata), and the segments_2
file is the global manifest that tracks all active segments so Lucene can search across them as one index.
Install and use Luke
I installed the Lucene binaries and started Luke:
wget https://dlcdn.apache.org/lucene/java/9.12.2/lucene-9.12.2.tgz
tar -zxvf lucene-9.12.2.tgz
lucene-9.12.2/bin/luke.sh
This starts the GUI asking for the index directory:
The "Overview" tab shows lots of information:
The field names are prefixed with the type. My description
field was indexed as a string and named $type:string/description
. There are 9 documents and 9 different terms:
The Lucene index keeps the overall frequency in order to apply Inverse Document Frequency (IDF). Here, π is present in 8 documents and π in one.
The "Document" tab lets us browse the documents and see what is indexed. For example, π is present in one document with { description: "π π π" }
:
The flags IdfpoN-S
mean that it is a fully i
ndexed text field with d
ocs, f
requencies, p
ositions, and o
ffsets, with n
orms and s
tored values
The "Search" tab allows to run queries. For example, { $search: { text: { query: ["π", "π"], path: "description" }, index: "default" } }
from my previous post is:
This is exactly what I got from MongoDB:
db.articles.aggregate([
{ $search: { text: { query: ["π", "π"], path: "description" }, index: "default" } },
{ $project: { _id: 0, score: { $meta: "searchScore" }, description: 1 } },
{ $sort: { score: -1 } }
]).forEach( i=> print(i.score.toFixed(3).padStart(5, " "),i.description) )
1.024 π π π
0.132 π π π π π π
0.107 π π π π
0.101 π π π π π
0.097 π π
0.088 π π π
0.073 π π π π π
0.059 π π π π΄ π« π π π°
0.059 π π π΄ π« π π π° π
If you double-click on a document in the result, you can get the explanation of the score:
For example, the score of π π π is explained by:
1.0242119 sum of:
1.0242119 weight($type:string/description:π in 0) [BM25Similarity], result of:
1.0242119 score(freq=1.0), computed as boost * idf * tf from:
1.89712 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
1 n, number of documents containing term
9 N, total number of documents with field
0.5398773 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
1.0 freq, occurrences of term within document
1.2 k1, term saturation parameter
0.75 b, length normalization parameter
3.0 dl, length of field
4.888889 avgdl, average length of field
The BM25 core formula is score = boost Γ idf Γ tf
IDF (Inverse Document Frequency) is idf = log(1 + (N - n + 0.5) / (n + 0.5))
where n
is the number of documents containing π , so 1
, and N
is the total documents in the index for this field, so 9
TF (Term Frequency normalization) is tf = freq / ( freq + k1 Γ (1 - b + b Γ (dl / avgdl)) )
where freq
is the term occurrences in this docβs field, so 1
, k1
is the term saturation parameter, which defaults to 1.2 in Lucene, b
is the length normalization, which defaults to 0.75 in Lucene, dl
is the document length, which is 3 tokens here, and avgdl
is the average document length for this field in the segment, here 4.888889
Daily usage in MongoDB search also allows boosting via the query, which multiplies into the BM25 scoring formula (boost
).
The "Analysis" tab helps explain how strings are tokenized and processed. For example, the standard analyzer explicitly recognized emojis:
Finally, I inserted 500 documents with other fruits, like in the previous post, and the collection-wide term frequency has been updated:
The scores reflect the change:
The explanation of the new rank for π π π is:
3.2850468 sum of:
3.2850468 weight($type:string/description:π in 205) [BM25Similarity], result of:
3.2850468 score(freq=1.0), computed as boost * idf * tf from:
5.8289456 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
1 n, number of documents containing term
509 N, total number of documents with field
0.5635748 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
1.0 freq, occurrences of term within document
1.2 k1, term saturation parameter
0.75 b, length normalization parameter
3.0 dl, length of field
5.691552 avgdl, average length of field
After adding 500 documents without π, BM25 recalculates IDF for π with a much larger N, making it appear far rarer in the corpus, so its score contribution more than triples.
Notice that when I queried for ππ, no documents contained both terms, so the scoring explanation included only one weight. If I modify the query to include πππ, the document π π π scores highest, as it combines the weights for both matching terms, π and π:
4.3254924 sum of:
3.2850468 weight($type:string/description:π in 205) [BM25Similarity], result of:
3.2850468 score(freq=1.0), computed as boost * idf * tf from:
5.8289456 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
1 n, number of documents containing term
509 N, total number of documents with field
0.5635748 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
1.0 freq, occurrences of term within document
1.2 k1, term saturation parameter
0.75 b, length normalization parameter
3.0 dl, length of field
5.691552 avgdl, average length of field
1.0404456 weight($type:string/description:π in 205) [BM25Similarity], result of:
1.0404456 score(freq=1.0), computed as boost * idf * tf from:
1.8461535 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
80 n, number of documents containing term
509 N, total number of documents with field
0.5635748 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
1.0 freq, occurrences of term within document
1.2 k1, term saturation parameter
0.75 b, length normalization parameter
3.0 dl, length of field
5.691552 avgdl, average length of field
Here is the same query in MongoDB:
db.articles.aggregate([
{ $search: { text: { query: ["π", "π"], path: "description" }, index: "default" } },
{ $project: { _id: 0, score: { $meta: "searchScore" }, description: 1 } },
{ $sort: { score: -1 } },
{ $limit: 15 }
]).forEach( i=> print(i.score.toFixed(3).padStart(5, " "),i.description) )
4.325 π π π
1.354 π π π π π
1.259 π« π π₯ π
1.137 π₯₯ π π π
π π
1.137 π π π π π₯ π
1.137 π₯₯ π π π π π
1.084 π π π π₯₯ π π π«
1.084 π π« π₯ π π₯ π π
1.084 π₯ π π₯ π π π π
1.040 π π« π₯
1.040 π π π
1.040 π π π
1.040 π π π
1.040 π π π
1.036 π π₯₯ π π π π π π
While the scores may feel intuitively correct when you look at the data, it's important to remember there's no magic β everything is based on wellβknown mathematics and formulas. Luceneβs scoring algorithms are used in many systems, including Elasticsearch, Apache Solr, and the search indexes built into MongoDB.
Conclusion
MongoDB search indexes are designed to work optimally out of the box. In my earlier post, I relied entirely on default settings, dynamic mapping, and even replaced words with emojis β yet still got relevant, well-ranked results without extra tuning. If you want to go deeper and fine-tune your search behavior, or simply learn more about how it works, inspecting the underlying Lucene index can provide great insights. Since Atlas Search indexes are Lucerne-compatible, tools like Luke allow you to see exactly how your text is tokenized, stored, and scored β giving you full transparency into how queries match your documents.
Top comments (0)