Franck Pachot for MongoDB

Posted on Sep 21 • Edited on Oct 10

MongoDB Search Index Internals With Luke (Lucene Toolbox GUI Tool)

#mongodb #text #search #lucene

Previously, I demonstrated MongoDB text search scoring with a simple example, creating a dynamic index without specifying fields explicitly. You might be curious about what data is actually stored in such an index. Let's delve into the specifics. Unlike regular MongoDB collections and indexes, which use WiredTiger for storage, search indexes leverage Lucene technology. We can inspect these indexes using Luke, the Lucene Toolbox GUI tool.

Set up a lab

I started a Atlas local deployment to get a container for my lab:

# download Atlas CLI if you don't have it. Here it is for my Mac:
wget https://www.mongodb.com/try/download/atlascli
unzip mongodb-atlas-cli_1.43.0_macos_arm64.zip

# start a container
bin/atlas deployments setup  atlas --type local --port 27017 --force

Sample data

I connected with mongosh and created a collection, and a search index, like in the previous post:

mongosh --eval '

db.articles.deleteMany({});

db.articles.insertMany([
 { description : "🍏 🍌 🍊" },                // short, 1 🍏
 { description : "🍎 🍌 🍊" },                // short, 1 🍎
 { description : "🍎 🍌 🍊 🍎" },             // larger, 2 🍎
 { description : "🍎 🍌 🍊 🍊 🍊" },          // larger, 1 🍎
 { description : "🍎 🍌 🍊 🌴 🫐 🍈 🍇 🌰" },  // large, 1 🍎
 { description : "🍎 🍎 🍎 🍎 🍎 🍎" },       // large, 6 🍎
 { description : "🍎 🍌" },                 // very short, 1 🍎
 { description : "🍌 🍊 🌴 🫐 🍈 🍇 🌰 🍎" },  // large, 1 🍎
 { description : "🍎 🍎 🍌 🍌 🍌" },          // shorter, 2 🍎
]);

db.articles.createSearchIndex("default",
  { mappings: { dynamic: true } }
);

'

Get Lucene indexes

While MongoDB collections and secondary indexes are stored by the WiredTiger storage engine (by default, in the /data/db directory), the text search indexes use Lucene in a mongot process (with files stored by default in /data/mongot). I copied it to my laptop:

docker cp atlas:/data/mongot ./mongot_copy
cd mongot_copy

One file is easy to read, as it is in JSON format, and it is the metadata listing the search indexes, with their MongoDB configuration:

cat configJournal.json | jq

{
  "version": 1,
  "stagedIndexes": [],
  "indexes": [
    {
      "index": {
        "indexID": "68d0588abf7ab96dd26277b1",
        "name": "default",
        "database": "test",
        "lastObservedCollectionName": "articles",
        "collectionUUID": "a18b587d-a380-4067-95aa-d0e9d4871b64",
        "numPartitions": 1,
        "mappings": {
          "dynamic": true,
          "fields": {}
        },
        "indexFeatureVersion": 4
      },
      "analyzers": [],
      "generation": {
        "userVersion": 0,
        "formatVersion": 6
      }
    }
  ],
  "deletedIndexes": [],
  "stagedVectorIndexes": [],
  "vectorIndexes": [],
  "deletedVectorIndexes": []
}

The directory where Lucene files are stored has the IndexID in their names:

ls 68d0588abf7ab96dd26277b1*

_0.cfe _0.cfs _0.si
_1.cfe _1.cfs _1.si
_2.cfe _2.cfs _2.si
_3.cfe _3.cfs _3.si
_4.cfe _4.cfs _4.si
_5.cfe _5.cfs _5.si
_6.cfe _6.cfs _6.si
segments_2
write.lock

In a Lucene index, each .cfs/.cfe/.si set represents one immutable segment containing a snapshot of indexed data (with .cfs holding the actual data, .cfe its table of contents, and .si the segment’s metadata), and the segments_2 file is the global manifest that tracks all active segments so Lucene can search across them as one index.

Install and use Luke

I installed the Lucene binaries and started Luke:

wget https://dlcdn.apache.org/lucene/java/9.12.2/lucene-9.12.2.tgz
tar -zxvf lucene-9.12.2.tgz
lucene-9.12.2/bin/luke.sh

This starts the GUI asking for the index directory:

The "Overview" tab shows lots of information:

The field names are prefixed with the type. My description field was indexed as a string and named $type:string/description. There are nine documents and nine different terms:

The Lucene index keeps the overall frequency in order to apply inverse document frequency (IDF). Here, 🍎 is present in eight documents and 🍏 in one.

The "Document" tab lets us browse the documents and see what is indexed. For example, 🍏 is present in one document with { description: "🍏 🍌 🍊" }:

The flags IdfpoN-S mean that it is a fully indexed text field with docs, frequencies, positions, and offsets, with norms and stored values.

The "Search" tab allows us to run queries. For example, { $search: { text: { query: ["🍎", "🍏"], path: "description" }, index: "default" } } from my previous post is:

This is exactly what I got from MongoDB:

db.articles.aggregate([
  { $search: { text: { query: ["🍎", "🍏"], path: "description" }, index: "default" } },
  { $project: { _id: 0, score: { $meta: "searchScore" }, description: 1 } },
  { $sort: { score: -1 } }
]).forEach( i=> print(i.score.toFixed(3).padStart(5, " "),i.description) )

1.024 🍏 🍌 🍊
0.132 🍎 🍎 🍎 🍎 🍎 🍎
0.107 🍎 🍌 🍊 🍎
0.101 🍎 🍎 🍌 🍌 🍌
0.097 🍎 🍌
0.088 🍎 🍌 🍊
0.073 🍎 🍌 🍊 🍊 🍊
0.059 🍎 🍌 🍊 🌴 🫐 🍈 🍇 🌰
0.059 🍌 🍊 🌴 🫐 🍈 🍇 🌰 🍎

If you double-click on a document in the result, you can get the explanation of the score:

For example, the score of 🍏 🍌 🍊 is explained by:

1.0242119 sum of:
  1.0242119 weight($type:string/description:🍏 in 0) [BM25Similarity], result of:
    1.0242119 score(freq=1.0), computed as boost * idf * tf from:
      1.89712 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        1 n, number of documents containing term
        9 N, total number of documents with field
      0.5398773 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 freq, occurrences of term within document
        1.2 k1, term saturation parameter
        0.75 b, length normalization parameter
        3.0 dl, length of field
        4.888889 avgdl, average length of field

The BM25 core formula is score = boost × idf × tf.

IDF is idf = log(1 + (N - n + 0.5) / (n + 0.5)) where n is the number of documents containing 🍏 , so 1, and N is the total documents in the index for this field, so 9.

TF is tf = freq / ( freq + k1 × (1 - b + b × (dl / avgdl)) ) where freq is the term occurrences in this doc’s field, so 1, k1 is the term saturation parameter, which defaults to 1.2 in Lucene, b is the length normalization, which defaults to 0.75 in Lucene, dl is the document length, which is three tokens here, and avgdl is the average document length for this field in the segment—here, 4.888889.

Daily usage in MongoDB search also allows boosting via the query, which multiplies into the BM25 scoring formula (boost).

The "Analysis" tab helps explain how strings are tokenized and processed. For example, the standard analyzer explicitly recognized emojis:

Finally, I inserted 500 documents with other fruits, like in the previous post, and the collection-wide term frequency has been updated:

The scores reflect the change:

The explanation of the new rank for 🍏 🍌 🍊 is:

3.2850468 sum of:
  3.2850468 weight($type:string/description:🍏 in 205) [BM25Similarity], result of:
    3.2850468 score(freq=1.0), computed as boost * idf * tf from:
      5.8289456 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        1 n, number of documents containing term
        509 N, total number of documents with field
      0.5635748 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 freq, occurrences of term within document
        1.2 k1, term saturation parameter
        0.75 b, length normalization parameter
        3.0 dl, length of field
        5.691552 avgdl, average length of field

After adding 500 documents without 🍏, BM25 recalculates IDF for 🍏 with a much larger N, making it appear far rarer in the corpus, so its score contribution more than triples.

Notice that when I queried for 🍎🍏, no documents contained both terms, so the scoring explanation included only one weight. If I modify the query to include 🍎🍏🍊, the document 🍏 🍌 🍊 scores highest, as it combines the weights for both matching terms, 🍏 and 🍊:

4.3254924 sum of:
  3.2850468 weight($type:string/description:🍏 in 205) [BM25Similarity], result of:
    3.2850468 score(freq=1.0), computed as boost * idf * tf from:
      5.8289456 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        1 n, number of documents containing term
        509 N, total number of documents with field
      0.5635748 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 freq, occurrences of term within document
        1.2 k1, term saturation parameter
        0.75 b, length normalization parameter
        3.0 dl, length of field
        5.691552 avgdl, average length of field
  1.0404456 weight($type:string/description:🍊 in 205) [BM25Similarity], result of:
    1.0404456 score(freq=1.0), computed as boost * idf * tf from:
      1.8461535 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        80 n, number of documents containing term
        509 N, total number of documents with field
      0.5635748 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 freq, occurrences of term within document
        1.2 k1, term saturation parameter
        0.75 b, length normalization parameter
        3.0 dl, length of field
        5.691552 avgdl, average length of field

Here is the same query in MongoDB:

db.articles.aggregate([
  { $search: { text: { query: ["🍏", "🍊"], path: "description" }, index: "default" } },
  { $project: { _id: 0, score: { $meta: "searchScore" }, description: 1 } },
  { $sort: { score: -1 } },
  { $limit: 15  }
]).forEach( i=> print(i.score.toFixed(3).padStart(5, " "),i.description) )

4.325 🍏 🍌 🍊
1.354 🍎 🍌 🍊 🍊 🍊
1.259 🫐 🍊 🥑 🍊
1.137 🥥 🍊 🍊 🍅 🍈 🍈
1.137 🍍 🍓 🍊 🍊 🥑 🍉
1.137 🥥 🍆 🍊 🍊 🍍 🍉
1.084 🍊 🍑 🍊 🥥 🍌 🍍 🫐
1.084 🍊 🫐 🥝 🍋 🥑 🍇 🍊
1.084 🥭 🍍 🥑 🍋 🍈 🍊 🍊
1.040 🍊 🫐 🥭
1.040 🍊 🍉 🍍
1.040 🍎 🍌 🍊
1.040 🍊 🍋 🍋
1.040 🍐 🍌 🍊
1.036 🍐 🥥 🍍 🍈 🍐 🍊 🍆 🍊

While the scores may feel intuitively correct when you look at the data, it's important to remember there's no magic—everything is based on well‑known mathematics and formulas. Lucene’s scoring algorithms are used in many systems, including Elasticsearch, Apache Solr, and the search indexes built into MongoDB.

Conclusion

MongoDB search indexes are designed to work optimally out of the box. In my earlier post, I relied entirely on default settings, dynamic mapping, and even replaced words with emojis—yet still got relevant, well-ranked results without extra tuning. If you want to go deeper and fine-tune your search behavior, or simply learn more about how it works, inspecting the underlying Lucene index can provide great insights. Since MongoDB Atlas Search indexes are Lucene-compatible, tools like Luke allow you to see exactly how your text is tokenized, stored, and scored—giving you full transparency into how queries match your documents.

DEV Community