Franck Pachot

Posted on Feb 19

Top-K queries with MongoDB search indexes (BM25)

#mongodb #index #bm25 #search

A document database is more than a JSON datastore. It must also support efficient storage and advanced search: equality and range predicates, fuzzy text search, ranking, pagination, and limited sorted results (top‑k). BM25 indexes, which combine an inverted index and columnar doc values, are ideal for this, with mature open‑source implementations like Lucene (used by MongoDB) and Tantivy (used by ParadeDB).

ParadeDB brings Tantivy indexing to PostgreSQL via the pg_search extension and recently published an excellent article showing where GIN indexes fall short and how BM25 bridges the gap. Here, I’ll present the MongoDB equivalent using its Lucene‑based search indexes. I suggest reading ParadeDB’s post first, as it clearly explains the problem and the solution:

How We Optimized Top K in Postgres | ParadeDB

How ParadeDB uses principles from search engines to optimize Postgres' Top K performance.

paradedb.com

I'll be lazy and use the same dataset, index and query.

MongoDB with search indexes

You can use BM25 indexes on MongoDB in several environments: the cloud-managed service (MongoDB Atlas), its local deployment (Atlas Local), on-premises MongoDB Enterprise Server, and the open-source MongoDB Community edition. The mongot engine that powers MongoDB Search is in public preview, with its source available at github.com/mongodb/mongot.

I started a local Atlas deployment on my laptop with Atlas CLI and connected automatically:


atlas deployments setup  mongo --type local --connectWith mongosh --force

Dataset generation

I generated 100,000,000 documents similar to ParadeDB's benchmark:


const batchSize = 10000;
const batches   = 10000;

const rows      = batches * batchSize;
print(`Generating ${rows.toLocaleString()} documents`);

db.benchmark_logs.drop();

const messages = [ 'The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.', 'The research facility analyzed samples from ancient artifacts, revealing breakthrough findings about civilizations lost to the depths of time.', 'The research station monitored weather patterns across mountain peaks, collecting data about atmospheric changes in the remote depths below.', 'The research observatory captured images of stellar phenomena, peering into the cosmic depths to understand the mysteries of distant galaxies.', 'The research laboratory processed vast amounts of genetic data, exploring the molecular depths of DNA to unlock biological secrets.', 'The research center studied rare organisms found in ocean depths, documenting new species thriving in extreme underwater environments.', 'The research institute developed quantum systems to probe subatomic depths, advancing our understanding of fundamental particle physics.', 'The research expedition explored underwater depths near volcanic vents, discovering unique ecosystems adapted to extreme conditions.', 'The research facility conducted experiments in the depths of space, testing how different materials behave in zero gravity environments.', 'The research team engineered crops that could grow in the depths of drought conditions, helping communities facing climate challenges.' ];

const countries = [ 'United States', 'Canada', 'United Kingdom', 'France', 'Germany', 'Japan', 'Australia', 'Brazil', 'India', 'China' ];

const labels = [ 'critical system alert', 'routine maintenance', 'security notification', 'performance metric', 'user activity', 'system status', 'network event', 'application log', 'database operation', 'authentication event' ];

let batch = [];
const startDate = new Date("2020-01-01T00:00:00Z");
for (let i = 0; i < rows; i++) {

  batch.push({
    message: messages[i % 10],
    country: countries[i % 10],
    severity: (i % 5) + 1,
    timestamp: new Date(startDate.getTime() + (i % 731) * 24 * 60 * 60 * 1000),
    metadata: {
      value: (i % 1000) + 1,
      label: labels[i % 10]
    }
  });

  if (batch.length === batchSize) {
    db.benchmark_logs.insertMany(batch);
    batch = [];
  }

}

I checked the document schema and counts:

print(`Done!
 \nSample: ${EJSON.stringify(  db.benchmark_logs.find().limit(1).toArray(),  null,  2  )}
 \nDocument count: ${db.benchmark_logs.countDocuments().toLocaleString()}
`);

Done!

Sample: [
  {
    "_id": {
      "$oid": "6997580679ab8450f81ff93c"
    },
    "message": "The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.",
    "country": "United States",
    "severity": 1,
    "timestamp": {
      "$date": "2020-01-01T00:00:00Z"
    },
    "metadata": {
      "value": 1,
      "label": "critical system alert"
    }
  }
]

Document count: 100,000,000

With 100 million documents, this is a large dataset. Because many fields can be queried, we can’t create every compound index combination. A single search index will make queries on this collection efficient.

Search index creation

I created the search index similar to the one used on ParadeDB (here):

const mapping = {
  mappings: {
    // Equivalent to: USING bm25 Atlas Search uses Lucene BM25 by default
    dynamic: false,
    fields: {
      // Equivalent to: bm25(id, message, ...) Standard full-text field scored by BM25
      message: { type: "string" },
      // Equivalent to: text_fields = {  "country": {  fast: true, tokenizer: { type: "raw", lowercase: true } } } // fast = true → implicit in Atlas Search; docValues optional in cloud
      country: { type: "string", analyzer: "keywordLowercase" },
      // Equivalent to:numeric field indexed for filtering
      severity: { type: "number", representation: "int64" },
      // Equivalent to:timestamp field included in the BM25 index
      timestamp: { type: "date" },
      // Equivalent to: json_fields = { "metadata": { fast: true, tokenizer: raw } }
      metadata: {
        type: "document",
        fields: {
          value: {
            type: "number",
            representation: "int64"
          },
          // Equivalent to: metadata tokenizer = raw + lowercase
          label: {
            type: "string",
            analyzer: "keywordLowercase"
          }
        }
      }
    }
  },
  analyzers: [
    {
      // Equivalent to: tokenizer = raw, lowercase = true
      name: "keywordLowercase",
      tokenizer: { type: "keyword" },
      tokenFilters: [{ type: "lowercase" }]
    }
  ]
};

db.benchmark_logs.createSearchIndex(
  "benchmark_logs_idx",
  mapping
);

The index is created asynchronously and updated via change stream operations.

Query and result

The query combines text search, range filter, sort by score, and limit for Top-K:

query = [
  {
    $search: {
      index: "benchmark_logs_idx",
      compound: {
        must: [{ text: { query: "research team", path: "message" } }],
        filter: [{ range: { path: "severity", lt: 3 } }]
      },
      sort: { score: { $meta: "searchScore" } }
    }
  },
  { $limit: 10 },
  {
    $project: { message: 1, country: 1, severity: 1, timestamp: 1, metadata: 1, rank: { $meta: "searchScore" }
    }
  }
]

const start = Date.now();

print(EJSON.stringify(db.benchmark_logs.aggregate(query).toArray(),null,2));

const end = Date.now();
print(`\nExecution time: ${end - start} ms`);

It is important that the sort is part of $search because an additional $sort stage would not be pushed down. This allows Atlas Search to run the query in Lucene’s Top‑K mode, enabling block‑max WAND (BMW) pruning via competitive score feedback during collection.

Here is the result and timing:


[{"_id":{"$oid":"699757049ce6a7c42c65d105"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-01-11T00:00:00Z"},"metadata":{"value":11,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d10f"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-01-21T00:00:00Z"},"metadata":{"value":21,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d119"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-01-31T00:00:00Z"},"metadata":{"value":31,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d123"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-02-10T00:00:00Z"},"metadata":{"value":41,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d12d"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-02-20T00:00:00Z"},"metadata":{"value":51,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d137"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-03-01T00:00:00Z"},"metadata":{"value":61,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d141"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-03-11T00:00:00Z"},"metadata":{"value":71,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d14b"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-03-21T00:00:00Z"},"metadata":{"value":81,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d155"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-03-31T00:00:00Z"},"metadata":{"value":91,"label":"critical system alert"},"rank":0.6839379072189331},{"_id":{"$oid":"699757049ce6a7c42c65d15f"},"message":"The research team discovered a new species of deep-sea creature while conducting experiments near hydrothermal vents in the dark ocean depths.","country":"United States","severity":1,"timestamp":{"$date":"2020-04-10T00:00:00Z"},"metadata":{"value":101,"label":"critical system alert"},"rank":0.6839379072189331}]

Execution time: 1850 ms

On my laptop, this search over 100 million documents returns results in under two seconds, with no tuning. It performs a broad text match, and the high‑frequency terms "research" and "team" generate tens of millions of candidate documents. The additional severity filter and scoring require comparing tens of millions of scores, which has been heavily parallelized to stay within the two‑second budget.

Performance breakdown (explain)

Because the execution plan is long, I’ve packed it into a short string that you can easily copy and paste into your preferred AI chatbot:

EJSON.stringify(
db.benchmark_logs.aggregate(query).explain("executionStats")
);

{"explainVersion":"1","stages":[{"$_internalSearchMongotRemote":{"mongotQuery":{"index":"benchmark_logs_idx","compound":{"must":[{"text":{"query":"research team","path":"message"}}],"filter":[{"range":{"path":"severity","lt":3}}]},"sort":{"score":{"$meta":"searchScore"}}},"explain":{"query":{"type":"BooleanQuery","args":{"must":[{"path":"compound.must","type":"BooleanQuery","args":{"must":[],"mustNot":[],"should":[{"type":"TermQuery","args":{"path":"message","value":"research"},"stats":{"context":{"millisElapsed":1.273251,"invocationCounts":{"createWeight":2,"createScorer":87}},"match":{"millisElapsed":0},"score":{"millisElapsed":1292.607756,"invocationCounts":{"score":40000011}}}},{"type":"TermQuery","args":{"path":"message","value":"team"},"stats":{"context":{"millisElapsed":0.292666,"invocationCounts":{"createWeight":2,"createScorer":87}},"match":{"millisElapsed":0},"score":{"millisElapsed":379.190071,"invocationCounts":{"score":10000011}}}}],"filter":[],"minimumShouldMatch":0},"stats":{"context":{"millisElapsed":2.268162,"invocationCounts":{"createWeight":2,"createScorer":87}},"match":{"millisElapsed":0},"score":{"millisElapsed":3838.859709,"invocationCounts":{"score":40000011}}}}],"mustNot":[],"should":[],"filter":[{"path":"compound.filter","type":"ConstantScoreQuery","args":{"query":{"type":"BooleanQuery","args":{"must":[],"mustNot":[],"should":[{"type":"IndexOrDocValuesQuery","args":{"query":[{"type":"PointRangeQuery","args":{"path":"severity","representation":"double","lte":2.9999999999999996},"stats":{"context":{"millisElapsed":0},"match":{"millisElapsed":0},"score":{"millisElapsed":0}}},{"type":"SortedNumericDocValuesRangeQuery","args":{},"stats":{"context":{"millisElapsed":0},"match":{"millisElapsed":0},"score":{"millisElapsed":0}}}]},"stats":{"context":{"millisElapsed":0.010995,"invocationCounts":{"createWeight":2,"createScorer":29}},"match":{"millisElapsed":0},"score":{"millisElapsed":0}}},{"type":"IndexOrDocValuesQuery","args":{"query":[{"type":"PointRangeQuery","args":{"path":"severity","representation":"int64","lte":2},"stats":{"context":{"millisElapsed":0},"match":{"millisElapsed":0},"score":{"millisElapsed":0}}},{"type":"SortedNumericDocValuesRangeQuery","args":{},"stats":{"context":{"millisElapsed":0},"match":{"millisElapsed":0},"score":{"millisElapsed":0}}}]},"stats":{"context":{"millisElapsed":38.433371,"invocationCounts":{"createWeight":2,"createScorer":87}},"match":{"millisElapsed":875.507977,"invocationCounts":{"nextDoc":40000028,"refineRoughMatch":11}},"score":{"millisElapsed":0}}},{"type":"PointRangeQuery","args":{"path":"severity","representation":"int64","lte":2},"stats":{"context":{"millisElapsed":0.010042,"invocationCounts":{"createWeight":2,"createScorer":29}},"match":{"millisElapsed":0},"score":{"millisElapsed":0}}},{"type":"PointRangeQuery","args":{"path":"severity","representation":"double","lte":2.9999999999999996},"stats":{"context":{"millisElapsed":0.003208,"invocationCounts":{"createWeight":2,"createScorer":29}},"match":{"millisElapsed":0},"score":{"millisElapsed":0}}}],"filter":[],"minimumShouldMatch":0},"stats":{"context":{"millisElapsed":38.985005,"invocationCounts":{"createWeight":1,"createScorer":84}},"match":{"millisElapsed":2908.644175,"invocationCounts":{"nextDoc":40000028}},"score":{"millisElapsed":0}}}},"stats":{"context":{"millisElapsed":39.017415,"invocationCounts":{"createWeight":1,"createScorer":84}},"match":{"millisElapsed":5158.791968,"invocationCounts":{"nextDoc":40000028}},"score":{"millisElapsed":0}}}],"minimumShouldMatch":0},"stats":{"context":{"millisElapsed":42.418625,"invocationCounts":{"createWeight":1,"createScorer":56}},"match":{"millisElapsed":14486.312741,"invocationCounts":{"nextDoc":40000028}},"score":{"millisElapsed":5997.668304,"invocationCounts":{"score":40000000}}}},"collectors":{"allCollectorStats":{"millisElapsed":4965.753857,"invocationCounts":{"collect":40000000,"competitiveIterator":28,"setScorer":28}},"sort":{"fieldInfos":{},"stats":{"prunedResultIterator":{"millisElapsed":0},"comparator":{"millisElapsed":2841.984556,"invocationCounts":{"competitiveIterator":28,"setScorer":28,"setBottom":14,"compareBottom":39999989,"setHitsThresholdReached":1}}}}},"resultMaterialization":{"stats":{"millisElapsed":0.660042,"invocationCounts":{"retrieveAndSerialize":1}}},"metadata":{"mongotVersion":"1.62.0-46-g84ab2fcdae","mongotHostName":"mongot-local-dev","indexName":"benchmark_logs_idx","cursorOptions":{"batchSize":11,"requiresSearchSequenceToken":false},"lucene":{"totalSegments":28,"totalDocs":100000000}},"resourceUsage":{"majorFaults":1,"minorFaults":176,"userTimeMs":10,"systemTimeMs":0,"maxReportingThreads":1,"numBatches":1}},"mongotDocsRequested":10,"requiresSearchMetaCursor":false,"internalMongotBatchSizeHistory":[11]},"nReturned":10,"executionTimeMillisEstimate":0},{"$_internalSearchIdLookup":{"limit":10,"subPipeline":[{"$match":{"_id":{"$eq":"_id placeholder"}}}],"totalDocsExamined":10,"totalKeysExamined":10,"numDocsFilteredByIdLookup":0},"nReturned":10,"executionTimeMillisEstimate":1},{"$limit":10,"nReturned":10,"executionTimeMillisEstimate":1},{"$project":{"_id":true,"message":true,"country":true,"severity":true,"timestamp":true,"metadata":true,"rank":{"$meta":"searchScore"}},"nReturned":10,"executionTimeMillisEstimate":1}],"queryShapeHash":"6D0DB1E1511DBF0A7AB3E604CA873DCC75454A33B79F2669445ECF20A5452721","serverInfo":{"host":"mongo","port":27017,"version":"8.2.5","gitVersion":"a471a13094434666c48a1f75451f2efa49f8f5df"},"serverParameters":{"internalQueryFacetBufferSizeBytes":104857600,"internalQueryFacetMaxOutputDocSizeBytes":104857600,"internalLookupStageIntermediateDocumentMaxSizeBytes":104857600,"internalDocumentSourceGroupMaxMemoryBytes":104857600,"internalQueryMaxBlockingSortMemoryUsageBytes":104857600,"internalQueryProhibitBlockingMergeOnMongoS":0,"internalQueryMaxAddToSetBytes":104857600,"internalDocumentSourceSetWindowFieldsMaxMemoryBytes":104857600,"internalQueryFrameworkControl":"trySbeRestricted","internalQueryPlannerIgnoreIndexWithCollationForRegex":1},"command":{"aggregate":"benchmark_logs","pipeline":[{"$search":{"index":"benchmark_logs_idx","compound":{"must":[{"text":{"query":"research team","path":"message"}}],"filter":[{"range":{"path":"severity","lt":3}}]},"sort":{"score":{"$meta":"searchScore"}}}},{"$limit":10},{"$project":{"message":1,"country":1,"severity":1,"timestamp":1,"metadata":1,"rank":{"$meta":"searchScore"}}}],"cursor":{},"$db":"test"},"ok":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1771531370,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"00"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1771531370,"i":1}}}

The search runs against a 100‑million‑document search index split into 28 Lucene segments and requests only the top 10 results:

"indexName": "benchmark_logs_idx",
"lucene": {
  "totalDocs": 100000000,
  "totalSegments": 28
},
"mongotDocsRequested": 10

High‑frequency terms ("research" and "team") cause the query to iterate over ~40 million documents, about 40% of the index:

"invocationCounts": {
  "nextDoc": 40000028
}

Evaluating the boolean query and the severity < 3 filter over those candidate accounts for ~14.5 seconds of aggregated Lucene work:

"match": {
  "millisElapsed": 14486.312741,
  "invocationCounts": {
    "nextDoc": 40000028
  }
}

Approximately 40 million documents are scored, resulting in ~6 seconds of aggregated scoring work:

"score": {
  "millisElapsed": 5997.668304,
  "invocationCounts": {
    "score": 40000000
  }
}

Selecting the top 10 results requires ~40 million score comparisons, taking ~2.8 seconds of comparator work:

"collectors": {
  "sort": {
    "stats": {
      "comparator": {
        "millisElapsed": 2841.984556,
        "invocationCounts": {
          "compareBottom": 39999989
        }
      }
    }
  }
}

Explain aggregates Boolean matching, scoring, and sorting to roughly 20 seconds of total Lucene CPU work across all segments:

"match": { "millisElapsed": 14486.312741 },
"score": { "millisElapsed": 5997.668304 }

Because the work is parallelized across 28 Lucene segments, the ~20 seconds of aggregated work completes in ~2 seconds of real latency:

"lucene": {
  "totalSegments": 28
}

After the search completes, MongoDB fetches only 10 documents by _id, examining 10 keys and documents in ~1 ms:

"$_internalSearchIdLookup": {
  "totalDocsExamined": 10,
  "totalKeysExamined": 10
}

The execution plan shows that the query scans and scores ~40% of a 100‑million‑document index, performs ~20 seconds of parallelized Lucene work, and still returns the top 10 results in ~2 seconds of wall‑clock time.

This is very similar to the execution plan with ParadeDB (except for the time, but this runs on different servers with different parallelism).

Conclusion

Modern document databases must meet two demanding requirements: high write throughput and fast ranked search over large datasets. ParadeDB and MongoDB solve this by building BM25 indexes on top of Tantivy and Apache Lucene, respectively—search engines designed for these workloads.

A BM25 index combines an inverted index (to find candidate documents) with columnar metadata (to compute relevance scores and apply filters). This lets the engine quickly assemble a large candidate set and then rank it without loading full documents.

To support real‑time updates, both Tantivy and Lucene use a Log‑Structured Merge (LSM) design. Inserts and updates go to an in‑memory buffer rather than random in‑place writes. When the buffer fills or a statement completes, it is flushed to disk as an immutable segment.

Background compaction merges these segments into larger ones, resolves duplicates, and drops deleted documents. This turns random writes into mostly sequential I/O, sustaining high ingestion rates while keeping reads predictable.

The hard part is Top‑K queries, where results must be ordered by relevance (e.g., BM25). Scoring every matching document is infeasible when queries hit tens of millions of documents.

ParadeDB, via Tantivy, uses Block WAND to avoid this. Documents are grouped into blocks, each with an upper bound on the score any document in that block can reach. As the Top‑K heap fills, the engine sets a minimum score threshold. If a block’s maximum cannot beat that threshold, the entire block is skipped.

Lucene implements the same idea with block‑max WAND and related dynamic pruning. This pruning is driven by a dynamically increasing minimum competitive score: once the Top‑K heap fills, Lucene skips entire blocks of documents whose maximum possible score cannot exceed the current threshold. Atlas Search doesn’t expose this by name, but it is applied whenever you sort by searchScore and limit results. Execution plans showing huge candidate sets, but relatively small Top‑K collectors reveal this pruning. For queries dominated by required clauses (MUST/FILTER), Lucene also applies block‑max pruning at the conjunction level, further reducing unnecessary scoring.

Conceptually, ParadeDB and MongoDB share the same core techniques—immutable segments, LSM‑style merges, and WAND‑style pruning—to make Top‑K queries efficient. They mainly differ in how visible these mechanics are.

ParadeDB surfaces them via PostgreSQL EXPLAIN, where you can see TopNScanExecState, worker counts, and segment details. MongoDB hides most internals behind Atlas Search: users write $search, $sort, and $limit, and Lucene handles pruning and parallelism.

Update visibility is another key difference. ParadeDB aligns index updates closely with PostgreSQL transactions. MongoDB Atlas Search decouples the search engine from the primary database process, optimizing more aggressively for ingestion and query throughput. This creates a trade‑off: MongoDB Search offers near‑real‑time indexing with very high performance and scalability, but with eventual consistency between writes and search visibility. The delay is usually small but is a deliberate choice in favor of throughput and parallelism.

Both ParadeDB and MongoDB show that BM25‑based search can scale to tens of millions of candidates while still returning Top‑K results in milliseconds or seconds. They rely on LSM‑style indexing and WAND‑style pruning, differing mainly in how these optimizations are exposed and how they balance transactional immediacy against performance. In both systems, the search engine does massive work efficiently—and mostly invisibly—to the user.