Prithvi S

Posted on Jun 17 • Edited on Jul 4

Distributed Scoring in OpenSearch: Why Your Results Aren't Perfectly Ranked

#opensearch #search #database #data

Understanding the trade-off between speed and accuracy in distributed search and the one query parameter that fixes it.

You index a million documents, run a search, and the top result looks right. You scale to a billion documents across fifty shards, run the same search, and suddenly the ranking feels off. Not broken just subtly wrong. The best result is third. A mediocre result is first. You re-run the query and get a different order.

Welcome to distributed scoring in OpenSearch.

This is not a bug. It is a fundamental property of how OpenSearch and any distributed search engine built on Lucene computes relevance. The good news: it is fixable. The bad news: the fix costs you a round-trip. In this post, I will explain why scores diverge across shards, when it matters, and how to decide whether to pay the latency tax for perfect ranking.

The Assumption That Breaks Everything

Most engineers assume search ranking works like this:

Collect every document that matches the query
Score all of them using global statistics
Return the top N in exact rank order

That is how it works in a single-node database. OpenSearch does not work that way. OpenSearch is distributed by design an index is split into shards, shards live on different nodes, and each shard is an independent Lucene index with its own local statistics.

When you send a query, the coordinating node does not gather every matching document from every shard and score them centrally. That would be catastrophically slow at scale. Instead, OpenSearch uses a scatter-gather pattern with a critical optimization: scores are computed locally on each shard, using local term statistics.

How Query Execution Actually Works

Here is what happens when you run a match query against a multi-shard index:

Phase 1: Query Phase (Scatter)

The coordinating node sends the query to every shard that holds data for the index. Each shard executes the query independently against its own Lucene segments. For each document, it computes a relevance score using BM25 or your chosen similarity function.

BM25 depends on two key statistics:

Term Frequency (TF): How often the term appears in this document
Inverse Document Frequency (IDF): How rare the term is across the entire collection

Here is the problem: each shard only knows about its own documents. It computes IDF using local document frequencies, not global ones. A term that appears in 1% of documents globally might appear in 5% of documents on one shard and 0.1% on another. The shard with the 5% local frequency will score matching documents lower. The shard with the 0.1% frequency will score them higher.

The result? Two identical documents on different shards can receive different scores for the same query.

Phase 2: Merge and Fetch (Gather)

Each shard returns its top-K results (document IDs + scores) to the coordinating node. The coordinating node merges these shard-local rankings into a single global ranking. But it is merging rankings that were computed with inconsistent scoring baselines.

Finally, if the client requested full documents (_source), the coordinating node issues a fetch phase to retrieve the actual document bodies from the relevant shards.

The default search type QUERY_THEN_FETCH optimizes for speed. It assumes that for most use cases, "close enough" ranking is acceptable. And for many use cases, it is.

When Local Scoring Bites You

The ranking error is usually small and random. But there are conditions where it becomes material:

1. Skewed Shard Distributions

If your data is not uniformly distributed for example, time-based indices where recent shards have different term distributions than old shards local IDF diverges significantly. A query for "Kubernetes" on a shard full of 2024 DevOps logs will score differently than on a shard of 2019 monolith logs.

2. Small Shards with Rare Terms

With many small shards, rare terms have volatile local document frequencies. A term that appears in exactly one document globally might appear in zero documents on most shards and one document on a single shard. That lucky document gets an enormous IDF boost and shoots to the top.

3. High-Precision Ranking Requirements

E-commerce search, legal discovery, and academic search often need deterministic, reproducible ranking. If you are A/B testing ranking algorithms or comparing results across replicas, non-deterministic shard-local scoring introduces noise that masks real signal.

4. Routing Keys and Custom Distributions

If you use custom routing (e.g., routing all documents for tenant X to the same shard), you create isolated term-frequency bubbles. A term common in tenant X's documents is rare globally but frequent locally and local scoring does not know the difference.

The Fix: DFS_QUERY_THEN_FETCH

OpenSearch provides an alternative search type: DFS_QUERY_THEN_FETCH. The "DFS" stands for Distributed Frequency Search a pre-query phase that gathers global term statistics before executing the actual search.

Here is the flow:

DFS Phase: The coordinating node sends a lightweight request to all shards. Each shard returns its local term frequencies and document frequencies for the query terms. The coordinating node aggregates these into global statistics.
Query Phase: The coordinating node re-sends the query to all shards, this time accompanied by the global term statistics. Every shard now scores documents using the same IDF values.
Fetch Phase: Identical to the default retrieve full documents for the globally-merged top results.

The result: perfectly consistent scoring across shards. Identical documents on different shards receive identical scores. Ranking is deterministic. The A/B test noise disappears.

The Cost: One Extra Round-Trip

DFS_QUERY_THEN_FETCH adds a network round-trip before the query phase. For simple term queries on small clusters, this overhead is negligible a few milliseconds. For complex queries with many terms on large clusters, it can add tens of milliseconds.

More importantly, the DFS phase does not reduce total work. It redistributes it. The coordinating node must collect, merge, and forward statistics. If your cluster is already network-bound or the coordinating node is under CPU pressure, DFS can amplify the bottleneck.

Here is my practical guidance:

Scenario	Recommendation
Few large shards, uniform data, speed matters	Stick with QUERY_THEN_FETCH
Many small shards, skewed distributions, precision matters	Use DFS_QUERY_THEN_FETCH
Reproducible ranking required (A/B tests, audits)	Use DFS_QUERY_THEN_FETCH
Rare-term-heavy queries (specialized domains)	Use DFS_QUERY_THEN_FETCH
High-traffic, latency-sensitive, "good enough" ranking	Stick with QUERY_THEN_FETCH

How to Use It

In OpenSearch, you specify the search type at query time:

GET /my-index/_search?search_type=dfs_query_then_fetch
{
  "query": {
    "multi_match": {
      "query": "distributed scoring",
      "fields": ["title^3", "content"]
    }
  }
}

There is no cluster-wide default to change this. It is a per-query decision, which is the right design you pay the cost only when you need the precision.

A Deeper Realization: This Is Inherent to Distributed Search

The local-vs-global scoring problem is not an OpenSearch quirk. It is inherent to any distributed search system that shards data independently. Solr has the same trade-off. Elasticsearch had it before the fork. The only engines that avoid it entirely are those that do not shard at the indexing level or those that accept the latency cost of global scoring on every query.

What OpenSearch gives you is transparency and control. You can see the behavior, measure it, and opt into the fix when your use case demands it. That is the difference between a system that hides complexity and one that exposes it with clear knobs.

What I Have Learned Maintaining Search Plugins

As a maintainer of the OpenSearch dashboards-search-relevance plugin, I have spent a lot of time in the scatter-gather path. One thing that surprised me early on: the concurrent segment search improvements in OpenSearch 3.0 do not change this trade-off. They parallelize execution within a shard, but each shard still operates with local statistics unless you request DFS.

Another surprise: most ranking anomalies reported as "bugs" are actually local IDF effects. Before you debug your analyzer or your BM25 parameters, check whether dfs_query_then_fetch makes the anomaly disappear. If it does, you have found your culprit and your fix.

Summary

OpenSearch scores documents using local shard statistics, not global ones
This causes ranking inconsistencies that are usually small but can be significant with skewed data or rare terms
DFS_QUERY_THEN_FETCH adds a pre-query phase to gather global statistics, enabling consistent cross-shard scoring
The cost is one extra round-trip acceptable for precision-critical workloads, unnecessary for speed-critical ones
This is not a bug. It is a documented, controllable trade-off in distributed search design

The next time your search results feel subtly wrong, ask yourself: am I seeing the true global ranking or fifty local opinions merged together?

About the author: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I contribute to Apache Lucene, OpenSearch, and related projects. Follow my work on GitHub.

DEV Community