Prithvi S

Posted on Jun 13 • Edited on Jul 4

OpenSearch Distributed Scoring: How Results Merge Across 100+ Shards

#lucene #search #java #opensource

When a search query runs against an index with 100 shards, each shard independently scores documents using its own local term statistics. The coordinating node then merges these shard-local results into a global ranking. This merge process is not a simple concatenation - it is a statistical reconciliation problem that determines whether the user sees the most relevant documents or a skewed sample. Understanding how OpenSearch handles distributed scoring is essential for any cluster that has grown beyond a handful of shards.

The Core Problem: Local Statistics vs Global Relevance

Lucene scoring relies on term statistics, specifically inverse document frequency (IDF) and term frequency (TF). In a single-segment index, these statistics are computed globally across all documents. In a distributed index, each shard maintains its own statistics based only on the documents it contains. A term that appears in 100 documents on a dense shard and 2 documents on a sparse shard will have different IDF values on each shard, causing the same query to produce different scores for identical documents depending on which shard holds them.

This is not a bug. It is a fundamental trade-off between search accuracy and query performance. Computing global statistics for every query would require broadcasting term frequencies to all shards before scoring, which adds a round-trip latency and increases network load. OpenSearch defaults to local statistics for speed, but provides an alternative when accuracy matters more.

Document Frequency Skew: The Most Common Symptom

The symptom of distributed scoring problems is usually obvious: a query returns different top results after reindexing, after adding nodes, or after force-merging. All of these operations change shard boundaries, which changes local document frequencies. If a customer reports that "search quality degraded after the cluster resize," the root cause is almost always a shift in term statistics across shards.

Consider a concrete example. An e-commerce index has 100 million products split across 10 shards. The query laptop appears in 50,000 products globally. On a balanced cluster, each shard has roughly 5,000 laptops, and the IDF is consistent. But if one shard is hot because it contains a category with more laptops, that shard might have 15,000 laptop documents while another has only 500. On the hot shard, laptop is a common term with low IDF. On the sparse shard, laptop is rare with high IDF. A document on the sparse shard that mentions laptop once gets a higher score than a document on the hot shard that mentions it five times, even though the latter is more relevant.

Query Then Fetch: The Default Scoring Model

The default search type, query_then_fetch, executes the query phase on each shard using local statistics. The coordinating node receives top-N results from each shard, merges them using a global heap, and then requests the fetch phase from the winning shards. This is the fastest mode because it requires only two round-trips: query phase and fetch phase.

How the Coordinating Node Merges Shard Results

The coordinator maintains a global TopDocs priority queue with capacity equal to from + size. As each shard returns its local top results, the coordinator inserts them into the global queue. The queue evicts the worst-scoring document when it exceeds capacity. After all shards have responded, the coordinator has the globally best from + size documents, and the fetch phase retrieves only those documents from the relevant shards.

The critical detail is that the coordinator does not re-score documents during the merge. It trusts the shard-level scores. This means the final ranking is a blend of locally-optimal scores, not globally-optimal scores. For balanced shards and common queries, the difference is negligible. For skewed shards or rare terms, the ranking can be visibly wrong.

Why the Coordinator Does Not Re-Score

Re-scoring would require the coordinator to have the term statistics for every shard. It would need to know, for every term in the query, how many documents contain that term on each shard. Then it could re-compute IDF using global frequencies and re-score each candidate document. But this is computationally expensive for the coordinator and requires storing all term statistics in memory. For a query with 10 terms across 100 shards, the coordinator would need 1,000 term frequencies to perform a global re-score. OpenSearch avoids this complexity by default.

DFS Query Then Fetch: The Accuracy-First Alternative

dfs_query_then_fetch adds a preliminary Distributed Frequency Search (DFS) phase before the query phase. In this phase, the coordinator broadcasts the query to all shards and collects two pieces of information per term: the total document frequency (how many documents contain the term) and the total term frequency (how many times the term appears in all documents). The coordinator then sends these global statistics back to all shards, which use them for scoring during the query phase.

The Three-Phase Round-Trip

DFS Phase - The coordinator sends the query to every shard. Each shard computes local statistics for each query term and returns them. The coordinator aggregates these into global statistics.
Query Phase - The coordinator sends the global statistics back to all shards. Each shard re-executes the query using global IDF and global TF, producing locally-scored but globally-consistent results. The coordinator merges these into a global top-N as before.
Fetch Phase - Identical to query_then_fetch. The coordinator requests full documents from the winning shards.

The DFS phase adds one full network round-trip. For a cluster with 50 shards and a query with 5 terms, the DFS phase sends 50 requests and receives 50 responses before the query phase can begin. The query phase then sends another 50 requests. This doubles the network overhead and increases the minimum query latency by roughly one network round-trip time (typically 1-5ms within a datacenter, but higher for cross-region clusters).

When DFS Is Worth the Cost

DFS is worth using when:

Shards are significantly unbalanced (more than 2x difference in document count).
Queries contain rare terms that appear on only a few shards, where local IDF would be wildly inaccurate.
The application requires score consistency across reindexing, node addition, or shard relocation.
The result set is small (top 10-100) and the scoring accuracy directly impacts business outcomes like ad placement or product ranking.

DFS is not worth using when:

Shards are balanced (within 20% of each other in document count).
Queries contain common terms where local IDF is already close to global IDF.
Latency is critical and sub-50ms responses are required.
The application uses constant_score or script-based scoring where IDF does not apply.

Shard Allocation and Rebalancing: How They Affect Scores

Shard allocation decisions directly impact scoring because they determine which documents share statistics. When a new node joins the cluster, OpenSearch may relocate shards to balance disk usage. This changes the document distribution on each shard and thus changes the local term frequencies. A query that returned result A before rebalancing might return result B after rebalancing, even though the documents themselves did not change.

The Force Merge Trap

Force-merging an index to a single segment per shard is a common optimization for read-heavy workloads. It improves query performance by reducing segment count and eliminating the need for multi-segment merging. But it also changes the term statistics if the merge reorders documents. Lucene's merge policy is deterministic, but the document order after merge depends on the segment merge order, which can change if segments were created at different times. After force-merging, users sometimes report "relevance changed" because the document order in the merged segment altered which documents are retrieved for the same query, even though the scores themselves are theoretically unchanged.

Shard Count Changes and Reindexing

Reindexing from 5 shards to 50 shards is the most disruptive operation for scoring. The document routing hash changes, so every document lands on a different shard. The local statistics are completely recalculated. Queries that relied on the old shard distribution will produce different results. This is why applications that depend on score stability should use dfs_query_then_fetch or design scoring that is less sensitive to term statistics.

Practical Tuning: Balancing Accuracy and Performance

Setting `search_type` Per Request

search_type is a per-request parameter, not a cluster-wide setting. This allows applications to use query_then_fetch for fast, approximate queries and dfs_query_then_fetch for critical ranking queries. For example, an e-commerce site might use query_then_fetch for category browsing and dfs_query_then_fetch for search-as-you-type suggestions where the result set is small and accuracy matters.

The `preference` Parameter: Routing to Specific Shards

The preference parameter allows a client to request that the query execute on a specific set of shards or replicas. If preference is set to a session ID or a custom string, OpenSearch consistently routes the same client to the same shards (or replicas). This does not fix the statistical problem, but it makes the problem consistent - the same query from the same client will always hit the same shards and see the same local statistics, reducing the perception of scoring instability.

For applications where score consistency is more important than score accuracy (e.g., A/B testing where the same user should see the same ranking across sessions), preference is a pragmatic solution. The user sees consistent results, even if those results are not globally optimal.

`min_score` and `terminate_after`: Approximate Scoring

For applications that need fast approximate results, min_score and terminate_after can be used to skip low-scoring documents early. min_score tells each shard to skip documents that score below a threshold. terminate_after tells each shard to stop after a fixed number of matches. Both are incompatible with DFS because DFS requires the shard to evaluate all matches to compute global statistics. If you use terminate_after, the global statistics will be incomplete, and DFS will produce inaccurate results. Use these parameters only with query_then_fetch.

How Aggregations Interact with Distributed Scoring

Aggregations are computed during the query phase, before the coordinating node merges results. Each shard computes its own aggregation buckets and sends them to the coordinator. The coordinator then merges buckets by key and sums the counts. This means aggregation results are globally accurate for counts and sums, but not for metrics that depend on document scoring (like the avg aggregation on a score field, or the percentiles aggregation).

The `shard_size` Parameter for Terms Aggregations

For terms aggregations, the shard_size parameter controls how many buckets each shard returns to the coordinator. The default is shard_size = size * 1.5 + 10, meaning each shard returns 1.5x the number of buckets requested. This is necessary because a term that is rare globally might be common on one shard. If the shard only returned the top 10 buckets, it might omit a globally significant term that happens to be the 11th most common on that shard. Increasing shard_size improves accuracy but increases the coordinator's merge work and memory usage. For high-cardinality fields with 1000+ buckets, shard_size can be a significant tuning parameter.

Sub-aggregations and the Fetch Phase

Sub-aggregations (aggregations inside other aggregations) are computed at the shard level during the query phase. The coordinator receives the full aggregation tree from each shard and merges it recursively. This is computationally expensive for deeply nested aggregations. The depth parameter controls how many levels of sub-aggregations are computed, but the default is unlimited. For queries with 5+ levels of sub-aggregations, the query phase can dominate the total query time, and the fetch phase becomes irrelevant because the client usually only needs the aggregation results, not the documents.

Common Pitfalls and Edge Cases

The `script_score` Query and Distributed Scoring

When using script_score to compute custom scores, the script runs on each shard with access to the shard's local document and term statistics. A script that uses doc.freq or term.tf will see different values on different shards. If the script logic is sensitive to these values, the results will be inconsistent across shards. To ensure global consistency with script_score, either use dfs_query_then_fetch or design the script to use only document-level fields that are independent of shard statistics (like doc.value or doc.count).

The `constant_score` Query: The Nuclear Option

constant_score wraps a query and assigns a fixed score to all matching documents. This completely eliminates the IDF/TF dependency, making scores 100% consistent across shards regardless of document distribution. The trade-off is that all matches receive the same score, so ranking is determined entirely by sort order, filters, or secondary criteria. For applications where boolean relevance is sufficient (e.g., "find all documents tagged with X"), constant_score is the simplest way to avoid distributed scoring problems entirely.

Rescoring at the Coordinator Level

The rescore API allows the coordinator to re-rank the top-N results from the query phase using a more expensive scoring model. This is a hybrid approach: the query phase uses fast, local statistics to find the candidate pool, and the rescore phase uses a slower, more accurate model to rank the final results. Because rescore only operates on the top-N results (typically 100-1000), the expensive scoring is applied to a small fraction of the total matches. This is useful for applications that need a mix of speed and accuracy - the query phase is fast, and the rescore phase ensures the visible results are well-ranked.

Conclusion

Distributed scoring in OpenSearch is a statistical approximation problem. The default query_then_fetch mode optimizes for speed by using local term statistics, but it sacrifices score consistency across shards. The dfs_query_then_fetch mode restores global consistency at the cost of an additional network round-trip. Neither mode is universally better - the choice depends on shard balance, query characteristics, and the application's sensitivity to score stability.

Production operators should monitor shard balance and query latency as a pair. A cluster with balanced shards can safely use query_then_fetch for most queries. A cluster with skewed shards or high-value ranking use cases should evaluate dfs_query_then_fetch for critical queries. For applications where score consistency is paramount, consider constant_score queries, preference-based routing, or rescoring at the coordinator level. The key is understanding that scoring is not a deterministic global property in a distributed system - it is a local computation that is reconciled at the coordinator, and the reconciliation strategy is a tunable parameter that should match the business requirements.

About the author: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I contribute to Apache Lucene, OpenSearch, and related projects. Follow my work on GitHub.

DEV Community

OpenSearch Distributed Scoring: How Results Merge Across 100+ Shards

The Core Problem: Local Statistics vs Global Relevance

Document Frequency Skew: The Most Common Symptom

Query Then Fetch: The Default Scoring Model