Prithvi S

Posted on Jun 10

Concurrent Segment Search in OpenSearch 3.0: How Parallel Search Works Under the Hood

#opensearch #search #database #data

OpenSearch 3.0 ships with a quiet but powerful change: concurrent segment search is now enabled by default. If you have ever watched a single CPU core peg at 100% while the other fifteen sit idle during a heavy aggregation query, this feature is the fix you have been waiting for. It is also one of the most under-documented architectural shifts in the project, so let us walk through exactly how it works, when it helps, and where it still falls short.

The Problem: One Core, Many Segments

To understand why concurrent segment search matters, you first need to understand what a Lucene segment is. Every OpenSearch index is split into shards, and every shard is a Lucene index. A Lucene index is composed of segments - immutable files on disk that contain a subset of the documents in that shard. When you index a document, it lands in an in-memory buffer. Once the buffer is refreshed (by default, every second), a new segment is written. Segments are never modified in place; they are only merged in the background by a merge policy. This immutability is what makes Lucene fast and crash-safe, but it also means that a single shard can contain dozens or even hundreds of segments at any moment.

Before concurrent segment search, OpenSearch searched those segments sequentially on a single thread. Even if your node had 64 cores, the query phase for a single shard was bound to one core. For a simple term query over a few segments, this is fine. For a heavy aggregation spanning a hundred segments, it is a bottleneck. The CPU was the resource, and the search engine was not using it.

The Solution: Slicing Segments Across Cores

Concurrent segment search changes the execution model. Instead of walking segments one by one, OpenSearch divides the segments into slices and assigns each slice to a separate thread. The slices are searched in parallel, and their results are merged before being returned to the coordinating node. This is all handled inside the shard-local query phase, so the rest of the scatter-gather architecture stays the same.

The implementation relies on two mechanisms, both inherited from Lucene and adapted by OpenSearch:

1. Max-Slice-Count (the OpenSearch default)

In OpenSearch 3.0, the default slicing strategy is max-slice-count. The number of slices is calculated dynamically as:

max(1, min(CPU_cores / 2, 4))

On a 16-core machine, that gives you 4 slices. On an 8-core machine, it gives you 4 slices. On a 4-core machine, it gives you 2 slices. On a single-core machine, it gives you 1 slice - effectively falling back to the old behavior. The formula deliberately caps the slice count at 4 to avoid excessive thread contention, because spawning one thread per segment would drown the scheduler in context switches and cache thrashing.

2. Lucene Mechanism (the alternative)

The second strategy is the Lucene mechanism, which limits each slice to at most 250,000 documents or 5 segments. This is more aggressive and can create many more slices on large shards. It is available as a configuration option but is not the default in OpenSearch because it can oversubscribe the thread pool on nodes with many large shards.

You can configure which mechanism to use via the cluster setting search.concurrent_segment_search.mode or the index-level setting index.search.concurrent_segment_search.mode. The values are auto, all, or none. In 3.0, the default is auto.

Auto Mode: When OpenSearch Decides for You

The auto setting is the most interesting part of the design, because it is not just a boolean toggle. OpenSearch runs a set of pluggable deciders to determine whether a query should actually be executed concurrently. Even if the mode is auto, the engine can still fall back to sequential search for specific queries.

The built-in deciders check for:

Aggregation complexity: Some aggregations do not play well with parallel execution. Parent aggregations on join fields, for example, require global document ordering that is hard to maintain across slices.
Sampler and diversified_sampler aggregations: These require a global view of the document collection to produce statistically valid samples, so they disable concurrent search.
terminate_after: If a query sets terminate_after, it wants early termination after a fixed number of documents. Parallel slices would each terminate independently, breaking the semantics of the parameter.

In OpenSearch 2.17+, you can also write your own decider by implementing the ConcurrentSearchRequestDecider interface. This is a real extension point: you register a decider in a plugin, and it receives the SearchContext before the query phase begins. Your decider can inspect the query, the aggregations, the index settings, or even custom request headers, and then vote YES, NO, or SKIP (defer to the next decider). This is a subtle but powerful hook for teams that want to enforce their own concurrency policies.

For example, a team running mixed workloads might write a decider that disables concurrent search for any query hitting an index with fewer than 5 segments, because the thread-scheduling overhead would exceed the parallelism benefit. Another team might disable it for queries containing a specific custom script that is not thread-safe. The decider API makes this possible without forking the core engine.

The Trade-Off: Latency vs. CPU

Concurrent segment search is not free. The trade-off is straightforward: you trade CPU utilization for query latency. A query that took 800 milliseconds on one core might take 250 milliseconds on four cores. But those four cores are now occupied for 250 milliseconds each, which means 1,000 core-milliseconds of total CPU time versus 800 before. The overhead comes from thread creation, result merging, and cache-line bouncing between cores.

For most users, this is a good trade. Query latency is the metric that users feel, and CPU cores are a renewable resource that you are already paying for. But on heavily loaded clusters where CPU is the bottleneck, enabling concurrent search for every query can make things worse. The scheduler becomes congested, and other queries queue up.

This is why the auto mode and the deciders exist. OpenSearch tries to be conservative: it only parallelizes when the expected latency win is large enough to justify the CPU cost. If you are running on a small instance with burstable CPU credits, you might want to set the mode to none or implement a custom decider that disables concurrent search during peak hours.

How to Observe and Tune It

There are a few key settings and APIs you should know:

Setting	Default	Description
`search.concurrent_segment_search.mode`	`auto`	Cluster-wide default: `auto`, `all`, or `none`
`index.search.concurrent_segment_search.mode`	inherits cluster	Per-index override
`search.concurrent_segment_search.max_slice_count`	dynamic	Caps the number of slices when using max-slice-count
`search.concurrent_segment_search.min_slice_count`	2	Minimum slices needed to trigger concurrency

To see whether a specific query used concurrent search, you can look at the took time and the query profile, but the most direct signal is the node-level thread pool. The search thread pool will show more active threads during concurrent queries. You can also use the /_nodes/stats API to monitor CPU usage per node and correlate spikes with query latency.

If you are benchmarking, run the same query with ?search_type=dfs_query_then_fetch and without, then toggle the concurrent search mode. The latency improvement is most visible on:

Large date_histogram aggregations over many segments
terms aggregations with high cardinality
Range queries that touch a large portion of the index
Queries with complex boolean combinations that stress the scorer

It is least visible on:

Simple match queries with low size (the fetch phase dominates)
Queries that are already I/O bound (waiting on disk or network)
Very small shards with 1-2 segments (nothing to parallelize)

Limitations You Should Know

Concurrent segment search has real limitations that are not just edge cases. They are architectural constraints that you will hit in production if you are not aware of them.

First, as mentioned, parent aggregations on join fields do not work. The join field type in OpenSearch (and Elasticsearch) stores parent-child relationships in a single index. Parent aggregations need to see the full document set to build the relationship graph, so parallel slices would produce inconsistent or incomplete results. If you use parent-child joins, concurrent search is automatically disabled for those queries.

Second, sampler and diversified_sampler aggregations are unsupported. These aggregations intentionally sample a subset of documents to produce a representative bucket. If each slice samples independently, the global sample is biased toward the slice with the most documents. The statistical guarantees break down.

Third, terminate_after is ignored in concurrent mode. If you rely on terminate_after for early termination in large indices, you should either set the mode to none for those queries or rewrite the query to use a more efficient filter.

Fourth, custom scripts that are not thread-safe can break. Most Painless scripts are fine, but if you have a custom ScriptEngine or a ScriptPlugin that uses mutable shared state, concurrent search will race on that state. This is rare, but it is a real risk if your organization has custom scripting extensions.

Finally, the merge policy interacts with concurrency. A shard with many small segments creates more slices, which sounds good, but it also means more thread overhead and more result merging. If your merge policy is too lazy (for example, merge_factor is very high), you might end up with hundreds of tiny segments, and the concurrency win turns into a loss. Keep an eye on the segment count per shard via /_cat/segments.

Putting It Into Practice: A Decision Tree

Here is a simple decision tree you can use when thinking about concurrent segment search:

Is your shard count low and your segment count high? Yes - concurrency likely helps.
Are your queries aggregation-heavy? Yes - check whether the aggregation type is supported.
Is your cluster CPU-bound already? Yes - consider none or a custom decider.
Do you use terminate_after or parent-child joins? Yes - those queries will fall back to sequential automatically.
Are you on OpenSearch 3.0+? Yes - it is already on. You are already benefiting for supported queries.

If you are on an older version and want to experiment, you can enable the preview feature in 2.17+ by setting search.concurrent_segment_search.mode: auto and restarting the nodes. The implementation has been stable since 2.17, so the upgrade path to 3.0 is smooth.

The Bigger Picture: A Trend in Search Engine Design

Concurrent segment search is part of a broader trend in search engine design: moving from single-threaded per-shard execution to multi-threaded, task-based parallelism. Apache Lucene has been adding concurrency primitives for years, and OpenSearch is now surfacing them in a production-safe way. The auto mode with deciders is the right abstraction: it gives operators control without exposing the full complexity of thread scheduling to every user.

For engineers building on OpenSearch, this means two things. First, you should revisit your query latency benchmarks after upgrading to 3.0, because the numbers may have improved without any code changes. Second, you should audit your custom plugins and scripts for thread safety, because the default execution environment is no longer single-threaded per shard. A ScriptEngine that worked fine in 2.x might silently corrupt state in 3.0 if it uses mutable class fields.

Conclusion

Concurrent segment search is one of those features that is easy to miss in a release notes bullet, but it changes the performance profile of the entire engine. By slicing Lucene segments across available CPU cores, OpenSearch 3.0 turns idle hardware into lower query latency. The auto mode and pluggable deciders keep the system safe by default, while still giving advanced users the hooks to customize behavior. The limitations are real and well-defined, so you should know them before you tune, but for most workloads, the upgrade to 3.0 is a free performance win.

If you are running OpenSearch at scale, take the time to profile a few of your heaviest queries before and after the upgrade. The latency graphs will tell you whether your segments were the bottleneck - and whether your CPU cores were the solution all along.

I am Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub: https://github.com/iprithv

DEV Community