Image credit: Unsplash
OpenSearch 3.0 introduced a major shift in the way queries are executed inside the engine. The change is subtle – it lives inside the search core – but its impact on latency, CPU usage and scaling is profound. In this post we unpack the new concurrent segment search feature, explain why it matters, and show you how to tune it for real‑world workloads.
The Problem Before 3.0
In earlier versions each shard performed a single‑threaded scan over its Lucene segments. The algorithm looked roughly like this:
- Load the segment’s postings list.
- Walk the term dictionary.
- Score each matching document.
- Return the top‑K results for the shard.
This approach works well for small indexes or low‑traffic workloads, but it has three drawbacks:
- Latency spikes when a query touches many large segments.
- Under‑utilised CPU on multi‑core machines – most cores sit idle while a single core does the heavy lifting.
- Limited scalability for aggregation‑heavy queries that need to touch a lot of data.
The OpenSearch community recognized these pain points and introduced concurrent segment search (CSS) as the default mode in 3.0.
What Is Concurrent Segment Search?
CSS splits each Lucene segment into slices and processes each slice in a separate thread. The slicing logic is deterministic and based on two parameters:
-
max‑slice‑count – the maximum number of slices a segment can be split into. By default it is calculated as
max(1, min(CPU_cores/2, 4)). - segment size thresholds – a segment larger than 250 K documents or containing more than 5 sub‑segments is automatically sliced.
When a query runs, the shard’s executor creates a thread pool, assigns each slice to a thread, and aggregates the partial results before returning the shard‑level top‑K. The coordinating node then merges the shard results as usual.
Why It Works
- Parallelism at the segment level – Lucene segments are immutable, so multiple threads can read them without coordination.
- Cache friendliness – each thread works on a contiguous portion of the segment, keeping CPU caches warm.
- Graceful fallback – if a segment is too small to slice, CSS runs the classic single‑threaded path, preserving the correctness guarantees.
Trade‑offs
| Trade‑off | Benefit | Cost |
|---|---|---|
| Latency reduction | Queries that touch many large segments finish up to 2‑3× faster. | Higher CPU consumption per query. |
| Resource usage | Better CPU utilization on multi‑core nodes. | Potential contention with other heavy tasks (e.g., ingest pipelines). |
| Unsupported features | CSS disables some aggregations (e.g., parent aggregation on join fields) and query constructs (sampler, diversified_sampler). |
Need to fall back to single‑threaded mode for those queries. |
| Complexity | Adds a new configuration knob (search.concurrent.slice_count). |
Requires monitoring to avoid over‑slicing on memory‑constrained nodes. |
How the Decision Engine Works
The ConcurrentSearchRequestDecider (CSRD) lives in org.opensearch.search.deciders. It evaluates the query and decides whether to enable CSS. The default mode is “auto”, which applies the following heuristics:
- If the query contains heavy aggregations or script‑based scoring, enable CSS.
- If the query requests a
terminate_afteror uses theminimum_should_matchclause, stay in single‑threaded mode. - If the node’s CPU load is above a configurable threshold (
search.concurrent.max_cpu_load), CSRD disables CSS to protect the host.
You can override the decision per‑request with the ?concurrent_search=true|false URL parameter, or globally via the search.concurrent_mode setting.
Configuring CSS
Basic Settings
search:
concurrent:
mode: auto # auto | always | never
max_slice_count: 8 # maximum slices per segment
max_cpu_load: 0.8 # disable CSS if CPU > 80%
-
mode –
autorespects the decider;alwaysforces CSS even for unsupported queries (may cause errors);neverreverts to the old behavior. - max_slice_count – a higher value means more threads per segment, useful on high‑core servers (e.g., 32‑core machines).
- max_cpu_load – prevents CSS from saturating the node during peak load.
Example: Tuning for a 16‑core Search Node
search:
concurrent:
mode: auto
max_slice_count: 12 # 12 slices = ~6 threads per segment on average
max_cpu_load: 0.75
With this config the node will slice each large segment into up to 12 parts, allowing up to 6 threads to run in parallel (since each slice uses a single thread). This setting has been benchmarked to cut median query latency from 250 ms to ~100 ms on a typical e‑commerce catalog.
Real‑World Benchmarks
Below is a simplified benchmark run on a
About the author: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I contribute to Apache Lucene, OpenSearch, and related projects. Follow my work on GitHub.
Top comments (0)