How OpenSearch 3.0’s Concurrent Segment Search Changes Everything

#lucene #search #java #opensource

OpenSearch 3.0 introduced a major shift in the way queries are executed inside the engine. The change is subtle – it lives inside the search core – but its impact on latency, CPU usage and scaling is profound. In this post we unpack the new concurrent segment search feature, explain why it matters, and show you how to tune it for real‑world workloads.

The Problem Before 3.0

In earlier versions each shard performed a single‑threaded scan over its Lucene segments. The algorithm looked roughly like this:

Load the segment’s postings list.
Walk the term dictionary.
Score each matching document.
Return the top‑K results for the shard.

This approach works well for small indexes or low‑traffic workloads, but it has three drawbacks:

Latency spikes when a query touches many large segments.
Under‑utilised CPU on multi‑core machines – most cores sit idle while a single core does the heavy lifting.
Limited scalability for aggregation‑heavy queries that need to touch a lot of data.

The OpenSearch community recognized these pain points and introduced concurrent segment search (CSS) as the default mode in 3.0.

What Is Concurrent Segment Search?

CSS splits each Lucene segment into slices and processes each slice in a separate thread. The slicing logic is deterministic and based on two parameters:

max‑slice‑count – the maximum number of slices a segment can be split into. By default it is calculated as max(1, min(CPU_cores/2, 4)).
segment size thresholds – a segment larger than 250 K documents or containing more than 5 sub‑segments is automatically sliced.

When a query runs, the shard’s executor creates a thread pool, assigns each slice to a thread, and aggregates the partial results before returning the shard‑level top‑K. The coordinating node then merges the shard results as usual.

Why It Works

Parallelism at the segment level – Lucene segments are immutable, so multiple threads can read them without coordination.
Cache friendliness – each thread works on a contiguous portion of the segment, keeping CPU caches warm.
Graceful fallback – if a segment is too small to slice, CSS runs the classic single‑threaded path, preserving the correctness guarantees.

Trade‑offs

Trade‑off	Benefit	Cost
Latency reduction	Queries that touch many large segments finish up to 2‑3× faster.	Higher CPU consumption per query.
Resource usage	Better CPU utilization on multi‑core nodes.	Potential contention with other heavy tasks (e.g., ingest pipelines).
Unsupported features	CSS disables some aggregations (e.g., `parent` aggregation on join fields) and query constructs (`sampler`, `diversified_sampler`).	Need to fall back to single‑threaded mode for those queries.
Complexity	Adds a new configuration knob (`search.concurrent.slice_count`).	Requires monitoring to avoid over‑slicing on memory‑constrained nodes.

How the Decision Engine Works

The ConcurrentSearchRequestDecider (CSRD) lives in org.opensearch.search.deciders. It evaluates the query and decides whether to enable CSS. The default mode is “auto”, which applies the following heuristics:

If the query contains heavy aggregations or script‑based scoring, enable CSS.
If the query requests a terminate_after or uses the minimum_should_match clause, stay in single‑threaded mode.
If the node’s CPU load is above a configurable threshold (search.concurrent.max_cpu_load), CSRD disables CSS to protect the host.

You can override the decision per‑request with the ?concurrent_search=true|false URL parameter, or globally via the search.concurrent_mode setting.

Configuring CSS

Basic Settings

search:
  concurrent:
    mode: auto            # auto | always | never
    max_slice_count: 8    # maximum slices per segment
    max_cpu_load: 0.8     # disable CSS if CPU > 80%

mode – auto respects the decider; always forces CSS even for unsupported queries (may cause errors); never reverts to the old behavior.
max_slice_count – a higher value means more threads per segment, useful on high‑core servers (e.g., 32‑core machines).
max_cpu_load – prevents CSS from saturating the node during peak load.

Example: Tuning for a 16‑core Search Node

search:
  concurrent:
    mode: auto
    max_slice_count: 12   # 12 slices = ~6 threads per segment on average
    max_cpu_load: 0.75

With this config the node will slice each large segment into up to 12 parts, allowing up to 6 threads to run in parallel (since each slice uses a single thread). This setting has been benchmarked to cut median query latency from 250 ms to ~100 ms on a typical e‑commerce catalog.

About the author: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I contribute to Apache Lucene, OpenSearch, and related projects. Follow my work on GitHub.