OpenSearch powers search at scale for many organizations, but raw relevance scores often need fine‑tuning to match user expectations. The Search Relevance plugin gives engineers a structured way to evaluate, experiment, and improve relevance without writing custom code for every change. In this post we walk through a complete workflow: from defining a query set to running experiments, measuring metrics, and applying the insights to boost search quality.
1. Why Search Relevance Matters
Even the most powerful search engine can return results that feel irrelevant if the underlying scoring models are not aligned with the business problem. Users expect the most useful documents at the top, and a mismatch can increase bounce rates, reduce conversions, and erode trust. The Search Relevance plugin provides a repeatable process to measure relevance, experiment with configurations, and iterate based on data‑driven metrics.
2. Core Concepts
2.1 Query Sets
A query set is a collection of representative user queries that you want to evaluate. Each entry includes the query text, optional filters, and a unique identifier. Building a good query set is critical: it should cover the most common intents, edge cases, and any domain‑specific terminology.
2.2 Experiments
An experiment ties a query set to one or more search configurations. A configuration may adjust analyzers, boosting rules, or ranking functions. Experiments run the queries against each configuration and collect judgments for every result.
2.3 Judgments
Judgments capture the perceived relevance of a document for a given query. They can be explicit (human annotators rating relevance on a scale) or implicit (click‑through, dwell time). The plugin stores judgments in internal system indexes, making them available for metric computation.
3. Building a Search Quality Evaluation Pipeline
The following steps outline a practical pipeline that you can adapt to any OpenSearch cluster.
3.1 Create the Query Set
### Example JSON for a query set stored in index .search-relevance-query-set
curl -X POST "http://localhost:9200/.search-relevance-query-set/_doc" \
-H 'Content-Type: application/json' -d '{
"name": "ecommerce‑top‑queries",
"queries": [
{"id": "q1", "query": "wireless headphones"},
{"id": "q2", "query": "best laptop for developers"},
{"id": "q3", "query": "budget travel insurance"}
]
}'
3.2 Define Search Configurations
Each configuration lives in index .search-relevance-config. You can adjust analyzers, boost fields, or enable function scoring.
curl -X POST "http://localhost:9200/.search-relevance-config/_doc" \
-H 'Content-Type: application/json' -d '{
"name": "baseline",
"settings": {"boost": 1.0},
"analyzer": "standard"
}'
Create additional configs (e.g., custom‑boost‑titles) to compare against the baseline.
3.3 Launch the Experiment
curl -X POST "http://localhost:9200/.search-relevance-experiment/_doc" \
-H 'Content-Type: application/json' -d '{
"name": "Q1‑baseline‑vs‑custom",
"query_set": "ecommerce‑top‑queries",
"configs": ["baseline", "custom‑boost‑titles"],
"type": "PAIRWISE_COMPARISON"
}'
The experiment moves to RUNNING state and the plugin orchestrates query execution across the selected configs.
4. Collecting Judgments
The UI of the Search Relevance Dashboards presents side‑by‑side results for each config. Evaluators choose a relevance rating (e.g., 0‑3) for each document. Implicit signals such as click‑through are also recorded automatically.
Tip: Use a small group of domain experts for explicit judgments and supplement with implicit data from production traffic.
5. Computing Metrics
After the experiment completes, the plugin calculates common relevance metrics:
- nDCG (Normalized Discounted Cumulative Gain) – accounts for position bias and relevance grades.
- Precision@k – proportion of relevant results in the top k.
- Recall@k – coverage of relevant documents in the top k.
- MRR (Mean Reciprocal Rank) – average of the reciprocal rank of the first relevant result.
The results are stored in index .search-relevance-metrics and can be queried via the REST API.
6. Interpreting the Results
A typical output looks like:
config nDCG Precision@5 MRR
baseline 0.72 0.48 0.55
custom‑boost‑titles 0.79 0.55 0.62
Higher scores indicate better alignment with the judged relevance. In this example, boosting titles improves nDCG by 0.07 and lifts Precision@5 by 7 percentage points.
7. Applying the Findings
7.1 Update Index Settings
Based on the metrics, you might adjust the search.analyzer or add custom scoring scripts. For the title‑boost example, you could add a field‑level boost in the query DSL.
"query": {
"function_score": {
"query": {"match": {"title": "{{query}}"}},
"field_value_factor": {"field": "title_boost", "factor": 1.5}
}
}
7.2 Re‑run the Experiment
After applying changes, rerun the experiment to verify the impact. This iterative loop ensures that each tweak produces measurable gains.
8. Real‑World Case Study: Fixing Bad Results
A media platform noticed that users searching for "latest tech news" were frequently seeing older articles. The team:
- Defined a query set focused on timeliness.
- Ran a baseline experiment and observed low nDCG (0.48).
- Added a recency boost using a decay function.
- Re‑ran the experiment and saw nDCG rise to 0.73.
- Deployed the new config to production, resulting in a 15 % increase in click‑through rate for the targeted queries.
9. Visual Summary
Figure 1: End‑to‑end workflow from query set creation to metric analysis.
10. Best Practices
- Keep query sets small but representative – 30‑50 queries often provide enough signal.
- Use both explicit and implicit judgments – they complement each other.
- Version your search configurations – the plugin stores them as immutable objects, making rollback trivial.
- Automate metric extraction – integrate a CI step that fetches the latest metrics after each experiment.
- Document decisions – store rationale in the experiment description field for future reference.
11. Conclusion
Search relevance is an ongoing discipline, not a one‑time setting. The OpenSearch Search Relevance plugin gives you a repeatable, data‑driven process to measure and improve relevance. By defining clear query sets, running controlled experiments, and acting on concrete metrics, you can turn vague user complaints into quantifiable improvements.
About the author
Prithvi S – Staff Software Engineer at Cloudera and Open‑source enthusiast. Follow my work on GitHub: https://github.com/iprithv
Published on Dev.to via automated pipeline.
Top comments (0)