Every search engine starts with good intentions. You pick OpenSearch for its distributed architecture, its Lucene-powered full-text capabilities, and its RESTful API. You build your indices, tune your analyzers, and ship it to production. Then the complaints start rolling in.
- "Why does 'laptop stand' return laptop bags first?"
- "Searching for 'java developer' gives me coffee shop listings."
- "The top result for 'project manager' is a book, not a job."
You know search is broken. But how do you prove it? And more importantly, how do you measure whether you actually fixed it?
This is where the OpenSearch Search Relevance plugin comes in. It gives you a systematic way to evaluate, measure, and improve search quality. Not with gut feelings, but with metrics. Let me walk you through how I use it to turn bad search results into good ones.
The Problem: Search Quality Is Invisible Until It Breaks
Search is unique among software features. When it works, nobody notices. When it fails, it fails loudly and publicly. A user who cannot find what they want in three searches is gone. They will not file a bug report. They will just leave.
The challenge is that search quality is subjective. What is "relevant" depends on who is searching, what they want, and what your business values. Product search wants purchase intent. Job search wants skill matching. Documentation search wants precision. There is no universal "good search" - only search that meets your users' needs.
Without measurement, search tuning becomes guesswork. You tweak a boost here, add a synonym there, deploy, and hope. The OpenSearch Search Relevance plugin replaces hope with data.
What Is the Search Relevance Plugin?
The Search Relevance plugin is an official OpenSearch plugin that provides tools for evaluating and comparing search quality. It lives in the OpenSearch ecosystem alongside the core engine, security plugin, and ML plugin. If you are already running OpenSearch, adding this plugin is straightforward.
The plugin is built around a simple but powerful idea: define what good search looks like for your domain, measure your current search against that definition, test improvements, and keep only the changes that actually help.
It provides four core building blocks:
- Query Sets - Collections of representative search queries that your users actually type
- Search Configurations - Named sets of search parameters (analyzers, query types, boosts, filters)
- Experiments - Structured comparisons between two search configurations using pairwise comparison
- Judgments - Human or automated relevance ratings that ground-truth your search results
These four components work together to create a search quality evaluation pipeline. Let me show you how to build one.
Step 1: Build Your Query Set
Before you can measure search quality, you need to know what people are searching for. The query set is the foundation of everything that follows. A bad query set will give you misleading metrics. A good query set will reveal real problems.
A query set is just a list of queries that represent your actual search traffic. The key word is "representative." If 40% of your searches are for product names, 40% of your query set should be product names. If you have seasonal spikes, include queries from those periods.
Here is what I look for when building a query set:
- High-frequency queries - The searches that happen most often. Improving these has the biggest impact.
- Edge cases - The weird queries that expose flaws. Misspellings, ambiguous terms, multi-word phrases.
- Business-critical queries - The searches that matter most to your organization, even if they are rare.
- Failure cases - Queries that you already know produce bad results. These are your baseline for improvement.
The Search Relevance plugin stores query sets in an internal system index. You create them via the API or through the OpenSearch Dashboards interface. Each query set gets a name, description, and a list of query strings.
A good query set size depends on your domain complexity. For a focused e-commerce catalog, 50-100 well-chosen queries might be enough. For a broad job board or content platform, you might need 500-1000. The goal is coverage, not volume. One query that captures a real failure mode is worth more than ten generic queries that all behave the same way.
The plugin stores query sets with schema versioning. If you add new fields later, the system index migrates cleanly. This is the kind of operational detail that matters when you are running this in production.
Step 2: Define Your Search Configurations
Now you need something to compare. The Search Relevance plugin uses "search configurations" - named parameter sets that define how a search is executed. Think of them as search recipes.
A typical search configuration includes:
- Index name - Which index to search
- Query template - The OpenSearch query DSL structure
- Analyzer settings - Which analyzer to use for text processing
- Boost parameters - Field weights, function scores, decay functions
- Filters - Any constant filters applied to all searches
The power of this approach is that you can define your current production search as one configuration, and your proposed improvements as another. Then you run an experiment to see which one actually performs better.
Here is a practical example. Suppose your current search for a job board uses a simple multi_match query:
current_config:
query: {
"multi_match": {
"query": "{{query}}",
"fields": ["title^2", "description", "skills"]
}
}
You suspect that adding a phrase match boost and a skills synonym filter would improve results. So you create a second configuration:
improved_config:
query: {
"bool": {
"should": [
{
"multi_match": {
"query": "{{query}}",
"fields": ["title^3", "description^1.5", "skills^2"]
}
},
{
"match_phrase": {
"title": {
"query": "{{query}}",
"boost": 5
}
}
}
]
}
}
Note the differences: title boost increased from 2 to 3, description boost added, skills boost increased from 1 to 2, and a phrase match added with a 5x boost. These are the kinds of incremental changes that you can test systematically.
Step 3: Create a Pairwise Comparison Experiment
With a query set and two configurations, you can now create an experiment. The Search Relevance plugin supports multiple experiment types, but the most common is PAIRWISE_COMPARISON. This runs every query in your query set against both configurations and collects the results for human judgment.
The experiment lifecycle follows a clear state machine:
- CREATED - The experiment is defined but not started
- RUNNING - The plugin is executing searches and collecting results
- COMPLETED - All queries have been processed and results are available
When you start the experiment, the plugin iterates through your query set. For each query, it executes the search against both configurations using the scatter-gather pattern that OpenSearch uses for all distributed queries. The coordinating node routes each query to the relevant shards, collects the results, and stores them.
Under the hood, each search follows the standard OpenSearch query path. The query reaches the coordinating node, gets routed to shards based on the routing key hash, and executes locally against Lucene segments. With OpenSearch 3.0, concurrent segment search is enabled by default in "auto" mode, which means the plugin benefits from parallel segment searching for long-running aggregation queries.
The results are stored in an internal system index, with schema versioning to handle future field additions. This is where Prithvi's work on schema evolution and backward compatibility becomes relevant - the plugin handles index migrations gracefully, which is critical for production deployments.
Step 4: Collect Judgments
Results without judgments are just data. Judgments transform data into ground truth. The Search Relevance plugin supports multiple judgment types:
Explicit judgments come from human evaluators. They look at a query-result pair and assign a relevance grade. Common grading scales include:
- Binary: Relevant / Not Relevant
- Ternary: Relevant / Partially Relevant / Not Relevant
- Graded: 0-4 scale where 4 is perfectly relevant
The Dashboards interface for the Search Relevance plugin renders search results side by side. An evaluator sees the results from configuration A and configuration B for the same query, and assigns grades to each. This pairwise comparison is efficient because the evaluator can focus on relative quality rather than absolute grades.
Implicit judgments come from user behavior signals. Click-through rates, dwell time, and conversion rates can serve as relevance proxies. If users consistently click the third result and ignore the first, that is a signal that the ranking is wrong. The plugin can ingest these signals if you have them, but explicit judgments are the gold standard for controlled experiments.
Judgments are stored in the plugin's internal system index with the same schema versioning approach. The grading scale is configurable per experiment, so you can use binary judgments for quick tests and graded judgments for deep analysis.
Step 5: Compute and Interpret Metrics
With judgments collected, the plugin computes relevance metrics. The key metrics are:
nDCG (Normalized Discounted Cumulative Gain) - The gold standard for ranked search evaluation. It considers both relevance grades and position. A highly relevant result at position 1 scores higher than the same result at position 5. The "normalized" part means the score is divided by the ideal ranking, so you get a 0-1 scale that is comparable across queries.
Precision - Of the results returned, what fraction is relevant? Useful when you care about the quality of the result set as a whole, not just the top position.
Recall - Of all relevant documents in the index, what fraction was returned? Important for completeness, especially in domains like legal or medical search where missing a relevant document is costly.
MRR (Mean Reciprocal Rank) - The average of 1/rank of the first relevant result. Good for "I just need one good result" scenarios like error code lookup or customer support search.
The plugin computes these metrics per query and aggregates them across the query set. You get mean, median, and standard deviation, which tells you whether your improvements are consistent or just helping a few queries while hurting others.
Here is what good looks like in practice. Suppose your baseline configuration has a mean nDCG of 0.45. Your improved configuration has a mean nDCG of 0.62. That is a 38% improvement in search quality. More importantly, you can look at the per-query breakdown and see that the improvement is consistent across query types, not just driven by one or two easy wins.
A Real-World Walkthrough
Let me put this together with a concrete example. Suppose you run a technical documentation search engine. Users search for things like "index template API", "cluster health", and "shard allocation".
The Problem: Your current search uses a default multi_match query with equal field weights. Users complain that API reference searches return blog posts first, and conceptual searches return API endpoints instead of overviews.
The Query Set: You create 100 queries representing your traffic mix: 40% API reference lookups, 30% troubleshooting searches, 20% conceptual queries, 10% edge cases (misspellings, ambiguous terms like "cluster" which could mean hardware or Elasticsearch cluster).
Configuration A (Current):
{
"multi_match": {
"query": "{{query}}",
"fields": ["title", "content", "tags"]
}
}
Configuration B (Improved):
{
"bool": {
"should": [
{
"multi_match": {
"query": "{{query}}",
"fields": ["title^3", "content^1", "tags^2"],
"type": "best_fields"
}
},
{
"match_phrase": {
"title": {
"query": "{{query}}",
"boost": 5
}
}
}
]
}
}
The changes are deliberate. Title boost to 3 because titles are strong relevance signals. Tags boost to 2 because tags are curated metadata. Content stays at 1 because it is noisy. The phrase match rewards exact title matches, which helps with API reference searches where users often type exact endpoint names.
The Experiment: You create a PAIRWISE_COMPARISON experiment with the 100-query set. The plugin runs each query against both configurations. You recruit three technical writers to grade the top 5 results for each query on a 0-3 scale.
The Results:
- Configuration A: nDCG = 0.41, Precision@5 = 0.52, MRR = 0.63
- Configuration B: nDCG = 0.58, Precision@5 = 0.71, MRR = 0.78
The improvement is consistent across all metrics. Looking at the per-query breakdown, you see that API reference searches improved the most (nDCG from 0.35 to 0.67), which aligns with your hypothesis. Troubleshooting searches improved modestly (0.48 to 0.55), which tells you there is still work to do on that query type.
The Decision: You deploy configuration B to production. But you also create a new query set focused on troubleshooting searches, because that is where the next improvement will come from.
Why This Beats Guesswork
Without the Search Relevance plugin, the typical search tuning workflow is:
- Notice a complaint
- Guess at a fix
- Deploy it
- Hope it helped
- Find out three months later that it hurt other queries
With the plugin, the workflow becomes:
- Measure baseline search quality with a representative query set
- Form a hypothesis about what will improve it
- Test the hypothesis in a controlled experiment
- Measure the improvement with objective metrics
- Deploy only if the metrics improve
- Monitor for query types that still need work
This is the difference between search engineering and search superstition. The plugin does not tell you what changes to make - that still requires domain knowledge and intuition. But it tells you whether your changes worked. That is the critical piece.
The OpenSearch Advantage
The Search Relevance plugin benefits from OpenSearch's core architecture in ways that make it production-ready:
Distributed execution: Experiments run across your cluster using the same scatter-gather pattern as regular searches. If your cluster handles production traffic, it can handle evaluation queries.
Near-real-time results: The same refresh interval that makes new documents searchable means your experiment results are available as soon as the searches complete. No waiting for batch jobs.
Plugin architecture: The Search Relevance plugin installs like any other OpenSearch plugin. It uses the standard extension points (ActionPlugin, SearchPlugin) and follows the same class loading and lifecycle patterns. If you run OpenSearch, you know how to run this.
Schema evolution: The plugin's internal system indexes use additive-only schema migrations. This means you can upgrade the plugin without losing existing query sets, configurations, and experiment results. As someone who has worked on the schema evolution patterns for this plugin, I can confirm this is designed for long-term operational stability.
Concurrent search: With OpenSearch 3.0, concurrent segment search is enabled by default. For evaluation workloads that often involve large aggregations or broad queries, this reduces latency significantly.
Getting Started
If you want to try this on your own OpenSearch cluster:
- Install the Search Relevance plugin via the
opensearch-plugintool - Install the Dashboards plugin for the visual interface
- Create a query set from your search logs
- Define your current production search as a configuration
- Create one variant with a specific hypothesis
- Run a PAIRWISE_COMPARISON experiment
- Collect judgments (start with explicit, even if it is just you grading 20 queries)
- Look at the metrics. If your variant wins, you have data to support the change. If it loses, you learned something without impacting production.
The plugin is open source and actively maintained. You can find it at https://github.com/opensearch-project/search-relevance along with documentation and issue tracking.
Conclusion
Search quality is too important to leave to intuition. The OpenSearch Search Relevance plugin gives you the tools to measure, compare, and improve your search systematically. It is not a magic solution that fixes search automatically. It is a measurement framework that makes your search tuning decisions evidence-based.
Start with a query set. Define your current search as a baseline. Test one change at a time. Measure with nDCG, precision, and MRR. Keep the changes that improve your metrics. Iterate.
That is how you turn bad search results into good ones. Not with bigger hardware or more complex algorithms, but with better measurement and disciplined experimentation.
I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I maintain the dashboards-search-relevance plugin and contribute to the OpenSearch ecosystem. Follow my work on GitHub: https://github.com/iprithv
Top comments (0)