Prithvi S

Posted on Apr 20

Building a Search Quality Evaluation Pipeline with OpenSearch Search Relevance

#opensearch #search #database #data

Introduction

You've built a search engine. Users can query your data. Results come back. But here's the uncomfortable question: are the results good?

This isn't about whether the feature works. It works. But does it rank the most relevant documents first? Do users find what they're looking for? Are you optimizing for the right metrics?

If you're operating OpenSearch at scale, you've probably felt this pain. Search quality isn't a one-time configuration problem. It's an ongoing optimization challenge. You need to measure it, experiment with it, and improve it systematically.

That's where the OpenSearch Search Relevance plugin comes in.

In this post, I'll walk you through building an end-to-end search quality evaluation pipeline using the Search Relevance plugin. By the end, you'll understand how to create representative query sets, run controlled experiments, collect human judgments, compute relevance metrics, and iterate on your search configuration until results actually matter.

The Search Quality Problem

Before diving into the solution, let's frame the problem clearly.

Search quality has multiple dimensions:

Relevance - Does the top result match what the user searched for?
Completeness - Are all relevant documents in the top-K results?
Ranking - Are more relevant docs ranked higher than less relevant ones?
Precision - What fraction of returned results are actually useful?
Recall - What fraction of useful documents did we find?

The BM25 algorithm (OpenSearch's default) is good out of the box. But "good" isn't "perfect." And what's perfect for one use case (e-commerce product search) might be terrible for another (medical research papers).

You need a way to:

Define what "good" means for your domain
Measure it quantitatively
Test changes before deploying
Track improvements over time

This is exactly what a search quality evaluation pipeline does.

Meet the OpenSearch Search Relevance Plugin

The Search Relevance plugin is part of the opensearch-project ecosystem. It's designed specifically to solve this problem.

At its core, it orchestrates four key components:

Query Sets - Representative questions or search terms for your domain
Search Configurations - Different index analyzers, query types, and boosting settings
Experiments - Controlled comparisons between two search configurations
Judgments - Human-provided relevance labels for query-document pairs
Metrics - Computed evaluation scores: nDCG, precision, recall, MRR

The beauty is that it connects all of these together into a coherent workflow. You don't need to glue together five different tools. It's all built in.

Step 1: Build Your Query Set

Everything starts with queries.

A query set is a collection of representative search terms or questions for your domain. The quality of your query set directly affects the quality of your evaluation.

How to Design a Query Set

Think like your users. What would they actually search for?

For an e-commerce search engine, examples might be:

"blue running shoes size 10"
"wireless headphones under 100"
"coffee maker for 2 people"

For a documentation search:

"how to configure SSL certificates"
"what is a shard"
"troubleshooting connection timeouts"

For a code search:

"implement BFS algorithm"
"parse JSON to object"
"handle file not found error"

Include variety. Your query set should cover:

Short queries (1-2 terms) and long queries (5+ terms)
Common searches and edge cases
Different intent types (navigation, informational, transactional)
Different domains if your search spans multiple categories

Size matters. A good query set has 50-300 queries depending on your domain:

50-100: Initial evaluation (fast iteration)
100-200: Standard evaluation for most teams
200+: Large-scale benchmarking across multiple configurations

Create it incrementally. Start small, run experiments, learn what queries are most impactful, expand.

Storing Your Query Set

In the Search Relevance plugin, query sets are stored as OpenSearch documents. Here's a conceptual example:

{
  "_id": "ecommerce-base-v1",
  "name": "E-commerce Base Query Set",
  "description": "100 representative queries for product search",
  "queries": [
    {"id": "q001", "text": "blue running shoes"},
    {"id": "q002", "text": "wireless headphones"},
    ...
  ]
}

Each query has an ID and text. Simple. You can create query sets via the OpenSearch API or the Dashboards UI.

Step 2: Define Search Configurations

A search configuration is a snapshot of how you want to search: analyzers, query types, field boosting, synonym expansion, and more.

Think of it as: "This is one way to search."

What Goes Into a Configuration

Here are common things you might tune:

Analyzers

Standard analyzer vs. custom analyzer with stemming
Language-specific analyzers (English, French, etc.)
Phonetic analysis for typo tolerance

Query Types

BM25 query (default, term frequency + IDF)
Match phrase query (exact phrase matching)
Bool query with SHOULD/MUST/FILTER clauses
Multi-match across multiple fields

Field Boosting

Title field: 2x boost (more important)
Description field: 1x boost (baseline)
Tags field: 0.5x boost (less important)

Query Parameters

Fuzziness (tolerating typos)
Operator (AND vs. OR semantics)
Minimum should match (for multi-clause queries)

Creating a Configuration

Via API:

PUT /search-config/_doc/config-v1
{
  "name": "Default BM25",
  "description": "Standard BM25 with field boosting",
  "analyzer": "standard",
  "query_type": "multi_match",
  "field_weights": {
    "title": 2.0,
    "description": 1.0,
    "tags": 0.5
  }
}

You typically create 2-3 configurations to compare:

Your baseline (current production)
A proposed improvement (new analyzer or boosting strategy)
(Optional) A radically different approach

Step 3: Run an Experiment

Now the magic happens. You tell the plugin: "Compare config A vs. config B using my query set. Execute all queries against both and show me which is better."

The plugin does this:

Iterate through each query in your query set
Execute the query against your OpenSearch index using config A
Execute the same query against your index using config B
Capture the results (top-K documents for each)
Store everything indexed and ready for judgment

This is an experiment. The status is CREATED -> RUNNING -> COMPLETED.

Once completed, you have paired results: for each query, you can see side-by-side what config A returned vs. what config B returned.

Why This Matters

You now have:

Reproducible, deterministic comparisons
No randomness (same queries, same configs, same results every time)
Side-by-side results for human evaluation
A complete audit trail of what changed

Step 4: Collect Judgments

Here's where humans come in.

Judgments are relevance labels. A human expert looks at a query and says: "This document is highly relevant" or "This document is not relevant."

The Search Relevance plugin supports two types of judgments:

Explicit Judgments

A human grades each query-document pair on a scale. Most common:

0: Not relevant (wrong topic entirely)
1: Somewhat relevant (tangentially related)
2: Relevant (answers the query)
3: Highly relevant (perfect match)

The plugin UI presents your experiment results (config A vs. B side-by-side) and lets judges assign these grades.

Implicit Judgments

Collect signals from user behavior:

Click-through rate (user clicked this result)
Dwell time (user spent time reading this)
Skip rate (user skipped this and clicked something lower)

For many teams, explicit judgments from a small pool of domain experts (5-20 people) is enough to get strong signal.

Step 5: Compute Metrics

Once you have judgments, the plugin computes relevance metrics:

nDCG (Normalized Discounted Cumulative Gain)

This measures ranking quality. The intuition:

Relevant documents should rank high
Position matters (higher positions worth more)
Perfect ranking gets nDCG = 1.0
Random ranking gets nDCG near 0.5

Formula (simplified):

nDCG = (1/IDCG) * sum(relevance_grade / log2(position + 1))

When to use: Almost always. This is the gold standard for ranking quality.

Precision and Recall

Precision: What fraction of top-K results were relevant?

Precision@10 = (# relevant in top 10) / 10
Good for: User experience (are the top results useful?)

Recall: What fraction of relevant documents did we find?

Recall@100 = (# relevant in top 100) / (# relevant total)
Good for: Comprehensiveness (did we find everything?)

MRR (Mean Reciprocal Rank)

Average position of the first relevant document.

MRR = 1 / (average position of first relevant result)
Good for: Cases where only the first result matters (navigation queries)

Which Metric to Track

nDCG: Primary metric. Use nDCG@10 or nDCG@20 for most cases.
Precision@K: Secondary. Shows top-K quality.
Recall@K: If comprehensiveness matters (e.g., search across entire document corpus).
MRR: Only if navigational queries are critical.

Step 6: Iterate

The experiment outputs metrics. You see that config B's nDCG is 0.78 while config A's is 0.72. Config B wins.

Now what?

Deploy config B. Monitor search quality in production. But don't stop.

Run the next experiment. Try:

A different analyzer
Different field boosting
Additional query logic

Treat search quality like any other product: continuous improvement, guided by metrics.

Putting It All Together: The Pipeline

Here's the complete workflow:

1. Design Query Set
   |
   v
2. Create Search Configurations (Baseline + Proposed)
   |
   v
3. Run Experiment (Execute queries for both configs)
   |
   v
4. Collect Judgments (Humans grade results)
   |
   v
5. Compute Metrics (nDCG, precision, recall, MRR)
   |
   v
6. Analyze Results (Which config wins? By how much?)
   |
   v
7. Deploy Winner & Monitor
   |
   v
8. Repeat (Design next experiment)

Each cycle takes days to weeks (depending on judgment collection speed). But you're grounded in data, not guesses.

Practical Tips

Start small. 50 queries, 2 configurations, 10 judges. Run the pipeline end-to-end. You'll learn what works.

Judge consistently. Train your judges. Create a judgment guide. Have judges re-evaluate a subset for inter-rater agreement.

Track over time. Keep historical metrics. Did nDCG improve? By how much? This builds confidence that you're moving in the right direction.

Combine signals. Use nDCG as your primary metric, but also check precision, recall, and MRR. Sometimes improvements in one metric hurt another.

Automate where possible. Explicit judgments require humans, but experiment execution, metric computation, and result analysis should all be automated.

Version everything. Query sets, configurations, experiments, judgments. Treat them like code: track versions, enable reproducibility, enable rollback.

Challenges You'll Face

Judgment burden. Grading 100 queries x 10 results per query = 1,000 judgments. At 30 seconds per judgment, that's 8 hours. Parallelize across judges. Use inter-rater agreement to validate quality.

Query set quality. A bad query set produces meaningless metrics. Spend time upfront building representative queries. Validate with users or domain experts.

Config comparison fairness. Make sure both configs query the same data, same index, same relevance judgments. Isolate variables.

Metric interpretation. A 0.02 improvement in nDCG might be noise or might be significant. Track confidence intervals. Run multiple rounds.

Real-World Example

Let's say you run e-commerce search. Your current baseline achieves nDCG@10 = 0.68. Users complain that size/color variants aren't matching well.

You hypothesize: "If I boost the size and color fields more, users will find exact matches faster."

You create a new config with aggressive field boosting on size and color. Run an experiment with 100 queries and 15 judges.

Results:

Config A (baseline): nDCG@10 = 0.68, Precision@10 = 0.72
Config B (boosted): nDCG@10 = 0.71, Precision@10 = 0.75

Config B wins. You deploy it. Users are happier. Search quality improved by 4.4%.

Next experiment: Can we improve recall without hurting precision? Try a different query operator. Repeat.

Over 6 months, you've compounded these improvements. Your nDCG went from 0.68 to 0.79. That's real impact, measured and reproducible.

Getting Started

Install OpenSearch Search Relevance plugin. Follow the official docs: https://github.com/opensearch-project/search-relevance
Create your first query set. Start with 50-100 queries. Validate with users.
Set up 2 configurations. Baseline (current) and one proposed improvement.
Run an experiment. Let it complete. Study the results.
Collect judgments. Have 5-10 domain experts grade a subset of results.
Compute metrics. Let the plugin calculate nDCG, precision, recall, MRR.
Analyze. Which config won? By how much? Is it statistically significant?
Iterate. Deploy the winner, then design the next experiment.

Conclusion

Search quality doesn't happen by accident. It's engineered, measured, and continuously improved.

The OpenSearch Search Relevance plugin gives you the infrastructure to do this systematically. Query sets, configurations, experiments, judgments, metrics. All connected. All reproducible.

If you're operating search at scale, you owe it to your users to build this pipeline. Start small. Run your first experiment this week. Measure. Iterate. Improve.

Your search results will thank you. So will your users.

About the Author

I'm Prithvi S, Staff Software Engineer at Cloudera and Open Source Enthusiast. Follow my work on GitHub: https://github.com/iprithv

DEV Community