Prithvi S

Posted on Jun 17 • Edited on Jul 4

Query Sets in Search Relevance: Designing Representative Test Queries That Actually Matter

#opensearch #search #database #data

You have spent weeks tuning your OpenSearch cluster. You have optimized analyzers, tweaked BM25 parameters, and maybe even added vector search. But how do you know if any of it actually helps users find what they want?

Here is the uncomfortable truth: most search quality problems start before you ever run an experiment. They start with your query set.

A query set is a curated collection of search queries that represent real user behavior. In OpenSearch Search Relevance, it is the foundation of every evaluation pipeline. Get it wrong, and your metrics are meaningless. Get it right, and you have a reliable compass for every tuning decision you make.

This post is about building query sets that actually matter. Not textbook examples. Not synthetic data. Real, representative queries that tell you whether your search is working.

Why Query Sets Are the Most Underrated Component of Search Evaluation

Search engineers love to talk about algorithms. BM25 vs DFR vs vector search. Query parsers. Boosting strategies. But the best scoring algorithm in the world cannot save you from a bad query set.

Think about it: your metrics (nDCG, precision, recall, MRR) are computed against a query set. If that query set does not represent what your users actually search for, you are optimizing for the wrong thing. You might chase a 0.05 nDCG improvement on synthetic queries while your real users cannot find basic products.

I have seen teams run elaborate A/B tests, publish glowing metric dashboards, and still get complaints about search quality. The disconnect? Their query set was 50 hand-picked queries from product demos, not 500 real queries from production logs.

OpenSearch Search Relevance treats query sets as first-class citizens. They are stored as documents in an internal index, versioned, reusable across experiments, and designed to be the starting point of every evaluation. But the plugin does not tell you how to build a good query set. That is on you.

What Makes a Query Set "Representative"?

A representative query set mirrors your actual user population. It captures the full distribution of what people search for, not just what you think they search for.

Here is what that means in practice.

Cover the Query Distribution

Real user query logs follow a power law. A small number of queries (the "head") account for most of the volume. A larger middle section (the "torso") covers moderate-frequency queries. The long tail contains thousands of rare, specific queries.

Your query set should sample from all three regions. If you only pick head queries, you miss edge cases. If you only pick tail queries, your metrics are noisy and your experiments take forever. A good rule of thumb: allocate 30% head, 40% torso, 30% tail.

Head queries are your bread and butter. "iphone 15" or "running shoes." These are high-volume, often generic, and usually competitive. If your search fails here, you are losing the most users.

Torso queries are where differentiation happens. "waterproof running shoes women" or "iphone 15 pro max 256gb blue." These show intent and are common enough to matter but specific enough to test relevance depth.

Tail queries are the real test. "shoes for plantar fasciitis under 100 dollars" or "iphone 15 case compatible with magSafe wallet third party." These are rare individually but collectively represent a significant portion of user needs. They also expose gaps in your catalog coverage, synonym handling, and query understanding.

Match Intent Diversity

Users search with different goals. Navigational queries aim to reach a specific page or product. Informational queries seek knowledge or comparisons. Transactional queries signal immediate purchase intent.

A query set that only tests transactional intent will overestimate your search quality for shoppers but miss problems researchers face. Mix intent types intentionally. The exact ratio depends on your product, but ignoring any category is a blind spot.

Include Known Problem Queries

Every search system has queries that are known to be problematic. Maybe they return zero results. Maybe they return irrelevant results. Maybe they are ambiguous and your current ranking is arbitrary.

These queries are gold. They are the ones where improvement is most visible and most valuable. Do not exclude them because they make your baseline metrics look bad. Include them specifically because they tell you where to focus your tuning efforts.

Avoid Synthetic Queries

There is a temptation to write queries you think users should ask. Resist it. "Best product under $50 with free shipping and 4.5 stars" is not a real query. It is a feature list pretending to be a query.

Real queries are messy. They have typos, abbreviations, brand names, model numbers, and half-remembered details. "nike air max red size 10" is a real query. "Comfortable athletic footwear with visible air cushioning technology" is not. Your query set should reflect messy reality.

Query Set Size: The Statistical Significance vs. Cost Trade-off

How many queries do you need? It depends on what you are trying to do.

Minimum Viable: 50 Queries

Fifty queries is enough for rough directional signal. You can tell if a change is catastrophic or promising. But you cannot reliably detect small improvements. The variance is too high. A single bad query can swing your aggregate nDCG by 0.1.

Use 50-query sets for early prototyping, debugging, or smoke testing. Do not use them for production go/no-go decisions.

The Sweet Spot: 100-500 Queries

This is where most teams should live. One hundred queries gives you enough statistical power to detect meaningful differences. Five hundred gives you confidence without drowning in judgment costs.

With 100-500 queries, you can segment by query type, compute per-query metrics, and identify specific patterns. You might discover that your change improves navigational queries by 15% but hurts informational queries by 8%. That is actionable insight you cannot get with 50 queries.

Large Scale: 1000+ Queries

Thousand-query sets are for mature search products with dedicated relevance teams. They give you statistical confidence, enable fine-grained segment analysis, and support long-term trend tracking.

The downside is cost. Every query needs judgments, and judgments are expensive. Human judgments require time and money. Even implicit judgments (click-through data) need volume and infrastructure. A 1000-query set with 10 results per query is 10,000 judgments. Scale accordingly.

In OpenSearch Search Relevance, query sets are stored as indexed documents. The plugin executes every query in the set against your configured searches, collects results, and feeds them into judgment collection. Larger query sets mean longer experiment runtimes. Plan for it.

Where to Source Your Queries

The best query sets come from real user data. Here are the most common sources and their trade-offs.

Search Query Logs

If you have logging infrastructure, this is the gold standard. Real queries from real users in your actual system. No simulation, no guessing.

Filter your logs by time period (last 30-90 days), remove bots and internal traffic, deduplicate near-identical queries, and sample across the frequency distribution. You want the query, not the count. One query in your set represents thousands of actual searches.

Analytics Tools

Google Analytics, Amplitude, Mixpanel, or similar tools often capture search queries. Export them, filter them, and use them. The advantage is that analytics tools often include outcome data (did the user convert? did they bounce?), which helps you prioritize which queries matter most.

Customer Support Tickets

Support tickets and help desk queries reveal where search is failing. "I searched for X but could not find Y" is a perfect query for your set. These queries are often high-intent, high-frustration, and underrepresented in normal logs because users give up rather than file tickets.

Competitor Analysis

For new products without query logs, analyze what users search for on competitor sites or in related forums. Reddit, Quora, and industry-specific communities are rich sources of how people actually talk about your domain. This is less precise than your own logs but better than synthetic queries.

Manual Curation

Some manual curation is always necessary. You need to fill gaps, add edge cases, and ensure coverage. But manual curation should complement data-driven sampling, not replace it. Think of it as seasoning, not the main ingredient.

Query Set Maintenance: Fighting Query Drift

User behavior changes. Seasonal trends, product launches, marketing campaigns, and cultural shifts all alter what people search for. A query set from January is not representative in December.

Set a schedule for query set updates. Monthly reviews for fast-moving products. Quarterly for stable domains. At minimum, annually. When you update, compare the old and new distributions. If the head queries have shifted significantly, your metrics from old experiments may not generalize.

In OpenSearch Search Relevance, query sets are reusable and versioned. You can keep old query sets for historical comparison while running new experiments on updated sets. This is powerful for long-term trend analysis. You can answer questions like: "Has our search quality improved for holiday queries year over year?"

The Relationship Between Query Sets and Experiments

Once you have a query set, it becomes the input to an experiment. Here is how the flow works in OpenSearch Search Relevance.

You create an experiment and link it to a query set. The plugin iterates through each query in the set, executes it against your configured search (or multiple searches for comparison), and collects the results. These results are then judged (by humans or implicit signals) and metrics are computed.

The query set size directly impacts experiment runtime. A 500-query set with 2 search configurations being compared means 1000 query executions. If each query hits a large index with complex aggregations, this can take significant time. The plugin handles this automatically, but you should plan your experiment windows accordingly.

Query sets also determine the granularity of your insights. A well-structured query set lets you segment results by query type, frequency, or intent. You might discover that your new analyzer improves tail queries by 20% but does nothing for head queries. That is a valuable, specific insight you would miss with a poorly designed query set.

Practical Tips for OpenSearch Search Relevance Users

If you are using the OpenSearch Search Relevance plugin, here are specific tips for query set management.

Store metadata with your query sets. The plugin supports name and description fields. Use them. "Holiday 2025 product queries, 250 queries, sourced from Nov-Dec logs" is more useful than "Query Set 1."

Start small and grow. Begin with a 50-query smoke test to validate your experiment setup. Once you trust the pipeline, expand to 200-500 queries for decision-making.

Segment your query sets. If your product has distinct categories, create category-specific query sets. "Electronics queries" and "Clothing queries" should be separate if your search configurations differ between them.

Link query sets to business metrics. The best query sets are tied to business outcomes. If you know that 20% of your revenue comes from "brand name + product" queries, weight those queries heavily in your set. Your search tuning should prioritize what drives value.

Review query sets before every major experiment. Do not reuse a query set blindly. Check if the queries are still relevant, if new problem queries have emerged, and if your product catalog has changed. A query for a discontinued product is wasted effort.

Common Mistakes to Avoid

Over-reliance on synthetic queries. I have said this before, but it bears repeating. Synthetic queries give you synthetic metrics. Real users do not search like product managers.

Ignoring query frequency. All queries are not equal. A query that happens once a month is not as important as a query that happens a thousand times a day. Your query set should reflect this, either through sampling weights or through explicit frequency-based allocation.

Static query sets. A query set is not a monument. It is a living document. If you have not updated yours in six months, it is probably stale.

Too few judgments per query. A query set with 1000 queries but only 1 judgment per query is less useful than a 100-query set with 10 judgments per query. Judgment depth matters as much as query breadth.

Chasing aggregate metrics blindly. A 0.02 nDCG improvement across 500 queries sounds small. But if it is driven by a 0.15 improvement in your top 20 revenue-driving queries, it is a massive win. Segment your analysis.

Conclusion

Query sets are the foundation of search quality evaluation. Everything else - analyzers, scoring algorithms, experiments, metrics - rests on this foundation. A cracked foundation means shaky conclusions.

In OpenSearch Search Relevance, query sets are first-class objects. The plugin gives you the tools to manage, version, and reuse them. But it cannot tell you what queries to include. That requires understanding your users, your data, and your business.

The best query sets are sampled from reality, maintained over time, and tied to business outcomes. They are not perfect, but they are representative. And representative is enough to make better decisions than guessing.

If you have not reviewed your query set recently, do it now. Your metrics will thank you.

About the author: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I contribute to Apache Lucene, OpenSearch, and related projects. Follow my work on GitHub.

DEV Community