Prithvi S

Posted on Jun 13 • Edited on Jul 4

Learning to Rank in Search Relevance: From Feature Engineering to Model Deployment

#lucene #search #java #opensource

Learning to Rank (LTR) transforms search from a hand-tuned relevance function into a machine-learned model that optimizes for business outcomes. Instead of manually setting field weights and boost parameters, LTR uses training data to learn the optimal scoring function from user behavior. This post covers the full pipeline: feature engineering, model training, and deployment in production search systems. The techniques apply to both OpenSearch and Elasticsearch, which share the same underlying Lucene architecture.

The Problem: Why Hand-Tuned Relevance Hits a Ceiling

Traditional search relevance uses a combination of TF-IDF scoring, field boosts, and boolean filters. A typical query might look like this:

{
  "query": {
    "bool": {
      "must": {
        "multi_match": {
          "query": "wireless headphones",
          "fields": ["title^3", "description^2", "brand^1"]
        }
      },
      "filter": {
        "term": { "category": "electronics" }
      }
    }
  }
}

The title^3 boost means matches in the title field are worth 3x more than matches in the brand field. This is an educated guess based on domain knowledge. It might work for 80% of queries, but it fails on edge cases: brand searches where the brand field should dominate, long-tail queries where the description matters more than the title, or queries with synonyms where the boost model does not apply.

The fundamental problem is that a single static weight cannot capture the varying importance of fields across different query types. A user searching for "iPhone 15" cares about exact model matching in the title. A user searching for "good phone for photography" cares about camera specs in the description. A single title^3 boost cannot handle both cases optimally.

When Hand-Tuning Becomes Unmanageable

As a search system grows, the number of query types and field combinations explodes. An e-commerce site might have 50 product categories, each with different relevance patterns. Fashion searches prioritize brand and style. Electronics searches prioritize specs and reviews. Grocery searches prioritize freshness and availability. Maintaining separate boost configurations for each category becomes a maintenance nightmare, and the configurations conflict with each other when a query spans multiple categories.

Learning to Rank solves this by replacing the static boost model with a learned model that adapts to the query context. The model sees the query text, the document fields, and the interaction context (user history, session data, time of day), and produces a relevance score that is optimal for that specific query-document pair.

The LTR Pipeline: Features, Judgments, and Models

The Learning to Rank pipeline has three stages:

Feature extraction - For each query-document pair, compute a set of numeric features that capture relevance signals.
Judgment collection - Gather human or implicit relevance labels for query-document pairs.
Model training - Train a machine learning model to predict relevance scores from features, using the judgments as training targets.

Feature Extraction: What the Model Sees

Features are the input to the LTR model. They must be computable at query time and should capture all signals that might indicate relevance. Common feature categories include:

Query-document text features:

BM25 score for the query against the title field
BM25 score for the query against the description field
BM25 score for the query against the brand field
Exact match count (how many query terms appear verbatim in the document)
Prefix match count (how many query terms match the beginning of a document term)
Synonym match count (how many query terms match via synonym expansion)
TF-IDF score for each query term in each field
Cosine similarity between query vector and document vector (if using dense retrieval)

Document quality features:

Click-through rate (CTR) for the document in the last 30 days
Conversion rate (purchase rate) for the document
Average review score and review count
Document age (how long since it was added to the index)
Inventory availability (in-stock or out-of-stock)
Price and price percentile within the category
Popularity rank (how many times the document was viewed)

Query context features:

Query length (number of terms)
Query category intent (electronics, fashion, etc.) via a classifier
User location (for geo-relevant searches)
Time of day and day of week (for temporal relevance)
User segment (new user vs returning user, premium vs basic)
Session history (what the user searched for and clicked in this session)

Interaction features:

Position bias (documents ranked higher get more clicks even if they are less relevant)
Previous query-document interactions (has the user clicked this document before?)
Dwell time (how long the user spent on the document page after clicking)
Skip rate (how often users skipped over this document without clicking)

In practice, a production LTR system might use 50-200 features. The feature set must be comprehensive enough to capture relevance signals but not so large that model training becomes slow or overfitting occurs. Feature selection and regularization are critical for model quality.

Judgment Collection: Explicit and Implicit Labels

Judgments are the training targets. Each query-document pair needs a relevance label. The labels are typically ordinal: 0 (irrelevant), 1 (somewhat relevant), 2 (relevant), 3 (highly relevant). Some systems use binary labels (relevant/irrelevant) or continuous labels (expected click probability).

Explicit judgments come from human annotators. A panel of judges evaluates query-document pairs and assigns relevance labels. Explicit judgments are accurate but expensive. A typical e-commerce site might need 10,000 judged query-document pairs for a category, which costs $5,000-10,000 in annotation fees. The judgment process must be carefully controlled: judges need guidelines, inter-annotator agreement must be measured (Cohen's kappa > 0.6 is considered acceptable), and edge cases must be escalated to senior annotators.

Implicit judgments come from user behavior. If a user searches for "wireless headphones" and clicks the third result, the implicit signal is that the third result was more relevant than the first and second results (which were skipped). This is the foundation of click-through based LTR. However, implicit judgments are noisy because of position bias: users click higher-ranked results more often regardless of relevance. To correct for position bias, LTR systems use click models like the Cascade Model or the Position-Based Model (PBM) that estimate the probability of a click given the position and the true relevance.

Hybrid approaches combine explicit and implicit judgments. Explicit judgments are used for a small set of carefully selected queries (head queries that drive 80% of traffic). Implicit judgments are used for the long tail. The model is trained on the combined dataset, with explicit judgments weighted more heavily because they are more reliable.

Model Training: Pointwise, Pairwise, and Listwise Approaches

LTR models are trained using one of three paradigms:

Pointwise treats each query-document pair as an independent sample. The model learns to predict the relevance label directly, like a standard regression or classification problem. The loss function is mean squared error (for continuous labels) or cross-entropy (for ordinal labels). Pointwise models are simple to train but ignore the ranking context: they do not know that a query has multiple documents and that the goal is to order them correctly.

Pairwise treats each pair of documents for the same query as a training sample. The model learns to predict which document is more relevant. The loss function is typically a hinge loss or logistic loss that penalizes the model when it incorrectly orders a pair. Pairwise models capture the relative ordering signal but still do not optimize for the full list quality.

Listwise treats the entire ranked list for a query as a training sample. The model optimizes a list-level metric like Normalized Discounted Cumulative Gain (NDCG) or Mean Average Precision (MAP). Listwise models are theoretically optimal because they directly optimize the ranking metric, but they are computationally expensive and harder to train. LambdaMART and its neural variant LambdaRank are the most popular listwise algorithms.

LambdaMART is a gradient boosting model that optimizes NDCG. It works by computing the "lambda" gradient for each document: the change in NDCG that would occur if the document's score changed by a small amount. The lambda gradient focuses on pairs that are incorrectly ordered and are near the top of the ranking, because those pairs have the largest impact on NDCG. LambdaMART then trains a decision tree to predict these lambda gradients, and the tree ensemble is updated iteratively.

For production deployment, LambdaMART is often the best choice because it provides a good balance between accuracy and inference speed. A typical LambdaMART model might have 100-500 trees with a maximum depth of 6-8. Inference is fast because each tree is a simple decision tree, and the ensemble prediction is the sum of tree outputs.

Feature Engineering for Search: Practical Techniques

Feature engineering is the most critical and time-consuming part of LTR. The model can only be as good as the features it sees. Here are practical techniques for building high-quality feature sets.

Query Classification for Intent-Aware Features

Not all queries are the same. A user searching for "iPhone 15" has a navigational intent: they know what they want and are looking for a specific product. A user searching for "good phone for photography" has an informational intent: they are researching options. A user searching for "cheap phone" has a transactional intent: they want to compare prices.

Query classification adds a feature that captures the intent. A simple classifier can be trained on query text alone: short queries with brand names are navigational, long queries with descriptive terms are informational, queries with price-related terms are transactional. The intent feature interacts with other features: for navigational queries, the exact title match score is highly weighted. For informational queries, the description BM25 score and review quality features are more important.

Normalization and Scaling

Features have different scales. BM25 scores might range from 0 to 30, while CTR ranges from 0 to 0.1. If the model is a neural network, feature scaling is critical. If the model is a tree-based model like LambdaMART, scaling is less important because trees are scale-invariant. However, even for trees, extreme outliers can cause the model to overfit to rare cases. Log-transforming CTR (log(1 + CTR)) and clipping BM25 scores to a maximum of 50 are common preprocessing steps.

Temporal Features and Recency Bias

For time-sensitive content (news, social media, product launches), recency is a strong relevance signal. A simple recency feature is 1 / (days_since_publication + 1). But recency should not dominate all queries. A query for "iPhone 15" should not be affected by recency because the iPhone 15 is a specific product. A query for "best phone 2024" should be affected by recency because newer reviews are more relevant. The interaction between query intent and recency features can be captured by adding a query-specific recency weight, which the model learns from the training data.

Cross-Field Features

A query might match multiple fields, and the combination of matches is a signal. For example, a query "Nike running shoes" that matches the brand field ("Nike") and the category field ("running shoes") is more relevant than a query that only matches the title field. A cross-field feature can be the product of the brand match score and the category match score, or a binary indicator that both fields matched. These interaction features capture the semantic coherence of the query-document match.

Model Deployment: From Training to Query-Time Scoring

Once the model is trained, it must be deployed to the search engine. The deployment architecture depends on whether the model is a simple linear model, a tree ensemble, or a neural network.

Linear Model Deployment: Rescoring at the Coordinator

A linear model has the form score = w1 * f1 + w2 * f2 + ... + wn * fn. The weights are learned during training. Deployment is straightforward: the search engine computes the features for each query-document pair and applies the linear combination. This can be done in a rescore phase at the coordinator level, which operates on the top-N results from the initial query phase.

The rescore phase in OpenSearch and Elasticsearch allows a custom script to re-rank the top results. A Painless script can compute the linear combination of features. However, Painless is not optimized for complex models, and the script execution overhead can be significant. For linear models with 50 features, a custom rescore query is usually fast enough. For tree ensembles, the script becomes unwieldy because each tree requires multiple if-else branches.

Tree Ensemble Deployment: Native Plugin or External Service

Tree ensembles like LambdaMART are harder to deploy because each tree is a series of if-else decisions. A 100-tree model with depth 6 has 600 decision nodes. Evaluating this in a Painless script is slow and error-prone. The standard approach is to use a native plugin that can evaluate the tree ensemble efficiently.

OpenSearch has the LTR plugin (formerly Elasticsearch LTR plugin) that provides native support for tree ensemble models. The plugin stores the model in the cluster state and exposes a sltr (search-learning-to-rank) query type. The query type computes features using the standard query DSL and then applies the model to score the results. The plugin supports XGBoost, LightGBM, and RankLib model formats.

For Elasticsearch, the LTR plugin was historically available but has been less maintained. An alternative is to deploy the model in an external service. The search engine returns the top 1000 results with all features as metadata. The external service evaluates the model and returns the re-ranked list. The trade-off is latency: an external service call adds 5-50ms depending on network latency. For applications where latency is critical, the native plugin is preferred.

Neural Network Deployment: ONNX Runtime Integration

Neural LTR models (e.g., BERT-based cross-encoders that jointly encode the query and document) are too complex for tree-based evaluation. These models require a deep learning runtime. The ONNX Runtime is a common choice because it supports models from PyTorch, TensorFlow, and other frameworks. Deployment options include:

Native ONNX plugin - Some search engines have plugins that embed the ONNX runtime. The plugin evaluates the model at query time within the search process. This is fast but requires the plugin to be maintained and compatible with the search engine version.
External inference service - A separate microservice hosts the ONNX model and receives query-document pairs via gRPC or HTTP. The search engine calls the service for the top-N results. This is flexible but adds network latency.
Pre-computed embeddings - For dense retrieval models, the document embeddings are pre-computed at index time and stored in a vector field. The query embedding is computed at query time, and the search engine performs a k-NN vector search. This is the fastest approach because the model inference is done once per query, not once per document. However, it only works for bi-encoder models (where query and document are encoded separately), not cross-encoder models (where the query and document are encoded together).

Training Data and Evaluation

Collecting Training Data from Production Logs

Production logs are the primary source of training data for LTR. The log pipeline should capture:

Query text and timestamp
All results returned for the query (with their positions)
All clicks (with their positions and dwell times)
Conversions (purchases, sign-ups, etc.) linked to clicks
User context (user ID, session ID, location, device)

The logs must be joined and processed to create query-document pairs with features and labels. The processing pipeline typically runs in batch mode (e.g., daily) and uses Spark or Flink to aggregate the logs. The output is a training dataset in a format like SVMLight or LibSVM, where each line is a query-document pair with features and a label.

Handling Position Bias in Click Data

Position bias is the biggest challenge in implicit judgment collection. A result at position 1 gets clicked 10-20% of the time even if it is irrelevant, while a result at position 10 gets clicked 1-2% of the time even if it is highly relevant. Without correction, the model learns to rank documents higher simply because they were ranked higher before, creating a self-reinforcing loop.

The standard correction is the Position-Based Model (PBM). PBM assumes that the probability of a click at position k is the product of two probabilities: the probability that the user examines the position (which decreases with position) and the probability that the document is relevant given the query. The examination probability is estimated from the data by observing how often each position is clicked across all queries. The relevance probability is then estimated as click_probability / examination_probability.

For example, if position 1 is examined 80% of the time and clicked 20% of the time, the estimated relevance is 20% / 80% = 25%. If position 5 is examined 40% of the time and clicked 10% of the time, the estimated relevance is 10% / 40% = 25%. Both positions have the same relevance estimate, which is correct even though the raw click rates differ.

PBM requires enough data to estimate the examination probabilities accurately. For a new search system with low traffic, the estimates are noisy. In this case, explicit judgments are needed until the click volume is sufficient. A common threshold is 1,000 clicks per query position before the PBM estimates are reliable.

Evaluation Metrics: NDCG, MAP, and MRR

The quality of an LTR model is evaluated using ranking metrics. The most common metric is NDCG (Normalized Discounted Cumulative Gain), which measures the quality of the ranked list by assigning higher credit to relevant documents that appear higher in the list. NDCG is computed as:

Compute the DCG: DCG = sum((2^relevance - 1) / log2(position + 1)) for all positions.
Compute the ideal DCG (IDCG): the DCG of the perfect ranking.
NDCG = DCG / IDCG.

NDCG ranges from 0 to 1, where 1 is a perfect ranking. A typical improvement from a baseline model to an LTR model is 0.05-0.15 NDCG points. For a search engine with 1 million queries per day, a 0.1 NDCG improvement translates to a significant increase in click-through rate and conversion.

MAP (Mean Average Precision) is another common metric, particularly for binary relevance. It computes the average precision at each position where a relevant document appears, then averages across queries. MAP is less sensitive to the exact ordering of relevant documents than NDCG, but it is easier to interpret for non-experts.

MRR (Mean Reciprocal Rank) is used for tasks where only the first relevant document matters, such as question answering or navigational search. MRR is the average of 1 / rank_of_first_relevant across queries. A perfect MRR is 1.0, which means the first result is always relevant.

A/B Testing: The Final Validation

Offline metrics like NDCG are useful for model development, but they do not guarantee business impact. The ultimate validation is an A/B test in production. The test should split traffic between the baseline model (hand-tuned boosts) and the LTR model. The metrics to track are:

Click-through rate (CTR) - the percentage of queries that result in a click.
Conversion rate - the percentage of clicks that result in a purchase or other goal.
Revenue per query - total revenue divided by number of queries.
Dwell time - average time spent on clicked pages.
Zero-result rate - percentage of queries that return no results.

The A/B test should run for at least 2 weeks to capture weekly patterns and should include at least 10,000 queries per variant to achieve statistical significance. The winner should be determined by the primary business metric (usually revenue per query or conversion rate), not just CTR. A model that increases CTR but decreases conversion is driving clicks to less relevant results, which is worse than the baseline.

Common Pitfalls and Production Considerations

Feature Drift and Model Retraining

Features change over time. A new product might have no CTR data initially, but after a month it has enough clicks to become a strong feature. Seasonal products (holiday decorations, swimwear) have CTR patterns that change throughout the year. If the model is trained on summer data, it will underperform in winter because the feature distributions have shifted.

Model retraining should be automated. A typical schedule is weekly retraining for fast-moving catalogs and monthly retraining for stable catalogs. The retraining pipeline should use a sliding window of data (e.g., the last 30 days) to keep the model current. Automated evaluation should compare the new model's offline NDCG against the current model's NDCG, and only deploy if the improvement exceeds a threshold (e.g., 0.01 NDCG). This prevents deploying models that are worse than the current one due to data quality issues or training instability.

The Cold Start Problem

New documents have no historical features (CTR, review count, popularity). The model cannot evaluate them accurately. This is the cold start problem. Common solutions include:

Default feature values - Use the average CTR and review count for the category as default values. This gives new documents a fair chance to rank based on their text features.
Exploration - Randomly promote new documents to higher positions for a small percentage of traffic to collect click data. This is a form of multi-armed bandit optimization that balances exploration (learning about new documents) and exploitation (ranking known good documents).
Content-based features - Use text features and document quality features that do not depend on historical data. A new document with a strong title match and good description should still rank reasonably well even without CTR data.

Latency Budget and Feature Computation

LTR adds latency to search. Feature computation requires running multiple queries (one per feature) or extracting features from the document metadata. A model with 100 features can add 50-200ms to query time if features are computed naively. The latency budget is critical for production.

Optimization techniques include:

Pre-computed features - Store feature values in the index at indexing time. For example, the CTR and review count can be updated daily and stored as numeric fields. The query time feature computation is then a simple doc value lookup.
Feature caching - Cache feature values for frequently accessed documents. A document that appears in the top 10 results for 100 queries per day should have its features cached.
Feature selection - Reduce the feature set to the most impactful features. A model with 20 well-chosen features often performs nearly as well as a model with 100 features, and the latency is much lower.
Asynchronous feature computation - Compute features in parallel using multiple threads. OpenSearch and Elasticsearch support parallel query execution for some feature types.

Fairness and Bias in LTR

LTR models can inadvertently learn biases from the training data. If the training data reflects historical bias (e.g., products from dominant brands get more clicks because they are ranked higher), the model will amplify that bias. Fairness-aware LTR is an active research area, but practical steps include:

Demographic parity constraints - Ensure that the ranking distribution is similar across different groups (e.g., small brands vs large brands).
Counterfactual evaluation - Evaluate what the ranking would have been if the model had been trained on unbiased data, using inverse propensity scoring to correct for historical bias.
Diverse result sets - Enforce diversity constraints that ensure the top results include a mix of brands, prices, and styles, rather than a homogeneous set that the model over-optimized.

Conclusion

Learning to Rank transforms search relevance from a manual craft into a data-driven science. The pipeline - feature extraction, judgment collection, model training, and deployment - requires careful engineering at each stage. Feature engineering is the most important step: the model can only learn from the signals you provide. Judgment collection must correct for position bias and combine explicit and implicit labels. Model training should use listwise optimization for ranking quality. Deployment must respect the latency budget and be monitored for drift and bias.

For search teams, LTR is a force multiplier. A well-trained LTR model can improve NDCG by 10-20% and business metrics by 5-15%, which translates to millions in revenue for large-scale e-commerce. But the investment is significant: feature engineering, annotation pipelines, model training infrastructure, and A/B testing culture are all required. The teams that succeed are those that treat LTR as a product, not a one-time project, with continuous retraining, evaluation, and iteration.

About the author: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I contribute to Apache Lucene, OpenSearch, and related projects. Follow my work on GitHub.

DEV Community