DEV Community

Cover image for Enhancing Search Accuracy with RRF(Reciprocal Rank Fusion) in Alibaba Cloud Elasticsearch 8.x
A_Lucas
A_Lucas

Posted on

Enhancing Search Accuracy with RRF(Reciprocal Rank Fusion) in Alibaba Cloud Elasticsearch 8.x

Introduction to Enhanced Ranking with RRF in Elasticsearch

With the arrival of Elasticsearch 8.x, a new horizon in search technology has emerged. Elasticsearch has welcomed Reciprocal Rank Fusion (RRF) into its suite of capabilities, offering an enriched approach to merging and re-ranking multiple result sets. This innovative feature retains the foundation of its predecessors—relevance ranking via BM25 and recall sorting through vector similarity—while empowering a cohesive and more precise ranking process when integrated. By pairing these methodologies through RRF, Elasticsearch heightens its accuracy in delivering search results. This article will walk you through the technicalities of this integration using a detailed example.

Getting Started with Elasticsearch

To embark on this journey, you'll need an Elasticsearch cluster running version 8.8 or later. Alibaba Cloud Elasticsearch has made clusters with the latest version 8.x available for immediate purchase.

1) Make your selection between versions 8.8 or 8.9 and configure your nodes.
2) Navigate through and select the appropriate network configuration.
3) Proceed to checkout with a simple click.

img

What is Elasticsearch?

A Closer Look at RRF Testing

Introduction to RRF

RRF operates on an algorithmic formula where 'k' is a constant value set by default to 60. Within the algorithm, 'R' represents the document sets sorted for every result from a query. Here, 'r(d)' specifies the rank order of document 'd' under certain query conditions, starting from one.

Algorithmic formula:

img

RRF Ranking Example

In a scenario where documents are ranked based on both BM25 and dense embedding, RRF seamlessly blends the outcomes to produce an integrated and improved ranking.

BM25 Rank Dense Embeding Rank RRF Result k=0
A 1 B 1 B:12+1/1=1.5
B 2 C 2 A:1/1+1/3=1.3
C 3 A 3 C:1/3+1/2=0.83

Data and Model Readiness

Pursuing the methodology described in the ESRE Series (I), we utilized the text_embedding model and launched the deployment through Eland. Subsequently, we uploaded the initial dataset via Kibana, configured the text-embeddings pipeline, and ultimately crafted indexed data replete with vectors via index rebuilding.

What is Vector Search and Embedding Model?

Evaluating Query Impact

For the assessment, one query from the TREC 2019 Deep Learning Track's "Paragraph Ranking Task" was selected to test the search results against the three techniques: text, vector, and RRF fusion. We utilized the query "hydrogen is a liquid below what temperature" to exemplify and contrast these methods.

// RRF Mixed Arrangement Query
GET collection-with-embeddings/_search
{
  "size": 10,
  "query": {
    "query_string": {
      "query": "hydrogen is a liquid below what temperature"
    }
  },
  "knn": [
    {
      "field": "text_embedding.predicted_value",
      "k": 10,
      "num_candidates": 100,
      "query_vector_builder": {
        "text_embedding": {
          "model_id": "sentence-transformers__msmarco-minilm-l-12-v3",
          "model_text": "hydrogen is a liquid below what temperature"
        }
      }
    }
  ],
  "_source": [
    "id"
  ],
  "rank": {
    "rrf": {
      "window_size": 10,
      "rank_constant": 1
    }
  }
}

//vector search
GET collection-with-embeddings/_search
{
  "size": 10,
  "knn": [
    {
      "field": "text_embedding.predicted_value",
      "k": 10,
      "num_candidates": 100,
      "query_vector_builder": {
        "text_embedding": {
          "model_id": "sentence-transformers__msmarco-minilm-l-12-v3",
          "model_text": "hydrogen is a liquid below what temperature"
        }
      }
    }
  ],
  "_source": [
    "id"
  ] 
}

//text search
GET collection-with-embeddings/_search
{
  "size": 10,
  "query": {
    "query_string": {
      "query": "how are some sharks warm blooded"
    }
  },
  "_source": [
    "id"
  ] 
}
Enter fullscreen mode Exit fullscreen mode

The three query types yielded varying results in terms of accuracy—scaled from 'not relevant' to 'completely relevant.' It's evident from the rankings that RRF's ability to synthesize vector and text query results pushes relevant documents - such as "7911557", previously absent from vector results, to the forefront. Simultaneously, RRF spotlighted the importance of documents like "6080460", which the text query originally overlooked, thereby sharpening recall precision.

RRF Mixed Arrangement Query Vector search Text Search
Paragraph ID accuracy Paragraph ID accuracy Paragraph ID accuracy
8588222 0 8588222 0 7911557 3
8588219 3 8588219 3 8588219 3
7911557 3 6080460 3 8588222 0
128984 3 128984 3 2697752 2
6080460 3 4254815 1 128984 3
2697752 2 6343521 1 1721142 0
4254815 1 1020793 0 8588227 0
1721142 0 4254811 3 302210 1
6343521 1 1959030 0 2697746 2
8588227 0 4254813 1 7350325 0

Through the adept integration of search technologies, Elasticsearch's adoption of RRF underpins a more accurate and refined experience for users delving into the vast expanse of data. Discover the power of enhanced search with Alibaba Cloud Elasticsearch's public cloud service — where precision meets performance.

30-Day Free Trial: Help You Implement the Latest Version of Elasticsearch

Search and Analytics Service Elasticsearch Version: Alibaba Cloud Elasticsearch is a fully managed Elasticsearch cloud service built on the open-source Elasticsearch, supporting out-of-the-box functionality and pay-as-you-go while being 100% compatible with open-source features. Not only does it provide the cloud-ready components of the Elastic Stack, including Elasticsearch, Logstash, Kibana, and Beats, but it also partners with Elastic to offer the free X-Pack (Platinum level advanced features) commercial plugin. This integration includes advanced features such as security, SQL, machine learning, alerting, and monitoring, and is widely used in scenarios such as real-time log analysis, information retrieval, and multi-dimensional data querying and statistical analysis.

For more information about Elasticsearch, please visit https://www.alibabacloud.com/en/product/elasticsearch.

Top comments (0)