DEV Community

Cover image for Opensearch as a Vector Database for Semantic Search
Neel
Neel

Posted on

Opensearch as a Vector Database for Semantic Search

LLM, RAG, Embeddings, Vector Database. These were all buzz words to get everyone's attention a couple of years ago, but not anymore. Using an effective vector Database to perform Retrieval Augmented Generation or Semantic Search is mainstream now and with opensearch, you can make the process simpler

No Extra Step for Embeddings

Opensearch ingestion pipelines make it easier to generate vector embeddings for the data in flight with no hassle. This means two things,

  1. You don't need to write custom logic to convert the text content to a vector type and store it
  2. You don't need to write custom logic to convert the search query to a vector type

The ingestion pipeline makes use of the builtin sentence transformer models provided by Opensearch to generate the embeddings for the text content. Opensearch comes bundled with a collection of pre-trained models from huggingface and the model that best suites your purpose can be selected to generate the embeddings

The Setup

Step - 1: Model deployment

Lets start with creating a new model group first. Use the Opensearch dashboard to execute the following request to create a custom model group

POST /_plugins/_ml/model_groups/_register
{
  "name": "optimus_embedding_group",
  "description": "A custom and powerful model group"
}
Enter fullscreen mode Exit fullscreen mode

On completion of this request, a model group ID will be returned. Keep that group ID safely for the model deployment

The next part is to create an instance of the pre-trained model that you can deploy. This step requires you to choose the right model for your use case. As we are covering semantic search in this article, I have chosen the msmarco-distilbert-base-tas-b which generates a 768 dimension vector space which is optimal for semantic search

POST /_plugins/_ml/models/_register
{
  "name": "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b",
  "version": "1.0.2",
  "model_group_id": "<model group ID obtained from the previous step>",
  "model_format": "TORCH_SCRIPT"
}
Enter fullscreen mode Exit fullscreen mode

This request is asynchronous and on completion, it returns a task ID for the registration task that is running in the background. Obtain the task ID and check its status periodically using the following request

GET /_plugins/_ml/tasks/<task ID from the previous request>
Enter fullscreen mode Exit fullscreen mode

The response should have the state as COMPLETED and the field model_id should be present in the response. Once the model_id becomes available, use that to deploy the model. Ensure that you keep note of this model ID as its required for all future purposes

POST /_plugins/_ml/models/<model ID fetched once the task completes>/_deploy
Enter fullscreen mode Exit fullscreen mode

This makes the model available to generate the text embeddings on the go

Step - 2: Ingestion Pipeline Setup

The ingestion pipeline acts as a trigger for an index and performs the required transformation in realtime. The pipelines have different types of ingest processors and we will be making use of the Text/image embedding processor to generate the text embeddings for the searchable content

Execute the following request to setup a new ingestion pipeline

PUT /_ingest/pipeline/<a_name_for_the_pipeline>
{
    "description": "Optimus pipeline, a super powerful pipeline for generating embeddings",
    "processors": [
        {
            "text_image_embedding": {
                "model_id": "<The model_id from the previous step>",
                "embedding": "text_embeddings",
                "field_map": {
                    "text": "article"
                }
            }
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • The embedding field instructs the pipeline to store the vector data to the mentioned field - text_embeddings
  • The field_map field defines the searchable field that needs to be embedded

In our sample, we are going to generate embeddings for a field called article and the pipeline is supposed to store the embeddings to another field called text_embeddings

Step - 3: Index Creation

We will keep the index simple and, the focus is on the vector field and the ingestion pipeline

Use the following request to create a new index

PUT /news_articles
{
  "settings": {
    "index.knn": true,
    "default_pipeline": "<name of the pipeline used during creation>"
  },
  "mappings": {
    "properties": {
      "article": {
        "type": "text"
      },
      "text_embeddings": {
        "type": "knn_vector",
        "dimension": 768,
        "method": {
          "engine": "lucene",
          "space_type": "l2",
          "name": "hnsw",
          "parameters": {}
        }
      }
    }
  }
}

Enter fullscreen mode Exit fullscreen mode

In the index settings, we have enabled KNN which is required for the semantic search and configured the default_pipeline. Whenever a insert or update operation happens on the index, the ingestion pipeline will be invoked to generate the text embeddings for the article field

Step - 4: The Search!

Lets load the index with some sample data and see how to perform the search operation

The following are some sample news articles and we will be performing search on these

sample data

POST /news_articles/_doc/3
{
  "article": "..."
}
Enter fullscreen mode Exit fullscreen mode

To search using vector similarity, Opensearch provides the neural search feature. This uses cosine similarity to rank the top k closest items

Use the following request to perform the vector search

GET /news_articles/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "neural": {
            "text_embeddings": {
              "query_text": "climate crisis",
              "model_id": "<ID of the deployed model>",
              "k": 1
            }
          }
        }
      ]
    }
  },
  "_source": {
    "excludes": "text_embeddings"
  }
}
Enter fullscreen mode Exit fullscreen mode

Explanation:

In the above search query, I am explicitly asking for the top k=1 similar items because there are only 3 documents in the sample dataset. The value of k can be increased to get how many ever number of similar documents you want and it will be ranked in the descending order of similarity (or the ascending order of vector distance)

The above sample search term "climate crisis" will rank the document with the "news about the floods" as the top most result

If I search for "developments in artificial intelligence", the "article about OpenAI" will be ranked on top

If I search for "team that ruined the GOAT", the "article about Ferrari" will be ranked on top (not joking btw 😜)

Few add-ons

  • For a realistic use case, the document could have more than one field and you may want to perform the search combining the content from multiple fields. In such cases, you can use the script ingest processor within the pipeline to normalize the content and use the normalized field to generate the embeddings. A sample can be found here

  • If you wish to combine the similarity search along with a normal phrase based search or something else, then it can be achieved by combining multiple queries within a boolean query and using query boosting to give preference to the required query criteria

GET /news_articles/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "neural": {
            "text_embeddings": {
              "query_text": "race",
              "model_id": "<ID of the model>",
              "k": 5,
              "boost": 1.5
            }
          }
        },
        {
          "bool": {
            "must": [
              {
                "query_string": {
                  "query": "race",
                  "default_field": "all",
                  "default_operator": "AND",
                  "analyze_wildcard": true,
                  "boost": 1.7
                }
              }
            ]
          }
        }
      ]
    }
  },
  "_source": {
    "excludes": "text_embeddings"
  }
}
Enter fullscreen mode Exit fullscreen mode

Conclusion

If Opensearch is your go to data source for search then the combination of the builtin models along with the ingestion pipeline will make it a powerhouse for semantic search or similarity search

Top comments (0)