LLM, RAG, Embeddings, Vector Database. These were all buzz words to get everyone's attention a couple of years ago, but not anymore. Using an effective vector Database to perform Retrieval Augmented Generation or Semantic Search is mainstream now and with opensearch, you can make the process simpler
No Extra Step for Embeddings
Opensearch ingestion pipelines make it easier to generate vector embeddings for the data in flight with no hassle. This means two things,
- You don't need to write custom logic to convert the text content to a vector type and store it
- You don't need to write custom logic to convert the search query to a vector type
The ingestion pipeline makes use of the builtin sentence transformer models provided by Opensearch to generate the embeddings for the text content. Opensearch comes bundled with a collection of pre-trained models from huggingface and the model that best suites your purpose can be selected to generate the embeddings
The Setup
Step - 1: Model deployment
Lets start with creating a new model group first. Use the Opensearch dashboard to execute the following request to create a custom model group
POST /_plugins/_ml/model_groups/_register
{
"name": "optimus_embedding_group",
"description": "A custom and powerful model group"
}
On completion of this request, a model group ID will be returned. Keep that group ID safely for the model deployment
The next part is to create an instance of the pre-trained model that you can deploy. This step requires you to choose the right model for your use case. As we are covering semantic search in this article, I have chosen the msmarco-distilbert-base-tas-b which generates a 768 dimension vector space which is optimal for semantic search
POST /_plugins/_ml/models/_register
{
"name": "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b",
"version": "1.0.2",
"model_group_id": "<model group ID obtained from the previous step>",
"model_format": "TORCH_SCRIPT"
}
This request is asynchronous and on completion, it returns a task ID for the registration task that is running in the background. Obtain the task ID and check its status periodically using the following request
GET /_plugins/_ml/tasks/<task ID from the previous request>
The response should have the state as COMPLETED and the field model_id should be present in the response. Once the model_id becomes available, use that to deploy the model. Ensure that you keep note of this model ID as its required for all future purposes
POST /_plugins/_ml/models/<model ID fetched once the task completes>/_deploy
This makes the model available to generate the text embeddings on the go
Step - 2: Ingestion Pipeline Setup
The ingestion pipeline acts as a trigger for an index and performs the required transformation in realtime. The pipelines have different types of ingest processors and we will be making use of the Text/image embedding processor to generate the text embeddings for the searchable content
Execute the following request to setup a new ingestion pipeline
PUT /_ingest/pipeline/<a_name_for_the_pipeline>
{
"description": "Optimus pipeline, a super powerful pipeline for generating embeddings",
"processors": [
{
"text_image_embedding": {
"model_id": "<The model_id from the previous step>",
"embedding": "text_embeddings",
"field_map": {
"text": "article"
}
}
}
]
}
Explanation:
- The
embeddingfield instructs the pipeline to store the vector data to the mentioned field -text_embeddings - The
field_mapfield defines the searchable field that needs to be embedded
In our sample, we are going to generate embeddings for a field called article and the pipeline is supposed to store the embeddings to another field called text_embeddings
Step - 3: Index Creation
We will keep the index simple and, the focus is on the vector field and the ingestion pipeline
Use the following request to create a new index
PUT /news_articles
{
"settings": {
"index.knn": true,
"default_pipeline": "<name of the pipeline used during creation>"
},
"mappings": {
"properties": {
"article": {
"type": "text"
},
"text_embeddings": {
"type": "knn_vector",
"dimension": 768,
"method": {
"engine": "lucene",
"space_type": "l2",
"name": "hnsw",
"parameters": {}
}
}
}
}
}
In the index settings, we have enabled KNN which is required for the semantic search and configured the default_pipeline. Whenever a insert or update operation happens on the index, the ingestion pipeline will be invoked to generate the text embeddings for the article field
Step - 4: The Search!
Lets load the index with some sample data and see how to perform the search operation
The following are some sample news articles and we will be performing search on these
POST /news_articles/_doc/3
{
"article": "..."
}
To search using vector similarity, Opensearch provides the neural search feature. This uses cosine similarity to rank the top k closest items
Use the following request to perform the vector search
GET /news_articles/_search
{
"query": {
"bool": {
"should": [
{
"neural": {
"text_embeddings": {
"query_text": "climate crisis",
"model_id": "<ID of the deployed model>",
"k": 1
}
}
}
]
}
},
"_source": {
"excludes": "text_embeddings"
}
}
Explanation:
In the above search query, I am explicitly asking for the top k=1 similar items because there are only 3 documents in the sample dataset. The value of k can be increased to get how many ever number of similar documents you want and it will be ranked in the descending order of similarity (or the ascending order of vector distance)
The above sample search term "climate crisis" will rank the document with the "news about the floods" as the top most result
If I search for "developments in artificial intelligence", the "article about OpenAI" will be ranked on top
If I search for "team that ruined the GOAT", the "article about Ferrari" will be ranked on top (not joking btw 😜)
Few add-ons
For a realistic use case, the document could have more than one field and you may want to perform the search combining the content from multiple fields. In such cases, you can use the
scriptingest processor within the pipeline to normalize the content and use the normalized field to generate the embeddings. A sample can be found hereIf you wish to combine the similarity search along with a normal phrase based search or something else, then it can be achieved by combining multiple queries within a boolean query and using query boosting to give preference to the required query criteria
GET /news_articles/_search
{
"query": {
"bool": {
"should": [
{
"neural": {
"text_embeddings": {
"query_text": "race",
"model_id": "<ID of the model>",
"k": 5,
"boost": 1.5
}
}
},
{
"bool": {
"must": [
{
"query_string": {
"query": "race",
"default_field": "all",
"default_operator": "AND",
"analyze_wildcard": true,
"boost": 1.7
}
}
]
}
}
]
}
},
"_source": {
"excludes": "text_embeddings"
}
}
Conclusion
If Opensearch is your go to data source for search then the combination of the builtin models along with the ingestion pipeline will make it a powerhouse for semantic search or similarity search

Top comments (0)