Semantic Search with Haystack and Elastic

#python #opensource #nlp

In this post, I'll show how you can build a semantic search application using the Haystack framework and Elastic. In this tutorial, you will:

Create an Elastic document store in Haystack
Generate text embeddings for documents in your document store
Build a semantic search pipeline to retrieve documents

If you just want to get right to the code you can go right to the repo and look at the notebook.

Create the Elastic Document Store

For this tutorial, we will be storing documents in an Elastic document store. A feature I like with Haystack is you can easily swap out different document stores. For example, if you prefer to use FAISS instead of Elastic, you can implement that here without having to change other components in the pipeline. You can get the complete list from the Haystack Document Store docs.

First, we need to initialize the Elastic document store. Haystack has a function launch_es() that will run a subprocess to run Elastic within a Docker container. You will need to have Docker running for this command to complete.

from Haystack.utils import launch_es

launch_es()

Now we can connect to the Elastic instance and configure the document store.

from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(
    host="localhost",
    username="",
    password="",
    index="document",
    create_index=True,
    similarity="dot_product"
)

To create the document store we provide the information about how to connect to the Elastic instance. We also create a new index called document within our Elastic instance where our documents will be stored.

Finally, we also define a similarity function, dot_product, that will be used when comparing document vectors.

The ElasticDocumentStore within Haystack has a bunch of configurations you can leverage that you can find in the docs.

Process Text and add to Document Store

Now documents can be added to the document store. Most Haystack tutorials use a collection of Wikipedia articles related to Game of Thrones, so we'll use that same data source here.

The data can be downloaded from S3 and stored locally. Haystack provides a helper function to fetch and store the data in a local directory.

from haystack.utils import clean_wiki_text, fetch_archive_from_http
from haystack.utils.preprocessing import convert_files_to_docs

# Read data from S3. Write text to the specified directory.
doc_dir = "data/article_txt_got"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

You should see ~180 text files in the data/article_txt_got directory when the function completes.

The raw documents need to be put into a format Haystack can load into the Document Store. The convert_files_to_docs function provides a convenient way to process raw text documents and put them in a Document format for Haystack. We will provide a cleaning function for this example that removes redundant line breaks, extremely short lines, and empty paragraphs.

docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text)

Here is what an example document looks like:

<Document: {'content': "Linda Antonsson and Elio García at Archipelacon on June 28, 2015.\n'''Elio Miguel García Jr.''' (born May 6, 1978) and '''Linda Maria Antonsson''' (born November 18, 1974) are authors known for their contributions and expertise in the ''A Song of Ice and Fire'' series by George R. R. Martin, co-writing in 2014 with Martin ''The World of Ice & Fire'', a companion book for the series. They are also the founders of the fansite Westeros.org, one of the earliest fan websites for ''A Song of Ice and Fire''.", 'content_type': 'text', 'score': None, 'meta': {'name': '145_Elio_M._García_Jr._and_Linda_Antonsson.txt'}, 'embedding': None, 'id': '41655cc804bb07b1569f3118ce70e05'}>

The content field is the document's text, which will be used for generating the text embeddings. The meta field stores other attributes of a document. These will be stored as fields within the Document Store. In this example, we're storing the name of the text file for the document.

The list of documents can now be written to the Elastic Document Store. Right now all but 10 of the documents will be written to the Document Store. The remaining 10 will be used later.

document_store.write_documents(docs[:-10])

Create Document Embeddings

After writing the documents to the document store, embeddings can be generated for the documents with a retriever.

Here we'll use the DensePassageRetriever, which lets us define separate embedding models for queries and documents. We will use the separated pretrained DPR models from Facebook for queries and text. The use_gpu flag is also set to True, which tells the retriever to use GPUs if they are available. If not GPUs are available that's okay, Haystack will fallback to using CPU. Similar to the document store Haystack has a number of configurations to customize the retriever behavior that you can find in the documentation.

from Haystack.nodes import DensePassageRetriever

retriever = DensePassageRetriever(
    document_store=document_store,
    use_gpu=True,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base"
)

After defining the retriever, the embeddings of the documents can up generated. If you're using a CPU, this step will take a while, so get some coffee, stretch, or do something while the process runs.

document_store.update_embeddings(retriever)

Querying the Document Store

Now that we have documents with embeddings, we can retrieve documents for a given query using the Haystack DocumentSearchPipeline wrapper for the retriever we just created.

from Haystack.pipelines import DocumentSearchPipeline

pipeline = DocumentSearchPipeline(retriever)

We can define our query and pass it to the pipeline along with other parameters for our retriever, like the number of results to return in this example. In this case, we want to retrieve documents related to the "Red Wedding" from Game of Thrones.

query = "what is the Red Wedding?"
result = pipeline.run(query, params={"Retriever": {"top_k": 2}})

When a query is sent to the pipeline, the query embeddings are generated using the query model defined in the retriever. Then a dot product vector similarity search is performed against the document store.

Haystack has a helper function print_documents() to display the results in a prettier format.

print_documents(result, max_text_len=100, print_name=True, print_meta=True)

Looking at the first result, you should see a document about "The Rains of Castamere", which is the name of the episode where "The Red Wedding" occurred, so given our query, this is a very relevant result.

Adding New Documents

In this example, we created a document store from scratch, uploaded text documents, and generated embeddings for those documents. What if we wanted to write new documents into our document store? To avoid the computation time of re-generating embeddings for all the documents, you can use the update_existing_embeddings parameter of the update_embeddings method.

document_store.write_documents(docs[-10:])
document_store.update_embeddings(
    retriever,
    update_existing_embeddings=False
)

By setting updated_existing_embeddings=False only documents without an embedding in the document store will be updated. This parameter can be helpful when making incremental updates to documents in your document store.

Wrapping Up

In this tutorial, I've shown how you can build a basic semantic search system using Haystack and Elastic. I talked about some different configurations you can utilize to customize this pipeline. Still, Haystack has a bunch of other configurations you can leverage, so I highly recommend you check out the documentation to see what is available to you.

I'm working on a couple more examples, so if there is something specific you'd like to see built with Haystack let me know!

Happy Tinkering!