DEV Community

Cover image for How to build a search engine with word embeddings
Mage
Mage

Posted on

How to build a search engine with word embeddings

Imagine you have a large set of news articles, and want to provide a way to allow your users to search through them. For example, here are 5 random articles from the Economist:

  1. An investment bonanza is coming
  2. Who governs a country’s airspace?
  3. What is a supermoon, and how noticeable is it to the naked eye?
  4. What the evidence says about police body-cameras
  5. Who controls Syria?

Let’s think about what kind of queries potential users issue and the articles that we would want to return.

Exact and Substring Matches

  • A search for “investment” should return the 1st article.
  • A search for “supermoon” should return the 3rd article.
  • A search for “syria” should return the 5th article.

This is quite simple to implement with traditional information retrieval techniques, since these queries have keyword matching.

Harder Examples

  • A search for “banking” should probably return the 1st article.
  • A search for “astrology” should return the 3rd article.
  • A search for “middle east” should return the 5th article.

The challenge with these queries is that none of them are actually a keyword match in the original documents. However, there isn’t any string matching we can perform between the search query and the document.

What we need is a way to perform searches over text without requiring exact keyword matches between words in the query and words in the documents.

Word Embeddings

One way we can do this is with word embeddings, which capture the semantics of words in vector representation.

If we can find a way to create a vector for the query, and a vector for every document, in the same vector space, then we can just compute the similarity between our query vector, and every document, to retrieve the document that is most semantically similar to our query.

Creating Document Embeddings from Word Embeddings

Assuming we have lookup table of a word to its embedding, how do we actually create an embedding representing a string of words?

We can think of the semantics of a document as just the average of the semantics of individual words, and compute a mean word embedding to represent a document.

Specifically:

def create_mean_embedding(words):
    return np.mean(
        [model[word] for word in words if word in model],
        axis=0)
Enter fullscreen mode Exit fullscreen mode

This would capture the average semantic of a document or query.

Another way we can build a document embedding is by by taking the coordinate wise max of all of the individual word embeddings:

def create_max_embedding(words, model):
    return np.amax(
        [model[word] for word in words if word in model],
        axis=0)
Enter fullscreen mode Exit fullscreen mode

This would highlight the max of every semantic dimension.

We can also concatenate the two embedding types to get a better representation of the document:

def create_embedding(words):
    return np.concatenate([
        create_max_embedding(words),
        create_mean_embedding(words)
    ])
Enter fullscreen mode Exit fullscreen mode

Let’s try apply this to the simple example documents we highlighted above.

We will be using gensim to load our Google News pre-trained word vectors. Find the code for this here.

Ranking with embeddings

Query: "banking"

Rank: 0 | Top Article: An investment bonanza is coming
Rank: 1 | Top Article: Who governs a country's airspace?
Rank: 2 | Top Article: What is a supermoon, and how noticeable is it to the naked eye?
Rank: 3 | Top Article: Who controls Syria?
Rank: 4 | Top Article: What the evidence says about police body-cameras

Query: "astrology"

Rank: 0 | Top Article: What is a supermoon, and how noticeable is it to the naked eye?
Rank: 1 | Top Article: Who governs a country's airspace?
Rank: 2 | Top Article: Who controls Syria?
Rank: 3 | Top Article: What the evidence says about police body-cameras
Rank: 4 | Top Article: An investment bonanza is coming

Query: "middle east"

Rank: 0 | Top Article: Who controls Syria?
Rank: 1 | Top Article: Who governs a country's airspace?
Rank: 2 | Top Article: An investment bonanza is coming
Rank: 3 | Top Article: What is a supermoon, and how noticeable is it to the naked eye?
Rank: 4 | Top Article: What the evidence says about police body-cameras
Enter fullscreen mode Exit fullscreen mode

Dual space embedding

So far we’ve used the standard pre-trained word2vec embeddings to represent both the query and document embeddings. But lets think about how word2vec constructs these vectors to come up with a more clever way to create these embeddings.

The core idea of word2vec is that words that are used in similar contexts, will have similar meanings. Focusing on the skip-gram variant of the model, this idea is formulated as trying to predict the context words from any given word. Check out this post if this is confusing to you.

In a skip-gram model, a one-hot encoded word is mapped to a dense representation, which then is mapped back to the context of that word (context in this case refers to the words that surround the target word). Usually, we keep the first transformation matrix W (in the diagram below), which becomes our word embeddings, and discard the matrix that maps to context words W’.
Image description

If we think deeply about the behavior of W and W’ matrix: the W embedding’s goal is to map a word to an intermediate representation, and the W’ matrix maps from the intermediate representation to a context. This naturally results in W capturing the semantics of a word, while W’ captures the context of a word.

We can frame our search problem in a similar way: The query being the “word” and the document being the “context”. In this way, if the query is embedded using W, while the document is embedded using W’, comparing the similarity between the two vectors would indicate whether or not the document acts as a good context for the query.

Lets consider two simple documents, that we could potentially return for a query “harvard”, and decide which one is more relevant:

Document 1:

Harvard alumni receives 
Enter fullscreen mode Exit fullscreen mode

Document 2:

Yale, NYU and Harvard 
Enter fullscreen mode Exit fullscreen mode

As a user, you’d probably want to see the first document. However, our current embedding model would assign the second document a higher relevance score. Why? Because the terms “Pittsburg Steelers”, “Baltimore Ravens” and “Cincinnati Bengals” are similar to the term “New England Patriots”, since they are all football teams. (Recall that the word2vec model measures similarity as words that have similar contexts. Football team names would exists in similar context). Therefore, the embedding for the query and document 2 will be more similar than the query and document 1.

However, using W’ to create the document embedding, and W to create the query embedding, will allow a similarity measure of “how well does this document represent the context of this query”.

Indeed, these are the results published in A Dual Embedding Space Model for Document Ranking.
Image description

This table above shows the closest word embeddings to the words “yale” with three different embedding comparisons.

  1. IN-IN: The bolded word is embedded with W and the words below are the closest word embeddings in W.
  2. OUT-OUT: The bolded word is embedded with W’ and the words below are the closest word embeddings in W’.
  3. IN-OUT: The bolded word is embedded with W and the words below are the closest word embeddings in W’.

Observe that while “Yale” is more semantically similar to “Harvard”, if a document about Harvard came up when searching “Yale”, that wouldn’t make a lot of sense.

With that in mind, it seems the IN-OUT embedding scheme provides the most relevant words for a search system. If someone searched for the word “Yale”, I would expect documents with the terms “faculty”, “alumni”, and “orientation” rather than “harvard” and “nyu”.

How do we deploy this?

Too many ML articles stop here, but deployment is typically half the challenge.

Typically, these models are deployed as a ranker after a retrieval model. The pipeline is typically structured like this:
Image description

The purpose of the retrieval model is to find the ~100 most similar documents based on keywords among millions of other documents, while the purpose of the ranking model is to find the most semantically relevant document in the set of 100.

BM25 is a common document retrieval method which is implemented in most search systems.

First, lets implement a retrieval model that leverages the rank-bm25 package. In a production setting, we would probably use something like Elasticsearch for this part.

from rank_bm25 import BM25Okapi as BM25

class Retriever(object):
    def __init__(self, documents):
        self.corpus = documents
        self.bm25 = BM25(self.corpus)

    def query(self, tokenized_query, n=100):
        scores = self.bm25.get_scores(tokenized_query)
        best_docs = sorted(range(len(scores)), key=lambda i: -scores[i])[:n]
        return best_docs, [scores[i] for i in best_docs]
Enter fullscreen mode Exit fullscreen mode

Next, let’s build the ranking model, which takes a list of documents and re-ranks them according to the similarity between the query embedding and the document embedding.

And now, let’s stitch the code together.

Let’s take a look at a real example output of this for the query “harvard hockey” against the Microsoft News Dataset. However, there are no documents in the corpus containing both the terms “harvard” and “hockey”, making this a tricky problem.

Here are the top 20 results for the query “harvard hockey” from the BM25 model.

The documents with the highest score from the BM25 retrieval model is “Admissions scandals unfolding at Harvard university” and “How a Harvard class project changed barbecue

I suspect many people do not think this is a very relevant article to “harvard hockey

But if we re-rank the top 100 results from the BM25 model with the embedding based model, here is what we get:

Now the results are much more semantically relevant to college hockey results.

Summary

A dual space embedding approach to creating query and document embeddings for embedding based ranking can offer more contextually relevant results than the traditional word embedding approach.

References

Top comments (2)

Collapse
 
mkadirtan profile image
mkadirtan

Very helpful post, I am planning to deploy a search engine, and this looks like a good starting point. However, it seems like some parts of the post is lost. For example There is no result after this:

But if we re-rank the top 100 results from the BM25 model with the embedding based model, here is what we get

Also, that doesn't match the example:

Why? Because the terms “Pittsburg Steelers”, “Baltimore Ravens” and “Cincinnati Bengals” are similar to the term “New England Patriots”, since they are all football teams.

I am sure it is an editing error, just wanted to point out in case you want to fix them.

Collapse
 
wilmol profile image
Will

How do I use the sample code? What is 'Document tsv'?