Introduction
I recently published an article on Word Embeddings, what they are, their different kinds and their use cases. Now of the many use cases of word embeddings, semantic search and information retrieval is one of them. So instead of just publishing another theoretical blog, I decided to spice things up by creating a tiny system that will store information from some top research papers and allow us to find papers that talk about a certain concept. Let's get started.
What is Semantic Similarity?
Semantic Similarity is the measure of how similar are two things. In our case those things are going to be sentences. So how do we tell if two sentences are similar?
Semantic Similarity for Humans
Let's take an example,
Sentence 1: I love to eat pizza.
Sentence 2: It is raining outside
If I ask you whether these two sentences are similar or not you'll tell me that they are not because, the first sentence is talking about the concept of liking a food item and the second sentence is talking about the concept of weather.
Now if we were to take two similar sentences, I love to eat pizza.
and Pizza is my favorite food.
You'll say that these two sentences are pretty similar since again both talk about the fact that the person in these sentences likes Pizza.
Semantic Similarity for Machines
As a human you can read these sentences and extract information from them but, computers only speak in binary so how can they tell whether two sentences are similar or not?
This is a three step process,
- Convert the sentences into a machine understandable format i.e Numbers using embeddings.
- Once you have your sentences in numerical form you can calculate a representative distance between those two.
- Based on this distance we can judge the similarity!
The process of retrieving relevant from a large corpus of text from a query using semantic similarity measures is called semantic search.
Simple right?! Let's start implementing!.
The Data
To build this semantic search pipeline, we need some data on which we can perform our semantic search. NIPS Papers Dataset comes in handy for the task. The dataset contains information about papers published in the Neural Information Processing Systems conferences from 1987 to 2019. It also contains content of each of these papers.
The NIPS Dataset
Once you download the data and extract it from the above link, it contains 2 files authors.csv
and papers.csv
.
authors.csv
has four columns
-
source_id
- ID of the research paper. -
first_name
- First name of the author. -
last_name
- Last name of the author. -
institution
- The institution to which the author belongs.
Note that a paper can have multiple authors and some of them may not disclose the institution
they work at.
papers.csv
has five columns
-
source_id
- ID of the research paper. This is common key between the two data files. -
year
- The year in which the paper was published. -
title
- Title of the paper. -
abstract
- Abstract of the paper. -
full_text
- Text of the paper.
Here we see a trend that the number of papers have increase over time but, between 2015 and 2020, it has grown exponentially. This can be attributed to recent developments in the field of NLP and growth of datasets.
The above plot shows a trend of total number of words in the title of the research papers. We see that on an average the title of a paper is around 7 to 8 words in length. This tells us that the vocabulary generated out of all the titles is not going to be very large.
The above picture visualizes the vocabulary for all the words present in the title of the papers.
A Major Issue With Data
One of the major issue with this data is that the source_id
column has non unique values in both papers.csv
and authors.csv
. The reason why this causes an issue is that it prevents us from uniquely identifying a paper, if two papers have the same source_id
.
This also causes a problem since it does not allow us to associate the authors.csv
with the papers.csv
as source_id
the assumed Primary Key is non Unique. How does one solve this issue?
For papers.csv
upon analysis of the data, I realized that no two papers published in the same year have the same source_id
.
So a unique ID can be formed by combining source_id
with year
.
Unfortunately for authors.csv
we cannot use the same trick. Since there is no year
information with authors. So I'm making a very very big assumption. The assumption is as follows, the authors.csv
is already sorted on time. What this allows us to do is that if there is a source_id
X which is assigned to two authors A1 and A2 then A1 is the author of the paper with the same source_id
X but appears first in the timeline.
Note: By looking at some samples, I've determined that the author names can also be inferred/verified by looking at the full_text
of the paper, however, there is no guarantee that full text will contain the names and it kind of goes beyond the scope of the problem we are trying to address.
authors.csv
example
papers.csv
example
We firstly introduce a new column in the authors.csv
dataset to uniquely identify each group of authors that associate with a source_id
. The below snippet shows how this is done.
Next we assign paper's unique ids to these author unique ids. The below snippet shows how this is done.
Architecture
I've made a small diagram to show a high level architecture of what the system would look like.
We upload two different kinds of data. First is the paper data and the second one is the author data. Once the system receives this data it will do the following steps,
Register the metadata for each paper in a DB. This would include the ID of the paper, its publishing year and its authors.
Once we have the metadata we can process the title text and abstract of the paper, we would apply some preprocessing on it like removing stopwords, lemmatization and converting to lower case.
We would embed the preprocessed text of title and abstract and store it in a vector database.
When we have data registered in the system we can perform searches on the data to find papers from relevant searches.
The Application
A small standalone application server written in Flask with a couple of APIs providing CRUD functionality is enough for demonstrating this task. I've adopted the same approach and designed a small app server with the following APIs,
-
/register
to register data of a paper. -
/query
to query the database. -
/delete
to remove papers from the database.
We'll get into individual components in the next couple of sections.
The Datastore
In this particular use case, we would like to compare text data against each other and the way to do this is to create embeddings of the text data. However, directly storing the embeddings in a traditional datastore like a database is unwise. This is why for this application I will be using a vector database
A vector database not only provides and optimized way to store embeddings but, also provides us with the functionality of querying these embeddings at lightning fast speeds. For our particular application we use ChromaDB.
We choose ChromaDB since it is quick to setup and easier to use. For this particular implementation we do not need to delve deep into advanced features of vector databases or use a hefty setup and ChromaDB provides a very minimalistic and quick approach to use vector databases.
ChromaDB stores data in collections a collection is like a database table. It holds information like the ids
of individual documents stored in the collection along with their embeddings. It provides the flexibility to either provide embeddings of our own or provide the collection with an embedding function which does the embedding task for you. Each collection also allows for storage of metadata
for each document in the form of a dictionary.
Apart from storage of embeddings, ChromaDB also allows us to assign unique IDs to each embeddings. For our use case we create two collections, one for storing embeddings of the titles and other to store the embeddings of the abstract.
Registering New Paper Data
When registering a new paper, the /register
API expects the following kind of request,
{
'id': 'Unique ID of the Paper',
'title': 'Title of the paper',
'abstract': 'Abstract of the paper',
'authors': ['list', 'of', 'paper', 'authors'],
'institutions': ['list', 'of', 'author', 'institutions'],
'publishing_year': 'Year of publishing'
}
When this API is called it executes the following steps,
- Preprocess
title
andabstract
data. - Vectorize
title
andabstract
data. - Create paper metadata using
authors
,institutions
andpublishing_year
field. - Save the data into the collection.
Querying The Data
You can provide a search query and search for titles that match your query or abstracts that match your query. The /query
API provides this functionality. It can be requested using the following request,
{
'query_text': 'Query for searching...',
'query_on': 'title' # 'abstract' to query on abstracts.
}
When you query either on the title
or on the abstract
you will be returned all the information about the paper including title, abstract, authors, institutions and publishing year.
Deleting Data
ChromaDB allows us to delete entries from its vectorstore by providing IDs of the items to be deleted. You can delete using the /delete
API by providing the IDs of the items you want to delete.
When a paper is deleted it is removed from both title and abstract vector stores.
{
'ids': ['0', '1', '2',...]
}
Setting up the Code
You can get the code for this project from this repository. You can follow the README.md file to setup the code in your system and get started.
Getting Started
If you want to explore the data and the APIs. You can start by looking at the eda.ipynb
Jupyter notebook. This serves as an entry point into understanding the project before you start hacking at the code.
Final Thoughts
Via this project we looked at how we can apply the concepts of word embeddings and semantic search into building a search engine. You can use this as a template to build something similar or you can use this as a boiler plate for something of your own.
We also touch upon the topic of vector databases and how they are being used in the era of AI. We also get a chance to work on some data engineering by addressing issues within the dataset. Would love to hear what you learn and build from this blog.
Top comments (0)