DEV Community: Ritesh Bhat

Components of Inverted Index - The Dictionary

Ritesh Bhat — Sat, 29 Aug 2020 19:57:30 +0000

This is the fourth article of the Inverted index series that I am writing on dev.to. We will be talking about the first component i.e The Dictionary in this article and other component i.e the Posting lists in the coming articles. This post is very closely related to my last article on Introduction to Inverted Index Please do give it a read before reading this one for better understanding.

Please don't assume the dictionary as the python dictionary/HashMap. There is more to it. 😌

Topics to be covered

 * Overview and need for Dictionary
 * Supported Operations
 * Types of Dictionary
 * Sort-based Dictionary
 * Hash-based Dictionary
 * Which one is better?

Overview

Let's recall the simple representation of the Inverted Index that we saw in our last post

In this article, we are going to focus on the term column in the Inverted Index representation in the above diagram.

Since this term column contains all the words/terms in our collection, we also refer to it as our Inverted index's dictionary.

🙋 But why do I need a dictionary?
The answer is pretty simple as seen in the diagram too, the main purpose of the dictionary terms is to manage a set of terms in a text collection and provide a mapping from the set of index terms to the locations of their posting lists. Posting lists column does not contain the data but the pointer/reference to the actual data store, they are just references to the actual text data available in-memory or over the disk.

Supported Operations by Dictionary 🚀

A "Basic" dictionary implementations in inverted indexes/search engines usually provide the following operations:

Insert new entry to the dictionary
Search/Find the key in the dictionary:
- Find a particular key/term and return the posting list entry.
- Find the entries for aa the terms which start with a given prefix.
Update the dictionary term entries as per the new incoming data. The delete operation can also be part of this since the deletion of old terms is kind of an update for the dictionary based on the new incoming data.

We will understand the Search/Find and insert operation right now in this post.

The problem in our beautiful dictionary? ❓
Let's assume my Dictionary has 10,000,000 terms. And I plan to search for the term "Scranton" in this dictionary. This itself becomes a problem since scanning/grepping through the dictionary will result in the time complexity of O(n). How do we reduce it down? Now take a seat and let me explain it to you with two of the most popular ways to achieve this:

Types to store the Dictionary terms:

Sort based dictionary
- Search Tree
- Lexicographical order list
Hash-based dictionary.

Again there are no perfect solutions, you can choose the type based on your requirement. We will have a look at the pros and cons of each implementation which will help you to choose the better approach based on the problem/scenario.

Sort based Dictionary 👀

As the name suggests this implementation is based on the arrangement of our text collection(aka Dictionary) in a sorted form. This lexicographically sorted form can be implemented in two ways, one is sorted arrays and the other one is search trees. Search over the text collection(aka dictionary) happens using binary search in case of sorted arrays and tree traversal in the case of search trees.

Hash-Based Dictionary 👀

For hash-based dictionaries, we can use hashtables. Where each term has a corresponding entry in the hashtable. Hashtables are amazingly fast when we are searching for a particular term but this has a catch too which we discuss later in the article(i.e prefix matches). Also, most of the people believe that search and insertions happen in hashtables at O(1) time complexity but IT IS NOT TRUE. To understand this you can read this answer on stackoverflow

Comparison between Hash-based and Sort based dictionaries. 🚀

If hashtable size is chosen properly, the hashtable implementation to search for a particular term is generally faster than the Sort based dictionary. Because unlike sort based approach binary search or tree traversal is not required.
Let us consider the query which requires prefix matches like searching "Jef*" in a dictionary should match all index terms starting with "Jef" -> Jeff Bezos, Jeffrey Archer, etc. For this requirement, hashtables will require a system to scan the whole hashtable(i.e term collection) for this whereas in a Sort based approach is it will be much faster generally ~ O(log(n)). Because of this reason, the autocomplete feature on websites like amazon.com will be using (kind of a) prefix match over the product catalog data somewhere in the background and for it to be amazingly fast you you just cannot have it in O(n) time complexity, it has to be O(log(n)) or even less. And this data structure should preferably be an amazingly fast Search Tree. Also, this is a major functionality that is expected from any search engine or an inverted index because at the end of the day you just want to type "lap" and get options as shown in the image here: (Accept it, you love this feature and you use it every day, accept this god damn it) 😜

But as per our human tendency, we always want to know who's better than who, right? Messi VS Ronaldo? Robert DeNiro Vs Al Pacino? dev.to VS medium? 🙄

🙋 So which one out of sort based and hash-based dictionaries is better?

Thomas Sowell once said: There Are No Solutions, Only Trade-offs.
So, considering Thomas's statement the answer is Sort based Dictionaries using the Search trees. (with decent tradeoffs in query processing time ofcourse).

Besides the prefix query support, predictable performance there is one more reason why I said "Sort based indexes are better". Ever heard of Lucene? the most popular search engine used by elastic search and Solr under the hood.
For the Memory Index Apache Lucene full-text search index, Lucene uses Sort based dictionary approach. Won't believe me check the source code yourself. Here

(I have asked the Lucene committers to confirm this, will update this too after getting the confirmation from them)

Bonus Gyaan:
FYI: SortedMap is nothing but a beautiful Red-Black tree. 🖖
Also, the holy "Introduction To Algorithm" by Thomas Cormem says "Red-Black trees make good search trees." with proof on page 309. I will be covering this for sure in some of my future posts. Hope you can join the links from here.

So we have discussed all the major tradeoffs between the Dictionary implementation in this article. For the next article, we will be looking into something again related to the dictionary implementation i.e the Tokenization of term. As of now we only consider "space" in between the sentences/documents to identify the terms but there are a lot of other things to be considered.

Also, hope you liked the article and it was helpful to you. As always, I am open to suggestions and feedback w.r.t the series.

Forward and Inverted Indexes - Requirement based differences

Ritesh Bhat — Sun, 21 Jun 2020 08:55:02 +0000

So this is my first post in the series of #explainSearchLikeImFive. I hope you guys find this useful.

This is also a part of my series named "Understanding Inverted Indexes".

In this article, we try to understand the use of Forward and Inverted Indexes based on different requirements. The article is not about why the forward indexes are better than inverted, or vice-versa. Because both of them serve different purposes/requirements as explained in the post.

Topics to be covered

 * Definitions
 * Requirement 1 and using Forward Indexes
 * Requirement 2 and using Inverted Indexes

Note: Forward Indexes are heavily used in traditional SQL databases like B-tree, Hash Indexes etc. So if you have ever heard of "indexes" in databases then chances are it was referring to forward indexes. Whereas Inverted Index articles and documentation specifically mention inverted index.

So now moving to the question of the hour, what is the main difference between Traditional forward indexes and inverted indexes?

Inverted Index stores the words as index and document name(s) as mapped reference(s).
Forward Index stores the document name as index and word(s) as mapped reference(s).

But Ritesh you told me that you will explain it like I am five ? Stop this gibberish now..

Okay, let me give you real-world examples of forward and Inverted Index which we all see in our daily lives.

So I have this book with me "Team of Rivals", a great book written by Doris Kearns Goodwin about the political genius of Abraham Lincoln. Let's use this book to explain the difference between forward and reverse indexes.

Requirement 1: I know that I want to read the section "Showdown in Chicago" of the book, but I don't know which page it is on.

So, how can I do this? How can I reach the "Showdown in Chicago" section of this 880 pages book?

Approach 1 (Grepping): I will turn every page of the book from the beginning and check if it is the desired section. The technique is called grepping. But the section "Showdown in Chicago" is on page 237. So the number of checks required to reach the section will be ~237, and this is not acceptable because of the time and effort required in this.

Approach 2 (Forward Indexes): Let's use Forward Indexes to solve this issue. You must have seen the first few pages which tell you about the exact location of the chapter/section, like this image.

This is the actual idea of the working of forward indexes. Use a key(here chapter/section name) name to point to the specific record in the db (here starting of the content of the chapter in the book). So now the number of checks to reach the "Show Down in Chicago" gets reduced down to 1. Hence reducing the time and effort of our search. (It's not exactly 1 comparison but yes comparisons and time required in this approach are wayyyyyyyyy less than that of our approach 1 i.e grepping).

Now look at the next requirement related to a term search.

Requirement 2: I want to search for all the documents which have mentioned the term "Baltimore" in the book. And let me remind you there are 880 pages in the book. And more than 300,000 words. Therefore grepping(aka scanning) in this case will require you to make 300,000 comparisons. This is enough to make any sane man go crazy.

So how do we do this? How do we find all the pages which have mentioned "Baltimore"?

Approach 1(grepping): You know the run, check each and every term of the book from start to end, and note down the page which has mentioned "Baltimore". Again very time consuming as already seen for the Requirement 1 too.

Approach 2 (Inverted Indexes): Since we are talking about searching a term in a large collection of documents(aka collection of chapters in this case) we can use Inverted Indexes to solve this issue, and yes almost all books use these Inverted Indexes to make your life easier. Just like many other books "Team of Rivals" has inverted indexes at the end of the book as shown in this image.

So after checking the Inverted indexes at the end of the book we know that "Baltimore" is mentioned on pages 629 and 630. So there are two parts in this searching for "Baltimore" in the lexicographically ordered Inverted Index list and fetching the pages based on the value of the index (here 629 and 630). The search time is very less for the term in the inverted index since in computing we actually use dictionaries(hash-based or search trees) to keep track of these terms and hence reduces down the search complexity from O(n) to O(log n) theoretically* when using the binary search or using a search tree, where n is the numbers of words/terms in our index.

GIST: Forward Indexes are used to map a column's value to a row or group of records. Whereas Inverted Indexes are usually used to maps the words/terms/content of a large document to a list of articles.

There are many other differences but I don't want to go into too many jargon words/topics since this post is part of the #explainmelikeiamfive section. If you are interested in reading a lot more about the Inverted Index, you can follow this series Inverted Index - The Story begins and the corresponding posts where the topics discussed will be more at intermediate and advanced levels.

You can read the same article on Github to on this link: Forward Indexes and Inverted Indexes - Requirement based differences

Your feedbacks are most welcome and if you think something can be improved about this post please feel free to write that out in the comment section.

Introduction to Inverted Indexes

Ritesh Bhat — Sun, 14 Jun 2020 20:51:22 +0000

You can have a look at the first post of the series here. To know about all the upcoming articles in the series.

So this article is not going to tell you how to use inverted indexes in any DB/framework, but will give you a nice overview of what exactly an inverted index is, its basic structure, how is it different from traditional forward indexes and how are they used in Search engines.

Topics to look out for in this article:

    * Introduction to Information Retrieval
    * What is an Inverted Index ?
    * Traditional database vs Search Engine
    * Components of Inverted Index
    * Dictionary
    * Posting Lists

Introduction to Information Retrieval

Let us suppose you wanted to determine what all news articles of The Washington Post contain the words "environment" and "health" since it's inception. One approach is to start at the beginning and to read through all the text, noting down each article which contains the mentioned words. This technique is generally refereed to as grepping through text. And this process is going to take half your life to complete, pretty sad right ? So what shall we do ?
One of the most popular way to avoid such linear scanning for each query is to index the documents in advance. With proper indexing in place you can do the above mentioned task in like few seconds/minutes on a modern machines. One such index which is heavily used in indexing of large collection of data is Inverted indexes. All of your popular search engines like Elasticsearch/Lucene/Solr use Inverted Indexes heavily to provide you with amazingly fast search systems.

What is an Inverted Index ? How does it make the info retrieval so fast?

To put in extremely simple words - "Reverse/Inverted Index provide a mapping between terms and their location of occurrence in a text collection". Therefore you don't need to scan the whole text collection to retrieve the information, which eventually reduces downs the search time.
A lot of other functionalities like ranked retrievals, spell correction can be implemented over these Inverted Index system to provide hell lot of more functionalities.(These topics will be covered later in this series)

Traditional Database (Forward Indexes) vs Search Engines(Inverted Index)

Lets us assume our System needs us to a collection which has 4 documents: Doc1, Doc2, Doc3, Doc4
Doc1 : Welcome to the Hotel California Such a lovely place
Doc2 : She's buying a stairway to Heaven
Doc3 : Hey Jude, don't make it bad
Doc4 : Take me to the heaven

In traditional SQL DB the data will look something like this:

Doc ID	Doc Content
1	Welcome to the Hotel California Such a lovely place
2	she's buying a stairway to Heaven
3	Hey Jude, don't make it bad
4	Welcome to the heaven

And it is very clear that for any data/information retrieval on the basis of Doc Content column is going to be difficult and complex. Performance in traditional SQL DBs is gained by querying over primary key or or by building efficient "indexes" for traversing these db tables. You can use inverted indexes in SQL DBs like postgresql, but they are not as efficient as they are in search engines like elasticsearch/lucene etc. The indexes used in SQL like B-Tree index(the default one), HashIndexes are kind of a forward indexes where generally the mapping is from Document(aka doc Id) to the whole data row.

Whereas, the same operation over a search engine is much simplified. Due to the use of reverse indexes, retrieval of information across huge number of documents is comparatively very easy and efficient when compared to the traditional dbs.
In Reverse Indexes the mapping is from "terms" to the Documents(as shown in the table below)

Term	Doc Id
buying	Doc2
california	Doc1
Heaven	Doc2, Doc4
hotel	Doc1
Jude	Doc3
lovely	Doc1
stairway	Doc2
welcome	Doc1, Doc4
and so on....	....

This table shows how a simple inverted index works (Much complex implementation is discussed in future posts, but this will give you the gist of it). And it showcases the power of inverted indexes when terms are being searched.
For example if you just search "welcome heaven", we don't have any exact match in the database but using the inverted index we can see that the user is looking for Doc4 or Doc2 or Doc1 (Doc4 having the highest rank score since it is in both the document list for the term welcome and heaven).

To know more about the differences in between forward and inverted indexes you can read this article:Forward and Inverted Indexes-
requirement based differences

Components of Inverted Indexes

Lets start with understanding the components of the inverted index. The two main components of a inverted index are Dictionary and Postings Lists. For each term in a text collection, there is a posting list which contains information about the term's occurrence in the provided collection.

Dictionary

The dictionary works as a lookup data structure on top of the posting lists. Given an inverted index and a query, our first task is to determine whether each query term exists in the vocabulary. Like in The Washinton Post example we first need to identify if the word "environment" is actually available in our vocabulary i.e the inverted index and if so identify the corresponding postings. This lookup operation uses a classic data structure called the dictionary. It has two broad sections of solutions: hashing and search trees. We will be going through them in the next articles.

Posting List

The actual index data is stored in posting list. It is accessed through the search engine's dictionary. Each term has its own postings list assigned to it.
Since the actual size of posting list is too large and therefore its better to keep this stored over disk to reduce the cost. Of course the implementation of disk systems are much more complex than keeping this whole thing in RAM.
Only during query processing are the query term's posting list is loaded into the memory, as required by the query processing routines.
There is no fixed format of posting lists and index, there are alot of different versions for its index like docid-index, frequency index, positional index, schema independent etc. We will be going through them in the coming articles.

Here is the diagram which shows a very simplified structure of an inverted index

)

Stop Words : Some extremely common words that would appear to be of little value in helping select documents matching a query need are excluded from the the vocabulary entirely. Like a, an, and, are, as etc.

The Terms on the left column of the inverted index table is contains the whole vocabulary of our collection which we have received from parsing N number of documents.

Documents/Posting list column helps us to identify the location of the term in our collection. For example: identify in which document is a particular term occurring, at what position in document the term is occurring etc. Other information like frequency of the term in doc, position of the term in doc etc can also be saved in the posting lists column with the document id. There are different ways to implement this too which are discussed in the coming articles.

As of now we have provided a overview of components of the inverted indexes i.e dictionary and posting list. In the next article we will be discussing about the dictionaries and its implementations in-depth.

If you feel some part of the article needs more clarification or some topics which can be covered in the series, please do put them in the comments below and I will for sure work on them.

Thanks. Stay Safe, Stay Strong.