As the name suggests Full Text Search is a method which is heavily used to get the relevant and faster search result especially when we have high volume of unstructured data.
Some of the use cases in which Full text search is used include :
1. Searching for a product on E-Commerce Website
2. Searching movies based on the description,movie star etc
3. Searching on Google( any search engine )
and many more....
Need for Full text search ?
As the volume of data and number of users grow , performing search on a column of a relational database using LIKE operator is inefficient and time consuming. Even if we create an index on the column the search is not efficient because if the LIKE operator has a wild card then that require every key of the index to be examined . Also relational database do not provide fuzzy search.
Lets understand the difference between Full Text Search and LIKE operator of relational database with an example.
- Lets say we have a database that stores all the comments of Virat Kholi's Instagram post, and you want to search for all the comments related to the word(concept) running.Then in such case if you perform a search query using LIKE = 'run' then your query would return all the comments that contains the word run.
- If the comment contains the word 'RAN' , then also that comment is related to running but that won't be returned by the LIKE operator.However,if the search is performed using some Full Text Search Solution then it returns all the comments that contains the word run,ran,running etc .
- Hence , In order to find more relevant search, Full text search solution is being used as the data becomes unstructured and the number of users grow .
How does Full text search solution give appropriate(relevant) search result ?
How Full Text Search returns the results faster than the LIKE operator on indexed Column ?
Full Text Search achieves above things by creating inverted index for the stored data. Inverted index is a hashmap like data structure that directs you from a word to a document or a web page. Lets understand with an example.
Imagine that you have documents like this:
doc1: "Today it's a perfect day to start my workout."
doc2: "I will not come today to meet you."
doc3: "Our meeting will start soon!"
doc4: "My class starts at 6."
Following are the steps used to create an inverted index:
- Tokenizing : It just extracts the words from the sentence. And each word extracted from the sentence is called the token.
- Remove stop words : The words like 'I' ,'You', 'Your' etc occur frequently and are useless words occurring in a document.Hence we remove these stop words.
- Stemming of Root Word : Whenever I want to search for 'start', I want to see a document that has information about it. But the word present in the document is 'starts' . To relate the word 'start' and 'starts' , We chop some part of each word and then we get 'root word'.
After tokenization, removing stop words, stemming the documents will look like this:
doc1: ["today", "perfect", "day", "start", "workout"]
doc2: ["come", "today", "meet"]
doc3: ["meet", "start", "soon"]
doc4: ["class", "start"]
Now we create a hashmap with word as the key and document Id as the values.
today : [doc1, doc2]
perfect : [doc1]
day : [doc1]
start : [doc1, doc3]
workout : [doc1]
come : [doc2]
meet : [doc2, doc3]
class : [doc4]
soon : [doc3]
This process is called indexing and the above mapping is called inverted index.
As the documents are stored in this format,the full text search gives most relevant search result .
And searching a word in full text search solutions takes O(1). However searching in relational database with index on the column takes O(log H) where H = Height of the B Tree , as it uses B Tree to create an index.
Various Full Text Search Solution present are Solr , Elastic Search which uses Lucene.
Top comments (0)