DEV Community

unworthyEnzyme
unworthyEnzyme

Posted on

Search Engine From Scratch

Let's define the problem at a high level first.

  1. User enters a query.
  2. The search engine returns the most relevant document (this could be an HTML, PDF, or any other textual content) given this query.

How do we define relevance? We can say relevance is low if none of the terms appear in the document; it's obvious this document is irrelevant.
On the other hand, we could say that this document is relevant if the terms occur with high frequency. Of course, this is not a binary category; it's a spectrum, but we'll talk about this when we implement the relevance/ranking algorithm.

Before trying to build a full-fledged search engine, though, let's start simple and build it from there. The first and simplest idea that comes to mind is to just use a searching the query inside the document.

document = "The quick brown fox jumps over the lazy dog."
query = "..." # comes from the user
is_relevant = document.include?(query)
Enter fullscreen mode Exit fullscreen mode

This fails if the user enters "Quick" instead of "quick". We can solve this by converting both the document and the query to lowercase.

document = "The quick brown fox jumps over the lazy dog.".downcase
query = "...".downcase # comes from the user
is_relevant = document.include?(query)
Enter fullscreen mode Exit fullscreen mode

This one fails if the user enters the words in the 'wrong' order, like "dog lazy". We can solve this by splitting the query into words and checking if all of them are present in the document.

is_relevant = query
  .split
  .all? { |word| document.include?(word) }
Enter fullscreen mode Exit fullscreen mode

This one seems like it works, but if the document only contains some of the words in the query and not others, then it won't be considered relevant.
Instead of using all? we can use any? to check if any of the words in the query are present in the document.

is_relevant = query
  .split
  .any? { |word| document.include?(word) }
Enter fullscreen mode Exit fullscreen mode

Until now, we used a binary classification of relevance. If there are a large number of relevant documents (as in the case of the web), there is no way of sorting them. It would be nice if we could give documents a ranking based on how relevant they are to the query. We can't do a qualitative analysis of the document; that may be done via an AI model, but we won't be using it here. Instead, we can use a quantitative one. We can calculate the count of each word of the query in the document and sum them up.

relevance = query
  .split
  .map { |word| document.scan(word).count }
  .sum
Enter fullscreen mode Exit fullscreen mode

We split the query into words with query.split, then we map each word to the number of times it appears in the document with document.scan(word).count. Finally, we sum them up with sum.

This version has the disadvantage of scoring big documents higher than small ones just because they contain more words. We can solve this by dividing the sum by the number of words in the document, and we get the frequency of the word. This is called term frequency.

number_of_words = document.split.count
relevance = query
  .split
  .map { |word| document.scan(word).count }
  .sum / number_of_words.to_f
Enter fullscreen mode Exit fullscreen mode

Not all words are created equal

As you already intuitively know, not all words have the same importance. You can remove some words and still understand the meaning of the sentence. For example, we can remove "the" from the sentence "The quick brown fox jumps over the lazy dog." and still understand the meaning. But if we remove "fox" then the meaning of the sentence changes. We can use this information to give more importance to some words than others. The words like "the", "a", "an" are called stop words, and we can remove them from the query by maintaining a list of them. But this only works if we're only dealing with English. We can do something else; not only do words like "the" have less importance, but they also appear very frequently. If you search for 'Most Common Words in English' you'll find that words like "the", "a" will occur most frequently. We can use this information to give less importance to words that appear more frequently in all documents. The basic formula is as follows:

importance=number of documentsnumber of documents containing the word\text{importance} = \frac{\text{number of documents}}{\text{number of documents containing the word}}

As the number of documents containing the word increases, that means this word is not that unique or specific, so the whole fraction decreases. However, the word may not appear in any document at all; that'll lead to a division by zero. We can solve this by adding 1 to the numerator and denominator.

importance=number of documents + 1number of documents containing the word + 1\text{importance} = \frac{\text{number of documents}\ +\ 1}{\text{number of documents containing the word}\ +\ 1}

Why do we add +1 to the numerator? Because in the extreme case that the word appears in all documents, the old formula would give 1, we could preserve this behavior by adding 1 to the numerator as well. It's not that important, though; in the limit, this +1 doesn't matter.

To calculate the importance of a word, we can use this fraction along with the TF we calculated before.

def importance(word, documents)
  number_of_documents = documents.count
  number_of_documents_containing_word = documents.count { |document| document.include?(word) }
  (number_of_documents + 1) / (number_of_documents_containing_word + 1).to_f
end

def relevance(query, document, documents)
  number_of_words = document.split.count
  query
    .split
    .map { |word| document.scan(word).count * importance(word, documents) }
    .sum / number_of_words.to_f
end
Enter fullscreen mode Exit fullscreen mode

I extracted the importance calculation into a separate function to make it more readable. We pass the documents to the importance function to calculate the number of documents. We can use this function to calculate the importance of each word in the query and multiply it by the term frequency to get the relevance of the document.

The importance formula is almost something called IDF(Inverse Document Frequency) but not exactly. IDF takes the logarithm of the fraction. Why? If the word occurs too frequently or too rarely, it can dominate the scores. Using a logarithm dampens this effect. We can use the logarithm of the fraction instead of the fraction itself.

def importance(word, documents)
  number_of_documents = documents.count
  # count how many times the word appears in the documents.
  number_of_documents_containing_word = documents.count { |document| document.include?(word) }
  Math.log((number_of_documents + 1) / (number_of_documents_containing_word + 1).to_f)
end

def relevance(query, document, documents)
  number_of_words = document.split.count
  query_words = query.split
  query_words
    .map { |word| document.scan(word).count * importance(word, documents) }
    .sum / number_of_words.to_f
end
Enter fullscreen mode Exit fullscreen mode

Partial Queries

Up until now, we only worked with whole words. But what if the user enters a partial query like "tes"? I would want the search engine to match words like "test", "testing", "tested" etc. We can solve this by working on smaller terms than a word. We can use something called stemming. Stemming allows us to reduce a word to its root form. For example, the stem of "fishing", "fished" and "fisher" is "fish".

This would require us to do this for every language we want to support. And it's not like a document has to be in a single language. How do we know which substrings are of certain languages? There is a simpler and more general approach, albeit a worse one. It's called n-grams. Instead of writing down the textbook definition, let me give you a couple examples. Let's say we want to find all 2-grams of the word "quick". We start from the beginning, take all the two-letter substrings, and move to the right, like this:
"qu", "ui", "ic," and "ck". If we want to find the 3-grams, then we'll have "qui", "uic", "ick". It's sort of like a moving window.
We're going to modify this a little bit. Instead of moving the window to the right, we're going to extend it to the right. For example, for the word "quick", we'll have "qu", "qui", "quic", "quick". Do you see how this is useful? People remember words by their first letters, not by their middle or last letters, so even if the user didn't enter the full word, we can still match it. We can use this to generate all the n-grams of the query and the document and check if any of them match.

The average length of a root word in European languages is roughly 5. I'm assuming there wouldn't be single-letter words, so I'm going to use a window size between 2 and 5. I'm generating all the n-grams between 2 and 5, not just 2 or 5, because as the user enters more and more letters, the n-grams will change. For example, if the user enters "tes" we can match "test" but if the user enters "test" we can match "test" as well. We can generate all the n-grams of the query and the document and check if any of them match.

def n_grams(document, min_n: 2, max_n: 5)
  words = document.split
  # This will store the n-gram and the number of times it appears in the document.
  terms = []
  words.each do |word|
    # the word can be smaller than the max window size so we take the minimum.
    max_window_size = [max_n, word.length].min
    # for every window size take a substring and add it to the terms.
    (min_n..max_window_size).each do |window_size|
      terms << word[...window_size]
    end
  end
  terms
end
Enter fullscreen mode Exit fullscreen mode

Since we changed the way we look at the smallest element of a document, we also need to change the way we defined relevance and importance methods.

def importance(term, documents)
  number_of_documents = documents.count
  number_of_documents_containing_term = documents.count { |document| n_grams(document).include?(term) }
  Math.log((number_of_documents + 1) / (number_of_documents_containing_term + 1).to_f)
end

def relevance(query, document, documents)
  number_of_terms = n_grams(document).count
  query_terms = n_grams(query)
  query_terms
    .map { |term| n_grams(document).count(term) * importance(term, documents) }
    .sum / number_of_terms.to_f
end
Enter fullscreen mode Exit fullscreen mode

Not much has changed. Instead of using words, we're using terms/n-grams.

One last thing is to remove punctuation from the documents and the query. We can do this by using a regular expression. Luckily, Ruby has a very short way of doing this with a regex.

def clean(document)
  document.gsub(/[[:punct:]]/, " ")
end
Enter fullscreen mode Exit fullscreen mode

Full Engine

Let's put the pieces together and build a full search engine.

The API could look like this:

search_engine = SearchEngine.new
search_engine.add_document("first_document", "Peter,\n\nI'm going to need those TPS reports on my desk first thing tomorrow! And clean up your desk!\n\nLumbergh")
search_engine.add_document("second_document", "Everyone,\n\nM-m-m-m-my red stapler has gone missing. H-h-has a-an-anyone seen it?\n\nMilton")
search_engine.add_document("third_document", "Peter,\n\nYeah, I'm going to need you to come in on Saturday. Don't forget those reports.\n\nLumbergh")

results = search_engine.search("tps repor")
puts results
<<-OUTPUT
[
  { :name => "first_document", :relevance => SCORE_1, :content => "Peter,\n\nI'm going to need those TPS reports on my desk first thing tomorrow! And clean up your desk!\n\nLumbergh" },
  { :name => "third_document",  :relevance => SCORE_2, :content => "Peter,\n\nYeah, I'm going to need you to come in on Saturday. Don't forget those reports.\n\nLumbergh" }
]
OUTPUT
Enter fullscreen mode Exit fullscreen mode

where SCORE_1 should be higher than SCORE_2 and the documents should be sorted by relevance.

Here is the full implementation:

class SearchEngine
  def initialize
    @documents = []
  end

  def add_document(name, document)
    @documents << { name: name, content: clean(document) }
  end

  def search(query)
    document_contents = @documents.map { |document| document[:content] }
    @documents
      .map { |document| { name: document[:name], relevance: relevance(query, document[:content], document_contents), content: document[:content] } }
      .sort_by { |document| -document[:relevance] }
  end

  private

  def clean(document)
    document.gsub(/[[:punct:]]/, " ").downcase
  end

  def n_grams(document, min_n: 2, max_n: 5)
    words = document.split
    terms = []
    words.each do |word|
      # the word can be smaller than the max window size like the word 'and' so we take the minimum.
      max_window_size = [max_n, word.length].min
      (min_n..max_window_size).each do |window_size|
        terms << word[...window_size]
      end
    end
    terms
  end

  def importance(term, documents)
    number_of_documents = documents.count
    # `count` method can take a block and count the number of times the block returns true.
    number_of_documents_containing_term = documents.count { |document| n_grams(document).include?(term) }
    Math.log((number_of_documents + 1) / (number_of_documents_containing_term + 1).to_f)
  end

  def relevance(query, document, documents)
    # Instead of just using `.split` we generate the n-grams and work on that.
    number_of_terms = n_grams(document).count
    query_terms = n_grams(query)
    query_terms
      .map { |term| n_grams(document).count(term) * importance(term, documents) }
      .sum / number_of_terms.to_f
  end
end
Enter fullscreen mode Exit fullscreen mode

Performance

How performant is this? The answer is not very. First of all, we're using a dynamically typed language like Ruby. Since there is a lot of processing of data, a systems programming language like C, Rust, or Zig will be an excellent choice. Another reason is that we process the documents for every different query. This may not be true in some cases, but most of the time, the documents we're searching in will largely stay the same or change much less frequently than the queries. What we can do is pre-process the documents and store the n-grams and the importance of each term. This way, we can just use the stored data and calculate the relevance of the query. This will make the search engine much faster.

Ok, how are we going to store this pre-processed data? The trick is to look at the relevance method.
number_of_terms = n_grams(document).count:
We need a mapping between documents and their number of terms. We can have a documents collection and store the number of terms for each document. This document could look like this:

{
  "_id": "first_document",
  "content": "Peter,\n\nI'm going to need those TPS reports on my desk first thing tomorrow! And clean up your desk!\n\nLumbergh",
  "number_of_terms": 20
}
Enter fullscreen mode Exit fullscreen mode

In the web, _id would be the URL of the document, but here we're using a simple string. The database can index based on the _id field, so we can quickly access the document.

query_terms = n_grams(query):
We can't do much about this because the query is completely dynamic.

n_grams(document).count(term):
We need a mapping between terms and appearances. Since a term can appear in more than one document, we need to have another mapping between document names and the number of appearances of the term. This structure could look like this:

{
  "_id": "tes",
  "appearances": {
    "first_document": 2,
    "second_document": 0,
    "third_document": 1
  }
}
Enter fullscreen mode Exit fullscreen mode

Here I'm using the term's name as the _id because it's unique.

importance(term, documents):
We can just add a another field to the above structure:

{
  "name": "tes",
  "appearances": {
    "first_document": 2,
    "second_document": 0,
    "third_document": 1
  },
+ "importance": 0.5
}
Enter fullscreen mode Exit fullscreen mode

By the way, this is called an inverted index. You can read more about it on Wikipedia.

Let's put these into code. I'm going to use MongoDB because it's really easy to convert from these JSONy structures to MongoDB documents.

def initialize(url)
  @client = Mongo::Client.new(url)
end
Enter fullscreen mode Exit fullscreen mode

initialize method is going to take a URL to connect to the MongoDB server.

def search(query)
  document_contents = @client[:documents].find.map { |document| document[:content] }
  @client[:documents]
    .find
    .map { |document| { name: document[:_id], relevance: relevance(query, document), content: document[:content] } }
    .sort_by { |document| -document[:relevance] }
    .reject { |document| document[:relevance] == 0 }
end
Enter fullscreen mode Exit fullscreen mode

search is almost the same. We get the documents from database instead of an instance variable.

def add_document(name, document)
  document = clean(document)
  terms = n_grams(document)
  number_of_terms = terms.count
  @client[:documents].insert_one({ _id: name, content: document, number_of_terms: number_of_terms })
  terms.each do |term|
    @client[:terms].update_one(
      { _id: term },
      # Increment the count of this document by one.
      { "$inc": { "appearances.#{name}": 1 } },
      upsert: true,
    )
  end
end
Enter fullscreen mode Exit fullscreen mode

Here after we save the document, for each term we increment the count of the document by one or set it to one if it doesn't exist.

def importance(term)
  appearances = @client[:terms].find(_id: term).first["appearances"]
  number_of_documents = @client[:documents].count
  number_of_documents_containing_term = appearances.count
  Math.log((number_of_documents + 1) / (number_of_documents_containing_term + 1).to_f)
end
Enter fullscreen mode Exit fullscreen mode

Not much to say here. We don't have to search through the entire documents to find how many documents contain the term. We can just look at the appearances field's length.

def relevance(query, document)
  number_of_terms = @client[:documents].find(_id: document[:_id]).first["number_of_terms"]
  query_terms = n_grams(query)
  query_terms.map { |term| term_relevance(term, document[:_id]) }.sum / number_of_terms.to_f
end

def term_relevance(term, document_id)
  appearances = @client[:terms].find(_id: term).first["appearances"]
  number_of_terms = @client[:documents].find(_id: document_id).first["number_of_terms"]
  if appearances.nil? || appearances[document_id].nil?
    return 0
  end
  importance(term) * appearances[document_id]
end
Enter fullscreen mode Exit fullscreen mode

relevance don't have to take the documents as an argument now; it can just look at the database. I refactored the block that calculates the relevance of a term into a separate method called term_relevance for readability. If the term doesn't appear in the document, we return 0; otherwise, we calculate the relevance as before by multiplying the importance by the number of appearances.

The full class looks like this:

require "mongo"

class SearchEngine
  def initialize(url)
    @client = Mongo::Client.new(url)
  end

  def add_document(name, document)
    document = clean(document)
    terms = n_grams(document)
    number_of_terms = terms.count
    @client[:documents].insert_one({ _id: name, content: document, number_of_terms: number_of_terms })
    terms.each do |term|
      @client[:terms].update_one(
        { _id: term },
        # Increment the count of this document by one.
        { "$inc": { "appearances.#{name}": 1 } },
        upsert: true,
      )
    end
  end

  def search(query)
    document_contents = @client[:documents].find.map { |document| document[:content] }
    @client[:documents]
      .find
      .map { |document| { name: document[:_id], relevance: relevance(query, document), content: document[:content] } }
      .sort_by { |document| -document[:relevance] }
      .reject { |document| document[:relevance] == 0 }
  end

  private

  def clean(document)
    document.gsub(/[[:punct:]]/, " ").downcase
  end

  def n_grams(document, min_n: 2, max_n: 5)
    words = document.split
    terms = []
    words.each do |word|
      # the word can be smaller than the max window size like the word 'and' so we take the minimum.
      max_window_size = [max_n, word.length].min
      (min_n..max_window_size).each do |window_size|
        terms << word[...window_size]
      end
    end
    terms
  end

  def importance(term)
    appearances = @client[:terms].find(_id: term).first["appearances"]
    number_of_documents = @client[:documents].count
    number_of_documents_containing_term = appearances.count
    Math.log((number_of_documents + 1) / (number_of_documents_containing_term + 1).to_f)
  end

  def relevance(query, document)
    number_of_terms = @client[:documents].find(_id: document[:_id]).first["number_of_terms"]
    query_terms = n_grams(query)
    query_terms.map { |term| term_relevance(term, document[:_id]) }.sum / number_of_terms.to_f
  end

  def term_relevance(term, document_id)
    appearances = @client[:terms].find(_id: term).first["appearances"]
    number_of_terms = @client[:documents].find(_id: document_id).first["number_of_terms"]
    if appearances.nil? || appearances[document_id].nil?
      return 0
    end
    importance(term) * appearances[document_id]
  end
end
Enter fullscreen mode Exit fullscreen mode

Conclusion

If you want to go deeper into the rabbit-hole of search engines, you should look at BM25, PageRank, Word2Vec. Maybe a next step would be using AI models but I'm not sure if this improves the search experience. For last few years current search engines gets shittier everyday. Maybe it's not such a good idea.
Anyways I hope you enjoyed this article. If you have any suggestions, please let me know. I'm still new to documenting my thought process so any feedback is appreciated. Thanks for reading!

Inspiration

Top comments (0)