Samriddha Chattopadhyay

Posted on Jul 18, 2022

How to build your own google search

#programming #algorithms #computerscience

Got interested ? Yeah we will build an algorithm to make our own search engine.

There are a couple of steps in doing so. First, we do some web scraping and index the data we obtain to store it in the database in a meaningful format. We won’t be doing these here, though. What we will do is to search what you want from these indexed data. So let’s get started.

Problem at hand

You suddenly get an urge to build your own google search. Your friend is helping you out in the initial stuff, but he is really dumb when it comes to algorithms. So you need to build an algorithm that can search for a word efficiently from a given list of strings.

We shall make some assumptions here. Firstly, we will assume that the strings will be unique and will be comprising of single word each. Also, the characters of the strings will be from a-z.

NOTE: These assumptions are just to deduce the algorithm. The algorithm that will be building can handle all characters and also multi-word strings. These assumptions will just let us deduce the algorithm easily and then we can generalise our algorithm by some minute modifications.

Intuition

Let’s first try to think what we can do in this case.

One simple option is to hash each string to a value and use the string as and when it is called. However, there are two problems in this method. We need to store the same part of the string twice. For example, consider these two cases:

[ "app", "apple" ]

Here, it is evident that storing app and apple separately will cause us to use more memory as compared to storing one of them. So the question is can we do better ?

Well, the answer is yes we can, however, there are some trade-offs for this method.

We can use something called a trie.

Tries are efficient information retrieval data structures. We can use these to retrieve data more efficiently than arrays (they don’t have that O(1) time complexity, though).

And guess what ?

Tries take lesser memory than hashmaps.

Solution

Now, let’s implement our trie. Before we start, though, I’d like to give you a basic structure of a TrieNode (basically a node or vertex in the trie).

struct Node{
    char val;
    Node* next;
};

class TrieNode{
    unordered_map<char, Node*> children;
    bool endOfWord;
public:
    TrieNode() {
        for(char c='a'; c<='z'; c++)
            children[c] = NULL;
        endOfWord = false;
    }
};

This is what a basic trie node looks like. Here, I took the information in each node as a character. It can be anything ranging from a character to complex objects. The children map will have all the characters in range (here, a-z) as keys and the node for that character’s data as the value.

So once we have that in place, now is the time we construct our trie.

First step is to understand how it will work. Basically, we will have a starting node (let’s call this a wildcard character). This node will be an empty TrieNode with null pointers as all it’s children and endOfWord as false.

Now, we keep on adding the strings in our trie. For each string, we add one character to the trie’s children with the corresponding node’s value. We repeat unless the string has ended. Once it has, we mark the endOfWord of that TrieNode as true.

void insert(string word) {
    Node* curr = root;
    for(char c : word) {
        if(curr->children.find(c) == curr->children.end()) {
            curr->children[c] = new Node();
            curr = curr->children[c];
        }
    }
    curr->endOfWord = true;
}

We do this for each string we get and at the end, we will have our trie completed.

Good, now we have our data in place. But how do we query it ?

If you understood it till now, you should’ve already guessed it. However, if you are dumb (like me), then hang on. Let’s discuss this through.

So querying it is quite simple. For the query word, we perform a similar logic and search character by character. If at any moment we see that a character that you need to complete the current word is not present, we can conclude that the word is not present in the trie and need to go further. (Case I)

In case you find the whole word, however, the last character you visited to doesn’t have endOfWord as true, then again, the word you are searching for doesn’t exist. In all other cases, you can find the word you are searching for. (Case II)

Consider the following example for case I:

  *
 / \
a   b
|   |
p   e
|   |
p   a
|   |
l   r
| 
e

So we have “apple” and “bear” in our trie. Now, say, you search for “boat”.

You see we have a ‘b’ but, hold on we do not have an o after b. Hence, “boat” doesn’t exist in our trie. Now, say you add “boat” to it. It’d look something like this:

  *
 / \
a   b
|  / \
p o   e
| |   |  
p a   a
| |   |  
l t   r
| 
e

Now, if you search for “boat”, you can see you will get all the characters, and ‘t’ will also have endOfWord as true.

Consider the following example for case I:

  *
 / \
a   b
|  / \
p o   e
| |   |  
p a   a
| |   |  
l t   r
| 
e

Now, say you search for “app”. You notice, ‘a’ is present, then ‘p’ is also present and the next ‘p’ is also present. However, this ‘p’ doesn’t have endOfWord set to true. Hence, “app” isn’t present in our trie. Now, if you add “app” to our trie, it’d look something like this.

  *
 / \
a   b
|  / \
p o   e
| |   |  
p a   a
| |   |  
l t   r
| 
e

As you can see, we marked p with endOfWord true as well in the above case.

Here is how you implement this in code:

bool search(string word) {
    Node* curr = root;
    for(char c : word) {
    if(curr->children.find(c) == curr->children.end()) {
      return false;
    }
    curr = curr->children[c];
    }

    return curr->endOfWord;
}

Also, let’s discuss a bonus thing you can do with tries. You can even predict the words (or characters in this case) that the user is going to type next. It works on the same logic as above with a slight modification.

Here, we just say if the word starts with the prefix string. All you need to do to make this into an autosuggestion feature is to put the words into an array and returning it in place of true (also, change the function return type lol).

bool startsWith(string prefix) {
    Node* curr = root;

    for(char c : prefix) {
        if(curr->children.find(c) == curr->children.end()) {
            return false;
        }
        curr = curr->children[c];
    }

    return true;
}

With that, my lecture is over. But you’d ask now, “Wait! You mentioned this is better than hashmaps and started giving a lecture without even asking if I was interested. So is it really more efficient than a hasmap ?”

So let’s try to answer that.

Here, we are traversing through the entire trie and we visit each node in a branch exactly once. So, the time complexity will be O(h) in the worst case, where h is the height of the trie. If you know that the trie has N nodes, then, you can even say it takes O(logN) time (Yay less than linear!!!).

The space needed is almost constant as each node has a fixed number(26 in this case) of children. If the height of tree is h, then, space taken is O(26*h) ~ O(h) which can also be interpreted as O(logN) if the number of nodes are N.

As you can imagine, it’s pretty efficient as compared the the naive way of using a hashmap to store stuff. Sure, we’d get O(1) search time, but we’d have taken a lot of memory and we also would’ve to give up on the autosuggest featue xP.

Well, with that, you’ve successfully planned the algorithm to build the next google search. What are you waiting for ? Go and build it! Don’t keep your friend waiting after he did all that hardwork of doing the initial indexing for you.

Thanks for reading it till the end and stay tuned for more amazing content like this one!

DEV Community

How to build your own google search

Problem at hand

Intuition

Solution

Top comments (0)

Read next

Master Efficient Window Scroll Event Handling in JavaScript: Best Practices and Tips

Latterns animation using the html css and javascript code

Why APIs Are the Unsung Heroes of App Development

Mental Health Application