DEV Community


Posted on • Updated on

How I built a Deep Learning Powered Search Engine

We use search engines everyday. Without them, we would be using the Internet in a much different way, a way which I can't imagine what it would be like.

I decided to have a go at making my own small search engine, using Deep Learning to power the search results.

Breaking the problem down

Because making a fully fledged search engine is a very long and expensive task, I decided that my search engine would be for Wikipedia pages only (a.k.a given a search query, it will return any relevant Wikipedia pages for it)

There were 3 main steps to building this search engine:

  • Find/Create a database of Wikipedia pages
  • Index the database of pages
  • Efficiently match search queries to pages in the index

Creating the Database

Instead of trying to find a database of Wikipedia pages online, I decided to create my own, simply because I wanted to build a web crawler!

What is a web crawler?

A web crawler is initially given a few starting URLs to visit. They then visit these URLs and parse through the HTML to find more URLs they can visit. Any URLs they find are added to a queue, so that they can visit them later.

Search engines, like Google and Bing, use web crawlers to keep adding to their index of website links, so that they can come up in their search results.

Since most of the URLs on Wikipedia pages are URLs to other Wikipedia pages, all I had to do was write a simple, generic web crawler and feed it a few intial Wikipedia pages.

import requests
import os
from tinydb import TinyDB
from html.parser import HTMLParser

db = TinyDB("results.json")

def get_host(url):
    while os.path.dirname(url) not in ["http:", "https:"]:
        url = os.path.dirname(url)

    return url

class Parser(HTMLParser):
    def __init__(self):
        super(Parser, self).__init__(convert_charrefs=True)
        self.url = ""
        self.urls = []
        self.meta_description = ""
        self.title = ""
        self.paragraph_content = ""
        self.paragraph = False
        self.set_description = False
        self.set_title = False

    def set_url(self, url):
        if url[-1] == "/":
            if os.path.dirname(url[:-1]) not in ["http:", "https:"]:
                url = url[:-1]
        self.url = url

    def handle_starttag(self, tag, attrs):
        if tag == "meta":
            for attr in attrs:
                if attr[0] == "name" and attr[1] == "description":
                    self.set_description = True

                if self.set_description:
                    if attr[0] == "content":
                        self.meta_description = attr[1]
                        self.set_description = False

        elif tag == "a":
            for attr in attrs:
                if attr[0] == "href":
                    link = attr[1]

                    if link:
                        if link[0] == "/":
                            link = get_host(self.url) + link
                        elif "http://" in link and link.index("http://") == 0:
                        elif "https://" in link and link.index("https://") == 0:

        elif tag == "p" and len(self.paragraph_content) < 100:
            self.paragraph = True

        elif tag == "title":
            self.set_title = True

    def handle_endtag(self, tag):
        if tag == "p":
            self.paragraph = False

    def handle_data(self, data):
        if self.set_title:
            self.title = data
            self.set_title = False
        elif self.paragraph:
            self.paragraph_content += data

    def clear(self):
        self.urls = []
        self.meta_description = ""
        self.title = ""
        self.paragraph_content = ""
        self.paragraph = False
        self.set_description = False
        self.set_title = False

def crawl(start_queue):
    parser = Parser()

    queue = start_queue
    seen_urls = []
    while len(queue) > 0:
        if queue[0] not in seen_urls:
                print (queue[0])

                page = requests.get(queue[0])

                    "title": parser.title,
                    "description": parser.meta_description,
                    "content": parser.paragraph_content,
                    "url": queue[0]

                queue = queue + parser.urls

        queue = queue[1:]

crawl(["", "", ""])
Enter fullscreen mode Exit fullscreen mode

Everytime the crawler visits a page, it scans through the HTML for any a tags and adds its href attribute to the queue.

It also records the page's title, found in between the title tags, the first 100 characters of the page's article, by looking at the content within the p tags and the page's meta description (however after crawling, I found that none of the Wikipedia pages have meta descriptions!)

After the page is scanned through, its URL and recorded details are saved to a local JSON file, using TinyDB.

I ran the crawler for a few mintues and managed to scrape around 1000 pages.

Indexing the Database

To return relevant pages to a user's search query, I was planning to use a KNN algorithm to compare the vector encodings of the search query and the vector encodings of the page contents store in the database.

Vector encodings of natural language are crucial when it comes to understanding what a user is saying. As you will see later, I decided to use a Transformer model to encode sentences into a vector representation.

Image description

With what I had so far, it was possible to build a working search engine with the above method in mind, but it would be extremely inefficient, since it would involve going through EVERY record in the database, vectorising their article's content and comparing it to the query vector.

Increasing Search Efficiency

In order to increase the efficiency of the search, I had an idea to index/preprocess the database, in order to reduce the search space for the KNN algorithm.

The first step was to vectorise the content of all the pages I had scraped and stored in my database.

from tinydb import TinyDB
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

model = SentenceTransformer('sentence-transformers/stsb-roberta-base-v2')
db = TinyDB("results.json")

contents = []

for record in db:
    content = record["content"]

embeddings = model.encode(contents)
Enter fullscreen mode Exit fullscreen mode

As you can see, for vectorising sentences, I used the SentenceTransformer library and the "stsb-roberta-base-v2" transformer model, which was fine-tuned for tasks like Neural Search, where a query needs to be matched with relevant documents (the exact task I had at hand).

Image description

Now that I had all the vector representations of the pages, I decided to cluster semantically similar pages together, using the K-Means clustering algorithm.

The idea behind this was that, when a search query is entered, the query would be vectorised and be classified into a cluster. Then we could perform KNN with the pages in that cluster only, instead of the whole database, which should improve efficiency.

kmeans = KMeans(n_clusters=3)
buckets = {}

for record, label in zip(db, kmeans.labels_):
    if label not in buckets:
        buckets[label] = []


import pickle

pickle.dump(kmeans, open("", "wb"))
pickle.dump(buckets, open("","wb"))
Enter fullscreen mode Exit fullscreen mode

Clustering pages

from sentence_transformers import SentenceTransformer
import pickle
import numpy as np
from sklearn.neighbors import NearestNeighbors

model = SentenceTransformer('sentence-transformers/stsb-roberta-base-v2')

kmeans = pickle.load(open("", 'rb'))
buckets = pickle.load(open("", 'rb'))

while True:
    query = input("Search: ")
    encoded_query = model.encode(query)
    encoded_query = np.expand_dims(encoded_query, axis=0)
    bucket = kmeans.predict(encoded_query)
    bucket = int(bucket.squeeze())
    embeddings = []

    for result in buckets[bucket]:
    embeddings = model.encode(embeddings)
    neigh = NearestNeighbors(n_neighbors=10)
    indexes = neigh.kneighbors(encoded_query, return_distance=False)

    for index in indexes.T:
      doc = buckets[bucket][int(index.squeeze())]
      print (doc["title"], doc["url"])
      print ("")
Enter fullscreen mode Exit fullscreen mode

Matching user query to 10 most relevant pages

In the code above, it shows that I broke the database down into 3 clusters.

However, I did a lot of tinkering with the number of clusters.

If we group the database into too many clusters, searches would be extremely efficient, but there would be a huge drop in result accuracy. The same goes the other way.

I found that, with 3 clusters, the accuracy of the results were really high, however it was still extremely slow to return results (~7 seconds for each search, but the worst was 32 seconds).

Search: how do i compose music

Musical instrument - Wikipedia

Elements of music - Wikipedia

Music criticism - Wikipedia

Contemporary classical music - Wikipedia

Accompaniment - Wikipedia

Musical improvisation - Wikipedia

Musique concrète - Wikipedia

Programming (music) - Wikipedia

Film score - Wikipedia

Song - Wikipedia

Harpsichord - Wikipedia

Music theory - Wikipedia

Music industry - Wikipedia

Definition of music - Wikipedia

Wolfgang Amadeus Mozart - Wikipedia

Invention (musical composition) - Wikipedia

Music - Wikipedia

Music - Wikipedia

Aesthetics of music - Wikipedia

Musicology - Wikipedia
Enter fullscreen mode Exit fullscreen mode

Searches relating to music took the most time, possibly because the database contained a lot of music pages, and so they were all grouped into one big cluster.

With anything above 6 clusters, I found that results were being returned at a quick speed, however the accuracy was poor. For example I'd search something simple, such as "Liverpool Football", but the engine would fail to return the Liverpool F.C page, despite it being present in the database.

A better solution

With the trade-off between speed and accuracy in the above solution being way too sensitive, I had to find a better solution.

After a bit of research, I came across ANNOY.

ANNOY stands for "Approximate Nearest Neighbours Oh Yeah" and is a small library, provided by Spotify, to search for points in space that are close to a given query point.

Spotify themselves use ANNOY for their user music recommendations!

Approximate Nearest Neighbours?

You may be thinking why would we want an approximate nearest neighbours alogrithm? Why not an exact one?

For KNN to be exact, it has to iterate through each and every datapoint given to it, which is obviously extremely inefficient.

Things can be drastically sped up if a little bit of accuracy is sacrificed, but, in practice, this sacrifice in accuracy does not matter at all. A user would not mind if the second closest datapoint and first closest datapoint are swapped around, since they are both probably good matches to their query.

How ANNOY works

Here is a good article explaining how ANNOY works (written by the man who built ANNOY himself!)

ANNOY works by building loads of binary trees (a forest) from the dataset its given.

To build a tree, it selects two random points in the vector space and divides the space into two subspaces, by the hyperplane equidistant to the two random points.

Image description

The process repeats again, in the new subspaces that were just made.

Image description

This keeps going until there is a certain n number of points in each subspace.

Points that are near to each other should be in the same subspace, since it is unlikely that there would be a hyperplane to separate them into separate subspaces.

Now that we have all the subspaces, a binary tree can be constructed.

Image description

The nodes of the tree represent a hyperplane. So, when given a query vector, we can traverse down the tree, telling us which hyperplanes we should go down to, in order to find some x most relevant points in the vector space.

ANNOY builds many of these trees to build a forest. The number of trees in the forest is specified by the programmer.

When given a query vector, ANNOY uses a priority queue to search this query through the binary trees in its forest. The priority queue allows for the search to focus on trees that are best for the query (aka trees whose hyperplanes are far from the query vector).

After it has finished searching, ANNOY looks at all the common points its trees have found, which would form the query vector's "neighbours".

Image description

Now the k nearest neighbours can be ranked and returned

Image description

Refactoring Indexing and Search code

It didn't take long to change the code to use ANNOY, thanks to its straightforward API.

from tinydb import TinyDB
from sentence_transformers import SentenceTransformer
from annoy import AnnoyIndex

model = SentenceTransformer('sentence-transformers/stsb-roberta-base-v2')
db = TinyDB("results.json")

descriptions = []

for record in db:
    description = record["content"]

embeddings = model.encode(descriptions)

index = AnnoyIndex(embeddings.shape[-1], "euclidean")

vec_idx = 0
for vec in embeddings:
    index.add_item(vec_idx, vec)
    vec_idx += 1"index.ann") #stores the results of the indexing
Enter fullscreen mode Exit fullscreen mode

Indexing code

from tinydb import TinyDB
from sentence_transformers import SentenceTransformer
from annoy import AnnoyIndex

model = SentenceTransformer('sentence-transformers/stsb-roberta-base-v2')
db = TinyDB("results.json")

index = AnnoyIndex(768, "euclidean")

while True:
    query = input("Search: ")

    vec = model.encode(query)
    indexes = (index.get_nns_by_vector(vec, 20))
    all_db = db.all()

    for i in indexes:
        print (all_db[i]["title"], all_db[i]["url"])
        print ("")
Enter fullscreen mode Exit fullscreen mode

user search code

Now let's see the results...

Search: great composers
Igor Stravinsky - Wikipedia

Contemporary classical music - Wikipedia

Gustav Mahler - Wikipedia

Symphony - Wikipedia

Art music - Wikipedia

Symphony No. 5 (Beethoven) - Wikipedia

Wolfgang Amadeus Mozart - Wikipedia

Ludwig van Beethoven - Wikipedia

Music of Central Asia - Wikipedia

Big band - Wikipedia

Program music - Wikipedia

Cello Suites (Bach) - Wikipedia

Film score - Wikipedia

Johann Sebastian Bach - Wikipedia

Elements of music - Wikipedia

Music of China - Wikipedia

Organ (music) - Wikipedia

Toccata and Fugue in D minor, BWV 565 - Wikipedia,_BWV_565

Georg Philipp Telemann - Wikipedia

Sonata form - Wikipedia

Search: what are the rules of cricket
No-ball - Wikipedia

International cricket - Wikipedia

Toss (cricket) - Wikipedia

Match referee - Wikipedia

Cricket ball - Wikipedia

Board of Control for Cricket in India - Wikipedia

India national cricket team - Wikipedia

Caught - Wikipedia

Substitute (cricket) - Wikipedia

International Cricket Council - Wikipedia

Portal:Cricket - Wikipedia

Delivery (cricket) - Wikipedia

Cricketer (disambiguation) - Wikipedia

Cricket (disambiguation) - Wikipedia

Cricket West Indies - Wikipedia

Zimbabwe Cricket - Wikipedia

Bowled - Wikipedia

Pakistan Cricket Board - Wikipedia

World Cricket League - Wikipedia

West Indies cricket team - Wikipedia
Enter fullscreen mode Exit fullscreen mode

As you can see, the results returned are pretty accurate! It obviously doesn't help that the database I had was pretty small, but this did yield some good results despite it!

On top of that, these results were produced almost instantly, much much quicker than the previous solution.


The only thing I could see that stopped from the search results being better was the size of the database. If I did set out to build a much bigger search engine in the future, I'd look to use databases such as Firebase or MongoDB, and look into how ANNOY could interface with them.

Having said that, I built this project to investigate how deep learning models could be used in document searching tasks and what can be done to efficiently perform the searches and I think I've taken a lot away from this project.

Thank you for reading and I hope you've learnt something from this too!

Top comments (5)

aaravrrrrrr profile image
Aarav Reddy

nice stuff

dinerdas profile image
Diner Das


darasimiajewole profile image

found this quite enlightening, thanks for sharing

ashwinscode profile image

glad you liked it

sherrydays profile image
Sherry Day

Really neat!