loading...

Writing a (trivial) search engine

gangezilla profile image Scott ・3 min read

I was recently made tech lead of a team focusing on search, without knowing much about search. To skill up on how search engines work, I started reading An Introduction to Information Retrieval. FYI, because I didn’t know, search engines and searching fall into the field of study called Information Retrieval. Hopefully that’ll save you some frantic Googling around the lines of “How do search engines work?”.

After reading a couple of chapters, I decided to write a trivial search engine to understand what I was reading. You can find it here. It indexes three Shakespeare texts, and you can search through those 🎉. It’s a bit buggy (try searching for more than two terms 🤷🏻‍♂️), but I’m pretty happy with it for a week’s worth of evenings. My goals in building this were to:

  • Read some documents
  • Index these documents
  • Figure out the term document frequencies
  • Develop an inverted index of these documents
  • Allow a user to search for multiple search terms and get back some results.
  • Let’s break some of this down:

The material that you can search for within a particular search engine is a document. So my search engine has Shakespeare plays, as Google allows you to search through webpages. Indexing a document involves saving this into a form that you can access later on.

After you’ve indexed a document, you’ll figure out the term document frequency of each word. This is a (long) list of how many times a particular term will appear in a document. This is accomplished by breaking down the text through tokenisation and stemming. Tokenisation is separating the words in a document into their individual words. Stemming is the process of bringing a word down to it’s “stem” word so that it can be compared to other similar words. For example, the words ‘rebellious’, ‘rebelling’, ‘rebels’, and ‘rebellion’ all come back to the stem word ‘rebel’. You’ll see this in the below examples where some words look a little bit weird. There many stemming algorithms, such as the Snowball and the Porter stemmer. As part of this project I did a port of the Porter algorithm, which you can find here if you’re interested. A stemmer basically drills down to the stem of a word through a series of more and more pointed regex trimmings. A snippet of this from Hamlet looks like this:

"mind": 16,
"mine": 33,
"miner": 1,
"minist": 2,
"ministr": 1,
"minut": 2,
"miracul": 1,
"mirror": 2,
"mirth": 2,
"mischanc": 2,
"mischief": 1,
"miss": 1,

After the term document frequencies for each document is calculated , the search engine will build an inverted index. This sounds a lot scarier than it is. An inverted index is a combined term document frequency for your documents, as well as a record of which documents contain those words. Below, the key “doc_freq” is the combined document frequency of these terms, and the “postings_list” is the list of document ID’s that these terms have appeared in. This inverted index looks something like this:

"deep": {
  "doc_freq": 13,
  "postings_list": [
    1,
    2,
    3
  ]
},
"deer": {
  "doc_freq": 8,
  "postings_list": [
    1,
    2,
    3
  ]
},
"defenc": {
  "doc_freq": 3,
  "postings_list": [
    1,
    2
  ]
}

After you’ve done this you can start searching. My search engine has a very basic search parser. When your user types in a query such as ‘Hamlet sword’, this gets converted to ‘Hamlet AND sword’ so both of these terms are searched for. The upgrade to this is a Boolean search parser. This allows your user to use some Boolean logic in their query, such as: ‘Hamlet AND sword OR cat AND been’. This can be used with a phrase query, which is many words surrounded by double quotes, which will match everything inside those quotes. With a Boolean query, this might look like: ‘Hamlet AND sword OR cat AND “who was that”’.

At this point, a user will see some results come back and our job. Reading this information was interesting, but confusing. Implementing it, even simplistically, has been a great way of understanding how this works.

There’s many ways that this could be improved. For example, better performance, indexing more documents, and implementing a ranked search system. Feel free to check out the code for the porter here, the search engine here, or view the final product here.

Thanks for reading!

Posted on by:

Discussion

markdown guide