DEV Community

Cover image for Search Engine from Scratch β€” Part 1: The Inverted Index
MABD
MABD

Posted on

Search Engine from Scratch β€” Part 1: The Inverted Index

I started something big β€” building a text-based search engine from scratch πŸ”

Why this?

  • 🧠 Zero experience in search engines = massive learning opportunity
  • πŸ’‘ We use search daily (browser, Spotlight, Windows Search) yet rarely think about how it works
  • βš™οΈ Pure algorithmic challenge β€” no backends, no APIs, just data structures and efficiency

The goal: a fast, pluggable search tool I can hook into other projects.


My current knowledge

As a developer, when I hear "search feature," the first thing that comes to mind is a simple algorithm:

for all texts -> find text that contains word (case insensitive)
Enter fullscreen mode Exit fullscreen mode

That's it. The trivial "contains" algorithm.

This is usually good enough for a large portion of projects. But what happens when the data is huge, or the query is more than just a single word?
Hmmm, this is where it gets complicated, and from here, I had no idea what to do.

Information Retrieval

Information retrieval (IR) is finding material of an unstructured nature (usually text) that satisfies an information need from within large collections.

Let me explain with an example. Say you have a list of 100 words and their meanings. When you want to find a word's meaning, you traverse the list until you find it. But what happens when the list grows to 100,000 words? There's no way you can traverse the whole list every single time.

Those 100,000 words are unstructured and the collection is very large.

To retrieve information efficiently, we need a better approach. As a developer, you probably already know one solution: sorting.

Sort the words alphabetically and finding any word becomes trivial, even if the collection grows to 10 million entries.

Search engines apply similar thinking: take a huge collection of data, then use algorithms and data structures to organize it for fast future lookups.


Search Engine V1

Objective: choose a folder on you machine and search for any word in it

W’ll build our engine based on files, let’s call them documents

type Document struct {
    ID   int
    Path string
}
Enter fullscreen mode Exit fullscreen mode

We want to search for words across a folder, so ideally our data structure would map each word to the list of files where it appears. This is called an inverted index, "inverted" because instead of mapping documents β†’ words, we map words β†’ documents.

Each location where a word appears is called a posting (think of it as "posting" the document to that word's list):

type Posting struct {
    DocID int
}

type Index map[string][]Posting
Enter fullscreen mode Exit fullscreen mode

Now we need to process all files in our folder and extract unique words. This process is called indexing.

The algorithm is straightforward: for each file, split by whitespace, then trim leading and trailing spaces from each word.
But a problem appears β€” you end up with things like: (IR), Objective:, backends., }

So we need to remove symbols and punctuation to get clean words. Let's call these cleaned words tokens.

Now we can build our index:

func indexDocument(doc Document, index Index) Index {
    fileContent := getFileContent(doc.Path)
    tokens := tokenize(fileContent)

    for _, token := range tokens {
        index[token] = append(postings, Posting{DocID: doc.ID})
    }
    return index
}
Enter fullscreen mode Exit fullscreen mode

After indexing all files, we're ready for queries.

Querying

For now, let's keep it simple, searching for a single word:

func getPostings(token string) []Posting {
    return index[token]
}
Enter fullscreen mode Exit fullscreen mode

That's it! We get all documents where our token exists.

This will get more exciting later when we track positions within each document where the query appears. Try implementing that yourself πŸ™‚

You can find the full code on github

Top comments (0)