Hello all, I built a search engine from scratch in Java. In a previous course we were given it as a group project, but since I worked on just the document processing module, I didn't really understand (I didn't understand at all actually) how the rest of it worked, so I decided to rebuild it from scratch. I called it Nova Search. Keep in mind this post won't have a lot of code but I'll try to keep it as technical as possible.
Before jumping in — what is a search engine? It's a program that allows users to retrieve information from a source. That source could range from a specific set of documents to the entire internet (e.g. Google). I built mine to cover a specific set of documents, since I don't have the resources to crawl the entire internet lol. The aim was just to understand how search engines work under the hood. So let's get into it.
I'll divide this into two parts: indexing and querying. Indexing covers what happens to a document and how it gets stored. Querying covers how information is retrieved from the stored documents.
The indexing
In a search engine, the main data structure for storing data is something called an inverted index. An inverted index maps words, numbers, or terms directly to the specific documents where they appear — it flips the traditional relationship (i.e. documents to words).
Why do we need it? Imagine searching for the word "dog" in a document with a million words. You'd have to go through every single word and match them — that's O(n) time complexity, meaning the time taken increases with the number of words you have to search through. Now imagine doing that across many documents. It gets inefficient fast.
Let me show how an inverted index fixes this. Start with two documents:
- Doc1 → "The dog chased the dog around the yard, the dog barked"
- Doc2 → "The fox jumped over the fence and ran away"
To turn this into an inverted index, we create a dictionary — let's call it Index. For each word in each document, we check if it's already in Index. If it's not, we add the word as a key with an array containing the document ID as the value. If it is, we just append the document ID to the existing array.
Which gives us something like this:
| Word | Documents |
|---|---|
| the | [Doc1, Doc2] |
| dog | [Doc1] |
| chased | [Doc1] |
| around | [Doc1] |
| yard | [Doc1] |
| barked | [Doc1] |
| fox | [Doc2] |
| jumped | [Doc2] |
| over | [Doc2] |
| fence | [Doc2] |
| and | [Doc2] |
| ran | [Doc2] |
| away | [Doc2] |
So now if we wanted to search for the word "dog", we simply check Index["dog"] and it gives us every document that contains it. That's an O(1) operation — no matter how many words were in the document, finding a word always takes constant time.
Now we could call it a day and be done with our index, but there's a way to make it more efficient: normalizing the text before indexing. Normalizing basically means cleaning the data to remove inconsistencies and bring everything to a single, predictable form. Here's what that looks like:
- Strip punctuation (yes, I'm aware there are no punctuations in this example but 🐻 with me)
- Lowercase everything
- Remove stop words — filler words that don't carry much meaning. You can grab a stop word list online; for our example we'll use ["the", "at", "a", "and"]
- Reduce each word to its root. "Jumped" becomes "jump", "barked" becomes "bark" — that way "barking" and "jumping" won't register as different words. You can do this with a stemmer or a lemmatizer; a stemmer is simpler, a lemmatizer is more accurate but uses more memory. I went with a stemmer.
Here's our cleaned-up inverted index:
| Word | Documents |
|---|---|
| dog | [Doc1] |
| chase | [Doc1] |
| yard | [Doc1] |
| bark | [Doc1] |
| fox | [Doc2] |
| jump | [Doc2] |
| fenc | [Doc2] |
| ran | [Doc2] |
| away | [Doc2] |
The querying
Now that we know how to build our index, how do we retrieve information from it based on a query?
Let's walk through an example. The user searches for "A dog and a fox".
First we go through the same pipeline we used during indexing — normalize the query, strip punctuation, lowercase, remove stop words, and stem.
| Step | Result |
|---|---|
| Original query | A dog and a fox |
| Strip punctuation | A dog and a fox |
| Lowercase | a dog and a fox |
| Remove stop words | dog fox |
| Stem | dog fox |
That gives us "dog fox".
Now for each word, we check the inverted index for the documents that contain it:
- "dog" → [Doc1]
- "fox" → [Doc2]
We first take the intersection — the documents common to both. Since there's no intersection in this example, we fall back to the union: all documents that contain any of the words, which gives us [Doc1, Doc2].
Now that we have our matching documents, how do we decide which one to show first? How do we determine which is most relevant to the user? This is where a ranking algorithm comes in. There are several to choose from, but for a simple search engine I went with TF-IDF.
TF-IDF stands for term frequency–inverse document frequency. It scores how relevant a word is to a given document.
- Term frequency (TF) = number of times the word appears in the document ÷ total number of words in that document
- Inverse document frequency (IDF) = log(total number of documents ÷ number of documents that contain the word)
- TF-IDF = TF × IDF
For each document, we calculate the TF-IDF of each word in the query:
- Doc1 → dog: 0.5 × 0.301 = 0.1505, fox: 0 × 0.301 = 0 → total: 0.1505
- Doc2 → dog: 0 × 0.301 = 0, fox: 0.2 × 0.301 = 0.0602 → total: 0.0602
We sort the documents by score and present the results — Doc1 ranks higher, which makes sense, it's the dog document.
There's one more feature I added — autocompletion — but I think I'll save that for the next article.
While I was mid-building this project I realized it was following a software architecture pattern I studied last semester — pipe and filter, where data gets transformed at each stage and passed along. I'd understood it well enough to pass the exam , but seeing it click into place in something I actually built hit different. Funny how that works.
You can check out the code in the repo here, and the live link here also my portfolio is here.
Top comments (0)