Muhammad Qasim

Posted on Dec 26, 2025 • Originally published at kas-sim.github.io

I built a Search Engine from Scratch in Java (No ElasticSearch, No Lucene)

#architecture #programming #java #nlp

Most developers build applications by gluing APIs together.

Need search?
npm install elasticsearch

Need storage?
aws s3 cp

There is nothing wrong with this approach. It ships products.

But to become a systems engineer, you cannot only consume black boxes.

You must understand how they are built.

That is why I built DevShelf — a vertical search engine and digital library for Computer Science literature, engineered entirely from scratch in Java, without Lucene, Solr, or external databases.

This article explains how I achieved constant-time retrieval and relevance-based ranking using classical data structures and mathematical models.

Architecture Overview: Split-Stack Design

A search engine cannot linearly scan files when a user types a query.

That approach is (O(N)) and fundamentally unscalable.

To achieve sub-millisecond latency, DevShelf is divided into two distinct layers:

Offline Layer (Writer)

Performs heavy computation once.
Online Layer (Reader)

Executes fast, read-only queries in memory.

This separation is the foundation of scalable search systems.

Offline Indexing: Building the Secret Catalog

Before the application ever runs, an offline process (IndexerMain) analyzes the dataset containing book metadata.

The pipeline includes:

Tokenization — splitting text into atomic terms
Stop-word removal — eliminating noise such as "the", "and", "is"
Stemming — reducing words to root forms (running → run)

The output is a positional inverted index.

Instead of mapping:

Book → Words

the system maps:

Word → List

Simplified Inverted Index Structure

// Simplified inverted index structure
Map<String, List<Integer>> invertedIndex = new HashMap<>();

// "java"   -> [101, 204, 305]
// "python" -> [102, 501]

Constant-Time Retrieval

With an inverted index, DevShelf can locate every book mentioning a term like "Java" in O(1) time complexity, regardless of how large the library grows.

Query-time execution scales with:

the number of query terms
not the size of the corpus

This is the single most important optimization in any real search engine.

Vector Space Ranking: Why Relevance Is Hard

Finding documents is easy.

Ranking them correctly is not.

Why should Effective Java rank higher than a book that mentions "Java" once in a footer?

To solve this, I implemented a TF-IDF (Term Frequency–Inverse Document Frequency)–based ranking engine.

Term Frequency (TF)

How often does a word appear in a specific document?

Inverse Document Frequency (IDF)

How rare is that word across the entire library?

TF-IDF balances local importance with global rarity.

Cosine Similarity as the Ranking Function

To compute relevance, DevShelf uses cosine similarity, which measures the angle between two vectors:

Similarity = (A dot B) / (length of A × length of B)

Where:

A is the query vector
B is the document vector

In DevShelf:

Each book is represented as a sparse vector
Each query is converted into the same vector space
Smaller angles indicate higher relevance

This approach:

Normalizes for document length
Works efficiently with sparse data
Produces deterministic and explainable rankings

Zero-Storage Digital Library Architecture

One strict requirement was zero local storage.

Users should not need to download gigabytes of PDFs just to search a library.

To achieve this, I designed a serverless content distribution model using GitHub Raw Content.

Storage Split

The application contains the search index (~78MB)
The cloud contains the book content (PDFs)

When a user clicks Read, the application streams the required byte range directly into the viewer instead of downloading the entire file.

This keeps the local footprint minimal while maintaining an instant user experience.

Feedback Loops: Making Search Adaptive

A static search engine is limited.

A useful search engine improves with usage.

DevShelf implements a user behavior analytics loop:

Capture

Every user click is logged to a lightweight JSON event stream.

Analyze

Offline processes aggregate these events into a normalized PopularityVector.

Re-Rank

Future queries incorporate popularity into the ranking formula alongside TF-IDF and rating signals:

double finalScore =
(0.6 * tfIdfScore) +
(0.2 * userRating) +
(0.2 * popularityScore);

This allows trending and highly engaged books to surface naturally, without manual curation or machine learning infrastructure.

Conclusion

Building DevShelf taught me more about computer science than any framework-driven tutorial ever could.

When you strip away abstractions and build from first principles, you stop guessing how systems work — and start knowing.

DevShelf is fully open source.

Source Code: https://github.com/Kas-sim/DevShelf
Download: https://kas-sim.github.io/projects/devshelf/
Portfolio: https://kas-sim.github.io/

Happy coding.

DEV Community