Most developers build applications by gluing APIs together.
Need search?
npm install elasticsearch
Need storage?
aws s3 cp
There is nothing wrong with this approach. It ships products.
But to become a systems engineer, you cannot only consume black boxes.
You must understand how they are built.
That is why I built DevShelf — a vertical search engine and digital library for Computer Science literature, engineered entirely from scratch in Java, without Lucene, Solr, or external databases.
This article explains how I achieved constant-time retrieval and relevance-based ranking using classical data structures and mathematical models.
Architecture Overview: Split-Stack Design
A search engine cannot linearly scan files when a user types a query.
That approach is (O(N)) and fundamentally unscalable.
To achieve sub-millisecond latency, DevShelf is divided into two distinct layers:
Offline Layer (Writer)
Performs heavy computation once.Online Layer (Reader)
Executes fast, read-only queries in memory.
This separation is the foundation of scalable search systems.
Offline Indexing: Building the Secret Catalog
Before the application ever runs, an offline process (IndexerMain) analyzes the dataset containing book metadata.
The pipeline includes:
- Tokenization — splitting text into atomic terms
- Stop-word removal — eliminating noise such as "the", "and", "is"
-
Stemming — reducing words to root forms (
running → run)
The output is a positional inverted index.
Instead of mapping:
Book → Words
the system maps:
Word → List
Simplified Inverted Index Structure
// Simplified inverted index structure
Map<String, List<Integer>> invertedIndex = new HashMap<>();
// "java" -> [101, 204, 305]
// "python" -> [102, 501]
Constant-Time Retrieval
With an inverted index, DevShelf can locate every book mentioning a term like "Java" in O(1) time complexity, regardless of how large the library grows.
Query-time execution scales with:
the number of query terms
not the size of the corpus
This is the single most important optimization in any real search engine.
Vector Space Ranking: Why Relevance Is Hard
Finding documents is easy.
Ranking them correctly is not.
Why should Effective Java rank higher than a book that mentions "Java" once in a footer?
To solve this, I implemented a TF-IDF (Term Frequency–Inverse Document Frequency)–based ranking engine.
Term Frequency (TF)
How often does a word appear in a specific document?
Inverse Document Frequency (IDF)
How rare is that word across the entire library?
TF-IDF balances local importance with global rarity.
Cosine Similarity as the Ranking Function
To compute relevance, DevShelf uses cosine similarity, which measures the angle between two vectors:
Similarity = (A dot B) / (length of A × length of B)
Where:
A is the query vector
B is the document vector
In DevShelf:
Each book is represented as a sparse vector
Each query is converted into the same vector space
Smaller angles indicate higher relevance
This approach:
Normalizes for document length
Works efficiently with sparse data
Produces deterministic and explainable rankings
Zero-Storage Digital Library Architecture
One strict requirement was zero local storage.
Users should not need to download gigabytes of PDFs just to search a library.
To achieve this, I designed a serverless content distribution model using GitHub Raw Content.
Storage Split
The application contains the search index (~78MB)
The cloud contains the book content (PDFs)
When a user clicks Read, the application streams the required byte range directly into the viewer instead of downloading the entire file.
This keeps the local footprint minimal while maintaining an instant user experience.
Feedback Loops: Making Search Adaptive
A static search engine is limited.
A useful search engine improves with usage.
DevShelf implements a user behavior analytics loop:
Capture
Every user click is logged to a lightweight JSON event stream.
Analyze
Offline processes aggregate these events into a normalized PopularityVector.
Re-Rank
Future queries incorporate popularity into the ranking formula alongside TF-IDF and rating signals:
double finalScore =
(0.6 * tfIdfScore) +
(0.2 * userRating) +
(0.2 * popularityScore);
This allows trending and highly engaged books to surface naturally, without manual curation or machine learning infrastructure.
Conclusion
Building DevShelf taught me more about computer science than any framework-driven tutorial ever could.
When you strip away abstractions and build from first principles, you stop guessing how systems work — and start knowing.
DevShelf is fully open source.
Source Code: https://github.com/Kas-sim/DevShelf
Portfolio: https://kas-sim.github.io/
Happy coding.
Top comments (0)