<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Muhammad Qasim</title>
    <description>The latest articles on DEV Community by Muhammad Qasim (@qasimio).</description>
    <link>https://dev.to/qasimio</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3679778%2F11443079-3c8a-4ac8-aea8-901f9681aa5a.jpg</url>
      <title>DEV Community: Muhammad Qasim</title>
      <link>https://dev.to/qasimio</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/qasimio"/>
    <language>en</language>
    <item>
      <title>I built a Search Engine from Scratch in Java (No ElasticSearch, No Lucene)</title>
      <dc:creator>Muhammad Qasim</dc:creator>
      <pubDate>Fri, 26 Dec 2025 14:27:30 +0000</pubDate>
      <link>https://dev.to/qasimio/i-built-a-search-engine-from-scratch-in-java-no-elasticsearch-no-lucene-218n</link>
      <guid>https://dev.to/qasimio/i-built-a-search-engine-from-scratch-in-java-no-elasticsearch-no-lucene-218n</guid>
      <description>&lt;h3&gt;
  
  
  Most developers build applications by gluing APIs together.
&lt;/h3&gt;

&lt;p&gt;Need search?&lt;br&gt;
&lt;code&gt;npm install elasticsearch&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Need storage?&lt;br&gt;
&lt;code&gt;aws s3 cp&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;There is nothing wrong with this approach. It ships products.&lt;/p&gt;

&lt;p&gt;But to become a &lt;strong&gt;systems engineer&lt;/strong&gt;, you cannot only consume black boxes.&lt;br&gt;&lt;br&gt;
You must understand how they are built.&lt;/p&gt;

&lt;p&gt;That is why I built &lt;strong&gt;DevShelf&lt;/strong&gt; — a vertical search engine and digital library for Computer Science literature, engineered entirely from scratch in Java, without Lucene, Solr, or external databases.&lt;/p&gt;

&lt;p&gt;This article explains how I achieved &lt;strong&gt;constant-time retrieval&lt;/strong&gt; and &lt;strong&gt;relevance-based ranking&lt;/strong&gt; using classical data structures and mathematical models.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview: Split-Stack Design
&lt;/h2&gt;

&lt;p&gt;A search engine cannot linearly scan files when a user types a query.&lt;br&gt;&lt;br&gt;
That approach is (O(N)) and fundamentally unscalable.&lt;/p&gt;

&lt;p&gt;To achieve sub-millisecond latency, DevShelf is divided into two distinct layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Offline Layer (Writer)&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Performs heavy computation once.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Online Layer (Reader)&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Executes fast, read-only queries in memory.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This separation is the foundation of scalable search systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Offline Indexing: Building the Secret Catalog
&lt;/h2&gt;

&lt;p&gt;Before the application ever runs, an offline process (&lt;code&gt;IndexerMain&lt;/code&gt;) analyzes the dataset containing book metadata.&lt;/p&gt;

&lt;p&gt;The pipeline includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tokenization&lt;/strong&gt; — splitting text into atomic terms
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stop-word removal&lt;/strong&gt; — eliminating noise such as "the", "and", "is"
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stemming&lt;/strong&gt; — reducing words to root forms (&lt;code&gt;running → run&lt;/code&gt;)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output is a &lt;strong&gt;positional inverted index&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Instead of mapping:
&lt;/h3&gt;

&lt;p&gt;Book → Words&lt;/p&gt;

&lt;h3&gt;
  
  
  the system maps:
&lt;/h3&gt;

&lt;p&gt;Word → List&lt;/p&gt;

&lt;h3&gt;
  
  
  Simplified Inverted Index Structure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified inverted index structure&lt;/span&gt;
&lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;invertedIndex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// "java"   -&amp;gt; [101, 204, 305]&lt;/span&gt;
&lt;span class="c1"&gt;// "python" -&amp;gt; [102, 501]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Constant-Time Retrieval
&lt;/h2&gt;

&lt;p&gt;With an inverted index, DevShelf can locate &lt;strong&gt;every book mentioning a term like "Java" in O(1) time complexity&lt;/strong&gt;, regardless of how large the library grows.&lt;/p&gt;

&lt;p&gt;Query-time execution scales with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;the number of query terms&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;not the size of the corpus&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the single most important optimization in any real search engine.&lt;/p&gt;




&lt;h2&gt;
  
  
  Vector Space Ranking: Why Relevance Is Hard
&lt;/h2&gt;

&lt;p&gt;Finding documents is easy.&lt;br&gt;&lt;br&gt;
Ranking them correctly is not.&lt;/p&gt;

&lt;p&gt;Why should &lt;em&gt;Effective Java&lt;/em&gt; rank higher than a book that mentions "Java" once in a footer?&lt;/p&gt;

&lt;p&gt;To solve this, I implemented a &lt;strong&gt;TF-IDF (Term Frequency–Inverse Document Frequency)&lt;/strong&gt;–based ranking engine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Term Frequency (TF)
&lt;/h3&gt;

&lt;p&gt;How often does a word appear in a specific document?&lt;/p&gt;

&lt;h3&gt;
  
  
  Inverse Document Frequency (IDF)
&lt;/h3&gt;

&lt;p&gt;How rare is that word across the entire library?&lt;/p&gt;

&lt;p&gt;TF-IDF balances local importance with global rarity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cosine Similarity as the Ranking Function
&lt;/h2&gt;

&lt;p&gt;To compute relevance, DevShelf uses &lt;strong&gt;cosine similarity&lt;/strong&gt;, which measures the angle between two vectors:&lt;/p&gt;

&lt;p&gt;Similarity = (A dot B) / (length of A × length of B)&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A&lt;/strong&gt; is the query vector&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;B&lt;/strong&gt; is the document vector&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In DevShelf:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Each book is represented as a sparse vector&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Each query is converted into the same vector space&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Smaller angles indicate higher relevance&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Normalizes for document length&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Works efficiently with sparse data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Produces deterministic and explainable rankings&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Zero-Storage Digital Library Architecture
&lt;/h2&gt;

&lt;p&gt;One strict requirement was &lt;strong&gt;zero local storage&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Users should not need to download gigabytes of PDFs just to search a library.&lt;/p&gt;

&lt;p&gt;To achieve this, I designed a &lt;strong&gt;serverless content distribution model&lt;/strong&gt; using GitHub Raw Content.&lt;/p&gt;

&lt;h3&gt;
  
  
  Storage Split
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The application contains the &lt;strong&gt;search index&lt;/strong&gt; (~78MB)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The cloud contains the &lt;strong&gt;book content&lt;/strong&gt; (PDFs)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When a user clicks &lt;strong&gt;Read&lt;/strong&gt;, the application streams the required byte range directly into the viewer instead of downloading the entire file.&lt;/p&gt;

&lt;p&gt;This keeps the local footprint minimal while maintaining an instant user experience.&lt;/p&gt;




&lt;h2&gt;
  
  
  Feedback Loops: Making Search Adaptive
&lt;/h2&gt;

&lt;p&gt;A static search engine is limited.&lt;br&gt;&lt;br&gt;
A useful search engine improves with usage.&lt;/p&gt;

&lt;p&gt;DevShelf implements a user behavior analytics loop:&lt;/p&gt;

&lt;h3&gt;
  
  
  Capture
&lt;/h3&gt;

&lt;p&gt;Every user click is logged to a lightweight JSON event stream.&lt;/p&gt;

&lt;h3&gt;
  
  
  Analyze
&lt;/h3&gt;

&lt;p&gt;Offline processes aggregate these events into a normalized &lt;code&gt;PopularityVector&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Re-Rank
&lt;/h3&gt;

&lt;p&gt;Future queries incorporate popularity into the ranking formula alongside TF-IDF and rating signals:&lt;/p&gt;

&lt;p&gt;double finalScore =&lt;br&gt;
        (0.6 * tfIdfScore) +&lt;br&gt;
        (0.2 * userRating) +&lt;br&gt;
        (0.2 * popularityScore);&lt;/p&gt;

&lt;p&gt;This allows trending and highly engaged books to surface naturally, without manual curation or machine learning infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building DevShelf taught me more about computer science than any framework-driven tutorial ever could.&lt;/p&gt;

&lt;p&gt;When you strip away abstractions and build from first principles, you stop guessing how systems work — and start knowing.&lt;/p&gt;

&lt;p&gt;DevShelf is fully open source.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Source Code&lt;/strong&gt;: &lt;a href="https://github.com/qasimio/DevShelf" rel="noopener noreferrer"&gt;https://github.com/qasimio/DevShelf&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Download&lt;/strong&gt;: &lt;a href="https://qasimio.me/projects/devshelf/" rel="noopener noreferrer"&gt;https://qasimio.me/projects/devshelf/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Portfolio&lt;/strong&gt;: &lt;a href="https://qasimio.me/" rel="noopener noreferrer"&gt;https://qasimio.me/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Happy coding.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>programming</category>
      <category>java</category>
      <category>nlp</category>
    </item>
  </channel>
</rss>
