<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: diogodls</title>
    <description>The latest articles on DEV Community by diogodls (@diogodls).</description>
    <link>https://dev.to/diogodls</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1277051%2Fce1becfd-2f64-4547-bddf-ebfc3e4bb4d4.jpeg</url>
      <title>DEV Community: diogodls</title>
      <link>https://dev.to/diogodls</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/diogodls"/>
    <language>en</language>
    <item>
      <title>Building a Search Engine from Scratch: Lessons from Implementing TF-IDF</title>
      <dc:creator>diogodls</dc:creator>
      <pubDate>Wed, 29 Apr 2026 13:24:45 +0000</pubDate>
      <link>https://dev.to/diogodls/building-a-search-engine-from-scratch-lessons-from-implementing-tf-idf-3ajf</link>
      <guid>https://dev.to/diogodls/building-a-search-engine-from-scratch-lessons-from-implementing-tf-idf-3ajf</guid>
      <description>&lt;p&gt;Over the last month, I’ve been working on a personal project: building a search engine from scratch.&lt;/p&gt;

&lt;p&gt;This started from a simple curiosity — I’ve always wanted to understand how tools like Google actually work under the hood. At the same time, I wanted to sharpen my backend skills and build something meaningful as I prepare to get back into the job market.&lt;/p&gt;

&lt;p&gt;The initial idea came from a conversation with AIs about project ideas. From there, I kept expanding it step by step, adding more complexity as I learned.&lt;/p&gt;

&lt;p&gt;⚙️ Tech Stack &amp;amp; Architecture&lt;/p&gt;

&lt;p&gt;I built the project using NestJS, since Node.js is the ecosystem I’m most comfortable with from my previous experience as a developer.&lt;/p&gt;

&lt;p&gt;At a high level, the system is structured into:&lt;/p&gt;

&lt;p&gt;Indexer → responsible for processing and normalizing documents&lt;br&gt;
Search Engine → calculates TF-IDF scores&lt;br&gt;
Ranking layer → orders results by relevance&lt;br&gt;
All of this exposed through an API&lt;/p&gt;

&lt;p&gt;Initially, everything was in-memory, but later I migrated to a database for persistence.&lt;/p&gt;

&lt;p&gt;🔍 Indexing: The Turning Point&lt;/p&gt;

&lt;p&gt;One of the most important parts of the system is the indexing process.&lt;/p&gt;

&lt;p&gt;Every time a document is created or updated, I:&lt;/p&gt;

&lt;p&gt;Normalize the text (lowercase, clean formatting, etc.)&lt;br&gt;
Tokenize it into terms&lt;/p&gt;

&lt;p&gt;At first, I stored everything in memory using Map structures and implemented an inverted index directly there.&lt;/p&gt;

&lt;p&gt;This was surprisingly challenging.&lt;/p&gt;

&lt;p&gt;Understanding inverted indexes — and especially implementing them using nested Maps — was one of the hardest parts of the project. I had never used this structure in depth before, and things got confusing quickly.&lt;/p&gt;

&lt;p&gt;But once it clicked, everything made more sense.&lt;/p&gt;

&lt;p&gt;Later, I moved to PostgreSQL, modeling the data with:&lt;/p&gt;

&lt;p&gt;document&lt;br&gt;
term&lt;br&gt;
term_document (mapping which term appears in which document and where)&lt;/p&gt;

&lt;p&gt;This transition helped me better understand how real systems persist and query this kind of data.&lt;/p&gt;

&lt;p&gt;📊 Ranking with TF-IDF&lt;/p&gt;

&lt;p&gt;For ranking, I implemented TF-IDF as a starting point.&lt;/p&gt;

&lt;p&gt;The idea is simple but powerful:&lt;/p&gt;

&lt;p&gt;TF (Term Frequency) → how often a term appears in a document&lt;br&gt;
IDF (Inverse Document Frequency) → how rare the term is across all documents&lt;/p&gt;

&lt;p&gt;The final relevance score is:&lt;/p&gt;

&lt;p&gt;TF × IDF&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;p&gt;Documents that contain the term more frequently rank higher&lt;br&gt;
Rare terms have more weight than common ones&lt;/p&gt;

&lt;p&gt;Even though the formula is straightforward, implementing it in a real system gave me a much deeper understanding of how ranking actually works.&lt;/p&gt;

&lt;p&gt;🧪 Adding End-to-End Tests&lt;/p&gt;

&lt;p&gt;More recently, I started working on E2E testing, which has been a big learning experience.&lt;/p&gt;

&lt;p&gt;I created a test file (document-e2e.spec.ts) where I:&lt;/p&gt;

&lt;p&gt;Send real HTTP requests to the API&lt;br&gt;
Validate document creation&lt;br&gt;
Verify if ranking is working correctly&lt;/p&gt;

&lt;p&gt;To avoid polluting the database, I run everything inside transactions and roll them back after each test.&lt;/p&gt;

&lt;p&gt;Honestly, I underestimated how complex testing can be.&lt;/p&gt;

&lt;p&gt;I even had to refactor large parts of my services to make them more testable — and I still have a lot to improve here.&lt;/p&gt;

&lt;p&gt;😵 Challenges Along the Way&lt;/p&gt;

&lt;p&gt;Some of the biggest challenges so far:&lt;/p&gt;

&lt;p&gt;Understanding and implementing inverted indexes&lt;br&gt;
Working with nested data structures like Map&amp;lt; Map&amp;lt; ...&amp;gt;&amp;gt;&lt;br&gt;
Realizing that testing is much harder than it looks&lt;/p&gt;

&lt;p&gt;One important realization was how much architecture affects testability. Writing tests forced me to rethink how I structure my code.&lt;/p&gt;

&lt;p&gt;💡 What Surprised Me the Most&lt;/p&gt;

&lt;p&gt;Before this project, I had a completely different mental model of how search engines worked.&lt;/p&gt;

&lt;p&gt;I used to think:&lt;/p&gt;

&lt;p&gt;“You go through each document and count matching terms.”&lt;/p&gt;

&lt;p&gt;But the reality is the opposite.&lt;/p&gt;

&lt;p&gt;Search engines rely on inverted indexes, where:&lt;/p&gt;

&lt;p&gt;Terms point to documents&lt;br&gt;
Not documents to terms&lt;/p&gt;

&lt;p&gt;This “reverse thinking” completely changed how I understand search systems.&lt;/p&gt;

&lt;p&gt;🚀 What’s Next&lt;/p&gt;

&lt;p&gt;I still have a lot I want to implement:&lt;/p&gt;

&lt;p&gt;Phrase search&lt;br&gt;
Highlighting&lt;br&gt;
Fuzzy search&lt;br&gt;
Stemming&lt;br&gt;
Suggestions&lt;/p&gt;

&lt;p&gt;After that, I plan to explore performance improvements like caching.&lt;/p&gt;

&lt;p&gt;🔗 Project&lt;/p&gt;

&lt;p&gt;If you want to check it out or give feedback:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://github.com/diogodls/search-engine" rel="noopener noreferrer"&gt;https://github.com/diogodls/search-engine&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🏁 Final Thoughts&lt;/p&gt;

&lt;p&gt;This project started as curiosity, but it quickly turned into one of the most valuable learning experiences I’ve had.&lt;/p&gt;

&lt;p&gt;There’s still a lot to improve — and that’s exactly the point.&lt;/p&gt;

&lt;p&gt;If you're also learning about search systems or building something similar, I'd love to connect &lt;/p&gt;

&lt;p&gt;I’ll keep building it in public 🚀&lt;/p&gt;

&lt;p&gt;Cover image from: &lt;a href="https://motopress.com/blog/top-search-engines/" rel="noopener noreferrer"&gt;https://motopress.com/blog/top-search-engines/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>programming</category>
      <category>backend</category>
      <category>webdev</category>
      <category>buildinpublic</category>
    </item>
  </channel>
</rss>
