diogodls

Posted on Apr 29

Building a Search Engine from Scratch: Lessons from Implementing TF-IDF

#programming #backend #webdev #buildinpublic

Over the last month, I’ve been working on a personal project: building a search engine from scratch.

This started from a simple curiosity — I’ve always wanted to understand how tools like Google actually work under the hood. At the same time, I wanted to sharpen my backend skills and build something meaningful as I prepare to get back into the job market.

The initial idea came from a conversation with AIs about project ideas. From there, I kept expanding it step by step, adding more complexity as I learned.

⚙️ Tech Stack & Architecture

I built the project using NestJS, since Node.js is the ecosystem I’m most comfortable with from my previous experience as a developer.

At a high level, the system is structured into:

Indexer → responsible for processing and normalizing documents
Search Engine → calculates TF-IDF scores
Ranking layer → orders results by relevance
All of this exposed through an API

Initially, everything was in-memory, but later I migrated to a database for persistence.

🔍 Indexing: The Turning Point

One of the most important parts of the system is the indexing process.

Every time a document is created or updated, I:

Normalize the text (lowercase, clean formatting, etc.)
Tokenize it into terms

At first, I stored everything in memory using Map structures and implemented an inverted index directly there.

This was surprisingly challenging.

Understanding inverted indexes — and especially implementing them using nested Maps — was one of the hardest parts of the project. I had never used this structure in depth before, and things got confusing quickly.

But once it clicked, everything made more sense.

Later, I moved to PostgreSQL, modeling the data with:

document
term
term_document (mapping which term appears in which document and where)

This transition helped me better understand how real systems persist and query this kind of data.

📊 Ranking with TF-IDF

For ranking, I implemented TF-IDF as a starting point.

The idea is simple but powerful:

TF (Term Frequency) → how often a term appears in a document
IDF (Inverse Document Frequency) → how rare the term is across all documents

The final relevance score is:

TF × IDF

This means:

Documents that contain the term more frequently rank higher
Rare terms have more weight than common ones

Even though the formula is straightforward, implementing it in a real system gave me a much deeper understanding of how ranking actually works.

🧪 Adding End-to-End Tests

More recently, I started working on E2E testing, which has been a big learning experience.

I created a test file (document-e2e.spec.ts) where I:

Send real HTTP requests to the API
Validate document creation
Verify if ranking is working correctly

To avoid polluting the database, I run everything inside transactions and roll them back after each test.

Honestly, I underestimated how complex testing can be.

I even had to refactor large parts of my services to make them more testable — and I still have a lot to improve here.

😵 Challenges Along the Way

Some of the biggest challenges so far:

Understanding and implementing inverted indexes
Working with nested data structures like Map< Map< ...>>
Realizing that testing is much harder than it looks

One important realization was how much architecture affects testability. Writing tests forced me to rethink how I structure my code.

💡 What Surprised Me the Most

Before this project, I had a completely different mental model of how search engines worked.

I used to think:

“You go through each document and count matching terms.”

But the reality is the opposite.

Search engines rely on inverted indexes, where:

Terms point to documents
Not documents to terms

This “reverse thinking” completely changed how I understand search systems.

🚀 What’s Next

I still have a lot I want to implement:

Phrase search
Highlighting
Fuzzy search
Stemming
Suggestions