Over the last month, I’ve been working on a personal project: building a search engine from scratch.
This started from a simple curiosity — I’ve always wanted to understand how tools like Google actually work under the hood. At the same time, I wanted to sharpen my backend skills and build something meaningful as I prepare to get back into the job market.
The initial idea came from a conversation with AIs about project ideas. From there, I kept expanding it step by step, adding more complexity as I learned.
⚙️ Tech Stack & Architecture
I built the project using NestJS, since Node.js is the ecosystem I’m most comfortable with from my previous experience as a developer.
At a high level, the system is structured into:
Indexer → responsible for processing and normalizing documents
Search Engine → calculates TF-IDF scores
Ranking layer → orders results by relevance
All of this exposed through an API
Initially, everything was in-memory, but later I migrated to a database for persistence.
🔍 Indexing: The Turning Point
One of the most important parts of the system is the indexing process.
Every time a document is created or updated, I:
Normalize the text (lowercase, clean formatting, etc.)
Tokenize it into terms
At first, I stored everything in memory using Map structures and implemented an inverted index directly there.
This was surprisingly challenging.
Understanding inverted indexes — and especially implementing them using nested Maps — was one of the hardest parts of the project. I had never used this structure in depth before, and things got confusing quickly.
But once it clicked, everything made more sense.
Later, I moved to PostgreSQL, modeling the data with:
document
term
term_document (mapping which term appears in which document and where)
This transition helped me better understand how real systems persist and query this kind of data.
📊 Ranking with TF-IDF
For ranking, I implemented TF-IDF as a starting point.
The idea is simple but powerful:
TF (Term Frequency) → how often a term appears in a document
IDF (Inverse Document Frequency) → how rare the term is across all documents
The final relevance score is:
TF × IDF
This means:
Documents that contain the term more frequently rank higher
Rare terms have more weight than common ones
Even though the formula is straightforward, implementing it in a real system gave me a much deeper understanding of how ranking actually works.
🧪 Adding End-to-End Tests
More recently, I started working on E2E testing, which has been a big learning experience.
I created a test file (document-e2e.spec.ts) where I:
Send real HTTP requests to the API
Validate document creation
Verify if ranking is working correctly
To avoid polluting the database, I run everything inside transactions and roll them back after each test.
Honestly, I underestimated how complex testing can be.
I even had to refactor large parts of my services to make them more testable — and I still have a lot to improve here.
😵 Challenges Along the Way
Some of the biggest challenges so far:
Understanding and implementing inverted indexes
Working with nested data structures like Map< Map< ...>>
Realizing that testing is much harder than it looks
One important realization was how much architecture affects testability. Writing tests forced me to rethink how I structure my code.
💡 What Surprised Me the Most
Before this project, I had a completely different mental model of how search engines worked.
I used to think:
“You go through each document and count matching terms.”
But the reality is the opposite.
Search engines rely on inverted indexes, where:
Terms point to documents
Not documents to terms
This “reverse thinking” completely changed how I understand search systems.
🚀 What’s Next
I still have a lot I want to implement:
Phrase search
Highlighting
Fuzzy search
Stemming
Suggestions
After that, I plan to explore performance improvements like caching.
🔗 Project
If you want to check it out or give feedback:
👉 https://github.com/diogodls/search-engine
🏁 Final Thoughts
This project started as curiosity, but it quickly turned into one of the most valuable learning experiences I’ve had.
There’s still a lot to improve — and that’s exactly the point.
If you're also learning about search systems or building something similar, I'd love to connect
I’ll keep building it in public 🚀
Cover image from: https://motopress.com/blog/top-search-engines/
Top comments (0)