Build your own search engine, inverted index and TF-IDF

#search #nlp #tfidf #python

Type into a search box, get back the most relevant documents, instantly, out of millions. It feels like something only Google can do. But the engine behind ordinary full-text search, the thing inside Elasticsearch, Lucene, and your site's search bar, rests on two ideas you can build in about 30 lines: an inverted index to find matching documents fast, and TF-IDF to rank them by relevance.

The one idea: flip the data around

The naive way to search is to scan every document for the query word. That's O(documents × length), hopeless at scale. The fix is to build the index backwards.

A normal index maps document → its words. An inverted index maps the other way: word → the documents that contain it (and how often). Now "find documents containing python" is a single dictionary lookup instead of a scan. Flipping the mapping is the entire reason search is fast.

from collections import defaultdict

def tokenize(text):
    return text.lower().split()        # real engines also strip punctuation, stem, etc.

def build_index(docs):
    index = defaultdict(dict)          # term -> {doc_id: term_frequency}
    for doc_id, text in enumerate(docs):
        for term in tokenize(text):
            index[term][doc_id] = index[term].get(doc_id, 0) + 1
    return index

docs = [
    "python is a great language for data",
    "search engines rank documents by relevance",
    "python powers data science and search",
]
index = build_index(docs)
print(index["python"])     # {0: 1, 2: 1}  -> docs 0 and 2, once each
print(index["search"])     # {1: 1, 2: 1}

One lookup gives you every document containing a term, plus the term frequency, which we'll need for ranking.

The second idea: rank by relevance with TF-IDF

Matching isn't enough, you need the best matches first. TF-IDF scores how important a term is to a document with two intuitions:

TF (term frequency): a document that uses the query word more is more about it. More mentions, higher score.
IDF (inverse document frequency): a word that appears in every document (like "the") tells you nothing, so it should count for little. A rare word that appears in few documents is highly discriminating, so it should count for a lot. IDF is log(total_docs / docs_containing_term), big for rare words, near zero for ubiquitous ones.

Multiply them: a term scores high in a document when it appears often there but rarely overall. That's the insight that makes results feel relevant, it automatically downweights common filler and rewards distinctive matches.

import math

def search(query, index, n_docs):
    scores = defaultdict(float)
    for term in tokenize(query):
        if term not in index:
            continue
        df  = len(index[term])              # how many docs contain the term
        idf = math.log(n_docs / df)         # rare terms -> larger idf
        for doc_id, tf in index[term].items():
            scores[doc_id] += tf * idf      # accumulate TF-IDF across query terms
    return sorted(scores.items(), key=lambda kv: kv[1], reverse=True)

print(search("python data", index, len(docs)))
# docs 0 and 2 rank top: they match both query terms

Two details that matter:

We only score documents in the index for the query terms. Thanks to the inverted index, we never touch documents that don't match. The cost scales with how many docs contain your words, not your whole corpus, the reason this is fast.
Scores accumulate across query terms. A document matching both "python" and "data" outranks one matching only one of them. Multi-word relevance falls right out of the sum.

From this to a real engine

This 30-line core is genuinely how production search starts. The rest is refinement, not new ideas:

Better tokenization: lowercasing, removing punctuation, stemming ("running" → "run") so variants match, dropping stop words.
Better ranking: BM25, a tuned evolution of TF-IDF that handles document length and term saturation, this is what Elasticsearch/Lucene actually use by default.
Phrase and boolean queries, using the positions of terms (store position lists in the index, not just counts).
Scale: sharding the index across machines, the inverted-index structure is what makes that distribution possible.

Every one of those sits on top of "word → documents" plus "score by TF-IDF." Even modern semantic search with vectors is usually combined with this keyword index, not a replacement for it.

Why this is worth building

Search is one of those capabilities that looks like deep infrastructure and turns out to be two clean ideas: invert the mapping so lookups are instant, and score by TF-IDF so the best matches rise. Build it once and Elasticsearch stops being a black box, it's this, hardened and distributed.

Taking it further, BM25, positional queries, stemming, and the bridge to embeddings, is exactly the path the NLP track follows, building the search engine instead of configuring one.