DEV Community

Haji Rufai
Haji Rufai

Posted on • Originally published at hajirufai.github.io

Building a Full-Text Search Engine from Scratch in Python

Every time you type a query into Google, Elasticsearch, or even VS Code's file search, a chain of algorithms runs beneath: tokenization, stemming, inverted indexing, scoring, ranking. Most developers treat these as black boxes.

I opened the box. SearchLite is a full-text search engine I built from scratch in Python — zero external dependencies — implementing every layer from raw text analysis to BM25-ranked results.

The Problem

I wanted to understand how search actually works. Not "call elasticsearch.search()" understanding, but "here's why 'running' matches 'run' and why document #7 ranks above document #3" understanding.

The goal: build something that handles real queries like python AND "data pipelines" NOT java with proper relevance ranking.

Architecture

Query → Parser → AST → Posting List Operations → Scorer → Ranked Results
                                    ↑
Text → Tokenizer → Normalizer → Stemmer → Inverted Index
Enter fullscreen mode Exit fullscreen mode

The search engine has 16 modules, each handling one piece:

Layer Module Job
Analysis tokenizer.py Break text into words with position tracking
Analysis normalizer.py Lowercase, strip accents, Unicode normalization
Analysis stemmer.py Porter stemmer — "running" → "run"
Analysis stopwords.py Filter "the", "is", "and"
Analysis analyzer.py Chain them into a pipeline
Schema schema.py Define field types and behavior
Index posting.py Posting lists with set operations
Index index.py Inverted index — term → document mapping
Query query.py AST nodes: Term, Phrase, Bool, Wildcard
Query query_parser.py Recursive descent parser
Scoring scorer.py BM25 and TF-IDF implementations
Search searcher.py Execute queries against the index
Search highlighter.py Extract best-match snippets
Search facets.py Faceted counts and filtering
Storage storage.py JSON persistence with compaction
API engine.py High-level SearchEngine class

The Inverted Index

The core data structure. Instead of storing "Document 1 contains words [a, b, c]", we flip it: "Word 'python' appears in documents [1, 4, 7] at positions [0, 3, 12]."

# What you'd think search is:
documents = {"doc1": "python is great", "doc2": "java is fine"}
results = [doc for doc in documents if "python" in documents[doc]]

# What search actually is (simplified):
index = {
    "python": PostingList([Posting(doc_id=0, positions=[0], tf=1)]),
    "great":  PostingList([Posting(doc_id=0, positions=[2], tf=1)]),
    "java":   PostingList([Posting(doc_id=1, positions=[0], tf=1)]),
}
Enter fullscreen mode Exit fullscreen mode

The position data is what makes phrase queries possible. When you search "machine learning", I find documents containing both terms, then check if "learning" appears exactly one position after "machine" in any document.

BM25 Scoring

TF-IDF is the classic, but BM25 (Best Matching 25) is what modern search engines use. The key insight: BM25 has term frequency saturation — the 10th occurrence of "python" in a document matters far less than the 1st.

class BM25Scorer:
    def __init__(self, k1: float = 1.2, b: float = 0.75):
        self.k1 = k1  # term frequency saturation
        self.b = b     # length normalization weight

    def score(self, tf, df, doc_length, avg_doc_length, total_docs):
        # IDF: rare terms score higher
        idf = math.log((total_docs - df + 0.5) / (df + 0.5) + 1.0)

        # Normalized TF with saturation
        norm = 1 - self.b + self.b * (doc_length / avg_doc_length)
        tf_norm = (tf * (self.k1 + 1)) / (tf + self.k1 * norm)

        return idf * tf_norm
Enter fullscreen mode Exit fullscreen mode

Two parameters control everything:

  • k1 (default 1.2): How quickly TF saturates. Higher = more weight on repeated terms.
  • b (default 0.75): Length normalization. 1.0 = heavily penalize long documents. 0.0 = ignore length.

The Query Parser

The most interesting piece. A recursive descent parser that turns (python OR java) AND "data pipelines" into an executable AST.

Input: (python OR java) AND "data pipelines"

Tokens: LPAREN, WORD(python), OR, WORD(java), RPAREN, AND, PHRASE("data pipelines")

AST:
  BoolQuery(must=[
    BoolQuery(should=[TermQuery("python"), TermQuery("java")]),
    PhraseQuery(["data", "pipelines"])
  ])
Enter fullscreen mode Exit fullscreen mode

It handles operator precedence (AND binds tighter than OR), field-specific queries (title:python), wildcards (pyth*), boost modifiers (python^2.0), and parenthesized grouping.

Porter Stemmer — From the Original Algorithm

The stemmer is the most algorithmically dense part. Five sequential steps, each with pattern-matching rules that strip suffixes based on the word's consonant-vowel structure.

The algorithm uses a "measure" — counting consonant-vowel sequences to decide when stripping is safe:

m=0: tr, ee, tree      → don't strip (too short)
m=1: trouble, oats     → strip cautiously
m=2: private, rate     → strip more aggressively
Enter fullscreen mode Exit fullscreen mode

Examples:

  • "caresses" → "caress" (Step 1a: SSES → SS)
  • "running" → "run" (Step 1b: strip -ING + cleanup)
  • "relational" → "relat" (Step 2: -ATIONAL → -ATE, then Step 5 strips final e)
  • "generalization" → "gener" (Step 2: -ALIZATION → -ALIZE → Step 3 → Step 5)

Performance

On 1,000 synthetic documents:

Simple term query:     0.26ms avg
Boolean AND:           0.52ms avg
Phrase query:          2.07ms avg
Wildcard prefix:       0.31ms avg
Indexing:              ~830 docs/sec
Enter fullscreen mode Exit fullscreen mode

Not Elasticsearch-fast, but respectable for pure Python with zero C extensions.

Usage

from searchlite import SearchEngine, Schema, TextField, KeywordField

engine = SearchEngine("./my_index", schema=Schema(
    title=TextField(boost=2.0),
    body=TextField(),
    tags=KeywordField(faceted=True),
))

# Index documents
engine.add({"title": "Python Data Science", "body": "...", "tags": ["python"]})
engine.commit()  # persist to disk

# Search with facets and highlighting
results = engine.search("python", facets=["tags"])
for hit in results:
    print(hit.score, hit.doc["title"])
    print(hit.highlight("body", fragment_size=100))
print(results.facets)  # {"tags": {"python": 2, ...}}
Enter fullscreen mode Exit fullscreen mode

What I Learned

  1. Position tracking is expensive but essential. Without it, you can't do phrase queries — one of the most common search patterns.

  2. BM25's simplicity is deceiving. Two parameters (k1, b) produce surprisingly good relevance across diverse document sets.

  3. The query parser is where complexity hides. Tokenizing title:"machine learning" AND python^2.0 NOT java into the right AST requires careful precedence handling.

  4. Stemming is lossy by design. "university" and "universe" both stem to "univers" — that's a feature, not a bug. The recall gain outweighs the precision loss for most use cases.

  5. Zero-dependency constraint forced better code. No nltk for stemming, no json5 for storage. Every algorithm had to be understood and implemented.

Source

194 tests across 11 test modules. 4 working examples. ~4,400 lines of Python.

GitHub: github.com/hajirufai/searchlite
Live page: hajirufai.github.io/searchlite

Top comments (0)