Building a Full-Text Search Engine from Scratch in Python

#python #searchengine #algorithms #tutorial

Every time you type a query into Google, Elasticsearch, or even VS Code's file search, a chain of algorithms runs beneath: tokenization, stemming, inverted indexing, scoring, ranking. Most developers treat these as black boxes.

I opened the box. SearchLite is a full-text search engine I built from scratch in Python — zero external dependencies — implementing every layer from raw text analysis to BM25-ranked results.

The Problem

I wanted to understand how search actually works. Not "call elasticsearch.search()" understanding, but "here's why 'running' matches 'run' and why document #7 ranks above document #3" understanding.

The goal: build something that handles real queries like python AND "data pipelines" NOT java with proper relevance ranking.

Architecture

Query → Parser → AST → Posting List Operations → Scorer → Ranked Results
                                    ↑
Text → Tokenizer → Normalizer → Stemmer → Inverted Index

The search engine has 16 modules, each handling one piece:

Layer	Module	Job
Analysis	`tokenizer.py`	Break text into words with position tracking
Analysis	`normalizer.py`	Lowercase, strip accents, Unicode normalization
Analysis	`stemmer.py`	Porter stemmer — "running" → "run"
Analysis	`stopwords.py`	Filter "the", "is", "and"
Analysis	`analyzer.py`	Chain them into a pipeline
Schema	`schema.py`	Define field types and behavior
Index	`posting.py`	Posting lists with set operations
Index	`index.py`	Inverted index — term → document mapping
Query	`query.py`	AST nodes: Term, Phrase, Bool, Wildcard
Query	`query_parser.py`	Recursive descent parser
Scoring	`scorer.py`	BM25 and TF-IDF implementations
Search	`searcher.py`	Execute queries against the index
Search	`highlighter.py`	Extract best-match snippets
Search	`facets.py`	Faceted counts and filtering
Storage	`storage.py`	JSON persistence with compaction
API	`engine.py`	High-level SearchEngine class

The Inverted Index

The core data structure. Instead of storing "Document 1 contains words [a, b, c]", we flip it: "Word 'python' appears in documents [1, 4, 7] at positions [0, 3, 12]."

# What you'd think search is:
documents = {"doc1": "python is great", "doc2": "java is fine"}
results = [doc for doc in documents if "python" in documents[doc]]

# What search actually is (simplified):
index = {
    "python": PostingList([Posting(doc_id=0, positions=[0], tf=1)]),
    "great":  PostingList([Posting(doc_id=0, positions=[2], tf=1)]),
    "java":   PostingList([Posting(doc_id=1, positions=[0], tf=1)]),
}

The position data is what makes phrase queries possible. When you search "machine learning", I find documents containing both terms, then check if "learning" appears exactly one position after "machine" in any document.

BM25 Scoring

TF-IDF is the classic, but BM25 (Best Matching 25) is what modern search engines use. The key insight: BM25 has term frequency saturation — the 10th occurrence of "python" in a document matters far less than the 1st.

class BM25Scorer:
    def __init__(self, k1: float = 1.2, b: float = 0.75):
        self.k1 = k1  # term frequency saturation
        self.b = b     # length normalization weight

    def score(self, tf, df, doc_length, avg_doc_length, total_docs):
        # IDF: rare terms score higher
        idf = math.log((total_docs - df + 0.5) / (df + 0.5) + 1.0)

        # Normalized TF with saturation
        norm = 1 - self.b + self.b * (doc_length / avg_doc_length)
        tf_norm = (tf * (self.k1 + 1)) / (tf + self.k1 * norm)

        return idf * tf_norm

Two parameters control everything:

k1 (default 1.2): How quickly TF saturates. Higher = more weight on repeated terms.
b (default 0.75): Length normalization. 1.0 = heavily penalize long documents. 0.0 = ignore length.

The Query Parser

The most interesting piece. A recursive descent parser that turns (python OR java) AND "data pipelines" into an executable AST.

Input: (python OR java) AND "data pipelines"

Tokens: LPAREN, WORD(python), OR, WORD(java), RPAREN, AND, PHRASE("data pipelines")

AST:
  BoolQuery(must=[
    BoolQuery(should=[TermQuery("python"), TermQuery("java")]),
    PhraseQuery(["data", "pipelines"])
  ])

It handles operator precedence (AND binds tighter than OR), field-specific queries (title:python), wildcards (pyth*), boost modifiers (python^2.0), and parenthesized grouping.

Porter Stemmer — From the Original Algorithm

The stemmer is the most algorithmically dense part. Five sequential steps, each with pattern-matching rules that strip suffixes based on the word's consonant-vowel structure.

The algorithm uses a "measure" — counting consonant-vowel sequences to decide when stripping is safe:

m=0: tr, ee, tree      → don't strip (too short)
m=1: trouble, oats     → strip cautiously
m=2: private, rate     → strip more aggressively

Examples:

"caresses" → "caress" (Step 1a: SSES → SS)
"running" → "run" (Step 1b: strip -ING + cleanup)
"relational" → "relat" (Step 2: -ATIONAL → -ATE, then Step 5 strips final e)
"generalization" → "gener" (Step 2: -ALIZATION → -ALIZE → Step 3 → Step 5)

Performance

On 1,000 synthetic documents:

Simple term query:     0.26ms avg
Boolean AND:           0.52ms avg
Phrase query:          2.07ms avg
Wildcard prefix:       0.31ms avg
Indexing:              ~830 docs/sec

Not Elasticsearch-fast, but respectable for pure Python with zero C extensions.

Usage

from searchlite import SearchEngine, Schema, TextField, KeywordField

engine = SearchEngine("./my_index", schema=Schema(
    title=TextField(boost=2.0),
    body=TextField(),
    tags=KeywordField(faceted=True),
))

# Index documents
engine.add({"title": "Python Data Science", "body": "...", "tags": ["python"]})
engine.commit()  # persist to disk

# Search with facets and highlighting
results = engine.search("python", facets=["tags"])
for hit in results:
    print(hit.score, hit.doc["title"])
    print(hit.highlight("body", fragment_size=100))
print(results.facets)  # {"tags": {"python": 2, ...}}

What I Learned

Position tracking is expensive but essential. Without it, you can't do phrase queries — one of the most common search patterns.
BM25's simplicity is deceiving. Two parameters (k1, b) produce surprisingly good relevance across diverse document sets.
The query parser is where complexity hides. Tokenizing title:"machine learning" AND python^2.0 NOT java into the right AST requires careful precedence handling.
Stemming is lossy by design. "university" and "universe" both stem to "univers" — that's a feature, not a bug. The recall gain outweighs the precision loss for most use cases.
Zero-dependency constraint forced better code. No nltk for stemming, no json5 for storage. Every algorithm had to be understood and implemented.