Every time you type a query into Google, Elasticsearch, or even VS Code's file search, a chain of algorithms runs beneath: tokenization, stemming, inverted indexing, scoring, ranking. Most developers treat these as black boxes.
I opened the box. SearchLite is a full-text search engine I built from scratch in Python — zero external dependencies — implementing every layer from raw text analysis to BM25-ranked results.
The Problem
I wanted to understand how search actually works. Not "call elasticsearch.search()" understanding, but "here's why 'running' matches 'run' and why document #7 ranks above document #3" understanding.
The goal: build something that handles real queries like python AND "data pipelines" NOT java with proper relevance ranking.
Architecture
Query → Parser → AST → Posting List Operations → Scorer → Ranked Results
↑
Text → Tokenizer → Normalizer → Stemmer → Inverted Index
The search engine has 16 modules, each handling one piece:
| Layer | Module | Job |
|---|---|---|
| Analysis | tokenizer.py |
Break text into words with position tracking |
| Analysis | normalizer.py |
Lowercase, strip accents, Unicode normalization |
| Analysis | stemmer.py |
Porter stemmer — "running" → "run" |
| Analysis | stopwords.py |
Filter "the", "is", "and" |
| Analysis | analyzer.py |
Chain them into a pipeline |
| Schema | schema.py |
Define field types and behavior |
| Index | posting.py |
Posting lists with set operations |
| Index | index.py |
Inverted index — term → document mapping |
| Query | query.py |
AST nodes: Term, Phrase, Bool, Wildcard |
| Query | query_parser.py |
Recursive descent parser |
| Scoring | scorer.py |
BM25 and TF-IDF implementations |
| Search | searcher.py |
Execute queries against the index |
| Search | highlighter.py |
Extract best-match snippets |
| Search | facets.py |
Faceted counts and filtering |
| Storage | storage.py |
JSON persistence with compaction |
| API | engine.py |
High-level SearchEngine class |
The Inverted Index
The core data structure. Instead of storing "Document 1 contains words [a, b, c]", we flip it: "Word 'python' appears in documents [1, 4, 7] at positions [0, 3, 12]."
# What you'd think search is:
documents = {"doc1": "python is great", "doc2": "java is fine"}
results = [doc for doc in documents if "python" in documents[doc]]
# What search actually is (simplified):
index = {
"python": PostingList([Posting(doc_id=0, positions=[0], tf=1)]),
"great": PostingList([Posting(doc_id=0, positions=[2], tf=1)]),
"java": PostingList([Posting(doc_id=1, positions=[0], tf=1)]),
}
The position data is what makes phrase queries possible. When you search "machine learning", I find documents containing both terms, then check if "learning" appears exactly one position after "machine" in any document.
BM25 Scoring
TF-IDF is the classic, but BM25 (Best Matching 25) is what modern search engines use. The key insight: BM25 has term frequency saturation — the 10th occurrence of "python" in a document matters far less than the 1st.
class BM25Scorer:
def __init__(self, k1: float = 1.2, b: float = 0.75):
self.k1 = k1 # term frequency saturation
self.b = b # length normalization weight
def score(self, tf, df, doc_length, avg_doc_length, total_docs):
# IDF: rare terms score higher
idf = math.log((total_docs - df + 0.5) / (df + 0.5) + 1.0)
# Normalized TF with saturation
norm = 1 - self.b + self.b * (doc_length / avg_doc_length)
tf_norm = (tf * (self.k1 + 1)) / (tf + self.k1 * norm)
return idf * tf_norm
Two parameters control everything:
- k1 (default 1.2): How quickly TF saturates. Higher = more weight on repeated terms.
- b (default 0.75): Length normalization. 1.0 = heavily penalize long documents. 0.0 = ignore length.
The Query Parser
The most interesting piece. A recursive descent parser that turns (python OR java) AND "data pipelines" into an executable AST.
Input: (python OR java) AND "data pipelines"
Tokens: LPAREN, WORD(python), OR, WORD(java), RPAREN, AND, PHRASE("data pipelines")
AST:
BoolQuery(must=[
BoolQuery(should=[TermQuery("python"), TermQuery("java")]),
PhraseQuery(["data", "pipelines"])
])
It handles operator precedence (AND binds tighter than OR), field-specific queries (title:python), wildcards (pyth*), boost modifiers (python^2.0), and parenthesized grouping.
Porter Stemmer — From the Original Algorithm
The stemmer is the most algorithmically dense part. Five sequential steps, each with pattern-matching rules that strip suffixes based on the word's consonant-vowel structure.
The algorithm uses a "measure" — counting consonant-vowel sequences to decide when stripping is safe:
m=0: tr, ee, tree → don't strip (too short)
m=1: trouble, oats → strip cautiously
m=2: private, rate → strip more aggressively
Examples:
- "caresses" → "caress" (Step 1a: SSES → SS)
- "running" → "run" (Step 1b: strip -ING + cleanup)
- "relational" → "relat" (Step 2: -ATIONAL → -ATE, then Step 5 strips final e)
- "generalization" → "gener" (Step 2: -ALIZATION → -ALIZE → Step 3 → Step 5)
Performance
On 1,000 synthetic documents:
Simple term query: 0.26ms avg
Boolean AND: 0.52ms avg
Phrase query: 2.07ms avg
Wildcard prefix: 0.31ms avg
Indexing: ~830 docs/sec
Not Elasticsearch-fast, but respectable for pure Python with zero C extensions.
Usage
from searchlite import SearchEngine, Schema, TextField, KeywordField
engine = SearchEngine("./my_index", schema=Schema(
title=TextField(boost=2.0),
body=TextField(),
tags=KeywordField(faceted=True),
))
# Index documents
engine.add({"title": "Python Data Science", "body": "...", "tags": ["python"]})
engine.commit() # persist to disk
# Search with facets and highlighting
results = engine.search("python", facets=["tags"])
for hit in results:
print(hit.score, hit.doc["title"])
print(hit.highlight("body", fragment_size=100))
print(results.facets) # {"tags": {"python": 2, ...}}
What I Learned
Position tracking is expensive but essential. Without it, you can't do phrase queries — one of the most common search patterns.
BM25's simplicity is deceiving. Two parameters (k1, b) produce surprisingly good relevance across diverse document sets.
The query parser is where complexity hides. Tokenizing
title:"machine learning" AND python^2.0 NOT javainto the right AST requires careful precedence handling.Stemming is lossy by design. "university" and "universe" both stem to "univers" — that's a feature, not a bug. The recall gain outweighs the precision loss for most use cases.
Zero-dependency constraint forced better code. No
nltkfor stemming, nojson5for storage. Every algorithm had to be understood and implemented.
Source
194 tests across 11 test modules. 4 working examples. ~4,400 lines of Python.
GitHub: github.com/hajirufai/searchlite
Live page: hajirufai.github.io/searchlite
Top comments (0)