M Shojaei

Posted on Jul 26

Build a Search Engine from Scratch

#nlp #regex #elasticsearch #programming

Search touches every corner of modern software. Whether you’re indexing your company’s internal docs or crawling the open web, the ability to store, rank, and retrieve information at scale is a core super‑power. This book is written for practical engineers who want to move beyond sample projects and build a production‑grade search engine—one that lives in the data‑center, survives real traffic, and answers queries in tens of milliseconds.

This book is a comprehensive guide to architecting and implementing such a system from the ground up. We will dissect the core components, explore the technologies that power industry giants like Google and privacy-focused innovators like Brave, and provide practical, production-ready code. By the end of this journey, you will not only understand how modern search works but will have built the foundational components of a powerful search engine capable of indexing the diverse and dynamic content of the modern web.

Throughout the chapters you’ll build Cortex Search, an independent index that reaches billions of pages, supports hybrid lexical + vector retrieval, and exposes a developer‑friendly gRPC API. Code is peppered throughout; each section ends with hands‑on labs you can run locally or on inexpensive cloud nodes.

Prerequisites

Solid Python or Rust, basic networking & Linux, and a willingness to debug distributed systems.

How to Use This Book

Each chapter stands alone but builds toward a complete system. Code blocks are MIT‑licensed; feel free to drop them into your repo. Wherever you see a 🚧 emoji, that section includes an optional extension (e.g., swapping FAISS for Milvus).

Part I · Foundations

Chapter 1: Introduction to Search Engines
- 1.1 What is a Search Engine?
- 1.2 Market Landscape
- 1.3 The Buy vs. Build Decision Matrix
- 1.4 Core Components at a Glance
- 1.5 Open Source Search Engines as a Blueprint
Chapter 2: Design Goals & System Architecture
- 2.1 Latency Budgets & Service Level Agreements (SLAs)
- 2.2 Coverage & Freshness KPIs
- 2.3 Choosing Languages: Rust for Indexer, Python for Glue
- 2.4 Data-Flow vs. Microlith & Architectural Evolution
- 2.5 Privacy-First Evolution: The Brave Model
- 2.6 Failure Domains & Replication
Chapter 3: Hardware & Cluster Baseline
- 3.1 Storage Tier
- 3.2 Compute Tier
- 3.3 Network Tier

Part II · Data Acquisition & Processing

Chapter 4: Web Crawling at Scale
- 4.1 Crawler Framework & Architecture
- 4.2 The URL Frontier and Scheduler
- 4.3 Distributed Crawling and Parallel Processing
- 4.4 Hands-On Lab 1: Hello Crawler
Chapter 5: Politeness, Robots, and Legal Compliance
- 5.1 Honoring Robots.txt and Handling Errors
- 5.2 Rate Limiting and Adaptive Throttling
- 5.3 Ethical and Legal Considerations
Chapter 6: Parsing, Boilerplate Removal, & Metadata Extraction
- 6.1 The Content Processing Pipeline
- 6.2 High-Performance Parsing
- 6.3 Metadata Extraction
Chapter 7: De‑Duplication & Canonicalisation
- 7.1 Near-Duplicate Detection
- 7.2 Efficient URL Tracking
Chapter 8: Text Processing: Tokenisation, Stemming, and Language ID
- 8.1 Core Text Processing Steps
- 8.2 Implementation in Python
- 8.3 Implementation in Rust

Part III · The Indexing Engine

Chapter 9: Building the Inverted Index
- 9.1 The Role of the Inverted Index
- 9.2 Technology Choices: Tantivy (Rust) & Lucene (Java)
- 9.3 Creating a Simple Inverted Index in Python
- 9.4 Creating an Inverted Index in Rust with Tantivy
- 9.5 Index Optimization: Persistence and Compression
Chapter 10: Embeddings & Vector Representations
- 10.1 Introduction to Vector Embeddings
- 10.2 Generation and Storage
Chapter 11: Approximate Nearest‑Neighbour Search with FAISS & HNSW
- 11.1 The Need for Approximation
- 11.2 Core Technologies: FAISS and HNSW
- 11.3 Scalable Indexing Techniques
Chapter 12: Hybrid Retrieval Strategies
- 12.1 Combining Lexical and Semantic Search
- 12.2 A Practical Hybrid Search Strategy
Chapter 13: Link Analysis & PageRank
- 13.1 The PageRank Algorithm
- 13.2 Python Implementation of PageRank
Chapter 14: Learning‑to‑Rank and Neural Re‑Ranking
- 14.1 Introduction to Learning-to-Rank (LTR)
- 14.2 Model Choices and Caching
- 14.3 Feature Engineering for LTR
Chapter 15: Incremental & Real‑Time Index Updates
- 15.1 The Challenge of Freshness
- 15.2 Real-Time Update Strategies

Part IV · Serving & Operations

Chapter 16: Query Serving Architecture & gRPC API Design
- 16.1 The Query Engine
- 16.2 API Design and Protocols
- 16.3 Security and Advanced Features
Chapter 17: SERP Front‑End with React & Tailwind
- 17.1 Frontend Technology Choices
- 17.2 Conceptual UI with Flask
- 17.3 User Interface Best Practices
Chapter 18: Distributed Sharding & Fault Tolerance
- 18.1 The Need for Distribution
- 18.2 Sharding Strategies
- 18.3 Replication for High Availability
Chapter 19: Low‑Latency Optimisations
- 19.1 Caching and Index Efficiency
- 19.2 Load Balancing and Memory Management
Chapter 20: Observability: Metrics, Tracing, and Alerting
- 20.1 Metrics and Tracing
- 20.2 Alerting, Chaos Testing, and Logging
Chapter 21: Security, Privacy, and Abuse Mitigation
- 21.1 Data Handling and Compliance
- 21.2 User Data Anonymization
Chapter 22: Cost Engineering & Cloud Deployment Patterns
- 22.1 Managing Storage and Compute Costs
- 22.2 Leveraging Cloud Infrastructure
Chapter 23: Continuous Integration & Delivery
- 23.1 Development and Deployment Workflow
- 23.2 Sample Project Plan

Part V · Advanced Topics & Case Studies

Chapter 24: Advanced Features: Snippets, Entities, and QA
- 24.1 Snippet Generation
- 24.2 Indexing Alternative Content Sources
Chapter 25: Scaling to Billions of Documents
Chapter 26: Personalisation & LLM‑Enhanced Ranking
Chapter 27: Case Study: Operating Cortex Search in Production
- 27.1 A High-Level Implementation Roadmap
- 27.2 Final Words
Appendices
- Appendix A: Config Templates
- Appendix B: Cheat-Sheets
- Appendix C: Glossary

Part I · Foundations

Chapter 1: Introduction to Search Engines

This chapter introduces the fundamental concepts of a search engine, examines the current market, and outlines the core components that form the basis of any modern search system.

1.1 What is a Search Engine?

A search engine is a software system that retrieves and ranks information from a large dataset, typically the web, based on user queries. It consists of several components working together to deliver relevant results quickly. Modern search engines like Google and Brave handle billions of pages, requiring sophisticated algorithms and infrastructure.

1.2 Market Landscape

Google, Bing, Baidu, and Yandex dominate web search, but vertical engines (Brave, Perplexity, Pinterest, academic indexes) prove that niches matter. Owning the full stack lets you:

Control ranking criteria & bias.
Integrate domain‑specific features (e.g., chemical structure search).
Avoid API rate limits and vendor lock‑in.

1.3 The Buy vs. Build Decision Matrix

If your queries exceed ≈ 100 QPS, or you need ranking that commercial APIs can’t provide, building a system from the ground up becomes cost-competitive.

Factor	SaaS API	Self‑Hosted Solr / Elasticsearch	Ground‑Up Engine
CapEx	Low	Medium	High
OpEx	Usage‑based	Cluster maintenance	Full infra & dev team
Custom Ranking	Limited	Plugin support	Unlimited
Latency Control	Vendor‑dependent	Moderate	Full control

1.4 Core Components at a Glance

A search engine consists of several interconnected components. At a high level, the process of searching the web can be broken down into three main stages: crawling, indexing, and query processing/ranking. In addition, we need a front-end interface and infrastructure to serve search results quickly to users.

Web Crawler (Spider): A program that systematically browses the web to discover new and updated pages. It starts from a set of seed URLs and follows links recursively, fetching page content.
Indexer: The component that processes fetched documents and builds an index. Indexing involves parsing documents, extracting textual content, and creating data structures (like the inverted index) that allow fast retrieval of documents by keywords.
Searcher / Query Processor: When a user issues a query, the search engine must interpret the query, look up relevant documents in the index, rank them by relevance, and prepare results.
Ranking Module: This applies algorithms to sort the retrieved documents by relevance. Classic ranking methods include textual relevance and link analysis.
User Interface: Allows users to input queries and view results, including titles, URLs, and a snippet.

A conceptual system overview can be visualized as a pipeline where each component can be scaled independently.

[ Crawler ] → [ Parser ] → [ Indexer ] → [ Query Engine ] → [ Ranker ] → [ API / UI ]
     ↓               ↓             ↓              ↓              ↓             ↓
[ Robots.txt ]   [ Content ]   [ Inverted Index ] [ BM25 / Dense ] [ Relevance ]  [ Frontend / API ]

┌────────────┐   ┌───────────┐   ┌────────────┐
│  Crawler   ├──►│  Parser   ├──►│  Indexer   │
└────────────┘   └───────────┘   └────┬───────┘
                                      │
                           ┌──────────▼─────────┐
                           │  Search Service    │
                           └──────────┬─────────┘
                                      │
                               ┌──────▼───────┐
                               │   Front‐End  │
                               └──────────────┘

1.5 Open Source Search Engines as a Blueprint

Open source search engines provide a blueprint for creating a production-grade search engine with low latency and rich indexing coverage. These systems, such as OpenSearch, Meilisearch, and Typesense, are freely available for study, allowing you to learn from their implementations. By analyzing these engines, you can learn about efficient data structures like inverted indices for quick retrieval, distributed architectures for scalability and fault tolerance, and API design for developer-friendly integration and real-time capabilities.

Aspect	OpenSearch	Meilisearch	Typesense
Base Technology	Apache Lucene	Rust, LMDB	Adaptive Radix Tree, RocksDB
Architecture	Distributed, role-based nodes	Modular, RESTful API	Single master, read-only replicas
Low Latency	Distributed processing, in-memory	In-memory, sub-50ms responses	In-memory, sub-50ms searches
Rich Indexing	Full-text, ML, vector search	Typo-tolerance, faceted search	Typo-tolerance, faceted navigation
Real-Time Updates	Supported via distributed nodes	Real-time update mechanism	Asynchronous replica updates
Programming Language	Java (Lucene-based)	Rust	C++
License	Apache 2.0	MIT	GPL-3.0

This comparison highlights the diversity in approaches, with each engine offering unique strengths for achieving low latency and rich indexing.

Chapter 2: Design Goals & System Architecture

This chapter outlines the high-level design goals and architectural patterns that will guide the construction of our search engine.

2.1 Latency Budgets & Service Level Agreements (SLAs)

Achieving low latency requires a strict budget for each stage of the query processing pipeline. Results should be returned in milliseconds, achieved through efficient indexing and caching.

Target End-to-End Latency:

P95 Latency: < 50 ms full pipeline.
Cold-cache Query: ≈ 25 ms.
Warm-cache Query: ≈ 10 ms.

Latency Breakdown per Stage:

Candidate generation (≤ 5 ms)
Feature assembly (≤ 3 ms)
Learning-to-Rank (≤ 5 ms)
Answer generation / snippets (≤ 8 ms)

2.2 Coverage & Freshness KPIs

Rich indexing coverage ensures comprehensive search results.

KPI	Good baseline
Indexed pages	12–20 B unique URLs (Brave’s public figure).
Average doc age	< 30 min for news; < 24 h global.
P95 latency	< 50 ms full pipeline.
Crawl politeness	≤ 1 req/s/host; adaptive throttling.

2.3 Choosing Languages: Rust for Indexer, Python for Glue

To build a system that is both high-performance and flexible, this book will adopt a dual-language approach, leveraging the unique strengths of Rust and Python.

Rust for the Core Engine: The heart of our search engine—the indexer, the data structures like the inverted index, and the query processor—will be built in Rust. Rust provides C++-level performance without sacrificing memory safety, a critical feature for building reliable, long-running systems. Its powerful concurrency model allows us to build highly parallelized indexing and query pipelines that can take full advantage of modern multi-core processors. For a component where every microsecond of latency counts, Rust is the ideal choice. Major search engines built in Rust include Meilisearch and GitHub's Blackbird, with Tantivy serving as a foundational library.
Python for the Periphery: The components responsible for data acquisition, parsing, and machine learning will be built in Python. Python's vast ecosystem of libraries makes it unparalleled for these tasks. We will use libraries like requests and BeautifulSoup for web crawling, and the transformers library for generating vector embeddings with state-of-the-art models. Python's agility and rich libraries allow for rapid development and experimentation.

2.4 Data-Flow vs. Microlith & Architectural Evolution

The conceptual pipeline of crawling, indexing, and serving has remained constant, but the underlying architecture has evolved from monoliths to microservices. This transformation was driven by the explosive growth of the web and the need for greater freshness and scalability.

Modern search engines employ a multi-tiered indexing architecture, often built on a microservices model. This allows the system to serve a blended result set, providing both up-to-the-minute freshness from a real-time tier and comprehensive historical depth from batch tiers.

Near Real-time Index: Ingests and indexes new content within seconds or minutes.
Weekly Batch Index: Processes a larger, more recent slice of data weekly for training ML models.
Full Batch Index: The historical archive, re-indexed infrequently, used for large-scale model training and long-tail queries.

This complexity is managed by breaking the system into microservices. Each component—query suggestion, ranking, news indexing, image search—becomes an independent, horizontally scalable service communicating via lightweight protocols like gRPC or REST.

2.5 Privacy-First Evolution: The Brave Model

In a market dominated by a few large players, new entrants must differentiate themselves strategically. Brave Search has done so by focusing on privacy and user control. While many alternative search engines are simply facades that pull results from Bing or Google's APIs, Brave is built on its own independent search index, created from scratch. This independence is the cornerstone of its privacy promise; by not relying on Big Tech, Brave can guarantee that user queries are not tracked or profiled. This level of customization is only possible because Brave controls its own index and ranking algorithms.

2.6 Failure Domains & Replication

Distributed architectures (OpenSearch) and replication strategies (Typesense) ensure scalability and fault tolerance, crucial for handling large datasets and achieving low latency. Understanding trade-offs, such as availability over consistency (Typesense), informs design decisions based on use case requirements.

Chapter 3: Hardware & Cluster Baseline

This chapter details the foundational hardware choices necessary for a web-scale search engine, balancing performance with cost.

Tier	Why it matters	Proven pattern
Storage	Crawling at web scale generates tens–hundreds TB/day. You need something faster than object storage but cheaper than all-RAM.	NVMe + distributed cache à la Exa’s 350 TB Alluxio pool fronting S3; 400 GbE keeps copy time out of the critical path.
Compute	Two distinct workloads: (a) IO-bound crawl/parse, (b) CPU/GPU-bound indexing & ranking.	Dual pools: low-cost x86/Graviton for crawl; GPU boxes (H100/H200) for embedding & vector search. Exa reports <$5 M for an H200-backed training cluster that outruns Google on benchmark queries.
Network	Latency floor is set by cross-node hops.	Keep index shards and rankers on the same host; rely on 100-400 GbE for unavoidable hops.

A successful architecture depends on matching the right hardware to each component's workload.

3.1 Storage Tier

Crawling at web scale generates tens to hundreds of terabytes of data per day. This requires a storage solution that is faster than object storage but more cost-effective than an all-RAM approach. A proven pattern is to use NVMe drives coupled with a distributed cache, such as Alluxio fronting an object store like S3. High-speed networking (e.g., 400 GbE) is essential to ensure that data transfer times do not become a bottleneck.

3.2 Compute Tier

Search engine workloads are diverse. They can be broadly categorized into two types:

IO-bound tasks like crawling and parsing.
CPU/GPU-bound tasks like indexing and ranking.

To handle this, a dual-pool approach is effective. Use low-cost x86 or ARM-based (Graviton) instances for crawling, and powerful GPU-equipped machines (e.g., H100/H200) for computationally intensive tasks like generating embeddings and performing vector search.

3.3 Network Tier

The physical distance and number of network hops between nodes set the floor for latency. To minimize this, index shards and their corresponding rankers should be co-located on the same physical host whenever possible. For hops that are unavoidable, high-bandwidth interconnects (100-400 GbE) are critical.

Part II · Data Acquisition & Processing

Chapter 4: Web Crawling at Scale

The crawler is the sensory organ of the search engine, responsible for discovering and fetching the vast and varied content that will ultimately populate our index.

4.1 Crawler Framework & Architecture

A production-grade crawler must be a distributed system capable of handling billions of URLs and fetching content concurrently from thousands of servers.

Framework: A good starting point is to fork a battle-tested framework like StormCrawler (Java on Apache Storm). It is built for streaming, low-latency fetch cycles and scales horizontally out of the box.
Tools & Libraries (Rust): For those building a custom crawler in Rust, reqwest is a robust library for making HTTP requests, and tokio is the standard for asynchronous concurrency.

4.2 The URL Frontier and Scheduler

The URL Frontier is the central nervous system of the crawler. It's a sophisticated data structure that manages the queue of URLs to be visited, prioritizing them and ensuring politeness. For large-scale crawls, the frontier must be disk-backed and implement priority queueing logic.

The Scheduler works in tandem with the frontier. It should maintain a priority queue keyed by properties such as (host, URL, last-seen). To discover new and important content quickly, it should mix in URLs from various sources like RSS feeds, sitemaps, and pages with high change-frequency hints.

4.3 Distributed Crawling and Parallel Processing

To achieve high throughput, a crawler must fetch pages in parallel using multiple worker processes or machines.

Python Implementation: The multiprocessing library can be used to parallelize crawling tasks. The following is a conceptual example. A real implementation would need to handle shared state, like the set of visited URLs and the URL queue, across processes.

from multiprocessing import Pool

def crawl_parallel(urls, max_pages=10):
    # This is a conceptual example. A real implementation would need to share
    # the 'visited' set and 'to_visit' queue across processes.
    with Pool(processes=4) as pool:
        # The 'crawl' function would need to be defined elsewhere in the book
        # results = pool.map(crawl, [(url, max_pages // 4) for url in urls])
        pass
    # return set().union(*results)
    return set()

# Example usage
urls = ["https://example.com", "https://anvil.works"]
crawled_urls = crawl_parallel(urls)

Rust Implementation: In Rust, the rayon crate provides an easy way to parallelize iterators. This can be applied to process multiple search queries or other batch tasks concurrently.

use rayon::prelude::*;
// Assume SearchQuery, SearchResponse, SearchError, and a search method are defined
// use crate::{SearchQuery, SearchResponse, SearchError};

pub struct SearchEngine;
impl SearchEngine {
    // pub async fn search(&self, query: SearchQuery) -> Result<SearchResponse, SearchError> {
    //     // Implementation of a single search
    //     unimplemented!()
    // }

    // pub async fn parallel_search(&self, queries: Vec<SearchQuery>) -> Vec<Result<SearchResponse, SearchError>> {
    //     // Process multiple queries in parallel
    //     let results: Vec<Result<SearchResponse, SearchError>> = queries
    //         .into_par_iter()
    //         .map(|query| {
    //             // In practice, you'd need to handle async in parallel processing more carefully
    //             tokio::task::block_in_place(|| {
    //                 tokio::runtime::Handle::current().block_on(self.search(query))
    //             })
    //         })
    //         .collect();
    //
    //     results
    // }
}

4.4 Hands-On Lab 1: Hello Crawler

Below is a minimal asynchronous crawler in Python 3.12 using aiohttp and aiodns. It respects robots.txt, handles redirects, and streams pages into a Kafka topic.

Run docker compose up with Kafka + Zookeeper first; see Appendix A for compose files.

import asyncio, re, ssl, json, time
from urllib.parse import urljoin, urlparse

import aiohttp
from aiokafka import AIOKafkaProducer

ROBOT_CACHE = {}
USER_AGENT = "CortexBot/0.1 (+https://cortex.example.com/bot)"

async def fetch_text(session, url):
    """Helper to fetch raw text content for robots.txt parsing."""
    try:
        async with session.get(url, timeout=10) as resp:
            if resp.status == 200:
                return await resp.text()
    except Exception:
        return ""
    return ""

async def fetch(session, url):
    try:
        async with session.get(url, timeout=15) as resp:
            if resp.status != 200 or "text/html" not in resp.headers.get("content-type", ""):
                return None
            return await resp.text()
    except Exception:
        return None

async def allowed(session, url):
    host = urlparse(url).netloc
    if not host:
        return False
    if host in ROBOT_CACHE:
        # Simplified check; a real implementation would parse the rules properly
        return all(not url.startswith(disallowed) for disallowed in ROBOT_CACHE[host])

    robots_url = urljoin(f"https://{host}", "/robots.txt")
    txt = await fetch_text(session, robots_url)
    disallows = re.findall(r"Disallow: (.*)", txt, re.I)

    # Store absolute disallowed URLs
    ROBOT_CACHE[host] = [urljoin(robots_url, d.strip()) for d in disallows]

    return all(not url.startswith(disallowed) for disallowed in ROBOT_CACHE[host])

async def crawl(seed_urls, kafka_bootstrap="localhost:9092"):
    producer = AIOKafkaProducer(bootstrap_servers=kafka_bootstrap)
    await producer.start()
    sslctx = ssl.create_default_context()
    sslctx.set_ciphers("DEFAULT@SECLEVEL=1")

    async with aiohttp.ClientSession(headers={"User-Agent": USER_AGENT},
                                     connector=aiohttp.TCPConnector(ssl=sslctx)) as session:
        q = asyncio.Queue()
        for u in seed_urls:
            await q.put(u)

        seen = set(seed_urls)

        while not q.empty():
            url = await q.get()

            if not await allowed(session, url):
                q.task_done()
                continue

            html = await fetch(session, url)
            if html:
                print(f"Crawled: {url}")
                await producer.send_and_wait("raw_pages", json.dumps({"url": url, "html": html}).encode())

                for link in re.findall(r"href=\"(http[^\"]+)\"", html):
                    if link.startswith("http") and link not in seen:
                        seen.add(link)
                        await q.put(link)
            q.task_done()

    await producer.stop()

if __name__ == "__main__":
    # This block is for demonstration; it won't run in this context.
    # To run, you would need Kafka and Zookeeper running.
    # See Appendix A for Docker Compose files.
    # seeds = ["https://example.org/"]
    # asyncio.run(crawl(seeds))
    pass

Chapter 5: Politeness, Robots, and Legal Compliance

A well-behaved crawler must be "polite." This is crucial for avoiding being blocked by web servers and for maintaining the overall health of the web ecosystem.

5.1 Honoring Robots.txt and Handling Errors

Always honor the robots.txt file. The allowed function in our lab crawler provides a basic implementation of this principle. A robust crawler should also gracefully handle server responses. This means backing off when it receives HTTP 4xx (client error) or 5xx (server error) status codes and rotating IP addresses to avoid aggressive throttling from hosts.

5.2 Rate Limiting and Adaptive Throttling

The primary mechanism for enforcing politeness is to limit the rate of requests to any single host. A good baseline is to aim for no more than one request per second per host (≤ 1 req/s/host). Furthermore, implement adaptive throttling that adjusts the crawl rate based on server response times, slowing down if latency increases.

5.3 Ethical and Legal Considerations

Beyond basic politeness, a responsible crawler operator must consider several ethical and legal factors:

Robots.txt: Use a reliable parser to interpret robots.txt rules. Python's urllib.robotparser is a standard choice.
Request Delays: Implement delays between consecutive requests to the same host to avoid causing server overload.
Error Handling: Handle HTTP errors gracefully instead of retrying aggressively.
Nofollow Attribute: Respect rel="nofollow" attributes on links as a hint not to pass authority, though crawlers may still follow the link for discovery purposes.
Transparency: Use a clear User-Agent string that points to a page explaining the purpose of your bot.
Opt-Out Mechanism: Implement a way for site owners to request that their content be removed or not crawled.
Content Safety: Store hashes of unsafe or illegal content to avoid re-indexing or displaying it in search results.

Chapter 6: Parsing, Boilerplate Removal, & Metadata Extraction

This chapter details the content pipeline that transforms raw crawled data into structured, indexable information.

6.1 The Content Processing Pipeline

Once raw HTML is fetched, it must be processed into clean, structured data suitable for indexing. This involves several stages, each of which can be optimized for latency.

Stage	Detail	Latency tricks
Boiler-plate stripping	Use a library like `jusText` or a clone of Mozilla's Readability to extract the main article content, stripping away menus, ads, and footers.	Run in worker threads; stream content directly to the parser as it's downloaded.
Tokenisation & POS	Tokenisation and Part-of-Speech tagging are necessary for building the inverted index (BM25) and for generating features for learning-to-rank models.	Keep a small static vocabulary in RAM for frequent terms.
Embeddings	Generate sentence embeddings using models like Sentence-T5 or E5. Batch documents on a GPU to amortise the overhead of transferring data to the device.
Link & anchor features	Compute PageRank-like metrics incrementally from the link graph.	Store partial sums in a key-value store like RocksDB and update them in place.

6.2 High-Performance Parsing

For high-performance parsing in Rust, the scraper crate is a good choice for DOM extraction. For more advanced or lenient HTML parsing where the input might be malformed, select.rs or html5ever are excellent alternatives. To handle non-HTML content like PDFs, you can use bindings to native libraries such as poppler or pdfium.

6.3 Metadata Extraction

During the crawl, it is crucial to extract and store essential metadata. This avoids needing a second, expensive pass over the raw content later. Key metadata includes:

Language
Character set
Canonical URL (<link rel="canonical">)
The graph of outbound links

Chapter 7: De‑Duplication & Canonicalisation

The web is filled with duplicate and near-duplicate content. Identifying and filtering this content early in the pipeline is critical for saving significant computational resources and storage.

7.1 Near-Duplicate Detection

To detect near-duplicates, not just exact copies, use specialized hashing algorithms. SimHash or MinHash are designed for this purpose, creating a "fingerprint" of a document that can be compared to others to find similarities. Hashing raw content early in the pipeline allows you to skip processing documents that have already been seen.

7.2 Efficient URL Tracking

The URL Frontier, in conjunction with a duplicate detection module, must prevent the redundant crawling of identical or canonicalized URLs. An extremely efficient data structure for checking if a URL has been seen before is the Bloom filter. It provides a probabilistic check with a small memory footprint, making it ideal for tracking billions of URLs.

Chapter 8: Text Processing: Tokenisation, Stemming, and Language ID

Indexing organizes crawled data for fast retrieval. This involves breaking down text into searchable units through several standard text processing steps.

8.1 Core Text Processing Steps

Tokenization: The process of splitting a stream of text into individual words or terms, called tokens.
Stop Word Removal: Removing common words (e.g., "the", "a", "is") that provide little semantic value for search. Python's NLTK library provides standard stop word lists for many languages.
Stemming: The process of reducing words to their root or base form (e.g., "running" becomes "run"). This helps the search engine match related terms. The Porter Stemmer is a classic algorithm for this task in English.

8.2 Implementation in Python

Here is a simple text processing pipeline in Python using the NLTK library.

from collections import defaultdict
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Ensure NLTK data is downloaded
# import nltk
# nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

def tokenize(text):
    # Simple tokenization: lowercase and remove non-alphanumeric characters
    words = re.findall(r'\b\w+\b', text.lower())
    return words

def process_text(text):
    words = tokenize(text)
    words = [ps.stem(word) for word in words if word not in stop_words]
    return words

8.3 Implementation in Rust

A similar tokenizer can be implemented in Rust for higher performance. This example demonstrates the basic structure.

use std::collections::{HashMap, HashSet};
use unicode_normalization::UnicodeNormalization;

pub struct Tokenizer {
    stop_words: HashSet<String>,
    min_token_length: usize,
    max_token_length: usize,
}

impl Tokenizer {
    pub fn new() -> Self {
        let stop_words = [
            "the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", 
            "of", "with", "by", "is", "are", "was", "were", "be", "been", "have", "has"
        ].iter().map(|s| s.to_string()).collect();

        Self {
            stop_words,
            min_token_length: 2,
            max_token_length: 40,
        }
    }

    pub fn tokenize(&self, text: &str) -> Vec<String> {
        // Normalize Unicode characters to handle accents etc.
        let normalized: String = text.nfc().collect();

        let mut tokens = Vec::new();
        let mut current_token = String::new();

        for ch in normalized.chars() {
            if ch.is_alphanumeric() {
                current_token.push(ch.to_ascii_lowercase());
            } else {
                if !current_token.is_empty() {
                    self.process_token(&mut tokens, current_token);
                    current_token = String::new();
                }
            }
        }

        // Don't forget the last token
        if !current_token.is_empty() {
            self.process_token(&mut tokens, current_token);
        }

        tokens
    }

    fn process_token(&self, tokens: &mut Vec<String>, token: String) {
        if token.len() >= self.min_token_length 
            && token.len() <= self.max_token_length 
            && !self.stop_words.contains(&token) {
            tokens.push(self.stem(&token));
        }
    }

    fn stem(&self, token: &str) -> String {
        // A real implementation would use a crate like rust-stemmers.
        // For simplicity, we'll just return the token as-is.
        token.to_string()
    }
}

Part III · The Indexing Engine

Chapter 9: Building the Inverted Index

The inverted index is the core data structure of any modern search engine. It enables rapid lookup of documents that contain specific terms, forming the foundation of lexical search.

9.1 The Role of the Inverted Index

An inverted index is a data structure that maps terms (words) to the documents that contain them. Instead of storing documents and searching through them one by one, the index allows the engine to directly retrieve a list of relevant documents for any given term, which is dramatically faster.

9.2 Technology Choices: Tantivy (Rust) & Lucene (Java)

Choosing the right technology for the index is a critical architectural decision. The following table summarizes proven choices for the different types of indexes a modern search engine requires.

Index	Tech choice	Why	Latency note
Inverted (lexical)	Apache Lucene 9 / Tantivy 0.21	Battle-tested BM25 ranking, near-real-time (NRT) readers for fresh data.	Keep hot posting lists (the lists of documents for a term) in the OS page cache using `mmap`.
Vector	FAISS IVF-PQ/HNSW on GPU	Achieves sub-20 ms Approximate Nearest Neighbour search on millions of documents.	Tune parameters like `nprobe` and `efSearch` for P99 latency; pre-warm GPU RAM with the index.
Link graph	Sparse adjacency matrix in RocksDB or a dedicated Graph Store	Used for authority signals (like PageRank) and de-duplication.	Pull link data into RAM for top-k ranked documents only to avoid latency.

9.3 Creating a Simple Inverted Index in Python

To understand the concept, we can build a simple in-memory inverted index using Python's defaultdict.

from collections import defaultdict
import re

def tokenize(text):
    words = re.findall(r'\b\w+\b', text.lower())
    return words

def build_index(documents):
    index = defaultdict(list)
    for doc_id, content in documents.items():
        words = tokenize(content)
        for word in set(words):  # Use set to store each word only once per doc
            index[word].append(doc_id)
    return index

# Example usage
documents = {
    1: "The quick brown fox jumps over the lazy dog",
    2: "A fox fled from danger"
}
index = build_index(documents)
print(index)

9.4 Creating an Inverted Index in Rust with Tantivy

For a production system, a library like Tantivy is essential. Tantivy is a full-text search engine library in Rust, inspired by Apache Lucene, that provides a high-level API for creating, populating, and searching indexes efficiently.

use tantivy::schema::*;
use tantivy::{doc, Index, TantivyError};

fn tantivy_example() -> Result<(), TantivyError> {
    let mut schema_builder = Schema::builder();
    schema_builder.add_text_field("title", TEXT | STORED);
    schema_builder.add_text_field("body", TEXT);
    let schema = schema_builder.build();

    // Create the index in RAM for this example
    let index = Index::create_in_ram(schema.clone());
    let mut index_writer = index.writer(50_000_000)?; // 50MB heap size for writer

    let title = schema.get_field("title").unwrap();
    let body = schema.get_field("body").unwrap();

    index_writer.add_document(doc!(
        title => "Rust is awesome",
        body => "Rust is a language empowering everyone to build reliable and efficient software."
    ))?;
    index_writer.commit()?;
    Ok(())
}

9.5 Index Optimization: Persistence and Compression

Persistence: For production use, the index must be persistent. Do not store it in RAM. Use a persistent key-value store like sled or rocksdb, or leverage the file-based persistence that comes standard with libraries like Tantivy.
Compression: To reduce disk space and improve performance by fitting more of the index into memory, compress the index. Techniques like delta encoding for document IDs and variable-byte encoding for integers are commonly used.

Chapter 10: Embeddings & Vector Representations

While inverted indexes are powerful for keyword matching, modern search requires understanding the semantic meaning behind queries. Vector embeddings are numerical representations of text that capture this meaning, enabling searches based on concepts rather than just keywords.

10.1 Introduction to Vector Embeddings

Vector embeddings are dense numerical vectors generated by deep learning models. These models are trained to map words, sentences, or entire documents to a high-dimensional space where semantically similar items are located close to one another.

10.2 Generation and Storage

Generation: State-of-the-art models like Sentence-T5 or E5 can be used to generate high-quality vectors for documents. This is a computationally intensive process. Batching documents on a GPU is crucial to amortize the overhead of transferring data over the PCIe bus and maximize throughput.
Vector Index: These embeddings are then stored in a specialized vector index that is optimized for performing Approximate Nearest-Neighbor (ANN) search, which is the subject of the next chapter.

Chapter 11: Approximate Nearest‑Neighbour Search with FAISS & HNSW

Finding the exact nearest neighbors for a query vector in a high-dimensional space is computationally prohibitive at scale. Approximate Nearest-Neighbor (ANN) search algorithms trade a small amount of accuracy for a massive gain in search speed, which is essential for interactive applications.

11.1 The Need for Approximation

For a query to be answered in milliseconds, we cannot afford to compare the query vector against every single document vector in the index. ANN algorithms provide a way to find "good enough" neighbors quickly.

11.2 Core Technologies: FAISS and HNSW

FAISS (Facebook AI Similarity Search) is a leading open-source library for efficient vector search. It offers a rich collection of index types that can be tuned for different trade-offs between speed, memory usage, and accuracy.
HNSW (Hierarchical Navigable Small World) is a popular and powerful ANN algorithm that builds a multi-layered graph data structure for fast searching. It is available within FAISS and other vector search libraries and is known for its excellent performance.

11.3 Scalable Indexing Techniques

To build indexes that can handle billions of items, we can combine several techniques:

IVF (Inverted File Index): This partitions the vector space into cells, and a search only needs to scan the cells nearest to the query vector.
PQ (Product Quantization): This technique compresses the vectors themselves, significantly reducing their memory footprint.

Combining IVF and PQ (IVF-PQ) is a common strategy for building highly scalable and memory-efficient vector indexes. 🚧 An alternative to FAISS for production deployments is a dedicated vector database like Milvus or Weaviate.

Chapter 12: Hybrid Retrieval Strategies

Hybrid search combines the strengths of traditional keyword-based (lexical) search and modern semantic search to improve both the breadth (recall) and quality (relevance) of search results.

12.1 Combining Lexical and Semantic Search

Lexical search is excellent at finding documents that contain the exact keywords from a query. Semantic search excels at finding conceptually related documents, even if they don't share any keywords. By combining them, we get the best of both worlds. Benchmarks from search platforms like Vespa have repeatedly validated that a hybrid approach improves both recall and latency.

12.2 A Practical Hybrid Search Strategy

A common and effective strategy is to execute two searches in parallel for each user query:

A traditional keyword search using a BM25 scoring function on the inverted index.
A single-vector ANN search on the vector index.

The system then takes the top ~1,000 documents from each result set, merges them into a single candidate list (removing duplicates), and passes this list to a final re-ranking stage.

Chapter 13: Link Analysis & PageRank

PageRank is a foundational algorithm in web search that assigns an importance score to web pages based on the structure of the web's link graph. It operates on the principle that a link from page A to page B is a vote of confidence from A to B. It remains a key signal for determining the authority of a document.

13.1 The PageRank Algorithm

PageRank is an iterative algorithm that propagates "rank" through the link graph. The score of a page is determined by the number and quality of pages that link to it.

13.2 Python Implementation of PageRank

The following Python code provides a simple implementation of the PageRank algorithm.

from collections import defaultdict

def pagerank(links, iterations=20, damping=0.85):
    # 'links' is a dict where key is a page and value is a list of pages it links to
    pages = set(links.keys())
    for linked_pages in links.values():
        pages.update(linked_pages)

    N = len(pages)
    if N == 0:
        return {}

    pr = {page: 1/N for page in pages}

    for _ in range(iterations):
        new_pr = {page: (1 - damping) / N for page in pages}
        for page, outgoing_links in links.items():
            # Handle cases where a page has no outgoing links (dangling nodes)
            if not outgoing_links:
                # Distribute its PageRank equally among all pages
                for p_target in pages:
                     new_pr[p_target] += damping * pr[page] / N
            else:
                for linked_page in outgoing_links:
                    if linked_page in new_pr:
                        new_pr[linked_page] += damping * pr[page] / len(outgoing_links)
        pr = new_pr

    return pr

# Example usage
links = {
    'page1': ['page2', 'page3'],
    'page2': ['page1'],
    'page3': ['page1']
}
pr_scores = pagerank(links)
print(f"PageRank scores: {pr_scores}")

Chapter 14: Learning‑to‑Rank and Neural Re‑Ranking

Learning to Rank (LTR) reframes the ranking problem as a supervised machine learning task. Instead of relying on a single, handcrafted formula like BM25, LTR uses a model trained on human-judged data to learn the optimal way to combine hundreds of different relevance signals.

14.1 Introduction to Learning-to-Rank (LTR)

LTR is typically used as a final re-ranking stage. After an initial candidate set of documents is retrieved (e.g., via hybrid search), the LTR model scores each of these candidates to produce the final, ordered list presented to the user. This re-ranking step is computationally intensive and should only be applied to a small number of top results (e.g., N ≤ 128).

14.2 Model Choices and Caching

Model: For the LTR model, gradient-boosted decision trees like LightGBM are a powerful and efficient choice. Alternatively, for higher accuracy, a transformer-based cross-encoder can be used. This re-ranking step is best performed on a GPU.
Caching: To reduce latency for common searches, the logits (raw output scores) of the LTR model can be cached for popular queries.

14.3 Feature Engineering for LTR

The power of an LTR model comes from the richness of the features it uses to evaluate a query-document pair. These features fall into several categories:

Static Features: Query-independent signals about the document's quality, such as PageRank, URL length, and document freshness.
Dynamic Features: Query-dependent signals that measure the textual match, such as TF-IDF or BM25 scores.
Semantic Features: Features that capture conceptual relevance, like the cosine similarity between the query embedding and the document embedding.

Chapter 15: Incremental & Real‑Time Index Updates

To keep the index fresh and reflect the ever-changing web, it is inefficient and impractical to rebuild the entire index from scratch constantly. The system must support incremental and near real-time updates.

15.1 The Challenge of Freshness

Users expect search results to be up-to-date, especially for news and trending topics. A system that only updates its index daily or weekly will feel stale.

15.2 Real-Time Update Strategies

Percolator-style Updates: A proven pattern, pioneered by Google, involves streaming small batches of new or updated documents through a transactional update pipeline. This allows the main index to stay very fresh (e.g., less than one hour stale) while avoiding the cost and complexity of full re-builds.
Built-in Mechanisms: Many open-source search engines provide built-in mechanisms for real-time updates. OpenSearch achieves this via its distributed architecture, while Meilisearch uses a dedicated update queue to process changes asynchronously.

Part IV · Serving & Operations

Chapter 16: Query Serving Architecture & gRPC API Design

This chapter covers the system that receives user queries, processes them through the ranking pipeline, and returns results.

16.1 The Query Engine

The query engine is the component that interprets user queries and executes them against the index. It must support a variety of features to be useful:

Scoring Functions: Standard algorithms like BM25 for lexical relevance.
Logic and Filtering: Boolean logic (AND, OR, NOT) and the ability to filter results by metadata such as date, domain, or language.
Fuzzy Matching: Tolerance for typos and misspellings.

16.2 API Design and Protocols

API Layer (Rust): The API serves as the entry point for all queries. For high performance, it should be built using a modern Rust web framework like axum, actix-web, or warp.

use axum::{routing::post, Router};
// Assume search_handler is an async function that takes a query and returns results
// async fn search_handler(...) -> ... {}

// let app = Router::new().route("/search", post(search_handler));

Protocol: For internal, service-to-service communication, use a high-performance protocol like gRPC or HTTP/2 with Protobuf-encoded responses. This is significantly more efficient than traditional JSON over HTTP/1.1. A typical search response would include the list of documents, their scores, and potentially an explanation of the scoring for debugging.

16.3 Security and Advanced Features

Security: If you expose a public Search Engine Results Page (SERP) or a developer API, you must implement rate limiting and authentication to prevent abuse. The Brave Search API is a good model to study for designing a public-facing API.
Features: Implement popular user-facing features like result clustering and "!bang" redirect syntax (used by Brave and DuckDuckGo for searching other sites directly).

Chapter 17: SERP Front‑End with React & Tailwind

This section covers building the user-facing Search Engine Results Page (SERP), where users interact with the search engine.

17.1 Frontend Technology Choices

Standard Frontend: For the user interface, a modern JavaScript framework like React combined with TypeScript is a robust and popular choice.
Full-Stack Rust: For developers looking for a full-stack Rust solution, consider frameworks that support Server-Side Rendering (SSR) such as Leptos or Yew.

17.2 Conceptual UI with Flask

A simple web UI can be built with any backend framework. Here is a conceptual example using Python's Flask to demonstrate the basic components of a search page.

from flask import Flask, request, render_template_string

app = Flask(__name__)

html_template = '''
<!DOCTYPE html>
<html>
<head><title>Search Engine</title></head>
<body>
    <h1>My Search Engine</h1>
    <form method="GET">
        <input type="text" name="query" placeholder="Enter your query" value="{{ query }}">
        <input type="submit" value="Search">
    </form>
    {% if results %}
        <h2>Results</h2>
        <ul>
        {% for doc_id, score in results %}
            <li>Document {{ doc_id }} (Score: {{ "%.2f"|format(score) }})</li>
        {% endfor %}
        </ul>
    {% endif %}
</body>
</html>
'''

@app.route('/')
def search_page():
    query = request.args.get('query', '')
    results = []
    if query:
        # Assumes a search function is defined that takes the query
        # and returns a list of (doc_id, score) tuples.
        # results = rank_documents(tfidf, query, documents)
        pass
    return render_template_string(html_template, query=query, results=results)

if __name__ == '__main__':
    # This block is for demonstration purposes.
    # app.run(debug=True)
    pass

17.3 User Interface Best Practices

A good SERP should have a prominent search bar, display results clearly with titles, URLs, and snippets, and include features like pagination and filters to help users refine their results.

Chapter 18: Distributed Sharding & Fault Tolerance

For a web-scale document collection, a compressed index will still be too large to fit on a single machine. The system must be distributed across a cluster of nodes to be scalable and resilient.

18.1 The Need for Distribution

Distributing the index and query processing load is essential for handling large volumes of data and traffic while maintaining low latency.

18.2 Sharding Strategies

Sharding is the process of splitting the index into smaller, more manageable pieces called shards. There are two primary strategies:

Document Partitioning: The collection of documents is divided into subsets, and each shard is a self-contained index for its assigned subset. This is the most common approach.
Term Partitioning: The dictionary of all terms is divided, and each shard holds the complete posting lists (lists of documents) for its assigned subset of terms.

18.3 Replication for High Availability

To ensure high availability and fault tolerance, each shard is replicated one or more times on different nodes in the cluster. If a node containing a primary shard fails, a replica can be promoted to take its place, ensuring the search service remains available.

Chapter 19: Low‑Latency Optimisations

Every millisecond counts in search. This chapter consolidates various techniques for optimizing latency across the system.

19.1 Caching and Index Efficiency

Caching: Use an in-memory cache like Redis to store the results of frequent queries, bypassing most of the query processing pipeline for popular searches.
Efficient Indexing: Use compressed data structures within the index to reduce its size, minimize disk I/O, and allow more of the index to fit into the OS page cache.

19.2 Load Balancing and Memory Management

Load Balancing: Distribute incoming queries evenly across multiple replica servers to prevent any single node from becoming a bottleneck.
Memory Management: In languages with manual memory management or custom allocators like Rust, use object pools for frequently allocated objects to reduce allocation overhead. For example, bumpalo can be used for specific workloads where memory can be allocated and cleared in large, efficient blocks.

Chapter 20: Observability: Metrics, Tracing, and Alerting

To operate a reliable production system, you need deep visibility into its performance and health. This is known as observability.

20.1 Metrics and Tracing

Metrics: Track key performance indicators (KPIs) such as Queries Per Second (QPS), P50/P95 latency, CPU/GPU utilization, and crawl queue depth. Use a time-series database like Prometheus for collecting metrics and Grafana for creating dashboards.
Tracing: Use a distributed tracing system like OpenTelemetry to trace requests as they flow through the entire system (crawler → indexer → ranker → API). The tracing crate is the de facto standard for instrumenting Rust applications.

20.2 Alerting, Chaos Testing, and Logging

Alerting: Configure alerts to notify operators of critical issues, such as a high ratio of server errors (5xx) or sudden, unexpected spikes in query volume.
Chaos testing: Proactively test the system's resilience by periodically and automatically killing nodes or injecting network latency. This ensures that shard replicas, caches, and failover mechanisms work as expected without requiring human intervention.
Logging: Use a structured logging system like PostgreSQL or ClickHouse for storing logs. This allows for powerful analytics and debugging of system behavior.

Chapter 21: Security, Privacy, and Abuse Mitigation

A search engine handles user data and interacts with the entire web, making security and privacy paramount.

21.1 Data Handling and Compliance

Adhere strictly to legal requirements such as GDPR for user data and DMCA for takedown notices.
Always enforce robots.txt and noindex directives found on web pages and in meta tags.

21.2 User Data Anonymization

Protect user privacy by anonymizing user data. For example, strip personally identifiable information like IP addresses from query logs after a short retention period (e.g., 24 hours).

Chapter 22: Cost Engineering & Cloud Deployment Patterns

Running a web-scale service can be expensive. Cost engineering involves making architectural choices that optimize for performance per dollar.

22.1 Managing Storage and Compute Costs

Cache hierarchy: Implement a multi-tiered cache (e.g., NVMe → RAM → GPU RAM) to reduce expensive egress and object storage (S3) costs. Exa’s Alluxio cache is an example that demonstrates multi-TB/s aggregate throughput.
Quantised vectors: Use techniques like product quantization (PQ) and 8-bit integers (int8) to compress vector embeddings. This can slash GPU memory demand by ~4x with a recall loss of less than 1%.

22.2 Leveraging Cloud Infrastructure

Use Spot/pre-emptible instances for non-critical, stateless workloads like crawler workers. This can significantly reduce compute costs.
Keep stateful, latency-sensitive services like rankers and index shards on more reliable on-demand or reserved hardware.

Chapter 23: Continuous Integration & Delivery

A structured development and deployment process is essential for building and maintaining a complex distributed system.

23.1 Development and Deployment Workflow

Local Development:

Dockerize each component (crawler, indexer, API) to create consistent, reproducible development environments.
Use docker-compose to orchestrate the services and simulate a distributed setup locally.

Production Deployment:

Orchestrate containers at scale using Kubernetes.
Use Redis for distributed job queues and caching.
Use a robust database like PostgreSQL or ClickHouse for logging and analytics.

23.2 Sample Project Plan

This table provides a high-level project plan to structure the development process.

Week	Task
1–2	Build async crawler
3–4	Parser & Content Extractor
5–6	Indexer using Tantivy or custom implementation
7–8	Query engine + basic ranking
9–10	API & UI
11+	Optimize, scale, implement ML ranker

Part V · Advanced Topics & Case Studies

Chapter 24: Advanced Features: Snippets, Entities, and QA

Once the core search functionality is in place, you can add advanced features to enhance the user experience.

24.1 Snippet Generation

Snippets are the short descriptions shown below the title and URL in search results. An efficient way to generate them is to pre-compute sentence embeddings for all sentences in a document. At query time, you can perform a nearest-sentence search inside the retrieved document vectors to find the most relevant sentences to display as a snippet. This process should be highly optimized and can be done in ≤ 8 ms on a GPU.

24.2 Indexing Alternative Content Sources

Extend the crawler and parsers to index content beyond standard web pages.

Telegram: Use the Telegram Bot API or scraping libraries to ingest content from public channels.
Reddit: Use the Pushshift dataset or the official Reddit API to index discussions.
PDFs: Use libraries like pdf_extract in Rust or PyMuPDF in Python to extract text from PDF documents, followed by text cleanup and processing.

Chapter 25: Scaling to Billions of Documents

The principles outlined in previous chapters—distributed crawling, sharding, replication, and efficient data structures—are the foundation for scaling to billions of documents. The key is horizontal scalability, where adding more machines to the cluster results in a proportional increase in capacity for crawling, indexing, and serving. Brave Search's public figure of indexing 12-20 billion unique URLs serves as a good baseline for a web-scale index.

Chapter 26: Personalisation & LLM‑Enhanced Ranking

To further improve relevance, the search experience can be personalized. This can involve re-ranking results based on a user's past search history or location. Additionally, Large Language Models (LLMs) can be integrated into the ranking pipeline, either as powerful re-rankers or to generate direct answers to user queries.

Chapter 27: Case Study: Operating Cortex Search in Production

This final chapter provides a high-level roadmap for assembling the complete Cortex Search system and offers some closing thoughts.

27.1 A High-Level Implementation Roadmap

Spin up a StormCrawler cluster and begin seeding it with an initial set of URLs.
Stand up Lucene or Tantivy shards to handle lexical search. Create a data pipeline that pipes the output of the crawler through a parser that writes directly to the shards’ near-real-time (NRT) writer.
On a dedicated GPU cluster, batch-generate embeddings for all new content, for example, on a nightly basis. Build FAISS HNSW indexes from these embeddings and ship the resulting index files to the serving nodes.
Deploy a serving layer using a framework like Vespa.ai (or your own custom microservices) so that a single /search API call fans out to both the lexical and vector indexes. This layer then executes the ML-based re-ranking on the combined candidate set and returns a final JSON response.
Layer on analytics, A/B testing capabilities, and plan for the gradual roll-out of new ranking models and features.

27.2 Final Words

Follow this roadmap and you’ll have a vertically-integrated, independent search index capable of delivering sub-50 ms responses at web-scale—a capability that only a handful of vendors offer today.

Happy indexing.

Top comments (1)

Sahib Workspace • Oct 5

very comprehensive article, is there any github repo for the code?

Prerequisites

How to Use This Book

Table of Contents

Part I · Foundations

Part II · Data Acquisition & Processing

Part III · The Indexing Engine

Part IV · Serving & Operations

Part V · Advanced Topics & Case Studies

Part I · Foundations

Chapter 1: Introduction to Search Engines

1.1 What is a Search Engine?

1.2 Market Landscape

1.3 The Buy vs. Build Decision Matrix

1.4 Core Components at a Glance

1.5 Open Source Search Engines as a Blueprint

Chapter 2: Design Goals & System Architecture

2.1 Latency Budgets & Service Level Agreements (SLAs)

2.2 Coverage & Freshness KPIs

2.3 Choosing Languages: Rust for Indexer, Python for Glue

2.4 Data-Flow vs. Microlith & Architectural Evolution

2.5 Privacy-First Evolution: The Brave Model

2.6 Failure Domains & Replication

Chapter 3: Hardware & Cluster Baseline

3.1 Storage Tier

3.2 Compute Tier

3.3 Network Tier

Part II · Data Acquisition & Processing

Chapter 4: Web Crawling at Scale

4.1 Crawler Framework & Architecture

4.2 The URL Frontier and Scheduler

4.3 Distributed Crawling and Parallel Processing

4.4 Hands-On Lab 1: Hello Crawler

Chapter 5: Politeness, Robots, and Legal Compliance

5.1 Honoring Robots.txt and Handling Errors

5.2 Rate Limiting and Adaptive Throttling

5.3 Ethical and Legal Considerations

Chapter 6: Parsing, Boilerplate Removal, & Metadata Extraction

6.1 The Content Processing Pipeline

6.2 High-Performance Parsing

6.3 Metadata Extraction

Chapter 7: De‑Duplication & Canonicalisation

7.1 Near-Duplicate Detection

7.2 Efficient URL Tracking

Chapter 8: Text Processing: Tokenisation, Stemming, and Language ID

8.1 Core Text Processing Steps

8.2 Implementation in Python

8.3 Implementation in Rust

Part III · The Indexing Engine

Chapter 9: Building the Inverted Index

9.1 The Role of the Inverted Index

9.2 Technology Choices: Tantivy (Rust) & Lucene (Java)

9.3 Creating a Simple Inverted Index in Python

9.4 Creating an Inverted Index in Rust with Tantivy

9.5 Index Optimization: Persistence and Compression

Chapter 10: Embeddings & Vector Representations

10.1 Introduction to Vector Embeddings

10.2 Generation and Storage

Chapter 11: Approximate Nearest‑Neighbour Search with FAISS & HNSW

11.1 The Need for Approximation

11.2 Core Technologies: FAISS and HNSW

11.3 Scalable Indexing Techniques

Chapter 12: Hybrid Retrieval Strategies

12.1 Combining Lexical and Semantic Search

12.2 A Practical Hybrid Search Strategy

Chapter 13: Link Analysis & PageRank

13.1 The PageRank Algorithm

13.2 Python Implementation of PageRank

Chapter 14: Learning‑to‑Rank and Neural Re‑Ranking

14.1 Introduction to Learning-to-Rank (LTR)

14.2 Model Choices and Caching

14.3 Feature Engineering for LTR

Chapter 15: Incremental & Real‑Time Index Updates

15.1 The Challenge of Freshness

15.2 Real-Time Update Strategies

Part IV · Serving & Operations

Chapter 16: Query Serving Architecture & gRPC API Design

16.1 The Query Engine

16.2 API Design and Protocols

16.3 Security and Advanced Features

Chapter 17: SERP Front‑End with React & Tailwind