Haji Rufai

Posted on May 26 • Originally published at github.com

Building a Vector Search Engine from Scratch in Python (Flat, IVF, HNSW)

#python #machinelearning #database #algorithms

Every AI application — from ChatGPT plugins to recommendation engines — relies on vector search. But how does it actually work under the hood?

I built VectorLite, a complete vector search engine from scratch in Python, implementing three industry-standard index algorithms: Flat, IVF, and HNSW. No FAISS. No Annoy. Just NumPy and clean code.

In this article, I'll walk through the core algorithms and the engineering decisions that make vector search fast.

Why Build One From Scratch?

Tools like Pinecone, Chroma, and FAISS are fantastic. But using them is like driving a car without knowing how the engine works. Building from scratch teaches you:

Why HNSW is fast (and when it isn't)
The recall-speed tradeoff at a fundamental level
How metadata filtering interacts with ANN search
What parameters actually do (M, ef, nprobe, nlist)

Plus, it makes for a great portfolio project. Let's dive in.

Architecture Overview

vectorlite/
├── distance.py        # Vectorized distance metrics (NumPy)
├── indexes/
│   ├── base.py        # Abstract index with metadata filtering
│   ├── flat.py        # Brute-force exact search
│   ├── ivf.py         # IVF with k-means clustering
│   └── hnsw.py        # HNSW graph-based ANN
├── collection.py      # Collection management
├── storage.py         # Disk persistence (NumPy + JSON)
├── engine.py          # Orchestration layer
├── api.py             # FastAPI REST API
└── cli.py             # Rich CLI

Distance Metrics: The Foundation

Every vector search starts with measuring how "close" two vectors are. VectorLite supports three metrics:

import numpy as np

def cosine_similarity(query: np.ndarray, vectors: np.ndarray) -> np.ndarray:
    query_norm = np.linalg.norm(query)
    if query_norm == 0:
        return np.zeros(vectors.shape[0])

    vector_norms = np.linalg.norm(vectors, axis=1)
    vector_norms = np.where(vector_norms == 0, 1.0, vector_norms)

    dot_products = vectors @ query
    return dot_products / (query_norm * vector_norms)

def euclidean_distance(query: np.ndarray, vectors: np.ndarray) -> np.ndarray:
    diff = vectors - query[np.newaxis, :]
    return np.sqrt(np.sum(diff ** 2, axis=1))

Key insight: All operations are vectorized with NumPy — we compute distances to ALL candidates in a single matrix operation. No Python loops.

Index #1: Flat (Brute Force)

The simplest index. Compare the query against every stored vector:

class FlatIndex(BaseIndex):
    def search(self, query, top_k=10, filters=None):
        allowed_ids = self._filter_ids(filters)
        ids = [vid for vid in self._vectors if not allowed_ids or vid in allowed_ids]

        vectors = np.array([self._vectors[vid] for vid in ids])
        distances = get_distance_fn(self.metric)(query, vectors)
        scores = distance_to_score(distances, self.metric)

        top_indices = np.argsort(scores)[::-1][:top_k]
        return [(ids[i], float(scores[i])) for i in top_indices]

Complexity: O(n) per query. Guarantees exact results. Use for datasets under ~10K vectors.

Index #2: IVF (Inverted File Index)

IVF partitions the vector space using k-means, then only searches relevant clusters:

Training Phase

def train(self, max_iterations=50):
    all_vectors = np.array(list(self._vectors.values()))

    # K-means++ initialization (better than random)
    centroids = self._kmeans_plus_plus_init(all_vectors, self.nlist)

    for _ in range(max_iterations):
        assignments = self._assign_to_centroids(all_vectors, centroids)
        new_centroids = np.zeros_like(centroids)

        for c in range(self.nlist):
            mask = assignments == c
            if np.any(mask):
                new_centroids[c] = all_vectors[mask].mean(axis=0)
            else:
                new_centroids[c] = centroids[c]

        if np.linalg.norm(new_centroids - centroids) < 1e-6:
            break
        centroids = new_centroids

Search Phase

def search(self, query, top_k=10, filters=None):
    # Find the nprobe nearest centroids
    centroid_dists = euclidean_distance(query, self._centroids)
    probe_centroids = np.argsort(centroid_dists)[:self.nprobe]

    # Only search vectors in those partitions
    candidates = []
    for c_idx in probe_centroids:
        candidates.extend(self._partitions[int(c_idx)])

    # Exact search within candidates
    vectors = np.array([self._vectors[vid] for vid in candidates])
    distances = get_distance_fn(self.metric)(query, vectors)
    ...

Key parameters:

nlist: Number of clusters. Rule of thumb: sqrt(n)
nprobe: Clusters to search. Higher = better recall, slower

Complexity: O(n/nlist × nprobe) — much faster than flat for large datasets.

Index #3: HNSW (The Star of the Show)

HNSW is the state-of-the-art algorithm used by most vector databases. It builds a multi-layer proximity graph:

Layer 2:    A ─────────────────── D          (few nodes, long-range links)
            │                     │
Layer 1:    A ──── B ──── C ──── D ──── E    (more nodes, medium links)
            │      │      │      │      │
Layer 0:    A ── B ── C ── D ── E ── F ── G  (all nodes, short links)

How insertion works:

def _on_add(self, vector_id, vector):
    # 1. Assign random level (geometric distribution)
    node_level = int(-math.log(uniform()) * (1/ln(M)))

    # 2. Navigate from top layer down to node_level+1
    ep = [self._entry_point]
    for layer in range(self._max_level, node_level, -1):
        results = self._search_layer(vector, ep, ef=1, layer=layer)
        ep = [results[0][1]]  # Greedy: take closest

    # 3. Insert at each layer from node_level down to 0
    for layer in range(min(node_level, self._max_level), -1, -1):
        results = self._search_layer(vector, ep, ef=self.ef_construction, layer=layer)
        neighbors = self._select_neighbors(vector_id, results, m=self.m)

        # Add bidirectional edges
        for neighbor in neighbors:
            self._graph[layer][vector_id].add(neighbor)
            self._graph[layer][neighbor].add(vector_id)

How search works:

def search(self, query, top_k=10):
    ep = [self._entry_point]

    # Phase 1: Navigate top layers with ef=1 (greedy)
    for layer in range(self._max_level, 0, -1):
        results = self._search_layer(query, ep, ef=1, layer=layer)
        ep = [results[0][1]]

    # Phase 2: Thorough search at layer 0
    results = self._search_layer(query, ep, ef=max(self.ef_search, top_k), layer=0)
    return results[:top_k]

The magic: Upper layers act like "highways" for fast navigation, while layer 0 has the precise local connections. This gives O(log n) search time.

Key parameters:

M: Max connections per node (16 is a good default)
ef_construction: How carefully we build the graph (200 is solid)
ef_search: How thoroughly we search (50-100 for most use cases)

Metadata Filtering

Real applications need to filter results. VectorLite applies filters at the base index level:

def _filter_ids(self, filters):
    if not filters:
        return None
    matching = set()
    for vid, meta in self._metadata.items():
        if all(meta.get(k) == v for k, v in filters.items()):
            matching.add(vid)
    return matching

For HNSW, filtering happens post-search (search with larger ef, then filter). For Flat and IVF, it's pre-search (only consider matching vectors).

REST API

The full-featured FastAPI server supports all operations:

# Create collection
curl -X POST http://localhost:8000/collections \
  -H "Content-Type: application/json" \
  -d '{"name": "docs", "dimension": 384, "index_type": "hnsw"}'

# Insert vectors
curl -X POST http://localhost:8000/collections/docs/vectors/batch \
  -d '{"vectors": [{"id": "v1", "vector": [...], "metadata": {"topic": "ml"}}]}'

# Search
curl -X POST http://localhost:8000/collections/docs/search \
  -d '{"vector": [...], "top_k": 5, "filters": {"topic": "ml"}}'

Interactive docs available at /docs (Swagger UI).

Performance

On my test setup (1,000 128-dim vectors, 100 queries):

Index	Avg Latency	QPS
Flat	0.8ms	1,250
IVF (nprobe=4)	0.3ms	3,333
HNSW (ef=50)	0.5ms	2,000

For a pure Python implementation with no C extensions, these numbers are quite respectable. The real advantage of IVF and HNSW shows at scale (100K+ vectors).

Testing

93 tests cover every component:

$ pytest -v
tests/test_distance.py     ✓ 15 tests
tests/test_flat.py         ✓ 13 tests
tests/test_ivf.py          ✓ 9 tests
tests/test_hnsw.py         ✓ 12 tests
tests/test_collection.py   ✓ 10 tests
tests/test_engine.py       ✓ 14 tests
tests/test_api.py          ✓ 19 tests + 1 train

93 passed in 1.61s

What I Learned

NumPy vectorization is everything. A naive Python loop for cosine similarity is 100x slower than the vectorized version.
HNSW's brilliance is in the multi-layer structure. Without upper layers, it degrades to a regular graph search. The skip-list-like hierarchy is what gives O(log n).
K-means++ matters. Random initialization for IVF leads to unbalanced clusters. K-means++ (selecting initial centroids proportional to distance) gives much better partitions.
The recall-speed tradeoff is continuous. There's no cliff — you can smoothly trade search quality for speed by adjusting nprobe (IVF) or ef_search (HNSW).
Persistence is tricky. HNSW's graph state (sets of neighbor IDs per layer) needs careful serialization. NumPy arrays are easy; graph topology less so.

Try It

git clone https://github.com/hajirufai/vectorlite.git
cd vectorlite
pip install -r requirements.txt
pytest -v  # Run 93 tests

Check out the full source on GitHub. Star it if you found this useful! ⭐

Building this helped me deeply understand the algorithms that power every vector database in production today. If you're working with embeddings, I highly recommend implementing at least a flat index from scratch — it changes how you think about similarity search.

python #machinelearning #database #algorithms

DEV Community