Every AI application — from ChatGPT plugins to recommendation engines — relies on vector search. But how does it actually work under the hood?
I built VectorLite, a complete vector search engine from scratch in Python, implementing three industry-standard index algorithms: Flat, IVF, and HNSW. No FAISS. No Annoy. Just NumPy and clean code.
In this article, I'll walk through the core algorithms and the engineering decisions that make vector search fast.
Why Build One From Scratch?
Tools like Pinecone, Chroma, and FAISS are fantastic. But using them is like driving a car without knowing how the engine works. Building from scratch teaches you:
- Why HNSW is fast (and when it isn't)
- The recall-speed tradeoff at a fundamental level
- How metadata filtering interacts with ANN search
- What parameters actually do (M, ef, nprobe, nlist)
Plus, it makes for a great portfolio project. Let's dive in.
Architecture Overview
vectorlite/
├── distance.py # Vectorized distance metrics (NumPy)
├── indexes/
│ ├── base.py # Abstract index with metadata filtering
│ ├── flat.py # Brute-force exact search
│ ├── ivf.py # IVF with k-means clustering
│ └── hnsw.py # HNSW graph-based ANN
├── collection.py # Collection management
├── storage.py # Disk persistence (NumPy + JSON)
├── engine.py # Orchestration layer
├── api.py # FastAPI REST API
└── cli.py # Rich CLI
Distance Metrics: The Foundation
Every vector search starts with measuring how "close" two vectors are. VectorLite supports three metrics:
import numpy as np
def cosine_similarity(query: np.ndarray, vectors: np.ndarray) -> np.ndarray:
query_norm = np.linalg.norm(query)
if query_norm == 0:
return np.zeros(vectors.shape[0])
vector_norms = np.linalg.norm(vectors, axis=1)
vector_norms = np.where(vector_norms == 0, 1.0, vector_norms)
dot_products = vectors @ query
return dot_products / (query_norm * vector_norms)
def euclidean_distance(query: np.ndarray, vectors: np.ndarray) -> np.ndarray:
diff = vectors - query[np.newaxis, :]
return np.sqrt(np.sum(diff ** 2, axis=1))
Key insight: All operations are vectorized with NumPy — we compute distances to ALL candidates in a single matrix operation. No Python loops.
Index #1: Flat (Brute Force)
The simplest index. Compare the query against every stored vector:
class FlatIndex(BaseIndex):
def search(self, query, top_k=10, filters=None):
allowed_ids = self._filter_ids(filters)
ids = [vid for vid in self._vectors if not allowed_ids or vid in allowed_ids]
vectors = np.array([self._vectors[vid] for vid in ids])
distances = get_distance_fn(self.metric)(query, vectors)
scores = distance_to_score(distances, self.metric)
top_indices = np.argsort(scores)[::-1][:top_k]
return [(ids[i], float(scores[i])) for i in top_indices]
Complexity: O(n) per query. Guarantees exact results. Use for datasets under ~10K vectors.
Index #2: IVF (Inverted File Index)
IVF partitions the vector space using k-means, then only searches relevant clusters:
Training Phase
def train(self, max_iterations=50):
all_vectors = np.array(list(self._vectors.values()))
# K-means++ initialization (better than random)
centroids = self._kmeans_plus_plus_init(all_vectors, self.nlist)
for _ in range(max_iterations):
assignments = self._assign_to_centroids(all_vectors, centroids)
new_centroids = np.zeros_like(centroids)
for c in range(self.nlist):
mask = assignments == c
if np.any(mask):
new_centroids[c] = all_vectors[mask].mean(axis=0)
else:
new_centroids[c] = centroids[c]
if np.linalg.norm(new_centroids - centroids) < 1e-6:
break
centroids = new_centroids
Search Phase
def search(self, query, top_k=10, filters=None):
# Find the nprobe nearest centroids
centroid_dists = euclidean_distance(query, self._centroids)
probe_centroids = np.argsort(centroid_dists)[:self.nprobe]
# Only search vectors in those partitions
candidates = []
for c_idx in probe_centroids:
candidates.extend(self._partitions[int(c_idx)])
# Exact search within candidates
vectors = np.array([self._vectors[vid] for vid in candidates])
distances = get_distance_fn(self.metric)(query, vectors)
...
Key parameters:
-
nlist: Number of clusters. Rule of thumb: sqrt(n) -
nprobe: Clusters to search. Higher = better recall, slower
Complexity: O(n/nlist × nprobe) — much faster than flat for large datasets.
Index #3: HNSW (The Star of the Show)
HNSW is the state-of-the-art algorithm used by most vector databases. It builds a multi-layer proximity graph:
Layer 2: A ─────────────────── D (few nodes, long-range links)
│ │
Layer 1: A ──── B ──── C ──── D ──── E (more nodes, medium links)
│ │ │ │ │
Layer 0: A ── B ── C ── D ── E ── F ── G (all nodes, short links)
How insertion works:
def _on_add(self, vector_id, vector):
# 1. Assign random level (geometric distribution)
node_level = int(-math.log(uniform()) * (1/ln(M)))
# 2. Navigate from top layer down to node_level+1
ep = [self._entry_point]
for layer in range(self._max_level, node_level, -1):
results = self._search_layer(vector, ep, ef=1, layer=layer)
ep = [results[0][1]] # Greedy: take closest
# 3. Insert at each layer from node_level down to 0
for layer in range(min(node_level, self._max_level), -1, -1):
results = self._search_layer(vector, ep, ef=self.ef_construction, layer=layer)
neighbors = self._select_neighbors(vector_id, results, m=self.m)
# Add bidirectional edges
for neighbor in neighbors:
self._graph[layer][vector_id].add(neighbor)
self._graph[layer][neighbor].add(vector_id)
How search works:
def search(self, query, top_k=10):
ep = [self._entry_point]
# Phase 1: Navigate top layers with ef=1 (greedy)
for layer in range(self._max_level, 0, -1):
results = self._search_layer(query, ep, ef=1, layer=layer)
ep = [results[0][1]]
# Phase 2: Thorough search at layer 0
results = self._search_layer(query, ep, ef=max(self.ef_search, top_k), layer=0)
return results[:top_k]
The magic: Upper layers act like "highways" for fast navigation, while layer 0 has the precise local connections. This gives O(log n) search time.
Key parameters:
-
M: Max connections per node (16 is a good default) -
ef_construction: How carefully we build the graph (200 is solid) -
ef_search: How thoroughly we search (50-100 for most use cases)
Metadata Filtering
Real applications need to filter results. VectorLite applies filters at the base index level:
def _filter_ids(self, filters):
if not filters:
return None
matching = set()
for vid, meta in self._metadata.items():
if all(meta.get(k) == v for k, v in filters.items()):
matching.add(vid)
return matching
For HNSW, filtering happens post-search (search with larger ef, then filter). For Flat and IVF, it's pre-search (only consider matching vectors).
REST API
The full-featured FastAPI server supports all operations:
# Create collection
curl -X POST http://localhost:8000/collections \
-H "Content-Type: application/json" \
-d '{"name": "docs", "dimension": 384, "index_type": "hnsw"}'
# Insert vectors
curl -X POST http://localhost:8000/collections/docs/vectors/batch \
-d '{"vectors": [{"id": "v1", "vector": [...], "metadata": {"topic": "ml"}}]}'
# Search
curl -X POST http://localhost:8000/collections/docs/search \
-d '{"vector": [...], "top_k": 5, "filters": {"topic": "ml"}}'
Interactive docs available at /docs (Swagger UI).
Performance
On my test setup (1,000 128-dim vectors, 100 queries):
| Index | Avg Latency | QPS |
|---|---|---|
| Flat | 0.8ms | 1,250 |
| IVF (nprobe=4) | 0.3ms | 3,333 |
| HNSW (ef=50) | 0.5ms | 2,000 |
For a pure Python implementation with no C extensions, these numbers are quite respectable. The real advantage of IVF and HNSW shows at scale (100K+ vectors).
Testing
93 tests cover every component:
$ pytest -v
tests/test_distance.py ✓ 15 tests
tests/test_flat.py ✓ 13 tests
tests/test_ivf.py ✓ 9 tests
tests/test_hnsw.py ✓ 12 tests
tests/test_collection.py ✓ 10 tests
tests/test_engine.py ✓ 14 tests
tests/test_api.py ✓ 19 tests + 1 train
93 passed in 1.61s
What I Learned
NumPy vectorization is everything. A naive Python loop for cosine similarity is 100x slower than the vectorized version.
HNSW's brilliance is in the multi-layer structure. Without upper layers, it degrades to a regular graph search. The skip-list-like hierarchy is what gives O(log n).
K-means++ matters. Random initialization for IVF leads to unbalanced clusters. K-means++ (selecting initial centroids proportional to distance) gives much better partitions.
The recall-speed tradeoff is continuous. There's no cliff — you can smoothly trade search quality for speed by adjusting nprobe (IVF) or ef_search (HNSW).
Persistence is tricky. HNSW's graph state (sets of neighbor IDs per layer) needs careful serialization. NumPy arrays are easy; graph topology less so.
Try It
git clone https://github.com/hajirufai/vectorlite.git
cd vectorlite
pip install -r requirements.txt
pytest -v # Run 93 tests
Check out the full source on GitHub. Star it if you found this useful! ⭐
Building this helped me deeply understand the algorithms that power every vector database in production today. If you're working with embeddings, I highly recommend implementing at least a flat index from scratch — it changes how you think about similarity search.
Top comments (0)