Ebendttl

Posted on May 26 • Originally published at akinseinde.netlify.app

Building a Vector Search Engine from Scratch: The Math and Mechanics of HNSW

#ai #machinelearning #typescript #computerscience

This is an excerpt. The full article includes a live 2D vector space sandbox — drag a query vector across a semantic coordinate field and watch cosine similarity scores update in real time across 9 framework nodes, with a toggle between Brute Force KNN and HNSW skip-graph traversal modes. Read the full interactive version →

The Geometry of Semantics

In the era of large language models, unstructured text is treated not as raw characters, but as a vector coordinate in a high-dimensional mathematical space.

When you pass text into an embedding model (like Google's text-embedding-004), it maps the semantic meaning into an array of floating-point numbers:

"Redis In-Memory Cache" → [ 0.01249, -0.04581, 0.08914, ..., -0.00312 ]  // 3072 dimensions
"PostgreSQL Database"   → [ 0.01183, -0.04491, 0.08877, ..., -0.00298 ]  // 3072 dimensions
"PyTorch Model"         → [-0.09214,  0.07321, -0.03812, ...,  0.06441 ]  // 3072 dimensions

Within this latent space, conceptually similar texts are geometrically close. "Redis" and "PostgreSQL" cluster near each other (both databases). "PyTorch" is far away (AI/ML domain).

Semantic search = finding the K-nearest neighbor coordinates to a query vector.

The Mathematical Distance Metrics

Three metrics dominate vector similarity calculations:

1. Cosine Similarity (most common for NLP)

cos(θ) = (A · B) / (|A| × |B|)

Measures the angular projection between two vectors. Range: -1 to 1. Completely ignores magnitude — two vectors pointing in the same direction score 1.0 even if one is 100× longer. This is ideal for text embeddings where magnitude carries no semantic information.

2. Dot Product (fastest for normalized vectors)

A · B = Σ(Aᵢ × Bᵢ)

When vectors are L2-normalized (unit length = 1), dot product yields identical rankings to cosine similarity with significantly less computation. Normalize your embeddings on ingestion, then use dot product at query time.

3. L2 Euclidean Distance (for spatial data)

d(A, B) = √(Σ(Aᵢ - Bᵢ)²)

Measures raw physical distance between vector tips. Highly sensitive to magnitude variations. Preferred for coordinate-based spatial systems, not raw text embeddings.

The Scaling Bottleneck: Why Brute Force Fails

In naive Brute Force KNN, you compare the query vector against every document in your database:

Complexity: O(D × N)  — where D = dimensions, N = document count

At scale, brute force saturates CPU and destroys real-time response requirements. You need Approximate Nearest Neighbor (ANN) indexing.

HNSW: Hierarchical Navigable Small World

HNSW translates the Skip List data structure into a multi-layered graph:

LAYER 2 (Sparse Highway) ─── node_A ──────────────────── node_B ───
LAYER 1 (Intermediate)   ─── node_A ─── node_C ─── node_B ─── node_D
LAYER 0 (Dense Ground)   ─ ALL nodes fully connected to neighbors ──

This reduces search complexity from:

Brute Force: O(N)         → Linear — fails at scale
HNSW:        O(log N)     → Logarithmic — sub-5ms at 10M+ docs

TypeScript HNSW Implementation

class HNSWIndex {
  private layers: Map<string, string[]>[] = [];
  private vectors: Map<string, number[]> = new Map();
  private readonly maxLayers: number;
  private readonly maxConnections: number;

  constructor(maxLayers = 4, maxConnections = 16) {
    this.maxLayers = maxLayers;
    this.maxConnections = maxConnections;
    this.layers = Array.from({ length: maxLayers }, () => new Map());
  }

  private cosineSimilarity(a: number[], b: number[]): number {
    const dot = a.reduce((sum, ai, i) => sum + ai * b[i], 0);
    const magA = Math.sqrt(a.reduce((sum, ai) => sum + ai * ai, 0));
    const magB = Math.sqrt(b.reduce((sum, bi) => sum + bi * bi, 0));
    return magA === 0 || magB === 0 ? 0 : dot / (magA * magB);
  }

  private getRandomLayer(): number {
    let layer = 0;
    while (layer < this.maxLayers - 1 && Math.random() < 0.5) {
      layer++;
    }
    return layer;
  }

  insert(id: string, vector: number[]): void {
    this.vectors.set(id, vector);
    const assignedLayer = this.getRandomLayer();

    for (let level = 0; level <= assignedLayer; level++) {
      if (!this.layers[level].has(id)) {
        const candidates = this.searchLayer(vector, level, this.maxConnections);
        this.layers[level].set(id, candidates.map(c => c.id));
      }
    }
  }

  search(queryVector: number[], k: number): { id: string; score: number }[] {
    let candidates = this.getEntryPoints(queryVector);

    for (let level = this.maxLayers - 1; level >= 0; level--) {
      candidates = this.searchLayer(queryVector, level, k, candidates);
    }

    return candidates
      .sort((a, b) => b.score - a.score)
      .slice(0, k);
  }

  private searchLayer(
    query: number[],
    level: number,
    k: number,
    entryPoints: { id: string; score: number }[] = []
  ): { id: string; score: number }[] {
    const visited = new Set<string>();
    const results: { id: string; score: number }[] = [...entryPoints];
    const queue = [...entryPoints];

    while (queue.length > 0) {
      const current = queue.shift()!;
      if (visited.has(current.id)) continue;
      visited.add(current.id);

      const neighbors = this.layers[level].get(current.id) ?? [];
      for (const neighborId of neighbors) {
        if (!visited.has(neighborId)) {
          const vec = this.vectors.get(neighborId)!;
          const score = this.cosineSimilarity(query, vec);
          queue.push({ id: neighborId, score });
          results.push({ id: neighborId, score });
        }
      }
    }

    return results
      .sort((a, b) => b.score - a.score)
      .slice(0, k);
  }

  private getEntryPoints(query: number[]): { id: string; score: number }[] {
    const entries: { id: string; score: number }[] = [];
    for (const [id, vec] of this.vectors) {
      entries.push({ id, score: this.cosineSimilarity(query, vec) });
      if (entries.length >= 3) break;
    }
    return entries;
  }
}

Production Scaling

Memory-Mapped Files (mmap): Keeping large indices in RAM causes OOM crashes. Bind binary index files directly to virtual address space — the OS swaps vector blocks in/out of disk cache automatically based on query patterns.

Product Quantization (PQ): Divide each high-dimensional vector into sub-vectors and map them to centroids. This reduces memory footprint by up to 95% with minimal recall degradation.

Engineering Takeaways

Always L2-normalize embeddings on ingestion to enable fast dot product scoring instead of full cosine calculation.
HNSW reduces O(N) to O(log N) — the single most impactful optimization for vector search at scale.
Use mmap + Product Quantization in production to handle tens of millions of vectors without OOM.

The full article includes a drag-and-drop 2D semantic space sandbox — 9 framework nodes plotted across a latent coordinate field, real-time cosine similarity leaderboard, toggle between Brute Force and HNSW skip-graph modes, and live layer traversal visualization.

Read the full interactive article →

Written by Ebenezer Akinseinde — Software Developer & AI Automations Engineer.

Portfolio · GitHub

Top comments (2)

Harjot Singh • May 31

Implementing HNSW from scratch is the best way to actually understand why vector search behaves the way it does, and it demystifies the magic. Once you've built the layered skip-list-over-a-proximity-graph, the practical knobs stop being mysterious: M (connections per node) and efConstruction trade index quality vs build time, efSearch trades recall vs latency at query time, and crucially ANN is approximate - you're accepting a small recall miss for a massive speed win over brute-force. Knowing that tradeoff in your bones is what lets you tune retrieval instead of cargo-culting a library's defaults.

The reason this matters beyond the algorithm: retrieval quality is the whole ballgame for RAG/agent context, and HNSW's recall/latency tradeoff is exactly the dial that decides whether your system surfaces the right chunk or a plausible-but-wrong one. Understanding it at this level is what separates "I added a vector DB" from "I tuned retrieval." That retrieval-quality obsession is core to how I build Moonshift, the thing I work on - a multi-agent pipeline that takes a prompt to a deployed SaaS, where what context an agent gets is deliberate and verified, not nearest-match-and-hope. Multi-model routing keeps a build ~$3 flat, first run free no card. Excellent deep dive - from-scratch HNSW is a rite of passage. Did building it change how you set efSearch in practice, now that you can feel the recall/latency curve instead of guessing?

Ebendttl • Jun 9 • Edited

@harjotscs ,

Thanks for such a thoughtful and detailed comment! You hit the nail on the head - there's a massive database architecture gap between "I added a vector DB" and "I actually tuned retrieval."

I checked out Moonshift - the 14-agent, 10-phase DAG pipeline is a beautiful piece of engineering. Orchestrating code generation, tests, and deployments in ~6 minutes is highly complex. I can see exactly why context verification is your main bottleneck: when you have agents fanning out in parallel (like frontend, backend, and security auditors), a single bad retrieval chunk in the planner phase would cascade across the entire DAG and generate broken builds or failed contract-validators. In an autonomous loop, nearest-match-and-hope just doesn't cut it.

To answer your question: yes, building HNSW from scratch completely changed how I configure efSearch in production.

A few key shifts in how I set it now:

The Construction Dependency: I realized that tuning efSearch is only as good as your efConstruction bounds. If the graph is built with local minima (low efConstruction / low M), turning up efSearch at query time just wastes compute traversing dead ends.
SLA-Based Dynamic Tuning: Instead of a global database default, I now set efSearch dynamically based on the operation. For initial agent routing or conversation history, I scale it down for sub-50ms latency. For heavy agent reasoning steps (like gathering core facts for code generation), I crank it up since the latency penalty is dwarfed by the cost of missing context.
Empirical Recall Charting: I no longer guess. I run validation sets to plot our actual recall-versus-latency curve to find the exact inflection point where increasing efSearch gives diminishing returns.

Are you running your agent context routing on top of a managed vector store, or did you build a custom in-memory vector index directly inside the Moonshift planner node to save on round-trip latency?