I Built a Vector Search Engine from Scratch — Here's What I Learned

#java #machinelearning #algorithms #programming

I Built a Vector Search Engine from Scratch — Here's What I Learned

Implementing HNSW (Hierarchical Navigable Small World) graphs, hybrid BM25 + dense retrieval, HyDE query rewriting, and atomic index persistence — achieving recall@10 = 0.984.

Why Build Your Own Vector Search?

When I started building Vektr — a RAG (Retrieval-Augmented Generation) engine — I had a choice: use an existing vector database like Pinecone, Weaviate, or FAISS, or build my own.

I chose to build my own. Not because existing solutions are bad (they're excellent), but because you don't truly understand a system until you've built it.

This post is about what I learned building HNSW from scratch.

What is HNSW?

HNSW (Hierarchical Navigable Small World) is the algorithm powering most modern vector databases. It achieves near-linear search time with high recall by organizing vectors into a hierarchical graph.

The key insight: approximate nearest neighbor search is fast enough, and "approximate" is closer to exact than you'd think.

My implementation achieves recall@10 = 0.984 — meaning for 98.4% of queries, all 10 true nearest neighbors appear in the top 10 results.

The HNSW Structure

Layer 2 (sparse):  1 ──────────── 5
                   │              │
Layer 1 (medium):  1 ── 3 ── 4 ── 5
                   │    │    │    │
Layer 0 (dense):   1─2─3─4─5─6─7─8─9

Each vector is inserted at layer 0. With probability 1/ln(M), it also appears in layer 1, and so on. This creates a highway network — you navigate quickly through sparse upper layers, then zoom in at the dense bottom layer.

Building the Index

public class HNSWIndex {
    private final int M;           // Max connections per node
    private final int efConstruction; // Search width during construction
    private final int maxLayer;
    private final Map<Integer, Node> nodes;
    private final Random random;
    private Node entryPoint;

    public void insert(int id, float[] vector) {
        int level = getRandomLevel();
        Node newNode = new Node(id, vector, level);

        if (entryPoint == null) {
            entryPoint = newNode;
            nodes.put(id, newNode);
            return;
        }

        // Start from entry point, navigate down to insertion level
        Node current = entryPoint;
        for (int l = entryPoint.level; l > level; l--) {
            current = greedySearch(current, vector, 1, l).get(0);
        }

        // Insert at each layer from level down to 0
        for (int l = Math.min(level, entryPoint.level); l >= 0; l--) {
            List<Node> candidates = searchLayer(current, vector, efConstruction, l);
            List<Node> neighbors = selectNeighbors(candidates, M, vector);

            newNode.setConnections(l, neighbors);

            // Add backlinks
            for (Node neighbor : neighbors) {
                neighbor.addConnection(l, newNode);

                // Prune if over capacity
                if (neighbor.getConnections(l).size() > M) {
                    List<Node> pruned = selectNeighbors(
                        neighbor.getConnections(l), M, neighbor.vector
                    );
                    neighbor.setConnections(l, pruned);
                }
            }
        }

        nodes.put(id, newNode);
        if (level > entryPoint.level) {
            entryPoint = newNode;
        }
    }

    private int getRandomLevel() {
        // Level distribution: P(level = l) = (1/ln(M))^l
        double r = -Math.log(random.nextDouble()) * (1.0 / Math.log(M));
        return (int) Math.min(r, maxLayer);
    }
}

Searching the Index

public List<SearchResult> search(float[] query, int k, int ef) {
    // Navigate from entry point down to layer 1
    Node current = entryPoint;
    for (int l = entryPoint.level; l > 0; l--) {
        current = greedySearch(current, query, 1, l).get(0);
    }

    // Beam search at layer 0 with ef candidates
    List<Node> candidates = searchLayer(current, query, ef, 0);

    // Return top-k by distance
    return candidates.stream()
        .sorted(Comparator.comparingDouble(n -> cosineSimilarity(query, n.vector)))
        .limit(k)
        .map(n -> new SearchResult(n.id, cosineSimilarity(query, n.vector)))
        .collect(Collectors.toList());
}

private List<Node> searchLayer(Node entry, float[] query, int ef, int layer) {
    Set<Node> visited = new HashSet<>();
    PriorityQueue<Node> candidates = new PriorityQueue<>(
        Comparator.comparingDouble(n -> -cosineSimilarity(query, n.vector))
    );
    PriorityQueue<Node> results = new PriorityQueue<>(
        Comparator.comparingDouble(n -> cosineSimilarity(query, n.vector))
    );

    candidates.add(entry);
    results.add(entry);
    visited.add(entry);

    while (!candidates.isEmpty()) {
        Node candidate = candidates.poll();

        // Termination condition: best candidate is worse than worst result
        if (results.size() >= ef &&
            cosineSimilarity(query, candidate.vector) <
            cosineSimilarity(query, results.peek().vector)) {
            break;
        }

        for (Node neighbor : candidate.getConnections(layer)) {
            if (!visited.contains(neighbor)) {
                visited.add(neighbor);
                candidates.add(neighbor);
                results.add(neighbor);
                if (results.size() > ef) results.poll();
            }
        }
    }

    return new ArrayList<>(results);
}

Hybrid Retrieval: BM25 + Dense Search

Pure vector search misses exact keyword matches. Pure BM25 misses semantic similarity. The solution: combine both.

public List<SearchResult> hybridSearch(String query, int k) {
    // Dense retrieval
    float[] queryEmbedding = embedder.embed(query);
    List<SearchResult> denseResults = index.search(queryEmbedding, k * 2, efSearch);

    // Sparse retrieval (BM25)
    List<SearchResult> sparseResults = bm25.search(query, k * 2);

    // Reciprocal Rank Fusion
    return reciprocalRankFusion(denseResults, sparseResults, k);
}

private List<SearchResult> reciprocalRankFusion(
    List<SearchResult> dense,
    List<SearchResult> sparse,
    int k
) {
    Map<Integer, Double> scores = new HashMap<>();
    int k_rrf = 60; // RRF constant

    // Dense scores
    for (int i = 0; i < dense.size(); i++) {
        int id = dense.get(i).id;
        scores.merge(id, 1.0 / (k_rrf + i + 1), Double::sum);
    }

    // Sparse scores
    for (int i = 0; i < sparse.size(); i++) {
        int id = sparse.get(i).id;
        scores.merge(id, 1.0 / (k_rrf + i + 1), Double::sum);
    }

    return scores.entrySet().stream()
        .sorted(Map.Entry.<Integer, Double>comparingByValue().reversed())
        .limit(k)
        .map(e -> new SearchResult(e.getKey(), e.getValue()))
        .collect(Collectors.toList());
}

RRF (Reciprocal Rank Fusion) is elegant: each result gets a score of 1 / (k + rank) from each retriever. Results appearing in both lists get combined scores, naturally surfacing the best matches.

HyDE: Hypothetical Document Embeddings

Query: "What is the capital of France?"

The problem: this query, embedded, looks nothing like a Wikipedia article about Paris. Dense retrieval fails.

HyDE solution: Generate a hypothetical answer first, then embed that.

public float[] hydeEmbed(String query) {
    // Generate hypothetical answer
    String hypothetical = llm.generate(
        "Write a short factual answer to: " + query
    );

    // Embed the hypothetical answer instead of the query
    return embedder.embed(hypothetical);
}

Query: "What is the capital of France?"
Hypothetical: "The capital of France is Paris, located in northern France along the Seine River..."

Now the embedding actually matches relevant documents.

Impact: +8% recall@10 on my test set.

Atomic Index Persistence

The naive approach to saving the index:

// DANGEROUS — if the process dies here, the file is corrupted
try (FileOutputStream fos = new FileOutputStream("index.bin")) {
    serialize(index, fos);
}

The safe approach — write-to-tmp + rename (atomic on POSIX systems):

public void saveIndex() throws IOException {
    Path tempFile = Files.createTempFile("index-", ".tmp");

    try {
        // Write to temp file
        try (ObjectOutputStream oos = new ObjectOutputStream(
            new BufferedOutputStream(Files.newOutputStream(tempFile))
        )) {
            oos.writeObject(this.nodes);
            oos.writeObject(this.entryPoint);
        }

        // Atomic rename — either succeeds completely or fails completely
        Files.move(tempFile, indexPath,
            StandardCopyOption.ATOMIC_MOVE,
            StandardCopyOption.REPLACE_EXISTING
        );
    } catch (Exception e) {
        Files.deleteIfExists(tempFile);
        throw e;
    }
}

ATOMIC_MOVE is a single filesystem operation — it either completes or doesn't happen at all. No corrupted state.

Result: Index loads in <15ms on restart, matching LevelDB's durability pattern.

Performance Results

Tested on 1,000 vectors (sentence embeddings, 384 dimensions):

Metric	Result
recall@10	0.984
Cold query latency	35ms
Cached query latency	<1ms
Index load time	<15ms
Index build time (1K vectors)	~200ms

The cold vs cached gap shows the LRU cache working: 35ms first query, sub-millisecond repeat queries.

Key Lessons

1. The probabilistic layer structure is brilliant.
O(log n) search complexity comes naturally from the exponential decay of upper layers. You don't need a balanced tree — randomness does the work.

2. ef and M are the critical parameters.

M: max connections per node. Higher = better recall, more memory.
efConstruction: search width during insertion. Higher = better index quality, slower build.
efSearch: search width at query time. Higher = better recall, slower queries.

3. Hybrid retrieval almost always beats pure dense retrieval.
BM25 catches exact matches that dense embeddings miss. RRF fusion requires no tuning.

4. Atomic writes are non-negotiable for any persistent data structure.
Write-to-tmp + rename is the standard pattern — use it everywhere.

5. HyDE is underrated.
Generating a hypothetical answer before embedding significantly improves recall for factoid queries with minimal overhead.