DEV Community

TJ Sweet
TJ Sweet

Posted on

How I sped up HNSW construction ~2.7x

HNSW Build Time at 1M Embeddings: 27 Minutes to 10 Minutes by Fixing Insertion Order

For a 1M-embedding corpus, we reduced HNSW construction time from about 27 minutes to about 10 minutes (2.7x) without changing recall or graph quality.

This post explains:

  • the problem (where traversal work is wasted during construction),
  • the solution (BM25-seeded insertion order),
  • and the math behind the observed speedup.

All numbers in this writeup use the validated parameters from the current implementation:

  • M=16
  • ef_construction=100
  • seed set size = 256 * 8 = 2,048 nodes

Problem: Random insertion order creates traversal waste

HNSW build quality and build cost both depend on insertion order. With random insertion:

  • early nodes form accidental local hubs,
  • new nodes frequently enter a poor region first,
  • greedy search spends extra distance evaluations before finding useful neighbors.

That wasted traversal work compounds over time.

Visual A: Where random-order traversal waste comes from

Random Insertion Order

In practice, this increases construction cost by a multiplicative overhead factor. I will call that factor beta:

build_time = ideal_time * beta
Enter fullscreen mode Exit fullscreen mode

Where:

  • ideal_time is the minimum cost if each insert reaches good neighbors with minimal detours,
  • beta > 1 captures wasted traversal and repair work.

Baseline mechanics: why layer-0 dominates at 1M scale

Using M=16, the level distribution gives:

  • P(node at layer >= 1) = 1/M = 1/16 = 6.25%
  • so 93.75% of inserted nodes effectively do all meaningful work in layer 0.

For ef_construction=100, expected distance computations per insertion are:

Upper layers: 0.067 * 100 * 16   =   107
Layer 0:     1.000 * 100 * 32    = 3,200
                                    -----
Total per insert                  = 3,307
Enter fullscreen mode Exit fullscreen mode

(32 above is 2*M, the layer-0 connection bound.)

So the primary optimization target is not exotic upper-layer behavior; it is reducing layer-0 traversal waste during insertion.

Solution: BM25-seeded insertion creates a backbone first

Instead of random insertion order, we pick a lexically diverse seed set from BM25 and insert those vectors first.

Seed extraction:

  • take high-IDF terms,
  • for each term, take top docs by term frequency,
  • defaults: NORNICDB_HNSW_LEXICAL_SEED_MAX_TERMS=256, NORNICDB_HNSW_LEXICAL_SEED_PER_TERM=8,
  • maximum seed set: 256 * 8 = 2,048 nodes.

Build order:

  1. insert seed nodes first,
  2. insert remaining N - seed_count nodes.

This gives the graph a broad early backbone, so later inserts find useful neighbors quickly instead of wandering.

Visual B: Seed-first construction reduces detours

Seed-first construction reduces detours

The math check on the 27 -> 10 minute result

Use a conservative distance-op estimate for 1024-dim float32 vectors in Go:

  • compute plus memory effects: about 160 ns per distance operation in this workload class.

Then ideal floor for 1M insertions:

ideal_time
= 1,000,000 * 3,307 * 160 ns
= 529 s
= 8.8 min
Enter fullscreen mode Exit fullscreen mode

Now map measured times to beta:

beta_random = 27 / 8.8 = 3.07
beta_seeded = 10 / 8.8 = 1.13
speedup     = beta_random / beta_seeded
            = 3.07 / 1.13
            = 2.72x ~= 2.7x
Enter fullscreen mode Exit fullscreen mode

This is the key point: the reported speedup is exactly what you expect if seeded order mostly removes traversal waste.

Visual C: Overhead factor (beta) before vs after

beta (lower is better)

Random order   | ############################### 3.07
Seeded order   | ###########                     1.13
Enter fullscreen mode Exit fullscreen mode

Interpretation:

  • random build spends about 3.07x the ideal work,
  • seeded build is close to the floor at about 1.13x.

Time decomposition for the 1M run

Total build time decomposition (minutes)

Case            Ideal floor   Overhead   Total
Random order      8.8          18.2      27.0
Seeded order      8.8           1.2      10.0
Enter fullscreen mode Exit fullscreen mode

Visual D: Same floor, different overhead

Random order | [#########.................][##################] 27.0
Seeded order | [#########.................][#]                  10.0
               ideal floor (8.8 min)       overhead
Enter fullscreen mode Exit fullscreen mode

Both runs share the same algorithmic floor; the difference is how much overhead is paid while traversing and wiring the graph.

Why this does not require a recall tradeoff

This change does not reduce ef_construction, M, or search-time quality knobs. It changes insertion order so the builder spends less effort reaching good neighborhoods.

That is why a large build-time gain can occur without reducing recall or graph quality: the graph is built with the same target connectivity constraints, but with less wasted traversal on the way there.

How to reproduce in your environment

  1. Keep HNSW params fixed (M, ef_construction unchanged).
  2. Build once with seeding disabled (or effectively zero seed set).
  3. Build once with defaults:
NORNICDB_HNSW_LEXICAL_SEED_MAX_TERMS=256
NORNICDB_HNSW_LEXICAL_SEED_PER_TERM=8
Enter fullscreen mode Exit fullscreen mode
  1. Log and compare:
  2. total build time,
  3. insertion throughput,
  4. recall on a fixed validation query set.

Use ratio as the primary cross-machine signal. Absolute minutes depend on CPU, memory bandwidth, cache behavior, and runtime effects.

Secondary effect: same seed mechanism helps k-means init

The same BM25-derived seed mechanism is also used by bm25+kmeans++ seed mode for centroid initialization. That improves initial centroid spread and typically reduces convergence iterations in the k-means phase.

The important architectural detail is reuse: one seed extraction pass supports both HNSW construction order and k-means initialization.

Closing

The 27-to-10 minute result is not a tuning artifact. It is a direct consequence of reducing traversal waste during graph construction:

  • keep the same quality parameters,
  • improve insertion geometry,
  • move beta from about 3.07 to about 1.13.

At 1M scale, this is enough to produce a repeatable 2.7x build-time improvement while preserving result quality.

https://github.com/orneryd/NornicDB

Top comments (0)