I built a Vamana-based vector search engine in C++ called sembed-engine. Recently I made a pull request that sped up queries by 16x and builds by 9x. The algorithm stayed exactly the same. The recall stayed at 1.0. The number of visited nodes did not change.
The speedup came from data layout.
The old design
The original code stored vectors as separate objects pointed to by shared_ptr:
struct Record {
int64_t id;
std::shared_ptr<Vector> vector;
};
This is clean C++. Every record has an id and a vector. The vector knows how to calculate distance. In the hot path, though, the CPU had to load the record, read the shared_ptr, follow the pointer, call virtual methods, and read each float through an abstraction layer. Millions of times per query.
The new layout
I replaced the object graph with a flat array. All vector values live in one contiguous block:
ids = [id0, id1, id2, ...]
values = [v0_dim0, v0_dim1, ..., v1_dim0, v1_dim1, ...]
Vector i starts at values[i * D]. A FloatVectorView is just a pointer and a dimension count. No allocations. No pointer chasing. The next vector is right after the previous one in memory.
The assembly tells the story. The old code had virtual calls and scalar square roots:
call rax ; virtual dispatch
sqrtss xmm2, xmm2 ; scalar square root
The new code loads packed floats and operates on four at a time:
movups xmm1, XMMWORD PTR [rdi+rax]
subps xmm1, xmm3
mulps xmm1, xmm1
Removing unnecessary square roots
Euclidean distance includes a square root. For nearest-neighbor search, we only care about ordering, not the absolute distance value. If sqrt(25) < sqrt(100), then 25 < 100. The ordering is identical.
Switching to squared distances eliminated sqrtss entirely from the hot path. One caveat: Vamana pruning uses an alpha parameter. When everything is squared, alpha must be squared too to preserve the same comparison semantics.
Caching scores during sort
The old comparator computed distances inside the sort function. Sorting calls the comparator many times, so the same distance was recomputed repeatedly. The fix was to compute each distance once, store it in a ScoredNode { node; score; }, and sort by the cached score.
Old comparator assembly called new_view_squared repeatedly. New comparator assembly just loaded two floats and compared them.
Results
| Workload | Metric | Before | After |
|---|---|---|---|
| gvec query latency | p50 | 4.094 ms | 0.631 ms |
| w2v query latency | p50 | 25.15 ms | 1.524 ms |
| w2v build time | total | 17.91 s | 1.889 s |
The search visited the same number of nodes. It stopped paying unnecessary tax at every node.
For the full benchmark methodology, assembly breakdown, and PR diff: How I Made My Vector Search Engine 16x Faster.
Top comments (0)