Reading Algorithms Like an Engineer: What DiskANN Taught Me About Pseudocode

#algorithms #ann #cpp #diskann

The first time I implemented Vamana from the DiskANN paper, my approximate nearest neighbor index was slower than brute force. On tiny test fixtures, brute force took 0.27 ms per query. My Vamana implementation took 22.98 ms.

That sounds absurd. ANN exists to skip work. The problem was not the algorithm. It was how I mapped the paper's abstractions to actual data structures.

A set is not a data structure

The DiskANN pseudocode talks about sets L, V, and Nout(p). That is fine for explanation. Code cannot store an abstract set.

When the paper says L (the candidate list), I had to decide: sorted vector? heap? bounded priority queue? How do I find the closest unvisited element? How do I enforce the search-list bound? How do I remove duplicates?

When the paper says V (the visited set), I had to decide: unordered_set? dense bitset? boolean array? Node ids in my case were dense integers, so an indexed bit operation beat a hash-table lookup by a wide margin.

When the paper says "remove candidates," I had to ask whether removal is physical or logical. In a hot loop, marking a candidate as deleted and skipping it is much cheaper than erasing from a vector and reshuffling everything behind it.

The fix

In my sembed-engine project, I changed the implementation to match the invariants the algorithm already needed, rather than copying the pseudocode literally.

A Neighbour struct became { float distance; NodeId node; bool marked; }. A SortedBoundedVector kept candidates sorted as they were inserted, capped the list size, rejected duplicates, and tracked the next unexpanded node. Visited tracking moved to boost::dynamic_bitset. Pruning switched from physical deletion to marker-style bookkeeping.

The algorithm did not change. The code started matching the invariants the algorithm already needed.

After the fix, Vamana went from 22.98 ms to 0.02 ms on the same small fixture. On a larger dataset, it delivered 5.34x the query throughput of brute force while keeping recall at 1.0.

The lesson

Slow down at the nouns in pseudocode. If it says L, ask what operations L needs. If it says V, ask how membership is checked. If it says "remove," ask whether deletion is physical or logical. If it says "bounded," ask where that bound is enforced.

The paper gives the map. Implementation is the terrain.

For the full benchmark data, PR details, and code snippets: Reading Algorithms Like an Engineer.