DEV Community

Aakash T M
Aakash T M Subscriber

Posted on

I built a vector search library in Rust/WASM. Here's what I learned about performance, browser limits, and building in public with AI

I wanted to build a privacy-first RAG app. The kind where your documents never leave the browser. It means no API keys, no server, no third-party vector database watching what you search for.

The architecture was obvious: embed documents client-side with something like Transformers.js, store the vectors locally, and search them with cosine similarity. Simple enough. Except the "search them" part fell apart at about 5,000 vectors.

Pure JavaScript vector search has a ceiling, and it's lower than you'd think.

The math itself isn't that complicated, cosine similarity is just a dot product divided by two norms. But when you're doing it across 10,000 vectors, each with 1,536 dimensions (standard for OpenAI embeddings), you're running 15 million floating-point multiplications per query. JavaScript's garbage collector doesn't care that you're in a hot loop. It will pause when it wants to.

I benchmarked every existing client-side library I could find. The results were consistent:

Library Runtime Algorithm Notes
client-vector-search Pure JS Brute-force ~37ms at 10k
MeMemo Pure JS HNSW ~17ms at 10k, but minutes to build the index
Vectra Pure JS Brute-force Node.js only, requires OpenAI API key
VecLite Rust/WASM + SIMD Flat index ~8ms at 10k

I benchmarked them all. At 10,000 vectors with 1,536 dimensions, client-vector-search took 37ms per query. MeMemo's HNSW was faster at 17ms. But building the index took minutes, and it's still twice as slow as VecLite's 8ms. Beyond 10k, pure JS starts to hurt. At 100k vectors, we're talking 1.5+ seconds per search.

The gap in the ecosystem was clear: nothing existed between "toy library that caps at 1k vectors" and "production vector database that requires a server." I needed something that worked at 100k vectors, entirely in the browser, with sub-second latency.

So I built VecLite.


The architecture decision

The first question was whether to stay in JavaScript or reach for something lower-level. I chose Rust compiled to WebAssembly, and every subsequent decision cascaded from that one.

Why Rust/WASM

Rust/WASM gives you three things JavaScript can't:

  1. No garbage collection. WASM linear memory is a flat buffer. No GC pauses, no surprise latency spikes during search.
  2. SIMD instructions. WebAssembly SIMD (simd128) lets you process four f32 values per cycle. At 1,536 dimensions, that's 384 SIMD iterations instead of 1,536 scalar ones.
  3. Predictable memory layout. Contiguous f32 arrays mean cache-friendly access patterns, which matters enormously when you're iterating over 100k vectors.

This isn't speculative. PGlite, DuckDB-WASM, and Figma all use the same Rust/WASM architecture for the same reasons. The toolchain is mature.

VecLite three-layer architecture: TypeScript API layer handles validation and persistence, WASM boundary enforces batching and Float32Array transfer, Rust core does pure computation with SIMD

The WASM boundary rules

The architecture has three strict layers - TypeScript, the WASM boundary, and Rust. And the boundary rules matter more than the algorithm:

  1. Always batch. Never call WASM in a loop. One crossing per operation, no matter how many vectors you're upserting.
  2. Pass vectors as flat Float32Array. Not nested JS arrays. This enables zero-copy transfer into WASM linear memory.
  3. Serialize metadata to JSON before crossing. Rust parses it on the other side.
  4. Validate everything in TypeScript. Rust should never receive malformed input ever, no NaN, no Infinity, no wrong dimensions.

In practice, the upsert path looks like this:

// TypeScript validates, flattens, serialises and then ONE crossing
this.index.upsert(
  JSON.stringify(ids),
  flattenVectors(vectors),   // → contiguous Float32Array
  JSON.stringify(metas),
)
Enter fullscreen mode Exit fullscreen mode

Every WASM crossing has overhead. The difference between calling WASM once with 10,000 vectors and calling it 10,000 times with one vector is the difference between 40ms and 4 seconds.

f32 vs f64: the decision that halved memory

All vectors are stored and computed as f32, never f64. This was a deliberate choice:

  • OpenAI, Cohere, and most embedding models output f32 precision. There's no meaningful accuracy gain from f64.
  • At 100k vectors × 1,536 dimensions, f32 uses ~600MB. f64 would be 1.2GB At that scale, your browser crashes.
  • WASM SIMD intrinsics process four f32 values per instruction. With f64 you'd get two.

The storage adapter pattern: keeping it dumb

Storage adapters in VecLite are deliberately simple / dumb, a four-method key/value interface:

interface StorageAdapter {
  get(key: string): Promise<string | null>
  set(key: string, value: string): Promise<void>
  delete(key: string): Promise<void>
  clear(): Promise<void>
}
Enter fullscreen mode Exit fullscreen mode

The adapter has zero knowledge of vectors, metadata, or search. VecLite owns all serialisation. This means community adapters for localStorage, React Native AsyncStorage, or SQLite are 10–20 lines of code.

The temptation was to make the storage layer "smart" by allowing it do partial loads, or native querying. I resisted. A thin interface is easy to implement, easy to test, and easy to swap. The complexity belongs in one place (VecLite), not spread across every adapter.


The honest benchmarks

I ran benchmarks with Vitest against a pure-JS Float32Array implementation, the fairest possible comparison. Both use Float32Array, both use the same algorithm (brute-force cosine similarity), the only difference is JS vs WASM+SIMD.

What I expected and what happened in real

I expected around 10–20× speedup based on Rust/WASM benchmarks blindly. I got ~4×.

Dataset VecLite (Rust/WASM) Pure JS Speedup
10k vectors, dim=1536 40ms 152ms 3.8×
50k vectors, dim=1536 200ms 778ms 3.9×
100k vectors, dim=1536 400ms 1,576ms 3.9×

Why not 20×? Because V8 is genuinely good at optimising tight Float32Array loops. The JS baseline isn't naive, it's a well-optimised hot loop. The WASM advantage comes from SIMD parallelism and no GC pauses, not from some fundamental inefficiency in JavaScript. 4× is real, repeatable, and honest.

Why HNSW lost to flat index at dim=1536

This was the most counterintuitive result. HNSW (Hierarchical Navigable Small World) is the gold standard for approximate nearest neighbour search. Every production vector database uses it. So I implemented it in v0.3.

It was slower than brute force. At every scale.

Scale Flat (exact) HNSW ef=200 Winner
1k vectors 0.83ms 0.95ms Flat, 1.14× faster
5k vectors 4.1ms 4.4ms Flat, 1.08× faster
10k vectors 8.2ms 8.8ms Flat, 1.07× faster

The reason: at 1,536 dimensions, each hop in the HNSW graph traverses a massive vector space. The "neighbourhood structure" that makes HNSW efficient at low dimensions (where nearby points are geometrically clustered) becomes noise at high dimensions. Graph traversal overhead exceeds the savings from reducing the candidate set.

HNSW would probably win (untested) at dim=128 with 500k+ vectors. But at the dimensions real embedding models use (512–3072 MB), brute-force SIMD is faster at every practical browser scale.

I kept HNSW in the library for users with specific use cases, but the flat index is the default and recommended path.

The filter pre-computation result that surprised me

Pre-filtering (applying metadata filters before scoring) produced dramatic speedups:

Filter Mean latency vs unfiltered
$gte (~50% selectivity) 10ms 3.9× faster
$in (~25% selectivity) 3ms 12× faster

At 25% selectivity, you're only computing cosine similarity on a quarter of the vectors. The filter itself is cheap (string/number comparison), so selective filters genuinely eliminate compute rather than just hiding it.

P99 spikes

The honest part: p99 latency spikes exist. WASM initialisation has a cold-start cost (mitigated by caching). And at 100k vectors, serialising the full index to JSON for persistence is the real bottleneck — not search.


Building with AI

VecLite was built with Claude Code, and one file changed everything: CLAUDE.md.

How CLAUDE.md changed my workflow

CLAUDE.md is a 257-line file at the root of the repo. It contains the architecture, every locked decision, the API surface, WASM boundary rules, security constraints, and what's deliberately deferred.

The effect was immediate. With CLAUDE.md in context, Claude Code stopped suggesting things I'd already rejectedm no proposals to inline WASM as base64, no attempts to break the three-layer architecture, no re-inventing HNSW from scratch when the flat index was the right choice.

It turned the AI from a "clever junior who hasn't read the docs" into a collaborator who understood the project's constraints.

What Claude Code is good at vs where it needs guidance

Where it exceled:

  • Boilerplate with consistent patterns like adapter implementations, test scaffolding, TypeScript types
  • WASM-bindgen ceremony, a.k.a the glue code between JS and Rust
  • Test generation: given a function and edge cases, it writes thorough test suites
  • Documentation

Where it needs guidance:

  • Rust SIMD intrinsics: the core::arch::wasm32 f32x4 path took iteration. AI-generated SIMD code often compiles but has subtle correctness issues (wrong horizontal sum, missing scalar tail)
  • Crate selection for WASM targets: It suggested crates that depend on rayon, which is incompatible with wasm32-unknown-unknown. I had to manually verify WASM compatibility for every Rust dependency
  • Performance-critical architecture decisions: The u32 bit-pattern trick for HNSW distance metrics (reinterpreting f32 as u32 for the graph's comparison function) required domain expertise

What's next

  • veclite/rag: A batteries-included RAG pipeline as a sub-path export. Bring a document, get semantic search. Chunking, local embeddings via Transformers.js, and VecLite search under the hood. Zero config. No API keys. No data leaves the device.
  • Chunked persistence: The current JSON blob works at 50k vectors but won't scale forever
  • Web Worker support: Embedding already runs off-thread, but the core search path could benefit too
  • React Native / Node.js: The architecture supports it, the plumbing doesn't exist yet
  • Community adapters: SQLite, AsyncStorage, localStorage adapters following the 4-method interface

Try it

npm install veclite
Enter fullscreen mode Exit fullscreen mode

GitHub: github.com/thealpha93/VecLite

If you're building something with client-side vector search, I'd love to hear about it. Open an issue, submit a PR, or just star the repo so I know people care. This is being built in public, your feedback shapes the roadmap.

Top comments (0)