I wanted to build a privacy-first RAG app. The kind where your documents never leave the browser. It means no API keys, no server, no third-party vector database watching what you search for.
The architecture was obvious: embed documents client-side with something like Transformers.js, store the vectors locally, and search them with cosine similarity. Simple enough. Except the "search them" part fell apart at about 5,000 vectors.
Pure JavaScript vector search has a ceiling, and it's lower than you'd think.
The math itself isn't that complicated, cosine similarity is just a dot product divided by two norms. But when you're doing it across 10,000 vectors, each with 1,536 dimensions (standard for OpenAI embeddings), you're running 15 million floating-point multiplications per query. JavaScript's garbage collector doesn't care that you're in a hot loop. It will pause when it wants to.
I benchmarked every existing client-side library I could find. The results were consistent:
| Library | Runtime | Algorithm | Notes |
|---|---|---|---|
| client-vector-search | Pure JS | Brute-force | ~37ms at 10k |
| MeMemo | Pure JS | HNSW | ~17ms at 10k, but minutes to build the index |
| Vectra | Pure JS | Brute-force | Node.js only, requires OpenAI API key |
| VecLite | Rust/WASM + SIMD | Flat index | ~8ms at 10k |
I benchmarked them all. At 10,000 vectors with 1,536 dimensions, client-vector-search took 37ms per query. MeMemo's HNSW was faster at 17ms. But building the index took minutes, and it's still twice as slow as VecLite's 8ms. Beyond 10k, pure JS starts to hurt. At 100k vectors, we're talking 1.5+ seconds per search.
The gap in the ecosystem was clear: nothing existed between "toy library that caps at 1k vectors" and "production vector database that requires a server." I needed something that worked at 100k vectors, entirely in the browser, with sub-second latency.
So I built VecLite.
The architecture decision
The first question was whether to stay in JavaScript or reach for something lower-level. I chose Rust compiled to WebAssembly, and every subsequent decision cascaded from that one.
Why Rust/WASM
Rust/WASM gives you three things JavaScript can't:
- No garbage collection. WASM linear memory is a flat buffer. No GC pauses, no surprise latency spikes during search.
-
SIMD instructions. WebAssembly SIMD (
simd128) lets you process fourf32values per cycle. At 1,536 dimensions, that's 384 SIMD iterations instead of 1,536 scalar ones. -
Predictable memory layout. Contiguous
f32arrays mean cache-friendly access patterns, which matters enormously when you're iterating over 100k vectors.
This isn't speculative. PGlite, DuckDB-WASM, and Figma all use the same Rust/WASM architecture for the same reasons. The toolchain is mature.
The WASM boundary rules
The architecture has three strict layers - TypeScript, the WASM boundary, and Rust. And the boundary rules matter more than the algorithm:
- Always batch. Never call WASM in a loop. One crossing per operation, no matter how many vectors you're upserting.
- Pass vectors as flat Float32Array. Not nested JS arrays. This enables zero-copy transfer into WASM linear memory.
- Serialize metadata to JSON before crossing. Rust parses it on the other side.
- Validate everything in TypeScript. Rust should never receive malformed input ever, no NaN, no Infinity, no wrong dimensions.
In practice, the upsert path looks like this:
// TypeScript validates, flattens, serialises and then ONE crossing
this.index.upsert(
JSON.stringify(ids),
flattenVectors(vectors), // → contiguous Float32Array
JSON.stringify(metas),
)
Every WASM crossing has overhead. The difference between calling WASM once with 10,000 vectors and calling it 10,000 times with one vector is the difference between 40ms and 4 seconds.
f32 vs f64: the decision that halved memory
All vectors are stored and computed as f32, never f64. This was a deliberate choice:
- OpenAI, Cohere, and most embedding models output
f32precision. There's no meaningful accuracy gain fromf64. - At 100k vectors × 1,536 dimensions,
f32uses ~600MB.f64would be 1.2GB At that scale, your browser crashes. - WASM SIMD intrinsics process four
f32values per instruction. Withf64you'd get two.
The storage adapter pattern: keeping it dumb
Storage adapters in VecLite are deliberately simple / dumb, a four-method key/value interface:
interface StorageAdapter {
get(key: string): Promise<string | null>
set(key: string, value: string): Promise<void>
delete(key: string): Promise<void>
clear(): Promise<void>
}
The adapter has zero knowledge of vectors, metadata, or search. VecLite owns all serialisation. This means community adapters for localStorage, React Native AsyncStorage, or SQLite are 10–20 lines of code.
The temptation was to make the storage layer "smart" by allowing it do partial loads, or native querying. I resisted. A thin interface is easy to implement, easy to test, and easy to swap. The complexity belongs in one place (VecLite), not spread across every adapter.
The honest benchmarks
I ran benchmarks with Vitest against a pure-JS Float32Array implementation, the fairest possible comparison. Both use Float32Array, both use the same algorithm (brute-force cosine similarity), the only difference is JS vs WASM+SIMD.
What I expected and what happened in real
I expected around 10–20× speedup based on Rust/WASM benchmarks blindly. I got ~4×.
| Dataset | VecLite (Rust/WASM) | Pure JS | Speedup |
|---|---|---|---|
| 10k vectors, dim=1536 | 40ms | 152ms | 3.8× |
| 50k vectors, dim=1536 | 200ms | 778ms | 3.9× |
| 100k vectors, dim=1536 | 400ms | 1,576ms | 3.9× |
Why not 20×? Because V8 is genuinely good at optimising tight Float32Array loops. The JS baseline isn't naive, it's a well-optimised hot loop. The WASM advantage comes from SIMD parallelism and no GC pauses, not from some fundamental inefficiency in JavaScript. 4× is real, repeatable, and honest.
Why HNSW lost to flat index at dim=1536
This was the most counterintuitive result. HNSW (Hierarchical Navigable Small World) is the gold standard for approximate nearest neighbour search. Every production vector database uses it. So I implemented it in v0.3.
It was slower than brute force. At every scale.
| Scale | Flat (exact) | HNSW ef=200 | Winner |
|---|---|---|---|
| 1k vectors | 0.83ms | 0.95ms | Flat, 1.14× faster |
| 5k vectors | 4.1ms | 4.4ms | Flat, 1.08× faster |
| 10k vectors | 8.2ms | 8.8ms | Flat, 1.07× faster |
The reason: at 1,536 dimensions, each hop in the HNSW graph traverses a massive vector space. The "neighbourhood structure" that makes HNSW efficient at low dimensions (where nearby points are geometrically clustered) becomes noise at high dimensions. Graph traversal overhead exceeds the savings from reducing the candidate set.
HNSW would probably win (untested) at dim=128 with 500k+ vectors. But at the dimensions real embedding models use (512–3072 MB), brute-force SIMD is faster at every practical browser scale.
I kept HNSW in the library for users with specific use cases, but the flat index is the default and recommended path.
The filter pre-computation result that surprised me
Pre-filtering (applying metadata filters before scoring) produced dramatic speedups:
| Filter | Mean latency | vs unfiltered |
|---|---|---|
$gte (~50% selectivity) |
10ms | 3.9× faster |
$in (~25% selectivity) |
3ms | 12× faster |
At 25% selectivity, you're only computing cosine similarity on a quarter of the vectors. The filter itself is cheap (string/number comparison), so selective filters genuinely eliminate compute rather than just hiding it.
P99 spikes
The honest part: p99 latency spikes exist. WASM initialisation has a cold-start cost (mitigated by caching). And at 100k vectors, serialising the full index to JSON for persistence is the real bottleneck — not search.
Building with AI
VecLite was built with Claude Code, and one file changed everything: CLAUDE.md.
How CLAUDE.md changed my workflow
CLAUDE.md is a 257-line file at the root of the repo. It contains the architecture, every locked decision, the API surface, WASM boundary rules, security constraints, and what's deliberately deferred.
The effect was immediate. With CLAUDE.md in context, Claude Code stopped suggesting things I'd already rejectedm no proposals to inline WASM as base64, no attempts to break the three-layer architecture, no re-inventing HNSW from scratch when the flat index was the right choice.
It turned the AI from a "clever junior who hasn't read the docs" into a collaborator who understood the project's constraints.
What Claude Code is good at vs where it needs guidance
Where it exceled:
- Boilerplate with consistent patterns like adapter implementations, test scaffolding, TypeScript types
- WASM-bindgen ceremony, a.k.a the glue code between JS and Rust
- Test generation: given a function and edge cases, it writes thorough test suites
- Documentation
Where it needs guidance:
- Rust SIMD intrinsics: the
core::arch::wasm32f32x4 path took iteration. AI-generated SIMD code often compiles but has subtle correctness issues (wrong horizontal sum, missing scalar tail) - Crate selection for WASM targets: It suggested crates that depend on
rayon, which is incompatible withwasm32-unknown-unknown. I had to manually verify WASM compatibility for every Rust dependency - Performance-critical architecture decisions: The
u32bit-pattern trick for HNSW distance metrics (reinterpreting f32 as u32 for the graph's comparison function) required domain expertise
What's next
-
veclite/rag: A batteries-included RAG pipeline as a sub-path export. Bring a document, get semantic search. Chunking, local embeddings via
Transformers.js, and VecLite search under the hood. Zero config. No API keys. No data leaves the device. - Chunked persistence: The current JSON blob works at 50k vectors but won't scale forever
- Web Worker support: Embedding already runs off-thread, but the core search path could benefit too
- React Native / Node.js: The architecture supports it, the plumbing doesn't exist yet
- Community adapters: SQLite, AsyncStorage, localStorage adapters following the 4-method interface
Try it
npm install veclite
GitHub: github.com/thealpha93/VecLite
If you're building something with client-side vector search, I'd love to hear about it. Open an issue, submit a PR, or just star the repo so I know people care. This is being built in public, your feedback shapes the roadmap.

Top comments (0)