mote

Posted on May 7

I Burned 3 Weeks Tuning Vector Search Before Realizing the Problem Was the Index, Not the Algorithm

#database #rust #programming #opensource

I was getting 200ms latency on vector search with only 50,000 embeddings. For a drone that needs to recognize objects in <50ms, that's not a database — that's a liability.

So I did what any reasonable developer would do. I spent 3 weeks tuning HNSW parameters. ef_search, M, ef_construction — I tried every combination. I switched to IVF. I tried PQ (product quantization). I even implemented a custom filtering layer to skip low-score candidates early.

Nothing moved the needle. 180ms. 190ms. 210ms if the CPU was busy with sensor fusion.

Then I realized the problem wasn't the search algorithm. It was the index structure itself — and the fact that I was treating an embedded database like a server database.

The Setup: Vector Search on a Drone

I'm building moteDB, an embedded multi-modal database for edge AI. The use case: a drone needs to store and query:

Vector embeddings (image patches, for object re-identification)
Time-series data (telemetry: altitude, GPS, battery)
State (mission waypoints, current task)

All on a Raspberry Pi 4 with 8GB RAM and a heatsink that's doing its best.

The vector search workload: given a query image, find the top-5 most similar patches from the last 10 minutes of flight. This is for visual odometry — if the drone loses GPS, it needs to recognize where it's been.

With 50,000 embeddings (128-dimensional, float32), a brute-force search takes ~8ms on the Pi 4. That's actually fine. But I wanted to support 500,000+ embeddings (for longer missions), so I needed an index.

Week 1: HNSW Tuning Hell

I started with HNSW (Hierarchical Navigable Small World), the go-to algorithm for vector search. Libraries like hnswrs and qdrant use it. Seemed like the right choice.

My first benchmark: 200ms for a single query. That's unacceptable for a drone that needs to make control decisions at 50Hz.

So I did what the internet told me to do — I tuned parameters:

M=16, ef_construction=200: 200ms
M=32, ef_construction=400: 180ms, but 3x larger index
M=8, ef_construction=100: 220ms, smaller index but slower queries
ef_search=50: faster (150ms) but recall dropped to 85%
ef_search=200: slower (250ms) but 98% recall

No matter what I did, I couldn't get below 150ms with >95% recall. And that's for a single query — in production, the drone needs to run multiple queries concurrently (object detection + visual odometry + geofence checking).

Week 2: Trying Other Algorithms

At this point, I was committed to making HNSW work. But I also started questioning the choice. So I benchmarked other algorithms:

IVF (Inverted File Index):

Pro: Fast query if you get the nprobe right
Con: Needs to be trained, and the clustering falls apart when embeddings are dynamically added (which happens on a drone in realtime)
Result: 120ms with nprobe=32, but recall was inconsistent (80-95% depending on data distribution)

PQ (Product Quantization):

Pro: Compresses embeddings, less memory bandwidth
Con: Lossy compression, and the quantization error is unpredictable
Result: 90ms with 8-bit PQ, but recall dropped to 75% — unacceptable for visual odometry

Brute-force with SIMD:

Pro: Perfect recall, and f32x4 SIMD helps
Con: O(n) scan, doesn't scale
Result: 8ms for 50K vectors, but 80ms for 500K — and that's just the vector search, not including the time-series or state queries

Week 3: The Realization

I was staring at perf top output for the 100th time when I noticed something. The CPU wasn't spending time in the HNSW graph traversal (which is what I was optimizing). It was spending time in page cache miss handling.

Every time I queried the HNSW index, the Pi had to pull graph nodes from RAM (or worse, swap to microSD). The HNSW graph was ~200MB for 500K vectors, and it was randomly accessed — terrible for cache locality.

The problem wasn't the algorithm. It was the index access pattern.

What I Did Wrong

I treated the index like a server-side structure. On a server with 64GB RAM and NVMe SSD, a 200MB randomly-accessed index is fine. The page cache handles it. On a Pi with 8GB RAM (and other processes using most of it), that same index causes page faults on every query.
I didn't account for concurrent queries. HNSW is fast for a single query, but when you run 3-4 queries concurrently, they compete for memory bandwidth. The Pi 4's memory controller is not designed for this.
I was storing vectors alongside the graph. Every graph node stored the full 128-dimensional vector (512 bytes). That's 256MB of vectors for 500K entries, plus the graph structure. Too much for the Pi's memory.

The Fix: Embedded-Aware Index Design

I realized I needed to redesign the index for embedded constraints:

1. Partitioned Storage

Instead of one global HNSW graph, I partitioned vectors by time window (10-minute buckets). Each bucket has its own small HNSW graph (~5MB for 5K vectors). Queries search the most recent N buckets (usually 3-5 for visual odometry).

This fixed the cache locality problem — the active bucket's graph fits in L2 cache.

2. Vector Separation

I separated the graph structure (which needs random access) from the vector data (which is only accessed when a candidate is promising). The graph stores only vector IDs and distances; the actual vectors are stored sequentially and accessed only for final re-ranking.

This cut memory usage by 3x.

3. Preallocated Memory Pool

Instead of allocating graph nodes dynamically (which causes fragmentation and unpredictable page faults), I preallocate a memory pool at database initialization. The Pi's kernel can't swap out preallocated memory as easily.

Results

Before: 200ms/query, 95% recall
After: 12ms/query, 97% recall
Memory: 60MB steady-state (instead of 200MB+ spiking)

What I Learned

If you're building vector search for embedded/edge scenarios:

Benchmark on target hardware. My initial benchmarks were on my MacBook Pro (M2, 32GB RAM). Everything looked great. On the Pi 4, it was a different story.
Cache locality > Algorithm complexity. An O(n) scan with good locality can outperform O(log n) with random access if your memory is constrained.
Don't copy server designs to embedded. HNSW is great for server-side vector search (Qdrant, Weaviate). But for embedded, you need to think about memory access patterns first, algorithm second.
Profile before optimizing. I wasted 2 weeks tuning HNSW parameters when the real bottleneck was page cache misses. perf, htop, and /proc/meminfo are your friends.

The moteDB Approach

This experience shaped how I'm building moteDB. It's not just "a vector database" — it's a vector database designed for the constraints of embedded hardware:

LSM-tree storage (not B-tree) for non-blocking writes
Partitioned indexes for cache locality
Preallocated memory pools to avoid unpredictable allocations
Multi-modal storage (vectors + time-series + state in one engine) to avoid cross-process communication overhead

If you're working on edge AI and hitting performance walls with existing databases, I'd love to hear about your use case. The constraints are different from server-side AI, and the solutions need to be different too.

I'm building moteDB, an open-source embedded multi-modal database for edge AI. It's 100% Rust, Apache 2.0 licensed. Check it out at github.com/motedb — and if you're working on embodied AI or edge inference, I'd love to collaborate.

DEV Community