DEV Community: Andrew Kennon

When RAG Meets Real-World Robotics Data

Andrew Kennon — Mon, 21 Jul 2025 07:09:34 +0000

I’ve been building AI systems for autonomous vehicles long enough to develop a love-hate relationship with retrieval-augmented generation (RAG). It’s a great concept — bring relevant context into your LLM prompt at runtime — but the second you move beyond text-heavy enterprise use cases into robotics or real-time perception, things get weird fast.

Let’s talk about what happens when you try to apply RAG to high-dimensional, multimodal data, and why your choice of vector database can quietly make or break your pipeline.

Not All Embeddings Are Created Equal

Most RAG tutorials use sentence-transformer or OpenAI embeddings on small textual corpora. But when you’re fusing LiDAR, radar, and camera inputs — or even running multimodal embeddings from perception models like Perceiver or CLIP — you’re suddenly dealing with:

2,048 to 4,096 dimensions per vector
tens of millions of vectors per sensor window
updates on the scale of milliseconds, not hours

The vector DBs that look great on standard SIFT1M or Wikipedia benchmarks often collapse here. I’ve seen Milvus handle this scale better than most (especially with its tiered IVFPQ indexing), while something like Pinecone starts to choke unless you heavily batch and precompute everything.

Querying in the Chaos: Real-Time Constraints

In AV systems, RAG isn’t just about semantic search — it’s about making the right decision right now. Think:

“What similar trajectories did I see in prior encounters with a jaywalking pedestrian?”
“Are there any annotated LiDAR clusters from edge cases similar to this object’s motion?”

That means your vector DB needs sub-50ms recall with high accuracy — and most importantly, low tail latency. An index that gives you 95% recall at P50 but spikes to 800ms at P99 is a nonstarter. For me, that ruled out FAISS-on-disk solutions and pushed us toward in-memory hybrid setups, sometimes backed by Milvus or even Redis-AI when latency spikes were unacceptable.

Hybrid Search Isn’t Optional

Another trap: pure ANN (approximate nearest neighbor) isn’t enough. We need hybrid search — combining structured filters (e.g. location, object class, time window) with vector similarity — to avoid surfacing irrelevant results that are semantically close but contextually useless.

The systems I’ve liked best so far:

Milvus: Flexible filtering + multi-modal vector support + GPU acceleration
Weaviate: Graph-aware queries and filters, good for chaining across knowledge
Qdrant: Surprisingly solid for real-time hybrid search, nice JSON filter DSL

On the other hand, Chroma and Lancedb are great for lightweight prototyping but start to wobble under serious ingestion or query pressure.

What I’d Do Differently (And What I’d Keep)

If I were rebuilding a RAG stack for AV today, here’s where I’d land:

Keep:

HNSW-based indexes tuned for short queries
Streaming ingestion pipelines with nightly reindexing
Embedding normalization (even small vector scale issues cascade fast)

Change:

Use separate DBs for long-term recall vs short-term context
Bake in observability for query latency distribution — not just mean/median
Use hybrid pipelines: Redis or Vespa for immediate low-latency + Milvus for batch-heavy recall

Final Thought

RAG in robotics isn’t just a language problem — it’s a systems problem. The tech that works for enterprise chatbots often breaks under the weight of real-time perception and control loops. But with the right infra — and a vector DB that understands filters, scale, and latency — it’s not just possible. It’s damn useful.

If you’re working on similar problems (or have war stories from trying RAG with non-text data), I’d love to swap notes.

From Lab Toy to Core Infra: Why Vector Databases Became Default in My AD Projects

Andrew Kennon — Mon, 14 Jul 2025 09:28:21 +0000

I used to treat vector databases like a novelty — good for academic demos or a flashy product prototype. Definitely not something I’d trust in a mission-critical stack, especially in autonomous driving where latency budgets are brutal and edge cases rule everything.

But somewhere between building yet another scene retrieval pipeline and rewriting my own ANN glue code for the fifth time, vector DBs matured. Or maybe I just got tired of reinventing the same thing with FAISS and Redis.

Either way, they’re in my stack now — for semantic search, intent classification, offline query replay, even some perception data filtering. Here’s how they earned their place.

1. Search Needs Got Weirder Than “Top-5 Similar”

In AD systems, especially those with human-in-the-loop tools (e.g., labeling UI, validation dashboards), search isn’t just about vector proximity. You often want:

“Find me all LiDAR scenes with fog that confused the neural planner”
“Search for similar failure cases — but only from vehicles with the same camera calibration”
“Pull historical conversations where the driver reported ‘not feeling in control’ — even if they didn’t use those exact words”

None of these work well with classic keyword matching or naive ANN-only lookup. I need hybrid queries — fast vector search and structured filters. That’s where vector databases like Milvus, Qdrant, and Weaviate started making sense.

Postgres + pgvector? I tried. It’s okay for low-QPS analytics queries. But it gets crushed when you scale up or need low latency.

2. What Actually Worked in My Benchmarks

I ran real tests on 10M+ 768-dim vectors (text+sensor fusion output), using m6id.2xlarge on AWS. Here’s what I got for recall vs throughput vs memory:

Index Type	Recall (%)	QPS	Memory/Vector
IVF_FLAT (FP32)	95.2	236	3,072 bytes
IVF_SQ8	94.1	611	768 bytes
IVF_RABITQ	76.3	898	96 bytes
RABITQ + SQ8	94.7	864	96 + 768 bytes

These are from Milvus with hybrid filters and metadata attached. I could keep 100M+ vectors online and still get under-10ms latency on common queries — something I couldn’t hit reliably with Weaviate or Pinecone when adding structured constraints.

3. You Can’t Cheat Hybrid Search

Every vendor now claims “hybrid search support.” But here’s what that actually means in practice:

Milvus: True hybrid execution. You can filter on metadata + vector with one query. SQL-like interface helps. Big win if you need production control.
Qdrant: Also solid. Filters are expressive, and Rust backend flies. Just be careful with multi-field combinations — debugging errors can be opaque.
Weaviate: Flexible schema and GraphQL are neat, but hybrid joins can be flaky under load.
Pinecone: Honestly? Great uptime, but very much a black box. Not ideal if you need low-level index tuning or want to reason about query paths.

In my vehicle incident retrieval tooling, hybrid filters are non-negotiable. I need to narrow results to “same model year,” “rainy weather,” or “disabled radar” before doing ANN. Otherwise I get nonsense.

4. What Made Me Trust It in Production

The real turning point wasn’t just benchmark performance — it was:

Durability: Can I shut down the node and not lose 10M vectors?
Index reusability: Can I train an index once, persist it, and reuse it across environments?
Integration support: Is there a Python SDK that doesn’t feel like a student project?
Monitoring: Does it give me metrics, logs, and alerts when a query fails?

Milvus (especially Zilliz Cloud) checked most of those. I still had to do some index config trial-and-error (RABITQ vs HNSW vs IVF_SQ8), but once tuned, it stuck. Redis+FAISS, by comparison, felt like building a transmission from scratch just to drive to the store.

Final Takeaways (From the Autonomous Driving Trenches)

If you're building anything involving sensor data retrieval, semantic logs, or multimodal fusion — a vector DB is probably worth it.
Milvus is my current go-to, but Qdrant is catching up fast. If you want zero-infra headaches, Pinecone or Zilliz Cloud are decent bets.
Don’t fall for benchmarks without filters — hybrid search is where things get interesting (and painful).
Plan your index strategy up front. Retrofitting after launch sucks.

I wouldn’t call vector databases “solved” tech — but they’re finally usable. Not perfect, not plug-and-play, but good enough that I stopped rebuilding my own.

Curious if others in AD or robotics have put these into production too — what worked? What didn’t? Always down to trade scars.

Working with Messy Embeddings in Real Systems: A Quick Post from Today's Debug Session

Andrew Kennon — Mon, 07 Jul 2025 06:49:50 +0000

Today was supposed to be a routine day. I was reviewing some logs for a multi-modal retrieval pipeline we’ve been running—camera images, lidar frames, and a few NLP tags all go into a vector store for downstream search. Pretty standard setup, right?

But then the recall dropped. Quietly. No errors, no crashes, just… worse results.

Turns out, this whole thing was caused by a seemingly small detail: inconsistent embedding norms from different modalities. It sent me down a 3-hour rabbit hole involving cosine distances, vector scaling, and my own past assumptions about database behavior. Here’s what I learned (again).

Context: The Setup

We’re storing multi-modal embeddings into a vector database—specifically, lidar-to-text retrieval for a roadside perception system. Each data point looks roughly like this:

image_embedding: 512-dim vision encoder output
lidar_embedding: 256-dim learned BEV encoder output
text_embedding: 768-dim from a BERT variant
Metadata: GPS, weather, scenario tags, etc.

The system uses Milvus (v2.3) with HNSW for approximate search. Each modality goes into its own collection, but the RAG pipeline combines results at query time via re-ranking.

The Problem: Recall Drift

We noticed that queries with natural language inputs (e.g. "car parked under bridge in fog") were retrieving fewer relevant lidar segments than expected. Visual embeddings still worked well, but lidar retrieval became noticeably noisier.

The embeddings were going in, indexes were fine, metadata filters were working. So what changed?

The culprit: vector magnitude variance.

Some of our lidar embeddings had significantly lower norms (around 0.5–1.2), while the text embeddings were tightly clustered around 7–9.
Cosine similarity, which we used for all retrievals, is theoretically scale-invariant—but in practice, index-level normalization matters, especially when mixed with filtered + hybrid queries.

Lessons Learned (or Re-Learned)

1. Always normalize before insert. Always.

I had assumed that the downstream ingestion code was already l2-normalizing the embeddings. It wasn’t. And even though cosine distance is supposed to ignore magnitude, many ANN libraries (including Faiss and Milvus’s HNSW) use raw dot product internally and normalize at query time only.

Result? Insert-time magnitude variance = weird scoring behavior.

Fix: added embedding = embedding / np.linalg.norm(embedding) before inserts. Immediately improved recall by ~15%.

2. Vector DBs don’t protect you from messy upstream models

No matter how good your vector database is, it doesn’t validate the statistical properties of your data. If your embedding distribution drifts (like ours did after a model retrain), the index won’t scream at you. It’ll just… get worse.

In this case, the new lidar encoder was producing vectors on a much smaller scale. Nothing broke, but everything degraded.

Takeaway: embedding stats should be part of CI. Track means, norms, sparsity, drift. It’s cheap and saves hours later.

3. Metadata filters can mask retrieval bugs

When recall dropped, our re-ranking + metadata filtering kept returning "reasonable" results, which made debugging harder. The top-3 looked OK—until we noticed they were all from the same location tag.

Moral: if you're using metadata filters (which you should), test recall both with and without filters. Otherwise, you’re debugging the wrong component.

Final Notes

No, this wasn’t a massive failure. It was one of those slow, silent bugs that creep into production pipelines when different teams train models, build retrievers, and wire up search logic. Nothing crashed—but the user experience got worse.

I’m sharing this mostly to remind myself (and maybe you) that ANN infrastructure is only as good as the vectors you feed it. And the most boring parts—like normalization—still bite you the hardest.

If you’ve run into similar issues with mixed-modality embeddings or have better ways to track embedding drift, I’m all ears. Thinking of adding some lightweight checksums or vector histograms to our monitoring pipeline next.

Evaluating Vector Databases in Real Systems

Andrew Kennon — Thu, 03 Jul 2025 09:14:24 +0000

Over the past couple years, I’ve integrated vector databases into several autonomous driving and LLM-based perception workflows — think multimodal RAG pipelines, scene retrieval across lidar+camera streams, or sensor signature matching. These workloads aren’t your average chatbot demos; they demand high recall, stable latency, and the ability to filter by time, location, or sensor type.

So I’ve been watching the vector DB landscape pretty closely. Below is a benchmark-grounded evaluation of what’s out there, with some firsthand perspective on what actually works when you’re pushing real data through these systems.

How I Evaluate (And Why It Matters)

I care about five things when selecting a vector DB for production:

Query performance — Low latency matters, especially when you’re feeding LLMs in real time.
Recall — In robotics, wrong retrievals mean wrong plans. I need to trust the top-k.
Insert + index time — If I’m syncing sensor data every few seconds, I don’t want indexing to be the bottleneck.
Scalability — Millions of vectors from real-world scenes pile up fast.
Functionality — Filtering, hybrid queries, stability under load — all must-haves.

I use results from ANN-Benchmark and VectorDBBench, with embeddings ranging from 960 to 1536 dimensions — roughly what you’d expect from vision transformers or OpenAI’s text models.

Zilliz Cloud / Milvus

What I’ve Seen

This one consistently performs well on both recall and throughput, especially with disk-based indexing and hybrid search. I’ve used Milvus in a lidar-tagged object retrieval task (think “find similar scenes where the ego vehicle was overtaken”) — and the filtering capabilities really helped.

Strengths

High QPS and recall in real-world tests (VectorDBBench)
Supports hybrid queries (e.g., timestamp > t && similarity > x)
Scales well for large workloads
Milvus (open-source) for flexibility, Zilliz (cloud) for lower ops

Limitations

Self-hosted setup is non-trivial (you’re running Pulsar and etcd)
Zilliz Cloud simplifies deployment but limits deep tuning (e.g., IVF params)

Weaviate

What I’ve Seen

I tested Weaviate in a cross-modal search prototype — matching dashcam clips to driving logs using metadata filters. It handled hybrid queries well, though indexing took longer than I expected during rapid ingestion.

Strengths

Good recall/QPS balance in most benchmarks
First-class support for metadata filtering and multimodal modules
Friendly APIs (GraphQL, REST), quick to start

Limitations

Index construction slower on large or dynamic datasets
Memory usage can spike during concurrent reads

Pinecone

What I’ve Seen

Pinecone feels like a SaaS-native solution. I tried it in a project where fast prototyping mattered more than custom indexing. It worked — but I hit walls when I wanted to tune performance.

Strengths

Easy to deploy, zero ops
Solid QPS under moderate load
Strong ecosystem integration (LangChain, OpenAI)

Limitations

Indexing parameters not exposed — you get what you get
Recall underperforms slightly on larger or high-dimensional datasets
Cost becomes a concern once you scale past toy projects

Qdrant

What I’ve Seen

This one surprised me. I ran a batch object similarity task on CPU (no GPU) and it held up pretty well. Still rough around the edges on some features, but promising for edge use or CPU-bound systems.

Strengths

Efficient insert/search on commodity hardware (Rust backend)
REST/gRPC APIs, filtering supported
Lightweight and open-source

Limitations

Limited support for hybrid search with complex schemas
Ecosystem not as mature as Milvus or Weaviate (yet)

FAISS

What I’ve Seen

I still use FAISS when I want to test indexing strategies in isolation. But I wouldn’t use it in a full production loop — no filtering, no persistence, no service layer.

Strengths

Excellent for raw ANN algorithm comparisons
GPU support, fast brute-force testing
Customizable index combinations

Limitations

Not a database: no filters, no auth, no scaling
Can’t be used standalone in production workflows

Chroma

What I’ve Seen

Nice for LangChain demos. I once used it in a hackathon to build a doc-based LLM assistant. Fast to start, but hit scaling limits quickly.

Strengths

Minimal setup, beginner-friendly
Works well for prototypes or one-off use

Limitations

Weak on indexing and recall with larger datasets
Missing core DB features like hybrid filters and distributed execution

Summary Table

Vector DB	Recall	QPS	Indexing	Hybrid Search	Deployment
Milvus / Zilliz	High	High	Fast	Supported	OSS + Cloud
Weaviate	High	Medium	Medium	Supported	OSS + Cloud
Pinecone	Medium	Medium	Fast	Limited	Cloud only
Qdrant	Medium	Medium	Fast	Partial	OSS + Cloud
FAISS	High	Varies	Fast	Not supported	Local only
Chroma	Low	Low	Simple	Limited	Local / Prototypes

Final Notes

For robotics, perception, or any RAG-heavy pipeline, my current picks look like this:

Milvus/Zilliz if I need indexing performance + hybrid filtering at scale
Weaviate if schema flexibility and metadata filtering are key
Qdrant if I’m deploying on the edge or working CPU-only
Pinecone if I want managed infrastructure and don’t mind tradeoffs

That said, nothing beats testing with your own embeddings and real query patterns. Benchmarks help, but your workload always tells the truth.

Let me know if you want a breakdown by use case — like fraud detection vs. vision search vs. conversational RAG. I’ve tested across a few and the performance shifts depending on the shape of your vectors and latency needs.

Autonomous Driving Tech: Who’s Actually Winning in 2025?

Andrew Kennon — Mon, 30 Jun 2025 09:19:51 +0000

2025 Autonomous Driving Leaderboard — A Practical, Technical Ranking

I’ve spent the past few years deep in the weeds of AI system integration for autonomous vehicles — mostly working on sensor fusion and neural planning stacks. And while the headlines keep swinging between “self-driving is dead” and “AI will solve it all,” the truth is way more nuanced.

So here’s a real ranking — not based on hype or stock price, but on what’s actually deployed, how the tech works, and how well it scales.

What I’m Ranking — and Why

Forget “Is it Level 4 or 5?” These are the five technical criteria that actually matter:

Perception: Sensor fusion, occlusion handling, extreme conditions.
Decision & Control: Driving policy intelligence, smoothness, human-likeness.
System Architecture: Rule-based vs. end-to-end; data flywheel maturity.
Operational Scale: Real-world deployment footprint — not just test demos.
Scalability: Can it generalize to new cities, new cars, new situations?

2025 Leaderboard (Narrative Style)

#1 — Waymo

Gold standard in safety, smoothness, and robustness.
Mature sensor fusion, great occlusion handling.
Slow, expensive expansion is their biggest weakness.

#2 — Tesla FSD v12+

Pure end-to-end transformer stack — no lidar, no HD maps.
Unmatched improvement rate due to fleet-scale data.
Still brittle with weird edge cases, pedestrians, and turns.

#3 — Cruise (Post-Reset)

Strong planning stack, especially in dense urban areas.
Setback after 2023 incident, public trust/reputation damaged.
Rebuilding mode, but core tech still solid.

#4 — XPeng XNGP

Strong BEV-based perception and memory-style planning.
OTA updates frequent; impressive highway+city integration.
Still too rule-heavy and less robust in unmapped zones.

#5 — Huawei ADS 2.0

"Too engineered" — great in well-mapped areas.
Relies heavily on lidar + HD maps.
Lacks flexibility outside coverage zones.

#6 — Baidu Apollo Go

Cost-efficient, city-scaled robotaxi service.
Rule-based, HD map-heavy planning.
Less adaptable than Tesla/XPeng in novel situations.

#7 — Mobileye SuperVision

More ADAS than AV, but worth mentioning.
Plug-and-play scale with global OEMs.
Perception stack is world-class; autonomy is limited.

Who’s Actually Doing End-to-End Neural Driving?

Company	Planning Type	Notes
Tesla	End-to-end transformer	Outputs control tokens directly from video + vehicle state
Wayve	End-to-end + LLM	Explains decisions in natural language
Others	Classical stack	Perception → planning → control

Personal Testing Notes

Tesla FSD v12.3.6 (Bay Area): Smooth suburban driving, but struggles with weird U-turns.
Waymo (SF): Still the smoothest and most confident rides.
XPeng G9 (Guangzhou): Great in mapped zones; fragile in new areas.
Cruise (Austin, pre-incident): Polished, but sometimes too cautious (e.g., freezes at crosswalks).

Where This Is Headed (2025–2026 Bets)

Multimodal BEV + LLM Fusion

Spatial reasoning + language-based policy → more explainable driving logic.
Closed-Loop Training Pipelines

Simulation + auto-labeling at fleet scale. Tesla is miles ahead.
Zero-Map Urban Generalization

Whoever nails robust, map-free city driving wins the long game.

Final Thoughts

Instead of asking:

“How many miles until takeover?”

We should be asking:

“Can this system outperform average human drivers in daily driving — and fail gracefully when it can’t?”

TL;DR

Waymo = Safest and most refined
Tesla = Boldest and fastest evolving
XPeng/Huawei/Baidu = Scaling fast in China, each with unique trade-offs

If you’ve tested these systems or want a deeper dive into any specific stack — like LLM planners, BEV fusion, or how Tesla tokenizes control — let me know. Always down to go deeper.