Kailash Sankar

Posted on Jun 21

# Vector Search and RAG: A Primer

#programming #vectordatabase #ai

A short learning path from a weekend project: I indexed my personal markdown notes (~800 chunks), tried a few local embedding models, stored the same vectors in four different backends, and wired up simple RAG. Not a production guide — just the basics, with honest results from a corpus small enough to reason about.

The idea, without the jargon pile

Keyword search looks for shared words. Vector search converts text into a list of numbers (an embedding), treats that list as a point in space, and finds nearby points. Similar meaning → nearby, even when the words differ.

That is the retrieval half of RAG (Retrieval-Augmented Generation):

your docs → split into chunks → embed → store in a vector index
your question → embed → find nearest chunks → pass chunks to an LLM → answer

The vector database does not understand language. It only stores vectors and finds neighbors. All the "meaning" comes from the embedding model you chose upstream.

What I indexed


Corpus	My personal dev-notes wiki — database notes, system design summaries, frontend/backend cheatsheets
Size	~50 markdown files → ~800 chunks after splitting
Chunking	Split on headings, then ~800-character windows with ~120-character overlap
Goal	Each retrieved hit should be a readable paragraph, not a whole file

Chunking matters as much as the model. Bad splits give you the right topic in the wrong paragraph — and no amount of vector magic fixes that.

Step 1 — Prove the pipeline locally (sqlite-vec)

I started with the smallest possible stack: a local embedding model (runs on my laptop, no API key) and sqlite-vec — vectors stored in a SQLite file, cosine similarity search in SQL. No Docker, no server.

First win: I searched for "clickhouse merge tree vs mysql" and the top hit was my ClickHouse notes — a comparison table about column-oriented storage vs row stores. No shared keywords required; the embedding captured the intent.

What clicked:

One chunk → one point in space. For the model I ended up preferring, that is 384 numbers — not one number per word, but one coordinate for the whole paragraph.
Lower cosine distance = better match (think angle between directions, not km/miles on a map).
At ~800 chunks, brute-force "compare to every vector" is fast enough. You do not need fancy indexes on day one.
Even sqlite-vec splits metadata (file path, text) from vectors — the same pattern every vector DB uses, just with different names.

Step 2 — Not all embedding models behave the same

Same corpus, three models, three indexes. Same five test queries. I compared whether the top hit was actually useful — not absolute scores across models (different models live in different spaces).

Model	Dimensions	Index time
MiniLM	384	~20s
bge-small	384	~45s
Nomic	768	~3.5 min

Some retrieval-tuned models want different prefixes at index vs query time (e.g. BGE uses passage: for documents and query: for questions). MiniLM needs no prefix. Wrong prefixes still produce vectors — they just hurt quality quietly.

Same query, different top result

Take one query: "How does ClickHouse compare to MySQL?" MiniLM returned a note that only mentioned ClickHouse in passing. BGE returned the dedicated comparison note about column-oriented vs row storage. Same corpus, same question — different chunk fed into RAG.

I ran five queries like this across all three models. The table below uses outcomes, not filenames — whether the top hit was useful, not which file won.

Query (plain English)	MiniLM	bge-small	Nomic
Database replication / leader–follower	Correct topic, vague section	Correct topic, best section	Correct topic, vague section
React re-renders and memoization	Correct doc	Correct doc, best section	Correct doc
ClickHouse vs MySQL	Wrong — tangential mention only	Correct — dedicated comparison	Correct — dedicated comparison
Hexagonal architecture	Weak — no such note in corpus	Weak — nearest unrelated doc	Weak — nearest unrelated doc
CAP theorem	Wrong — unrelated topic	Correct topic (passing mention only)	Correct topic (passing mention only)

Takeaways:

Training objective beats dimension count. BGE and MiniLM are both 384-dimensional; BGE won top-1 on four of five queries.
More dimensions ≠ automatically better. Nomic (768d) never beat BGE on top-1 and was much slower to index.
You cannot retrieve what you never wrote. I do not have a hexagonal architecture note; search returns the nearest neighbor, not "I don't know."
Brief mentions lose to dedicated docs. CAP appears in passing in my consistency notes — there is no clean CAP explainer chunk to find.
Evaluate on your own queries. Public benchmarks are a starting point; your corpus is the real test.

I kept bge-small for everything after this step.

Step 3 — Four vector stores, one lesson

Next I indexed the same chunks and same embeddings into four backends:

Store	Role in the exercise	Search method
sqlite-vec	Zero-ops local file	Exact KNN
Qdrant	Dedicated vector DB	HNSW (approximate)
Redis Stack	In-memory + optional hybrid text search	HNSW
Milvus	Vector-native, schema-heavy	HNSW

Concepts map across all of them:

Idea	sqlite-vec	Qdrant	Redis	Milvus
Vector + metric	vec0 table	collection	HASH field	FLOAT_VECTOR column
Extra fields	chunks table	JSON payload	HASH fields	VARCHAR columns
Filter + search	limited	yes	yes	yes

HNSW (Hierarchical Navigable Small World) is the usual approximate index at scale: walk a graph instead of scanning every vector. At ~800 chunks it returned the same top hits as brute force.

Did the backends disagree?

No — not on ranking. Same embeddings + same cosine metric → same neighbors at this size.

Query	All four agreed?
clickhouse merge tree vs mysql	Yes (sometimes my own vector-search write-up ranked #1 after the corpus grew — meta docs mentioning ClickHouse stole rank, not a backend bug)
database replication leader follower	Identical top hits across sqlite, Qdrant, Redis, Milvus

Latency differed — infrastructure, not quality:

Backend	Typical search	Why
Redis	~6–15 ms	Everything in RAM
sqlite-vec	~10 ms	No network; brute force still fine
Milvus	~12–20 ms	gRPC + HNSW
Qdrant	~15–70 ms	HTTP overhead

Headline: at hundreds of chunks, pick a store for ops and scale, not because one "understands" your text better.

When each made sense to me

	Best when
sqlite-vec	Learning, offline POC, no infra
Qdrant	App RAG with metadata filters, flexible JSON payload
Redis Stack	Already on Redis, hot in-memory set, hybrid keyword + vector
Milvus	Huge scale, rigid schema, partitions by topic or time range

Redis taught me that plain redis-server ≠ Redis Stack — vector search needs the RediSearch module. Redis also showed that corpus size ≈ RAM budget (~800 chunks was a few MB; millions would not fit the same way).

Qdrant and Milvus both reinforced: filter during search, not after. If you take global top-100 then discard unwanted hits in application code, you can easily end up with nothing useful left.

Step 4 — RAG: retrieval is the ceiling

Retrieval alone gives you ranked chunks. RAG adds generation: top-k chunks → prompt → LLM answer with citations.

Example that worked: "What is ClickHouse column storage?" → retrieved my ClickHouse comparison chunks → the LLM described column-oriented storage and cited the right sources.

Prompt pattern that helped: "Answer using ONLY the context below. Cite [n]." When retrieval was good, hallucination dropped. When retrieval was wrong, the LLM was confidently wrong anyway.

Lesson I will not forget: RAG quality ceiling = retrieval quality. Debug the chunks before blaming the model. I added a "context only" mode that skips the LLM — invaluable when an answer looks plausible but wrong.

Also: top-k × chunk_size must fit the LLM context window. At ~800-char chunks and k=5, that is manageable; at scale you rerank or compress.

The full picture

┌─────────────┐     ┌──────────┐     ┌─────────────────┐     ┌──────────────┐
│ Raw docs    │────▶│ Chunker  │────▶│ Embedding model │────▶│ Vector store │
└─────────────┘     └──────────┘     └─────────────────┘     └──────┬───────┘
                                                                      │
User question ──▶ embed query ──▶ nearest neighbors ──▶ top-k chunks
                                                                      │
                                                                      ▼
                                    Context + "cite [n]" ──▶ LLM answer

Your job: chunking, embedding, re-index policy, query embedding, RAG prompt, LLM.

Vector DB's job: store vectors, run KNN/ANN, attach metadata, filter during search.

Embeddings are derived data — when text changes, re-embed. The vector index is not the source of truth for documents.

Things to watch for

Never mix embedding models in one index — even two 384-dim models are incompatible spaces.
Model change = full re-index. You cannot swap models on existing vectors.
Corpus coverage beats clever search. Missing topic → plausible wrong neighbor.
Curate what you index. My own vector-search write-up sometimes outranked the ClickHouse doc because it mentioned ClickHouse in comparison tables.
Positional chunk IDs (File.md::5) break when documents edit — use stable content hashes if you need incremental sync.
Vector carries meaning; payload carries rules. Source path, dates, tags — filter fields, not embedding magic.

What I deliberately did not learn (yet)

Everything past ~800 chunks in this exercise was mostly scaling — same ideas, harder ops:

Sharding, replication, embedding as a separate service
One shared index with query-time filters vs many separate indexes
Reranking, hybrid search tuning, evaluation harnesses
Billion-vector index types (IVF, product quantization)

That is real production work. It builds on the fundamentals above; it does not replace understanding them.

If you remember only five things

Embeddings turn text into points in space — similar meaning is nearby, keywords optional.
Pick and evaluate one embedding model on your queries — retrieval-tuned small models often beat bigger general ones.
Chunk well — retrieval returns paragraphs, not files; bad chunks cap RAG quality.
Vector DB choice is mostly infrastructure at small scale — same neighbors, different RAM/disk/ops trade-offs.
RAG = retrieve first, generate second — fix retrieval before tuning the LLM.

Closing

~800 chunks was enough to see semantic search work, watch models mis-rank, and feel RAG succeed or fail based on what got retrieved.

I learn this kind of thing by doing, not by reading another diagram. If you want the same, index something small you already have — notes, READMEs, runbooks — pick one model, one store, and five queries you care about. That is enough to start.

vector-search-lab