DEV Community

Cover image for # Vector Search and RAG: A Primer
Kailash Sankar
Kailash Sankar

Posted on

# Vector Search and RAG: A Primer

A short learning path from a weekend project: I indexed my personal markdown notes (~800 chunks), tried a few local embedding models, stored the same vectors in four different backends, and wired up simple RAG. Not a production guide — just the basics, with honest results from a corpus small enough to reason about.


The idea, without the jargon pile

Keyword search looks for shared words. Vector search converts text into a list of numbers (an embedding), treats that list as a point in space, and finds nearby points. Similar meaning → nearby, even when the words differ.

That is the retrieval half of RAG (Retrieval-Augmented Generation):

your docs → split into chunks → embed → store in a vector index
your question → embed → find nearest chunks → pass chunks to an LLM → answer
Enter fullscreen mode Exit fullscreen mode

The vector database does not understand language. It only stores vectors and finds neighbors. All the "meaning" comes from the embedding model you chose upstream.


What I indexed

Corpus My personal dev-notes wiki — database notes, system design summaries, frontend/backend cheatsheets
Size ~50 markdown files → ~800 chunks after splitting
Chunking Split on headings, then ~800-character windows with ~120-character overlap
Goal Each retrieved hit should be a readable paragraph, not a whole file

Chunking matters as much as the model. Bad splits give you the right topic in the wrong paragraph — and no amount of vector magic fixes that.


Step 1 — Prove the pipeline locally (sqlite-vec)

I started with the smallest possible stack: a local embedding model (runs on my laptop, no API key) and sqlite-vec — vectors stored in a SQLite file, cosine similarity search in SQL. No Docker, no server.

First win: I searched for "clickhouse merge tree vs mysql" and the top hit was my ClickHouse notes — a comparison table about column-oriented storage vs row stores. No shared keywords required; the embedding captured the intent.

What clicked:

  • One chunk → one point in space. For the model I ended up preferring, that is 384 numbers — not one number per word, but one coordinate for the whole paragraph.
  • Lower cosine distance = better match (think angle between directions, not km/miles on a map).
  • At ~800 chunks, brute-force "compare to every vector" is fast enough. You do not need fancy indexes on day one.
  • Even sqlite-vec splits metadata (file path, text) from vectors — the same pattern every vector DB uses, just with different names.

Step 2 — Not all embedding models behave the same

Same corpus, three models, three indexes. Same five test queries. I compared whether the top hit was actually useful — not absolute scores across models (different models live in different spaces).

Model Dimensions Index time
MiniLM 384 ~20s
bge-small 384 ~45s
Nomic 768 ~3.5 min

Some retrieval-tuned models want different prefixes at index vs query time (e.g. BGE uses passage: for documents and query: for questions). MiniLM needs no prefix. Wrong prefixes still produce vectors — they just hurt quality quietly.

Same query, different top result

Take one query: "How does ClickHouse compare to MySQL?" MiniLM returned a note that only mentioned ClickHouse in passing. BGE returned the dedicated comparison note about column-oriented vs row storage. Same corpus, same question — different chunk fed into RAG.

I ran five queries like this across all three models. The table below uses outcomes, not filenames — whether the top hit was useful, not which file won.

Query (plain English) MiniLM bge-small Nomic
Database replication / leader–follower Correct topic, vague section Correct topic, best section Correct topic, vague section
React re-renders and memoization Correct doc Correct doc, best section Correct doc
ClickHouse vs MySQL Wrong — tangential mention only Correct — dedicated comparison Correct — dedicated comparison
Hexagonal architecture Weak — no such note in corpus Weak — nearest unrelated doc Weak — nearest unrelated doc
CAP theorem Wrong — unrelated topic Correct topic (passing mention only) Correct topic (passing mention only)

Takeaways:

  • Training objective beats dimension count. BGE and MiniLM are both 384-dimensional; BGE won top-1 on four of five queries.
  • More dimensions ≠ automatically better. Nomic (768d) never beat BGE on top-1 and was much slower to index.
  • You cannot retrieve what you never wrote. I do not have a hexagonal architecture note; search returns the nearest neighbor, not "I don't know."
  • Brief mentions lose to dedicated docs. CAP appears in passing in my consistency notes — there is no clean CAP explainer chunk to find.
  • Evaluate on your own queries. Public benchmarks are a starting point; your corpus is the real test.

I kept bge-small for everything after this step.


Step 3 — Four vector stores, one lesson

Next I indexed the same chunks and same embeddings into four backends:

Store Role in the exercise Search method
sqlite-vec Zero-ops local file Exact KNN
Qdrant Dedicated vector DB HNSW (approximate)
Redis Stack In-memory + optional hybrid text search HNSW
Milvus Vector-native, schema-heavy HNSW

Concepts map across all of them:

Idea sqlite-vec Qdrant Redis Milvus
Vector + metric vec0 table collection HASH field FLOAT_VECTOR column
Extra fields chunks table JSON payload HASH fields VARCHAR columns
Filter + search limited yes yes yes

HNSW (Hierarchical Navigable Small World) is the usual approximate index at scale: walk a graph instead of scanning every vector. At ~800 chunks it returned the same top hits as brute force.

Did the backends disagree?

No — not on ranking. Same embeddings + same cosine metric → same neighbors at this size.

Query All four agreed?
clickhouse merge tree vs mysql Yes (sometimes my own vector-search write-up ranked #1 after the corpus grew — meta docs mentioning ClickHouse stole rank, not a backend bug)
database replication leader follower Identical top hits across sqlite, Qdrant, Redis, Milvus

Latency differed — infrastructure, not quality:

Backend Typical search Why
Redis ~6–15 ms Everything in RAM
sqlite-vec ~10 ms No network; brute force still fine
Milvus ~12–20 ms gRPC + HNSW
Qdrant ~15–70 ms HTTP overhead

Headline: at hundreds of chunks, pick a store for ops and scale, not because one "understands" your text better.

When each made sense to me

Best when
sqlite-vec Learning, offline POC, no infra
Qdrant App RAG with metadata filters, flexible JSON payload
Redis Stack Already on Redis, hot in-memory set, hybrid keyword + vector
Milvus Huge scale, rigid schema, partitions by topic or time range

Redis taught me that plain redis-server ≠ Redis Stack — vector search needs the RediSearch module. Redis also showed that corpus size ≈ RAM budget (~800 chunks was a few MB; millions would not fit the same way).

Qdrant and Milvus both reinforced: filter during search, not after. If you take global top-100 then discard unwanted hits in application code, you can easily end up with nothing useful left.


Step 4 — RAG: retrieval is the ceiling

Retrieval alone gives you ranked chunks. RAG adds generation: top-k chunks → prompt → LLM answer with citations.

Example that worked: "What is ClickHouse column storage?" → retrieved my ClickHouse comparison chunks → the LLM described column-oriented storage and cited the right sources.

Prompt pattern that helped: "Answer using ONLY the context below. Cite [n]." When retrieval was good, hallucination dropped. When retrieval was wrong, the LLM was confidently wrong anyway.

Lesson I will not forget: RAG quality ceiling = retrieval quality. Debug the chunks before blaming the model. I added a "context only" mode that skips the LLM — invaluable when an answer looks plausible but wrong.

Also: top-k × chunk_size must fit the LLM context window. At ~800-char chunks and k=5, that is manageable; at scale you rerank or compress.


The full picture

┌─────────────┐     ┌──────────┐     ┌─────────────────┐     ┌──────────────┐
│ Raw docs    │────▶│ Chunker  │────▶│ Embedding model │────▶│ Vector store │
└─────────────┘     └──────────┘     └─────────────────┘     └──────┬───────┘
                                                                      │
User question ──▶ embed query ──▶ nearest neighbors ──▶ top-k chunks
                                                                      │
                                                                      ▼
                                    Context + "cite [n]" ──▶ LLM answer
Enter fullscreen mode Exit fullscreen mode

Your job: chunking, embedding, re-index policy, query embedding, RAG prompt, LLM.

Vector DB's job: store vectors, run KNN/ANN, attach metadata, filter during search.

Embeddings are derived data — when text changes, re-embed. The vector index is not the source of truth for documents.


Things to watch for

  • Never mix embedding models in one index — even two 384-dim models are incompatible spaces.
  • Model change = full re-index. You cannot swap models on existing vectors.
  • Corpus coverage beats clever search. Missing topic → plausible wrong neighbor.
  • Curate what you index. My own vector-search write-up sometimes outranked the ClickHouse doc because it mentioned ClickHouse in comparison tables.
  • Positional chunk IDs (File.md::5) break when documents edit — use stable content hashes if you need incremental sync.
  • Vector carries meaning; payload carries rules. Source path, dates, tags — filter fields, not embedding magic.

What I deliberately did not learn (yet)

Everything past ~800 chunks in this exercise was mostly scaling — same ideas, harder ops:

  • Sharding, replication, embedding as a separate service
  • One shared index with query-time filters vs many separate indexes
  • Reranking, hybrid search tuning, evaluation harnesses
  • Billion-vector index types (IVF, product quantization)

That is real production work. It builds on the fundamentals above; it does not replace understanding them.


If you remember only five things

  1. Embeddings turn text into points in space — similar meaning is nearby, keywords optional.
  2. Pick and evaluate one embedding model on your queries — retrieval-tuned small models often beat bigger general ones.
  3. Chunk well — retrieval returns paragraphs, not files; bad chunks cap RAG quality.
  4. Vector DB choice is mostly infrastructure at small scale — same neighbors, different RAM/disk/ops trade-offs.
  5. RAG = retrieve first, generate second — fix retrieval before tuning the LLM.

Closing

~800 chunks was enough to see semantic search work, watch models mis-rank, and feel RAG succeed or fail based on what got retrieved.

I learn this kind of thing by doing, not by reading another diagram. If you want the same, index something small you already have — notes, READMEs, runbooks — pick one model, one store, and five queries you care about. That is enough to start.


vector-search-lab

Top comments (0)