A short learning path from a weekend project: I indexed my personal markdown notes (~800 chunks), tried a few local embedding models, stored the same vectors in four different backends, and wired up simple RAG. Not a production guide — just the basics, with honest results from a corpus small enough to reason about.
The idea, without the jargon pile
Keyword search looks for shared words. Vector search converts text into a list of numbers (an embedding), treats that list as a point in space, and finds nearby points. Similar meaning → nearby, even when the words differ.
That is the retrieval half of RAG (Retrieval-Augmented Generation):
your docs → split into chunks → embed → store in a vector index
your question → embed → find nearest chunks → pass chunks to an LLM → answer
The vector database does not understand language. It only stores vectors and finds neighbors. All the "meaning" comes from the embedding model you chose upstream.
What I indexed
| Corpus | My personal dev-notes wiki — database notes, system design summaries, frontend/backend cheatsheets |
| Size | ~50 markdown files → ~800 chunks after splitting |
| Chunking | Split on headings, then ~800-character windows with ~120-character overlap |
| Goal | Each retrieved hit should be a readable paragraph, not a whole file |
Chunking matters as much as the model. Bad splits give you the right topic in the wrong paragraph — and no amount of vector magic fixes that.
Step 1 — Prove the pipeline locally (sqlite-vec)
I started with the smallest possible stack: a local embedding model (runs on my laptop, no API key) and sqlite-vec — vectors stored in a SQLite file, cosine similarity search in SQL. No Docker, no server.
First win: I searched for "clickhouse merge tree vs mysql" and the top hit was my ClickHouse notes — a comparison table about column-oriented storage vs row stores. No shared keywords required; the embedding captured the intent.
What clicked:
- One chunk → one point in space. For the model I ended up preferring, that is 384 numbers — not one number per word, but one coordinate for the whole paragraph.
- Lower cosine distance = better match (think angle between directions, not km/miles on a map).
- At ~800 chunks, brute-force "compare to every vector" is fast enough. You do not need fancy indexes on day one.
- Even sqlite-vec splits metadata (file path, text) from vectors — the same pattern every vector DB uses, just with different names.
Step 2 — Not all embedding models behave the same
Same corpus, three models, three indexes. Same five test queries. I compared whether the top hit was actually useful — not absolute scores across models (different models live in different spaces).
| Model | Dimensions | Index time |
|---|---|---|
| MiniLM | 384 | ~20s |
| bge-small | 384 | ~45s |
| Nomic | 768 | ~3.5 min |
Some retrieval-tuned models want different prefixes at index vs query time (e.g. BGE uses passage: for documents and query: for questions). MiniLM needs no prefix. Wrong prefixes still produce vectors — they just hurt quality quietly.
Same query, different top result
Take one query: "How does ClickHouse compare to MySQL?" MiniLM returned a note that only mentioned ClickHouse in passing. BGE returned the dedicated comparison note about column-oriented vs row storage. Same corpus, same question — different chunk fed into RAG.
I ran five queries like this across all three models. The table below uses outcomes, not filenames — whether the top hit was useful, not which file won.
| Query (plain English) | MiniLM | bge-small | Nomic |
|---|---|---|---|
| Database replication / leader–follower | Correct topic, vague section | Correct topic, best section | Correct topic, vague section |
| React re-renders and memoization | Correct doc | Correct doc, best section | Correct doc |
| ClickHouse vs MySQL | Wrong — tangential mention only | Correct — dedicated comparison | Correct — dedicated comparison |
| Hexagonal architecture | Weak — no such note in corpus | Weak — nearest unrelated doc | Weak — nearest unrelated doc |
| CAP theorem | Wrong — unrelated topic | Correct topic (passing mention only) | Correct topic (passing mention only) |
Takeaways:
- Training objective beats dimension count. BGE and MiniLM are both 384-dimensional; BGE won top-1 on four of five queries.
- More dimensions ≠ automatically better. Nomic (768d) never beat BGE on top-1 and was much slower to index.
- You cannot retrieve what you never wrote. I do not have a hexagonal architecture note; search returns the nearest neighbor, not "I don't know."
- Brief mentions lose to dedicated docs. CAP appears in passing in my consistency notes — there is no clean CAP explainer chunk to find.
- Evaluate on your own queries. Public benchmarks are a starting point; your corpus is the real test.
I kept bge-small for everything after this step.
Step 3 — Four vector stores, one lesson
Next I indexed the same chunks and same embeddings into four backends:
| Store | Role in the exercise | Search method |
|---|---|---|
| sqlite-vec | Zero-ops local file | Exact KNN |
| Qdrant | Dedicated vector DB | HNSW (approximate) |
| Redis Stack | In-memory + optional hybrid text search | HNSW |
| Milvus | Vector-native, schema-heavy | HNSW |
Concepts map across all of them:
| Idea | sqlite-vec | Qdrant | Redis | Milvus |
|---|---|---|---|---|
| Vector + metric | vec0 table | collection | HASH field | FLOAT_VECTOR column |
| Extra fields | chunks table | JSON payload | HASH fields | VARCHAR columns |
| Filter + search | limited | yes | yes | yes |
HNSW (Hierarchical Navigable Small World) is the usual approximate index at scale: walk a graph instead of scanning every vector. At ~800 chunks it returned the same top hits as brute force.
Did the backends disagree?
No — not on ranking. Same embeddings + same cosine metric → same neighbors at this size.
| Query | All four agreed? |
|---|---|
| clickhouse merge tree vs mysql | Yes (sometimes my own vector-search write-up ranked #1 after the corpus grew — meta docs mentioning ClickHouse stole rank, not a backend bug) |
| database replication leader follower | Identical top hits across sqlite, Qdrant, Redis, Milvus |
Latency differed — infrastructure, not quality:
| Backend | Typical search | Why |
|---|---|---|
| Redis | ~6–15 ms | Everything in RAM |
| sqlite-vec | ~10 ms | No network; brute force still fine |
| Milvus | ~12–20 ms | gRPC + HNSW |
| Qdrant | ~15–70 ms | HTTP overhead |
Headline: at hundreds of chunks, pick a store for ops and scale, not because one "understands" your text better.
When each made sense to me
| Best when | |
|---|---|
| sqlite-vec | Learning, offline POC, no infra |
| Qdrant | App RAG with metadata filters, flexible JSON payload |
| Redis Stack | Already on Redis, hot in-memory set, hybrid keyword + vector |
| Milvus | Huge scale, rigid schema, partitions by topic or time range |
Redis taught me that plain redis-server ≠ Redis Stack — vector search needs the RediSearch module. Redis also showed that corpus size ≈ RAM budget (~800 chunks was a few MB; millions would not fit the same way).
Qdrant and Milvus both reinforced: filter during search, not after. If you take global top-100 then discard unwanted hits in application code, you can easily end up with nothing useful left.
Step 4 — RAG: retrieval is the ceiling
Retrieval alone gives you ranked chunks. RAG adds generation: top-k chunks → prompt → LLM answer with citations.
Example that worked: "What is ClickHouse column storage?" → retrieved my ClickHouse comparison chunks → the LLM described column-oriented storage and cited the right sources.
Prompt pattern that helped: "Answer using ONLY the context below. Cite [n]." When retrieval was good, hallucination dropped. When retrieval was wrong, the LLM was confidently wrong anyway.
Lesson I will not forget: RAG quality ceiling = retrieval quality. Debug the chunks before blaming the model. I added a "context only" mode that skips the LLM — invaluable when an answer looks plausible but wrong.
Also: top-k × chunk_size must fit the LLM context window. At ~800-char chunks and k=5, that is manageable; at scale you rerank or compress.
The full picture
┌─────────────┐ ┌──────────┐ ┌─────────────────┐ ┌──────────────┐
│ Raw docs │────▶│ Chunker │────▶│ Embedding model │────▶│ Vector store │
└─────────────┘ └──────────┘ └─────────────────┘ └──────┬───────┘
│
User question ──▶ embed query ──▶ nearest neighbors ──▶ top-k chunks
│
▼
Context + "cite [n]" ──▶ LLM answer
Your job: chunking, embedding, re-index policy, query embedding, RAG prompt, LLM.
Vector DB's job: store vectors, run KNN/ANN, attach metadata, filter during search.
Embeddings are derived data — when text changes, re-embed. The vector index is not the source of truth for documents.
Things to watch for
- Never mix embedding models in one index — even two 384-dim models are incompatible spaces.
- Model change = full re-index. You cannot swap models on existing vectors.
- Corpus coverage beats clever search. Missing topic → plausible wrong neighbor.
- Curate what you index. My own vector-search write-up sometimes outranked the ClickHouse doc because it mentioned ClickHouse in comparison tables.
-
Positional chunk IDs (
File.md::5) break when documents edit — use stable content hashes if you need incremental sync. - Vector carries meaning; payload carries rules. Source path, dates, tags — filter fields, not embedding magic.
What I deliberately did not learn (yet)
Everything past ~800 chunks in this exercise was mostly scaling — same ideas, harder ops:
- Sharding, replication, embedding as a separate service
- One shared index with query-time filters vs many separate indexes
- Reranking, hybrid search tuning, evaluation harnesses
- Billion-vector index types (IVF, product quantization)
That is real production work. It builds on the fundamentals above; it does not replace understanding them.
If you remember only five things
- Embeddings turn text into points in space — similar meaning is nearby, keywords optional.
- Pick and evaluate one embedding model on your queries — retrieval-tuned small models often beat bigger general ones.
- Chunk well — retrieval returns paragraphs, not files; bad chunks cap RAG quality.
- Vector DB choice is mostly infrastructure at small scale — same neighbors, different RAM/disk/ops trade-offs.
- RAG = retrieve first, generate second — fix retrieval before tuning the LLM.
Closing
~800 chunks was enough to see semantic search work, watch models mis-rank, and feel RAG succeed or fail based on what got retrieved.
I learn this kind of thing by doing, not by reading another diagram. If you want the same, index something small you already have — notes, READMEs, runbooks — pick one model, one store, and five queries you care about. That is enough to start.
Top comments (0)