Vector Database Tutorial: Build Semantic Search from Scratch (2026)
A vector database stores data as high-dimensional vectors and retrieves by semantic similarity. This tutorial builds a complete pipeline: embeddings, ChromaDB, semantic search, metadata filtering, and a full PDF Q&A example.
Install
pip install chromadb openai
Generate Embeddings
from openai import OpenAI
client = OpenAI()
def embed(text: str) -> list[float]:
return client.embeddings.create(model="text-embedding-3-small", input=text).data[0].embedding
Set Up ChromaDB
import chromadb
db = chromadb.PersistentClient(path="./chroma_store")
collection = db.get_or_create_collection("documents", metadata={"hnsw:space": "cosine"})
Add Documents
texts = ["Python is readable", "Rust is fast", "Docker containers isolate apps"]
vectors = [embed(t) for t in texts]
collection.add(ids=["1","2","3"], embeddings=vectors, documents=texts)
Semantic Search
def search(query: str, n: int = 3):
results = collection.query(query_embeddings=[embed(query)], n_results=n)
return [(doc, round(1-dist, 3)) for doc, dist in
zip(results["documents"][0], results["distances"][0])]
for doc, score in search("running apps in containers"):
print(f"[{score}] {doc}")
Filter by Metadata
results = collection.query(
query_embeddings=[embed("web development")],
n_results=2,
where={"language": {"$eq": "python"}},
)
Scaling
- ChromaDB — local, great for dev and < 1M vectors
- Pinecone — managed, billions of vectors, serverless
- pgvector — if you're already on PostgreSQL
- Qdrant — high-performance Rust, rich filtering
Originally published at kalyna.pro
Top comments (0)