DEV Community

Cover image for Vector Database Tutorial: Build Semantic Search from Scratch (2026)
Serhii Kalyna
Serhii Kalyna

Posted on • Originally published at kalyna.pro

Vector Database Tutorial: Build Semantic Search from Scratch (2026)

Vector Database Tutorial: Build Semantic Search from Scratch (2026)

A vector database stores data as high-dimensional vectors and retrieves by semantic similarity. This tutorial builds a complete pipeline: embeddings, ChromaDB, semantic search, metadata filtering, and a full PDF Q&A example.

Install

pip install chromadb openai
Enter fullscreen mode Exit fullscreen mode

Generate Embeddings

from openai import OpenAI
client = OpenAI()

def embed(text: str) -> list[float]:
    return client.embeddings.create(model="text-embedding-3-small", input=text).data[0].embedding
Enter fullscreen mode Exit fullscreen mode

Set Up ChromaDB

import chromadb
db = chromadb.PersistentClient(path="./chroma_store")
collection = db.get_or_create_collection("documents", metadata={"hnsw:space": "cosine"})
Enter fullscreen mode Exit fullscreen mode

Add Documents

texts = ["Python is readable", "Rust is fast", "Docker containers isolate apps"]
vectors = [embed(t) for t in texts]
collection.add(ids=["1","2","3"], embeddings=vectors, documents=texts)
Enter fullscreen mode Exit fullscreen mode

Semantic Search

def search(query: str, n: int = 3):
    results = collection.query(query_embeddings=[embed(query)], n_results=n)
    return [(doc, round(1-dist, 3)) for doc, dist in
            zip(results["documents"][0], results["distances"][0])]

for doc, score in search("running apps in containers"):
    print(f"[{score}] {doc}")
Enter fullscreen mode Exit fullscreen mode

Filter by Metadata

results = collection.query(
    query_embeddings=[embed("web development")],
    n_results=2,
    where={"language": {"$eq": "python"}},
)
Enter fullscreen mode Exit fullscreen mode

Scaling

  • ChromaDB — local, great for dev and < 1M vectors
  • Pinecone — managed, billions of vectors, serverless
  • pgvector — if you're already on PostgreSQL
  • Qdrant — high-performance Rust, rich filtering

Originally published at kalyna.pro

Top comments (0)