vapmail16

Posted on Feb 22

Vector Databases Explained: How AI Actually Understands Your Text

#tutorial #machinelearning #ai #database

When I first saw that King - Man + Woman ≈ Queen in vector space, something clicked. Not intellectually — I'd read about word embeddings before. But seeing it actually work, watching the maths produce the right answer from pure numbers, was the moment I finally understood why everyone was excited about embeddings.

Vector databases are the backbone of every modern AI application — semantic search, recommendation engines, RAG systems, chatbots that actually know your data. But most tutorials skip the why and jump straight to pip install. That's like learning SQL without understanding what a relational database actually does.

Let me fix that.

What Are Embeddings? (And Why They Matter)

Traditional databases store data as rows and columns. You search with exact matches: WHERE name = 'pizza'. That's great for structured data. It's terrible for meaning.

Vector databases store data as embeddings — arrays of numbers that capture semantic meaning. The sentence "I love pizza" becomes something like:

[0.23, -0.41, 0.87, 0.12, -0.56, ...]  // 1,536 numbers

Here's the magic: sentences with similar meanings have similar numbers, regardless of the words used.

"I love pizza"              → [0.23, -0.41, 0.87, ...]
"Pizza is my favourite food" → [0.25, -0.39, 0.85, ...]  // Very close!
"I love debugging"          → [0.67, 0.12, -0.34, ...]   // Very different.

The embedding model (like OpenAI's text-embedding-3-small) has been trained on billions of text examples. It's learned that "love" and "favourite" carry similar weight, that "pizza" and "food" are related, and that "pizza" and "debugging" occupy completely different corners of meaning-space.

This is what makes semantic search possible. You don't search for keywords. You search for meaning.

How Similarity Search Works

Once your text is converted to vectors, you need to measure how close two vectors are. There are three common approaches, and which one you pick matters.

Cosine Similarity (Most Common)

Measures the angle between two vectors. Ignores magnitude, focuses on direction.

1.0 = identical meaning
0.0 = completely unrelated
-1.0 = opposite meaning

from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a, b):
    return dot(a, b) / (norm(a) * norm(b))

Use when: Your embeddings vary in magnitude (most text use cases). This is the default for a reason.

Euclidean Distance

Measures the straight-line distance between two points in vector space. Smaller = more similar.

Use when: You care about absolute differences, not just direction. Less common for text, more useful for image features or numerical data.

Dot Product

Like cosine similarity but does consider magnitude. Vectors that are both similar in direction and large in magnitude score highest.

Use when: Your embedding model is designed for it (some models normalise vectors, making dot product equivalent to cosine).

My recommendation: Start with cosine similarity. It's robust, well-understood, and works for almost every text use case. Only switch if you have a specific reason.

Indexing: Why Brute Force Doesn't Scale

Here's the problem: if you have 1 million documents, comparing your query vector to every single stored vector takes 1 million similarity calculations. That's O(n). Slow. Unusable at scale.

Vector databases solve this with specialised index structures.

HNSW (Hierarchical Navigable Small World)

The most popular index type. Think of it like a network of neighbours.

Imagine you're looking for a specific house in a city. Brute force: check every house on every street. HNSW: ask a neighbour, they point you closer, you ask that person's neighbour, they point you closer still. A few hops and you're there.

Technically, HNSW builds a multi-layer graph where each node connects to its nearest neighbours. Higher layers are sparse (for big jumps), lower layers are dense (for precision). Search starts at the top and navigates down.

Search time: O(log n) — millions of vectors in milliseconds
Tradeoff: Uses more memory (stores the graph structure)
Best for: Most use cases. Fast, accurate, well-supported

IVF (Inverted File Index)

Divides vector space into clusters (like postal codes). At query time, only searches the nearest clusters instead of everything.

Search time: Faster than brute force, less accurate than HNSW
Tradeoff: Needs tuning (how many clusters? how many to search?)
Best for: Very large datasets where memory is constrained

Product Quantization (PQ)

Compresses vectors by splitting them into sub-vectors and approximating each one. Dramatically reduces memory usage at the cost of some accuracy.

Tradeoff: Lossy compression — some precision loss
Best for: Billions of vectors where memory is the bottleneck

My take: HNSW is the right choice for 95% of applications. It's the default in Pinecone, Weaviate, and Qdrant for good reason. Only reach for IVF or PQ when you're dealing with genuinely massive scale or tight memory constraints.

Choosing a Vector Database

This is where people get stuck. There are too many options and every vendor says they're the best. Here's an honest comparison based on what I've actually used:

Pinecone — Managed, Zero Ops

import pinecone

pinecone.init(api_key="your-key", environment="us-east-1")
index = pinecone.Index("my-index")

# Upsert
index.upsert(vectors=[
    ("doc1", [0.23, -0.41, 0.87, ...], {"text": "Original document"}),
])

# Query
results = index.query(vector=[0.25, -0.39, 0.85, ...], top_k=5)

Pros: Fully managed. No infrastructure to worry about. Scales automatically. Great for teams that don't want to manage databases.
Cons: Vendor lock-in. More expensive at scale. Limited querying beyond similarity search.
Best for: Production apps where you want zero ops overhead.

Weaviate — Open Source, Hybrid Search

Pros: Open source. Supports hybrid search (vector + keyword BM25 in one query). GraphQL API. Modules for auto-vectorisation.
Cons: More complex to set up and manage. Heavier resource usage.
Best for: Applications that need both semantic and keyword search — which, in practice, is most production RAG systems.

Chroma — Lightweight, Great for Prototyping

import chromadb

client = chromadb.Client()
collection = client.create_collection("my-collection")

collection.add(
    documents=["I love pizza", "Pizza is my favourite food"],
    ids=["doc1", "doc2"]
)

results = collection.query(query_texts=["What food do you enjoy?"], n_results=2)

Pros: Dead simple API. Runs locally. Handles embedding generation for you. Perfect for experimentation.
Cons: Not designed for production scale. Limited configuration.
Best for: Prototyping, local development, learning.

pgvector — Postgres Extension

CREATE EXTENSION vector;

CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  content TEXT,
  embedding vector(1536)
);

-- Similarity search
SELECT content, embedding <=> '[0.23, -0.41, ...]' AS distance
FROM documents
ORDER BY distance
LIMIT 5;

Pros: Use your existing Postgres. No new infrastructure. SQL familiarity. Transactions and joins with your regular data.
Cons: Slower than purpose-built vector DBs at scale. Limited index types.
Best for: Teams already on Postgres who don't want another database. Works well under 1M vectors.

Qdrant — Rust-Based, Fast, Self-Hosted

Pros: Very fast (written in Rust). Rich filtering alongside vector search. Open source with a managed option.
Cons: Smaller community than Pinecone/Weaviate. Fewer integrations.
Best for: Performance-sensitive applications. Teams comfortable self-hosting.

The Vector DB Decision Checklist

When someone asks me "which vector database should I use?", I run through this:

1. Are you prototyping or building for production?
Prototyping → Chroma. Get something working in 10 minutes.

2. Do you already have Postgres and < 1M vectors?
Yes → pgvector. Don't add infrastructure you don't need.

3. Do you need hybrid search (vector + keyword)?
Yes → Weaviate. Its hybrid search is genuinely best-in-class.

4. Do you want zero infrastructure management?
Yes → Pinecone. Pay more, worry less.

5. Is raw query speed your top priority?
Yes → Qdrant. Rust-level performance.

6. Budget-constrained and self-hosting?
Weaviate or Qdrant, both open source.

There's no universal "best." There's "best for your use case, your team, and your stage."

Putting It Together: The Embedding Pipeline

Here's the complete flow for any AI application using vector search:

1. Document comes in (PDF, webpage, chat message, etc.)
2. Split into chunks (paragraphs, sections — overlap by 200 chars)
3. Generate embeddings via API (OpenAI, Cohere, or local model)
4. Store embeddings + metadata in your vector database
5. User asks a question
6. Embed the question using the SAME model
7. Query vector DB for top-k similar chunks
8. Pass retrieved chunks + question to an LLM
9. LLM generates an answer grounded in YOUR data

Steps 5-9 is what people call RAG (Retrieval-Augmented Generation). But the quality of step 8's answer depends entirely on steps 2-4. Bad chunking, wrong embedding model, or a poorly-configured index means the LLM gets irrelevant context and hallucinates confidently.

The foundation matters. Get the vector layer right, and everything built on top works. Get it wrong, and no amount of prompt engineering will save you.

What's your use case? I'd genuinely love to hear — drop a comment and I'll tell you which vector DB fits. Building semantic search? RAG chatbot? Recommendation engine? The answer changes based on what you're actually building.

DEV Community