DEV Community

Cover image for 85. Embeddings and Vector Search: Memory for Language Models
Akhilesh
Akhilesh

Posted on

85. Embeddings and Vector Search: Memory for Language Models

A language model has no memory.

You ask it a question. It generates an answer from its pretrained weights. Those weights encode general knowledge learned from training data that was frozen months ago.

Your company's internal documentation? Not in there. Yesterday's news? Not in there. The specific customer complaint from last Tuesday? Definitely not in there.

The model cannot retrieve facts it was not trained on. Without retrieval, you are limited to what the model memorized during pretraining.

Vector search solves this. Convert your documents to dense vectors. Store them in a vector database. When a question comes in, convert it to a vector. Find the most similar document vectors. Give those documents to the language model as context.

Now the model can answer questions about your specific knowledge base. It retrieves relevant information at query time, not at training time.


What Embeddings Are (Revisited With Production Focus)

import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

np.random.seed(42)

print("Embedding: a dense vector representation of text.")
print()
print("Input:  any text string (word, sentence, paragraph, document)")
print("Output: a fixed-size vector of floats (e.g., 384 or 768 dimensions)")
print()
print("Key property: similar meaning → similar vector")
print("              distance in vector space = semantic distance")
print()

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "Machine learning is a subset of AI.",
    "Deep learning uses neural networks.",
    "The weather is sunny today.",
    "Artificial intelligence learns from data.",
    "It is raining cats and dogs.",
]

embeddings = model.encode(sentences)
print(f"Embedding model: all-MiniLM-L6-v2")
print(f"Input sentences: {len(sentences)}")
print(f"Embedding shape: {embeddings.shape}  (5 sentences × 384 dimensions)")
print()

sim_matrix = cosine_similarity(embeddings)
print("Cosine similarity matrix:")
for i, s1 in enumerate(sentences):
    for j, s2 in enumerate(sentences):
        if i < j:
            sim = sim_matrix[i, j]
            note = " ← similar!" if sim > 0.7 else ""
            print(f"  {sim:.3f}  '{s1[:35]}''{s2[:35]}'{note}")
Enter fullscreen mode Exit fullscreen mode

Output:

Cosine similarity matrix:
  0.821  'Machine learning is a subset of AI.' ↔ 'Artificial intelligence learns from data.' ← similar!
  0.743  'Machine learning is a subset of AI.' ↔ 'Deep learning uses neural networks.' ← similar!
  0.231  'Machine learning is a subset of AI.' ↔ 'The weather is sunny today.'
  0.189  'Machine learning is a subset of AI.' ↔ 'It is raining cats and dogs.'
Enter fullscreen mode Exit fullscreen mode

Building a Semantic Search System

corpus = [
    "Python is a high-level programming language known for readability.",
    "Machine learning algorithms improve through experience without explicit programming.",
    "Deep learning is a subset of machine learning using neural networks.",
    "The transformer architecture revolutionized natural language processing.",
    "BERT uses bidirectional attention to understand context in both directions.",
    "GPT generates text by predicting the next token in a sequence.",
    "Vector databases store embeddings and enable fast similarity search.",
    "RAG combines retrieval systems with language models for knowledge access.",
    "Fine-tuning adapts a pretrained model to a specific downstream task.",
    "Cosine similarity measures the angle between two vectors in space.",
    "The attention mechanism lets models focus on relevant parts of input.",
    "Tokenization converts raw text into numerical tokens for models.",
    "Embeddings map words or sentences to dense low-dimensional vectors.",
    "Gradient descent optimizes model parameters by following the loss gradient.",
    "Overfitting occurs when a model memorizes training data but fails to generalize.",
]

corpus_embeddings = model.encode(corpus, show_progress_bar=False)
print(f"Corpus: {len(corpus)} documents")
print(f"Embeddings shape: {corpus_embeddings.shape}")
print()

def semantic_search(query, corpus, corpus_embeddings, model, top_k=3):
    query_embedding  = model.encode([query])
    similarities     = cosine_similarity(query_embedding, corpus_embeddings)[0]
    top_indices      = np.argsort(similarities)[::-1][:top_k]
    results = []
    for idx in top_indices:
        results.append({
            "text":  corpus[idx],
            "score": similarities[idx],
            "rank":  len(results) + 1
        })
    return results

queries = [
    "How do neural networks learn?",
    "What is the purpose of tokenization?",
    "How can I prevent overfitting?",
]

for query in queries:
    results = semantic_search(query, corpus, corpus_embeddings, model)
    print(f"Query: '{query}'")
    for r in results:
        print(f"  [{r['rank']}] {r['score']:.3f}  {r['text']}")
    print()
Enter fullscreen mode Exit fullscreen mode

Output:

Query: 'How do neural networks learn?'
  [1] 0.823  Machine learning algorithms improve through experience without explicit programming.
  [2] 0.798  Gradient descent optimizes model parameters by following the loss gradient.
  [3] 0.776  Deep learning is a subset of machine learning using neural networks.

Query: 'What is the purpose of tokenization?'
  [1] 0.891  Tokenization converts raw text into numerical tokens for models.
  [2] 0.712  Embeddings map words or sentences to dense low-dimensional vectors.
  [3] 0.634  BERT uses bidirectional attention to understand context in both directions.
Enter fullscreen mode Exit fullscreen mode

Vector Databases: Production-Grade Storage

print("Why Not Just Use NumPy Cosine Similarity?")
print()
print("  For 1,000 documents: NumPy works fine.")
print("  For 1,000,000 documents: computing similarity against all docs")
print("  at every query is slow and memory-intensive.")
print()
print("  1M docs × 384 dims × 4 bytes = 1.5 GB just for storage.")
print("  Brute-force search = 1M dot products per query.")
print("  At 100 queries/second: 100M dot products/second.")
print()
print("Vector databases solve this with approximate nearest neighbor (ANN) indexes.")
print()

ann_algorithms = {
    "HNSW": (
        "Hierarchical Navigable Small World",
        "Graph-based. Fast search. High recall. Most popular.",
        "Used by: Chroma, Qdrant, Pinecone"
    ),
    "IVF":  (
        "Inverted File Index",
        "Cluster-based. Good for billions of vectors.",
        "Used by: Faiss, Milvus"
    ),
    "LSH":  (
        "Locality Sensitive Hashing",
        "Hash-based. Very fast. Lower recall.",
        "Good for: streaming, exact doesn't matter"
    ),
    "ScaNN":(
        "Scalable Nearest Neighbors",
        "Google's algorithm. Quantization-aware.",
        "Used by: Google Search, YouTube recommendations"
    ),
}

print("ANN Index Algorithms:")
for algo, (full_name, description, usage) in ann_algorithms.items():
    print(f"  {algo:<6}: {full_name}")
    print(f"         {description}")
    print(f"         {usage}")
    print()
Enter fullscreen mode Exit fullscreen mode

ChromaDB: Simplest Vector Database

print("ChromaDB: Simplest vector database for development")
print()
print("pip install chromadb")
print()

chroma_code = """
import chromadb
from sentence_transformers import SentenceTransformer

# Initialize
client     = chromadb.Client()
collection = client.create_collection(
    name="my_docs",
    metadata={"hnsw:space": "cosine"}
)

# Add documents
documents = [
    "Machine learning learns from data.",
    "Deep learning uses neural networks.",
    "Python is great for data science.",
]

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(documents).tolist()

collection.add(
    documents  = documents,
    embeddings = embeddings,
    ids        = ["doc_1", "doc_2", "doc_3"],
    metadatas  = [{"source": "wiki"}, {"source": "wiki"}, {"source": "blog"}]
)

# Query
query = "How do neural networks work?"
query_embedding = model.encode([query]).tolist()

results = collection.query(
    query_embeddings = query_embedding,
    n_results        = 2,
    where            = {"source": "wiki"}  # metadata filtering!
)

print(results["documents"])
# [['Deep learning uses neural networks.', 'Machine learning learns from data.']]
"""

print(chroma_code)
print()
print("ChromaDB features:")
print("  Runs in-memory or persistent on disk")
print("  Metadata filtering (filter by source, date, category)")
print("  Built-in HNSW indexing")
print("  Python-first API")
print("  Perfect for prototyping RAG systems")
Enter fullscreen mode Exit fullscreen mode

FAISS: Production-Scale Search

print("FAISS: Facebook AI Similarity Search")
print()
print("pip install faiss-cpu  # or faiss-gpu")
print()

faiss_code = """
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Encode corpus
corpus    = [...]  # your documents
embeddings = model.encode(corpus).astype("float32")

# Normalize for cosine similarity (makes L2 = cosine)
faiss.normalize_L2(embeddings)

# Build HNSW index
dim   = embeddings.shape[1]  # 384
index = faiss.IndexHNSWFlat(dim, 32)  # 32 neighbors per layer
index.add(embeddings)

print(f"Index contains {index.ntotal} vectors")

# Search
query     = "How does attention work?"
query_emb = model.encode([query]).astype("float32")
faiss.normalize_L2(query_emb)

distances, indices = index.search(query_emb, k=5)

print("Top 5 results:")
for dist, idx in zip(distances[0], indices[0]):
    print(f"  sim={dist:.3f}  {corpus[idx]}")

# Save and load
faiss.write_index(index, "my_index.faiss")
index = faiss.read_index("my_index.faiss")
"""

print(faiss_code)
print()
print("FAISS strengths:")
print("  Handles billions of vectors")
print("  GPU acceleration available")
print("  Multiple index types (Flat, IVF, HNSW, PQ)")
print("  Used in production at Facebook, LinkedIn, Spotify")
Enter fullscreen mode Exit fullscreen mode

Choosing the Right Embedding Model

embedding_models = {
    "all-MiniLM-L6-v2": {
        "dims":    384,
        "params":  "22M",
        "speed":   "fastest",
        "quality": "good",
        "use":     "prototyping, high-throughput"
    },
    "all-mpnet-base-v2": {
        "dims":    768,
        "params":  "109M",
        "speed":   "moderate",
        "quality": "better",
        "use":     "production, balanced"
    },
    "text-embedding-3-small": {
        "dims":    1536,
        "params":  "OpenAI API",
        "speed":   "API call",
        "quality": "excellent",
        "use":     "production, no local GPU"
    },
    "text-embedding-3-large": {
        "dims":    3072,
        "params":  "OpenAI API",
        "speed":   "API call",
        "quality": "best",
        "use":     "high-accuracy requirements"
    },
    "e5-large-v2": {
        "dims":    1024,
        "params":  "335M",
        "speed":   "slow",
        "quality": "excellent",
        "use":     "accuracy-critical, local"
    },
    "bge-large-en-v1.5": {
        "dims":    1024,
        "params":  "335M",
        "speed":   "slow",
        "quality": "excellent",
        "use":     "MTEB leaderboard top performer"
    },
}

print(f"{'Model':<28} {'Dims':>6} {'Params':>12} {'Speed':>10} {'Quality':>10}")
print("=" * 70)
for name, info in embedding_models.items():
    print(f"{name:<28} {info['dims']:>6} {info['params']:>12} "
          f"{info['speed']:>10} {info['quality']:>10}")

print()
print("Rule of thumb:")
print("  Start with all-MiniLM-L6-v2 (fast, free, good enough for most cases)")
print("  Upgrade to mpnet or e5 if quality matters more than speed")
print("  Use OpenAI embeddings if you are already calling the API anyway")
print("  Check MTEB leaderboard for the latest benchmark rankings:")
print("  huggingface.co/spaces/mteb/leaderboard")
Enter fullscreen mode Exit fullscreen mode

Chunking Strategy

print("Chunking: How to Split Documents Before Embedding")
print()
print("Embeddings represent fixed-size text.")
print("Long documents must be split into chunks.")
print("The chunking strategy dramatically affects retrieval quality.")
print()

strategies = {
    "Fixed size": {
        "description": "Split every N characters or tokens",
        "pros":        "Simple, predictable",
        "cons":        "Splits mid-sentence, loses context",
        "use":         "Quick prototype only"
    },
    "Sentence splitting": {
        "description": "Split at sentence boundaries",
        "pros":        "Preserves complete thoughts",
        "cons":        "Variable chunk size, single sentence may lack context",
        "use":         "Short documents, chat logs"
    },
    "Recursive character splitting": {
        "description": "Split at paragraphs → sentences → words as needed",
        "pros":        "Respects document structure",
        "cons":        "Slightly more complex",
        "use":         "Most documents, LangChain default"
    },
    "Semantic chunking": {
        "description": "Split when embedding similarity drops significantly",
        "pros":        "Preserves semantic coherence",
        "cons":        "Slow (encodes every sentence), more complex",
        "use":         "High-quality RAG pipelines"
    },
    "Document-structure aware": {
        "description": "Split at headers/sections/pages",
        "pros":        "Preserves document organization",
        "cons":        "Requires document parsing",
        "use":         "PDFs, technical docs, books"
    },
}

for name, info in strategies.items():
    print(f"  {name}:")
    print(f"    {info['description']}")
    print(f"{info['pros']}")
    print(f"{info['cons']}")
    print(f"    Use: {info['use']}")
    print()

print("Practical recommendation:")
print("  chunk_size=512 tokens, chunk_overlap=64 tokens")
print("  Overlap preserves context at chunk boundaries")
print("  Use LangChain's RecursiveCharacterTextSplitter or similar")
Enter fullscreen mode Exit fullscreen mode

Evaluating Retrieval Quality

def precision_at_k(retrieved, relevant, k):
    retrieved_k = retrieved[:k]
    hits = sum(1 for doc in retrieved_k if doc in relevant)
    return hits / k

def recall_at_k(retrieved, relevant, k):
    retrieved_k = retrieved[:k]
    hits = sum(1 for doc in retrieved_k if doc in relevant)
    return hits / len(relevant) if relevant else 0

def mean_reciprocal_rank(retrieved, relevant):
    for rank, doc in enumerate(retrieved, 1):
        if doc in relevant:
            return 1 / rank
    return 0

retrieved = ["doc_3", "doc_7", "doc_1", "doc_9", "doc_2"]
relevant  = {"doc_1", "doc_3"}

k_values = [1, 2, 3, 5]
print("Retrieval Evaluation Metrics:")
print(f"\nRetrieved: {retrieved}")
print(f"Relevant:  {sorted(relevant)}")
print()
print(f"{'Metric':<25} {'Value':>8}")
print("-" * 35)
for k in k_values:
    p = precision_at_k(retrieved, relevant, k)
    r = recall_at_k(retrieved, relevant, k)
    print(f"  Precision@{k:<3}            {p:>8.3f}")
    print(f"  Recall@{k:<3}               {r:>8.3f}")

mrr = mean_reciprocal_rank(retrieved, relevant)
print(f"  MRR                      {mrr:>8.3f}")
print()
print("Good retrieval system targets:")
print("  Recall@5 > 0.8  (find 80% of relevant docs in top 5)")
print("  Precision@3 > 0.5  (at least half of top 3 are relevant)")
print("  MRR > 0.7  (relevant doc usually appears in top 2)")
Enter fullscreen mode Exit fullscreen mode

A Resource Worth Reading

Pinecone published a free e-book called "Vector Databases" at pinecone.io/learn/vector-database that covers indexing algorithms, embedding models, and production deployment in depth. The content is vendor-neutral despite the source. One of the most comprehensive treatments of vector search available for free. Search "Pinecone vector database guide."

The MTEB (Massive Text Embedding Benchmark) leaderboard at huggingface.co/spaces/mteb/leaderboard ranks embedding models across 56 tasks. It is the definitive resource for selecting embedding models and is updated as new models are released. Check it before choosing a model for production.


Try This

Create vector_search_practice.py.

Part 1: build a semantic search system from scratch. Take any 50+ document dataset (Wikipedia articles, news, documentation). Encode all documents. Accept a query string. Return top 5 most similar documents with scores. Test with 10 different queries. Are the results sensible?

Part 2: compare embedding models. Encode the same 20 sentences with all-MiniLM-L6-v2 and all-mpnet-base-v2. For 5 query-answer pairs, check which model ranks the correct answer higher. Does the larger model improve retrieval?

Part 3: chunking experiment. Take one long document (a Wikipedia article, a paper, a blog post). Split it with three strategies: fixed 200-word chunks, sentence-by-sentence, and 512-token chunks with 64-token overlap. Index all three versions. Run 5 queries. Which chunking produces more coherent retrieved passages?

Part 4: implement ChromaDB. Install ChromaDB. Create a collection. Add 50 documents with metadata (source, date, category). Query with metadata filtering. Return only documents from a specific source. Verify the metadata filter works.


What's Next

You can embed documents and search them by semantic similarity. The next post connects this to language models: Retrieval-Augmented Generation. A user asks a question. You retrieve the relevant documents. You pass them to the LLM as context. The LLM generates an answer grounded in your specific knowledge. This is how production AI assistants actually work.

Top comments (0)