DEV Community

Cover image for Vector Databases
Farhan Khan
Farhan Khan

Posted on

Vector Databases

1️⃣ Introduction

🔹 Vector

  • A vector is simply an ordered list (array) of numbers.
  • Can represent data points in 2D, 3D, or higher dimensions.
  • Example:
    • 2D → [3.5, 7.2]
    • 3D → [1.2, -4.5, 6.0]
  • In machine learning, vectors are used to describe positions in a multi-dimensional space.

🔹 Embedding Vector

  • An embedding vector (or just embedding) is a special kind of vector generated by an embedding model.
  • Purpose: represent complex data (text, images, audio, etc.) in a way that captures meaning and similarity.
  • All embeddings are vectors, but not all vectors are embeddings.
  • Dimensions (e.g., 384, 768, 1536) are fixed by the model design.
    • Lightweight (384–512 dimensions) → e.g., all-MiniLM-L6-v2 (384d), Universal Sentence Encoder (512d)
    • Standard (768–1,536 dimensions) → e.g., all-mpnet-base-v2 (768d), OpenAI text-embedding-ada-002 (1,536d)
    • High capacity (2,000–3,000+ dimensions) → e.g., OpenAI text-embedding-3-large (3,072d)
  • Higher dimensions = richer detail, but more costly to store and search.
  • Represents the semantic meaning of data, so similar concepts are placed close together in vector space.
    • “dog” and “puppy” → embedding vectors close together.
    • “dog” and “car” → embedding vectors far apart.

🔹 Positional Encoding

  • Positional encoding is a technique used in Transformer models to provide word order information to embeddings.
  • Purpose: ensure that the sequence of tokens matters, so sentences with the same words in different orders have different meanings.
  • Implemented by adding a positional vector to each token embedding before feeding it into the model.
    • Can be sinusoidal (fixed mathematical functions).
    • Or learned (trainable position embeddings).
  • Formula (sinusoidal encoding):
    • PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
    • PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
  • Sentence-level embeddings indirectly include positional information, since it’s baked into the Transformer during encoding.
  • Preserves contextual meaning by distinguishing different word orders.
    • “dog bites man” → one meaning.
    • “man bites dog” → very different meaning.

🔹 Vector Databases

  • A vector database is a system built to store and retrieve vectors, especially embeddings.
  • Core features:
    • Persistence: store large volumes of embeddings.
    • Similarity search: find nearest neighbors using cosine similarity, dot product, or Euclidean distance.
    • Indexing: HNSW, IVF, Annoy, PQ for fast retrieval.
    • Database functions: CRUD operations, filtering, replication, and scaling.
  • Purpose: enable semantic search → finding results based on meaning instead of exact keyword matches.

2️⃣ Persistence

  • Vectors often number in the millions or billions, so keeping them only in memory is not practical.
  • Persistence ensures embeddings are stored long-term and survive restarts or failures.

  • CRUD: create, read, update, and delete vectors.

  • Durability: vectors are written to disk or distributed storage.

  • Index persistence: not just raw vectors, but also the indexing structures (like HNSW graphs).

🔹 Example: creating an embedding (OpenAI)

from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="The cat sat on the mat"
)

vector = response.data[0].embedding
print(len(vector))  # 1536 dimensions
Enter fullscreen mode Exit fullscreen mode

🔹 Example: insertion (ElasticSearch)

PUT /documents/_doc/1
{
  "content": "The cat sat on the mat",
  "embedding": [0.12, -0.87, 0.33]
}

PUT /documents/_doc/2
{
  "content": "The dog chased the ball",
  "embedding": [0.91, 0.04, -0.22]
}
Enter fullscreen mode Exit fullscreen mode

3️⃣ Similarity Search

  • Core function of a vector database → find vectors closest in meaning to a query vector.
  • Powers semantic search, recommendations, fraud detection, and Retrieval-Augmented Generation (RAG).

🔹 Common distance metrics

  • Cosine similarity: measures angle between vectors.
  • Euclidean distance: straight-line distance in vector space.
  • Dot product: measures alignment, often used when vectors are normalized.

🔹 Example: semantic search (Elasticsearch)

k-Nearest Neighbors (kNN) is the task of finding the k most similar vectors to a query vector. In vector databases, it’s the core operation behind semantic search, recommendations, and RAG.

POST /documents/_search
{
  "size": 2,
  "knn": {
    "field": "embedding",
    "query_vector": [0.15, -0.52, 0.48, ...],
    "k": 2,
    "num_candidates": 10
  },
  "_source": ["content"]
}
Enter fullscreen mode Exit fullscreen mode

4️⃣ Indexing

  • Searching vectors directly without an index is too slow for large datasets.
  • Indexing structures organize vectors so that nearest-neighbor queries can be answered efficiently.
  • These methods implement Approximate Nearest Neighbor (ANN) search → trading a bit of accuracy for major speed gains.

🔹 Popular ANN-based algorithms

Approximate Nearest Neighbor (ANN) search is a family of algorithms that find the closest vectors quickly by trading a small amount of accuracy for massive gains in speed and scalability — the following algorithms are all ANN-based.

HNSW (Hierarchical Navigable Small World Graph)

  • Builds a multi-layer graph where each vector connects to its neighbors.
  • ✅ Fast queries, high recall, supports dynamic updates.
  • ❌ High memory consumption, complex to tune.

IVF (Inverted File Index)

  • Clusters vectors into groups (using k-means or similar).
  • At query time, only the most relevant clusters are searched.
  • ✅ Efficient for large datasets, reduces search scope.
  • ❌ Accuracy depends on clustering quality.

PQ (Product Quantization)

  • Compresses vectors into compact codes to save memory.
  • ✅ Great for billion-scale datasets, reduces storage dramatically.
  • ❌ Some loss in accuracy due to compression.

Annoy (Approximate Nearest Neighbors)

  • Builds multiple random projection trees for searching.
  • ✅ Lightweight, simple to use, good for read-heavy static datasets.
  • ❌ Slower than HNSW at very large scale, poor for frequent updates.

ScaNN (Scalable Nearest Neighbors, Google)

  • Optimized for high-dimensional, large-scale data.
  • ✅ Very fast, optimized for Google-scale workloads, low memory footprint.
  • ❌ Less community support, limited flexibility outside Google’s ecosystem.

🔹 Trade-offs

  • Speed vs. Memory: Some indexes (like HNSW) are fast but memory-hungry.
  • Accuracy vs. Compression: PQ saves space but reduces precision.
  • Dynamic vs. Static Data: HNSW handles updates well, while Annoy is better for static data.

5️⃣ Filtering & Metadata

  • Pure similarity search often returns results that are semantically close but not contextually relevant.
  • Real-world apps combine semantic similarity with structured filters (e.g., category, date, user ID).
  • Example: “find similar support tickets from the past month” or “recommend products in the Shoes category.”

🔹 How it works

  • Each vector is stored with metadata → key-value pairs like:
    • category: "electronics"
    • created_at: "2025-09-01"
    • user_id: 12345
  • At query time, the DB runs a hybrid search: 1) Apply metadata filters. 2) Run similarity search only on the filtered subset.

🔹 Getting the query vector (two common options)

from openai import OpenAI
client = OpenAI()

q = "Wireless noise-cancelling headphones"
emb = client.embeddings.create(
    model="text-embedding-3-small",
    input=q
).data[0].embedding  # length = 1536
Enter fullscreen mode Exit fullscreen mode

🔹 Example with metadata filter (Elasticsearch)

POST /documents/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "knn": {
            "embedding": {
              "vector": [/* paste emb here, e.g., 0.12, -0.87, 0.33, ... */],
              "k": 3
            }
          }
        },
        { "term": { "category": "electronics" } }
      ]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Top comments (0)