Farhan Khan

Posted on Sep 4 • Edited on Sep 8

Vector Database: Core Concepts

#vectordatabase #elasticsearch #rag #python

1️⃣ Introduction

🔹 Vector

A vector is simply an ordered list (array) of numbers.
Can represent data points in 2D, 3D, or higher dimensions.
Example:
- 2D → [3.5, 7.2]
- 3D → [1.2, -4.5, 6.0]
In machine learning, vectors are used to describe positions in a multi-dimensional space.

🔹 Embedding Vector

An embedding vector (or just embedding) is a special kind of vector generated by an embedding model.
Purpose: represent complex data (text, images, audio, etc.) in a way that captures meaning and similarity.
All embeddings are vectors, but not all vectors are embeddings.
Dimensions (e.g., 384, 768, 1536) are fixed by the model design.
- Lightweight (384–512 dimensions) → e.g., all-MiniLM-L6-v2 (384d), Universal Sentence Encoder (512d)
- Standard (768–1,536 dimensions) → e.g., all-mpnet-base-v2 (768d), OpenAI text-embedding-ada-002 (1,536d)
- High capacity (2,000–3,000+ dimensions) → e.g., OpenAI text-embedding-3-large (3,072d)
Higher dimensions = richer detail, but more costly to store and search.
Represents the semantic meaning of data, so similar concepts are placed close together in vector space.
- “dog” and “puppy” → embedding vectors close together.
- “dog” and “car” → embedding vectors far apart.

🔹 Positional Encoding

Positional encoding is a technique used in Transformer models to provide word order information to embeddings.
Purpose: ensure that the sequence of tokens matters, so sentences with the same words in different orders have different meanings.
Implemented by adding a positional vector to each token embedding before feeding it into the model.
- Can be sinusoidal (fixed mathematical functions).
- Or learned (trainable position embeddings).
Formula (sinusoidal encoding):
- PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
- PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Sentence-level embeddings indirectly include positional information, since it’s baked into the Transformer during encoding.
Preserves contextual meaning by distinguishing different word orders.
- “dog bites man” → one meaning.
- “man bites dog” → very different meaning.

🔹 Vector Databases

A vector database is a system built to store and retrieve vectors, especially embeddings.
Core features:
- Persistence: store large volumes of embeddings.
- Similarity search: find nearest neighbors using cosine similarity, dot product, or Euclidean distance.
- Indexing: HNSW, IVF, Annoy, PQ for fast retrieval.
- Database functions: CRUD operations, filtering, replication, and scaling.
Purpose: enable semantic search → finding results based on meaning instead of exact keyword matches.

2️⃣ Persistence

Vectors often number in the millions or billions, so keeping them only in memory is not practical.
Persistence ensures embeddings are stored long-term and survive restarts or failures.
CRUD: create, read, update, and delete vectors.
Durability: vectors are written to disk or distributed storage.
Index persistence: not just raw vectors, but also the indexing structures (like HNSW graphs).

🔹 Example: creating an embedding (OpenAI)

from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="The cat sat on the mat"
)

vector = response.data[0].embedding
print(len(vector))  # 1536 dimensions

🔹 Example: insertion (ElasticSearch)

PUT /documents/_doc/1
{
  "content": "The cat sat on the mat",
  "embedding": [0.12, -0.87, 0.33]
}

PUT /documents/_doc/2
{
  "content": "The dog chased the ball",
  "embedding": [0.91, 0.04, -0.22]
}

3️⃣ Similarity Search

Core function of a vector database → find vectors closest in meaning to a query vector.
Powers semantic search, recommendations, fraud detection, and Retrieval-Augmented Generation (RAG).

🔹 Common distance metrics

Cosine similarity: measures angle between vectors.
Euclidean distance: straight-line distance in vector space.
Dot product: measures alignment, often used when vectors are normalized.

🔹 Example: semantic search (Elasticsearch)

k-Nearest Neighbors (kNN) is the task of finding the k most similar vectors to a query vector. In vector databases, it’s the core operation behind semantic search, recommendations, and RAG.

POST /documents/_search
{
  "size": 2,
  "knn": {
    "field": "embedding",
    "query_vector": [0.15, -0.52, 0.48, ...],
    "k": 2,
    "num_candidates": 10
  },
  "_source": ["content"]
}

4️⃣ Indexing

Searching vectors directly without an index is too slow for large datasets.
Indexing structures organize vectors so that nearest-neighbor queries can be answered efficiently.
These methods implement Approximate Nearest Neighbor (ANN) search → trading a bit of accuracy for major speed gains.

🔹 Popular ANN-based algorithms

Approximate Nearest Neighbor (ANN) search is a family of algorithms that find the closest vectors quickly by trading a small amount of accuracy for massive gains in speed and scalability — the following algorithms are all ANN-based.

HNSW (Hierarchical Navigable Small World Graph)

Builds a multi-layer graph where each vector connects to its neighbors.
✅ Fast queries, high recall, supports dynamic updates.
❌ High memory consumption, complex to tune.

IVF (Inverted File Index)

Clusters vectors into groups (using k-means or similar).
At query time, only the most relevant clusters are searched.
✅ Efficient for large datasets, reduces search scope.
❌ Accuracy depends on clustering quality.

PQ (Product Quantization)

Compresses vectors into compact codes to save memory.
✅ Great for billion-scale datasets, reduces storage dramatically.
❌ Some loss in accuracy due to compression.

Annoy (Approximate Nearest Neighbors)

Builds multiple random projection trees for searching.
✅ Lightweight, simple to use, good for read-heavy static datasets.
❌ Slower than HNSW at very large scale, poor for frequent updates.

ScaNN (Scalable Nearest Neighbors, Google)

Optimized for high-dimensional, large-scale data.
✅ Very fast, optimized for Google-scale workloads, low memory footprint.
❌ Less community support, limited flexibility outside Google’s ecosystem.

🔹 Trade-offs

Speed vs. Memory: Some indexes (like HNSW) are fast but memory-hungry.
Accuracy vs. Compression: PQ saves space but reduces precision.
Dynamic vs. Static Data: HNSW handles updates well, while Annoy is better for static data.

5️⃣ Filtering & Metadata

Pure similarity search often returns results that are semantically close but not contextually relevant.
Real-world apps combine semantic similarity with structured filters (e.g., category, date, user ID).
Example: “find similar support tickets from the past month” or “recommend products in the Shoes category.”

🔹 How it works

Each vector is stored with metadata → key-value pairs like:
- category: "electronics"
- created_at: "2025-09-01"
- user_id: 12345
At query time, the DB runs a hybrid search: 1) Apply metadata filters. 2) Run similarity search only on the filtered subset.

🔹 Getting the query vector (two common options)

from openai import OpenAI
client = OpenAI()

q = "Wireless noise-cancelling headphones"
emb = client.embeddings.create(
    model="text-embedding-3-small",
    input=q
).data[0].embedding  # length = 1536

🔹 Example with metadata filter (Elasticsearch)

POST /documents/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "knn": {
            "embedding": {
              "vector": [/* paste emb here, e.g., 0.12, -0.87, 0.33, ... */],
              "k": 3
            }
          }
        },
        { "term": { "category": "electronics" } }
      ]
    }
  }
}

DEV Community