1️⃣ Introduction
🔹 Vector
- A vector is simply an ordered list (array) of numbers.
- Can represent data points in 2D, 3D, or higher dimensions.
- Example:
- 2D →
[3.5, 7.2]
- 3D →
[1.2, -4.5, 6.0]
- 2D →
- In machine learning, vectors are used to describe positions in a multi-dimensional space.
🔹 Embedding Vector
- An embedding vector (or just embedding) is a special kind of vector generated by an embedding model.
- Purpose: represent complex data (text, images, audio, etc.) in a way that captures meaning and similarity.
- All embeddings are vectors, but not all vectors are embeddings.
- Dimensions (e.g., 384, 768, 1536) are fixed by the model design.
- Lightweight (384–512 dimensions) → e.g., all-MiniLM-L6-v2 (384d), Universal Sentence Encoder (512d)
- Standard (768–1,536 dimensions) → e.g., all-mpnet-base-v2 (768d), OpenAI text-embedding-ada-002 (1,536d)
- High capacity (2,000–3,000+ dimensions) → e.g., OpenAI text-embedding-3-large (3,072d)
- Higher dimensions = richer detail, but more costly to store and search.
- Represents the semantic meaning of data, so similar concepts are placed close together in vector space.
- “dog” and “puppy” → embedding vectors close together.
- “dog” and “car” → embedding vectors far apart.
🔹 Positional Encoding
- Positional encoding is a technique used in Transformer models to provide word order information to embeddings.
- Purpose: ensure that the sequence of tokens matters, so sentences with the same words in different orders have different meanings.
- Implemented by adding a positional vector to each token embedding before feeding it into the model.
- Can be sinusoidal (fixed mathematical functions).
- Or learned (trainable position embeddings).
- Formula (sinusoidal encoding):
- PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
- PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
- Sentence-level embeddings indirectly include positional information, since it’s baked into the Transformer during encoding.
- Preserves contextual meaning by distinguishing different word orders.
- “dog bites man” → one meaning.
- “man bites dog” → very different meaning.
🔹 Vector Databases
- A vector database is a system built to store and retrieve vectors, especially embeddings.
- Core features:
- Persistence: store large volumes of embeddings.
- Similarity search: find nearest neighbors using cosine similarity, dot product, or Euclidean distance.
- Indexing: HNSW, IVF, Annoy, PQ for fast retrieval.
- Database functions: CRUD operations, filtering, replication, and scaling.
- Purpose: enable semantic search → finding results based on meaning instead of exact keyword matches.
2️⃣ Persistence
- Vectors often number in the millions or billions, so keeping them only in memory is not practical.
Persistence ensures embeddings are stored long-term and survive restarts or failures.
CRUD: create, read, update, and delete vectors.
Durability: vectors are written to disk or distributed storage.
Index persistence: not just raw vectors, but also the indexing structures (like HNSW graphs).
🔹 Example: creating an embedding (OpenAI)
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input="The cat sat on the mat"
)
vector = response.data[0].embedding
print(len(vector)) # 1536 dimensions
🔹 Example: insertion (ElasticSearch)
PUT /documents/_doc/1
{
"content": "The cat sat on the mat",
"embedding": [0.12, -0.87, 0.33]
}
PUT /documents/_doc/2
{
"content": "The dog chased the ball",
"embedding": [0.91, 0.04, -0.22]
}
3️⃣ Similarity Search
- Core function of a vector database → find vectors closest in meaning to a query vector.
- Powers semantic search, recommendations, fraud detection, and Retrieval-Augmented Generation (RAG).
🔹 Common distance metrics
- Cosine similarity: measures angle between vectors.
- Euclidean distance: straight-line distance in vector space.
- Dot product: measures alignment, often used when vectors are normalized.
🔹 Example: semantic search (Elasticsearch)
k-Nearest Neighbors (kNN) is the task of finding the k most similar vectors to a query vector. In vector databases, it’s the core operation behind semantic search, recommendations, and RAG.
POST /documents/_search
{
"size": 2,
"knn": {
"field": "embedding",
"query_vector": [0.15, -0.52, 0.48, ...],
"k": 2,
"num_candidates": 10
},
"_source": ["content"]
}
4️⃣ Indexing
- Searching vectors directly without an index is too slow for large datasets.
- Indexing structures organize vectors so that nearest-neighbor queries can be answered efficiently.
- These methods implement Approximate Nearest Neighbor (ANN) search → trading a bit of accuracy for major speed gains.
🔹 Popular ANN-based algorithms
Approximate Nearest Neighbor (ANN) search is a family of algorithms that find the closest vectors quickly by trading a small amount of accuracy for massive gains in speed and scalability — the following algorithms are all ANN-based.
HNSW (Hierarchical Navigable Small World Graph)
- Builds a multi-layer graph where each vector connects to its neighbors.
- ✅ Fast queries, high recall, supports dynamic updates.
- ❌ High memory consumption, complex to tune.
IVF (Inverted File Index)
- Clusters vectors into groups (using k-means or similar).
- At query time, only the most relevant clusters are searched.
- ✅ Efficient for large datasets, reduces search scope.
- ❌ Accuracy depends on clustering quality.
PQ (Product Quantization)
- Compresses vectors into compact codes to save memory.
- ✅ Great for billion-scale datasets, reduces storage dramatically.
- ❌ Some loss in accuracy due to compression.
Annoy (Approximate Nearest Neighbors)
- Builds multiple random projection trees for searching.
- ✅ Lightweight, simple to use, good for read-heavy static datasets.
- ❌ Slower than HNSW at very large scale, poor for frequent updates.
ScaNN (Scalable Nearest Neighbors, Google)
- Optimized for high-dimensional, large-scale data.
- ✅ Very fast, optimized for Google-scale workloads, low memory footprint.
- ❌ Less community support, limited flexibility outside Google’s ecosystem.
🔹 Trade-offs
- Speed vs. Memory: Some indexes (like HNSW) are fast but memory-hungry.
- Accuracy vs. Compression: PQ saves space but reduces precision.
- Dynamic vs. Static Data: HNSW handles updates well, while Annoy is better for static data.
5️⃣ Filtering & Metadata
- Pure similarity search often returns results that are semantically close but not contextually relevant.
- Real-world apps combine semantic similarity with structured filters (e.g., category, date, user ID).
- Example: “find similar support tickets from the past month” or “recommend products in the Shoes category.”
🔹 How it works
- Each vector is stored with metadata → key-value pairs like:
-
category: "electronics"
-
created_at: "2025-09-01"
-
user_id: 12345
-
- At query time, the DB runs a hybrid search: 1) Apply metadata filters. 2) Run similarity search only on the filtered subset.
🔹 Getting the query vector (two common options)
from openai import OpenAI
client = OpenAI()
q = "Wireless noise-cancelling headphones"
emb = client.embeddings.create(
model="text-embedding-3-small",
input=q
).data[0].embedding # length = 1536
🔹 Example with metadata filter (Elasticsearch)
POST /documents/_search
{
"query": {
"bool": {
"must": [
{
"knn": {
"embedding": {
"vector": [/* paste emb here, e.g., 0.12, -0.87, 0.33, ... */],
"k": 3
}
}
},
{ "term": { "category": "electronics" } }
]
}
}
}
Top comments (0)