AWS Vector Databases Part 1: Embeddings, Dimensions & Similarity

#aws #vectordatabase #rag #genai

This is Part 1 of a series exploring vector databases on AWS.

We recently evaluated multiple AWS vector database options to understand their trade-offs, performance characteristics, and real-world use cases. Before comparing services, it’s important to understand the core concepts that power vector search.

In this part, we’ll cover embeddings, dimensions, and similarity search — the foundation of every RAG and semantic search system.

What Are Embeddings?

Let’s say you're building a customer support chatbot.

A user asks: “How do I change my login info?”
Your FAQ has: “Resetting your password.”

A keyword search might miss this. But as humans, we know they mean the same thing. That’s the gap embeddings solve.

An embedding is a numerical representation of content (text, image, code) where similar meaning leads to similar numbers. So even if the words differ, the intent stays close.

How Embeddings Are Created

Here's what happens under the hood when you pass a sentence to an embedding model:

"How do I reset my password?"
        │
        ▼
   Tokenization         →  ["How", "do", "I", "reset", "my", "password", "?"]
        │
        ▼
   Embedding Model       →  Neural network (e.g., Titan v2)
        │
        ▼
   Vector Output         →  [0.021, -0.438, 0.712, ..., 0.155] (1,024 floats)

The important part isn’t the numbers themselves—it’s that similar sentences produce vectors that are close to each other.

On AWS, you can generate embeddings using models like:

Titan Text Embeddings V2
Titan Embeddings G1 - Text
Cohere Embed English v3
Cohere Embed Multilingual v3

If you're getting started, Amazon Titan Embeddings V2 is a solid default—simple, cost-effective, and good enough for most use cases.

Note: The model you use for ingestion (storing data) must be the exact same model you use for inference (querying). If you embed your database using Amazon Titan but try to query it using an OpenAI embedding, the math won't align, and your search results will be complete gibberish.

Dimensions

So far, we’ve seen that embeddings are just lists of numbers representing meaning. The next question is: how many numbers are in that list? That’s where dimensions come in.

A dimension is simply the number of values in an embedding list (vector). Different models produce different dimensions:

Cohere Embed English v3, Cohere Embed Multilingual v3 → 1,024
Amazon Titan Embeddings → 1,024 (default), 512, 256

Note: Historically, more dimensions meant better accuracy but higher storage costs and slower searches. However, the game changed with Amazon Titan Text Embeddings V2.Titan v2 supports "flexible" dimensions. You can generate a 1024-dimension vector and "truncate" it down to 512 or 256.

1024 Dimensions: Maximum "nuance" and accuracy.

256 Dimensions: Up to 4x less storage cost and faster search speeds, with only a marginal hit to accuracy.

Distance Metrics: Measure Similarity

Once you have thousands of embeddings stored, the database uses a distance metric to find the "nearest neighbors" to your query.

Metric	Logic	Use Case
Cosine	Angle between vectors	The Standard. Best for text and RAG. Ignores document length.
Euclidean (L2)	Straight-line distance	Best for Images or fixed-size data where magnitude matters.
Inner Product	Direction + Magnitude	Best for Recommendations where popularity or "weight" matters.

Why Metrics Matter: The "Wordiness" Problem

Let’s compare two vectors representing the same topic:

Vector A: [1, 2, 3] (A short, concise help article)
Vector B: [10, 20, 30] (A very long, detailed whitepaper on the same topic)

Even though they cover the exact same intent, different metrics interpret them wildly differently:

Cosine Similarity (The Compass): Sees that both arrows point at the exact same target in space. It gives them a perfect match score. This is why it’s the standard for RAG—you want your short question to match a long document.
Euclidean Distance (The Ruler): Measures the physical distance between the "tips" of the arrows. Because Vector B is so much longer, the ruler sees them as miles apart and may treat them as unrelated.
Inner Product (The Spotlight): It sees that both point the same way, but it gives Vector B a "higher" score because it is stronger/longer. This is perfect for recommendation engines where you want to highlight "heavy-hitting" content.

For example,

Query A: "How do I reset my password?"
Doc B: "A guide to password resets for new users."

They mean the same thing, but Doc B is 1,000 words long.

Cosine correctly identifies them as a match because it ignores the extra "fluff" and focuses on the intent.
Euclidean might fail because the sheer volume of words in Doc B pushes its vector too far away from the short query.

Key Takeaway: For 95% of AWS text-based applications (Chatbots, Q&A, Knowledge Bases), use Cosine Similarity. It is the default in Aurora pgvector, OpenSearch, and S3 Vectors for a reason: it focuses on meaning over length.

👉 Continue reading: In Part 2, we’ll explore vector search patterns (KNN vs ANN), hybrid search, metadata filtering, and chunking — and how they impact real-world systems.