surajrkhonde

Posted on Jun 23 • Edited on Jul 9

Phase 2: Embeddings & Semantic Search

#ai #rag #programming #beginners

From Text to Vectors: The Complete Story

The Story Starts: Why Can't We Just Search for Words?

👦 Nephew: Uncle! Phase 1 was done. Now we have clean chunks. Can't we just search for keywords?

👨‍🦳 Uncle: Let me ask you something. Someone applies for a job and says:

Resume says: "I managed a team of 10 people for 3 years"

Job Description asks: "Do you have leadership experience?"

Are these related?

👦 Nephew: Obviously yes! But the words are different. Team management and leadership.

👨‍🦳 Uncle: Exactly. A computer doing simple keyword search would say NO.

Keyword Search:
Resume: "team management"
Question: "leadership"
Match? NO (different words)

This is the problem Phase 2 solves.

👦 Nephew: How?

👨‍🦳 Uncle: By understanding meaning, not just words.

PHASE 2 OVERVIEW: Three Key Concepts

Phase 1: Document Ingestion ✅
    ↓
Phase 2: Embeddings & Semantic Search ← WE ARE HERE
    │
    ├─ Step 1: Tokenization
    │         (Text → Token IDs)
    │
    ├─ Step 2: Embedding Layer
    │         (Token IDs → Vectors)
    │
    └─ Step 3: Semantic Search
              (Find Similar Vectors)
    ↓
Phase 3: Safety & Production

STEP 1: TOKENIZATION

What Is Tokenization?

👨‍🦳 Uncle: Computers don't understand words like "React" or "JavaScript". They only understand numbers.

A tokenizer converts text into numbers called token IDs.

Text:           "React is awesome"
                    ↓
Tokenizer converts to:
Token IDs:      [4521, 318, 9876]

👦 Nephew: But why not just one number per word?

👨‍🦳 Uncle: Good question. Look at this:

Word:          "JavaScriptDeveloper"

Tokenizer might see:
["Java", "Script", "Developer"]  ← 3 tokens
or
["JavaScript", "Developer"]      ← 2 tokens

Depends on the tokenizer's vocabulary.

Why Is This Important?

👨‍🦳 Uncle: Because the embedding model doesn't receive text. It receives token IDs.

Text Input:
"Employees must submit resignation 60 days before"
           ↓
Tokenizer (internal to embedding model):
[1001, 245, 782, 4521, 99, 1234, ...]
           ↓
Embedding Layer looks up each token:
1001 → [0.12, -0.45, 0.78, ...]
245  → [0.88, -0.12, 0.34, ...]
           ↓
Combines them into final vector

Real Example: Your Name

👨‍🦳 Uncle: Your name is Suraj Khonde. How does the tokenizer handle it?

If the tokenizer has seen "Suraj" often:
"Suraj" → Single Token → 5234

If not:
"Suraj" → Breaks into: ["Su", "raj"]
                      → [5234, 9876]

Same with "Khonde":
If common: → Single token
If rare:   → Multiple tokens

Result might be:
"Suraj Khonde" → [5234, 9876, 7621, 3344]

👦 Nephew: So less common words use more tokens?

👨‍🦳 Uncle: Exactly! This is why we count tokens, not words.

Word count: 2 words
Token count: 4 tokens

For billing and cost, we count tokens!

STEP 2: EMBEDDING LAYER

The Bridge: From Numbers to Meaning

👨‍🦳 Uncle: Now we have token IDs. But they're meaningless.

Token ID: 4521

This is just an identifier. Like an employee ID.
Employee #4521 is just a number.

The embedding layer transforms this into meaningful vectors.

👦 Nephew: How?

👨‍🦳 Uncle: Think of a giant lookup table inside the model:

Token ID    Vector (Meaning)
─────────────────────────────────────────
4521   →  [0.13, -0.46, 0.79, ... 1536 values]
318    →  [0.91, -0.21, 0.12, ... 1536 values]
9876   →  [0.44, 0.77, -0.33, ... 1536 values]

When model sees token 4521, it looks up:
4521 → [0.13, -0.46, 0.79, ...]

👦 Nephew: What are those numbers? [0.13, -0.46, 0.79]?

👨‍🦳 Uncle: Those numbers encode meaning. They describe what the token represents.

Who Created These Numbers?

👨‍🦳 Uncle: When the embedding model was first created, these numbers were mostly random.

But then something magical happened: training.

During training on billions of documents, the model learned:

React appears with JavaScript many times
↓
So React and JavaScript vectors move closer

Node.js appears with Express many times
↓
So Node.js and Express vectors move closer

Dog appears with Puppy many times
↓
So Dog and Puppy vectors move closer

After training on billions of examples, the vectors ended up in meaningful positions.

👦 Nephew: So React and JavaScript have similar vectors?

👨‍🦳 Uncle: Yes! That's why semantic search works.

KEY CONCEPT: Why 1536 Dimensions?

The Problem with Too Few Dimensions

👨‍🦳 Uncle: Suppose the embedding only used 1 number:

React   → 5
Python  → 7
JavaScript → 6

Now what? React and JavaScript are close (5 vs 6), 
but so are React and Python (5 vs 7).
All three look similar with one number!

❌ Not enough information.

Why So Many Numbers?

👨‍🦳 Uncle: Language is incredibly complex. The vector needs to capture:

Dimension 1: Frontend vs Backend
Dimension 2: Web vs Systems Programming
Dimension 3: Dynamic vs Compiled
Dimension 4: Syntax Style
Dimension 5: Ecosystem Size
...
Dimension 1536: [something the model learned]

Each dimension captures a different aspect of meaning.

👦 Nephew: So with 1536 dimensions, React can be described as:

React: [0.9 (very frontend), 0.2 (mostly web), 0.95 (dynamic), ...]
Node.js: [0.3 (somewhat frontend), 0.8 (backend), 0.95 (dynamic), ...]
Python: [0.1 (rarely frontend), 0.9 (backend), 0.5 (compiled), ...]

👨‍🦳 Uncle: Exactly! And notice:

React and Node.js both have 0.95 (dynamic) → similar there
React is 0.9 frontend, Node is 0.3 → different there
Python is 0.5 (middle) → different from both

This richness requires many dimensions.

Mental Model: Google Maps

👨‍🦳 Uncle: Think of Google Maps. Bangalore has coordinates:

Latitude: 12.9716
Longitude: 77.5946

These 2 numbers uniquely identify Bangalore.

Embeddings work similarly:

React has 1536 coordinates in "semantic space":
[0.13, -0.46, 0.79, ..., X1536]

This position uniquely identifies React's meaning.

STEP 3: SEMANTIC SEARCH

Keyword Search vs Semantic Search

👦 Nephew: So now we have vectors for all chunks. How do we search?

👨‍🦳 Uncle: Let me show you the difference:

Keyword Search:
─────────────────
User asks: "What is notice period?"
System searches for exact words: "notice" + "period"

Document contains: "resignation requirements"
Match? NO ❌

Semantic Search:
────────────────
User asks: "What is notice period?"
System converts to vector: [0.21, -0.44, 0.88, ...]

Document: "resignation requirements"
System converts to vector: [0.22, -0.40, 0.85, ...]

Compare vectors: Very similar! ✅

Cosine Similarity: How Similar Are Two Vectors?

👨‍🦳 Uncle: We need a way to measure "how similar" two vectors are.

Think of vectors as arrows pointing in a direction.

Vector A: ↗️ (pointing up-right)
Vector B: ↗️ (pointing up-right)
Similarity: HIGH (same direction)

Vector A: ↗️
Vector B: →️ (pointing right)
Similarity: MEDIUM (somewhat same)

Vector A: ↗️
Vector B: ↙️ (pointing down-left)
Similarity: LOW (opposite directions)

👦 Nephew: So cosine similarity measures "angle between vectors"?

👨‍🦳 Uncle: Exactly! In mathematics:

Cosine Similarity Score Range:
─────────────────────────────
1.0  = Vectors point in exactly same direction (identical)
0.5  = 60° angle between them (medium similarity)
0.0  = 90° angle (unrelated)
-0.5 = 120° angle (somewhat opposite)
-1.0 = Opposite directions (contradictory)

Real Example

👨‍🦳 Uncle: Suppose vector DB contains:

Question Vector: 
"What is notice period?" 
→ [0.21, -0.44, 0.88, ...]

Document Chunks:
─────────────────────────────────────
Chunk 1: "Employees must serve 30 days notice period"
→ [0.22, -0.40, 0.85, ...]
Cosine Similarity: 0.95 ✅ (Very Similar)

Chunk 2: "Resignation must be submitted in writing"
→ [0.19, -0.42, 0.87, ...]
Cosine Similarity: 0.92 ✅ (Similar)

Chunk 3: "Office timings are 9 AM to 5 PM"
→ [0.05, 0.88, -0.12, ...]
Cosine Similarity: 0.15 ❌ (Not Related)

Chunk 4: "Coffee is available in cafeteria"
→ [-0.45, 0.12, 0.33, ...]
Cosine Similarity: -0.05 ❌ (Unrelated)

The system returns the top 3 chunks with highest similarity.

👦 Nephew: Why top 3 and not all?

👨‍🦳 Uncle: Because:

Efficiency - Why send irrelevant chunks to LLM?
Context window - LLM has limited capacity
Cost - More tokens = more money
Quality - Less noise = better answers

This is called Top-K Retrieval (usually K=3-5).

Complete Flow: Query Time

👨‍🦳 Uncle: Now let's trace a complete query:

USER ASKS:
"What is our notice period?"

STEP 1: Tokenization (inside embedding model)
─────────────────────────────────────
"What is our notice period?"
→ Tokenizer
→ [789, 44, 123, 555, 321]

STEP 2: Embedding (inside embedding model)
─────────────────────────────────────
Token IDs: [789, 44, 123, 555, 321]
→ Embedding Layer looks up each
→ Combines into one vector
Query Vector: [0.21, -0.44, 0.88, ... 1536 values]

STEP 3: Vector Database Search
─────────────────────────────────────
Query Vector: [0.21, -0.44, 0.88, ...]
↓
Compare against ALL stored chunk vectors
↓
Compute cosine similarity for each
↓
Sort by score (highest first)
↓
Return Top-K (usually 5)

STEP 4: Pass to LLM
─────────────────────────────────────
Retrieved chunks:
1. "Employees receive 30 days notice period"
2. "Manager approval required for resignation"
3. "Notice period starts from submission date"

Prompt:
Context:
[chunk 1]
[chunk 2]
[chunk 3]

Question: What is our notice period?

STEP 5: Generate Answer
─────────────────────────────────────
LLM reads context and answers:
"The notice period is 30 days from submission."

The Question Everyone Asks: "Where's the Magic?"

👦 Nephew: So the magic is... the vectors?

👨‍🦳 Uncle: Yes and no. The magic is in how vectors are created.

During training on billions of documents, the model learned:

Notice period
↔
Resignation requirements
↔
Notice duration

These concepts appear together many times.
So their vectors end up close to each other.

It's not magic. It's pattern recognition at scale.

Why Different Models Give Different Vectors

👦 Nephew: If I use OpenAI embeddings, then switch to Cohere, what happens?

👨‍🦳 Uncle: Different model = different vector space.

OpenAI text-embedding-3:
"React" → [0.13, -0.46, 0.79, ...]

Cohere Embed:
"React" → [0.57, 0.12, -0.34, ...]

Different numbers!

👦 Nephew: Can I compare them?

👨‍🦳 Uncle: NO! They're in different "semantic spaces".

It's like:
Celsius: 25 degrees
Fahrenheit: 77 degrees

Different scales. Can't compare directly.

When you change embedding models, you must re-embed everything.

Production Concern: Model Migrations

👨‍🦳 Uncle: This is where RAG gets expensive.

Your company has:

100,000 documents
After chunking: 5 million chunks
After embedding: 5 million vectors stored

New, better embedding model appears:

Old Model: 92% accuracy
New Model: 95% accuracy

But cost of migration: $50,000 (re-embedding 5M vectors)

CTO's question:
"Is 3% improvement worth $50,000?"

Usually NO. So we don't upgrade.

👦 Nephew: So companies are stuck with old models?

👨‍🦳 Uncle: Yes, until the improvement is significant enough. This is why:

Many enterprises still use old APIs
Many use old databases
Migration is expensive and risky

The best technology doesn't always win. The best ROI wins.

Smart Companies: Gradual Migration

👨‍🦳 Uncle: Smart companies do:

Old documents (stored 2 years ago)
→ Use old embedding model

New documents (uploaded today)
→ Use new embedding model

Gradually migrate in background:
- When storage capacity allows
- When costs are low
- When nobody notices the performance dip

Why Store Original Chunk Text?

👦 Nephew: Uncle, we stored the vector. Why also store the chunk text?

👨‍🦳 Uncle: Great question! It's for the future.

Scenario:

Today: Using Model V1
Store:
{
  "chunk_text": "Employees receive 30 days notice",
  "embedding": [0.21, -0.44, 0.88, ...],
  "embedding_model": "text-embedding-3-small"
}

Tomorrow: Model V2 is available (better!)

Can we do:
Old Vector → New Vector?
NO ❌

But we can do:
Old Text → New Model → New Vector
YES ✅

This is why chunk_text is precious!

Without chunk_text:

You'd have to:
1. Find original PDF
2. Parse it again
3. Clean it again
4. Chunk it again
5. Embed it with new model

= Huge pain and cost

Vector Database: Making Search Fast

The Problem: 5 Million Vectors!

👦 Nephew: If we have 5 million vectors, don't we compare with all of them?

👨‍🦳 Uncle: That would be:

Query Vector
↓
Compare with 5 million vectors = 5 million calculations
↓
Takes 30 seconds

User waits 30 seconds. Bad experience.

This is where Vector Databases come in (Pinecone, Qdrant, pgvector).

How Vector Databases Work Fast

👨‍🦳 Uncle: They use indexing. Like Google Maps.

Google doesn't search the entire world to find Bangalore. It uses:

Regions → Countries → States → Cities → Exact Location

Similarly, vector databases use:

Approximate Nearest Neighbor (ANN)
↓
Divides the semantic space into regions
↓
When you search, quickly finds which region
↓
Only searches that region
↓
Returns nearest neighbors in 100ms instead of 30 seconds

Interview-Level Answers

Q1: What is Tokenization?

👨‍🦳 Uncle: Here's your answer:

"Tokenization is the process of breaking down text into smaller units called tokens and converting them into token IDs. Since embedding models work with numbers, not text, a tokenizer converts words or subwords into numeric identifiers. For example, 'React is awesome' might become [4521, 318, 9876] where each number represents a token. The granularity depends on the tokenizer's vocabulary - common words might be single tokens, while rare words could be split into multiple tokens."

Q2: What is an Embedding?

"An embedding is a vector representation of text that captures semantic meaning. When a token ID is passed to the embedding layer, the model looks it up in a learned embedding matrix, returning 1536 numbers (dimensions) that collectively represent the token's meaning. These numbers are learned during the model's training process - similar concepts end up with vectors that point in similar directions in this high-dimensional space."

Q3: Why 1536 Dimensions?

"One number cannot capture the complexity of language. The embedding model needs to represent many relationships simultaneously: whether a word relates to programming, frontend vs backend, syntax style, ecosystem size, etc. With 1536 dimensions, each dimension can capture a different aspect of meaning. This richness allows the model to distinguish between concepts that would look similar with fewer dimensions."

Q4: How Does Semantic Search Work?

"Semantic search works by converting both queries and documents into embeddings (vectors), then using cosine similarity to find documents whose vectors point in similar directions. If a user asks 'notice period' and a document discusses 'resignation requirements', both will have vectors that point in similar directions despite using different words, because they were learned to be related during the model's training on billions of documents."

Q5: What Happens When You Change Embedding Models?

"When you switch embedding models, you must re-embed all stored chunks because different models produce vectors in different vector spaces. You can't compare vectors from Model A with query vectors from Model B - it's like mixing Celsius and Fahrenheit. Companies must evaluate the accuracy improvement against the cost of re-indexing. For example, improving from 92% to 95% accuracy might not justify a $50,000 re-embedding cost."

The Complete Architecture So Far

PHASE 1: DOCUMENT INGESTION ✅
─────────────────────────────
PDF Upload
  ↓
File Hash Check
  ↓
Parse & Clean
  ↓
Chunking
  ↓
Deduplication
  ↓
Store Chunk Text

PHASE 2: EMBEDDINGS ← YOU ARE HERE
─────────────────────────────
Chunk Text
  ↓
Tokenization (internal to embedding model)
  ↓
Token IDs
  ↓
Embedding Layer
  ↓
Vector
  ↓
Store in Vector Database

PHASE 3: QUERY TIME
─────────────────────────────
User Question
  ↓
Tokenization + Embedding
  ↓
Query Vector
  ↓
Vector Database Search (ANN)
  ↓
Top-K Similar Chunks
  ↓
Send to LLM
  ↓
Generate Answer

Summary: What Phase 2 Solves

Problem	Phase 1	Phase 2
Large Documents	✅ Chunking	✅ Handled
Exact Keyword Matching	✅ Works	❌ Fails for synonyms
Similar Concepts	❌ Keyword search misses	✅ Vector similarity finds
Scale (Millions of chunks)	❌ Search is slow	✅ Vector DB indexes fast
Meaning Understanding	❌ Can't understand meaning	✅ Embeddings capture it

Key Takeaways

Tokenization = Text → Numbers (done by embedding model internally)
Embedding = Numbers → Meaningful vectors (captures semantic relationships)
Semantic Search = Find vectors pointing in similar direction (cosine similarity)
Vector Databases = Index for fast nearest-neighbor search
Re-embedding Cost = Why companies don't switch models often
Chunk Text Storage = Insurance for future model upgrades

Next: Phase 2 Node.js Code

Now that you understand:

How tokenization works
How embeddings are created
How semantic search finds similar documents
How to measure similarity
Production considerations for migrations

We'll implement Phase 2 in Node.js:

Create embeddings for chunks
Store vectors in pgvector
Implement semantic search
Handle model versioning

Ready?

Remember: Less noise, more action. Phase 2 is where the actual intelligence happens.

DEV Community