DEV Community

Cover image for Phase 2: Embeddings & Semantic Search
surajrkhonde
surajrkhonde

Posted on

Phase 2: Embeddings & Semantic Search

From Text to Vectors: The Complete Story


The Story Starts: Why Can't We Just Search for Words?

👦 Nephew: Uncle! Phase 1 was done. Now we have clean chunks. Can't we just search for keywords?

👨‍🦳 Uncle: Let me ask you something. Someone applies for a job and says:

Resume says: "I managed a team of 10 people for 3 years"

Job Description asks: "Do you have leadership experience?"

Are these related?

👦 Nephew: Obviously yes! But the words are different. Team management and leadership.

👨‍🦳 Uncle: Exactly. A computer doing simple keyword search would say NO.

Keyword Search:
Resume: "team management"
Question: "leadership"
Match? NO (different words)
Enter fullscreen mode Exit fullscreen mode

This is the problem Phase 2 solves.

👦 Nephew: How?

👨‍🦳 Uncle: By understanding meaning, not just words.


PHASE 2 OVERVIEW: Three Key Concepts

Phase 1: Document Ingestion ✅
    ↓
Phase 2: Embeddings & Semantic Search ← WE ARE HERE
    │
    ├─ Step 1: Tokenization
    │         (Text → Token IDs)
    │
    ├─ Step 2: Embedding Layer
    │         (Token IDs → Vectors)
    │
    └─ Step 3: Semantic Search
              (Find Similar Vectors)
    ↓
Phase 3: Safety & Production
Enter fullscreen mode Exit fullscreen mode

STEP 1: TOKENIZATION

What Is Tokenization?

👨‍🦳 Uncle: Computers don't understand words like "React" or "JavaScript". They only understand numbers.

A tokenizer converts text into numbers called token IDs.

Text:           "React is awesome"
                    ↓
Tokenizer converts to:
Token IDs:      [4521, 318, 9876]
Enter fullscreen mode Exit fullscreen mode

👦 Nephew: But why not just one number per word?

👨‍🦳 Uncle: Good question. Look at this:

Word:          "JavaScriptDeveloper"

Tokenizer might see:
["Java", "Script", "Developer"]   3 tokens
or
["JavaScript", "Developer"]       2 tokens

Depends on the tokenizer's vocabulary.
Enter fullscreen mode Exit fullscreen mode

Why Is This Important?

👨‍🦳 Uncle: Because the embedding model doesn't receive text. It receives token IDs.

Text Input:
"Employees must submit resignation 60 days before"
           ↓
Tokenizer (internal to embedding model):
[1001, 245, 782, 4521, 99, 1234, ...]
           ↓
Embedding Layer looks up each token:
1001 → [0.12, -0.45, 0.78, ...]
245  → [0.88, -0.12, 0.34, ...]
           ↓
Combines them into final vector
Enter fullscreen mode Exit fullscreen mode

Real Example: Your Name

👨‍🦳 Uncle: Your name is Suraj Khonde. How does the tokenizer handle it?

If the tokenizer has seen "Suraj" often:
"Suraj" → Single Token → 5234

If not:
"Suraj" → Breaks into: ["Su", "raj"]
                      → [5234, 9876]

Same with "Khonde":
If common: → Single token
If rare:   → Multiple tokens

Result might be:
"Suraj Khonde" → [5234, 9876, 7621, 3344]
Enter fullscreen mode Exit fullscreen mode

👦 Nephew: So less common words use more tokens?

👨‍🦳 Uncle: Exactly! This is why we count tokens, not words.

Word count: 2 words
Token count: 4 tokens

For billing and cost, we count tokens!


STEP 2: EMBEDDING LAYER

The Bridge: From Numbers to Meaning

👨‍🦳 Uncle: Now we have token IDs. But they're meaningless.

Token ID: 4521

This is just an identifier. Like an employee ID.
Employee #4521 is just a number.
Enter fullscreen mode Exit fullscreen mode

The embedding layer transforms this into meaningful vectors.

👦 Nephew: How?

👨‍🦳 Uncle: Think of a giant lookup table inside the model:

Token ID    Vector (Meaning)
─────────────────────────────────────────
4521   →  [0.13, -0.46, 0.79, ... 1536 values]
318    →  [0.91, -0.21, 0.12, ... 1536 values]
9876   →  [0.44, 0.77, -0.33, ... 1536 values]

When model sees token 4521, it looks up:
4521 → [0.13, -0.46, 0.79, ...]
Enter fullscreen mode Exit fullscreen mode

👦 Nephew: What are those numbers? [0.13, -0.46, 0.79]?

👨‍🦳 Uncle: Those numbers encode meaning. They describe what the token represents.

Who Created These Numbers?

👨‍🦳 Uncle: When the embedding model was first created, these numbers were mostly random.

But then something magical happened: training.

During training on billions of documents, the model learned:

React appears with JavaScript many times
↓
So React and JavaScript vectors move closer

Node.js appears with Express many times
↓
So Node.js and Express vectors move closer

Dog appears with Puppy many times
↓
So Dog and Puppy vectors move closer
Enter fullscreen mode Exit fullscreen mode

After training on billions of examples, the vectors ended up in meaningful positions.

👦 Nephew: So React and JavaScript have similar vectors?

👨‍🦳 Uncle: Yes! That's why semantic search works.


KEY CONCEPT: Why 1536 Dimensions?

The Problem with Too Few Dimensions

👨‍🦳 Uncle: Suppose the embedding only used 1 number:

React   → 5
Python  → 7
JavaScript → 6

Now what? React and JavaScript are close (5 vs 6), 
but so are React and Python (5 vs 7).
All three look similar with one number!
Enter fullscreen mode Exit fullscreen mode

❌ Not enough information.

Why So Many Numbers?

👨‍🦳 Uncle: Language is incredibly complex. The vector needs to capture:

Dimension 1: Frontend vs Backend
Dimension 2: Web vs Systems Programming
Dimension 3: Dynamic vs Compiled
Dimension 4: Syntax Style
Dimension 5: Ecosystem Size
...
Dimension 1536: [something the model learned]
Enter fullscreen mode Exit fullscreen mode

Each dimension captures a different aspect of meaning.

👦 Nephew: So with 1536 dimensions, React can be described as:

React: [0.9 (very frontend), 0.2 (mostly web), 0.95 (dynamic), ...]
Node.js: [0.3 (somewhat frontend), 0.8 (backend), 0.95 (dynamic), ...]
Python: [0.1 (rarely frontend), 0.9 (backend), 0.5 (compiled), ...]
Enter fullscreen mode Exit fullscreen mode

👨‍🦳 Uncle: Exactly! And notice:

React and Node.js both have 0.95 (dynamic) → similar there
React is 0.9 frontend, Node is 0.3 → different there
Python is 0.5 (middle) → different from both

This richness requires many dimensions.

Mental Model: Google Maps

👨‍🦳 Uncle: Think of Google Maps. Bangalore has coordinates:

Latitude: 12.9716
Longitude: 77.5946

These 2 numbers uniquely identify Bangalore.
Enter fullscreen mode Exit fullscreen mode

Embeddings work similarly:

React has 1536 coordinates in "semantic space":
[0.13, -0.46, 0.79, ..., X1536]

This position uniquely identifies React's meaning.
Enter fullscreen mode Exit fullscreen mode

STEP 3: SEMANTIC SEARCH

Keyword Search vs Semantic Search

👦 Nephew: So now we have vectors for all chunks. How do we search?

👨‍🦳 Uncle: Let me show you the difference:

Keyword Search:
─────────────────
User asks: "What is notice period?"
System searches for exact words: "notice" + "period"

Document contains: "resignation requirements"
Match? NO ❌

Semantic Search:
────────────────
User asks: "What is notice period?"
System converts to vector: [0.21, -0.44, 0.88, ...]

Document: "resignation requirements"
System converts to vector: [0.22, -0.40, 0.85, ...]

Compare vectors: Very similar! ✅
Enter fullscreen mode Exit fullscreen mode

Cosine Similarity: How Similar Are Two Vectors?

👨‍🦳 Uncle: We need a way to measure "how similar" two vectors are.

Think of vectors as arrows pointing in a direction.

Vector A: ↗️ (pointing up-right)
Vector B: ↗️ (pointing up-right)
Similarity: HIGH (same direction)

Vector A: ↗️
Vector B: →️ (pointing right)
Similarity: MEDIUM (somewhat same)

Vector A: ↗️
Vector B: ↙️ (pointing down-left)
Similarity: LOW (opposite directions)
Enter fullscreen mode Exit fullscreen mode

👦 Nephew: So cosine similarity measures "angle between vectors"?

👨‍🦳 Uncle: Exactly! In mathematics:

Cosine Similarity Score Range:
─────────────────────────────
1.0  = Vectors point in exactly same direction (identical)
0.5  = 60° angle between them (medium similarity)
0.0  = 90° angle (unrelated)
-0.5 = 120° angle (somewhat opposite)
-1.0 = Opposite directions (contradictory)
Enter fullscreen mode Exit fullscreen mode

Real Example

👨‍🦳 Uncle: Suppose vector DB contains:

Question Vector: 
"What is notice period?" 
→ [0.21, -0.44, 0.88, ...]

Document Chunks:
─────────────────────────────────────
Chunk 1: "Employees must serve 30 days notice period"
→ [0.22, -0.40, 0.85, ...]
Cosine Similarity: 0.95 ✅ (Very Similar)

Chunk 2: "Resignation must be submitted in writing"
→ [0.19, -0.42, 0.87, ...]
Cosine Similarity: 0.92 ✅ (Similar)

Chunk 3: "Office timings are 9 AM to 5 PM"
→ [0.05, 0.88, -0.12, ...]
Cosine Similarity: 0.15 ❌ (Not Related)

Chunk 4: "Coffee is available in cafeteria"
→ [-0.45, 0.12, 0.33, ...]
Cosine Similarity: -0.05 ❌ (Unrelated)
Enter fullscreen mode Exit fullscreen mode

The system returns the top 3 chunks with highest similarity.

👦 Nephew: Why top 3 and not all?

👨‍🦳 Uncle: Because:

  1. Efficiency - Why send irrelevant chunks to LLM?
  2. Context window - LLM has limited capacity
  3. Cost - More tokens = more money
  4. Quality - Less noise = better answers

This is called Top-K Retrieval (usually K=3-5).


Complete Flow: Query Time

👨‍🦳 Uncle: Now let's trace a complete query:

USER ASKS:
"What is our notice period?"

STEP 1: Tokenization (inside embedding model)
─────────────────────────────────────
"What is our notice period?"
→ Tokenizer
→ [789, 44, 123, 555, 321]

STEP 2: Embedding (inside embedding model)
─────────────────────────────────────
Token IDs: [789, 44, 123, 555, 321]
→ Embedding Layer looks up each
→ Combines into one vector
Query Vector: [0.21, -0.44, 0.88, ... 1536 values]

STEP 3: Vector Database Search
─────────────────────────────────────
Query Vector: [0.21, -0.44, 0.88, ...]
↓
Compare against ALL stored chunk vectors
↓
Compute cosine similarity for each
↓
Sort by score (highest first)
↓
Return Top-K (usually 5)

STEP 4: Pass to LLM
─────────────────────────────────────
Retrieved chunks:
1. "Employees receive 30 days notice period"
2. "Manager approval required for resignation"
3. "Notice period starts from submission date"

Prompt:
Context:
[chunk 1]
[chunk 2]
[chunk 3]

Question: What is our notice period?

STEP 5: Generate Answer
─────────────────────────────────────
LLM reads context and answers:
"The notice period is 30 days from submission."
Enter fullscreen mode Exit fullscreen mode

The Question Everyone Asks: "Where's the Magic?"

👦 Nephew: So the magic is... the vectors?

👨‍🦳 Uncle: Yes and no. The magic is in how vectors are created.

During training on billions of documents, the model learned:

Notice period
↔
Resignation requirements
↔
Notice duration

These concepts appear together many times.
So their vectors end up close to each other.
Enter fullscreen mode Exit fullscreen mode

It's not magic. It's pattern recognition at scale.

Why Different Models Give Different Vectors

👦 Nephew: If I use OpenAI embeddings, then switch to Cohere, what happens?

👨‍🦳 Uncle: Different model = different vector space.

OpenAI text-embedding-3:
"React" → [0.13, -0.46, 0.79, ...]

Cohere Embed:
"React" → [0.57, 0.12, -0.34, ...]

Different numbers!
Enter fullscreen mode Exit fullscreen mode

👦 Nephew: Can I compare them?

👨‍🦳 Uncle: NO! They're in different "semantic spaces".

It's like:
Celsius: 25 degrees
Fahrenheit: 77 degrees

Different scales. Can't compare directly.
Enter fullscreen mode Exit fullscreen mode

When you change embedding models, you must re-embed everything.


Production Concern: Model Migrations

👨‍🦳 Uncle: This is where RAG gets expensive.

Your company has:

100,000 documents
After chunking: 5 million chunks
After embedding: 5 million vectors stored
Enter fullscreen mode Exit fullscreen mode

New, better embedding model appears:

Old Model: 92% accuracy
New Model: 95% accuracy

But cost of migration: $50,000 (re-embedding 5M vectors)

CTO's question:
"Is 3% improvement worth $50,000?"

Usually NO. So we don't upgrade.
Enter fullscreen mode Exit fullscreen mode

👦 Nephew: So companies are stuck with old models?

👨‍🦳 Uncle: Yes, until the improvement is significant enough. This is why:

  1. Many enterprises still use old APIs
  2. Many use old databases
  3. Migration is expensive and risky

The best technology doesn't always win. The best ROI wins.

Smart Companies: Gradual Migration

👨‍🦳 Uncle: Smart companies do:

Old documents (stored 2 years ago)
→ Use old embedding model

New documents (uploaded today)
→ Use new embedding model

Gradually migrate in background:
- When storage capacity allows
- When costs are low
- When nobody notices the performance dip
Enter fullscreen mode Exit fullscreen mode

Why Store Original Chunk Text?

👦 Nephew: Uncle, we stored the vector. Why also store the chunk text?

👨‍🦳 Uncle: Great question! It's for the future.

Scenario:

Today: Using Model V1
Store:
{
  "chunk_text": "Employees receive 30 days notice",
  "embedding": [0.21, -0.44, 0.88, ...],
  "embedding_model": "text-embedding-3-small"
}

Tomorrow: Model V2 is available (better!)

Can we do:
Old Vector  New Vector?
NO 

But we can do:
Old Text  New Model  New Vector
YES 

This is why chunk_text is precious!
Enter fullscreen mode Exit fullscreen mode

Without chunk_text:

You'd have to:
1. Find original PDF
2. Parse it again
3. Clean it again
4. Chunk it again
5. Embed it with new model

= Huge pain and cost
Enter fullscreen mode Exit fullscreen mode

Vector Database: Making Search Fast

The Problem: 5 Million Vectors!

👦 Nephew: If we have 5 million vectors, don't we compare with all of them?

👨‍🦳 Uncle: That would be:

Query Vector
↓
Compare with 5 million vectors = 5 million calculations
↓
Takes 30 seconds

User waits 30 seconds. Bad experience.
Enter fullscreen mode Exit fullscreen mode

This is where Vector Databases come in (Pinecone, Qdrant, pgvector).

How Vector Databases Work Fast

👨‍🦳 Uncle: They use indexing. Like Google Maps.

Google doesn't search the entire world to find Bangalore. It uses:

Regions → Countries → States → Cities → Exact Location
Enter fullscreen mode Exit fullscreen mode

Similarly, vector databases use:

Approximate Nearest Neighbor (ANN)
↓
Divides the semantic space into regions
↓
When you search, quickly finds which region
↓
Only searches that region
↓
Returns nearest neighbors in 100ms instead of 30 seconds
Enter fullscreen mode Exit fullscreen mode

Interview-Level Answers

Q1: What is Tokenization?

👨‍🦳 Uncle: Here's your answer:

"Tokenization is the process of breaking down text into smaller units called tokens and converting them into token IDs. Since embedding models work with numbers, not text, a tokenizer converts words or subwords into numeric identifiers. For example, 'React is awesome' might become [4521, 318, 9876] where each number represents a token. The granularity depends on the tokenizer's vocabulary - common words might be single tokens, while rare words could be split into multiple tokens."

Q2: What is an Embedding?

"An embedding is a vector representation of text that captures semantic meaning. When a token ID is passed to the embedding layer, the model looks it up in a learned embedding matrix, returning 1536 numbers (dimensions) that collectively represent the token's meaning. These numbers are learned during the model's training process - similar concepts end up with vectors that point in similar directions in this high-dimensional space."

Q3: Why 1536 Dimensions?

"One number cannot capture the complexity of language. The embedding model needs to represent many relationships simultaneously: whether a word relates to programming, frontend vs backend, syntax style, ecosystem size, etc. With 1536 dimensions, each dimension can capture a different aspect of meaning. This richness allows the model to distinguish between concepts that would look similar with fewer dimensions."

Q4: How Does Semantic Search Work?

"Semantic search works by converting both queries and documents into embeddings (vectors), then using cosine similarity to find documents whose vectors point in similar directions. If a user asks 'notice period' and a document discusses 'resignation requirements', both will have vectors that point in similar directions despite using different words, because they were learned to be related during the model's training on billions of documents."

Q5: What Happens When You Change Embedding Models?

"When you switch embedding models, you must re-embed all stored chunks because different models produce vectors in different vector spaces. You can't compare vectors from Model A with query vectors from Model B - it's like mixing Celsius and Fahrenheit. Companies must evaluate the accuracy improvement against the cost of re-indexing. For example, improving from 92% to 95% accuracy might not justify a $50,000 re-embedding cost."


The Complete Architecture So Far

PHASE 1: DOCUMENT INGESTION ✅
─────────────────────────────
PDF Upload
  ↓
File Hash Check
  ↓
Parse & Clean
  ↓
Chunking
  ↓
Deduplication
  ↓
Store Chunk Text

PHASE 2: EMBEDDINGS ← YOU ARE HERE
─────────────────────────────
Chunk Text
  ↓
Tokenization (internal to embedding model)
  ↓
Token IDs
  ↓
Embedding Layer
  ↓
Vector
  ↓
Store in Vector Database

PHASE 3: QUERY TIME
─────────────────────────────
User Question
  ↓
Tokenization + Embedding
  ↓
Query Vector
  ↓
Vector Database Search (ANN)
  ↓
Top-K Similar Chunks
  ↓
Send to LLM
  ↓
Generate Answer
Enter fullscreen mode Exit fullscreen mode

Summary: What Phase 2 Solves

Problem Phase 1 Phase 2
Large Documents ✅ Chunking ✅ Handled
Exact Keyword Matching ✅ Works ❌ Fails for synonyms
Similar Concepts ❌ Keyword search misses ✅ Vector similarity finds
Scale (Millions of chunks) ❌ Search is slow ✅ Vector DB indexes fast
Meaning Understanding ❌ Can't understand meaning ✅ Embeddings capture it

Key Takeaways

  1. Tokenization = Text → Numbers (done by embedding model internally)
  2. Embedding = Numbers → Meaningful vectors (captures semantic relationships)
  3. Semantic Search = Find vectors pointing in similar direction (cosine similarity)
  4. Vector Databases = Index for fast nearest-neighbor search
  5. Re-embedding Cost = Why companies don't switch models often
  6. Chunk Text Storage = Insurance for future model upgrades

Next: Phase 2 Node.js Code

Now that you understand:

  • How tokenization works
  • How embeddings are created
  • How semantic search finds similar documents
  • How to measure similarity
  • Production considerations for migrations

We'll implement Phase 2 in Node.js:

  • Create embeddings for chunks
  • Store vectors in pgvector
  • Implement semantic search
  • Handle model versioning

Ready?


Remember: Less noise, more action. Phase 2 is where the actual intelligence happens.

Top comments (0)