DEV Community

Cover image for Embeddings 101 for RAG: Why Storage & Compute Explode?
Abhishek Gautam
Abhishek Gautam

Posted on

Embeddings 101 for RAG: Why Storage & Compute Explode?

🚀 Welcome to Part 1 of our Embedding Optimization Series! Over the next few posts, we’ll tackle half‑precision floats, quantization algorithms, index tuning, and more.

👋 “Hey, how hard can vector search really be?”

That’s what I muttered a couple of months ago when I started a PoC on "second brain" that could answer across all 10 million arXiv papers.

Fast-forward couple of weeks and my laptop looked like a space-heater, RAM usage at 100%, and Postgres was politely telling me to go buy more memory.

The culprit? Embeddings – those innocent-looking vectors that turn text into geometry so that “neural” search feels like magic.

Today we’re going to unpack exactly why they explode in size, how that explosion kills a RAG pipeline, and the three-step escape plan we’ll prove works over the next six posts.

No PhD required, but by the end you’ll be able to back-of-the-napkin the RAM cost of any dataset in under 30 seconds.


1. 30-Second Refresher: What Is an Embedding?

Imagine every paragraph in your corpus is a tiny star floating in space.

An embedding model (OpenAI’s text-embedding-3-small, Facebook’s E5, BAAI’s BGE … pick your flavor) reads the paragraph and outputs a point in N-dimensional space such that:

  • Similar paragraphs are close.
  • Dissimilar ones are far.

Mathematically it’s just a dense vector of N numbers.

Most SaaS APIs today use N = 1536 (OpenAI) or N = 768 (open-source E5-small).

Each number is a 32-bit IEEE-754 float, so:

float32: 4 bytes × dims
1536 d → 6 144 bytes ≈ 6.0 KB / vector
3072 d → 12 288 bytes ≈ 12.0 KB / vector
Enter fullscreen mode Exit fullscreen mode

2. RAG in One Diagram

User Question ─► Embedding Model ─► Vector ─►
ANN Index ─► Top-k Chunks ─► LLM ─► Answer
Enter fullscreen mode Exit fullscreen mode

The ANN Index (Approximate Nearest Neighbor) is where the pain lives.

It has to hold every chunk’s vector in RAM for fast look-ups.

If you chunk aggressively (say 256 tokens ≈ ~200 words) you end up with roughly:

10 million papers × 30 chunks/paper = 300 M vectors
Enter fullscreen mode Exit fullscreen mode

3. The Moment of Horror: Memory Math

Dataset & Chunking #Vectors Float32 RAM
250 M chunks, 384 d 250 M 250 M × 1.5 KB = 358 GB
250 M chunks, 768 d 250 M 250 M × 3 KB = 715 GB
250 M chunks, 1536 d 250 M 250 M × 6 KB = 1 431 GB
1 B chunks, 3072 d 1 B 1 B × 12 KB = 11 444 GB

That’s why you see tweets like “We moved our index to 12 × r6i.8xlarge and it still swapped.”


4. Why Bigger Dimension = Sharper Search … Until It Doesn’t

Higher N gives the model more “room” to separate concepts, but the volume of the space explodes exponentially.

If you double the dimension, you need exponentially more data to keep the same density of neighbors.

In practice the marginal gain above ~1024 dims is tiny for most retrieval tasks, but the storage cost is still linear.

Volume grows exponentially but recall plateaus.

  • Qdrant’s benchmark: on the same 100 k OpenAI subset
  • text-embedding-3-large (3072 d) rescored → 97-99 % recall
  • text-embedding-3-small (512 d) rescored → 71-91 % recall

Take-away: doubling dims from 1536→3072 only nets ~2 % extra recall but doubles your cloud bill.


5. The Three-Step Escape Plan (Series Road-Map)

We’ll shrink vectors in three increasingly aggressive stages:

  1. halfvec (16-bit floats) – 50 % smaller, almost no recall drop.
--before half vec
1536 Dim * 4 Bytes = 6144 bytes = 6KB / vector

--after half vec 
1536 Dim * 2 Bytes = 3072 bytes = 3KB / vector
Enter fullscreen mode Exit fullscreen mode

Real benchmark
593k Vectors, 384 d

  • float32: 913 MB
  • float16: 456 MB (2X Saving) GitHub
  1. int8 / scalar quantization – 75 % smaller, still < 1 % recall drop.
-- after half vec 
1536 Dim * 2 Bytes = 3072 bytes = 3KB /vector

-- after scalar quantization 
1536 Dim * 1 byte = 1536 bytes = 1.5 KB /vector
Enter fullscreen mode Exit fullscreen mode

Speed vs Recall

  • 3.77x faster search.
  • 4x memory reduction
  • 99% recall after reranking.
  1. binary quantization – 32× smaller, ~4–7 % recall drop, but we’ll recover the drop with oversampling + reranking.

Every step is configurable in open-source tools (pgvector, Qdrant, Milvus, FAISS) and we’ll see some code snippets.


6. Mini-Lab: Feel the Pain in 20 Lines

We’ll now materialize the exact byte counts on the Wikipedia slice.

from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import numpy as np

# 10 k Wikipedia articles = ~19 k chunks (512-token window, 256-token stride)
ds = load_dataset("wikipedia", "20220301.en", split="train[:10000]")
model = SentenceTransformer("all-MiniLM-L6-v2")

chunks = []
for article in ds:
    words = article["text"].split()
    for i in range(0, len(words), 128):
        chunks.append(" ".join(words[i:i+256]))

embs = model.encode(chunks, show_progress_bar=True)      # shape (19737, 384)
print("Float32 size:", embs.nbytes, "bytes")             # 30 317 568 bytes ≈ 30.3 MB
print("Float16 size:", embs.astype(np.float16).nbytes)   # 15 158 784 bytes ≈ 15.2 MB
print("Int8 size  :", embs.astype(np.int8).nbytes)       # 7 579 392 bytes ≈ 7.6 MB
print("Binary size:", (embs > 0).astype(np.uint8).nbytes // 8)  # 948 672 bytes ≈ 0.95 MB
Enter fullscreen mode Exit fullscreen mode

Extrapolate that 30MB to the full Wikipedia dump (6.5M articles ~~ 200M Chunks)

  • float32 -> ~ 300GB
  • float16 -> ~ 150GB
  • int8 -> ~ 75GB
  • binary -> ~ 9.4GB

7. Production Symptoms (So You Know It’s Not Just You)

Symptom Root Cause
Index build takes > 3 h on 8 vCPUs float32 memory bandwidth bound
Query latency spikes at 300 ms RAM-to-CPU cache misses
A100 GPU idles at 10 % util CPU choking on vector distance math
Re-embed quarterly 300GB egress + ingress

8. The Hidden Cost Nobody Mentions: Versioning

You re-embed every quarter when the model improves.

With float32, that’s 1.8 TB of egress + ingress; with halfvec + binary it drops to < 60 GB.

That alone can shave > $500/month off egress fees.


9. TL;DR Cheat-Sheet (Print & Pin)

RAM(GB) = (#chunks) × (dims) × bytes_per_dim / 2^30
          × 1.5  (ANN overhead)

bytes_per_dim:
  float32 = 4
  float16 = 2
  int8    = 1
  binary  = 1/8
Enter fullscreen mode Exit fullscreen mode

10. Next Part: Diving into halfvec

We’ll convert the same vectorindex to 16-bit floats, measure recall vs latency, and prove the 50 % saving is basically free.

Top comments (0)